Computational Complexity Theory, Techniques, and Applications
This book consists of selections from the Encyclopedia of Complexity and Systems Science edited by Robert A. Meyers, published by Springer New York in 2009.
Robert A. Meyers (Ed.)
Computational Complexity Theory, Techniques, and Applications
With 1487 Figures and 234 Tables
123
ROBERT A. MEYERS, Ph. D. Editor-in-Chief RAMTECH LIMITED 122 Escalle Lane Larkspur, CA 94939 USA
[email protected]
Library of Congress Control Number: 2011940800
ISBN: 978-1-4614-1800-9 This publication is available also as: Print publication under ISBN: 978-1-4614-1799-6 and Print and electronic bundle under ISBN 978-1-4614-1801-6 © 2012 SpringerScience+Business Media, LLC. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC., 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. This book consists of selections from the Encyclopedia of Complexity and Systems Science edited by Robert A. Meyers, published by Springer New York in 2009. springer.com Printed on acid free paper
Preface
Complex systems are systems that comprise many interacting parts with the ability to generate a new quality of collective behavior through self-organization, e.g. the spontaneous formation of temporal, spatial or functional structures. They are therefore adaptive as they evolve and may contain self-driving feedback loops. Thus, complex systems are much more than a sum of their parts. Complex systems are often characterized as having extreme sensitivity to initial conditions as well as emergent behavior that are not readily predictable or even completely deterministic. The conclusion is that a reductionist (bottom-up) approach is often an incomplete description of a phenomenon. This recognition, that the collective behavior of the whole system cannot be simply inferred from the understanding of the behavior of the individual components, has led to many new concepts and sophisticated mathematical and modeling tools for application to many scientific, engineering, and societal issues that can be adequately described only in terms of complexity and complex systems. The inherent difficulty, or hardness, of computational problems in complex systems is a fundamental concept in computational complexity theory. This compendium, Computational Complexity, presents a detailed integrated view of the theoretical basis, computational methods and newest applicable approaches to solving inherently difficult problems whose solution requires extensive resources approaching the practical limits of present day computer systems. Key components of computational complexity are detailed, integrated and utilized, varying from Parameterized Complexity Theory (e.g. see articles entitled Quantum Computing, and Analog Computation), the Exponential Time Hypothesis (e.g. see articles on Cellular Automata and Language Theory) and Complexity Class P (e.g. see Quantum Computational Complexity and Cellular Automata, Universality of), to Heuristics (e.g. Social Network Visualization, Methods of and Repeated Games with Incomplete Information) and Parallel Algorithms (e.g. Cellular Automata as Models of Parallel Computation and Optical Computing). There are 209 articles which have been organized into 14 sections each headed by a recognized expert in the field and supported by peer reviewers in addition to the section editor. The sections are:
Agent Based Modeling and Simulation Cellular Automata, Mathematical Basis of Complex Networks and Graph Theory Data Mining and Knowledge Discovery Game Theory Granular Computing Intelligent Systems Probability and Statistics in Complex Systems Quantum Information Science Social Network Analysis Social Science, Physics and Mathematics, Applications in Soft Computing Unconventional Computing Wavelets
The complete listing of articles and section editors is presented on pages VII to XII.
VI
Preface
The articles are written for an audience of advanced university undergraduate and graduate students, professors, and professionals in a wide range of fields who must manage complexity on scales ranging from the atomic and molecular to the societal and global. Each article was selected and peer reviewed by one of our 13 Section Editors with advice and consultation provided by Board Members: Lotfi Zadeh, Stephen Wolfram and Richard Stearns, and the Editor-in-Chief. This level of coordination assures that the reader can have a level of confidence in the relevance and accuracy of the information far exceeding that generally found on the World Wide Web. Accessibility is also a priority and for this reason each article includes a glossary of important terms and a concise definition of the subject. Robert A. Meyers Editor-in-Chief Larkspur, California July 2011
Sections
Agent Based Modeling and Simulation, Section Editor: Filippo Castiglione Agent Based Computational Economics Agent Based Modeling and Artificial Life Agent Based Modeling and Computer Languages Agent Based Modeling and Simulation, Introduction to Agent Based Modeling, Large Scale Simulations Agent Based Modeling, Mathematical Formalism for Agent-Based Modeling and Simulation Cellular Automaton Modeling of Tumor Invasion Computer Graphics and Games, Agent Based Modeling in Embodied and Situated Agents, Adaptive Behavior in Interaction Based Computing in Physics Logic and Geometry of Agents in Agent-Based Modeling Social Phenomena Simulation Swarm Intelligence Cellular Automata, Mathematical Basis of, Section Editor: Andrew Adamatzky Additive Cellular Automata Algorithmic Complexity and Cellular Automata Cellular Automata and Groups Cellular Automata and Language Theory Cellular Automata as Models of Parallel Computation Cellular Automata in Hyperbolic Spaces Cellular Automata Modeling of Physical Systems Cellular Automata on Triangular, Pentagonal and Hexagonal Tessellations Cellular Automata with Memory Cellular Automata, Classification of Cellular Automata, Emergent Phenomena in Cellular Automata, Universality of Chaotic Behavior of Cellular Automata Dynamics of Cellular Automata in Non-compact Spaces Ergodic Theory of Cellular Automata Evolving Cellular Automata Firing Squad Synchronization Problem in Cellular Automata Gliders in Cellular Automata Growth Phenomena in Cellular Automata Identification of Cellular Automata Mathematical Basis of Cellular Automata, Introduction to
VIII
Sections
Phase Transitions in Cellular Automata Quantum Cellular Automata Reversible Cellular Automata Self-organised Criticality and Cellular Automata Self-Replication and Cellular Automata Structurally Dynamic Cellular Automata Tiling Problem and Undecidability in Cellular Automata Topological Dynamics of Cellular Automata Complex Networks and Graph Theory, Section Editor: Geoffrey Canright Community Structure in Graphs Complex Gene Regulatory Networks – From Structure to Biological Observables: Cell Fate Determination Complex Networks and Graph Theory Complex Networks, Visualization of Food Webs Growth Models for Networks Human Sexual Networks Internet Topology Link Analysis and Web Search Motifs in Graphs Non-negative Matrices and Digraphs Random Graphs, a Whirlwind Tour of Synchronization Phenomena on Networks World Wide Web, Graph Structure Data Mining and Knowledge Discovery, Section Editor: Peter Kokol Data and Dimensionality Reduction in Data Analysis and System Modeling Data-Mining and Knowledge Discovery, Introduction to Data-Mining and Knowledge Discovery, Neural Networks in Data-Mining and Knowledge Discovery: Case Based Reasoning, Nearest Neighbor and Rough Sets Decision Trees Discovery Systems Genetic and Evolutionary Algorithms and Programming: General Introduction and Application to Game Playing Knowledge Discovery: Clustering Machine Learning, Ensemble Methods in Manipulating Data and Dimension Reduction Methods: Feature Selection Game Theory, Section Editor: Marilda Sotomayor Bayesian Games: Games with Incomplete Information Cooperative Games Cooperative Games (Von Neumann–Morgenstern Stable Sets) Correlated Equilibria and Communication in Games Cost Sharing Differential Games Dynamic Games with an Application to Climate Change Models Evolutionary Game Theory Fair Division
Sections
Game Theory and Strategic Complexity Game Theory, Introduction to Implementation Theory Inspection Games Learning in Games Market Games and Clubs Mechanism Design Networks and Stability Principal-Agent Models Repeated Games with Complete Information Repeated Games with Incomplete Information Reputation Effects Signaling Games Static Games Stochastic Games Two-Sided Matching Models Voting Voting Procedures, Complexity of Zero-sum Two Person Games Granular Computing, Section Editor: Tsau Y. Lin Cooperative Multi-Hierarchical Query Answering Systems Dependency and Granularity in Data Mining Fuzzy Logic Fuzzy Probability Theory Fuzzy System Models Evolution from Fuzzy Rulebases to Fuzzy Functions Genetic-Fuzzy Data Mining Techniques Granular Model for Data Mining Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach Granular Computing and Modeling of the Uncertainty in Quantum Mechanics Granular Computing System Vulnerabilities: Exploring the Dark Side of Social Networking Communities Granular Computing, Information Models for Granular Computing, Introduction to Granular Computing, Philosophical Foundation for Granular Computing, Principles and Perspectives of Granular Computing: Practices, Theories and Future Directions Granular Neural Network Granulation of Knowledge: Similarity Based Approach in Information and Decision Systems Multi-Granular Computing and Quotient Structure Non-standard Analysis, an Invitation to Rough and Rough-Fuzzy Sets in Design of Information Systems Rough Set Data Analysis Rule Induction, Missing Attribute Values and Discretization Social Networks and Granular Computing Intelligent Systems, Section Editor: James A. Hendler Artificial Intelligence in Modeling and Simulation Intelligent Control Intelligent Systems, Introduction to
IX
X
Sections
Learning and Planning (Intelligent Systems) Mobile Agents Semantic Web Probability and Statistics in Complex Systems, Section Editor: Henrik Jeldtoft Jensen Bayesian Statistics Branching Processes Complexity in Systems Level Biology and Genetics: Statistical Perspectives Correlations in Complex Systems Entropy Extreme Value Statistics Field Theoretic Methods Fluctuations, Importance of: Complexity in the View of Stochastic Processes Hierarchical Dynamics Levy Statistics and Anomalous Transport: Levy Flights and Subdiffusion Probability and Statistics in Complex Systems, Introduction to Probability Densities in Complex Systems, Measuring Probability Distributions in Complex Systems Random Matrix Theory Random Walks in Random Environment Record Statistics and Dynamics Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems Stochastic Processes Quantum Information Science, Section Editor: Joseph F. Traub Quantum Algorithms Quantum Algorithms and Complexity for Continuous Problems Quantum Computational Complexity Quantum Computing Using Optics Quantum Computing with Trapped Ions Quantum Cryptography Quantum Error Correction and Fault Tolerant Quantum Computing Quantum Information Processing Quantum Information Science, Introduction to Social Network Analysis, Section Editor: John Scott Network Analysis, Longitudinal Methods of Positional Analysis and Blockmodelling Social Network Analysis, Estimation and Sampling in Social Network Analysis, Graph Theoretical Approaches to Social Network Analysis, Large-Scale Social Network Analysis, Overview of Social Network Analysis, Two-Mode Concepts in Social Network Visualization, Methods of Social Networks, Algebraic Models for Social Networks, Diffusion Processes in Social Networks, Exponential Random Graph (p*) Models for
Sections
Social Science, Physics and Mathematics Applications in, Section Editor: Andrzej Nowak Minority Games Rational, Goal-Oriented Agents Social Processes, Simulation Models of Soft Computing, Section Editor: Janusz Kacprzyk Aggregation Operators and Soft Computing Evolving Fuzzy Systems Fuzzy Logic, Type-2 and Uncertainty Fuzzy Optimization Fuzzy Sets Theory, Foundations of Hybrid Soft Computing Models for Systems Modeling and Control Neuro-fuzzy Systems Possibility Theory Rough Sets in Decision Making Rough Sets: Foundations and Perspectives Soft Computing, Introduction to Statistics with Imprecise Data Unconventional Computing, Section Editor: Andrew Adamatzky Amorphous Computing Analog Computation Artificial Chemistry Bacterial Computing Cellular Computing Computing in Geometrical Constrained Excitable Chemical Systems Computing with Solitons DNA Computing Evolution in Materio Immunecomputing Mechanical Computing: The Computational Complexity of Physical Devices Membrane Computing Molecular Automata Nanocomputers Optical Computing Quantum Computing Reaction-Diffusion Computing Reversible Computing Thermodynamics of Computation Unconventional Computing, Introduction to Unconventional Computing, Novel Hardware for Wavelets, Section Editor: Edward Aboufadel Bivariate (Two-dimensional) Wavelets Comparison of Discrete and Continuous Wavelet Transforms Curvelets and Ridgelets
XI
XII
Sections
Multivariate Splines and Their Applictions Multiwavelets Numerical Issues When Using Wavelets Popular Wavelet Families and Filters and Their Use Statistical Applications of Wavelets Wavelets and PDE Techniques in Image Processing, a Quick Tour of Wavelets and the Lifting Scheme Wavelets, Introduction to
About the Editor-in-Chief
Robert A. Meyers President: RAMTECH Limited Manager, Chemical Process Technology, TRW Inc. Post-doctoral Fellow: California Institute of Technology Ph. D. Chemistry, University of California at Los Angeles B. A., Chemistry, California State University, San Diego
Biography Dr. Meyers has worked with more than 25 Nobel laureates during his career. Research Dr. Meyers was Manager of Chemical Technology at TRW (now Northrop Grumman) in Redondo Beach, CA and is now President of RAMTECH Limited. He is co-inventor of the Gravimelt process for desulfurization and demineralization of coal for air pollution and water pollution control. Dr. Meyers is the inventor of and was project manager for the DOE-sponsored Magnetohydrodynamics Seed Regeneration Project which has resulted in the construction and successful operation of a pilot plant for production of potassium formate, a chemical utilized for plasma electricity generation and air pollution control. Dr. Meyers managed the pilot-scale DoE project for determining the hydrodynamics of synthetic fuels. He is a co-inventor of several thermo-oxidative stable polymers which have achieved commercial success as the GE PEI, Upjohn Polyimides and Rhone-Polenc bismaleimide resins. He has also managed projects for photochemistry, chemical lasers, flue gas scrubbing, oil shale analysis and refining, petroleum analysis and refining, global change measurement from space satellites, analysis and mitigation (carbon dioxide and ozone), hydrometallurgical refining, soil and hazardous waste remediation, novel polymers synthesis, modeling of the economics of space transportation systems, space rigidizable structures and chemiluminescence-based devices. He is a senior member of the American Institute of Chemical Engineers, member of the American Physical Society, member of the American Chemical Society and serves on the UCLA Chemistry Department Advisory Board. He was a member of the joint USA-Russia working group on air pollution control and the EPA-sponsored Waste Reduction Institute for Scientists and Engineers.
XIV
About the Editor-in-Chief
Dr. Meyers has more than 20 patents and 50 technical papers. He has published in primary literature journals including Science and the Journal of the American Chemical Society, and is listed in Who’s Who in America and Who’s Who in the World. Dr. Meyers’ scientific achievements have been reviewed in feature articles in the popular press in publications such as The New York Times Science Supplement and The Wall Street Journal as well as more specialized publications such as Chemical Engineering and Coal Age. A public service film was produced by the Environmental Protection Agency of Dr. Meyers’ chemical desulfurization invention for air pollution control. Scientific Books Dr. Meyers is the author or Editor-in-Chief of 12 technical books one of which won the Association of American Publishers Award as the best book in technology and engineering. Encyclopedias Dr. Meyers conceived and has served as Editor-in-Chief of the Academic Press (now Elsevier) Encyclopedia of Physical Science and Technology. This is an 18-volume publication of 780 twenty-page articles written to an audience of university students and practicing professionals. This encyclopedia, first published in 1987, was very successful, and because of this, was revised and reissued in 1992 as a second edition. The Third Edition was published in 2001 and is now online. Dr. Meyers has completed two editions of the Encyclopedia of Molecular Cell Biology and Molecular Medicine for Wiley VCH publishers (1995 and 2004). These cover molecular and cellular level genetics, biochemistry, pharmacology, diseases and structure determination as well as cell biology. His eight-volume Encyclopedia of Environmental Analysis and Remediation was published in 1998 by John Wiley & Sons and his 15-volume Encyclopedia of Analytical Chemistry was published in 2000, also by John Wiley & Sons, all of which are available on-line.
Editorial Board Members
LOTFI A. Z ADEH Professor in the Graduate School, Computer Science Division Department of Electrical Engineering and Computer Sciences University of California, Berkeley
STEPHEN W OLFRAM Founder and CEO, Wolfram Research Creator, Mathematica® Author, A New Kind of Science
RICHARD E. STEARNS 1993 Turing Award for foundations of computational complexity Current interests include: computational complexity, automata theory, analysis of algorithms, and game theory.
Section Editors
Agent Based Modeling and Simulation
Complex Networks and Graph Theory
FILIPPO CASTIGLIONE Research Scientist Institute for Computing Applications (IAC) “M. Picone” National Research Council (CNR), Italy
GEOFFREY CANRIGHT Senior Research Scientist Telenor Research and Innovation Fornebu, Norway
Cellular Automata, Mathematical Basis of
Data Mining and Knowledge Discovery
ANDREW ADAMATZKY Professor Faculty of Computing, Engineering and Mathematical Science University of the West of England
PETER KOKOL Professor Department of Computer Science University of Maribor, Slovenia
XVIII
Section Editors
Game Theory
MARILDA SOTOMAYOR Professor Department of Economics University of São Paulo, Brazil Department of Economics Brown University, Providence
Probability and Statistics in Complex Systems
HENRIK JELDTOFT JENSEN Professor of Mathematical Physics Department of Mathematics and Institute for Mathematical Sciences Imperial College London
Granular Computing
Quantum Information Science
TSAU Y. LIN Professor Computer Science Department San Jose State University
JOSEPH F. TRAUB Edwin Howard Armstrong Professor of Computer Science Computer Science Department Columbia University
Intelligent Systems
Social Network Analysis
JAMES A. HENDLER Senior Constellation Professor of the Tetherless World Research Constellation Rensselaer Polytechnic Institute
JOHN SCOTT Professor of Sociology School of Social Science and Law University of Plymouth
Section Editors
Social Science, Physics and Mathematics Applications in
ANDRZEJ N OWAK Director of the Center for Complex Systems University of Warsaw Assistant Professor, Psychology Department Florida Atlantic University
Soft Computing
JANUSZ KACPRZYK Deputy Director for Scientific Affairs, Professor Systems Research Institute Polish Academy of Sciences
Unconventional Computing
ANDREW ADAMATZKY Professor Faculty of Computing, Engineering and Mathematical Science University of the West of England
Wavelets
EDWARD ABOUFADEL Professor of Mathematics Grand Valley State University
XIX
Table of Contents
Additive Cellular Automata Burton Voorhees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Agent Based Computational Economics Moshe Levy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
Agent Based Modeling and Artificial Life Charles M. Macal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
Agent Based Modeling and Computer Languages Michael J. North, Charles M. Macal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
Agent Based Modeling, Large Scale Simulations Hazel R. Parry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
Agent Based Modeling, Mathematical Formalism for Reinhard Laubenbacher, Abdul S. Jarrah, Henning S. Mortveit, S.S. Ravi . . . . . . . . . . . . . . . . . . .
88
Agent Based Modeling and Simulation Stefania Bandini, Sara Manzoni, Giuseppe Vizzari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
105
Agent Based Modeling and Simulation, Introduction to Filippo Castiglione . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
118
Aggregation Operators and Soft Computing Vicenç Torra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
122
Algorithmic Complexity and Cellular Automata Julien Cervelle, Enrico Formenti . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
132
Amorphous Computing Hal Abelson, Jacob Beal, Gerald Jay Sussman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
147
Analog Computation Bruce J. MacLennan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
161
Artificial Chemistry Peter Dittrich . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
185
Artificial Intelligence in Modeling and Simulation Bernard Zeigler, Alexandre Muzy, Levent Yilmaz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
204
Bacterial Computing Martyn Amos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
228
Bayesian Games: Games with Incomplete Information Shmuel Zamir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
238
Bayesian Statistics David Draper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
254
Bivariate (Two-dimensional) Wavelets Bin Han . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
275
XXII
Table of Contents
Branching Processes Mikko J. Alava, Kent Bækgaard Lauritsen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
285
Cellular Automata as Models of Parallel Computation Thomas Worsch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
298
Cellular Automata, Classification of Klaus Sutner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
312
Cellular Automata, Emergent Phenomena in James E. Hanson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
325
Cellular Automata and Groups Tullio Ceccherini-Silberstein, Michel Coornaert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
336
Cellular Automata in Hyperbolic Spaces Maurice Margenstern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
350
Cellular Automata and Language Theory Martin Kutrib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
359
Cellular Automata with Memory Ramón Alonso-Sanz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
382
Cellular Automata Modeling of Physical Systems Bastien Chopard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
407
Cellular Automata in Triangular, Pentagonal and Hexagonal Tessellations Carter Bays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
434
Cellular Automata, Universality of Jérôme Durand-Lose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
443
Cellular Automaton Modeling of Tumor Invasion Haralambos Hatzikirou, Georg Breier, Andreas Deutsch . . . . . . . . . . . . . . . . . . . . . . . . . . . .
456
Cellular Computing Christof Teuscher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
465
Chaotic Behavior of Cellular Automata Julien Cervelle, Alberto Dennunzio, Enrico Formenti . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
479
Community Structure in Graphs Santo Fortunato, Claudio Castellano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
490
Comparison of Discrete and Continuous Wavelet Transforms Palle E. T. Jorgensen, Myung-Sin Song . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
513
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination Sui Huang, Stuart A. Kauffman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
527
Complexity in Systems Level Biology and Genetics: Statistical Perspectives David A. Stephens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
561
Complex Networks and Graph Theory Geoffrey Canright . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
579
Complex Networks, Visualization of Vladimir Batagelj . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
589
Computer Graphics and Games, Agent Based Modeling in Brian Mac Namee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
604
Computing in Geometrical Constrained Excitable Chemical Systems Jerzy Gorecki, Joanna Natalia Gorecka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
622
Computing with Solitons Darren Rand, Ken Steiglitz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
646
Table of Contents
Cooperative Games Roberto Serrano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
666
Cooperative Games (Von Neumann–Morgenstern Stable Sets) Jun Wako, Shigeo Muto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
675
Cooperative Multi-hierarchical Query Answering Systems Zbigniew W. Ras, Agnieszka Dardzinska . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
690
Correlated Equilibria and Communication in Games Françoise Forges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
695
Correlations in Complex Systems Renat M. Yulmetyev, Peter Hänggi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
705
Cost Sharing Maurice Koster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
724
Curvelets and Ridgelets Jalal Fadili, Jean-Luc Starck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
754
Data and Dimensionality Reduction in Data Analysis and System Modeling Witold Pedrycz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
774
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets Lech Polkowski . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
789
Data-Mining and Knowledge Discovery, Introduction to Peter Kokol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
810
Data-Mining and Knowledge Discovery, Neural Networks in Markus Brameier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
813
Decision Trees Vili Podgorelec, Milan Zorman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
827
Dependency and Granularity in Data-Mining Shusaku Tsumoto, Shoji Hirano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
846
Differential Games Marc Quincampoix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
854
Discovery Systems Petra Povalej, Mateja Verlic, Gregor Stiglic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
862
DNA Computing Martyn Amos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
882
Dynamic Games with an Application to Climate Change Models Prajit K. Dutta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
897
Dynamics of Cellular Automata in Non-compact Spaces Enrico Formenti, Petr K˚urka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
914
Embodied and Situated Agents, Adaptive Behavior in Stefano Nolfi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
925
Entropy Constantino Tsallis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
940
Ergodic Theory of Cellular Automata Marcus Pivato . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
965
Evolutionary Game Theory William H. Sandholm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1000
Evolution in Materio Simon Harding, Julian F. Miller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1030
XXIII
XXIV
Table of Contents
Evolving Cellular Automata Martin Cenek, Melanie Mitchell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1043
Evolving Fuzzy Systems Plamen Angelov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1053
Extreme Value Statistics Mario Nicodemi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1066
Fair Division Steven J. Brams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1073
Field Theoretic Methods Uwe Claus Täuber . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1080
Firing Squad Synchronization Problem in Cellular Automata Hiroshi Umeo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1094
Fluctuations, Importance of: Complexity in the View of Stochastic Processes Rudolf Friedrich, Joachim Peinke, M. Reza Rahimi Tabar . . . . . . . . . . . . . . . . . . . . . . . . . . .
1131
Food Webs Jennifer A. Dunne . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1155
Fuzzy Logic Lotfi A. Zadeh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1177
Fuzzy Logic, Type-2 and Uncertainty Robert I. John, Jerry M. Mendel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1201
Fuzzy Optimization Weldon A. Lodwick, Elizabeth A. Untiedt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1211
Fuzzy Probability Theory Michael Beer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1240
Fuzzy Sets Theory, Foundations of Janusz Kacprzyk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1253
Fuzzy System Models Evolution from Fuzzy Rulebases to Fuzzy Functions I. Burhan Türk¸sen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1274
Game Theory, Introduction to Marilda Sotomayor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1289
Game Theory and Strategic Complexity Kalyan Chatterjee, Hamid Sabourian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1292
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing Michael Orlov, Moshe Sipper, Ami Hauptman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1309
Genetic-Fuzzy Data Mining Techniques Tzung-Pei Hong, Chun-Hao Chen, Vincent S. Tseng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1321
Gliders in Cellular Automata Carter Bays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1337
Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach Salvatore Greco, Benedetto Matarazzo, Roman Słowi´nski . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1347
Granular Computing, Information Models for Steven A. Demurjian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1369
Granular Computing, Introduction to Tsau Young Lin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1377
Granular Computing and Modeling of the Uncertainty in Quantum Mechanics Kow-Lung Chang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1381
Table of Contents
Granular Computing, Philosophical Foundation for Zhengxin Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1389
Granular Computing: Practices, Theories, and Future Directions Tsau Young Lin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1404
Granular Computing, Principles and Perspectives of Jianchao Han, Nick Cercone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1421
Granular Computing System Vulnerabilities: Exploring the Dark Side of Social Networking Communities Steve Webb, James Caverlee, Calton Pu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1433
Granular Model for Data Mining Anita Wasilewska, Ernestina Menasalvas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1444
Granular Neural Network Yan-Qing Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1455
Granulation of Knowledge: Similarity Based Approach in Information and Decision Systems Lech Polkowski . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1464
Growth Models for Networks Sergey N. Dorogovtsev . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1488
Growth Phenomena in Cellular Automata Janko Gravner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1499
Hierarchical Dynamics Martin Nilsson Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1514
Human Sexual Networks Fredrik Liljeros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1535
Hybrid Soft Computing Models for Systems Modeling and Control Oscar Castillo, Patricia Melin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1547
Identification of Cellular Automata Andrew Adamatzky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1564
Immunecomputing Jon Timmis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1576
Implementation Theory Luis C. Corchón . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1588
Inspection Games Rudolf Avenhaus, Morton J. Canty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1605
Intelligent Control Clarence W. de Silva . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1619
Intelligent Systems, Introduction to James Hendler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1642
Interaction Based Computing in Physics Franco Bagnoli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1644
Internet Topology Yihua He, Georgos Siganos, Michalis Faloutsos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1663
Knowledge Discovery: Clustering Pavel Berkhin, Inderjit S. Dhillon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1681
Learning in Games John Nachbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1695
Learning and Planning (Intelligent Systems) Ugur Kuter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1706
XXV
XXVI
Table of Contents
Levy Statistics and Anomalous Transport: Levy Flights and Subdiffusion Ralf Metzler, Aleksei V. Chechkin, Joseph Klafter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1724
Link Analysis and Web Search Johannes Bjelland, Geoffrey Canright, Kenth Engø-Monsen . . . . . . . . . . . . . . . . . . . . . . . . . . .
1746
Logic and Geometry of Agents in Agent-Based Modeling Samson Abramsky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1767
Machine Learning, Ensemble Methods in Sašo Džeroski, Panˇce Panov, Bernard Ženko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1781
Manipulating Data and Dimension Reduction Methods: Feature Selection Huan Liu, Zheng Zhao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1790
Market Games and Clubs Myrna Wooders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1801
Mathematical Basis of Cellular Automata, Introduction to Andrew Adamatzky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1819
Mechanical Computing: The Computational Complexity of Physical Devices John H. Reif . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1821
Mechanism Design Ron Lavi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1837
Membrane Computing ˘ Gheorghe PAun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1851
Minority Games Chi Ho Yeung, Yi-Cheng Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1863
Mobile Agents Niranjan Suri, Jan Vitek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1880
Molecular Automata Joanne Macdonald, Darko Stefanovic, Milan Stojanovic . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1894
Motifs in Graphs Sergi Valverde, Ricard V. Solé . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1919
Multi-Granular Computing and Quotient Structure Ling Zhang, Bo Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1929
Multivariate Splines and Their Applications Ming-Jun Lai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1939
Multiwavelets Fritz Keinert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1981
Nanocomputers Ferdinand Peper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1998
Network Analysis, Longitudinal Methods of Tom A. B. Snijders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2029
Networks and Stability Frank H. Page Jr., Myrna Wooders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2044
Neuro-fuzzy Systems Leszek Rutkowski, Krzysztof Cpałka, Robert Nowicki, Agata Pokropi´nska, Rafał Scherer . . . . . . . . . . .
2069
Non-negative Matrices and Digraphs Abraham Berman, Naomi Shaked-Monderer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2082
Non-standard Analysis, an Invitation to Wei-Zhe Yang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2096
Table of Contents
Numerical Issues When Using Wavelets Jean-Luc Starck, Jalal Fadili . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2121
Optical Computing Thomas J. Naughton, Damien Woods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2138
Phase Transitions in Cellular Automata Nino Boccara . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2157
Popular Wavelet Families and Filters and Their Use Ming-Jun Lai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2168
Positional Analysis and Blockmodeling Patrick Doreian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2226
Possibility Theory Didier Dubois, Henri Prade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2240
Principal-Agent Models Inés Macho-Stadler, David Pérez-Castrillo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2253
Probability Densities in Complex Systems, Measuring Gunnar Pruessner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2267
Probability Distributions in Complex Systems Didier Sornette . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2286
Probability and Statistics in Complex Systems, Introduction to Henrik Jeldtoft Jensen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2301
Quantum Algorithms Michele Mosca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2303
Quantum Algorithms and Complexity for Continuous Problems Anargyros Papageorgiou, Joseph F. Traub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2334
Quantum Cellular Automata Karoline Wiesner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2351
Quantum Computational Complexity John Watrous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2361
Quantum Computing Viv Kendon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2388
Quantum Computing with Trapped Ions Wolfgang Lange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2406
Quantum Computing Using Optics Gerard J. Milburn, Andrew G. White . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2437
Quantum Cryptography Hoi-Kwong Lo, Yi Zhao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2453
Quantum Error Correction and Fault Tolerant Quantum Computing Markus Grassl, Martin Rötteler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2478
Quantum Information Processing Seth Lloyd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2496
Quantum Information Science, Introduction to Joseph F. Traub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2534
Random Graphs, a Whirlwind Tour of Fan Chung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2536
Random Matrix Theory Güler Ergün . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2549
XXVII
XXVIII
Table of Contents
Random Walks in Random Environment Ofer Zeitouni . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2564
Rational, Goal-Oriented Agents Rosaria Conte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2578
Reaction-Diffusion Computing Andrew Adamatzky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2594
Record Statistics and Dynamics Paolo Sibani, Henrik, Jeldtoft Jensen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2611
Repeated Games with Complete Information Olivier Gossner, Tristan Tomala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2620
Repeated Games with Incomplete Information Jérôme Renault . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2635
Reputation Effects George J. Mailath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2656
Reversible Cellular Automata Kenichi Morita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2668
Reversible Computing Kenichi Morita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2685
Rough and Rough-Fuzzy Sets in Design of Information Systems Theresa Beaubouef, Frederick Petry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2702
Rough Set Data Analysis Shusaku Tsumoto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2716
Rough Sets in Decision Making Roman Słowi´nski, Salvatore Greco, Benedetto Matarazzo . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2727
Rough Sets: Foundations and Perspectives James F. Peters, Andrzej Skowron, Jarosław Stepaniuk . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2761
Rule Induction, Missing Attribute Values and Discretization Jerzy W. Grzymala-Busse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2772
Self-organized Criticality and Cellular Automata Michael Creutz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2780
Self-Replication and Cellular Automata Gianluca Tempesti, Daniel Mange, André Stauffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2792
Semantic Web Wendy Hall, Kieron O’Hara . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2810
Signaling Games Joel Sobel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2830
Social Network Analysis, Estimation and Sampling in Ove Frank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2845
Social Network Analysis, Graph Theoretical Approaches to Wouter de Nooy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2864
Social Network Analysis, Large-Scale Vladimir Batagelj . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2878
Social Network Analysis, Overview of John Scott . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2898
Social Network Analysis, Two-Mode Concepts in Stephen P. Borgatti . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2912
Table of Contents
Social Networks, Algebraic Models for Philippa Pattison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2925
Social Networks, Diffusion Processes in Thomas W. Valente . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2940
Social Networks, Exponential Random Graph (p*) Models for Garry Robins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2953
Social Networks and Granular Computing Churn-Jung Liau . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2968
Social Network Visualization, Methods of Linton C. Freeman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2981
Social Phenomena Simulation Paul Davidsson, Harko Verhagen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2999
Social Processes, Simulation Models of Klaus G. Troitzsch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3004
Soft Computing, Introduction to Janusz Kacprzyk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3020
Static Games Oscar Volij . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3023
Statistical Applications of Wavelets Sofia Olhede . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3043
Statistics with Imprecise Data María Ángeles Gil, Olgierd Hryniewicz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3052
Stochastic Games Eilon Solan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3064
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems Hans C. Fogedby . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3075
Stochastic Processes Alan J. McKane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3097
Structurally Dynamic Cellular Automata Andrew Ilachinski . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3114
Swarm Intelligence Gerardo Beni . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3150
Synchronization Phenomena on Networks Guanrong Chen, Ming Zhao, Tao Zhou, Bing-Hong Wang . . . . . . . . . . . . . . . . . . . . . . . . . . .
3170
Thermodynamics of Computation H. John Caulfield, Lei Qian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3187
Tiling Problem and Undecidability in Cellular Automata Jarkko Kari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3198
Topological Dynamics of Cellular Automata Petr K˚urka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3212
Two-Sided Matching Models Marilda Sotomayor, Ömer Özak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3234
Unconventional Computing, Introduction to Andrew Adamatzky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3258
XXIX
XXX
Table of Contents
Unconventional Computing, Novel Hardware for Tetsuya Asai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3260
Voting Alvaro Sandroni, Jonathan Pogach, Michela Tincani, Antonio Penta, Deniz Selman . . . . . . . . . . . . .
3280
Voting Procedures, Complexity of Olivier Hudry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3291
Wavelets, Introduction to Edward Aboufadel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3314
Wavelets and the Lifting Scheme Anders La Cour–Harbo, Arne Jensen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3316
Wavelets and PDE Techniques in Image Processing, a Quick Tour of Hao-Min Zhou, Tony F. Chan, Jianhong Shen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3341
World Wide Web, Graph Structure Lada A. Adamic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3358
Zero-Sum Two Person Games T.E.S. Raghavan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3372
List of Glossary Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3397
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3417
Contributors
ABELSON, HAL Massachusetts Institute of Technology Cambridge USA
ANGELOV, PLAMEN Lancaster University Lancaster UK
ABOUFADEL, EDWARD Grand Valley State University Allendale USA
ASAI , TETSUYA Hokkaido University Sapporo Japan
ABRAMSKY, SAMSON Oxford University Computing Laboratory Oxford UK
AVENHAUS, RUDOLF Armed Forces University Munich Neubiberg Germany
ADAMATZKY, ANDREW University of the West of England Bristol UK
BAGNOLI , FRANCO University of Florence Florence Italy
ADAMIC, LADA A. University of Michigan Ann Arbor USA
BANDINI , STEFANIA University of Milan-Bicocca Milan Italy
ALAVA , MIKKO J. Espoo University of Technology Espoo Finland ALONSO-SANZ, RAMÓN Universidad Politécnica de Madrid Madrid Spain
BATAGELJ, VLADIMIR University of Ljubljana Ljubljana Slovenia BAYS, CARTER University of South Carolina Columbia USA
AMOS, MARTYN Manchester Metropolitan University Manchester UK
BEAL, JACOB Massachusetts Institute of Technology Cambridge USA
ÁNGELES GIL, MARÍA University of Oviedo Oviedo Spain
BEAUBOUEF, THERESA Southeastern Louisiana University Hammond USA
XXXII
Contributors
BEER, MICHAEL National University of Singapore Kent Ridge Singapore
CANTY, MORTON J. Forschungszentrum Jülich Jülich Germany
BENI , GERARDO University of California Riverside Riverside USA
CASTELLANO, CLAUDIO “Sapienza” Università di Roma Roma Italy
BERKHIN, PAVEL eBay Inc. San Jose USA BERMAN, ABRAHAM Technion – Israel Institute of Technology Haifa Israel BJELLAND, JOHANNES Telenor R&I Fornebu Norway BOCCARA , N INO University of Illinois Chicago USA CE Saclay Gif-sur-Yvette France BORGATTI , STEPHEN P. University of Kentucky Lexington USA
CASTIGLIONE, FILIPPO Institute for Computing Applications (IAC) – National Research Council (CNR) Rome Italy CASTILLO, OSCAR Tijuana Institute of Technology Tijuana Mexico CAULFIELD, H. JOHN Fisk University Nashville USA CAVERLEE, JAMES Texas A&M University College Station USA CECCHERINI -SILBERSTEIN, TULLIO Università del Sannio Benevento Italy
BRAMEIER, MARKUS University of Aarhus Århus Denmark
CENEK, MARTIN Portland State University Portland USA
BRAMS, STEVEN J. New York University New York USA
CERCONE, N ICK York University Toronto Canada
BREIER, GEORG Technische Universität Dresden Dresden Germany
CERVELLE, JULIEN Université Paris-Est Marne la Vallée France
CANRIGHT, GEOFFREY Telenor R&I Fornebu Norway
CHANG, KOW-LUNG National Taiwan University Taipeh Taiwan
Contributors
CHAN, TONY F. University of California Los Angeles USA CHATTERJEE, KALYAN The Pennsylvania State University University Park USA CHECHKIN, ALEKSEI V. Institute for Theoretical Physics NSC KIPT Kharkov Ukraine CHEN, CHUN-HAO National Cheng–Kung University Tainan Taiwan CHEN, GUANRONG City University of Hong Kong Hong Kong China CHEN, Z HENGXIN University of Nebraska at Omaha Omaha USA CHOPARD, BASTIEN University of Geneva Geneva Switzerland CHUNG, FAN University of California San Diego USA CONTE, ROSARIA CNR Rome Italy
CPAŁKA , KRZYSZTOF Cz˛estochowa University of Technology Cz˛estochowa Poland Academy of Humanities and Economics Lodz Poland CREUTZ, MICHAEL Brookhaven National Laboratory Upton USA DARDZINSKA , AGNIESZKA Białystok Technical University Białystok Poland DAVIDSSON, PAUL Blekinge Institute of Technology Ronneby Sweden DEMURJIAN, STEVEN A. The University of Connecticut Storrs USA DENNUNZIO, ALBERTO Università degli Studi di Milano-Bicocca Milan Italy DE N OOY , W OUTER University of Amsterdam Amsterdam The Netherlands DE SILVA , CLARENCE W. University of British Columbia Vancouver Canada
DEUTSCH, ANDREAS Technische Universität Dresden Dresden Germany
COORNAERT, MICHEL Université Louis Pasteur et CNRS Strasbourg France
DHILLON, INDERJIT S. University of Texas Austin USA
CORCHÓN, LUIS C. Universidad Carlos III Madrid Spain
DITTRICH, PETER Friedrich Schiller University Jena Jena Germany
XXXIII
XXXIV
Contributors
DOREIAN, PATRICK University of Pittsburgh Pittsburgh USA
FADILI , JALAL Ecole Nationale Supèrieure d’Ingènieurs de Caen Caen Cedex France
DOROGOVTSEV, SERGEY N. Universidade de Aveiro Aveiro Portugal A. F. Ioffe Physico-Technical Institute St. Petersburg Russia
FALOUTSOS, MICHALIS University of California Riverside USA
DRAPER, DAVID University of California Santa Cruz USA DUBOIS, DIDIER Universite Paul Sabatier Toulouse Cedex France DUNNE, JENNIFER A. Santa Fe Institute Santa Fe USA Pacific Ecoinformatics and Computational Ecology Lab Berkeley USA DURAND-LOSE, JÉRÔME Université d’Orléans Orléans France DUTTA , PRAJIT K. Columbia University New York USA DŽEROSKI , SAŠO Jožef Stefan Institute Ljubljana Slovenia
FOGEDBY, HANS C. University of Aarhus Aarhus Denmark Niels Bohr Institute Copenhagen Denmark FORGES, FRANÇOISE Université Paris-Dauphine Paris France FORMENTI , ENRICO Université de Nice Sophia Antipolis Sophia Antipolis France FORTUNATO, SANTO ISI Foundation Torino Italy FRANK, OVE Stockholm University Stockholm Sweden FREEMAN, LINTON C. University of California Irvine USA
ENGØ-MONSEN, KENTH Telenor R&I Fornebu Norway
FRIEDRICH, RUDOLF University of Münster Münster Germany
ERGÜN, GÜLER University of Bath Bath UK
GORECKA , JOANNA N ATALIA Polish Academy of Science Warsaw Poland
Contributors
GORECKI , JERZY Polish Academy of Science Warsaw Poland Cardinal Stefan Wyszynski University Warsaw Poland GOSSNER, OLIVIER Northwestern University Paris France GRASSL, MARKUS Austrian Academy of Sciences Innsbruck Austria GRAVNER, JANKO University of California Davis USA GRECO, SALVATORE University of Catania Catania Italy GRZYMALA -BUSSE, JERZY W. University of Kansas Lawrence USA Polish Academy of Sciences Warsaw Poland
HANSON, JAMES E. IBM T.J. Watson Research Center Yorktown Heights USA HARDING, SIMON Memorial University St. John’s Canada HATZIKIROU, HARALAMBOS Technische Universität Dresden Dresden Germany HAUPTMAN, AMI Ben-Gurion University Beer-Sheva Israel HENDLER, JAMES Rensselaer Polytechnic Institute Troy USA HE, YIHUA University of California Riverside USA HIRANO, SHOJI Shimane University, School of Medicine Enya-cho Izumo City, Shimane Japan
HALL, W ENDY University of Southampton Southampton United Kingdom
HONG, TZUNG-PEI National University of Kaohsiung Kaohsiung Taiwan
HAN, BIN University of Alberta Edmonton Canada
HRYNIEWICZ, OLGIERD Systems Research Institute Warsaw Poland
HÄNGGI , PETER University of Augsburg Augsburg Germany
HUANG, SUI Department of Biological Sciences, University of Calgary Calgary Canada
HAN, JIANCHAO California State University Dominguez Hills, Carson USA
HUDRY, OLIVIER École Nationale Supérieure des Télécommunications Paris France
XXXV
XXXVI
Contributors
ILACHINSKI , ANDREW Center for Naval Analyses Alexandria USA
KENDON, VIV University of Leeds Leeds UK
JARRAH, ABDUL S. Virginia Polytechnic Institute and State University Virginia USA
KLAFTER, JOSEPH Tel Aviv University Tel Aviv Israel University of Freiburg Freiburg Germany
JENSEN, ARNE Aalborg University Aalborg East Denmark JENSEN, HENRIK JELDTOFT Institute for Mathematical Sciences London UK Imperial College London London UK JENSEN, HENRIK, JELDTOFT Imperial College London London UK JOHN, ROBERT I. De Montfort University Leicester United Kingdom JORGENSEN, PALLE E. T. The University of Iowa Iowa City USA KACPRZYK, JANUSZ Polish Academy of Sciences Warsaw Poland
KOKOL, PETER University of Maribor Maribor Slovenia KOSTER, MAURICE University of Amsterdam Amsterdam Netherlands ˚ KURKA , PETR Université de Nice Sophia Antipolis Nice France Academy of Sciences and Charles University Prague Czechia
KUTER, UGUR University of Maryland College Park USA KUTRIB, MARTIN Universität Giessen Giessen Germany
KARI , JARKKO University of Turku Turku Finland
LA COUR–HARBO, ANDERS Aalborg University Aalborg East Denmark
KAUFFMAN, STUART A. Department of Biological Sciences, University of Calgary Calgary Canada
LAI , MING-JUN The University of Georgia Athens USA
KEINERT, FRITZ Iowa State University Ames USA
LANGE, W OLFGANG University of Sussex Brighton UK
Contributors XXXVII
LAUBENBACHER, REINHARD Virginia Polytechnic Institute and State University Virginia USA LAURITSEN, KENT BÆKGAARD Danish Meteorological Institute Copenhagen Denmark LAVI , RON The Technion – Israel Institute of Technology Haifa Israel LEVY, MOSHE The Hebrew University Jerusalem Israel LIAU, CHURN-JUNG Academia Sinica Taipei Taiwan
MACAL, CHARLES M. Center for Complex Adaptive Agent Systems Simulation (CAS2 ) Argonne USA MACDONALD, JOANNE Columbia University New York USA MACHO-STADLER, INÉS Universitat Autònoma de Barcelona Barcelona Spain MACLENNAN, BRUCE J. University of Tennessee Knoxville USA MAC N AMEE, BRIAN Dublin Institute of Technology Dublin Ireland
LILJEROS, FREDRIK Stockholm University Stockholm Sweden
MAILATH, GEORGE J. University of Pennsylvania Philadelphia USA
LIN, TSAU YOUNG San Jose State University San Jose USA
MANGE, DANIEL Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne Switzerland
LIU, HUAN Arizona State University Tempe USA
MANZONI , SARA University of Milan-Bicocca Milan Italy
LLOYD, SETH MIT Cambridge USA
MARGENSTERN, MAURICE Université Paul Verlaine Metz France
LODWICK, W ELDON A. University of Colorado Denver Denver USA
MATARAZZO, BENEDETTO University of Catania Catania Italy
LO, HOI -KWONG University of Toronto Toronto Canada
MCKANE, ALAN J. University of Manchester Manchester UK
XXXVIII Contributors
MELIN, PATRICIA Tijuana Institute of Technology Tijuana Mexico
MUTO, SHIGEO Institute of Technology Tokyo Japan
MENASALVAS, ERNESTINA Facultad de Informatica Madrid Spain
MUZY, ALEXANDRE Università di Corsica Corte France
MENDEL, JERRY M. University of Southern California Los Angeles USA METZLER, RALF Technical University of Munich Garching Germany MILBURN, GERARD J. The University of Queensland Brisbane Australia MILLER, JULIAN F. University of York Heslington UK MITCHELL, MELANIE Portland State University Portland USA MORITA , KENICHI Hiroshima University Higashi-Hiroshima Japan MORTVEIT, HENNING S. Virginia Polytechnic Institute and State University Virginia USA MOSCA , MICHELE University of Waterloo Waterloo Canada St. Jerome’s University Waterloo Canada Perimeter Institute for Theoretical Physics Waterloo Canada
N ACHBAR, JOHN Washington University St. Louis USA N AUGHTON, THOMAS J. National University of Ireland Maynooth County Kildare Ireland University of Oulu, RFMedia Laboratory Ylivieska Finland N ICODEMI , MARIO University of Warwick Coventry UK N ILSSON JACOBI , MARTIN Chalmers University of Technology Gothenburg Sweden N OLFI , STEFANO National Research Council (CNR) Rome Italy N ORTH, MICHAEL J. Center for Complex Adaptive Agent Systems Simulation (CAS2 ) Argonne USA N OWICKI , ROBERT Cz˛estochowa University of Technology Cz˛estochowa Poland O’HARA , KIERON University of Southampton Southampton United Kingdom
Contributors
OLHEDE, SOFIA University College London London UK
PEINKE, JOACHIM Carl-von-Ossietzky University Oldenburg Oldenburg Germany
ORLOV, MICHAEL Ben-Gurion University Beer-Sheva Israel
PENTA , ANTONIO University of Pennsylvania Philadelphia USA
ÖZAK, ÖMER Brown University Providence USA
PEPER, FERDINAND National Institute of Information and Communications Technology Kobe Japan
PAGE JR., FRANK H. Indiana University Bloomington USA Universite Paris 1 Pantheon–Sorbonne France ˇ PANOV, PAN CE Jožef Stefan Institute Ljubljana Slovenia
PAPAGEORGIOU, ANARGYROS Columbia University New York USA PARRY, HAZEL R. Central Science Laboratory York UK PATTISON, PHILIPPA University of Melbourne Parkville Australia ˘ UN, GHEORGHE PA Institute of Mathematics of the Romanian Academy Bucure¸sti Romania PEDRYCZ, W ITOLD University of Alberta Edmonton Canada Polish Academy of Sciences Warsaw Poland
PÉREZ-CASTRILLO, DAVID Universitat Autònoma de Barcelona Barcelona Spain PETERS, JAMES F. University of Manitoba Winnipeg Canada PETRY, FREDERICK Stennis Space Center Mississippi USA PIVATO, MARCUS Trent University Peterborough Canada PODGORELEC, VILI University of Maribor Maribor Slovenia POGACH, JONATHAN University of Pennsylvania Philadelphia USA ´ , AGATA POKROPI NSKA Jan Dlugosz University Cz˛estochowa Poland
POLKOWSKI , LECH Polish-Japanese Institute of Information Technology Warsaw Poland
XXXIX
XL
Contributors
POVALEJ, PETRA University of Maribor Maribor Slovenia
RENAULT, JÉRÔME Université Paris Dauphine Paris France
PRADE, HENRI Universite Paul Sabatier Toulouse Cedex France
REZA RAHIMI TABAR, M. Sharif University of Technology Theran Iran
PRUESSNER, GUNNAR Imperial College London London UK PU, CALTON Georgia Institute of Technology Atlanta USA QIAN, LEI Fisk University Nashville USA QUINCAMPOIX, MARC Université de Bretagne Occidentale Brest France RAGHAVAN, T.E.S. University of Illinois Chicago USA RAND, DARREN Massachusetts Institute of Technology Lexington USA RAS, Z BIGNIEW W. University of North Carolina Charlotte USA Polish Academy of Sciences Warsaw Poland
ROBINS, GARRY University of Melbourne Melbourne Australia RÖTTELER, MARTIN NEC Laboratories America, Inc. Princeton USA RUTKOWSKI , LESZEK Cz˛estochowa University of Technology Cz˛estochowa Poland SABOURIAN, HAMID University of Cambridge Cambridge UK SANDHOLM, W ILLIAM H. University of Wisconsin Madison USA SANDRONI , ALVARO University of Pennsylvania Philadelphia USA SCHERER, RAFAŁ Cz˛estochowa University of Technology Cz˛estochowa Poland
RAVI , S.S. University at Albany – State University of New York New York USA
SCOTT, JOHN University of Plymouth Plymouth UK
REIF, JOHN H. Duke University Durham USA
SELMAN, DENIZ University of Pennsylvania Philadelphia USA
Contributors
SERRANO, ROBERTO Brown University Providence USA IMDEA-Social Sciences Madrid Spain SHAKED-MONDERER, N AOMI Emek Yezreel College Emek Yezreel Israel SHEN, JIANHONG Barclays Capital New York USA SIBANI , PAOLO SDU Odense Denmark SIGANOS, GEORGOS University of California Riverside USA SIPPER, MOSHE Ben-Gurion University Beer-Sheva Israel SKOWRON, ANDRZEJ Warsaw University Warsaw Poland ´ , ROMAN SŁOWI NSKI Poznan University of Technology Poznan Poland Polish Academy of Sciences Warsaw Poland
SOLAN, EILON Tel Aviv University Tel Aviv Israel SOLÉ, RICARD V. Santa Fe Institute Santa Fe USA SONG, MYUNG-SIN Southern Illinois University Edwardsville USA SORNETTE, DIDIER ETH Zurich Zurich Switzerland SOTOMAYOR, MARILDA University of São Paulo/SP São Paulo Brazil Brown University Providence USA STARCK, JEAN-LUC CEA/Saclay Gif sur Yvette France STAUFFER, ANDRÉ Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne Switzerland STEFANOVIC, DARKO University of New Mexico Albuquerque USA STEIGLITZ, KEN Princeton University Princeton USA
SNIJDERS, TOM A. B. University of Oxford Oxford United Kingdom
STEPANIUK, JAROSŁAW Białystok University of Technology Białystok Poland
SOBEL, JOEL University of California San Diego USA
STEPHENS, DAVID A. McGill University Montreal Canada
XLI
XLII
Contributors
STIGLIC, GREGOR University of Maribor Maribor Slovenia
TORRA , VICENÇ Institut d’Investigació en Intelligència Artificial – CSIC Bellaterra Spain
STOJANOVIC, MILAN Columbia University New York USA
TRAUB, JOSEPH F. Columbia University New York USA
SURI , N IRANJAN Institute for Human and Machine Cognition Pensacola USA SUSSMAN, GERALD JAY Massachusetts Institute of Technology Cambridge USA SUTNER, KLAUS Carnegie Mellon University Pittsburgh USA
TROITZSCH, KLAUS G. Universität Koblenz-Landau Koblenz Germany TSALLIS, CONSTANTINO Centro Brasileiro de Pesquisas Físicas Rio de Janeiro Brazil Santa Fe Institute Santa Fe USA
TÄUBER, UWE CLAUS Virginia Polytechnic Institute and State University Blacksburg USA
TSENG, VINCENT S. National Cheng–Kung University Tainan Taiwan
TEMPESTI , GIANLUCA University of York York UK
TSUMOTO, SHUSAKU Faculty of Medicine, Shimane University Shimane Japan
TEUSCHER, CHRISTOF Los Alamos National Laboratory Los Alamos USA
TÜRK¸S EN, I. BURHAN TOBB-ETÜ, (Economics and Technology University of the Union of Turkish Chambers and Commodity Exchanges) Ankara Republic of Turkey
TIMMIS, JON University of York York UK University of York York UK
UMEO, HIROSHI University of Osaka Osaka Japan
TINCANI , MICHELA University of Pennsylvania Philadelphia USA
UNTIEDT, ELIZABETH A. University of Colorado Denver Denver USA
TOMALA , TRISTAN HEC Paris Paris France
VALENTE, THOMAS W. University of Southern California Alhambra USA
Contributors
VALVERDE, SERGI Parc de Recerca Biomedica de Barcelona Barcelona Spain
W EBB, STEVE Georgia Institute of Technology Atlanta USA
VERHAGEN, HARKO Stockholm University and Royal Institute of Technology Stockholm Sweden
W HITE, ANDREW G. The University of Queensland Brisbane Australia
VERLIC, MATEJA University of Maribor Maribor Slovenia
W IESNER, KAROLINE University of Bristol Bristol UK University of Bristol Bristol UK
VITEK, JAN Purdue University West Lafayette USA VIZZARI , GIUSEPPE University of Milan-Bicocca Milan Italy VOLIJ, OSCAR Ben-Gurion University Beer-Sheva Israel
W OODERS, MYRNA Vanderbilt University Nashville USA University of Warwick Coventry UK
VOORHEES, BURTON Athabasca University Athabasca Canada
W OODS, DAMIEN University College Cork Cork Ireland University if Seville Seville Spain
W AKO, JUN Gakushuin University Tokyo Japan
W ORSCH, THOMAS Universität Karlsruhe Karlsruhe Germany
W ANG, BING-HONG University of Science and Technology of China Hefei Anhui China Shanghai Academy of System Science Shanghai China
YANG, W EI -Z HE National Taiwan University Taipei Taiwan
W ASILEWSKA , ANITA Stony Brook University Stony Brook USA W ATROUS, JOHN University of Waterloo Waterloo Canada
YEUNG, CHI HO The Hong Kong University of Science and Technology Hong Kong China Université de Fribourg Pérolles, Fribourg Switzerland University of Electronic Science and Technology of China (UESTC) Chengdu China
XLIII
XLIV
Contributors
YILMAZ, LEVENT Auburn University Alabama USA
Z HANG, YAN-QING Georgia State University Atlanta USA
YULMETYEV, RENAT M. Kazan State University Kazan Russia Tatar State University of Pedagogical and Humanities Sciences Kazan Russia
Z HANG, YI -CHENG The Hong Kong University of Science and Technology Hong Kong China Université de Fribourg Pérolles, Fribourg Switzerland University of Electronic Science and Technology of China (UESTC) Chengdu China
Z ADEH, LOTFI A. University of California Berkeley USA Z AMIR, SHMUEL Hebrew University Jerusalem Israel Z EIGLER, BERNARD University of Arizona Tucson USA Z EITOUNI, OFER University of Minnesota Minneapolis USA Ž ENKO, BERNARD Jožef Stefan Institute Ljubljana Slovenia
Z HAO, MING University of Science and Technology of China Hefei Anhui China Z HAO, YI University of Toronto Toronto Canada Z HAO, Z HENG Arizona State University Tempe USA Z HOU, HAO-MIN Georgia Institute of Technology Atlanta USA
Z HANG, BO Tsinghua University Beijing China
Z HOU, TAO University of Science and Technology of China Hefei Anhui China
Z HANG, LING Anhui University, Hefei Anhui China
Z ORMAN, MILAN University of Maribor Maribor Slovenia
Additive Cellular Automata
Additive Cellular Automata BURTON VOORHEES Center for Science, Athabasca University, Athabasca, Canada
Article Outline Glossary Definition of the Subject Introduction Notation and Formal Definitions Additive Cellular Automata in One Dimension d-Dimensional Rules Future Directions Bibliography
Glossary Cellular automata Cellular automata are dynamical systems that are discrete in space, time, and value. A state of a cellular automaton is a spatial array of discrete cells, each containing a value chosen from a finite alphabet. The state space for a cellular automaton is the set of all such configurations. Alphabet of a cellular automaton The alphabet of a cellular automaton is the set of symbols or values that can appear in each cell. The alphabet contains a distinguished symbol called the null or quiescent symbol, usually indicated by 0, which satisfies the condition of an additive identity: 0 C x D x. Cellular automata rule The rule, or update rule of a cellular automaton describes how any given state is transformed into its successor state. The update rule of a cellular automaton is described by a rule table, which defines a local neighborhood mapping, or equivalently as a global update mapping. Additive cellular automata An additive cellular automaton is a cellular automaton whose update rule satisfies the condition that its action on the sum of two states is equal to the sum of its actions on the two states separately. Linear cellular automata A linear cellular automaton is a cellular automaton whose update rule satisfies the condition that its action on the sum of two states separately equals action on the sum of the two states plus its action on the state in which all cells contain the quiescent symbol. Note that some researchers reverse the definitions of additivity and linearity.
Neighborhood The neighborhood of a given cell is the set of cells that contribute to the update of value in that cell under the specified update rule. Rule table The rule table of a cellular automaton is a listing of all neighborhoods together with the symbol that each neighborhood maps to under the local update rule. Local maps of a cellular automaton The local mapping for a cellular automaton is a map from the set of all neighborhoods of a cell to the automaton alphabet. State transition diagram The state transition diagram (STD) of a cellular automaton is a directed graph with each vertex labeled by a possible state and an edge directed from a vertex x to a vertex y if and only if the state labeling vertex x maps to the state labeling vertex y under application of the automaton update rule. Transient states A transient state of a cellular automaton is a state that can at most appear only once in the evolution of the automaton rule. Cyclic states A cyclic state of a cellular automaton is a state lying on a cycle of the automaton update rule, hence it is periodically revisited in the evolution of the rule. Basins of attraction The basins of attraction of a cellular automaton are the equivalences classes of cyclic states together with their associated transient states, with two states being equivalent if they lie on the same cycle of the update rule. Predecessor state A state x is the predecessor of a state y if and only if x maps to y under application of the cellular automaton update rule. More specifically, a state x is an nth order predecessor of a state y if it maps to y under n applications of the update rule. Garden-of-Eden A Garden-of-Eden state is a state that has no predecessor. It can be present only as an initial condition. Surjectivity A mapping is surjective (or onto) if every state has a predecessor. Injectivity A mapping is injective (one-to-one) if every state in its domain maps to a unique state in its range. That is, if states x and y both map to a state z then x D y. Reversibility A mapping X is reversible if and only if a second mapping X 1 exists such that if X (x) D y then X 1 (y) D x. For finite state spaces reversibility and injectivity are identical. Definition of the Subject Cellular automata are discrete dynamical systems in which an extended array of symbols from a finite alphabet is it-
1
2
Additive Cellular Automata
eratively updated according to a specified local rule. Originally developed by John von Neumann [1,2] in 1948, following suggestions from Stanislaw Ulam, for the purpose of showing that self-replicating automata could be constructed. Von Neumann’s construction followed a complicated set of reproduction rules but later work showed that self-reproducing automata could be constructed with only simple update rules, e. g. [3]. More generally, cellular automata are of interest because they show that highly complex patterns can arise from the application of very simple update rules. While conceptually simple, they provide a robust modeling class for application in a variety of disciplines, e. g. [4], as well as fertile grounds for theoretical research. Additive cellular automata are the simplest class of cellular automata. They have been extensively studied from both theoretical and practical perspectives. Introduction A wide variety of cellular automata applications, in a number of differing disciplines, has appeared in the past fifty years, see, e. g. [5,6,7,8]. Among other things, cellular automata have been used to model growth and aggregation processes [9,10,11,12]; discrete reaction-diffusion systems [13,14,15,16,17]; spin exchange systems [18,19]; biological pattern formation [20,21]; disease processes and transmission [22,23,24,25,26]; DNA sequences, and gene interactions [27,28]; spiral galaxies [29]; social interaction networks [30]; and forest fires [31,32]. They have been used for language and pattern recognition [33, 34,35,36,37,38,39]; image processing [40,41]; as parallel computers [42,43,44,45,46,47]; parallel multipliers [48]; sorters [49]; and prime number sieves [50]. In recent years, cellular automata have become important for VLSI logic circuit design [51]. Circuit designers need “simple, regular, modular, and cascadable logic circuit structure to realize a complex function” and cellular automata, which show a significant advantage over linear feedback shift registers, the traditional circuit building block, satisfy this need (see [52] for an extensive survey). Cellular automata, in particular additive cellular automata, are of value for producing high-quality pseudorandom sequences [53,54,55,56,57]; for pseudoexhaustive and deterministic pattern generation [58,59,60,61,62,63]; for signature analysis [64,65,66]; error correcting codes [67,68]; pseudoassociative memory [69]; and cryptography [70]. In this discussion, attention focuses on the subclass of additive cellular automata. These are the simplest cellular automata, characterized by the property that the action of the update rule on the sum of two states is equal to the sum of the rule acting on each state separately. Hybrid
additive rules (i. e., with different cells evolving according to different additive rules) have proved particularly useful for generation of pseudorandom and pseudoexhaustive sequences, signature analysis, and other circuit design applications, e. g. [52,71,72]. The remainder of this article is organized as follows: Sect. “Notation and Formal Definitions” introduces definitions and notational conventions. In Sect. “Additive Cellular Automata in One Dimension”, consideration is restricted to one-dimensional rules. The influence of boundary conditions on the evolution of one-dimensional rules, conditions for rule additivity, generation of fractal space-time outputs, equivalent forms of rule representation, injectivity and reversibility, transient lengths, and cycle periods are discussed using several approaches. Taking X as the global operator for an additive cellular automata, a method for analytic solution of equations of the form X () D ˇ is described. Section “d-Dimensional Rules” describes work on d-dimensional rules defined on tori. The discrete baker transformation is defined and used to generalize one-dimensional results on transient lengths, cycle periods, and similarity of state transition diagrams. Extensive references to the literature are provided throughout and a set of general references is provided at the end of the bibliography. Notation and Formal Definitions Let S(L) D fs i g be the set of lattice sites of a d-dimensional lattice L with nr equal to the number of lattice sites on dimension r. Denote by A a finite symbols set with jAj D p (usually prime). An A-configuration on L is a surjective map v : A 7! S(L) that assigns a symbol from A to each site in S(L). In this way, every A-configuration defines a size n1 nd , d-dimensional matrix of symbols drawn from A. Denote the set of all A-configurations on L by E (A; L). Each s i 2 S(L) is labeled by an integer vector iE D (i1 ; : : : ; i d ) where i r is the number of sites along the rth dimension separating s i from the assigned origin in L. The shift operator on the rth dimension of L is the map r : L 7! L defined by r (s i ) D s j ;
Ej D (i1 ; : : : ; i r 1; : : : ; i d ) :
(1)
Equivalently, the shift maps the value at site Ei to the value at site Ej. Let (s i ; t) D (i1 ; : : : ; i d ; t) 2 A be the entry of corresponding to site s i at iteration t for any discrete dynamical system having E (A; L) as state space. Given a finite set of integer d-tuples N D f(k1 ; : : : ; kd )g, define the
Additive Cellular Automata
N -neighborhood of a site s i 2 S(L) as
o n ˇ N(s i ) D s j ˇ jE D Ei C Ek; kE 2 N
(2)
A neighborhood configuration is a surjective map v : A 7! N(s0 ). Denote the set of all neighborhood configurations by EN (A). The rule table for a cellular automata acting on the state space E (A; L) with standard neighborhood N(s0 ) is defined by a map x : EN (A) 7! A (note that this map need not be surjective or injective). The value of x for a given neighborhood configuration is called the (value of the) rule component of that configuration. The map x : EN (A) 7! A induces a global map X : E (A; L) 7! E (A; L) as follows: For any given element ˇ ˚ (t) 2 E (A; L), the set C (s i ) D (s j ; t) ˇs j 2 N(s i ) is a neighborhood configuration for the site s i , hence the map (s i ; t) 7! x(C (s i )) for all s i produces a new symbol (s i ; t C 1). The site s i is called the mapping site. When taken over all mapping sites, this produces a matrix (t C 1) that is the representation of X ((t)). A cellular automaton is indicated by reference to its rule table or to the global map defined by this rule table. A cellular automaton with global map X is additive if and only if, for all pairs of states and ˇ, X ( C ˇ) D X () C X (ˇ)
(3)
Addition of states is carried out site-wise mod(p) on the matrix representations of and ˇ; for example, for a onedimensional six-site lattice with p D 3 the sum of 120112 and 021212 is 111021. The definition for additivity given in [52] differs slightly from this standard definition. There, a binary valued cellular automaton is called “linear” if its local rule only involves the XOR operation and “additive” if it involves XOR and/or XNOR. A rule involving XNOR can be written as the binary complement of a rule involving only XOR. In terms of the global operator of the rule, this means that it has the form 1 C X where X satisfies Eq. (3) and 1 represents the rule that maps every site to 1. Thus, (1 C X )( C ˇ) equals 1 : : : 1 C X ( C ˇ) while (1 C X )() C (1 C X )(ˇ) D 1 : : : 1 C 1 : : : 1 C X () C X (ˇ) D X () C X (ˇ) mod(2) : In what follows, an additive rule is defined strictly as one obeying Eq. (3), corresponding to rules that are “linear” in [52]. Much of the formal study of cellular automata has focused on the properties and forms of representation of the map X : E (A; L) 7! E (A; L). The structure of the state
transition diagram (STD(X )) of this map is of particular interest. Example 1 (Continuous Transformations of the Shift Dynamical System) Let L be isomorphic to the set of integers Z. Then E (A; Z) is the set of infinite sequences with entries from A. With the product topology induced by the discrete topology on A and as the left shift map, the system (E (A; Z); ) is the shift dynamical system on A. The set of cellular automata maps X : E (A; Z) 7! E (A; Z) constitutes the class of continuous shift-commuting transformations of (E (A; Z); ), a fundamental result of Hedlund [73]. Example 2 (Elementary Cellular Automata) Let L be isomorphic to L with A D f0; 1g and N D f1; 0; 1g. The neighborhood of site s i is fs i1 ; s i ; s iC1 g and EN (A) D f000; 001; 010; 011; 100; 101; 110; 111g. In this one-dimensional case, the rule table can be written as x i D x(i0 i1 i2 ) where i0 i1 i2 is the binary form of the index i. Listing this gives the standard form for the rule table of an elementary cellular automata. 000 x0
001 x1
010 x2
011 x3
100 x4
101 x5
110 x6
111 x7
The standard labeling scheme for elementary cellular automata was introduced by Wolfram [74], who observed that the rule table for elementary rules defines the binary P number 71D0 x i 2 i and used this number to label the corresponding rule. Example 3 (The Game of Life) This simple 2-dimensional cellular automata was invented by John Conway to illustrate a self-reproducing system. It was first presented in 1970 by Martin Gardner [75,76]. The game takes place on a square lattice, either infinite or toridal. The neighborhood of a cell consists of the eight cells surrounding it. The alphabet is {0,1}: a 1 in a cell indicates that cell is alive, a 0 indicates that it is dead. The update rules are: (a) If a cell contains a 0, it remains 0 unless exactly three of its neighbors contain a 1; (b) If a cell contains a 1 then it remains a 1 if and only if two or three of its neighbors are 1. This cellular automata produces a number of interesting patterns including a variety of fixed points (still life); oscillators (period 2); and moving patterns (gliders, spaceships); as well as more exotic patterns such as glider guns which generate a stream of glider patterns. Additive Cellular Automata in One Dimension Much of the work on cellular automata has focused on rules in one dimension (d D 1). This section reviews some of this work.
3
4
Additive Cellular Automata
Boundary Conditions and Additivity In the case of one-dimensional cellular automata, the lattice L can be isomorphic to the integers; to the nonnegative integers; to the finite set f0; : : : ; n 1g 2 Z; or to the integers modulo an integer n. In the first case, there are no boundary conditions; in the remaining three cases, different boundary conditions apply. If L is isomorphic to Zn , the integers mod(n), the boundary conditions are periodic and the lattice is circular (it is a p-adic necklace). This is called a cylindrical cellular automata [77] because evolution of the rule can be represented as taking place on a cylinder. If the lattice is isomorphic to f0; : : : ; n 1g, null, or Dirchlet boundary conditions are set [78,79,80]. That is, the symbol assigned to all sites in L outside of this set is the null symbol. When the lattice is isomorphic to the non-negative integers ZC , null boundary conditions are set at the left boundary. In these latter two cases, the neighborhood structure assumed may influence the need for null conditions. Example 4 (Elementary Rule 90) Let ı represent the global map for the elementary cellular automata rule 90, with rule table 000 0
001 1
010 0
011 1
100 1
101 0
110 1
111 0
For a binary string in Z or Zn the action of rule 90 is defined by [ı()] i D i1 C iC1 mod(2), where all indices are taken mod(n) in the case of Zn . In the remaining cases, 8 iD0 ˆ jwj c :
K(wjv) < K' (wjv) C c : Let z be a shortest program for x knowing v, and let z0 be cz 0 is a proa shortest program for y knowing hv; xi. As z; gram for hx; yi knowing v with representation system ', one has
Proposition 2 For any c 2 RC , there are at least (2nC1 2nc ) c-incompressible words w such that jwj 6 n. Proof Each program produces only one word; therefore there cannot be more words whose complexity is below n than the number of programs of size n. Hence, one finds
K(hx; yijv) 0
cz j C c < K' (wjv) C c 6 jz; 6 K(xjv) C K(yjhv; xi) C 2 log2 (K(xjv)) C 3 C c ; which completes the proof.
The next result proves the existence of incompressible words.
Of course, this inequality is not an equality since if x D y, the complexity of K(hx; yi) is the same as K(x) up to some constant. When the equality holds, it is said that x and y have no mutual information. Note that if we can represent the combination of programs z and z0 in a shorter way, the upper bound is decreased by the same amount. For instance, if we choose x ; y D `(`(x))2 01`(x)x y, it becomes
b
2 log2 log2 jxj C log2 jxj C jxj C jyj : If we know that a program never contains a special word u, then with x ; y D xuy it becomes jxj C jyj.
b
Examples The properties seen so far allow one to prove most of the relations between complexities of words. For instance, choosing f : w 7! ww and g being the identity in Corollary 1, we obtain that there is a constant c such that jK(wwjv) K(wjv)j < c. In the same way, letting f be the identity and g be defined by g(x) D " in Theorem 3, we get that there is a constant c such that K(wjv) < K(w) C c. By Theorem 4, Corollary 1, and choosing f as hx; yi 7! x y, and g as the identity, we have that there is a constant c such that K(x y) < K(x)CK(y)C2 min log2 (K(x)); log2 (K(y)))Cc:
jfw; K(w) < ngj 6 2n 1 : This implies that jfw; K(w) < jwj c and jwj 6 ngj 6 jfw; K(w) < ngj < 2nc ; and, since fw; K(w) > jwj c and jwj 6 ng fw; jwj 6 ngnfw; K(w) < jwj c and jwj 6 ng ; it holds that jfw; K(w) > jwj c and jwj 6 ngj > 2nC1 2nc : From this proposition one deduces that half of the words whose size is less than or equal to n reach the upper bound of their Kolmogorov complexity. This incompressibility method relies on the fact that most words are incompressible. Martin–Löf Tests Martin–Löf tests give another equivalent definition of random sequences. The idea is simple: a sequence is random if it does not pass any computable test of singularity (i. e.,
Algorithmic Complexity and Cellular Automata
a test that selects words from a “negligible” recursively enumerable set). As for Kolmogorov complexity, this notion needs a proper definition and implies a universal test. However, in this review we prefer to express tests in terms of Kolmogorov complexity. (We refer the reader to [13] for more on Martin–Löf tests.) Let us start with an example. Consider the test “to have the same number of 0s and 1s.” Define the set E D fw; w has the same number of 0s and 1sg and order it in military order (length first, then lexicographic order), E D fe0 ; e1 ; : : : g. Consider the computable function f : x 7! e x (which is in fact computable for any decidable set E). Note that there are 2n n words has length 2n, x is less than in E of length 2n. Thus, if e x 2n p n C O(1). Then, us, whose logarithm is 2n log 2 n ing Theorem 3 with function f , one finds that r je x j K(e x ) < K(x) < log2 x < je x j log2 ; 2 up to an additive constant. We conclude that all members ex of E are not c-incompressible whenever they are long enough. This notion of test corresponds to “belonging to a small set” in Kolmogorov complexity terms. The next proposition formalizes these ideas. Proposition 3 Let E be a recursively enumerable set such that jE \ f0; 1gn j D o(2n ). Then, for all constants c, there is an integer M such that all words of length greater than M of E are not c-incompressible. Proof In this proof, we represent integers as binary strings. Let x ; y be defined, for integers x and y, by
b
b
x ; y D jxj2 01x y : Let f be the computable function 8 < f0; 1g ! f0; 1g ; f : x; y 7! the yth word of length x : in the enumeration of E :
b
From Theorems 2 and 3, there is a constant d such that for all words u of E K(e) < j f 1 (e)j C d : Note u n D jE \ f0; 1gn j. From the definition of function f one has j f 1 (e)j 6 3 C 2 log2 (jej) C log2 (ujej ) : As u n D o(2n ), log2 (ujej ) jej tends to 1, so there is an integer M such that when jej > M, K(e) < j f 1 (e)j C d < jejc. This proves that no members of E whose length is greater than M are c-incompressible.
Dynamical Systems and Symbolic Factors A (discrete-time) dynamical system is a structure hX; f i where X is the set of states of the system and f is a function from X to itself that, given a state, tells which is the next state of the system. In the literature, the state space is usually assumed to be a compact metric space and f is a continuous function. The study of the asymptotic properties of a dynamical system is, in general, difficult. Therefore, it can be interesting to associate the studied system with a simpler one and deduce the properties of the original system from the simple one. Indeed, under the compactness assumption, one can associate each system hX; f i with its symbolic factor as follows. Consider a finite open covering fˇ0 ; ˇ1 ; : : : ; ˇ k g of X. Label each set ˇ i with a symbol ˛ i . For any orbit of initial condition x 2 X, we build an infinite word wx on f˛0 ; ˛1 ; : : : ; ˛ k g such that for all i 2 N, w x (i) D ˛ j if f i (x) 2 ˇ j for some j 2 f0; 1; : : : ; kg. If f i (x) belongs to ˇ j \ ˇ h (for j ¤ h), then arbitrarily choose either ˛ j or ˛ h . Denote by Sx the set of infinite words associated with S the initial condition x 2 X and set S D x2X S x . The system hS; i is the symbolic system associated with hX; f i; is the shift map defined as 8x 2 S8i 2 N; (x) i D x iC1 . When fˇ0 ; ˇ1 ; : : : ; ˇ k g is a clopen partition, then hS; i is a factor of hX; f i (see [12] for instance). Dynamical systems theory has made great strides in recent decades, and a huge quantity of new systems have appeared. As a consequence, scientists have tried to classify them according to different criteria. From interactions among physicists was born the idea that in order to simulate a certain physical phenomenon, one should use a dynamical system whose complexity is not higher than that of the phenomenon under study. The problem is the meaning of the word “complexity”; in general it has a different meaning for each scientist. From a computerscience point of view, if we look at the state space of factor systems, they can be seen as sets of bi-infinite words on a fixed alphabet. Hence, each factor hS; i can be associated with a language as follows. For each pair of words u w write u v if u is a factor of w. Given a finite word u and a bi-infinite word v, with an abuse of notation we write u v if u occurs as a factor in v. The language L(v) associated with a bi-infinite word v 2 A is defined as L(v) D fu 2 A ; u vg : Finally, the language L(S) associated with the symbolic factor hS; i is given by L(S) D fu 2 A ; u vg :
137
138
Algorithmic Complexity and Cellular Automata
The idea is that the complexity of the system hX; f i is proportional to the language complexity of its symbolic factor hS; i (see, for example, Topological Dynamics of Cellular Automata). In [4,5], Brudno proposes to evaluate the complexity of symbolic factors using Kolmogorov complexity. Indeed, the complexity of the orbit of initial condition x 2 X according to finite open covering fˇ0 ; ˇ1 ; : : : ; ˇ k g is defined as K(x; f ; fˇ0 ; ˇ1 ; : : : ; ˇ k g) D lim sup min
n!1 w2S x
K(w0:n ) : n
Finally, the complexity of the orbit x is given by K(x; f ) D sup K(x; f ; ˇ) ; where the supremum is taken w.r.t. all possible finite open coverings ˇ of X. Brudno has proven the following result. Theorem 5 ([4]) Consider a dynamical system hX; f i and an ergodic measure . For -almost all x 2 X, K(x; f ) D H ( f ), where H ( f ) is the measure entropy of hX; f i (for more on the measure entropy see, for example, Ergodic Theory of Cellular Automata). Cellular Automata Cellular automata (CA) are (discrete-time) dynamical systems that have received increased attention in the last few decades as formal models of complex systems based on local interaction rules. This is mainly due to their great variety of distinct dynamical behavior and to the possibility of easily performing large-scale computer simulations. In this section we quickly review the definitions and useful results, which are necessary in the sequel. Consider the set of configurations C , which consists of all functions from Z D into A. The space C is usually equipped with the Cantor metric dC defined as 8a; b 2 C ;
dC (a; b) D 2n ; with n D min fkE v k1 : a(E v ) ¤ b(E v )g ; vE2Z D
(3)
where kE v k1 denotes the maximum of the absolute value of the components of vE. The topology induced by dC coincides with the product topology induced by the discrete topology on A. With this topology, C is a compact, perfect, and totally disconnected space. Let N D fuE1 ; : : : ; uEs g be an ordered set of vectors of Z D and f : As 7! A be a function. Definition 5 (CA) The D-dimensional CA based on the local rule ı and the neighborhood frame N is the pair
hC ; f i, where f : C 7! C is the global transition rule defined as follows: 8c 2 C ;
8E v 2 ZD ; v C uEs ) : f (c)(E v ) D ı c(E v C uE1 ); : : : ; c(E
(4)
Note that the mapping f is (uniformly) continuous with respect to dC . Hence, the pair hC ; f i is a proper (discretetime) dynamical system. In [7], a new metric on the phase space is introduced to better match the intuitive idea of chaotic behavior with its mathematical formulation. More results in this research direction can be found in [1,2,3,15]. This volume dedicates a whole chapter to the subject ( Dynamics of Cellular Automata in Non-compact Spaces). Here we simply recall the definition of the Besicovitch distance since it will be used in Subsect. “Example 2”. Consider the Hamming distance between two words u v on the same alphabet A, #(u; v) D jfi 2 N j u i ¤ v i gj. This distance can be easily extended to work on words that are factors of bi-infinite words as follows: # h;k (u; v) D jfi 2 [h; k] j u i ¤ v i gj ; where h; k 2 Z and h < k. Finally, the Besicovitch pseudodistance is defined for any pair of bi-infinite words u v as dB (u; v) D lim sup n!1
#n;n (u; v) : 2n C 1
The pseudodistance dB can be turned into a distance when : taking its restriction to AZ jD: , where D is the relation of “being at null dB distance.” Roughly speaking, dB measures the upper density of differences between two bi-infinite words. The space-time diagram is a graphical representation of a CA orbit. Formally, for a D-dimensional CA f D with state set A, is a function from AZ N Z to A D defined as (x; i) D f i (x) j , for all x 2 AZ , i 2 N and j 2 Z. The limit set ˝ f contains the long-term behavior of a CA on a given set U and is defined as follows: \ f n (U) : ˝ f (U) D n2N
Unfortunately, any nontrivial property on the CA limit set is undecidable [11]. Other interesting information on a CA’s long-term behavior is given by the orbit limit set defined as follows: [ O0f (u) ; f (U) D u2U
Algorithmic Complexity and Cellular Automata
where H 0 is the set of adherence points of H. Note that in general the limit set and the orbit limit sets give different information. For example, consider a rich configuration c i. e., a configuration containing all possible finite patterns (here we adopt the terminology of [6]) and the shift map . Then, (fcg) D AZ , while ˝ (fcg) is countable. Algorithmic Complexity as a Demonstration Tool The incompressibility method is a valuable formal tool to decrease the combinatorial complexity of problems. It is essentially based on the following ideas (see also Chap. 6 in [13]): Incompressible words cannot be produced by short programs. Hence, if one has an incompressible (infinite) word, it cannot be algorithmically obtained. Most words are incompressible. Hence, a word taken at random can usually be considered to be incompressible without loss of generality. If one “proves” a recursive property on incompressible words, then, by Proposition 3, we have a contradiction.
Application to Cellular Automata A portion of a space-time diagram of a CA can be computed given its local rule and the initial condition. The part of the diagram that depends only upon it has at most the complexity of this portion (Fig. 1). This fact often implies great computational dependencies between the initial part and the final part of the portion of the diagram: if the final part has high complexity, then the initial part must be at least as complex. Using this basic idea, proofs are structured as follows: assume that the final part has high complexity; use the hypothesis to prove that the initial part is not complex; then we have a contradiction. This technique provides a faster and clearer way to prove results that could be obtained by technical combinatorial arguments. The first example illustrates this fact by rewriting a combinatorial proof in terms of Kolmogorov complexity. The second one is a result that was directly written in terms of Kolmogorov complexity. Example 1 Consider the following result about languages recognizable by CA. Proposition 4 ([17]) The language L D fuvu; u; v 2 f0; 1g ; juj > 1g is not recognizable by a real-time oneway CA.
Algorithmic Complexity and Cellular Automata, Figure 1 States in the gray zone can be computed for the states of the black line: K(gray zone) 6 K(black line) up to an additive constant
Before giving the proof in terms of Kolmogorov complexity, we must recall some concepts. A language L is accepted in real-time by a CA if is accepted after jxj transitions. An input word is accepted by a CA if after accepting state. A CA is one-way if the neighborhood is f0; 1g (i. e., the central cell and the one on its right). Proof One-way CA have some interesting properties. First, from a one-way CA recognizing L, a computer can build a one-way CA recognizing L˙ D fuvu; u; v 2 ˙ ; juj > 1g, where ˙ is any finite alphabet. The idea is to code the ith letter of ˙ by 1u0 1u1 1u k 00, where u0 u1 u k is i written in binary, and to apply the former CA. Second, for any one-way CA of global rule f we have that f jwj (xw y) D f jwj (xw) f jwj (w y) for all words x, y, and w. Now, assume that a one-way CA recognizes L in real time. Then, there is an algorithm F that, for any integer n, computes a local rule for a one-way CA that recognizes Lf0;:::;n1g in real time. Fix an integer n and choose a 0-incompressible word w of length n(n 1). Let A be the subset of Z of pairs (x; y) for x; y 2 f0; : : : ; n 1g and let x 6D y be defined by A D f(x; y); the nx C yth bit of w is 1g : Since A can be computed from w and vice versa, one finds that K(A) D K(w) D jwj D n(n 1) up to an additive constant that does not depend on n. Order the set A D f(x0 ; y0 ); (x1 ; y1 ); : : : ; g and build a new word u as follows: u0 D y 0 x 0 ; u iC1 D u i y iC1 x iC1 u i ; u D ujAj1 ;
139
140
Algorithmic Complexity and Cellular Automata
From Lemma 1 in [17], one finds that for all x; y 2 f0; : : : ; n 1g, (x; y) 2 A , xuy 2 L. Let ı be the local rule produced by F(n), Q the set of final states, and f the associated global rule. Since f
juj
(xuy) D f
juj
(xu) f
juj
(uy);
0 y!n , f , u and n; 0 The two words of y 0 of length ur that surround y!n 0 and that are missing to compute x!n with Eq. (6).
We obtain that 0 0 K(x!n jy!n ) 2urCK(u)CK(n)CK( f )CO(1) o(n)
one has that
(7)
(x; y) 2 A , xuy 2 L , ı( f
juj
(xu); f
juj
(uy)) 2 Q :
f juj (xu)
Hence, from the knowledge of each and f juj (ux) for x 2 f0; : : : ; n 1g, a list of 2n integers of f0; : : : ; n 1g, one can compute A. We conclude that K(A) 6 2n log2 (n) up to an additive constant that does not depend on n. This contradicts K(A) D n(n 1) for large enough n. Example 2 Notation Given x 2 AZ , we denote by x!n the word D w 2 S (2nC1) obtained by taking all the states xi for i 2 (2n C 1) D in the martial order. The next example shows a proof that directly uses the incompressibility method. D
Theorem 6 In the Besicovitch topological space there is no transitive CA. Recall that a CA f is transitive if for all nonempty sets A B there exists n 2 N such that f n (A) \ B ¤ ;. Proof By contradiction, assume that there exists a transitive CA f of radius r with C D jSj states. Let x and y be two configurations such that 8n 2 N; K(x!n jy!n )
n : 2
A simple counting argument proves that such configurations x and y always exist. Since f is transitive, there are two configurations x 0 and y 0 such that for all n 2 N
(the notations O and o are defined with respect to n). Note that n is fixed and hence K(n) is a constant bounded by log2 n. Similarly, r and S are fixed, and hence K(f ) is constant and bounded by C 2rC1 log2 C C O(1). 0 Let us evaluate K(y!n jy!n ). Let a1 ; a2 ; a3 ; : : : ; a k be 0 differ at, sorted the positive positions that y!n and y!n in increasing order. Let b1 D a1 and b i D a i a i1 , for 2 i k. By Eq. (5) we know that k 4ın. Note that Pk iD1 b i D a k n. Symmetrically let a10 ; a20 ; a30 ; : : : ; a 0k 0 be the absolute values of the strictly negative positions that y!n and 0 differ at, sorted in increasing order. Let b10 D a10 and y!n 0 b i D a0i a0i1 , where 2 i k 0 . Equation (5) states that k 0 4ın. Since the logarithm is a concave function, one has P X ln b i bi n ln ln ; k k k and hence X n ln b i k ln ; k
(8)
which also holds for b0i and k 0 . Knowledge of the bi , b0i , and k C k 0 states of the cells of 0 0 , is enough to compute y!n , where y!n differs from y!n 0 y!n from y!n . Hence, X 0 jy!n ) ln(b i ) K(y!n X C ln(b0i ) C (k C k 0 ) log2 C C O(1) : Equation (8) states that
0 (x!n ; x!n ) 4"n ;
(5)
0 ) 4ın ; (y!n ; y!n
and an integer u (which only depends on " and ı) such that f u (y 0 ) D x 0 ;
(6) 1
where " D ı D (4e 10 log2 C ) . In what follows only n varies, while C, u, x, y, x 0 , y 0 , ı, and " are fixed and independent of n. 0 By Eq. (6), one may compute the word x!n from the following items:
0 K(y!n jy!n ) k ln
n n Ck 0 ln 0 C(kCk 0 ) log2 CCO(1): k k
The function k 7! k ln nk is increasing on [0; ne ]. As n k 4ın 10 log , we have that 2C e
k ln
10n n n n 4ın ln 10 log C ln e 10 log2 C 10 log C 2 2 k 4ın e e
and that (k C k 0 ) log2 C
2n log2 C : e 10 log2 C
Algorithmic Complexity and Cellular Automata
Replacing a, b, and k by a0 , b0 , and k 0 , the same sequence of inequalities leads to a similar result. One deduces that 0 jy!n ) K(y!n
(2 log2 C C 20)n C O(1) : e 10 log2 C
(9)
0 ). Similarly, Eq. (9) is also true with K(x!n jx!n The triangular mequality for Kolmogorov complexity K(ajb) K(ajc) C K(bjc) (consequence of theorems 3 and 4) gives: 0 0 0 ) C K(x!n jy!n ) K(x!n jy!n ) K(x!n jx!n 0 C K(y!n jy!n ) C O(1) :
Equations (9) and (7) allow one to conclude that K(x!n jy!n )
(2 log2 C C 20)n C o(n) : e 10 log2 C
The hypothesis on x and y was K(x!n jy!n ) n2 . This implies that n (2C C 20)n C o(n) : 2 e 10 log2 C The last inequality is false for big enough n.
Measuring CA Structural Complexity Another use of Kolmogorov complexity in the study of CA is to understand what maximum complexity they can produce extracting examples of CA that show high complexity characteristics. The question is to define what is the meaning of “show high complexity characteristics” and, more precisely, what characteristic to consider. This section is devoted to structural characteristics of CA, that is to say, complexity that can be observed through static particularities. The Case of Tilings In this section, we give the original example that was given for tilings, often considered as a static version of CA. In [10], Durand et al. construct a tile set whose tilings have maximum complexity. This paper contains two main results. The first one is an upper bound for the complexity of tilings, and the second one is an example of tiling that reaches this bound. First we recall the definitions about tilings of a plane by Wang tiles. Definition 6 (Tilings with Wang tiles) Let C be a finite set of colors. A Wang tile is a quadruplet (n; s; e; w) of four colors from C corresponding to a square tile whose top color is n, left color is w, right color is e, and bottom color
is s. A Wang tile cannot be rotated but can be copied to infinity. Given a set of Wang tiles T, we say that a plane can be tiled by T if one can place tiles from T on the square grid Z2 such that the adjacent border of neighboring tiles has the same color. A set of tiles T that can tile the plane is called a palette. The notion of local constraint gives a point of view closer to CA than tilings. Roughly speaking, it gives the local constraints that a tiling of the plane using 0 and 1 must satisfy. Note that this notion can be defined on any alphabet, but we can equivalently code any letter with 0 and 1. Definition 7 (Tilings by local constraints) Let r be a positive integer called radius. Let C be a set of square patterns of size 2r C 1 made of 0 and 1 (formally a function from r; r to f0; 1g). The set is said to tile the plane if there is a way to put zeros and ones on the 2-D grid (formally a function from Z2 to f0; 1g) whose patterns of size 2r C 1 are all in C. The possible layouts of zeros and ones on the (2-D) grid are called the tilings acceptable for C. Seminal papers on this subject used Wang tiles. We translate these results in terms of local constraints in order to more smoothly apply them to CA. Theorem 7 proves that, among tilings acceptable by a local constraint, there is always one that is not too complex. Note that this is a good notion since the use of a constraint of radius 1, f 0 ; 1 g, which allows for all possible patterns, provides for acceptable high-complexity tilings since all tilings are acceptable. Theorem 7 ([10]) Let C be a local constraint. There is tiling acceptable for C such that the Kolmogorov complexity of the central pattern of size n is O(n) (recall that the maximal complexity for a square pattern n n is O(n2 )). Proof The idea is simple: if one knows the bits present in a border of width r of a pattern of size n, there are finitely many possibilities to fill the interior, so the first one in any computable order (for instance, lexicographically when putting all horizontal lines one after the other) has at most the complexity of the border since an algorithm can enumerate all possible fillings and take the first one in the chosen order. Then, if one knows the bits in borders of width r of all central square patterns of size 2n for all positive integers n (Fig. 2), one can recursively compute for each gap the first possible filling for a given computable order. The tiling obtained in this way (this actually defines a tiling since all cells are eventually assigned a value) has the required complexity: in order to compute the central pattern of size n,
141
142
Algorithmic Complexity and Cellular Automata
Algorithmic Complexity and Cellular Automata, Figure 2 Nested squares of size 2k and border width r
the algorithm simply needs all the borders of size 2k for k 6 2n, which is of length at most O(n). The next result proves that this bound is almost reachable. Theorem 8 Let r be any computable monotone and unbounded function. There exists a local constraint Cr such that for all tilings acceptable for Cr , the complexity of the central square pattern of is O(n/r(n)). The original statement does not have the O part. However, note that the simulation ofp Wang tiles by a local constraint uses a square of size ` D d log ke to simulate a single tile from a set of k Wang tiles; a square pattern of size n in the tiling corresponds to a square pattern of size n` in the original tiling. Function r can grow very slowly (for instance, the inverse of the Ackermann function) provided it grows monotonously to infinity and is computable. Proof The proof is rather technical, and we only give a sketch. The basic idea consists in taking a tiling constructed by Robinson [16] in order to prove that it is undecidable to test if a local constraint can tile the plane. This tiling is self-similar and is represented in Fig. 3. As it contains increasingly larger squares that occur periodically (note that the whole tiling is not periodic but only quasiperiodic), one can perform more and more computation steps within these squares (the periodicity is required to be sure that squares are present).
Algorithmic Complexity and Cellular Automata, Figure 3 Robinson’s tiling
Using this tiling, Robinson can build a local constraint that simulates any Turing machine. Note that in the present case, the situation is more tricky than it seems since some technical features must be assured, like the fact that a square must deal with the smaller squares inside it or the constraint to add to make sure that smaller squares
Algorithmic Complexity and Cellular Automata
have the same input as the bigger one. Using the constraint to forbid the occurrence of any final state, one gets that the compatible tilings will only simulate computations for inputs for which it does not halt. To finish the proof, Durand et al. build a local constraint C that simulates a special Turing machine that halts on inputs whose complexity is small. Such a Turing machine exists since, though Kolmogorov complexity is not computable, testing all programs from " to 1n allows a Turing machine to compute all words whose complexity is below n and halt if it finds one. Then all tilings compatible with C contain in each square an input on which the Turing machine does not halt yet is of high complexity. Function r occurs in the technical arguments since computable zones do not grow as fast as the length of squares. The Case of Cellular Automata As we have seen so far, one of the results of [10] is that a tile set always produces tilings whose central square patterns of size n have a Kolmogorov complexity of O(n) and not n2 (which is the maximal complexity). In the case of CA, something similar holds for spacetime diagrams. Indeed, if one knows the initial row. One can compute the triangular part of the space-time diagram which depends on it (see Fig. 1). Then, as with tilings, the complexity of an n n-square is the same as the complexity of the first line, i. e., O(n). However, unlike tilings, in CA, there is no restriction as to the initial configuration, hence CA have simple space-time diagrams. Thus in this case Kolmogorov complexity is not of great help. One idea to improve the results could be to study how the complexity of configurations evolves during the application of the global rule. This aspect is particularly interesting with respect to dynamical properties. This is the subject of the next section. Consider a CA F with radius r and local rule ı. Then, its orbit limit set cannot be empty. Indeed, let a be a state of f . Consider the configuration ! a! . Let s a D ı(a; : : : ; a). Then, f (! a! ) D ! s ! a . Consider now a graph whose vertices are the states of the CA and the edges are (a; s a ). Since each vertex has an outgoing edge (actually exactly one), it must contain a cycle a0 ! a1 ! ! a k ! a0 . Then each of the configurations ! a! i for ! a! . ) D 0 6 i 6 k is in the limit set since f k (! a! i i This simple fact proves that any orbit limit set (and any limit set) of a CA must contain at least a monochromatic configuration whose complexity is low (for any reasonable definition). However, one can build a CA whose orbit limit set contains only complex configurations, except
for the mandatory monochromatic one using the localconstraints technique discussed in the previous section. Proposition 5 There exists a CA whose orbit limit set contains only complex configurations. Proof Let r be any computable monotone and unbounded function and Cr the associated local constraint. Let A be the alphabet that Cr is defined on. Consider the 2-D CA f r on the alphabet A [ f#g (we assume f#g ¤ A). Let r be the radius of f r (the same radius as Cr ). Finally, the local rule ı of f r is defined as follows: ( P0 if P 2 Cr ; ı(P) D # otherwise : Using this local rule, one can verify the following fact. If the configuration c is not acceptable for Cr , then (O(c))0 D f! #! g; (O(c))0 D fcg otherwise. Indeed, if c is acceptable for Cr , then f (c) D c; otherwise there is a position i such that c(i) is not a valid pattern for Cr . Then f (c)(i) D #. By simple induction, this means that all cells that are at a distance less than kr from position i become # after k steps. Hence, for all n > 0, after k > nC2jij steps, r the Cantor distance between f k (c) and ! #! is less than 2n , i. e., O(c) tends to ! #! . Measuring the Complexity of CA Dynamics The results of the previous section have limited range since they tell something about the quasicomplexity but nothing about the plain complexity of the limit set. To enhance our study we need to introduce some general concepts, namely, randomness spaces. Randomness Spaces Roughly speaking, a randomness space is a structure made of a topological space and a measure that helps in defining which points of the space are random. More formally we can give the following. Definition 8 ([6]) A randomness space is a structure hX; B; i where X is a topological space, B : N ! 2 X a total numbering of a subbase of X, and a Borel measure. Given a numbering B for a subbase for an open set of X, one can produce a numbering B0 for a base as follows: \ B( j) ; B0 (i) D j2D iC1
where D : N ! fE j E N and E is finiteg is the bijection P defined by D1 (E) D i2E 2 i . B0 is called the base derived from the subbase B. Given two sequences of open sets (Vn )
143
144
Algorithmic Complexity and Cellular Automata
and (U n ) of X, we say that (Vn ) is U-computable if there exists a recursively enumerable set H 2 N such that [ 8n 2 N; Vn D Ui ; i2N;hn;ii2H
where hi; ji D (i C j)(i C j C 1) C j is the classical bijection between N 2 and N. Note that this bijection can be extended to N D (for D > 1) as follows: hx1 ; x2 ; : : : ; x k i D hx1 ; hx2 ; : : : ; x k ii. Definition 9 ([6]) Given a randomness space hX; B; i, a randomness test on X is a B0 -computable sequence (U n ) of open sets such that 8n 2 N; (U n ) 2n . Given a randomness test (U n ), a point x 2 X is said to pass the test (U n ) if x 2 \n2N U n . In other words, tests select points belonging to sets of null measure. The computability of the tests assures the computability of the selected null measure sets. Definition 10 ([6]) Given a randomness space hX; B; i, a point x 2 X is nonrandom if x passes some randomness test. The point x 2 X is randomk if it is not nonrandom. Finally, note that for any D 1, hAZ ; B; i is a randomness space when setting B as o n D B( j C D hi1 ; : : : ; i D i) D c 2 AZ ; c i 1 ;:::;i D D a j ; D
where A D fa1 ; : : : ; a j ; : : : ; a D g and is the classical product measure built from the uniform Bernoulli measure over A. Theorem 9 ([6]) Consider a D-dimensional CA f . Then, the following statements are equivalent: 1. f is surjective; D 2. 8c 2 AZ , if c is rich (i. e., c contains all possible finite patterns), then f (c) is rich; D 3. 8c 2 AZ , if c is random, then f (c) is random. Theorem 10 ([6]) Consider a D-dimensional CA f . Then, D 8c 2 AZ , if c is not rich, then f (c) is not rich. Theorem 11 ([6]) Consider a 1-D CA f . Then, 8c 2 AZ , if c is nonrandom, then f (c) is nonrandom. D
Note that the result in Theorem 11 is proved only for 1-D CA; its generalization to higher dimensions is still an open problem. Open problem 1 Do D-dimensional CA for D > 1 preserve nonrandomness? From Theorem 9 we know that the property of preserving randomness (resp. richness) is related to surjectivity.
Hence, randomness (resp. richness) preserving is decidable in one dimension and undecidable in higher dimensions. The opposite relations are still open. Open problem 2 Is nonrichness (resp. nonrandomness) a decidable property? Algorithmic Distance In this section we review an approach to the study of CA dynamics from the point of view of algorithmic complexity that is completely different than the one reported in the previous section. For more details on the algorithmic distance see Chaotic Behavior of Cellular Automata. In this new approach we are going to define a new distance using Kolmogorov complexity in such a way that two points x and y are near if it is “easy” to transform x into y or vice versa using a computer program. In this way, if a CA turns out to be sensitive to initial conditions, for example, then it means that it is able to create new information. Indeed, we will see that this is not the case. Definition 11 The algorithmic distance between x 2 AZ D and y 2 AZ is defined as follows:
D
da (x; y) D lim sup n!1
K(x ! njy ! n) C K(y ! njx ! n) : 2(2n C 1) D
It is not difficult to see that da is only a pseudodistance since there are many points that are at null distance (those that differ only on a finite number of cells, for example). Consider the relation Š of “being at null da distance, D i. e., 8x; y 2 AZ ; x Š y if and only if da (x; y) D 0. Then, D hAZ /Š ; da i is a metric space. Note that the definition of da does not depend on the chosen additively optimal universal description mode since the additive constant disappears when dividing by 2(2n C 1) D and taking the superior limit. Moreover, by Theorem 2, the distance is bounded by 1. The following results summarize the main properties of this new metric space. Theorem 12 ( Chaotic Behavior of Cellular Automata) D The metric space hAZ /Š ; da i is perfect, pathwise connected, infinite dimensional, nonseparable, and noncompact. Theorem 12 says that the new topological space has enough interesting properties to make worthwhile the study of CA dynamics on it. The first interesting result obtained by this approach concerns surjective CA. Recall that surjectivity plays a central role in the study of chaotic behavior since it is a nec-
Algorithmic Complexity and Cellular Automata
essary condition for many other properties used to define deterministic chaos like expansivity, transitivity, ergodicity, and so on (see Topological Dynamics of Cellular Automata and [8] for more on this subject). The following reD sult proves that in the new topology AZ /Š the situation is completely different. Proposition 6 ( Chaotic Behavior of Cellular Automata) If f is a surjective CA, then d A (x; f (x)) D 0 for D any x 2 AZ /Š . In other words, every CA behaves like the D identity in AZ /Š . Proof In order to compute f (x)!n from x!n , one need only know the index of f in the set of CA with radius r, state set S, and dimension d; therefore K( f (x)!n jx!n ) 6 2dr(2n C 2r C 1)d1 log2 jSj C K( f ) C 2 log2 K( f ) C c ; and similarly K(x!n j f (x)!n ) 6 gdr(2n C 1)d1 log2 jSj C K( f ) C 2 log2 K( f ) C c : Dividing by 2(2n C 1)d and taking the superior limit one finds that da (x; f (x)) D 0. The result of Proposition 6 means that surjective CA can neither create new information nor destroy it. Hence, they have a high degree of stability from an algorithmic point of view. This contrasts with what happens in the Cantor topology. We conclude that the classical notion of deterministic chaos is orthogonal to “algorithmic chaos,” at least in what concerns the class of CA. Proposition 7 ( Chaotic Behavior of Cellular Automata) Consider a CA f that is neither surjective nor constant. Then, there exist two configuraD tions x; y 2 AZ /Š such that da (x; y) D 0 but da ( f (x); f (y)) 6D 0. In other words, Proposition 7 says that nonsurjective, nonconstant CA are not compatible with Š and hence not continuous. This means that, for any pair of configurations x y, this kind of CA either destroys completely the information content of x y or preserves it in one, say x, and destroys it in y. However, the following result says that some weak form of continuity still persists (see Topological Dynamics of Cellular Automata and [8] for the definitions of equicontinuity point and sensibility to initial conditions). Proposition 8 ( Chaotic Behavior of Cellular Automata) Consider a CA f and let a be the configuration made with all cells in state a. Then, a is both a fixed point and an equicontinuity point for f .
Even if CA were noncontinuous on AZ /Š , one could still wonder what happens with respect to the usual properties used to define deterministic chaos. For instance, by Proposition 8, it is clear that no CA is sensitive to initial conditions. The following question is still open. D
Open problem 3 Is a the only equicontinuity point for CA D on AZ /Š ? Future Directions In this paper we have illustrated how algorithmic complexity can help in the study of CA dynamics. We essentially used it as a powerful tool to decrease the combinatorial complexity of problems. These kinds of applications are only at their beginnings and much more are expected in the future. For example, in view of the results of Subsect. “Example 1”, we wonder if Kolmogorov complexity can help in proving the famous conjecture that languages recognizable in real time by CA are a strict subclass of linear-time recognizable languages (see [9,14]). Another completely different development would consist in finding how and if Theorem 11 extends to higher dimensions. How this property can be restated in the context of the algorithmic distance is also of great interest. Finally, how to extend the results obtained for CA to other dynamical systems is a research direction that must be explored. We are rather confident that this can shed new light on the complexity behavior of such systems. Acknowledgments This work has been partially supported by the ANR Blanc Project “Sycomore”. Bibliography Primary Literature 1. Blanchard F, Formenti E, Kurka P (1999) Cellular automata in the Cantor, Besicovitch and Weyl topological spaces. Complex Syst 11:107–123 2. Blanchard F, Cervelle J, Formenti E (2003) Periodicity and transitivity for cellular automata in Besicovitch topologies. In: Rovan B, Vojtas P (eds) (MFCS’2003), vol 2747. Springer, Bratislava, pp 228–238 3. Blanchard F, Cervelle J, Formenti E (2005) Some results about chaotic behavior of cellular automata. Theor Comput Sci 349(3):318–336 4. Brudno AA (1978) The complexity of the trajectories of a dynamical system. Russ Math Surv 33(1):197–198 5. Brudno AA (1983) Entropy and the complexity of the trajectories of a dynamical system. Trans Moscow Math Soc 44:127 6. Calude CS, Hertling P, Jürgensen H, Weihrauch K (2001)
145
146
Algorithmic Complexity and Cellular Automata
7.
8.
9.
10.
11. 12. 13.
Randomness on full shift spaces. Chaos, Solitons Fractals 12(3):491–503 Cattaneo G, Formenti E, Margara L, Mazoyer J (1997) A shiftinvariant metric on SZ inducing a non-trivial topology. In: Privara I, Rusika P (eds) (MFCS’97), vol 1295. Springer, Bratislava, pp 179–188 Cervelle J, Durand B, Formenti E (2001) Algorithmic information theory and cellular automata dynamics. In: Mathematical Foundations of Computer Science (MFCS’01). Lectures Notes in Computer Science, vol 2136. Springer, Berlin pp 248– 259 Delorme M, Mazoyer J (1999) Cellular automata as languages recognizers. In: Cellular automata: A parallel model. Kluwer, Dordrecht Durand B, Levin L, Shen A (2001) Complex tilings. In: STOC ’01: Proceedings of the 33rd annual ACM symposium on theory of computing, pp 732–739 Kari J (1994) Rice’s theorem for the limit set of cellular automata. Theor Comput Sci 127(2):229–254 Kurka ˚ P (1997) Languages, equicontinuity and attractors in cellular automata. Ergod Theory Dyn Syst 17:417–433 Li M, Vitányi P (1997) An introduction to Kolmogorov complexity and its applications, 2nd edn. Springer, Berlin
14. M Delorme EF, Mazoyer J (2000) Open problems. Research Report LIP 2000-25, Ecole Normale Supérieure de Lyon 15. Pivato M (2005) Cellular automata vs. quasisturmian systems. Ergod Theory Dyn Syst 25(5):1583–1632 16. Robinson RM (1971) Undecidability and nonperiodicity for tilings of the plane. Invent Math 12(3):177–209 17. Terrier V (1996) Language not recognizable in real time by oneway cellular automata. Theor Comput Sci 156:283–287 18. Wolfram S (2002) A new kind of science. Wolfram Media, Champaign. http://www.wolframscience.com/
Books and Reviews Batterman RW, White HS (1996) Chaos and algorithmic complexity. Fund Phys 26(3):307–336 Bennet CH, Gács P, Li M, Vitányi P, Zurek W (1998) Information distance. EEE Trans Inf Theory 44(4):1407–1423 Calude CS (2002) Information and randomness. Texts in theoretical computer science, 2nd edn. Springer, Berlin Cover TM, Thomas JA (2006) Elements of information theory, 2nd edn. Wiley, New York White HS (1993) Algorithmic complexity of points in dynamical systems. Ergod Theory Dyn Syst 13:807–830
Amorphous Computing
Amorphous Computing HAL ABELSON, JACOB BEAL, GERALD JAY SUSSMAN Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, USA Article Outline Glossary Definition of the Subject Introduction The Amorphous Computing Model Programming Amorphous Systems Amorphous Computing Paradigms Primitives for Amorphous Computing Means of Combination and Abstraction Supporting Infrastructure and Services Lessons for Engineering Future Directions Bibliography Glossary Amorphous computer A collection of computational particles dispersed irregularly on a surface or throughout a volume, where individual particles have no a priori knowledge of their positions or orientations. Computational particle A (possibly faulty) individual device for an amorphous computer. Each particle has modest computing power and a modest amount of memory. The particles are not synchronized, although they are all capable of operating at similar speeds, since they are fabricated by the same process. All particles are programmed identically, although each particle has means for storing local state and for generating random numbers. Field A function assigning a value to every particle in an amorphous computer. Gradient A basic amorphous computing primitive that estimates the distance from each particle to the nearest particle designated as a source of the gradient. Definition of the Subject The goal of amorphous computing is to identify organizational principles and create programming technologies for obtaining intentional, pre-specified behavior from the cooperation of myriad unreliable parts that are arranged in unknown, irregular, and time-varying ways. The heightened relevance of amorphous computing today stems from the emergence of new technologies that could serve as substrates for information processing systems of
immense power at unprecedentedly low cost, if only we could master the challenge of programming them. Introduction Even as the foundations of computer science were being laid, researchers could hardly help noticing the contrast between the robustness of natural organisms and the fragility of the new computing devices. As John von Neumann remarked in 1948 [53]: With our artificial automata we are moving much more in the dark than nature appears to be with its organisms. We are, and apparently, at least at present, have to be much more ‘scared’ by the occurrence of an isolated error and by the malfunction which must be behind it. Our behavior is clearly that of overcaution, generated by ignorance. Amorphous computing emerged as a field in the mid1990s, from the convergence of three factors: Inspiration from the cellular automata models for fundamental physics [13,34]. Hope that understanding the robustness of biological development could both help overcome the brittleness typical of computer systems and also illuminate the mechanisms of developmental biology. The prospect of nearly free computers in vast quantities.
Microfabrication One technology that has come to fruition over the past decade is micro-mechanical electronic component manufacture, which integrates logic circuits, micro-sensors, actuators, and communication on a single chip. Aggregates of these can be manufactured extremely inexpensively, provided that not all the chips need work correctly, and that there is no need to arrange the chips into precise geometrical configurations or to establish precise interconnections among them. A decade ago, researchers envisioned smart dust elements small enough to be borne on air currents to form clouds of communicating sensor particles [26]. Airborne sensor clouds are still a dream, but networks of millimeter-scale particles are now commercially available for environmental monitoring applications [40]. With low enough manufacturing costs, we could mix such particles into bulk materials to form coatings like “smart paint” that can sense data and communicate its actions to the outside world. A smart paint coating on a wall could sense
147
148
Amorphous Computing
vibrations, monitor the premises for intruders, or cancel noise. Bridges or buildings coated with smart paint could report on traffic and wind loads and monitor structural integrity. If the particles have actuators, then the paint could even heal small cracks by shifting the material around. Making the particles mobile opens up entire new classes of applications that are beginning to be explored by research in swarm robotics [35] and modular robotics [49].
Digital computers have always been constructed to behave as precise arrangements of reliable parts, and almost all techniques for organizing computations depend upon this precision and reliability. Amorphous computing seeks to discover new programming methods that do not require precise control over the interaction or arrangement of the individual computing elements and to instantiate these techniques in new programming languages.
Cellular Engineering
The Amorphous Computing Model
The second disruptive technology that motivates the study of amorphous computing is microbiology. Biological organisms have served as motivating metaphors for computing since the days of calculating engines, but it is only over the past decade that we have begun to see how biology could literally be a substrate for computing, through the possibility of constructing digital-logic circuits within individual living cells. In one technology logic signals are represented not by electrical voltages and currents, but by concentrations of DNA binding proteins, and logic elements are realized as binding sites where proteins interact through promotion and repression. As a simple example, if A and B are proteins whose concentrations represent logic levels, then an “inverter” can be implemented in DNA as a genetic unit through which A serves as a repressor that blocks the production of B [29,54,55,61]. Since cells can reproduce themselves and obtain energy from their environment, the resulting information processing units could be manufactured in bulk at a very low cost. There is beginning to take shape a technology of cellular engineering that can tailor-make programmable cells to function as sensors or delivery vehicles for pharmaceuticals, or even as chemical factories for the assembly of nanoscale structures. Researchers in this emerging field of synthetic biology are starting to assemble registries of standard logical components implemented in DNA that can be inserted into E. coli [48]. The components have been engineered to permit standard means of combining them, so that biological logic designers can assemble circuits in a mix-and-match way, similar to how electrical logic designers create circuits from standard TTL parts. There’s even an International Genetically Engineered Machine Competition where student teams from universities around the world compete to create novel biological devices from parts in the registry [24]. Either of these technologies—microfabricated particles or engineered cells—provides a path to cheaply fabricate aggregates of massive numbers of computing elements. But harnessing these for computing is quite a different matter, because the aggregates are unstructured.
Amorphous computing models the salient features of an unstructured aggregate through the notion of an amorphous computer, a collection of computational particles dispersed irregularly on a surface or throughout a volume, where individual particles have no a priori knowledge of their positions or orientations. The particles are possibly faulty, may contain sensors and effect actions, and in some applications might be mobile. Each particle has modest computing power and a modest amount of memory. The particles are not synchronized, although they are all capable of operating at similar speeds, since they are fabricated by the same process. All particles are programmed identically, although each particle has means for storing local state and for generating random numbers. There may also be several distinguished particles that have been initialized to particular states. Each particle can communicate with a few nearby neighbors. In an electronic amorphous computer the particles might communicate via short-distance radio, whereas bioengineered cells might communicate by chemical signals. Although the details of the communication model can vary, the maximum distance over which two particles can communicate effectively is assumed to be small compared with the size of the entire amorphous computer. Communication is assumed to be unreliable and a sender has no assurance that a message has been received (Higher-level protocols with message acknowledgement can be built on such unreliable channels). We assume that the number of particles may be very large (on the order of 106 to 1012 ). Algorithms appropriate to run on an amorphous computer should be relatively independent of the number of particles: the performance should degrade gracefully as the number of particles decreases. Thus, the entire amorphous computer can be regarded as a massively parallel computing system, and previous investigations into massively parallel computing, such as research in cellular automata, are one source of ideas for dealing with amorphous computers. However, amorphous computing differs from investigations into cellular automata, because amorphous mechanisms must be
Amorphous Computing
independent of the detailed configuration, reliability, and synchronization of the particles. Programming Amorphous Systems A central theme in amorphous computing is the search for programming paradigms that work within the amorphous model. Here, biology has been a rich source of metaphors for inspiring new programming techniques. In embryonic development, even though the precise arrangements and numbers of the individual cells are highly variable, the genetic “programs” coded in DNA nevertheless produce well-defined intricate shapes and precise forms. Amorphous computers should be able to achieve similar results. One technique for programming an amorphous computer uses diffusion. One particle (chosen by some symmetry-breaking process) broadcasts a message. This message is received by each of its neighbors, which propagate it to their neighbors, and so on, to create a wave that spreads throughout the system. The message contains a count, and each particle stores the received count and increments it before re-broadcasting. Once a particle has stored its count, it stops re-broadcasting and ignores future count messages. This count-up wave gives each particle a rough measure of its distance from the original source. One can also produce regions of controlled size, by having the count message relayed only if the count is below a designated bound. Two such count-up waves can be combined to identify a chain of particles between two given particles A and B. Particle A begins by generating a count-up wave as above. This time, however, each intermediate particle, when it receives its count, performs a handshake to identify its “predecessor”—the particle from which it received the count (and whose own count will therefore be one less). When the wave of count messages reaches B, B sends a “successor” message, informing its predecessor that it should become part of the chain and should send a message to its predecessor, and so on, all the way back to A. Note that this method works, even though the particles are irregularly distributed, provided there is a path from A to B. The motivating metaphor for these two programs is chemical gradient diffusion, which is a foundational mechanism in biological development [60]. In nature, biological mechanisms not only generate elaborate forms, they can also maintain forms and repair them. We can modify the above amorphous line-drawing program so that it produces self-repairing line: first, particles keep rebroadcasting their count and successor messages. Second, the status of a particle as having a count or being in the chain decays over time unless it is refreshed by new mes-
sages. That is, a particle that stops hearing successor messages intended for it will eventually revert to not being in the chain and will stop broadcasting its own successor messages. A particle that stops hearing its count being broadcast will start acting as if it never had a count, pick up a new count from the messages it hears, and start broadcasting the count messages with the new count. Clement and Nagpal [12] demonstrated that this mechanism can be used to generate self-repairing lines and other patterns, and even re-route lines and patterns around “dead” regions where particles have stopped functioning. The relationship with biology flows in the other direction as well: the amorphous algorithm for repair is a model which is not obviously inconsistent with the facts of angiogenesis in the repair of wounds. Although the existence of the algorithm has no bearing on the facts of the matter, it may stimulate systems-level thinking about models in biological research. For example, Patel et al. use amorphous ideas to analyze the growth of epithelial cells [43]. Amorphous Computing Paradigms Amorphous computing is still in its infancy. Most of linguistic investigations based on the amorphous computing model have been carried out in simulation. Nevertheless, this work has yielded a rich variety of programming paradigms that demonstrate that one can in fact achieve robustness in face of the unreliability of individual particles and the absence of precise organization among them. Marker Propagation for Amorphous Particles Weiss’s Microbial Colony Language [55] is a marker propagation language for programming the particles in an amorphous computer. The program to be executed, which is the same for each particle, is constructed as a set of rules. The state of each particle includes a set of binary markers, and rules are enabled by boolean combinations of the markers. The rules, which have the form (trigger, condition, action) are triggered by the receipt of labelled messages from neighboring particles. A rule may test conditions, set or clear various markers, and it broadcast further messages to its neighbors. Each message carries a count that determines how far it will diffuse, and each marker has a lifetime that determines how long its value lasts. Supporting these language’s rules is a runtime system that automatically propagates messages and manages the lifetimes of markers, so that the programmer need not deal with these operations explicitly. Weiss’s system is powerful, but the level of abstraction is very low. This is because it was motivated by cellular engineering—as something that can be directly im-
149
150
Amorphous Computing
Amorphous Computing, Figure 1 A Microbial Colony Language program organizes a tube into a structure similar to that of somites in the developing vertebrate (from [55]).
Amorphous Computing, Figure 2 A pattern generated by GPL whose shape mimics a chain of CMOS inverters (from [15])
plemented by genetic regulatory networks. The language is therefore more useful as a tool set in which to implement higher-level languages such as GPL (see below), serving as a demonstration that in principle, these higher-level languages can be implemented by genetic regulatory networks as well. Figure 1 shows an example simulation programmed in this language, that organizes an initially undifferentiated column of particles into a structure with band of two alternating colors: a caricature of somites in developing vertebrae. The Growing Point Language Coore’s Growing Point Language (GPL) [15] demonstrates that an amorphous computer can be configured by a program that is common to all the computing elements to generate highly complex patterns, such as the pattern representing the interconnection structure of an arbitrary electrical circuit as shown in Fig. 2. GPL is inspired by a botanical metaphor based on growing points and tropisms. A growing point is a locus of activity in an amorphous computer. A growing point propagates through the computer by transferring its activity from one computing element to a neighbor.
As a growing point passes through the computer it effects the differentiation of the behaviors of the particles it visits. Particles secrete “chemical” signals whose countup waves define gradients, and these attract or repel growing points as directed by programmer-specific “tropisms”. Coore demonstrated that these mechanisms are sufficient to permit amorphous computers to generate any arbitrary prespecified graph structure pattern, up to topology. Unlike real biology, however, once a pattern has been constructed, there is no clear mechanism to maintain it in the face of changes to the material. Also, from a programming linguistic point of view, there is no clear way to compose shapes by composing growing points. More recently, Gayle and Coore have shown how GPL may be extended to produce arbitrarily large patterns such as arbitrary text strings[21]. D’Hondt and D’Hondt have explored the use of GPL for geometrical constructions and its relations with computational geometry [18,19]. Origami-Based Self-Assembly Nagpal [38] developed a prototype model for controlling programmable materials. She showed how to organize a program to direct an amorphous sheet of deformable particles to cooperate to construct a large family
Amorphous Computing
Amorphous Computing, Figure 3 Folding an envelope structure (from [38]). A pattern of lines is constructed according to origami axioms. Elements then coordinate to fold the sheet using an actuation model based on epithelial cell morphogenesis. In the figure, black indicates the front side of the sheet, grey indicates the back side, and the various colored bands show the folds and creases that are generated by the amorphous process. The small white spots show gaps in the sheet cause by “dead” or missing cells—the process works despite these
of globally-specified predetermined shapes. Her method, which is inspired by the folding of epithelial tissue, allows a programmer to specify a sequence of folds, where the set of available folds is sufficient to create any origami shape (as shown by Huzita’s axioms for origami [23]). Figure 3 shows a sheet of amorphous particles, where particles can
cooperate to create creases and folds, assembling itself into the well-known origami “cup” structure. Nagpal showed how this language of folds can be compiled into a a low-level program that can be distributed to all of the particles of the amorphous sheet, similar to Coore’s GPL or Weiss’s MCL. With a few differences of initial state (for example, particles at the edges of the sheet know that they are edge particles) the particles run their copies of the program, interact with their neighbors, and fold up to make the predetermined shape. This technique is quite robust. Nagpal studied the range of shapes that can be constructed using her method, and on their sensitivity to errors of communication, random cell death, and density of the cells. As a programming framework, the origami language has more structure than the growing point language, because the origami methods allow composition of shape constructions. On the other hand, once a shape has been constructed, there is no clear mechanism to maintain existing patterns in the face of changes to the material. Dynamic Recruitment In Butera [10]’s “paintable computing”, processes dynamically recruit computational particles from an amorphous computer to implement their goals. As one of his examples Butera uses dynamic recruitment to implement a robust storage system for streaming audio and images (Fig. 4). Fragments of the image and audio stream circulate freely through the amorphous computer and are marshaled to a port when needed. The audio fragments also sort themselves into a playable stream as they migrate to the port. To enhance robustness, there are multiple copies of the lower resolution image fragments and fewer copies of the higher resolution image fragments. Thus, the image is hard to destroy; with lost fragments the image is degraded but not destroyed.
Amorphous Computing, Figure 4 Butera dynamically controls the flow of information through an amorphous computer. In a image fragments spread through the computer so that a degraded copy can be recovered from any segment; the original image is on the left, the blurry copy on the right has been recovered from the small region shown below. In b audio fragments sort themselves into a playable stream as they migrate to an output port; cooler colors are earlier times (from [10]).
151
152
Amorphous Computing
Amorphous Computing, Figure 5 A tracking program written in Proto sends the location of a target region (orange) to a listener (red) along a channel (small red dots) in the network (indicated by green lines). The continuous space and time abstraction allows the same program to run at different resolutions
Clement and Nagpal [12] also use dynamic recruitment in the development of active gradients, as described below. Growth and Regeneration Kondacs [30] showed how to synthesize arbitrary twodimensional shapes by growing them. These computing units, or “cells”, are identically-programmed and decentralized, with the ability to engage in only limited, local communication. Growth is implemented by allowing cells to multiply. Each of his cells may create a child cell and place it randomly within a ring around the mother cell. Cells may also die, a fact which Kondacs puts to use for temporary scaffolding when building complex shapes. If a structure requires construction of a narrow neck between two other structures it can be built precisely by laying down a thick connection and later trimming it to size. Attributes of this system include scalability, robustness, and the ability for self-repair. Just as a starfish can regenerate its entire body from part of a limb, his system can self-repair in the event of agent death: his sphere-network representation allows the structure to be grown starting from any sphere, and every cell contains all necessary information for reproducing the missing structure. Abstraction to Continuous Space and Time The amorphous model postulates computing particles distributed throughout a space. If the particles are dense, one can imagine the particles as actually filling the space, and create programming abstractions that view the space itself as the object being programmed, rather than the collection of particles. Beal and Bachrach [1,7] pursued this approach by creating a language, Proto, where programmers specify
the behavior of an amorphous computer as though it were a continuous material filling the space it occupies. Proto programs manipulate fields of values spanning the entire space. Programming primitives are designed to make it simple to compile global operations to operations at each point of the continuum. These operations are approximated by having each device represent a nearby chunk of space. Programs are specified in space and time units that are independent of the distribution of particles and of the particulars of communication and execution on those particles (Fig. 5). Programs are composed functionally, and many of the details of communication and composition are made implicit by Proto’s runtime system, allowing complex programs to be expressed simply. Proto has been applied to applications in sensor networks like target tracking and threat avoidance, to swarm robotics and to modular robotics, e. g., generating a planar wave for coordinated actuation. Newton’s language Regiment [41,42] also takes a continuous view of space and time. Regiment is organized in terms of stream operations, where each stream represents a time-varying quantity over a part of space, for example, the average value of the temperature over a disc of a given radius centered at a designated point. Regiment, also a functional language, is designed to gather streams of data from regions of the amorphous computer and accumulate them at a single point. This assumption allows Regiment to provide region-wide summary functions that are difficult to implement in Proto.
Primitives for Amorphous Computing The previous section illustrated some paradigms that have been developed for programming amorphous systems, each paradigm building on some organizing metaphor.
Amorphous Computing
But eventually, meeting the challenge of amorphous systems will require a more comprehensive linguistic framework. We can approach the task of creating such a framework following the perspective in [59], which views languages in terms of primitives, means of combination, and means of abstraction. The fact that amorphous computers consist of vast numbers of unreliable and unsynchronized particles, arranged in space in ways that are locally unknown, constrains the primitive mechanisms available for organizing cooperation among the particles. While amorphous computers are naturally massively parallel, the kind of computation that they are most suited for is parallelism that does not depend on explicit synchronization and the use of atomic operations to control concurrent access to resources. However, there are large classes of useful behaviors that can be implemented without these tools. Primitive mechanisms that are appropriate for specifying behavior on amorphous computers include gossip, random choice, fields, and gradients. Gossip Gossip, also known as epidemic communication [17,20], is a simple communication mechanism. The goal of a gossip computation is to obtain an agreement about the value of some parameter. Each particle broadcasts its opinion of the parameter to its neighbors, and computation is performed by each particle combining the values that it receives from its neighbors, without consideration of the identification of the source. If the computation changes a particle’s opinion of the value, it rebroadcasts its new opinion. The process concludes when the are no further broadcasts. For example, an aggregate can agree upon the minimum of the values held by all the particles as follows. Each particle broadcasts its value. Each recipient compares its current value with the value that it receives. If the received value is smaller than its current value, it changes its current value to that minimum and rebroadcasts the new value. The advantage of gossip is that it flows in all directions and is very difficult to disrupt. The disadvantage is that the lack of source information makes it difficult to revise a decision. Random Choice Random choice is used to break symmetry, allowing the particles to differentiate their behavior. The simplest use of random choice is to establish local identity of particles: each particle chooses a random number to identify
itself to its neighbors. If the number of possible choices is large enough, then it is unlikely that any nearby particles will choose the same number, and this number can thus be used as an identifier for the particle to its neighbors. Random choice can be combined with gossip to elect leaders, either for the entire system or for local regions. (If collisions in choice can be detected, then the number of choices need not be much higher than then number of neighbors. Also, using gossip to elect leaders makes sense only when we expect a leader to be long-lived, due to the difficulty of changing the decision to designate a replacement leader.) To elect a single leader for the entire system, every particle chooses a value, then gossips to find the minimum. The particle with the minimum value becomes the leader. To elect regional leaders, we instead use gossip to carry the identity of the first leader a particle has heard of. Each particle uses random choice as a “coin flip” to decide when to declare itself a leader; if the flip comes up heads enough times before the particle hears of another leader, the particle declares itself a leader and broadcasts that fact to its neighbors. The entire system is thus broken up into contiguous domains of particles who first heard some particular particle declare itself a leader. One challenge in using random choice on an amorphous computer is to ensure that the particulars of particle distribution do not have an unexpected effect on the outcome. For example, if we wish to control the expected size of the domain that each regional leader presides over, then the probability of becoming a leader must depend on the density of the particles. Fields Every component of the state of the computational particles in an amorphous computer may be thought of as a field over the discrete space occupied by those particles. If the density of particles is large enough this field of values may be thought of as an approximation of a field on the continuous space. We can make amorphous models that approximate the solutions of the classical partial-differential equations of physics, given appropriate boundary conditions. The amorphous methods can be shown to be consistent, convergent and stable. For example, the algorithm for solving the Laplace equation with Dirichlet conditions is analogous to the way it would be solved on a lattice. Each particle must repeatedly update the value of the solution to be the average of the solutions posted by its neighbors, but the boundary points must not change their values. This algorithm will
153
154
Amorphous Computing
Amorphous Computing, Figure 6 A line being maintained by active gradients, from [12]. A line (black) is constructed between two anchor regions (dark grey) based on the active gradient emitted by the right anchor region (light grays). The line is able to rapidly repair itself following failures because the gradient actively maintains itself
eventually converge, although very slowly, independent of the order of the updates and the details of the local connectedness of the network. There are optimizations, such as over-relaxation, that are just as applicable in the amorphous context as on a regular grid. Katzenelson [27] has shown similar results for the diffusion equation, complete with analytic estimates of the errors that arise from the discrete and irregularly connected network. In the diffusion equation there is a conserved quantity, the amount of material diffusing. Rauch [47] has shown how this can work with the wave equation, illustrating that systems that conserve energy and momentum can also be effectively modeled with an amorphous computer. The simulation of the wave equation does require that the communicating particles know their relative positions, but it is not hard to establish local coordinate systems. Gradients An important primitive in amorphous computing is the gradient, which estimates the distance from each particle to the nearest particle designated as a source. The gradient is inspired by the chemical-gradient diffusion process that is crucial to biological development. Amorphous computing builds on this idea, but does not necessarily compute the distance using diffusion because simulating diffusion can be expensive. The common alternative is a linear-time mechanism that depends on active computation and relaying of information rather than passive diffusion. Calculation of a gradient starts with each source particle setting its distance estimate to zero, and every other particle setting its distance
estimate to infinity. The sources then broadcast their estimates to their neighbors. When a particle receives a message from its neighbor, it compares its current distance estimate to the distance through its neighbor. If the distance through its neighbor is less, it chooses that to be its estimate, and broadcasts its new estimate onwards. Although the basic form of the gradient is simple, there are several ways in which gradients can be varied to better match the context in which they are used. These choices may be made largely independently, giving a wide variety of options when designing a system. Variations which have been explored include: Active Gradients An active gradient [12,14,16] monitors its validity in the face of changing sources and device failure, and maintains correct distance values. For example, if the supporting sources disappear, the gradient is deallocated. A gradient may also carry version information, allowing its source to change more smoothly. Active gradients can provide self-repairing coordinate systems, as a foundation for robust construction of patterns. Figure 6 shows the “count-down wave” line described above in the introduction to this article. The line’s implementation in terms of active gradients provides for self-repair when the underlying amorphous computer is damaged. Polarity A gradient may be set to count down from a positive value at the source, rather than to count up from zero. This bounds the distance that the gradient can span, which can help limit resource usage, but may limit the scalability of programs.
Amorphous Computing
Adaptivity As described above, a gradient relaxes once to a distance estimate. If communication is expensive and precision unimportant, the gradient can take the first value that arrives and ignore all subsequent values. If we want the gradient to adapt to changes in the distribution of particles or sources, then the particles need to broadcast at regular intervals. We can then have estimates that converge smoothly to precise estimates by adding a restoring force which acts opposite to the relaxation, allowing the gradient to rise when unconstrained by its neighbors. If, on the other hand, we value adaptation speed over smoothness, then each particle can recalculate its distance estimate from scratch with each new batch of values. Carrier Normally, the distance value calculated by the gradient is the signal we are interested in. A gradient may instead be used to carry an arbitrary signal outward from the source. In this case, the value at each particle is the most recently arrived value from the nearest source. Distance Measure A gradient’s distance measure is, of course, dependent on how much knowledge we have about the relative positions of neighbors. It is sometimes advantageous to discard good information and use only hopcount values, since it is easier to make an adaptive gradient using hop-count values. Non-linear distance measures are also possible, such as a count-down gradient that decays exponentially from the source. Finally, the value of a gradient may depend on more sources than the nearest (this is the case for a chemical gradient), though this may be very expensive to calculate. Coordinates and Clusters Computational particles may be built with restrictions about what can be known about local geometry. A particle may know that it can reliably communicate with a few neighbors. If we assume that these neighbors are all within a disc of some approximate communication radius then distances to others may be estimated by minimum hop count [28]. However, it is possible that more elaborate particles can estimate distances to near neighbors. For example, the Cricket localization system [45] uses the fact that sound travels more slowly than radio, so the distance is estimated by the difference in time of arrival between simultaneously transmitted signals. McLurkin’s swarmbots [35] use the ISIS communication system that gives bearing and range information. However, it is possible for a sufficiently dense amorphous computer to produce local coordinate systems for its particles with even the crudest method of determining distances. We can make an atlas of overlap-
ping coordinate systems, using random symmetry breaking to make new starting baselines [3]. These coordinate systems can be combined and made consistent to form a manifold, even if the amorphous computer is not flat or simply connected. One way to establish coordinates is to choose two initial particles that are a known distance apart. Each one serves as the source of a gradient. A pair of rectangular axes can be determined by the shortest path between them and by a bisector constructed where the two gradients are equal. These may be refined by averaging and calibrated using the known distance between the selected particles. After the axes are established, they may source new gradients that can be combined to make coordinates for the region near these axes. The coordinate system can be further refined using further averaging. Other natural coordinate constructions are bipolar elliptical. This kind of construction was pioneered by Coore [14] and Nagpal [37]. Katzenelson [27] did early work to determine the kind of accuracy that can be expected from such a construction. Spatial clustering can be accomplished with any of a wide variety of algorithms, such as the clubs algorithm [16], LOCI [36], or persistent node partitioning [4]. Clusters can themselves be clustered, forming a hierarchical clustering of logarithmic height. Means of Combination and Abstraction A programming framework for amorphous systems requires more than primitive mechanisms. We also need suitable means of combination, so that programmers can combine behaviors to produce more complex behaviors, and means of abstraction so that the compound behaviors can be named and manipulated as units. Here are a few means of combination that have been investigated with amorphous computing. Spatial and Temporal Sequencing Several behaviors can be strung together in a sequence. The challenge in controlling such a sequence is to determine when one phase has completed and it is safe to move on to the next. Trigger rules can be used to detect completion locally. In Coore’s Growing Point Language [15], all of the sequencing decisions are made locally, with different growing points progressing independently. There is no difficulty of synchronization in this approach because the only time when two growing points need to agree is when they have become spatially coincident. When growing points merge, the independent processes are automatically synchronized.
155
156
Amorphous Computing
Nagpal’s origami language [38] has long-range operations that cannot overlap in time unless they are in non-interacting regions of the space. The implementation uses barrier synchronization to sequence the operations: when completion is detected locally, a signal is propagated throughout a marked region of the sheet, and the next operation begins after a waiting time determined by the diameter of the sheet. With adaptive gradients, we can use the presence of an inducing signal to run an operation. When the induction signal disappears, the operation ceases and the particles begin the next operation. This allows sequencing to be triggered by the last detection of completion rather than by the first. Pipelining If a behavior is self-stabilizing (meaning that it converges to a correct state from any arbitrary state) then we can use it in a sequence without knowing when the previous phase completes. The evolving output of the previous phase serves as the input of this next phase, and once the preceding behavior has converged, the self-stabilizing behavior will converge as well. If the previous phase evolves smoothly towards its final state, then by the time it has converged, the next phase may have almost converged as well, working from its partial results. For example, the coordinate system mechanism described above can be pipelined; the final coordinates are being formed even as the farther particles learn that they are not on one of the two axes. Restriction to Spatial Regions Because the particles of an amorphous computer are distributed in space it is natural to assign particular behaviors to specific spatial regions. In Beal and Bachrach’s work, restriction of a process to a region is a primitive[7]. As another example, when Nagpal’s system folds an origami construction, regions on different faces may differentiate so that they fold in different patterns. These folds may, if the physics permits, be performed simultaneously. It may be necessary to sequence later construction that depends on the completion of the substructures. Regions of space can be named using coordinates, clustering, or implicitly through calculations on fields. Indeed, one could implement solid modelling on an amorphous computer. Once a region is identified, a particle can test whether it is a member of that region when deciding whether to run a behavior. It is also necessary to specify how a particle should change its behavior if its membership in a region may vary with time.
Modularity and Abstraction Standard means of abstraction may be applied in an amorphous computing context, such as naming procedures, data structures, and processes. The question for amorphous computing is what collection of entities is useful to name. Because geometry is essential in an amorphous computing context, it becomes appropriate to describe computational processes in terms of geometric entities. Thus there are new opportunities for combining geometric structures and naming the combinations. For example, it is appropriate to compute with, combine, and name regions of space, intervals of time, and fields defined on them [7,42]. It may also be useful to describe the propagation of information through the amorphous computer in terms of the light cone of an event [2]. Not all traditional abstractions extend nicely to an amorphous computing context because of the challenges of scale and the fallibility of parts and interconnect. For example, atomic transactions may be excessively expensive in an amorphous computing context. And yet, some of the goals that a programmer might use atomic transactions to accomplish, such as the approximate enforcement of conservation laws, can be obtained using techniques that are compatible with an amorphous environment, as shown by Rauch [47]. Supporting Infrastructure and Services Amorphous computing languages, with their primitives, means of combination, and means of abstraction, rest on supporting services. One example, described above, is the automatic message propagation and decay in Weiss’s Microbial Colony Language [55]. MCL programs do not need to deal with this explicitly because it is incorporated into the operating system of the MCL machine. Experience with amorphous computing is beginning to identify other key services that amorphous machines must supply. Particle Identity Particles must be able to choose identifiers for communicating with their neighbors. More generally, there are many operations in an amorphous computation where the particles may need to choose numbers, with the property that individual particles choose different numbers. If we are willing to pay the cost, it is possible to build unique identifiers into the particles, as is done with current macroscopic computers. We need only locally unique identifiers, however, so we can obtain them using pseudorandom-number generators. On the surface of it,
Amorphous Computing
this may seem problematic, since the particles in the amorphous are assumed to be manufactured identically, with identical programs. There are, however, ways to obtain individualized random numbers. For example, the particles are not synchronized, and they are not really physically identical, so they will run at slightly different rates. This difference is enough to allow pseudorandom-number generators to get locally out of synch and produce different sequences of numbers. Amorphous computing particles that have sensors may also get seeds for their pseudorandomnumber generators from sensor noise. Local Geometry and Gradients Particles must maintain connections with their neighbors, tracking who they can reliably communicate with, and whatever local geometry information is available. Because particles may fail or move, this information needs to be maintained actively. The geometry information may include distance and bearing to each neighbor, as well as the time it takes to communicate with each neighbor. But many implementations will not be able to give significant distance or bearing information. Since all of this information may be obsolete or inaccurately measured, the particles must also maintain information on how reliable each piece of information is. An amorphous computer must know the dimension of the space it occupies. This will generally be a constant—either the computer covers a surface or fills a volume. In rare cases, however, the effective dimension of a computer may change: for example, paint is threedimensional in a bucket and two-dimensional once applied. Combining this information with how the number of accessible correspondents changes with distance, an amorphous process can derive curvature and local density information. An amorphous computer should also support gradient propagation as part of the infrastructure: a programmer should not have to explicitly deal with the propagation of gradients (or other broadcast communications) in each particle. A process may explicitly initiate a gradient, or explicitly react to one that it is interested in, but the propagation of the gradient through a particle should be automatically maintained by the infrastructure. Implementing Communication Communication between neighbors can occur through any number of mechanisms, each with its own set of properties: amorphous computing systems have been built that communicate through directional infrared [35], RF broadcast [22], and low-speed serial cables [9,46]. Simulated sys-
tems have also included other mechanisms such as signals superimposed on the power connections [11] and chemical diffusion [55]. Communication between particles can be made implicit with a neighborhood shared memory. In this arrangement, each particle designates some of its internal state to be shared with its neighbors. The particles regularly communicate, giving each particle a best-effort view of the exposed portions of the states of its neighbors. The contents of the exposed state may be specified explicitly [10,35,58] or implicitly [7]. The shared memory allows the system to be tuned by trading off communication rate against the quality of the synchronization, and decreasing transmission rates when the exposed state is not changing. Lessons for Engineering As von Neumann remarked half a century ago, biological systems are strikingly robust when compared with our artificial systems. Even today software is fragile. Computer science is currently built on a foundation that largely assumes the existence of a perfect infrastructure. Integrated circuits are fabricated in clean-room environments, tested deterministically, and discarded if even a single defect is uncovered. Entire software systems fail with single-line errors. In contrast, biological systems rely on local computation, local communication, and local state, yet they exhibit tremendous resilience. Although this contrast is most striking in computer science, amorphous computing can provide lessons throughout engineering. Amorphous computing concentrates on making systems flexible and adaptable at the expense of efficiency. Amorphous computing requires an engineer to work under extreme conditions. The engineer must arrange the cooperation of vast numbers of identical computational particles to accomplish prespecified goals, but may not depend upon the numbers. We may not depend on any prespecified interconnect of the particles. We may not depend on synchronization of the particles. We may not depend on the stability of the communications system. We may not depend on the long-term survival of any individual particles. The combination of these obstacles forces us to abandon many of the comforts that are available in more typical engineering domains. By restricting ourselves in this way we obtain some robustness and flexibility, at the cost of potentially inefficient use of resources, because the algorithms that are appropriate are ones that do not take advantage of these assumptions. Algorithms that work well in an amorphous context depend on the average behavior of participating particles. For example, in Nagpal’s origami system a fold that
157
158
Amorphous Computing
is specified will be satisfactory if it is approximately in the right place and if most of the particles on the specified fold line agree that they are part of the fold line: dissenters will be overruled by the majority. In Proto a programmer can address only regions of space, assumed to be populated by many particles. The programmer may not address individual particles, so failures of individual particles are unlikely to make major perturbations to the behavior of the system. An amorphous computation can be quite immune to details of the macroscopic geometry as well as to the interconnectedness of the particles. Since amorphous computations make their own local coordinate systems, they are relatively independent of coordinate distortions. In an amorphous computation we accept a wide range of outcomes that arise from variations of the local geometry. Tolerance of local variation can lead to surprising flexibility: the mechanisms which allow Nagpal’s origami language to tolerate local distortions allow programs to distort globally as well, and Nagpal shows how such variations can account for the variations in the head shapes of related species of Drosophila [38]. In Coore’s language one specifies the topology of the pattern to be constructed, but only limited information about the geometry. The topology will be obtained, regardless of the local geometry, so long as there is sufficient density of particles to support the topology. Amorphous computations based on a continuous model of space (as in Proto) are naturally scale independent. Since an amorphous computer is composed of unsynchronized particles, a program may not depend upon a priori timing of events. The sequencing of phases of a process must be determined by either explicit termination signals or with times measured dynamically. So amorphous computations are time-scale independent by construction. A program for an amorphous computer may not depend on the reliability of the particles or the communication paths. As a consequence it is necessary to construct the program so as to dynamically compensate for failures. One way to do this is to specify the result as the satisfaction of a set of constraints, and to build the program as a homeostatic mechanism that continually drives the system toward satisfaction of those constraints. For example, an active gradient continually maintains each particle’s estimate of the distance to the source of the gradient. This can be used to establish and maintain connections in the face of failures of particles or relocation of the source. If a system is specified in this way, repair after injury is a continuation of the development process: an injury causes some constraints to become unsatisfied, and the development process builds new structure to heal the injury.
By restricting the assumptions that a programmer can rely upon we increase the flexibility and reliability of the programs that are constructed. However, it is not yet clear how this limits the range of possible applications of amorphous computing. Future Directions Computer hardware is almost free, and in the future it will continue to decrease in price and size. Sensors and actuators are improving as well. Future systems will have vast numbers of computing mechansisms with integrated sensors and actuators, to a degree that outstrips our current approaches to system design. When the numbers become large enough, the appropriate programming technology will be amorphous computing. This transition has already begun to appear in several fields: Sensor networks The success of sensor network research has encouraged the planning and deployment of everlarger numbers of devices. The ad-hoc, time-varying nature of sensor networks has encouraged amorphous approaches, such as communication through directed diffusion [25] and Newton’s Regiment language [42]. Robotics Multi-agent robotics is much like sensor networks, except that the devices are mobile and have actuators. Swarm robotics considers independently mobile robots working together as a team like ants or bees, while modular robotics consider robots that physically attach to one another in order to make shapes or perform actions, working together like the cells of an organism. Gradients are being used to create “flocking” behaviors in swarm robotics [35,44]. In modular robotics, Stoy uses gradients to create shapes [51] while De Rosa et al. form shapes through stochastic growth and decay [49]. Pervasive computing Pervasive computing seeks to exploit the rapid proliferation of wireless computing devices throughout our everyday environment. Mamei and Zambonelli’s TOTA system [32] is an amorphous computing implementation supporting a model of programming using fields and gradients [33]. Servat and Drogoul have suggested combining amorphous computing and reactive agent-based systems to produce something they call “pervasive intelligence” [50]. Multicore processors As it becomes more difficult to increase processor speed, chip manufacturers are looking for performance gains through increasing the number of processing cores per chip. Butera’s work [10] looks toward a future in which there are thousands of cores per chip and it is no longer rea-
Amorphous Computing
sonable to assume they are all working or have them communicate all-to-all. While much of amorphous computing research is inspired by biological observations, it is also likely that insights and lessons learned from programming amorphous computers will help elucidate some biological problems [43]. Some of this will be stimulated by the emerging engineering of biological systems. Current work in synthetic biology [24,48,61] is centered on controlling the molecular biology of cells. Soon synthetic biologists will begin to engineer biofilms and perhaps direct the construction of multicellular organs, where amorphous computing will become an essential technological tool.
Bibliography Primary Literature 1. Bachrach J, Beal J (2006) Programming a sensor network as an amorphous medium. In: DCOSS 2006 Posters, June 2006 2. Bachrach J, Beal J, Fujiwara T (2007) Continuous space-time semantics allow adaptive program execution. In: IEEE International Conference on Self-Adaptive and Self-Organizing Systems, 2007 3. Bachrach J, Nagpal R, Salib M, Shrobe H (2003) Experimental results and theoretical analysis of a self-organizing global coordinate system for ad hoc sensor networks. Telecommun Syst J, Special Issue on Wireless System Networks 26(2–4):213–233 4. Beal J (2003) A robust amorphous hierarchy from persistent nodes. In: Commun Syst Netw 5. Beal J (2004) Programming an amorphous computational medium. In: Unconventional Programming Paradigms International Workshop, September 2004 6. Beal J (2005) Amorphous medium language. In: Large-Scale Multi-Agent Systems Workshop (LSMAS). Held in Conjunction with AAMAS-05 7. Beal J, Bachrach J (2006) Infrastructure for engineered emergence on sensor/actuator networks. In: IEEE Intelligent Systems, 2006 8. Beal J, Sussman G (2005) Biologically-inspired robust spatial programming. Technical Report AI Memo 2005-001, MIT, January 2005 9. Beebee W M68hc11 gunk api book. http://www.swiss.ai.mit. edu/projects/amorphous/HC11/api.html. Accessed 31 May 2007 10. Butera W (2002) Programming a Paintable Computer. Ph D thesis, MIT 11. Campbell J, Pillai P, Goldstein SC (2005) The robot is the tether: Active, adaptive power routing for modular robots with unary inter-robot connectors. In: IROS 2005 12. Clement L, Nagpal R (2003) Self-assembly and self-repairing topologies. In: Workshop on Adaptability in Multi-Agent Systems, RoboCup Australian Open, January 2003 13. Codd EF (1968) Cellular Automata. Academic Press, New York 14. Coore D (1998) Establishing a coordinate system on an amorphous computer. In: MIT Student Workshop on High Performance Computing, 1998
15. Coore D (1999) Botanical Computing: A Developmental Approach to Generating Interconnect Topologies on an Amorphous Computer. Ph D thesis, MIT 16. Coore D, Nagpal R, Weiss R (1997) Paradigms for structure in an amorphous computer. Technical Report AI Memo 1614, MIT 17. Demers A, Greene D, Hauser C, Irish W, Larson J, Shenker S, Stuygis H, Swinehart D, Terry D (1987) Epidemic algorithms for replicated database maintenance. In: 7th ACM Symposium on Operating Systems Principles, 1987 18. D’Hondt E, D’Hondt T (2001) Amorphous geometry. In: ECAL 2001 19. D’Hondt E, D’Hondt T (2001) Experiments in amorphous geometry. In: 2001 International Conference on Artificial Intelligence 20. Ganesan D, Krishnamachari B, Woo A, Culler D, Estrin D, Wicker S (2002) An empirical study of epidemic algorithms in large scale multihop wireless networks. Technical Report IRB-TR-02003, Intel Research Berkeley 21. Gayle O, Coore D (2006) Self-organizing text in an amorphous environment. In: ICCS 2006 22. Hill J, Szewcyk R, Woo A, Culler D, Hollar S, Pister K (2000) System architecture directions for networked sensors. In: ASPLOS, November 2000 23. Huzita H, Scimemi B (1989) The algebra of paper-folding. In: First International Meeting of Origami Science and Technology, 1989 24. igem 2006: international genetically engineered machine competition (2006) http://www.igem2006.com. Accessed 31 May 2007 25. Intanagonwiwat C, Govindan R, Estrin D (2000) Directed diffusion: a scalable and robust communication paradigm for sensor networks. In: Mobile Computing and Networking, pp 56–67 26. Kahn JM, Katz RH, Pister KSJ (1999) Mobile networking for smart dust. In: ACM/IEEE Int. Conf. on Mobile Computing and Networking (MobiCom 99), August 1999 27. Katzenelson J (1999) Notes on amorphous computing. (Unpublished Draft) 28. Kleinrock L, Sylvester J (1978) Optimum transmission radii for packet radio networks or why six is a magic number. In: IEEE Natl Telecommun. Conf, December 1978, pp 4.3.1–4.3.5 29. Knight TF, Sussman GJ (1998) Cellular gate technology. In: First International Conference on Unconventional Models of Computation (UMC98) 30. Kondacs A (2003) Biologically-inspired self-assembly of 2d shapes, using global-to-local compilation. In: International Joint Conference on Artificial Intelligence (IJCAI) 31. Mamei M, Zambonelli F (2003) Spray computers: Frontiers of self-organization for pervasive computing. In: WOA 2003 32. Mamei M, Zambonelli F (2004) Spatial computing: the tota approach. In: WOA 2004, pp 126–142 33. Mamei M, Zambonelli F (2005) Physical deployment of digital pheromones through rfid technology. In: AAMAS 2005, pp 1353–1354 34. Margolus N (1988) Physics and Computation. Ph D thesis, MIT 35. McLurkin J (2004) Stupid robot tricks: A behavior-based distributed algorithm library for programming swarms of robots. Master’s thesis, MIT 36. Mittal V, Demirbas M, Arora A (2003) Loci: Local clustering service for large scale wireless sensor networks. Technical Report OSU-CISRC-2/03-TR07, Ohio State University
159
160
Amorphous Computing
37. Nagpal R (1999) Organizing a global coordinate system from local information on an amorphous computer. Technical Report AI Memo 1666, MIT 38. Nagpal R (2001) Programmable Self-Assembly: Constructing Global Shape using Biologically-inspired Local Interactions and Origami Mathematics. Ph D thesis, MIT 39. Nagpal R, Mamei M (2004) Engineering amorphous computing systems. In: Bergenti F, Gleizes MP, Zambonelli F (eds) Methodologies and Software Engineering for Agent Systems, The Agent-Oriented Software Engineering Handbook. Kluwer, New York, pp 303–320 40. Dust Networks. http://www.dust-inc.com. Accessed 31 May 2007 41. Newton R, Morrisett G, Welsh M (2007) The regiment macroprogramming system. In: International Conference on Information Processing in Sensor Networks (IPSN’07) 42. Newton R, Welsh M (2004) Region streams: Functional macroprogramming for sensor networks. In: First International Workshop on Data Management for Sensor Networks (DMSN), August 2004 43. Patel A, Nagpal R, Gibson M, Perrimon N (2006) The emergence of geometric order in proliferating metazoan epithelia. Nature 442:1038–1041 44. Payton D, Daily M, Estowski R, Howard M, Lee C (2001) Pheromone robotics. Autonomous Robotics 11:319–324 45. Priyantha N, Chakraborty A, Balakrishnan H (2000) The cricket location-support system. In: ACM International Conference on Mobile Computing and Networking (ACM MOBICOM), August 2000 46. Raffle H, Parkes A, Ishii H (2004) Topobo: A constructive assembly system with kinetic memory. CHI 47. Rauch E (1999) Discrete, amorphous physical models. Master’s thesis, MIT 48. Registry of standard biological parts. http://parts.mit.edu. Accessed 31 May 2007 49. De Rosa M, Goldstein SC, Lee P, Campbell J, Pillai P (2006) Scalable shape sculpting via hole motion: Motion planning in lattice-constrained module robots. In: Proceedings of the 2006 IEEE International Conference on Robotics and Automation (ICRA ‘06), May 2006
50. Servat D, Drogoul A (2002) Combining amorphous computing and reactive agent-based systems: a paradigm for pervasive intelligence? In: AAMAS 2002 51. Stoy K (2003) Emergent Control of Self-Reconfigurable Robots. Ph D thesis, University of Southern Denmark 52. Sutherland A (2003) Towards rseam: Resilient serial execution on amorphous machines. Master’s thesis, MIT 53. von Neumann J (1951) The general and logical theory of automata. In: Jeffress L (ed) Cerebral Mechanisms for Behavior. Wiley, New York, p 16 54. Weiss R, Knight T (2000) Engineered communications for microbial robotics. In: Sixth International Meeting on DNA Based Computers (DNA6) 55. Weiss R (2001) Cellular Computation and Communications using Engineered Genetic Regular Networks. Ph D thesis, MIT 56. Welsh M, Mainland G (2004) Programming sensor networks using abstract regions. In: Proceedings of the First USENIX/ACM Symposium on Networked Systems Design and Implementation (NSDI ’04), March 2004 57. Werfel J, Bar-Yam Y, Nagpal R (2005) Building patterned structures with robot swarms. In: IJCAI 58. Whitehouse K, Sharp C, Brewer E, Culler D (2004) Hood: a neighborhood abstraction for sensor networks. In: Proceedings of the 2nd international conference on Mobile systems, applications, and services. 59. Abelson H, Sussman GJ, Sussman J (1996) Structure and Interpretation of Computer Programs, 2nd edn. MIT Press, Cambridge 60. Ashe HL, Briscoe J (2006) The interpretation of morphogen gradients. Development 133:385–94 61. Weiss R, Basu S, Hooshangi S, Kalmbach A, Karig D, Mehreja R, Netravali I (2003) Genetic circuit building blocks for cellular computation, communications, and signal processing. Natural Computing 2(1):47–84
Books and Reviews Abelson H, Allen D, Coore D, Hanson C, Homsy G, Knight T, Nagpal R, Rauch E, Sussman G, Weiss R (1999) Amorphous computing. Technical Report AIM-1665, MIT
Analog Computation
Analog Computation BRUCE J. MACLENNAN Department of Electrical Engineering & Computer Science, University of Tennessee, Knoxville, USA Article Outline Glossary Definition of the Subject Introduction Fundamentals of Analog Computing Analog Computation in Nature General-Purpose Analog Computation Analog Computation and the Turing Limit Analog Thinking Future Directions Bibliography Glossary Accuracy The closeness of a computation to the corresponding primary system. BSS The theory of computation over the real numbers defined by Blum, Shub, and Smale. Church–Turing (CT) computation The model of computation based on the Turing machine and other equivalent abstract computing machines; commonly accepted as defining the limits of digital computation. EAC Extended analog computer defined by Rubel. GPAC General-purpose analog computer. Nomograph A device for the graphical solution of equations by means of a family of curves and a straightedge. ODE Ordinary differential equation. PDE Partial differential equation. Potentiometer A variable resistance, adjustable by the computer operator, used in electronic analog computing as an attenuator for setting constants and parameters in a computation. Precision The quality of an analog representation or computation, which depends on both resolution and stability. Primary system The system being simulated, modeled, analyzed, or controlled by an analog computer, also called the target system. Scaling The adjustment, by constant multiplication, of variables in the primary system (including time) so that the corresponding variables in the analog systems are in an appropriate range. TM Turing machine.
Definition of the Subject Although analog computation was eclipsed by digital computation in the second half of the twentieth century, it is returning as an important alternative computing technology. Indeed, as explained in this article, theoretical results imply that analog computation can escape from the limitations of digital computation. Furthermore, analog computation has emerged as an important theoretical framework for discussing computation in the brain and other natural systems. Analog computation gets its name from an analogy, or systematic relationship, between the physical processes in the computer and those in the system it is intended to model or simulate (the primary system). For example, the electrical quantities voltage, current, and conductance might be used as analogs of the fluid pressure, flow rate, and pipe diameter. More specifically, in traditional analog computation, physical quantities in the computation obey the same mathematical laws as physical quantities in the primary system. Thus the computational quantities are proportional to the modeled quantities. This is in contrast to digital computation, in which quantities are represented by strings of symbols (e. g., binary digits) that have no direct physical relationship to the modeled quantities. According to the Oxford English Dictionary (2nd ed., s.vv. analogue, digital), these usages emerged in the 1940s. However, in a fundamental sense all computing is based on an analogy, that is, on a systematic relationship between the states and processes in the computer and those in the primary system. In a digital computer, the relationship is more abstract and complex than simple proportionality, but even so simple an analog computer as a slide rule goes beyond strict proportion (i. e., distance on the rule is proportional to the logarithm of the number). In both analog and digital computation—indeed in all computation—the relevant abstract mathematical structure of the problem is realized in the physical states and processes of the computer, but the realization may be more or less direct [40,41,46]. Therefore, despite the etymologies of the terms “analog” and “digital”, in modern usage the principal distinction between digital and analog computation is that the former operates on discrete representations in discrete steps, while the later operated on continuous representations by means of continuous processes (e. g., MacLennan [46], Siegelmann [p. 147 in 78], Small [p. 30 in 82], Weyrick [p. 3 in 89]). That is, the primary distinction resides in the topologies of the states and processes, and it would be more accurate to refer to discrete and continuous computa-
161
162
Analog Computation
tion [p. 39 in 25]. (Consider so-called analog and digital clocks. The principal difference resides in the continuity or discreteness of the representation of time; the motion of the two (or three) hands of an “analog” clock do not mimic the motion of the rotating earth or the position of the sun relative to it.) Introduction History Pre-electronic Analog Computation Just like digital calculation, analog computation was originally performed by hand. Thus we find several analog computational procedures in the “constructions” of Euclidean geometry (Euclid, fl. 300 BCE), which derive from techniques used in ancient surveying and architecture. For example, Problem II.51 is “to divide a given straight line into two parts, so that the rectangle contained by the whole and one of the parts shall be equal to the square of the other part”. Also, Problem VI.13 is “to find a mean proportional between two given straight lines”, and VI.30 is “to cut a given straight line in extreme and mean ratio”. These procedures do not make use of measurements in terms of any fixed unit or of digital calculation; the lengths and other continuous quantities are manipulated directly (via compass and straightedge). On the other hand, the techniques involve discrete, precise operational steps, and so they can be considered algorithms, but over continuous magnitudes rather than discrete numbers. It is interesting to note that the ancient Greeks distinguished continuous magnitudes (Grk., megethoi), which have physical dimensions (e. g., length, area, rate), from discrete numbers (Grk., arithmoi), which do not [49]. Euclid axiomatizes them separately (magnitudes in Book V, numbers in Book VII), and a mathematical system comprising both discrete and continuous quantities was not achieved until the nineteenth century in the work of Weierstrass and Dedekind. The earliest known mechanical analog computer is the “Antikythera mechanism”, which was found in 1900 in a shipwreck under the sea near the Greek island of Antikythera (between Kythera and Crete). It dates to the second century BCE and appears to be intended for astronomical calculations. The device is sophisticated (at least 70 gears) and well engineered, suggesting that it was not the first of its type, and therefore that other analog computing devices may have been used in the ancient Mediterranean world [22]. Indeed, according to Cicero (Rep. 22) and other authors, Archimedes (c. 287–c. 212 BCE) and other ancient scientists also built analog computers, such as armillary spheres, for astronomical simu-
lation and computation. Other antique mechanical analog computers include the astrolabe, which is used for the determination of longitude and a variety of other astronomical purposes, and the torquetum, which converts astronomical measurements between equatorial, ecliptic, and horizontal coordinates. A class of special-purpose analog computer, which is simple in conception but may be used for a wide range of purposes, is the nomograph (also, nomogram, alignment chart). In its most common form, it permits the solution of quite arbitrary equations in three real variables, f (u; v; w) D 0. The nomograph is a chart or graph with scales for each of the variables; typically these scales are curved and have non-uniform numerical markings. Given values for any two of the variables, a straightedge is laid across their positions on their scales, and the value of the third variable is read off where the straightedge crosses the third scale. Nomographs were used to solve many problems in engineering and applied mathematics. They improve intuitive understanding by allowing the relationships among the variables to be visualized, and facilitate exploring their variation by moving the straightedge. Lipka (1918) is an example of a course in graphical and mechanical methods of analog computation, including nomographs and slide rules. Until the introduction of portable electronic calculators in the early 1970s, the slide rule was the most familiar analog computing device. Slide rules use logarithms for multiplication and division, and they were invented in the early seventeenth century shortly after John Napier’s description of logarithms. The mid-nineteenth century saw the development of the field analogy method by G. Kirchhoff (1824–1887) and others [33]. In this approach an electrical field in an electrolytic tank or conductive paper was used to solve twodimensional boundary problems for temperature distributions and magnetic fields [p. 34 in 82]. It is an early example of analog field computation. In the nineteenth century a number of mechanical analog computers were developed for integration and differentiation (e. g., Litka 1918, pp. 246–256; Clymer [15]). For example, the planimeter measures the area under a curve or within a closed boundary. While the operator moves a pointer along the curve, a rotating wheel accumulates the area. Similarly, the integraph is able to draw the integral of a given function as its shape is traced. Other mechanical devices can draw the derivative of a curve or compute a tangent line at a given point. In the late nineteenth century William Thomson, Lord Kelvin, constructed several analog computers, including a “tide predictor” and a “harmonic analyzer”, which com-
Analog Computation
puted the Fourier coefficients of a tidal curve [85,86]. In 1876 he described how the mechanical integrators invented by his brother could be connected together in a feedback loop in order to solve second and higher order differential equations (Small [pp. 34–35, 42 in 82], Thomson [84]). He was unable to construct this differential analyzer, which had to await the invention of the torque amplifier in 1927. The torque amplifier and other technical advancements permitted Vannevar Bush at MIT to construct the first practical differential analyzer in 1930 [pp. 42– 45 in 82]. It had six integrators and could also do addition, subtraction, multiplication, and division. Input data were entered in the form of continuous curves, and the machine automatically plotted the output curves continuously as the equations were integrated. Similar differential analyzers were constructed at other laboratories in the US and the UK. Setting up a problem on the MIT differential analyzer took a long time; gears and rods had to be arranged to define the required dependencies among the variables. Bush later designed a much more sophisticated machine, the Rockefeller Differential Analyzer, which became operational in 1947. With 18 integrators (out of a planned 30), it provided programmatic control of machine setup, and permitted several jobs to be run simultaneously. Mechanical differential analyzers were rapidly supplanted by electronic analog computers in the mid-1950s, and most were disassembled in the 1960s (Bowles [10], Owens [61], Small [pp. 50–45 in 82]). During World War II, and even later wars, an important application of optical and mechanical analog computation was in “gun directors” and “bomb sights”, which performed ballistic computations to accurately target artillery and dropped ordnance. Electronic Analog Computation in the 20th Century It is commonly supposed that electronic analog computers were superior to mechanical analog computers, and they were in many respects, including speed, cost, ease of construction, size, and portability [pp. 54–56 in 82]. On the other hand, mechanical integrators produced higher precision results (0.1% vs. 1% for early electronic devices) and had greater mathematical flexibility (they were able to integrate with respect to any variable, not just time). However, many important applications did not require high precision and focused on dynamic systems for which time integration was sufficient. Analog computers (non-electronic as well as electronic) can be divided into active-element and passiveelement computers; the former involve some kind of am-
plification, the latter do not [pp. 2-1–4 in 87]. Passiveelement computers included the network analyzers, which were developed in the 1920s to analyze electric power distribution networks, and which continued in use through the 1950s [pp. 35–40 in 82]. They were also applied to problems in thermodynamics, aircraft design, and mechanical engineering. In these systems networks or grids of resistive elements or reactive elements (i. e., involving capacitance and inductance as well as resistance) were used to model the spatial distribution of physical quantities such as voltage, current, and power (in electric distribution networks), electrical potential in space, stress in solid materials, temperature (in heat diffusion problems), pressure, fluid flow rate, and wave amplitude [p. 2-2 in 87]. That is, network analyzers dealt with partial differential equations (PDEs), whereas active-element computers, such as the differential analyzer and its electronic successors, were restricted to ordinary differential equations (ODEs) in which time was the independent variable. Large network analyzers are early examples of analog field computers. Electronic analog computers became feasible after the invention of the DC operational amplifier (“op amp”) c. 1940 [pp. 64, 67–72 in 82]. Already in the 1930s scientists at Bell Telephone Laboratories (BTL) had developed the DC-coupled feedback-stabilized amplifier, which is the basis of the op amp. In 1940, as the USA prepared to enter World War II, DL Parkinson at BTL had a dream in which he saw DC amplifiers being used to control an antiaircraft gun. As a consequence, with his colleagues CA Lovell and BT Weber, he wrote a series of papers on “electrical mathematics”, which described electrical circuits to “operationalize” addition, subtraction, integration, differentiation, etc. The project to produce an electronic gundirector led to the development and refinement of DC op amps suitable for analog computation. The war-time work at BTL was focused primarily on control applications of analog devices, such as the gundirector. Other researchers, such as E. Lakatos at BTL, were more interested in applying them to general-purpose analog computation for science and engineering, which resulted in the design of the General Purpose Analog Computer (GPAC), also called “Gypsy”, completed in 1949 [pp. 69–71 in 82]. Building on the BTL op amp design, fundamental work on electronic analog computation was conducted at Columbia University in the 1940s. In particular, this research showed how analog computation could be applied to the simulation of dynamic systems and to the solution of nonlinear equations. Commercial general-purpose analog computers (GPACs) emerged in the late 1940s and early 1950s [pp. 72–73 in 82]. Typically they provided several dozen
163
164
Analog Computation
integrators, but several GPACs could be connected together to solve larger problems. Later, large-scale GPACs might have up to 500 amplifiers and compute with 0.01%– 0.1% precision [p. 2-33 in 87]. Besides integrators, typical GPACs provided adders, subtracters, multipliers, fixed function generators (e. g., logarithms, exponentials, trigonometric functions), and variable function generators (for user-defined functions) [Chaps. 1.3, 2.4 in 87]. A GPAC was programmed by connecting these components together, often by means of a patch panel. In addition, parameters could be entered by adjusting potentiometers (attenuators), and arbitrary functions could be entered in the form of graphs [pp. 172–81, 2-154–156 in 87]. Output devices plotted data continuously or displayed it numerically [pp. 3-1–30 in 87]. The most basic way of using a GPAC was in singleshot mode [pp. 168–170 in 89]. First, parameters and initial values were entered into the potentiometers. Next, putting a master switch in “reset” mode controlled relays to apply the initial values to the integrators. Turning the switch to “operate” or “compute” mode allowed the computation to take place (i. e., the integrators to integrate). Finally, placing the switch in “hold” mode stopped the computation and stabilized the values, allowing them to be read from the computer (e. g., on voltmeters). Although single-shot operation was also called “slow operation” (in comparison to “repetitive operation”, discussed next), it was in practice quite fast. Because all of the devices computed in parallel and at electronic speeds, analog computers usually solved problems in real-time but often much faster (Truitt and Rogers [pp. 1-30–32 in 87], Small [p. 72 in 82]). One common application of GPACs was to explore the effect of one or more parameters on the behavior of a system. To facilitate this exploration of the parameter space, some GPACs provided a repetitive operation mode, which worked as follows (Weyrick [p. 170 in 89], Small [p. 72 in 82]). An electronic clock switched the computer between reset and compute modes at an adjustable rate (e. g., 10–1000 cycles per second) [p. 280, n. 1 in 2]. In effect the simulation was rerun at the clock rate, but if any parameters were adjusted, the simulation results would vary along with them. Therefore, within a few seconds, an entire family of related simulations could be run. More importantly, the operator could acquire an intuitive understanding of the system’s dependence on its parameters. The Eclipse of Analog Computing A common view is that electronic analog computers were a primitive predecessor of the digital computer, and that their use was just a historical episode, or even a digression, in the inevitable triumph of digital technology. It is supposed that the cur-
rent digital hegemony is a simple matter of technological superiority. However, the history is much more complicated, and involves a number of social, economic, historical, pedagogical, and also technical factors, which are outside the scope of this article (see Small [81] and Small [82], especially Chap. 8, for more information). In any case, beginning after World War II and continuing for twenty-five years, there was lively debate about the relative merits of analog and digital computation. Speed was an oft-cited advantage of analog computers [Chap. 8 in 82]. While early digital computers were much faster than mechanical differential analyzers, they were slower (often by several orders of magnitude) than electronic analog computers. Furthermore, although digital computers could perform individual arithmetic operations rapidly, complete problems were solved sequentially, one operation at a time, whereas analog computers operated in parallel. Thus it was argued that increasingly large problems required more time to solve on a digital computer, whereas on an analog computer they might require more hardware but not more time. Even as digital computing speed was improved, analog computing retained its advantage for several decades, but this advantage eroded steadily. Another important issue was the comparative precision of digital and analog computation [Chap. 8 in 82]. Analog computers typically computed with three or four digits of precision, and it was very expensive to do much better, due to the difficulty of manufacturing the parts and other factors. In contrast, digital computers could perform arithmetic operations with many digits of precision, and the hardware cost was approximately proportional to the number of digits. Against this, analog computing advocates argued that many problems did not require such high precision, because the measurements were known to only a few significant figures and the mathematical models were approximations. Further, they distinguished between precision and accuracy, which refers to the conformity of the computation to physical reality, and they argued that digital computation was often less accurate than analog, due to numerical limitations (e. g., truncation, cumulative error in numerical integration). Nevertheless, some important applications, such as the calculation of missile trajectories, required greater precision, and for these, digital computation had the advantage. Indeed, to some extent precision was viewed as inherently desirable, even in applications where it was unimportant, and it was easily mistaken for accuracy. (See Sect. “Precision” for more on precision and accuracy.) There was even a social factor involved, in that the written programs, precision, and exactness of digital com-
Analog Computation
putation were associated with mathematics and science, but the hands-on operation, parameter variation, and approximate solutions of analog computation were associated with engineers, and so analog computing inherited “the lower status of engineering vis-à-vis science” [p. 251 in 82]. Thus the status of digital computing was further enhanced as engineering became more mathematical and scientific after World War II [pp. 247– 251 in 82]. Already by the mid-1950s the competition between analog and digital had evolved into the idea that they were complementary technologies. This resulted in the development of a variety of hybrid analog/digital computing systems [pp. 251–253, 263–266 in 82]. In some cases this involved using a digital computer to control an analog computer by using digital logic to connect the analog computing elements, set parameters, and gather data. This improved the accessibility and usability of analog computers, but had the disadvantage of distancing the user from the physical analog system. The intercontinental ballistic missile program in the USA stimulated the further development of hybrid computers in the late 1950s and 1960s [81]. These applications required the speed of analog computation to simulate the closed-loop control systems and the precision of digital computation for accurate computation of trajectories. However, by the early 1970s hybrids were being displaced by all digital systems. Certainly part of the reason was the steady improvement in digital technology, driven by a vibrant digital computer industry, but contemporaries also pointed to an inaccurate perception that analog computing was obsolete and to a lack of education about the advantages and techniques of analog computing. Another argument made in favor of digital computers was that they were general-purpose, since they could be used in business data processing and other application domains, whereas analog computers were essentially specialpurpose, since they were limited to scientific computation [pp. 248–250 in 82]. Against this it was argued that all computing is essentially computing by analogy, and therefore analog computation was general-purpose because the class of analog computers included digital computers! (See also Sect. “Definition of the Subject” on computing by analogy.) Be that as it may, analog computation, as normally understood, is restricted to continuous variables, and so it was not immediately applicable to discrete data, such as that manipulated in business computing and other nonscientific applications. Therefore business (and eventually consumer) applications motivated the computer industry’s investment in digital computer technology at the expense of analog technology.
Although it is commonly believed that analog computers quickly disappeared after digital computers became available, this is inaccurate, for both general-purpose and special-purpose analog computers have continued to be used in specialized applications to the present time. For example, a general-purpose electrical (vs. electronic) analog computer, the Anacom, was still in use in 1991. This is not technological atavism, for “there is no doubt considerable truth in the fact that Anacom continued to be used because it effectively met a need in a historically neglected but nevertheless important computer application area” [3]. As mentioned, the reasons for the eclipse of analog computing were not simply the technological superiority of digital computation; the conditions were much more complex. Therefore a change in conditions has necessitated a reevaluation of analog technology. Analog VLSI In the mid-1980s, Carver Mead, who already had made important contributions to digital VLSI technology, began to advocate for the development of analog VLSI [51,52]. His motivation was that “the nervous system of even a very simple animal contains computing paradigms that are orders of magnitude more effective than are those found in systems made by humans” and that they “can be realized in our most commonly available technology—silicon integrated circuits” [pp. xi in 52]. However, he argued, since these natural computation systems are analog and highly non-linear, progress would require understanding neural information processing in animals and applying it in a new analog VLSI technology. Because analog computation is closer to the physical laws by which all computation is realized (which are continuous), analog circuits often use fewer devices than corresponding digital circuits. For example, a four-quadrant adder (capable of adding two signed numbers) can be fabricated from four transistors [pp. 87–88 in 52], and a fourquadrant multiplier from nine to seventeen, depending on the required range of operation [pp. 90–96 in 52]. Intuitions derived from digital logic about what is simple or complex to compute are often misleading when applied to analog computation. For example, two transistors are sufficient to compute the logarithm or exponential, five for the hyperbolic tangent (which is very useful in neural computation), and three for the square root [pp. 70– 71, 97–99 in 52]. Thus analog VLSI is an attractive approach to “post-Moore’s Law computing” (see Sect. “Future Directions” below). Mead and his colleagues demonstrated a number of analog VLSI devices inspired by the nervous system, including a “silicon retina” and an “electronic cochlea” [Chaps. 15–16 in 52], research that has lead to a renaissance of interest in electronic analog computing.
165
166
Analog Computation
Non-Electronic Analog Computation As will be explained in the body of this article, analog computation suggests many opportunities for future computing technologies. Many physical phenomena are potential media for analog computation provided they have useful mathematical structure (i. e., the mathematical laws describing them are mathematical functions useful for general- or specialpurpose computation), and they are sufficiently controllable for practical use. Article Roadmap The remainder of this article will begin by summarizing the fundamentals of analog computing, starting with the continuous state space and the various processes by which analog computation can be organized in time. Next it will discuss analog computation in nature, which provides models and inspiration for many contemporary uses of analog computation, such as neural networks. Then we consider general-purpose analog computing, both from a theoretical perspective and in terms of practical generalpurpose analog computers. This leads to a discussion of the theoretical power of analog computation and in particular to the issue of whether analog computing is in some sense more powerful than digital computing. We briefly consider the cognitive aspects of analog computing, and whether it leads to a different approach to computation than does digital computing. Finally, we conclude with some observations on the role of analog computation in “post-Moore’s Law computing”. Fundamentals of Analog Computing Continuous State Space As discussed in Sect. “Introduction”, the fundamental characteristic that distinguishes analog from digital computation is that the state space is continuous in analog computation and discrete in digital computation. Therefore it might be more accurate to call analog and digital computation continuous and discrete computation, respectively. Furthermore, since the earliest days there have been hybrid computers that combine continuous and discrete state spaces and processes. Thus, there are several respects in which the state space may be continuous. In the simplest case the state space comprises a finite (generally modest) number of variables, each holding a continuous quantity (e. g., voltage, current, charge). In a traditional GPAC they correspond to the variables in the ODEs defining the computational process, each typically having some independent meaning in the analysis of the problem. Mathematically, the variables are taken to contain bounded real numbers, although complex-valued
variables are also possible (e. g., in AC electronic analog computers). In a practical sense, however, their precision is limited by noise, stability, device tolerance, and other factors (discussed below, Sect. “Characteristics of Analog Computation”). In typical analog neural networks the state space is larger in dimension but more structured than in the former case. The artificial neurons are organized into one or more layers, each composed of a (possibly large) number of artificial neurons. Commonly each layer of neurons is densely connected to the next layer. In general the layers each have some meaning in the problem domain, but the individual neurons constituting them do not (and so, in mathematical descriptions, the neurons are typically numbered rather than named). The individual artificial neurons usually perform a simple computation such as this: y D (s) ;
where s D b C
n X
wi xi ;
iD1
and where y is the activity of the neuron, x1 ; : : : ; x n are the activities of the neurons that provide its inputs, b is a bias term, and w1 ; : : : ; w n are the weights or strengths of the connections. Often the activation function is a realvalued sigmoid (“S-shaped”) function, such as the logistic sigmoid, (s) D
1 ; 1 C es
in which case the neuron activity y is a real number, but some applications use a discontinuous threshold function, such as the Heaviside function, ( C1 if s 0 U(s) D 0 if s < 0 in which case the activity is a discrete quantity. The saturated-linear or piecewise-linear sigmoid is also used occasionally: 8 ˆ 1 (s) D s if 0 s 1 ˆ : 0 if s < 0 : Regardless of whether the activation function is continuous or discrete, the bias b and connection weights w1 ; : : : ; w n are real numbers, as is the “net input” P s D i w i x i to the activation function. Analog computation may be used to evaluate the linear combination s and the activation function (s), if it is real-valued. The biases
Analog Computation
and weights are normally determined by a learning algorithm (e. g., back-propagation), which is also a good candidate for analog implementation. In summary, the continuous state space of a neural network includes the bias values and net inputs of the neurons and the interconnection strengths between the neurons. It also includes the activity values of the neurons, if the activation function is a real-valued sigmoid function, as is often the case. Often large groups (“layers”) of neurons (and the connections between these groups) have some intuitive meaning in the problem domain, but typically the individual neuron activities, bias values, and interconnection weights do not. If we extrapolate the number of neurons in a layer to the continuum limit, we get a field, which may be defined as a continuous distribution of continuous quantity. Treating a group of artificial or biological neurons as a continuous mass is a reasonable mathematical approximation if their number is sufficiently large and if their spatial arrangement is significant (as it generally is in the brain). Fields are especially useful in modeling cortical maps, in which information is represented by the pattern of activity over a region of neural cortex. In field computation the state space in continuous in two ways: it is continuous in variation but also in space. Therefore, field computation is especially applicable to solving PDEs and to processing spatially extended information such as visual images. Some early analog computing devices were capable of field computation [pp. 114–17, 2-2–16 in 87]. For example, as previously mentioned (Sect. “Introduction”), large resistor and capacitor networks could be used for solving PDEs such as diffusion problems. In these cases a discrete ensemble of resistors and capacitors was used to approximate a continuous field, while in other cases the computing medium was spatially continuous. The latter made use of conductive sheets (for two-dimensional fields) or electrolytic tanks (for two- or three-dimensional fields). When they were applied to steady-state spatial problems, these analog computers were called field plotters or potential analyzers. The ability to fabricate very large arrays of analog computing devices, combined with the need to exploit massive parallelism in realtime computation and control applications, creates new opportunities for field computation [37,38,43]. There is also renewed interest in using physical fields in analog computation. For example, Rubel [73] defined an abstract extended analog computer (EAC), which augments Shannon’s [77] general purpose analog computer with (unspecified) facilities for field computation, such as PDE solvers (see Sects. Shannon’s Analysis – Rubel’s Extended Analog Computer below). JW. Mills
has explored the practical application of these ideas in his artificial neural field networks and VLSI EACs, which use the diffusion of electrons in bulk silicon or conductive gels and plastics for 2D and 3D field computation [53,54]. Computational Process We have considered the continuous state space, which is the basis for analog computing, but there are a variety of ways in which analog computers can operate on the state. In particular, the state can change continuously in time or be updated at distinct instants (as in digital computation). Continuous Time Since the laws of physics on which analog computing is based are differential equations, many analog computations proceed in continuous real time. Also, as we have seen, an important application of analog computers in the late 19th and early 20th centuries was the integration of ODEs in which time is the independent variable. A common technique in analog simulation of physical systems is time scaling, in which the differential equations are altered systematically so the simulation proceeds either more slowly or more quickly than the primary system (see Sect. “Characteristics of Analog Computation” for more on time scaling). On the other hand, because analog computations are close to the physical processes that realize them, analog computing is rapid, which makes it very suitable for real-time control applications. In principle, any mathematically describable physical process operating on time-varying physical quantities can be used for analog computation. In practice, however, analog computers typically provide familiar operations that scientists and engineers use in differential equations [70,87]. These include basic arithmetic operations, such as algebraic sum and difference (u(t) D v(t) ˙ w(t)), constant multiplication or scaling (u(t) D cv(t)), variable multiplication and division (u(t) D v(t)w(t), u(t) D v(t)/w(t)), and inversion (u(t) D v(t)). Transcendental functions may be provided, such as the exponential (u(t) D exp v(t)), logarithm (u(t) D ln v(t)), trigonometric functions (u(t) D sin v(t), etc.), and resolvers for converting between polar and rectangular coordinates. Most important, of course, is definite integraRt tion (u(t) D v0 C 0 v( ) d ), but differentiation may also be provided (u(t) D v˙(t)). Generally, however, direct differentiation is avoided, since noise tends to have a higher frequency than the signal, and therefore differentiation amplifies noise; typically problems are reformulated to avoid direct differentiation [pp. 26–27 in 89]. As previously mentioned, many GPACs include (arbitrary) function generators, which allow the use of functions defined
167
168
Analog Computation
only by a graph and for which no mathematical definition might be available; in this way empirically defined functions can be used [pp. 32–42 in 70]. Thus, given a graph (x; f (x)), or a sufficient set of samples, (x k ; f (x k )), the function generator approximates u(t) D f (v(t)). Rather less common are generators for arbitrary functions of two variables, u(t) D f (v(t); w(t)), in which the function may be defined by a surface, (x; y; f (x; y)), or by sufficient samples from it. Although analog computing is primarily continuous, there are situations in which discontinuous behavior is required. Therefore some analog computers provide comparators, which produce a discontinuous result depending on the relative value of two input values. For example, ( k if v w ; uD 0 if v < w : Typically, this would be implemented as a Heaviside (unit step) function applied to the difference of the inputs, w D kU(v w). In addition to allowing the definition of discontinuous functions, comparators provide a primitive decision making ability, and may be used, for example to terminate a computation (switching the computer from “operate” to “hold” mode). Other operations that have proved useful in analog computation are time delays and noise generators [Chap. 7 in 31]. The function of a time delay is simply to retard the signal by an adjustable delay T > 0 : u(t C T) D v(t). One common application is to model delays in the primary system (e. g., human response time). Typically a noise generator produces time-invariant Gaussian-distributed noise with zero mean and a flat power spectrum (over a band compatible with the analog computing process). The standard deviation can be adjusted by scaling, the mean can be shifted by addition, and the spectrum altered by filtering, as required by the application. Historically noise generators were used to model noise and other random effects in the primary system, to determine, for example, its sensitivity to effects such as turbulence. However, noise can make a positive contribution in some analog computing algorithms (e. g., for symmetry breaking and in simulated annealing, weight perturbation learning, and stochastic resonance). As already mentioned, some analog computing devices for the direct solution of PDEs have been developed. In general a PDE solver depends on an analogous physical process, that is, on a process obeying the same class of PDEs that it is intended to solve. For example, in Mill’s EAC, diffusion of electrons in conductive sheets or solids is used to solve diffusion equations [53,54]. His-
torically, PDEs were solved on electronic GPACs by discretizing all but one of the independent variables, thus replacing the differential equations by difference equations [70], pp. 173–193. That is, computation over a field was approximated by computation over a finite real array. Reaction-diffusion computation is an important example of continuous-time analog computing Reaction-Diffusion Computing. The state is represented by a set of time-varying chemical concentration fields, c1 ; : : : ; c n . These fields are distributed across a one-, two-, or threedimensional space ˝, so that, for x 2 ˝, c k (x; t) represents the concentration of chemical k at location x and time t. Computation proceeds in continuous time according to reaction-diffusion equations, which have the form: @c D Dr 2 c C F(c) ; @t where c D (c1 ; : : : ; c n )T is the vector of concentrations, D D diag(d1 ; : : : ; d n ) is a diagonal matrix of positive diffusion rates, and F is nonlinear vector function that describes how the chemical reactions affect the concentrations. Some neural net models operate in continuous time and thus are examples of continuous-time analog computation. For example, Grossberg [26,27,28] defines the activity of a neuron by differential equations such as this: n n X X x˙ i D a i x i C b i j w (C) f (x ) c i j w () j j ij i j g j (x j )C I i : jD1
jD1
This describes the continuous change in the activity of neuron i resulting from passive decay (first term), positive feedback from other neurons (second term), negative feedback (third term), and input (last term). The f j and g j are () nonlinear activation functions, and the w (C) i j and w i j are adaptable excitatory and inhibitory connection strengths, respectively. The continuous Hopfield network is another example of continuous-time analog computation [30]. The output y i of a neuron is a nonlinear function of its internal state xi , y i D (x i ), where the hyperbolic tangent is usually used as the activation function, (x) D tanh x, because its range is [1; 1]. The internal state is defined by a differential equation, i x˙ i D a i x i C b i C
n X
wi j y j ;
jD1
where i is a time constant, ai is the decay rate, b i is the bias, and wij is the connection weight to neuron i from
Analog Computation
neuron j. In a Hopfield network every neuron is symmetrically connected to every other (w i j D w ji ) but not to itself (w i i D 0). Of course analog VLSI implementations of neural networks also operate in continuous time (e. g., [20,52].) Concurrent with the resurgence of interest in analog computation have been innovative reconceptualizations of continuous-time computation. For example, Brockett [12] has shown that dynamical systems can perform a number of problems normally considered to be intrinsically sequential. In particular, a certain system of ODEs (a nonperiodic finite Toda lattice) can sort a list of numbers by continuous-time analog computation. The system is started with the vector x equal to the values to be sorted and a vector y initialized to small nonzero values; the y vector converges to a sorted permutation of x. Sequential Time computation refers to computation in which discrete computational operations take place in succession but at no definite interval [88]. Ordinary digital computer programs take place in sequential time, for the operations occur one after another, but the individual operations are not required to have any specific duration, so long as they take finite time. One of the oldest examples of sequential analog computation is provided by the compass-and-straightedge constructions of traditional Euclidean geometry (Sect. “Introduction”). These computations proceed by a sequence of discrete operations, but the individual operations involve continuous representations (e. g., compass settings, straightedge positions) and operate on a continuous state (the figure under construction). Slide rule calculation might seem to be an example of sequential analog computation, but if we look at it, we see that although the operations are performed by an analog device, the intermediate results are recorded digitally (and so this part of the state space is discrete). Thus it is a kind of hybrid computation. The familiar digital computer automates sequential digital computations that once were performed manually by human “computers”. Sequential analog computation can be similarly automated. That is, just as the control unit of an ordinary digital computer sequences digital computations, so a digital control unit can sequence analog computations. In addition to the analog computation devices (adders, multipliers, etc.), such a computer must provide variables and registers capable of holding continuous quantities between the sequential steps of the computation (see also Sect. “Discrete Time” below). The primitive operations of sequential-time analog computation are typically similar to those in continuoustime computation (e. g., addition, multiplication, tran-
scendental functions), but integration and differentiation with respect to sequential time do not make sense. However, continuous-time integration within a single step, and space-domain integration, as in PDE solvers or field computation devices, are compatible with sequential analog computation. In general, any model of digital computation can be converted to a similar model of sequential analog computation by changing the discrete state space to a continuum, and making appropriate changes to the rest of the model. For example, we can make an analog Turing machine by allowing it to write a bounded real number (rather than a symbol from a finite alphabet) onto a tape cell. The Turing machine’s finite control can be altered to test for tape markings in some specified range. Similarly, in a series of publications Blum, Shub, and Smale developed a theory of computation over the reals, which is an abstract model of sequential-time analog computation [6,7]. In this “BSS model” programs are represented as flowcharts, but they are able to operate on realvalued variables. Using this model they were able to prove a number of theorems about the complexity of sequential analog algorithms. The BSS model, and some other sequential analog computation models, assume that it is possible to make exact comparisons between real numbers (analogous to exact comparisons between integers or discrete symbols in digital computation) and to use the result of the comparison to control the path of execution. Comparisons of this kind are problematic because they imply infinite precision in the comparator (which may be defensible in a mathematical model but is impossible in physical analog devices), and because they make the execution path a discontinuous function of the state (whereas analog computation is usually continuous). Indeed, it has been argued that this is not “true” analog computation [p. 148 in 78]. Many artificial neural network models are examples of sequential-time analog computation. In a simple feed-forward neural network, an input vector is processed by the layers in order, as in a pipeline. That is, the output of layer n becomes the input of layer nC1. Since the model does not make any assumptions about the amount of time it takes a vector to be processed by each layer and to propagate to the next, execution takes place in sequential time. Most recurrent neural networks, which have feedback, also operate in sequential time, since the activities of all the neurons are updated synchronously (that is, signals propagate through the layers, or back to earlier layers, in lockstep). Many artificial neural-net learning algorithms are also sequential-time analog computations. For example, the
169
170
Analog Computation
back-propagation algorithm updates a network’s weights, moving sequentially backward through the layers. In summary, the correctness of sequential time computation (analog or digital) depends on the order of operations, not on their duration, and similarly the efficiency of sequential computations is evaluated in terms of the number of operations, not on their total duration. Discrete Time analog computation has similarities to both continuous-time and sequential analog computation. Like the latter, it proceeds by a sequence of discrete (analog) computation steps; like the former, these steps occur at a constant rate in real time (e. g., some “frame rate”). If the real-time rate is sufficient for the application, then discrete-time computation can approximate continuoustime computation (including integration and differentiation). Some electronic GPACs implemented discrete-time analog computation by a modification of repetitive operation mode, called iterative analog computation [Chap. 9 in 2]. Recall (Sect. “Electronic Analog Computation in the 20th Century”) that in repetitive operation mode a clock rapidly switched the computer between reset and compute modes, thus repeating the same analog computation, but with different parameters (set by the operator). However, each repetition was independent of the others. Iterative operation was different in that analog values computed by one iteration could be used as initial values in the next. This was accomplished by means of an analog memory circuit (based on an op amp) that sampled an analog value at the end of one compute cycle (effectively during hold mode) and used it to initialize an integrator during the following reset cycle. (A modified version of the memory circuit could be used to retain a value over several iterations.) Iterative computation was used for problems such as determining, by iterative search or refinement, the initial conditions that would lead to a desired state at a future time. Since the analog computations were iterated at a fixed clock rate, iterative operation is an example of discrete-time analog computation. However, the clock rate is not directly relevant in some applications (such as the iterative solution of boundary value problems), in which case iterative operation is better characterized as sequential analog computation. The principal contemporary examples of discrete-time analog computing are in neural network applications to time-series analysis and (discrete-time) control. In each of these cases the input to the neural net is a sequence of discrete-time samples, which propagate through the net and generate discrete-time output signals. Many of these neural nets are recurrent, that is, values from later layers
are fed back into earlier layers, which allows the net to remember information from one sample to the next. Analog Computer Programs The concept of a program is central to digital computing, both practically, for it is the means for programming general-purpose digital computers, and theoretically, for it defines the limits of what can be computed by a universal machine, such as a universal Turing machine. Therefore it is important to discuss means for describing or specifying analog computations. Traditionally, analog computers were used to solve ODEs (and sometimes PDEs), and so in one sense a mathematical differential equation is one way to represent an analog computation. However, since the equations were usually not suitable for direct solution on an analog computer, the process of programming involved the translation of the equations into a schematic diagram showing how the analog computing devices (integrators etc.) should be connected to solve the problem. These diagrams are the closest analogies to digital computer programs and may be compared to flowcharts, which were once popular in digital computer programming. It is worth noting, however, that flowcharts (and ordinary computer programs) represent sequences among operations, whereas analog computing diagrams represent functional relationships among variables, and therefore a kind of parallel data flow. Differential equations and schematic diagrams are suitable for continuous-time computation, but for sequential analog computation something more akin to a conventional digital program can be used. Thus, as previously discussed (Sect. “Sequential Time”), the BSS system uses flowcharts to describe sequential computations over the reals. Similarly, C. Moore [55] defines recursive functions over the reals by means of a notation similar to a programming language. In principle any sort of analog computation might involve constants that are arbitrary real numbers, which therefore might not be expressible in finite form (e. g., as a finite string of digits). Although this is of theoretical interest (see Sect. “Real-valued Inputs, Outputs, and Constants” below), from a practical standpoint these constants could be set with about at most four digits of precision [p. 11 in 70]. Indeed, automatic potentiometer-setting devices were constructed that read a series of decimal numerals from punched paper tape and used them to set the potentiometers for the constants [pp. 3-58–60 in 87]. Nevertheless it is worth observing that analog computers do allow continuous inputs that need not be expressed in digital notation, for example, when the parameters of a sim-
Analog Computation
ulation are continuously varied by the operator. In principle, therefore, an analog program can incorporate constants that are represented by a real-valued physical quantity (e. g., an angle or a distance), which need not be expressed digitally. Further, as we have seen (Sect. “Electronic Analog Computation in the 20th Century”), some electronic analog computers could compute a function by means of an arbitrarily drawn curve, that is, not represented by an equation or a finite set of digitized points. Therefore, in the context of analog computing it is natural to expand the concept of a program beyond discrete symbols to include continuous representations (scalar magnitudes, vectors, curves, shapes, surfaces, etc.). Typically such continuous representations would be used as adjuncts to conventional discrete representations of the analog computational process, such as equations or diagrams. However, in some cases the most natural static representation of the process is itself continuous, in which case it is more like a “guiding image” than a textual prescription [42]. A simple example is a potential surface, which defines a continuum of trajectories from initial states (possible inputs) to fixed-point attractors (the results of the computations). Such a “program” may define a deterministic computation (e. g., if the computation proceeds by gradient descent), or it may constrain a nondeterministic computation (e. g., if the computation may proceed by any potential-decreasing trajectory). Thus analog computation suggests a broadened notion of programs and programming. Characteristics of Analog Computation Precision Analog computation is evaluated in terms of both accuracy and precision, but the two must be distinguished carefully [pp. 25–28 in 2,pp. 12–13 in 89,pp. 257– 261 in 82]. Accuracy refers primarily to the relationship between a simulation and the primary system it is simulating or, more generally, to the relationship between the results of a computation and the mathematically correct result. Accuracy is a result of many factors, including the mathematical model chosen, the way it is set up on a computer, and the precision of the analog computing devices. Precision, therefore, is a narrower notion, which refers to the quality of a representation or computing device. In analog computing, precision depends on resolution (fineness of operation) and stability (absence of drift), and may be measured as a fraction of the represented value. Thus a precision of 0.01% means that the representation will stay within 0.01% of the represented value for a reasonable period of time. For purposes of comparing analog devices, the precision is usually expressed as a fraction of full-scale
variation (i. e., the difference between the maximum and minimum representable values). It is apparent that the precision of analog computing devices depends on many factors. One is the choice of physical process and the way it is utilized in the device. For example a linear mathematical operation can be realized by using a linear region of a nonlinear physical process, but the realization will be approximate and have some inherent imprecision. Also, associated, unavoidable physical effects (e. g., loading, and leakage and other losses) may prevent precise implementation of an intended mathematical function. Further, there are fundamental physical limitations to resolution (e. g., quantum effects, diffraction). Noise is inevitable, both intrinsic (e. g., thermal noise) and extrinsic (e. g., ambient radiation). Changes in ambient physical conditions, such as temperature, can affect the physical processes and decrease precision. At slower time scales, materials and components age and their physical characteristics change. In addition, there are always technical and economic limits to the control of components, materials, and processes in analog device fabrication. The precision of analog and digital computing devices depend on very different factors. The precision of a (binary) digital device depends on the number of bits, which influences the amount of hardware, but not its quality. For example, a 64-bit adder is about twice the size of a 32bit adder, but can made out of the same components. At worst, the size of a digital device might increase with the square of the number of bits of precision. This is because binary digital devices only need to represent two states, and therefore they can operate in saturation. The fabrication standards sufficient for the first bit of precision are also sufficient for the 64th bit. Analog devices, in contrast, need to be able to represent a continuum of states precisely. Therefore, the fabrication of high-precision analog devices is much more expensive than low-precision devices, since the quality of components, materials, and processes must be much more carefully controlled. Doubling the precision of an analog device may be expensive, whereas the cost of each additional bit of digital precision is incremental; that is, the cost is proportional to the logarithm of the precision expressed as a fraction of full range. The forgoing considerations might seem to be a convincing argument for the superiority of digital to analog technology, and indeed they were an important factor in the competition between analog and digital computers in the middle of the twentieth century [pp. 257–261 in 82]. However, as was argued at that time, many computer applications do not require high precision. Indeed, in many engineering applications, the input data are known to only a few digits, and the equations may be approximate or de-
171
172
Analog Computation
rived from experiments. In these cases the very high precision of digital computation is unnecessary and may in fact be misleading (e. g., if one displays all 14 digits of a result that is accurate to only three). Furthermore, many applications in image processing and control do not require high precision. More recently, research in artificial neural networks (ANNs) has shown that low-precision analog computation is sufficient for almost all ANN applications. Indeed, neural information processing in the brain seems to operate with very low precision (perhaps less than 10% [p. 378 in 50]), for which it compensates with massive parallelism. For example, by coarse coding a population of low-precision devices can represent information with relatively high precision [pp. 91–96 in 74,75]. Scaling An important aspect of analog computing is scaling, which is used to adjust a problem to an analog computer. First is time scaling, which adjusts a problem to the characteristic time scale at which a computer operates, which is a consequence of its design and the physical processes by which it is realized [pp. 37–44 in 62,pp. 262– 263 in 70,pp. 241–243 in 89]. For example, we might want a simulation to proceed on a very different time scale from the primary system. Thus a weather or economic simulation should proceed faster than real time in order to get useful predictions. Conversely, we might want to slow down a simulation of protein folding so that we can observe the stages in the process. Also, for accurate results it is necessary to avoid exceeding the maximum response rate of the analog devices, which might dictate a slower simulation speed. On the other hand, too slow a computation might be inaccurate as a consequence of instability (e. g., drift and leakage in the integrators). Time scaling affects only time-dependent operations such as integration. For example, suppose t, time in the primary system or “problem time”, is related to , time in the computer, by D ˇt. Therefore, an integration Rt u(t) D 0 v(t 0 )dt 0 in the primary system is replaced by R the integration u( ) D ˇ 1 0 v( 0 )d 0 on the computer. Thus time scaling may be accomplished simply by decreasing the input gain to the integrator by a factor of ˇ. Fundamental to analog computation is the representation of a continuous quantity in the primary system by a continuous quantity in the computer. For example, a displacement x in meters might be represented by a potential V in volts. The two are related by an amplitude or magnitude scale factor, V D ˛x, (with units volts/meter), chosen to meet two criteria [pp. 103–106 in 2,Chap. 4 in 62,pp. 127–128 in 70,pp. 233–240 in 89]. On the one hand, ˛ must be sufficiently small so that the range of the problem variable is accommodated within the range of values
supported by the computing device. Exceeding the device’s intended operating range may lead to inaccurate results (e. g., forcing a linear device into nonlinear behavior). On the other hand, the scale factor should not be too small, or relevant variation in the problem variable will be less than the resolution of the device, also leading to inaccuracy. (Recall that precision is specified as a fraction of fullrange variation.) In addition to the explicit variables of the primary system, there are implicit variables, such as the time derivatives of the explicit variables, and scale factors must be chosen for them too. For example, in addition to displacement x, a problem might include velocity x˙ and acceleration x¨ . Therefore, scale factors ˛, ˛ 0 , and ˛ 00 must be chosen so that ˛x, ˛ 0 x˙ , and ˛ 00 x¨ have an appropriate range of variation (neither too large nor too small). Once a scale factor has been chosen, the primary system equations are adjusted to obtain the analog computing equations. For example, if we haveR scaled u D ˛x and t v D ˛ 0 x˙ , then the integration x(t) D 0 x˙ (t 0 )dt 0 would be computed by scaled equation: Z ˛ t 0 0 u(t) D 0 v(t )dt : ˛ 0 This is accomplished by simply setting the input gain of the integrator to ˛/˛ 0 . In practice, time scaling and magnitude scaling are not independent [p. 262 in 70]. For example, if the derivatives of a variable can be large, then the variable can change rapidly, and so it may be necessary to slow down the computation to avoid exceeding the high-frequency response of the computer. Conversely, small derivatives might require the computation to be run faster to avoid integrator leakage etc. Appropriate scale factors are determined by considering both the physics and the mathematics of the problem [pp. 40–44 in 62]. That is, first, the physics of the primary system may limit the ranges of the variables and their derivatives. Second, analysis of the mathematical equations describing the system can give additional information on the ranges of the variables. For example, in some cases the natural frequency of a system can be estimated from the coefficients of the differential equations; the maximum of the nth derivative is then estimated as the n power of this frequency [p. 42 in 62,pp. 238–240 in 89]. In any case, it is not necessary to have accurate values for the ranges; rough estimates giving orders of magnitude are adequate. It is tempting to think of magnitude scaling as a problem unique to analog computing, but before the invention of floating-point numbers it was also necessary in digital computer programming. In any case it is an essential
Analog Computation
aspect of analog computing, in which physical processes are more directly used for computation than they are in digital computing. Although the necessity of scaling has been a source of criticism, advocates for analog computing have argued that it is a blessing in disguise, because it leads to improved understanding of the primary system, which was often the goal of the computation in the first place [5,Chap. 8 in 82]. Practitioners of analog computing are more likely to have an intuitive understanding of both the primary system and its mathematical description (see Sect. “Analog Thinking”). Analog Computation in Nature Computational processes—that is to say, information processing and control—occur in many living systems, most obviously in nervous systems, but also in the selforganized behavior of groups of organisms. In most cases natural computation is analog, either because it makes use of continuous natural processes, or because it makes use of discrete but stochastic processes. Several examples will be considered briefly. Neural Computation In the past neurons were thought of binary computing devices, something like digital logic gates. This was a consequence of the “all or nothing” response of a neuron, which refers to the fact that it does or does not generate an action potential (voltage spike) depending, respectively, on whether its total input exceeds a threshold or not (more accurately, it generates an action potential if the membrane depolarization at the axon hillock exceeds the threshold and the neuron is not in its refractory period). Certainly some neurons (e. g., so-called “command neurons”) do act something like logic gates. However, most neurons are analyzed better as analog devices, because the rate of impulse generation represents significant information. In particular, an amplitude code, the membrane potential near the axon hillock (which is a summation of the electrical influences on the neuron), is translated into a rate code for more reliable long-distance transmission along the axons. Nevertheless, the code is low precision (about one digit), since information theory shows that it takes at least N milliseconds (and probably more like 5N ms) to discriminate N values [39]. The rate code is translated back to an amplitude code by the synapses, since successive impulses release neurotransmitter from the axon terminal, which diffuses across the synaptic cleft to receptors. Thus a synapse acts as a leaky integrator to time-average the impulses. As previously discussed (Sect. “Continuous State Space”), many artificial neural net models have real-valued
neural activities, which correspond to rate-encoded axonal signals of biological neurons. On the other hand, these models typically treat the input connections as simple realvalued weights, which ignores the analog signal processing that takes place in the dendritic trees of biological neurons. The dendritic trees of many neurons are complex structures, which often have thousand of synaptic inputs. The binding of neurotransmitters to receptors causes minute voltage fluctuations, which propagate along the membrane, and ultimately cause voltage fluctuations at the axon hillock, which influence the impulse rate. Since the dendrites have both resistance and capacitance, to a first approximation the signal propagation is described by the “cable equations”, which describe passive signal propagation in cables of specified diameter, capacitance, and resistance [Chap. 1 in 1]. Therefore, to a first approximation, a neuron’s dendritic net operates as an adaptive linear analog filter with thousands of inputs, and so it is capable of quite complex signal processing. More accurately, however, it must be treated as a nonlinear analog filter, since voltage-gated ion channels introduce nonlinear effects. The extent of analog signal processing in dendritic trees is still poorly understood. In most cases, then, neural information processing is treated best as low-precision analog computation. Although individual neurons have quite broadly tuned responses, accuracy in perception and sensorimotor control is achieved through coarse coding, as already discussed (Sect. “Characteristics of Analog Computation”). Further, one widely used neural representation is the cortical map, in which neurons are systematically arranged in accord with one or more dimensions of their stimulus space, so that stimuli are represented by patterns of activity over the map. (Examples are tonotopic maps, in which pitch is mapped to cortical location, and retinotopic maps, in which cortical location represents retinal location.) Since neural density in the cortex is at least 146 000 neurons per square millimeter [p. 51 in 14], even relatively small cortical maps can be treated as fields and information processing in them as analog field computation. Overall, the brain demonstrates what can be accomplished by massively parallel analog computation, even if the individual devices are comparatively slow and of low precision. Adaptive Self-Organization in Social Insects Another example of analog computation in nature is provided by the self-organizing behavior of social insects, microorganisms, and other populations [13]. Often such organisms respond to concentrations, or gradients in the concentrations, of chemicals produced by other members
173
174
Analog Computation
of the population. These chemicals may be deposited and diffuse through the environment. In other cases, insects and other organisms communicate by contact, but may maintain estimates of the relative proportions of different kinds of contacts. Because the quantities are effectively continuous, all these are examples of analog control and computation. Self-organizing populations provide many informative examples of the use of natural processes for analog information processing and control. For example, diffusion of pheromones is a common means of self-organization in insect colonies, facilitating the creation of paths to resources, the construction of nests, and many other functions [13]. Real diffusion (as opposed to sequential simulations of it) executes, in effect, a massively parallel search of paths from the chemical’s source to its recipients and allows the identification of near-optimal paths. Furthermore, if the chemical degrades, as is generally the case, then the system will be adaptive, in effect continually searching out the shortest paths, so long as source continues to function [13]. Simulated diffusion has been applied to robot path planning [32,69]. Genetic Circuits Another example of natural analog computing is provided by the genetic regulatory networks that control the behavior of cells, in multicellular organisms as well as single-celled ones [16]. These networks are defined by the mutually interdependent regulatory genes, promoters, and repressors that control the internal and external behavior of a cell. The interdependencies are mediated by proteins, the synthesis of which is governed by genes, and which in turn regulate the synthesis of other gene products (or themselves). Since it is the quantities of these substances that is relevant, many of the regulatory motifs can be described in computational terms as adders, subtracters, integrators, etc. Thus the genetic regulatory network implements an analog control system for the cell [68]. It might be argued that the number of intracellular molecules of a particular protein is a (relatively small) discrete number, and therefore that it is inaccurate to treat it as a continuous quantity. However, the molecular processes in the cell are stochastic, and so the relevant quantity is the probability that a regulatory protein will bind to a regulatory site. Further, the processes take place in continuous real time, and so the rates are generally the significant quantities. Finally, although in some cases gene activity is either on or off (more accurately: very low), in other cases it varies continuously between these extremes [pp. 388–390 in 29].
Embryological development combines the analog control of individual cells with the sort of self-organization of populations seen in social insects and other colonial organisms. Locomotion of the cells and the expression of specific genes is controlled by chemical signals, among other mechanisms [16,17]. Thus PDEs have proved useful in explaining some aspects of development; for example reaction-diffusion equations have been used to describe the formation of hair-coat patterns and related phenomena [13,48,57]; see Reaction-Diffusion Computing. Therefore the developmental process is governed by naturally occurring analog computation. Is Everything a Computer? It might seem that any continuous physical process could be viewed as analog computation, which would make the term almost meaningless. As the question has been put, is it meaningful (or useful) to say that the solar system is computing Kepler’s laws? In fact, it is possible and worthwhile to make a distinction between computation and other physical processes that happen to be described by mathematical laws [40,41,44,46]. If we recall the original meaning of analog computation (Sect. “Definition of the Subject”), we see that the computational system is used to solve some mathematical problem with respect to a primary system. What makes this possible is that the computational system and the primary system have the same, or systematically related, abstract (mathematical) structures. Thus the computational system can inform us about the primary system, or be used to control it, etc. Although from a practical standpoint some analogs are better than others, in principle any physical system can be used that obeys the same equations as the primary system. Based on these considerations we may define computation as a physical process the purpose of which is the abstract manipulation of abstract objects (i. e., information processing); this definition applies to analog, digital, and hybrid computation [40,41,44,46]. Therefore, to determine if a natural system is computational we need to look to its purpose or function within the context of the living system of which it is a part. One test of whether its function is the abstract manipulation of abstract objects is to ask whether it could still fulfill its function if realized by different physical processes, a property called multiple realizability. (Similarly, in artificial systems, a simulation of the economy might be realized equally accurately by a hydraulic analog computer or an electronic analog computer [5].) By this standard, the majority of the nervous system is purely computational; in principle it could be
Analog Computation
replaced by electronic devices obeying the same differential equations. In the other cases we have considered (selforganization of living populations, genetic circuits) there are instances of both pure computation and computation mixed with other functions (for example, where the specific substances used have other – e. g. metabolic – roles in the living system). General-Purpose Analog Computation The Importance of General-Purpose Computers Although special-purpose analog and digital computers have been developed, and continue to be developed, for many purposes, the importance of general-purpose computers, which can be adapted easily for a wide variety of purposes, has been recognized since at least the nineteenth century. Babbage’s plans for a general-purpose digital computer, his analytical engine (1835), are well known, but a general-purpose differential analyzer was advocated by Kelvin [84]. Practical general-purpose analog and digital computers were first developed at about the same time: from the early 1930s through the war years. Generalpurpose computers of both kinds permit the prototyping of special-purpose computers and, more importantly, permit the flexible reuse of computer hardware for different or evolving purposes. The concept of a general-purpose computer is useful also for determining the limits of a computing paradigm. If one can design—theoretically or practically—a universal computer, that is, a general-purpose computer capable of simulating any computer in a relevant class, then anything uncomputable by the universal computer will also be uncomputable by any computer in that class. This is, of course, the approach used to show that certain functions are uncomputable by any Turing machine because they are uncomputable by a universal Turing machine. For the same reason, the concept of general-purpose analog computers, and in particular of universal analog computers are theoretically important for establishing limits to analog computation. General-Purpose Electronic Analog Computers Before taking up these theoretical issues, it is worth recalling that a typical electronic GPAC would include linear elements, such as adders, subtracters, constant multipliers, integrators, and differentiators; nonlinear elements, such as variable multipliers and function generators; other computational elements, such as comparators, noise generators, and delay elements (Sect. “Electronic Analog Computation in the 20th Century”). These are, of
course, in addition to input/output devices, which would not affect its computational abilities. Shannon’s Analysis Claude Shannon did an important analysis of the computational capabilities of the differential analyzer, which applies to many GPACs [76,77]. He considered an abstract differential analyzer equipped with an unlimited number of integrators, adders, constant multipliers, and function generators (for functions with only a finite number of finite discontinuities), with at most one source of drive (which limits possible interconnections between units). This was based on prior work that had shown that almost all the generally used elementary functions could be generated with addition and integration. We will summarize informally a few of Shannon’s results; for details, please consult the original paper. First Shannon offers proofs that, by setting up the correct ODEs, a GPAC with the mentioned facilities can generate any function if and only if is not hypertranscendental (Theorem II); thus the GPAC can generate any function that is algebraic transcendental (a very large class), but not, for example, Euler’s gamma function and Riemann’s zeta function. He also shows that the GPAC can generate functions derived from generable functions, such as the integrals, derivatives, inverses, and compositions of generable functions (Thms. III, IV). These results can be generalized to functions of any number of variables, and to their compositions, partial derivatives, and inverses with respect to any one variable (Thms. VI, VII, IX, X). Next Shannon shows that a function of any number of variables that is continuous over a closed region of space can be approximated arbitrarily closely over that region with a finite number of adders and integrators (Thms. V, VIII). Shannon then turns from the generation of functions to the solution of ODEs and shows that the GPAC can solve any system of ODEs defined in terms of nonhypertranscendental functions (Thm. XI). Finally, Shannon addresses a question that might seem of limited interest, but turns out to be relevant to the computational power of analog computers (see Sect. “Analog Computation and the Turing Limit” below). To understand it we must recall that he was investigating the differential analyzer—a mechanical analog computer—but similar issues arise in other analog computing technologies. The question is whether it is possible to perform an arbitrary constant multiplication, u D kv, by means of gear ratios. He show that if we have just two gear ratios a and b(a; b ¤ 0; 1), such that b is not a rational power of a, then
175
176
Analog Computation
by combinations of these gears we can approximate k arbitrarily closely (Thm. XII). That is, to approximate multiplication by arbitrary real numbers, it is sufficient to be able to multiply by a, b, and their inverses, provided a and b are not related by a rational power. Shannon mentions an alternative method Rof constant v multiplication, which uses integration, kv D 0 kdv, but this requires setting the integrand to the constant function k. Therefore, multiplying by an arbitrary real number requires the ability to input an arbitrary real as the integrand. The issue of real-valued inputs and outputs to analog computers is relevant both to their theoretical power and to practical matters of their application (see Sect. “Real-valued Inputs, Output, and Constants”). Shannon’s proofs, which were incomplete, were eventually refined by M. Pour-El [63] and finally corrected by L. Lipshitz and L.A. Rubel [35]. Rubel [72] proved that Shannon’s GPAC cannot solve the Dirichlet problem for Laplace’s equation on the disk; indeed, it is limited to initial-value problems for algebraic ODEs. Specifically, the Shannon–Pour-El Thesis is that the outputs of the GPAC are exactly the solutions of the algebraic differential equations, that is, equations of the form P[x; y(x); y 0 (x); y 00 (x); : : : ; y (n) (x)] D 0 ; where P is a polynomial that is not identically vanishing in any of its variables (these are the differentially algebraic functions) [71]. (For details please consult the cited papers.) The limitations of Shannon’s GPAC motivated Rubel’s definition of the Extended Analog Computer. Rubel’s Extended Analog Computer The combination of Rubel’s [71] conviction that the brain is an analog computer together with the limitations of Shannon’s GPAC led him to propose the Extended Analog Computer (EAC) [73]. Like Shannon’s GPAC (and the Turing machine), the EAC is a conceptual computer intended to facilitate theoretical investigation of the limits of a class of computers. The EAC extends the GPAC in a number of respects. For example, whereas the GPAC solves equations defined over a single variable (time), the EAC can generate functions over any finite number of real variables. Further, whereas the GPAC is restricted to initial-value problems for ODEs, the EAC solves both initial- and boundary-value problems for a variety of PDEs. The EAC is structured into a series of levels, each more powerful than the ones below it, from which it accepts inputs. The inputs to the lowest level are a finite number of real variables (“settings”). At this level it operates on real
polynomials, from which it is able to generate the differentially algebraic functions. The computing on each level is accomplished by conceptual analog devices, which include constant real-number generators, adders, multipliers, differentiators, “substituters” (for function composition), devices for analytic continuation, and inverters, which solve systems of equations defined over functions generated by the lower levels. Most characteristic of the EAC is the “boundary-value-problem box”, which solves systems of PDEs and ODEs subject to boundary conditions and other constraints. The PDEs are defined in terms of functions generated by the lower levels. Such PDE solvers may seem implausible, and so it is important to recall fieldcomputing devices for this purpose were implemented in some practical analog computers (see Sect. “History”) and more recently in Mills’ EAC [54]. As Rubel observed, PDE solvers could be implemented by physical processes that obey the same PDEs (heat equation, wave equation, etc.). (See also Sect. “Future Directions” below.) Finally, the EAC is required to be “extremely wellposed”, which means that each level is relatively insensitive to perturbations in its inputs; thus “all the outputs depend in a strongly deterministic and stable way on the initial settings of the machine” [73]. Rubel [73] proves that the EAC can compute everything that the GPAC can compute, but also such functions as the gamma and zeta, and that it can solve the Dirichlet problem for Laplace’s equation on the disk, all of which are beyond the GPAC’s capabilities. Further, whereas the GPAC can compute differentially algebraic functions of time, the EAC can compute differentially algebraic functions of any finite number of real variables. In fact, Rubel did not find any real-analytic (C 1 ) function that is not computable on the EAC, but he observes that if the EAC can indeed generate every real-analytic function, it would be too broad to be useful as a model of analog computation. Analog Computation and the Turing Limit Introduction The Church–Turing Thesis asserts that anything that is effectively computable is computable by a Turing machine, but the Turing machine (and equivalent models, such as the lambda calculus) are models of discrete computation, and so it is natural to wonder how analog computing compares in power, and in particular whether it can compute beyond the “Turing limit”. Superficial answers are easy to obtain, but the issue is subtle because it depends upon choices among definitions, none of which is obviously correct, it involves the foundations of mathematics and its
Analog Computation
philosophy, and it raises epistemological issues about the role of models in scientific theories. Nevertheless this is an active research area, but many of the results are apparently inconsistent due to the differing assumptions on which they are based. Therefore this section will be limited to a mention of a few of the interesting results, but without attempting a comprehensive, systematic, or detailed survey; Siegelmann [78] can serve as an introduction to the literature. A Sampling of Theoretical Results Continuous-Time Models P. Orponen’s [59] 1997 survey of continuous-time computation theory is a good introduction to the literature as of that time; here we give a sample of these and more recent results. There are several results showing that—under various assumptions—analog computers have at least the power of Turing machines (TMs). For example, M.S. Branicky [11] showed that a TM could be simulated by ODEs, but he used non-differentiable functions; O. Bournez et al. [8] provide an alternative construction using only analytic functions. They also prove that the GPAC computability coincides with (Turing-)computable analysis, which is surprising, since the gamma function is Turingcomputable but, as we have seen, the GPAC cannot generate it. The paradox is resolved by a distinction between generating a function and computing it, with the latter, broader notion permitting convergent computation of the function (that is, as t ! 1). However, the computational power of general ODEs has not been determined in general [p. 149 in 78]. MB Pour-El and I Richards exhibit a Turing-computable ODE that does not have a Turingcomputable solution [64,66]. M. Stannett [83] also defined a continuous-time analog computer that could solve the halting problem. C. Moore [55] defines a class of continuous-time recursive functions over the reals, which includes a zerofinding operator . Functions can be classified into a hierarchy depending on the number of uses of , with the lowest level (no s) corresponding approximately to Shannon’s GPAC. Higher levels can compute non-Turingcomputable functions, such as the decision procedure for the halting problem, but he questions whether this result is relevant in the physical world, which is constrained by “noise, quantum effects, finite accuracy, and limited resources”. O. Bournez and M. Cosnard [9] have extended these results and shown that many dynamical systems have super-Turing power. S. Omohundro [58] showed that a system of ten coupled nonlinear PDEs could simulate an arbitrary cellu-
lar automaton (see Mathematical Basis of Cellular Automata, Introduction to), which implies that PDEs have at least Turing power. Further, D. Wolpert and B.J. MacLennan [90,91] showed that any TM can be simulated by a field computer with linear dynamics, but the construction uses Dirac delta functions. Pour-El and Richards exhibit a wave equation in three-dimensional space with Turing-computable initial conditions, but for which the unique solution is Turing-uncomputable [65,66]. Sequential-Time Models We will mention a few of the results that have been obtained concerning the power of sequential-time analog computation. Although the BSS model has been investigated extensively, its power has not been completely determined [6,7]. It is known to depend on whether just rational numbers or arbitrary real numbers are allowed in its programs [p. 148 in 78]. A coupled map lattice (CML) is a cellular automaton with real-valued states Mathematical Basis of Cellular Automata, Introduction to; it is a sequential-time analog computer, which can be considered a discrete-space approximation to a simple sequential-time field computer. P. Orponen and M. Matamala [60] showed that a finite CML can simulate a universal Turing machine. However, since a CML can simulate a BSS program or a recurrent neural network (see Sect. “Recurrent Neural Networks” below), it actually has super-Turing power [p. 149 in 78]. Recurrent neural networks are some of the most important examples of sequential analog computers, and so the following section is devoted to them. Recurrent Neural Networks With the renewed interest in neural networks in the mid-1980s, may investigators wondered if recurrent neural nets have superTuring power. M. Garzon and S. Franklin showed that a sequential-time net with a countable infinity of neurons could exceed Turing power [21,23,24]. Indeed, Siegelmann and E.D. Sontag [80] showed that finite neural nets with real-valued weights have super-Turing power, but W. Maass and Sontag [36] showed that recurrent nets with Gaussian or similar noise had sub-Turing power, illustrating again the dependence on these results on assumptions about what is a reasonable mathematical idealization of analog computing. For recent results on recurrent neural networks, we will restrict our attention of the work of Siegelmann [78], who addresses the computational power of these networks in terms of the classes of languages they can recognize. Without loss of generality the languages are restricted to sets of binary strings. A string to be tested is fed to the network one bit at a time, along with an input that indi-
177
178
Analog Computation
cates when the end of the input string has been reached. The network is said to decide whether the string is in the language if it correctly indicates whether it is in the set or not, after some finite number of sequential steps since input began. Siegelmann shows that, if exponential time is allowed for recognition, finite recurrent neural networks with real-valued weights (and saturated-linear activation functions) can compute all languages, and thus they are more powerful than Turing machines. Similarly, stochastic networks with rational weights also have super-Turing power, although less power than the deterministic nets with real weights. (Specifically, they compute P/POLY and BPP/log respectively; see Siegelmann [Chaps. 4, 9 in 78] for details.) She further argues that these neural networks serve as a “standard model” of (sequential) analog computation (comparable to Turing machines in Church-Turing computation), and therefore that the limits and capabilities of these nets apply to sequential analog computation generally. Siegelmann [p. 156 in 78] observes that the superTuring power of recurrent neural networks is a consequence of their use of non-rational real-valued weights. In effect, a real number can contain an infinite number of bits of information. This raises the question of how the non-rational weights of a network can ever be set, since it is not possible to define a physical quantity with infinite precision. However, although non-rational weights may not be able to be set from outside the network, they can be computed within the network by learning algorithms, which are analog computations. Thus, Siegelmann suggests, the fundamental distinction may be between static computational models, such as the Turing machine and its equivalents, and dynamically evolving computational models, which can tune continuously variable parameters and thereby achieve super-Turing power. Dissipative Models Beyond the issue of the power of analog computing relative to the Turing limit, there are also questions of its relative efficiency. For example, could analog computing solve NP-hard problems in polynomial or even linear time? In traditional computational complexity theory, efficiency issues are addressed in terms of the asymptotic number of computation steps to compute a function as the size of the function’s input increases. One way to address corresponding issues in an analog context is by treating an analog computation as a dissipative system, which in this context means a system that decreases some quantity (analogous to energy) so that the system state converges to an point attractor. From this perspective, the initial state of the system incorporates the input
to the computation, and the attractor represents its output. Therefore, HT Sieglemann, S Fishman, and A BenHur have developed a complexity theory for dissipative systems, in both sequential and continuous time, which addresses the rate of convergence in terms of the underlying rates of the system [4,79]. The relation between dissipative complexity classes (e. g., Pd , N Pd ) and corresponding classical complexity classes (P, N P) remains unclear [p. 151 in 78]. Real-Valued Inputs, Outputs, and Constants A common argument, with relevance to the theoretical power of analog computation, is that an input to an analog computer must be determined by setting a dial to a number or by typing a number into digital-to-analog conversion device, and therefore that the input will be a rational number. The same argument applies to any internal constants in the analog computation. Similarly, it is argued, any output from an analog computer must be measured, and the accuracy of measurement is limited, so that the result will be a rational number. Therefore, it is claimed, real numbers are irrelevant to analog computing, since any practical analog computer computes a function from the rationals to the rationals, and can therefore be simulated by a Turing machine. (See related arguments by Martin Davis [18,19].) There are a number of interrelated issues here, which may be considered briefly. First, the argument is couched in terms of the input or output of digital representations, and the numbers so represented are necessarily rational (more generally, computable). This seems natural enough when we think of an analog computer as a calculating device, and in fact many historical analog computers were used in this way and had digital inputs and outputs (since this is our most reliable way of recording and reproducing quantities). However, in many analog control systems, the inputs and outputs are continuous physical quantities that vary continuously in time (also a continuous physical quantity); that is, according to current physical theory, these quantities are real numbers, which vary according to differential equations. It is worth recalling that physical quantities are neither rational nor irrational; they can be so classified only in comparison with each other or with respect to a unit, that is, only if they are measured and digitally represented. Furthermore, physical quantities are neither computable nor uncomputable (in a Church-Turing sense); these terms apply only to discrete representations of these quantities (i. e., to numerals or other digital representations).
Analog Computation
Therefore, in accord with ordinary mathematical descriptions of physical processes, analog computations can can be treated as having arbitrary real numbers (in some range) as inputs, outputs, or internal states; like other continuous processes, continuous-time analog computations pass through all the reals in some range, including nonTuring-computable reals. Paradoxically, however, these same physical processes can be simulated on digital computers. The Issue of Simulation by Turing Machines and Digital Computers Theoretical results about the computational power, relative to Turing machines, of neural networks and other analog models of computation raise difficult issues, some of which are epistemological rather than strictly technical. On the one hand, we have a series of theoretical results proving the super-Turing power of analog computation models of various kinds. On the other hand, we have the obvious fact that neural nets are routinely simulated on ordinary digital computers, which have at most the power of Turing machines. Furthermore, it is reasonable to suppose that any physical process that might be used to realize analog computation—and certainly the known processes—could be simulated on a digital computer, as is done routinely in computational science. This would seem to be incontrovertible proof that analog computation is no more powerful than Turing machines. The crux of the paradox lies, of course, in the non-Turing-computable reals. These numbers are a familiar, accepted, and necessary part of standard mathematics, in which physical theory is formulated, but from the standpoint of Church-Turing (CT) computation they do not exist. This suggests that the the paradox is not a contradiction, but reflects a divergence between the goals and assumptions of the two models of computation. The Problem of Models of Computation These issues may be put in context by recalling that the Church-Turing (CT) model of computation is in fact a model, and therefore that it has the limitations of all models. A model is a cognitive tool that improves our ability to understand some class of phenomena by preserving relevant characteristics of the phenomena while altering other, irrelevant (or less relevant) characteristics. For example, a scale model alters the size (taken to be irrelevant) while preserving shape and other characteristics. Often a model achieves its purposes by making simplifying or idealizing assumptions, which facilitate analysis or simulation of the system. For example, we may use a linear math-
ematical model of a physical process that is only approximately linear. For a model to be effective it must preserve characteristics and make simplifying assumptions that are appropriate to the domain of questions it is intended to answer, its frame of relevance [46]. If a model is applied to problems outside of its frame of relevance, then it may give answers that are misleading or incorrect, because they depend more on the simplifying assumptions than on the phenomena being modeled. Therefore we must be especially cautious applying a model outside of its frame of relevance, or even at the limits of its frame, where the simplifying assumptions become progressively less appropriate. The problem is aggravated by the fact that often the frame of relevance is not explicit defined, but resides in a tacit background of practices and skills within some discipline. Therefore, to determine the applicability of the CT model of computation to analog computing, we must consider the frame of relevance of the CT model. This is easiest if we recall the domain of issues and questions it was originally developed to address: issues of effective calculability and derivability in formalized mathematics. This frame of relevance determines many of the assumptions of the CT model, for example, that information is represented by finite discrete structures of symbols from a finite alphabet, that information processing proceeds by the application of definite formal rules at discrete instants of time, and that a computational or derivational process must be completed in a finite number of these steps.1 Many of these assumptions are incompatible with analog computing and with the frames of relevance of many models of analog computation. Relevant Issues for Analog Computation Analog computation is often used for control. Historically, analog computers were used in control systems and to simulate control systems, but contemporary analog VLSI is also frequently applied in control. Natural analog computation also frequently serves a control function, for example, sensorimotor control by the nervous system, genetic regulation in cells, and self-organized cooperation in insect colonies. Therefore, control systems provide one frame of relevance for models of analog computation. In this frame of relevance real-time response is a critical issue, which models of analog computation, therefore, ought to be able to address. Thus it is necessary to be able to relate the speed and frequency response of analog computation to the rates of the physical processes by which the computation is realized. Traditional methods of algorithm 1 See MacLennan [45,46] for a more detailed discussion of the frame of relevance of the CT model.
179
180
Analog Computation
analysis, which are based on sequential time and asymptotic behavior, are inadequate in this frame of relevance. On the one hand, the constants (time scale factors), which reflect the underlying rate of computation are absolutely critical (but ignored in asymptotic analysis); on the other hand, in control applications the asymptotic behavior of algorithm is generally irrelevant, since the inputs are typically fixed in size or of a limited range of sizes. The CT model of computation is oriented around the idea that the purpose of a computation is to evaluate a mathematical function. Therefore the basic criterion of adequacy for a computation is correctness, that is, that given a precise representation of an input to the function, it will produce (after finitely many steps) a precise representation of the corresponding output of the function. In the context of natural computation and control, however, other criteria may be equally or even more relevant. For example, robustness is important: how well does the system respond in the presence of noise, uncertainty, imprecision, and error, which are unavoidable in real natural and artificial control systems, and how well does it respond to defects and damage, which arise in many natural and artificial contexts. Since the real world is unpredictable, flexibility is also important: how well does an artificial system respond to inputs for which it was not designed, and how well does a natural system behave in situations outside the range of those to which it is evolutionarily adapted. Therefore, adaptability (through learning and other means) is another important issue in this frame of relevance.2 Transcending Turing Computability Thus we see that many applications of analog computation raise different questions from those addressed by the CT model of computation; the most useful models of analog computing will have a different frame of relevance. In order to address traditional questions such as whether analog computers can compute “beyond the Turing limit”, or whether they can solve NP-hard problems in polynomial time, it is necessary to construct models of analog computation within the CT frame of relevance. Unfortunately, constructing such models requires making commitments about many issues (such as the representation of reals and the discretization of time), that may affect the answers to these questions, but are fundamentally unimportant in the frame of relevance of the most useful applications of the concept of analog computation. Therefore, being overly focused on traditional problems in the theory of computation (which was formulated for a different frame of rele2 See MacLennan [45,46] for a more detailed discussion of the frames of relevance of natural computation and control.
vance) may distract us from formulating models of analog computation that can address important issues in its own frame of relevance. Analog Thinking It will be worthwhile to say a few words about the cognitive implications of analog computing, which are a largely forgotten aspect of analog vs. digital debates of the late 20th century. For example, it was argued that analog computing provides a deeper intuitive understanding of a system than the alternatives do [5,Chap. 8 in 82]. On the one hand, analog computers afforded a means of understanding analytically intractable systems by means of “dynamic models”. By setting up an analog simulation, it was possible to vary the parameters and explore interactively the behavior of a dynamical system that could not be analyzed mathematically. Digital simulations, in contrast, were orders of magnitude slower and did not permit this kind of interactive investigation. (Performance has improved sufficiently in contemporary digital computers so that in many cases digital simulations can be used as dynamic models, sometimes with an interface that mimics an analog computer; see [5].) Analog computing is also relevant to the cognitive distinction between knowing how (procedural knowledge) and knowing that (factual knowledge) [Chap. 8 in 82]. The latter (“know-that”) is more characteristic of scientific culture, which strives for generality and exactness, often by designing experiments that allow phenomena to be studied in isolation, whereas the former (“know-how”) is more characteristic of engineering culture; at least it was so through the first half of the twentieth century, before the development of “engineering science” and the widespread use of analytic techniques in engineering education and practice. Engineers were faced with analytically intractable systems, with inexact measurements, and with empirical relationships (characteristic curves, etc.), all of which made analog computers attractive for solving engineering problems. Furthermore, because analog computing made use of physical phenomena that were mathematically analogous to those in the primary system, the engineer’s intuition and understanding of one system could be transferred to the other. Some commentators have mourned the loss of hands-on intuitive understanding attendant on the increasingly scientific orientation of engineering education and the disappearance of analog computer [5,34,61,67]. I will mention one last cognitive issue relevant to the differences between analog and digital computing. As already discussed Sect. “Characteristics of Analog Compu-
Analog Computation
tation”, it is generally agreed that it is less expensive to achieve high precision with digital technology than with analog technology. Of course, high precision may not be important, for example when the available data are inexact or in natural computation. Further, some advocates of analog computing argue that high precision digital result are often misleading [p. 261 in 82]. Precision does not imply accuracy, and the fact that an answer is displayed with 10 digits does not guarantee that it is accurate to 10 digits; in particular, engineering data may be known to only a few significant figures, and the accuracy of digital calculation may be limited by numerical problems. Therefore, on the one hand, users of digital computers might fall into the trap of trusting their apparently exact results, but users of modest-precision analog computers were more inclined to healthy skepticism about their computations. Or so it was claimed. Future Directions Certainly there are many purposes that are best served by digital technology; indeed there is a tendency nowadays to think that everything is done better digitally. Therefore it will be worthwhile to consider whether analog computation should have a role in future computing technologies. I will argue that the approaching end of Moore’s Law [56], which has predicted exponential growth in digital logic densities, will encourage the development of new analog computing technologies. Two avenues present themselves as ways toward greater computing power: faster individual computing elements and greater densities of computing elements. Greater density increases power by facilitating parallel computing, and by enabling greater computing power to be put into smaller packages. Other things being equal, the fewer the layers of implementation between the computational operations and the physical processes that realize them, that is to say, the more directly the physical processes implement the computations, the more quickly they will be able to proceed. Since most physical processes are continuous (defined by differential equations), analog computation is generally faster than digital. For example, we may compare analog addition, implemented directly by the additive combination of physical quantities, with the sequential process of digital addition. Similarly, other things being equal, the fewer physical devices required to implement a computational element, the greater will be the density of these elements. Therefore, in general, the closer the computational process is to the physical processes that realize it, the fewer devices will be required, and so the continuity of physical law suggests that analog com-
putation has the potential for greater density than digital. For example, four transistors can realize analog addition, whereas many more are required for digital addition. Both considerations argue for an increasing role of analog computation in post-Moore’s Law computing. From this broad perspective, there are many physical phenomena that are potentially usable for future analog computing technologies. We seek phenomena that can be described by well-known and useful mathematical functions (e. g., addition, multiplication, exponential, logarithm, convolution). These descriptions do not need to be exact for the phenomena to be useful in many applications, for which limited range and precision are adequate. Furthermore, in some applications speed is not an important criterion; for example, in some control applications, small size, low power, robustness, etc. may be more important than speed, so long as the computer responds quickly enough to accomplish the control task. Of course there are many other considerations in determining whether given physical phenomena can be used for practical analog computation in a given application [47]. These include stability, controllability, manufacturability, and the ease of interfacing with input and output transducers and other devices. Nevertheless, in the post-Moore’s Law world, we will have to be willing to consider all physical phenomena as potential computing technologies, and in many cases we will find that analog computing is the most effective way to utilize them. Natural computation provides many examples of effective analog computation realized by relatively slow, low-precision operations, often through massive parallelism. Therefore, post-Moore’s Law computing has much to learn from the natural world. Bibliography Primary Literature 1. Anderson JA (1995) An Introduction to Neural Networks. MIT Press, Cambridge 2. Ashley JR (1963) Introduction to Analog Computing. Wiley, New York 3. Aspray W (1993) Edwin L. Harder and the Anacom: Analog computing at Westinghouse. IEEE Ann Hist of Comput 15(2):35–52 4. Ben-Hur A, Siegelmann HT, Fishman S (2002) A theory of complexity for continuous time systems. J Complex 18:51–86 5. Bissell CC (2004) A great disappearing act: The electronic analogue computer. In: IEEE Conference on the History of Electronics, Bletchley, June 2004. pp 28–30 6. Blum L, Cucker F, Shub M, Smale S (1998) Complexity and Real Computation. Springer, Berlin 7. Blum L, Shub M, Smale S (1988) On a theory of computation and complexity over the real numbers: NP completeness, re-
181
182
Analog Computation
8.
9.
10.
11.
12.
13.
14. 15.
16.
17. 18.
19. 20. 21.
22.
23.
24.
25. 26.
cursive functions and universal machines. Bulletin Am Math Soc 21:1–46 Bournez O, Campagnolo ML, Graça DS, Hainry E. The General Purpose Analog Computer and computable analysis are two equivalent paradigms of analog computation. In: Theory and Applications of Models of Computation (TAMC 2006). Lectures Notes in Computer Science, vol 3959. Springer, Berlin, pp 631– 643 Bournez O, Cosnard M (1996) On the computational power of dynamical systems and hybrid systems. Theor Comput Sci 168(2):417–59 Bowles MD (1996) US technological enthusiasm and British technological skepticism in the age of the analog brain. Ann Hist Comput 18(4):5–15 Branicky MS (1994) Analog computation with continuous ODEs. In: Proceedings IEEE Workshop on Physics and Computation, Dallas, pp 265–274 Brockett RW (1988) Dynamical systems that sort lists, diagonalize matrices and solve linear programming problems. In: Proceedings 27th IEEE Conference Decision and Control, Austin, December 1988, pp 799–803 Camazine S, Deneubourg J-L, Franks NR, Sneyd G, Theraulaz J, Bonabeau E (2001) Self-organization in Biological Systems. Princeton Univ. Press, New York Changeux J-P (1985) Neuronal Man: The Biology of Mind (trans: Garey LL). Oxford University Press, Oxford Clymer AB (1993) The mechanical analog computers of Hannibal Ford and William Newell. IEEE Ann Hist Comput 15(2): 19–34 Davidson EH (2006) The Regulatory Genome: Gene Regulatory Networks in Development and Evolution. Academic Press, Amsterdam Davies JA (2005) Mechanisms of Morphogensis. Elsevier, Amsterdam Davis M (2004) The myth of hypercomputation. In: Teuscher C (ed) Alan Turing: Life and Legacy of a Great Thinker. Springer, Berlin, pp 195–212 Davis M (2006) Why there is no such discipline as hypercomputation. Appl Math Comput 178:4–7 Fakhraie SM, Smith KC (1997) VLSI-Compatible Implementation for Artificial Neural Networks. Kluwer, Boston Franklin S, Garzon M (1990) Neural computability. In: Omidvar OM (ed) Progress in Neural Networks, vol 1. Ablex, Norwood, pp 127–145 Freeth T, Bitsakis Y, Moussas X, Seiradakis JH, Tselikas A, Mangou H, Zafeiropoulou M, Hadland R, Bate D, Ramsey A, Allen M, Crawley A, Hockley P, Malzbender T, Gelb D, Ambrisco W, Edmunds MG (2006) Decoding the ancient Greek astronomical calculator known as the Antikythera mechanism. Nature 444:587–591 Garzon M, Franklin S (1989) Neural computability ii (extended abstract). In: Proceedings, IJCNN International Joint Conference on Neural Networks, vol 1. Institute of Electrical and Electronic Engineers, New York, pp 631–637 Garzon M, Franklin S (1990) Computation on graphs. In: Omidvar OM (ed) Progress in Neural Networks, vol 2. Ablex, Norwood Goldstine HH (1972) The Computer from Pascal to von Neumann. Princeton, Princeton Grossberg S (1967) Nonlinear difference-differential equations
27.
28.
29. 30.
31. 32. 33.
34.
35. 36.
37.
38.
39.
40.
41. 42.
43. 44.
45.
in prediction and learning theory. Proc Nat Acad Sci USA 58(4):1329–1334 Grossberg S (1973) Contour enhancement, short term memory, and constancies in reverberating neural networks. Stud Appl Math LII:213–257 Grossberg S (1976) Adaptive pattern classification and universal recoding: I. parallel development and coding of neural feature detectors. Biol Cybern 23:121–134 Hartl DL (1994) Genetics, 3rd edn. Jones & Bartlett, Boston Hopfield JJ (1984) Neurons with graded response have collective computational properties like those of two-state neurons. Proc Nat Acad Sci USA 81:3088–92 Howe RM (1961) Design Fundamentals of Analog Computer Components. Van Nostrand, Princeton Khatib O (1986) Real-time obstacle avoidance for manipulators and mobile robots. Int J Robot Res 5:90–99 Kirchhoff G (1845) Ueber den Durchgang eines elektrischen Stromes durch eine Ebene, insbesondere durch eine kreisförmige. Ann Phys Chem 140(64)(4):497–514 Lang GF (2000) Analog was not a computer trademark! Why would anyone write about analog computers in year 2000? Sound Vib 34(8):16–24 Lipshitz L, Rubel LA (1987) A differentially algebraic replacment theorem. Proc Am Math Soc 99(2):367–72 Maass W, Sontag ED (1999) Analog neural nets with Gaussian or other common noise distributions cannot recognize arbitrary regular languages. Neural Comput 11(3):771–782 MacLennan BJ (1987) Technology-independent design of neurocomputers: The universal field computer. In: Caudill M, Butler C (eds) Proceedings of the IEEE First International Conference on Neural Networks, vol 3, IEEE Press, pp 39–49 MacLennan BJ (1990) Field computation: A theoretical framework for massively parallel analog computation, parts I–IV. Technical Report CS-90-100, Department of Computer Science, University of Tennessee, Knoxville. Available from http:// www.cs.utk.edu/~mclennan. Accessed 20 May 2008 MacLennan BJ (1991) Gabor representations of spatiotemporal visual images. Technical Report CS-91-144, Department of Computer Science, University of Tennessee, Knoxville. Available from http://www.cs.utk.edu/~mclennan. Accessed 20 May 2008 MacLennan BJ (1994) Continuous computation and the emergence of the discrete. In: Pribram KH (ed) Origins: Brain & SelfOrganization, Lawrence Erlbaum, Hillsdale, pp 121–151. MacLennan BJ (1994) “Words lie in our way”. Minds Mach 4(4):421–437 MacLennan BJ (1995) Continuous formal systems: A unifying model in language and cognition. In: Proceedings of the IEEE Workshop on Architectures for Semiotic Modeling and Situation Analysis in Large Complex Systems, Monterey, August 1995. pp 161–172. Also available from http://www.cs.utk.edu/ ~mclennan. Accessed 20 May 2008 MacLennan BJ (1999) Field computation in natural and artificial intelligence. Inf Sci 119:73–89 MacLennan BJ (2001) Can differential equations compute? Technical Report UT-CS-01-459, Department of Computer Science, University of Tennessee, Knoxville. Available from http:// www.cs.utk.edu/~mclennan. Accessed 20 May 2008 MacLennan BJ (2003) Transcending Turing computability. Minds Mach 13:3–22
Analog Computation
46. MacLennan BJ (2004) Natural computation and non-Turing models of computation. Theor Comput Sci 317:115–145 47. MacLennan BJ (in press) Super-Turing or non-Turing? Extending the concept of computation. Int J Unconv Comput, in press 48. Maini PK, Othmer HG (eds) (2001) Mathematical Models for Biological Pattern Formation. Springer, New York 49. Maziarz EA, Greenwood T (1968) Greek Mathematical Philosophy. Frederick Ungar, New York 50. McClelland JL, Rumelhart DE, the PDP Research Group. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol 2: Psychological and Biological Models. MIT Press, Cambridge 51. Mead C (1987) Silicon models of neural computation. In: Caudill M, Butler C (eds) Proceedings, IEEE First International Conference on Neural Networks, vol I. IEEE Press, Piscataway, pp 91–106 52. Mead C (1989) Analog VLSI and Neural Systems. AddisonWesley, Reading 53. Mills JW (1996) The continuous retina: Image processing with a single-sensor artificial neural field network. In: Proceedings IEEE Conference on Neural Networks. IEEE Press, Piscataway 54. Mills JW, Himebaugh B, Kopecky B, Parker M, Shue C, Weilemann C (2006) “Empty space” computes: The evolution of an unconventional supercomputer. In: Proceedings of the 3rd Conference on Computing Frontiers, New York, May 2006. ACM Press, pp 115–126 55. Moore C (1996) Recursion theory on the reals and continuoustime computation. Theor Comput Sci 162:23–44 56. Moore GE (1965) Cramming more components onto integrated circuits. Electronics 38(8):114–117 57. Murray JD (1977) Lectures on Nonlinear Differential-Equation Models in Biology. Oxford, Oxford 58. Omohundro S (1984) Modeling cellular automata with partial differential equations. Physica D 10:128–34, 1984. 59. Orponen P (1997) A survey of continous-time computation theory. In: Advances in Algorithms, Languages, and Complexity, Kluwer, Dordrecht, pp 209–224 60. Orponen P, Matamala M (1996) Universal computation by finite two-dimensional coupled map lattices. In: Proceedings, Physics and Computation 1996, New England Complex Systems Institute, Cambridge, pp 243–7 61. Owens L (1986) Vannevar Bush and the differential analyzer: The text and context of an early computer. Technol Culture 27(1):63–95 62. Peterson GR (1967) Basic Analog Computation. Macmillan, New York 63. Pour-El MB (1974) Abstract computability and its relation to the general purpose analog computer (some connections between logic, differential equations and analog computers). Trans Am Math Soc 199:1–29 64. Pour-El MB, Richards I (1979) A computable ordinary differential equation which possesses no computable solution. Ann Math Log 17:61–90 65. Pour-EL MB, Richards I (1981) The wave equation with computable initial data such that its unique solution is not computable. Adv Math 39:215–239 66. Pour-El MB, Richards I (1982) Noncomputability in models of physical phenomena. Int J Theor Phys, 21:553–555 67. Puchta S (1996) On the role of mathematics and mathematical
68. 69.
70. 71. 72. 73. 74.
75.
76. 77.
78. 79.
80. 81. 82.
83.
84.
85. 86.
87. 88.
89.
knowledge in the invention of Vannevar Bush’s early analog computers. IEEE Ann Hist Comput 18(4):49–59 Reiner JM (1968) The Organism as an Adaptive Control System. Prentice-Hall, Englewood Cliffs Rimon E, Koditschek DE (1989) The construction of analytic diffeomorphisms for exact robot navigation on star worlds. In: Proceedings of the 1989 IEEE International Conference on Robotics and Automation, Scottsdale AZ. IEEE Press, New York, pp 21–26 Rogers AE, Connolly TW (1960) Analog Computation in Engineering Design. McGraw-Hill, New York Rubel LA (1985) The brain as an analog computer. J Theor Neurobiol 4:73–81 Rubel LA (1988) Some mathematical limitations of the generalpurpose analog computer. Adv Appl Math 9:22–34 Rubel LA (1993) The extended analog computer. Adv Appl Math 14:39–50 Rumelhart DE, McClelland JL, the PDP Research Group (1986) Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol 1: Foundations. MIT Press, Cambridge Sanger TD (1996) Probability density estimation for the interpretation of neural population codes. J Neurophysiol 76:2790– 2793 Shannon CE (1941) Mathematical theory of the differential analyzer. J Math Phys Mass Inst Technol 20:337–354 Shannon CE (1993) Mathematical theory of the differential analyzer. In: Sloane NJA, Wyner AD (eds) Claude Elwood Shannon: Collected Papers. IEEE Press, New York, pp 496–513 Siegelmann HT (1999) Neural Networks and Analog Computation: Beyond the Turing Limit. Birkhäuser, Boston Siegelmann HT, Ben-Hur A, Fishman S (1999) Computational complexity for continuous time dynamics. Phys Rev Lett 83(7):1463–6 Siegelmann HT, Sontag ED (1994) Analog computation via neural networks. Theor Comput Sci 131:331–360 Small JS (1993) General-purpose electronic analog computing. IEEE Ann Hist Comput 15(2):8–18 Small JS (2001) The Analogue Alternative: The electronic analogue computer in Britain and the USA, 1930–1975. Routledge, London & New York Stannett M (19901) X-machines and the halting problem: Building a super-Turing machine. Form Asp Comput 2:331– 341 Thomson W (Lord Kelvin) (1876) Mechanical integration of the general linear differential equation of any order with variable coefficients. Proc Royal Soc 24:271–275 Thomson W (Lord Kelvin) (1878) Harmonic analyzer. Proc Royal Soc 27:371–373 Thomson W (Lord Kelvin) (1938). The tides. In: The Harvard Classics, vol 30: Scientific Papers. Collier, New York, pp 274– 307 Truitt TD, Rogers AE (1960) Basics of Analog Computers. John F. Rider, New York van Gelder T (1997) Dynamics and cognition. In: Haugeland J (ed) Mind Design II: Philosophy, Psychology and Artificial Intelligence. MIT Press, Cambridge MA, revised & enlarged edition, Chap 16, pp 421–450 Weyrick RC (1969) Fundamentals of Analog Computers. Prentice-Hall, Englewood Cliffs
183
184
Analog Computation
90. Wolpert DH (1991) A computationally universal field computer which is purely linear. Technical Report LA-UR-91-2937. Los Alamos National Laboratory, Loa Alamos 91. Wolpert DH, MacLennan BJ (1993) A computationally universal field computer that is purely linear. Technical Report CS93-206. Dept. of Computer Science, University of Tennessee, Knoxville
Books and Reviews Fifer S (1961) Analog computation: Theory, techniques and applications, 4 vols. McGraw-Hill, New York Bissell CC (2004) A great disappearing act: The electronic analogue
computer. In: IEEE conference on the history of electronics, 28– 30 Bletchley, June 2004 Lipka J (1918) Graphical and mechanical computation. Wiley, New York Mead C (1989) Analog VLSI and neural systems. Addison-Wesley, Reading Siegelmann HT (1999) Neural networks and analog computation: Beyond the Turing limit. Birkhäuser, Boston Small JS (2001) The analogue alternative: The electronic analogue computer in Britain and the USA, 1930–1975. Routledge, London & New York Small JS (1993) General-purpose electronic analog computing: 1945–1965. IEEE Ann Hist Comput 15(2):8–18
Artificial Chemistry
Artificial Chemistry PETER DITTRICH Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, Germany
Article Outline Glossary Definition of the Subject Introduction Basic Building Blocks of an Artificial Chemistry Structure-to-Function Mapping Space Theory Evolution Information Processing Future Directions Bibliography
Glossary Molecular species A molecular species is an abstract class denoting an ensemble of identical molecules. Equivalently the terms “species”, “compound”, or just “molecule” are used; in some specific context also the terms “substrate” or “metabolite”. Molecule A molecule is a concrete instance of a molecular species. Molecules are those entities of an artificial chemistry that react. Note that sometimes the term “molecule” is used equivalently to molecular species. Reaction network A set of molecular species together with a set of reaction rules. Formally, a reaction network is equivalent to a Petri network. A reaction network describes the stoichiometric structure of a reaction system. Order of a reaction The order of a reaction is the sum of the exponents of concentrations in the kinetic rate law (if given). Note that an order can be fractional. If only the stoichiometric coefficients of the reaction rule are given (Eq. (1)), the order is the sum of the coefficients of the left-hand side species. When assuming mass-action kinetics, both definitions are equivalent. Autocatalytic set A (self-maintaining) set where each molecule is produced catalytically by molecules from that set. Note that an autocatalytic may produce molecules not present in that set. Closure A set of molecules A is closed, if no combination of molecules from A can react to form a molecule outside A. Note that the term “closure” has been also
used to denote the catalytical closure of an autocatalytic set. Self-maintaining A set of molecules is called self-maintaining, if it is able to maintain all its constituents. In a purely autocatalytic system under flow condition, this means that every molecule can be catalytically produced by at least one reaction among molecule from the set. (Chemical) Organization A closed and self-maintaining set of molecules. Multiset A multiset is like a set but where elements can appear more than once; that is, each element has a multiplicity, e. g., fa; a; b; c; c; cg. Definition of the Subject Artificial chemistries are chemical-like systems or abstract models of chemical processes. They are studied in order to illuminate and understand fundamental principles of chemical systems as well as to exploit the chemical metaphor as a design principle for information processing systems in fields like chemical computing or nonlinear optimization. An artificial chemistry (AC) is usually a formal (and, more seldom, a physical) system that consists of objects called molecules, which interact according to rules called reactions. Compared to conventional chemical models, artificial chemistries are more abstract in the sense that there is usually not a one-to-one mapping between the molecules (and reactions) of the artificial chemistry to real molecules (and reactions). An artificial chemistry aims at capturing the logic of chemistry rather than trying to explain a particular chemical system. More formally, an artificial chemistry can be defined by a set of molecular species M, a set of reaction rules R, and specifications of the dynamics, e. g., kinetic laws, update algorithm, and geometry of the reaction vessel. The scientific motivation for AC research is to build abstract models in order to understand chemical-like systems in all kind of domains ranging from physics, chemistry, biology, computer science, and sociology. Abstract chemical models to study fundamental principles, such as spatial pattern formation and self-replication, can be traced back to the beginning of modern computer science in the 40s [91,97]. Since then a growing number of approaches have tackled questions concerning the origin of complex forms, the origin of life itself [19,81], or cellular diversity [48]. In the same way a variety of approaches for constructing ACs have appeared, ranging from simple ordinary differential equations [20] to complex individualbased algebraic transformation systems [26,36,51].
185
186
Artificial Chemistry
The engineering motivation for AC research aims at employing the chemical metaphor as a programming and computing paradigm. Approaches can be distinguished according to whether chemistry serves as a paradigm to program or construct conventional in-silico information processing systems [4], Or whether real molecules are used for information processing like in molecular computing [13] or DNA computing [2] Cellular Computing, DNA Computing. Introduction The term artificial chemistry appeared around 1990 in the field of artificial life [72]. However, models that fell under the definition of artificial chemistry appeared also decades before, such as von Neumann’s universal self-replicating automata [97]. In the 50s, a famous abstract chemical system was introduced by Turing [91] in order to show how spatial diffusion can destabilize a chemical system leading to spatial pattern formation. Turing’s artificial chemistries consist of only a handful of chemical species interacting according to a couple of reaction rules, each carefully designed. The dynamics is simulated by a partial differential equation. Turing’s model possesses the structure of a conventional reaction kinetics chemical model (Subsect. “The Chemical Differential Equation”). However it does not aim at modeling a particular reaction process, but aims at exploring how, in principle, spatial patterns can be generated by simple chemical processes. Turing’s artificial chemistries allow one to study whether a particular mechanism (e. g., destabilization through diffusion) can explain a particular phenomena (e. g., symmetry breaking and pattern formation). The example illustrates that studying artificial chemistries is more a synthetic approach. That is, understanding should be gained through the synthesis of an artificial system and its observation under various conditions [45,55]. The underlying assumption is that understanding a phenomena and how this phenomena can be (re-)created (without copying it) is closely related [96]. Today’s artificial chemistries are much more complex than Turing’s model. They consist of a huge, sometimes infinite, amount of different molecular species and their reaction rules are defined not manually one by one, but implicitly (Sect. “Structure-to-Function Mapping”) and can evolve (Sect. “Evolution”). Questions that are tackled are the structure and organization of large chemical networks, the origin of life, prebiotic chemical evolution, and the emergence of chemical information and its processing.
Introductory Example It is relatively easy to create a complex, infinitely sized reaction network. Assume that the set of possible molecules M are strings over an alphabet. Next, we have to define a reaction function describing what happens when two molecules a1 ; a2 2 M collide. A general approach is to map a1 to a machine, operator, or function f D fold(a1 ) with f : M ! M [ felasticg. The mapping fold can be defined in various ways, such as by interpreting a1 as a computer program, a Turing machine, or a lambda expression (Sect. “Structure-to-Function Mapping”). Next, we apply the machine f derived from a1 to the second molecule. If f (a2 ) D elastic the molecules collided elastically and nothing happens. This could be the case if the machine f does not halt within a predefined amount of time. Otherwise the molecules react and a new molecule a3 D f (a2 ) is produced. How this new molecule changes the reaction vessel depends on the algorithm simulating the dynamics of the vessel. Algorithm 1 is a–simple, widely applied algorithm that simulates second-order catalytic reactions under flow conditions: Because the population size is constant and thus limited, there is competition and only those molecules that are somehow created will sustain. First, the algorithm chooses two molecules randomly, which simulates a collision. Note that a realistic measure of time takes the amount of collision and not the amount of reaction events into account. Then, we determine stochastically whether the selected molecules react. If the molecules react, a randomly selected molecule from the population is replaced by the new product, which assures that the population size stays constant. The replaced molecules constitute the outflow. The inflow is implicitly assumed and is generated by the catalytic production of molecules. From a chemical point of view, the algorithm assumes that a product is built from an energy-rich substrate, which does not appear in the algorithm and which is assumed to be available at constant concentration. The following section (Sect. “Basic Building Blocks of an Artificial Chemistry”) describes the basic building blocks of an artificial chemistry in more detail. In Sect. “Structure-to-Function Mapping” we will explore various techniques to define reaction rules. The role of space is briefly discussed in Sect. “Space”. Then the important theoretical concepts of an autocatalytic set and a chemical organization are explained (Sect. “Theory”). Sect. “Evolution” shows how artificial chemistries can be used to study (chemical) evolution. This article does not discuss information processing in detail, because there are a series of other specialized articles on that topic, which are summarized in Sect. “Information Processing”.
Artificial Chemistry
INPUT: P : population (array of k molecules) randomPosition() := randomInt(0, k-1); reaction(r1, r2) := (fold(r1))(r2); while not terminate do reactant1 := P[randomPosition()]; reactant2 := P[randomPosition()]; product := reaction(reactant1, reactant2); if product == not elastic P[randomPosition()] := product; fi t := t + 1/k ;; increment simulated time od Artificial Chemistry, Algorithm 1
Basic Building Blocks of an Artificial Chemistry
Reaction Rules
Molecules
The set of reaction rules R describes how molecules from M D fa1 ; a2 ; : : :g interact. A rule 2 R can be written
The first step in defining an AC requires that we define the set of all molecular species that can appear in the AC. The easiest way to specify this set is to enumerate explicitly all molecules as symbols. For example: M D fH2 ; O2 ; H2 O; H2 O2 g or, equivalently, M D fa; b; c; dg. A symbol is just a name without any structure. Most conventional (bio-)chemical reaction system models are defined this way. Also many artificial chemistries where networks are designed “by hand” [20], are created randomly [85], or are evolved by changing the reaction rules [41,42,43] use only symbolic molecules without any structure. Alternatively, molecules can posses a structure. In that case, the set of molecules is implicitly defined. For example: M D fall polymers that can be made from the two monomers a and b}. In this case, the set of all possible molecules can even become infinite, or at least quite large. A vast variety of such rule-based molecule definitions can be found in different approaches. For example, structured molecules may be abstract character sequences [3,23, 48,64], sequences of instructions [1,39], lambda-expressions [26], combinators [83], binary strings [5,16,90], numbers [8], machine code [72], graphs [78], swarms [79], or proofs [28]. We can call a molecule’s representation its structure, in contrast to its function, which is given by the reaction rules R. The description of the valid molecules and their structure is usually the first step when an AC is defined. This is analogous to a part of chemistry that describes what kind of atom configurations form stable molecules and how these molecules appear.
according to the chemical notation of reaction rules in the form l a 1 ; a1 C l a 2 ; a2 C ! r a 1 ; a1 C r a 2 ; a2 C : (1) The stoichiometric coefficients l a; and r a; describe the amount of molecular species a 2 M in reaction 2 R on the left-hand and right-hand side, respectively. Together, the stoichiometric coefficients define the stoichiometric matrix S D (s a; ) D (r a; l a; ) :
(2)
An entry s a; of the stoichiometric matrix denotes the net amount of molecules of type a produced in reaction . A reaction rule determines a multiset of molecules (the left-hand side) that can react and subsequently be replaced by the molecules on the right-hand side. Note that the sign “C” is not an operator here, but should only separate the components on either side. The set of all molecules M and a set of reaction rules R define the reaction network hM; Ri of the AC. The reaction network is equivalent to a Petri net [71]. It can be represented by two matrices, (l a; ) and (r a; ), made up of the stoichiometric coefficients. Equivalently we can represent the reaction network as a hyper-graph or a directed bipartite graph with two node types denoting reaction rules and molecular species, respectively, and edge labels for the stoichiometry. A rule is applicable only if certain conditions are fulfilled. The major condition is that all of the left-hand side components must be available. This condition can
187
188
Artificial Chemistry
be broadened easily to include other parameters such as neighborhood, rate constants, probability for a reaction, influence of modifier species, or energy consumption. In such a case, a reaction rule would also contain additional information or further parameters. Whether or not these additional predicates are taken into consideration depends on the objectives of the artificial chemistry. If it is meant to simulate real chemistry as accurate as possible, it is necessary to integrate these parameters into the simulation system. If the goal is to build an abstract model, many of these parameters can be omitted. Like for the set of molecules we can define the set of reaction rules explicitly by enumerating all rules symbolically [20], or we can define it implicitly by referring to the structure of the molecules. Implicit definitions of reaction rules use, for example, string matching/string concatenation [3,60,64], lambda-calculus [26,27], Turing machines [40,90], finite state machines [98] or machine code language [1,16,78,87], proof theory [27], matrix multiplication [5], swarm dynamics [79], or simple arithmetic operations [8]. Note that in some cases the reactions can emerge as the result of the interaction of many atomic particles, whose dynamics is specified by force fields or rules [79]. Section “Structure-to-Function Mapping” will discuss implicitly defined reactions in more details. Dynamics The third element of an artificial chemistry is its specification of how the reaction rules give rise to a dynamic process of molecules reacting in some kind of reaction vessel. It is assumed that the reaction vessel contains multiple copies of molecules, which can be seen as instances of the molecular species M. This section summarizes how the dynamics of a reaction vessel (which usually contains a huge number of molecules) can be modeled and simulated. The approaches can be characterized roughly by whether each molecule is treated explicitly or whether all molecules of one type are represented by a number denoting their frequency or concentration. The Chemical Differential Equation A reaction network hM; Ri specifies the structure of a reaction system, but does not contain any notion of time. A common way to specify the dynamics of the reaction system is by using a system of ordinary differential equations of the following form: x˙ (t) D Sv(x(t)) ;
(3)
where x D (x1 ; : : : ; x m )T 2 Rm is a concentration vector
depending on time t, S the stoichiometric matrix derived from R (Eq. (2)), and v D (v1 ; : : : ; vr )T 2 Rr a flux vector depending on the current concentration vector. A flux v 0 describes the velocity or turnover rate of reaction 2 R. The actual value of v depends usually on the concentration of the species participating in the reaction (i. e., LHS( ) fa 2 M : l a; > 0g) and sometimes on additional modifier species, whose concentration is not influenced by that reaction. There are two important assumptions that are due to the nature of reaction systems. These assumptions relate the kinetic function v to the reaction rules R: Assumption 1 A reaction can only take place if all species of its left-hand side LHS(() ) are present. This implies that for all molecules a 2 M and reactions 2 R with a 2 LHS( ), if x a D 0 then v D 0. The flux v must be zero, if the concentration xa of a molecule appearing on the left-hand side of this reaction is zero. This assumption meets the obvious fact that a molecule has to be present to react. Assumption 2 If all species LHS( ) of a reaction 2 R are present in the reactor (e. g. for all a 2 LHS( ); x a > 0) the flux of that reaction is positive, (i. e., v > 0). In other words, the flux v must be positive, if all molecules required for that reaction are present, even in small quantities. Note that this assumption implies that a modifier species necessary for that reaction must appear as a species on the left- (and right-) hand side of . In short, taking Assumption 1 and 2 together, we demand for a chemical differential equation: v > 0 , 8a 2 LHS( ); x a > 0 :
(4)
Note that Assumption 1 is absolutely necessary. Although Assumption 2 is very reasonable, there might be situations where it is not true. Assume, for example, a reaction of the form a C a ! b. If there is only one single molecule a left in the reaction vessel, the reaction cannot take place. Note however, if we want to model this effect, an ODE is the wrong approach and we should choose a method like those described in the following sections. There are a large number of kinetic laws fulfilling these assumptions (Eq. (4)), including all laws that are usually applied in practice. The most fundamental of such kinetic laws is mass-action kinetics, which is just the product of the concentrations of the interacting molecules (cf. [32]): Y l a; v (x) D xa : (5) a2M
P The sum of the exponents in the kinetic law a2M l a; is called the order of the reaction. Most more compli-
Artificial Chemistry
cated laws like Michaelis–Menten kinetics are derived from mass-action kinetics. In this sense, mass-action kinetics is the most general approach, allowing one to emulate all more complicated laws derived from it, but at the expense of a larger amount of molecular species. Explicitly Simulated Collisions The chemical differential equation is a time and state continuous approximation of a discrete system where molecules collide stochastically and react. In the introduction we saw how this can be simulated by a rather simple algorithm, which is widely used in AC research [5,16,26]. The algorithm presented in the introduction simulates a second-order catalytic flow system with constant population size. It is relatively easy to extend this algorithm to include reaction rates and arbitrary orders. First, the algorithm chooses a subset from P, which simulates a collision among molecules. Note that the resulting subset could be empty, which can be used to simulate an inflow (i. e., reactions of the form ! A, where the left-hand side is empty). By defining the probability of obtaining a subset of size n, we can control the rate of reactions of order n. If the molecules react, the reactands are removed from the population and the products are added. Note that this algorithm can be interpreted as a stochastic rewriting system operating on a multiset of molecules [70,88]. For large population size k, the dynamics created by the algorithm tends to the continuous dynamics of the chemical differential equation assuming mass-action kinetics. For the special case of second-order catalytic flow, as described by the algorithm in the introduction, the dynamics can be described by the catalytic network equation [85] X x˙ k D ˛ i;k j x i x j x k ˚(x) ; with i; j2M
˚(x) D
X
˛ i;k j x i x j ;
(6)
i; j;k2M
which is a generalization of the replicator equation [35]. If the reaction function is deterministic, we can set the kinetic constants to ( 1 reaction(i; j) D k ; k ˛ i; j D (7) 0 otherwise : This dynamic is of particular interest, because competition is generated due to a limited population size. Note that the limited population size is equivalent to an unlimited population size with a limited substrate, e. g., [3].
Discrete-Event Simulation If the copy number of a particular molecular species becomes high or if the reaction rates differ strongly, it is more efficient to use discreteevent simulation techniques. The most famous technique is Gillespie’s algorithm [32]. Roughly, in each step, the algorithm generates two random numbers depending on the current population state to determine the next reaction to occur as well as the time interval t when it will occur. Then the simulated time is advanced t :D t C t, and the molecule count of the population updated according to the reaction that occurred. The Gillespie algorithm generates a statistically correct trajectory of the chemical master equation. The chemical master equation is the most general way to formulate the stochastic dynamics of a chemical system. The chemical master equation is a first-order differential equation describing the time-evolution of the probability of a reaction system to occupy each one of its possible discrete set of states. In a well-stirred (non spatial) AC a state is equivalent to a multiset of molecules. Structure-to-Function Mapping A fundamental “logic” of real chemical systems is that molecules posses a structure and that this structure determines how the molecule interacts with other molecules. Thus, there is a mapping from the structure of a molecule to its function, which determines the reaction rules. In real chemistry the mapping from structure to function is given by natural or physical laws. In artificial chemistries the structure-to-function mapping is usually given by some kind of algorithm. In most cases, an artificial chemistry using a structure-to-function mapping does not aim at modeling real molecular dynamics in detail. Rather the aim is to capture the fact that there is a structure, a function, and a relation between them. Having an AC with a structure-to-function mapping at hand, we can study strongly constructive dynamical systems [27,28]. These are systems where, through the interaction of its components, novel molecular species appear. Example: Lambda-Calculus (AlChemy) This section demonstrates a structure-to-function that employs a concept from computer science: the -calculus. The -calculus has been used by Fontana [26] and Fontana and Buss [27] to define a constructive artificial chemistry. In the so-called AlChemy, a molecule is a normalized -expression. A -expression is a word over an alphabet A D f; :; (; )g [ V where V D fx1 ; x2 ; : : :g is an infinite set of available variable names. The set of expres-
189
190
Artificial Chemistry
while not terminate() do reactands := choseASubsetFrom(P); if randomReal(0, 1) < reactionProbability(reactands); products = reaction(reactands); P := remove(P, reactands); P := insert(P, products); fi t := t + 1/ sizeOf(P) ;; increment simulated time od Artificial Chemistry, Algorithm 2
sions is defined for x 2 V ; s1 2 ; s2 2 by x2 x:s2 2 (s2 )s1 2
variable name ; abstraction ; application :
An abstraction x:s2 can be interpreted as a function definition, where x is the parameter in the “body” s2 . The expression (s2 )s1 can be interpreted as the application of s2 on s1 , which is formalized by the rewriting rule (x:s2 )s1 D s2 [x
s1 ] ;
(8)
where s2 [x s1 ] denotes the term which is generated by replacing every unbounded occurrence of x in s2 by s1 . A variable x is bounded if it appears in in a form like : : : (x: : : : x : : :) : : :. It is also not allowed to apply the rewriting rule if a variable becomes bounded. Example: Let s1 D x1 :(x1 )x2 :x2 and s2 D x3 :x3 then we can derive: (s1 )s2 ) (x1 :(x1 )x2 :x2 )x3 :x3 ) (x3 :x3 )x2 :x2 ) x2 :x2 :
(9)
The simplest way to define a set of second-order catalytic reactions R by applying a molecule s1 to another molecule s2 : s1 C s2 ! s1 C s2 C normalForm((s1 )s2 ) :
(10)
The procedure normalForm reduces its argument term to normal form, which is in practice bounded by a maximum of available time and memory; if these resources are exceeded before termination, the collision is considered to be elastic. It should be noted that the -calculus allows an elegant generalization of the collision rule by defining it by -expression ˚ 2 : s1 C s2 ! s1 C s2 C normalForm(((˚ )s1 )s2 ). In order to simulate the artificial chemistry, Fontana and Buss [27] used an algorithm like the algorithm described in Subsect. “Explicitly Simulated Collisions”,
which simulates second-order catalytic mass-action kinetics in a well-stirred population under flow condition. AlChemy possesses a couple of important general properties: Molecules come in two forms: as passive data possessing a structure and as active machines operating on those structures. This generates a “strange loop” like in typogenetics [36,94], which allows molecules to refer to themselves and to operate on themselves. Structures operate on structures, and by doing so change the way structures operate on structures, and so on; creating a “strange”, self-referential dynamic loop. Obviously, there are always some laws that cannot be changed, which are here the rules of the lambda-calculus, defining the structure-to-function mapping. Thus, we can interpret these fixed and predefined laws of an artificial chemistry as the natural or physical laws. Arithmetic and Logic Operations One of the most easy ways to define reaction rules implicitly is to apply arithmetic operations taken from mathematics. Even the simplest rules can generate interesting behaviors. Assume, for example, that the molecules are natural numbers: M D f0; 1; 2 : : : ; n 1g and the reaction rules are defined by division: reaction(a1 ; a2 ) D a1 /a2 if a1 is a multiple of a2 , otherwise the molecules do not react. For the dynamics we assume a finite, discrete population and an algorithm like the one described in Subsect. “Explicitly Simulated Collisions”. With increasing population size the resulting reaction system displays a phase transition, at which the population is able to produce prime numbers with high probability (see [8] for details). In a typical simulation, the initially high diversity of a random population is reduced leaving a set of non-reacting prime numbers. A quite different behavior can be obtained by simply replacing the division by addition: reaction(a1 ; a2 ) D a1 C a2 mod n with n D jMj the number of molecular species. Initializing the well-stirred tank reactor only with
Artificial Chemistry
the molecular species 1, the diversity rapidly increases towards its maximum and the reactor behaves apparently totally chaotic after a very short transient phase (for a sufficiently large population size). However, there are still regularities, which depend on the prime factors of n (cf. [15]). Matrix-Multiplication Chemistry More complex reaction operators can be defined that operate on vectors or strings instead of scalar numbers as molecules. The matrix multiplication chemistry introduce by Banzhaf [5,6,7] uses binary strings as molecules. The reaction between two binary strings is performed by folding one of them into a matrix which then operates on the other string by multiplication.
The system has a couple of interesting properties: Despite the relatively simple definition of the basic reaction mechanism, the resulting complexity of the reaction system and its dynamics is surprisingly high. Like in typogenetics and Fontana’s lambda-chemistry, molecules appear in two forms; as passive data (binary strings) and as active operators (binary matrix). The folding of a binary string to a matrix is the central operation of the structure-to-function mapping. The matrix is multiplied with substrings of the operand and thus some kind of locality is preserved, which mimics local operation of macromolecules (e. g. ribosomes or restriction enzymes) on polymers (e. g. RNA or DNA). Locality is also conserved in the folding, because bits that are close in the string are also close in the matrix.
Example of a reaction for 4-bit molecules Assume a reaction s1 C s2 H) s3 . The general approach is:
Autocatalytic Polymer Chemistries
1. Fold s1 to matrix M. Example: s1 D (s11 ; s12 ; s13 ; s14 ) MD
s11 s13
s12 s14
:
(11)
2. Multiply M with subsequences of s2 . Example: Let s2 D s21 ; s22 ; s23 ; s24 be divided into two subsequences s212 D s21 ; s22 and s234 D (s23 ; s24 ). Then we can multiply M with the subsequences: s312 D M ˇ s212 ;
s334 D M ˇ s234 :
(12)
3. Compose s3 by concatenating the products. Example: s3 D s312 ˚ s334 . There are various ways of defining the vector matrix product ˇ. It was mainly used with the following threshold multiplication. Given a bit vector x D (x1 ; : : : ; x n ) and a bit matrix M D (M i j ) then the term y D M ˇ x is defined by: ( yj D
Pn
x i M i; j ˚ :
0
if
1
otherwise :
iD1
(13)
The threshold multiplication is similar to the common matrix-vector product, except that the resulting vector is mapped to a binary vector by using the threshold ˚. Simulating the dynamics by an ODE or explicit molecular collisions (Subsect. “Explicitly Simulated Collisions”), we can observe that such a system would develop into a steady state where some string species support each other in production and thus become a stable autocatalytic cycle, whereas others would disappear due to the competition in the reactor.
In order to study the emergence and evolution of autocatalytic sets [19,46,75] Bagley, Farmer, Fontana, Kauffman and others [3,23,47,48,59] have used artificial chemistries where the molecules are character sequences (e. g., M D fa; b; aa; ab; ba; bb; aaa; aab; : : :g) and the reactions are concatenation and cleavage, for example: aa C babb ˛ aababb
(slow) :
(14)
Additionally, each molecule can act as a catalyst enhancing the rate of a concatenation reaction. aa C babb C bbb ˛ bbb C aababb
(fast) :
(15)
Figure 1 shows an example of an autocatalytic network that appeared. There are two time scales. Reactions that are not catalyzed are assumed to be very slow and not depicted, whereas catalyzed reactions, inflow, and decay are fast. Typical experiments simulate a well-stirred flow reactor using meta-dynamical ODE frame work [3]; that is, the ODE model is dynamically changed when new molecular species appear or present species vanish. Note that the structure of the molecules does not fully determine the reaction rules. Which molecule catalyzes which reaction (the dotted arrows in Fig. 1) is assigned explicitly randomly. An interesting aspect is that the catalytic or autocatalytic polymer sets (or reaction networks) evolve without having a genome [47,48]. The simulation studies show how small, spontaneous fluctuations can be amplified by an autocatalytic network, possibly leading to a modification of the entire network [3]. Recent studies by Fernando and Rowe [24] include further aspects like energy conservation and compartmentalization.
191
192
Artificial Chemistry
Artificial Chemistry, Figure 1 Example of an autocatalytic polymer network. The dotted lines represent catalytic links, e. g., aa C ba C aaa ! aaba C aaa. All molecules are subject to a dilution flow. In this example, all polymerization reactions are reversible and there is a continuous inflow of the monomers a and b. Note that the network contains further auto-catalytic networks, for example, fa; b; aa; bag and fa; aag, of which some are closed and thus organizations, for example, fa; b; aa; bag. The set fa; aag is not closed, because b and ba can be produced
Bagley and Farmer [3] showed that autocatalytic sets can be silhouetted against a noisy background of spontaneously reacting molecules under moderate (that is, neither too low nor too high) flow. Artificial Chemistries Inspired by Turing Machines The concept of an abstract automaton or Turing machine provides a base for a variety of structure-function mappings. In these approaches molecules are usually represented as a sequence of bits, characters, or instructions [36,51,63,86]. A sequence of bits specifies the behavior of a machine by coding for its state transition function. Thus, like in the matrix-multiplication chemistry and the lambda-chemistry, a molecule appears in two forms, namely as passive data (e. g., a binary string) and as an active machine. Also here we can call the mapping from a binary string into its machine folding, which might be indeterministic and may depend on other molecules (e. g., [40]). In the early 1970s Laing [51] argued for abstract, nonanalogous models in order to develop a general theory for living systems. For developing such general theory, so-called artificial organisms would be required. Laing suggested a series of artificial organisms [51,52,53] that should allow one to study general properties of life and thus would allow one to derive a theory which is not re-
stricted to the instance of life we observe on earth. The artificial organisms consist of different compartments, e. g., a “brain” plus “body” parts. These compartments contain binary strings as molecules. Strings are translated to a sequence of instructions forming a three-dimensional shape (cf. [78,86,94]). In order to perform a reaction, two molecules are attached such that they touch at one position (Fig. 2). One of the molecules is executed like a Turing machine, manipulating the passive data molecule as its tape. Laing proved that his artificial organisms are able to perform universal computation. He also demonstrated different forms of self-reproduction, self-description and selfinspection using his molecular machines [53]. Typogenetics is a similar approach, which was introduced in 1979 by Hofstadter in order to illustrate the “formal logic of life” [36]. Later, typogenetics was simulated and investigated in more detail [67,94,95]. The molecules of typogenetics are character sequences (called strands) over the alphabet A, C, G, T. The reaction rules are “typographic” manipulations based on a set of predefined basic operations like cutting, insertion, or deletion of characters. A sequence of such operations forms a unit (called an enzyme) which may operate on a character sequence like a Turing machine on its tape. A character string can be “translated” to an enzyme (i. e., a sequence of operations) by mapping two characters to an operation according to a predefined “genetic code”.
Artificial Chemistry
sites could not coadapt and went extinct. In a spatial environment (e. g., a 2-D lattice), sets of cooperating polymers evolved, interacting in a hypercyclic fashion. The authors also observed a chemoton-like [30] cooperative behavior, with spatially isolated, membrane-bounded evolutionary stable molecular organizations. Machines with Fixed Tape Size In order to perform large-scale systematic simulation studies like the investigation of noise [40] or intrinsic evolution [16], it makes sense to limit the study to molecules of fixed tractable length. Ikegami and Hashimoto [40] developed an abstract artificial chemistry with two types of molecular species: 7 bit long tapes, which are mapped to 16 bit long machines. Tapes and machines form two separated populations, simulated in parallel. Reactions take place between a tape and a machine according to the following reaction scheme: 0 sM C sT ! sM C sT C sM C sT0 :
Artificial Chemistry, Figure 2 Illustration of Laing’s molecular machines. A program molecule is associated with a data molecule. Figure from [52]
Another artificial chemistry whose reaction mechanism is inspired by the Turing machine was suggested by McCaskill [63] in order to study the self-organization of complex chemical systems consisting of catalytic polymers. Variants of this AC were realized later in a specially designed parallel reconfigurable computer based on FPGAs – Field Programmable Gate Arrays [12,18,89]. In this approach, molecules are binary strings of fixed length (e. g., 20 [63]) or of variable length [12]. As in previous approaches, a string codes for an automaton able to manipulate other strings. And again, pattern matching is used to check whether two molecules can react and to obtain the “binding site”; i. e., the location where the active molecule (machine) manipulates the passive molecule (tape). The general reaction scheme can be written as: sM C sT ! sM C sT0 :
(16)
In experimental studies, self-replicating strings appeared frequently. In coevolution with parasites, an evolutionary arms race started among these species and the selfreplicating string diversified to an extent that the para-
(17)
A machine sM can react with a tape sT , if its head matches a substring of the tape sT and its tail matches a different substring of the tape sT . The machine operates only between these two substrings (called reading frame) which results in a tape sT0 . The tape sT0 is then “translated” (folded) 0 . into a machine sM Ikegami and Hashimoto [40] showed that under the influence of low noise, simple autocatalytic loops are formed. When the noise level is increased, the reaction network is destabilized by parasites, but after a relatively long transient phase (about 1000 generations) a very stable, dense reaction network appears, called core network [40]. A core network maintains its relatively high diversity even if the noise is deactivated. The active mutation rate is high. When the noise level is very high, only small, degenerated core networks emerge with a low diversity and very low (even no) active mutation. The core networks which emerged under the influence of external noise are very stable so that their is no further development after they have appeared. Assembler Automata An Assembler automaton is like a parallel von Neumann machine. It consists of a core memory and a set of processing units running in parallel. Inspired by Core Wars [14], assembler automata have been used to create certain artificial life systems, such as Coreworld [72,73], Tierra [74], Avida [1], and Amoeba [69]. Although these machines have been classified as artificial chemistries [72], it is in general difficult to identify molecules or reactions. Furthermore, the assembler automaton Tierra has explicitly
193
194
Artificial Chemistry
been designed to imitate the Cambrian explosion and not a chemical process. Nevertheless, in some cases we can interpret the execution of an assembler automaton as a chemical process, which is especially possible in a clear way in experiments with Avida. Here, a molecule is a single assembler program, which is protected by a memory management system. The system is initialized with manually written programs that are able to self-replicate. Therefore, in a basic version of Avida only unimolecular first-order reactions occur, which are of replicator type. The reaction scheme can be written as a ! a C mutation(a). The function mutation represents the possible errors that can occur while the program is self-replicating. Latticei Molecular Systems In this section systems are discussed which consist of a regular lattice, where each lattice site can hold a part (e. g. atom) of a molecule. Between parts, bonds can be formed so that a molecule covers many lattice sites. This is different to systems where a lattice site holds a complete molecule, like in Avida. The important difference is that in lattice molecular systems, the space of the molecular structure is identical to the space in which the molecules are floating around. In systems where a molecule covers just one lattice site, the molecular structure is described in a different space independently from the space in which the molecule is located. Lattice molecular systems have been intensively used to model polymers [92], protein folding [82], and RNA structures. Besides approaches which intend to model real molecular dynamics as accurately as possible, there are also approaches which try to build abstract models. These models should give insight into statistical properties of polymers like their energy landscape and folding processes [76,77], but should not give insight into questions concerning origin and evolution self-maintaining organizations or molecular information processing. For these questions more abstract systems are studied as described in the following. Varela, Maturana, and Uribe introduced in [93] a lattice molecular system to illustrate their concept of autopoiesis (cf. [65,99]). The system consists of a 2-D square lattice. Each lattice site can be occupied by one of the following atoms: Substrate S, catalyst K, and monomer L. Atoms may form bonds and thus form molecular structures on the lattice. If molecules come close to each other they may react according to the following reaction rules: K C 2S ! K C L (1) Composition: Formation of a monomer. : : : -L-L C L ! : : : -L-L-L (2) Concatenation:
Artificial Chemistry, Figure 3 Illustration of a lattice molecular automaton that contains an autopoietic entity. Its membrane is formed by a chain of L monomers and encloses a catalyst K. Only substrate S may diffuse through the membrane. Substrate inside is catalyzed by K to form free monomers. If the membrane is damaged by disintegration of a monomer L it can be quickly repaired by a free monomer. See [65,93] for details
L C L ! L-L
Formation of a bond between a monomer and another monomer with no more than one bond. This reaction is inhibited by a double-bounded monomer [65]. L ! 2S (3) Disintegration: Decomposition of a monomer Figure 3 illustrates an autopoietic entity that may arise. Note that it is quite difficult to find the right dynamical conditions under which such autopoietic structures are stable [65,68].
Cellular Automata Cellular automata Mathematical Basis of Cellular Automata, Introduction to can be used as a medium to simulate chemical-like processes in various ways. An obvious approach is to use the cellular automata to model space where each cell can hold a molecule (like in Avida) or atom (like in lattice molecular automata). However there are approaches where it is not clear at the onset what a molecule or reaction is. The model specification does not contain any notion of a molecule or reaction, so that an observer has to identify them. Molecules can be equated
Artificial Chemistry
Artificial Chemistry, Figure 4 Example of mechanical artificial chemistries. a Self-assembling magnetic tiles by Hosokawa et al. [38]. b Rotating magnetic discs by Grzybowski et al. [33]. Figures taken (and modified) from [33,38]
with self-propagating patterns like gliders, self-reproducing loops [54,80], or with the moving boundary between two homogeneous domains [37]. Note that in the latter case, particles become visible as space-time structures, which are defined by boundaries and not by connected set of cells with specific states. An interaction between these boundaries is then interpreted as a reaction. Mechanical Artificial Chemistries There are also physical systems that can be regarded as artificial chemistries. Hosokawa et al. [38] presented a mechanical self-assembly system consisting of triangular shapes that form bonds by permanent magnets. Interpreting attachment and detachment as chemical reactions, Hosokawa et al. [38] derived a chemical differential equation (Subsect. “The Chemical Differential Equation”) modeling the kinetics of the structures appearing in the system. Another approach uses millimeter-sized magnetic disks at a liquid-air interface, subject to a magnetic field produced by a rotating permanent magnet [33]. These magnetic discs exhibit various types of structures, which might even interact resembling chemical reactions. The important difference to the first approach is that the rotating magnetic discs form dissipative structures, which require continuous energy supply.
Semi-Realistic Molecular Structures and Graph Rewriting Recent development in artificial chemistry aims at more realistic molecular structures and reactions, like oo chemistry by Bersini [10] (see also [57]) or toy chemistry [9]. These approaches apply graph rewriting, which is a powerful and quite universal approach for defining transformations of graph-like structures [50] In toy chemistry, molecules are represented as labeled graphs; i. e., by their structural formulas; their basic properties are derived by a simplified version of the extended Hückel MO theory that operates directly on the graphs; chemical reaction mechanisms are implemented as graph rewriting rules acting on the structural formulas; reactivities and selectivities are modeled by a variant of the frontier molecular orbital theory based on the extended Hückel scheme. Figure 5 shows an example of a graph-rewriting rule for unimolecular reactions. Space Many approaches discussed above assume that the reaction vessel is well-stirred. However, especially in living systems, space plays an important role. Cells, for example, are not a bag of enzymes, but spatially highly structured.
195
196
Artificial Chemistry
Theory
Artificial Chemistry, Figure 5 An example of a graph-rewriting rule (top) and its application to the synthesis of a bridge ring system (bottom). Figure from [9]
Moreover, it is assumed that space has played an important role in the origin of life. Techniques for Modeling Space In chemistry, space is usually modeled by assuming an Euclidean space (partial differential equation or particlebased) or by compartments, which we can find also in many artificial chemistries [29,70]. However, some artificial chemistries use more abstract, “computer friendly” spaces, such as a core-memory like in Tierra [74] or a planar triangular graph [83]. There are systems like MGS [31] that allow to specify various topological structures easily. Approaches like P-systems and membrane computing [70] allow one even to change the spatial structure dynamically Membrane Computing. Phenomena in Space In general, space delays the extinction of species by fitter species, which is for example, used in Avida to obtain more complex evolutionary patterns [1]. Space leads usually to higher diversity. Moreover, systems that are instable in a well-stirred reaction vessel can become stable in a spatial situation. An example is the hypercycle [20], which stabilizes against parasites when space is introduced [11]. Conversely, space can destabilize an equilibrium, leading to symmetry breaking, which has been suggested as an important mechanism underlying morphogenesis [91]. Space can support the co-existence of chemical species, which would not co-exist in a well-stirred system [49]. Space is also necessary for the formation of autopoietic structures [93] and the formation of units that undergo Darwinian evolution [24].
There is a large body of theory from domains like chemical reaction system theory [21], Petri net theory [71], and rewriting system theory (e. g., P-systems [70]), which applies also to artificial chemistries. In this section, however, we shall investigate in more detail those theoretical concepts which have emerged in artificial chemistry research. When working with complex artificial chemistries, we are usually more interested in the qualitative than quantitative nature of the dynamics. That is, we study how the set of molecular species present in the reaction vessel changes over time, rather than studying a more detailed trajectory in concentration space. The most prominent qualitative concept is the autocatalytic set, which has been proposed as an important element in the origin of life [19,46,75]. An autocatalytic set is a set of molecular species where each species is produced by at least one catalytic reaction within the set [41,47] (Fig. 1). This property has also been called self-maintaining [27] or (catalytic) closure [48]. Formally: A set of species A M is called an autocatalytic set (or sometimes, catalytically closed set), if for all species1 a 2 A there is a catalyst a0 2 A, and a reaction 2 R such that a0 is catalyst in (i. e., a0 2 LHS( ) and a 0 2 RHS( )) and can take place in A (i. e., LHS( ) A). Example 1 (autocatalytic sets) R D fa ! 2a; a ! a C b; a !; b !g. Molecule a catalyzes its own production and the production of b. Both molecules are subject to a dilution flow (or, equivalently, spontaneous decay), which is usually assumed in ACs studying autocatalytic sets. In this example there are three autocatalytic sets: two non-trivial autocatalytic sets fag and fa; bg, and the empty set fg, which is from a mathematical point of view also an autocatalytic set. The term “autocatalytic set” makes sense only in ACs where catalysis is possible, it is not useful when applied to arbitrary reaction networks. For this reason Dittrich and Speroni di Fenizio [17] introduced a general notion of self-maintenance, which includes the autocatalytic set as a special case: Formally, given a reaction network hM; Ri with m D jMj molecules and r D jRj reactions, and let S D (s a; ) be the (m r) stoichiometric matrix implied by the reaction rules R, where s a; denotes the number 1 For general reaction systems, this definition has to be refined. When A contains species that are part of the inflow, like a and b in Fig. 1, but which are not produced in a catalytic way, we might want them to be part of an autocatalytic set. Assume, for example, the set A D fa; b; aa; bag from Fig. 1, where aa catalyzes the production of aa and ba, while using up “substrate” a and b, which are not catalytically produced.
Artificial Chemistry
of molecules of type a produced in reaction . A set of molecules C M is called self-maintaining, if there exists a flux vector v 2 Rr such that the following three conditions apply: (1) for all reactions that can take place in C (i. e., LHS( ) C) the flux v > 0; (2) for all remaining reactions (i. e., LHS( ) ª C), the flux v D 0; and (3) for all molecules a 2 C, the production rate (Sv) a 0. v denotes the element of v describing the flux (i. e. rate) of reaction . (Sv) a is the production rate of molecule a given flux vector v. In Example 1 there are three self-maintaining (autocatalytic) sets. Interestingly, there are self-maintaining sets that cannot make up a stationary state. In our Example 1 only two of the three self-maintaining sets are species combinations that can make up a stationary state (i. e., a state x0 for which 0 D Sv(x0 ) holds). The self-maintaining set fag cannot make up a stationary state because a generates b through the reaction a ! a C b. Thus, there is no stationary state (of Eq. (3) with Assumptions 1 and 2) in which the concentration of a is positive while the concentration of b is zero. In order to filter out those less interesting self-maintaining sets, Fontana and Buss [27] introduced a concept taken from mathematics called closure: Formally, a set of species A M is closed, if for all reactions with LHS( ) A (the reactions that can take place in A), the products are also contained in A, i. e., RHS( ) A). Closure and self-maintenance lead to the important concept of a chemical organization [17,27]: Given an arbitrary reaction network hM; Ri, a set of molecular species that is closed and self-maintaining is called an organization. The importance of an organization is illustrated by a theorem roughly saying that given a fixed point of the chemical ODE (Eq. (3)), then the species with positive concentrations form an organization [17]. In other words, we have only to check those species combinations for stationary states that are organizations. The set of all organizations can be visualized nicely by a Hasse-diagram, which sketches the hierarchical (organizational) structure of the reaction network (Fig. 6). The dynamics of the artificial chemistry can then be explained within this Hassediagram as a movement from organization to organization [61,84]. Note that in systems under flow condition, the (finite) set of all organizations forms an algebraic lattice [17]. That is, given two organizations, there is always a unique organization union and organization intersection. Although the concept of autopoiesis [93] has been described informally in quite some detail, a stringent formal definition is lacking. In a way, we can interpret the formal concepts above as an attempt to approach necessary prop-
Artificial Chemistry, Figure 6 Lattice of organizations of the autocatalytic network shown in Fig. 1. An organization is a closed and self-maintaining set of species. Two organizations are connected by a line, if one is contained in the other and there is no organization in between. The vertical position of an organization is determined by the number of species it contains
erties of an autopoietic system step by step by precise formal means. Obviously, being (contained in) at least one organization is a necessary condition for a chemical autopoietic system. But it is not sufficient. Missing are the notion of robustness and a spatial concept that formalizes a system’s ability to maintain (and self-create) its own identity, e. g., through maintaining a membrane (Fig. 3). Evolution With artificial chemistries we can study chemical evolution. Chemical evolution (also called “pre-biotic evolution”) describes the first step in the development of life, such as the formation of complex organic molecules from simpler (in-)organic compounds [62]. Because there are no pre-biotic fossils, the study of chemical evolution has to rely on experiments [56,66] and theoretic (simulation) models [19]. Artificial chemistries aim at capturing the
197
198
Artificial Chemistry
constructive nature of these chemical systems and try to reproduce their evolution in computer simulations. The approaches can be distinguished by whether the evolution is driven by external operators like mutation, or whether variation and selection appears intrinsically through the chemical dynamics itself. Extrinsic Evolution In extrinsic approaches, an external variation operator changes the reaction network by adding, removing, or manipulating a reaction, which may also lead to the addition or removal of chemical species. In this approach, a molecule does not need to posses a structure. The following example by Jain and Krishna [41] shows that this approach allows one to create an evolving system quite elegantly: Let us assume that the reaction network consists of m species M D f1; : : : ; mg. There are first-order catalytic reaction rules of the form (i ! i C j) 2 R and a general dilution flow (a !) 2 R for all species a 2 M. The reaction network is completely described by a directed graph represented by the adjacency matrix C D (c i; j ); i; j 2 M; c i; j 2 f0; 1g, with c i; j D 1 if molecule j catalyzes the production of i, and c i; j D 0 otherwise. In order to avoid that selfreplicators dominate the system, Jain and Krishna assume c i;i D 0 for all molecules i 2 M. At the beginning, the reaction network is randomly initialized, that is, (for i ¤ j)c i; j D 1 with probability p and c i; j D 0, otherwise. In order to simulate the dynamics, we assume a population of molecules represented by the concentration vector x D (x1 ; : : : ; x m ), where xi represents the current concentration of species i. The whole system is simulated in the following way: Step 1: Simulate the chemical differential equation X X c j;i x i c k; j x j ; (18) x˙ i D i2M
a slow, evolutionary time scale at which the reaction network evolves (Step 2 and 3), and a fast time scale at which the molecules catalytically react (Step 1). In this model we can observe how an autocatalytic set inevitably appears after a period of disorder. After its arrival the largest autocatalytic set increases rapidly its connectivity until it spans the whole network. Subsequently, the connectivity converges to a steady state determined by the rate of external perturbations. The resulting highly non-random network is not fully stable, so that we can also study the causes for crashes and recoveries. For example, Jain and Krishna [44] identified that in the absence of large external perturbation, the appearance of a new viable species is a major cause of large extinction and recoveries. Furthermore, crashes can be caused by extinction of a “keystone species”. Note that for these observations, the new species created in Step 3 do not need to inherit any information from other species. Intrinsic Evolution Evolutionary phenomena can also be caused by the intrinsic dynamics of the (artificial) chemistry. In this case, external operators like those mutating the reaction network are not required. As opposed to the previous approach, we can define the whole reaction network at the onset and let only the composition of molecules present in the reaction vessel evolve. The reaction rules are (usually) de-
k; j2M
until a steady state is reached. Note that this steady state is usually independent of the initial concentrations, cf. [85]. Step 2: Select the “mutating” species i, which is the species with smallest concentration in the steady state. Step 3: Update the reaction network by replacing the mutating species by a new species, which is created randomly in the same way as the initial species. That is, the entries of the ith row and ith column of the adjacency matrix C are replaced by randomly chosen entries with the same probability p as during initialization. Step 4: Go to Step 1. Note that there are two explicitly simulated time scales:
Artificial Chemistry, Figure 7 Illustration of syntactically and semantically closed organizations. Each organization consists of an infinite number of molecular species (connected by a dotted line). A circle represents a molecule having the structure: x1 :x2 : : : : 27
Artificial Chemistry
fined by a structure-to-function mapping similar to those described in Sect. “Structure-to-Function Mapping”. The advantage of this approach is that we need not define an external operator changing the reaction network’s topology. Furthermore, the evolution can be more realistic because, for example, molecular species that have vanished can reenter at a later time, which does not happen in an approach like the one described previously. Also, when we have a structure-to-function mapping, we can study how the structure of the molecules is related to the emergence of chemical organizations. There are various approaches using, for example, Turing machines [63], lambda-calculus [26,27], abstract automata [16], or combinators [83] for the structure-tofunction mapping. In all of those approaches we can observe an important phenomenon: While the AC evolves autocatalytic sets or more general, chemical organizations become visible. Like in the system by Jain and Krishna this effect is indicated also by an increase of the connectivity of the species within the population. When an organization becomes visible, the system has focused on a sub-space of the whole set of possible molecules. Those molecules, belonging to the emerged or-
ganization, posses usually relatively high concentrations, because they are generated by many reactions. The closure of this set is usually smaller than the universe M. Depending on the setting the emerged organization can consist of a single self-replicator, a small set of mutually producing molecules, or a large set of different molecules. Note that in the latter case the organization can be so large (and even infinite) that not all its molecules are present in the population (see Fig. 7 for an example). Nevertheless the population can carry the organization, if the system can keep a generating set of molecules within the population. An emerged organization can be quite stable but can also change either by external perturbations like randomly inserted molecules or by internal fluctuations caused by reactions among molecules that are not part of the emerged organization, but which have remained in small quantities in the population, cf. for an example [61]. Dittrich and Banzhaf [16] have shown that it is even possible to obtain evolutionary behavior without any externally generated variation, that is, even without any inflow of random molecules. And without any explicitly defined fitness function or selection process. Selection emerges as a result of the dilution flow and the limited population size.
Artificial Chemistry, Figure 8 Example of a self-evolving artificial chemistry. The figure shows the concentration of some selected species of a reaction vessel containing approximately 104 different species. In this example, molecules are binary strings of length 32 bit. Two binary strings react by mapping one of them to a finite state machine operating on the second binary string. There is only a general, non-selective dilution flow. No other operators like mutation, variation, or selection are applied. Note that the structure of a new species tends to be similar to the structure of the species it has been created from. The dynamics simulated by the algorithm of the introduction. Population size k D 106 molecules. Figure from [16]
199
200
Artificial Chemistry
Syntactic and Semantic Closure In many “constructive” implicitly defined artificial chemistries we can observe species appearing that share syntactical and functional similarities that are invariant under the reaction operation. Fontana and Buss [27] called this phenomenon syntactic and semantic closure. Syntactic closure refers to the observation that the molecules within an emerged organization O M are structurally similar. That is, they share certain structural features. This allows one to describe the set of molecules O by a formal language or grammar in a compact way. If we would instead pick a subset A from M randomly, the expected length of A’s description is on the order of its size. Assume, for example, if we pick one million strings of length 100 randomly from the set of all strings of length 100, then we would need about 100 million characters to describe the resulting set. Interestingly, the organizations that appear in implicitly defined ACs can be described much more compactly by a grammar. Syntactic and semantic closure should be illustrated with an example taken from [27], where O M is even infinite in size. In particular experiments with the lambdachemistry, molecules appeared that posses the following structure: A i; j x1 :x2 : : : : x i :x j
with
ji;
(19)
for example x1 :x2 :x3 :x2 . Syntactical closure means that we can specify such structural rules specifying the molecules of O and that these structural rules are invariant under the reaction mechanism. If molecules with the structure given by Eq. (19) react, their product can also be described by Eq. (19). Semantic closure means that we can describe the reactions taking place within O by referring to the molecule’s grammatical structure (e. g. Eq. (19)). In our example, all reactions within O can be described by the following laws (illustrated by Fig. 7):
Note that syntactical and semantical closure appears also in real chemical system and is exploited by chemistry to organize chemical explanations. It might even be stated that without this phenomenon, chemistry as we know it would not be possible. Information Processing The chemical metaphor gives rise to a variety of computing concepts, which are explained in detail in further articles of this encyclopedia. In approaches like amorphous computing Amorphous Computing chemical mechanisms are used as a sub-system for communication between a huge number of simple, spatially distributed processing units. In membrane computing Membrane Computing, rewriting operations are not only used to react molecules, but also to model active transport between compartments and to change the spatial compartment structure itself. Beside these theoretical and in-silico artificial chemistries there are approaches that aim at using real molecules: Spatial patterns like waves and solitons in excitable media can be exploited to compute, see Reaction-Diffusion Computing and Computing in Geometrical Constrained Excitable Chemical Systems. Whereas other approaches like conformational computing and DNA computing DNA Computing do not rely on spatially structured reaction vessels. Other closely related topics are molecular automata Molecular Automata and bacterial computing Bacterial Computing. Future Directions
Currently, we can observe a convergence of artificial chemistries and chemical models towards more realistic artificial chemistries [9]. On the one hand, models of systems biology are inspired by techniques from artificial chemistries, e. g., in the domain of rule-based modeling [22,34]. On the other hand, artificial chemistries become more realistic, for example, by adopting more real8i; j > 1; k; l : A i; j C A k;l H) A i1; j1 ; istic molecular structures or using techniques from computational chemistry to calculate energies [9,57]. Further8i > 1; j D 1; k; l : A i; j C A k;l H) A kCi1;l Ci1 : (20) more, the relation between artificial chemistries and real biological systems is made more and more explicit [25, For example x1 :x2 :x3 :x4 :x3 C x1 :x2 :x2 H) 45,58]. In the future, novel theories and techniques have to be x1 :x2 :x3 :x2 and x1 :x2 :x1 Cx1 :x2 :x3 :x4 :x3 H) x1 :x2 :x3 :x4 :x5 :x4 , respectively (Fig. 7). As a result developed to handle complex, implicitly defined reaction of the semantic closure we need not to refer to the un- systems. This development will especially be driven by the derlying reaction mechanism (e. g., the lambda-calculus). needs of system biology, when implicitly defined models Instead, we can explain the reactions within O on a more (i. e., rule-based models) become more frequent. Another open challenge lies in the creation of realistic abstract level by referring to the grammatical structure (chemical) open-ended evolutionary systems. For exam(e. g., Eq. (19)).
Artificial Chemistry
ple, an artificial chemistry with “true” open-ended evolution or the ability to show a satisfying long transient phase, which would be comparable to the natural process where novelties continuously appear, has not been presented yet. The reason for this might be lacking computing power, insufficient man-power to implement what we know, or missing knowledge concerning the fundamental mechanism of (chemical) evolution. Artificial chemistries could provide a powerful platform to study whether the mechanisms that we believe explain (chemical) evolution are indeed candidates. As sketched above, there is a broad range of currently explored practical application domains of artificial chemistries in technical artifacts, such as ambient computing, amorphous computing, organic computing, or smart materials. And eventually, artificial chemistry might come back to the realm of real chemistry and inspire the design of novel computing reaction systems like in the field of molecular computing, molecular communication, bacterial computing, or synthetic biology.
Bibliography Primary Literature 1. Adami C, Brown CT (1994) Evolutionary learning in the 2D artificial life system avida. In: Brooks RA, Maes P (eds) Prof artificial life IV. MIT Press, Cambridge, pp 377–381. ISBN 0-262-52190-3 2. Adleman LM (1994) Molecular computation of solutions to combinatorical problems. Science 266:1021 3. Bagley RJ, Farmer JD (1992) Spontaneous emergence of a metabolism. In: Langton CG, Taylor C, Farmer JD, Rasmussen S (eds) Artificial life II. Addison-Wesley, Redwood City, pp 93–140. ISBN 0-201-52570-4 4. Banâtre J-P, Métayer DL (1986) A new computational model and its discipline of programming. Technical Report RR-0566. INRIA, Rennes 5. Banzhaf W (1993) Self-replicating sequences of binary numbers – foundations I and II: General and strings of length n = 4. Biol Cybern 69:269–281 6. Banzhaf W (1994) Self-replicating sequences of binary numbers: The build-up of complexity. Complex Syst 8:215–225 7. Banzhaf W (1995) Self-organizing algorithms derived from RNA interactions. In: Banzhaf W, Eeckman FH (eds) Evolution and Biocomputing. LNCS, vol 899. Springer, Berlin, pp 69–103 8. Banzhaf W, Dittrich P, Rauhe H (1996) Emergent computation by catalytic reactions. Nanotechnology 7(1):307–314 9. Benkö G, Flamm C, Stadler PF (2003) A graph-based toy model of chemistry. J Chem Inf Comput Sci 43(4):1085–1093. doi:10. 1021/ci0200570 10. Bersini H (2000) Reaction mechanisms in the oo chemistry. In: Bedau MA, McCaskill JS, Packard NH, Rasmussen S (eds) Artificial life VII. MIT Press, Cambridge, pp 39–48 11. Boerlijst MC, Hogeweg P (1991) Spiral wave structure in prebiotic evolution: Hypercycles stable against parasites. Physica D 48(1):17–28
12. Breyer J, Ackermann J, McCaskill J (1999) Evolving reaction-diffusion ecosystems with self-assembling structure in thin films. Artif Life 4(1):25–40 13. Conrad M (1992) Molecular computing: The key-lock paradigm. Computer 25:11–22 14. Dewdney AK (1984) In the game called core war hostile programs engage in a battle of bits. Sci Amer 250:14–22 15. Dittrich P (2001) On artificial chemistries. Ph D thesis, University of Dortmund 16. Dittrich P, Banzhaf W (1998) Self-evolution in a constructive binary string system. Artif Life 4(2):203–220 17. Dittrich P, Speroni di Fenizio P (2007) Chemical organization theory. Bull Math Biol 69(4):1199–1231. doi:10.1007/ s11538-006-9130-8 18. Ehricht R, Ellinger T, McCascill JS (1997) Cooperative amplification of templates by cross-hybridization (CATCH). Eur J Biochem 243(1/2):358–364 19. Eigen M (1971) Selforganization of matter and the evolution of biological macromolecules. Naturwissenschaften 58(10):465–523 20. Eigen M, Schuster P (1977) The hypercycle: A principle of natural self-organisation, part A. Naturwissenschaften 64(11): 541–565 21. Érdi P, Tóth J (1989) Mathematical models of chemical reactions: Theory and applications of deterministic and stochastic models. Pinceton University Press, Princeton 22. Faeder JR, Blinov ML, Goldstein B, Hlavacek WS (2005) Rulebased modeling of biochemical networks. Complexity. doi:10. 1002/cplx.20074 23. Farmer JD, Kauffman SA, Packard NH (1986) Autocatalytic replication of polymers. Physica D 22:50–67 24. Fernando C, Rowe J (2007) Natural selection in chemical evolution. J Theor Biol 247(1):152–167. doi:10.1016/j.jtbi.2007.01. 028 25. Fernando C, von Kiedrowski G, Szathmáry E (2007) A stochastic model of nonenzymatic nucleic acid replication: Elongators sequester replicators. J Mol Evol 64(5):572–585. doi:10.1007/ s00239-006-0218-4 26. Fontana W (1992) Algorithmic chemistry. In: Langton CG, Taylor C, Farmer JD, Rasmussen S (eds) Artificial life II. AddisonWesley, Redwood City, pp 159–210 27. Fontana W, Buss LW (1994) ‘The arrival of the fittest’: Toward a theory of biological organization. Bull Math Biol 56: 1–64 28. Fontana W, Buss LW (1996) The barrier of objects: From dynamical systems to bounded organization. In: Casti J, Karlqvist A (eds) Boundaries and barriers. Addison-Wesley, Redwood City, pp 56–116 29. Furusawa C, Kaneko K (1998) Emergence of multicellular organisms with dynamic differentiation and spatial pattern. Artif Life 4:79–93 30. Gánti T (1975) Organization of chemical reactions into dividing and metabolizing units: The chemotons. Biosystems 7(1):15–21 31. Giavitto J-L, Michel O (2001) MGS: A rule-based programming language for complex objects and collections. Electron Note Theor Comput Sci 59(4):286–304 32. Gillespie DT (1976) General method for numerically simulating stochastic time evolution of coupled chemical-reaction. J Comput Phys 22(4):403–434 33. Grzybowski BA, Stone HA, Whitesides GM (2000) Dynamic self-
201
202
Artificial Chemistry
34.
35. 36. 37.
38.
39. 40. 41.
42.
43.
44.
45. 46.
47. 48. 49.
50.
51. 52. 53. 54. 55. 56.
assembly of magnetized, millimetre-sized objects rotating at a liquid-air interface. Nature 405(6790):1033–1036 Hlavacek W, Faeder J, Blinov M, Posner R, Hucka M, Fontana W (2006) Rules for modeling signal-transduction systems. Sci STKE 2006:re6 Hofbauer J, Sigmund K (1988) Dynamical systems and the theory of evolution. University Press, Cambridge Hofstadter DR (1979) Gödel, Escher, Bach: An eternal golden braid. Basic Books Inc, New York. ISBN 0-465-02685-0 Hordijk W, Crutchfield JP, Mitchell M (1996) Embedded-particle computation in evolved cellular automata. In: Toffoli T, Biafore M, Leäo J (eds) PhysComp96. New England Complex Systems Institute, Cambridge, pp 153–8 Hosokawa K, Shimoyama I, Miura H (1994) Dynamics of selfassembling systems: Analogy with chemical kinetics. Artif Life 1(4):413–427 Hutton TJ (2002) Evolvable self-replicating molecules in an artificial chemistry. Artif Life 8(4):341–356 Ikegami T, Hashimoto T (1995) Active mutation in self-reproducing networks of machines and tapes. Artif Life 2(3):305–318 Jain S, Krishna S (1998) Autocatalytic sets and the growth of complexity in an evolutionary model. Phys Rev Lett 81(25):5684–5687 Jain S, Krishna S (1999) Emergence and growth of complex networks in adaptive systems. Comput Phys Commun 122:116–121 Jain S, Krishna S (2001) A model for the emergence of cooperation, interdependence, and structure in evolving networks. Proc Natl Acad Sci USA 98(2):543–547 Jain S, Krishna S (2002) Large extinctions in an evolutionary model: The role of innovation and keystone species. Proc Natl Acad Sci USA 99(4):2055–2060. doi:10.1073/pnas.032618499 Kaneko K (2007) Life: An introduction to complex systems biology. Springer, Berlin Kauffman SA (1971) Cellular homeostasis, epigenesis and replication in randomly aggregated macromolecular systems. J Cybern 1:71–96 Kauffman SA (1986) Autocatalytic sets of proteins. J Theor Biol 119:1–24 Kauffman SA (1993) The origins of order: Self-organization and selection in evolution. Oxford University Press, New York Kirner T, Ackermann J, Ehricht R, McCaskill JS (1999) Complex patterns predicted in an in vitro experimental model system for the evolution of molecular cooperation. Biophys Chem 79(3):163–86 Kniemeyer O, Buck-Sorlin GH, Kurth W (2004) A graph grammar approach to artificial life. Artif Life 10(4):413–431. doi:10.1162/ 1064546041766451 Laing R (1972) Artificial organisms and autonomous cell rules. J Cybern 2(1):38–49 Laing R (1975) Some alternative reproductive strategies in artificial molecular machines. J Theor Biol 54:63–84 Laing R (1977) Automaton models of reproduction by self-inspection. J Theor Biol 66:437–56 Langton CG (1984) Self-reproduction in cellular automata. Physica D 10D(1–2):135–44 Langton CG (1989) Artificial life. In: Langton CG (ed) Proc of artificial life. Addison-Wesley, Redwood City, pp 1–48 Lazcano A, Bada JL (2003) The 1953 Stanley L. Miller experiment: Fifty years of prebiotic organic chemistry. Orig Life Evol Biosph 33(3):235–42
57. Lenaerts T, Bersini H (2009) A synthon approach to artificial chemistry. Artif Life 9 (in press) 58. Lenski RE, Ofria C, Collier TC, Adami C (1999) Genome complexity, robustness and genetic interactions in digital organisms. Nature 400(6745):661–4 59. Lohn JD, Colombano S, Scargle J, Stassinopoulos D, Haith GL (1998) Evolution of catalytic reaction sets using genetic algorithms. In: Proc IEEE International Conference on Evolutionary Computation. IEEE, New York, pp 487–492 60. Lugowski MW (1989) Computational metabolism: Towards biological geometries for computing. In: Langton CG (ed) Artificial Life. Addison-Wesley, Redwood City, pp 341–368. ISBN 0201-09346-4 61. Matsumaru N, Speroni di Fenizio P, Centler F, Dittrich P (2006) On the evolution of chemical organizations. In: Artmann S, Dittrich P (eds) Proc of the 7th german workshop of artificial life. IOS Press, Amsterdam, pp 135–146 62. Maynard Smith J, Szathmáry E (1995) The major transitions in evolution. Oxford University Press, New York 63. McCaskill JS (1988) Polymer chemistry on tape: A computational model for emergent genetics. Internal report. MPI for Biophysical Chemistry, Göttingen 64. McCaskill JS, Chorongiewski H, Mekelburg D, Tangen U, Gemm U (1994) Configurable computer hardware to simulate longtime self-organization of biopolymers. Ber Bunsenges Phys Chem 98(9):1114–1114 65. McMullin B, Varela FJ (1997) Rediscovering computational autopoiesis. In: Husbands P, Harvey I (eds) Fourth european conference on artificial life. MIT Press, Cambridge, pp 38–47 66. Miller SL (1953) A production of amino acids under possible primitive earth conditions. Science 117(3046):528–9 67. Morris HC (1989) Typogenetics: A logic for artificial life. In: Langton CG (ed) Artif life. Addison-Wesley, Redwood City, pp 341–368 68. Ono N, Ikegami T (2000) Self-maintenance and self-reproduction in an abstract cell model. J Theor Biol 206(2):243–253 69. Pargellis AN (1996) The spontaneous generation of digital “life”. Physica D 91(1–2):86–96 70. Pˇaun G (2000) Computing with membranes. J Comput Syst Sci 61(1):108–143 71. Petri CA (1962) Kommunikation mit Automaten. Ph D thesis, University of Bonn 72. Rasmussen S, Knudsen C, Feldberg R, Hindsholm M (1990) The coreworld: Emergence and evolution of cooperative structures in a computational chemistry. Physica D 42:111–134 73. Rasmussen S, Knudsen C, Feldberg R (1992) Dynamics of programmable matter. In: Langton CG, Taylor C, Farmer JD, Rasmussen S (eds) Artificial life II. Addison-Wesley, Redwood City, pp 211–291. ISBN 0-201-52570-4 74. Ray TS (1992) An approach to the synthesis of life. In: Langton CG, Taylor C, Farmer JD, Rasmussen S (eds) Artificial life II. Addison-Wesley, Redwood City, pp 371–408 75. Rössler OE (1971) A system theoretic model for biogenesis. Z Naturforsch B 26(8):741–746 76. Sali A, Shakhnovich E, Karplus M (1994) How does a protein fold? Nature 369(6477):248–251 77. Sali A, Shakhnovich E, Karplus M (1994) Kinetics of protein folding: A lattice model study of the requirements for folding to the native state. J Mol Biol 235(5):1614–1636 78. Salzberg C (2007) A graph-based reflexive artificial chemistry. Biosystems 87(1):1–12
Artificial Chemistry
79. Sayama H (2009) Swarm chemistry. Artif Life. (in press) 80. Sayama H (1998) Introduction of structural dissolution into Langton’s self-reproducing loop. In: Adami C, Belew R, Kitano H, Taylor C (eds) Artificial life VI. MIT Press, Cambridge, pp 114–122 81. Segre D, Ben-Eli D, Lancet D (2000) Compositional genomes: Prebiotic information transfer in mutually catalytic noncovalent assemblies. Proc Natl Acad Sci USA 97(8):4112–4117 82. Socci ND, Onuchic JN (1995) Folding kinetics of proteinlike heteropolymers. J Chem Phys 101(2):1519–1528 83. Speroni di Fenizio P (2000) A less abstract artficial chemistry. In: Bedau MA, McCaskill JS, Packard NH, Rasmussen S (eds) Artificial life VII. MIT Press, Cambridge, pp 49–53 84. Speroni Di Fenizio P, Dittrich P (2002) Artificial chemistry’s global dynamics. Movement in the lattice of organisation. J Three Dimens Images 16(4):160–163. ISSN 13422189 85. Stadler PF, Fontana W, Miller JH (1993) Random catalytic reaction networks. Physica D 63:378–392 86. Suzuki H (2007) Mathematical folding of node chains in a molecular network. Biosystems 87(2–3):125–135. doi:10. 1016/j.biosystems.2006.09.005 87. Suzuki K, Ikegami T (2006) Spatial-pattern-induced evolution of a self-replicating loop network. Artif Life 12(4):461–485. doi: 10.1162/artl.2006.12.4.461 88. Suzuki Y, Tanaka H (1997) Symbolic chemical system based on abstract rewriting and its behavior pattern. Artif Life Robotics 1:211–219 89. Tangen U, Schulte L, McCaskill JS (1997) A parallel hardware evolvable computer polyp. In: Pocek KL, Arnold J (eds) IEEE symposium on FPGAs for custopm computing machines. IEEE Computer Society, Los Alamitos
90. Thürk M (1993) Ein Modell zur Selbstorganisation von Automatenalgorithmen zum Studium molekularer Evolution. Ph D thesis, Universität Jena 91. Turing AM (1952) The chemical basis of morphogenesis. Phil Trans R Soc London B 237:37–72 92. Vanderzande C (1998) Lattice models of polymers. Cambridge University Press, Cambridge 93. Varela FJ, Maturana HR, Uribe R (1974) Autopoiesis: The organization of living systems. BioSystems 5(4):187–196 94. Varetto L (1993) Typogenetics: An artificial genetic system. J Theor Biol 160(2):185–205 95. Varetto L (1998) Studying artificial life with a molecular automaton. J Theor Biol 193(2):257–285 96. Vico G (1710) De antiquissima Italorum sapientia ex linguae originibus eruenda librir tres. Neapel 97. von Neumann J, Burks A (ed) (1966) The theory of self-reproducing automata. University of Illinois Press, Urbana 98. Zauner K-P, Conrad M (1996) Simulating the interplay of structure, kinetics, and dynamics in complex biochemical networks. In: Hofestädt R, Lengauer T, Löffler M, Schomburg D (eds) Computer science and biology GCB’96. University of Leipzig, Leipzig, pp 336–338 99. Zeleny M (1977) Self-organization of living systems: A formal model of autopoiesis. Int J General Sci 4:13–28
Books and Reviews Adami C (1998) Introduction to artificial life. Springer, New York Dittrich P, Ziegler J, Banzhaf W (2001) Artificial chemistries – a review. Artif Life 7(3):225–275 Hofbauer J, Sigmund K (1998) Evolutionary games and population dynamics. Cambridge University Press, Cambridge
203
204
Artificial Intelligence in Modeling and Simulation
Artificial Intelligence in Modeling and Simulation BERNARD Z EIGLER1 , ALEXANDRE MUZY2 , LEVENT YILMAZ3 1 Arizona Center for Integrative Modeling and Simulation, University of Arizona, Tucson, USA 2 CNRS, Università di Corsica, Corte, France 3 Auburn University, Alabama, USA Article Outline Glossary Definition of the Subject Introduction Review of System Theory and Framework for Modeling and Simulation Fundamental Problems in M&S AI-Related Software Background AI Methods in Fundamental Problems of M&S Automation of M&S SES/Model Base Architecture for an Automated Modeler/Simulationist Intelligent Agents in Simulation Future Directions Bibliography Glossary Behavior The observable manifestation of an interaction with a system. DEVS Discrete Event System Specification formalism describes models developed for simulation; applications include simulation based testing of collaborative services. Endomorphic agents Agents that contain models of themselves and/or of other endomorphic Agents. Levels of interoperability Levels at which systems can interoperate such as syntactic, semantic and pragmatic. The higher the level, the more effective is information exchange among participants. Levels of system specification Levels at which dynamic input/output systems can be described, known, or specified ranging from behavioral to structural. Metadata Data that describes other data; a hierarchical concept in which metadata are a descriptive abstraction above the data it describes. Model-based automation Automation of system development and deployment that employs models or system specifications, such as DEVS, to derive artifacts. Modeling and simulation ontology The SES is interpreted as an ontology for the domain of hierarchical,
modular simulation models specified with the DEVS formalism. Net-centric environment Network Centered, typically Internet-centered or web-centered information exchange medium. Ontology Language that describes a state of the world from a particular conceptual view and usually pertains to a particular application domain. Pragmatic frame A means of characterizing the consumer’s use of the information sent by a producer; formalized using the concept of processing network model. Pragmatics Pragmatics is based on Speech Act Theory and focuses on elucidating the intent of the semantics constrained by a given context. Metadata tags to support pragmatics include Authority, Urgency/ Consequences, Relationship, Tense and Completeness. Predicate logic An expressive form of declarative language that can describe ontologies using symbols for individuals, operations, variables, functions with governing axioms and constraints. Schema An advanced form of XML document definition, extends the DTD concept. Semantics Semantics determines the content of messages in which information is packaged. The meaning of a message is the eventual outcome of the processing that it supports. Sensor Device that can sense or detect some aspect of the world or some change in such an aspect. System specification Formalism for describing or specifying a system. There are levels of system specification ranging from behavior to structure. Service-oriented architecture Web service architecture in which services are designed to be (1) accessed without knowledge of their internals through well-defined interfaces and (2) readily discoverable and composable. Structure The internal mechanism that produces the behavior of a system. System entity structure Ontological basis for modeling and simulation. Its pruned entity structures can describe both static data sets and dynamic simulation models. Syntax Prescribes the form of messages in which information is packaged. UML Unified Modeling Language is a software development language and environment that can be used for ontology development and has tools that map UML specifications into XML. XML eXtensible Markup Language provides a syntax for document structures containing tagged information
Artificial Intelligence in Modeling and Simulation
Artificial Intelligence in Modeling and Simulation, Table 1 Hierarchy of system specifications Level 4 3 2
Name Coupled systems I/O System I/O Function
1
I/O Behavior
0
I/O Frame
What we specify at this level System built up by several component systems that are coupled together System with state and state transitions to generate the behavior Collection of input/output pairs constituting the allowed behavior partitioned according to the initial state the system is in when the input is applied Collection of input/output pairs constituting the allowed behavior of the system from an external Black Box view Input and output variables and ports together with allowed values
where tag definitions set up the basis for semantic interpretation. Definition of the Subject This article discusses the role of Artificial Intelligence (AI) in Modeling and Simulation (M&S). AI is the field of computer science that attempts to construct computer systems that emulate human problem solving behavior with the goal of understanding human intelligence. M&S is a multidisciplinary field of systems engineering, software engineering, and computer science that seeks to develop robust methodologies for constructing computerized models with the goal of providing tools that can assist humans in all activities of the M&S enterprise. Although each of these disciplines has its core community there have been numerous intersections and cross-fertilizations between the two fields. From the perspective of this article, we view M&S as presenting some fundamental and very difficult problems whose solution may benefit from the concepts and techniques of AI. Introduction To state the M&S problems that may benefit from AI we first briefly review a system-theory based framework for M&S that provides a language and concepts to facilitate definitive problem statement. We then introduce some key problem areas: verification and validation, reuse and composability, and distributed simulation and systems of systems interoperability. After some further review of software and AI-related background, we go on to outline some areas of AI that have direct applicability to the just given problems in M&S. In order to provide a unifying theme for the problem and solutions, we then raise the question of whether all of M&S can be automated into an integrated autonomous artificial modeler/simulationist. We then proceed to explore an approach to developing such an intelligent agent and present a concrete means by which such an agent could engage in M&S. We close with con-
sideration of an advanced feature that such an agent must have if it is to fully emulate human capability—the ability, to a limited, but significant extent, to construct and employ models of its own “mind” as well of the “minds” of other agents. Review of System Theory and Framework for Modeling and Simulation Hierarchy of System Specifications Systems theory [1] deals with a hierarchy of system specifications which defines levels at which a system may be known or specified. Table 1 shows this Hierarchy of System Specifications (in simplified form, see [2] for full exposition). At level 0 we deal with the input and output interface of a system. At level 1 we deal with purely observational recordings of the behavior of a system. This is an I/O relation which consists of a set of pairs of input behaviors and associated output behaviors. At level 2 we have knowledge of the initial state when the input is applied. This allows partitioning the input/output pairs of level 1 into non-overlapping subsets, each subset associated with a different starting state. At level 3 the system is described by state space and state transition functions. The transition function describes the state-to-state transitions caused by the inputs and the outputs generated thereupon. At level 4 a system is specified by a set of components and a coupling structure. The components are systems on their own with their own state set and state transition functions. A coupling structure defines how those interact. A property of a coupled system that is called “closure under coupling” guarantees that a coupled system at level 3 itself specifies a system. This property allows hierarchical construction of systems, i. e., that
205
206
Artificial Intelligence in Modeling and Simulation
coupled systems can be used as components in larger coupled systems. As we shall see in a moment, the system specification hierarchy provides a mathematical underpinning to define a framework for modeling and simulation. Each of the entities (e. g., real world, model, simulation, and experimental frame) will be described as a system known or specified at some level of specification. The essence of modeling and simulation lies in establishing relations between pairs of system descriptions. These relations pertain to the validity of a system description at one level of specification relative to another system description at a different (higher, lower, or equal) level of specification. On the basis of the arrangement of system levels as shown in Table 1, we distinguish between vertical and horizontal relations. A vertical relation is called an association mapping and takes a system at one level of specification and generates its counterpart at another level of specification. The downward motion in the structure-to-behavior direction, formally represents the process by which the behavior of a model is generated. This is relevant in simulation and testing when the model generates the behavior which then can be compared with the desired behavior. The opposite upward mapping relates a system description at a lower level with one at a higher level of specification. While the downward association of specifications is straightforward, the upward association is much less so. This is because in the upward direction information is introduced while in the downward direction information is reduced. Many structures exhibit the same behavior and recovering a unique structure from a given behavior is not possible. The upward direction, however, is fundamental in the design process where a structure (system at level 3) has to be found that is capable of generating the desired behavior (system at level 1). Framework for Modeling and Simulation The Framework for M&S as described in [2] establishes entities and their relationships that are central to the M&S enterprise (see Fig. 1. The entities of the Framework are: source system, model, simulator, and experimental frame; they are related by the modeling and the simulation relationships. Each entity is formally characterized as a system at an appropriate level of specification of a generic dynamic system. Source System The source system is the real or virtual environment that we are interested in modeling. It is viewed as a source of
Artificial Intelligence in Modeling and Simulation, Figure 1 Framework entities and relationships
observable data, in the form of time-indexed trajectories of variables. The data that has been gathered from observing or otherwise experimenting with a system is called the system behavior database. These data are viewed or acquired through experimental frames of interest to the model development and user. As we shall see, in the case of model validation, these data are the basis for comparison with data generated by a model. Thus, these data must be sufficient to enable reliable comparison as well as being accepted by both the model developer and the test agency as the basis for comparison. Data sources for this purpose might be measurement taken in prior experiments, mathematical representation of the measured data, or expert knowledge of the system behavior by accepted subject matter experts. Experimental Frame An experimental frame is a specification of the conditions under which the system is observed or experimented with [3]. An experimental frame is the operational formulation of the objectives that motivate a M&S project. A frame is realized as a system that interacts with the system of interest to obtain the data of interest under specified conditions. An experimental frame specification consists of four major subsections: Input stimuli Specification of the class of admissible input time-dependent stimuli. This is the class from which individual samples will be drawn and injected into the model or system under test for particular experiments. Control Specification of the conditions under which the model or system will be initialized, continued under examination, and terminated.
Artificial Intelligence in Modeling and Simulation
Artificial Intelligence in Modeling and Simulation, Figure 2 Experimental frame and components
Metrics Specification of the data summarization functions and the measures to be employed to provide quantitative or qualitative measures of the input/output behavior of the model. Examples of such metrics are performance indices, goodness of fit criteria, and error accuracy bound. Analysis Specification of means by which the results of data collection in the frame will be analyzed to arrive at final conclusions. The data collected in a frame consists of pairs of input/output time functions. When an experimental frame is realized as a system to interact with the model or system under test the specifications become components of the driving system. For example, a generator of output time functions implements the class of input stimuli. An experimental frame is the operational formulation of the objectives that motivate a modeling and simulation project. Many experimental frames can be formulated for the same system (both source system and model) and the same experimental frame may apply to many systems. Why would we want to define many frames for the same system? Or apply the same frame to many systems? For the same reason that we might have different objectives in modeling the same system, or have the same objective in modeling different systems. There are two equally valid views of an experimental frame. One, views a frame as a definition of the type of data elements that will go into the database. The second views a frame as a system that interacts with the system of interest to obtain the data of
interest under specified conditions. In this view, the frame is characterized by its implementation as a measurement system or observer. In this implementation, a frame typically has three types of components (as shown in Fig. 2 and Fig. 3): a generator, that generates input segments to the system; acceptor that monitors an experiment to see the desired experimental conditions are met; and transducer that observes and analyzes the system output segments.
Artificial Intelligence in Modeling and Simulation, Figure 3 Experimental frame and its components
207
208
Artificial Intelligence in Modeling and Simulation
Figure 2b illustrates a simple, but ubiquitous, pattern for experimental frames that measure typical job processing performance metrics, such as round trip time and throughput. Illustrated in the web context, a generator produces service request messages at a given rate. The time that has elapsed between sending of a request and its return from a server is the round trip time. A transducer notes the departures and arrivals of requests allowing it to compute the average round trip time and other related statistics, as well as the throughput and unsatisfied (or lost) requests. An acceptor notes whether performance achieves the developer’s objectives, for example, whether the throughput exceeds the desired level and/or whether say 99% of the round trip times are below a given threshold. Objectives for modeling relate to the role of the model in systems design, management or control. Experimental frames translate the objectives into more precise experimentation conditions for the source system or its models. We can distinguish between objectives concerning those for verification and validation of (a) models and (b) systems. In the case of models, experimental frames translate the objectives into more precise experimentation conditions for the source system and/or its models. A model under test is expected to be valid for the source system in each such frame. Having stated the objectives, there is presumably a best level of resolution to answer these questions. The more demanding the questions, the greater the resolution likely to be needed to answer them. Thus, the choice of appropriate levels of abstraction also hinges on the objectives and their experimental frame counterparts. In the case of objectives for verification and validation of systems, we need to be given, or be able to formulate, the requirements for the behavior of the system at the IO behavior level. The experimental frame then is formulated to translate these requirements into a set of possible experiments to test whether the system actually performs its required behavior. In addition we can formulate measures of the effectiveness (MOE) of a system in accomplishing its goals. We call such measures, outcome measures. In order to compute such measures, the system must expose relevant variables, we’ll call these output variables, whose values can be observed during execution runs of the system.
crete Event System Specification (DEVS) formalism delineates the subclass of discrete event systems and it can also represent the systems specified within traditional formalisms such as differential (continuous) and difference (discrete time) equations [4]. In DEVS, as in systems theory, a model can be atomic, i. e., not further decomposed, or coupled, in which case it consists of components that are coupled or interconnected together.
Model
Validation and Verification
A model is a system specification, such as a set of instructions, rules, equations, or constraints for generating input/output behavior. Models may be expressed in a variety of formalisms that may be understood as means for specifying subclasses of dynamic systems. The Dis-
The basic concepts of verification and validation (V&V) have been described in different settings, levels of details, and points of views and are still evolving. These concepts have been studied by a variety of scientific and engineering disciplines and various flavors of validation and
Simulator A simulator is any computation system (such as a single processor, or a processor network, or more abstractly an algorithm), capable of executing a model to generate its behavior. The more general purpose a simulator is the greater the extent to which it can be configured to execute a variety of model types. In order of increasing capability, simulators can be: Dedicated to a particular model or small class of similar models Capable of accepting all (practical) models from a wide class, such as an application domain (e. g., communication systems) Restricted to models expressed in a particular modeling formalism, such as continuous differential equation models Capable of accepting multi-formalism models (having components from several formalism classes, such as continuous and discrete event). A simulator can take many forms such as on a single computer or multiple computers executing on a network. Fundamental Problems in M&S We have now reviewed a system-theory-based framework for M&S that provides a language and concepts in which to formulate key problems in M&S. Next on our agenda is to discuss problem areas including: verification and validation, reuse and composability, and distributed simulation and systems of systems interoperability. These are challenging, and heretofore, unsolved problems at the core of the M&S enterprise.
Artificial Intelligence in Modeling and Simulation
Artificial Intelligence in Modeling and Simulation, Figure 4 Basic approach to model validation
verification concepts and techniques have emerged from a modeling and simulation perspective. Within the modeling and simulation community, a variety of methodologies for V&V have been suggested in the literature [5,6,7]. A categorization of 77 verification, validation and testing techniques along with 15 principles has been offered to guide the application of these techniques [8]. However, these methods vary extensively – e. g., alpha testing, induction, cause and effect graphing, inference, predicate calculus, proof of correctness, and user interface testing and are only loosely related to one another. Therefore, such a categorization can only serve as an informal guideline for the development of a process for V&V of models and systems. Validation and verification concepts are themselves founded on more primitive concepts such as system specifications and homomorphism as discussed in the framework of M&S [2]. In this framework, the entities system, experimental frame, model, simulator take on real importance only when properly related to each other. For example, we build a model of a particular system for some objective and only some models, and not others, are suitable. Thus, it is critical to the success of a simulation modeling effort that certain relationships hold. Two of the most important are validity and simulator correctness. The basic modeling relation, validity, refers to the relation between a model, a source system and an experimental frame. The most basic concept, replicative validity, is affirmed if, for all the experiments possible within the experimental frame, the behavior of the model and system
agree within acceptable tolerance. The term accuracy is often used in place of validity. Another term, fidelity, is often used for a combination of both validity and detail. Thus, a high fidelity model may refer to a model that is both highly detailed and valid (in some understood experimental frame). However, when used this way, the assumption seems to be that high detail alone is needed for high fidelity, as if validity is a necessary consequence of high detail. In fact, it is possible to have a very detailed model that is nevertheless very much in error, simply because some of the highly resolved components function in a different manner than their real system counterparts. The basic approach to model validation is comparison of the behavior generated by a model and the source system it represents within a given experimental frame. The basis for comparison serves as the reference Fig. 4 against which the accuracy of the model is measured. The basic simulation relation, simulator correctness, is a relation between a simulator and a model. A simulator correctly simulates a model if it is guaranteed to faithfully generate the model’s output behavior given its initial state and its input trajectory. In practice, as suggested above, simulators are constructed to execute not just one model but also a family of possible models. For example, a network simulator provides both a simulator and a class of network models it can simulate. In such cases, we must establish that a simulator will correctly execute the particular class of models it claims to support. Conceptually, the approach to testing for such execution, illustrated in Fig. 5, is
209
210
Artificial Intelligence in Modeling and Simulation
Artificial Intelligence in Modeling and Simulation, Figure 5 Basic approach to simulator verification
Artificial Intelligence in Modeling and Simulation, Figure 6 Basic approach to system validation
to perform a number of test cases in which the same model is provided to the simulator under test and to a “gold standard simulator” which is known to correctly simulate the model. Of course such test case models must lie within the class supported by the simulat or under test as well as presented in the form that it expects to receive them. Comparison of the output behaviors in the same manner as with model validation is then employed to check the agreement between the two simulators.
If the specifications of both the simulator and the model are available in separated form where each can be accessed independently, it may be possible to prove correctness mathematically. The case of system validation is illustrated in Fig. 6. Here the system is considered as a hardware and/or software implementation to be validated against requirements for its input/output behavior. The goal is to develop test models that can stimulate the implemented system with
Artificial Intelligence in Modeling and Simulation
inputs and can observe its outputs to compare them with those required by the behavior requirements. Also shown is a dotted path in which a reference model is constructed that is capable of simulation execution. Construction of such a reference model is more difficult to develop than the test models since it requires not only knowing in advance what output to test for, but to actually to generate such an output. Although such a reference model is not required, it may be desirable in situations in which the extra cost of development is justified by the additional range of tests that might be possible and the consequential increased coverage this may provide.
ferent components. For systems composed of models with dynamics that are intrinsically heterogeneous, it is crucial to use multiple modeling formalisms to describe them. However, combining different model types poses a variety of challenges [9,12,13]. Sarjoughian [14], introduced an approach to multi-formalism modeling that employs an interfacing mechanism called a Knowledge Interchange Broker to compose model components expressed in diverse formalisms. The KIB supports translation from the semantics of one formalism into that of a second to ensure coordinated and correct execution simulation algorithms of distinct modeling formalisms.
Model Reuse and Composability
Distributed Simulation and System of Systems Interoperability
Model reuse and composability are two sides of the same coin—it is patently desirable to reuse models, the fruits of earlier or others work. However, typically such models will become components in a larger composite model and must be able to interact meaningfully with them. While software development disciplines are successfully applying a component-based approach to build software systems, the additional systems dynamics involved in simulation models has resisted straight forward reuse and composition approaches. A model is only reusable to the extent that its original dynamic systems assumptions are consistent with the constraints of the new simulation application. Consequently, without contextual information to guide selection and refactoring, a model may not be reused to advantage within a new experimental frame. Davis and Anderson [9] argue that to foster such reuse, model representation methods should distinguish, and separately specify, the model, simulator, and the experimental frame. However, Yilmaz and Oren [10] pointed out that more contextual information is needed beyond the information provided by the set of experimental frames to which a model is applicable [11], namely, the characterization of the context in which the model was constructed. These authors extended the basic model-simulator-experimental frame perspective to emphasize the role of context in reuse. They make a sharp distinction between the objective context within which a simulation model is originally defined and the intentional context in which the model is being qualified for reuse. They extend the system theoretic levels of specification discussed earlier to define certain behavioral model dependency relations needed to formalize conceptual, realization, and experimental aspects of context. As the scope of simulation applications grows, it is increasingly the case that more than one modeling paradigm is needed to adequately express the dynamics of the dif-
The problems of model reuse and composability manifest themselves strongly in the context of distributed simulation where the objective is to enable existing geographically dispersed simulators to meaningfully interact, or federate, together. We briefly review experience with interoperability in the distributed simulation context and a linguistically based approach to the System of Systems (SoS) interoperability problem [15]. Sage and Cuppan [16] drew the parallel between viewing the construction of SoS as a federation of systems and the federation that is supported by the High Level Architecture (HLA), an IEEE standard fostered by the DoD to enable composition of simulations [17,18]. HLA is a network middleware layer that supports message exchanges among simulations, called federates, in a neutral format. However, experience with HLA has been disappointing and forced acknowledging the difference between technical interoperability and substantive interoperability [19]. The first only enables heterogeneous simulations to exchange data but does not guarantee the second, which is the desired outcome of exchanging meaningful data, namely, that coherent interaction among federates takes place. Tolk and Muguirra [20] introduced the Levels of Conceptual Interoperability Model (LCIM) which identified seven levels of interoperability among participating systems. These levels can be viewed as a refinement of the operational interoperability type which is one of three defined by [15]. The operational type concerns linkages between systems in their interactions with one another, the environment, and with users. The other types apply to the context in which systems are constructed and acquired. They are constructive—relating to linkages between organizations responsible for system construction and programmatic—linkages between program offices to manage system acquisition.
211
212
Artificial Intelligence in Modeling and Simulation
AI-Related Software Background To proceed to the discussion of the role of AI in addressing key problems in M&S, we need to provide some further software and AI-related background. We offer a brief historical account of object-orientation and agent-based systems as a springboard to discuss the upcoming concepts of object frameworks, ontologies and endomorphic agents. Object-Orientation and Agent-Based Systems Many of the software technology advances of the last 30 years have been initiated from the field of M&S. Objects, as code modules with both structure and behavior, were first introduced in the SIMULA simulation language [21]. Objects blossomed in various directions and became institutionalized in the widely adopted programming language C++ and later in the infrastructure for the web in the form of Java [22] and its variants. The freedom from straight-line procedural programming that objectorientation championed was taken up in AI in two directions: various forms of knowledge representation and of autonomy. Rule-based systems aggregate modular ifthen logic elements—the rules—that can be activated in some form of causal sequence (inference chains) by an execution engine [23]. In their passive state, rules represent static discrete pieces of inferential logic, called declarative knowledge. However, when activated, a rule influences the state of the computation and the activation of subsequent rules, providing the system a dynamic or procedural, knowledge characteristic as well. Frame-based systems further expanded knowledge representation flexibility and inferencing capability by supporting slots and constraints on their values—the frames—as well as their taxonomies based on generalization/specialization relationships [24]. Convergence with object-orientation became apparent in that frames could be identified as objects and their taxonomic organization could be identified with classes within object-style organizations that are based on sub-class hierarchies. On the other hand, the modular nature of objects together with their behavior and interaction with other objects, led to the concept of agents which embodied increased autonomy and self-determination [25]. Agents represent individual threads of computation and are typically deployable in distributed form over computer networks where they interact with local environments and communicate/coordinate with each other. A wide variety of agent types exists in large part determined by the variety and sophistication of their processing capacities—ranging from agents that simply gather information on packet traffic in a network to logic-based entities with elaborate rea-
soning capacity and authority to make decisions (employing knowledge representations just mentioned.) The step from agents to their aggregates is natural, thus leading to the concept of multi-agent systems or societies of agents, especially in the realm of modeling and simulation [26]. To explore the role of AI in M&S at the present, we project the just-given historical background to the concurrent concepts of object frameworks, ontologies and endomorphic agents. The Unified Modeling Language (UML) is gaining a strong foothold as the defacto standard for object-based software development. Starting as a diagrammatical means of software system representation, it has evolved to a formally specified language in which the fundamental properties of objects are abstracted and organized [27,28]. Ontologies are models of the world relating to specific aspects or applications that are typically represented in frame-based languages and form the knowledge components for logical agents on the Semantic Web [29]. A convergence is underway that re-enforces the commonality of the object-based origins of AI and software engineering. UML is being extended to incorporate ontology representations so that software systems in general will have more explicit models of their domains of operation. As we shall soon see, endomorphic agents refer to agents that include abstractions of their own structure and behavior within their ontologies of the world [30]. The M&S Framework Within Unified Modeling Language (UML) With object-orientation as unified by UML and some background in agent-based systems, we are in a position to discuss the computational realization of the M&S framework discussed earlier. The computational framework is based on the Discrete Event System Specification (DEVS) formalism and implemented in various object-oriented environments. Using UML we can represent the framework as a set of classes and relations as illustrated in Figs. 7 and 8. Various software implementations of DEVS support different subsets of the classes and relations. In particular, we mention a recent implementation of DEVS within a Service-Oriented Architecture (SOA) environment called DEVS/SOA [31,32]. This implementation exploits some of the benefits afforded by the web environment mentioned earlier and provides a context for consideration of the primary target of our discussion, comprehensive automation of the M&S enterprise. We use one of the UML constructs, the use case diagram, to depict the various capabilities that would be involved in automating all or parts of the M&S enterprise. Use cases are represented by ovals that connect to at least
Artificial Intelligence in Modeling and Simulation
Artificial Intelligence in Modeling and Simulation, Figure 7 M&S framework formulated within UML
Artificial Intelligence in Modeling and Simulation, Figure 8 M&S framework classes and relations in a UML representation
one actor (stick figure) and to other use cases through “includes” relations, shown as dotted arrows. For example, a sensor (actor) collects data (use case) which includes storage of data (use case). A memory actor stores and
retrieves models which include storage and retrieval (respectively) of data. Constructing models includes retrieving stored data within an experimental frame. Validating models includes retrieving models from memory as com-
213
214
Artificial Intelligence in Modeling and Simulation
ponents and simulating the composite model to generate data within an experimental frame. The emulator-simulator actor does the simulating to execute the model so that its generated behavior can be matched against the stored data in the experimental frame. The objectives of the human modeler drive the model evaluator and hence the choice of experimental frames to consider as well as models to validate. Models can be used in (at least) two time frames [33]. In the long term, they support planning of actions to be taken in the future. In the short term, models support execution and control in real-time of previously planned actions. AI Methods in Fundamental Problems of M&S The enterprise of modeling and simulation is characterized by activities such as model, simulator and experimental frame creation, construction, reuse, composition, verification and validation. We have seen that valid model construction requires significant expertise in all the components of the M&S enterprise, e. g., modeling formalisms, simulation methods, and domain understanding and knowledge. Needless to say, few people can bring all such elements to the table, and this situation creates a significant bottleneck to progress in such projects. Among the contributing factors are lack of trained personnel that must be brought in, expense of such high capability experts, and the time needed to construct models to the resolution required for most objectives. This section introduces some AI-related technologies that can ameliorate this situation: Service-Oriented Architecture and Seman-
Artificial Intelligence in Modeling and Simulation, Figure 9 UML use case formulation of the overall M&S enterprise
tic Web, ontologies, constrained natural language capabilities, and genetic algorithms. Subsequently we will consider these as components in unified, comprehensive, and autonomous automation of M&S. Service-Oriented Architecture and Semantic Web On the World Wide Web, a Service-Oriented Architecture is a market place of open and discoverable web-services incorporating, as they mature, Semantic Web technologies [34]. The eXtensible Markup Language (XML) is the standard format for encoding data sets and there are standards for sending and receiving XML [35]. Unfortunately, the problem just starts at this level. There are myriad ways, or Schemata, to encode data into XML and a good number of such Schemata have already been developed. More often than not, they are different in detail when applied to the same domains. What explains this incompatibility? In a Service-Oriented Architecture, the producer sends messages containing XML documents generated in accordance with a schema. The consumer receives and interprets these messages using the same schema in which they were sent. Such a message encodes a world state description (or changes in it) that is a member of a set delineated by an ontology. The ontology takes into account the pragmatic frame, i. e., a description of how the information will be used in downstream processing. In a SOA environment, data dissemination may be dominated by “user pull of data,” incremental transmission, discovery using metadata, and automated retrieval of data to meet user pragmatic frame specifications. This is the SOA con-
Artificial Intelligence in Modeling and Simulation
cept of data-centered, interface-driven, loose coupling between producers and consumers. The SOA concept requires the development of platform-independent, community-accepted, standards that allow raw data to be syntactically packaged into XML and accompanied by metadata that describes the semantic and pragmatic information needed to effectively process the data into increasingly higher-value products downstream. Artificial Intelligence in Modeling and Simulation, Figure 10 Interoperability levels in distributed simulation
Ontologies Semantic Web researchers typically seek to develop intelligent agents that can draw logical inferences from diverse, possibly contradictory, ontologies such as a web search might discover. Semantic Web research has led to a focus on ontologies [34]. These are logical languages that provide a common vocabulary of terms and axiomatic relations among them for a subject area. In contrast, the newly emerging area of ontology integration assumes that human understanding and collaboration will not be replaced by intelligent agents. Therefore, the goal is to create concepts and tools to help people develop practical solutions to incompatibility problems that impede “effective” exchange of data and ways of testing that such solutions have been correctly implemented. As illustrated in Fig. 10, interoperability of systems can be considered at three linguistically-inspired levels: syntactic, semantic, and pragmatic. The levels are summarized in Table 2. More detail is provided in [36]. Constrained Natural Language Model development can be substantially aided by enabling users to specify modeling constructs using some form of
constrained natural language [37]. The goal is to overcome modeling complexity by letting users with limited or nonexistent formal modeling or programming background convey essential information using natural language, a form of expression that is natural and intuitive. Practicality demands constraining the actual expressions that can be used so that the linguistic processing is tractable and the input can be interpreted unambiguously. Some techniques allow the user to narrow down essential components for model construction. Their goal is to reduce ambiguity between the user’s requirements and essential model construction components. A natural language interface allows model specification in terms of a verb phrase consisting of a verb, noun, and modifier, for example “build car quickly.” Conceptual realization of a model from a verb phrase ties in closely with Checkland’s [38] insight that an appropriate verb should be used to express the root definition, or core purpose, of a system. The main barrier between many people and existing modeling software is their lack of computer literacy and this provides an incentive to develop natural language interfaces as a means of bridging this gap. Natural language
Artificial Intelligence in Modeling and Simulation, Table 2 Linguistic levels Linguistic Level Pragmatic – how information in messages is used Semantic – shared understanding of meaning of messages
A collaboration of systems or services interoperates at this level if: The receiver reacts to the message in a manner that the sender intends
The receiver assigns the same meaning as the sender did to the message.
Syntactic – common The consumer is able to receive and rules governing parse the sender’s message composition and transmitting of messages
Examples An order from a commander is obeyed by the troops in the field as the commander intended. A necessary condition is that the information arrives in a timely manner and that its meaning has been preserved (semantic interoperability) An order from a commander to multi -national participants in a coalition operation is understood in a common manner despite translation into different languages. A necessary condition is that the information can be unequivocally extracted from the data (syntactic interoperability) A common network protocol (e. g. IPv4) is employed ensuring that all nodes on the network can send and receive data bit arrays adhering to a prescribed format.
215
216
Artificial Intelligence in Modeling and Simulation
expression could create modelers out of people who think semantically, but do not have the requisite computer skills to express these ideas. A semantic representation frees the user to explore the system on the familiar grounds of natural language and opens the way for brain storming, innovation and testing of models before they leave the drawing board. Genetic Algorithms The genetic algorithm is a subset of evolutionary algorithms that model biological processes to search in highly complex spaces. A genetic algorithm (GA) allows a population composed of many individuals to evolve under specified selection rules to a state that maximizes the “fitness.” The theory was developed by John Holland [39] and popularized by Goldberg who was able to solve a difficult problem involving the control of gas pipeline transmission for his dissertation [40]. Numerous applications of GAs have since been chronicled [41,42]. Recently, GAs have been applied to cutting-edge problems in automated construction of simulation models, as discussed below [43]. Automation of M&S We are now ready to suggest a unifying theme for the problems in M&S and possible AI-based solutions, by raising the question of whether all of M&S can be automated into an integrated autonomous artificial modeler/simulationist. First, we provide some background needed to explore an approach to developing such an intelligent agent based on the System Entity Structure/Model Base framework, a hybrid methodology that combines elements of AI and M&S. System Entity Structure The System Entity Structure (SES) concepts were first presented in [44]. They were subsequently extended and implemented in a knowledge-based design environment [45]. Application to model base management originated with [46]. Subsequent formalizations and implementations were developed in [47,48,49,50,51]. Applications to various domains are given in [52]. A System Entity Structure is a knowledge representation formalism which focuses on certain elements and relationships that relate to M&S. Entities represent things that exist in the real world or sometimes in an imagined world. Aspects represent ways of decomposing things into more fine grained ones. Multi-aspects are aspects for which the components are all of one kind. Specializations represent categories or families of specific forms
that a thing can assume provides the means to represent a family of models as a labeled tree. Two of its key features are support for decomposition and specialization. The former allows decomposing a large system into smaller systems. The latter supports representation of alternative choices. Specialization enables representing a generic model (e. g., a computer display model) as one of its specialized variations (e. g., a flat panel display or a CRT display). On the basis of SES axiomatic specifications, a family of models (design-space) can be represented and further automatically pruned to generate a simulation model. Such models can be systematically studied and experimented with based on alternative design choices. An important, salient feature of SES is its ability to represent models not only in terms of their decomposition and specialization, but also their aspects. The SES represents alternative decompositions via aspects. The system entity structure (SES) formalism provides an operational language for specifying such hierarchical structures. An SES is a structural knowledge representation scheme that systematically organizes a family of possible structures of a system. Such a family characterizes decomposition, coupling, and taxonomic relationships among entities. An entity represents a real world object. The decomposition of an entity concerns how it may be broken down into sub-entities. In addition, coupling specifications tell how sub-entities may be coupled together to reconstitute the entity and be associated with an aspect. The taxonomic relationship concerns admissible variants of an entity. The SES/Model-Base framework [52] is a powerful means to support the plan-generate-evaluate paradigm in systems design. Within the framework, entity structures organize models in a model base. Thus, modeling activity within the framework consists of three sub-activities: specification of model composition structure, specification of model behavior, and synthesis of a simulation model. The SES is governed by an axiomatic framework in which entities alternate with the other items. For example, a thing is made up of parts; therefore, its entity representation has a corresponding aspect which, in turn, has entities representing the parts. A System Entity Structure specifies a family of hierarchical, modular simulation models, each of which corresponds to a complete pruning of the SES. Thus, the SES formalism can be viewed as an ontology with the set of all simulation models as its domain of discourse. The mapping from SES to the Systems formalism, particularly to the DEVS formalism, is discussed in [36]. We note that simulation models include both static and dynamic elements in any application domain, hence represent an advanced form of ontology framework.
Artificial Intelligence in Modeling and Simulation
Artificial Intelligence in Modeling and Simulation, Figure 11 SES/Model base architecture for automated M&S
SES/Model Base Architecture for an Automated Modeler/Simulationist In this section, we raise the challenge of creating a fully automated modeler/simulationist that can autonomously carry out all the separate functions identified in the M&S framework as well as the high level management of these functions that is currently under exclusively human control. Recall the use case diagrams in Fig. 9 that depict the various capabilities that would need to be involved in realizing a completely automated modeler/simulationist. To link up with the primary modules of mind, we assign model construction to the belief generator—interpreting beliefs as models [54]. Motivations outside the M&S component drive the belief evaluator and hence the choice of experimental frames to consider as well as models to validate. External desire generators stimulate the imaginer/envisioner to run models to make predictions within pragmatic frames and assist in action planning. The use case diagram of Fig. 9, is itself a model of how modeling and simulation activities may be carried out within human minds. We need not be committed to particular details at this early stage, but will assume that such a model can be refined to provide a useful representation of human mental activity from the perspective of M&S. This provides the basis for examining how such an arti-
ficially intelligent modeler/simulationist might work and considering the requirements for comprehensive automation of M&S. The SES/MB methodology, introduced earlier, provides a basis for formulating a conceptual architecture for automating all of the M&S activities depicted earlier in Fig. 9 into an integrated system. As illustrated in Fig. 11, the dichotomy into real-time use of models for acting in the real world and the longer-term development of models that can be employed in such real-time activities is manifested in the distinction between passive and active models. Passive models are stored in a repository that can be likened to long-term memory. Such models under go the life-cycle mentioned earlier in which they are validated and employed within experimental frames of interest for long-term forecasting or decision-making. However, in addition to this quite standard concept of operation, there is an input pathway from the short term, or working, memory in which models are executed in real-time. Such execution, in real-world environments, often results in deficiencies, which provide impetus and requirements for instigating new model construction. Whereas long-term model development and application objectives are characterized by experimental frames, the short-term execution objectives are characterized by pragmatic frames. As discussed in [36], a pragmatic frame provides a means of
217
218
Artificial Intelligence in Modeling and Simulation
characterizing the use for which an executable model is being sought. Such models must be simple enough to execute within, usually, stringent real-time deadlines. The M&S Framework Within Mind Architecture An influential formulation of recent work relating to mind and brain [53] views mind as the behavior of the brain, where mind is characterized by a massively modular architecture. This means that mind is, composed of a large number of modules, each responsible for different functions, and each largely independent and sparsely connected with others. Evolution is assumed to favor such differentiation and specialization since under suitably weakly interactive environments they are less redundant and more efficient in consuming space and time resources. Indeed, this formulation is reminiscent of the class of problems characterized by Systems of Systems (SoS) in which the attempt is made to integrate existing systems, originally built to perform specific functions, into a more comprehensive and multifunctional system. As discussed in [15], the components of each system can be viewed as communicating with each other within a common ontology, or model of the world that is tuned to the smooth functioning of the organization. However, such ontologies may well be mismatched to support integration at the systems level. Despite working on the results of a long history of pre-human evolution, the fact that consciousness seems to provide a unified and undifferentiated picture of mind, suggests that human evolution has to a large extent solved the SoS integration problem. The Activity Paradigm for Automated M&S At least for the initial part of its life, a modeling agent needs to work on a “first order” assumption about its environment, namely, that it can exploit only semantics-free properties [39]. Regularities, such as periodic behaviors, and stimulus-response associations, are one source of such semantics-free properties. In this section, we will focus on a fundamental property of systems, such as the brain’s neural network, that have a large number of components. This distribution of activity of such a system over space and time provides a rich and semantics-free substrate from which models can be generated. Proposing structures and algorithms to track and replicate this activity should support automated modeling and simulation of patterns of reality. The goal of the activity paradigm is to extract mechanisms from natural phenomena and behaviors to automate and guide the M&S process. The brain offers a quintessential illustration of activity and its potential use to construct models. Figure 12
Artificial Intelligence in Modeling and Simulation, Figure 12 Brain module activities
describes brain electrical activities [54]. Positron emission tomography (PET) is used to record electrical activity inside the brain. The PET method scans show what happens inside the brain when resting and when stimulated by words and music. The red areas indicate high brain activities. Language and music produce responses in opposite sides of the brain (showing the sub-system specializations). There are many levels of activity (ranging from low to high.) There is a strong link between modularity and the applicability of activity measurement as a useful concept. Indeed, modules represent loci for activity—a distribution of activity would not be discernable over a network were there no modules that could be observed to be in different states of activity. As just illustrated, neuroscientists are exploiting this activity paradigm to associate brain areas with functions and to gain insight into areas that are active or inactive over different, but related, functions, such as language and music processing. We can generalize this approach as a paradigm for an automated modeling agent. Component, Activity and Discrete-Event Abstractions Figure 13 depicts a brain description through components and activities. First, the modeler considers one brain activity (e. g., listening music.) This first level activity corresponds to simulation components at a lower level. At this level, activity of the components can be considered to be only on (grey boxes) or off (white boxes.) At lower levels, structure and behaviors can be decomposed. The structure of the higher component level is detailed at lower levels (e. g., to the neuronal network). The behavior is also detailed through activity and discrete-event abstraction [55]. At the finest levels, activity of components can be detailed.
Artificial Intelligence in Modeling and Simulation
Artificial Intelligence in Modeling and Simulation, Figure 13 Hierarchy of components, activity and discrete-event abstractions
Using the pattern detection and quantization methods continuously changing variables can also be treated within the activity paradigm. As illustrated in Fig. 14, small slopes and small peaks can signal low activity whereas high slopes and peaks can signal high activity levels. To provide for scale, discrete event abstraction can be achieved using quantization [56]. To determine the activity level, a quantum or measure of significant change has to be chosen. The quantum size acts as a filter on the continuous flow. For example, one can notice that in the figure, using the displayed quantum, the smallest peaks will not be significant. Thus, different levels of resolution can be achieved by employing different quantum sizes. A genetic algorithm can be used to find the optimum such level of resolution given for a given modeling objective [43]. Activity Tracking Within the activity paradigm, M&S consists of capturing activity paths through component processing and transformations. To determine the basic structure of the whole system, an automated modeler has to answer questions of the form: where and how is activity produced, received, and transmitted? Figure 15 represents a component-based view of activity flow in a neuronal network. Activity paths through components are represented by full arrows. Ac-
Artificial Intelligence in Modeling and Simulation, Figure 14 Activity sensitivity and discrete-events
Artificial Intelligence in Modeling and Simulation, Figure 15 Activity paths in neurons
tivity is represented by full circles. Components are represented by squares. The modeler must generate such a graph based on observed data—but how does it obtain such data? One approach is characteristic of current use of PET scans by neuroscientists. This approach exploits a relationship between activity and energy—activity requires consumption of energy, therefore, observing areas of high energy consumption signals areas of high activity. Notice that this correlation requires that energy consumption be
219
220
Artificial Intelligence in Modeling and Simulation
localizable to modules in the same way that activity is so localized. So for example, current computer architectures that provide a single power source to all components do not lend themselves to such observation. How activity is passed on from component to component can be related to the modularity styles (none, weak, strong) of the components. Concepts relating such modularity styles to activity transfer need to be developed to support an activity tracking methodology that goes beyond reliance on energy correlation. Activity Model Validation Recall that having generated a model of an observed system (whether through activity tracking or by other means), the next step is validation. In order to perform such validation, the modeler needs an approach to generating activity profiles from simulation experiments on the model and to comparing these profiles with those observed in the real system. Muzy and Nutaro [57] have developed algorithms that exploit activity tracking to achieve efficient simulation of DEVS models. These algorithms can be
Artificial Intelligence in Modeling and Simulation, Figure 16 Evolution of the use of intelligent agents in simulation
adopted to provide an activity tracking pattern applicable to a given simulation model to extract its activity profiles for comparison with those of the modeled system. A forthcoming monograph will develop the activity paradigm for M&S in greater detail [58]. Intelligent Agents in Simulation Recent trends in technology as well as the use of simulation in exploring complex artificial and natural information processes [62,63] have made it clear that simulation model fidelity and complexity will continue to increase dramatically in the coming decades. The dynamic and distributed nature of simulation applications, the significance of exploratory analysis of complex phenomena [64], and the need for modeling the micro-level interactions, collaboration, and cooperation among real-world entities is bringing a shift in the way systems are being conceptualized. Using intelligent agents in simulation models is based on the idea that it is possible to represent the behavior of active entities in the world in terms of the interactions of an assembly of agents with their own operational autonomy.
Artificial Intelligence in Modeling and Simulation
Artificial Intelligence in Modeling and Simulation, Figure 17 Agent-directed simulation framework
The early pervading view on the use of agents in simulation stems from the developments in Distributed Artificial Intelligence (DAI), as well as advances in agent architectures and agent-oriented programming. The DAI perspective to modeling systems in terms of entities that are capable of solving problems by means of reasoning through symbol manipulation resulted in various technologies that constitute the basic elements of agent systems. The early work on design of agent simulators within the DAI community focused on answering the question of how goals and intentions of agents emerge and how they lead to execution of actions that change the state of their environment. The agent-directed approach to simulating agent systems lies at the intersection of several disciplines: DAI, Control Theory, Complex Adaptive Systems (CAS), and Discrete-event Systems/Simulation. As shown in Fig. 16, these core disciplines gave direction to technology, languages, and possible applications, which then influenced the evolution of the synergy between simulation and agent systems. Distributed Artificial Intelligence and Simulation While progress in agent simulators and interpreters resulted in various agent architectures and their computational engines, the ability to coordinate agent ensembles was recognized early as a key challenge [65]. The MACE system [66] is considered as one of the major milestones in DAI. Specifically, the proposed DAI system integrated concepts from concurrent programming (e. g., actor formalism [67]) and knowledge representation to symbolically reason about skills and beliefs pertaining to modeling the environment. Task allocation and coordination were considered as fundamental challenges in early DAI systems. The contract net protocol developed by [68] pro-
vided the basis for modeling collaboration in simulation of distributed problem solving. Agent Simulation Architectures One of the first agent-oriented simulation languages, AGENT-0 [69], provided a framework that enabled the representation of beliefs and intentions of agents. Unlike object-oriented simulation languages such as SIMULA 67 [70], the first object-oriented language for specifying discrete-event systems, AGENT-O and McCarthy’s Elephant2000 language incorporated speech act theory to provide flexible communication mechanisms for agents. DAI and cognitive psychology influenced the development of cognitive agents such as those found in AGENT-0, e. g., the Belief-Desires-Intentions (BDI) framework [71]. However, procedural reasoning and control theory provided a basis for the design and implementation of reactive agents. Classical control theory enables the specification of a mathematical model that describes the interaction of a control system and its environment. The analogy between an agent and control system facilitated the formalization of agent interactions in terms of a formal specification of dynamic systems. The shortcomings of reactive agents (i. e., lack of mechanisms of goal-directed behavior) and cognitive agents (i. e., issues pertaining to computational tractability in deliberative reasoning) led to the development of hybrid architectures such as the RAP system [72]. Agents are often viewed as design metaphors in the development of models for simulation and gaming. Yet, this narrow view limits the potential of agents in improving various other dimensions of simulation. To this end, Fig. 17 presents a unified paradigm of Agent-Directed Simulation that consists of two categories as follows:
221
222
Artificial Intelligence in Modeling and Simulation
(1) Simulation for Agents (agent simulation), i. e., simulation of systems that can be modeled by agents (in engineering, human and social dynamics, military applications etc.) and (2) Agents for Simulation that can be grouped under two groups: agent-supported simulation and agentbased simulation.
tasks. Also, agent-based simulation is useful for having complex experiments and deliberative knowledge processing such as planning, deciding, and reasoning. Agents are also critical enablers to improve composability and interoperability of simulation models [73]. Agent-Supported Simulation
Agent Simulation Agent simulation involves the use of agents as design metaphors in developing simulation models. Agent simulation involves the use of simulation conceptual frameworks (e. g., discrete-event, activity scanning) to simulate the behavioral dynamics of agent systems and incorporate autonomous agents that function in parallel to achieve their goals and objectives. Agents possess high-level interaction mechanisms independent of the problem being solved. Communication protocols and mechanisms for interaction via task allocation, coordination of actions, and conflict resolution at varying levels of sophistication are primary elements of agent simulations. Simulating agent systems requires understanding the basic principles, organizational mechanisms, and technologies underlying such systems. Agent-Based Simulation Agent-based simulation is the use of agent technology to monitor and generate model behavior. This is similar to the use of AI techniques for the generation of model behavior (e. g., qualitative simulation and knowledge-based simulation). Development of novel and advanced simulation methodologies such as multisimulation suggests the use of intelligent agents as simulator coordinators, where run-time decisions for model staging and updating takes place to facilitate dynamic composability. The perception feature of agents makes them pertinent for monitoring
Artificial Intelligence in Modeling and Simulation, Figure 18 M&S within mind
Agent-supported simulation deals with the use of agents as a support facility to enable computer assistance by enhancing cognitive capabilities in problem specification and solving. Hence, agent-supported simulation involves the use of intelligent agents to improve simulation and gaming infrastructures or environments. Agent-supported simulation is used for the following purposes: to provide computer assistance for front-end and/or back-end interface functions; to process elements of a simulation study symbolically (for example, for consistency checks and built-in reliability); and to provide cognitive abilities to the elements of a simulation study, such as learning or understanding abilities. For instance, in simulations with defense applications, agents are often used as support facilities to see the battlefield, fuse, integrate and de-conflict the information presented by the decision-maker, generate alarms based on the recognition of specific patterns, Filter, sort, track, and prioritize the disseminated information, and generate contingency plans and courses of actions. A significant requirement for the design and simulation of agent systems is the distributed knowledge that represents
Artificial Intelligence in Modeling and Simulation
Artificial Intelligence in Modeling and Simulation, Figure 19 Emergence of endomorphic agents
the mental model that characterizes each agent’s beliefs about the environment, itself, and other agents. Endomorphic agent concepts provide a framework for addressing the difficult conceptual issues that arise in this domain. Endomorphic Agents We now consider an advanced feature that an autonomous, integrated and comprehensive modeler/simulationist agent must have if it is fully emulate human capability. This is the ability, to a limited, but significant extent, to construct and employ models of its own mind as well of the minds of other agents. We use the term “mind” in the sense just discussed. The concept of an endomorphic agent is illustrated in Fig. 18 and 19 in a sequence of related diagrams. The diagram labeled with an oval with embedded number 1 is that of Fig. 9 with the modifications mentioned earlier to match up with human motivation and desire generation modules. In diagram 2, the label “mind” refers to the set of M&S capabilities depicted in Fig. 9. As in [30], an agent, human or technological, is considered to be composed of a mind and a body. Here, “body” represents the external manifestation of the agent, which is observable by other agents. Whereas, in contrast, mind is hidden from view and must be a construct, or model, of other agents. In other words, to use the language of evolutionary psychology, agents must develop a “theory of mind” about other
agents from observation of their external behavior. An endomorphic agent is represented in diagram 3 with a mental model of the body and mind of the agent in diagram 2. This second agent is shown more explicitly in diagram 4, with a mental representation of the first agent’s body and mind. Diagram 5 depicts the recursive aspect of endomorphism, where the (original) agent of diagram 2 has developed a model of the second agent’s body and mind. But the latter model contains the just-mentioned model of the first agent’s body and mind. This leads to a potentially infinite regress in which—apparently—each agent can have a representation of the other agent, and by reflection, of himself, that increases in depth of nesting without end. Hofstadter [59] represents a similar concept in the diagram on page 144 of his book, in which the comic character Sluggo is “dreaming of himself dreaming of himself dreaming of himself, without end.” He then uses the label on the Morton Salt box on page 145 to show how not all self reference involves infinite recursion. On the label, the girl carrying the salt box obscures its label with her arm, thereby shutting down the regress. Thus, the salt box has a representation of itself on its label but this representation is only partial. Reference [30] related the termination in self-reference to the agent’s objectives and requirements in constructing models of himself, other agents, and the environment. Briefly, the agent need only to go as deep as needed to get a reliable model of the other agents. The agent can
223
224
Artificial Intelligence in Modeling and Simulation
Artificial Intelligence in Modeling and Simulation, Figure 20 Interacting models of others in baseball
stop at level 1 with a representation of the other’s bodies. However, this might not allow predicting another’s movements, particularly if the latter has a mind in control of these movements. This would force the first agent to include at least a crude model of the other agent’s mind. In a competitive situation, having such a model might give the first agent an advantage and this might lead the second agent to likewise develop a predictive model of the first agent. With the second agent now seeming to become less predictable, the first agent might develop a model of the second agent’s mind that restores lost predictive power. This would likely have to include a reflected representation of himself, although the impending regression could be halted if this representation did not, itself, contain a model of the other agent. Thus, the depth to which competitive endomorphic agents have models of themselves and others might be the product of a co-evolutionary “mental arms race” in which an improvement in one side triggers a contingent improvement in the other—the improvement being an incremental refinement of the internal models by successively adding more levels of nesting. Minsky [60] conjectured that termination of the potentially infinite regress in agent’s models of each other within a society of mind might be constrained by shear limitations on the ability to martial the resources required to support the necessary computation. We can go further by assuming that agents have differing mental capacities to support such computational nesting. Therefore, an agent
with greater capacity might be able to “out think” one of lesser capability. This is illustrated by the following reallife story drawn from a recent newspaper account of a critical play in a baseball game. Interacting Models of Others in Competitive Sport The following account is illustrated in Fig. 20. A Ninth Inning to Forget Cordero Can’t Close, Then Base-Running Gaffe Ends Nats’ Rally Steve Yanda – Washington Post Staff Writer Jun 24, 2007 Copyright The Washington Post Company Jun 24, 2007Indians 4, Nationals 3 Nook Logan played out the ending of last night’s game in his head as he stood on second base in the bottom of the ninth inning. The bases were loaded with one out and the Washington Nationals trailed by one run. Even if Felipe Lopez, the batter at the plate, grounded the ball, say, right back to Cleveland Indians closer Joe Borowski, the pitcher merely would throw home. Awaiting the toss would be catcher Kelly Shoppach, who would tag the plate and attempt to nail Lopez at first. By the time Shoppach’s throw reached first baseman Victor Martinez, Logan figured he would be gliding across the plate with the ty-
Artificial Intelligence in Modeling and Simulation
ing run. Lopez did ground to Borowski, and the closer did fire the ball home. However, Shoppach elected to throw to third instead of first, catching Logan drifting too far off the bag for the final out in the Nationals’ 4–3 loss at RFK Stadium. “I thought [Shoppach] was going to throw to first,” Logan said. And if the catcher had, would Logan have scored all the way from second? “Easy.” We’ll analyze this account to show how it throws light on the advantage rendered by having an endomorphic capability to process to a nesting depth exceeding that of an opponent. The situation starts the bottom of the ninth inning with the Washington Nationals at bats having the bases loaded with one out and trailing by one run. The runner on second base, Nook Logan plays out the ending of the game in his head. This can be interpreted in terms of endomorphic models as follows. Logan makes a prediction using his models of the opposing pitcher and catcher, namely that the pitcher would throw home and the catcher would tag the plate and attempt to nail the batter at first. Logan then makes a prediction using a model of himself, namely, that he would be able to reach home plate while the pitcher’s thrown ball was traveling to first base. In actual play, the catcher threw the ball to third and caught Logan out. This is evidence that the catcher was able to play out the simulation to a greater depth then was Logan. The catcher’s model of the situation agreed with that of Logan as it related to the other actors. The difference was that the catcher used a model of Logan that predicted that the latter (Logan) would predict the he (the catcher) would throw to first. Having this prediction, the catcher decided instead to throw the ball to the second baseman which resulted in putting Logan out. We note that the catcher’s model was based on his model of Logan’s model so it was at one level greater in depth of nesting than the latter. To have succeeded, Logan would have had to be able to support one more level, namely, to have a model of the catcher that would predict that the catcher would use the model of himself (Logan) to out-think him and then make the counter move, not to start towards third base. The enigma of such endomorphic agents provides extreme challenges to further research in AI and M&S. The formal and computational framework that the M&S framework discussed here provides may be of particular advantage to cognitive psychologists and philosophers interested in an active area of investigation in which the terms “theory of mind,” “simulation,” and “mind reading” are employed without much in the way of definition [61].
Future Directions M&S presents some fundamental and very difficult problems whose solution may benefit from the concepts and techniques of AI. We have discussed some key problem areas including verification and validation, reuse and composability, and distributed simulation and systems of systems interoperability. We have also considered some areas of AI that have direct applicability to problems in M&S, such as Service-Oriented Architecture and Semantic Web, ontologies, constrained natural language, and genetic algorithms. In order to provide a unifying theme for the problem and solutions, we raised the question of whether all of M&S can be automated into an integrated autonomous artificial modeler/simulationist. We explored an approach to developing such an intelligent agent based on the System Entity Structure/Model Base framework, a hybrid methodology that combines elements of AI and M&S. We proposed a concrete methodology by which such an agent could engage in M&S based on activity tracking. There are numerous challenges to AI that implementing such a methodology in automated form presents. We closed with consideration of endomorphic modeling capability, an advanced feature that such an agent must have if it is to fully emulate human M&S capability. Since this capacity implies an infinite regress in which models contain models of themselves without end, it can only be had to a limited degree. However, it may offer critical insights into competitive co-evolutionary human or higher order primate behavior to launch more intensive research into model nesting depth. This is the degree to which an endomorphic agent can marshal mental resources needed to construct and employ models of its own “mind” as well of the ‘minds” of other agents. The enigma of such endomorphic agents provides extreme challenges to further research in AI and M&S, as well as related disciplines such as cognitive science and philosophy.
Bibliography Primary Literature 1. Wymore AW (1993) Model-based systems engineering: An introduction to the mathematical theory of discrete systems and to the tricotyledon theory of system design. CRC, Boca Raton 2. Zeigler BP, Kim TG, Praehofer H (2000) Theory of modeling and simulation. Academic Press, New York 3. Ören TI, Zeigler BP (1979) Concepts for advanced simulation methodologies. Simulation 32(3):69–82 4. http://en.wikipedia.org/wiki/DEVS Accessed Aug 2008 5. Knepell PL, Aragno DC (1993) Simulation validation: a confidence assessment methodology. IEEE Computer Society Press, Los Alamitos
225
226
Artificial Intelligence in Modeling and Simulation
6. Law AM, Kelton WD (1999) Simulation modeling and analysis, 3rd edn. McGraw-Hill, Columbus 7. Sargent RG (1994) Verification and validation of simulation models. In: Winter simulation conference. pp 77–84 8. Balci O (1998) Verification, validation, and testing. In: Winter simulation conference. 9. Davis KP, Anderson AR (2003) Improving the composability of department of defense models and simulations, RAND technical report. http://www.rand.org/pubs/monographs/MG101/. Accessed Nov 2007; J Def Model Simul Appl Methodol Technol 1(1):5–17 10. Ylmaz L, Oren TI (2004) A conceptual model for reusable simulations within a model-simulator-context framework. Conference on conceptual modeling and simulation. Conceptual Models Conference, Italy, 28–31 October, pp 28–31 11. Traore M, Muxy A (2004) Capturing the dual relationship between simulation models and their context. Simulation practice and theory. Elsevier 12. Page E, Opper J (1999) Observations on the complexity of composable simulation. In: Proceedings of winter simulation conference, Orlando, pp 553–560 13. Kasputis S, Ng H (2000) Composable simulations. In: Proceedings of winter simulation conference, Orlando, pp 1577–1584 14. Sarjoughain HS (2006) Model composability. In: Perrone LF, Wieland FP, Liu J, Lawson BG, Nicol DM, Fujimoto RM (eds) Proceedings of the winter simulation conference, pp 104–158 15. DiMario MJ (2006) System of systems interoperability types and characteristics in joint command and control. In: Proceedings of the 2006 IEEE/SMC international conference on system of systems engineering, Los Angeles, April 2006 16. Sage AP, Cuppan CD (2001) On the systems engineering and management of systems of systems and federation of systems. Information knowledge systems management, vol 2, pp 325– 345 17. Dahmann JS, Kuhl F, Weatherly R (1998) Standards for simulation: as simple as possible but not simpler the high level architecture for simulation. Simulation 71(6):378 18. Sarjoughian HS, Zeigler BP (2000) DEVS and HLA: Complimentary paradigms for M&S? Trans SCS 4(17):187–197 19. Yilmaz L (2004) On the need for contextualized introspective simulation models to improve reuse and composability of defense simulations. J Def Model Simul 1(3):135–145 20. Tolk A, Muguira JA (2003) The levels of conceptual interoperability model (LCIM). In: Proceedings fall simulation interoperability workshop, http://www.sisostds.org Accessed Aug 2008 21. http://en.wikipedia.org/wiki/Simula Accessed Aug 2008 22. http://en.wikipedia.org/wiki/Java Accessed Aug 2008 23. http://en.wikipedia.org/wiki/Expert_system Accessed Aug 2008 24. http://en.wikipedia.org/wiki/Frame_language Accessed Aug 2008 25. http://en.wikipedia.org/wiki/Agent_based_model Accessed Aug 2008 26. http://www.swarm.org/wiki/Main_Page Accessed Aug 2008 27. Unified Modeling Language (UML) http://www.omg.org/ technology/documents/formal/uml.htm 28. Object Modeling Group (OMG) http://www.omg.org 29. http://en.wikipedia.org/wiki/Semantic_web Accessed Aug 2008
30. Zeigler BP (1990) Object Oriented Simulation with Hierarchical, Modular Models: Intelligent Agents and Endomorphic Systems. Academic Press, Orlando 31. http://en.wikipedia.org/wiki/Service_oriented_architecture Accessed Aug 2008 32. Mittal S, Mak E, Nutaro JJ (2006) DEVS-Based dynamic modeling & simulation reconfiguration using enhanced DoDAF design process. Special issue on DoDAF. J Def Model Simul, Dec (3)4:239–267 33. Ziegler BP (1988) Simulation methodology/model manipulation. In: Encyclopedia of systems and controls. Pergamon Press, England 34. Alexiev V, Breu M, de Bruijn J, Fensel D, Lara R, Lausen H (2005) Information integration with ontologies. Wiley, New York 35. Kim L (2003) Official XMLSPY handbook. Wiley, Indianapolis 36. Zeigler BP, Hammonds P (2007) Modeling & simulation-based data engineering: introducing pragmatics into ontologies for net-centric information exchange. Academic Press, New York 37. Simard RJ, Zeigler BP, Couretas JN (1994) Verb phrase model specification via system entity structures. AI and Planning in high autonomy systems, 1994. Distributed interactive simulation environments. Proceedings of the Fifth Annual Conference, 7–9 Dec 1994, pp 192–1989 38. Checkland P (1999) Soft systems methodology in action. Wiley, London 39. Holland JH (1992) Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence. MIT Press, Cambridge 40. Goldberg DE (1989) Genetic algorithms in search, optimization and machine learning. Addison-Wesley Professional, Princeton 41. Davis L (1987) Genetic algorithms and simulated annealing. Morgan Kaufmann, San Francisco 42. Zbigniew M (1996) Genetic algorithms + data structures = evolution programs. Springer, Heidelberg 43. Cheon S (2007) Experimental frame structuring for automated model construction: application to simulated weather generation. Doct Diss, Dept of ECE, University of Arizona, Tucson 44. Zeigler BP (1984) Multifaceted modelling and discrete event simulation. Academic Press, London 45. Rozenblit JW, Hu J, Zeigler BP, Kim TG (1990) Knowledge-based design and simulation environment (KBDSE): foundational concepts and implementation. J Oper Res Soc 41(6):475–489 46. Kim TG, Lee C, Zeigler BP, Christensen ER (1990) System entity structuring and model base management. IEEE Trans Syst Man Cyber 20(5):1013–1024 47. Zeigler BP, Zhang G (1989) The system entity structure: knowledge representation for simulation modeling and design. In: Widman LA, Loparo KA, Nielsen N (eds) Artificial intelligence, simulation and modeling. Wiley, New York, pp 47–73 48. Luh C, Zeigler BP (1991) Model base management for multifaceted systems. ACM Trans Model Comp Sim 1(3):195–218 49. Couretas J (1998) System entity structure alternatives enumeration environment (SEAS). Doctoral Dissertation Dept of ECE, University of Arizona 50. Hyu C Park, Tag G Kim (1998) A relational algebraic framework for VHDL models management. Trans SCS 15(2):43–55 51. Chi SD, Lee J, Kim Y (1997) Using the SES/MB framework to analyze traffic flow. Trans SCS 14(4):211–221 52. Cho TH, Zeigler BP, Rozenblit JW (1996) A knowledge based simulation environment for hierarchical flexible manufacturing. IEEE Trans Syst Man Cyber- Part A: Syst Hum 26(1):81–91
Artificial Intelligence in Modeling and Simulation
53. Carruthers P (2006) Massively modular mind architecture the architecture of the mind. Oxford University Press, USA, pp 480 54. Wolpert L (2004) Six impossible things before breakfast: The evolutionary origin of belief, W.W. Norton London 55. Zeigler BP (2005) Discrete event abstraction: an emerging paradigm for modeling complex adaptive systems perspectives on adaptation, In: Booker L (ed) Natural and artificial systems, essays in honor of John Holland. Oxford University Press, Oxford 56. Nutaro J, Zeigler BP (2007) On the stability and performance of discrete event methods for simulating continuous systems. J Comput Phys 227(1):797–819 57. Muzy A, Nutaro JJ (2005) Algorithms for efficient implementation of the DEVS & DSDEVS abstract simulators. In: 1st Open International Conference on Modeling and Simulation (OICMS). Clermont-Ferrand, France, pp 273–279 58. Muzy A The activity paradigm for modeling and simulation of complex systems. (in process) 59. Hofstadter D (2007) I am a strange loop. Basic Books 60. Minsky M (1988) Society of mind. Simon & Schuster, Goldman 61. Alvin I (2006) Goldman simulating minds: the philosophy, psychology, and neuroscience of mindreading. Oxford University Press, USA 62. Denning PJ (2007) Computing is a natural science. Commun ACM 50(7):13–18 63. Luck M, McBurney P, Preist C (2003) Agent technology: enabling next generation computing a roadmap for agent based computing. Agentlink, Liverpool 64. Miller JH, Page SE (2007) Complex adaptive systems: an introduction to computational models of social life. Princeton University Press, Princeton 65. Ferber J (1999) Multi-Agent systems: an introduction to distributed artificial intelligence. Addison-Wesley, Princeton 66. Gasser L, Braganza C, Herman N (1987) Mace: a extensible testbed for distributed AI research. Distributed artificial intelligence – research notes in artificial intelligence, pp 119–152 67. Agha G, Hewitt C (1985) Concurrent programming using actors: exploiting large-scale parallelism. In: Proceedings of the foundations of software technology and theoretical computer science, Fifth Conference, pp 19–41
68. Smith RG (1980) The contract net protocol: high-level communication and control in a distributed problem solver. IEEE Trans Comput 29(12):1104–1113 69. Shoham Y (1993) Agent-oriented programming. Artif Intell 60(1):51–92 70. Dahl OJ, Nygaard K (1967) SIMULA67 Common base definiton. Norweigan Computing Center, Norway 71. Rao AS, George MP (1995) BDI-agents: from theory to practice. In: Proceedings of the first intl. conference on multiagent systems, San Francisco 72. Firby RJ (1992) Building symbolic primitives with continuous control routines. In: Procedings of the First Int Conf on AI Planning Systems. College Park, MD pp 62–29 73. Yilmaz L, Paspuletti S (2005) Toward a meta-level framework for agent-supported interoperation of defense simulations. J Def Model Simul 2(3):161–175
Books and Reviews Alexiev V, Breu M, de Bruijn J, Fensel D, Lara R, Lausen H (2005) Information integration with ontologies. Wiley, New York Alvin I (2006) Goldman Simulating Minds: The philosophy, psychology, and neuroscience of mindreading. Oxford University Press, USA Carruthers P (2006) Massively modular mind architecture The architecture of the mind. http://www.amazon.com/exec/obidos/ search-handle-url/102-1253221-6360121 John H (1992) Holland adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. The MIT Press Cambridge Zeigler BP (1990) Object oriented simulation with hierarchical, modular models: intelligent agents and endomorphic systems. Academic Press, Orlando Zeigler BP, Hammonds P (2007) Modeling & simulation-based data engineering: introducing pragmatics into ontologies for net-centric information exchange. Academic Press, New York Zeigler BP, Kim TG, Praehofer H (2000) Theory of modeling and simulation. Academic Press, New York
227
228
Bacterial Computing
Bacterial Computing MARTYN AMOS Department of Computing and Mathematics, Manchester Metropolitan University, Manchester, UK Article Outline Glossary Definition of the Subject Introduction Motivation for Bacterial Computing The Logic of Life Rewiring Genetic Circuitry Successful Implementations Future Directions Bibliography Glossary DNA Deoxyribonucleic acid. Molecule that encodes the genetic information of cellular organisms. Operon Set of functionally related genes with a common promoter (“on switch”). Plasmid Small circular DNA molecule used to transfer genes from one organism to another. RNA Ribonucleic acid. Molecule similar to DNA, which helps in the conversion of genetic information to proteins. Transcription Conversion of a genetic sequence into RNA. Translation Conversion of an RNA sequence into an amino acid sequence (and, ultimately, a protein). Definition of the Subject Bacterial computing is a conceptual subset of synthetic biology, which is itself an emerging scientific discipline largely concerned with the engineering of biological systems. The goals of synthetic biology may be loosely partioned into four sets: (1) To better understand the fundamental operation of the biological system being engineered, (2) To extend synthetic chemistry, and create improved systems for the synthesis of molecules, (3) To investigate the “optimization” of existing biological systems for human purposes, (4) To develop and apply rational engineering principles to the design and construction of biological systems. It is on these last two goals that we focus in the current article. The main benefits that may accrue from these studies are both theoretical and practical; the construction and study of synthetic biosystems could improve our quantitative understanding of the fundamental underly-
ing processes, as well as suggesting plausible applications in fields as diverse as pharmaceutical synthesis and delivery, biosensing, tissue engineering, bionanotechnology, biomaterials, energy production and environmental remediation. Introduction Complex natural processes may often be described in terms of networks of computational components, such as Boolean logic gates or artificial neurons. The interaction of biological molecules and the flow of information controlling the development and behavior of organisms is particularly amenable to this approach, and these models are well-established in the biological community. However, only relatively recently have papers appeared proposing the use of such systems to perform useful, human-defined tasks. For example, rather than merely using the network analogy as a convenient technique for clarifying our understanding of complex systems, it may now be possible to harness the power of such systems for the purposes of computation. Despite the relatively recent emergence of biological computing as a distinct research area, the link between biology and computer science is not a new one. Of course, for years biologists have used computers to store and analyze experimental data. Indeed, it is widely accepted that the huge advances of the Human Genome Project (as well as other genome projects) were only made possible by the powerful computational tools available. Bioinformatics has emerged as “the science of the 21st century”, requiring the contributions of truly interdisciplinary scientists who are equally at home at the lab bench or writing software at the computer. However, the seeds of the relationship between biology and computer science were sown over fifty years ago, when the latter discipline did not even exist. When, in the 17th century, the French mathematician and philosopher René Descartes declared to Queen Christina of Sweden that animals could be considered a class of machines, she challenged him to demonstrate how a clock could reproduce. Three centuries later in 1951, with the publication of “The General and Logical Theory of Automata” [38] John von Neumann showed how a machine could indeed construct a copy of itself. Von Neumann believed that the behavior of natural organisms, although orders of magnitude more complex, was similar to that of the most intricate machines of the day. He believed that life was based on logic. We now begin to look at how this view of life may be used, not simply as a useful analogy, but as the practical foundation of a whole new engineering discipline.
Bacterial Computing
Motivation for Bacterial Computing Here we consider the main motivations behind recent work on bacterial computing (and, more broadly, synthetic biology). Before recombinant DNA technology made it possible to construct new genetic sequences, biologists were restricted to crudely “knocking out” individual genes from an organism’s genome, and then assessing the damage caused (or otherwise). Such knock-outs gradually allowed them to piece together fragments of causality, but the process was very time-consuming and error-prone. Since the dawn of genetic engineering – with the ability to synthesize novel gene segments – biologists have been in a position to make much more finely-tuned modifications to their organism of choice, this generating much more refined data. Other advances in biochemistry have also contributed, allowing scientists to – for example – investigate new types of genetic systems with, for example, twelve bases, rather than the traditional four [14]. Such creations have yielded valuable insights into the mechanics of mutation, adaptation and evolution. Researchers in synthetic biology are now extending their work beyond the synthesis of single genes, and are now introducing whole new gene complexes into organisms. The objectives behind this work are both theoretical and practical. As Benner and Seymour argue [5], “ . . . a synthetic goal forces scientists to cross uncharted ground to encounter and solve problems that are not easily encountered through [top-down] analysis. This drives the emergence of new paradigms [“world views”] in ways that analysis cannot easily do.” Drew Endy agrees. “Worst-case scenario, it’s a complementary approach to traditional discovery science” [18]. “Best-case scenario, we get systems that are simpler and easier to understand . . . ” Or, to put it bluntly, “Let’s build new biological systems – systems that are easier to understand because we made them that way” [31]. As well as shedding new light on the underlying biology, these novel systems may well have significant practical utility. Such new creations, according to Endy’s “personal wish list” might include “generating biological machines that could clean up toxic waste, detect chemical weapons, perform simple computations, stalk cancer cells, lay down electronic circuits, synthesize complex compounds and even produce hydrogen from sunlight” [18]. In the next Section we begin to consider how this might be achieved, by first describing the underlying logic of genetic circuitry. The Logic of Life Twenty years after von Neumann’s seminal paper, Francois Jacob and Jacques Monod identified specific natural
processes that could be viewed as behaving according to logical principles: “The logic of biological regulatory systems abides not by Hegelian laws but, like the workings of computers, by the propositional algebra of George Boole.” [29] This conclusion was drawn from earlier work of Jacob and Monod [30]. In addition, Jacob and Monod described the “lactose system” [20], which is one of the archetypal examples of a Boolean biosystem. We describe this system shortly, but first give a brief introduction to the operation of genes in general terms. DNA as the Carrier of Genetic Information The central dogma of molecular biology [9] is that DNA produces RNA, which in turn produces proteins. The basic “building blocks” of genetic information are known as genes. Each gene codes for one specific protein and may be turned on (expressed) or off (repressed) when required. Transcription and Translation We now describe the processes that determine the structure of a protein, and hence its function. Note that in what follows we assume the processes described occur in bacteria, rather than in higher organisms such as humans. For a full description of the structure of the DNA molecule, see the chapter on DNA computing. In order for a DNA sequence to be converted into a protein molecule, it must be read (transcribed) and the transcript converted (translated) into a protein. Transcription of a gene produces a messenger RNA (mRNA) copy, which can then be translated into a protein. Transcription proceeds as follows. The mRNA copy is synthesized by an enzyme known as RNA polymerase. In order to do this, the RNA polymerase must be able to recognize the specific region to be transcribed. This specificity requirement facilitates the regulation of genetic expression, thus preventing the production of unwanted proteins. Transcription begins at specific sites within the DNA sequence, known as promoters. These promoters may be thought of as “markers”, or “signs”, in that they are not transcribed into RNA. The regions that are transcribed into RNA (and eventually translated into protein) are referred to as structural genes. The RNA polymerase recognizes the promoter, and transcription begins. In order for the RNA polymerase to begin transcription, the double helix must be opened so that the sequence of bases may be read. This opening involves the breaking of the hydrogen bonds between bases. The RNA polymerase then
229
230
Bacterial Computing
moves along the DNA template strand in the 3 ! 50 direction. As it does so, the polymerase creates an antiparallel mRNA chain (that is, the mRNA strand is the equivalent of the Watson-Crick complement of the template). However, there is one significant difference, in that RNA contains uracil instead of thymine. Thus, in mRNA terms, “U binds with A.” The RNA polymerase moves along the DNA, the DNA re-coiling into its double-helix structure behind it, until it reaches the end of the region to be transcribed. The end of this region is marked by a terminator which, like the promoter, is not transcribed. Genetic Regulation Each step of the conversion, from stored information (DNA), through mRNA (messenger), to protein synthesis (effector), is itself catalyzed by other effector molecules. These may be enzymes or other factors that are required for a process to continue (for example, sugars). Consequently, a loop is formed, where products of one gene are required to produce further gene products, and may even influence that gene’s own expression. This process was first described by Jacob and Monod in 1961 [20], a discovery that earned them a share of the 1965 Nobel Prize in Physiology or Medicine. Genes are composed of a number of distinct regions, which control and encode the desired product. These regions are generally of the form promoter–gene– terminator. Transcription may be regulated by effector molecules known as inducers and repressors, which interact with the promoter and increase or decrease the level of transcription. This allows effective control over the expression of proteins, avoiding the production of unnecessary compounds. It is important to note at this stage that, in reality, genetic regulation does not conform to the digital “on-off” model that is popularly portrayed; rather, it is continuous or analog in nature. The Lac Operon One of the most well-studied genetic systems is the lac operon. An operon is a set of functionally related genes with a common promoter. An example of this is the lac operon, which contains three structural genes that allow E. coli to utilize the sugar lactose. When E.coli is grown on the sugar glucose, the product of the (separate, and unrelated to the lac operon) lacI gene represses the transcription of the lacZYA operon (i. e., the operon is turned off). However, if lactose is supplied together with glucose, a lactose by-product is produced which interacts with the repressor molecule, preventing
it from repressing the lacZYA operon. This de-repression does not itself initiate transcription, since it would be inefficient to utilize lactose if the more common sugar glucose were still available. The operon is positively regulated (i. e., “encouraged”) by a different molecule, whose level increases as the amount of available glucose decreases. Therefore, if lactose were present as the sole carbon source, the lacI repression would be relaxed and the high “encouraging” levels would activate transcription, leading to the synthesis of the lacZYA gene products. Thus, the promoter is under the control of two sugars, and the lacZYA operon is only transcribed when lactose is present and glucose is absent. In essence, Jacob and Monod showed how a gene may be thought of (in very abstract terms) as a binary switch, and how the state of that switch might be affected by the presence or absence of certain molecules. Monod’s point, made in his classic book Chance and Necessity and quoted above, was that the workings of biological systems operate not by Hegel’s philosophical or metaphysical logic of understanding, but according to the formal, mathematicallygrounded logical system of George Boole. What Jacob and Monod found was that the transcription of a gene may be regulated by molecules known as inducers and repressors, which either increase or decrease the “volume” of a gene (corresponding to its level of transcription, which isn’t always as clear cut and binary as Monod’s quote might suggest). These molecules interact with the promoter region of a gene, allowing the gene’s level to be finely “tuned”. The lac genes are so-called because, in the E. coli bacterium, they combine to produce a variety of proteins that allow the cell to metabolise the sugar lactose (which is most commonly found in milk, hence the derivation from the Latin, lact, meaning milk). For reasons of efficiency, these proteins should only be produced (i. e., the genes be turned on) when lactose is present in the cell’s environment. Making these proteins when lactose is absent would be a waste of the cell’s resources, after all. However, a different sugar – glucose – will always be preferable to lactose, if the cell can get it, since glucose is an “easier” form of sugar to metabolise. So, the input to and output from the lac operon may be expressed as a truth table, with G and L standing for glucose and lactose (1 if present, 0 if absent), and O standing for the output of the operon (1 if on, 0 if off): G 0 0 1 1
L 0 1 0 1
O 0 1 0 0
Bacterial Computing
The Boolean function that the lac operon therefore physically computes is (L AND (NOT G)), since it only outputs 1 if L=1 (lactose present) and G=0 (glucose is absent). By showing how one gene could affect the expression of another – just like a transistor feeds into the input of another and affects its state – Jacob and Monod laid the foundations for a new way of thinking about genes; not simply in terms of protein blueprints, but of circuits of interacting parts, or dynamic networks of interconnected switches and logic gates. This view of the genome is now well-established [22,23,27], but in the next Section we show how it might be used to guide the engineering of biological systems. Rewiring Genetic Circuitry A key difference between the wiring of a computer chip and the circuitry of the cell is that “electronics engineers know exactly how resistors and capacitors are wired to each other because they installed the wiring. But biologists often don’t have a complete picture. They may not know which of thousands of genes and proteins are interacting at a given moment, making it hard to predict how circuits will behave inside cells” [12]. This makes the task of reengineering vastly more complex. Rather than trying assimilate the huge amounts of data currently being generated by the various genome projects, synthetic biologists are taking a novel route – simplify and build. “They create models of genetic circuits, build the circuits, see if they work, and adjust them if they don’t – learning about biology in the process. ‘I view it as a reductionist approach to systems biology,’ says biomedical engineer Jim Collins of Boston University” [12]. The field of systems biology has emerged in recent years as an alternative to the traditional reductionist way of doing science. Rather than simply focussing on a single level of description (such as individual proteins), researchers are now seeking to integrate information from many different layers of complexity. By studing how different biological components interact, rather than simply looking at their structure, systems biologists are attempting to build models of systems from the bottom up. A model is simply an abstract description of how a system operates – for example, a set of equations that describe how a disease spreads throughout a population. The point of a model is to capture the essence of a system’s operation, and it should therefore be as simple as possible. Crucially, a model should also be capable of making predictions, which can then be tested against reality using real data (for example, if infected people are placed in quarantine for a week, how does this affect the progress of the
disease in question, or what happens if I feed this signal into the chip?). The results obtained from these tests may then feed back in to further refinements of the model, in a continuous cycle of improvement. When a model suggests a plausible structure for a synthetic genetic circuit, the next stage is to engineer it into the chosen organism, such as a bacterium. Once the structure of the DNA molecule was elucidated and the processes of transcription and translation were understood, molecular biologists were frustrated by the lack of suitable experimental techniques that would facilitate more detailed examination of the genetic material. However, in the early 1970s, several techniques were developed that allowed previously impossible experiments to be carried out (see [8,33]). These techniques quickly led to the first ever successful cloning experiments [19,25]. Cloning is generally defined as “ . . . the production of multiple identical copies of a single gene, cell, virus, or organism.” [35]. This is achieved as follows: a specific sequence (corresponding, perhaps, to a novel gene) is inserted in a circular DNA molecule, known as a plasmid, or vector, producing a recombinant DNA molecule. The vector acts as a vehicle, transporting the sequence into a host cell (usually a bacterium, such as E.coli). Cloning single genes is well-established, but is often done on an ad hoc basis. If biological computing is to succeed, it requires some degree of standardization, in the same way that computer manufacturers build different computers, but using a standard library of components. “Biobricks are the first example of standard biological parts,” explains Drew Endy [21]. “You will be able to use biobricks to program systems that do whatever biological systems do.” He continues. “That way, if in the future, someone asks me to make an organism that, say, counts to 3,000 and then turns left, I can grab the parts I need off the shelf, hook them together and predict how they will perform” [15]. Each biobrick is a simple component, such as an AND gate, or an inverter (NOT). Put them together one after the other, and you have a NAND (NOT-AND) gate, which is all that is needed to build any Boolean circuit (an arbitrary circuit can be translated to an equivalent circuit that uses only NAND gates. It will be much bigger than the original, but it will compute the same function. Such considerations were important in the early stages of integrated circuits, when building different logic gates was difficult and expensive). Just as transistors can be used together to build logic gates, and these gates then combined into circuits, there exists a hierarchy of complexity with biobricks. At the bottom are the “parts”, which generally correspond to coding regions for proteins. Then, one level up, we have “devices”, which are built from parts – the oscillator of Elowitz and
231
232
Bacterial Computing
Leibler, for example, could be constructed from three inverter devices chained together, since all an inverter does is “flip” its signal from 1 to 0 (or vice versa). This circuit would be an example of the biobricks at the top of the conceptual tree – “systems”, which are collections of parts to do a significant task (like oscillating or counting). Tom Knight at MIT made the first 6 biobricks, each held in a plasmid ready for use. As we have stated, plasmids can be used to insert novel DNA sequences in the genomes of bacteria, which act as the “testbed” for the biobrick circuits. “Just pour the contents of one of these vials into a standard reagent solution, and the DNA will transform itself into a functional component of the bacteria,” he explains [6]. Drew Endy was instrumental in developing this work further, one invaluable resource being the Registry of Standard Biological Parts [32], the definitive catalogue of new biological component. At the start of 2006, it contained 224 “basic” parts and 459 “composite” parts, with 270 parts “under construction”. Biobricks are still at a relatively early stage, but “Eventually we’ll be able to design and build in silico and go out and have things synthesized,” says Jay Keasling, head of Lawrence Berkeley National Laboratory’s new synthetic biology department [12]. Successful Implementations Although important foundational work had been performed by Arkin and Ross as early as 1994 [2], the year 2000 was a particularly significant one for synthetic biology. In January two foundational papers appeared backto-back in the same issue of Nature. “Networks of interacting biomolecules carry out many essential functions in living cells, but the ‘design principles’ underlying the functioning of such intracellular networks remain poorly understood, despite intensive efforts including quantitative analysis of relatively simple systems. Here we present a complementary approach to this problem: the design and construction of a synthetic network to implement a particular function” [11]. That was the introduction to a paper that Drew Endy would call “the high-water mark of a synthetic genetic circuit that does something” [12]. In the first of the two articles, Michael Elowitz and Stanislau Leibler (then both at Princeton) showed how to build microscopic “Christmas tree lights” using bacteria. Synthetic Oscillator In physics, an oscillator is a system that produces a regular, periodic “output”. Familiar examples include a pendulum, a vibrating string, or a lighthouse. Linking several oscilla-
tors together in some way gives rise to synchrony – for example, heart cells repeatedly firing in unison, or millions of fireflies blinking on and off, seemingly as one [36]. Leibler actually had two articles published in the same high-impact issue of Nature. The other was a short communication, co-authored with Naama Barkai – also at Princeton, but in the department of Physics [3]. In their paper, titled “Circadian clocks limited by noise”, Leibler and Barkai showed how a simple model of biochemical networks could oscillate reliably, even in the presence of noise. They argued that such oscillations (which might, for example, control the internal circadian clock that tells us when to wake up and when to be tired) are based on networks of genetic regulation. They built a simulation of a simple regulatory network using a Monte Carlo algorithm. They found that, however they perturbed the system, it still oscillated reliably, although, at the time, their results existed only in silico. The other paper by Leibler was much more applied, in the sense that they had constructed a biological circuit [11]. Elowitz and Leibler had succeeded in constructing an artificial genetic oscillator in cells, using a synthetic network of repressors. They called this construction the repressilator. Rather than investigating existing oscillator networks, Elowitz and Leibler decided to build one entirely from first principles. They chose three repressor-promoter pairs that had already been sequenced and characterized, and first built a mathematical model in software. By running the sets of equations, they identified from their simulation results certain molecular characteristics of the components that gave rise to so-called limit cycle oscillations; those that are robust to perturbations. This information from the model’s results lead Elowitz and Leibler to select strong promoter molecules and repressor molecules that would rapidly decay. In order to implement the oscillation, they chose three genes, each of which affected one of the others by repressing it, or turning it off. For the sake of illustration, we call the genes A, B and C. The product of gene A turns off (represses) gene B. The absence of B (which represses C) allows C to turn on. C is chosen such that it turns gene A off again, and the three genes loop continuously in a “daisy chain” effect, turning on and off in a repetitive cycle. However, some form of reporting is necessary in order to confirm that the oscillation is occurring as planned. Green fluorescent protein (GFP) is a molecule found occurring naturally in the jellyfish Aequorea victoria. Biologists find it invaluable because it has one interesting property – when placed under ultraviolet light, it glows. Biologists quickly sequenced the gene responsible for producing this protein, as they realized that it could have
Bacterial Computing
many applications as a reporter. By inserting the gene into an organism, you have a ready-made “status light” – when placed into bacteria, they glow brightly if the gene is turned on, and look normal if it’s turned off. We can think of it in terms of a Boolean circuit – if the circuit outputs the value 1, the GFP promoter is produced to turn on the light. If the value is 0, the promoter isn’t produced, the GFP gene isn’t expressed, and the light stays off. Elowitz and Leibler set up their gene network so that the GFP gene would be expressed whenever gene C was turned off – when it was turned on, the GFP would gradually decay and fade away. They synthesized the appropriate DNA sequences and inserted them into a plasmid, eventually yielding a population of bacteria that blinked on and off in a repetitive cycle, like miniature lighthouses. Moreover, and perhaps most significantly, the period between flashes was longer than the time taken for the cells to divide, showing that the state of the system had been passed on during reproduction. Synthetic Toggle Switch Rather than model an existing circuit and then altering it, Elowitz and Leibler had taken a “bottom up” approach to learning about how gene circuits operate. The other notable paper to appear in that issue was written by Timothy Gardner, Charles Cantor and Jim Collins, all of Boston University in the US. In 2000, Gardner, Collins and Cantor observed that genetic switching (such as that observed in the lambda phage [34]) had not yet been “demonstrated in networks of non-specialized regulatory components” [13]. That is to say, at that point nobody had been able to construct a switch out of genes that hadn’t already been “designed” by evolution to perform that specific task. The team had a similar philosophy to that of Elowitz and Leibler, in that their main motivation was being able to test theories about the fundamental behaviour of gene regulatory networks. “Owing to the difficulty of testing their predictions,” they explained, “these theories have not, in general, been verified experimentally. Here we have integrated theory and experiment by constructing and testing a synthetic, bistable [two-state] gene circuit based on the predictions of a simple mathematical model.” “We were looking for the equivalent of a light switch to flip processes on or off in the cell,” explained Gardner [10]. “Then I realized a way to do this with genes instead of with electric circuits.” The team chose two genes that were mutually inhibitory – that is, each produced a molecule that would turn the other off. One important thing to bear in mind is that the system didn’t have a single input. Al-
though the team acknowledged that bistability might be possible – in theory – using only a single promoter that regulated itself , they anticipated possible problems with robustness and experimental tunability if they used that approach. Instead, they decided to use a system whereby each “side” of the switch could be “pressed” by a different stimulus – the addition of a chemical on one, and a change in temperature on the other. Things were set up so that if the system was in the state induced by the chemical, it would stay in that state until the temperature was changed, and would only change back again if the chemical was reintroduced. Importantly, these stimuli did not have to be applied continuously – a “short sharp” burst was enough to cause the switch to flip over. As with the other experiment, Gardner and his colleagues used GFP as the system state reporter, so that the cells glowed in one state, and looked “normal” in the other. In line with the bottom-up approach, they first created a mathematical model of the system and made some predictions about how it would behave inside a cell. Within a year, Gardner had spliced the appropriate genes into the bacteria, and he was able to flip them –at will – from one state the the other. As McAdams and Arkin observed, synthetic “one way” switches had been created in the mid1980s, but “this is perhaps the first engineered design exploiting bistability to produce a switch with capability of reversibly switching between two . . . stable states” [28]. The potential applications of such a bacterial switch were clear. As they state in the conclusion to their article, “As a practical device, the toggle switch . . . may find applications in gene therapy and biotechnology.” They also borrowed from the language of computer programming, using an analogy between their construction and the short “applets” written in the Java language, which now allow us to download and run programs in our web browser. “Finally, as a cellular memory unit, the toggle forms the basis for ‘genetic applets’ – self-contained, programmable, synthetic gene circuits for the control of cell function”. Engineered Communication Towards the end of his life, Alan Turing did some foundational work on pattern formation in nature, in an attempt to explain how zebras get their striped coats or leopards their spots. The study of morphogenesis (from the Greek, morphe – shape, and genesis – creation. “Amorphous” therefore means “without shape or structure”) is concerned with how cells split to assume new roles and communicate with another to form very precise shapes, such as tissues and organs. Turing postulated that the diffusion of chemical signals both within and between cells is
233
234
Bacterial Computing
the main driving force behind such complex pattern formation [37]. Although Turing’s work was mainly concerned with the processes occurring amongst cells inside a developing embryo, it is clear that chemical signalling also goes on between bacteria. Ron Weiss of Princeton University was particularly interested in Vibrio fischeri, a bacterium that has a symbiotic relationship with a variety of aquatic creatures, including the Hawaiian squid. This relationship is due mainly to the fact that the bacteria exhibit bioluminescence – they generate a chemical known as a Luciferase (coded by the Lux gene), a version of which is also found in fireflies, and which causes them to glow when gathered together in numbers. Cells within the primitive light organs of the squid draw in bacteria from the seawater and encourage them to grow. Crucially, once enough bacteria are packed into the light organ they produce a signal to tell the squid cells to stop attracting their colleagues, and only then do they begin to glow. The cells get a safe environment in which to grow, protected from competition, and the squid has a light source by which to navigate and catch prey. The mechanism by which the Vibrio “know” when to start glowing is known as quorum sensing, since there have to be sufficient “members” present for luminiscence to occur. The bacteria secrete an autoinducer molecule, known as VAI (Vibrio Auto Inducer), which diffuses through the cell wall. The Lux gene (which generates the glowing chemical) needs to be activated (turned on) by a particular protein – which attracts the attention of the polymerase – but the protein can only do this with help from the VAI. Its particular 3D structure is such that it can’t fit tightly onto the gene unless it’s been slightly bent out of shape. Where there’s enough VAI present, it locks onto the protein and alters its conformation, so that it can turn on the gene. Thus, the concentration of VAI is absolutely crucial; once a critical threshold has been passed, the bacteria “know” that there are enough of them present, and they begin to glow. Weiss realized that this quorum-based cell-to-cell communication mechanism could provide a powerful framework for the construction of bacterial devices – imagine, for example, a tube of solution containing engineered bacteria that can be added to a sample of seawater, causing it to glow only if the concentration of a particular pollutant exceeds a certain threshold. Crucially, as we will see shortly, it also allows the possibility of generating precise “complex patterned” development. Weiss set up two colonies of E. coli, one containing “sender”, and the other “receivers”. The idea was that the senders would generate a chemical signal made up of VAI,
which could diffuse across a gap and then be picked up by the receivers. Once a strong enough signal was being communicated, the receivers would glow using GFP to say that it had been picked up. Weiss cloned the appropriate gene sequences (corresponding to a type of biobrick) into his bacteria, placed colonies of receiver cells on a plate, and the receivers started to glow in acknowledgment. Synthetic Circuit Evolution In late 2002, Weiss and his colleagues published another paper, this time describing how rigorous engineering principles may be brought to bear on the problem of designing and building entirely new genetic circuitry. The motivation was clear – “biological circuit engineers will have to confront their inability to predict the precise behavior of even the most simple synthetic networks, a serious shortcoming and challenge for the design and construction of more sophisticated genetic circuitry in the future” [39]. Together with colleagues Yohei Yokobayashi and Frances Arnold, Weiss proposed a two stage strategy: first, design a circuit from the bottom up, as Elowitz and others had before, and clone it into bacteria. Such circuits are highly unlikely to work first time, “because the behavior of biological components inside living cells is highly context-dependent, the actual circuit performance will likely differ from the design predictions, often resulting in a poorly performing or nonfunctional circuit.” Rather than simply abandoning their design, Weiss and his team decided to then tune the circuit inside the cell itself, by applying the principles of evolution. By inducing mutations in the DNA that they had just introduced, they were able to slightly modify the behaviour of the circuit that it represented. Of course, many of these changes would be catastrophic, giving even worse performance than before, but, occasionally, they observed a minor improvement. In that case, they kept the “winning” bacteria, and subjected it to another round of mutation, in a repeated cycle. In a microcosmic version of Darwinian evolution, mutation followed by selection of the fittest took an initially unpromising pool of broken circuits and transformed them into winners. “Ron is utilizing the power of evolution to design networks in ways so that they perform exactly the way you want them to,” observed Jim Collins [16]. In a commentary article in the same issue of the journal, Jeff Hasty called this approach “design then mutate” [17]. The team showed how a circuit made up of three genetic gates could be fine-tuned in vivo to give the correct performance, and they concluded that “the approach we have outlined should serve as a ro-
Bacterial Computing
bust and widely applicable route to obtaining circuits, as well as new genetic devices, that function inside living cells.” Pattern Formation The next topic studied by Weiss and his team was the problem of space – specifically, how to get a population of bacteria to cover a surface with a specific density. This facility could be useful when designing bacterial biosensors – devices that detect chemicals in the environment and produce a response. By controlling the density of the microbial components, it might be possible to tune the sensitivity of the overall device. More importantly, the ability for cells to control their own density would provide a useful “self-destruct” mechanism were these geneticallymodified bugs ever to be released into the environment for “real world” applications. In “Programmed population control by cell-cell communication and regulated killing” [40], Weiss and his team built on their previous results to demonstrate the ability to keep the density of an E. coli population artificially low – that is, below the “natural” density that could be supported by the available nutrients. They designed a genetic circuit that caused the bacteria to generate a different Vibrio signalling molecule, only this time, instead of making the cells glow, a sufficient concentration would flip a switch inside the cell, turning on a killer gene, encoding a protein that was toxic in sufficient quantities. The system behaved exactly as predicted by their mathematical model. The culture grew at an exponential rate (that is, doubling every time step) for seven hours, before hitting the defined density threshold. At that point the population dropped sharply, as countless cells expired, until the population settled at a steady density significantly (ten times) lower than an unmodified “control” colony. The team concluded that “The population-control circuit lays the foundations for using cell-cell communication to programme interactions among bacterial colonies, allowing the concept of communication -regulated growth and death to be extended to engineering synthetic ecosystems” [40]. The next stage was to programme cells to form specific spatial patterns in the dish. As we have already mentioned briefly, pattern formation is one characteristic of multicellular systems. This is generally achieved using some form of chemical signalling, combined with a differential response – that is, different cells, although genetically identical, may “read” the environmental signals and react in different ways, depending on their internal state. For example, one cell might be “hungry” and choose to move towards a food source, while an identical cell might choose
to remain in the same spot, since it has adequate levels of energy. The team used a variant of the sender-receiver model, only this time adding a “distance detection” component to the receiver circuit. The senders were placed in the center of the dish, and the receivers distributed uniformly across the surface. The receivers constructed so that they could measure the strength of the signal being “beamed” from the senders, a signal which decayed over distance (a little like a radio station gradually breaking up as you move out of the reception area). The cells were engineered so that only those that were either “near” to the senders or “far” from the senders would generate a response (those in the middle region were instructed to keep quiet). These cells are genetically identical, and are uniformly distributed over the surface – the differential response comes in the way that they assess the strength of the signal, and make a decision on whether or not to respond. The power of the system was increased further by making the near cells glow green, and those far away glow red (using a different fluorescent protein). When the team set the system running, they observed the formation of a “dartboard” pattern, with the “bullseye” being the colony of senders (instructed to glow cyan, or light blue), which was surrounded by a green ring, which in turn was surrounded by a red ring. By placing three sender colonies in a triangle, they were also able to obtain a green heart-shaped pattern, formed by the intersection of three green circles, as well as other patterns, determined solely by the initial configuration of senders [4]. Bacterial Camera Rather than generating light, a different team decided to use bacteria to detect light – in the process, building the world’s first microbial camera. By engineering a dense bed of E. coli, a team of students led by Chris Voight at Berkeley developed light-sensitive “film” capable of storing images at a resolution of 100 megapixels per square inch. E. coli are not normally sensitive to light, so the group took genes coding for photoreceptors from blue-green algae, and spliced them into their bugs [24]. When light was shone on the cells, it turned on a genetic switch that cause a chemical inside them to permanently darken, thus generating a black “pixel”. By projecting an image onto a plate of bacteria, the team were able to obtain several monochrome images, including the Nature logo and the face of team member Andrew Ellington. Nobel Laureate Sir Harry Kroto, discoverer of “buckballs”, called the team’s camera an “extremely exciting advance” [26], going on to say that “I have always thought that the first ma-
235
236
Bacterial Computing
jor nanotechnology advances would involve some sort of chemical modification of biology.” Future Directions Weiss and his team team suggest that “the integration of such systems into higher-level organisms and with different cell functions will have practical applications in threedimensional tissue engineering, biosensing, and biomaterial fabrication.” [4] One possible use for such a system might lie in the detection of bio-weapons – spread a culture of bacteria over a surface, and, with the appropriate control circuit, they will be able to accurately pinpoint the location of any pathogens. Programmed cells could eventually replace artificial tissues, or even organs – current attempts to build such constructions in the laboratory rely on cells arranging themselves around an artificial scaffold. Controlled cellular structure formation could do away with the need for such support – “The way we’re doing tissue engineering, right now, . . . is very unnatural,” argues Weiss. “Clearly cells make scaffolds themselves. If we’re able to program them to do that, we might be able to embed them in the site of injury and have them figure out for themselves what the pattern should be” [7]. In addition to building structures, others are considering engineering cells to act as miniature drug delivery systems – fighting disease or infection from the inside. Adam Arkin and Chris Voight are currently investigating the use of modified E. coli to battle against cancer tumours, while Jay Keasling and co-workers at Berkeley are looking at engineering circuits into the same bacteria to persuade them to generate a potent antimalarial drug that is normally found in small amounts in wormwood plants. Clearly, bacterial computing/synthetic biology is still at a relatively early stage in its development, although the field is growing at a tremendous pace. It could be argued, with some justification, that the dominant science of the new millennium may well prove to be at the intersection of biology and computing. As biologist Roger Brent argues, “I think that synthetic biology . . . will be as important to the 21st century as [the] ability to manipulate bits was to the 20th” [1]. Bibliography Primary Literature 1. Anon (2004) Roger Brent and the alpha project. ACM Ubiquity, 5(3) 2. Arkin A, Ross J (1994) Computational functions in biochemical reaction networks. Biophysical J 67:560–578 3. Barkai N, Leibler S (2000) Circadian clocks limited by noise. Nature 403:267–268
4. Basu S, Gerchman Y, Collins CH, Arnold FH, Weiss R (2005) A synthetic multicellular system for programmed pattern formation. Nature 434:1130–1134 5. Benner SA, Sismour M (2005) Synthetic biology. Nature Rev Genet 6:533–543 6. Brown C (2004) BioBricks to help reverse-engineer life. EE Times, June 11 7. Brown S (2005) Command performances. San Diego UnionTribune, December 14 8. Brown TA (1990) Gene cloning: an introduction, 2nd edn. Chapman and Hall, London 9. Crick F (1970) Central dogma of molecular biology. Nature 227:561–563 10. Eisenberg A (2000) Unlike viruses, bacteria find a welcome in the world of computing. New York Times, June 1 11. Elowitz M, Leibler S (2000) A synthetic oscillatory network of transcriptional regulators. Nature 403:335–338 12. Ferber D (2004) Synthetic biology: microbes made to order. Science 303(5655):158–161 13. Gardner T, Cantor R, Collins J (2000) Construction of a genetic toggle switch in Escherichia coli. Nature 403:339–342 14. Geyer CR, Battersby TR, Benner SA (2003) Nucleobase pairing in expanded Watson-Crick-like genetic information systems. Structure 11:1485–1498 15. Gibbs WW (2004) Synthetic life. Scientific Am April 26 16. Gravitz L (2004) 10 emerging technologies that will change your world. MIT Technol Rev February 17. Hasty J (2002) Design then mutate. Proc Natl Acad Sci 99(26):16516–16518 18. Hopkin K (2004) Life: the next generation. The Scientist 18(19):56 19. Jackson DA, Symons RH, Berg P (1972) Biochemical method for inserting new genetic information into DNA of simian virus 40: circular SV40 DNA molecules containing lambda phage genes and the galactose operon of Escherichia coli. Proc Natl Acad Sci 69:2904–2909 20. Jacob F, Monod J (1961) Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol 3:318–356 21. Jha A (2005) From the cells up. The Guardian, March 10 22. Kauffman S (1993) Gene regulation networks: a theory for their global structure and behaviors. Current topics in developmental biology 6:145–182 23. Kauffman SA (1993) The origins of order: Self-organization and selection in evolution. Oxford University Press, New York 24. Levskaya A, Chevalier AA, Tabor JJ, Simpson ZB, Lavery LA, Levy M, Davidson EA, Scouras A, Ellington AD, Marcotte EM, Voight CA (2005) Engineering Escherichia coli to see light. Nature 438:441–442 25. Lobban PE, Sutton CA (1973) Enzymatic end-to-end joining of DNA molecules. J Mol Biol 78(3):453–471 26. Marks P (2005) For ultrasharp pictures, use a living camera. New Scientist, November 26, p 28 27. McAdams HH, Shapiro L (1995) Circuit simulation of genetic networks. Science 269(5224):650–656 28. McAdams HH, Arkin A (2000) Genetic regulatory circuits: Advances toward a genetic circuit engineering discipline. Current Biol 10:318–320 29. Monod J (1970) Chance and Necessity. Penguin, London 30. Monod J, Changeux JP, Jacob F (1963) Allosteric proteins and cellular control systems. J Mol Biol 6:306–329 31. Morton O (2005) Life, Reinvented. Wired 13(1), January
Bacterial Computing
32. Registry of Standard Biological Parts. http://parts.mit.edu/ 33. Old R, Primrose S (1994) Principles of Gene Manipulation, an Introduction to Genetic Engineering, 5th edn. Blackwell, Boston 34. Ptashne M (2004) A Genetic Switch, 3rd edn. Phage Lambda Revisited. Cold Spring Harbor Laboratory Press, Woodbury 35. Roberts L, Murrell C (eds) (1998) An introduction to genetic engineering. Department of Biological Sciences, University of Warwick 36. Strogatz S (2003) Sync: The Emerging Science of Spontaneous Order. Penguin, London 37. Turing AM (1952) The chemical basis of morphogenesis. Phil Trans Roy Soc B 237:37–72 38. von Neumann J (1941) The general and logical theory of automata. In: Cerebral Mechanisms in Behavior. Wiley, New York 39. Yokobayashi Y, Weiss R, Arnold FH (2002) Directed evolution of a genetic circuit. Proc Natl Acad Sci 99(26):16587–16591 40. You L, Cox III RS, Weiss R, Arnold FH (2004) Programmed population control by cell-cell communication and regulated killing. Nature 428:868–871
Books and Reviews Alon U (2006) An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman and Hall/CRC Amos M (ed) (2004) Cellular Computing. Series in Systems Biology Oxford University Press Amos M (2006) Genesis Machines: The New Science of Biocomputing. Atlantic Books, London Benner SA (2003) Synthetic biology: Act natural. Nature 421: 118 Endy D (2005) Foundations for engineering biology. Nature 436:449–453 Kobayashi H, Kaern M, Araki M, Chung K, Gardner TS, Cantor CR, Collins JJ (2004) Programmable cells: interfacing natural and engineered gene networks. Proc Natl Acad Sci 101(22):8414– 8419 Sayler GS, Simpson ML, Cox CD (2004) Emerging foundations: nano-engineering and bio-microelectronics for environmental biotechnology. Curr Opin Microbiol 7:267–273
237
238
Bayesian Games: Games with Incomplete Information
Bayesian Games: Games with Incomplete Information SHMUEL Z AMIR Center for the Study of Rationality, Hebrew University, Jerusalem, Israel
Article Outline Glossary Definition of the Subject Introduction Harsanyi’s Model: The Notion of Type Aumann’s Model Harsanyi’s Model and Hierarchies of Beliefs The Universal Belief Space Belief Subspaces Consistent Beliefs and Common Priors Bayesian Games and Bayesian Equilibrium Bayesian Equilibrium and Correlated Equilibrium Concluding Remarks and Future Directions Acknowledgments Bibliography Glossary Bayesian game An interactive decision situation involving several decision makers (players) in which each player has beliefs about (i. e. assigns probability distribution to) the payoff relevant parameters and the beliefs of the other players. State of nature Payoff relevant data of the game such as payoff functions, value of a random variable, etc. It is convenient to think of a state of nature as a full description of a ‘game-form’ (actions and payoff functions). Type Also known as state of mind, is a full description of player’s beliefs (about the state of nature), beliefs about beliefs of the other players, beliefs about the beliefs about his beliefs, etc. ad infinitum. State of the world A specification of the state of nature (payoff relevant parameters) and the players’ types (belief of all levels). That is, a state of the world is a state of nature and a list of the states of mind of all players. Common prior and consistent beliefs The beliefs of players in a game with incomplete information are said to be consistent if they are derived from the same probability distribution (the common prior) by conditioning on each player’s private information. In other words, if the beliefs are consistent, the only source of differences in beliefs is difference in information.
Bayesian equilibrium A Nash equilibrium of a Bayesian game: A list of behavior and beliefs such that each player is doing his best to maximize his payoff, according to his beliefs about the behavior of the other players. Correlated equilibrium A Nash equilibrium in an extension of the game in which there is a chance move, and each player has only partial information about its outcome. Definition of the Subject Bayesian games (also known as Games with Incomplete Information) are models of interactive decision situations in which the decision makers (players) have only partial information about the data of the game and about the other players. Clearly this is typically the situation we are facing and hence the importance of the subject: The basic underlying assumption of classical game theory according to which the data of the game is common knowledge (CK) among the players, is too strong and often implausible in real situations. The importance of Bayesian games is in providing the tools and methodology to relax this implausible assumption, to enable modeling of the overwhelming majority of real-life situations in which players have only partial information about the payoff relevant data. As a result of the interactive nature of the situation, this methodology turns out to be rather deep and sophisticated, both conceptually and mathematically: Adopting the classical Bayesian approach of statistics, we encounter the need to deal with an infinite hierarchy of beliefs: what does each player believe that the other player believes about what he believes. . . is the actual payoff associated with a certain outcome? It is not surprising that this methodological difficulty was a major obstacle in the development of the theory, and this article is largely devoted to explaining and resolving this methodological difficulty. Introduction A game is a mathematical model for an interactive decision situation involving several decision makers (players) whose decisions affect each other. A basic, often implicit, assumption is that the data of the game, which we call the state of nature, are common knowledge (CK) among the players. In particular the actions available to the players and the payoff functions are CK. This is a rather strong assumption that says that every player knows all actions and payoff functions of all players, every player knows that all other players know all actions and payoff functions, every player knows that every player knows that every player knows. . . etc. ad infinitum. Bayesian games (also known as
Bayesian Games: Games with Incomplete Information
games with incomplete information), which is the subject of this article, are models of interactive decision situations in which each player has only partial information about the payoff relevant parameters of the given situation. Adopting the Bayesian approach, we assume that a player who has only partial knowledge about the state of nature has some beliefs, namely prior distribution, about the parameters which he does not know or he is uncertain about. However, unlike in a statistical problem which involves a single decision maker, this is not enough in an interactive situation: As the decisions of other players are relevant, so are their beliefs, since they affect their decisions. Thus a player must have beliefs about the beliefs of other players. For the same reason, a player needs beliefs about the beliefs of other players about his beliefs and so on. This interactive reasoning about beliefs leads unavoidably to infinite hierarchies of beliefs which looks rather intractable. The natural emergence of hierarchies of beliefs is illustrated in the following example: Example 1 Two players, P1 and P2, play a 2 2 game whose payoffs depend on an unknown state of nature s 2 f1; 2g. Player P1’s actions are fT; Bg, player P2’s actions are fL; Rg and the payoffs are given in the following matrices:
Now, since the optimal action of P1 depends not only on his belief p but also on the, unknown to him, action of P2, which depends on his belief q, player P1 must therefore have beliefs about q. These are his second-level beliefs, namely beliefs about beliefs. But then, since this is relevant and unknown to P2, he must have beliefs about that which will be third-level beliefs of P2, and so on. The whole infinite hierarchies of beliefs of the two players pop out naturally in the analysis of this simple two-person game of incomplete information. The objective of this article is to model this kind of situation. Most of the effort will be devoted to the modeling of the mutual beliefs structure and only then we add the underlying game which, together with the beliefs structure, defines a Bayesian game for which we define the notion of Bayesian equilibrium. Harsanyi’s Model: The Notion of Type
Assume that the belief (prior) of P1 about the event fs D 1g is p and the belief of P2 about the same event is q. The best action of P1 depends both on his prior and on the action of P2, and similarly for the best action of P2. This is given in the following tables:
As suggested by our introductory example, the straightforward way to describe the mutual beliefs structure in a situation of incomplete information is to specify explicitly the whole hierarchies of beliefs of the players, that is, the beliefs of each player about the unknown parameters of the game, each player’s beliefs about the other players’ beliefs about these parameters, each player’s beliefs about the other players’ beliefs about his beliefs about the parameters, and so on ad infinitum. This may be called the explicit approach and is in fact feasible and was explored and developed at a later stage of the theory (see [18,5,6,7]). We will come back to it when we discuss the universal belief space. However, for obvious reasons, the explicit approach is mathematically rather cumbersome and hardly manageable. Indeed this was a major obstacle to the development of the theory of games with incomplete information at its early stages. The breakthrough was provided by John Harsanyi [11] in a seminal work that earned him the Nobel Prize some thirty years later. While Harsanyi actually formulated the problem verbally, in an explicit way, he suggested a solution that ‘avoided’ the difficulty of having to deal with infinite hierarchies of beliefs, by providing a much more workable implicit, encapsulated model which we present now.
239
240
Bayesian Games: Games with Incomplete Information
The key notion in Harsanyi’s model is that of type. Each player can be of several types where a type is to be thought of as a full description of the player’s beliefs about the state of nature (the data of the game), beliefs about the beliefs of other players about the state of nature and about his own beliefs, etc. One may think of a player’s type as his state of mind; a specific configuration of his brain that contains an answer to any question regarding beliefs about the state of nature and about the types of the other players. Note that this implies self-reference (of a type to itself through the types of other players) which is unavoidable in an interactive decision situation. A Harsanyi game of incomplete information consists of the following ingredients (to simplify notations, assume all sets to be finite): I – Player’s set. S – The set of states of nature. T i – The type set of player i 2 I. Let T D i2I Ti – denote the type set, that is, the set type profiles. Y S T – a set of states of the world. p 2 (Y) – probability distribution on Y, called the common prior. (For a set A, we denote the set of probability distributions on A by (A).) Remark A state of the world ! thus consists of a state of nature and a list of the types of the players. We denote it as ! D (s(!); t1 (!); : : : ; t n (!)) : We think of the state of nature as a full description of the game which we call a game-form. So, if it is a game in strategic form, we write the state of nature at state of the world ! as: s(!) D (I; (A i (!)) i2I ; (u i (; !)) i2I ) : The payoff functions ui depend only on the state of nature and not on the types. That is, for all i 2 I: s(!) D s(! 0 ) ) u i (; !) D u i (; ! 0 ) : The game with incomplete information is played as follows: (1) A chance move chooses ! D (s(!); t1 (!); : : : ; t n (!)) 2 Y using the probability distribution p. (2) Each player is told his chosen type t i (!) (but not the chosen state of nature s(!) and not the other players’ types ti (!) D (t j (!)) j¤i ). (3) The players choose simultaneously an action: player i chooses a i 2 A i (!) and receives a payoff u i (a; !) where a D (a1 ; : : : ; a n ) is the vector of chosen actions
and ! is the state of the world chosen by the chance move. Remark The set A i (!) of actions available to player i in state of the world ! must be known to him. Since his only information is his type t i (!), we must impose that A i (!) is T i -measurable, i. e., t i (!) D t i (! 0 ) ) A i (!) D A i (! 0 ) : Note that if s(!) was commonly known among the players, it would be a regular game in strategic form. We use the term ‘game-form’ to indicate that the players have only partial information about s(!). The players do not know which s(!) is being played. In other words, in the extensive form game of Harsanyi, the game-forms (s(!))!2Y are not subgames since they are interconnected by information sets: Player i does not know which s(!) is being played since he does not know !; he knows only his own type t i (!). An important application of Harsanyi’s model is made in auction theory, as an auction is a clear situation of incomplete information. For example, in a closed privatevalue auction of a single indivisible object, the type of a player is his private-value for the object, which is typically known to him and not to other players. We come back to this in the section entitled “Examples of Bayesian Equilibria”. Aumann’s Model A frequently used model of incomplete information was given by Aumann [2]. Definition 2 An Aumann model of incomplete information is (I; Y; ( i ) i2I ; P) where: I is the players’ set. Y is a (finite) set whose elements are called states of the world. For i 2 I, i is a partition of Y. P is a probability distribution on Y, also called the common prior. In this model a state of the world ! 2 Y is chosen according to the probability distribution P, and each player i is informed of i (!), the element of his partition that contains the chosen state of the world !. This is the informational structure which becomes a game with incomplete information if we add a mapping s : Y ! S. The state of nature s(!) is the game-form corresponding to the state of the world ! (with the requirement that the action sets A i (!) are i -measurable). It is readily seen that Aumann’s model is a Harsanyi model in which the type set T i of player i is the set of
Bayesian Games: Games with Incomplete Information
his partition elements, i. e., Ti D f i (!)j! 2 Yg, and the common prior on Y is P. Conversely, any Harsanyi model is an Aumann model in which the partitions are those defined by the types, i. e., i (!) D f! 0 2 Yjt i (! 0 ) D t i (!)g. Harsanyi’s Model and Hierarchies of Beliefs As our starting point in modeling incomplete information situations was the appearance of hierarchies of beliefs, one may ask how is the Harsanyi (or Aumann) model related to hierarchies of beliefs and how does it capture this unavoidable feature of incomplete information situations? The main observation towards answering this question is the following: Proposition 3 Any state of the world in Aumann’s model or any type profile t 2 T in Harsanyi’s model defines (uniquely) a hierarchy of mutual beliefs among the players. Let us illustrate the idea of the proof by the following example: Example Consider a Harsanyi model with two players, I and II, each of which can be of two types: TI D fI1 ; I2 g, TII D fII1 ; II2 g and thus: T D f(I1 ; II1 ); (I1 ; II2 ); (I2 ; II1 ); (I2 ; II2 )g. The probability p on types is given by:
Denote the corresponding states of nature by a D s(I1 II1 ), b D s(I1 II2 ), c D s(I2 II1 ) and d D s(I2 II2 ). These are the states of nature about which there is incomplete information. The game in extensive form:
Assume that the state of nature is a. What are the belief hierarchies of the players?
First-level beliefs are obtained by each player from p, by conditioning on his type: I 1 : With probability 12 the state is a and with probability 1 2 the state is b. I 2 : With probability 23 the state is c and with probability 1 3 the state is d. II1 : With probability 37 the state is a and with probability 47 the state is c. II2 : With probability 35 the state is b and with probability 25 the state is d. Second-level beliefs (using short-hand notation for the above beliefs: 12 a C 12 b , etc.): I 1 : With probability 12 , player II believes 37 a C 47 c , and with probability 12 , player II believes 35 b C 25 d . I 2 : With probability 23 , player II believes 37 a C 47 c , and with probability 13 , player II believes 35 b C 25 d . II1 : With probability 37 , player I believes 12 a C 12 b , and with probability 47 , player I believes 23 c C 13 d . II2 : With probability 35 , player I believes 12 a C 12 b , and with probability 25 , player I believes 23 c C 13 d . Third-level beliefs: I 1 : With probability
1 2,
player II believes that: “With player I believes 12 a C 12 b and with probability probability 47 , player I believes 23 c C 13 d ”. 3 7,
And with probability 12 , player II believes that: “With probability 35 , player I believes 12 a C 12 b and with probability 25 , player I believes 23 c C 13 d ”. and so on and so on. The idea is very simple and powerful; since each player of a given type has a probability distribution (beliefs) both about the types of the other players
241
242
Bayesian Games: Games with Incomplete Information
and about the set S of states of nature, the hierarchies of beliefs are constructed inductively: If the kth level beliefs (about S) are defined for each type, then the beliefs about types generates the (k C 1)th level of beliefs. Thus the compact model of Harsanyi does capture the whole hierarchies of beliefs and it is rather tractable. The natural question is whether this model can be used for all hierarchies of beliefs. In other words, given any hierarchy of mutual beliefs of a set of players I about a set S of states of nature, can it be represented by a Harsanyi game? This was answered by Mertens and Zamir [18], who constructed the universal belief space; that is, given a set S of states of nature and a finite set I of players, they looked for the space ˝ of all possible hierarchies of mutual beliefs about S among the players in I. This construction is outlined in the next section. The Universal Belief Space Given a finite set of players I D f1; : : : ; ng and a set S of states of nature, which are assumed to be compact, we first identify the mathematical spaces in which lie the hierarchies of beliefs. Recall that (A) denotes the set of probability distributions on A and define inductively the sequence of spaces (X k )1 kD1 by X1
D (S)
X kC1 D X k (S X kn1 );
(1) for k D 1; 2; : : : :
(2)
Any probability distribution on S can be a first-level belief and is thus in X 1 . A second-level belief is a joint probability distribution on S and the first-level beliefs of the other (n 1) players. This is an element in (S X1n1 ) and therefore a two-level hierarchy is an element of the product space X 1 (S X1n1 ), and so on for any level. Note that at each level belief is a joint probability distribution on S and the previous level beliefs, allowing for correlation between the two. In dealing with these probability spaces we need to have some mathematical structure. More specifically, we make use of the weak topology: Definition 4 A sequence (Fn )1 nD1 of probability measures topology to the probabil(on ˝) converges in the weak R R ity F if and only if limn!1 ˝ g(!)dFn D ˝ g(!)dF for all bounded and continuous functions g : ˝ ! R. It follows from the compactness of S that all spaces defined by (1)–(2) are compact in the weak topology. However, for k > 1, not every element of X k represents a coherent hierarchy of beliefs of level k. For example, if (1 ; 2 ) 2 X 2 where 1 2 (S) D X1 and 2 2 (S X1n1 ), then for this to describe meaningful
beliefs of a player, the marginal distribution of 2 on S must coincide with 1 . More generally, any event A in the space of k-level beliefs has to have the same (marginal) probability in any higher-level beliefs. Furthermore, not only are each player’s beliefs coherent, but he also considers only coherent beliefs of the other players (only those that are in support of his beliefs). Expressing formally this coherency condition yields a selection Tk X k such that T1 D X 1 D (S). It is proved that the projection of TkC1 on X k is T k (that is, any coherent k-level hierarchy can be extended to a coherent k C 1-level hierarchy) and that all the sets T k are compact. Therefore, the projective limit, T D lim1 k Tk , is well defined and nonempty.1 Definition 5 The universal type space T is the projective limit of the spaces (Tk )1 kD1 . That is, T is the set of all coherent infinite hierarchies of beliefs regarding S, of a player in I. It does not depend on i since by construction it contains all possible hierarchies of beliefs regarding S, and it is therefore the same for all players. It is determined only by S and the number of players n. Proposition 6 The universal type space T is compact and satisfies T (S T n1 ) :
(3)
The sign in (3) is to be read as an isomorphism and Proposition 6 says that a type of player can be identified with a joint probability distribution on the state of nature and the types of the other players. The implicit equation (3) reflects the self-reference and circularity of the notion of type: The type of a player is his beliefs about the state of nature and about all the beliefs of the other players, in particular, their beliefs about his own beliefs. Definition 7 The universal belief space (UBS) is the space ˝ defined by: ˝ D S Tn :
(4)
An element of ˝ is called a state of the world. Thus a state of the world is ! D (s(!); t1 (!); t2 (!); : : : ; t n (!)) with s(!) 2 S and t i (!) 2 T for all i in I. This is the specification of the states of nature and the types of all players. The universal belief space ˝ is what we looked for: the set of all incomplete information and mutual belief configurations of a set of n players regarding the state of 1 The projective limit (also known as the inverse limit) of the sequence (Tk )1 kD1 is the space T of all sequences (1 ; 2 ; : : : ) 2 1 kD1 Tk which satisfy: For any k 2 N, there is a probability distribution k 2 (S Tkn1 ) such that kC1 D (k ; k ).
Bayesian Games: Games with Incomplete Information
nature. In particular, as we will see later, all Harsanyi and Aumann models are embedded in ˝, but it includes also belief configurations that cannot be modeled as Harsanyi games. As we noted before, the UBS is determined only by the set of states of nature S and the set of players I, so it should be denoted as ˝(S; I). For the sake of simplicity we shall omit the arguments and write ˝, unless we wish to emphasize the underlying sets S and I. The execution of the construction of the UBS according to the outline above involves some non-trivial mathematics, as can be seen in Mertens and Zamir [18]. The reason is that even with a finite number of states of nature, the space of first-level beliefs is a continuum, the second level is the space of probability distributions on a continuum and the third level is the space of probability distributions on the space of probability distributions on a continuum. This requires some structure for these spaces: For a (Borel) p measurable event E let B i (E) be the event “player i of type ti believes that the probability of E is at least p”, that is, p
B i (E) D f! 2 ˝jt i (E) pg Since this is the object of beliefs of players other than i (beliefs of j ¤ i about the beliefs of i), this set must also be measurable. Mertens and Zamir used the weak topology p which is the minimal topology with which the event B i (E) is (Borel) measurable for any (Borel) measurable event E. In this topology, if A is a compact set then (A), the space of all probability distributions on A, is also compact. However, the hierarchic construction can also be made with stronger topologies on (A) (see [9,12,17]). Heifetz and Samet [14] worked out the construction of the universal belief space without topology, using only a measurable structure (which is implied by the assumption that the beliefs of the players are measurable). All these explicit constructions of the belief space are within what is called the semantic approach. Aumann [6] provided another construction of a belief system using the syntactic approach based on sentences and logical formulas specifying explicitly what each player believes about the state of nature, about the beliefs of the other players about the state of nature and so on. For a detailed construction see Aumann [6], Heifetz and Mongin [13], and Meier [16]. For a comparison of the syntactic and semantic approaches see Aumann and Heifetz [7]. Belief Subspaces In constructing the universal belief space we implicitly assumed that each player knows his own type since we specified only his beliefs about the state of nature and about the beliefs of the other players. In view of that, and since
by (3) a type of player i is a probability distribution on S T Infig , we can view a type ti also as a probability distribution on ˝ D S T I in which the marginal distribution on T i is a degenerate delta function at ti ; that is, if ! D (s(!); t1 (!); t2 (!); : : : ; t n (!)), then for all i in I, t i (!) 2 (˝) and
t i (!)[t i D t i (!)] D 1 :
(5)
In particular it follows that if Supp(t i ) denotes the support of ti , then ! 0 2 Supp(t i (!)) ) t i (! 0 ) D t i (!) :
(6)
Let Pi (!) D Supp(t i (!)) ˝. This defines a possibility correspondence; at state of the world !, player i does not consider as possible any point not in Pi (!). By (6), Pi (!) \ Pi (! 0 ) ¤ ) Pi (!) D Pi (! 0 ) : However, unlike in Aumann’s model, Pi does not define a partition of ˝ since it is possible that ! … Pi (!), and hence the union [!2˝ Pi (!) may be strictly smaller than ˝ (see Example 7). If ! 2 Pi (!) Y holds for all ! in some subspace Y ˝, then (Pi (!))!2Y is a partition of Y. As we said, the universal belief space includes all possible beliefs and mutual belief structures over the state of nature. However, in a specific situation of incomplete information, it may well be that only part of ˝ is relevant for describing the situation. If the state of the world is ! then clearly all states of the world in [ i2I Pi (!) are relevant, but this is not all, because if ! 0 2 Pi (!) then all states in Pj (! 0 ), for j ¤ i, are also relevant in the considerations of player i. This observation motivates the following definition: Definition 8 A belief subspace (BL-subspace) is a closed subset Y of ˝ which satisfies: Pi (!) Y
8i 2 I
and 8! 2 Y :
(7)
A belief subspace is minimal if it has no proper subset which is also a belief subspace. Given ! 2 ˝, the belief subspace at !, denoted by Y(!), is the minimal subspace containing !. Since ˝ is a BL-subspace, Y(!) is well defined for all ! 2 ˝. A BL-subspace is a closed subset of ˝ which is also closed under beliefs of the players. In any ! 2 Y, it contains all states of the world which are relevant to the situation: If ! 0 … Y, then no player believes that ! 0 is possible, no player believes that any other player believes that ! 0 is possible, no player believes that any player believes that any player believes. . . , etc.
243
244
Bayesian Games: Games with Incomplete Information
Remark 9 The subspace Y(!) is meant to be the minimal subspace which is belief-closed by all players at the state !. ˜ Thus, a natural definition would be: Y(!) is the minimal BL-subspace containing Pi (!) for all i in I. However, if for ˜ Yet, every player the state ! is not in Pi (!) then ! … Y(!). even if it is not in the belief closure of the players, the real state ! is still relevant (at least for the analyst) because it determines the true state of nature; that is, it determines the true payoffs of the game. This is the reason for adding the true state of the world !, even though “it may not be in the mind of the players”. It follows from (5), (6) and (7) that a BL-subspace Y has the following structure: Proposition 10 A closed subset Y of the universal belief space ˝ is a BL-subspace if and only if it satisfies the following conditions: 1. For any ! D (s(!); t1 (!); t2 (!); : : : ; t n (!)) 2 Y, and for all i, the type t i (!) is a probability distribution on Y. 2. For any ! and ! 0 in Y, ! 0 2 Supp(t i (!)) ) t i (! 0 ) D t i (!) : In fact condition 1 follows directly from Definition 8 while condition 2 follows from the general property of the UBS expressed in (6). Given a BL-subspace Y in ˝(S; I) we denote by T i the type set of player i, Ti D ft i (!)j! 2 Yg ; and note that unlike in the UBS, in a specific model Y, the type sets are typically not the same for all i, and the analogue of (4) is Y S T1 Tn : A BL-subspace is a model of incomplete information about the state of nature. As we saw in Harsanyi’s model, in any model of incomplete information about a fixed set S of states of nature, involving the same set of players I, a state of the world ! defines (encapsulates) an infinite hierarchy of mutual beliefs of the players I on S. By the universality of the belief space ˝(S; I), there is ! 0 2 ˝(S; I) with the same hierarchy of beliefs as that of !. The mapping of each ! to its corresponding ! 0 in ˝(S; I) is called a belief morphism, as it preserves the belief structure. Mertens and Zamir [18] proved that the space ˝(S; I) is universal in the sense that any model Y of incomplete information of the set of players I about the state of nature s 2 S can be embedded in ˝(S; I) via belief morphism ' : Y ! ˝(S; I) so that '(Y) is a belief subspace in ˝(S; I). In the following examples we give the BL-subspaces representing some known models.
Examples of Belief Subspaces Example 1 (A game with complete information) If the state of nature is s0 2 S then in the universal belief space ˝(S; I), the game is described by a BL-subspace Y consisting of a single state of the world: Y D f!g where ! D (s0 ; [1!]; : : : ; [1!]) : Here [1!] is the only possible probability distribution on Y, namely, the trivial distribution supported by !. In particular, the state of nature s0 (i. e., the data of the game) is commonly known. Example 2 (Commonly known uncertainty about the state of nature) Assume that the players’ set is I D f1; : : : ; ng and there are k states of nature representing, say, k possible n-dimensional payoff matrices G1 ; : : : ; G k . At the beginning of the game, the payoff matrix is chosen by a chance move according to the probability vector p D (p1 ; : : : ; p k ) which is commonly known by the players but no player receives any information about the outcome of the chance move. The set of states of nature is S D fG1 ; : : : ; G k g. The situation described above is embedded in the UBS, ˝(S; I), as the following BLsubspace Y consisting of k states of the world (denoting p 2 (Y) by [p1 !1 ; : : : ; p k ! k ]):
Y D f!1 ; : : : ; ! k g !1 D (G1 ; [p1 !1 ; : : : ; p k ! k ]; : : : ; [p1 !1 ; : : : ; p k ! k ]) !2 D (G2 ; [p1 !1 ; : : : ; p k ! k ]; : : : ; [p1 !1 ; : : : ; p k ! k ]) :::::: ! k D (G k ; [p1 !1 ; : : : ; p k ! k ]; : : : ; [p1 !1 ; : : : ; p k ! k ]).
There is a single type, [p1 !1 ; : : : ; p k ! k ], which is the same for all players. It should be emphasized that the type is a distribution on Y (and not just on the states of nature), which implies that the beliefs [p1 G1 ; : : : ; p k G k ] on the state of nature are commonly known by the players. Example 3 (Two players with incomplete information on one side) There are two players, I D fI; IIg, and two possible payoff matrices, S D fG1 ; G2 g. The payoff matrix is chosen at random with P(s D G1 ) D p, known to both players. The outcome of this chance move is known only to player I. Aumann and Maschler have studied such situations in which the chosen matrix is played repeatedly and the issue is how the informed player strategically uses his information (see Aumann and Maschler [8] and its references). This situation is presented in the UBS by the following BL-subspace: Y D f!1 ; !2 g !1 D (G1 ; [1!1 ]; [p!1 ; (1 p)!2 ])
Bayesian Games: Games with Incomplete Information
!2 D (G2 ; [1!2 ]; [p!1 ; (1 p)!2 ]). Player I has two possible types: I1 D [1!1 ] when he is informed of G1 , and I2 D [1!2 ] when he is informed of G2 . Player II has only one type, II D [p!1 ; (1 p)!2 ]. We describe this situation in the following extensive form-like figure in which the oval forms describe the types of the players in the various vertices.
Example 5 (Incomplete information on two sides: A Harsanyi game) In this example, the set of players is again I D fI; IIg and the set of states of nature is S D fs11 ; s12 ; s21 ; s22 g. In the universal belief space ˝(S; I) consider the following BL-subspace consisting of four states of the world. Example 4 (Incomplete information about the other players’ information) In the next example, taken from Sorin and Zamir [23], one of two players always knows the state of nature but he may be uncertain whether the other player knows it. There are two players, I D fI; IIg, and two possible payoff matrices, S D fG1 ; G2 g. It is commonly known to both players that the payoff matrix is chosen at random by a toss of a fair coin: P(s D G1 ) D 1/2. The outcome of this chance move is told to player I. In addition, if (and only if) the matrix G1 was chosen, another fair coin toss determines whether to inform player II which payoff matrix was chosen. In any case player I is not told the result of the second coin toss. This situation is described by the following belief space with three states of the world:
Y D f!1 ; !2 ; !3 g !1 D (G1 ; [ 12 !1 ; 12 !2 ]; [1!1 ]) !2 D (G1 ; [ 12 !1 ; 12 !2 ]; [ 13 !2 ; 23 !3 ]) !3 D (G2 ; [1!3 ]; [ 12 !2 ; 12 !3 ])
Each player has two types and the type sets are:
1 1 !1 ; !2 ; [1!3 ] 2 2
1 2 : TII D fII1 ; II2 g D [1!1 ] ; !2 ; !3 3 3
Y D f!11 ; !12 ; !21 ; !22 g !11 D s11 ; [ 37 !11 ; 47 !12 ]; [ 35 !11 ; 25 !21 ] !12 D s12 ; [ 37 !11 ; 47 !12 ]; [ 45 !12 ; 15 !22 ] !21 D s21 ; [ 23 !21 ; 13 !22 ]; [ 35 !11 ; 25 !21 ] !22 D s22 ; [ 23 !21 ; 13 !22 ]; [ 45 !12 ; 15 !22 ]
Again, each player has two types and the type sets are:
3 2 4 !11 ; !12 ; !21 ; 7 7 3
3 4 2 !11 ; !21 ; !12 ; TII D fII1 ; II2 g D 5 5 5 TI D fI1 ; I2 g D
The type of a player determines his beliefs about the type of the other player. For example, player I of type I 1 assigns probability 3/7 to the state of the world ! 11 in which player II is of type II1 , and probability 4/7 to the state of the world ! 12 in which player II is of type II2 . Therefore, the beliefs of type I 1 about the types of player II are P(II1 ) D 3/7, P(II2 ) D 4/7. The mutual beliefs about each other’s type are given in the following tables:
TI D fI1 ; I2 g D
I1 I2
II1 3/7 2/3
II2 4/7 1/3
Beliefs of player I Note that in all our examples of belief subspaces, condition (6) is satisfied; the support of a player’s type contains only states of the world in which he has that type. The game with incomplete information is described in the following figure:
1 !22 3 1 !22 : 5
I1 I2
II1
II2
3/5 2/5
4/5 1/5
Beliefs of player II
These are precisely the beliefs of Bayesian players if the pair of types (t I ; t II ) in T D TI TII is chosen according to the prior probability distribution p below, and each player is then informed of his own type:
245
246
Bayesian Games: Games with Incomplete Information
tent beliefs cannot be described as a Harsanyi or Aumann model; it cannot be described as a game in extensive form.
In other words, this BL-subspace is a Harsanyi game with type sets TI ; TII and the prior probability distribution p on the types. Actually, as there is one-to-one mapping between the type set T and the set S of states of nature, the situation is generated by a chance move choosing the state of nature s i j 2 S according to the distribution p (that is, P(s i j ) D P(I i ; II j ) for i and j in f1; 2g) and then player I is informed of i and player II is informed of j. As a matter of fact, all the BL-subspaces in the previous examples can also be written as Harsanyi games, mostly in a trivial way. Example 6 (Inconsistent beliefs) In the same universal belief space, ˝(S; I) of the previous example, consider now another BL-subspace Y˜ which differs from Y only by changing the type II1 of player II from [ 35 !11 ; 25 !21 ] to [ 12 !11 ; 12 !21 ], that is,
Y˜ D f!11 ; !12 ; !21 ; !22 g !11 D s11 ; [ 37 !11 ; 47 !12 ]; [ 12 !11 ; !12 D s12 ; [ 37 !11 ; 47 !12 ]; [ 45 !12 ; !21 D s21 ; [ 23 !21 ; 13 !22 ]; [ 12 !11 ; !22 D s22 ; [ 23 !21 ; 13 !22 ]; [ 45 !12 ;
1 2 !21 ] 1 5 !22 ] 1 2 !21 ] 1 5 !22 ]
with type sets:
3 2 4 1 !11 ; !12 ; !21 ; !22 TI D fI1 ; I2 g D 7 7 3 3
1 4 1 1 !11 ; !21 ; !12 ; !22 TII D fII1 ; II2 g D : 2 2 5 5 Now, the mutual beliefs about each other’s type are:
I1 I2
II1 3/7 2/3
II2 4/7 1/3
Beliefs of player I
I1 I2
II1
II2
1/2 1/2
4/5 1/5
Beliefs of player II
Unlike in the previous example, these beliefs cannot be derived from a prior distribution p. According to Harsanyi, these are inconsistent beliefs. A BL-subspace with inconsis-
Example 7 (“Highly inconsistent” beliefs) In the previous example, even though the beliefs of the players were inconsistent in all states of the world, the true state was considered possible by all players (for example in the state ! 12 player I assigns to this state probability 4/7 and player II assigns to it probability 4/5). As was emphasized before, the UBS contains all belief configurations, including highly inconsistent or wrong beliefs, as the following example shows. The belief subspace of the two players I and II concerning the state of nature which can be s1 or s2 is given by: Y D f! 1 ; !2 g !1 D s1 ; [ 12 !1 ; 12 !2 ]; [1!2 ] !2 D s2 ; [ 12 !1 ; 12 !2 ]; [1!2 ] . In the state of the world ! 1 , the state of nature is s1 , player I assigns equal probabilities to s1 and s2 , but player II assigns probability 1 to s2 . In other words, he does not consider as possible the true state of the world (and also the true state of nature): !1 … PI (!1 ) and consequently [!2Y PI (!) D f!2 g which is strictly smaller than Y. By the definition of belief subspace and condition (6), this also implies that [!2˝ PI (!) is strictly smaller than ˝ (as it does not contain ! 1 ). Consistent Beliefs and Common Priors A BL-subspace Y is a semantic belief system presenting, via the notion of types, the hierarchies of belief of a set of players having incomplete information about the state of nature. A state of the world captures the situation at what is called the interim stage: Each player knows his own type and has beliefs about the state of nature and the types of the other players. The question “what is the real state of the world !?” is not addressed. In a BL-subspace, there is no chance move with explicit probability distribution that chooses the state of the world, while such a probability distribution is part of a Harsanyi or an Aumann model. Yet, in the belief space Y of Example 5 in the previous section, such a prior distribution p emerged endogenously from the structure of Y. More specifically, if the state ! 2 Y is chosen by a chance move according to the probability distribution p and each player i is told his type t i (!), then his beliefs are precisely those described by t i (!). This is a property of the BL-subspace that we call consistency (which does not hold, for instance, for the BL-subspace Y˜ in Example 6) and that we define now: Let Y ˝ be a BLsubspace.
Bayesian Games: Games with Incomplete Information
Definition 11 (i) A probability distribution p 2 (Y) is said to be consistent if for any player i 2 I, Z t i (!)dp : (8) pD Y
(ii) A BL-subspace Y is said to be consistent if there is a consistent probability distribution p with Supp(p) D Y. A consistent BL-subspace will be called a C-subspace. A state of the world ! 2 ˝ is said to be consistent if it is a point in a C-subspace. The interpretation of (8) is that the probability distribution p is “the average” of the types t i (!) of player i (which are also probability distributions on Y), when the average is taken on Y according to p. This definition is not transparent; it is not clear how it captures the consistency property we have just explained, in terms of a chance move choosing ! 2 Y according to p. However, it turns out to be equivalent. For ! 2 Y denote i (!) D f! 0 2 Yjt i (! 0 ) D t i (!)g; then we have: Proposition 12 A probability distribution p 2 (Y) is consistent if and only if t i (!)(A) D p(Aj i (!))
same information will have precisely the same beliefs. It is no surprise that this assumption has strong consequences, the most known of which is due to Aumann [2]: Players with consistent beliefs cannot agree to disagree. That is, if at some state of the world it is commonly known that one player assigns probability q1 to an event E and another player assigns probability q2 to the same event, then it must be the case that q1 D q2 . Variants of this result appear under the title of “No trade theorems” (see, e. g., [19]): Rational players with consistent beliefs cannot believe that they both can gain from a trade or a bet between them. The plausibility and the justification of the common prior assumption was extensively discussed in the literature (see, e. g., [4,10,11]). It is sometimes referred to in the literature as the Harsanyi doctrine. Here we only make the observation that within the set of BL-subspaces in ˝, the set of consistent BL-subspaces is a set of measure zero. To see the idea of the proof, consider the following example: Example 8 (Generalization of Examples 5 and 6) Consider a BL-subspace as in Examples 5 and 6 but with type sets: TI D fI1 ; I2 g D f[˛1 !11 ; (1 ˛1 )!12 ]; [˛2 !21 ; (1 ˛2 )!22 ]g TII D fII1 ; II2 g
(9)
holds for all i 2 I and for any measurable set A Y. In particular, a Harsanyi or an Aumann model is represented by a consistent BL-subspace since, by construction, the beliefs are derived from a common prior distribution which is part of the data of the model. The role of the prior distribution p in these models is actually not that of an additional parameter of the model but rather that of an additional assumption on the belief system, namely, the consistency assumption. In fact, if a minimal belief subspace is consistent, then the common prior p is uniquely determined by the beliefs, as we saw in Example 5; there is no need to specify p as additional data of the system. Proposition 13 If ! 2 ˝ is a consistent state of the world, and if Y(!) is the smallest consistent BL-subspace containing !, then the consistent probability distribution p on Y(!) is uniquely determined. (The formulation of this proposition requires some technical qualification if Y(!) is a continuum.) The consistency (or the existence of a common prior), is quite a strong assumption. It assumes that differences in beliefs (i. e., in probability assessments) are due only to differences in information; players having precisely the
D f[ˇ1 !11 ; (1 ˇ1 )!21 ]; [ˇ2 !12 ; (1 ˇ2 )!22 ]g : For any (˛1 ; ˛2 ; ˇ1 ; ˇ2 ) 2 [0; 1]4 this is a BL-subspace. The mutual beliefs about each other’s type are:
I1 I2
II1 ˛1 ˛2
II2 1 ˛1 1 ˛2
Beliefs of player I
I1 I2
II1
II2
ˇ1 1 ˇ1
ˇ2 1 ˇ2
Beliefs of player II
If the subspace is consistent, these beliefs are obtained as conditional distributions from some prior probability distribution p on T D TI TII , say, by p of the following matrix:
247
248
Bayesian Games: Games with Incomplete Information
This implies (assuming p i j ¤ 0 for all i and j), ˛1 p11 D ; p12 1 ˛1
p21 ˛2 D p22 1 ˛2 p11 p22 ˛1 1 ˛2 and hence D : p12 p21 1 ˛1 ˛2
Similarly, p11 ˇ1 D ; p21 1 ˇ1
p12 ˇ2 D p22 1 ˇ2 p11 p22 ˇ1 1 ˇ2 and hence D : p12 p21 1 ˇ1 ˇ2
It follows that the types must satisfy: ˛1 1 ˛2 ˇ1 1 ˇ2 D ; 1 ˛1 ˛2 1 ˇ1 ˇ2
all the players. For a vector of actions a 2 A(!), we write u i (!; a) for u i (!)(a). Given a BL-subspace Y ˝(S; I) we define the Bayesian game on Y as follows: Definition 14 The Bayesian game on Y is a vector payoff game in which: I D f1; : : : ; ng – the players’ set. ˙ i – the strategy set of player i, is the set of mappings i : Y ! A i
which are Ti –measurable :
In particular: t i (!1 ) D t i (!2 ) H) i (!1 ) D i (!2 ) :
(10)
which is generally not the case. More precisely, the set of (˛1 ; ˛2 ; ˇ1 ; ˇ2 ) 2 [0; 1]4 satisfying the condition (10) is a set of measure zero; it is a three-dimensional set in the four-dimensional set [0; 1]4 . Nyarko [21] proved that even the ratio of the dimensions of the set of consistent BLsubspaces to the dimension of the set of BL-subspaces goes to zero as the latter goes to infinity. Summing up, most BLsubspaces are inconsistent and thus do not satisfy the common prior condition.
Let ˙ D i2I ˙ i . The payoff function ui for player i is a vector-valued function u i D u t i t i 2Ti , where u t i (the payoff function of player i of type ti ) is a mapping u t i : ˙ ! R defined by Z u t i ( ) D
u i (!; (!))dt i (!) :
(11)
Y
Bayesian Games and Bayesian Equilibrium As we said, a game with incomplete information played by Bayesian players, often called a Bayesian game, is a game in which the players have incomplete information about the data of the game. Being a Bayesian, each player has beliefs (probability distribution) about any relevant data he does not know, including the beliefs of the other players. So far, we have developed the belief structure of such a situation which is a BL-subspace Y in the universal belief space ˝(S; I). Now we add the action sets and the payoff functions. These are actually part of the description of the state of nature: The mapping s : ˝ ! S assigns to each state of the world ! the game-form s(!) played at this state. To emphasize this interpretation of s(!) as a game-form, we denote it also as ! : ! D (I; A i (t i (!)) i2I ; (u i (!)) i2I ) ; where A i (t i (!)) is the actions set (pure strategies) of player i at ! and u i (!) : A(!) ! R is his payoff function and A(!) D i2I A i (t i (!)) is the set of action profiles at state !. Note that while the actions of a player depend only on his type, his payoff depends on the actions and types of
Note that u t i is T i –measurable, as it should be. When Y is a finite BL-subspace, the above-defined Bayesian game is an n-person “game” in which the payoff for player i is a vector with a payoff for each one of his types (therefore, a vector of dimension jTi j). It becomes a regular gameform for a given state of the world ! since then the payoff to player i is u t i (!) . However, these game-forms are not regular games since they are interconnected; the players do not know which of these “games” they are playing (since they do not know the state of the world !). Thus, just like a Harsanyi game, a Bayesian game on a BL-subspace Y consists of a family of connected gameforms, one for each ! 2 Y. However, unlike a Harsanyi game, a Bayesian game has no chance move that chooses the state of the world (or the vector of types). A way to transform a Bayesian game into a regular game was suggested by R. Selten and was named by Harsanyi as the Selten game G (see p. 496 in [11]). This is a game with jT1 j jT2 j : : : jTn j players (one for each type) in which each player t i 2 Ti chooses a strategy and then selects his (n 1) partners, one from each T j ; j ¤ i, according to his beliefs ti .
Bayesian Games: Games with Incomplete Information
Bayesian Equilibrium Although a Bayesian game is not a regular game, the Nash equilibrium concept based on the notion of best reply can be adapted to yield the solution concept of Bayesian equilibrium (also called Nash–Bayes equilibrium). Definition 15 A vector of strategies D (1 ; : : : ; n ), in a Bayesian game, is called a Bayesian equilibrium if for all i in I and for all ti in T i , u t i () u t i (i ; ˜ i ) ;
8˜ i 2 ˙ i ;
(12)
where, as usual, i D ( j ) j¤i denotes the vector of strategies of players other than i. Thus, a Bayesian equilibrium specifies a behavior for each player which is a best reply to what he believes is the behavior of the other players, that is, a best reply to the strategies of the other players given his type. In a game with complete information, which corresponds to a BL-subspace with one state of the world (Y D f!g), as there is only one type of each player, and the beliefs are all probability one on a singleton, the Bayesian equilibrium is just the wellknown Nash equilibrium. Remark 16 It is readily seen that when Y is finite, any Bayesian equilibrium is a Nash equilibrium of the Selten game G in which each type is a player who selects the types of his partners according to his beliefs. Similarly, we can transform the Bayesian game into an ordinary game in strategic form by defining the payoff function to player i P to be u˜ i D t i 2Ti t i u t i where t i are strictly positive. Again, independently of the values of the constants t i , any Bayesian equilibrium is a Nash equilibrium of this game and vice versa. In particular, if we choose the conP stants so that t i 2Ti t i D 1, we obtain the game suggested by Aumann and Maschler in 1967 (see p. 95 in [8]) and again, the set of Nash equilibria of this game is precisely the set of Bayesian equilibria. The Harsanyi Game Revisited As we observed in Example 5, the belief structure of a consistent BL-subspace is the same as in a Harsanyi game after the chance move choosing the types. That is, the embedding of the Harsanyi game as a BL-subspace in the universal belief space is only at the interim stage, after the moment that each player gets to know his type. The Harsanyi game on the other hand is at the ex ante stage, before a player knows his type. Then, what is the relation between the Nash equilibrium in the Harsanyi game at the ex ante stage and the equilibrium at the interim stage, namely, the Bayesian equilibrium of the corresponding BL-subspace?
This is an important question concerning the embedding of the Harsanyi game in the UBS since, as we said before, the chance move choosing the types does not appear explicitly in the UBS. The answer to this question was given by Harsanyi (1967–8) (assuming that each type ti has a positive probability): Theorem 17 (Harsanyi) The set of Nash equilibria of a Harsanyi game is identical to the set of Bayesian equilibria of the equivalent BL-subspace in the UBS. In other words, this theorem states that any equilibrium in the ex ante stage is also an equilibrium at the interim stage and vice versa. In modeling situations of incomplete information, the interim stage is the natural one; if a player knows his beliefs (type), then why should he analyze the situation, as Harsanyi suggests, from the ex ante point of view as if his type was not known to him and he could equally well be of another type? Theorem 17 provides a technical answer to this question: The equilibria are the same in both games and the equilibrium strategy of the ex ante game specifies for each type precisely his equilibrium strategy at the interim stage. In that respect, for a player who knows his type, the Harsanyi model is just an auxiliary game to compute his equilibrium behavior. Of course the deeper answer to the question above comes from the interactive nature of the situation: Even though player i knows he is of type ti , he knows that his partners do not know that and that they may consider the possibility that he is of type ˜t i , and since this affects their behavior, the behavior of type ˜t i is also relevant to player i who knows he is of type ti . Finally, Theorem 17 makes the Bayesian equilibrium the natural extension of the Nash equilibrium concept to games with incomplete information for consistent or inconsistent beliefs, when the Harsanyi ordinary game model is unavailable. Examples of Bayesian Equilibria In Example 6, there are two players of two types each, and with inconsistent mutual beliefs given by
I1 I2
II1 3/7 2/3
II2 4/7 1/3
Beliefs of player I
I1 I2
II1
II2
1/2 1/2
4/5 1/5
Beliefs of player II
Assume that the payoff matrices for the four type’s of profiles are:
249
250
Bayesian Games: Games with Incomplete Information
As the beliefs are inconsistent they cannot be presented by a Harsanyi game. Yet, we can compute the Bayesian equilibrium of this Bayesian game. Let (x; y) be the strategy of player I, which is: Play the mixed strategy [x(T); (1 x)(B)] when you are of type I 1 . Play the mixed strategy [y(T); (1 y)(B)] when you are of type I 2 . and let (z; t) be the strategy of player II, which is: Play the mixed strategy [z(L); (1 z)(R)] when you are of type II1 . Play the mixed strategy [t(L); (1 t)(R)] when you are of type II2 . For 0 < x; y; z; t < 1, each player of each type must be indifferent between his two pure actions; that yields the values in equilibrium: xD
3 ; 5
yD
2 ; 5
zD
7 ; 9
tD
2 : 9
There is no “expected payoff” since this is a Bayesian game and not a game; the expected payoffs depend on the actual state of the world, i. e., the actual types of the players and the actual payoff matrix. For example, the state of the world is !11 D (G11 ; I1 ; II1 ); the expected payoffs are: (!11 ) D
46 6 3 2 7/9 D ; G11 ; : 2/9 5 5 45 45
Similarly:
3 2 ; G12 5 5
2 3 (!21 ) D ; G21 5 5
2 3 (!22 ) D ; G22 5 5 (!12 ) D
2/9 7/9 7/9 2/9 2/9 7/9
D
D
D
18 4 ; 45 45 21 21 ; 45 45 28 70 ; 45 45
However, these are the objective payoffs as viewed by the analyst; they are viewed differently by the players. For player i of type ti the relevant payoff is his subjective payoff u t i ( ) defined in (11). For example, at state ! 11 (or ! 12 ) player I believes that with probability 3/7 the state is ! 11 in which case his payoff is 46/45 and with probability 4/7 the state is ! 12 in which case his payoff is 18/45. Therefore his subjective expected payoff at state ! 11 is 3/7 46/45 C 4/7 18/45 D 2/3. Similar computations show that in states ! 21 or ! 22 player I “expects” a payoff of 7/15 while player II “expects” 3/10 at states ! 11 or ! 21 and 86/225 in states ! 12 or ! 22 . Bayesian equilibrium is widely used in Auction Theory, which constitutes an important and successful application of the theory of games with incomplete information. The simplest example is that of two buyers bidding in a first-price auction for an indivisible object. If each buyer i has a private value vi for the object (which is independent of the private value vj of the other buyer), and if he further believes that vj is random with uniform probability distribution on [0; 1], then this is a Bayesian game in which the type of a player is his private valuation; that is, the type sets are T1 D T2 D [0; 1], which is a continuum. This is a consistent Bayesian game (that is, a Harsanyi game) since the beliefs are derived from the uniform probability distribution on T1 T2 D [0; 1]2 . A Bayesian equilibrium of this game is that in which each player bids half of his private value: b i (v i ) D v i /2 (see, e. g., Chap. III in [25]). Although auction theory was developed far beyond this simple example, almost all the models studied so far are Bayesian games with consistent beliefs, that is, Harsanyi games. The main reason of course is that consistent Bayesian games are more manageable since they can be described in terms of an equivalent ordinary game in strategic form. However, inconsistent beliefs are rather plausible and exist in the market place in general and even more so in auction situations. An example of that is the case of collusion of bidders: When a bidding ring is formed, it may well be the case that some of the bidders outside the ring are unaware of its existence and behave under the belief that all bidders are competitive. The members of the ring may or may not know whether the other bidders know about the ring, or they may be uncertain about it. This rather plausible mutual belief situation is typically inconsistent and has to be treated as an inconsistent Bayesian game for which a Bayesian equilibrium is to be found.
Bayesian Equilibrium and Correlated Equilibrium
:
Correlated equilibrium was introduced in Aumann (1974) as the Nash equilibrium of a game extended by adding to
Bayesian Games: Games with Incomplete Information
it random events about which the players have partial information. Basically, starting from an ordinary game, Aumann added a probability space and information structure and obtained a game with incomplete information, the equilibrium of which he called a correlated equilibrium of the original game. The fact that the Nash equilibrium of a game with incomplete information is the Bayesian equilibrium suggests that the concept of correlated equilibrium is closely related to that of Bayesian equilibrium. In fact Aumann noticed that and discussed it in a second paper entitled “Correlated equilibrium as an expression of Bayesian rationality” [3]. In this section, we review briefly, by way of an example, the concept of correlated equilibrium, and state formally its relation to the concept of Bayesian equilibrium. Example 18 Consider a two-person game with actions fT; Bg for player 1 and fL; Rg for player 2 with corresponding payoffs given in the following matrix:
This game has three Nash equilibria: (T; R) with payoff (2; 7), (B; L) with payoff (7; 2) and the mixed equilibrium ([ 23 (T); 13 (B)]; [ 23 (L); 13 (R)]) with payoff (4 23 ; 4 23 ). Suppose that we add to the game a chance move that chooses an element in fT; Bg fL; Rg according to the following probability distribution :
Let us now extend the game G to a game with incomplete information G in which a chance move chooses an element in fT; Bg fL; Rg according to the probability distribution above. Then player 1 is informed of the first (left) component of the chosen element and player 2 is informed of the second (right) component. Then each player chooses an action in G and the payoff is made. If we interpret the partial information as a suggestion of which action to choose, then it is readily verified that following the suggestion is a Nash equilibrium of the extended game yielding a payoff (5; 5). This was called by Aumann a correlated equilibrium of the original game G. In our terminology, the extended game G is a Bayesian game and its
Nash equilibrium is its Bayesian equilibrium. Thus what we have here is that a correlated equilibrium of a game is just the Bayesian equilibrium of its extension to a game with incomplete information. We now make this a general formal statement. For simplicity, we use the Aumann model of a game with incomplete information. Let G D (I; (A i ) i2I ; (u i ) i2I ) be a game in strategic form where I is the set of players, Ai is the set of actions (pure strategies) of player i and ui is his payoff function. Definition 19 Given a game in strategic form G, an incomplete information extension (the I-extension) of the game G is the game G given by G D (I; (A i ) i2I ; (u i ) i2I ; (Y; p)); ( i ) i2I ) ; where (Y; p) is a finite probability space and i is a partition of Y (the information partition of player i). This is an Aumann model of incomplete information and, as we noted before, it is also a Harsanyi type-based model in which the type of player i at state ! 2 Y is t i (!) D i (!), and a strategy of player i is a mapping from his type set to his mixed actions: i : Ti ! (A i ). We identify a correlated equilibrium in the game G by the probability distribution on the vectors of actions A D A1 ; : : : ; A n . Thus 2 (A) is a correlated equilibrium of the game G if when a 2 A is chosen according to and each player i is suggested to play ai , his best reply is in fact to play the action ai . Given a game with incomplete information G as in definition 19, any vector of strategies of the players D (1 ; : : : ; n ) induces a probability distribution on the vectors of actions a 2 A. We denote this as 2 (A). We can now state the relation between correlated and Bayesian equilibria: Theorem 20 Let be a Bayesian equilibrium in the game of incomplete information G D (I; (A i ) i2I ; (u i ) i2I ; (Y; p)); ( i ) i2I ); then the induced probability distribution is a correlated equilibrium of the basic game G D (I; (A i ) i2I ; (u i ) i2I ). The other direction is: Theorem 21 Let be a correlated equilibrium of the game G D (I; (A i ) i2I ; (u i ) i2I ); then G has an extension to a game with incomplete information G D (I; (A i ) i2I ; (u i ) i2I ; (Y; p)); ( i ) i2I ) with a Bayesian equilibrium for which D .
251
252
Bayesian Games: Games with Incomplete Information
Concluding Remarks and Future Directions The Consistency Assumption To the heated discussion of the merits and justification of the consistency assumption in economic and game-theoretical models, we would like to add a couple of remarks. In our opinion, the appropriate way of modeling an incomplete information situation is at the interim stage, that is, when a player knows his own beliefs (type). The Harsanyi ex ante model is just an auxiliary construction for the analysis. Actually this was also the view of Harsanyi, who justified his model by proving that it provides the same equilibria as the interim stage situation it generates (Theorem 17). The Harsanyi doctrine says roughly that our models “should be consistent” and if we get an inconsistent model it must be the case that it not be a “correct” model of the situation at hand. This becomes less convincing if we agree that the interim stage is what we are interested in: Not only are most mutual beliefs inconsistent, as we saw in the section entitled “Consistent Beliefs and Common Priors” above, but it is hard to argue convincingly that the model in Example 5 describes an adequate mutual belief situation while the model in Example 6 does not; the only difference between the two is that in one model, a certain type’s beliefs are [ 35 !11 ; 25 !21 ] while in the other model his beliefs are [ 12 !11 ; 12 !21 ]. Another related point is the fact that if players’ beliefs are the data of the situation (in the interim stage), then these are typically imprecise and rather hard to measure. Therefore any meaningful result of our analysis should be robust to small changes in the beliefs. This cannot be achieved within the consistent belief systems which are a thin set of measure zero in the universal belief space. Knowledge and Beliefs Our interest in this article was mostly in the notion of beliefs of players and less in the notion of knowledge. These are two related but different notions. Knowledge is defined through a knowledge operator satisfying some axioms. Beliefs are defined by means of probability distributions. Aumann’s model, discussed in the section entitled “Aumann’s Model” above, has both elements: The knowledge was generated by the partitions of the players while the beliefs were generated by the probability P on the space Y (and the partitions). Being interested in the subjective beliefs of the player we could understand “at state of the world ! 2 ˝ player i knows the event E ˝” to mean “at state of the world ! 2 ˝ player i assigns to the event E ˝ probability 1”. However, in the universal belief space, “belief with probability 1” does not satisfy a cen-
tral axiom of the knowledge operator. Namely, if at ! 2 ˝ player i knows the event E ˝, then ! 2 E. That is, if a player knows an event, then this event in fact happened. In the universal belief space where all coherent beliefs are possible, in a state ! 2 ˝ a player may assign probability 1 to the event f! 0 g where ! 0 ¤ !. In fact, if in a BLsubspace Y the condition ! 2 Pi (!) is satisfied for all i and all ! 2 Y, then belief with probability 1 is a knowledge operator on Y. This in fact was the case in Aumann’s and in Harsanyi’s models where, by construction, the support of the beliefs of a player in the state ! always included !. For a detailed discussion of the relationship between knowledge and beliefs in the universal belief space see Vassilakis and Zamir [24]. Future Directions We have not said much about the existence of Bayesian equilibrium, mainly because it has not been studied enough and there are no general results, especially in the non-consistent case. We can readily see that a Bayesian game on a finite BL-subspace in which each state of nature s(!) is a finite game-form has a Bayesian equilibrium in mixed strategies. This can be proved, for example, by transforming the Bayesian game into an ordinary finite game (see Remark 16) and applying the Nash theorem for finite games. For games with incomplete information with a continuum of strategies and payoff functions not necessarily continuous, there are no general existence results. Even in consistent auction models, existence was proved for specific models separately (see [20,15,22]). Establishing general existence results for large families of Bayesian games is clearly an important future direction of research. Since, as we argued before, most games are Bayesian games, the existence of a Bayesian equilibrium should, and could, reach at least the level of generality available for the existence of a Nash equilibrium. Acknowledgments I am grateful to two anonymous reviewers for their helpful comments. Bibliography 1. Aumann R (1974) Subjectivity and Correlation in Randomized Strategies. J Math Econ 1:67–96 2. Aumann R (1976) Agreeing to disagree. Ann Stat 4:1236–1239 3. Aumann R (1987) Correlated equilibrium as an expression of Bayesian rationality. Econometrica 55:1–18 4. Aumann R (1998) Common priors: A reply to Gul. Econometrica 66:929–938 5. Aumann R (1999) Interactive epistemology I: Knowledge. Intern J Game Theory 28:263–300
Bayesian Games: Games with Incomplete Information
6. Aumann R (1999) Interactive epistemology II: Probability. Intern J Game Theory 28:301–314 7. Aumann R, Heifetz A (2002) Incomplete Information. In: Aumann R, Hart S (eds) Handbook of Game Theory, vol 3. Elsevier, pp 1666–1686 8. Aumann R, Maschler M (1995) Repeated Games with Incomplete Information. MIT Press, Cambridge 9. Brandenburger A, Dekel E (1993) Hierarchies of beliefs and common knowledge. J Econ Theory 59:189–198 10. Gul F (1998) A comment on Aumann’s Bayesian view. Econometrica 66:923–927 11. Harsanyi J (1967–8) Games with incomplete information played by ‘Bayesian’ players, parts I–III. Manag Sci 8:159–182, 320–334, 486–502 12. Heifetz A (1993) The Bayesian formulation of incomplete information, the non-compact case. Intern J Game Theory 21:329– 338 13. Heifetz A, Mongin P (2001) Probability logic for type spaces. Games Econ Behav 35:31–53 14. Heifetz A, Samet D (1998) Topology-free topology of beliefs. J Econ Theory 82:324–341 15. Maskin E, Riley J (2000) Asymmetric auctions. Rev Econ Stud 67:413–438
16. Meier M (2001) An infinitary probability logic for type spaces. CORE Discussion Paper 2001/61 17. Mertens J-F, Sorin S, Zamir S (1994) Repeated Games, Part A: Background Material. CORE Discussion Paper No. 9420 18. Mertens J-F, Zamir S (1985) Foundation of Bayesian analysis for games with incomplete information. Intern J Game Theory 14:1–29 19. Milgrom PR, Stokey N (1982) Information, trade and common knowledge. J Eco Theory 26:17–27 20. Milgrom PR, Weber RJ (1982) A Theory of Auctions and Competitive Bidding. Econometrica 50:1089–1122 21. Nyarko Y (1991) Most games violate the Harsanyi doctrine. C.V. Starr working paper #91–39, NYU 22. Reny P, Zamir S (2004) On the existence of pure strategy monotone equilibria in asymmetric first price auctions. Econometrica 72:1105–1125 23. Sorin S, Zamir S (1985) A 2-person game with lack of information on 1 12 sides. Math Oper Res 10:17–23 24. Vassilakis S, Zamir S (1993) Common beliefs and common knowledge. J Math Econ 22:495–505 25. Wolfstetter E (1999) Topics in Microeconomics. Cambridge University Press, Cambridge
253
254
Bayesian Statistics
Bayesian Statistics DAVID DRAPER Department of Applied Mathematics and Statistics, Baskin School of Engineering, University of California, Santa Cruz, USA Article Outline Glossary Definition of the Subject and Introduction The Bayesian Statistical Paradigm Three Examples Comparison with the Frequentist Statistical Paradigm Future Directions Bibliography Glossary Bayes’ theorem; prior, likelihood and posterior distributions Given (a) , something of interest which is unknown to the person making an uncertainty assessment, conveniently referred to as You, (b) y, an information source which is relevant to decreasing Your uncertainty about , (c) a desire to learn about from y in a way that is both internally and externally logically consistent, and (d) B, Your background assumptions and judgments about how the world works, as these assumptions and judgments relate to learning about from y, it can be shown that You are compelled in this situation to reason within the standard rules of probability as the basis of Your inferences about , predictions of future data y , and decisions in the face of uncertainty (see below for contrasts between inference, prediction and decision-making), and to quantify Your uncertainty about any unknown quantities through conditional probability distributions. When inferences about are the goal, Bayes’ Theorem provides a means of combining all relevant information internal and external to y: p( jy; B) D c p( jB) l( jy; B) :
(1)
Here, for example in the case in which is a real-valued vector of length k, (a) p( jB) is Your prior distribution about given B (in the form of a probability density function), which quantifies all relevant information available to You about external to y, (b) c is a positive normalizing constant, chosen to make the density on the left side of the equation integrate to 1, (c) l( jy; B) is Your likelihood distribution for given y and B, which is defined to be a den-
sity-normalized multiple of Your sampling distribution p(j ; B) for future data values y given and B, but re-interpreted as a function of for fixed y, and (d) p( jy; B) is Your posterior distribution about given y and B, which summarizes Your current total information about and solves the basic inference problem. Bayesian parametric and non-parametric modeling (1) Following de Finetti [23], a Bayesian statistical model is a joint predictive distribution p(y1 ; : : : ; y n ) for observable quantities yi that have not yet been observed, and about which You are therefore uncertain. When the yi are real-valued, often You will not regard them as probabilistically independent (informally, the yi are independent if information about any of them does not help You to predict the others); but it may be possible to identify a parameter vector D ( 1 ; : : : ; k ) such that You would judge the yi conditionally independent given , and would therefore be willing to model them via the relation p(y1 ; : : : ; y n j ) D
n Y
p(y i j ) :
(2)
iD1
When combined with a prior distribution p( ) on that is appropriate to the context, this is Bayesian parametric modeling, in which p(y i j ) will often have a standard distributional form (such as binomial, Poisson or Gaussian). (2) When a (finite) parameter vector that induces conditional independence cannot be found, if You judge your uncertainty about the realvalued yi exchangeable (see below), then a representation theorem of de Finetti [21] states informally that all internally logically consistent predictive distributions p(y1 ; : : : ; y n ) can be expressed in a way that is equivalent to the hierarchical model (see below) (FjB)
p(FjB) IID
(y i jF; B)
F;
(3)
where (a) F is the cumulative distribution function (CDF) of the underlying process (y1 ; y2 ; : : : ) from which You are willing to regard p(y1 ; : : : ; y n ) as (in effect) like a random sample and (b) p(FjB) is Your prior distribution on the space F of all CDFs on the real line. This (placing probability distributions on infinite-dimensional spaces such as F ) is Bayesian nonparametric modeling, in which priors involving Dirichlet processes and/or Pólya trees (see Sect. “Inference: Parametric and Non-Parametric Modeling of Count Data”) are often used.
Bayesian Statistics
Exchangeability A sequence y D (y1 ; : : : ; y n ) of random variables (for n 1) is (finitely) exchangeable if the joint probability distribution p(y1 ; : : : ; y n ) of the elements of y is invariant under permutation of the indices (1; : : : ; n), and a countably infinite sequence (y1 ; y2 ; : : : ) is (infinitely) exchangeable if every finite subsequence is finitely exchangeable. Hierarchical modeling Often Your uncertainty about something unknown to You can be seen to have a nested or hierarchical character. One class of examples arises in cluster sampling in fields such as education and medicine, in which students (level 1) are nested within classrooms (level 2) and patients (level 1) within hospitals (level 2); cluster sampling involves random samples (and therefore uncertainty) at two or more levels in such a data hierarchy (examples of this type of hierarchical modeling are given in Sect. “Strengths and Weaknesses of the Two Approaches”). Another, quite different, class of examples of Bayesian hierarchical modeling is exemplified by equation (3) above, in which is was helpful to decompose Your overall predictive uncertainty about (y1 ; : : : ; y n ) into (a) uncertainty about F and then (b) uncertainty about the yi given F (examples of this type of hierarchical modeling appear in Sect. “Inference and Prediction: Binary Outcomes with No Covariates” and “Inference: Parametric and Non-Parametric Modeling of Count Data”). Inference, prediction and decision-making; samples and populations Given a data source y, inference involves drawing probabilistic conclusions about the underlying process that gave rise to y, prediction involves summarizing uncertainty about future observable data values y , and decision-making involves looking for optimal behavioral choices in the face of uncertainty (about either the underlying process, or the future, or both). In some cases inference takes the form of reasoning backwards from a sample of data values to a population: a (larger) universe of possible data values from which You judge that the sample has been drawn in a manner that is representative (i. e., so that the sampled and unsampled values in the population are (likely to be) similar in relevant ways). Mixture modeling Given y, unknown to You, and B, Your background assumptions and judgments relevant to y, You have a choice: You can either model (Your uncertainty about) y directly, through the probability distribution p(yjB), or (if that is not feasible) You can identify a quantity x upon which You judge y to depend and model y hierarchically, in two stages: first by modeling x, through the probability distribu-
tion p(xjB), and then by modeling y given x, through the probability distribution p(yjx; B): Z p(yjx; B) p(xjB) dx ; (4) p(yjB) D X
where X is the space of possible values of x over which Your uncertainty is expressed. This is mixture modeling, a special case of hierarchical modeling (see above). In hierarchical notation (4) can be re-expressed as
x yD : (5) (y j x) Examples of mixture modeling in this article include (a) equation (3) above, with F playing the role of x; (b) the basic equation governing Bayesian prediction, discussed in Sect. “The Bayesian Statistical Paradigm”; (c) Bayesian model averaging (Sect. “The Bayesian Statistical Paradigm”); (d) de Finetti’s representation theorem for binary outcomes (Sect. “Inference and Prediction: Binary Outcomes with No Covariates”); (e) random-effects parametric and non-parametric modeling of count data (Sect. “Inference: Parametric and Non-Parametric Modeling of Count Data”); and (f) integrated likelihoods in Bayes factors (Sect. “Decision-Making: Variable Selection in Generalized Linear Models; Bayesian Model Selection”). Probability – frequentist and Bayesian In the frequentist probability paradigm, attention is restricted to phenomena that are inherently repeatable under (essentially) identical conditions; then, for an event A of interest, P f (A) is the limiting relative frequency with which A occurs in the (hypothetical) repetitions, as the number of repetitions n ! 1. By contrast, Your Bayesian probability PB (AjB) is the numerical weight of evidence, given Your background information B relevant to A, in favor of a true-false proposition A whose truth status is uncertain to You, obeying a series of reasonable axioms to ensure that Your Bayesian probabilities are internally logically consistent. Utility To ensure internal logical consistency, optimal decision-making proceeds by (a) specifying a utility function U(a; 0 ) quantifying the numerical value associated with taking action a if the unknown is really 0 and (b) maximizing expected utility, where the expectation is taken over uncertainty in as quantified by the posterior distribution p( jy; B). Definition of the Subject and Introduction Statistics may be defined as the study of uncertainty: how to measure it, and how to make choices in the
255
256
Bayesian Statistics
face of it. Uncertainty is quantified via probability, of which there are two leading paradigms, frequentist (discussed in Sect. “Comparison with the Frequentist Statistical Paradigm”) and Bayesian. In the Bayesian approach to probability the primitive constructs are true-false propositions A whose truth status is uncertain, and the probability of A is the numerical weight of evidence in favor of A, constrained to obey a set of axioms to ensure that Bayesian probabilities are coherent (internally logically consistent). The discipline of statistics may be divided broadly into four activities: description (graphical and numerical summaries of a data set y, without attempting to reason outward from it; this activity is almost entirely non-probabilistic and will not be discussed further here), inference (drawing probabilistic conclusions about the underlying process that gave rise to y), prediction (summarizing uncertainty about future observable data values y ), and decision-making (looking for optimal behavioral choices in the face of uncertainty). Bayesian statistics is an approach to inference, prediction and decision-making that is based on the Bayesian probability paradigm, in which uncertainty about an unknown (this is the inference problem) is quantified by means of a conditional probability distribution p( jy; B); here y is all available relevant data and B summarizes the background assumptions and judgments of the person making the uncertainty assessment. Prediction of a future y is similarly based on the conditional probability distribution p(y jy; B), and optimal decision-making proceeds by (a) specifying a utility function U(a; 0 ) quantifying the numerical reward associated with taking action a if the unknown is really 0 and (b) maximizing expected utility, where the expectation is taken over uncertainty in as quantified by p( jy; B). The Bayesian Statistical Paradigm Statistics is the branch of mathematical and scientific inquiry devoted to the study of uncertainty: its consequences, and how to behave sensibly in its presence. The subject draws heavily on probability, a discipline which predates it by about 100 years: basic probability theory can be traced [48] to work of Pascal, Fermat and Huygens in the 1650s, and the beginnings of statistics [34,109] are evident in work of Bayes published in the 1760s. The Bayesian statistical paradigm consists of three basic ingredients: , something of interest which is unknown (or only partially known) to the person making the uncertainty assessment, conveniently referred to, in a convention proposed by Good (1950), as You. Often is a parameter vector of real numbers (of finite length k, say) or
a matrix, but it can literally be almost anything: for example, a function (three leading examples are a cumulative distribution function (CDF), a density, or a regression surface), a phylogenetic tree, an image of a region on the surface of Mars at a particular moment in time, : : : y, an information source which is relevant to decreasing Your uncertainty about . Often y is a vector of real numbers (of finite length n, say), but it can also literally be almost anything: for instance, a time series, a movie, the text in a book, : : : A desire to learn about from y in a way that is both coherent (internally consistent: in other words, free of internal logical contradictions; Bernardo and Smith [11] give a precise definition of coherence) and well-calibrated (externally consistent: for example, capable of making accurate predictions of future data y ). It turns out [23,53] that You are compelled in this situation to reason within the standard rules of probability (see below) as the basis of Your inferences about , predictions of future data y , and decisions in the face of uncertainty, and to quantify Your uncertainty about any unknown quantities through conditional probability distributions, as in the following three basic equations of Bayesian statistics: p( jy; B) D c p( jB) l( jy; B) Z p(y jy; B) D p(y j ; B) p( jy; B) d a D argmax E( jy;B) U(a; ) :
(6)
a2A
(The basic rules of probability [71] are: for any truefalse propositions A and B and any background assumptions and judgments B, (convexity) 0 P(AjB) 1, with equality at 1 iff A is known to be true under B; (multiplication) P(A and BjB) D P(AjB) P(BjA; B) D P(BjB) P(AjB; B); and (addition) P(A or BjB) D P(AjB) C P(BjB) P(A and BjB).) The meaning of the equations in (6) is as follows. B stands for Your background (often not fully stated) assumptions and judgments about how the world works, as these assumptions and judgments relate to learning about from y. B is often omitted from the basic equations (sometimes with unfortunate consequences), yielding the simpler-looking forms p( jy) D c p( ) l( jy) Z p(y j ) p( jy) d p(y jy) D a D argmax E( jy) U(a; ) : a2A
(7)
Bayesian Statistics
p( jB) is Your prior information about given B, in the form of a probability density function (PDF) or probability mass function (PMF) if lives continuously or discretely on R k (this is generically referred to as Your prior distribution), and p( jy; B) is Your posterior distribution about given y and B, which summarizes Your current total information about and solves the basic inference problem. These are actually not very good names for p( jB) and p( jy; B), because (for example) p( jB) really stands for all (relevant) information about (given B) external to y, whether that information was obtained before (or after) y arrives, but (a) they do emphasize the sequential nature of learning and (b) through long usage it would be difficult for more accurate names to be adopted. c (here and throughout) is a generic positive normalizing constant, inserted into the first equation in (6) to make the left-hand side integrate (or sum) to 1 (as any coherent distribution must). p(y j ; B) is Your sampling distribution for future data values y given and B (and presumably You would use the same sampling distribution p(yj ; B) for (past) data values y, mentally turning the clock back to a point before the data arrives and thinking about what values of y You might see). This assumes that You are willing to regard Your data as like random draws from a population of possible data values (an heroic assumption in some cases, for instance with observational rather than randomized data; this same assumption arises in the frequentist statistical paradigm, discussed below in Sect. “Comparison with the Frequentist Statistical Paradigm”). l( jy; B) is Your likelihood function for given y and B, which is defined to be any positive constant multiple of the sampling distribution p(yj ; B) but re-interpreted as a function of for fixed y: l( jy; B) D c p(yj ; B) :
(8)
The likelihood function is also central to one of the main approaches to frequentist statistical inference, developed by Fisher [37]; the two approaches are contrasted in Sect.“Comparison with the Frequentist Statistical Paradigm”. All of the symbols in the first equation in (6) have now been defined, and this equation can be recognized as Bayes’ Theorem, named after Bayes [5] because a special case of it appears prominently in work of his that was published posthumously. It describes how to pass coherently from information about external to y (quantified in the prior distribution p( jB)) to information
both internal and external to y (quantified in the posterior distribution p( jy; B)), via the likelihood function l( jy; B): You multiply the prior and likelihood pointwise in and normalize so that the posterior distribution p( jy; B) integrates (or sums) to 1. According to the second equation in (6), p(y jy; B), Your (posterior) predictive distribution for future data y given (past) data y and B, which solves the basic prediction problem, must be a weighted average of Your sampling distribution p(y j ; B) weighted by Your current best information p( jy; B) about given y and B; in this integral is the space of possible values of over which Your uncertainty is expressed. (The second equation in (6) contains a simplifying assumption that should be mentioned: in full generality the first term p(y j ; B) inside the integral would be p(y jy; ; B), but it is almost always the case that the information in y is redundant in the presence of complete knowledge of , in which case p(y jy; ; B) D p(y j ; B); this state of affairs could be described by saying that the past and future are conditionally independent given the truth. A simple example of this phenomenon is provided by coin-tossing: if You are watching a Bernoulli( ) process unfold (see Sect. “Inference and Prediction: Binary Outcomes with No Covariates”) whose success probability is unknown to You, the information that 8 of the first 10 tosses have been heads is definitely useful to You in predicting the 11th toss, but if instead You somehow knew that was 0.7, the outcome of the first 10 tosses would be irrelevant to You in predicting any future tosses.) Finally, in the context of making a choice in the face of uncertainty, A is Your set of possible actions, U(a; 0 ) is the numerical value (utility) You attach to taking action a if the unknown is really 0 (specified, without loss of generality, so that large utility values are preferred by You), and the third equation in (6) says that to make the choice coherently You should find the action a that maximizes expected utility (MEU); here the expectation Z E( jy;B) U(a; ) D U(a; ) p( jy; B) d (9)
is taken over uncertainty in as quantified by the posterior distribution p( jy; B). This summarizes the entire Bayesian statistical paradigm, which is driven by the three equations in (6). Examples of its use include clinical trial design [56] and analysis [105]; spatio-temporal modeling, with environmental applications [101]; forecasting and dynamic linear mod-
257
258
Bayesian Statistics
els [115]; non-parametric estimation of receiver operating characteristic curves, with applications in medicine and agriculture [49]; finite selection models, with health policy applications [79]; Bayesian CART model search, with applications in breast cancer research [16]; construction of radiocarbon calibration curves, with archaeological applications [15]; factor regression models, with applications to gene expression data [114]; mixture modeling for high-density genotyping arrays, with bioinformatic applications [100]; the EM algorithm for Bayesian fitting of latent process models [76]; state-space modeling, with applications in particle-filtering [92]; causal inference [42,99]; hierarchical modeling of DNA sequences, with genetic and medical applications [77]; hierarchical Poisson regression modeling, with applications in health care evaluation [17]; multiscale modeling, with engineering and financial applications [33]; expected posterior prior distributions for model selection [91]; nested Dirichlet processes, with applications in the health sciences [96]; Bayesian methods in the study of sustainable fisheries [74,82]; hierarchical non-parametric meta-analysis, with medical and educational applications [81]; and structural equation modeling of multilevel data, with applications to health policy [19]. Challenges to the paradigm include the following: Q: How do You specify the sampling distribution/ likelihood function that quantifies the information about the unknown internal to Your data set y? A: (1) The solution to this problem, which is common to all approaches to statistical inference, involves imagining future data y from the same process that has yielded or will yield Your data set y; often the variability You expect in future data values can be quantified (at least approximately) through a standard parametric family of distributions (such as the Bernoulli/binomial for binary data, the Poisson for count data, and the Gaussian for real-valued outcomes) and the parameter vector of this family becomes the unknown of interest. (2) Uncertainty in the likelihood function is referred to as model uncertainty [67]; a leading approach to quantifying this source of uncertainty is Bayesian model averaging [18,25,52], in which uncertainty about the models M in an ensemble M of models (specifying M is part of B) is assessed and propagated for a quantity, such as a future data value y , that is common to all models via the expression Z p(y jy; M; B) p(Mjy; B) dM : (10) p(y jy; B) D M
In other words, to make coherent predictions in the presence of model uncertainty You should form
a weighted average of the conditional predictive distributions p(y jy; M; B), weighted by the posterior model probabilities p(Mjy; B). Other potentially useful approaches to model uncertainty include Bayesian non-parametric modeling, which is examined in Sect. “Inference: Parametric and Non-Parametric Modeling of Count Data”, and methods based on crossvalidation [110], in which (in Bayesian language) part of the data is used to specify the prior distribution on M (which is an input to calculating the posterior model probabilities) and the rest of the data is employed to update that prior. Q: How do You quantify information about the unknown external to Your data set y in the prior probability distribution p( jB)? A: (1) There is an extensive literature on elicitation of prior (and other) probabilities; notable references include O’Hagan et al. [85] and the citations given there. (2) If is a parameter vector and the likelihood function is a member of the exponential family [11], the prior distribution can be chosen in such a way that the prior and posterior distributions for have the same mathematical form (such a prior is said to be conjugate to the given likelihood); this may greatly simplify the computations, and often prior information can (at least approximately) be quantified by choosing a member of the conjugate family (see Sect.“Inference and Prediction: Binary Outcomes with No Covariates” for an example of both of these phenomena). In situations where it is not precisely clear how to quantify the available information external to y, two sets of tools are available: Sensitivity analysis [30], also known as pre-posterior analysis [4]: Before the data have begun to arrive, You can (a) generate data similar to what You expect You will see, (b) choose a plausible prior specification and update it to the posterior on the quantities of greatest interest, (c) repeat (b) across a variety of plausible alternatives, and (d) see if there is substantial stability in conclusions across the variations in prior specification. If so, fine; if not, this approach can be combined with hierarchical modeling [68]: You can collect all of the plausible priors and add a layer hierarchically to the prior specification, with the new layer indexing variation across the prior alternatives. Bayesian robustness [8,95]: If, for example, the context of the problem implies that You only wish to specify that the prior distribution belongs to an infinite-dimensional class (such as, for priors on (0; 1), the class of monotone non-increasing functions)
Bayesian Statistics
with (for instance) bounds on the first two moments, You can in turn quantify bounds on summaries of the resulting posterior distribution, which may be narrow enough to demonstrate that Your uncertainty in specifying the prior does not lead to differences that are large in practical terms. Often context suggests specification of a prior that has relatively little information content in relation to the likelihood information; for reasons that are made clear in Sect. “Inference and Prediction: Binary Outcomes with No Covariates”, such priors are referred to as relatively diffuse or flat (the term non-informative is sometimes also used, but this seems worth avoiding, because any prior specification takes a particular position regarding the amount of relevant information external to the data). See Bernardo [10] and Kass and Wasserman [60] for a variety of formal methods for generating diffuse prior distributions. Q: How do You quantify Your utility function U(a; ) for optimal decision-making? A: There is a rather less extensive statistical literature on elicitation of utility than probability; notable references include Fishburn [35,36], Schervish et al. [103], and the citations in Bernardo ans Smith [11]. There is a parallel (and somewhat richer) economics literature on utility elicitation; see, for instance, Abdellaoui [1] and Blavatskyy [12]. Sect. “Decision-Making: Variable Selection in Generalized Linear Models; Bayesian Model Selection” provides a decision-theoretic example. Suppose that D ( 1 ; : : : ; k ) is a parameter vector of length k. Then (a) computing the normalizing constant in Bayes’ Theorem Z cD
Z
p(yj 1 ; : : : ; k ; B)
p( 1 ; : : : ; k jB) d 1 d k
1 (11)
involves evaluating a k-dimensional integral; (b) the predictive distribution in the second equation in (6) involves another k-dimensional integral; and (c) the posterior p( 1 ; : : : ; k jy; B) is a k-dimensional probability distribution, which for k > 2 can be difficult to visualize, so that attention often focuses on the marginal posterior distributions Z Z p( j jy; B) D p( 1 ; : : : ; k jy; B) d j (12) for j D 1; : : : k, where j is the vector with component j omitted; each of these marginal distributions involves a (k 1)-dimensional integral. If k is large these
integrals can be difficult or impossible to evaluate exactly, and a general method for computing accurate approximations to them proved elusive from the time of Bayes in the eighteenth century until recently (in the late eighteenth century Laplace [63]) developed an analytical method, which today bears his name, for approximating integrals that arise in Bayesian work [11], but his method is not as general as the computationally-intensive techniques in widespread current use). Around 1990 there was a fundamental shift in Bayesian computation, with the belated discovery by the statistics profession of a class of techniques – Markov chain Monte Carlo (MCMC) methods [41,44] – for approximating high-dimensional Bayesian integrals in a computationally-intensive manner, which had been published in the chemical physics literature in the 1950s [78]; these methods came into focus for the Bayesian community at a moment when desktop computers had finally become fast enough to make use of such techniques. MCMC methods approximate integrals associated with the posterior distribution p( jy; B) by (a) creating a Markov chain whose equilibrium distribution is the desired posterior and (b) sampling from this chain from an initial (0) (i) until equilibrium has been reached (all draws up to this point are typically discarded) and (ii) for a sufficiently long period thereafter to achieve the desired approximation accuracy. With the advent and refinement of MCMC methods since 1990, the Bayesian integration problem has been solved for a wide variety of models, with more ambitious sampling schemes made possible year after year with increased computing speeds: for instance, in problems in which the dimension of the parameter space is not fixed in advance (an example is regression changepoint problems [104], where the outcome y is assumed to depend linearly (apart from stochastic noise) on the predictor(s) x but with an unknown number of changes of slope and intercept and unknown locations for those changes), ordinary MCMC techniques will not work; in such problems methods such as reversible-jump MCMC [47,94] and Markov birth-death processes [108], which create Markov chains that permit trans-dimensional jumps, are required. The main drawback of MCMC methods is that they do not necessarily scale well as n (the number of data observations) increases; one alternative, popular in the machine learning community, is variational methods [55], which convert the integration problem into an optimization problem by (a) approximating the posterior distribution of interest by a family of distributions yielding a closed-form approximation to the integral
259
260
Bayesian Statistics
and (b) finding the member of the family that maximizes the accuracy of the approximation. Bayesian decision theory [6], based on maximizing expected utility, is unambiguous in its normative recommendation for how a single agent (You) should make a choice in the face of uncertainty, and it has had widespread success in fields such as economics (e. g., [2,50]) and medicine (e. g., [87,116]). It is well known, however [3,112]), that Bayesian decision theory (or indeed any other formal approach that seeks an optimal behavioral choice) can be problematic when used normatively for group decision-making, because of conflicts in preferences among members of the group. This is an important unsolved problem.
mortality outcomes become known to You, Your predictive distribution would not change. This still seems to leave p(y1 ; : : : ; y n jB) substantially unspecified (where B now includes the judgment of exchangeability of the yi ), but de Finetti [20] proved a remarkable theorem which shows (in effect) that all exchangeable predictive distributions for a vector of binary observables are representable as mixtures of Bernoulli sampling distributions: if You’re willing to regard (y1 ; : : : ; y n ) as the first n terms in an infinitely exchangeable binary sequence (y1 ; y2 ; : : : ) (which just means that every finite subsequence is exchangeable), then to achieve coherence Your predictive distribution must be expressible as Z 1
s n (1 )ns n p( jB) d ; (13) p(y1 ; : : : ; y n jB) D 0
Three Examples Inference and Prediction: Binary Outcomes with No Covariates Consider the problem of measuring the quality of care at a particular hospital H. One way to do this is to examine the outcomes of that care, such as mortality, after adjusting for the burden of illness brought by the patients to H on admission. As an even simpler version of this problem, consider just the n binary mortality observables y D (y1 ; : : : ; y n ) (with mortality measured within 30 days of admission, say; 1 = died, 0 = lived) that You will see from all of the patients at H with a particular admission diagnosis (heart attack, say) during some prespecified future time window. You acknowledge Your uncertainty about which elements in the sequence will be 0s and which 1s, and You wish to quantify this uncertainty using the Bayesian paradigm. As de Finetti [20] noted, in this situation Your fundamental imperative is to construct a predictive distribution p(y1 ; : : : ; y n jB) that expresses Your uncertainty about the future observables, rather than – as is perhaps more common – to reach immediately for a standard family of parametric models for the yi (in other words, to posit the existence of a vector
D ( 1 ; : : : ; k ) of parameters and to model the observables by appeal to a family p(y i j ; B) of probability distributions indexed by ). Even though the yi are binary, with all but the smallest values of n it still seems a formidable task to elicit from Yourself an n-dimensional predictive distribution p(y1 ; : : : ; y n jB). De Finetti [20] showed, however, that the task is easier than it seems. In the absence of any further information about the patients, You notice that Your uncertainty about them is exchangeable: if someone (without telling You) were to rearrange the order in which their
P where s n D niD1 y i . Here the quantity on the right side of (13) is more than just an integration variable: the equation says that in Your predictive modeling of the binary yi You may as well proceed as if There is a quantity called , interpretable both as the marginal death probability p(y i D 1j ; B) for each patient and as the long-run mortality rate in the infinite sequence (y1 ; y2 ; : : : ) (which serves, in effect, as a population of values to which conclusions from the data can be generalized); Conditional on and B, the yi are independent identically distributed (IID) Bernoulli ( ); and can be viewed as a realization of a random variable with density p( jB). In other words, exchangeability of Your uncertainty about a binary process is functionally equivalent to assuming the simple Bayesian hierarchical model [27] ( jB)
p( jB)
IID
(y i j ; B) Bernoulli( ) ;
(14)
and p( jB) is recognizable as Your prior distribution for , the underlying death rate for heart attack patients similar to those You expect will arrive at hospital H during the relevant time window. Consider now the problem of quantitatively specifying prior information about . From (13) and (14) the likelihood function is l( jy; B) D c s n (1 )ns n ;
(15)
which (when interpreted in the Bayesian manner as a density in ) is recognizable as a member of the Beta family of probability distributions: for ˛; ˇ > 0 and 0 < < 1,
Beta(˛; ˇ) iff p( ) D c ˛1 (1 )ˇ 1 :
(16)
Bayesian Statistics
Moreover, this family has the property that the product of two Beta densities is another Beta density, so by Bayes’ Theorem if the prior p( jB) is chosen to be Beta(˛; ˇ) for some (as-yet unspecified) ˛ > 0 and ˇ > 0, then the posterior will be Beta(˛ C s n ; ˇ C n s n ): this is conjugacy (Sect. “The Bayesian Statistical Paradigm”) of the Beta family for the Bernoulli/binomial likelihood. In this case the conjugacy leads to a simple interpretation of ˛ and ˇ: the prior acts like a data set with ˛ 1s and ˇ 0s, in the sense that if person 1 does a Bayesian analysis with a Beta(˛; ˇ) prior and sample data y D (y1 ; : : : ; y n ) and person 2 instead merges the corresponding “prior data set” with y and does a maximum-likelihood analysis (Sect. “Comparison with the Frequentist Statistical Paradigm”) on the resulting merged data, the two people will get the same answers. This also shows that the prior sample size n0 in the Beta– Bernoulli/binomial model is (˛ C ˇ). Given that the mean of a Beta(˛; ˇ) distribution is ˛ / (˛ C ˇ), calculation reveals that the posterior mean (˛ C s n ) / (˛ C ˇ C n) of is a weighted average of the prior mean and the data mean P y D n1 niD1 y i , with prior and data weights n0 and n, re-
spectively: ˛ C sn D ˛CˇCn
n0
˛ ˛Cˇ
C ny
n0 C n
:
(17)
These facts shed intuitive light on how Bayes’ Theorem combines information internal and external to a given data source: thinking of prior information as equivalent to a data set is a valuable intuition, even in non-conjugate settings. The choice of ˛ and ˇ naturally depends on the available information external to y. Consider for illustration two such specifications: Analyst 1 does a web search and finds that the 30-day mortality rate for heart attack (given average quality of care and average patient sickness at admission) in her country is 15%. The information she has about hospital H is that its care and patient sickness are not likely to be wildly different from the country averages but that a mortality deviation from the mean, if present, would
Bayesian Statistics, Figure 1 Prior-to-posterior updating with two prior specifications in the mortality data set (in both panels, prior: long dotted lines; likelihood: short dotted lines; posterior: solid lines). The left and right panels give the updating with the priors for Analysts 1 and 2, respectively
261
262
Bayesian Statistics
be more likely to occur on the high side than the low. Having lived in the community served by H for some time and having not heard anything either outstanding or deplorable about the hospital, she would be surprised to find that the underlying heart attack death rate at H was less than (say) 5% or greater than (say) 30%. One way to quantify this information is to set the prior mean to 15% and to place (say) 95% of the prior mass between 5% and 30%. Analyst 2 has little information external to y and thus wishes to specify a relatively diffuse prior distribution that does not dramatically favor any part of the unit interval. Numerical integration reveals that (˛; ˇ) D (4:5; 25:5), with a prior sample size of 30.0, satisfies Analyst 1’s constraints. Analyst 2’s diffuse prior evidently corresponds to a rather small prior sample size; a variety of positive values of ˛ and ˇ near 0 are possible, all of which will lead to a relatively flat prior. Suppose for illustration that the time period in question is about four years in length and H is a medium-size US hospital; then there will be about n D 385 heart attack patients in the data set y. Suppose further that the observed : mortality rate at H comes out y D s n / n D 69 / 385 D 18%. Figure 1 summarizes the prior-to-posterior updating with this data set and the two priors for Analysts 1 (left panel) and 2 (right panel), with ˛ D ˇ D 1 (the Uniform distribution) for Analyst 2. Even though the two priors are rather different – Analyst 1’s prior is skewed, : with a prior mean of 0.15 and n0 D 30; Analyst 2’s prior is flat, with a prior mean of 0.5 and n0 D 2 – it is evident that the posterior distributions are nearly the same in both cases; this is because the data sample size n D 385 is so much larger than either of the prior sample sizes, so that the likelihood information dominates. With both priors the likelihood and posterior distributions are nearly the same, another consequence of n0 n. For Analyst 1 the posterior mean, standard deviation, and 95% central posterior interval for are (0:177; 0:00241; 0:142; 0:215), and the corresponding numerical results for Analyst 2 are (0:181; 0:00258; 0:144; 0:221); again it is clear that the two sets of results are almost identical. With a large sample size, careful elicitation – like that undertaken by Analyst 1 – will often yield results similar to those with a diffuse prior. The posterior predictive distributionp(y nC1 jy1 ; : : :y n ; B) for the next observation, having observed the first n, is also straightforward to calculate in closed form with the conjugate prior in this model. It is clear that p(y nC1 jy; B) has to be a Bernoulli( ) distribution for some , and
intuition says that should just be the mean ˛ / (˛ C ˇ ) of the posterior distribution for given y, in which ˛ D ˛ C s n and ˇ D ˇ C n s n are the parameters of the Beta posterior. To check this, making use of the fact that the normalizing constant in the Beta(˛; ˇ) family is (˛ C ˇ) / (˛) (ˇ), the second equation in (6) gives p(y nC1 jy1 ; : : : y n ; B) Z 1 (˛ C ˇ ) ˛ 1 D
y nC1 (1 )1y nC1
(˛ ) (ˇ ) 0 (1 )ˇ
D
1
ˇ)
(˛ C (˛ ) (ˇ )
d Z 1
˛
Cy
nC1 1
0
(1 )(ˇ y nC1 C1)1 d (˛ C y nC1 ) (ˇ y nC1 C 1) D (˛ ) (ˇ ) (˛ C ˇ ) ; (˛ C ˇ C 1)
(18)
this, combined with the fact that (x C 1) / (x) D x for any real x, yields, for example in the case y nC1 D 1,
p(y nC1
(˛ C 1) D 1jy; B) D (˛ ) ˛ D ; ˛ C ˇ
(˛ C ˇ ) (˛ C ˇ C 1)
(19)
confirming intuition. Inference: Parametric and Non-Parametric Modeling of Count Data Most elderly people in the Western world say they would prefer to spend the end of their lives at home, but many instead finish their lives in an institution (a nursing home or hospital). How can elderly people living in their communities be offered health and social services that would help to prevent institutionalization? Hendriksen et al. [51] conducted an experiment in the 1980s in Denmark to test the effectiveness of in-home geriatric assessment (IHGA), a form of preventive medicine in which each person’s medical and social needs are assessed and acted upon individually. A total of n D 572 elderly people living in noninstitutional settings in a number of villages were randomized, nC D 287 to a control group, who received standard health care, and n T D 285 to a treatment group, who received standard care plus IHGA. The number of hospitalizations during the two-year life of the study was an outcome of particular interest. The data are presented and summarized in Table 1. Evidently IHGA lowered the mean hospitalization rate per
Bayesian Statistics Bayesian Statistics, Table 1 Distribution of number of hospitalizations in the IHGA study
Group Control Treatment
Number of Hospitalizations Sample 0 1 2 3 4 5 6 7 n Mean Variance 138 77 46 12 8 4 0 2 287 0.944 1.54 147 83 37 13 3 1 1 0 285 0.768 1.02
(b) G(FjB) D limn C !1 p(Fn C jB), where p(jB) is Your joint probability distribution on (C1 ; C2 ; : : : ); and (c) F is the space of all possible CDFs on R. Equation (20) says informally that exchangeability of Your uncertainty about an observable process unfolding on the real line is functionally equivalent to assuming the Bayesian hierarchical model (FjB)
p(FjB) IID
two years (for the elderly Danish people in the study, at : least) by (0:944 0:768) D 0:176, which is about an 18% reduction from the control level, a clinically large difference. The question then becomes, in Bayesian inferential language: what is the posterior distribution for the treatment effect in the entire population P of patients judged exchangeable with those in the study? Continuing to refer to the relevant analyst as You, with a binary outcome variable and no covariates in Sect. “Inference and Prediction: Binary Outcomes with No Covariates” the model arose naturally from a judgment of exchangeability of Your uncertainty about all n outcomes, but such a judgment of unconditional exchangeability would not be appropriate initially here; to make such a judgment would be to assert that the treatment and control interventions have the same effect on hospitalization, and it was the point of the study to see if this is true. Here, at least initially, it would be more scientifically appropriate to assert exchangeability separately and in parallel within the two experimental groups, a judgment de Finetti [22] called partial exchangeability and which has more recently been referred to as conditional exchangeability [28,72] given the treatment/control status covariate. Considering for the moment just the control group outcome values C i ; i D 1; : : : ; nC , and seeking as in Sect. “Inference and Prediction: Binary Outcomes with No Covariates” to model them via a predictive distribution p(C1 ; : : : ; C n C jB), de Finetti’s previous representation theorem is not available because the outcomes are real-valued rather than binary, but he proved [21] another theorem for this situation as well: if You’re willing to regard (C1 ; : : : ; C n C ) as the first nC terms in an infinitely exchangeable sequence (C1 ; C2 ; : : : ) of values on R (which plays the role of the population P , under the control condition, in this problem), then to achieve coherence Your predictive distribution must be expressible as
(y i jF; B)
F;
(21)
where p(FjB) is a prior distribution on F . Placing distributions on functions, such as CDFs and regression surfaces, is the topic addressed by the field of Bayesian nonparametric (BNP) modeling [24,80], an area of statistics that has recently moved completely into the realm of dayto-day implementation and relevance through advances in MCMC computational methods. Two rich families of prior distributions on CDFs about which a wealth of practical experience has recently accumulated include (mixtures of) Dirichlet processes [32] and Pólya trees [66]. Parametric modeling is of course also possible with the IHGA data: as noted by Krnjaji´c et al. [62], who explore both parametric and BNP models for data of this kind, Poisson modeling is a natural choice, since the outcome consists of counts of relatively rare events. The first Poisson model to which one would generally turn is a fixed-effects model, in which (C i jC ) are IID Poisson(C ) (i D 1; : : : ; nC D 287) and (T j jT ) are IID Poisson(T ) ( j D 1; : : : ; n T D 285), with a diffuse prior on (C ; T ) if little is known, external to the data set, about the underlying hospitalization rates in the control and treatment groups. However, the last two columns of Table 1 reveal that the sample variance is noticeably larger than the sample mean in both groups, indicating substantial Poisson over-dispersion. For a second, improved, parametric model this suggests a random-effects Poisson model of the form indep
(C i j i C ) Poisson( i C ) IID N(ˇ0C ; C2 ) ; log( i C )jˇ0C ; C2
(22)
and similarly for the treatment group, with diffuse priors for (ˇ0C ; C2 ; ˇ0T ; T2 ). As Krnjaji´c et al. [62] note, from a medical point of view this model is more plausible than the fixed-effects formulation: each patient in the control group has his/her own latent (unobserved) underlying rate Z Y nC of hospitalization i C , which may well differ from the unp(C1 ; : : : ; C n C jB) D F(C i ) dG(FjB) ; (20) derlying rates of the other control patients because of unF iD1 measured differences in factors such as health status at the here (a) F has an interpretation as F(t) D lim n C !1 Fn C (t), beginning of the experiment (and similarly for the treatwhere Fn C is the empirical CDF based on (C1 ; : : : ; C n C ); ment group).
263
264
Bayesian Statistics
Model (22), when complemented by its analogue in the treatment group, specifies a Lognormal mixture of Poisson distributions for each group and is straightforward to fit by MCMC, but the Gaussian assumption for the mixing distribution is conventional, not motivated by the underlying science of the problem, and if the distribution of the latent variables is not Gaussian – for example, if it is multimodal or skewed – model (22) may well lead to incorrect inferences. Krnjaji´c et al. [62] therefore also examine several BNP models that are centered on the random-effects Poisson model but which permit learning about the true underlying distribution of the latent variables instead of assuming it is Gaussian. One of their models, when applied (for example) to the control group, was (C i j i C ) log( i C )jG (Gj˛; ; 2 )
indep
IID
Poisson( i C )
G
DP[˛ N(; 2 )] :
(23)
Here DP[˛ N(; 2 )] refers to a Dirichlet process prior distribution, on the CDF G of the latent variables, which is centered at the N(; 2 ) model with precision parameter ˛. Model (23) is an expansion of the random-effects Poisson model (22) in that the latter is a special case of the former (obtained by letting ˛ ! 1). Model expansion is a common Bayesian analytic tool which helps to assess and propagate model uncertainty: if You are uncertain about a particular modeling detail, instead of fitting a model that assumes this detail is correct with probability 1, embed it in a richer model class of which it is a special case, and let the data tell You about its plausibility. With the IHGA data, models (22) and (23) turned out to arrive at similar inferential conclusions – in both cases point estimates of the ratio of the treatment mean to the control mean were about 0.82 with a posterior standard deviation of about 0.09, and a posterior probability that the (population) mean ratio was less than 1 of about 0.95, so that evidence is strong that IHGA lowers mean hospitalizations not just in the sample but in the collection P of elderly people to whom it is appropriate to generalize. But the two modeling approaches need not yield similar results: if the latent variable distribution is far from Gaussian, model (22) will not be able to adjust to this violation of one of its basic assumptions. Krnjaji´c et al. [62] performed a simulation study in which data sets with 300 observations were generated from various Gaussian and non-Gaussian latent variable distributions and a variety of parametric and BNP models were fit to the resulting count data; Fig. 2 summarizes the prior and posterior predictive distributions from models (22; top panel)
and (23; bottom panel) with a bimodal latent variable distribution. The parametric Gaussian random-effects model cannot fit the bimodality on the data scale, but the BNP model – even though centered on the Gaussian as the random-effects distribution – adapts smoothly to the underlying bimodal reality. Decision-Making: Variable Selection in Generalized Linear Models; Bayesian Model Selection Variable selection (choosing the “best” subset of predictors) in generalized linear models is an old problem, dating back at least to the 1960s, and many methods [113] have been proposed to try to solve it; but virtually all of them ignore an aspect of the problem that can be important: the cost of data collection of the predictors. An example, studied by Fouskakis and Draper [39], which is an elaboration of the problem examined in Sect. “Inference and Prediction: Binary Outcomes with No Covariates”, arises in the field of quality of health care measurement, where patient sickness at admission is often assessed by using logistic regression of an outcome, such as mortality within 30 days of admission, on a fairly large number of sickness indicators (on the order of 100) to construct a sickness scale, employing standard variable selection methods (for instance, backward selection from a model with all predictors) to find an “optimal” subset of 10–20 indicators that predict mortality well. The problem with such benefit-only methods is that they ignore the considerable differences among the sickness indicators in the cost of data collection; this issue is crucial when admission sickness is used to drive programs (now implemented or under consideration in several countries, including the US and UK) that attempt to identify substandard hospitals by comparing observed and expected mortality rates (given admission sickness), because such quality of care investigations are typically conducted under cost constraints. When both data-collection cost and accuracy of prediction of 30-day mortality are considered, a large variable-selection problem arises in which the only variables that make it into the final scale should be those that achieve a cost-benefit tradeoff. Variable selection is an example of the broader process of model selection, in which questions such as “Is model M 1 better than M 2 ?” and “Is M 1 good enough?” arise. These inquiries cannot be addressed, however, without first answering a new set of questions: good enough (better than) for what purpose? Specifying this purpose [26,57,61,70] identifies model selection as a decision problem that should be approached by constructing a contextually relevant utility function and maximizing expected utility. Fouskakis and Draper [39] create a utility
Bayesian Statistics
Bayesian Statistics, Figure 2 Prior (open circles) and posterior (solid circles) predictive distributions under models (22) and (23) (top and bottom panels, respectively) based on a data set generated from a bimodal latent variable distribution. In each panel, the histogram plots the simulated counts
function, for variable selection in their severity of illness problem, with two components that are combined additively: a data-collection component (in monetary units, such as US$), which is simply the negative of the total amount of money required to collect data on a given set of patients with a given subset of the sickness indicators; and a predictive-accuracy component, in which a method
is devised to convert increased predictive accuracy into decreased monetary cost by thinking about the consequences of labeling a hospital with bad quality of care “good” and vice versa. One aspect of their work, with a data set (from a RAND study: [58]) involving p D 83 sickness indicators gathered on a representative sample of n D 2,532 elderly American patients hospitalized in the period 1980–
265
266
Bayesian Statistics
Bayesian Statistics, Figure 3 Estimated expected utility of all 16 384 variable subsets in the quality of care study based on RAND data
86 with pneumonia, focused only on the p D 14 variables in the original RAND sickness scale; this was chosen because 214 D 16 384 was a small enough number of possible models to do brute-force enumeration of the estimated expected utility (EEU) of all the models. Figure 3 is a parallel boxplot of the EEUs of all 16 384 variable subsets, with the boxplots sorted by the number of variables in each model. The model with no predictors does poorly, with an EEU of about US$14.5, but from a cost-benefit point of view the RAND sickness scale with all 14 variables is even worse (US$15.7), because it includes expensive variables that do not add much to the predictive power in relation to cheaper variables that predict almost as well. The best subsets have 4–6 variables and would save about US$8 per patient when compared with the entire 14–variable scale; this would amount to significant savings if the observed-versus-expected assessment method were applied widely. Returning to the general problem of Bayesian model selection, two cases can be distinguished: situations in which the precise purpose to which the model will be put can be specified (as in the variable-selection problem above), and settings in which at least some of the end uses to which the modeling will be put are not yet known. In
this second situation it is still helpful to reason in a decision-theoretic way: the hallmark of a good (bad) model is that it makes good (bad) predictions, so a utility function based on predictive accuracy can be a good general-purpose choice. With (a) a single sample of data y, (b) a future data value y , and (c) two models M j ( j D 1; 2) for illustration, what is needed is a scoring rule that measures the discrepancy between y and its predictive distribution p(y jy; M j ; B) under model M j . It turns out [46,86] that the optimal (impartial, symmetric, proper) scoring rules are linear functions of log p(y jy; M j ; B), which has a simple intuitive motivation: if the predictive distribution is Gaussian, for example, then values of y close to the center (in other words, those for which the prediction has been good) will receive a greater reward than those in the tails. An example [65], in this one-sample setting, of a model selection criterion (a) based on prediction, (b) motivated by utility considerations and (c) with good model discrimination properties [29] is the full-sample log score n 1X LS FS (M j jy; B) D log p (y i jy; M j ; B) ; (24) n iD1
which is related to the conditional predictive ordinate cri-
Bayesian Statistics
terion [90]. Other Bayesian model selection criteria in current use include the following: Bayes factors [60]: Bayes’ Theorem, written in odds form for discriminating between models M 1 and M 2 , says that p(M1 jy; B) D p(M2 jy; B)
p(M1 jB) p(yjM1 ; B) ; p(M2 jB) p(yjM2 ; B)
(25)
here the prior odds in favor of M1 ; p(M1 jB) ; p(M2 jB)
Deviance Information Criterion (DIC): Given a parametric model p(yj j ; M j ; B), Spiegelhalter et al. [106] define the deviance information criterion (DIC) (by analogy with other information criteria) to be a tradeoff between (a) an estimate of the model lack of fit, as measured by the deviance D( j ) (where j is the posterior mean of j under M j ; for the purpose of DIC, the deviance of a model [75] is minus twice the logarithm of the likelihood for that model), and (b) a penalty for model complexity equal to twice the effective number of parameters p D j of the model: DIC(M j jy; B) D D( j ) C 2 pˆ D j :
are multiplied by the Bayes factor
When p D j is difficult to read directly from the model (for example, in complex hierarchical models, especially those with random effects), Spiegelhalter et al. motivate the following estimate, which is easy to compute from standard MCMC output:
p(yjM1 ; B) p(yjM2 ; B) to produce the posterior odds p(M1 jy; B) : p(M2 jy; B)
pˆ D j D D( j ) D( j ) ;
According to the logic of this criterion, models with high posterior probability are to be preferred, and if all the models under consideration are equally plausible a priori this reduces to preferring models with larger Bayes factors in their favor. One problem with this approach is that – in parametric models in which model M j has parameter vector j defined on parameter space j – the integrated likelihoods p(yjM j ; B) appearing in the Bayes factor can be expressed as p(yjM j ; B) D
Z j
(27)
p(yj j ; M j ; B) p( j jM j ; B) d j
D E( j jM j ;B) p(yj j ; M j ; B) : (26) In other words, the numerator and denominator ingredients in the Bayes factor are each expressible as expectations of likelihood functions with respect to the prior distributions on the model parameters, and if context suggests that these priors should be specified diffusely the resulting Bayes factor can be unstable as a function of precisely how the diffuseness is specified. Various attempts have been made to remedy this instability of Bayes factors (for example, {partial, intrinsic, fractional} Bayes factors, well calibrated priors, conventional priors, intrinsic priors, expected posterior priors, . . . ; [9]); all of these methods appear to require an appeal to ad-hockery which is absent from the log score approach.
(28)
in other words, pˆ D j is the difference between the posterior mean of the deviance and the deviance evaluated at the posterior mean of the parameters. DIC is available as an option in several MCMC packages, including WinBUGS [107] and MLwiN [93]. One difficulty with DIC is that the MCMC estimate of p D j can be poor if the marginal posteriors for one or more parameters (using the parameterization that defines the deviance) are far from Gaussian; reparameterization (onto parameter scales where the posteriors are approximately Normal) helps but can still lead to mediocre estimates of p D j . Other notable recent references on the subject of Bayesian variable selection include Brown et al. [13], who examine multivariate regression in the context of compositional data, and George and Foster [43], who use empirical Bayes methods in the Gaussian linear model. Comparison with the Frequentist Statistical Paradigm Strengths and Weaknesses of the Two Approaches Frequentist statistics, which has concentrated mainly on inference, proceeds by (i) thinking of the values in a data set y as like a random sample from a population P (a set to which it is hoped that conclusions based on the data can validly be generalized), (ii) specifying a summary of interest in P (such as the population mean of the outcome variable), (iii) identifying a function ˆ of y that can
267
268
Bayesian Statistics
serve as a reasonable estimate of , (iv) imagining repeating the random sampling from P to get other data sets y and therefore other values of ˆ , and (v) using the random behavior of ˆ across these repetitions to make inferential probability statements involving . A leading implementation of the frequentist paradigm [37] is based on using the value ˆMLE that maximizes the likelihood function as the estimate of and obtaining a measure of uncertainty for ˆMLE from the curvature of the logarithm of the likelihood function at its maximum; this is maximum likelihood inference. Each of the frequentist and Bayesian approaches to statistics has strengths and weaknesses. The frequentist paradigm has the advantage that repeated-sampling calculations are often more tractable than manipulations with conditional probability distributions, and it has the clear strength that it focuses attention on the scientifically important issue of calibration: in settings where the true data-generating process is known (e. g., in simulations of random sampling from a known population P ), how often does a particular method of statistical inference recover known truth? The frequentist approach has the disadvantage that it only applies to inherently repeatable phenomena, and therefore cannot be used to quantify uncertainty about many true-false propositions of real-world interest (for example, if You are a doctor to whom a new patient (male, say) has just come, strictly speaking You cannot talk about the frequentist probability that this patient is HIV positive; he either is or he is not, and his arriving at Your office is not the outcome of any repeatable process that is straightforward to identify). In practice the frequentist approach also has the weaknesses that (a) model uncertainty is more difficult to assess and propagate in this paradigm, (b) predictive uncertainty assessments are not always straightforward to create from the frequentist point of view (the bootstrap [31] is one possible solution) and (c) inferential calibration may not be easy to achieve when the sample size n is small. An example of several of these drawbacks arises in the construction of confidence intervals [83], in which repeated-sampling statements such as P f ( ˆlow < < ˆhigh ) D 0:95
(29)
(where Pf quantifies the frequentist variability in ˆlow and ˆhigh across repeated samples from P ) are interpreted in the frequentist paradigm as suggesting that the unknown lies between ˆlow and ˆhigh with 95% “confidence.” Two difficulties with this are that
(a) equation (29) looks like a probability statement about but is not, because in the frequentist approach is a fixed unknown constant that cannot be described probabilistically, and (b) with small sample sizes nominal 95% confidence intervals based on maximum likelihood estimation can have actual coverage (the percentage of time in repeated sampling that the interval includes the true ) substantially less than 95%. The Bayesian approach has the following clear advantages: (a) It applies (at least in principle) to uncertainty about anything, whether associated with a repeatable process or not; (b) inference is unambiguously based on the first equation in (6), without the need to face questions such as what constitutes a “reasonable” estimate of (step (iii) in the frequentist inferential paradigm above); (c) prediction is straightforwardly and unambiguously based on the second equation in (6); and (d) in the problem of decision analysis a celebrated theorem of Wald [111] says informally that all good decisions can be interpreted as having been arrived at by maximizing expected utility as in the third equation of (6), so the Bayesian approach appears to be the way forward in decision problems rather broadly (but note the final challenge at the end of Sect. “The Bayesian Statistical Paradigm”). The principal disadvantage of the Bayesian approach is that coherence (internal logical consistency) by itself does not guarantee good calibration: You are free in the Bayesian paradigm to insert strong prior information in the modeling process (without violating coherence), and – if this information is seen after the fact to have been out of step with the world – Your inferences, predictions and/or decisions may also be off-target (of course, the same is true in the both the frequentist and Bayesian paradigms with regard to Your modeling of the likelihood information). Two examples of frequentist inferences having poor calibration properties in small samples were given by Browne and Draper [14]. Their first example again concerns the measurement of quality of care, which is often studied with cluster samples: a random sample of J hospitals (indexed by j) and a random sample of N total patients (indexed by i) nested in the chosen hospitals is taken, and quality of care for the chosen patients and various hospital- and patient-level predictors are measured. With yij as the quality of care score for patient i in hospital j, a first step would often be to fit a variance-components model with random effects at both the hospital and patient levels, to assess the relative magnitudes of within- and between-hospital vari-
Bayesian Statistics
ability in quality of care: y i j D ˇ0 C u j C e i j ; i D 1; : : : ; n j ; J X
nj D N ;
j D 1; : : : ; J ;
IID
(u j ju2 ) N(0; u2 ) ;
jD1 IID
(e i j j e2 ) N(0; e2 ) :
(30)
Browne and Draper [14] used a simulation study to show that, with a variety of maximum-likelihood-based methods for creating confidence intervals for u2 , the actual coverage of nominal 95% intervals ranged from 72–94% across realistic sample sizes and true parameter values in the fields of education and medicine, versus 89–94% for Bayesian methods based on diffuse priors. Their second example involved a re-analysis of a Guatemalan National Survey of Maternal and Child Health [89,97], with three-level data (births nested within mothers within communities), working with the randomeffects logistic regression model indep with (y i jk j p i jk ) Bernoulli p i jk logit p i jk D ˇ0 Cˇ1 x1i jk Cˇ2 x2 jk Cˇ3 x3k Cu jk Cv k ; (31) where yijk is a binary indicator of modern prenatal care or not and where u jk N(0; u2 ) and v k N(0; v2 ) were random effects at the mother and community levels (respectively). Simulating data sets with 2 449 births by 1 558 women living in 161 communities (as in the Rodríguez and Goldman study [97]), Browne and Draper [14] showed that things can be even worse for likelihood-based methods in this model, with actual coverages (at nominal 95%) as low as 0–2% for intervals for u2 and v2 , whereas Bayesian methods with diffuse priors again produced actual coverages from 89–96%. The technical problem is that the marginal likelihood functions for random-effects variances are often heavily skewed, with maxima at or near 0 even when the true variance is positive; Bayesian methods, which integrate over the likelihood function rather than maximizing it, can have (much) better small-sample calibration performance as a result. Some Historical Perspective The earliest published formal example of an attempt to do statistical inference – to reason backwards from effects to causes – seems to have been Bayes [5], who defined conditional probability for the first time and noted that the result we now call Bayes’ Theorem was a trivial consequence of the definition. From the 1760s til the 1920s,
all (or almost all) statistical inference was Bayesian, using the paradigm that Fisher and others referred to as inverse probability; prominent Bayesians of this period included Gauss [40], Laplace [64] and Pearson [88]. This Bayesian consensus changed with the publication of Fisher [37], which laid out a user-friendly program for maximum-likelihood estimation and inference in a wide variety of problems. Fisher railed against Bayesian inference; his principal objection was that in settings where little was known about a parameter (vector) external to the data, a number of prior distributions could be put forward to quantify this relative ignorance. He believed passionately in the late Victorian–Edwardian goal of scientific objectivity, and it bothered him greatly that two analysts with somewhat different diffuse priors might obtain somewhat different posteriors. (There is a Bayesian account of objectivity: a probability is objective if many different people more or less agree on its value. An example would be the probability of drawing a red ball from an urn known to contain 20 red and 80 white balls, if a sincere attempt is made to thoroughly mix the balls without looking at them and to draw the ball in a way that does not tend to favor one ball over another.) There are two problems with Fisher’s argument, which he never addressed: 1. He would be perfectly correct to raise this objection to Bayesian analysis if investigators were often forced to do inference based solely on prior information with no data, but in practice with even modest sample sizes the posterior is relatively insensitive to the precise manner in which diffuseness is specified in the prior, because the likelihood information in such situations is relatively so much stronger than the prior information; Sect. “Inference and Prediction: Binary Outcomes with No Covariates” provides an example of this phenomenon. 2. If Fisher had looked at the entire process of inference with an engineering eye to sensitivity and stability, he would have been forced to admit that uncertainty in how to specify the likelihood function has inferential consequences that are often an order of magnitude larger than those arising from uncertainty in how to specify the prior. It is an inescapable fact that subjectivity, through assumptions and judgments (such as the form of the likelihood function), is an integral part of any statistical analysis in problems of realistic complexity. In spite of these unrebutted flaws in Fisher’s objections to Bayesian inference, two schools of frequentist inference – one based on Fisher’s maximum-likelihood esti-
269
270
Bayesian Statistics
mation and significance tests [38], the other based on the confidence intervals and hypothesis tests of Neyman [83] and Neyman and Pearson [84] – came to dominate statistical practice from the 1920s at least through the 1980s. One major reason for this was practical: the Bayesian paradigm is based on integrating over the posterior distribution, and accurate approximations to high-dimensional integrals were not available during the period in question. Fisher’s technology, based on differentiation (to find the maximum and curvature of the logarithm of the likelihood function) rather than integration, was a much more tractable approach for its time. Jeffreys [54], working in the field of astronomy, and Savage [102] and Lindley [69], building on de Finetti’s results, advocated forcefully for the adoption of Bayesian methods, but prior to the advent of MCMC techniques (in the late 1980s) Bayesians were often in the position of saying that they knew the best way to solve statistical problems but the computations were beyond them. MCMC has removed this practical objection to the Bayesian paradigm for a wide class of problems. The increased availability of affordable computers with decent CPU throughput in the 1980s also helped to overcome one objection raised in Sect. “Strengths and Weaknesses of the Two Approaches” against likelihood methods, that they can produce poorly-calibrated inferences with small samples, through the introduction of the bootstrap by Efron [31] in 1979. At this writing (a) both the frequentist and Bayesian paradigms are in vigorous inferential use, with the proportion of Bayesian articles in leading journals continuing an increase that began in the 1980s; (b) Bayesian MCMC analyses are often employed to produce meaningful predictive conclusions, with the use of the bootstrap increasing for frequentist predictive calibration; and (c) the Bayesian paradigm dominates decision analysis. A Bayesian-Frequentist Fusion During the 20th century the debate over which paradigm to use was often framed in such a way that it seemed it was necessary to choose one approach and defend it against attacks from people who had chosen the other, but there is nothing that forces an analyst to choose a single paradigm. Since both approaches have strengths and weaknesses, it seems worthwhile instead to seek a fusion of the two that makes best use of the strengths. Because (a) the Bayesian paradigm appears to be the most flexible way so far developed for quantifying all sources of uncertainty and (b) its main weakness is that coherence does not guarantee good calibration, a number of statisti-
cians, including Rubin [98], Draper [26], and Little [73], have suggested a fusion in which inferences, predictions and decisions are formulated using Bayesian methods and then evaluated for their calibration properties using frequentist methods, for example by using Bayesian models to create 95% predictive intervals for observables not used in the modeling process and seeing if approximately 95% of these intervals include the actual observed values. Analysts more accustomed to the purely frequentist (likelihood) paradigm who prefer not to explicitly make use of prior distributions may still find it useful to reason in a Bayesian way, by integrating over the parameter uncertainty in their likelihood functions rather than maximizing over it, in order to enjoy the superior calibration properties that integration has been demonstrated to provide. Future Directions Since the mid- to late-1980s the Bayesian statistical paradigm has made significant advances in many fields of inquiry, including agriculture, archaeology, astronomy, bioinformatics, biology, economics, education, environmetrics, finance, health policy, and medicine (see Sect. “The Bayesian Statistical Paradigm” for recent citations of work in many of these disciplines). Three areas of methodological and theoretical research appear particularly promising for extending the useful scope of Bayesian work, as follows: Elicitation of prior distributions and utility functions: It is arguable that too much use is made in Bayesian analysis of diffuse prior distributions, because (a) accurate elicitation of non-diffuse priors is hard work and (b) lingering traces still remain of a desire to at least appear to achieve the unattainable Victorian–Edwardian goal of objectivity, the (false) argument being that the use of diffuse priors somehow equates to an absence of subjectivity (see, e. g., the papers by Berger [7] and Goldstein [45] and the ensuing discussion for a vigorous debate on this issue). It is also arguable that too much emphasis was placed in the 20th century on inference at the expense of decision-making, with inferential tools such as the Neyman–Pearson hypothesis testing machinery (Sect. “Some Historical Perspective”) used incorrectly to make decisions for which they are not optimal; the main reason for this, as noted in Sect. “Strengths and Weaknesses of the Two Approaches” and “Some Historical Perspective”, is that (a) the frequentist paradigm was dominant from the 1920s through the 1980s and (b) the high ground in decision theory is dominated by the Bayesian approach.
Bayesian Statistics
Relevant citations of excellent recent work on elicitation of prior distributions and utility functions were given in Sect. “The Bayesian Statistical Paradigm”; it is natural to expect that there will be a greater emphasis on decision theory and non-diffuse prior modeling in the future, and elicitation in those fields of Bayesian methodology is an important area of continuing research. Group decision-making: As noted in Sect. “The Bayesian Statistical Paradigm”, maximizing expected utility is an effective method for decision-making by a single agent, but when two or more agents are involved in the decision process this approach cannot be guaranteed to yield a satisfying solution: there may be conflicts in the agents’ preferences, particularly if their relationship is at least partly adversarial. With three or more possible actions, transitivity of preference – if You prefer action a1 to a2 and a2 to a3 , then You should prefer a1 to a3 – is a criterion that any reasonable decision-making process should obey; informally, a well-known theorem by Arrow [3] states that even if all of the agents’ utility functions obey transitivity, there is no way to combine their utility functions into a single decisionmaking process that is guaranteed to respect transitivity. However, Arrow’s theorem is temporally static, in the sense that the agents do not share their utility functions with each other and iterate after doing so, and it assumes that all agents have the same set A of feasible actions. If agents A1 and A2 have action spaces A1 and A2 that are not identical and they share the details of their utility specification with each other, it is possible that A1 may realize that one of the actions in A2 that (s)he had not considered is better than any of the actions in A1 or vice versa; thus a temporally dynamic solution to the problem posed by Arrow’s theorem may be possible, even if A1 and A2 are partially adversarial. This is another important area for new research. Bayesian computation: Since the late 1980s, simulation-based computation based on Markov chain Monte Carlo (MCMC) methods has made useful Bayesian analyses possible in an increasingly broad range of application areas, and (as noted in Sect. “The Bayesian Statistical Paradigm”) increases in computing speed and sophistication of MCMC algorithms have enhanced this trend significantly. However, if a regression-style data set is visualized as a matrix with n rows (one for each subject of inquiry) and k columns (one for each variable measured on the subjects), MCMC methods do not necessarily scale well in either n or k, with the result that they can be too slow to be of practical
use with large data sets (e.g, at current desktop computing speeds, with n and/or k on the order of 105 or greater). Improving the scaling of MCMC methods, or finding a new approach to Bayesian computation that scales better, is thus a third important area for continuing study.
Bibliography 1. Abdellaoui M (2000) Parameter-free elicitation of utility and probability weighting functions. Manag Sci 46:1497–1512 2. Aleskerov F, Bouyssou D, Monjardet B (2007) Utility Maximization, Choice and Preference, 2nd edn. Springer, New York 3. Arrow KJ (1963) Social Choice and Individual Values, 2nd edn. Yale University Press, New Haven CT 4. Barlow RE, Wu AS (1981) Preposterior analysis of Bayes estimators of mean life. Biometrika 68:403–410 5. Bayes T (1764) An essay towards solving a problem in the doctrine of chances. Philos Trans Royal Soc Lond 53:370–418 6. Berger JO (1985) Statistical Decision Theory and Bayesian Analysis. Springer, New York 7. Berger JO (2006) The case for objective Bayesian analysis (with discussion). Bayesian Anal 1:385–472 8. Berger JO, Betro B, Moreno E, Pericchi LR, Ruggeri F, Salinetti G, Wasserman L (eds) (1995) Bayesian Robustness. Institute of Mathematical Statistics Lecture Notes-Monograph Series, vol 29. IMS, Hayward CA 9. Berger JO, Pericchi LR (2001) Objective Bayesian methods for model selection: introduction and comparison. In: Lahiri P (ed) Model Selection. Monograph Series, vol 38. Institute of Mathematical Statistics Lecture Notes Series, Beachwood, pp 135–207 10. Bernardo JM (1979) Reference posterior distributions for Bayesian inference (with discussion). J Royal Stat Soc, Series B 41:113–147 11. Bernardo JM, Smith AFM (1994) Bayesian Theory. Wiley, New York 12. Blavatskyy P (2006) Error propagation in the elicitation of utility and probability weighting functions. Theory Decis 60:315– 334 13. Brown PJ, Vannucci M, Fearn T (1998) Multivariate Bayesian variable selection and prediction. J Royal Stat Soc, Series B 60:627–641 14. Browne WJ, Draper D (2006) A comparison of Bayesian and likelihood methods for fitting multilevel models (with discussion). Bayesian Anal 1:473–550 15. Buck C, Blackwell P (2008) Bayesian construction of radiocarbon calibration curves (with discussion). In: Case Studies in Bayesian Statistics, vol 9. Springer, New York 16. Chipman H, George EI, McCulloch RE (1998) Bayesian CART model search (with discussion). J Am Stat Assoc 93:935–960 17. Christiansen CL, Morris CN (1997) Hierarchical Poisson regression modeling. J Am Stat Assoc 92:618–632 18. Clyde M, George EI (2004) Model uncertainty. Stat Sci 19:81– 94 19. Das S, Chen MH, Kim S, Warren N (2008) A Bayesian structural equations model for multilevel data with missing responses and missing covariates. Bayesian Anal 3:197–224
271
272
Bayesian Statistics
20. de Finetti B (1930) Funzione caratteristica di un fenomeno aleatorio. Mem R Accad Lincei 4:86–133 21. de Finetti B (1937). La prévision: ses lois logiques, ses sources subjectives. Ann Inst H Poincaré 7:1–68 (reprinted in translation as de Finetti B (1980) Foresight: its logical laws, its subjective sources. In: Kyburg HE, Smokler HE (eds) Studies in Subjective Probability. Dover, New York, pp 93–158) 22. de Finetti B (1938/1980). Sur la condition d’équivalence partielle. Actual Sci Ind 739 (reprinted in translation as de Finetti B (1980) On the condition of partial exchangeability. In: Jeffrey R (ed) Studies in Inductive Logic and Probability. University of California Press, Berkeley, pp 193–206) 23. de Finetti B (1970) Teoria delle Probabilità, vol 1 and 2. Eunaudi, Torino (reprinted in translation as de Finetti B (1974– 75) Theory of probability, vol 1 and 2. Wiley, Chichester) 24. Dey D, Müller P, Sinha D (eds) (1998) Practical Nonparametric and Semiparametric Bayesian Statistics. Springer, New York 25. Draper D (1995) Assessment and propagation of model uncertainty (with discussion). J Royal Stat Soc, Series B 57:45–97 26. Draper D (1999) Model uncertainty yes, discrete model averaging maybe. Comment on: Hoeting JA, Madigan D, Raftery AE, Volinsky CT (eds) Bayesian model averaging: a tutorial. Stat Sci 14:405–409 27. Draper D (2007) Bayesian multilevel analysis and MCMC. In: de Leeuw J, Meijer E (eds) Handbook of Multilevel Analysis. Springer, New York, pp 31–94 28. Draper D, Hodges J, Mallows C, Pregibon D (1993) Exchangeability and data analysis (with discussion). J Royal Stat Soc, Series A 156:9–37 29. Draper D, Krnjaji´c M (2008) Bayesian model specification. Submitted 30. Duran BS, Booker JM (1988) A Bayes sensitivity analysis when using the Beta distribution as a prior. IEEE Trans Reliab 37:239–247 31. Efron B (1979) Bootstrap methods. Ann Stat 7:1–26 32. Ferguson T (1973) A Bayesian analysis of some nonparametric problems. Ann Stat 1:209–230 33. Ferreira MAR, Lee HKH (2007) Multiscale Modeling. Springer, New York 34. Fienberg SE (2006) When did Bayesian inference become “Bayesian”? Bayesian Anal 1:1–40 35. Fishburn PC (1970) Utility Theory for Decision Making. Wiley, New York 36. Fishburn PC (1981) Subjective expected utility: a review of normative theories. Theory Decis 13:139–199 37. Fisher RA (1922) On the mathematical foundations of theoretical statistics. Philos Trans Royal Soc Lond, Series A 222:309–368 38. Fisher RA (1925) Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh 39. Fouskakis D, Draper D (2008) Comparing stochastic optimization methods for variable selection in binary outcome prediction, with application to health policy. J Am Stat Assoc, forthcoming 40. Gauss CF (1809) Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium, vol 2. Perthes and Besser, Hamburg 41. Gelfand AE, Smith AFM (1990) Sampling-based approaches to calculating marginal densities. J Am Stat Assoc 85:398–409
42. Gelman A, Meng X-L (2004) Applied Bayesian Modeling and Causal Inference From Incomplete-Data Perspectives. Wiley, New York 43. George EI, Foster DP (2000) Calibration and empirical Bayes variable selection. Biometrika 87:731–747 44. Gilks WR, Richardson S, Spiegelhalter DJ (eds) (1996) Markov Chain Monte Carlo in Practice. Chapman, New York 45. Goldstein M (2006) Subjective Bayesian analysis: principles and practice (with discussion). Bayesian Anal 1:385–472 46. Good IJ (1950) Probability and the Weighing of Evidence. Charles Griffin, London 47. Green P (1995) Reversible jump Markov chain Monte carlo computation and Bayesian model determination. Biometrika 82:711–713 48. Hacking I (1984) The Emergence of Probability. University Press, Cambridge 49. Hanson TE, Kottas A, Branscum AJ (2008) Modelling stochastic order in the analysis of receiver operating characteristic data: Bayesian non-parametric approaches. J Royal Stat Soc, Series C (Applied Statistics) 57:207–226 50. Hellwig K, Speckbacher G, Weniges P (2000) Utility maximization under capital growth constraints. J Math Econ 33:1–12 51. Hendriksen C, Lund E, Stromgard E (1984) Consequences of assessment and intervention among elderly people: a three year randomized controlled trial. Br Med J 289:1522–1524 52. Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999) Bayesian model averaging: a tutorial. Stat Sci 14:382–417 53. Jaynes ET (2003) Probability Theory: The Logic of Science. Cambridge University Press, Cambridge 54. Jeffreys H (1931) Scientific Inference. Cambridge University Press, Cambridge 55. Jordan MI, Ghahramani Z, Jaakkola TS, Saul L (1999) An introduction to variational methods for graphical models. Mach Learn 37:183–233 56. Kadane JB (ed) (1996) Bayesian Methods and Ethics in a Clinical Trial Design. Wiley, New York 57. Kadane JB, Dickey JM (1980) Bayesian decision theory and the simplification of models. In: Kmenta J, Ramsey J (eds) Evaluation of Econometric Models. Academic Press, New York 58. Kahn K, Rubenstein L, Draper D, Kosecoff J, Rogers W, Keeler E, Brook R (1990) The effects of the DRG-based Prospective Payment System on quality of care for hospitalized Medicare patients: An introduction to the series (with discussion). J Am Med Assoc 264:1953–1955 59. Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90:773–795 60. Kass RE, Wasserman L (1996) The selection of prior distributions by formal rules. J Am Stat Assoc 91:1343–1370 61. Key J, Pericchi LR, Smith AFM (1999) Bayesian model choice: what and why? (with discussion). In: Bernardo JM, Berger JO, Dawid AP, Smith AFM (eds) Bayesian Statistics 6. Clarendon Press, Oxford, pp 343–370 62. Krnjaji´c M, Kottas A, Draper D (2008) Parametric and nonparametric Bayesian model specification: a case study involving models for count data. Comput Stat Data Anal 52:2110–2128 63. Laplace PS (1774) Mémoire sur la probabilité des causes par les évenements. Mém Acad Sci Paris 6:621–656 64. Laplace PS (1812) Théorie Analytique des Probabilités. Courcier, Paris
Bayesian Statistics
65. Laud PW, Ibrahim JG (1995) Predictive model selection. J Royal Stat Soc, Series B 57:247–262 66. Lavine M (1992) Some aspects of Pólya tree distributions for statistical modelling. Ann Stat 20:1222–1235 67. Leamer EE (1978) Specification searches: Ad hoc inference with non-experimental data. Wiley, New York 68. Leonard T, Hsu JSJ (1999) Bayesian Methods: An Analysis for Statisticians and Interdisciplinary Researchers. Cambridge University Press, Cambridge 69. Lindley DV (1965) Introduction to Probability and Statistics. Cambridge University Press, Cambridge 70. Lindley DV (1968) The choice of variables in multiple regression (with discussion). J Royal Stat Soc, Series B 30:31–66 71. Lindley DV (2006) Understanding Uncertainty. Wiley, New York 72. Lindley DV, Novick MR (1981) The role of exchangeability in inference. Ann Stat 9:45–58 73. Little RJA (2006) Calibrated Bayes: A Bayes/frequentist roadmap. Am Stat 60:213–223 74. Mangel M, Munch SB (2003) Opportunities for Bayesian analysis in the search for sustainable fisheries. ISBA Bulletin 10:3–5 75. McCullagh P, Nelder JA (1989) Generalized Linear Models, 2nd edn. Chapman, New York 76. Meng XL, van Dyk DA (1997) The EM Algorithm: an old folk song sung to a fast new tune (with discussion). J Royal Stat Soc, Series B 59:511–567 77. Merl D, Prado R (2007) Detecting selection in DNA sequences: Bayesian modelling and inference (with discussion). In: Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M (eds) Bayesian Statistics 8. University Press, Oxford, pp 1–22 78. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equations of state calculations by fast computing machine. J Chem Phys 21:1087–1091 79. Morris CN, Hill J (2000) The Health Insurance Experiment: design using the Finite Selection Model. In: Morton SC, Rolph JE (eds) Public Policy and Statistics: Case Studies from RAND. Springer, New York, pp 29–53 80. Müller P, Quintana F (2004) Nonparametric Bayesian data analysis. Stat Sci 19:95–110 81. Müller P, Quintana F, Rosner G (2004) Hierarchical meta-analysis over related non-parametric Bayesian models. J Royal Stat Soc, Series B 66:735–749 82. Munch SB, Kottas A, Mangel M (2005) Bayesian nonparametric analysis of stock-recruitment relationships. Can J Fish Aquat Sci 62:1808–1821 83. Neyman J (1937) Outline of a theory of statistical estimation based on the classical theory of probability. Philos Trans Royal Soc Lond A 236:333–380 84. Neyman J, Pearson ES (1928) On the use and interpretation of certain test criteria for purposes of statistical inference. Biometrika 20:175–240 85. O’Hagan A, Buck CE, Daneshkhah A, Eiser JR, Garthwaite PH, Jenkinson DJ, Oakley JE, Rakow T (2006) Uncertain Judgements. Eliciting Experts’ Probabilities. Wiley, New York 86. O’Hagan A, Forster J (2004) Bayesian Inference, 2nd edn. In: Kendall’s Advanced Theory of Statistics, vol 2B. Arnold, London 87. Parmigiani G (2002) Modeling in medical decision-making: A Bayesian approach. Wiley, New York
88. Pearson KP (1895) Mathematical contributions to the theory of evolution, II. Skew variation in homogeneous material. Proc Royal Soc Lond 57:257–260 89. Pebley AR, Goldman N (1992) Family, community, ethnic identity, and the use of formal health care services in Guatemala. Working Paper 92-12. Office of Population Research, Princeton 90. Pettit LI (1990) The conditional predictive ordinate for the Normal distribution. J Royal Stat Soc, Series B 52:175–184 91. Pérez JM, Berger JO (2002) Expected posterior prior distributions for model selection. Biometrika 89:491–512 92. Polson NG, Stroud JR, Müller P (2008) Practical filtering with sequential parameter learning. J Royal Stat Soc, Series B 70:413–428 93. Rashbash J, Steele F, Browne WJ, Prosser B (2005) A User’s Guide to MLwiN, Version 2.0. Centre for Multilevel Modelling, University of Bristol, Bristol UK; available at www.cmm.bristol. ac.uk Accessed 15 Aug 2008 94. Richardson S, Green PJ (1997) On Bayesian analysis of mixtures with an unknown number of components (with discussion). J Royal Stat Soc, Series B 59:731–792 95. Rios Insua D, Ruggeri F (eds) (2000) Robust Bayesian Analysis. Springer, New York 96. Rodríguez A, Dunston DB, Gelfand AE (2008) The nested Dirichlet process. J Am Stat Assoc, 103, forthcoming 97. Rodríguez G, Goldman N (1995) An assessment of estimation procedures for multilevel models with binary responses. J Royal Stat Soc, Series A 158:73–89 98. Rubin DB (1984) Bayesianly justifiable and relevant frequency calculations for the applied statistician. Ann Stat 12:1151– 1172 99. Rubin DB (2005) Bayesian inference for causal effects. In: Rao CR, Dey DK (eds) Handbook of Statistics: Bayesian Thinking, Modeling and Computation, vol 25. Elsevier, Amsterdam, pp 1–16 100. Sabatti C, Lange K (2008) Bayesian Gaussian mixture models for high-density genotyping arrays. J Am Stat Assoc 103:89–100 101. Sansó B, Forest CE, Zantedeschi D (2008) Inferring climate system properties using a computer model (with discussion). Bayesian Anal 3:1–62 102. Savage LJ (1954) The Foundations of Statistics. Wiley, New York 103. Schervish MJ, Seidenfeld T, Kadane JB (1990) State-dependent utilities. J Am Stat Assoc 85:840–847 104. Seidou O, Asselin JJ, Ouarda TMBJ (2007) Bayesian multivariate linear regression with application to change point models in hydrometeorological variables. In: Water Resources Research 43, W08401, doi:10.1029/2005WR004835. 105. Spiegelhalter DJ, Abrams KR, Myles JP (2004) Bayesian Approaches to Clinical Trials and Health-Care Evaluation. Wiley, New York 106. Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A (2002) Bayesian measures of model complexity and fit (with discussion). J Royal Stat Soc, Series B 64:583–640 107. Spiegelhalter DJ, Thomas A, Best NG (1999) WinBUGS Version 1.2 User Manual. MRC Biostatistics Unit, Cambridge 108. Stephens M (2000) Bayesian analysis of mixture models with an unknown number of components – an alternative to reversible-jump methods. Ann Stat 28:40–74
273
274
Bayesian Statistics
109. Stigler SM (1986) The History of Statistics: The Measurement of Uncertainty Before 1900. Harvard University Press, Cambridge 110. Stone M (1974) Cross-validation choice and assessment of statistical predictions (with discussion). J Royal Stat Soc, Series B 36:111–147 111. Wald A (1950) Statistical Decision Functions. Wiley, New York 112. Weerahandi S, Zidek JV (1981) Multi-Bayesian statistical decision theory. J Royal Stat Soc, Series A 144:85–93
113. Weisberg S (2005) Applied Linear Regression, 3rd edn. Wiley, New York 114. West M (2003) Bayesian factor regression models in the “large p, small n paradigm.” Bayesian Statistics 7:723–732 115. West M, Harrison PJ (1997) Bayesian Forecasting and Dynamic Models. Springer, New York 116. Whitehead J (2006) Using Bayesian decision theory in doseescalation studies. In: Chevret S (ed) Statistical Methods for Dose-Finding Experiments. Wiley, New York, pp 149–171
Bivariate (Two-dimensional) Wavelets
Bivariate (Two-dimensional) Wavelets BIN HAN Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Canada Article Outline Glossary Definitions Introduction Bivariate Refinable Functions and Their Properties The Projection Method Bivariate Orthonormal and Biorthogonal Wavelets Bivariate Riesz Wavelets Pairs of Dual Wavelet Frames Future Directions Bibliography Glossary Dilation matrix A 2 2 matrix M is called a dilation matrix if all the entries of M are integers and all the eigenvalues of M are greater than one in modulus. Isotropic dilation matrix A dilation matrix M is said to be isotropic if M is similar to a diagonal matrix and all its eigenvalues have the same modulus. Wavelet system A wavelet system is a collection of square integrable functions that are generated from a finite set of functions (which are called wavelets) by using integer shifts and dilations. Definitions Throughout this article, R; C; Z denote the real line, the complex plane, and the set of all integers, respectively. For 1 6 p 6 1, L p (R2 ) denotes the set of all Lebesgue p measurable bivariate functions f such that k f kL (R2 ) :D p R p 2 R2 j f (x)j dx < 1. In particular, the space L2 (R ) of square integrable functions is a Hilbert space under the inner product Z h f ; gi :D
R2
f (x)g(x)dx ;
f ; g 2 L2 (R2 ) ;
where g(x) denotes the complex conjugate of the complex number g(x). In applications such as image processing and computer graphics, the following are commonly used isotropic
dilation matrices: 1 1 Mp2 D ; 1 1 2 1 ; Mp3 D 1 2
1 1 d dI2 D 0
Q p2 D
1 ; 1 0 ; d
(1)
where d is an integer with jdj > 1. Using a dilation matrix M, a bivariate M-wavelet system is generated by integer shifts and dilates from a finite set f 1 ; : : : ; L g of functions in L2 (R2 ). More precisely, the set of all the basic wavelet building blocks of an M-wavelet system generated by f 1 ; : : : ; L g is given by 1
X M (f
;:::;
:D f
`;M j;k
L
g)
: j 2 Z; k 2 Z2 ; ` D 1; : : : ; Lg ;
(2)
where `;M j;k (x)
:D j det Mj j/2
`
(M j x k) ;
x 2 R2 :
(3)
One of the main goals in wavelet analysis is to find wavelet systems X M (f 1 ; : : : ; L g) with some desirable properties such that any two-dimensional function or signal f 2 L2 (R2 ) can be sparsely and efficiently represented under the M-wavelet system X M (f 1 ; : : : ; L g): f D
L X X X
h`j;k ( f )
`;M j;k ;
(4)
`D1 j2Z k2Z2
where h`j;k : L2 (R2 ) 7! C are linear functionals. There are many types of wavelet systems studied in the literature. In the following, let us outline some of the most important types of wavelet systems. Orthonormal Wavelets We say that f 1 ; : : : ; L g generates an orthonormal Mwavelet basis in L2 (R2 ) if the system X M (f 1 ; : : : ; L g) is an orthonormal basis of the Hilbert space L2 (R2 ). That is, the linear span of elements in X M (f 1 ; : : : ; L g) is dense in L2 (R2 ) and h
`;M j;k ;
`0 ;M j0 ;k 0 i 0
D ı``0 ı j j0 ı kk 0 ;
8 j; j 2 Z ; k; k 0 2 Z2 ; `; `0 D 1; : : : ; L ;
(5)
where ı denotes the Dirac sequence such that ı0 D 1 and ı k D 0 for all k ¤ 0. For an orthonormal wavelet basis X M (f 1 ; : : : ; L g), the linear functional h`j;k in (4) is
275
276
Bivariate (Two-dimensional) Wavelets
`;M j;k i and the representation in (4)
given by h`j;k ( f ) D h f ; becomes f D
L X X X `D1 j2Z
hf;
`;M j;k i
`;M j;k
f 2 L2 (R2 )
;
(6)
k2Z2
with the series converging in L2 (R2 ). Riesz Wavelets We say that f 1 ; : : : ; L g generates a Riesz M-wavelet basis in L2 (R2 ) if the system X M (f 1 ; : : : ; L g) is a Riesz basis of L2 (R2 ). That is, the linear span of elements in X M (f 1 ; : : : ; L g) is dense in L2 (R2 ) and there exist two positive constants C1 and C2 such that 2 X L X X L X X X `;M ` 2 ` C1 jc j;k j 6 c j;k j;k 2 2 `D1 j2Z k2Z `D1 j2Z k2Z 2 L 2 (R )
6 C2
L X X X
jc `j;k j2
`D1 j2Z k2Z2
for all finitely supported sequences fc `j;k g j2Z;k2Z2 ;`D1;:::;L . Clearly, a Riesz M-wavelet generalizes an orthonormal M-wavelet by relaxing the orthogonality requirement in (5). For a Riesz basis X M (f 1 ; : : : ; L g), it is wellknown that there exists a dual Riesz basis f ˜ `; j;k : j 2 Z; k 2 Z2 ; ` D 1; : : : ; Lg of elements in L2 (R2 ) (this set is not necessarily generated by integer shifts and dilates from some finite set of functions) such that (5) still holds after 0 0 0 0 replacing `j0 ;k;M0 by ˜ ` ; j ;k . For a Riesz wavelet basis, the linear functional in (4) becomes h` ( f ) D h f ; ˜ `; j;k i. In j;k
fact, ˜ `; j;k :D F 1 ( is defined to be F ( f ) :D
`;M j;k ),
L X X X `D1 j2Z
hf;
where F : L2 (R2 ) 7! L2 (R2 )
`;M j;k i
`;M j;k
;
f 2 L2 (R2 ) :
k2Z2
(7) Wavelet Frames A further generalization of a Riesz wavelet is a wavelet frame. We say that f 1 ; : : : ; L g generates an M-wavelet frame in L2 (R2 ) if the system X M (f 1 ; : : : ; L g) is a frame of L2 (R2 ). That is, there exist two positive constants C1 and C2 such that C1 k f k2L 2 (R2 ) 6
L X X X
jh f ;
`;M 2 j;k ij
`D1 j2Z k2Z2
6 C2 k f k2L 2 (R2 ) ;
8 f 2 L2 (R2 ) :
(8)
It is not difficult to check that a Riesz M-wavelet is an M-wavelet frame. Then (8) guarantees that the frame operator F in (7) is a bounded and invertible linear operator. For an M-wavelet frame, the linear functional in (4) can be chosen to be h`j;k ( f ) D h f ; F 1 ( `;M j;k )i; however, such functionals may not be unique. The fundamental difference between a Riesz wavelet and a wavelet frame lies in that for any given function f 2 L2 (R2 ), the representation in (4) is unique under a Riesz wavelet while it may not be unique under a wavelet frame. The representation in (4) with the choice h`j;k ( f ) D h f ; F 1 ( `;M j;k )i is called the canonical representation of a wavelet frame. In other 2 words, fF 1 ( `;M j;k ) : j 2 Z; k 2 Z ; ` D 1; : : : ; Lg is called the canonical dual frame of the given wavelet frame X M (f 1 ; : : : ; L g). Biorthogonal Wavelets We say that (f 1 ; : : : ; L g; f ˜ 1 ; : : : ; ˜ L g) generates a pair of biorthogonal M-wavelet bases in L2 (R2 ) if each of X M (f 1 ; : : : ; L g) and X M (f ˜ 1 ; : : : ; ˜ L g) is a Riesz basis of L2 (R2 ) and (5) still holds after replac0 0 ing `j0 ;k;M0 by ˜ `j0 ;k;M0 . In other words, the dual Riesz basis of X M (f 1 ; : : : ; L g) has the wavelet structure and is given by X M (f ˜ 1 ; : : : ; ˜ L g). For a biorthogonal wavelet, the wavelet representation in (4) becomes f D
L X X X `D1 j2Z
h f ; ˜ `;M j;k i
`;M j;k
;
f 2 L2 (R2 ) : (9)
k2Z2
Obviously, f 1 ; : : : ; L g generates an orthonormal M-wavelet basis in L2 (R2 ) if and only if (f 1 ; : : : ; L g; f 1 ; : : : ; L g) generates a pair of biorthogonal M-wavelet bases in L2 (R2 ). Dual Wavelet Frames Similarly, we have the notion of a pair of dual wavelet frames. We say that (f 1 ; : : : ; L g; f ˜ 1 ; : : : ; ˜ L g) generates a pair of dual M-wavelet frames in L2 (R2 ) if each of X M (f 1 ; : : : ; L g) and X M (f ˜ 1 ; : : : ; ˜ L g) is an M-wavelet frame in L2 (R2 ) and h f ; gi D
L X X X `D1 j2Z
h f ; ˜ `;M j;k ih
`;M j;k ; gi
;
k2Z2
f ; g 2 L2 (R2 ) : It follows from the above identity that (9) still holds for a pair of dual M-wavelet frames. We say that f 1 ; : : : ; L g generates a tight M-wavelet frame if (f 1 ; : : : ; L g;
Bivariate (Two-dimensional) Wavelets
f 1 ; : : : ; L g) generates a pair of dual M-wavelet frames in L2 (R2 ). For a tight wavelet frame f 1 ; : : : ; L g, the wavelet representation in (6) still holds. So, a tight wavelet frame is a generalization of an orthonormal wavelet basis. Introduction Bivariate (two-dimensional) wavelets are of interest in representing and processing two dimensional data such as images and surfaces. In this article we shall discuss some basic background and results on bivariate wavelets. We denote ˘ J the set of all bivariate polynomials of total degree at most J. For compactly supported functions 1 ; : : : ; L , we say that f 1 ; : : : ; L g has J vanishing moments if hP; ` i D 0 for all ` D 1; : : : ; L and all polynomials P 2 ˘ J1 . The advantages of wavelet representations largely lie in the following aspects: 1. There is a fast wavelet transform (FWT) for computing the wavelet coefficients h`j;k ( f ) in the wavelet representation (4). 2. The wavelet representation has good time and frequency localization. Roughly speaking, the basic building blocks `;M have good time localization and j;k smoothness. 3. The wavelet representation is sparse. For a smooth function f , most wavelet coefficients are negligible. Generally, the wavelet coefficient h`j;k ( f ) only depends on the information of f in a small neighborhood of the ˜ `;M support of `;M j;k (more precisely, the support of j;k if h` ( f ) D h f ; ˜ `;M i). If f in a small neighborhood of j;k
j;k
the support of `;M j;k is smooth or behaves like a polynomial to certain degree, then by the vanishing moments of the dual wavelet functions, h`j;k ( f ) is often negligible. 4. For a variety of function spaces B such as Sobolev and Besov spaces, its norm k f kB , f 2 B, is equivalent to certain weighted sequence norm of its wavelet coefficients fh`j;k ( f )g j2Z;k2Z2 ;`D1;:::;L . For more details on advantages and applications of wavelets, see [6,8,14,16,21,58]. For a dilation matrix M D dI2 , the easiest way for obtaining a bivariate wavelet is to use the tensor product method. In other words, all the functions in the generator set f 1 ; : : : ; L g L2 (R2 ) take the form ` (x; y) D f ` (x)g ` (y), x; y 2 R, where f ` and g ` are some univariate functions in L2 (R). However, tensor product (also called separable) bivariate wavelets give preference to the horizontal and vertical directions which may not be desirable in applications such as image processing ([9,16,50,58]).
Also, for a non-diagonal dilation matrix, it is difficult or impossible to use the tensor product method to obtain bivariate wavelets. Therefore, nonseparable bivariate wavelets themselves are of importance and interest in both theory and application. In this article, we shall address several aspects of bivariate wavelets with a general twodimensional dilation matrix. Bivariate Refinable Functions and Their Properties In order to have a fast wavelet transform to compute the wavelet coefficients in (4), the generators 1 ; : : : ; L in a wavelet system are generally obtained from a refinable function via a multiresolution analysis [6,15,16,58]. Let M be a 2 2 dilation matrix. For a function in L2 (R2 ), we say that is M-refinable if it satisfies the following refinement equation X a k (Mx k) ; a.e. x 2 R2 ; (10) (x) D j det Mj k2Z2
where a : Z2 7! C is a finitely supported sequence on P Z2 satisfying k2Z2 a k D 1. Such a sequence a is often called a mask in wavelet analysis and computer graphics, or a low-pass filter in signal processing. Using the Fourier transform ˆ of 2 L2 (R2 ) \ L1 (R2 ) and the Fourier series aˆ of a, which are defined to be Z ˆ () :D (x)e i x dx and R2 X ˆ a k e i k ; 2 R2 ; (11) a() :D k2Z2
the refinement equation (10) can be equivalently rewritten as ˆ T ) D a() ˆ ; ˆ () (M
2 R2 ;
(12)
where M T denotes the transpose of the dilation matrix M. ˆ D 1 and aˆ is a trigonometric polynomial, one Since a(0) can define a function ˆ by ˆ () :D
1 Y
aˆ((M T ) j ) ;
2 R2 ;
(13)
jD1
with the series converging uniformly on any compact set of R2 . It is known ([4,16]) that is a compactly supported tempered distribution and clearly satisfies the refinement equation in (12) with mask a. We call the standard refinable function associated with mask a and dilation M. Now the generators 1 ; : : : ; L of a wavelet system are often obtained from the refinable function via c`
b ˆ (M T ) :D a` ()() ;
2 R2 ; ` D 1; : : : ; L ;
277
278
Bivariate (Two-dimensional) Wavelets
for some 2-periodic trigonometric polynomials ab` . Such sequences a` are called wavelet masks or high-pass filters. Except for very few special masks a such as masks for box spline refinable functions (see [20] for box splines), most refinable functions , obtained in (13) from mask a and dilation M, do not have explicit analytic expressions and a so-called cascade algorithm or subdivision scheme is used to approximate the refinable function . Let B be a Banach space of bivariate functions. Starting with a suitable initial function f 2 B, we iteratively compute a sen quence fQ a;M f g1 nD0 of functions, where Q a;M f :D j det Mj
X
a k f (M k) :
(14)
k2Z2 n If the sequence fQ a;M f g1 nD0 of functions converges to some f1 2 B in the Banach space B, then we have Q a;M f1 D f1 . That is, as a fixed point of the cascade operator Q a;M , f1 is a solution to the refinement equation (10). A cascade algorithm plays an important role in the study of refinable functions in wavelet analysis and of subdivision surfaces in computer graphics. For more details on cascade algorithms and subdivision schemes, see [4,19,21,22,23,31,36,54] and numerous references therein. One of the most important properties of a refinable function is its smoothness, which is measured by its Lp smoothness exponent p () and is defined to be
p () :D supf : 2 Wp (R2 )g ;
1 6 p 6 1 ; (15)
where Wp (R2 ) denotes the fractional Sobolev space of order > 0 in L p (R2 ). The notion of sum rules of a mask is closely related to the vanishing moments of a wavelet system ([16,44]). For a bivariate mask a, we say that a satisfies the sum rules of order J with the dilation matrix M if X
a jCM k P( jCM k) D
k2Z2
X
a M k P(M k); 8P 2 ˘ J1 :
k2Z2
(16)
Or equivalently, @1 1 @2 2 aˆ(2 ) D 0 for all 1 C 2 < J and 2 [(M T )1 Z2 ]nZ2 , where @1 and @2 denote the partial derivatives in the first and second coordinate, respectively. Throughout the article, we denote sr(a,M) the highest order of sum rules satisfied by the mask a with the dilation matrix M. To investigate various properties of refinable functions, next we introduce a quantity p (a; M) in wavelet analysis (see [31]). For a bivariate mask a and a dilation
matrix M, we denote
p (a; M) :D log (M) [j det Mj11/p (a; M; p)] ; 16 p 61;
(17)
where (M) denotes the spectral radius of M and n : (a; M; p) :D max lim sup ka n;(1 ;2 ) k1/n ` p (Z2 ) n!1
o 1 C 2 D sr(a; M); 1 ; 2 2 N [ f0g ; where the sequence a n;(1 ;2 ) is defined by its Fourier series as follows: for D (1 ; 2 ) 2 R2 ,
3
a n;(1 ;2 ) () :D (1 ei1 )1 (1 e i2 )2 T n1 ˆ ˆ ˆ T ) a() ˆ : a((M ) ) a((M T )n2 ) a(M
It is known that ([31,36]) p (a; M) > q (a; M) > p (a; M) C (1/q 1/p) log (M) j det Mj for all 1 6 p 6 q 6 1. Parts of the following result essentially appeared in various forms in many papers in the literature. For the study of refinable functions and cascade algorithms, see [4,10,12,15,16,22,23,25,29,31,45,48,53] and references therein. The following is from [31]. Theorem 1 Let M be a 2 2 isotropic dilation matrix and a be a finitely supported bivariate mask. Let denote the standard refinable function associated with mask a and dilation M. Then p () > p (a; M) for all 1 6 p 6 1. If the shifts of are stable, that is, f( k) : k 2 Z2 g is a Riesz system in L p (R2 ), then p () D p (a; M). Moreover, for every nonnegative integer J, the following statements are equivalent. 1. For every compactly supported function f 2 WpJ (R2 ) (if p D 1, we require f 2 C J (R2 )) such that fˆ(0) D 1 and @1 1 @2 2 fˆ(2 k) D 0 for all 1 C 2 < J and n 2 k 2 Z nf0g, the cascade sequence fQ a;M f g1 nD0 con-
verges in the Sobolev space WpJ (R2 ) (in fact, the limit function is ). 2. p (a; M) > J.
Symmetry of a refinable function is also another important property of a wavelet system in applications. For example, symmetry is desirable in order to handle the boundary of an image or to improve the visual quality of a surface ([16,21,22,32,50,58]). Let M be a 2 2 dilation matrix and G be a finite set of 2 2 integer matrices. We say that G is a symmetry group with respect to M ([27,29]) if G is a group under matrix
Bivariate (Two-dimensional) Wavelets
multiplication and MEM 1 2 G for all E 2 G. In dimension two, two commonly used symmetry groups are D4 and D6 :
1 0 1 0 0 1 0 1 ;˙ ;˙ ;˙ D4 :D ˙ 0 1 0 1 1 0 1 0 and
1 D6 :D ˙ 0 0 ˙ 1
0 0 ;˙ 1 1 1 1 ;˙ 0 0
1 1 ;˙ 1 1 1 1 ;˙ 1 1
1 ; 0 0 : 1
For symmetric refinable functions, we have the following result ([32, Proposition 1]): Proposition 2 Let M be a 2 2 dilation matrix and G be a symmetry group with respect to M. Let a be a bivariate mask and be the standard refinable function associated with mask a and dilation M. Then a is G-symmetric with center ca : a E(kc a )Cc a D a k for all k 2 Z2 and E 2 G, if and only if, is G-symmetric with center c: (E(c)Cc) D ;
8E2G
if 1 (a; M) > 0 and a is an interpolatory mask with dilation M, that is, a0 D j det Mj1 and a M k D 0 for all k 2 Z2 nf0g. Example 6 Let M D 2I2 and ˆ 1 ; 2 ) : D cos(1 /2) cos(2 /2) cos(1 /2 C 2 /2) a( 1 C 2 cos(1 ) C 2 cos(2 ) C 2 cos(1 C 2 ) cos(21 C 2 ) cos(1 C 22 ) cos(1 2 ) /4 : (18) Then sr(a; 2I2 ) D 4, a is D6 -symmetric with center 0 and a is an interpolatory mask with dilation 2I 2 . Since
2 (a; 2I2 ) 2:44077, so, 1 (a; 2I2 ) > 2 (a; 2I2 ) 1 1:44077 > 0. By Theorem 5, the associated standard refinable function is interpolating. The butterfly interpolatory subdivision scheme for triangular meshes is based on this mask a (see [23]). For more details on bivariate interpolatory masks and interpolating refinable functions, see [4,19,22,23,25,26,27, 31,37,38,39,59].
with c :D (MI2 )1 c a : The Projection Method
In the following, let us present some examples of bivariate refinable functions. ˆ 1 ; 2 ) :D cos4 (1 /2) Example 3 Let M D 2I2 and a( 4 cos (2 /2). Then a is a tensor product mask with sr(a; 2I2 ) D 4 and a is D4 -symmetric with center 0. The associated standard refinable function is the tensor product spline of order 4. The Catmull-Clark subdivision scheme for quadrilateral meshes in computer graphics is based on this mask a ([21]). ˆ 1 ; 2 ) :D cos2 (1 /2) Example 4 Let M D 2I2 and a( 2 2 cos (2 /2) cos (1 /2 C 2 /2). Then sr(a; 2I2 ) D 4 and a is D6 -symmetric with center 0. The associated refinable function is the convolution of the three direction box spline with itself ([20]). The Loop subdivision scheme for triangular meshes in computer graphics is based on this mask a ([21]). Another important property of a refinable function is interpolation. We say that a bivariate function is interpolating if is continuous and (k) D ı k for all k 2 Z2 . Theorem 5 Let M be a 2 2 dilation matrix and a be a finitely supported bivariate mask. Let denote the standard refinable function associated with mask a and dilation M. Then is an interpolating function if and only
In applications, one is interested in analyzing some optimal properties of multivariate wavelets. The projection method is useful for this purpose. Let r and s be two positive integers with r 6 s. Let P be an r s real-valued matrix. For a compactly supported function and a finitely supported mask a in dimension s, we define the projected function P and projected mask Pa in dimension r by ˆ T ) and c P() :D (P c Pa() :D
t X
ˆ T C 2" j ) ; a(P
2 Rr ;
(19)
jD1
where ˆ and aˆ are understood to be continuous and f"1 ; : : : ; " t g is a complete set of representatives of the distinct cosets of [PT Rr ]/Zs . If P is an integer matrix, then c D a(P ˆ T ). Pa() Now we have the following result on projected refinable functions ([27,28,34,35]). Theorem 7 Let N be an s s dilation matrix and M be an r r dilation matrix. Let P be an r s integer matrix such that PN D MP and PZs D Zr . Let aˆ be a 2-periodic ˆ D 1 and trigonometric polynomials in s-variables with a(0) the standard N-refinable function associated with mask a.
279
280
Bivariate (Two-dimensional) Wavelets
c T ) D c c Then sr(a; N) 6 sr(Pa; M) and P(M Pa()P(). That is, P is M-refinable with mask Pa. Moreover, for all 1 6 p 6 1,
p () 6 p (P) and j det Mj11/p (Pa; M; p) 6 j det Nj11/p (a; N; p) : If we further assume that (M) D (N), then p (a; N) 6
p (Pa; M). As pointed out in [34,35], the projection method is closely related to box splines. For a given r s (direction) integer matrix of rank r with r 6 s, the Fourier transform of its associated box spline M and its mask a are given by (see [20])
b
Y 1 e i k M () :D ik
and
k2
ac () :D
Y 1 C e i k ; 2
2 Rr ;
(20)
the projected mask Pa is an interpolatory mask with dilation M and sr(Pa; M) > sr(a; j det MjI r ). Example 10 For M D Mp2 or M D Qp2 , we can take H :D Mp2 in Theorem 9. For more details on the projection method, see [25,27,28, 32,34,35]. Bivariate Orthonormal and Biorthogonal Wavelets In this section, we shall discuss the analysis and construction of bivariate orthonormal and biorthogonal wavelets. For analysis of biorthogonal wavelets, the following result is well-known (see [9,11,16,24,26,31,46,51,53] and references therein): Theorem 11 Let M be a 2 2 dilation matrix and a; a˜ be two finitely supported bivariate masks. Let and ˜ be the standard M-refinable functions associated with masks a and a˜, respectively. Then ; ˜ 2 L2 (R2 ) and satisfy the biorthogonality relation
k2
where k 2 means that k is a column vector of and k goes through all the columns of once and only once. Let [0;1]s denote the characteristic function of the unit cube [0; 1]s . From (20), it is evident that the box spline M is just the projected function [0;1]s , since [0;1]s D M by [0;1]s () D [0;1]s ( T ); 2 Rr . Note that M is 2I r -refinable with the mask a , since M (2) D ac () M (). As an application of the projection method in Theorem 7, we have ([25, Theorem 3.5.]):
b
b
2
1
Corollary 8 Let M D 2I s and a be an interpolatory mask with dilation M such that a is supported inside [3; 3]s . Then 1 (a; 2I s ) 6 2 and therefore, 62 C 2 (Rs ), where is the standard refinable function associated with mask a and dilation 2I s . We use proof by contradiction. Suppose 1 (a; 2I s ) > 2. Let P D [1; 0; : : : ; 0] be a 1 s matrix. Then we must have sr(a; 2I s ) > 3, which, combining with other assumptions c D 1/2 C 9/16 cos() 1/16 cos(3). on a, will force Pa() Since 1 (Pa; 2) D 2, by Theorem 7, we must have
1 (a; 2I s ) 6 1 (Pa; 2) D 2. A contradiction. So, 1 (a; 2I s ) 6 2. In particular, the refinable function in the butterfly subdivision scheme in Example 6 is not C2 . The projection method can be used to construct interpolatory masks painlessly [27]). Theorem 9 Let M be an r r dilation matrix. Then there is an r r integer matrix H such that MZr D HZr and H r D j det MjI r . Let P :D j det Mj1 H. Then for any (tensor product) interpolatory mask a with dilation j det MjI r ,
˜ k)i D ı k ; h; (
k 2 Z2 ;
(21)
˜ M) > 0, and (a; a˜) is if and only if, 2 (a; M) > 0, 2 ( a; a pair of dual masks: X aˆ( C 2 )aˆ˜ ( C 2 ) D 1 ; 2 R2 ; (22) 2 M T
where M T is a complete set of representatives of distinct cosets of [(M T )1 Z2 ]/Z2 with 0 2 M T . Moreover, if (21) holds and there exist 2-periodic trigonometric polynomi˜1 ; : : : ; a˜ m1 with m :D j det Mj such als ab1 ; : : : ; a m1 , ab that
1
M
1
b ()M
ˆb [ a; a 1 ;:::; a m1 ]
b ()
T
a˜ 1 ;:::; a˜ m1 ] [ aˆ˜ ;b
D Im ;
2 R2 ; (23)
where for 2 R2 ,
b () :D
M
2
ˆb [ a; a 1 ;:::; a m1 ]
ˆ C 2 0 ) a(
6 ˆ C 2 1 ) 6 a( 6: 6: 4:
b ab ( C 2 )
ab1 ( C 2 0 )
3
: : : a m1 ( C 2 0 )
ab1 ( C 2 1 ) :: :
: : : m1 : : :: : :
1
b
ˆ C 2 m1 ) ab1 ( C 2 m1 ) : : : a m1 ( C 2 m1 ) a(
7 7 7 7 5
(24) with f0 ; : : : ; m1 g :D M T and 0 :D 0. Define : : : ; m1 ; ˜ 1 ; : : : ; ˜ m1 by c`
1;
b ˆ (M T ) :D a` ()() and
c ˆ˜ ; ˜ ` (M T ) :D ab ˜` ()()
` D 1; : : : ; m 1 :
(25)
˜ 1 ; : : : ; ˜ m1 g) generates a pair Then (f of biorthogonal M-wavelet bases in L2 (R2 ). 1; : : : ;
m1 g; f
Bivariate (Two-dimensional) Wavelets
As a direct consequence of Theorem 11, for orthonormal wavelets, we have Corollary 12 Let M be a 2 2 dilation matrix and a be a finitely supported bivariate mask. Let be the standard refinable function associated with mask a and dilation M. Then has orthonormal shifts, that is, h; ( k)i D ı k for all k 2 Z2 , if and only if, 2 (a; M) > 0 and a is an orthogonal mask (that is, (22) is satisfied with aˆ˜ being replaced ˆ If in addition there exist 2-periodic trigonometby a). ric polynomials ab1 ; : : : ; a m1 with m :D j det Mj such that M
b ()M
1 b b ]()
T
D I m . Then f
ˆb ˆ a 1 ;:::; a m1 [ a; a 1 ;:::; a m1 ] [ a; : : : ; m1 g, defined in (25), generates
1;
an orthonormal
M-wavelet basis in L2 (R2 ). The masks a and a˜ are called low-pass filters and the masks a 1 ; : : : ; a m1 ; a˜1 ; : : : ; a˜ m1 are called high-pass filters in the engineering literature. The equation in (23) is called the matrix extension problem in the wavelet litera˜ ture (see [6,16,46,58]). Given a pair of dual masks a and a, b b 1 m1 1 m1 though the existence of a ; : : : ; a ; a˜ ; : : : ; a˜ (without symmetry) in the matrix extension problem in (23) is guaranteed by the Quillen–Suslin theorem, it is far from a trivial task to construct them (in particular, with symmetry) algorithmically. For the orthogonal case, the general theory of the matrix extension problem in (23) with a˜ D a and a˜` D a` ; ` D 1; : : : ; m 1, remains unanswered. However, when j det Mj D 2, the matrix extension problem is trivial. In fact, letting 2 M T nf0g (that is, M T D f0; g, one defines ab1 () :D e i aˆ˜ ( C 2 ) ˜1 () :D e i a( ˆ C 2 ), where 2 Z2 satisfies and ab D 1/2. It is not an easy task to construct nonseparable orthogonal masks with desirable properties for a general dilation matrix. For example, for the dilation matrix Qp2 in (1), it is not known so far in the literature [9] whether there is a compactly supported C1 Qp2 refinable function with orthonormal shifts. See [1,2,9, 25,27,30,32,50,51,53,57] for more details on construction of orthogonal masks and orthonormal refinable functions. However, for any dilation matrix, “separable” orthogonal masks with arbitrarily high orders of sum rules can be easily obtained via the projection method. Let M be a 2 2 dilation matrix. Then M D E diag(d1 ; d2 )F for some integer matrices E and F such that j det Ej D j det Fj D 1 and d1 ; d2 2 N. Let a be a tensor product orthogonal mask with the diagonal dilation matrix diag(d1 ; d2 ) satisfying the sum rules of any preassigned order J. Then the projected mask Ea (that is, (Ea) k :D a E 1 k ; k 2 Z2 ) is an orthogonal mask with dilation M and sr(Ea; M) > J. See [27, Corollary 3.4.] for more detail.
1
1
Using the high-pass filters in the matrix extension problem, for a given mask a, dual masks a˜ (without any guaranteed order of sum rules of the dual masks) of a can be obtained by the lifting scheme in [63] and the stable completion method in [3]. Special families of dual masks with sum rules can be obtained by the convolution method in [24, Proposition 3.7.] and [43], but the constructed dual masks generally have longer supports with respect to their orders of sum rules. Another method, which we shall cite here, for constructing all finitely supported dual masks with any preassigned orders of sum rules is the CBC (coset by coset) algorithm proposed in [5,25,26]. For D (1 ; 2 ) and D (1 ; 2 ), we say 6 if
1 6 1 and 2 6 2 . Also we denote jj D 1 C 2 , ! :D 1 !2 ! and x :D x11 x22 for x D (x1 ; x2 ). We denote ˝ M a complete set of representatives of distinct cosets of Z2 /[MZ2 ] with 0 2 ˝ M . The following result is from [25]. Theorem 13 (CBC Algorithm) Let M be a 2 2 dilation matrix and a be an interpolatory mask with dilation M. Let J be any preassigned positive integer. a ; jj < J, from the mask a by 1. Compute the quantities h the recursive formula: X X ! a :D ı (1)jj a k k : ha h
!( )! 2 06 0, where aˆ˜ dilation M. If 2 (a; M) > 0 and 2 ( a;
1
b
T
is the (1; 1)-entry of the matrix [M b1 () ]1 , ˆ a ;:::; a m1 ] [ a; then f 1 ; : : : ; m1 g, which are defined in (25), generates a Riesz M-wavelet basis in L2 (R2 ). ˆ 1 ; 2 ) :D cos2 (1 /2) Example 17 Let M D 2I2 and a( cos2 (2 /2) cos2 (1 /2 C 2 /2) in Example 4. Then sr(a; 2I2 ) D 4 and a is D6 -symmetric with center 0. Define ([41,60]) ab1 ( ; ) : D ei(1 C2 ) aˆ( C ; ); 1
2
1
2
ab2 (1 ; 2 ) : D ei2 aˆ(1 ; 2 C ); ˆ 1 C ; 2 C ) : ab3 (1 ; 2 ) : D e i1 a( Then all the conditions in Theorem 16 are satisfied and f 1 ; 2 ; 3 g generates a Riesz 2I 2 -wavelet basis in L2 (R2 ) ([41]). This Riesz wavelet derived from the Loop scheme has been used in [49] for mesh compression in computer graphics with impressive performance. Pairs of Dual Wavelet Frames In this section, we mention a method for constructing bivariate dual wavelet frames. The following Oblique Extension Principle (OEP) has been proposed in [18] (and independently in [7]. Also see [17]) for constructing pairs of dual wavelet frames. Theorem 18 Let M be a 2 2 dilation matrix and a; a˜ be two finitely supported masks. Let ; ˜ be the standard ˜ respectively, such M-refinable functions with masks a and a, that ; ˜ 2 L2 (R2 ). If there exist 2-periodic trigonometric
Bivariate (Two-dimensional) Wavelets
˜1 ; : : : ; a˜bL such that (0) D 1, polynomials ; ab1 ; : : : ; abL ; ab b b b 1 L 1 a (0) D D a (0) D a˜ (0) D D a˜bL (0) D 0, and T
M
ˆb [(M T ) a; a 1 ;:::; abL ]
()M ˆ b1 bL () [ a˜ ; a˜ ;:::; a˜ ]
D diag(( C 0 ); : : : ; ( C m1 )) ;
(26)
where f0 ; : : : ; m1 g D M T with 0 D 0 in Theorem 11. Then (f 1 ; : : : ; L g; f ˜ 1 ; : : : ; ˜ L g), defined in (25), generates a pair of dual M-wavelet frames in L2 (R2 ). For dimension one, many interesting tight wavelet frames and pairs of dual wavelet frames in L2 (R) have been constructed via the OEP method in the literature, for more details, see [7,17,18,24,30,52,61,62] and references therein. The application of the OEP in high dimensions is much more difficult, mainly due to the matrix extension problem in (26). The projection method can also be used to obtain pairs of dual wavelet frames ([34,35]). Theorem 19 Let M be an r r dilation matrix and N be an s s dilation matrix with r 6 s. Let P be an r s integer matrix of rank r such that MP D PN and PT (Zr n[M T Zr ]) Zs n[N T Zs ]. Let 1 ; : : : ; L ; ˜ 1 ; : : : ; ˜ L be compactly supported functions in L2 (Rs ) such that
2 (
`
)>0
and 2 ( ˜ ` ) > 0
8 ` D 1; : : : ; L :
If (f 1 ; : : : ; L g; f ˜ 1 ; : : : ; ˜ L g) generates a pair of dual N-wavelet frames in L2 (Rs ), then (fP 1 ; : : : ; P L g; fP ˜ 1 ; : : : ; P ˜ L g) generates a pair of dual M-wavelet frames in L2 (Rr ). Example 20 Let be an r s (direction) integer matrix such that T (Zr n[2Zr ]) Zs n[2Z2 ]. Let M D 2I r s and N D 2I s . Let f 1 ; : : : ; 2 1 g be the generators of the tensor product Haar orthonormal wavelet in dimension s, derived from the Haar orthonormal refinable function s :D [0;1]s . Then by Theorem 19, f 1 ; : : : ; 2 1 g r generates a tight 2I r -wavelet frame in L2 (R ) and all the projected wavelet functions are derived from the refinable box spline function D M . Future Directions There are still many challenging problems on bivariate wavelets. In the following, we only mention a few here. 1. For any 2 2 dilation matrix M, e. g., M D Qp2 in (1), can one always construct a family of MRA compactly supported orthonormal M-wavelet bases with arbitrarily high smoothness (and with symmetry if j det Mj > 2)?
2. The matrix extension problem in (23) for orthogonal masks. That is, for a given orthogonal mask a with dilation M, find finitely supported high-pass filters a 1 ; : : : ; a m1 (with symmetry if possible) such that (23) holds with a˜ D a and a˜` D a` . 3. The matrix extension problem in (23) for a given pair of dual masks with symmetry. That is, for a given pair of symmetric dual masks a and a˜ with dilation M, find finitely supported symmetric high-pass filters a1 ; : : : ; a m1 ; a˜1 ; : : : ; a˜ m1 such that (23) is satisfied. 4. Directional bivariate wavelets. In order to handle edges of different orientations in images, directional wavelets are of interest in applications. See [13,55] and many references on this topic. Bibliography Primary Literature 1. Ayache A (2001) Some methods for constructing nonseparable, orthonormal, compactly supported wavelet bases. Appl Comput Harmon Anal 10:99–111 2. Belogay E, Wang Y (1999) Arbitrarily smooth orthogonal nonseparable wavelets in R2 . SIAM J Math Anal 30:678–697 3. Carnicer JM, Dahmen W, Peña JM (1996) Local decomposition of refinable spaces and wavelets. Appl Comput Harmon Anal 3:127–153 4. Cavaretta AS, Dahmen W, Micchelli CA (1991) Stationary subdivision. Mem Amer Math Soc 93(453)1–186 5. Chen DR, Han B, Riemenschneider SD (2000) Construction of multivariate biorthogonal wavelets with arbitrary vanishing moments. Adv Comput Math 13:131–165 6. Chui CK (1992) An introduction to wavelets. Academic Press, Boston 7. Chui CK, He W, Stöckler J (2002) Compactly supported tight and sibling frames with maximum vanishing moments. Appl Comput Harmon Anal 13:224–262 8. Cohen A (2003) Numerical analysis of wavelet methods. NorthHolland, Amsterdam 9. Cohen A, Daubechies I (1993) Nonseparable bidimensional wavelet bases. Rev Mat Iberoamericana 9:51–137 10. Cohen A, Daubechies I (1996) A new technique to estimate the regularity of refinable functions. Rev Mat Iberoamericana 12:527–591 11. Cohen A, Daubechies I, Feauveau JC (1992) Biorthogonal bases of compactly supported wavelets. Comm Pure Appl Math 45:485–560 12. Cohen A, Gröchenig K, Villemoes LF (1999) Regularity of multivariate refinable functions. Constr Approx 15:241–255 13. Cohen A, Schlenker JM (1993) Compactly supported bidimensional wavelet bases with hexagonal symmetry. Constr Approx 9:209–236 14. Dahmen W (1997) Wavelet and multiscale methods for operator equations. Acta Numer 6:55–228 15. Daubechies I (1988) Orthonormal bases of compactly supported wavelets. Comm Pure Appl Math 41:909–996 16. Daubechies I (1992) Ten Lectures on Wavelets, CBMS-NSF Series. SIAM, Philadelphia
283
284
Bivariate (Two-dimensional) Wavelets
17. Daubechies I, Han B (2004) Pairs of dual wavelet frames from any two refinable functions. Constr Approx 20:325–352 18. Daubechies I, Han B, Ron A, Shen Z (2003) Framelets: MRAbased constructions of wavelet frames. Appl Comput Harmon Anal 14:1–46 19. Dahlke S, Gröchenig K, Maass P (1999) A new approach to interpolating scaling functions. Appl Anal 72:485–500 20. de Boor C, Hollig K, Riemenschneider SD (1993) Box splines. Springer, New York 21. DeRose T, Forsey DR, Kobbelt L, Lounsbery M, Peters J, Schröder P, Zorin D (1998) Subdivision for Modeling and Animation, (course notes) 22. Dyn N, Levin D (2002) Subdivision schemes in geometric modeling. Acta Numer 11:73–144 23. Dyn N, Gregory JA, Levin D (1990) A butterfly subdivision scheme for surface interpolation with tension control. ACM Trans Graph 9:160–169 24. Han B (1997) On dual wavelet tight frames. Appl Comput Harmon Anal 4:380–413 25. Han B (2000) Analysis and construction of optimal multivariate biorthogonal wavelets with compact support. SIAM Math Anal 31:274–304 26. Han B (2001) Approximation properties and construction of Hermite interpolants and biorthogonal multiwavelets. J Approx Theory 110:18–53 27. Han B (2002) Symmetry property and construction of wavelets with a general dilation matrix. Lin Algeb Appl 353:207–225 28. Han B (2002) Projectable multivariate refinable functions and biorthogonal wavelets. Appl Comput Harmon Anal 13:89–102 29. Han B (2003) Computing the smoothness exponent of a symmetric multivariate refinable function. SIAM J Matrix Anal Appl 24:693–714 30. Han B (2003) Compactly supported tight wavelet frames and orthonormal wavelets of exponential decay with a general dilation matrix. J Comput Appl Math 155:43–67 31. Han B (2003) Vector cascade algorithms and refinable function vectors in Sobolev spaces. J Approx Theory 124:44–88 32. Han B (2004) Symmetric multivariate orthogonal refinable functions. Appl Comput Harmon Anal 17:277–292 33. Han B (2006) On a conjecture about MRA Riesz wavelet bases. Proc Amer Math Soc 134:1973–1983 34. Han B (2006) The projection method in wavelet analysis. In: Chen G, Lai MJ (eds) Modern Methods in Mathematics. Nashboro Press, Brentwood, pp. 202–225 35. Han B (2008) Construction of wavelets and framelets by the projection method. Int J Appl Math Appl 1:1–40 36. Han B, Jia RQ (1998) Multivariate refinement equations and convergence of subdivision schemes. SIAM J Math Anal 29:1177–1199 37. Han B, Jia RQ (1999) Optimal interpolatory subdivision schemes in multidimensional spaces. SIAM J Numer Anal 36:105–124 38. Han B, Jia RQ (2002) Quincunx fundamental refinable functions and quincunx biorthogonal wavelets. Math Comp 71:165–196 39. Han B, Jia RQ (2006) Optimal C 2 two-dimensional interpolatory ternary subdivision schemes with two-ring stencils. Math Comp 75:1287–1308 40. Han B, Jia RQ (2007) Characterization of Riesz bases of wavelets generated from multiresolution analysis. Appl Comput Harmon Anal 23:321–345
41. Han B, Shen Z (2005) Wavelets from the Loop scheme. J Fourier Anal Appl 11:615–637 42. Han B, Shen Z (2006) Wavelets with short support. SIAM J Math Anal 38:530–556 43. Ji H, Riemenschneider SD, Shen Z (1999) Multivariate compactly supported fundamental refinable functions, duals, and biorthogonal wavelets. Stud Appl Math 102:173–204 44. Jia RQ (1998) Approximation properties of multivariate wavelets. Comp Math 67:647–665 45. Jia RQ (1999) Characterization of smoothness of multivariate refinable functions in Sobolev spaces. Trans Amer Math Soc 351:4089–4112 46. Jia RQ, Micchelli CA (1991) Using the refinement equation for the construction of pre-wavelets II: Power of two. In: Laurent PJ, Le Méhauté A, Schumaker LL (eds) Curves and Surfaces. Academic Press, New York, pp. 209–246 47. Jia RQ, Wang JZ, Zhou DX (2003) Compactly supported wavelet bases for Sobolev spaces. Appl Comput Harmon Anal 15:224– 241 48. Jiang QT (1998) On the regularity of matrix refinable functions. SIAM J Math Anal 29:1157–1176 49. Khodakovsky A, Schröder P, Sweldens W (2000) Progressive geometry compression. Proc SIGGRAPH 50. Kovaˇcevi´c J, Vetterli M (1992) Nonseparable multidimensional perfect reconstruction filter banks and wavelet bases for Rn . IEEE Trans Inform Theory 38:533–555 51. Lai MJ (2006) Construction of multivariate compactly supported orthonormal wavelets. Adv Comput Math 25:41–56 52. Lai MJ, Stöckler J (2006) Construction of multivariate compactly supported tight wavelet frames. Appl Comput Harmon Anal 21:324–348 53. Lawton W, Lee SL, Shen Z (1997) Stability and orthonormality of multivariate refinable functions. SIAM J Math Anal 28:999– 1014 54. Lawton W, Lee SL, Shen Z (1998) Convergence of multidimensional cascade algorithm. Numer Math 78:427–438 55. Le Pennec E, Mallat S (2005) Sparse geometric image representations with bandelets. IEEE Trans Image Process 14:423–438 56. Lorentz R, Oswald P (2000) Criteria for hierarchical bases in Sobolev spaces. Appl Comput Harmon Anal 8:32–85 57. Maass P (1996) Families of orthogonal two-dimensional wavelets. SIAM J Math Anal 27:1454–1481 58. Mallat S (1998) A wavelet tour of signal processing. Academic Press, San Diego 59. Riemenschneider SD, Shen Z (1997) Multidimensional interpolatory subdivision schemes. SIAM J Numer Anal 34:2357– 2381 60. Riemenschneider SD, Shen ZW (1992) Wavelets and prewavelets in low dimensions. J Approx Theory 71:18–38 61. Ron A, Shen Z (1997) Affine systems in L2 (Rd ): the analysis of the analysis operator. J Funct Anal 148:408–447 62. Ron A, Shen Z (1997) Affine systems in L2 (Rd ) II: dual systems. J Fourier Anal Appl 3:617–637 63. Sweldens W (1996) The lifting scheme: a custom-design construction of biorthogonal wavelets. Appl Comput Harmon Anal 3:186–200
Books and Reviews Cabrelli C, Heil C, Molter U (2004) Self-similarity and multiwavelets in higher dimensions. Mem Amer Math Soc 170(807):1–82
Branching Processes
Branching Processes MIKKO J. ALAVA 1 , KENT BÆKGAARD LAURITSEN2 Department of Engineering Physics, Espoo University of Technology, Espoo, Finland 2 Research Department, Danish Meteorological Institute, Copenhagen, Denmark
1
Article Outline Glossary Definition of the Subject Introduction Branching Processes Self-Organized Branching Processes Scaling and Dissipation Network Science and Branching Processes Conclusions Future Directions Acknowledgments Bibliography Glossary Markov process A process characterized by a set of probabilities to go from a certain state at time t to another state at time t C 1. These transition probabilities are independent of the history of the process and only depend on a fixed probability assigned to the transition. Critical properties and scaling The behavior of equilibrium and many non-equilibrium systems in steady states contain critical points where the systems display scale invariance and the correlation functions exhibit an algebraic behavior characterized by so-called critical exponents. A characteristics of this type of behavior is the lack of finite length and time scales (also reminiscent of fractals). The behavior near the critical points can be described by scaling functions that are universal and that do not depend on the detailed microscopic dynamics. Avalanches When a system is perturbed in such a way that a disturbance propagates throughout the system one speak of an avalanche. The local avalanche dynamics may either conserve energy (particles) or dissipate energy. The avalanche may also loose energy when it reaches the system boundary. In the neighborhood of a critical point the avalanche distribution is described by a power-law distribution. Self-organized criticality (SOC) SOC is the surprising “critical” state in which many systems from physics to biology to social ones find themselves. In physics jar-
gon, they exhibit scale-invariance, which means that the dynamics – consisting of avalanches – has no typical scale in time or space. The really necessary ingredient is that there is a hidden, fine-tuned balance between how such systems are driven to create the dynamic response, and how they dissipate the input (“energy”) to still remain in balance. Networks These are descriptions of interacting systems, where in graph theoretical language nodes or vertices are connected by links or edges. The interesting thing can be the structure of the network, and its dynamics, or that of a process on the top of it like the spreading of computer viruses on the Internet. Definition of the Subject Consider the fate of a human population on a small, isolated island. It consists of a certain number of individuals, and the most obvious question, of importance in particular for the inhabitants of the island, is whether this number will go to zero. Humans die and reproduce in steps of one, and therefore one can try to analyze this fate mathematically by writing down what is called master equations, to describe the dynamics as a “branching process” (BP). The branching here means that if at time t D 0 there are N humans, at the next step t D 1 there can be N 1 (or N C 1 or N C 2 if the only change from t D 0 was that a pair of twins was born). The outcome will depend in the simplest case on a “branching number”, or the number of offspring that a human being will have [1,2,3,4]. If the offspring created are too few, then the population will decay, or reach an “absorbing state” out of which it will never escape. Likewise if they are many (the Malthusian case in reality, perhaps), exponential growth in time will ensue in the simplest case. In between, there is an interesting twist: a phase transition that separates these two outcomes at a critical value of c . As is typical of such transitions in statistical physics one runs into scale-free behavior. The lifetime of the population suddenly has no typical scale, and its total size will be a stochastic quantity, described by a probability distribution that again has no typical scale exactly at c . The example of a small island also illustrates the many different twists that one can find in branching processes. The population can be “spatially dispersed” such that the individuals are separated by distance. There are in fact two interacting populations, called “male” and “female”, and if the size of one of the populations becomes zero the other one will die out soon as well. The people on the island eat, and there is thus a hidden variable in the dynamics, the availability of food. This causes a history
285
286
Branching Processes
effect which makes the dynamics of human population what is called “non-Markovian”. Imagine as above, that we look at the number of persons on the island at discrete times. A Markovian process is such that the probabilities to go from a state (say of) N to state N C ı depends only on the fixed probability assigned to the “transition” N ! N C ıN. Clearly, any relatively faithful description of the death and birth rates of human beings has to consider the average state of nourishment, or whether there is enough food for reproduction. Introduction Branching processes are often perfect models of complex systems, or in other words exhibit deep complexity themselves. Consider the following example of a one-dimensional model of activated random walkers [5]. Take a line of sites x i ; i D 1 : : : L. Fill the sites randomly to a certain density n D N/L, where N is the pre-set number of individuals performing the activated random walk. Now, let us apply the simple rule, that if there are two or more walkers at the same xj , two of them get “activated” and hop to j 1 or j C 1, again at random. In other words, this version of the drunken bar-hoppers problem has the twist that they do not like each other. If the system is “periodic” or i D 1 is connected to i D L, then the dynamics is controlled by the density n. For a critical value nc (estimated by numerical simulations to be about 0.9488. . . [6]) a phase transition takes place, such that for n < nc the asymptotic state is the “absorbing one”, where all the walkers are immobilized since N i D 1 or 0. In the opposite case for n > nc there is an active phase such that (in the infinite L-limit) the activity persists forever. This particular model is unquestionably non-Markovian if one only considers the number of active walkers or their density . One needs to know the full state of N i to be able to write down exact probabilities for how changes in a discrete step of time. The most interesting things happen if one changes the one-dimensional lattice by adapting two new rules. If a walker walks out (to i D 0 or i D L C 1), it disappears. Second, if there are no active ones ( D 0), one adds one new walker randomly to the system. Now the activity (t) is always at a marginal value, and the long-term average of n becomes a (possibly L-dependent) constant such that the first statement is true, i. e. the system becomes critical. With these rules, the model of activated random walkers is also known as the Manna model after the Indian physicist [5], and it exhibits the phenomenon dubbed “Self-Organized Criticality” (SOC) [7]. Figure 1 shows an example of the dynamics by using what is called an “activity plot”,
Branching Processes, Figure 1 We follow the “activity” in a one-dimensional system of random activated walkers. The walkers stroll around the x-axis, and the pattern becomes in fact scale-invariant. This system is such that some of the walkers disappear (by escaping through open boundaries) and to maintain a constant density new ones are added. One question one may ask is what is the waiting time, that another (or the same) walker gets activated at the same location after a time of inactivity. As one can see from the figure, this can be a result of old activity getting back, or a new “avalanche” starting from the addition of an outsider (courtesy of Lasse Laurson)
where those locations xi are marked both in space and time which happen to contain just-activated walkers. One can now apply several kinds of measures to the system, but the figure already hints about the reasons why these simple models have found much interest. The structure of the activity is a self-affine fractal (for discussions about fractals see other reviews in this volume). The main enthusiasm about SOC comes from the avalanches. These are in other words the bursts of activity that separate quiescent times (when D 0). The silence is broken by the addition of a particle or a walker, and it creates an integrated quantity (volume) of activity, RT s D 0 (t)dt, where for 0 < t < T, > 0 and D 0 at the endpoints t and T. The original boost to SOC took place after Per Bak, Chao Tang, and Kay Wiesenfeld published in 1987 a highly influential paper in the premium physics journal Physical Review Letters, and introduced what is called the Bak–Tang–Wiesenfeld (BTW) sandpile model – of which the Manna one is a relative [7]. The BTW and Manna models and many others exhibit the important property that the avalanche sizes s have a scale-free probability distribution but note that this simple criticality is not always true, not even for the BTW model which shows so-called multiscaling [8,9]. Scale-free probability distri-
Branching Processes
butions are usually written as P(s) s
s
Ds
f s (s/L ) ;
(1)
here all the subscripts refer to the fact that we look at avalanches. s and Ds define the avalanche exponent and the cut-off exponent, respectively. f s is a cut-off function that together with Ds includes the fact that the avalanches are restricted somehow by the system size (in the onedimensional Manna model, if s becomes too large many walkers are lost, and first n drops and then goes to zero). Similar statements can be made about the avalanche durations (T), area or support A, and so forth [10,11,12,13,14]. The discovery of simple-to-define models, yet of very complex behavior has been a welcome gift since there are many, many phenomena that exhibit apparently scale“free statistics and/or bursty or intermittent dynamics similar to SOC models. These come from natural sciences – but not only, since economy and sociology are also giving rise to cases that need explanations. One particular field where these ideas have found much interest is the physics of materials ranging from understanding earthquakes to looking at the behavior of vortices in superconductors. The Gutenberg–Richter law of earthquake magnitudes is a powerlaw, and one can measure similar statistics from fracture experiments by measuring acoustic emission event energies from tensile tests on ordinary paper [15]. The modern theory of networks is concerned with graph or network structures which exhibit scale-free characteristics, in particular the number of neighbors a vertex has is often a stochastic quantity exhibiting a power-law-like probability distribution [16,17]. Here, branching processes are also finding applications. They can describe the development of network structures, e. g., as measured by the degree distribution, the number of neighbors a node (vertex) is connected to, and as an extension of more usual models give rise to interesting results when the dynamics of populations on networks are being considered. Whenever comparing with real, empirical phenomena models based on branching processes give a paradigm on two levels. In the case of SOC this is given by a combination of “internal dynamics” and “an ensemble”. Of these, the first means e. g. that activated walkers or particles of type A are moved around with certain rules, and of course there is an enormous variety of possible models. e. g. in the Manna-model this is obvious if one splits the walkers into categories A (active) and B (passive). Then, there is the question of how the balance (assuming this balance is true) is maintained. The SOC models do this by a combination of dissipation (by e. g. particles dropping off the edge of a system) and drive (by addition of B’s), where the rates are chosen actually pretty carefully. In the case of growing
networks, one can similarly ask what kind of growth laws allow for scale-free distributions of the node degree. For the theory of complex systems branching processes thus give rise to two questions: what classes of models are there? What kinds of truly different ensembles are there? The theory of reaction-diffusion systems has tried to answer to the first question since the 70’s, and the developments reflect the idea of universality in statistical physics (see e. g. the article of Dietrich Stauffer in this volume). There, this means that the behavior of systems at “critical points” – such as defined by the c and nc from above – follows from the dimension at hand and the “universality class” at hand. The activated walkers’ model, it has recently been established, belongs to the “fixed energy sandpile” one, and is closely related to other seemingly far-fetched ones such as the depinning/pinning of domain walls in magnets (see e. g. [18]). The second question can be stated in two ways: forgetting about the detailed model at hand, when can one expect complex behavior such as power-law distributions, avalanches etc.? Second, and more technically, can one derive the exponents such as s and Ds from those of the same model at its “usual” phase transition? The theory of branching processes provide many answers to these questions, and in particular help to illustrate the influence of boundary conditions and modes of driving on the expected behavior. Thus, one gets a clear idea of the kind of complexity one can expect to see in the many different kinds of systems where avalanches and intermittency and scaling is observed. Branching Processes The mathematical branching process is defined for a set of objects that do not interact. At each iteration, each object can give rise to new objects with some probability p (or in general it can be a set of probabilities). By continuing this iterative process the objects will form what is referred to as a cluster or an avalanche. We can now ask questions of the following type: Will the process continue for ever? Will the process maybe die out and stop after a finite number of iterations? What will the average lifetime be? What is the average size of the clusters? And what is (the asymptotic form of) the probability that the process is active after a certain number of iterations, etc. BP Definition We will consider a simple discrete BP denoted the Galton– Watson process. For other types of BPs (including processes in continuous time) we refer to the book by Harris [1]. We will denote the number of objects at generation n as: z0 ; z1 ; z2 ; : : : ; z n ; : : :. The index n is referred to
287
288
Branching Processes
as time and time t D 0 corresponds to the 0th generation where we take z0 D 1, i. e., the process starts from one object. We will assume that the transition from generation n to n C 1 is given by a probability law that is independent of n, i. e., it is assumed that it forms a Markov process. And finally it will be assumed that different objects do not interact with one another. The probability measure for the process is characterized by probabilities as follows: pk is the probability that an object in the nth generation will have k offsprings in the n C 1th generation. We assume that pk is independent of n. The probabilities pk thus reads p k D Prob(z1 D k) and fulfills: X
pk D 1 :
(2)
Branching Processes, Figure 3 Schematic drawing of the behavior of the probability of extinction q, of the branching process. The quantity 1 q is similar to the order parameter for systems in equilibrium at their critical points and the quantity mc is referred to as the critical point for the BP
k
Figure 2 shows an example of the tree structure for a branching process and the resulting avalanche. One can define the probability generating function f (s) associated to the transition probabilities: f (s) D
X
k
pk s :
(3)
k
The first and second moments of the number z1 are denoted as: m D Ez1 ;
2 D Var z1 :
By taking derivatives of the generating function at s D 1 it follows that: Ez n D m n ;
Var z n D n 2 :
For the BP defined in Fig. 2, it follows that m D 2p and 2 D 4p(1 p). An important quantity is the probability for extinction q. It is obtained as follows: q D P(z n ! 0) D P(z n D 0 for some n) D lim P(z n D 0): n
(6)
(4)
Branching Processes, Figure 2 Schematic drawing of an avalanche in a system with a maximum of n D 3 avalanche generations corresponding to N D 2nC1 1 D 15 sites. Each black site relaxes with probability p to two new black sites and with probability 1 p to two white sites (i. e., p0 D 1 p, p1 D 0, p2 D p). The black sites are part of an avalanche of size s D 7, whereas the active sites at the boundary yield a boundary avalanche of size 3 (p; t) D 2
(5)
It can be shown that for m 1 : q D 1 ; and for m > 1: there exists a solution that fulfills q D f (q), where 0 < q < 1 [1]. It is possible to show that limn P(z n D k) D 0, for k D 1; 2; 3; : : : , and that z n ! 0 with probability q, and z n ! 1, with probability 1 q. Thus, the sequence fz n g does not remain positive and bounded [1]. The quantity 1 q is similar to an order parameter for systems in equilibrium and its behavior is schematically shown in Fig. 3. The behavior around the value mc D 1, the so-called critical value for the BP (see below), can in analogy to second order phase transitions in equilibrium systems be described by a critical exponent ˇ defined as follows (ˇ D 1, cf. [1]):
0; m mc (7) Prob(survival) D (m mc )ˇ ; m > mc :
Avalanches and Critical Point We will next consider the clusters, or avalanches, in more detail and obtain the asymptotic form of the probability distributions. The size of the cluster is given by the sum
Branching Processes
z D z1 C z2 C z3 C : : :. One can also consider other types of clusters, e. g., the activity of the boundary (of a finite tree) and define the boundary avalanche (cf. Fig. 2). For concreteness, we consider the BP defined in Fig. 2. The quantities Pn (s; p) and Q n (; p) denote the probabilities of having an avalanche of size s and boundary size in a system with n generations. The corresponding generating functions are defined by [1] X f n (x; p) Pn (s; p)x s (8) s
g n (x; p)
X
Q n (; p)x :
(9)
Due to the hierarchical structure of the branching process, it is possible to write down recursion relations for Pn (s; p) and Q n (; p): (10) f nC1 (x; p) D x (1 p) C p f n2 (x; p) ; g nC1 (x; p) D (1 p) C pg n2 (x; p) ;
(11)
where f0 (x; p) D g0 (x; p) D x. The avalanche distribution D(s) is determined by Pn (s; p) by using the recursion relation (10). The stationary solution of Eq. (10) in the limit n 1 is given by p 1 1 4x 2 p(1 p) : (12) f (x; p) D 2x p By expanding Eq. (12) as a series in x, comparing with the definition (8), and using Stirling’s formula for the high-order terms, we obtain for sizes such that 1 s . n the following behavior: p 2(1 p)/ p exp s/sc (p) : (13) Pn (s; p) D s 3/2 The cutoff sc (p) is given by sc (p) D 2/ ln[4p(1 p)]. As p ! 1/2, sc (p) ! 1, thus showing explicitly that the critical value for the branching process is pc D 1/2 (i. e., mc D 1), and that the mean-field avalanche exponent, cf. Eq. (1), for the critical branching process is D 3/2. The expression (13) is only valid for avalanches which are not effected by the finite size of the system. For avalanches with n . s . N, it is possible to solve the recursion relation (10), and then obtain Pn (s; p) for p pc by the use of a Tauberian theorem [19,20,21]. By carrying out such an analysis one obtains after some algebra Pn (s; p) A(p) exp s/s0 (p) , with functions A(p) and s0 (p) which can not be determined analytically. Nevertheless, we see that for any p the probabilities Pn (s; p) will decay exponentially. One can also calculate the asymptotic form of Q n (; p) for 1 . n and p pc by the use of
a Tauberian theorem [19,20,21]. We will return to study this in the next section where we will also define and investigate the distribution of the time to extinction. Self-Organized Branching Processes We now return to discussing the link between self-organized criticality as mentioned in the introduction and branching processes. The simplest theoretical approach to SOC is mean-field theory [22], which allows for a qualitative description of the behavior of the SOC state. Meanfield exponents for SOC models have been obtained by various approaches [22,23,24,25] and it turns out that their values (e. g., D 3/2) are the same for all the models considered thus far. This fact can easily be understood since the spreading of an avalanche in mean-field theory can be described by a front consisting of non-interacting particles that can either trigger subsequent activity or die out. This kind of process is reminiscent of a branching process. The connection between branching processes and SOC has been investigated, and it has been argued that the mean-field behavior of sandpile models can be described by a critical branching process [26,27,28]. For a branching process to be critical one must fine tune a control parameter to a critical value. This, by definition, cannot be the case for a SOC system, where the critical state is approached dynamically without the need to fine tune any parameter. In the so-called self-organized branching-process (SOBP), the coupling of the local dynamical rules to a global condition drives the system into a state that is indeed described by a critical branching process [29]. It turns out that the mean-field theory of SOC models can be exactly mapped to the SOBP model. In the mean-field description of the sandpile model (d ! 1) one neglects correlations, which implies that avalanches do not form loops and hence spread as a branching process. In the SOBP model, an avalanche starts with a single active site, which then relaxes with probability p, leading to two new active sites. With probability 1 p the initial site does not relax and the avalanche stops. If the avalanche does not stop, one repeats the procedure for the new active sites until no active site remains. The parameter p is the probability that a site relaxes when it is triggered by an external input. For the SOBP branching process, there is a critical value, pc D 1/2, such that for p > pc the probability to have an infinite avalanche is non-zero, while for p < pc all avalanches are finite. Thus, p D pc corresponds to the critical case, where avalanches are power law distributed. In this description, however, the boundary conditions are not taken into account – even though they are cru-
289
290
Branching Processes
cial for the self-organization process. The boundary conditions can be introduced in the problem in a natural way by allowing for no more than n generations for each avalanche. Schematically, we can view the evolution of a single avalanche of size s as taking place on a tree of size N D 2nC1 1 (see Fig. 2). If the avalanche reaches the boundary of the tree, one counts the number of active sites n (which in the sandpile language corresponds to the energy leaving the system), and we expect that p decreases for the next avalanche. If, on the other hand, the avalanche stops before reaching the boundary, then p will slightly increase. The number of generations n can be thought of as some measure of the linear size of the system. The above avalanche scenario is described by the following dynamical equation for p(t): p(t C 1) D p(t) C
1 n (p; t) ; N
(14)
where n , the size of an avalanche reaching the boundary, fluctuates in time and hence acts as a stochastic driving force. If n D 0, then p increases (because some energy has been put into the system without any output), whereas if n > 0 then p decreases (due to energy leaving the system). Equation (14) describes the global dynamics of the SOBP, as opposed to the local dynamics which is given by the branching process. One can study the model for a fixed value of n, and then take the limit n ! 1. In this way, we perform the long-time limit before the “thermodynamic” limit, which corresponds exactly to what happens in sandpile models. We will now show that the SOBP model provides a mean-field theory of self-organized critical systems. Consider for simplicity the sandpile model of activated random walkers from the Introduction [5]: When a particle is added to a site zi , the site will relax if z i D 1. In the limit d ! 1, the avalanche will never visit the same site more than once. Accordingly, each site in the avalanche will relax with the same probability p D P(z D 1). Eventually, the avalanche will stop, and 0 particles will leave the system. Thus, the total number of particles M(t) evolves according to
Branching Processes, Figure 4 The value of p as a function of time for a system with n D 10 generations. The two curves refer to two different initial conditions, above pc () and below pc (ı). After a transient, the control parameter p(t) reaches its critical value pc and fluctuates around it with short-range correlations
where hn i n D (2p)n n (p; t) describes the fluctuations in the steady state. Thus, is obtained by measuring n for each realization of the process. Without the last term, Eq. (16) has a fixed point (dp/dt D 0) for p D pc D 1/2. On linearizing Eq. (16), one sees that the fixed point is attractive, which demonstrates the self organization of the SOBP model since the noise /N will have vanishingly small effect in the thermodynamic limit [29]. Figure 4 shows the value of p as a function of time. Independent of the initial conditions, one finds that after a transient p(t) reaches the self-organized state described by the critical value pc D 1/2 and fluctuates around it with short-range correlations (of the order of one time unit). By computing the variance of p(t), one finds that the fluctuations can be very well described by a Gaussian distribution, (p) [29]. In the limit N ! 1, the distribution (p) approaches a delta function, ı(p pc ). Avalanche Distributions
M(t C 1) D M(t) C 1 :
(15)
The dynamical Eq. (14) for the SOBP model is recovered by noting that M(t) D N P(z D 1) D N p. By taking the continuum time limit of Eq. (14), it is possible to obtain the following expression: dp 1 (2p)n (p; t) D C ; dt N N
(16)
Figure 5 shows the avalanche size distribution D(s) for different values of the number of generations n. One notices that there is a scaling region (D(s) s with D 3/2), whose size increases with n, and characterized by an exponential cutoff. This power-law scaling is a signature of the mean-field criticality of the SOBP model. The distribution of active sites at the boundary, D( ), for different values of the number of generations falls off exponentially [29].
Branching Processes
The avalanche lifetime distribution L(t) t y yields the probability to have an avalanche that lasts for a time t. For a system with m generations one obtains L(m) m2 [1]. Identifying the number of generations m of an avalanche with the time t, we thus obtain the meanfield value y D 2, in agreement with simulations of the SOBP model [29]. In summary, the self-organized branching process captures the physical features of the self-organization mechanism in sandpile models. By explicitly incorporating the boundary conditions it follows that the dynamics drives the system into a stationary state, which in the thermodynamic limit corresponds to the critical branching process. Scaling and Dissipation Branching Processes, Figure 5 Log–log plot of the avalanche distribution D(s) for different system sizes. The number of generations n increases from left to right. A line with slope D 3/2 is plotted for reference, and it describes the behavior of the data for intermediate s values, cf. Eq. (18). For large s, the distributions fall off exponentially
In the limit where n 1 one can obtain various analytical results and e. g. calculate the avalanche distribution D(s) for the SOBP model. In addition, one can obtain results for finite, but large, values of n. The distribution D(s) can be calculated as the average value of Pn (s; p) with respect to the probability density (p), i. e., according to the formula Z 1 D(s) D dp (p) Pn (s; p) : (17) 0
The simulation results in Fig. 4 yield that (p) for N 1 approaches the delta function ı(p pc ). Thus, from Eqs. (13) and (17) we obtain the power-law behavior r 2 (18) s ; D(s) D where D 3/2, and for s & n we obtain an exponential cutoff exp(s/s0 (pc )). These results are in complete agreement with the numerical results shown in Fig. 5. The deviations from the power-law behavior (18) are due to the fact that Eq. (13) is only valid for 1 s . n. One can also calculate the asymptotic form of Q n (; p) for 1 . n and p pc by the use of a Tauberian theorem [19,20,21]; the result show that the boundary avalanche distribution is Z 1 8 dp (p) Q n (; p) D 2 exp (2/n) ; (19) D() D n 0 which agrees with simulation results for n 1, cf. [29].
Sometimes it can be difficult to determine whether the cutoff in the scaling is due to finite-size effects or due to the fact that the system is not at but rather only close to the critical point. In this respect, it is important to test the robustness of critical behavior by understanding which perturbations destroy the critical properties. It has been shown numerically [30,31,32] that the breaking of the conservation of particle numbers leads to a characteristic size in the avalanche distributions. We will now allow for dissipation in branching processes and show how the system self-organizes into a sub-critical state. In other words, the degree of nonconservation is a relevant parameter in the renormalization group sense [33]. Consider again the two-state model introduced by Manna [5]. Some degree of nonconservation can be introduced in the model by allowing for energy dissipation in a relaxation event. In a continuous energy model this can be done by transferring to the neighboring sites only a fraction (1 ) of the energy lost by the relaxing site [30]. In a discrete energy model, such as the Manna two-state model, one can introduce dissipation as the probability that the two particles transferred by the relaxing site are annihilated [31]. For D 0 one recovers the original two-state model. Numerical simulations [30,31] show that different ways of considering dissipation lead to the same effect: a characteristic length is introduced into the system and the criticality is lost. As a result, the avalanche size distribution decays not as a pure power law but rather as D(s) s hs (s/sc ) :
(20)
Here hs (x) is a cutoff function and the cutoff size scales as sc ' :
(21)
291
292
Branching Processes
The size s is defined as the number of sites that relax in an avalanche. We define the avalanche lifetime T as the number of steps comprising an avalanche. The corresponding distribution decays as D(T) T y h T (T/Tc ) ;
(22)
where hT (x) is another cutoff function and Tc is a cutoff that scales as Tc :
p D pc
1 : 2(1 )
(25)
Thus, for p > pc the probability to have an infinite avalanche is non-zero, while for p < pc all avalanches are finite. The value p D pc corresponds to the critical case where avalanches are power law distributed.
(23)
The cutoff or “scaling” functions hs (x) and hT (x) fall off exponentially for x 1. To construct the mean-field theory one proceeds as follows [34]: When a particle is added to an arbitrary site, the site will relax if a particle was already present, which occurs with probability p D P(z D 1), the probability that the site is occupied. If a relaxation occurs, the two particles are transferred with probability 1 to two of the infinitely many nearest neighbors, or they are dissipated with probability (see Fig. 6). The avalanche process in the mean-field limit is a branching process. Moreover, the branching process can be described by the effective branching probability p˜ p(1 ) ;
where p˜ is the probability to create two new active sites. We know that there is a critical value for p˜ D 1/2, or
The Properties of the Steady State To address the self-organization, consider the evolution of the total number of particles M(t) in the system after each avalanche: M(t C 1) D M(t) C 1 (p; t) (p; t) :
(26)
Here is the number of particles that leave the system from the boundaries and is the number of particles lost by dissipation. Since (cf. Sect. “Self–Organized Branching Processes”) M(t) D N P(z D 1) D N p, we obtain an evolution equation for the parameter p: p(t C 1) D p(t) C
(24)
1 (p; t) (p; t) : N
(27)
This equation reduces to the SOBP model for the case of no dissipation ( D 0). In the continuum limit one obtains [34] (p; t) 1 dp 1 (2p(1 ))n pH(p(1 )) C D : dt N N (28) Here, we defined the function H(p(1 )), that can be obtained analytically, and introduced the function (p; t) to describe the fluctuations around the average values of and . It can be shown numerically that the effect of this “noise” term is vanishingly small in the limit N ! 1 [34]. Without the noise term one can study the fixed points of Eq. (28) and one finds that there is only one fixed point, Branching Processes, Figure 6 Schematic drawing of an avalanche in a system with a maximum of n D 3 avalanche generations corresponding to N D 2nC1 1 D 15 sites. Each black site () can relax in three different ways: (i) with probability p(1 ) to two new black sites, (ii) with probability 1 p the avalanche stops, and (iii) with probability p two particles are dissipated at a black site, which then becomes a marked site (˚), and the avalanche stops. The black sites are part of an avalanche of size s D 6, whereas the active sites at the boundary yield 3 (p; t) D 2. There was one dissipation event such that D 2
p D 1/2 ;
(29)
independent of the value of ; the corrections to this value are of the order O(1/N). By linearizing Eq. (28), it follows that the fixed point is attractive. This result implies that the SOBP model self-organizes into a state with p D p . In Fig. 7 is shown the value of p as a function of time for different values of the dissipation . We find that independent of the initial conditions after a transient p(t) reaches the self-organized steady-state described by the
Branching Processes
Branching Processes, Figure 7 The value of the control parameter p(t) as a function of time for a system with different levels of dissipation. After a transient, p(t) reaches its fixed-point value p D 1/2 and fluctuates around it with short-range time correlations
Branching Processes, Figure 8 Phase diagram for the SOBP model with dissipation. The dashed line shows the fixed points p D 1/2 of the dynamics, with the flow being indicated by the arrows. The solid line shows the critical points, cf. Eq. (25)
fixed point value p D 1/2 and fluctuates around it with short-range correlations (of the order of one time unit). The fluctuations around the critical value decrease with the system size as 1/N. It follows that in the limit N ! 1 the distribution (p) of p approaches a delta function (p) ı(p p ). By comparing the fixed point value (29) with the critical value (25), we obtain that in the presence of dissipation ( > 0) the self-organized steady-state of the system is subcritical. Figure 8 is a schematic picture of the phase diagram of the model, including the line p D pc of critical behavior (25) and the line p D p of fixed points (29). These two lines intersect only for D 0. Avalanche and Lifetime Distributions In analogy to the results in Sect. “Self–Organized Branching Processes”, we obtain similar formulas for the avalanche size distributions but with p˜ replacing p. As a result we obtain the distribution r 2 1 C C ::: D(s) D exp (s/sc ()) : (30) s We can expand sc ( p˜ ()) D 2/ln [4 p˜ (1 p˜)] in with the result 2 sc () ' ; ' D 2 : (31) Furthermore, the mean-field exponent for the critical branching process is obtained setting D 0, i. e., D 3/2 :
(32)
Branching Processes, Figure 9 Log–log plot of the avalanche distribution D(s) for different levels of dissipation. A line with slope D 3/2 is plotted for reference, and it describes the behavior of the data for intermediate s values, cf. Eq. (30). For large s, the distributions fall off exponentially. The data collapse is produced according to Eq. (30)
These results are in complete agreement with the SOBP model and the simulation of D(s) for the SOBP model with dissipation (cf. Fig. 9). The deviations from the power-law behavior (30) are due to the fact that Eq. (13) is only valid for 1 s . n. Next, consider the avalanche lifetime distribution D(T) characterizing the probability to obtain an avalanche which spans m generations. It can be shown that the result
293
294
Branching Processes
Branching Processes, Figure 10 Log–log plot of the lifetime distribution D(T) for different levels of dissipation. A line with slope y D 2 is plotted for reference. Note the initial deviations from the power law for D 0 due to the strong corrections to scaling. The data collapse is produced according to Eq. (33)
can be expressed in the scaling form [1,34] D(T) T y exp(T/Tc ) ;
(33)
where Tc ;
D1:
(34)
The lifetime exponent y was defined in Eq. (22), wherefrom we confirm the mean-field result y D2:
(35)
In Fig. 10, we show the data collapse produced by Eq. (33) for the lifetime distributions for different values of . In summary, the effect of dissipation on the dynamics of the sandpile model in the mean-field limit (d ! 1) is described by a branching process. The evolution equation for the branching probability has a single attractive fixed point which in the presence of dissipation is not a critical point. The level of dissipation therefore acts as a relevant parameter for the SOBP model. These results show, in the mean-field limit, that criticality in the sandpile model is lost when dissipation is present. Network Science and Branching Processes An interesting mixture of two kinds of “complex systems” is achieved by considering what kind of applications one can find of branching processes applied to or on networks.
Here, network is a loose term in the way it is often used – as in the famous “six degrees” concept of how the social interactions of human beings can be measured to have a “small world” character [16,17,35]. These structures are usually thought of in terms of graphs, with vertices forming a set G, and edges a set E such that if e g g 0 2 E it means that it connects two sites g and g 0 2 G. An edge or link can either be symmetric or asymmetric. Another article in this volume discusses in depth the modern view on networks, providing more information. The Structural properties of networks heterogeneous networks are most usually measured by the degree ki of site i 2 G, i. e. the number of nearest neighbors the vertex i is connected to. One can go further by defining weights wij for edges eij [36,37,38]. The interest in heterogeneous networks is largely due to two typical properties one can find. In a scale-free network the structure is such that the probability distribution has a power-law form P(k) k ˛ up to a size-dependent cut-off kmax . One can now show using mean-field-like reference models (such as the “configuration model”, which has no structural correlations and a prefixed P(k)), that it follows that these networks tend to have a “small world” character, that the average vertex-vertex distance is only logarithmic in the number of vertices N: d(G) ln N. This property is correlated with the fact that the various moments of k depend on ˛. For hki to remain finite, ˛ > 2 is required. In many empirically measured networks it appears that ˛ < 3, which implies that hk 2 i diverges (we are assuming, that R 1 P(k) is cut off at a maximum degree kmax for which k max 1/N, from which the divergences follow) [16,17,35]. The first central question of the two we pose here is, what happens to branching processes on the top of such networks? Again, this can be translated into an inquiry about the “phase diagram” – given a model – and its particulars. We would like to understand what is the critical threshold c , and what happens to sub-critical and supercritical systems (with appropriate values of ). The answers may of course depend on the model at hand, i. e. the universality classes are still an important issue. One easy analytical result for un-correlated (in particular tree-like) networks is that activity will persist in the branching process if the number of offspring is at least one for an active “individual” or vertex. Consider now one such vertex, i. Look at the neighbors of neighbors. There are of the order O (k i hkneighbor of i i 1) of these, so c 1/hk 2 i. This demonstrates the highly important result, that the critical threshold may vanish in heterogeneous networks, for ˛ 3 in the example case. Pastor-Satorras and Vespignani were the first to point out this fact for the Susceptible-Infected-Susceptible (SIS)
Branching Processes
model [39]. The analysis is relatively easy to do using mean-field theory for k , the average activity for the SIS branching process on nodes of degree k. The central equation reads h i (36) @ t k (t) D k (t) C k 1 k (t) k () : Here measures the probability that at least one of the neighbors of a vertex of degree k is infected. A MF-treatment of , excluding degree correlations, makes it possible to show that the SIS-model may have a zero threshold, and also establishes other interesting properties as h k i. As is natural, the branching process concentrates on vertices with higher degrees, that is k increases with k. The consequences of a zero epidemic threshold are plentiful and important [40]. The original application of the theory was to the spreading of computer viruses: the Internet is (on the Autonomous System level) a scale-free network [41]. So is the network of websites. The outcome even for subcritical epidemics is a long life-time, and further complications ensue if the internal dynamics of the spreading do not follow Poissonian statistics of time (e. g. the waiting time before vertex i sends a virus to a neighbor), but a power-law distribution [42]. Another highly topical issue is the behavior and control of human-based disease outbreaks. How to vaccinate against and isolate a dangerous virus epidemy depends on the structure of the network on which the spreading takes place, and on the detailed rules and dynamics. An intuitively easy idea is to concentrate on the “hubs”, on the most connected vertices of the network [43]. This can be an airport through which travellers acting as carriers move, or it can – as in the case of sexually transmitted viruses such as HIV – be the most active individuals. There are indeed claims that the network of sexual contacts is scale-free, with a small exponent ˛. Of interest here is the recent result that for > c and ˛ < 3 the spreading of branching processes deviates from usual expectations. The outgrowth covers a finite fraction of the network in a short time (vanishing in the large-N limit), and the growth in time of the infected nodes becomes polynomial, after an initial exponential phase [44]. Branching processes can also be applied to network structure. The “standard model” of growing networks is the Barabasi–Albert model. In it, from a small seed graph one grows an example network by so-called preferential attachment. That is, there are two mechanisms that operate: i) new vertices are added, and ii) old ones grow more links (degree increases) by getting linked to new ones. There is a whole zoo of various network growth models, but we next overview some ideas that directly connect to
Branching Processes, Figure 11 Two processes creating a growing or steady-state network: addition and merging of vertices
branching processes. These apply to cases in which vertices disappear by combining with other vertices, are generated from thin air, or split into “sub-vertices” such that old links from the original one are retained. One important mechanism, operative in cellular protein interaction networks, is the duplication of vertices (a protein) with slight changes to the copy’s connectivity from the parent vertex. Figure 11 illustrates an example where the structure in the steady-state can be fruitfully described by similar mathematics as in other similar cases in networks. The basic rules are two-fold: i) a new vertex is added to the network, ii) two randomly chosen vertices merge. This kind of mechanisms are reminiscent of aggregation processes, of e. g. sticky particles that form colloidal aggregates in fluids. The simple model leads to a degree distribution of the asymptotic form P(k) k 3/2 , reminiscent of the mean-field branching process result for the avalanche size distribution (where D 3/2, cf. Sect. “Branching Processes”) [45,46]. The mathematical tool here is given by rate equations that describe the number of vertices with a certain degree, k. These are similar to those that one can write for the size of an avalanche in other contexts. It is an interesting question of what kind of generalizations one can find by considering cases where the edges have weights, and the elementary processes are made dependent in some way on those [47]. It is worth noting that the aggregation/branching process description of network structure develops easily deep complexity. This can be achieved by choosing the nodes to be joined with an eye to the graph geometry, and/or splitting vertices with some particular rules. In the latter case, a natural starting point is to maintain all the links of the original vertex and to distribute them among the descendant vertices. Then, one has to choose whether to link those to each other or not – either choice influences the local correlations. The same is naturally true, if one considers reactions of the type k C x 0 ! k 00 , which originate from joining two neighboring vertices with degrees k and
295
296
Branching Processes
x 0 . The likelihood of this rate is dependent on the conditional probability of a vertex with degree k having a neighbor with x 0 . Such processes are dependent on structural properties of the networks, which themselves change in time, thus the theoretical understanding is difficult due to the correlations. As a final note, in some cases the merging of vertices produces to the network a giant component, a vertex which has a finite fraction of the total mass (or, links in the whole). This is analogous to avalanching systems which are “weakly first-order”, in other words exhibit statistics which has a power-law, scale-free part and then a separate, singular peak. Often, this would be a final giant avalanche.
scription of non-Markovian phenomena, as when for instance the avalanche shape is non-symmetrical [51]. This indicates that there is an underlying mechanism which needs to be incorporated into the BP model.
Conclusions
Bibliography
In this short overview we have given some ideas of how to understand complexity via the tool of branching processes. The main issue has been that they are an excellent means of understanding “criticality” and “complexity” in many systems. We have concentrated on two particular applications, SOC and networks, to illustrate this. Many other important fields where BP-based ideas find use have been left out, from biology and the dynamics of species and molecules to geophysics and the spatial and temporal properties of say earthquakes. An example is the so-called ETAS model used for their modeling (see [48,49,50]). Future Directions The applications of branching processes to complex systems continue on various fronts. One can predict interesting developments in a number of cases. First, as indicated the dynamics of BP’s on networks are not understood very well at all when it comes to the possible scenarios. A simple question is, whether the usual language of absorbing state phase transitions applies, and if not why – and what is the effect of the heterogeneous geometry of complex networks. In many cases the structure of a network is a dynamic entity, which can be described by generalized branching processes and there has been so far relatively little work into this direction. The inclusion of spatial effects and temporal memory dynamics is another interesting and important future avenue. Essentially, one searches for complicated variants of usual BP’s to be able to model avalanching systems or cases where one wants to compute the typical time to reach an absorbing state, and the related distribution. Or, the question concerns the supercritical state ( > c ) and the spreading from a seed. As noted in the networks section this can be complicated by the presence of non-Poissonian temporal statistics. Another exciting future task is the de-
Acknowledgments We are grateful to our colleague Stefano Zapperi with whom we have collaborated on topics related to networks, avalanches, and branching processes. This work was supported by the Academy of Finland through the Center of Excellence program (M.J.A.) and EUMETSAT’s GRAS Satellite Application Facility (K.B.L.).
Primary Literature 1. Harris TE (1989) The Theory of Branching Processes. Dover, New York 2. Kimmel M, Axelrod DE (2002) Branching Processes in Biology. Springer, New York 3. Athreya KB, Ney PE (2004) Branching Processes. Dover Publications, Inc., Mineola 4. Haccou P, Jagers P, Vatutin VA (2005) Branching Processes: Variation, Growth, and Extinction of Populations. Cambridge University Press, Cambridge 5. Manna SS (1991) J Phys A 24:L363. In this two-state model, the energy takes the two stable values, zi D 0 (empty) and zi D 1 (particle). When zi zc , with zc D 2, the site relaxes by distributing two particles to two randomly chosen neighbors 6. Dickman R, Alava MJ, Munoz MA, Peltola J, Vespignani A, Zapperi S (2001) Phys Rev E64:056104 7. Bak P, Tang C, Wiesenfeld K (1987) Phys Rev Lett 59:381; (1988) Phys Rev A 38:364 8. Tebaldi C, De Menech M, Stella AL (1999) Phys Rev Lett 83:3952 9. Stella AL, De Menech M (2001) Physica A 295:1001 10. Vespignani A, Dickman R, Munoz MA, Zapperi S (2000) Phys Rev E 62:4564 11. Dickman R, Munoz MA, Vespignani A, Zapperi S (2000) Braz J Phys 30:27 12. Alava M (2003) Self-Organized Criticality as a Phase Transition. In: Korutcheva E, Cuerno R (eds) Advances in Condensed Matter and Statistical Physics. arXiv:cond-mat/0307688,(2004) Nova Publishers, p 45 13. Lubeck S (2004) Int J Mod Phys B18:3977 14. Jensen HJ (1998) Self-Organized Criticality. Cambridge University Press, Cambridge 15. Alava MJ, Nukala PKNN, Zapperi S (2006) Statistical models of fracture. Adv Phys 55:349–476 16. Albert R, Barabási AL (2002) Statistical mechanics of complex networks. Rev Mod Phys 74:47 17. Dorogovtsev SN, Mendes JFF (2003) Evolution of Networks: From Biological Nets to the Internet and WWW. Oxford University Press, Oxford; (2002) Adv Phys 51:1079; (2004) arXiv:condmat/0404593 18. Bonachela JA, Chate H, Dornic I, Munoz MA (2007) Phys Rev Lett 98:115702
Branching Processes
19. Feller W (1971) An Introduction to Probability Theory and its Applications. vol 2, 2nd edn. Wiley, New York 20. Asmussen S, Hering H (1983) Branching Processes. Birkhäuser, Boston 21. Weiss GH (1994) Aspects and Applications of the Random Walk. North-Holland, Amsterdam 22. Tang C, Bak P (1988) J Stat Phys 51:797 23. Dhar D, Majumdar SN (1990) J Phys A 23:4333 24. Janowsky SA, Laberge CA (1993) J Phys A 26:L973 25. Flyvbjerg H, Sneppen K, Bak P (1993) Phys Rev Lett 71:4087; de Boer J, Derrida B, Flyvbjerg H, Jackson AD, Wettig T (1994) Phys Rev Lett 73:906 26. Alstrøm P (1988) Phys Rev A 38:4905 27. Christensen K, Olami Z (1993) Phys Rev E 48:3361 28. García-Pelayo R (1994) Phys Rev E 49:4903 29. Zapperi S, Lauritsen KB, Stanley HE (1995) Phys Rev Lett 75:4071 30. Manna SS, Kiss LB, Kertész J (1990) J Stat Phys 61:923 31. Tadi´c B, Nowak U, Usadel KD, Ramaswamy R, Padlewski S (1992) Phys Rev A 45:8536 32. Tadic B, Ramaswamy R (1996) Phys Rev E 54:3157 33. Vespignani A, Zapperi S, Pietronero L (1995) Phys Rev E 51:1711 34. Lauritsen KB, Zapperi S, Stanley HE (1996) Phys Rev E 54:2483 35. Newman MEJ (2003) The structure and function of complex networks. SIAM Rev 45:167 36. Yook SH, Jeong H, Barabasi AL, Tu Y (2001) Phys Rev Lett 86:5835 37. Barrat A, Barthelemy M, Pastor-Satorras R, Vespignani A (2004) Proc Natl Acad Sci USA 101:3747 38. Barrat A, Barthelemy M, Vespignani A (2004) Phys Rev Lett 92:228701 39. Pastor-Satorras R, Vespignani A (2001) Phys Rev Lett 86:3200 40. Dorogovtsev SN, Goltsev AV, Mendes JFF (2007) arXiv:condmat/0750.0110 41. Pastor-Satorras R, Vespignani A (2004) Evolution and Structure of the Internet: A Statistical Physics Approach. Cambridge University Press, Cambridge 42. Vazquez A, Balazs R, Andras L, Barabasi AL (2007) Phys Rev Lett 98:158702
43. Colizza V, Barrat A, Barthelemy M, Vespignani A (2006) PNAS 103:2015 44. Vazquez A (2006) Phys Rev Lett 96:038702 45. Kim BJ, Trusina A, Minnhagen P, Sneppen K (2005) Eur Phys J B43:369 46. Alava MJ, Dorogovtsev SN (2005) Phys Rev E 71:036107 47. Hui Z, Zi-You G, Gang Y, Wen-Xu W (2006) Chin Phys Lett 23:275 48. Ogata Y (1988) J Am Stat Assoc 83:9 49. Saichev A, Helmstetter A, Sornette D (2005) Pure Appl Geophys 162:1113 50. Lippidello E, Godano C, de Arcangelis L (2007) Phys Rev Lett 98:098501 51. Zapperi S, Castellano C, Colaiori F, Durin G (2005) Nat Phys 1:46
Books and Reviews Albert R, Barabási AL (2002) Statistical mechanics of complex networks. Rev Mod Phys 74:47 Asmussen S, Hering H (1883) Branching Processes. Birkhäuser, Boston Athreya KB, Ney PE (2004) Branching Processes. Dover Publications, Inc., Mineola Feller W (1971) An Introduction to Probability Theory and its Applications. vol 2, 2nd edn. Wiley, New York Haccou P, Jagers P, Vatutin VA (2005) Branching Processes: Variation, Growth, and Extinction of Populations. Cambridge University Press, Cambridge Harris TE (1989) The Theory of Branching Processes. Dover, New York Jensen HJ (1998) Self-Organized Criticality. Cambridge University Press, Cambridge Kimmel M, Axelrod DE (2002) Branching Processes in Biology. Springer, New York Newman MEJ (2003) The structure and function of complex networks. SIAM Rev 45:167 Weiss GH (1994) Aspects and Applications of the Random Walk. North-Holland, Amsterdam
297
298
Cellular Automata as Models of Parallel Computation
Cellular Automata as Models of Parallel Computation THOMAS W ORSCH Lehrstuhl Informatik für Ingenieure und Naturwissenschaftler, Universität Karlsruhe, Karlsruhe, Germany Article Outline Glossary Definition of the Subject Introduction Time and Space Complexity Measuring and Controlling the Activities Communication in CA Future Directions Bibliography Glossary Cellular automaton The classical fine-grained parallel model introduced by John von Neumann. Hyperbolic cellular automaton A cellular automaton resulting from a tessellation of the hyperbolic plane. Parallel Turing machine A generalization of Turing’s classical model where several control units work cooperatively on the same tape (or set of tapes). Time complexity Number of steps needed for computing a result. Usually a function t : NC ! NC , t(n) being the maximum (“worst case”) for any input of size n. Space complexity Number of cells needed for computing a result. Usually a function s : NC ! NC , s(n) being the maximum for any input of size n. State change complexity Number of proper state changes of cells during a computation. Usually a function sc : NC ! NC , sc(n) being the maximum for any input of size n. Processor complexity Maximum number of control units of a parallel Turing machine which are simultaneously active during a computation. Usually a function sc : NC ! NC , sc(n) being the maximum for any input of size n. NC The set f1; 2; 3; : : : g of positive natural numbers. Z The set f: : : ; 3; 2; 1; 0; 1; 2; 3; : : : g of integers. QG The set of all (total) functions from a set G to a set Q. Definition of the Subject This article will explore the properties of cellular automata (CA) as a parallel model.
The Main Theme We will first look at the standard model of CA and compare it with Turing machines as the standard sequential model, mainly from a computational complexity point of view. From there we will proceed in two directions: by removing computational power and by adding computational power in different ways in order to gain insight into the importance of some ingredients of the definition of CA. What is Left Out There are topics which we will not cover although they would have fit under the title. One such topic is parallel algorithms for CA. There are algorithmic problems which make sense only for parallel models. Probably the most famous for CA is the so-called Firing Squad Synchronization Problem. This is the topic of Umeo’s article ( Firing Squad Synchronization Problem in Cellular Automata), which can also be found in this encyclopedia. Another such topic in this area is the Leader election problem. For CA it has received increased attention in recent years. See the paper by Stratmann and Worsch [29] and the references therein for more details. And we do want to mention the most exciting (in our opinion) CA algorithm: Tougne has designed a CA which, starting from a single point, after t steps has generated the discretized circle of radius t, for all t; see [5] for this gem. There are also models which generalize standard CA by making the cells more powerful. Kutrib has introduced push-down cellular automata [14]. As the name indicates, in this model each cell does not have a finite memory but can make use of a potentially unbounded stack of symbols. The area of nondeterministic CA is also not covered here. For results concerning formal language recognition with these devices refer to Cellular Automata and Language Theory. All these topics are, unfortunately, beyond the scope of this article. Structure of the Paper The core of this article consists of four sections: Introduction: The main point is the standard definition of Euclidean deterministic synchronous cellular automata. Furthermore, some general aspects of parallel models and typical questions and problems are discussed. Time and space complexity: After defining the standard computational complexity measures, we compare CA with different resource bounds. The comparison of CA
Cellular Automata as Models of Parallel Computation
with the Turing Machine (TM) gives basic insights into their computational power. Measuring and controlling activities: There are two approaches to measure the “amount of parallelism” in CA. One is an additional complexity measured directly for CA, the other via the definition of so-called parallel Turing machines. Both are discussed. Communication: Here we have a look at “variants” of CA with communication structures other than the onedimensional line. We sketch the proofs that some of these variants are in the second machine class. Introduction In this section we will first formalize the classical model of cellular automata, basically introduced by von Neumann [21]. Afterwards we will recap some general facts about parallel models. Definition of Cellular Automata There are several equivalent formalizations of CA and of course one chooses the one most appropriate for the topics to be investigated. Our point of view will be that each CA consists of a regular arrangement of basic processing elements working in parallel while exchanging data. Below, for each of the words regular, basic, processing, parallel and exchanging, we first give the standard definition for clarification. Then we briefly point out possible alternatives which will be discussed in more detail in later sections. Underlying Grid A Cellular Automaton (CA) consists of a set G of cells, where each cell has at least one neighbor with which it can exchange data. Informally speaking one usually assumes a “regular” arrangement of cells and, in particular, identically shaped neighborhoods. For a d-dimensional CA, d 2 NC , one can think of G D Zd . Neighbors are specified by a finite set N of coordinate differences called the neighborhood. The cell i 2 G has as its neighbors the cells i C n for all n 2 N. Usually one assumes that 0 2 N. (Here we write 0 for a vector of d zeros.) As long as one is not specifically interested in the precise role N is playing, one may assume some standard neighborhood: The von Neumann neighborhood of P radius r is N (r) D f(k1 ; : : : ; kd ) j j jk j j rg and the (r) Moore neighborhood of radius r is M D f(k1 ; : : : ; kd ) j max j jk j j rg. The choices of G and N determine the structure of what in a real parallel computer would be called the “com-
munication network”. We will usually consider the case G D Zd and assume that the neighborhood is N D N (1) . Discussion The structure of connections between cells is sometimes defined using the concept of Cayley graphs. Refer to the article by Ceccherini–Silberstein ( Cellular Automata and Groups), also in this encyclopedia, for details. Another approach is via regular tessellations. For example the 2-dimensional Euclidean space can be tiled with copies of a square. These can be considered as cells, and cells sharing an edge are neighbors. Similarly one can tile, e. g., the hyperbolic plane with copies of a regular k-gon. This will be considered to some extent in Sect. “Communication in CA”. A more thorough exposition can be found in the article by Margenstern ( Cellular Automata in Hyperbolic Spaces) also in this encyclopedia. CA resulting, for example, from tessellations of the 2-dimensional Euclidean plane with triangles or hexagons are considered in the article by Bays ( Cellular Automata in Triangular, Pentagonal and Hexagonal Tessellations) also in this encyclopedia. Global and Local Configurations The basic processing capabilities of each cell are those of a finite automaton. The set of possible states of each cell, denoted by Q, is finite. As inputs to be processed each cell gets the states of all the cells in its neighborhood. We will write QG for the set of all functions from G to Q. Thus each c 2 Q G describes a possible global state of the whole CA. We will call these c (global) configurations. On the other hand, functions ` : N ! Q are called local configurations. We say that in a configuration c cell i observes the local configuration c iCN : N ! Q where c iCN (n) D c(i C n). A cell gets its currently observed local configuration as input. It remains to be defined how they are processed. Dynamics The dynamics of a CA are defined by specifying the local dynamics of a single cell and how cells “operate in parallel” (if at all). In the first sections we will consider the classical case: A local transition function is a function f : Q N ! Q prescribing for each local configuration ` 2 Q N the next state f (`) of a cell which currently observes ` in its neighborhood. In particular, this means that we are considering deterministic behavior of cells. Furthermore, we will first concentrate on CA where all cells are working synchronously: The possible transitions from one configuration to the next one in one
299
300
Cellular Automata as Models of Parallel Computation
global step of the CA can be described by a function F : Q G ! Q G requiring that all cells make one state transition: 8i 2 G : F(c)(i) D f (c iCN ). For alternative definitions of the dynamic behavior of CA see Sect. “Measuring and Controlling the Activities”. Discussion Basically, the above definition of CA is the standard one going back to von Neumann [21]; he used G D Z2 and N D f(1; 0); (1; 0); (0; 1); (0; 1)g for his construction. But for all of the aspects just defined there are other possibilities, some of which will be discussed in later sections. Finite Computations on CA In this article we are interested in using CA as devices for computing, given some finite input, in a finite number of steps a finite output. Inputs As the prototypical examples of problems to be solved by CA and other models we will consider the recognition of formal languages. This has the advantages that the inputs have a simple structure and, more importantly, the output is only one bit (accept or reject) and can be formalized easily. A detailed discussion of CA as formal language recognizers can be found in the article by Kutrib ( Cellular Automata and Language Theory). The input alphabet will be denoted by A. We assume that A Q. In addition there has to be a special state q 2 Q which is called a quiescent state because it has the property that for the quiescent local configuration `q : N ! Q : n 7! q the local transition function must specify f (`q ) D q. In the literature two input modes are usually considered. Parallel input mode: For an input w D x1 x n 2 An the initial configuration cw is defined as ( cw (i) D
xj
iff i D ( j; 0; : : : ; 0)
q
otherwise :
Sequential input mode: In this case, all cells are in state q in the initial configuration. But cell (0; : : : ; 0) is a designated input cell and acts differently from the others. It works according to a function g : Q N (A~[fqg). During the first n steps the input cell gets input symbol xj in step j; after the last input symbol it always gets q. CA using this type of input are often called iterative arrays (IA).
Unless otherwise noted we will always assume parallel input mode. Conceptually the difference between the two input modes has the same consequences as for TM. If input is provided sequentially it is meaningful to have a look at computations which “use” less than n cells (see the definition of space complexity later on). Technically, some results occur only for the parallel input mode but not for the sequential one, or vice versa. This is the case, for example, when one looks at devices with p small time bounds like n or n C n steps. But as soon as one considers more generally ‚(n) or more steps and a space complexity of at least ‚(n), both CA and IA can simulate each other in linear time: An IA can first read the complete input word, storing it in successive cells, and then start simulating a CA, and A CA can shift the whole word to the cell holding the first input symbol and have it act as the designated input cell of an IA. Outputs Concerning output, one usually defines that a CA has finished its work whenever it has reached a stable configuration c, i. e., F(c) D c. In such a case we will also say that the CA halts (although formally one can continue to apply F). An input word w 2 AC is accepted, iff the cell (1; 0; : : : ; 0) which got the first input symbol, is in an accepting state from a designated finite subset FC Q of states. We write L(C) for the set of all words w 2 AC which are accepted by a CA C. For the sake of simplicity we will assume that all deterministic machines under consideration halt for all inputs. E. g., the CA as defined above always reaches a stable configuration. The sequence of all configurations from the initial one for some input w to the stable one is called the computation for input w. Discussion For 1-dimensional CA the definition of parallel input is the obvious one. For higher-dimensional CA, say G D Z2 , one could also think of more compact forms of input for one-dimensional words, e. g., inscribing the input symbols row by row into a square with side length p d ne. But this requires extra care. Depending on the formal language to be accepted, the special way in which the symbols are input might provide additional information which is useful for the language recognition task at hand. Since a CA performs work on an infinite number of bits in each step, it would also be possible to consider inputs and outputs of infinite length, e. g., as representations of all real numbers in the interval [0; : : : ; 1]. There is much
Cellular Automata as Models of Parallel Computation
less literature about this aspect; see for example chapter 11 of [8]. It is not too surprising that this area is also related to the view of CA as dynamical systems (instead of computers); see the contributions by Formenti ( Chaotic Behavior of Cellular Automata) and K˚urka ( Topological Dynamics of Cellular Automata). Example: Recognition of Palindromes As an example that will also be useful later, consider the formal language Lpal of palindromes of odd length: Lpal D fvxv R j v 2 A ^ x 2 Ag : (Here vR is the mirror image of v.) For example, if A contains all Latin letters saippuakauppias belongs to Lpal (the Finnish word for a soap dealer). It is known that each T TM with only one tape and only one head on it (see Subsect. “Turing Machines” for a quick introduction) needs time (n2 ) for the recognition of at least some inputs of length n belonging to Lpal [10]. We will sketch a CA recognizing Lpal in time ‚(n). As the set of states for a single cell we use Q D A [ f g [ Q l Qr Qv Q l r , basically subdividing each cell in 4 “registers”, each containing a “substate”. The substates from Q l D f < a; < b; g and Qr D fa > ; b > ; g are used to shift input symbols to the left and to the right respectively. In the third register a substate from Qv D f+; -g indicates the results of comparisons. In the fourth register substates from Q l r D f > ; < ; < + > ; < - > ; g are used to realize “signals” > and < which identify the middle cell and distribute the relevant overall comparison result to all cells. As accepting states one chooses those, whose last component is : FC D Q l Qr Qv fg. There is a total of 3 C 3 3 2 5 D 93 states and for a complete definition of the local transition function one would have to specify f (x; y; z) for 933 D 804357 triples of states. We will not do that, but we will sketch some important parts. In the first step the registers are initialized. For all x; y; z 2 A: `(1) x x
`(0) y y y y
`(1) z z
then < zl xr > f (`) D v 0m 0 dm Here, of course, the new value of the third register is computed as ( v 0m
D
+
if v m D + ^ z l D x r
-
otherwise :
0 in detail. We do not describe the computation of d m Figure 1 shows the computation for the input babbbab, which is a palindrome. Horizontal double lines separate configurations at subsequent time steps. Registers in state are simply left empty. As can be seen there is a triangle in the space time diagram consisting of the n input cells at time t D 0 and shrinking at both ends by one cell in each subsequent step where “a lot of activity” can happen due to the shifting of the input symbols in both directions.
f (`) ( < y; y > ; +; > ) ( < y; y > ; +; ) ( < y; y > ; +; < ) ( < y; y > ; +; < + > )
In all later steps, if < xl xr > `(1) D vl dl
< yl yr > `(0) D vm dm
< zl zr > `(1) D vr dr
Cellular Automata as Models of Parallel Computation, Figure 1 Recognition of a palindrome; the last configuration is stable and the cell which initially stored the first input symbol is in an accepting state
301
302
Cellular Automata as Models of Parallel Computation
Clearly a two-head TM can also recognize Lpal in linear time, by first moving one head to the last symbol and then synchronously shifting both heads towards each other comparing the symbols read. Informally speaking in this case the ability of multihead TM to transport a small amount of information over a long distance in one step can be “compensated” by CA by shifting a large amount of information over a short distance. We will see in Theorem 3 that this observation can be generalized. Complexity Measures: Time and Space For one-dimensional CA it is straightforward to define their time and space complexity. We will consider only worst-case complexity. Remember that we assume that all CA halt for all inputs (reach a stable configuration). For w 2 AC let time0 (w) denote the smallest number of steps such that the CA reaches a stable configuration after steps when started from the initial configuration for input w. Then time : NC ! NC : n 7! maxftime0 (w) j w 2 An g is called the time complexity of the CA. Similarly, let space0 (w) denote the total number of cells which are not quiescent in at least one configuration occurring during the computation for input w. Then space : NC ! NC : n 7! maxfspace0 (w) j w 2 An g is called the space complexity of the CA. If we want to mention a specific CA C, we indicate it as an index, e. g., timeC . If s and t are functions NC ! NC , we write CA SPC(s) TIME(t) for the set of formal languages which can be accepted by some CA C with spaceC s and timeC t, and analogously CA SPC(s) and CA T IME(t) if only one complexity measure is bounded. Thus we only look at upper bounds. For a whole set of functions T , we will use the abbreviation [ CA TIME(T ) D CA TIME( f ) :
that TM TIME(Pol(n)) D P and TM SPC(Pol(n)) D PSPACE. Discussion For higher-dimensional CA the definition of space complexity requires more consideration. One possibility is to count the number of cells used during the computation. A different, but sometimes more convenient approach is to count the number of cells in the smallest hyper-rectangle comprising all used cells. Turing Machines For reference, and because we will consider a parallel variant, we set forth some definitions of Turing machines. In general we allow Turing machines with k work tapes and h heads on each of them. Each square carries a symbol from the tape alphabet B which includes the blank symbol . The control unit (CU) is a finite automaton with set of states S. The possible actions of a deterministic TM are described by a function f : S B k h ! S B k h D k h , where D D f1; 0; C1g is used for indicating the direction of movement of a head. If the machine reaches a situation in which f (s; b1 ; : : : ; b k h ) D (s; b1 ; : : : ; b k h , 0; : : : ; 0) for the current state s and the currently scanned symbols b1 ; : : : ; b k h , we say that it halts. Initially a word w of length n over the input alphabet A B is written on the first tape on squares 1; : : : ; n, all other tape squares are empty, i. e., carry the . An input is accepted if the CU halts in an accepting state from a designated subset FC S. L(T) denotes the formal language of all words accepted by a TM T. We write kT h TM SPC(s) TIME(t) for the class of all formal languages which can be recognized by TM T with k work tapes and h heads on each of them which have a space complexity space T s and time T t. If k and/or h is missing, 1 is assumed instead. If the whole prefix kT h is missing, T is assumed. If arbitrary k and h are allowed, we write T .
t2T
Typical examples will be T D O(n) or T D Pol(n), where S in general Pol( f ) D k2NC O( f k ). Resource bounded complexity classes for other computational models will be noted similarly. If we want to make the dimension of the CA explicit, we write Zd CA : : :; if the prefix Zd is missing, d D 1 is to be assumed. Throughout this article n will always denote the length of input words. Thus a time complexity of Pol(n) simply means polynomial time, and similarly for space, so
Sequential Versus Parallel Models Today quite a number of different computational models are known which intuitively look as if they are parallel. Several years ago van Emde Boas [6] observed that many of these models have one property in common. The problems that can be solved in polynomial time on such a model P , coincide with the problems that can be solved in polynomial space on Turing machines: P T IME(Pol(n)) D TM SPC(Pol(n)) D PSPACE :
Cellular Automata as Models of Parallel Computation
Here we have chosen P as an abbreviation for “parallel” model. Models P satisfying this equality are by definition the members of the so-called second machine class. On the other hand, the first machine class is formed by all models S satisfying the relation S SPC(s) T IME(t) D TM SPC(‚(s)) T IME (Pol(t))
at least for some reasonable functions s and t. We deliberately avoid making this more precise. In general there is consensus on which models are in the these machine classes. We do want to point out that the naming of the two machine classes does not mean that they are different or even disjoint. This is not known. For example, if P D PSPACE, it might be that the classes coincide. Furthermore there are models, e. g., Savitch’s NLPRAM [25], which might be in neither machine class. Another observation is that in order to possibly classify a machine model it obviously has to have something like “time complexity” and/or “space complexity”. This may sound trivial, but we will see in Subsect. “Parallel Turing Machines” that, for example, for so-called parallel Turing machines with several work tapes it is in fact not. Time and Space Complexity Comparison of Resource Bounded One-Dimensional CA It is clear that time and space complexity for CA are Blum measures [2] and hence infinite hierarchies of complexity classes exist. It follows from the more general Theorem 9 for parallel Turing machines that the following holds: Theorem 1 Let s and t be two functions such that s is fully CA space constructable in time t and t is CA computable in space s and time t. Then: [ CA SPC(‚(s/ )) TIME(‚(t/ )) …O(1)
¤ CA SPC(O(s)) TIME(O(t))
CA SPC(o(s)) TIME(o(t)) ¤ CA SPC(O(s)) TIME(O(t)) CA TIME(o(t)) ¤ CA T IME(O(t)) : The second and third inclusion are simple corollaries of the first one. We do not go into the details of the definition of CA constructibility, but note that for hierarchy results for TM one sometimes needs analogous additional conditions. For details, interested readers are referred to [3,12,17].
We not that for CA the situation is better than for deterministic TM: there one needs f (n) log f (n) 2 o(g(n)) in order to prove TM TIME( f ) ¤ TM T IME(g). Open Problem 2 For the proper inclusions in Theorem 1 the construction used in [34] really needs to increase the space used in order to get the time hierarchy. It is an open problem whether there also exists a time hierarchy if the space complexity is fixed, e. g., as s(n) D n. It is even an open problem to prove or disprove that the inclusion CA SPC(n) TIME (n) CA SPC(n) TIME (2O(n) ) : is proper or not. We will come back to this topic in Subsect. “Parallel Turing Machines” on parallel Turing machines. Comparison with Turing Machines It is well-known that a TM with one tape and one head on that tape can be simulated by a one-dimensional CA. See for example the paper by Smith [27]. But even multi-tape TM can be simulated by a one-dimensional CA without any significant loss of time. Theorem 3 For all space bounds s(n) n and all time bounds t(n) n the following holds for one-dimensional CA and TM with an arbitrary number of heads on its tapes: T TM SPC(s) TIME(t) CA SPC(s) TIME(O(t)) : Sketch of the Simulation We first describe a simulation for 1T 1 TM. In this case the actions of the TM are of the form s; b ! s 0 ; b0 ; d where s; s 0 2 S are old and new state, b; b0 2 B old and new tape symbol and d 2 f1; 0; C1g the direction of head movement. The simulating CA uses three substates in each cell, one for a TM state, one for a tape symbol, and an additional one for shifting tape symbols: Q D Q S Q T Q M . We use Q S D S [ f g and a substate of means, that the cell does not store a state. Similarly Q T D B [ f < ı; ı > g and a substate of < ı or ı > means that there is no symbol stored but a “hole” to be filled with an adjacent symbol. Substates from Q S D B f < ; > g, like < b and b > , are used for shifting symbols from one cell to the adjacent one to the left or right. Instead of moving the state one cell to the right or left whenever the TM moves its head, the tape contents as stored in the CA are shifted in the opposite direction. Assume for example that the TM performs the following actions: s0,d ! s1, d0 , + 1
303
304
Cellular Automata as Models of Parallel Computation
In other words, the CA transports “a large amount of information over a short distance in one step”. Theorem 3 says that this ability is at least as powerful as the ability of multi-head TM to transport “a small amount of information over a long distance in one step”. Open Problem 4 The question remains whether some kind of converse also holds and in Theorem 3 an D sign would be correct instead of the , or whether CA are more powerful, i. e., a ¨ sign would be correct. This is not known. The best simulation of CA by TM that is known is the obvious one: states of neighboring cells are stored on adjacent tape squares. For the simulation of one CA step the TM basically makes one sweep across the complete tape segment containing the states of all non-quiescent cells updating them one after the other. As a consequence one gets Cellular Automata as Models of Parallel Computation, Figure 2 Shifting tape contents step by step
s1,e ! s2, e0 , 1 s2,d0 ! s3, d00 , 1 s3,c ! s4, c0 , + 1 Figure 2 shows how shifting the tape in direction d can be achieved by sending the current symbol in that direction and sending a “hole” ı in the opposite direction d. It should be clear that the required state changes of each cell depend only on information available in its neighborhood. A consequence of this approach to incrementally shift the tape contents is that it takes an arbitrary large number of steps until all symbols have been shifted. On the other hand, after only two steps the cell simulating the TM control unit has information about the next symbol visited and can simulate the next TM step and initialize the next tape shift. Clearly the same approach can be used if one wants to simulate a TM with several tapes, each having one head. For each additional tape the CA would use two additional registers analogously to the middle and bottom row used in Fig. 2 for one tape. Stoß [28] has proved that kT h TM (h heads on each tape) can be simulated by (kh)T TM (only one head on each tape) in linear time. Hence there is nothing left to prove. Discussion As one can see in Fig. 2, in every second step one signal is sent to the left and one to the right. Thus, if the TM moves its head a lot and if the tape segment which has to be shifted is already long, many signals are traveling simultaneously.
Theorem 5 For all space bounds s(n) n and all time bounds t(n) n holds: CA SPC(s)TIME(t) TMSPC(s)TIME(O(s t)) TM SPC(s) TIME O t 2 : The construction proving the first inclusion needs only a one-head TM, and no possibility is known to take advantage of more heads. The second inclusion follows from the observation that in order to use an initially blank tape square, a TM must move one of its heads there, which requires time. Thus s 2 O(t). Taking Theorems 3 and 5 together, one immediately gets Corollary 6 Cellular automata are in the first machine class. And it is not known that CA are in the second machine class. In this regard they are “more like” sequential models. The reason for this is the fact, the number of active processing units only grows polynomially with the number of steps in a computation. In Sect. “Communication in CA” variations of the standard CA model will be considered, where this is different. Measuring and Controlling the Activities Parallel Turing Machines One possible way to make a parallel model from Turing machines is to allow several control units (CU), but with all of them working on the same tape (or tapes). This model can be traced back at least to a paper by Hemmerling [9] who called it Systems of Turing automata. A few years later Wiedermann [32] coined the term Parallel Turing machine (PTM).
Cellular Automata as Models of Parallel Computation
We consider only the case where there is only one tape and each of the control units has only one head on that tape. As for sequential TM, we usually drop the prefix 1T 1 for PTM, too. Readers interested in the case of PTM with multi-head CUs are referred to [33]. PTM with One-Head Control Units The specification of a PTM includes a tape alphabet B with a blank Cybil and a set S of possible states for each CU. A PTM starts with one CU on the first input symbol as a sequential 1T 1 TM. During the computation the number of control units may increase and decrease, but all CUs always work cooperatively on one common tape. The idea is to have the CUs act independently unless they are “close” to each other, retaining the idea of only local interactions, as in CA. A configuration of a PTM is a pair c D (p; b). The mapping b : Z ! B describes the contents of the tape. Let 2S denote the power set of S. The mapping p : Z ! 2S describes for each tape square i the set of states of the finite automata currently visiting it. In particular, this formalization means that it is not possible to distinguish two automata on the same square and in the same state: the idea is that because of that they will always behave identically and hence need not be distinguished. The mode of operation of a PTM is determined by the transition function f : 2Q B ! 2QD B where D is the set f1; 0; 1g of possible movements of a control unit. In order to compute the successor configuration c 0 D (p0 ; b0 ) of a configuration c D (p; b), f is simultaneously computed for all tape positions i 2 Z. The arguments used are the set of states of the finite automata currently visiting square i and its tape symbol. Let (M 0i ; b0i ) D f (p(i); b(i)). Then the new symbol on square i in configuration c 0 is b0 (i) D b0i . The set of finite automata on square i is replaced by a new set of finite automata (defined by M 0i Q D) each of which changes the tape square according to the indicated direction of movement. Therefore p0 (i) D fq j (q; 1) 2 M 0i1 _ (q; 0) 2 M 0i _ (q; 1) 2 M 0iC1 g. Thus f induces a global transition function F mapping global configurations to global configurations. In order to make the model useful (and to come up to some intuitive expectations) it is required, that CUs cannot arise “out of nothing” and that the symbol on a tape square can change only if it is visited by at least one CU. In other words we require that 8 b 2 B : f (;; b) D (;; b). Observe that the number of finite automata on the tape may change during a computation. Automata may vanish, for example if f (fsg; b) D (;; b) and new automata may be generated, for example if f (fsg; b) D (f(q; 1); (q0 ; 0)g; b).
For the recognition of formal languages we define the initial configuration cw for an input word w 2 AC as the one in which w is written on the otherwise blank tape on squares 1; 2; : : : ; jwj, and in which there exists exactly one finite automaton in an initial state q0 on square 1. A configuration (p; b) of aPTM is called accepting iff it is stable (i. e. F((p; b)) D (p; b)) and p(1) FC . The language L(P) recognized by a PTM P is the set of input words, for which it reaches an accepting configuration. Complexity Measures for PTM Time complexity of a PTM can be defined in the obvious way. For space complexity, one counts the total number of tape squares which are used in at least one configuration. Here we call a tape square i unused in a configuration c D (p; b) if p(i) D ; and b(i) D ; otherwise it is used. What makes PTM interesting is the definition of its processor complexity. Let proc0 (w) denote the maximum number of CU which exist simultaneously in a configuration occurring during the computation for input w and define proc : NC ! NC : n 7! maxfproc0 (w) j w 2 An g. For complexity classes we use the notation PTMSPC(s) TIME(t) PROC(p) etc. The processor complexity is one way to measure (an upper bound on) “how many activities” happened simultaneously. It should be clear that at the lower end one has the case of constant proc(n) D 1, which means that the PTM is in fact (equivalent to) a sequential TM. The other extreme is to have CUs “everywhere”. In that case proc(n) 2 ‚(space(n)), and one basically has a CA. In other words, processor complexity measures the mount of parallelism of a PTM. Theorem 7 For all space bounds s and time bounds t: PTM SPC(s)T IME(t) PROC (1) D TM SPC(s) TIME(t) PTM SPC(s)T IME(t) PROC (s) D CA SPC(s) TIME(O(t)) : Under additional constructibility conditions it is even possible to get a generalization of Theorem 5: Theorem 8 For all functions s(n) n, t(n) n, and h(n) 1, where h is fully PTM processor constructable in space s, time t, and with h processors, holds: CA SPC(O(s)) TIME(O(t)) PTMSPC(O(s))T IME(O(st/h))PROC (O(h)): Decreasing the processor complexity indeed leads to the expected slowdown.
305
306
Cellular Automata as Models of Parallel Computation
Relations Between PTM Complexity Classes (part 1) The interesting question now is whether different upper bounds on the processor complexity result in different computational power. In general that is not the case, as PTM with only one CU are TM and hence computationally universal. (As a side remark we note that therefore processor complexity cannot be a Blum measure. In fact that should be more or less clear since, e. g., deciding whether a second CU will ever be generated might require finding out whether the first CU, i. e., a TM, ever reaches a specific state.) In this first part we consider the case where two complexity measures are allowed to grow in order to get a hierarchy. Results which only need one growing measure are the topic of the second part. First of all it turns out that for fixed processor complexity between log n and s(n) there is a space/time hierarchy: Theorem 9 Let s and t be two functions such that s is fully PTM space constructable in time t and t is PTM computable in space s and time t and let h log. Then: [ …O(1)
PTM SPC(‚(s/ )) TIME(‚(t/ )) PROC (O(h))
¤ PTM SPC(O(s)) T IME(O(t)) PROC (O(h)) : The proof of this theorem applies the usual idea of diagonalization. Technical details can be found in [34]. Instead of keeping processor complexity fixed and letting space complexity grow, one can also do the opposite. As for analogous results for TM, one needs the additional restriction to one fixed tape alphabet. One gets the following result, where the complexity classes carry the additional information about the size of the tape alphabet. Theorem 10 Let s, t and h be three functions such that s is fully PTM space constructable in time t and with h processors, and that t and h are PTM computable in space s and time t and with h processors such that in all cases the tape is not written. Let b 2 be the size of the tape alphabet. Then: [ …O(1)
PTM SPC(s) TIME‚(t/ ) PROC(h/ ) ALPH(b)
¤ PTM SPC(s) TIME(‚(st)) PROC (‚(h)) ALPH(b) : Again we do not go into the details of the constructibility definitions which can be found in [34]. The important point here is that one can prove that increasing time
and processor complexity by a non-constant factor does increase the (language recognition) capabilities of PTM, even if the space complexity is fixed, provided that one does not allow any changes to the tape alphabet. In particular the theorem holds for the case space(n) D n. It is now interesting to reconsider Open Problem 2. Let’s assume that CA SPC(n) TIME (n) D CA SPC(n) TIME (2O(n) ) : One may choose (n) D log n, t(n) D 2n/ log n and h(n) D n in Theorem 10. Using that together with Theorem 7 the assumption would give rise to PTM SPC(n) TIME
2n/ log n log n
!
PROC
n log n
ALPH(b) ¤ PTM SPC(n) TIME(n2
n/ log n
) PROC (n) ALPH(b)
D PTM SPC(n) TIME(n) PROC(n) ALPH(b) : If the polynomial time hierarchy for n-space bounded CA collapses, then there are languages which cannot be recognized by PTM in almost exponential time with n/ log n processors but which can be recognized by PTM with n processors in linear time, if the tape alphabet is fixed. Relations Between PTM Complexity Classes (part 2) One can get rid of the fixed alphabet condition by using a combinatorial argument for a specific formal language (instead of diagonalization) and even have not only the space but also the processor complexity fixed and still get a time hierarchy. The price to pay is that the range of time bounds is more restricted than in Theorem 10. Consider the formal language o n Lvv D vcjvj v j v 2 fa; bgC : It contains all words which can be divided into three segments of equal length such that the first and third are identical. Intuitively whatever type of machine is used for recognition, it is unavoidable to “move” the complete information from one end to the other. Lvv shares this feature with Lpal . Using a counting argument inspired by Hennie’s concept of crossing sequences [10] applied to Lpal , one can show: Lemma 11 ([34]) If P is a PTM recognizing Lvv , then time2P proc P 2 (n3 / log2 n).
Cellular Automata as Models of Parallel Computation
On the other hand, one can construct a PTM recognizing Lvv with processor complexity na for sufficiently nice a:
For the language Lvv which already played a role in the previous subsection one can show:
Lemma 12 ([34]) For each a 2 Q with 0 < a < 1 holds: Lvv 2 PTMSPC(n)TIME ‚ n2a PROC ‚ n a :
Lemma 15 Let f (n) be a non-decreasing function which is not in O(log n), i. e., limn!1 log n/ f (n) D 0. Then any CA C recognizing Lvv makes a total of at least (n2 / f (n)) state changes in the segment containing the n input cells and n cells to the left and to the right of them. In particular, if timeC 2 ‚(n), then sumchgC 2 (n2 / f (n)). Furthermore maxchgC 2 (n/ f (n)).
Putting these lemmas together yields another hierarchy theorem: Theorem 13 For rational numbers 0 < a < 1 and 0 < " < 3/2 a/2 holds: PTM SPC(n) TIME ‚ n3/2a/2"
PROC ‚ n a ¤ PTM SPC(n) TIME ‚ n2a PROC ‚ n a :
Hence, for a close to 1 a “small” increase in time by some 0 n" suffices to increase the recognition power of PTM while the processor complexity is fixed at na and the space complexity is fixed at n as well. Open Problem 14 For the recognition of Lvv there is a gap between the lower bound of time2P proc P 2 (n3 / log2 n) in Lemma 11 and the upper bound of time2P proc P 2 O(n4a ) in Lemma 12. It is not known whether the upper or the lower bound or both can be improved. An even more difficult problem is to prove a similar result for the case a D 1, i. e., cellular automata, as mentioned in Open Problem 2. State Change Complexity In CMOS technology, what costs most of the energy is to make a proper state change, from zero to one or from one to zero. Motivated by this fact Vollmar [30] introduced the state change complexity for CA. There are two variants based on the same idea: Given a halting CA computation for an input w and a cell i one can count the number of time points , 1 time0 (w), such that cell i is in different states at times 1 and . Denote that number by change0 (w; i). Define maxchg0 (w) D max change0 (w; i) and i2G X 0 change0 (w; i) sumchg (w) D i2G
and maxchg(n) D maxfmaxchg0 (w) j w 2 An g sumchg(n) D maxfsumchg0 (w) j w 2 An g :
In the paper by Sanders et al. [24] a generalization of this lemma to d-dimensional CA is proved. Open Problem 16 While the processor complexity of PTM measures how many activities happen simultaneously “across space”, state change complexity measures how many activities happen over time. For both cases we have made use of the same formal language in proofs. That might be an indication that there are connections between the two complexity measures. But no non-trivial results are known until now. Asynchronous CA Until now we have only considered one global mode of operation: the so-called synchronous case, where in each global step of the CA all cells must update their states synchronously. Several models have been considered where this requirement has been relaxed. Generally speaking, asynchronous CA are characterized by the fact that in one global step of the CA some cells are active and do update their states (all according to the same local transition function) while others do nothing, i. e., remain in the same state as before. There are then different approaches to specify some restrictions on which cells may be active or not. Asynchronous update mode. The simplest possibility is to not quantify anything and to say that a configuration c 0 is a legal successor of configuration c, denoted c ` c 0 , iff for all i 2 G one has c 0 (i) D c(i) or c 0 (i) D f (c iCN ). Unordered sequential update mode. In this special case it is required that there is only one active cell in each global step, i. e., card(fijc 0 (i) 6D c(i)g) 1. Since CA with an asynchronous update mode are no longer deterministic from a global (configuration) point of view, it is not completely clear how to define, e. g., formal language recognition and time complexity. Of course one could follow the way it is done for nondeterministic TM. To the best of our knowledge this has not considered for
307
308
Cellular Automata as Models of Parallel Computation
asynchronous CA. (There are results for general nondeterministic CA; see for example Cellular Automata and Language Theory.) It should be noted that Nakamura [20] has provided a very elegant construction for simulating a CA Cs with synchronous update mode on a CA Ca with one of the above asynchronous update modes. Each cell stores the “current” and the “previous” state of a Cs -cell before its last activation and a counter value T modulo 3 (Q a D Qs Qs f0; 1; 2g). The local transition function f a is defined in such a way that an activated cell does the following: T always indicates how often a cell has already been updated modulo 3. If the counters of all neighbors have value T or T C 1, the current Cs -state of the cell is remembered as previous state and a new current state is computed according to f s from the current and previous Cs -states of the neighbors; the selection between current and previous state depends on the counter value of that cell. In this case the counter is incremented. If the counter of at least one neighboring cell is at T 1, the activated cell keeps its complete state as it is. Therefore, if one does want to gain something using asynchronous CA, their local transition functions would have to be designed for that specific usage. Recently, interest has increased considerably in CA where the “degree of (a-)synchrony” is quantified via probabilities. In these cases one considers CA with only a finite number of cells. Probabilistic update mode. Let 0 ˛ 1 be a probability. In probabilistic update mode each legal global step c ` c 0 of the CA is assigned a probability by requiring that each cell i independently has a probability ˛ of updating its state. Random sequential update mode. This is the case when in each global step one of the cells in G is chosen with even probability and its state updated, while all others do not change their state. CA operating in this mode are called fully asynchronous by some authors. These models can be considered special cases of what is usually called probabilistic or stochastic CA. For these CA the local transition function is no longer a map from QN to Q, but from QN to [0; 1]Q . For each ` 2 Q N the value f (`) is a probability distribution for the next state (satisP fying q2Q f (`)(q) D 1). There are only very few papers about formal language recognition with probabilistic CA; see [18].
On the other hand, probabilistic update modes have received some attention recently. See for example [22] and the references therein. Development of this area is still at its beginning. Until now, specific local rules have mainly been investigated; for an exception see [7]. Communication in CA Until now we have considered only one-dimensional Euclidean CA where one bit of information can reach O(t) cells in t steps. In this section we will have a look at a few possibilities for changing the way cells communicate in a CA. First we have a quick look at CA where the underlying grid is Zd . The topic of the second subsection is CA where the cells are connected to form a tree. Different Dimensionality In Zd CA with, e. g., von Neumann neighborhood of radius 1, a cell has the potential to influence O(t d ) cells in t steps. This is a polynomial number of cells. It comes as no surprise that Zd CA SPC(s) TIME(t) TM SPC(Pol(s)) TIME(Pol(t)) and hence Zd CA are in the first machine class. One might only wonder why for the TM a space bound of ‚(s) might not be sufficient. This is due to the fact that the shape of the cells actually used by the CA might have an “irregular” structure and that the TM has to perform some bookkeeping or simulate a whole (hyper-)rectangle of cells encompassing all that are really used by the CA. Trivially, a d-dimensional CA can be simulated on a d 0 -dimensional CA, where d 0 > d. The question is how much one loses when decreasing the dimensionality. The currently known best result in this direction is by Scheben [26]: Theorem 17 It is possible to simulate a d 0 -dimensional CA with running time t on a d-dimensional CA, d < d 0 with running time and space O(t 2
lddd 0 /de
).
It should be noted that the above result is not directly about language recognition; the redistribution of input symbols needed for the simulation is not taken into account. Readers interested in that as well are referred to [1]. Open Problem 18 Try to find simulations of lower on higher dimensional CA which somehow make use of the “higher connectivity” between cells. It is probably much too difficult or even impossible to hope for general
Cellular Automata as Models of Parallel Computation
speedups. But efficient use of space (small hypercubes) for computations without losing time might be achievable. Tree CA and hyperbolic CA Starting from the root of a full binary tree one can reach an exponential number 2 t of nodes in t steps. If there are some computing capabilities related to the nodes, there is at least the possibility that such a device might exhibit some kind of strong parallelism. One of the earliest papers in this respect is Wiedermann’s article [31] (unfortunately only available in Slovak). The model introduced there one would now call parallel Turing machines, where a tape is not a linear array of cells, but where the cells are connected in such a way as to form a tree. A proof is sketched, showing that these devices can simulate PRAMs in linear time (assuming the socalled logarithmic cost model). PRAMs are in the second machine class. So, indeed in some sense, trees are powerful. Below we first quickly introduce a PSPACE-complete problem which is a useful tool in order to prove the power of computational models involving trees. A few examples of such models are considered afterwards. Quantified Boolean Formula The instances of the problem Quantified Boolean Formula (QBF, sometimes also called QSAT) have the structure Q1 x1 Q2 x2 Q k x k : F(x1 ; : : : ; x k ) : Here F(x1 ; : : : ; x k ) is a Boolean formula with variables x1 ; : : : ; x k and connectives ^, _ and :. Each Qj is one of the quantifiers 8 or 9. The problem is to decide whether the formula is true under the obvious interpretation. This problem is known to be complete for PSPACE. All known TM, i. e., all deterministic sequential algorithms, for solving QBF require exponential time. Thus a proof that QBF can be solved by some model M in polynomial time (usually) implies that all problems in PSPACE can be solved by M in polynomial time. Often this can be paired with the “opposite” results that problems that can be solved in polynomial time on M are in PSPACE, and hence TM PSPACE D M P. This, of course, is not the case only for models “with trees”; see [6] for many alternatives. Tree CA A tree CA (TCA for short) working on a full d-ary tree can be defined as follows: There is a set of states Q. For the root there is a local transition function f0 : (A [ fg) Q Qd ! Q, which uses an input symbol (if available), the root cell’s own state and
those of the d child nodes to computex the next state of the node. And there are d local transition functions f i : Q Q Q d ! Q, where 1 i d. The ith child of a node uses f i to compute its new state depending the state of its parent node, its own state and the states of the d child nodes. For language recognition, input is provided sequentially to the root node during the first n steps and a blank symbol afterwards. A word is accepted if the root node enters an accepting state from a designated subset FC Q. Mycielski and Niwi´nski [19] were the first to realize that sequential polynomial reductions can be carried out by TCA and that QBF can be recognized by tree CA as well: A formula to be checked with k variables is copied and distributed to 2 k “evaluation cells”. The sequences of left/right choices of the paths to them determine a valuation of the variables with zeros and ones. Each evaluation cell uses the subtree below it to evaluate F(x1 ; : : : ; x k ) accordingly. The results are propagated up to the root. Each cell in level i below the root, 1 i k above an evaluation cell combines the results using _ or ^, depending on whether the ith quantifier of the formula was 9 or 8. On the other hand, it is routine work to prove that the result of a TCA running in polynomial time can be computed sequentially in polynomial space by a depth first procedure. Hence one gets: Theorem 19 TCA TIME(Pol(n)) D PSPACE Thus tree cellular automata are in the second machine class. Hyperbolic CA Two-dimensional CA as defined in Sect. “Introduction” can be considered as arising from the tessellation of the Euclidean plane Z2 with squares. Therefore, more generally, sometimes CA on a grid G D Zd are called Euclidean CA. Analogously, some Hyperbolic CA arise from tessellation of the hyperbolic plane with some regular polygon. They are covered in depth in a separate article ( Cellular Automata in Hyperbolic Spaces). Here we just consider one special case: The two-dimensional hyperbolic plane can be tiled with copies of the regular 6-gon with six right angles. If one considers only one quarter and draws a graph with the tiles as nodes and links between those nodes who share a common tile edge, one gets the graph depicted in Fig. 3. Basically it is a tree with two types of nodes, black and white ones, and some “additional” edges depicted as dotted lines. The root is a white node. The first child of each node is black. All other children are white; a black node has 2 white children, a white node has 3 white children.
309
310
Cellular Automata as Models of Parallel Computation
Cellular Automata as Models of Parallel Computation, Figure 3 The first levels of a tree of cells resulting from a tiling of the hyperbolic plane with 6-gons
For hyperbolic CA (HCA) one uses the formalism analogously to that described for tree CA. As one can see, basically HCA are trees with some additional edges. It is therefore not surprising that they can accept the languages from PSPACE in polynomial time. The inverse inclusion is also proved similarly to the tree case. This gives:
Future Directions
Theorem 20
Proper Inclusions and Denser Hierarchies
HCA TIME(Pol(n)) D PSPACE It is also interesting to have a look at the analogs of P, PSPACE, and so on for hyperbolic CA. Somewhat surprisingly Iwamoto et al. [11] have shown:
At several points in this paper we have pointed out open problems which deserve further investigation. Here, we want to stress three areas which we consider particularly interesting in the area of “CA as a parallel model”.
It has been pointed out several times, that inclusions of complexity classes are not known to be proper, or that gaps between resource bounds still needed to be “large” in order to prove that the inclusion of the related classes is a proper one. For the foreseeable future it remains a wide area for further research. Most probably new techniques will have to be developed to make significant progress.
Theorem 21 Activities HCA T IME(Pol(n)) D HCA SPC(Pol(n)) D NHCA TIME(Pol(n)) D NHCA SPC(Pol(n)) where NHCA denotes nondeterministic hyperbolic CA. The analogous equalities hold for exponential time and space. Outlook There is yet another possibility for bringing trees into play: trees of configurations. The concept of alternation [4] can be carried over to cellular automata. Since there are several active computational units, the definitions are little bit more involved and it turns out that one has several possibilities which also result in models with slightly different properties. But in all cases one gets models from the second machine class. For results, readers are referred to [13] and [23]. On the other end there is some research on what happens if one restricts the possibilities for communication between neighboring cells: Instead of getting information about the complete states of the neighbors, in the extreme case only one bit can be exchanged. See for example [15].
The motivation for considering state change complexity was the energy consumption of CMOS hardware. It is known that irreversible physical computational processes must consume energy. This seems not to be the case for reversible ones. Therefore reversible CA are also interesting in this respect. The definition of reversible CA and results for them are the topic of the article by Morita ( Reversible Cellular Automata), also in this encyclopedia. Surprisingly, all currently known simulations of irreversible CA on reversible ones (this is possible) exhibit a large state change complexity. This deserves further investigation. Also the examination of CA which are “reversible on the computational core” has been started only recently [16]. There are first surprising results; the impacts on computational complexity are unforeseeable. Asynchronicity and Randomization Randomization is an important topic in sequential computing. It is high time that this is also investigated in much
Cellular Automata as Models of Parallel Computation
more depth for cellular automata. The same holds for cellular automata where not all cells are updating their states synchronously. These areas promise a wealth of new insights into the essence of fine-grained parallel systems. Bibliography 1. Achilles AC, Kutrib M, Worsch T (1996) On relations between arrays of processing elements of different dimensionality. In: Vollmar R, Erhard W, Jossifov V (eds) Proceedings Parcella ’96, no. 96 in Mathematical Research. Akademie, Berlin, pp 13–20 2. Blum M (1967) A machine-independent theory of the complexity of recursive functions. J ACM 14:322–336 3. Buchholz T, Kutrib M (1998) On time computability of functions in one-way cellular automata. Acta Inf 35(4):329–352 4. Chandra AK, Kozen DC, Stockmeyer LJ (1981) Alternation. J ACM 28(1):114–133 5. Delorme M, Mazoyer J, Tougne L (1999) Discrete parabolas and circles on 2D cellular automata. Theor Comput Sci 218(2):347– 417 6. van Emde Boas P (1990) Machine models and simulations. In: van Leeuwen J (ed) Handbook of Theoretical Computer Science, vol A. Elsevier Science Publishers and MIT Press, Amsterdam, chap 1, pp 1–66 7. Fatès N, Thierry É, Morvan M, Schabanel N (2006) Fully asynchronous behavior of double-quiescent elementary cellular automata. Theor Comput Sci 362:1–16 8. Garzon M (1991) Models of Massive Parallelism. Texts in Theoretical Computer Science. Springer, Berlin 9. Hemmerling A (1979) Concentration of multidimensional tape-bounded systems of Turing automata and cellular spaces. In: Budach L (ed) International Conference on Fundamentals of Computation Theory (FCT ’79). Akademie, Berlin, pp 167–174 10. Hennie FC (1965) One-tape, off-line Turing machine computations. Inf Control 8(6):553–578 11. Iwamoto C, Margenstern M (2004) Time and space complexity classes of hyperbolic cellular automata. IEICE Trans Inf Syst E87-D(3):700–707 12. Iwamoto C, Hatsuyama T, Morita K, Imai K (2002) Constructible functions in cellular automata and their applications to hierarchy results. Theor Comput Sci 270(1–2):797–809 13. Iwamoto C, Tateishi K, Morita K, Imai K (2003) Simulations between multi-dimensional deterministic and alternating cellular automata. Fundamenta Informaticae 58(3/4):261–271 14. Kutrib M (2008) Efficient pushdown cellular automata: Universality, time and space hierarchies. J Cell Autom 3(2):93–114 15. Kutrib M, Malcher A (2006) Fast cellular automata with restricted inter-cell communication: Computational capacity. In: Navarro YKG, Bertossi L (ed) Proceedings Theor Comput Sci (IFIP TCS 2006), pp 151–164
16. Kutrib M, Malcher A (2007) Real-time reversible iterative arrays. In: Csuhaj-Varjú E, Ésik Z (ed) Fundametals of Computation Theory 2007. LNCS vol 4639. Springer, Berlin, pp 376–387 17. Mazoyer J, Terrier V (1999) Signals in one-dimensional cellular automata. Theor Comput Sci 217(1):53–80 18. Merkle D, Worsch T (2002) Formal language recognition by stochastic cellular automata. Fundamenta Informaticae 52(1– 3):181–199 ´ 19. Mycielski J, Niwinski D (1991) Cellular automata on trees, a model for parallel computation. Fundamenta Informaticae XV:139–144 20. Nakamura K (1981) Synchronous to asynchronous transformation of polyautomata. J Comput Syst Sci 23:22–37 21. von Neumann J (1966) Theory of Self-Reproducing Automata. University of Illinois Press, Champaign. Edited and completed by Arthur W. Burks 22. Regnault D, Schabanel N, Thierry É (2007) Progress in the analysis of stochastic 2d cellular automata: a study of asychronous 2d minority. In: Csuhaj-Varjú E, Ésik Z (ed) Fundametals of Computation Theory 2007. LNCS vol 4639. Springer, Berlin, pp 376– 387 23. Reischle F, Worsch T (1998) Simulations between alternating CA, alternating TM and circuit families. In: MFCS’98 satellite workshop on cellular automata, pp 105–114 24. Sanders P, Vollmar R, Worsch T (2002) Cellular automata: Energy consumption and physical feasibility. Fundamenta Informaticae 52(1–3):233–248 25. Savitch WJ (1978) Parallel and nondeterministic time complexity classes. In: Proc. 5th ICALP, pp 411–424 26. Scheben C (2006) Simulation of d0 -dimensional cellular automata on d-dimensional cellular automata. In: El Yacoubi S, Chopard B, Bandini S (eds) Proceedings ACRI 2006. LNCS, vol 4173. Springer, Berlin, pp 131–140 27. Smith AR (1971) Simple computation-universal cellular spaces. J ACM 18(3):339–353 28. Stoß HJ (1970) k-Band-Simulation von k-Kopf-Turing-Maschinen. Computing 6:309–317 29. Stratmann M, Worsch T (2002) Leader election in d-dimensional CA in time diam log(diam). Future Gener Comput Syst 18(7):939–950 30. Vollmar R (1982) Some remarks about the ‘efficiency’ of polyautomata. Int J Theor Phys 21:1007–1015 31. Wiedermann J (1983) Paralelný Turingov stroj – Model distribuovaného poˇcítaˇca. In: Gruska J (ed) Distribouvané a parallelné systémy. CRC, Bratislava, pp 205–214 32. Wiedermann J (1984) Parallel Turing machines. Tech. Rep. RUU-CS-84-11. University Utrecht, Utrecht 33. Worsch T (1997) On parallel Turing machines with multi-head control units. Parallel Comput 23(11):1683–1697 34. Worsch T (1999) Parallel Turing machines with one-head control units and cellular automata. Theor Comput Sci 217(1):3–30
311
312
Cellular Automata, Classification of
Cellular Automata, Classification of KLAUS SUTNER Carnegie Mellon University, Pittsburgh, USA Article Outline Glossary Definition of the Subject Introduction Reversibility and Surjectivity Definability and Computability Computational Equivalence Conclusion Bibliography Glossary Cellular automaton For our purposes, a (one-dimensional) cellular automaton (CA) is given by a local map : ˙ w ! ˙ where ˙ is the underlying alphabet of the automaton and w is its width. As a data structure, suitable as input to a decision algorithm, a CA can thus be specified by a simple lookup table. We abuse notation and write (x) for the result of applying the global map of the CA to configuration x 2 ˙ Z . Wolfram classes Wolfram proposed a heuristic classification of cellular automata based on observations of typical behaviors. The classification comprises four classes: evolution leads to trivial configurations, to periodic configurations, evolution is chaotic, evolution leads to complicated, persistent structures. Undecidability It was recognized by logicians and mathematicians in the first half of the 20th century that there is an abundance of well-defined problems that cannot be solved by means of an algorithm, a mechanical procedure that is guaranteed to terminate after finitely many steps and produce the appropriate answer. The best known example of an undecidable problem is Turing’s Halting Problem: there is no algorithm to determine whether a given Turing machine halts when run on an empty tape. Semi-decidability A problem is said to be semi-decidable or computably enumerable if it admits an algorithm that returns “yes” after finitely many steps if this is indeed the correct answer. Otherwise the algorithm never terminates. The Halting Problem is the standard example for a semi-decidable problem. A problem is decidable if, and only if, the problem itself and its negation are semi-decidable.
Universality A computational device is universal it is capable of simulating any other computational device. The existence of universal computers was another central insight of the early days of computability theory and is closely related to undecidability. Reversibility A discrete dynamical system is reversible if the evolution of the system incurs no loss of information: the state at time t can be recovered from the state at time t C 1. For CAs this means that the global map is injective. Surjectivity The global map of a CA is surjective if every configuration appears as the image of another. By contrast, a configuration that fails to have a predecessor is often referred to as a Garden-of-Eden. Finite configurations One often considers CA with a special quiescent state: the homogeneous configuration where all cells are in the quiescent state is required to be fixed point under the global map. Infinite configurations where all but finitely many cells are in the quiescent state are often called finite configurations. This is somewhat of a misnomer; we prefer to speak about configurations with finite support. Definition of the Subject Cellular automata display a large variety of behaviors. This was recognized clearly when extensive simulations of cellular automata, and in particular one-dimensional CA, became computationally feasible around 1980. Surprisingly, even when one considers only elementary CA, which are constrained to a binary alphabet and local maps involving only nearest neighbors, complicated behaviors are observed in some cases. In fact, it appears that most behaviors observed in automata with more states and larger neighborhoods already have qualitative analogues in the realm of elementary CA. Careful empirical studies lead Wolfram to suggest a phenomenological classification of CA based on the long-term evolution of configurations, see [68,71] and Sect. “Introduction”. While Wolfram’s four classes clearly capture some of the behavior of CA it turns out that any attempt at formalizing this taxonomy meets with considerable difficulties. Even apparently simple questions about the behavior of CA turn out to be algorithmically undecidable and it is highly challenging to provide a detailed mathematical analysis of these systems. Introduction In the early 1980’s Wolfram published a collection of 20 open problems in the the theory of CA, see [69]. The first problem on his list is “What overall classification of cellu-
Cellular Automata, Classification of
lar automata behavior can be given?” As Wolfram points out, experimental mathematics provides a first answer to this problem: one performs a large number of explicit simulations and observes the patterns associated with the long term evolution of a configuration, see [67,71]. Wolfram proposed a classification that is based on extensive simulations in particular of one-dimensional cellular automata where the evolution of a configuration can be visualized naturally as a two-dimensional image. The classification involves four classes that can be described as follows:
W1: Evolution leads to homogeneous fixed points. W2: Evolution leads to periodic configurations. W3: Evolution leads to chaotic, aperiodic patterns. W4: Evolution produces persistent, complex patterns of localized structures.
Thus, Wolfram’s first three classes follow closely concepts from continuous dynamics: fixed point attractors, periodic attractors and strange attractors, respectively. They correspond roughly to systems with zero temporal and spatial entropy, zero temporal entropy but positive spatial entropy, and positive temporal and spatial entropy, respectively. W4 is more difficult to associate with a continuous analogue except to say that transients are typically very long. To understand this class it is preferable to consider CA as models of massively parallel computation rather than as particular discrete dynamical systems. It was conjectured by Wolfram that W4 automata are capable of performing complicated computations and may often be computationally universal. Four examples of elementary CA that are typical of the four classes are shown in Fig. 1. Li and Packard [32,33] proposed a slightly modified version of this hierarchy by refining the low classes and in particular Wolfram’s W2. Much like Wolfram’s classification, the Li–Packard classification is concerned with the asymptotic behavior of the automaton, the structure and behavior of the limiting configurations. Here is one version of the Li–Packard classification, see [33]. LP1: Evolution leads to homogeneous fixed points. LP2: Evolution leads to non-homogeneous fixed points, perhaps up a to a shift. LP3: Evolution leads to ultimately periodic configurations. Regions with periodic behavior are separated by domain walls, possibly up to a shift. LP4: Configurations produce locally chaotic behavior. Regions with chaotic behavior are separated by domain walls, possibly up to a shift. LP5: Evolution leads to chaotic patterns that are spatially unbounded.
LP6: Evolution is complex. Transients are long and lead to complicated space-time patterns which may be nonmonotonic in their behavior. By contrast, a classification closer to traditional dynamical systems theory was introduced by K˚urka, see [27,28]. The classification rests on the notions of equicontinuity, sensitivity to initial conditions and expansivity. Suppose x is a point in some metric space and f a map on that space. Then f is equicontinuous at x if 8" > 0 9ı > 0 8y 2 Bı (x); n 2 N : (d( f n (x); f n (y)) < ") where d(:; :) denotes a metric. Thus, all points in a sufficiently small neighborhood of x remain close to the iterates of x for the whole orbit. Global equicontinuity is a fairly strong condition, it implies that the limit set of the automaton is reached after finitely many steps. The map is sensitive (to initial conditions) if 8x; " > 0 9ı > 0 8y 2 Bı (x) 9n 2 N : (d( f n (x); f n (y)) ") : Lastly, the map is positively expansive if 9" > 0 8x ¤ y 9n 2 N : (d( f n (x); f n (y)) ") : K˚urka’s classification then takes the following form. K1: All points are equicontinuous under the global map. K2: Some but not all points are equicontinuous under the global map. K3: The global map is sensitive but not positively expansive. K4: The global map is positively expansive. This type of classification is perfectly suited to the analysis of uncountable spaces such as the Cantor space f0; 1gN or the full shift space ˙ Z which carry a natural metric structure. For the most part we will not pursue the analysis of CA by topological and measure theoretic means here and refer to Topological Dynamics of Cellular Automata in this volume for a discussion of these methods. See Sect. “Definability and Computability” for the connections between topology and computability. Given the apparent complexity of observable CA behavior one might suspect that it is difficult to pinpoint the location of an arbitrary given CA in any particular classification scheme with any precision. This is in contrast to simple parameterizations of the space of CA rules such as Langton’s parameter that are inherently easy to
313
314
Cellular Automata, Classification of
Cellular Automata, Classification of, Figure 1 Typical examples of the behavior described by Wolfram’s classes among elementary cellular automata
compute. Briefly, the value of a local map is the fraction of local configurations that map to a non-zero value, see [29,33]. Small values result in short transients leading to fixed points or simple periodic configurations. As increases the transients grow longer and the orbits become more and more complex until, at last, the dynamics become chaotic. Informally, sweeping the value from 0 to 1 will produce CA in W1, then W2, then W4 and lastly in W3. The last transition appears to be associated with a threshold phenomenon. It is unclear what the connection between Langton’s -value and computational properties of a CA is, see [37,46]. Other numerical measures that appear to be loosely connected to classifications are the mean field parameters of Gutowitz [20,21], the Z-parameter by Wuensche [72], see also [44]. It seems doubtful that a structured taxonomy along the lines of Wolfram or Li–Packard can be derived from a simple numerical measure such as the value alone, or even from a combination of several such values. However, they may be useful as empirical evidence for membership in a particular class. Classification also becomes significantly easier when one restricts one’s attention to a limited class of CA such as additive CA, see Additive Cellular Automata. In this
context, additive means that the local rule of the automaP ton has the form (E x ) D i c i x i where the coefficients as well as the states are modular numbers. A number of properties starting with injectivity and surjectivity as well as topological properties such as equicontinuity and sensitivity can be expressed in terms of simple arithmetic conditions on the rule coefficients. For example, equicontinuity is equivalent to all prime divisors of the modulus m dividing all coefficients ci , i > 1, see [35] and the references therein. It is also noteworthy that in the linear case methods tend to carry over to arbitrary dimensions; in general there is a significant step in complexity from dimension one to dimension two. No claim is made that the given classifications are complete; in fact, one should think of them as prototypes rather than definitive taxonomies. For example, one might add the class of nilpotent CA at the bottom. A CA is nilpotent if all configurations evolve to a particular fixed point after finitely many steps. Equivalently, by compactness, there is a bound n such that all configurations evolve to the fixed point in no more than n steps. Likewise, we could add the class of intrinsically universal CA at the top. A CA is intrinsically universal if it is capable of simulating all other CA of the same dimension in some reasonable sense. For
Cellular Automata, Classification of
a fairly natural notion of simulation see [45]. At any rate, considerable effort is made in the references to elaborate the characteristics of the various classes. For many concrete CA visual inspection of the orbits of a suitable sample of configurations readily suggests membership in one of the classes. Reversibility and Surjectivity A first tentative step towards the classification of a dynamical systems is to determine its reversibility or lack thereof. Thus we are trying to determine whether the evolution of the system is associated with loss of information, or whether it is possible to reconstruct the state of the system at time t from its state at time t C 1. In terms of the global map of the system we have to decide injectivity. Closely related is the question whether the global map is surjective, i. e., whether there is no Garden-of-Eden: every configuration has a predecessor under the global map. As a consequence, the limit set of the automaton is the whole space. It was shown of Hedlund that for CA the two notions are connected: every reversible CA is also surjective, see [24], Reversible Cellular Automata. As a matter of fact, reversibility of the global map of a CA implies openness of the global map, and openness implies surjectivity. The converse implications are both false. By a well-known theorem by Hedlund [24] the global maps of CA are precisely the continuous maps that commute with the shift. It follows from basic topology that the inverse global map of a reversible CA is again the global map of a suitable CA. Hence, the predecessor configuration of a given configuration can be reconstructed by another suitably chosen CA. For results concerning reversibility on the limit set of the automaton see [61]. From the perspective of complexity the key result concerning reversible systems is the work by Lecerf [30] and Bennett [7]. They show that reversible Turing machines can compute any partial recursive function, modulo a minor technical problem: In a reversible Turing machine there is no loss of information; on the other hand even simple computable functions are clearly irreversible in the sense that, say, the sum of two natural numbers does not determine these numbers uniquely. To address this issue one has to adjust the notion of computability slightly in the context of reversible computation: given a partial recursive function f : N ! N the function b f (x) D hx; f (x)i can be computed by a reversible Turing machine where h:; :i is any effective pairing function. If f itself happens to be injective then there is no need for the coding device and f can be computed by a reversible Turing machine directly. For example, we can compute the product of two primes re-
versibly. Morita demonstrated that the same holds true for one-dimensional cellular automata [38,40,62], Tiling Problem and Undecidability in Cellular Automata: reversibility is no obstruction to computational universality. As a matter of fact, any irreversible cellular automaton can be simulated by a reversible one, at least on configurations with finite support. Thus one should expect reversible CA to exhibit fairly complicated behavior in general. For infinite, one-dimensional CA it was shown by Amoroso and Patt [2] that reversibility is decidable. Moreover, it is decidable if the the global map is surjective. An efficient practical algorithm using concepts of automata theory can be found in [55], see also [10,14,23]. The fast algorithm is based on interpreting a one-dimensional CA as a deterministic transducer, see [6,48] for background. The underlying semi-automaton of the transducer is a de Bruijn automaton B whose states are words in ˙ w1 where ˙ is the alphabet of the CA and w is its width. c The transitions are given by ax ! xb where a; b; c 2 ˙ , w2 x2˙ and c D (axb), being the local map of the CA. Since B is strongly connected, the product automaton of B will contain a strongly connected component C that contains the diagonal D, an isomorphic copy of B. The global map of the CA is reversible if, and only if, C D D is the only non-trivial component. It was shown by Hedlund [24] that surjectivity of the global map is equivalent with local injectivity: the restriction of the map to configurations with finite support must be injective. The latter property holds if, and only if, C D D and is thus easily decidable. Automata theory does not readily generalize to words of dimensions higher than one. Indeed, reversibility and surjectivity in dimensions higher than one are undecidable, see [26] and Tiling Problem and Undecidability in Cellular Automata in this volume for the rather intricate argument needed to establish this fact. While the structure of reversible one-dimensional CA is well-understood, see Tiling Problem and Undecidability in Cellular Automata, [16], and while there is an efficient algorithm to check reversibility, few methods are known that allow for the construction of interesting reversible CA. There is a noteworthy trick due to Fredkin that exploits the reversibility of the Fibonacci equation X nC1 D X n C X n1 . When addition is interpreted as exclusive or this can be used to construct a second-order CA from any given binary CA; the former can then be recoded as a first-order CA over a 4-letter alphabet. For example, for the open but irreversible elementary CA number 90 we obtain the CA shown in Fig. 2. Another interesting class of reversible one-dimensional CA, the so-called partitioned cellular automata (PCA), is due to Morita and Harao, see [38,39,40]. One
315
316
Cellular Automata, Classification of
Cellular Automata, Classification of, Figure 2 A reversible automaton obtained by applying Fredkin’s construction to the irreversible elementary CA 77
can think of a PCA as a cellular automaton whose cells are divided into multiple tracks; specifically Morita uses an alphabet of the form ˙ D ˙1 ˙2 ˙3 . The configurations of the automaton can be written as (X; Y; Z) where X 2 ˙1 Z , Y 2 ˙2 Z and Z 2 ˙3 Z . Now consider the shearing map defined by (X; Y; Z) D (RS(X); Y; LS(Z)) where RS and LS denote the right and left shift, respectively. Given any function f : ˙ ! ˙ we can define a global map f ı where f is assumed to be applied point-wise. Since the shearing map is bijective, the CA will be reversible if, and only if, the map f is bijective. It is relatively easy to construct bijections f that cause the CA to perform particular computational tasks, even when a direct construction appears to be entirely intractable.
Definability and Computability Formalizing Wolfram’s Classes Wolfram’s classification is an attempt to categorize the complexity of the CA by studying the patterns observed during the long-term evolution of all configurations. The first two classes are relatively easy to observe, but it is difficult to distinguish between the last two classes. In particular W4 is closely related to the kind of behavior that would be expected in connection with systems that are capable of performing complicated computations, including the ability to perform universal computation; a property that is notoriously difficult to check, see [52]. The focus on the full configuration space rather than a significant sub-
Cellular Automata, Classification of
set thereof corresponds to the worst-case approach wellknown in complexity theory and is somewhat inferior to an average case analysis. Indeed, Baldwin and Shelah point out that a product construction can be used to design a CA whose behavior is an amalgamation of the behavior of two given CA, see [3,4]. By combining CA in different classes one obtains striking examples of the weakness of the worst-case approach. A natural example of this mixed type of behavior is elementary CA 184 which displays class II or class III behavior, depending on the initial configuration. Another basic example for this type of behavior is the well-studied elementary CA 30, see Sect. “Conclusion”. Still, for many CA a worst-case classification seems to provide useful information about the structural properties of the automaton. The first attempt at formalizing Wolfram’s class was made by Culik and Yu who proposed the following hierarchy, given here in cumulative form, see [11]: CY1: All configurations evolve to a fixed point. CY2: All configurations evolve to a periodic configuration. CY3: The orbits of all configurations are decidable. CY4: No constraints. The Culik–Yu classification employs two rather different methods. The first two classes can be defined by a simple formula in a suitable logic whereas the third (and the fourth in the disjoint version of the hierarchy) rely on notions of computability theory. As a general framework for both approaches we consider discrete dynamical systems, structures of the form A D hC ; i where C ˙ Z is the space of configurations of the system and is the “next configuration” relation on C. We will only consider the deterministic case where for each configuration x there exists precisely one configuration y such that x y. Hence we are really dealing with algebras with one unary function, but iteration is slightly easier to deal with in the relational setting. The structures most important in this context are the ones arising from a CA. For any local map we consider the structure A D hC ; i where the next configuration relation is determined by x (x). Using the standard language of first order logic we can readily express properties of the CA in terms of the system A . For example, the system is reversible, respectively surjective, if the following assertions are valid over A: 8 x; y; z (x z and y z implies x D y) ; 8 x 9 y (y x) : As we have seen, both properties are easily decidable in the one-dimensional case. In fact, one can express the ba-
sic predicate x y (as well as equality) in terms of finite state machines on infinite words. These machines are defined like ordinary finite state machines but the acceptance condition requires that certain states are reached infinitely and co-infinitely often, see [8,19]. The emptiness problem for these automata is easily decidable using graph theoretic algorithms. Since regular languages on infinite words are closed under union, complementation and projection, much like their finite counterparts, and all the corresponding operations on automata are effective, it follows that one can decide the validity of first order sentences over A such as the two examples above: the model-checking problem for these structures and first order logic is decidable, see [34]. For example, we can decide whether there is a configuration that has a certain number of predecessors. Alternatively, one can translate these sentences into monadic second order logic of one successor, and use wellknown automata-based decision algorithms there directly, see [8]. Similar methods can be used to handle configurations with finite support, corresponding to weak monadic second order logic. Since the complexity of the decision procedure is non-elementary one should not expect to be able to handle complicated assertions. On the other hand, at least for weak monadic second order logic practical implementations of the decision method exist, see [17]. There is no hope of generalizing this approach as the undecidability of, say, reversibility in higher dimensions demonstrates. t C Write x ! y if x evolves to y in exactly t steps, x ! y if x evolves to y in any positive number of steps and x ! y t if x evolves to y in any number of steps. Note that ! is definable for each fixed t, but ! fails to be so definable in first order logic. This is in analogy to the undefinability of path existence problems in the first order theory of graphs, see [34]. Hence it is natural to extend our language so we can express iterations of the global map, either by adding transitive closures or by moving to some limited system of
higher order logic over A where ! is definable, see [8]. Arguably the most basic decision problem associated with a system A that requires iteration of the global map is the Reachability Problem: given two configurations x and y, does the evolution of x lead to y? A closely related but different question is the Confluence Problem: will two configurations x and y evolve to the same limit cycle? Confluence is an equivalence relation and allows for the decomposition of configuration space into limit cycles together with their basins of attraction. The Reachability and Confluence Problem amount to determining, given configurations x and y, whether
x! y;
9 z (x ! z and y ! z) ;
317
318
Cellular Automata, Classification of
respectively. As another example, the first two Culik–Yu class can be defined like so:
8 x 9 z (x ! z and z z);
C
8 x 9 z (x ! z and z ! z) : It is not difficult to give similar definitions for the lower Li– Packard classes if one extends the language by a function symbol denoting the shift operator. The third Culik–Yu class is somewhat more involved. By definition, a CA lies in the third class if it admits a global decision algorithm to determine whether a given configuration x evolves to another given configuration y in a finite number of steps. In other words, we are looking for automata where the Reachability Problem is algorithmically solvable. While one can agree that W4 roughly translates into undecidability and is thus properly situated in the hierarchy, it is unclear how chaotic patterns in W3 relate to decidability. No method is known to translate the apparent lack of tangible, persistent patterns in rules such as elementary CA 30 into decision algorithms for Reachability. There is another, somewhat more technical problem to overcome in formalizing classifications. Recall that the full configuration space is C D ˙ Z . Intuitively, given x 2 C we can effectively determine the next configuration y D (x). However, classical computability theory does not deal with infinitary objects such as arbitrary configuration so a bit of care is needed here. The key insight is that we can determine arbitrary finite segments of (x) using only finite segments of x (and, of course, the lookup table for the local map). There are several ways to model computability on ˙ Z based on this idea of finite approximations, we refer to [66] for a particularly appealing model based on so-called type-2 Turing machines; the reference also contains many pointers to the literature as well as a comparison between the different approaches. It is easy to see that for any CA the global map as well as all its iterates t are computable, the latter uniformly in t. However, due to the finitary nature of all computations, equality is not decidable in type-2 computability: the unequal operator U0 (x; y) D 0 if x ¤ y, U0 (x; y) undefined otherwise, is computable and thus unequality is semi-decidable, but the stronger U0 (x; y) D 0 if x ¤ y, U0 (x; y) D 1, otherwise, is not computable. The last result is perhaps somewhat counterintuitive, but it is inevitable if we strictly adhere to the finite approximation principle. In order to avoid problems of this kind it has become customary to consider certain subspaces of the full configuration space, in particular Cfin , the collection of configurations with finite support, Cper , the collection of spatially periodic configurations and Cap , the collection of al-
most periodic configurations of the form : : : uuuwvvv : : : where u, v and w are all finite words over the alphabet of the automaton. Thus, an almost periodic configuration differs from a configuration of the form ! u v ! in only finitely many places. Configurations with finite support correspond to the special case where u D v D 0 is a special quiescent symbol and spatially periodic configurations correspond to u D v, w D ". The most general type of configuration that admits a finitary description is the class Crec of recursive configurations, where the assignment of state to a cell is given by a computable function. It is clear that all these subspaces are closed under the application of a global map. Except for Cfin there are also closed under inverse maps in the following sense: given a configuration y in some subspace that has a predecessor x in Call there already exists a predecessor in the same subspace, see [55,58]. This is obvious except in the case of recursive configurations. The reference also shows that the recursive predecessor cannot be computed effectively from the target configuration. Thus, for computational purposes the dynamics of the cellular automaton are best reflected in Cap : it includes all configuration with finite support and we can effectively trace an orbit in both directions. It is not hard to see that Cap is the least such class. Alas, it is standard procedure to avoid minor technical difficulties arising from the infinitely repeated spatial patterns and establish classifications over the subspace Cfin . There is a arguably not much harm in this simplification since Cfin is a dense subspace of Call and compactness can be used to lift properties from Cfin to the full configuration space. The Culik–Yu hierarchy is correspondingly defined over Cfin , the class of all configurations of finite support. In this setting, the first three classes of this hierarchy are undecidable and the fourth is undecidable in the disjunctive version: there is no algorithm to test whether a CA admits undecidable orbits. As it turns out, the CA classes are complete in their natural complexity classes within the arithmetical hierarchy [50,52]. Checking membership in the first two classes comes down to performing an infinite number of potentially unbounded searches and can be described logically by a ˘2 expression, a formula of type 8 x 9 y R(x; y) where R is a decidable predicate. Indeed, CY1 and CY2 are both ˘2 -complete. Thus, deciding whether all configurations on a CA evolve to a fixed point is equivalent to the classical problem of determining whether a semi-decidable set is infinite. The third class is even less amenable to algorithmic attack; one can show that CY3 is ˙3 -complete, see [53]. Thus, deciding whether all orbits are decidable is as difficult as determining whether any given semi-decidable set is decidable. It is
Cellular Automata, Classification of
not difficult to adjust these undecidability results to similar classes such as the lower levels of the Li–Packard hierarchy that takes into account spatial displacements of patterns. Effective Dynamical Systems and Universality The key property of CA that is responsible for all these undecidability results is the fact that CA are capable of performing arbitrary computations. This is unsurprising when one defines computability in terms of Turing machines, the devices introduced by Turing in the 1930’s, see [47,63]. Unlike the Gödel–Herbrand approach using general recursive functions or Church’s -calculus, Turing’s devices are naturally closely related to discrete dynamical systems. For example, we can express an instantaneous description of a Turing machine as a finite sequence al al C1 : : : a1 p a1 a2 : : : ar where the ai are tape symbols and p is a state of the machine, with the understanding that the head is positioned at a1 and that all unspecified tape cells contain the blank symbol. Needless to say, these Turing machine configurations can also be construed as finite support configurations of a one-dimensional CA. It follows that a onedimensional CA can be used to simulate an arbitrary Turing machine, hence CA are computational universal: any computable function whatsoever can already be computed by a CA. Note, though, that the simulation is not entirely trivial. First, we have to rely on input/output conventions. For example, we may insist that objects in the input domain, typically tuples of natural numbers, are translated into a configuration of the CA by a primitive recursive coding function. Second, we need to adopt some convention that determines when the desired output has occurred: we follow the evolution of the input configuration until some “halting” condition applies. Again, this condition must be primitive recursively decidable though there is considerable leeway as to how the end of a computation should be signaled by the CA. For example, we could insist that a particular cell reaches a special state, that an arbitrary cell reaches a special state, that the configuration be a fixed point and so forth. Lastly, if and when a halting configuration is reached, we a apply a primitive recursive decoding function to obtain the desired output. Restricting the space to configurations that have finite support, that are spatially periodic, and so forth, produces an effective dynamical system: the configurations can be coded as integers in some natural way, and the next configuration relation is primitive recursive in the sense that the corresponding relation on code numbers is so primitive
recursive. A classical example for an effective dynamical system is given by selecting the instantaneous descriptions of a Turing machine M as configurations, and one-step relation of the Turing machine as the operation of C . Thus we obtain a system A M whose orbits represent the computations of the Turing machine. Likewise, given the local map of a CA we obtain a system A whose operation is the induced global map. While the full configuration space Call violates the effectiveness condition, any of the spaces Cper , Cfin , Cap and Crec will give rise to an effective dynamical system. Closure properties as well as recent work on the universality of elementary CA 110, see Sect. “Conclusion”, suggests that the class of almost periodic configurations, also known as backgrounds or wallpapers, see [9,58], is perhaps the most natural setting. Both Cfin and Cap provide a suitable setting for a CA that simulates a Turing machine: we can interpret A M as a subspace of A for some suitably constructed one-dimensional CA ; the orbits of the subspace encode computations of the Turing machine. It follows from the undecidability of the Halting Problem for Turing machines that the Reachability Problem for these particular CA is undecidable. Note, though, that orbits in A M may well be finite, so some care must be taken in setting up the simulation. For example, one can translate halting configurations into fixed points. Another problem is caused by the worstcase nature of our classification schemes: in Turing machines and their associated systems A M it is only behavior on specially prepared initial configurations that matters, whereas the behavior of a CA depends on all configurations. The behavior of a Turing machine on all instantaneous descriptions, rather than just the ones that can occur during a legitimate computation on some actual input, was first studied by Davis, see [12,13], and also Hooper [25]. Call a Turing machine stable if it halts on any instantaneous description whatsoever. With some extra care one can then construct a CA that lies in the first Culik–Yu class, yet has the same computational power as the Turing machine. Davis showed that every total recursive function can already be computed by a stable Turing machine, so membership in CY1 is not an impediment to considerable computational power. The argument rests on a particular decomposition of recursive functions. Alternatively, one directly manipulate Turing machines to obtain a similar result, see [49,53]. On the other hand, unstable Turing machines yield a natural and coding-free definition of universality: a Turing machine is Davis-universal if the set of all instantaneous description on which the machine halts is ˙1 -complete. The mathematical theory of infinite CA is arguably more elegant than the actually observable finite case. As
319
320
Cellular Automata, Classification of
a consequence, classifications are typically concerned with CA operating on infinite grids, so that even a configuration with finite support can carry arbitrarily much information. If we restrict our attention to the space of configurations on a finite grid a more fine-grained analysis is required. For a finite grid of size n the configuration space has the form Cn D [n] ! ˙ and is itself finite, hence any orbit is ultimately periodic and the Reachability Problem is trivially decidable. However, in practice there is little difference between the finite and infinite case. First, computational complexity issues make it practically impossible to analyze even systems of modest size. The Reachability Problem for finite CA, while decidable, is PSPACE-complete even in the one-dimensional case. Computational hardness appears in many other places. For example, if we try to determine whether a given configuration on a finite grid is a Garden-of-Eden the problem turns out to be NLOG-complete in dimension one and NP -complete in all higher dimensions, see [56]. Second, it stands to reason that the more interesting classification problem in the finite case takes the following parameterized form: given a local map together with boundary conditions, determine the behavior of on all finite grids. Under periodic boundary conditions this comes down to the study of Cper and it seems that there is little difference between this and the fixed boundary case. Since all orbits on a finite grid are ultimately periodic one needs to apply a more fine-grained classification that takes into account transient lengths. It is undecidable whether all configurations on all finite grids evolve to a fixed point under a given local map, see [54]. Thus, there is no algorithm to determine whether
hCn ; i ˆ 8 x 9 z (x ! z and z z) for all grid sizes n. The transient lengths are trivially bounded by kn where k is the size of the alphabet of the automaton. It is undecidable whether the transient lengths grow according to some polynomial bound, even when the polynomial in question is constant. Restrictions of the configuration space are one way to obtain an effective dynamical system. Another is to interpret the approximation-based notion of computability on the full space in terms of topology. It is well-known that computable maps Call ! Call are continuous in the standard product topology. The clopen sets in this topology are the finite unions of cylinder sets where a cylinder set is determined by the values of a configuration in finitely many places. By a celebrated result of Hedlund the global maps of a CA on the full space are characterized by being continuous and shift-invariant. Perhaps somewhat counter-intuitively, the decidable subsets of Call
are quite weak, they consist precisely of the clopen sets. Now consider a partition of Call into finitely many clopen sets C0 ; C2 ; : : : ; C n1 . Thus, it is decidable which block of the partition a given point in the space belongs to. Moreover, Boolean operations on clopen sets as well as application of the global map and the inverse global map are all computable. The partition affords a natural projection : Call ! ˙n where ˙n D f0; 1; : : : ; n 1g and (x) D i iff x 2 C i . Hence the projection translates orbits in the full space Call into a class W of !-words over ˙n , the symbolic orbits of the system. The Cantor space ˙nZ together with the shift describes all logically possible orbits with respect to the given partition and W describes the symbolic orbits that actually occur in the given CA. The shift operator corresponds to an application of the global map of the CA. The finite factors of W provide information about possible finite traces of an orbit when filtered through the given partition. Whole orbits, again filtered through the partition, can be described by !-words. To tackle the classification of the CA in terms of W it was suggested by Delvenne et al., see [15], to refer to the CA as decidable if there it is decidable whether W has nonempty intersection with a !-regular language. Alas, decidability in this sense is very difficult, its complexity being ˙11 -complete and thus outside of the arithmetical hierarchy. Likewise it is suggested to call a CA universal if the problem of deciding whether the cover of W, the collection of all finite factors, is ˙1 -complete, in analogy to Davisuniversality. Computational Equivalence In recent work, Wolfram suggests a so-called Principle of Computational Equivalence, or PCE for short, see [71], p. 717. PCE states that most computational processes come in only two flavors: they are either of a very simple kind and avoid undecidability, or they represent a universal computation and are therefore no less complicated than the Halting Problem. Thus, Wolfram proposes a zero-one law: almost all computational systems, and thus in particular all CA, are either as complicated as a universal Turing machine or are computationally simple. As evidence for PCE Wolfram adduces a very large collection of simulations of various effective dynamical systems such as Turing machines, register machines, tag systems, rewrite systems, combinators, and cellular automata. It is pointed out in Chap. 3 of [71], that in all these classes of systems there are surprisingly small examples that exhibit exceedingly complicated behavior – and presumably are capable of universal computation. Thus it is conceivable that universality is a rather common property, a property that is
Cellular Automata, Classification of
indeed shared by all systems that are not obviously simple. Of course, it is often very difficult to give a complete proof of the computational universality of a natural system, as opposed to carefully constructed one, so it is not entirely clear how many of Wolfram’s examples are in fact universal. As a case in point consider the universality proof of Conway’s Game of Life, or the argument for elementary CA 110. If Wolfram’s PCE can be formally established in some form it stands to reason that it will apply to all effective dynamical systems and in particular to CA. Hence, classifications of CA would be rather straightforward: at the top there would be the class of universal CA, directly preceded by a class similar to the third Culik–Yu class, plus a variety of subclasses along the lines of the lower Li– Packard classes. The corresponding problem in classical computability theory was first considered in the 1930’s by Post and is now known as Post’s Problem: is there a semi-decidable set that fails to be decidable, yet is not as complicated as the Halting Set? In terms of Turing degrees the problem thus is to construct a semi-decidable set A such that ; 3. Future Directions This short survey has only been able to hint at the vast wealth of emergent phenomena that arise in CA. Much work yet remains to be done, in classifying the different structures, identifying general laws governing their behavior, and determining the the causal mechanisms that lead them to arise. For example, there are as yet no general techniques for determining whether a given domain is stable in a given CA; for characterizing the set of initial conditions that will eventually give rise to it; or for working out the particles that it supports. In CA or two or more dimensions, a large body of descriptive results are available, but these are more frequently anecdotal than systematic. A significant barrier to progress has been the lack of good mathematical techniques for identifying, describing, and classifying domains. One promising development in this area is an information-theoretic filtering technique that can operate on configurations of any dimension [13]. Bibliography Primary Literature 1. Boccara N, Nasser J, Roger M (1991) Particlelike structures and their interactions in spatio-temporal patterns generated by one-dimensional deterministic cellular automaton rules. Phys Rev A 44:866 2. Chate H, Manneville P (1992) Collective behaviors in spatially extended systems with local interactions and synchronous updating. Profress Theor Phys 87:1 3. Crutchfield JP, Hanson JE (1993) Turbulent pattern bases for cellular automata. Physica D 69:279 4. Eloranta K, Nummelin E (1992) The kink of cellular automaton rule 18 performs a random walk. J Stat Phys 69:1131 5. Fisch R, Gravner J, Griffeath D (1991) Threshold-range scaling of excitable cellular automata. Stat Comput 1:23–39
6. Gallas J, Grassberger P, Hermann H, Ueberholz P (1992) Noisy collective behavior in deterministic cellular automata. Physica A 180:19 7. Grassberger P (1984) Chaos and diffusion in deterministic cellular automata. Phys D 10:52 8. Griffeath D (2008) The primordial soup kitchen. http://psoup. math.wisc.edu/kitchen.html 9. Hanson JE, Crutchfield JP (1992) The attractor-basin portrait of a cellular automaton. J Stat Phys 66:1415 10. Hanson JE, Crutchfield JP (1997) Computational mechanics of cellular automata: An example. Physica D 103:169 11. Hopcroft JE, Ullman JD (1979) Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Reading 12. Lindgren K, Moore C, Nordahl M (1998) Complexity of two-dimensional patterns. J Stat Phys 91:909 13. Shalizi C, Haslinger R, Rouquier J, Klinker K, Moore C (2006) Automatic filters for the detection of coherent structure in spatiotemporal systems. Phys Rev E 73:036104 14. Wojtowicz M (2008) Mirek’s cellebration. http://www.mirekw. com/ca/ 15. Wolfram S (1984) Universality and complexity in cellular automata. Physica D 10:1
Books and Reviews Das R, Crutchfield JP, Mitchell M, Hanson JE (1995) Evolving Globally Synchronized Cellular Automata. In: Eshelman LJ (ed) Proceedings of the Sixth International Conference on Genetic Algorithms. Morgan Kaufmann, San Mateo Gerhardt M, Schuster H, Tyson J (1990) A cellular automaton model of excitable medai including curvature and dispersion. Science 247:1563 Gutowitz HA (1991) Transients, Cycles, and Complexity in Cellular Automata. Phys Rev A 44:R7881 Henze C, Tyson J (1996) Cellular automaton model of three-dimensional excitable media. J Chem Soc Faraday Trans 92:2883 Hordijk W, Shalizi C, Crutchfield J (2001) Upper bound on the products of particle interactions in cellular automata. Physica D 154:240 Iooss G, Helleman RH, Stora R (ed) (1983) Chaotic Behavior of Deterministic Systems. North-Holland, Amsterdam Ito H (1988) Intriguing Properties of Global Structure in Some Classes of Finite Cellular Automata. Physica 31D:318 Jen E (1986) Global Properties of Cellular Automata. J Stat Phys 43:219 Kaneko K (1986) Attractors, Basin Structures and Information Processing in Cellular Automata. In: Wolfram S (ed) Theory and Applications of Cellular Automata. World Scientific, Singapore, pp 367 Langton C (1990) Computation at the Edge of Chaos: Phase transitions and emergent computation. Physica D 42:12 Lindgren K (1987) Correlations and Random Information in Cellular Automata. Complex Syst 1:529 Lindgren K, Nordahl M (1988) Complexity Measures and Cellular Automata. Complex Syst 2:409 Lindgren K, Nordahl M (1990) Universal Computation in Simple One-Dimensional Cellular Automata. Complex Syst 4:299 Mitchell M (1998) Computation in Cellular Automata: A Selected Review. In: Schuster H, Gramms T (eds) Nonstandard Computation. Wiley, New York
Cellular Automata, Emergent Phenomena in
Packard NH (1984) Complexity in Growing Patterns in Cellular Automata. In: Demongeot J, Goles E, Tchuente M (eds) Dynamical Behavior of Automata: Theory and Applications. Academic Press, New York Packard NH (1985) Lattice Models for Solidification and Aggregation. Proceedings of the First International Symposium on Form, Tsukuba Pivato M (2007) Defect Particle Kinematics in One-Dimensional Cellular Automata. Theor Comput Sci 377:205–228
Weimar J (1997) Cellular automata for reaction-diffusion systems. Parallel Comput 23:1699 Wolfram S (1984) Computation Theory of Cellular Automata. Comm Math Phys 96:15 Wolfram S (1986) Theory and Applications of Cellular Automata. World Scientific Publishers, Singapore Wuensche A, Lesser MJ (1992) The Global Dynamics of Cellular Automata. Santa Fe Institute Studies in the Science of Complexity, Reference vol 1. Addison-Wesley, Redwood City
335
336
Cellular Automata and Groups
Cellular Automata and Groups TULLIO CECCHERINI -SILBERSTEIN1, MICHEL COORNAERT2 1 Dipartimento di Ingegneria, Università del Sannio, Benevento, Italy 2 Institut de Recherche Mathématique Avancée, Université Louis Pasteur et CNRS, Strasbourg, France Article Outline Glossary Definition of the Subject Introduction Cellular Automata Cellular Automata with a Finite Alphabet Linear Cellular Automata Group Rings and Kaplansky Conjectures Future Directions Bibliography Glossary Groups A group is a set G endowed with a binary operation G G 3 (g; h) 7! gh 2 G, called the multiplication, that satisfies the following properties: (i) for all g; h and k in G, (gh)k D g(hk) (associativity); (ii) there exists an element 1G 2 G (necessarily unique) such that, for all g in G, 1G g D g1G D g (existence of the identity element); (iii) for each g in G, there exists an element g 1 2 G (necessarily unique) such that g g 1 D g 1 g D 1G (existence of the inverses). A group G is said to be Abelian (or commutative) if the operation is commutative, that is, for all g; h 2 G one has gh D hg. A group F is called free if there is a subset S F such that any element g of F can be uniquely written as a reduced word on S, i. e. in the form g D s1˛1 s2˛2 s ˛n n , where n 0, s i 2 S and ˛ i 2 Z n f0g for 1 i n, and such that s i ¤ s iC1 for 1 i n 1. Such a set S is called a free basis for F. The cardinality of S is an invariant of the group F and it is called the rank of F. A group G is finitely generated if there exists a finite subset S G such that every element g 2 G can be expressed as a product of elements of S and their inverses, that is, g D s11 s22 s nn , where n 0 and s i 2 S, i D ˙1 for 1 i n. The minimal n for which such an expression exists is called the word length of g with respect to S and it is denoted by `(g). The group G is a (discrete) metric space with the
distance function d : G G ! RC defined by setting d(g; g 0 ) D `(g 1 g 0 ) for all g; g 0 2 G. The set S is called a finite generating subset for G and one says that S is symmetric provided that s 2 S implies s1 2 S. The Cayley graph of a finitely generated group G w.r. to a symmetric finite generating subset S G is the (undirected) graph Cay(G; S) with vertex set G and where two elements g; g 0 2 G are joined by an edge if and only if g 1 g 0 2 S. A group G is residually finite if the intersection of all subgroups of G of finite index is trivial. A group G is amenable if it admits a right-invariant mean, that is, a map : P (G) ! [0; 1], where P (G) denotes the set of all subsets of G, satisfying the following conditions: (i) (G) D 1 (normalization); (ii) (A [ B) D (A) C (B) for all A; B 2 P (G) such that A \ B D ¿ (finite additivity); (iii) (Ag) D (A) for all g 2 G and A 2 P (G) (right-invariance). Rings A ring is a set R equipped with two binary operations R R 3 (a; b) 7! a C b 2 R and R R 3 (a; b) 7! ab 2 R, called the addition and the multiplication, respectively, such that the following properties are satisfied: (i) R, with the addition operation, is an Abelian group with identity element 0, called the zero element, (the inverse of an element a 2 R is denoted by a); (ii) the multiplication is associative and admits an identity element 1, called the unit element; (iii) multiplication is distributive with respect to addition, that is, a(b C c) D ab C ac and (b C c)a D ba C ca for all a; b and c 2 R. A ring R is commutative if ab D ba for all a; b 2 R. A field is a commutative ring K ¤ f0g where every non-zero element a 2 K is invertible, that is there exists a1 2 K such that aa1 D 1. In a ring R a non-trivial element a is called a zero-divisor if there exists a non-zero element b 2 R such that either ab D 0 or ba D 0. A ring R is directly finite if whenever ab D 1 then necessarily ba D 1, for all a; b 2 R. If the ring M d (R) of d d matrices with coefficients in R is directly finite for all d 1 one says that R is stably finite. Let R be a ring and let G be a group. Denote by R[G] P the set of all formal sums g2G ˛ g g where ˛ g 2 R and ˛ g D 0 except for finitely many elements g 2 G. We define two binary operations on R[G], namely the addition, by setting ! ! X X X ˛g g C ˇh h D (˛ g C ˇ g )g ; g2G
h2G
and the multiplication, by setting
g2G
Cellular Automata and Groups
X g2G
! ˛g g
X
! ˇh h D
h2G
X
˛ g ˇ h gh
g;h2G
kDg h
X
˛ g ˇ g 1 k k:
g;k2G
Then, with these two operations, R[G] becomes a ring; it is called the group ring of G with coefficients in R. Cellular automata Let G be a group, called the universe, and let A be a set, called the alphabet. A configuration is a map x : G ! A. The set AG of all configurations is equipped with the right action of G defined by AG G 3 (x; g) 7! x g 2 AG , where x g (g 0 ) D x(g g 0 ) for all g 0 2 G. A cellular automaton over G with coefficients in A is a map : AG ! AG satisfying the following condition: there exists a finite subset M G and a map : AM ! A such that (x)(g) D (x g j M ) for all x 2 AG ; g 2 G, where x g j M denotes the restriction of xg to M. Such a set M is called a memory set and is called a local defining map for . If A D V is a vector space over a field K, then a cellular automaton : V G ! V G , with memory set M G and local defining map : V M ! V , is said to be linear provided that is linear. Two configurations x; x 0 2 AG are said to be almost equal if the set fg 2 G; x(g) ¤ x 0 (g)g at which they differ is finite. A cellular automaton is called pre-injective if whenever (x) D (x 0 ) for two almost equal configurations x; x 0 2 AG one necessarily has x D x 0 . A Garden of Eden configuration is a configuration x 2 AG n (AG ). Clearly, GOE configurations exist if and only if is not surjective. Definition of the Subject A cellular automaton is a self-mapping of the set of configurations of a group defined from local and invariant rules. Cellular automata were first only considered on the n-dimensional lattice group Z n and for configurations taking values in a finite alphabet set but they may be formally defined on any group and for any alphabet. However, it is usually assumed that the alphabet set is endowed with some mathematical structure and that the local defining rules are related to this structure in some way. It turns out that general properties of cellular automata often reflect properties of the underlying group. As an example, the Garden of Eden theorem asserts that if the group is amenable and the alphabet is finite, then the surjectivity of a cellular automaton is equivalent to its pre-injectivity
(a weak form of injectivity). There is also a linear version of the Garden of Eden theorem for linear cellular automata and finite-dimensional vector spaces as alphabets. It is an amazing fact that famous conjectures of Kaplansky about the structure of group rings can be reformulated in terms of linear cellular automata. Introduction The goal of this paper is to survey results related to the Garden of Eden theorem and the surjunctivity problem for cellular automata. The notion of a cellular automaton goes back to John von Neumann [37] and Stan Ulam [34]. Although cellular automata were firstly considered only in theoretical computer science, nowadays they play a prominent role also in physics and biology, where they serve as models for several phenomena ( Cellular Automata Modeling of Physical Systems, Chaotic Behavior of Cellular Automata), and in mathematics. In particular, cellular automata are studied in ergodic theory ( Ergodic Theory of Cellular Automata) and in the theory of dynamical systems ( Topological Dynamics of Cellular Automata), in functional and harmonic analysis, and in group theory. In the classical framework, the universe U is the lattice Z2 of integer points in Euclidean plane and the alphabet A is a finite set, typically A D f0; 1g. The set AU D fx : U ! Ag is the configuration space, a map x : U ! A is a configuration and a point (n; m) 2 U is called a cell. One is given a neighborhood M of the origin (0; 0) 2 U, typically, for some r > 0, M D f(n; m) 2 Z2 : jnj C jmj rg (von Neumann r-ball) or M D f(n; m) 2 Z2 : jnj; jmj rg (Moore’s r-ball) and a local map : AM ! A. One then “extends” to the whole universe obtaining a map : AU ! AU , called a cellular automaton, by setting (x)(n; m) D (x(n C s; m C t)(s;t)2M ). This way, the value (x)(n; m) 2 A of the configuration x at the cell (n; m) 2 U only depends on the values x(n C s; m C s) of x at its neighboring cells (x C s; y C t) (x; y) C (s; t) 2 (x; y)C M, in other words, is Z2 -equivariant. M is called a memory set for and a local defining map. In 1963 E.F. Moore proved that if a cellular automa2 2 ton : AZ ! AZ is surjective then it is also pre-injective, a weak form of injectivity. Shortly later, John Myhill proved the converse to Moore’s theorem. The equivalence of surjectivity and pre-injectivity of cellular automata is referred to as the Garden of Eden theorem (briefly GOE theorem), this biblical terminology being motivated by the fact that it gives necessary and sufficient conditions for the existence of configurations x that are not in the image 2 2 2 of , i. e. x 2 AZ n (AZ ), so that, thinking of ( ; AZ ) as
337
338
Cellular Automata and Groups
a discrete dynamical system, with being the time, they can appear only as “initial configurations”. It was immediately realized that the GOE theorem was holding also in higher dimension, namely for cellular automata with universe U D Zd , the lattice of integer points in the d-dimensional space. Then, Machì and Mignosi [27] gave the definition of a cellular automaton over a finitely generated group and extended the GOE theorem to the class of groups G having sub-exponential growth, that is for which the growth function G (n), which counts the elements g 2 G at “distance” at most n from the unit element 1G of G, has a growth weaker than the exponential, p in formulæ, lim n!1 n G (n) D 1. Finally, in 1999 Ceccherini-Silberstein, Machì and Scarabotti [9] extended the GOE theorem to the class of amenable groups. It is interesting to note that the notion of an amenable group was also introduced by von Neumann [36]. This class of groups contains all finite groups, all Abelian groups, and in fact all solvable groups, all groups of sub-exponential growth and it is closed under the operation of taking subgroups, quotients, directed limits and extensions. In [27] two examples of cellular automata with universe the free group F 2 of rank two, the prototype of a non-amenable group, which are surjective but not pre-injective and, conversely, preinjective but not surjective, thus providing an instance of the failure of the theorems of Moore and Myhill and so of the GOE theorem. In [9] it is shown that this examples can be extended to the class of groups, thus necessarily nonamenable, containing the free group F 2 . We do not know whether the GOE theorem only holds for amenable groups or there are examples of groups which are non-amenable and have no free subgroups: by results of Olshanskii [30] and Adyan [1] it is know that such class is non-empty. In 1999 Misha Gromov [20], using a quite different terminology, reproved the GOE for cellular automata whose universes are infinite amenable graphs with a dense pseudogroup of holonomies (in other words such s are rich in symmetries). In addition, he considered not only cellular automata from the full configuration space A into itself but also between subshifts X; Y A . He used the notion of entropy of a subshift (a concept hidden in the papers [27] and [9]. In the mid of the fifties W. Gottschalk introduced the notion of surjunctivity of maps. A map f : X ! Y is surjunctive if it is surjective or not injective. We say that a group G is surjunctive if all cellular automata : AG ! AG with finite alphabet are surjunctive. Lawton [18] proved that residually finite groups are surjunctive. From the GOE theorem for amenable groups [9] one immediately deduce that amenable groups are surjunctive as well. Finally Gromov [20] and, independently, Ben-
jamin Weiss [38] proved that all sofic groups (the class of sofic groups contains all residually finite groups and all amenable groups) are surjunctive. It is not known whether or not all groups are surjunctive. In the literature there is a notion of a linear cellular automaton. This means that the alphabet is not only a finite set but also bears the structure of an Abelian group and that the local defining map is a group homomorphism, that is, it preserves the group operation. These are also called additive cellular automata ( Additive Cellular Automata). In [5], motivated by [20], we introduced another notion of linearity for cellular automata. Given a group G and a vector space V over a (not necessarily finite) field K, the configuration space is V G and a cellular automaton : V G ! V G is linear if the local defining map : V B ! V is K-linear. The set LCA(V ; G) of all linear cellular automata with alphabet V and universe G naturally bears a structure of a ring. The finiteness condition for a set A in the classical framework is now replaced by the finite dimensionality of V. Similarly, the notion of entropy for subshifts X AG is now replaced by that of mean-dimension (a notion due to Gromov [20]). In [5] we proved the GOE theorem for linear cellular automata : V G ! V G with alphabet a finite dimensional vector space and with G an amenable group. Moreover, we proved a linear version of Gottschalk surjunctivity theorem for residually finite groups. In the same paper we also establish a connection with the theory of group rings. Given a group G and a field K, there is a one-to-one correspondence between the elements in the group ring K[G] and the cellular automata : KG ! KG . This correspondence preserves the ring structures of K[G] and LCA(K; G). This led to a reformulation of a long standing problem, raised by Irving Kaplansky [23], about the absence of zero-divisors in K[G] for G a torsion-free group, in terms of the pre-injectivity of all 2 LCA(K; G). In [6] we proved the linear version of the Gromov– Weiss surjunctivity theorem for sofic groups and established another application to the theory of group rings. We extended the correspondence above to a ring isomorphism between the ring Matd (K[G]) of d d matrices with coefficients in the group ring K[G] and LCA(Kd ; G). This led to a reformulation of another famous problem, raised by Irving Kaplansky [24] about the structure of group rings. A group ring K[G] is stably finite if and only if, for all d 1, all linear cellular automata : (Kd )G ! (Kd )G are surjunctive. As a byproduct we obtained another proof of the fact that group rings over sofic groups are stably finite,
Cellular Automata and Groups
a result previously established by G. Elek and A. Szabó [11] using different methods. The paper is organized as follows. In Sect. “Cellular Automata” we present the general definition of a cellular automaton for any alphabet and any group. This includes a few basic examples, namely Conway’s Game of Life, the majority action and the discrete Laplacian. In the subsequent section we specify our attention to cellular automata with a finite alphabet. We present the notions of Cayley graphs (for finitely generated groups), of amenable groups, and of entropy for G-invariant subsets in the configuration space. This leads to a description of the setting and the statement of the Garden of Eden theorem for amenable groups. We also give detailed expositions of a few examples showing that the hypotheses of amenability cannot, in general, be removed from the assumption of this theorem. We also present the notion of surjunctivity, of sofic groups and state the surjunctivity theorem of Gromov and Weiss for sofic groups. In Sect. “Linear Cellular Automata” we introduce the notions of linear cellular automata and of mean dimension for G-invariant subspaces in V G . We then discuss the linear analogue of the Garden of Eden theorem and, again, we provide explicit examples showing that the assumptions of the theorem (amenability of the group and finite dimensionality of the underlying vector space) cannot, in general, be removed. Finally we present the linear analogue of the surjunctivity theorem of Gromov and Weiss for linear cellular automata over sofic groups. In Sect. “Group Rings and Kaplansky Conjectures” we give the definition of a group ring and present a representation of linear cellular automata as matrices with coefficients in the group ring. This leads to the reformulation of the two long standing problems raised by Kaplansky about the structure of group rings. Finally, in Sect. “Future Directions” we present a list of open problems with a description of more recent results related to the Garden of Eden theorem and to the surjunctivity problem. Cellular Automata The Configuration Space Let G be a group, called the universe, and let A be a set, called the alphabet or the set of states. A configuration is a map x : G ! A. The set AG of all configurations is equipped with the right action of G defined by AG G 3 (x; g) 7! x g 2 AG , where x g (g 0 ) D x(g g 0 ) for all g 0 2 G. Cellular Automata A cellular automaton over G with coefficients in A is a map : AG ! AG satisfying the following condition: there ex-
ists a finite subset M G and a map : AM ! A such that (x)(g) D (x g j M )
(1)
for all x 2 AG ; g 2 G, where x g j M denotes the restriction of xg to M. Such a set M is called a memory set and is called a local defining map for . It follows directly from the definition that every cellular automaton : AG ! AG is G-equivariant, i. e., it satisfies (x g ) D (x) g
(2)
for all g 2 G and x 2 AG . Note that if M is a memory set for , then any finite set M 0 G containing M is also a memory set for . The local defining map associated with such an M 0 is the map 0 0 0 : AM ! A given by 0 D ı , where : AM ! AM is the restriction map. However, there exists a unique memory set M 0 of minimal cardinality. This memory set M 0 is called the minimal memory set for . We denote by CA(G; A) the set of all cellular automata over G with alphabet A. Examples Example 1 (Conway’s Game of Life [3]) The most famous example of a cellular automaton is the Game of Life of John Horton Conway. The set of states is A D f0; 1g. State 0 corresponds to absence of life while state 1 indicates life. Therefore passing from 0 to 1 can be interpreted as birth, while passing from 1 to 0 corresponds to death. The universe for Life is the group G D Z2 , that is, the free Abelian group of rank 2. The minimal memory set is M D f1; 0; 1g2 Z2 . The set M is the Moore neighborhood of the origin in Z2 . It consists of the origin (0; 0) and its eight neighbors ˙(1; 0); ˙(0; 1); ˙(1; 1); ˙(1; 1). The corresponding local defining map : AM ! A given by 8 ˆ ˆ ˆ ˆ ˆ 0:5 and to 0 for m < 0:5. If m is exactly 0.5, then the last state is assigned (s (T) D i(T) ). i This memory mechanism is accumulative in their demand of knowledge of past history: to calculate the memory ˚ charge ! i(T) it is not necessary to know the whole i(t) series, while it can be sequentially calculated as: ! i(T) D ˛! i(T1) C i(T) . It is, ˝(T) D (˛ T 1)/(˛ 1). The choice of the memory factor ˛ simulates the longterm or remnant memory effect: the limit case ˛ D 1 corresponds to memory with equally weighted records (full memory, equivalent to mode if k D 2), whereas ˛ 1 intensifies the contribution of the most recent states and diminishes the contribution of the past ones (short term working memory). The choice ˛ D 0 leads to the ahistoric model. In the most unbalanced scenario up to T, i. e.: i(1) D D i(T1) ¤ i(T) , it is: 1 ˛1 1 ) T D 2 ˛ 1 2 ˛T ˛ 1 1 D : m(1; 1; : : : ; 1; 0) D ) T 2 ˛ 1 2
m(0; 0; : : : ; 0; 1) D
Thus, memory is only operative if ˛ is greater than a critical ˛T that verifies: ˛TT 2˛T C 1 D 0 ;
(1)
in which case cells will be featured at T with state values different to the last one. Initial operative values are: ˛3 D 0:61805, ˛4 D 0:5437. When T ! 1, Eq. 1 becomes: 2˛1 C 1 D 0, thus, in the k D 2 scenario, ˛-memory is not effective if ˛ 0:5. A Worked Example: The Parity Rule The so-called parity rule states: cell alive if the number of neighbors is odd, dead on the contrary case. Figure 2 shows the effect of memory on the parity rule starting from a single live cell in the Moore neighborhood. In accordance with the above given values of ˛ 3 and ˛ 4 : (i) The
pattern at T D 4 is the ahistoric one if ˛ 0:6, altered when ˛ 0:7, and (ii) the patterns at T D 5 for ˛ D 0:54 and ˛ D 0:55 differ. Not low levels of memory tend to freeze the dynamics since the early time-steps, e. g. over 0.54 in Fig. 2. In the particular case of full memory small oscillators of short range in time are frequently generated, such as the periodtwo oscillator that appears as soon as at T D 2 in Fig. 2. The group of evolution patterns shown in the [0.503,0.54] interval of ˛ variation of Fig. 2, is rather unexpected to be generated by the parity rule, because they are too sophisticated for this simple rule. On the contrary, the evolution patterns with very small memory, ˛ D 0:501, resemble those of the ahistoric model in Fig. 2. But this similitude breaks later on, as Fig. 3 reveals: from T D 19, the parity rule with minimal memory evolves producing patterns notably different to the ahistoric ones. These patterns tend to be framed in squares of size not over T T, whereas in the ahistoric case, the patterns tend to be framed in 2T 2T square regions, so even minimal memory induces a very notable reduction in the affected cell area in the scenario of Fig. 2. The patterns of the featured cells tend not to be far to the actual ones, albeit examples of notable divergence can be traced in Fig. 2. In the particular case of the minimal memory scenario of Fig. 2, that of ˛ D 0:501, memory has no effect up to T D 9, when the pattern of featured live cells reduces to the initial one; afterward both evolutions are fairly similar up to T D 18, but at this time step both kinds of patterns notably differs, and since then the evolution patterns in Fig. 3 notably diverge from the ahistoric ones. To give consideration to previous states (historic memory) in two-dimensional CA tends to confine the disruption generated by a single live cell. As a rule, full memory tends to generate oscillators, and less historic information retained, i. e. smaller ˛ value, implies an approach to the ahistoric model in a rather smooth form. But the transition which decreases the memory factor from ˛ D 1:0 (full memory) to ˛ D 0:5 (ahistoric model), is not always regular, and some kind of erratic effect of memory can be traced. The inertial (or conservating) effect of memory dramatically changes the dynamics of the semitotalistic LIFE rule. Thus, (i) the vividness that some small clusters exhibit in LIFE, has not been detected in LIFE with memory. In particular, the glider in LIFE does not glide with
Cellular Automata with Memory, Figure 2 The 2D parity rule with memory up to T D 15
383
384
Cellular Automata with Memory
Cellular Automata with Memory
Cellular Automata with Memory, Figure 3 The 2D parity rule with ˛ D 0:501 memory starting from a single site live cell up to T D 55
memory, but stabilizes very close to its initial position as , (ii) as the size of a configuration increases, ofthe tub ten live clusters tend to persist with a higher number of live cells in LIFE with memory than in the ahistoric formulation, (iii) a single mutant appearing in a stable agar can lead to its destruction in the ahistoric model, whereas its effect tends to be restricted to its proximity with memory [26]. One-Dimensional CA Elementary rules are one-dimensional, two-state rules operating on nearest neighbors. Following Wolfram’s notation, these rules are characterized by a sequence of binary values (ˇ) associated with each of the eight possible triplets
(T) (T) i1 ; i(T) ; iC1 :
111 ˇ1
110 ˇ2
101 ˇ3
100 ˇ4
011 ˇ5
010 ˇ6
001 ˇ7
000 ˇ8
The rules are conveniently specified by their rule numP ber R D 8iD1 ˇ i 28i . Legal rules are reflection symmetric (ˇ5 D ˇ2 ; ˇ7 D ˇ4 ), and quiescent (ˇ8 D 0), restrictions that leave 32 possible legal rules. Figure 4 shows the spatio-temporal patterns of legal rules affected by memory when starting from a single live cell [17]. Patterns are shown up to T D 63, with the memory factor varying from 0.6 to 1.0 by 0.1 intervals, and adopting also values close to the limit of its effectivity: 0.5. As a rule, the transition from the ˛ D 1:0 (fully historic) to the ahistoric scenario is fairly gradual, so that the patterns become more expanded as less historic memory is retained (smaller ˛). Rules 50, 122, 178,250, 94, and 222,254 are paradigmatic of this smooth evolution. Rules 222 and
254 are not included in Fig. 4 as they evolve as rule 94 but with the inside of patterns full of active cells. Rules 126 and 182 also present a gradual evolution, although their patterns with high levels of memory models hardly resemble the historic ones. Examples without a smooth effect of memory are also present in Fig. 4: (i) rule 150 is sharply restrained at ˛ D 0:6, (ii) the important rule 54 extinguish in [0.8,0.9], but not with full memory, (iii) the rules in the group {18,90,146,218} become extinct from ˛ D 0:501. Memory kills the evolution for these rules already at T D 4 for ˛ values over ˛ 3 (thus over 0.6 in Fig. 4): after T D 3 all the cells, even the two outer cells alive at T D 3, are featured as dead, and (iv) rule 22 becomes extinct for ˛ D 0:501, not in 0.507, 0.6, and 0.7, again extinguish at 0.8 and 0.9, and finally generate an oscillator with full memory. It has been argued that rules 18, 22, 122, 146 and 182 simul ate Rule 90 in that their behavior coincides when restricted to certain spatial subsequences. Starting with a single site live cell, the coincidence fully applies in the historic model for rules 90, 18 and 146. Rule 22 shares with these rules the extinction for high ˛ values, with the notable exception of no extinction in the fully historic model. Rules 122 and 182 diverge in their behavior: there is a gradual decrease in the width of evolving patterns as ˛ is higher, but they do not reach extinction. Figure 5 shows the effect of memory on legal rules when starting at random: the values of sites are initially uncorrelated and chosen at random to be 0 (blank) or 1 (gray) with probability 0.5. Differences in patterns resulting from reversing the center site value are shown as black pixels. Patterns are shown up to T D 60, in a line of size 129 with periodic boundary conditions imposed on the edges. Only the nine legal rules which generate non-periodic pat-
385
386
Cellular Automata with Memory
Cellular Automata with Memory, Figure 4 Elementary, legal rules with memory from a single site live cell
terns in the ahistoric scenario are significantly affected by memory. The patterns with inverted triangles dominate the scene in the ahistoric patterns of Fig. 5, a common appearance that memory tends to eliminate. History has a dramatic effect on Rule 18. Even at the low value of ˛ D 0:6; the appearance of its spatio-tem-
poral pattern fully changes: a number of isolated periodic structures are generated, far from the distinctive inverted triangle world of the ahistoric pattern. For ˛ D 0:7; the live structures are fewer, advancing the extinction found in [0.8,0.9]. In the fully historic model, simple periodic patterns survive.
Cellular Automata with Memory
Cellular Automata with Memory, Figure 4 (continued)
Rule 146 is affected by memory in much the same way as Rule 18 because their binary codes differ only in their ˇ1 value. The spatio-temporal of rule 182 and its equivalent Rule 146 are reminiscent, though those of Rule 182 look like a negatives photogram of those of Rule 146. The effect of memory on rule 22 and the complex rule 54 is similar. Their spatio-temporal patterns in ˛ D 0:6 and ˛ D 0:7 keep the essential of the ahistoric, although the inverted triangles become enlarged and tend to be
more sophisticated in their basis. A notable discontinuity is found for both rules ascending in the value of the memory factor: in ˛ D 0:8 and ˛ D 0:9 only a few simple structures survive. But unexpectedly, the patterns of the fully historic scenario differ markedly from the others, showing a high degree of synchronization. The four remaining chaotic legal rules (90, 122, 126 and 150) show a much smoother evolution from the ahistoric to the historic scenario: no pattern evolves either to
387
388
Cellular Automata with Memory
Cellular Automata with Memory, Figure 5 Elementary, legal rules with memory starting at random
full extinction or to the preservation of only a few isolated persistent propagating structures (solitons). Rules 122 and 126, evolve in a similar form, showing a high degree of synchronization in the fully historic model. As a rule, the effect of memory on the differences in patterns (DP) resulting from reversing the value of its initial center site is reminiscent of that on the spatio-temporal patterns, albeit this very much depends on the actual simulation run. In the case of rule 18 for example, damage is not present in the simulation of Fig. 5. The group of rules 90, 122, 126 and 150 shows a, let us say canonical, fairly gradual evolution from the ahistoric to the historic scenario, so that the DP appear more constrained as more historic memory is retained, with no extinction for any ˛
value. Figure 6 shows the evolution of the fraction T of sites with value 1, starting at random ( 0 D 0:5). The simulation is implemented for the same rules as in Fig. 4, but with notably wider lattice: N D 500: A visual inspection of the plots in Fig. 6, ratifies the general features observed in the patterns in Fig. 5 regarding density. That also stands for damage spreading: as a rule, memory depletes the damaged region. In one-dimensional r D 2 CA, the value of a given site depends on values of the nearest and next-nearest neighbors. Totalistic rules with memory have the form: r D 2(T) (T) (T) (T) i(TC1) D s(T) C s i2 i1 C s i C s iC1 C s iC2 . The effect of memory on these rules follows the way traced in the r D 1 context, albeit with a rich casuistic studied in [14].
Cellular Automata with Memory
Cellular Automata with Memory, Figure 5 (continued)
Probabilistic CA So far the CA considered are deterministic. In order to study perturbations to deterministic CA as well as transitional changes from one deterministic CA to another, it is natural to generalize the deterministic CA framework to the probabilistic scenario. In the elementary scenario, the ˇ are replaced by probabilities . (T) (T) : ; i(T) ; iC1 p D P i(TC1) D 1 i1 111 110 101 100 011 010 001 000 ˇ1 ˇ2 ˇ3 ˇ4 ˇ5 ˇ6 ˇ7 ˇ8 p1 p2 p3 p4 p5 p6 p7 p8
ˇ 2 f0; 1g p 2 [0; 1]
As in the deterministic scenario, memory can be embedded in probabilistic CA (PCA) by featuring cells by a summary of past states si instead of by their last (T) (T) state i : p D P i(TC1) D 1/s (T) i1 ; s i ; s iC1 . Again, mem-
ory is embedded into the characterization of cells but not in the construction of the stochastic transition rules, as done in the canonical approach to PCA. We have explored the effect of memory on three different subsets 0; p ; 0; p ; p ; 0; p ; 0 2 4 2 4 , 0; 0; p3 ; 1; 0; p6 ; 1; 0 , and p1 ; p2 ; p1 ; p2 ; p2 ; 0; p2 ; 0 in [9]. Other Memories A number of average-like memory mechanisms can readily be proposed by using weights different to that implemented in the ˛-memory mechanism: ı(t) D ˛ Tt . Among the plausible choices of ı, we mention the weights ı(t) D t c and ı(t) D c t , c 2 N, in which the larger the value of c, the more heavily is the recent past taken into account, and consequently the closer the scenario to the ahistoric model [19,21]. Both weights allow for integer-based
389
390
Cellular Automata with Memory
Cellular Automata with Memory, Figure 5 (continued)
arithmetics (à la CA) comparing 2! (T) to 2˝(T) to get the featuring states s (a clear computational advantage over the ˛-based model), and are accumulative in respect to charge: ! i(T) D ! i(T1) C T c i(T) , ! i(T) D ! i(T1) C c T i(T) . Nevertheless, they share the same drawback: powers explode, at high values of t, even for c D 2. Limited trailing memory would keep memory of only the last states. This is implemented in the conP text of average memory as: ! i(T) D TtD> ı(t) i(t) , with > D max(1; T C 1). Limiting the trailing memory would approach the model to the ahistoric model ( D 1). In the geometrically discounted method, such an effect is more appreciable when the value of ˛ is high, whereas at low ˛ values (already close to the ahistoric model when memory is not limited) the effect of limiting the trailing memory is not so important. In the k D 2 context, if D 3; provided that ˛ > ˛3 D 0:61805; the memory
mechanism turns out to be that of selecting the mode D mode i(T2) ; i(T) ; i(T1) , of the last three states: s(T) i i. e. the elementary rule 232. Figure 7 shows the effect of this kind of memory on legal rules. As is known, history has a dramatic effect on Rules 18, 90, 146 and 218 as their pattern dies out as early as at T D 4: The case of Rule 22 is particular: two branches are generated at T D 17 in the historic model; the patterns of the remaining rules in the historic model are much reminiscent of the ahistoric ones, but, let us say, compressed. Figure 7 shows also the effect of memory on some relevant quiescent asymmetric rules. Rule 2 shifts a single site live cell one space at every time-step in the ahistoric model; with the pattern dying at T D 4. This evolution is common to all rules that just shift a single site cell without increasing the number of living cells at T D 2, this is the case of the important rules 184 and 226. The patterns generated by
Cellular Automata with Memory
Cellular Automata with Memory, Figure 6 Evolution of the density starting at random in elementary legal rules. Color code: blue ! full memory, black ! ˛ D 0:8, red ! ahistoric model
Cellular Automata with Memory, Figure 7 Legal (first row of patterns) and quiescent asymmetric elementary rules significantly affected by the mode of the three last states of memory
rules 6 and 14 are rectified (in the sense of having the lines in the spatio-temporal pattern slower slope) by memory in such a way that the total number of live cells in the historic
and ahistoric spatio-temporal patterns is the same. Again, the historic patterns of the remaining rules in Fig. 7 seem, as a rule, like the ahistoric ones compressed [22].
391
392
Cellular Automata with Memory
Cellular Automata with Memory, Figure 8 The Rule 150 with elementary rules up to R D 125 as memory
Cellular Automata with Memory
Cellular Automata with Memory, Figure 9 The parity rule with elementary rules as memory. Evolution from T D 4 15 in the Neumann neighborhood starting from a singe site live cell
Elementary rules (ER, noted f ) can in turn act as memory rules: D f i(T2) ; i(T) ; i(T1) s (T) i Figure 8 shows the effect of ER memories up to R D 125 on rule 150 starting from a single site live cell up to T D 13. The effect of ER memories with R > 125 on rule 150 as well as on rule 90 is shown in [23]. In the latter case, complementary memory rules (rules whose rule number adds 255) have the same effect on rule 90 (regardless of the role played by the three last states in and the initial configuration). In the ahistoric scenario, Rules 90 and 150 are linear (or additive): i. e., any initial pattern can be decomposed into the superposition of patterns from a single site seed. Each of these configurations can be evolved independently and the results superposed (module two) to obtain the final complete pattern. The additivity of rules 90 and 150 remains in the historic model with linear memory rules.
Figure 9 shows the effect of elementary rules on the 2D parity rule with von Neumann neighborhood from a singe site live cell. This figure shows patterns from T D 4, being . The consideration of CA the three first patterns: rules as memory induces a fairly unexplored explosion of new patterns. CA with Three Statesk This section deals with CA with three possible values at each site (k D 3), noted f0; 1; 2g, so the rounding mechanism is implemented by comparing the unrounded weighted mean m to the hallmarks 0.5 and 1.5, assigning the last state in case on an equality to any of these values. Thus, s T D 0 if m T < 0:5 ; s T D 1 if 0:5 < m T < 1:5 ; s T D 2 if m T > 1:5 ;
and
s T D T if m T D 0:5 or m T D 1:5 :
393
394
Cellular Automata with Memory
Cellular Automata with Memory, Figure 10 Parity k D 3 rules starting from a single D 1 seed. The red cells are at state 1, the blue ones at state
In the most unbalanced cell dynamics, historic memory takes effect after time step T only if ˛ > ˛T , with 3˛TT 4˛T C 1 D 0, which in the temporal limit becomes 4˛ C 1 D 0 , ˛ D 0:25. In general, in CA with k states (termed from 0 to k 1), the characteristic equation at T is (2k 3)˛TT (2k 1)˛T C 1 D 0, which becomes 2(k 1)˛ C 1 D 0 in the temporal limit. It is then concluded that memory does not affect the scenario if ˛ ˛ (k) D 1/(2(k 1)). (T) We study first totalistic rules: i(TC1) D i1 C (T) i(T) C iC1 , characterized by a sequence of ternary values (ˇs ) associated with each of the seven possible values of the sum (s) of the neighbors: (ˇ6 ; ˇ5 ; ˇ4 ; ˇ3 ; ˇ2 ; P ˇ1 ; ˇ0 ), with associated rule number R D 6sD0 ˇs 3s 2 [0; 2186]: Figure 10 shows the effect of memory on quiescent (ˇ0 D 0) parity rules, i. e. rules with ˇ1 ; ˇ3 and ˇ 5 non null, and ˇ2 D ˇ4 D ˇ6 D 0: Patterns are shown up to T D 26. The pattern for ˛ D 0:3 is shown to test its proximity to the ahistoric one (recall that if ˛ 0:25 memory takes no effect). Starting with a single site seed it can be concluded, regarding proper three-state rules such as those in Fig. 10, that: (i) as an overall rule the patterns become more expanded as less historic memory is retained (smaller ˛). This characteristic inhibition of growth effect of memory is traced on rules 300 and 543 in Fig. 10, (ii) the transition from the fully historic to the ahistoric scenario tends to be gradual in regard to the amplitude of the spatio-temporal patterns, although their composition can differ notably, even at close ˛ values, (iii) in contrast
to the two-state scenario, memory fires the pattern of some three-state rules that die out in the ahistoric model, and no rule with memory dies out. Thus, the effect of memory on rules 276, 519, 303 and 546 is somewhat unexpected: they die out at ˛ 0:3 but at ˛ D 0:4 the pattern expands, the expansion being inhibited (in Fig. 10) only at ˛ 0:8. This activation under memory of rules that die at T D 3 in the ahistoric model is unfeasible in the k D 2 scenario. The features in the evolving patterns starting from a single seed in Fig. 10 are qualitatively reflected starting at random as shown with rule 276 in Fig. 11, which is also activated (even at ˛ D 0:3) when starting at random. The effect of average memory (˛ and integer-based models, unlimited and limited trailing memory, even D 2) and that of the mode of the last three states has been studied in [21]. When working with more than three states, it is an inherent consequence of averaging the tendency to bias the featuring state to the mean value: 1. That explains the redshift in the previous figures. This led us to focus on a much more fair memory mechanism: the mode, in what follows. Mode memory allows for manipulation of pure symbols, avoiding any computing/arithmetics. In excitable CA, the three states are featured: resting 0, excited 1 and refractory 2. State transitions from excited to refractory and from refractory to resting are unconditional, they take place independently on a cell’s neighborhood state: i(T) D 1 ! i(TC1) D 2, i(T) D 2 ! i(TC1) D 0. In [15] the excitation rule adopts a Pavlovian phenomenon of defensive inhibition: when strength of stimulus applied exceeds a certain limit the
Cellular Automata with Memory
Cellular Automata with Memory, Figure 11 The k D 3; R D 276 rule starting at random
system ‘shuts down’, this can be naively interpreted as an inbuilt protection of energy loss and exhaustion. To simulate the phenomenon of defensive inhibition we adopt interval excitation rules [2], and a resting cell becomes excited only if one or two of its neighbors are excited: P i(T) D 0 ! i(T) D 1 if j2N i j(T) D 1 2 f1; 2g [3]. Figure 12 shows the effect of mode of the last three time steps memory on the defensive-inhibition CA rule with the Moore neighborhood, starting from a simple configuration. At T D 3 the outer excited cells in the actual pattern are not featured as excited but as resting cells (twice resting versus one excited), and the series of evolving patterns with memory diverges from the ahistoric evolution at T D 4, becoming less expanded. Again, memory tends to restrain the evolution. The effect of memory on the beehive rule, a totalistic two-dimensional CA rule with three states implemented in the hexagonal tessellation [57] has been explored in [13].
Cellular Automata with Memory, Figure 12 Effect of mode memory on the defensive inhibition CA rule
Reversible CA The second-order in time implementation based on the subtraction modulo of the number of states (noted ): i(TC1) D ( j(T) 2 N i ) i(T1) , readily reverses as: i(T1) D ( j(T) 2 N i ) i(TC1) . To preserve the reversible feature, memory has to be endowed only in the pivotal component of the rule transition, so: i(T1) D (TC1) (s(T) . j 2 Ni ) i For reversing from T it is necessary to know not only i(T) and i(TC1) but also ! i(T) to be compared to ˝(T), to obtain: 8 0 if 2! i(T) < ˝(T) ˆ ˆ < s(T) D i(TC1) if 2! i(T) D ˝(T) i ˆ ˆ : 1 if 2! i(T) > ˝(T) : Then to progress in the reversing, to obtain s (T1) D i round ! i(T1) /˝(T 1) , it is necessary to calculate ! i(T1) D ! i(T) i(T) /˛. But in order to avoid dividing by the memory factor (recall that operations with real numbers are not exact in computer arithmetic), it is preferable to work with i(T1) D ! i(T) i(T) , and to comP Tt . This leads pare these values to (T 1) D T1 tD1 ˛ to: 8 0 if 2 i(T1) < (T 1) ˆ ˆ < s(T1) D i(T) if 2 i(T1) D (T 1) i ˆ ˆ : 1 if 2 i(T1) > (T 1) :
395
396
Cellular Automata with Memory
Cellular Automata with Memory, Figure 13 The reversible parity rule with memory
In general: i(T) D i(TC1) ˛ 1 i(TC1) ; (T ) D (T C 1) ˛ 1 . Figure 13 shows the effect of memory on the reversible parity rule starting from a single site live cell, so the scenario of Figs. 2 and 3, with the reversible qualification. As expected, the simulations corresponding to ˛ D 0:6 or below shows the ahistoric pattern at T D 4, whereas memory leads to a pattern different from ˛ D 0:7, and the pattern at T D 5 for ˛ D 0:54 and ˛ D 0:55 differ. Again, in the reversible formulation with memory, (i) the configuration of the patterns is notably altered, (ii) the speed of diffusion of the area affected are notably reduced, even by minimal
memory (˛ D 0:501), (iii) high levels of memory tend to freeze the dynamics since the early time-steps. We have studied the effect of memory in the reversible formulation of CA in many scenarios, e. g., totalistic, k D r D 2 rules [7], or rules with three states [21]. Reversible systems are of interest since they preserve information and energy and allow unambiguous backtracking. They are studied in computer science in order to design computers which would consume less energy [51]. Reversibility is also an important issue in fundamental physics [31,41,52,53]. Geraldt ’t Hooft, in a speculative paper [34], suggests that a suitably defined deterministic, lo-
Cellular Automata with Memory
Cellular Automata with Memory, Figure 14 The parity rule with four inputs: effect of memory and random rewiring. Distance between two consecutive patterns in the ahistoric model (red) and memory models of ˛ levels: 0.6,0.7.0.8, 0.9 (dotted) and 1.0 (blue)
cal reversible CA might provide a viable formalism for constructing field theories on a Planck scale. Svozil [50] also asks for changes in the underlying assumptions of current field theories in order to make their discretization appear more CA-like. Applications of reversible CA with memory in cryptography are being scrutinized [30,42]. Heterogeneous CA CA on networks have arbitrary connections, but, as proper CA, the transition rule is identical for all cells. This generalization of the CA paradigm addresses the intermediate class between CA and Boolean networks (BN, considered in the following section) in which, rules may be different at each site. In networks two topological ends exist, random and regular networks, both display totally opposite geometric properties. Random networks have lower clustering coefficients and shorter average path length between nodes commonly known as small world property. On the other hand, regular graphs, have a large average path length between nodes and high clustering coefficients. In an attempt to build a network with characteristics observed in real networks, a large clustering coefficient and a small world property, Watts and Strogatz (WS, [54]) proposed a model built by randomly rewiring a regular lattice. Thus, the WS model interpolates between regular and ran-
dom networks, taking a single new parameter, the random rewiring degree, i. e.: the probability that any node redirects a connection, randomly, to any other. The WS model displays the high clustering coefficient common to regular lattices as well as the small world property (the small world property has been related to faster flow in the information transmission). The long-range links introduced by the randomization procedure dramatically reduce the diameter of the network, even when very few links are rewired. Figure 14 shows the effect of memory and topology on the parity rule with four inputs in a lattice of size 65 65 with periodic boundary conditions, starting at random. As expected, memory depletes the Hamming distance between two consecutive patterns in relation to the ahistoric model, particularly when the degree of rewiring is high. With full memory, quasi-oscillators tend to appear. As a rule, the higher the curve the lower the memory factor ˛, but in the particular case of a regular lattice (and lattice with 10% of rewiring), the evolution of the distance in the full memory model turns out rather atypical, as it is maintained over some memory models with lower ˛ parameters. Figure 15 shows the evolution of the damage spread when reversing the initial state of the 3 3 central cells in the initial scenario of Fig. 14. The fraction of cells with the state reversed is plotted in the regular and 10% of rewiring scenarios. The plots corresponding to higher
397
398
Cellular Automata with Memory
Cellular Automata with Memory, Figure 15 Damage up to T D 100 in the parity CA of Fig. 14
Cellular Automata with Memory, Figure 16 Relative Hamming distance between two consecutive patterns. Boolean network with totalistic, K D 4 rules in the scenario of Fig. 14
rates of rewiring are very similar to that of the 10% case in Fig. 15. Damage spreads fast very soon as rewiring is present, even in a short extent. Boolean Networks In Boolean Networks (BN,[38]), instead of what happens in canonical CA, cells may have arbitrary connections and rules may be different at each site. Working with totalistic P rules: i(TC1) D i ( j2N i s (T) j ). The main features on the effect of memory in Fig. 14 are preserved in Fig. 16: (i) the ordering of the historic networks tends to be stronger with a high memory factor, (ii) with full memory, quasi-oscillators appear (it seems that full memory tends to induce oscillation), (iii) in the particular case of the regular graph (and a lesser extent in the networks with low rewiring), the evolution of the
full memory model turns out rather atypical, as it is maintained over some of those memory models with lower ˛ parameters. The relative Hamming distance between the ahistoric patterns and those of historic rewiring tends to be fairly constant around 0.3, after a very short initial transition period. Figure 17 shows the evolution of the damage when reversing the initial state of the 3 3 central cells. As a rule in every frame, corresponding to increasing rates of random rewiring, the higher the curve the lower the memory factor ˛. The damage vanishing effect induced by memory does result apparently in the regular scenario of Fig. 17, but only full memory controls the damage spreading when the rewiring degree is not high, the dynamics with the remaining ˛ levels tend to the damage propagation that characterizes the ahistoric model. Thus, with up to 10% of connections rewired, full memory notably controls the
Cellular Automata with Memory
Cellular Automata with Memory, Figure 17 Evolution of the damage when reversing the initial state of the 3 3 central cells in the scenario of Fig. 16
spreading, but this control capacity tends to disappear with a higher percentage of rewiring connections. In fact, with rewiring of 50% or higher, neither full memory seems to be very effective in altering the final rate of damage, which tends to reach a plateau around 30% regardless of scenario. A level notably coincident with the percolation threshold in site percolation in the simple cubic lattice, and the critical point for the nearest neighbor Kaufmann model on the square lattice [49]: 0.31.
Structurally Dynamic CA Structurally dynamic cellular automata (SDCA) were suggested by Ilachinski and Halpern [36]. The essential new feature of this model was that the connections between the cells are allowed to change according to rules similar in nature to the state transition rules associated with the conventional CA. This means that given certain conditions, specified by the link transition rules, links between rules may be created and destroyed; the neighborhood of each cell is dynamic, so, state and link configurations of an SDCA are both dynamic and continually interacting. If cells are numbered 1 to N, their connectivity is specified by an N N connectivity matrix in which i j D 1 if cells i and j are connected; 0 otherwise. So, now: (T) (T) (TC1) N i D f j/ i j D 1g and i D ( j(T) 2 N i(T) ). The geodesic distance between two cells i and j, ı i j , is defined as the number of links in the shortest path between i and j. Thus, i and j are direct neighbors if ı i j D 1, and are next-nearest neighbors if ı i j D 2, so
(T)
D f j/ı (T) i j D 2g. There are two types of link transition functions in an SDCA: couplers and decouplers, the former add new links, the latter remove links. The coupler and decoupler set determines the link transition rule: (T) (T) (TC1) D (l i(T) ij j ; i ; j ). Instead of introducing the formalism of the SDCA, we deal here with just one example in which the decoupler rule removes all links connected to cells in which both val(TC1) D 0 iff i(T) C j(T) D 0) ues are zero ((T) i j D 1 ! i j and the coupler rule adds links between all next-nearest neighbor sites in which both values are one ((T) ij D (T) (T) (T) j D 1 iff C D 2 and j 2 N N 0 ! (TC1) i i j i ). The SDCA with these transition rules for connections, together with the parity rule for mass states, is implemented in Fig. 18, in which the initial Euclidean lattice with four neighbors (so the generic cell has eight next-nearest neighbors: ) is seeded with a 3 3 block of ones. After the first iteration, most of the lattice structure has decayed as an effect of the decoupler rule, so that the active value cells and links are confined to a small region. After T D 6, the link and value structures become periodic, with a periodicity of two. Memory can be embedded in links in a similar manner as in state values, so the link between any two cells is featured by a mapping of its previous link values: l i(T) j D (T) ; : : : ; ). The distance between two cells in the hisl((1) ij ij toric model (di j ), is defined in terms of l instead of values, so that i and j are direct neighbors if d i j D 1, and are next-nearest neighbors if d i j D 2. Now: N i(T) D f j/d (T) ij D D 2g. Generalizing the approach 1g, and N N i(T) D f j/d (T) ij NN i
399
400
Cellular Automata with Memory
Cellular Automata with Memory, Figure 18 The SDCA described in text up to T D 6
Cellular Automata with Memory, Figure 19 The SD cellular automaton introduced in text with weighted memory of factor ˛. Evolution from T D 4 up to T D 9 starting as in Fig. 18
to embedded memory applied to states, the unchanged transition rules ( and ) operate on the featured link 2 N i ); (TC1) D and cell state values: i(TC1) D (s (T) j ij (T) (T) (T) (l i j; s i ; s j ). Figure 19 shows the effect of ˛-memory on the cellular automaton above introduced starting as in Fig. 18. The effect of memory on SDCA in the hexagonal and triangular tessellations is scrutinized in [11]. A plausible wiring dynamics when dealing with excitable CA is that in which the decoupler rule removes all links connected to cells in which both values are at refrac(TC1) tory state ((T) D 0 iff i(T) D j(T) =2) i j D 1 ! i j and the coupler rule adds links between all next-nearest neighbor sites in which both values are excited ((T) ij D D 1 iff i(T) D j(T) D 1 and j 2 N N (T) 0 ! (TC1) ij i ). In the SDCA in Fig. 20, the transition rule for cell states is that of the generalized defensive inhibition rule: resting cell is excited if its ratio of excited and connected to the cell neighbors to total number of connected neighbors lies in the interval [1/8,2/8]. The initial scenario of Fig. 20 is that of Fig. 12 with the wiring network revealed, that of an Euclidean lattice with eight neighbors, in which, the generic
cell has 16 next-nearest neighbors: . No decoupling is verified at the first iteration in Fig. 20, but the excited cells generate new connections, most of them lost, together with some of the initial ones, at T D 3. The excited cells at T D 3 generate a crown of new connections at T D 4. Figure 21 shows the ahistoric and mode memory patterns at T D 20. The figure makes apparent the preserving effect of memory. The Fredkin’s reversible construction is feasible in the SDCA scenario extending the operation also to links: (T) (T) (T1) D ((T) . These automata (TC1) ij i j ; i ; j ) i j may be endowed with memory as: i(TC1) D s (T) 2 j (T) (T) (T) (T) (T1) (TC1) (T1) N i i ; i j D l i j ; s i ; s j i j [12]. The SDCA seems to be particularly appropriate for modeling the human brain function—updating links between cells imitates variation of synaptic connections between neurons represented by the cells—in which the relevant role of memory is apparent. Models similar to SDCA have been adopted to build a dynamical network approach to quantum space-time physics [45,46]. Reversibility is an important issue at such a fundamental physics level. Technical applications of SDCA may also be traced [47].
Cellular Automata with Memory
Cellular Automata with Memory, Figure 20 The k D 3 SD cellular automaton described in text, up to T D 4
Cellular Automata with Memory, Figure 21 The SD cellular automaton starting as in Fig. 20 at T D 20, with no memory (left) and mode memory in both cell states and links
Anyway, besides their potential applications, SDCA with memory have an aesthetic and mathematical interest on their own [1,35]. Nevertheless, it seems plausible that further study on SDCA (and Lattice Gas Automata with dynamical geometry [40]) with memory should turn out to be profitable. Memory in Other Discrete Time Contexts Continuous-Valued CA The mechanism of implementation of memory adopted here, keeping the transition rule unaltered but applying it to a function of previous states, can be adopted in any spatialized dynamical system. Thus, historic memory can be embedded in: Continuousvalued CA (or Coupled Map Lattices in which the state variable ranges in R, and the transi-
tion rule ' is a continuous function [37]), just by considering m instead of in the application of the up (T) . An elementary dating rule: i(TC1) D ' m(T) j 2 Ni CA of this kind with memory would be [20]: i(TC1) D (T) (T) (T) 1 3 m i1 C m i C m iC1 . Fuzzy CA, a sort of continuous CA with states ranging in the real [0,1] interval. An illustration of the effect of memory in fuzzy CA is given in [17]. The illustration (T) operates on the elementary rule 90 : i(TC1) D i1 ^ (T) (T) (T) ) _ ((: i1 ) ^ iC1 , which after fuzzification (: iC1 (a_b ! min(1; aCb); a^b ! ab; :a ! 1a) yields: (T) (T) (T) (T) C iC1 2 i1 iC1 ; thus incorporating i(TC1) D i1
(T) (T) (T) memory: i(TC1) D m(T) i1 C m iC1 2m i1 m iC1 . Quantum CA, such, for example, as the simple 1D quantum CA models introduced in [32]: 1 (T) (T) ; j(TC1) D 1/2 iı j1 C j(T) C iı jC1 N
401
402
Cellular Automata with Memory
which would become with memory [20]: j(TC1) D
1 (T) (T) iım(T) j1 C m j C iı m jC1 : 1/2 N
Spatial Prisoner’s Dilemma The Prisoner’s Dilemma (PD) is a game played by two players (A and B), who may choose either to cooperate (C or 1) or to defect (D or 0). Mutual cooperators each score the reward R, mutual defectors score the punishment P ; D scores the temptation T against C, who scores S (sucker’s payoff) in such an encounter. Provided that T > R > P > S, mutual defection is the only equilibrium strategy pair. Thus, in a single round both players are to be penalized instead of both rewarded, but cooperation may be rewarded in an iterated (or spatial) formulation. The game is simplified (while preserving its essentials) if P D S D 0: Choosing R D 1; the model will have only one parameter: the temptation T =b. In the spatial version of the PD, each player occupies at a site (i,j) in a 2D lattice. In each generation the payoff of a given individual (p(T) i; j ), is the sum over all interactions with the eight nearest neighbors and with its own site. In the next generation, an individual cell is assigned the decision (d (T) i; j ) that received the highest payoff among all the cells of its Moore’s neighborhood. In case of a tie, the cell retains its choice. The spatialized PD (SPD for short) has proved to be a promising tool to explain how cooperation can hold out against the ever-present threat of exploitation [43]. This is a task that presents problems in the classic struggle for survival Darwinian framework. When dealing with the SPD, memory can be embedded not only in choices but also in rewards. Thus, in the historic model we dealt with, at T: (i) the payoffs coming from previous rounds are accumulated ( i;(T)j ), and (ii) players are featured by a summary of past decisions (ı (T) i; j ). Again, in each round or generation, a given cell plays with each of the eight neighbors and itself, the decision ı in the cell of the neighborhood with the highest being adopted. This approach to modeling memory has been rather neglected, the usual being that of designing strategies that specify the choice for every possible outcome in the sequence of historic choices recalled [33,39]. Table 1 shows the initial scenario starting from a single defector if 8b > 9 , b > 1:125, which means that neighbors of the initial defector become defectors at T D 2. Nowak and May paid particular attention in their seminal papers to b D 1:85, a high but not excessive temptation value which leads to complex dynamics. After T D 2, defection can advance to a 5 5 square or be
Cellular Automata with Memory, Table 1 Choices at T D 1 and T D 2; accumulated payoffs after T D 1 and T D 2 starting from a single defector in the SPD. b > 9/8
d(1) D ı (1) 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
1 1 1 1 1
p(1) D (1) 9 9 9 9 9 8 8 8 9 8 8b 8 9 8 8 8 9 9 9 9
(2) = ˛p(1) + p(2) 9˛ + 9 9˛ + 9 9˛ + 9 9˛ + 9 9˛ + 8 9˛ + 7 9˛ + 9 9˛ + 9 8˛ + 5b 9˛ + 9 9˛ + 9 8˛ + 3b 9˛ + 9 9˛ + 9 8˛ + 5b 9˛ + 9 9˛ + 8 9˛ + 7 9˛ + 9 9˛ + 9 9˛ + 9
9 9 9 9 9
9˛ + 9 9˛ + 6 8˛ + 3b 8˛ 8˛ + 3b 9˛ + 6 9˛ + 9
d(2) D ı(2) 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1
9˛ + 9 9˛ + 7 8˛ + 5b 8˛ + 3b 8˛ + 5b 9˛ + 7 9˛ + 9
9˛ + 9 9˛ + 8 9˛ + 9 9˛ + 9 9˛ + 9 9˛ + 8 9˛ + 9
1 1 0 0 0 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
9˛ + 9 9˛ + 9 9˛ + 9 9˛ + 9 9˛ + 9 9˛ + 9 9˛ + 9
restrained as a 3 3 square, depending on the comparison of 8˛ C 5 1:85 (the maximum value of the recent defectors) with 9˛ C 9 (the value of the non-affected players). As 8˛ C 5 1:85 D 9˛ C 9 ! ˛ D 0:25, i. e., if ˛ > 0:25, defection remains confined to a 3 3 square at T D 3. Here we see the paradigmatic effect of memory: it tends to avoid the spread of defection. If memory is limited to the last three iterations: i;(T)j D 2 (T2) (T) ˛ 2 p(T2) C˛ p(T1) C p(T) C˛d (T1) C i; j i; j i; j ; m i; j D ˛ d i; j i; j (T) (T) (T) 2 d i; j /(˛ C ˛ C 1); ) ı i; j D round(m i; j ), with assigna(2) tions at T D 2: i;(2)j D ˛ i;(1)j C i;(2)j ; ı (2) i; j D d i; j . Memory has a dramatic restrictive effect on the advance of defection as shown in Fig. 22. This figure shows the frequency of cooperators (f ) starting from a single defector and from a random configuration of defectors in a lattice of size 400 400 with periodic boundary conditions when b D 1:85. When starting from a single defector, f at time step T is computed as the frequency of cooperators within the square of size (2(T 1) C 1)2 centered on the initial D site. The ahistoric plot reveals the convergence of f to 0.318, (which seems to be the same value regardless of the initial conditions [43]). Starting from a single defector (a), the model with small memory (˛ D 0:1) seems to reach a similar f value, but sooner and in a smoother way. The plot corresponding to ˛ D 0:2 still shows an early decay in f that leads it to about 0.6, but higher memory factor values lead f close to or over 0.9 very soon. Starting at random (b), the curves corresponding to
Cellular Automata with Memory
Cellular Automata with Memory, Figure 22 Frequency of cooperators (f) with memory of the last three iterations. a starting from a single defector, b starting at random (f(1) D 0:5). The red curves correspond to the ahistoric model, the blue ones to the full memory model, the remaining curves to values of ˛ from 0.1 to 0.9 by 0.1 intervals, in which, as a rule, the higher the ˛ the higher the f for any given T
0:1 ˛ 0:6 (thus with no memory of choices) do mimic the ahistoric curve but with higher f , as ˛ 0:7 (also memory of choices) the frequency of cooperators grows monotonically to reach almost full cooperation: D persists as scattered unconnected small oscillators (D-blinkers), as shown in Fig. 23. Similar results are found for any temptation value in the parameter region 0:8 < b < 2:0, in which spatial chaos is characteristic in the ahistoric model. It is then concluded that short-type memory supports cooperation. As a natural extension of the described binary model, the 0-1 assumption underlying the model can be relaxed by allowing for degrees of cooperation in a continuousvalued scenario. Denoting by x the degree of cooperation of player A and by y the degree of cooperation of the player B, a consistent way to specify the pay-off for values of x and y other than zero or one is to simply interpolate between the extreme payoffs of the binary case. Thus, the payoff that the player A receives is:
y R S G A (x; y) D (x; 1 x) : T P 1 y In the continuous-valued formulation it is ı m, (1) historic (2) including ı (2) D ˛d C d /(˛ C 1). Table 2 illusi; j i; j i; j trates the initial scenario starting from a single (full) defector. Unlike in the binary model, in which the initial defector never becomes a cooperator, the initial defector cooperates with degree ˛/(1 C ˛) at T D 3: its neighbors which received the highest accumulated payoff (those in the corners with (2) D 8˛ C 5b > 8b˛), achieved this mean de-
Cellular Automata with Memory, Figure 23 Patterns at T D 200 starting at random in the scenario of Fig. 22b
gree of cooperation after T D 2. Memory dramatically constrains the advance of defection in a smooth way, even for the low level ˛ D 0:1. The effect appears much more homogeneous compared to the binary model, with no special case for high values of ˛, as memory on decisions is always operative in the continuous-valued model [24]. The effect of unlimited trailing memory on the SPD has been studied in [5,6,7,8,9,10].
403
404
Cellular Automata with Memory Cellular Automata with Memory, Table 2 Weighted mean degrees of cooperation after T D 2 and degree of cooperation at T D 3 starting with a single defector in the continuous-valued SPD with b D 1:85
ı (2) 1 1 1 1 1 ˛ ˛ ˛ 1 1C˛ 1C˛ 1C˛ 1 1 1
˛ 1C˛ ˛ 1C˛
1 1
˛ 1C˛
˛ 1C˛ ˛ 1C˛
1
1
0
d(3) (˛ < 0:25) 1 1 1 1 1 1 1 ˛ ˛ ˛ ˛ ˛ 1 1C˛ 1C˛ 1C˛ 1C˛ 1C˛ 1 1
1
1
1
1
1
1
˛ 1C˛ ˛ 1C˛ ˛ 1C˛ ˛ 1C˛
1 1
˛ 1C˛ ˛ 1C˛ ˛ 1C˛ ˛ 1C˛
˛ 1C˛ ˛ 1C˛ ˛ 1C˛ ˛ 1C˛
˛ 1C˛ ˛ 1C˛ ˛ 1C˛ ˛ 1C˛
˛ 1C˛ ˛ 1C˛ ˛ 1C˛ ˛ 1C˛
1
1
1
1
Discrete-Time Dynamical Systems Memory can be embedded in any model in which time plays a dynamical role. Thus, Markov chains p0TC1 D p0T M become with memory: p0TC1 D ı0T M with ıT being a weighted mean of the probability distributions up to T: ıT D (p1 ; : : : ; pT ). In such scenery, even a minimal incorporation of memory notably alters the evolution of p [23]. Last but not least, conventional, non-spatialized, discrete dynamical systems become with memory: x TC1 D f (m T ) with mT being an average of past values. As an overall rule, memory leads the dynamics a fixed point of the map f [4]. We will introduce an example of this in the context of the PD game in which players follow the socalled Paulov strategy: a Paulov player cooperates if and only if both players opted for the same alternative in the previous move. The name Paulov stems from the fact that this strategy embodies an almost reflex-like response to the payoff: it repeats its former move if it was rewarded by T or R, but switches behavior if it was punished by receiving only P or S. By coding cooperation as 1 and defection as 0, this strategy can be formulated in terms of the choices x of Player A (Paulov) and y of Player B as: x (TC1) D 1 j x (T) y (T) j. The Paulov strategy has proved to be very successful in its con-
1
d(3) (˛ < 0:25) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ˛ ˛ ˛ 1 1 1C˛ 1 1C˛ 1C˛
1
1 1
1
1 1
1
1 1 1 1 1 1
1
˛ 1C˛ ˛ 1C˛
˛ 1C˛ ˛ 1C˛
˛ 1C˛ ˛ 1C˛
1 1
1 1
1 1 1 1 1 1
tests with otherstrategies [44]. Let us give a simple example of this: suppose that Player B adopts an Anti-Paulov strategy (which cooperates to the extent Paulov defects) with y (TC1) D 1 j 1 x (T) y (T) j. Thus, in an iterated Paulov-Anti-Paulov (PAP) contest, with T(x; y) D 1 jx yj ; 1 j1 x yj , it is T(0; 0) D T(1; 1) D (1; 0), T(1; 0) D (0; 1), and T(0; 1) D (0; 1), so that (0,1) turns out to be immutable. Therefore, in an iterated PAP contest, Paulov will always defect, and Anti-Paulov will always cooperate. Relaxing the 0-1 assumption in the standard formulation of the PAP contest, degrees of cooperation can be considered in a continuous-valued scenario. Now x and y will denote the degrees of cooperation of players A and B respectively, with both x and y lying in [0,1]. In this scenario, not only (0,1) is a fixed point, but also T(0:8; 0:6) D (0:8; 0:6). Computer implementation of the iterated PAP tournament turns out to be fully disrupting of the theoretical dynamics. The errors caused by the finite precision of the computer floating point arithmetics (a common problem in dynamical systems working modulo 1) make the final fate of every point to be (0,1). With no exceptions: even the theoretically fixed point (0.8,0.6) ends up as (0,1) in the computerized implementation. A natural way to incorporate older choices in the strategies of decision is to feature players by a summary
Cellular Automata with Memory, Figure 24 Dynamics of the mean values of x (red) and y (blue) starting from any of the points of the 1 1 square
Cellular Automata with Memory
(m) of their own choices farther back in time. The PAP (T) contest becomes in this way: x (TC1) D 1 j m(T) x my j (T) ; y (TC1) D 1 j 1 m(T) x m y j. The simplest historic extension results in considering only the two last choices: m(z(T1) ; z(T) ) D (˛z(T1) C z(T) )/(˛ C 1) (z stands for both x and y) [10]. Figure 24 shows the dynamics of the mean values of x and y starting from any of the 101 101 lattice points of the 1 1 square with sides divided by 0.01 intervals. The dynamics in the ahistoric context are rather striking: immediately, at T D 2; both x and y increase from 0.5 up to app. 0:66(' 2/3), a value which remains stable up to app. T D 100, but soon after Paulov cooperation plummets, with the corresponding firing of cooperation of Anti-Paulov: finite precision arithmetics leads every point to (0,1). With memory, Paulov not only keeps a permanent mean degree cooperation but it is higher than that of AntiPaulov; memory tends to lead the overall dynamics to the ahistoric (theoretically) fixed point (0.8, 0.6). Future Directions Embedding memory in states (and in links if wiring is also dynamic) broadens the spectrum of CA as a tool for modeling, in a fairly natural way of easy computer implementation. It is likely that in some contexts, a transition rule with memory could match the correct behavior of the CA system of a given complex system (physical, biological, social and so on). A major impediment in modeling with CA stems from the difficulty of utilizing the CA complex behavior to exhibit a particular behavior or perform a particular function. Average memory in CA tends to inhibit complexity, inhibition that can be modulated by varying the depth of memory, but memory not of average type opens a notable new perspective in CA. This could mean a potential advantage of CA with memory over standard CA as a tool for modeling. Anyway, besides their potential applications, CA with memory (CAM) have an aesthetic and mathematical interest on their own. Thus, it seems plausible that further study on CA with memory should turn out profitable, and, maybe that as a result of a further rigorous study of CAM it will be possible to paraphrase T. Toffoli in presenting CAM—as an alternative to (rather than an approximation of) integro-differential equations in modeling— phenomena with memory. Bibliography Primary Literature 1. Adamatzky A (1994) Identification of Cellular Automata. Taylor and Francis
2. Adamatzky A (2001) Computing in Nonlinear Media and Automata Collectives. IoP Publishing, London 3. Adamatzky A, Holland O (1998) Phenomenology of excitation in 2-D cellular automata and swarm systems. Chaos Solit Fract 9:1233–1265 4. Aicardi F, Invernizzi S (1982) Memory effects in Discrete Dynamical Systems. Int J Bifurc Chaos 2(4):815–830 5. Alonso-Sanz R (1999) The Historic Prisoner’s Dilemma. Int J Bifurc Chaos 9(6):1197–1210 6. Alonso-Sanz R (2003) Reversible Cellular Automata with Memory. Phys D 175:1–30 7. Alonso-Sanz R (2004) One-dimensional, r D 2 cellular automata with memory. Int J Bifurc Chaos 14:3217–3248 8. Alonso-Sanz R (2004) One-dimensional, r D 2 cellular automata with memory. Int J BifurcChaos 14:3217–3248 9. Alonso-Sanz R (2005) Phase transitions in an elementary probabilistic cellular automaton with memory. Phys A 347:383–401 Alonso-Sanz R, Martin M (2004) Elementary Probabilistic Cellular Automata with Memory in Cells. Sloot PMA et al (eds) LNCS, vol 3305. Springer, Berlin, pp 11–20 10. Alonso-Sanz R (2005) The Paulov versus Anti-Paulov contest with memory. Int J Bifurc Chaos 15(10):3395–3407 11. Alonso-Sanz R (2006) A Structurally Dynamic Cellular Automaton with Memory in the Triangular Tessellation. Complex Syst 17(1):1–15. Alonso-Sanz R, Martin, M (2006) A Structurally Dynamic Cellular Automaton with Memory in the Hexagonal Tessellation. In: El Yacoubi S, Chopard B, Bandini S (eds) LNCS, vol 4774. Springer, Berlin, pp 30-40 12. Alonso-Sanz R (2007) Reversible Structurally Dynamic Cellular Automata with Memory: a simple example. J Cell Automata 2:197–201 13. Alonso-Sanz R (2006) The Beehive Cellular Automaton with Memory. J Cell Autom 1:195–211 14. Alonso-Sanz R (2007) A Structurally Dynamic Cellular Automaton with Memory. Chaos Solit Fract 32:1285–1295 15. Alonso-Sanz R, Adamatzky A (2008) On memory and structurally dynamism in excitable cellular automata with defensive inhibition. Int J Bifurc Chaos 18(2):527–539 16. Alonso-Sanz R, Cardenas JP (2007) On the effect of memory in Boolean networks with disordered dynamics: the K D 4 case. Int J Modrn Phys C 18:1313–1327 17. Alonso-Sanz R, Martin M (2002) One-dimensional cellular automata with memory: patterns starting with a single site seed. Int J Bifurc Chaos 12:205–226 18. Alonso-Sanz R, Martin M (2002) Two-dimensional cellular automata with memory: patterns starting with a single site seed. Int J Mod Phys C 13:49–65 19. Alonso-Sanz R, Martin M (2003) Cellular automata with accumulative memory: legal rules starting from a single site seed. Int J Mod Phys C 14:695–719 20. Alonso-Sanz R, Martin M (2004) Elementary cellular automata with memory. Complex Syst 14:99–126 21. Alonso-Sanz R, Martin M (2004) Three-state one-dimensional cellular automata with memory. Chaos, Solitons Fractals 21:809–834 22. Alonso-Sanz R, Martin M (2005) One-dimensional Cellular Automata with Memory in Cells of the Most Frequent Recent Value. Complex Syst 15:203–236 23. Alonso-Sanz R, Martin M (2006) Elementary Cellular Automata with Elementary Memory Rules in Cells: the case of linear rules. J Cell Autom 1:70–86
405
406
Cellular Automata with Memory
24. Alonso-Sanz R, Martin M (2006) Memory Boosts Cooperation. Int J Mod Phys C 17(6):841–852 25. Alonso-Sanz R, Martin MC, Martin M (2000) Discounting in the Historic Prisoner’s Dilemma. Int J Bifurc Chaos 10(1):87–102 26. Alonso-Sanz R, Martin MC, Martin M (2001) Historic Life Int J Bifurc Chaos 11(6):1665–1682 27. Alonso-Sanz R, Martin MC, Martin M (2001) The Effect of Memory in the Spatial Continuous-valued Prisoner’s Dilemma. Int J Bifurc Chaos 11(8):2061–2083 28. Alonso-Sanz R, Martin MC, Martin M (2001) The Historic Strategist. Int J Bifurc Chaos 11(4):943–966 29. Alonso-Sanz R, Martin MC, Martin M (2001) The HistoricStochastic Strategist. Int J Bifurc Chaos 11(7):2037–2050 30. Alvarez G, Hernandez A, Hernandez L, Martin A (2005) A secure scheme to share secret color images. Comput Phys Commun 173:9–16 31. Fredkin E (1990) Digital mechanics. An informal process based on reversible universal cellular automata. Physica D 45:254– 270 32. Grössing G, Zeilinger A (1988) Structures in Quantum Cellular Automata. Physica B 15:366 33. Hauert C, Schuster HG (1997) Effects of increasing the number of players and memory steps in the iterated Prisoner’s Dilemma, a numerical approach. Proc R Soc Lond B 264:513– 519 34. Hooft G (1988) Equivalence Relations Between Deterministic and Quantum Mechanical Systems. J Statistical Phys 53(1/2):323–344 35. Ilachinski A (2000) Cellular Automata. World Scientific, Singapore 36. Ilachinsky A, Halpern P (1987) Structurally dynamic cellular automata. Complex Syst 1:503–527 37. Kaneko K (1986) Phenomenology and Characterization of coupled map lattices, in Dynamical Systems and Sigular Phenomena. World Scientific, Singapore 38. Kauffman SA (1993) The origins of order: Self-Organization and Selection in Evolution. Oxford University Press, Oxford 39. Lindgren K, Nordahl MG (1994) Evolutionary dynamics of spatial games. Physica D 75:292–309 40. Love PJ, Boghosian BM, Meyer DA (2004) Lattice gas simulations of dynamical geometry in one dimension. Phil Trans R Soc Lond A 362:1667 41. Margolus N (1984) Physics-like Models of Computation. Physica D 10:81–95
42. Martin del Rey A, Pereira Mateus J, Rodriguez Sanchez G (2005) A secret sharing scheme based on cellular automata. Appl Math Comput 170(2):1356–1364 43. Nowak MA, May RM (1992) Evolutionary games and spatial chaos. Nature 359:826 44. Nowak MA, Sigmund K (1993) A strategy of win-stay, lose-shift that outperforms tit-for-tat in the Prisoner’s Dilemma game. Nature 364:56–58 45. Requardt M (1998) Cellular networks as models for Plank-scale physics. J Phys A 31:7797; (2006) The continuum limit to discrete geometries, arxiv.org/abs/math-ps/0507017 46. Requardt M (2006) Emergent Properties in Structurally Dynamic Disordered Cellular Networks. J Cell Aut 2:273 47. Ros H, Hempel H, Schimansky-Geier L (1994) Stochastic dynamics of catalytic CO oxidation on Pt(100). Pysica A 206:421– 440 48. Sanchez JR, Alonso-Sanz R (2004) Multifractal Properties of R90 Cellular Automaton with Memory. Int J Mod Phys C 15:1461 49. Stauffer D, Aharony A (1994) Introduction to percolation Theory. CRC Press, London 50. Svozil K (1986) Are quantum fields cellular automata? Phys Lett A 119(41):153–156 51. Toffoli T, Margolus M (1987) Cellular Automata Machines. MIT Press, Massachusetts 52. Toffoli T, Margolus N (1990) Invertible cellular automata: a review. Physica D 45:229–253 53. Vichniac G (1984) Simulating physics with Cellular Automata. Physica D 10:96–115 54. Watts DJ, Strogatz SH (1998) Collective dynamics of SmallWorld networks. Nature 393:440–442 55. Wolf-Gladrow DA (2000) Lattice-Gas Cellular Automata and Lattice Boltzmann Models. Springer, Berlin 56. Wolfram S (1984) Universality and Complexity in Cellular Automata. Physica D 10:1–35 57. Wuensche A (2005) Glider dynamics in 3-value hexagonal cellular automata: the beehive rule. Int J Unconv Comput 1:375– 398 58. Wuensche A, Lesser M (1992) The Global Dynamics of Cellular Automata. Addison-Wesley, Massachusetts
Books and Reviews Alonso-Sanz R (2008) Cellular Automata with Memory. Old City Publising, Philadelphia (in press)
Cellular Automata Modeling of Physical Systems
Cellular Automata Modeling of Physical Systems BASTIEN CHOPARD Computer Science Department, University of Geneva, Geneva, Switzerland Article Outline Glossary Definition of the Subject Introduction Definition of a Cellular Automata Limitations, Advantages, Drawbacks, and Extensions Applications Future Directions Bibliography Cellular automata offer a powerful modeling framework to describe and study physical systems composed of interacting components. The potential of this approach is demonstrated in the case of applications taken from various fields of physics, such as reaction-diffusion systems, pattern formation phenomena, fluid flows and road traffic models. Glossary BGK models Lattice Boltzmann models where the collision term ˝ is expressed as a deviation from a given local equilibrium distribution f (0) , namely ˝ D ( f (0) f )/ , where f is the unknown particle distribution and a relaxation time (which is a parameter of the model). BGK stands for Bhatnager, Gross and Krook who first considered such a collision term, but not specifically in the context of lattice systems. CA Abbreviation for cellular automata or cellular automaton. Cell The elementary spatial component of a CA. The cell is characterized by a state whose value evolves in time according to the CA rule. Cellular automaton System composed of adjacent cells or sites (usually organized as a regular lattice) which evolves in discrete time steps. Each cell is characterized by an internal state whose value belongs to a finite set. The updating of these states is made in parallel according to a local rule involving only a neighborhood of each cell. Conservation law A property of a physical system in which some quantity (such as mass, momentum or energy) is locally conserved during the time evolution. These conservation laws should be included in the mi-
crodynamics of a CA model because they are essential ingredients governing the macroscopic behavior of any physical system. Collision The process by which the particles of a LGA change their direction of motion. Continuity equation An equation of the form @ t C div u D 0 expressing the mass (or particle number) conservation law. The quantity is the local density of particles and u the local velocity field. Critical phenomena The phenomena which occur in the vicinity of a continuous phase transition, and are characterized by very long correlation length. Diffusion A physical process described by the equation @ t D Dr 2 , where is the density of a diffusing substance. Microscopically, diffusion can be viewed as a random motion of particles. DLA Abbreviation of Diffusion Limited Aggregation. Model of a physical growth process in which diffusing particles stick on an existing cluster when they hit it. Initially, the cluster is reduced to a single seed particle and grows as more and more particles arrive. A DLA cluster is a fractal object whose dimension is typically 1.72 if the experiment is conducted in a two-dimensional space. Dynamical system A system of equations (differential equations or discretized equations) modeling the dynamical behavior of a physical system. Equilibrium states States characterizing a closed system or a system in thermal equilibrium with a heat bath. Ergodicity Property of a system or process for which the time-averages of the observables converge, in a probabilistic sense, to their ensemble averages. Exclusion principle A restriction which is imposed on LGA or CA models to limit the number of particles per site and/or lattice directions. This ensures that the dynamics can be described with a cellular automata rule with a given maximum number of bits. The consequence of this exclusion principle is that the equilibrium distribution of the particle numbers follows a Fermi–Dirac-like distribution in LGA dynamics. FHP model Abbreviation for the Frisch, Hasslacher and Pomeau lattice gas model which was the first serious candidate to simulate two-dimensional hydrodynamics on a hexagonal lattice. Fractal Mathematical object usually having a geometrical representation and whose spatial dimension is not an integer. The relation between the size of the object and its “mass” does not obey that of usual geometrical objects. A DLA cluster is an example of a fractal. Front The region where some physical process occurs. Usually the front includes the locations in space that
407
408
Cellular Automata Modeling of Physical Systems
are first affected by the phenomena. For instance, in a reaction process between two spatially separated reactants, the front describes the region where the reaction takes place. HPP model Abbreviation for the Hardy, de Pazzis and Pomeau model. The first two-dimensional LGA aimed at modeling the behavior of particles colliding on a square lattice with mass and momentum conservation. The HPP model has several physical drawbacks that have been overcome with the FHP model. Invariant A quantity which is conserved during the evolution of a dynamical system. Some invariants are imposed by the physical laws (mass, momentum, energy) and others result from the model used to describe physical situations (spurious, staggered invariants). Collisional invariants are constant vectors in the space where the Chapman–Enskog expansion is performed, associated to each quantity conserved by the collision term. Ising model Hamiltonian model describing the ferromagnetic paramagnetic transition. Each local classical spin variables s i D ˙1 interacts with its neighbors. Isotropy The property of continuous systems to be invariant under any rotations of the spatial coordinate system. Physical quantities defined on a lattice and obtained by an averaging procedure may or may not be isotropic, in the continuous limit. It depends on the type of lattice and the nature of the quantity. Secondorder tensors are isotropic on a 2D square lattice but fourth-order tensors need a hexagonal lattice. Lattice The set of cells (or sites) making up the spatial area covered by a CA. Lattice Boltzmann model A physical model defined on a lattice where the variables associated to each site represent an average number of particles or the probability of the presence of a particle with a given velocity. Lattice Boltzmann models can be derived from cellular automata dynamics by an averaging and factorization procedure, or be defined per se, independently of a specific realization. Lattice gas A system defined on a lattice where particles are present and follow given dynamics. Lattice gas automata (LGA) are a particular class of such a system where the dynamics are performed in parallel over all the sites and can be decomposed in two stages: (i) propagation: the particles jump to a nearest-neighbor site, according to their direction of motion and (ii) collision: the particles entering the same site at the same iteration interact so as to produce a new particle distribution. HPP and FHP are well-known LGA.
Lattice spacing The separation between two adjacent sites of a regular lattice. Throughout this book, it is denoted by the symbol r . LB Abbreviation for Lattice Boltzmann. LGA Abbreviation for Lattice Gas Automaton. See lattice gas model for a definition. Local equilibrium Situation in which a large system can be decomposed into subsystems, very small on a macroscopic scale but large on a microscopic scale such that each sub-system can be assumed to be in thermal equilibrium. The local equilibrium distribution is the function which makes the collision term of a Boltzmann equation vanish. Lookup table A table in which all possible outcomes of a cellular automata rule are pre-computed. The use of a lookup table yields a fast implementation of a cellular automata dynamics since however complicated a rule is, the evolution of any configuration of a site and its neighbors is directly obtained through a memory access. The size of a lookup table grows exponentially with the number of bits involved in the rule. Margolus neighborhood A neighborhood made of twoby-two blocks of cells, typically in a two-dimensional square lattice. Each cell is updated according to the values of the other cells in the same block. A different rule may possibly be assigned dependent on whether the cell is at the upper left, upper right, lower left or lower right location. After each iteration, the lattice partition defining the Margolus blocs is shifted one cell right and one cell down so that at every other step, information can be exchanged across the lattice. Can be generalized to higher dimensions. Microdynamics The Boolean equation governing the time evolution of a LGA model or a cellular automata system. Moore neighborhood A neighborhood composed of the central cell and all eight nearest and next-nearest neighbors in a two-dimensional square lattice. Can be generalized to higher dimensions. Multiparticle models Discrete dynamics modeling a physical system in which an arbitrary number of particles is allowed at each site. This is an extension of an LGA where no exclusion principle is imposed. Navier–Stokes equation The equation describing the velocity field u in a fluid flow. For an incompressible fluid (@ t D 0), it reads 1 @ t u C (u r)u D rP C r 2 u where is the density and P the pressure. The Navier– Stokes equation expresses the local momentum con-
Cellular Automata Modeling of Physical Systems
servation in the fluid and, as opposed to the Euler equation, includes the dissipative effects with a viscosity term r 2 u. Together with the continuity equation, this is the fundamental equation of fluid dynamics. Neighborhood The set of all cells necessary to compute a cellular automaton rule. A neighborhood is usually composed of several adjacent cells organized in a simple geometrical structure. Moore, von Neumann and Margolus neighborhoods are typical examples. Occupation numbers Boolean quantities indicating the presence or absence of a particle in a given physical state. Open system A system communicating with the environment by exchange of energy or matter. Parallel Refers to an action which is performed simultaneously at several places. A parallel updating rule corresponds to the updating of all cells at the same time as if resulting from the computations of several independent processors. Partitioning A technique consisting of dividing space in adjacent domains (through a partition) so that the evolution of each block is uniquely determined by the states of the elements within the block. Phase transition Change of state obtained when varying a control parameter such as the one occurring in the boiling or freezing of a liquid, or in the change between ferromagnetic and paramagnetic states of a magnetic solid. Propagation This is the process by which the particles of a LGA are moved to a nearest neighbor, according to the direction of their velocity vector vi . In one time step t the particle travel from cell r to cell r C vi t where r C vi t is the nearest neighbor in lattice direction i. Random walk A series of uncorrelated steps of length unity describing a random path with zero average displacement but characteristic size proportional to the square root of the number of steps. Reaction-diffusion systems Systems made of one or several species of particles which diffuse and react among themselves to produce some new species. Scaling hypothesis A hypothesis concerning the analytical properties of the thermodynamic potentials and the correlation functions in a problem invariant under a change of scale. Scaling law Relations among the critical exponents describing the power law behaviors of physical quantities in systems invariant under a change of scale. Self-organized criticality Concept aimed at describing a class of dynamical systems which naturally drive
themselves to a state where interesting physics occurs at all scales. Site Same as a cell, but preferred terminology in LGA and LB models. Spatially extended systems Physical systems involving many spatial degrees of freedom and which, usually, have rich dynamics and show complex behaviors. Coupled map lattices and cellular automata provides a way to model spatially extended systems. Spin Internal degree of freedom associated to particles in order to describe their magnetic state. A widely used case is the one of classical Ising spins. To each particle, one associates an “arrow” which is allowed to take only two different orientations, up or down. Time step Interval of time separating two consecutive iterations in the evolution of a discrete time process, like a CA or a LB model. Throughout this work the time step is denoted by the symbol t . Universality The phenomenon whereby many microscopically different systems exhibit a critical behavior with quantitatively identical properties such as the critical exponents. Updating operation consisting of assigning a new value to a set of variables, for instance those describing the states of a cellular automata system. The updating can be done in parallel and synchronously as is the case in CA dynamics or sequentially, one variable after another, as is usually the case for Monte–Carlo dynamics. Parallel, asynchronous updating is less common but can be envisaged too. Sequential and parallel updating schemes may yield different results since the interdependencies between variables are treated differently. Viscosity A property of a fluid indicating how much momentum “diffuses” through the fluid in a inhomogeneous flow pattern. Equivalently, it describes the stress occurring between two fluid layers moving with different velocities. A high viscosity means that the resulting drag force is important and low viscosity means that this force is weak. Kinematic viscosity is usually denoted by and dynamic viscosity is denoted by D where is the fluid density. von Neumann neighborhood On a two-dimensional square lattice, the neighborhood including a central cell and its nearest neighbors north, south, east and west. Ziff model A simple model describing adsorption– dissociation–desorption on a catalytic surface. This model is based upon some of the known steps of the reaction A–B2 on a catalyst surface (for example CO–O2 ).
409
410
Cellular Automata Modeling of Physical Systems
Cellular Automata Modeling of Physical Systems, Figure 1 The game of life automaton. Black dots represent living cells whereas dead cells are white. The figure shows the evolution of a random initial configuration and the formation of spatial structures, with possibly some emerging functionalities. The figure shows the evolution of some random initial configurations
Definition of the Subject The computational science community has always been faced with the challenge of bringing efficient numerical tools to solve problems of increasing difficulty. Nowadays, the investigation and understanding of the so-called complex systems, and the simulation of all kinds of phenomena originating from the interaction of many components are of central importance in many area of science. Cellular automata turn out to be a very fruitful approach to address many scientific problems by providing an efficient way to model and simulate specific phenomena for which more traditional computational techniques are hardly applicable. The goal of this article is to provide the reader with the foundation of this approach, as well as a selection of simple applications of the cellular automata approach to the modeling of physical systems. We invite the reader to consult the web site http://cui.unige.ch/~chopard/CA/ Animations/img-root.html in order to view short movies about several of the models discussed in this article. Introduction Cellular automata (hereafter termed CA) are an idealization of the physical world in which space and time are of discrete nature. In addition to space and time, the physical quantities also take only a finite set of values. Since it has been proposed by von Neumann in the late 1940s, the cellular automata approach has been applied to a large range of scientific problems (see for instance [4,10,16,35,42,51]). International conferences (e. g. ACRI) and dedicated journals (J. of Cellular Automata) also describe current developments. When von Neumann developed the concept of CA, its motivation was to extract the abstract (or algorithmic) mechanisms leading to self-reproduction of biological organisms [6].
Following the suggestions of S. Ulam, von Neumann addressed this question in the framework of a fully discrete universe made up of simple cells. Each cell was characterized by an internal state, which typically consists of a finite number of information bits. Von Neumann suggested that this system of cells evolves, in discrete time steps, like simple automata which only know of a simple recipe to compute their new internal state. The rule determining the evolution of this system is the same for all cells and is a function of the states of the cell itself and its neighbors. Similarly to what happens in any biological system, the activity of the cells takes place simultaneously. The same clock is assumed to drive the evolution of every cell and the updating of their internal state occurs synchronously. Such a fully discrete dynamical system (cellular space), as invented by von Neumann, is now referred to as a cellular automaton. Among the early applications of CA, the game of life [19] is famous. In 1970, the mathematician John Conway proposed a simple model leading to complex behaviors. He imagined a two-dimensional square lattice, like a checkerboard, in which each cell can be either alive (state one) or dead (state zero). The updating rule of the game of life is as follows: a dead cell surrounded by exactly three living cells gets back to life; a living cell surrounded by less than two or more than three neighbors dies of isolation or overcrowdness. Here, the surrounding cells correspond to the neighborhood composed of the four nearest cells (North, South, East and West), plus the four second nearest neighbors, along the diagonals. It turns out that the game of life automaton has an unexpectedly rich behavior. Complex structures emerge out of a primitive “soup” and evolve so as to develop some skills that are absent of the elementary cells (see Fig. 1).
Cellular Automata Modeling of Physical Systems
The game of life is a cellular automata capable of universal computations: it is always possible to find an initial configuration of the cellular space reproducing the behavior of any electronic gate and, thus, to mimic any computation process. Although this observation has little practical interest, it is very important from a theoretical point of view since it assesses the ability of CAs to be a nonrestrictive computational technique. A very important feature of CAs is that they provide simple models of complex systems. They exemplify the fact that a collective behavior can emerge out of the sum of many, simply interacting, components. Even if the basic and local interactions are perfectly known, it is possible that the global behavior obeys new laws that are not obviously extrapolated from the individual properties, as if the whole were more than the sum of all the parts. These properties make cellular automata a very interesting approach to model physical systems and in particular to simulate complex and nonequilibrium phenomena. The studies undertaken by S. Wolfram in the 1980s [50,51] clearly establishes that a CA may exhibit many of the behaviors encountered in continuous systems, yet in a much simpler mathematical framework. A further step is to recognize that CAs are not only behaving similarly to some dynamical processes, they can also represent an actual model of a given physical system, leading to macroscopic predictions that could be checked experimentally. This fact follows from statistical mechanics which tells us that the macroscopic behavior of many systems is often only weakly related to the details of its microscopic reality. Only symmetries and conservation laws survive to the change of observation level: it is well known that the flows of a fluid, a gas or even a granular media are very similar at a macroscopic scale, in spite of their different microscopic nature. When one is interested in the global or macroscopic properties of a system, it is therefore a clear advantage to invent a much simpler microscopic reality, which is more appropriate to the available numerical means of investigation. An interesting example is the FHP fluid model proposed by Frisch, Hasslacher and Pomeau in 1986 [18] which can be viewed as a fully discrete molecular dynamic and yet behaves as predicted by the Navier–Stokes equation when the observation time and length scales are much larger than the lattice and automaton time step. A cellular automata model can then be seen as an idealized universe with its own microscopic reality but, nevertheless, with the same macroscopic behavior as given in the real system.
The cellular automata paradigm presents nevertheless some weaknesses inherent to its discrete nature. In the early 1990s, Lattice Boltzmann (LB) models were proposed to remedy some of these problems, using real-valued states instead of Boolean variables. It turns out that LB models are indeed a very powerful approach which combines numerical efficiency with the advantage of having a model whose microscopic components are intuitive. LB-fluids are more and more used to solve complex flows such as multi-component fluids or complicated geometries problems. See for instance [10,40,41,49] for an introduction to LB models. Definition of a Cellular Automata In order to give a definition of a cellular automaton, we first present a simple example, the so-called parity rule. Although it is very basic, the rule we discuss here exhibits a surprisingly rich behavior. It was proposed initially by Edward Fredkin in the 1970s [3] and is defined on a twodimensional square lattice. Each site of the lattice is a cell which is labeled by its position r D (i; j) where i and j are the row and column indices. A function (r; t) is associated with the lattice to describe the state of each cell r at iteration t. This quantity can be either 0 or 1. The cellular automata rule specifies how the states (r; t C 1) are to be computed from the states at iteration t. We start from an initial condition at time t D 0 with a given configuration of the values (r; t D 0) on the lattice. The state at time t D 1 will be obtained as follows (1) Each site r computes the sum of the values (r0 ; 0) on the four nearest neighbor sites r0 at north, west, south, and east. The system is supposed to be periodic in both i and j directions (like on a torus) so that this calculation is well defined for all sites. (2) If this sum is even, the new state (r; t D 1) is 0 (white) and, else, it is 1 (black). The same rule (steps 1 and 2) is repeated over to find the states at time t D 2; 3; 4; : : :. From a mathematical point of view, this cellular automata parity rule can be expressed by the following relation (i; j; t C 1) D (i C 1; j; t) ˚ ˚
(i 1; j; t)
(i; j C 1; t) ˚
(i; j 1; t)
(1)
where the symbol ˚ stands for the exclusive OR logical operation. It is also the sum modulo 2: 1 ˚ 1 D 0 ˚ 0 D 0 and 1 ˚ 0 D 0 ˚ 1 D 1.
411
412
Cellular Automata Modeling of Physical Systems
Cellular Automata Modeling of Physical Systems, Figure 2 The ˚ rule (or parity rule) on a 256 256 periodic lattice. a Initial configuration; b and c configurations after tb D 93 and tc D 110 iterations, respectively
When this rule is iterated, very nice geometric patterns are observed, as shown in Fig. 2. This property of generating complex patterns starting from a simple rule is generic of many cellular automata rules. Here, complexity results from some spatial organization which builds up as the rule is iterated. The various contributions of successive iterations combine together in a specific way. The spatial patterns that are observed reflect how the terms are combined algebraically. A computer implementation of this CA can be given. In Fig. 3 we propose as, an illustration, a Matlab program (the reader can also consider Octave which is a free version of Matlab). On the basis of this example we now give a definition of a cellular automata. Formally, a cellular automata is a tuple (A; ; R; N ) where (i) A is a regular lattice of cells covering a portion of a d-dimensional space. (ii) A set (r; t) D f (1) (r; t); (2) (r; t); : : : ; (m) (r; t)g of m Boolean variables attached to each site r of the lattice and giving the local state of the cells at time t. (iii) A set R of rules, R D fR(1) ; R(2) ; : : : ; R(m) g, which specifies the time evolution of the states (r; t) in the following way ( j) (r; t C t ) D R( j) ( (r; t); (r C v1 ; t); (r C v2 ; t); : : : ; (r C vq ; t))
(2)
where r C vk designate the cells belonging to the neighborhood N of cell r. In the above definition, the rule R is identical for all sites and is applied simultaneously to each of them, leading to synchronous dynamics. As the number of configurations of the neighborhood is finite, it is common to precompute all the values of R in a lookup table. Otherwise, an algebraic expression can be used and evaluated at each iteration, for each cell, as in Eq. (1).
It is important to notice that the rule is homogeneous, that is it cannot not depend explicitly on the cell position r. However, spatial (or even temporal) inhomogeneities can be introduced anyway by having some j (r) systematically 1 in some given locations of the lattice to mark particular cells on which a different rule applies. Boundary cells are a typical example of spatial inhomogeneities. Similarly, it is easy to alternate between two rules by having a bit which is 1 at even time steps and 0 at odd time steps. The neighborhood N of each cell (i. e. the spatial region around each cell used to compute the next state) is usually made of its adjacent cells. It is often restricted to the nearest or next to nearest neighbors, otherwise the complexity of the rule is too large. For a two-dimensional cellular automaton, two neighborhoods are often considered: the von Neumann neighborhood which consists of a central cell (the one which is to be updated) and its four geographical neighbors North, West, South, and East. The Moore neighborhood contains, in addition, the second nearest neighbor North-East, North-West, SouthWest, and South-East, that is a total of nine cells. Another interesting neighborhood is the Margolus neighborhood briefly described in the glossary. According to the above definition, a cellular automaton is deterministic. The rule R is some well-defined function and a given initial configuration will always evolve identically. However, it may be very convenient for some applications to have a certain degree of randomness in the rule. For instance, it may be desirable that a rule selects one outcome among several possible states, with a probability p. Cellular automata whose updating rule is driven by some external probabilities are called probabilistic cellular automata. On the other hand, those which strictly comply with the definition given above, are referred to as deterministic cellular automata. Probabilistic cellular automata are a very useful generalization because they offer a way to adjust the parameters of a rule in a continuous range of values, despite the dis-
Cellular Automata Modeling of Physical Systems
nx=128; ny=128; a=zeros(nx,ny);
% size of the domain: 128x128 % the states are first initialized to 0
north=[ 2:nx,1]; % vectors to access the neighbors south=[nx, 1:nx-1]; % corresponding to a cyclic permutation east=[2:ny,1]; % of 1:nx or 1:ny west=[ny,1:ny-1]; % a central patch is initialized with 1’s a(nx/2-3:nx/2+2, ny/2-4:ny/2+3)=1; for t=1:65 % let us pcolor(a) % axis off axis square shading flat drawnow % somme=a(north,:) a=mod(somme,2); end
do 65 iterations build a graphical representation
display it + a(south,:) + a(:,west) + a(:,east);
Cellular Automata Modeling of Physical Systems, Figure 3 A example of a Matlab program for the parity rule
crete nature of the cellular automata world. This is very convenient when modeling physical systems in which, for instance, particles are annihilated or created at some given rate. Limitations, Advantages, Drawbacks, and Extensions The interpretation of the cellular automata dynamics in terms of simple “microscopic” rules offers a very intuitive and powerful approach to model phenomena that are very difficult to include in more traditional approaches (such as differential equations). For instance, boundary conditions are often naturally implemented in a cellular automata model because they have a natural interpretation at this level of description (e. g. particles bouncing back on an obstacle). Numerically, an advantage of the CA approach is its simplicity and its adequation to computer architectures and parallel machines. In addition, working with Boolean quantities prevent numerical instabilities since an exact computation is made. There is no truncation or approximation in the dynamics itself. Finally, a CA model is an implementation of an N-body system where all correlations are taken into account, as well as all spontaneous fluctuations arising in a system made up of many particles. On the other hand, cellular automata models have several drawbacks related to their fully discrete nature. An im-
portant one is the statistical noise requiring systematic averaging processes. Another one is the little flexibility to adjust parameters of a rule in order to describe a wider range of physical situations. The Lattice Boltzmann approach solves several of the above problems. On the other hand, it may be numerically unstable and, also, requires some hypotheses of molecular chaos which reduces the some of the richness of the original CA dynamics [10]. Finally, we should remark that the CA approach is not a rigid framework but should allow for many extensions according to the problem at hand. The CA methodology is a philosophy of modeling where one seeks a description in terms of simple but essential mechanisms. Its richness and interest comes from the microscopic contents of its rule for which there is, in general, a clear physical or intuitive interpretation of the dynamics directly at the level of the cell. Einstein’s quote “Everything should be made as simple as possible, but not simpler” is a good illustration of the CA methodology to the modeling of physical systems. Applications Many physical situations, like fluid flows, pattern formation, reaction-diffusion processes, nucleation-aggregation growth phenomena, phase transition, population dynamics, or traffic processes are very well suited to the cellu-
413
414
Cellular Automata Modeling of Physical Systems
lar automata approach because both space and time play a central role in the behavior of these systems. Below we describe several applications which illustrate the potential of the approach and that can be extended in order to address a wide class of scientific problems. A Growth Model A simple class of cellular automata rules consists of the socalled majority rules. The updating selects the new state of each cell so as to conform to the value currently held by the majority of the neighbors. Typically, in these majority rules, the state is either 0 or 1. A very interesting behavior is observed with the twisted majority rule proposed by G. Vichniac [44]: in two-dimensions, each cell considers its Moore neighborhood (i. e. itself plus its eight nearest neighbors) and computes the sum of the cells having a value 1. This sum can be any value between 0 and 9. The new state s(t C 1) of each cell is then determined from this local sum, according to the following table sum(t) 0 1 2 3 4 5 6 7 8 9 s(t C 1) 0 0 0 0 1 0 1 1 1 1
(3)
As opposed to the plain majority rule, here, the two middle entries of the table have been swapped. Therefore, when there is a slight majority of 1 around a cell, it turns to 0. Conversely, if there is a slight majority of 0, the cell becomes 1. Surprisingly enough this rule describes the interface motion between two phases, as illustrated in Fig. 4. Vich-
niac has observed that the normal velocity of the interface is proportional to its local curvature, as required by the Allen–Cahn [21] equation. Of course, due to its local nature, the rule cannot detect the curvature of the interface directly. However, as the rule is iterated, local information is propagated to the nearest neighbors and the radius of curvature emerges as a collective effect. This rule is particularly interesting when the initial configuration is a random mixture of the two phases, with equal concentration. Otherwise, some pathological behaviors may occur. For instance, an initial square of 1’s surrounded by zero’s will not evolve: 90-degree angles are not eroded and remain stable structures. Ising-Like Dynamics The Ising model is extensively used in physics. Its basic constituents are spins si which can be in one of two states: s i 2 f1; 1g. These spins are organized on a regular lattice in d-dimensions and coupled in the sense that each pair (s i ; s j ) of neighbor spins contributes an amount Js i s j to the energy of the system. Intuitively, the dynamics of such a system is that a spin flips (s i ! s i ) if this is favorable in view of the energy of the local configuration. Vichniac [44], in the 1980s, proposed a CA rule, called Q2R, simulating the behavior of Ising spin dynamics. The model is as follows: We consider a two-dimensional square lattice such that each site holds a spin si which is either up (s i D 1) or down (s i D 0) (instead of ˙1). The coupling between spins is assumed to come from the von Neumann neighborhood (i. e. north, west, south, and east neighbors).
Cellular Automata Modeling of Physical Systems, Figure 4 Evolution of the twisted majority. The inherent “surface tension” present in the rule tends to separate the red phases s D 1 from the blue phase s D 0. The snapshots a, b and c correspond to t D 0, t D 72, and t D 270 iterations, respectively. The other colors indicate how “capes” have been eroded and “bays” filled: light blue shows the blue regions that have been eroded during the last few iterations and yellow marks the red regions that have been filled
Cellular Automata Modeling of Physical Systems
In this simple model, the spins will flip (or not flip) during their discrete time evolution according to a local energy conservation principle. This means we are considering a system which cannot exchange energy with its surroundings. The model will be a microcanonical cellular automata simulation of Ising spin dynamics, without a temperature but with a critical energy. A spin si can flip at time t to become 1 s i at time t C 1 if and only if this move does not cause any energy change. Accordingly, spin si will flip if the number of its neighbors with spin up is the same as the number of its neighbors with spin down. However, one has to remember that the motion of all spins is simultaneous in a cellular automata. The decision to flip is based on the assumption that the neighbors are not changing. If they are allowed to flip too, (because they obey the same rule), then energy may not be conserved. A way to cure this problem is to split the updating in two phases and consider a partition of the lattice in odd and even sites (e. g. the white and black squares of a chessboard in 2D): first, one flips the spins located at odd positions, according to the configuration of the even spins. In the second phase, the even sublattice is updated according to the odd one. The spatial structure (defining the two sublattices) is obtained by adding an extra bit b to each lattice site, whose value is 0 for the odd sublattice and 1 for the even sublattice. The flipping rule described earlier is then regulated by the value of b. It takes place only for those sites for which b D 1. Of course, the value of b is also updated at each iteration according to b(t C 1) D 1 b(t), so that at the next iteration, the other sublattice is considered. In two-dimensions, the Q2R rule can be then expressed by the following expressions
s i j (tC1) D
8 ˆ < 1 s i j (t) ˆ : s (t) ij
if b i j D 1 and s i1; j Cs iC1; j C s i; j1 C s i; jC1 D 2 otherwise (4)
and b i j (t C 1) D 1 b i j (t)
rounded by domains of “down” magnetization (black regions). In this dynamics, energy is exactly conserved because that is the way the rule is built. However, the number of spins down and up may vary. In the present experiment, the fraction of spins up increases from 11% in the initial state to about 40% in the stationary state. Since there is an excess of spins down in this system, there is a resulting macroscopic magnetization. It is interesting to study this model with various initial fractions s of spins up. When starting with a random initial condition, similar to that of Fig. 5a, it is observed that, for many values of s , the system evolves to a state where there is, in the average, the same amount of spin down and up, that is no macroscopic magnetization. However, if the initial configuration presents a sufficiently large excess of one kind of spins, then a macroscopic magnetization builds up as time goes on. This means there is a phase transition between a situation of zero magnetization and a situation of positive or negative magnetization. It turns out that this transition occurs when the total energy E of the system is low enough (a low energy means that most of the spins are aligned and that there is an excess of one species over the other), or more precisely when E is smaller than a critical energy Ec . In that sense, the Q2R rule captures an important aspect of a real magnetic system, namely a non-zero magnetization at low energy (which can be related to a low temperature situation) and a transition to a nonmagnetic phase at high energy. However, Q2R also exhibits unexpected behavior that is difficult to detect from a simple observation. There is a breaking of ergodicity: a given initial configuration of energy E0 evolves without visiting completely the region of the phase space characterized by E D E0 . This is illustrated by the following simple 1D example, where a ring of four spins with periodic boundary condition are considered. t : 1001 t C 1 : 1100
(5)
t C 2 : 0110
(6)
t C 3 : 0011 where the indices (i; j) label the Cartesian coordinates and s i j (t D 0) is either one or zero. The question is now how well does this cellular automata rule perform to describe an Ising model? Figure 5 shows a computer simulation of the Q2R rule, starting from an initial configuration with approximately 11% of spins s i j D 1 (Fig. 5a). After a transient phase (figures b and c), the system reaches a stationary state where domains with “up” magnetization (white regions) are sur-
t C 4 : 1001 After four iterations, the system cycles back to its original state. The configuration of this example has E0 D 0. As we observed, it never evolves to 0111, which is also a configuration of zero energy. This nonergodicity means that not only energy is conserved during the evolution of the automaton, but also another quantity which partitions the energy surface in independent regions.
415
416
Cellular Automata Modeling of Physical Systems
Cellular Automata Modeling of Physical Systems, Figure 5 Evolution of a system of spins with the Q2R rule. Black represents the spins down sij D 0 and white the spins up sij D 1. The four images a, b, c, and d show the system at four different times ta D 0 < tb tcd
Competition Models and Cell Differentiation In Sect. “A Growth Model” we have discussed a majority rule in which the cells imitate their neighbors. In some sense, this corresponds to a cooperative behavior between the cells. A quite different situation can be obtained if the cells obey competitive dynamics. For instance, we may imagine that the cells compete for some resources at the expense of their nearest neighbors. A winner is a cell of state 1 and a loser a cell of state 0. No two winner cells can be neighbors and any loser cell must have at least one winner neighbor (otherwise nothing would have prevented it to also win). It is interesting to note that this problem has a direct application in biology, to study cell differentiation. It
has been observed in the development of Drosophila that about 25% of the cells forming the embryo evolve to the state of neuroblast, while the remaining 75% do not. How can we explain this differentiation and the observed fraction since, at the beginning of the process all cells can be assumed equivalent? A possible mechanism [28] is that some competition takes place between the adjacent biological cells. In other words, each cell produces some substance S but the production rate is inhibited by the amount of S already present in the neighboring cells. Differentiation occurs when a cell reaches a level of S above a given threshold. The competition CA model we propose to describe this situation is the following. Because of the analogy with the biological system, we shall consider a hexagonal lattice,
Cellular Automata Modeling of Physical Systems
Cellular Automata Modeling of Physical Systems, Figure 6 The hexagonal lattice used for the competition-inhibition CA rule. Black cells are cells of state 1 (winners) and white cells are cells of state 0 (losers). The two possible final states with a fully regular structure are illustrated with density 1/3 and 1/7 of a winner, respectively
which is a reasonable approximation of the cell arrangement observed in the Drosophila embryo (see Fig. 6). We assume that the values of S can be 0 (inhibited) or 1 (active) in each lattice cell. A S D 0 cell will grow (i. e. turn to S D 1) with probability pgrow provided that all its neighbors are 0. Otherwise, it stays inhibited. A cell in state S D 1 will decay (i. e. turn to S D 0) with probability pdecay if it is surrounded by at least one active cell. If the active cell is isolated (all the neighbors are in state 0) it remains in state 1. The evolution stops (stationary process) when no S D 1 cell feels any more inhibition from its neighbors and when all S D 0 cells are inhibited by their neighborhood. Then, with our biological interpretation, cells with S D 1 are those which will differentiate. What is the expected fraction of these S D 1 cells in the final configuration? Clearly, from Fig. 6, the maximum value is 1/3. According to the inhibition condition we imposed, this is the close-packed situation on the hexagonal lattice. On the other hand, the minimal value is 1/7, corresponding to a situation where the lattice is partitioned in blocks with one active cell surrounded by six inhibited cells. In practice we do not expect any of these two limits to occur spontaneously after the automaton evolution. On the contrary, we should observe clusters of close-packed active cells surrounded by defects, i. e. regions of low density of active cells. CA simulations show indeed that the final fraction s of active cells is a mix of the two limiting situations of Fig. 6 :23 s :24 almost irrespectively of the values chosen for panihil and pgrowth . This is exactly the value expected from the biological observations made on the Drosophila embryo. Thus, cell
differentiation can be explained by a geometrical competition without having to specify the inhibitory couplings between adjacent cell and the production rate (i. e. the values of panihil and pgrowth ): the result is quite robust against any possible choices. Traffic Models Cellular automata models for road traffic have received a great deal of interest during the past few years (see [13,32, 33,36,37,47,48,52] for instance). One-dimensional models for single lane car motions are quite simple and elegant. The road is represented as a line of cells, each of them being occupied or not by a vehicle. All cars travel in the same direction (say to the right). Their positions are updated synchronously. During the motion, each car can be at rest or jump to the nearest neighbor site, along the direction of motion. The rule is simply that a car moves only if its destination cell is empty. This means that the drivers do not know whether the car in front will move or will be blocked by another car. Therefore, the state of each cell si is entirely determined by the occupancy of the cell itself and its two nearest neighbors s i1 and s iC1 . The motion rule can be summarized by the following table, where all eight possible configurations (s i1 s i s iC1 ) t ! (s i ) tC1 are given (111) (110) (101) (100) (011) (010) „ƒ‚… „ƒ‚… „ƒ‚… „ƒ‚… „ƒ‚… „ƒ‚… 1
0
1
1
1
0
(001) (000) : „ƒ‚… „ƒ‚… 0
(7)
0
This cellular automaton rule turns out to be Wolfram rule 184 [50,52]. These simple dynamics capture an interesting feature of real car motion: traffic congestion. Suppose we have a low car density in the system, for instance something
417
418
Cellular Automata Modeling of Physical Systems
like : : : 0010000010010000010 : : : :
(8)
This is a free traffic regime in which all the cars are able to move. The average velocity hvi defined as the number of motions divided by the number of cars is then hvf i D 1
(9)
where the subscript f indicates a free state. On the other hand, in a high density configuration such as : : : 110101110101101110 : : : :
(10)
only six cars over 12 will move and hvi D 1/2. This is a partially jammed regime. If the car positions were uncorrelated, the number of moving cars (i. e. the number of particle-hole pairs) would be given by L (1 ), where L is the system size. Since the number of cars is L, the average velocity would be hvuncorreli D 1 :
(11)
However, in this model, the car occupancy of adjacent sites is highly correlated and the vehicles cannot move until a hole has appeared in front of them. The car distribution tries to self-adjust to a situation where there is one spacing between consecutive cars. For densities less than one-half, this is easily realized and the system can organize to have one car every other site. Therefore, due to these correlations, Eq. (11) is wrong in the high density regime. In this case, since a car needs a hole to move to, we expect that the number of moving cars simply equals the number of empty cells [52]. Thus, the number of motions is L(1 ) and the average velocity in the jammed phase is hv j i D
1 :
Cellular Automata Modeling of Physical Systems, Figure 7 Traffic flow diagram for the simple CA traffic rule
The cars slow down when required: u0i ! u00i D d i 1, if u0i d i . 00 The cars have a random behavior: u00i ! u000 i D u i 1, 00 with probability pi if u i > 0. Finally, the cars move u000 i sites ahead.
(12)
From the above relations we can compute the so-called fundamental flow diagram, i. e. the relation between the flow of cars hvi as a function of the car density : for 1/2, we use the free regime expression and hvi D . For densities > 1/2, we use the jammed expression and hvi D 1 . The resulting diagram is shown in Fig. 7. As in real traffic, we observe that the flow of cars reaches a maximum value before decreasing. A richer version of the above CA traffic model is due to Nagel and Schreckenberg [33,47,48]. The cars may have several possible velocities u D 0; 1; 2; : : : ; umax . Let ui be the velocity of car i and di the distance, along the road, separating cars i and i C 1. The updating rule is: The cars accelerate when possible: u i ! u0i D u i C 1, if u i < umax .
Cellular Automata Modeling of Physical Systems, Figure 8 The four central cells represent a roundabout which is traveled counterclockwise. The gray levels indicate the different traffic lanes: white is a northbound lane, light gray an eastbound lane, gray a southbound lane and, finally, dark gray is a westbound lane. The dots labeled a, b, c, d, e, f, g, and h are cars which will move to the destination cell indicated by the arrows, as determined by some local decision rule. Cars without an arrow are forbidden to move
Cellular Automata Modeling of Physical Systems
Cellular Automata Modeling of Physical Systems, Figure 9 Traffic configuration after 600 iterations, for a car density of 30%. Streets are white, buildings gray and the black pixels represent the cars. Situation a corresponds to the roundabout junctions, whereas image b mimics the presence of traffic lights. In the second case, queues are more likely to form and the global mobility is less than in the first case
Cellular Automata Modeling of Physical Systems, Figure 10 Average velocity versus average density for the cellular automata street network, for a time-uncorrelated turning strategies and b a fixed driver decision. The different curves correspond to different distances L between successive road junctions. The dashed line is the analytical prediction (see [13]). Junction deadlock is likely to occur in b, resulting in a completely jammed state
This rule captures some important behaviors of real traffic on a highway: velocity fluctuations due to a nondeterministic behavior of the drivers, and “stop-and-go” waves observed in a high-density traffic regime. We refer the reader to recent literature for the new developments of this topic. See for instance [24,25]. Note that a street network can also be described using a CA. A possible approach is to model a road intersection as a roundabout. Cars in the roundabout have prior-
ity over those willing to enter. Figure 8 illustrates a simple four-way road junction. Traffic lights can also be modeled in a CA by stopping, during a given number of iterations, the car reaching a given cell. Figure 9 illustrates CA simulation of a Manhattan-like city in which junctions are either controlled by a roundabout or by a traffic light. In both cases, the destination of a car reaching the junction is randomly chosen.
419
420
Cellular Automata Modeling of Physical Systems
Figure 10 shows the fundamental flow diagram obtained with the CA model, for a Manhattan-like city governed by roundabouts separated by a distance L. CA modeling of urban traffic has been used for real situations by many authors (see for instance [11]) and some cities in Europe and USA use the CA approach as a way to manage traffic. Note finally that crowd motion is also addressed within the CA framework. Recent results [7,29] show that the approach is quite promising to plan evacuation strategies and reproduce several of the motion patterns observed in real crowds. A Simple Gas: The HPP Model The HPP rule is a simple example of an important class of cellular automata models: lattice gas automata (LGA). The basic ingredient of such models are point particles that move on a lattice, according to appropriate rules so as to mimic fully discrete “molecular dynamics.” The HPP lattice gas automata is traditionally defined on a two-dimensional square lattice. Particles can move along the main directions of the lattice, as shown in Fig. 11. The model limits to 1 the number of particles entering a given site with a given direction of motion. This is the exclusion principle which is common in most LGA (LGA models without the exclusion principle are called multiparticle models [10]). With at most one particle per site and direction, four bits of information at each site are enough to describe the system during its evolution. For instance, if at iteration t site r has the following state s(r; t) D (1011), it means that three particles are entering the site along direction 1, 3, and 4, respectively. The cellular automata rule describing the evolution of s(r; t) is often split into two steps: collision and motion (or propagation). The collision phase specifies how the particles entering the same site will interact and change their trajectories. During the propagation phase, the particles actually move to the nearest neighbor site they are traveling to. This decomposition into two phases is a quite convenient way to partition the space so that the collision rule is purely local. Figure 12 illustrates the HPP rules. According to our Boolean representation of the particles at each site, the collision part for the two head on collisions are expressed as (1010) ! (0101)
(0101) ! (1010)
(13)
all the other configurations being unchanged. During the propagation phase, the first bit of the state variable is shifted to the east neighbor cell, the second bit to the north and so on.
Cellular Automata Modeling of Physical Systems, Figure 11 Example of a configuration of HPP particles
The aim of this rule is to reproduce some aspect of the real interactions between particles, namely that momentum and particle number are conserved during a collision. From Fig. 12, it is easily checked that these properties are obeyed: a pair of zero momentum particles along a given direction is transformed into another pair of zero momentum along the perpendicular axis. It is easy to express the HPP model in a mathematical form. For this purpose, the so-called occupation numbers n i (r; t) are introduced for each lattice site r and each time step t. The index i labels the lattice directions (or the possible velocities of the particles). In the HPP model, the lattice has four directions (North, West, South, and East) and i runs from 1 to 4. By definition and due to the exclusion principle, the ni ’s are Boolean variables 8 < 1 if a particle is entering site r at time t along lattice direction i n i (r; t) D : 0 otherwise : From this definition it is clear that, for HPP, the ni ’s are simply the components of the state s introduced above s D (n1 ; n2 ; n3 ; n4 ) : In an LGA model, the microdynamics can be naturally expressed in terms of the occupation numbers ni as n i (r C vi t ; t C t ) D n i (r; t) C ˝ i (n(r; t))
(14)
Cellular Automata Modeling of Physical Systems
be expressed as n i (r C vi t ; t C t ) D n i n i n iC2 (1 n iC1 )(1 n iC3 ) C n iC1 n iC3 (1 n i )(1 n iC2 ) :
Cellular Automata Modeling of Physical Systems, Figure 12 The HPP rule: a a single particle has a ballistic motion until it experiences a collision; b and c the two nontrivial collisions of the HPP model: two particles experiencing a head on collision are deflected in the perpendicular direction. In the other situations, the motion is ballistic, that is the particles are transparent to each other when they cross the same site
where vi is a vector denoting the speed of the particle in the ith lattice direction. The function ˝ is called the collision term and it describes the interaction of the particles which meet at the same time and same location. Note that another way to express Eq. (14) is through the so-called collision and propagation operators C and P n(t C t ) D PCn(t)
(15)
where n(t) describe the set of values n i (r; t) for all i and r. The quantities C and P act over the entire lattice. They are defined as (Pn) i (r) D n i (r vi t ) ; (Cn) i (r) D n i (r) C ˝ i : More specifically, for the HPP model, it can be shown [10] that the collision and propagation phase can
(16)
In this equation, the values i C m are wrapped onto the values 1 to 4 and the right-hand term is computed at position r and time t. The HPP rule captures another important ingredient of the microscopic nature of a real interaction: invariance under time reversal. Figure 12b and c show that, if at some given time, the directions of motion of all particles are reversed, the system will just trace back its own history. Since the dynamics of a deterministic cellular automaton is exact, this fact allows us to demonstrate the properties of physical systems to return to their original situation when all the particles reverse their velocity. Figure 13 illustrates the time evolution of an HPP gas initially confined in the left compartment of a container. There is an aperture on the wall of the compartment and the gas particles will flow so as to fill the entire space available to them. In order to include a solid boundary in the system, the HPP rule is modified as follows: when a site is a wall (indicated by an extra bit), the particles no longer experience the HPP collision but bounce back from where they came. Therefore, particles cannot escape a region delimited by such a reflecting boundary. If the system of Fig. 13 is evolved, it reaches an equilibrium after a long enough time and no macroscopic trace of its initial state is visible any longer. However, no information has been lost during the process (no numerical dissipation) and the system has the memory of where it comes from. Reversing all the velocities and iterating the HPP rule makes all particles go back to the compartment in which they were initially located. Reversing the particle velocity can be described by an operator R which swaps occupation numbers with opposite velocities (Rn) i D n iC2 : The reversibility of HPP stem from the fact that the collision and propagation operators obey PRP D R
CRC D R :
Thus (PC)n PR(PC)n D R which shows that the system will return to its initial state (though with opposite velocities) if the first n iterations are followed by a velocity change vi ! vi , a propagation and again n iterations.
421
422
Cellular Automata Modeling of Physical Systems
Cellular Automata Modeling of Physical Systems, Figure 13 Time evolution of an HPP gas. a From the initial state to equilibrium. b Illustration of time reversal invariance: in the rightmost image of a, the velocity of each particle is reversed and the particles naturally return to their initial position
This time-reversal behavior is only possible because the dynamics are perfectly exact and no numerical errors are present in the numerical scheme. If one introduces externally some errors (for instance, one can add an extra particle in the system) before the direction of motion of each particle is reversed, then reversibility is lost. Note that this property has inspired a new symmetric cryptography algorithm called Crystal [30], which exploits and develops the existing analogy between discrete physical models of particles and the standard diffusionconfusion paradigm of cryptography proposed by Shannon [39]. The FHP Model The HPP rule is interesting because it illustrates the basic ingredients of LGA models. However, the capability of this rule to model a real gas of particles is poor, due to a lack of isotropy and spurious invariants. A remedy to this problem is to use a different lattice and a different collision model.
The FHP rule (proposed by Frisch, Hasslacher, and Pomeau [18] in 1986) was the first CA whose behavior was shown to reproduce, within some limits, a two-dimensional fluid. The FHP model is an abstraction, at a microscopic scale, of a fluid. It is expected to contain all the salient features of a real fluid. It is well known that the continuity and Navier–Stokes equations of hydrodynamics express the local conservation of mass and momentum in a fluid. The detailed nature of the microscopic interactions does not affect the form of these equations but only the values of the coefficients (such as the viscosity) appearing in them. Therefore, the basic ingredients one has to include in the microdynamics of the FHP model is the conservation of particles and momentum after each updating step. In addition, some symmetries are required so that, in the macroscopic limit, where time and space can be considered as continuous variables, the system be isotropic. As in the case of the HPP model, the microdynamics of FHP is given in terms of Boolean variables describing the occupation numbers at each site of the lattice and at
Cellular Automata Modeling of Physical Systems
Cellular Automata Modeling of Physical Systems, Figure 14 The two-body collision in the FHP model. On the right part of the figure, the two possible outcomes of the collision are shown in dark and light gray, respectively. They both occur with probability one-half
Cellular Automata Modeling of Physical Systems, Figure 15 The three-body collision in the FHP model
each time step (i. e. the presence or the absence of a fluid particle). The FHP particles move in discrete time steps, with a velocity of constant modulus, pointing along one of the six directions of the lattice. Interactions take place among particles entering the same site at the same time and result in a new local distribution of particle velocities. In order to conserve the number of particles and the momentum during each interaction, only a few configurations lead to a nontrivial collision (i. e. a collision in which the directions of motion have changed). For instance, when exactly two particles enter the same site with opposite velocities, both of them are deflected by 60 degrees so that the output of the collision is still a zero momentum configuration with two particles. As shown in Fig. 14, the deflection can occur to the right or to the left, indifferently. For symmetry reasons, the two possibilities are chosen randomly, with equal probability. Another type of collision is considered: when exactly three particles collide with an angle of 120 degrees between each other, they bounce back (so that the momentum after collision is zero, as it was before collision). Figure 15 illustrates this rule. For the simplest case we are considering here, all interactions come from the two collision processes described above. For all other configurations (i. e. those which are not obtained by rotations of the situations given in Figs. 14
Cellular Automata Modeling of Physical Systems, Figure 16 Development of a sound wave in a FHP gas, due to an over particle concentration in the middle of the system
and 15) no collision occurs and the particles go through as if they were transparent to each other. Both two- and three-body collisions are necessary to avoid extra conservation laws. The two-particle collision removes a pair of particles with a zero total momentum and moves it to another lattice direction. Therefore, it conserves momentum along each line of the lattice. On the other hand, three-body interactions deflect particles by 180 degrees and cause the net momentum of each lattice line to change. However, three-body collisions conserve the number of particles within each lattice line. The FHP model has played a central role in computational physics because it can be shown (see for instance [10]) that the density , defined as the average number of particles at a given lattice site and u the average velocity of these particles, obey Navier–Stokes equation 1 @ t u C (u r)u D r p C r 2 u
(17)
where p D cs2 is the scalar pressure, with cs the speed of sound and is the kinematic viscosity. Note that here both and cs are quantities that emerge from the FHP dynamics. The speed of sound reflects the lattice topology whereas the viscosity reflects the details of the collision process. As an illustration, Fig. 16 shows the propagation of a density wave in a FHP model. Figure 17 shows the eddies that form when a FHP-fluid flow against an obstacle. More complex models can be built by adding new processes on top of a FHP-fluid. For instance, Fig. 18 shows the result of a model of snow transport and deposition by wind. In addition to the wind flow, obtained from a FHP model, snow particles are traveling due to the combined effect of wind and gravity. Upon reaching the ground, they pile up (possible after toppling) so as to form a new boundary condition for the wind. In Fig. 19 an extension of the model (using the lattice Boltzmann approach described in Sect. “Lattice Boltzmann Models”) shows how a fence with ground clearance creates a snow deposit.
423
424
Cellular Automata Modeling of Physical Systems
Cellular Automata Modeling of Physical Systems, Figure 19 A lattice Boltzmann snow transport and deposition model. The three panels show the time evolution of the snow deposit (yellow) past a fence. Airborne snowflakes are shown as white dots Cellular Automata Modeling of Physical Systems, Figure 17 Flow pattern from a simulation of a FHP model
u D
X
f i vi
i
Cellular Automata Modeling of Physical Systems, Figure 18 A CA snow transport and deposition model
Lattice Boltzmann Models Lattice Boltzmann models (LBM) are an extension of the CA-fluid described in the previous section. The main conceptual difference is that in LBM, the CA state is no longer Boolean numbers ni but real-valued quantity f i for each lattice directions i. Instead of describing the presence or absence of a particle, the interpretation of f i is the density distribution function of particles traveling in lattice directions i. As with LGA (see Eq. (16)), the dynamics of LBM can be expressed in terms of an equation for the f i ’s. The fluid quantities such as the density and velocity field u are obtained by summing the contributions from all lattice directions D
X i
fi
where vi denotes the possible particle velocities in the lattice. From a numerical point of view, the advantages of suppressing the Boolean constraint are several: less statistical noise, more numerical accuracy and, importantly, more flexibility to choose the lattice topology, the collision operator and boundary conditions. Thus, for many practical applications, the LBM approach is preferred to the LGA one. The so-called BGK (or “single-time relaxation”) LBM f i (r C t vi ; t C t ) D f i (r; t) C
1 eq f ( ; u) f i (18) i
where f eq is a given function, has been used extensively in the literature to simulate complex flows. The method is now recognized as a serious competitor of the more traditional approach based on the computer solution of the Navier–Stokes partial differential equation. It is beyond the scope of this article to discuss the LBM approach in more detail. We refer the reader to several textbooks on this topic [8,10,40,41,49]. In addition to some rather technical aspects, one advantage of the LBM over the Navier–Stokes equation is its extended range of validity when the Knudsen number is not negligible (e. g. in microflows) [2]. Note finally that the LBM approach is also applicable to reaction-diffusion processes [1] or wave equation [10] eq by simply choosing an appropriate expression for f i .
Cellular Automata Modeling of Physical Systems
Diffusion Processes Diffusive phenomena and reaction processes play an important role in many areas of physics, chemistry, and biology and still constitute an active field of research. Systems in which reactive particles are brought into contact by a diffusion process and transform, often give rise to very complex behaviors. Pattern formation [31,34], is a typical example of such a behavior. CA provide an interesting framework to describe reaction-diffusion phenomena. The HPP rule we discussed in the Sect. “A Simple Gas: The HPP Model” can be easily modified to produce many synchronous random walks. Random walk is well known to be the microscopic origin of a diffusion process. Thus, instead of experiencing a mass and momentum conserving collision, each particle now selects, at random, a new direction of motion among the possible values permitted by the lattice. Since several particles may enter the same site (up to four, on a two-dimensional square lattice), the random change of directions should be such that there are never two or more particles exiting a site in the same direction. This would otherwise violate the exclusion principle. The solution is to shuffle the directions of motion or, more precisely, to perform a random permutation of the velocity vectors, independently at each lattice site and each time step. Figure 20 illustrates this probabilistic evolution rule for a 2D square lattice. It can be shown that the quantity defined as the average number of particle at site r and time t obeys the dif-
Cellular Automata Modeling of Physical Systems, Figure 20 How the entering particles are deflected at a typical site, as a result of the diffusion rule. The four possible outcomes occur with respective probabilities p0 , p1 , p2 , and p3 . The figure shows four particles, but the mechanism is data-blind and any one of the arrows can be removed when fewer entering particles are present
fusion equation [10] @ t C div D grad D 0 where D is the diffusion constant whose expression is DD
2r t
1 1 4(p C p2 ) 4
D
2r t
p C p0 4[1 (p C p0 )] (19)
where t and r are the time step and lattice spacing, respectively. For the one- and three-dimensional cases, a similar approach can be developed [10]. As an example of the use of the present random walk cellular automata rule, we discuss an application to growth processes. In many cases, growth is governed by an aggregation mechanism: like particles stick to each other as they meet and, as a result, form a complicated pattern with a branching structure. A prototype model of aggregation is the so-called DLA model (diffusion-limited aggregation), introduced by Witten and Sander [46] in the early 1980s. Since its introduction, the DLA model has been investigated in great detail. However, diffusion-limited aggregation is a far from equilibrium process which is not described theoretically by first principle only. Spatial fluctuations that are typical of the DLA growth are difficult to take into account and a numerical approach is necessary to complete the analysis. DLA-like processes can be readily modeled by our diffusion cellular automata, provided that an appropriate rule is added to take into account the particle-particle aggregation. The first step is to introduce a rest particle to represent the particles of the aggregate. Therefore, in a twodimensional system, a lattice site can be occupied by up to four diffusing particles, or by one “solid” particle. Figure 21 shows a two-dimensional DLA-like cluster grown by the cellular automata dynamics. At the beginning of the simulation, one or more rest particles are introduced in the system to act as aggregation seeds. The rest of the system is filled with particles with average concentration . When a diffusing particle becomes nearest neighbor to a rest particle, it stops and sticks to it by transforming into a rest particle. Since several particles can enter the same site, we may choose to aggregate all of them at once (i. e. a rest particle is actually composed of several moving particles), or to accept the aggregation only when a single particle is present. In addition to this question, the sticking condition is important. If any diffusing particle always sticks to the DLA cluster, the growth is very fast and can be influenced by the underlying lattice anisotropy. It is therefore more appropriate to stick with some probability ps .
425
426
Cellular Automata Modeling of Physical Systems
simultaneously present at a given site. Also, when Cs already exist at this site, the exclusion principle may prevent the formation of new ones. A simple choice is to have A and B react only when they perform a head-on collision and when no Cs are present in the perpendicular directions. Figure 22 displays such a process. Other rules can be considered if we want to enhance the reaction (make it more likely) or to deal with more complex situations (2A C B ! C, for instance). A parameter k can be introduced to tune the reaction rate K by controlling the probability of a reaction taking place. Using an appropriate mathematical tool [10], one can show that the idealized microscopic reaction-diffusion behavior implemented by the CA rule obeys the expected partial differential equation @ t A D Dr 2 A K A B Cellular Automata Modeling of Physical Systems, Figure 21 Two-dimensional cellular automata DLA-like cluster (black), obtained with ps D 1, an aggregation threshold of 1 particle and a density of diffusing particle of 0.06 per lattice direction. The gray dots represent the diffusing particles not yet aggregated. The fractal dimension is found to be df D 1:78
Reaction-Diffusion Processes A reaction term can be added on top of the CA diffusion rule. For the sake of illustration let us consider a process such as K
ACB ! C
(20)
where A, B, and C are different chemical species, all diffusing in the same solvent, and K is the reaction constant. To account for this reaction, one can consider the following mechanism: at the “microscopic” level of the discrete lattice dynamics, all the three species are first governed by a diffusion rule. When an A and a B particle enter the same site at the same time, they disappear and form a C particle. Of course, there are several ways to select the events that will produce a C when more than one A or one B are
(21)
provided k is correctly chosen [10]. As an example of a CA reaction-diffusion model, we show in Fig. 23 the formation of the so-called Liesegang patterns [22]. Liesegang patterns are produced by precipitation and aggregation in the wake of a moving reaction front. Typically, they are observed in a test tube containing a gel in which a chemical species B (for example AgNO3 ) reacts with another species A (for example HCl). At the beginning of the experiment, B is uniformly distributed in the gel with concentration b0 . The other species A, with concentration a0 is allowed to diffuse into the tube from its open extremity. Provided that the concentration a0 is larger than b0 , a reaction front propagates in the tube. As this A C B reaction goes on, formation of consecutive bands of precipitate (AgCl in our example) is observed in the tube, as shown in Fig. 23. Although this figure is from a computer simulation [12], it is very close to the picture of a real experiment. Figure 24 shows the same process but in a different geometry. Species A is added in the middle of a 2D gel and diffuses radially. Rings (a) or spirals (b) result from the interplay between the reaction front and the solidification.
Cellular Automata Modeling of Physical Systems, Figure 22 Automata implementation of the A C B ! C reaction process. The reaction takes place with probability k. The Boolean quantity determines in which direction the C particle is moving after its creation
Cellular Automata Modeling of Physical Systems
Cellular Automata Modeling of Physical Systems, Figure 23 Example of the formation of Liesegang bands in a cellular automata simulation. The red bands correspond to the precipitate which results from the A C B reaction front (in blue)
Cellular Automata Modeling of Physical Systems, Figure 24 Formation of Liesegang rings in a cellular automata simulation. The red spots correspond to the precipitate created by the A C B reaction front (in blue)
Excitable Media Excitable media are other examples of reaction processes where unexpected space-time patterns are created. As opposed to the reaction-diffusion models discussed above, diffusion is not considered here explicitly. It is assumed that reaction occurs between nearest neighbor cells, making the transport of species unnecessary. The main focus is on the description of chemical waves propagating in the system much faster and differently than any diffusion process would produce. An excitable medium is basically characterized by three states [5]: the resting state, the excited state, and the refractory state. The resting state is a stable state of the system. But a resting state can respond to a local perturbation and become excited. Then, the excited state evolves to a refractory state where it no longer influences its neighbors and, finally, returns to the resting state.
A generic behavior of excitable media is to produce chemical waves of various geometries [26,27]. Ring and spiral waves are a typical pattern of excitations. Many chemical systems exhibit an excitable behavior. The Selkov model [38] and the Belousov–Zhabotinsky reaction are examples. Chemical waves play an important role in many biological processes (nervous systems, muscles) since they can mediate the transport of information from one place to another. The Greenberg–Hasting model is an example of a cellular automata model of an excitable media. This rule, and its generalization, have been extensively studied [17,20]. The implementation we propose here for the Greenberg–Hasting model is the following: the state (r; t) of site r at time t takes its value in the set f0; 1; 2; : : : ; n 1g. The state D 0 is the resting state. The states D 1; : : : ; n/2 (n is assumed to be even) correspond to excited states. The rest, D n/2 C 1; : : : ; n 1 are the re-
427
428
Cellular Automata Modeling of Physical Systems
Cellular Automata Modeling of Physical Systems, Figure 25 Excitable medium: evolution of a configuration with 5% of excited states D 1, and 95% of resting states (black), for n D 8 and kD3
fractory states. The cellular automata evolution rule is the following: (r; t) is excited or refractory, then (r; t C 1) D (r; t) C 1 mod n. 2. If (r; t) D 0 (resting state) it remains so, unless there are at least k excited sites in the Moore neighborhood of site r. In this case (r; t C 1) D 1.
1. If
The n states play the role of a clock: an excited state evolves through the sequence of all possible states until it returns to 0, which corresponds to a stable situation. The behavior of this rule is quite sensitive to the value of n and the excitation threshold k. Figure 25 shows the evolution of this automaton for a given set of the parameters n and k. The simulation is started with a uniform configuration of resting states, perturbed by some excited sites randomly distributed over the system. Note that if the concentration of perturbation is low enough, excitation dies out rapidly and the system returns to the rest state. Increasing the number of perturbed states leads to the formation of traveling waves and self-sustained oscillations may appear in the form of ring or spiral waves. The Greenberg–Hasting model has some similarity with the “tube-worms” rule proposed by Toffoli and Margolus [42]. This rule is intended to model the Belousov– Zhabotinsky reaction and is as follows. The state of each site is either 0 (refractory) or 1 (excited) and a local timer (whose value is 3, 2, 1, or 0) controls the refractory period. Each iteration of the rule can be expressed by the following sequence of operations: (i) where the timer is zero, the state is excited; (ii) the timer is decreased by 1 unless it is 0; (iii) a site becomes refractory whenever the timer is equal to 2; (iv) the timer is reset to 3 for the excited sites which
Cellular Automata Modeling of Physical Systems, Figure 26 The tube-worms rule for an excitable media
have two, or more than four, excited sites in their Moore neighborhood. Figure 26 shows a simulation of this automaton, starting from a random initial configuration of the timers and the excited states. We observe the formation of spiral pairs of excitations. Note that this rule is very sensitive to small modifications (in particular to the order of operations (i) to (iv)). Another rule which is also similar to Greenberg– Hasting and Margolus–Toffoli tube-worm models is the
Cellular Automata Modeling of Physical Systems
Cellular Automata Modeling of Physical Systems, Figure 27 The forest fire rule: green sites correspond to a grown tree, black pixels represent burned sites and the yellow color indicates a burning tree. The snapshots given here represent three situations after a few hundred iterations. The parameters of the rule are p D 0:3 and f D 6 105
so-called forest-fire model. This rule describes the propagation of a fire or, in a different context, may also be used to mimic contagion in the case of an epidemic. Here we describe the case of a forest-fire rule. The forest-fire rule is a probabilistic CA defined on a d-dimensional cubic lattice. Initially, each site is occupied by either a tree, a burning tree, or is empty. The state of the system is parallel updated according to the following rule: (1) a burning tree becomes an empty site; (2) a green tree becomes a burning tree if at least one of its nearest neighbors is burning; (3) at an empty site, a tree grows with probability p; (4) A tree without a burning neighbor becomes a burning tree with probability f (so as to mimic an effect of lightning). Figure 27 illustrates the behavior of this rule, in a twodimensional situation. Provided that the time scales of tree growth and burning down of forest clusters are well separated (i. e. in the limit f /p ! 0), this model has self-organized critical states [15]. This means that in the steady state, several physical quantities characterizing the system have a power law behavior. Surface Reaction Models Other important reaction models that can be described by a CA are surface-reaction models where nonequilibrium phase transitions can be observed. Nonequilibrium phase transitions are an important topic in physics because no general theory is available to describe such systems and most of the known results are based on numerical simulations. The so-called Ziff model [54] gives an example of the reaction A–B2 on a catalyst surface (for example CO–O2 ). The system is out of equilibrium because it is an open system in which material continuously flows in and out.
However, after a while, it reaches a stationary state and, depending on some control parameters, may be in different phases. The basic steps are A gas mixture with concentrations X B2 of B2 and X A of A sits above a surface on which it can be adsorbed. The surface is divided into elementary cells and each cell can adsorb one atom only. The B species can be adsorbed only in the atomic form. A molecule B2 dissociates into two B atoms only if two adjacent cells are empty. Otherwise the B2 molecule is rejected. If two nearest neighbor cells are occupied by different species they chemically react and the product of the reaction is desorbed. In the example of the CO–O2 reaction, the desorbed product is a CO2 molecule. This final desorption step is necessary for the product to be recovered and for the catalyst to be regenerated. However, the gas above the surface is assumed to be continually replenished by fresh material so that its composition remains constant during the whole evolution. It is found by sequential numerical simulation [54] that a reactive steady state occurs only in a window defined by X1 < X A < X2 where X1 D 0:389 ˙ 0:005 and X2 D 0:525 ˙ 0:001 (provided that X B2 D 1 X A ). This situation is illustrated in Fig. 28, though for the corresponding cellular automata dynamics and X B2 ¤ 1 X A . Outside this window of parameter, the steady state is a “poisoned” catalyst of pure A (when X A > X 2 ) or pure B (when X A < X 1 ). For X1 < X A < X 2 , the coverage fraction varies continuously with X A and one speaks of a continuous (or second-order) nonequilibrium phase
429
430
Cellular Automata Modeling of Physical Systems
Cellular Automata Modeling of Physical Systems, Figure 28 Typical microscopic configuration in the stationary state of the CA Ziff model, where there is coexistence of the two species over time. The simulation corresponds to the generalized model described by rules R1, R2, R3, and R4 below. The blue and green dots represent, respectively, the A and B particles, whereas the empty sites are black
transition. At X A D X2 , the coverage fraction varies discontinuously with X A and one speaks of a discontinuous (or first-order) nonequilibrium phase transition. Figure 30 displays this behavior. The asymmetry of behavior at X 1 and X 2 comes from the fact that A and B atoms have a different adsorption rule: two vacant adjacent sites are necessary for B to stick on the surface, whereas one empty site is enough for A.
In a CA approach the elementary cells of the catalyst are mapped onto the cells of the automaton. In order to model the different processes, each cell j can be in one of four different states, denoted j j i D j0i, jAi, jBi or jCi. The state j0i corresponds to an empty cell, jAi to a cell occupied by an atom A, and jBi to a cell occupied by an atom B. The state jCi is artificial and represents a precursor state describing the conditional occupation of the cell by an atom B. Conditional means that during the next evolution step of the automaton, jCi will become jBi or j0i depending upon the fact that a nearest neighbor cell is empty and ready to receive the second B atom of the molecule B2 . This conditional state is necessary to describe the dissociation of B2 molecules on the surface. The main difficulty when implementing the Ziff model with a fully synchronous updating scheme is to ensure that the correct stoichiometry is obeyed. Indeed, since all atoms take a decision at the same time, the same atom could well take part in a reaction with several different neighbors, unless some care is taken. The solution to this problem is to add a vector field to every site in the lattice [53], as shown in Fig. 29. A vector field is a collection of arrows, one at each lattice site, that can point in any of the four directions of the lattice. The directions of the arrows at each time step are assigned randomly. Thus, a two-site process is carried out only on those pairs of sites in which the arrows point toward each other (matching nearest-neighbor pairs (MNN)). This concept of reacting matching pairs is a general way to partition the parallel computation in local parts. In the present implementation, the following generalization of the dynamics is included: an empty site remains empty with some probability. One has then two control
Cellular Automata Modeling of Physical Systems, Figure 29 Illustration of rules R2 and R3. The arrows select which neighbor is considered for a reaction. Dark and white particles represent the A and B species, respectively. The shaded region corresponds to cells that are not relevant to the present discussion such as, for instance, cells occupied by the intermediate C species
Cellular Automata Modeling of Physical Systems
R4: If j
D jCi then 8 ˆ 0, the system hX; F n i is transitive. Proposition 1 ([27]) Any mixing DTDS is totally transitive. At the beginning of the eighties, Auslander and Yorke introduced the following definition of chaos [5]. Definition 7 (AY-Chaos) A DTDS is AY-chaotic if it is transitive and it is sensitive to the initial conditions. This definition involves two fundamental characteristics: the undecomposability of the system, due to transitivity and the unpredictability of the dynamical evolution, due to sensitivity. We now introduce a notion which is often referred to as an element of regularity a chaotic dynamical system must exhibit. Definition 8 (DPO) A DTDS hX; Fi has the denseness of periodic points (or, it is regular) if the set of its periodic points is dense in X. The following is a standard result for compact DTDS. Proposition 2 If a compact DTDS has DPO then it is surjective. In his famous book [22], Devaney modified the AY-chaos adding the denseness of periodic points. Definition 9 (D-Chaos) A DTDS is said to be D-chaotic if it is sensitive, transitive and regular. An interesting result states that sensitivity, despite its popular appeal, is redundant in the Devaney definition of chaos.
Stability All the previous properties can be considered as components of a chaotic, and then unstable, behavior for a DTDS. We now illustrate some properties concerning conditions of stability for a system. Definition 10 (Equicontinuous Point) A state x 2 X of a DTDS hX; Fi is an equicontinuous point if for any " > 0 there exists ı > 0 such that for all y 2 X, d(y; x) < ı implies that 8n 2 N; d(F n (y); F n (x)) < ". In other words, a point x is equicontinuous (or Lyapunov stable) if for any " > 0, there exists a neighborhood of x whose states have orbits which stay close to the orbit of x with distance less than ". This is a condition of local stability for the system. Associated with this notion involving a single state, we have two notions of global stability based on the “size” of the set of the equicontinuous points. Definition 11 (Equicontinuity) A DTDS hX; Fi is said to be equicontinuous if for any " > 0 there exists ı > 0 such that for all x; y 2 X, d(y; x) < ı implies that 8n 2 N; (F n (y); F n (x)) < ". Given a DTDS, let E be its set of equicontinuity points. Remark that if a DTDS is equicontinuous then the set E of all its equicontinuity points is the whole X. The converse is also true in the compact settings. Furthermore, if a system is sensitive then E D ;. In general, the converse is not true [43]. Definition 12 (Almost Equicontinuity) A DTDS is almost equicontinuous if the set of its equicontinuous points E is residual (i. e., it can be obtained by a infinite intersection of dense open subsets). It is obvious that equicontinuous systems are almost equicontinuous. In the sequel, almost equicontinuous systems which are not equicontinuous will be called strictly almost equicontinuous. An important result affirms that transitive systems on compact spaces are almost equicontinuous if and only if they are not sensitive [3].
481
482
Chaotic Behavior of Cellular Automata
Topological Entropy Topological entropy is another interesting property which can be taken into account in order to study the degree of chaoticity of a system. It was introduced in [2] as an invariant of topological conjugacy. The notion of topological entropy is based on the complexity of the coverings of the systems. Recall that an open covering of a topological space X is a family of open sets whose union is X. The join of two open coverings U and V is U _ V D fU \ V : U 2 U; V 2 V g. The inverse image of an open covering U by a map F : X 7! X is F 1 (U) D fF 1 (U) : U 2 Ug. On the basis of these previous notions, the entropy of a system hX; Fi over an open covering U is defined as H(X;F; U) log jU _ F 1 (U) _ F (n1) (U)j ; n!1 n
D lim
where jUj is the cardinality of U. Definition 13 (Topological Entropy) The topological entropy of a DTDS hX; Fi is h(X; F) D supfH(X; F; U) : U is an open covering of Xg (1) Topological entropy represents the exponential growth of the number of orbit segments which can be distinguished with a certain good, but finite, accuracy. In other words, it measures the uncertainty of the system evolutions when a partial knowledge of the initial state is given. There are close relationships between the entropy and the topological properties we have seen so far. For instance, we have the following. Proposition 4 ([3,12,28]) In compact DTDS, transitivity and positive entropy imply sensitivity. Cellular Automata Consider the set of configurations C which consists of all functions from Z D into A. The space C is usually equipped with the Thychonoff (or Cantor) metric d defined as 8a; b 2 C ;
d(a; b) D 2n ; with
n o v ) ¤ b(E v) ; n D min kvkE1 : a(E vE2Z D
where kvkE1 denotes the maximum of the absolute value of the components of vE. The topology induced by d coincides with the product topology induced by the discrete
topology on A. With this topology, C is a compact, perfect and totally disconnected space. ˚ Let N D uE1 ; : : : ; uEs be an ordered set of vectors of Z D and f : As 7! A be a function. Definition 14 (CA) The D-dimensional CA based on the local rule f and the neighborhood frame N is the pair hC i ; F where F : C 7! C is the global transition rule defined as follows: v 2 ZD ; 8c 2 C ; 8E F(c)(E v ) D f (c(E v C uE1 ); : : : ; c(E v C uEs )) : (2) Note that the mapping F is (uniformly) continuous with respect to the Thychonoff metric. Hence, the pair hC i ; F is a proper discrete time dynamical system. Let Z m D f0; 1; : : : ; m 1g be the group of the integers modulo m. Denote by Cm the configuration space C for the special case A D Z m . When Cm is equipped with the natural extensions of the sum and the product operations, it turns out to be a linear space. Therefore, one can exploit the properties of linear spaces to simplify the proofs and the overall presentation. s 7! Z is said to be linear if there A function f : Z m m exist 1 ; : : : ; s 2 Z m such that it can be expressed as: " s # X s i x i 8(x1 ; : : : ; x s ) 2 Z m ; f (x1 ; : : : ; x s ) D iD1
m
where [x]m is the integer x taken modulo m. Definition 15 (Linear CA) A D-dimensional linear CA is a CA hC im ; F whose local rule f is linear. Note that for the linear CA Equ. (2) becomes: " s # X D v 2 Z ; F(c)(E v) D i c(E v C uE i ) 8c 2 C ; 8E iD1
: m
The Case of Cellular Automata In this section the results seen so far are specialized to the CA setting, focusing on dimension one. The following result allows a first classification of one-dimensional CA according to their degree of chaoticity. Theorem 1 ([43]) A one-dimensional CA is sensitive if and only if it is not almost equicontinuous. In other words, for CA the dichotomy between sensitivity and almost equicontinuity is true and not only under the transitivity condition. As a consequence, the family of all one-dimensional CA can be partitioned into four classes [43]: (k1) equicontinuous CA; (k2) strictly almost equicontinuous CA;
Chaotic Behavior of Cellular Automata
(k3) sensitive CA; (k4) expansive CA. This classification almost fits for higher dimensions. The problem is that there exist CA between the classes K2 and K3 (i. e., non sensitive CA without any equicontinuity point). Even relaxing K2 definition to “CA having some equicontinuity point”, the gap persists (see, for instance [57]). Unfortunately, much like most of the interesting properties of CA, the properties defining the above classification scheme are also affected by undecidability. Theorem 2 ([23]) For each i D 1; 2; 3, there is no algorithm to decide if a one-dimensional CA belongs to the class Ki. The following conjecture stresses the fact that nothing is known about the decidability for the membership in K4. Conjecture 1 (Folklore) Membership in class K4 is undecidable. Remark that the above conjecture is clearly false for dimensions greater than 1 since there do not exist expansive CA for dimension strictly greater than 1 [53]. Proposition 5 ([7,11]) Expansive CA are strongly transitive and mixing. In CA settings, the notion of total transitivity reduces to the simple transitivity. Moreover, there is a strict relation between transitive and sensitive CA. Theorem 3 ([49]) If a CA is transitive then it is totally transitive. Theorem 4 ([28]) Transitive CA are sensitive. As we have already seen, sensitivity is undecidable. Hence, in view of the combinatorial complexity of transitive CA, the following conjectures sound true. Conjecture 2 Transitivity is an undecidable property. Conjecture 3 [48] Strongly transitive CA are (topologically) mixing. Chaos and Combinatorial Properties In this section, when referring to a one-dimensional CA, we assume that u1 D min N and us D max N (see also Sect. “Definitions”). Furthermore, we call elementary a one-dimensional CA with alphabet A D f0; 1g and N D f1; 0; 1g (there exist 256 possible elementary CA which can be enumerated according to their local rule [64]).
In CA settings, most of the chaos components are related to some properties of combinatorial nature like injectivity, surjectivity and openness. First of all, remark that injectivity and surjectivity are dimension sensitive properties in the sense of the following. Theorem 5 ([4,39]) Injectivity and surjectivity are decidable in dimension 1, while they are not decidable in dimension greater than 1. A one-dimensional CA is said to be a right CA (resp., left CA) if u1 > 0 (resp., us < 0). Theorem 6 ([1]) Any surjective and right (or left) CA is topologically mixing. The previous result can be generalized in dimension greater than 1 in the following sense. Theorem 7 If for a given surjective D-dimensional CA there exists a (D 1)-dimensional hyperplane H (as a linear subspace of Z D ) such that: 1. all the neighbor vectors stay on the same side of a H, and 2. no vectors lay on H, then the CA is topologically mixing. Proof Choose two configurations c and d and a natural number r. Let U and V be the two distinct open balls of radius 2r and of center c and d, respectively (in a metric space (X; d) the open ball of radius ı > 0 and center x 2 X is the set Bı (x) D fy 2 X j d(y; x) < ıg). For any integer n > 1, denote by N n the neighbor frame of the CA F n and with d n 2 F n (d) any n-preimage of d. The values c(E x ) for xE 2 O depend only on the values d n (E x ) for xE 2 v j kvkE1 rg. By the hypothesis, O C N n , where O D fE there exists an integer m > 0 such that for any n m the sets O and O C N n are disjoint. Therefore, for any n m x ) D d(E x ) for build a configuration e n 2 C such that e n (E xE 2 O, and e n (E x ) D d n (E x ) for xE 2 O C N n . Then, for any n m, e n 2 V and F n (e n ) 2 U. Injectivity prevents a CA from being strong transitive as stated in the following Theorem 8 ([11]) Any strongly transitive CA is not injective. Recall that a CA of global rule F is open if F is an open function. Equivalently, in the one-dimensional case, every configuration has the same numbers of predecessors [33]. Theorem 9 ([55]) Openness is decidable in dimension one. Remark that mixing CA are not necessarily open (consider, for instance, the elementary rule 106, see [44]). The
483
484
Chaotic Behavior of Cellular Automata
following conjecture is true when replacing strong transitivity by expansively [43]. Conjecture 4 Strongly transitive CA are open. Recall that the shift map : AZ 7! AZ is the one-dimensional linear CA defined by the neighborhood N D fC1g and by the coefficient 1 D 1. A configuration of a one-dimensional CA is called jointly periodic if it is periodic both for the CA and the shift map (i. e., it is also spatially periodic). A CA is said to have the joint denseness of periodic orbits property (JDPO) if it admits a dense set of jointly periodic configurations. Obviously, JDPO is a stronger form of DPO. Theorem 10 ([13]) Open CA have JDPO. The common feeling is that (J)DPO is a property of a class wider than open CA. Indeed, Conjecture 5 [8] Every surjective CA has (J)DPO.
As a consequence of Theorem 13, if all mixing CA have DPO then all transitive CA have DPO. Permutivity is another easy-to-check combinatorial property strictly related to chaotic behavior. Definition 16 (Permutive CA) A function f : As 7! A is permutive in the variable ai if for any (a1 ; : : : ; a i1 ; a iC1 ; : : : ; a s ) 2 As1 the function a 7! f (a1 ; : : : ; a i1 ; a; a iC1 ; : : : ; a s ) is a permutation. In the one-dimensional case, a function f that is permutive in the leftmost variable a1 (resp., rightmost as ), it is called leftmost (resp. rightmost) permutive. CA with either leftmost or rightmost permutive local rule share most of the chaos components. Theorem 14 ([18]) Any one-dimensional CA based on a leftmost (or, rightmost) permutive local rule with u1 < 0 (resp., us > 0) is topologically mixing.
If this conjecture were true then, as a consequence of Theorem 5 and Proposition 2, DPO would be decidable in dimension one (and undecidable in greater dimensions). Up to now, Conjecture 5 has been proved true for some restricted classes of one-dimensional surjective CA beside open CA.
The previous result can be generalized to any dimension in the following sense.
Theorem 11 ([8]) Almost equicontinuous surjective CA have JDPO.
1. f is permutive in the variable ai , and u i k2 D maxfkE uk2 j 2. the neighbor vector uE i is such that kE uE 2 Ng, and 3. all the coordinates of uEi have absolute value l, for some integer l > 0,
Consider for a while a CA whose alphabet is an algebraic group. A configuration is said to be finite if there exists an integer h such that for any i, jij > k, c(i) D 0, where 0 is P the null element of the group. Denote s(c) D i c(i) the sum of the values of a finite configuration. A one-dimensional CA F is called number conserving if for any finite configuration c, s(c) D s(F(c)). Theorem 12 ([26]) have DPO.
Number-conserving surjective CA
If a CA F is number-conserving, then for any h 2 Z the CA h ı F is number-conserving. As a consequence we have that Corollary 1 Number-conserving surjective CA have JDPO. Proof Let F be a number-conserving CA. Choose h 2 Z in such a way that the CA h ı F is a (number-conserving) right CA. By a result in [1], both the CA h ı F and F have JDPO. In a recent work it is proved that the problem of solving Conjecture 5 can be reduced to the study of mixing CA. Theorem 13 ([1]) If all mixing CA have DPO then every surjective CA has JDPO.
Theorem 15 Let f and N be the local rule and the neighborhood frame, respectively, of a given D-dimensional CA. If there exists i such that
then the given CA is topologically mixing. Proof Without loss of generality, assume that uE i D (l; : : : ; l). Let U and V be two distinct open balls of equal radius 2r , where r is an arbitrary natural number. For any integer n 1, denote with N n the neighbor frame of the CA F n , with f n the corresponding lox ) the set fE x C vE j vE 2 N n g for cal rule, and with N n (E a given xE 2 Z D . Note that f n is permutive in the variable corresponding to the neighbor vector nuE i 2 N n . Choose two configurations c 2 U and d 2 V. Let m be the smaller natural number such that ml > 2r. For any n m, we are going to build a configuration d n 2 U such that F n (d n ) 2 V. Set d n (Ez ) D c(Ez) for zE 2 O where O D fE v j kvkE1 rg. In this way d n 2 U. In order to obn x ) D d(E x ) for tain F (d n ) 2 V , it is required that F n (d n )(E each xE 2 O. We complete the configuration dn by starting with xE D yE, where yE D (r; : : : ; r). Choose arbitrarily the values d n (Ez ) for zE 2 (N n (Ey ) n O) n f yE C nuE i g (note that O N n (Ey )). By permutivity of f n , there exists a 2 A such that if one set d n (Ey C nuE i ) D a, then
Chaotic Behavior of Cellular Automata
F n (d n )(Ey ) D d(Ey ). Let now xE D yE C eE1 . Choose arbitrarily the values d n (Ez) for zE 2 (N n (E x ) n N n (Ey )) n fE x C nuEi g. By the same argument as above, there exists a 2 A such that if one set d n (E x C nuEi ) D a, then F n (d n )(E x ) D d(E x ). Proceeding in this way one can complete dn in order to obtain F n (d n ) 2 V. Theorem 16 ([16]) Any one-dimensional CA based on a leftmost (or, rightmost) permutive local rule with u1 < 0 (resp., us > 0) has (J)DPO. Theorem 17 ([54]) Any one-dimensional CA based on a leftmost and rightmost permutive local rule with u1 < 0 and us > 0 is expansive. As a consequence of Proposition 5, we have the following result for which we give a direct proof in order to make clearer the result which follows immediately after. Proposition 6 Any one-dimensional CA based on a leftmost and rightmost permutive local rule with u1 < 0 and us > 0 is strongly transitive. Proof Choose arbitrarily two configurations c; o 2 AZ and an integer k > 0. Let n > 0 be the first integer such that nr > k, where r D maxfu1 ; us g. We are going to construct a configuration b 2 AZ such that d(b; c) < 2k and F n (b) D o. Fix b(x) D c(x) for each x D nu1 ; : : : ; nus 1. In this way d(b; c) < 2k . For each i 2 N we are going to find suitable values b(nus C i) in order to obtain F n (b)(i) D o(i). Let us start with i D 0. By the hypothesis, the local rule f n of the CA F n is permutive in the rightmost variable nus . Thus, there exists a value a0 2 A such that, if one sets b(nus ) D a0 , we obtain F n (b)(0) D o(0). By the same reasons as above, there exists a value a1 2 A such that, if one set b(nus C 1) D a1 , we obtain F n (b)(1) D o(1). Proceeding in this way one can complete the configuration b for any position nus C i. Finally, since f n is permutive also in the leftmost variable nu1 one can use the same technique to complete the configuration b for the positions nu1 1, nu1 2, . . . , in such a way that for any integer i < 0, F n (b)(i) D o(i). The previous result can be generalized as follows. Denote eE1 ; eE2 ; : : : ; eED the canonical basis of RD . Theorem 18 Let f and N be the local rule and the neighborhood frame, respectively, of a given D-dimensional CA. If there exists an integer l > 0 such that 1. f is permutive in all the 2D variable corresponding to the neighbor vectors (˙l; : : : ; ˙l), and 2. for each vector uE 2 N, we have kuk1 l, then the CA F is strongly transitive.
Proof For the sake of simplicity, we only trait the case D D 2. For higher dimensions, the idea of the proof is the same. Let uE2 D (l; l), uE3 D (l; l), uE4 D (l; l), uE5 D (l; l). Choose arbitrarily two configurations c; o 2 2 AZ and an integer k > 0. Let n > 0 be the first integer such that nl > k. We are going to construct a configura2 tion b 2 AZ such that d(b; c) < 2k and F n (b) D o. Fix b(x) D c(x) for each xE ¤ nuE2 with kxkE1 n. In this way d(b; c) < 2k . For each i 2 Z we are going to find suitable values for the configuration b in order to obtain F n (b)(i eE1 ) D o(i eE1 ). Let us start with i D 0. By the hypothesis, the local rule f n of the CA F n is permutive in the variable nuE2 . Thus, there exists a value a(0;0) 2 A such that, if one set b(nuE1 ) D a(0;0) , we obtain F n (b)(0E ) D o(0E ). Now, choose arbitrary values of b in the positions (n C 1)E e1 C j eE2 for j D n; : : : ; n 1. By the same reasons as above, there exists a value a(0;1) 2 A such that, if one sets b(nu2 C 1E e1 ) D a(0;1) , we obtain e1 ) D o(E e1 ). Proceeding in this way, at each step i F n (b)(E (i > 1), one can complete the configuration b for all the positions (n C i)E e1 C jE e2 for j D n; : : : ; n, obtaining F n (b)(i eE1 ) D o(i eE1 ). In a similar way, by using the fact that the local rule f n of the CA F n is permutive in the variable nuE3 , for any i < 0 one can complete the cone2 for figuration b for all the positions (n C i)E e1 C jE j D n; : : : ; n, obtaining F n (b)(i eE1 ) D o(i eE1 ). Now, for each step j D 1; 2; : : : , choose arbitrarily the values of b e2 and i eE1 C (n C j)E e2 with in the positions i eE1 C (n C j)E i D n; : : : n 1. The permutivity of f n in the variables nuE2 , nuE3 , nuE5 , and nuE4 permits one to complete the configuration b in the positions (n C i)E e1 C (n C j)E e2 for all e2 for all integer integers i 0, (n C i)E e1 C (n C j)E e2 for all integers i 0, i < 0, (n C i)E e1 C (n j)E and (n C i)E e1 C (n j)E e2 for all integers i < 0, so e2 ) D that for each step j we obtain 8i 2 Z; F n (b)(i eE1 C jE e2 ). o(i eE1 C jE CA, Entropy and Decidability In [34], it is shown that, in the case of CA, the definition of topological entropy can be restated in a simpler form than (1). The space-time diagram S(c) generated by a configuration c of a D-dimensional CA is the (D C 1)-dimensional infinite figure obtained by drawing in sequence the elements of the orbit of initial state c along the temporal axis. Formally, S(c) is a function from N Z D in A defined as v ). For a given CA, 8t 2 N; 8E v 2 Z D , S(c)(t; vE) D F t (c)(E fix a time t and a finite square region of side of length k in the lattice. In this way, a finite (D C 1)-dimensional figure (hyper-rectangle) is identified in all space-time diagrams.
485
486
Chaotic Behavior of Cellular Automata
Transitivity a linear CA is topologically transitive if and only if gcd f2 ; : : : ; s g D 1. Mixing a linear CA is topologically mixing if and only if it is topologically transitive. Strong transitivity a linear CA is strongly transitive if for each prime p dividing m, there exist at least two coefficients i ; j such that p − i and p − j . Regularity (DPO) a linear CA has the denseness of periodic points if it is surjective. Chaotic Behavior of Cellular Automata, Figure 1 N (k; t) is the number of distinct blue blocks that can be obtained starting from any initial configuration (orange plane)
Let N (k; t) be the number of distinct finite hyper-rectangles obtained by all possible space-time diagrams for the CA (i. e., N (k; t) is the number of the all space-time diagrams which are distinct in this finite region). The topological entropy of any given CA can be expressed as h(C ; F) D lim lim N (k; t) k!1 t!1
Despite the expression of the CA entropy is simpler than for a generic DTDS, the following result holds. Theorem 19 ([34]) The topological entropy of CA is uncomputable. Nevertheless there exist some classes of CA where it is computable [20,45]. Unfortunately, in most of these cases it is difficult to establish if a CA is a member of these classes. Results for Linear CA: Everything Is Detectable In the sequel, we assume that a linear CA on Cm is based on a neighborhood frame N D uE1 ; : : : ; uEs whose corresponding coefficients of the local rule are 1 ; : : : ; s . Moreover, without loss of generality, we suppose uE1 D 0E. In most formulas the coefficient 1 does not appear.
Concerning positive expansively, since in dimensions greater than one, there are no such CA, the following theorem characterizes expansively for linear CA in dimension one. For this situation we consider a local Pr rule f with expression f (xr ; : : : ; x r ) D iDr a i x i m . Theorem 21 ([47]) A linear one dimensional CA is positively expansive if and only if gcd fm; ar ; : : : ; a1 g D 1 and gcd fm; a1 ; : : : ; ar g D 1. Decidability Results for Other Properties The next result was stated incompletely in [47] since the case of non sensitive CA without equicontinuity points is not treated, tough they exist [57]. Theorem 22 Let F be a linear cellular automaton. Then the following properties are equivalent 1. 2. 3. 4.
F is equicontinuous F has an equicontinuity point F is not sensitive for all prime p such that pjm, p divides gcd(2 ; : : : ; s ).
Proof 1) H) 2) and 2) H) 3) are obvious. 3) H) 4) is done by negating the formula for sensitive CA in Theorem 20. Let us prove that 4) H) 1). Suppose that F is a linear CA. We decompose F D G C H by separating the term in 1 from the others: " s # X H(x)(E v ) D 1 x(E v ) G(x)(E v) D i c(E v C uE i ) : iD2
Decidability Results for Chaotic Properties The next results state that all chaotic properties introduced in Section III are decidable. Yet, one can use the formulas to build samples of cellular automata that has the required properties. Theorem 20 ([17,19,47]) Sensitivity a linear CA is sensitive to the initial conditions if there exists a prime number p such that pjm and p − gcd f2 ; : : : ; s g.
m
˛ p l l be the decomposition in prime factors lcm f˛ i g. The condition 4) gives that for all k,
p˛1 1
Let m D and a D l p j and then m divides any product of a factors ˘ iD1 i k i . Let vE be a vector such that for all i, uE i vE has nonnegative coordinates. Classically, we represent local rules of linear CA by D-variable polynomials (this representation, together with the representation of configurations by formal power series allows to simplify the calculus of images through the iterates of the CA [36]). Let X 1 ; : : : ; X D
Chaotic Behavior of Cellular Automata
be the variables. For yE D (y1 ; : : : ; y D ) 2 Z D , we note X yE D X y i . We consider the polynomial P the monomial ˘ iD1 i associated with G combined with a translation of vector s X uE i E v . The coefficients of Pa are products vE, P D ˘ iD2 i of a factors i hence [P a ] m D 0. This means that the composition of G and the translation of vector vE is nilpotent and then that G is nilpotent. As F is the sum of 1 times the identity and a nilpotent CA, we conclude that F is equicontinuous. The next theorem gives the formula for some combinatorial properties Theorem 23 ([36]). Surjectivity a linear CA is surjective if and only if gcd f1 ; : : : ; s g D 1. Injectivity a linear CA is injective if and only if for each prime p decomposing m there exists an unique coefficient i such that p does not divide i . Computation of Entropy for Linear Cellular Automata Let us start by considering the one-dimensional case. Theorem 24 Let us consider a one dimensional linear CA. Let m D p1k 1 p khh be the prime factor decomposition of m. The topological entropy of the CA is h(C; F) D
h X
k i (R i L i ) log(p i )
iD1
where L i D min Pi and R i D max Pi , with Pi D f0g [ f j : gcd(a j ; p i ) D 1g. In [50], it is proved that for dimensions greater than one, there are only two possible values for the topological entropy: zero or infinity. Theorem 25 A D-dimensional linear CA hC; Fi with D 2 is either sensitive and h(C; F) D 1 or equicontinuous and h(C; F) D 0. By a combination of Theorem 25 and 20, it is possible to establish if a D dimensional linear CA with D 2 has either zero or infinite entropy. Linear CA, Fractal Dimension and Chaos In this section we review the relations between strong transitivity and fractal dimension in the special case of linear CA. The idea is that when a system is chaotic then it produces evolutions which are complex even from a (topological) dimension point of view. Any linear CA F, can be associated with its W-limit set, a subset of the (D C 1)-dimensional Euclid space defined
as follows. Let tn be a sequence of integers (we call them times) which tends to infinity. A subset S F (t n ) of (D C 1)dimensional Euclid space represents a space-time pattern until time t n 1: ˚ S F (t n ) D (t; i)s.t.F t (e1 ) i ¤ 0; t < t n : A W-limit set for F is defined by lim n!1 S F (t n )/t n if the limit exists, where S F (t n )/t n is the contracted set of S F (t n ) by the rate t1n i. e. S F (t n )/t n contains the point (t/t n ; i/t n ) if and only if S F (t n ) contains the point (t; i). The limit lim n!1 S F (t n )/t n exists when lim inf n!1 S F (t n )/t n and lim supn!1 S F (t n )/t n coincide, where ( S F (t n ) lim inf D x 2 R DC1 : 8 j; n!1 tn ) S F (t j ) ; x j ! x when j ! 1 9x j 2 tj and
( ˚ S F (t n ) D x 2 R DC1 : 9 t n j ; 8 j; 9x n j lim sup tn n!1 2
S F (t n j ) tn j
)
; x n j ! x when j ! 1 ;
˚ for a subsequence t n j of ft n g. For the particular case of linear CA, the W-limit set always exists [30,31,32,56]. In the last ten years, the Wlimit set of additive CA has been extensively studied [61, 62,63]. It has been proved that for most of additive CA, it has interesting dimensional properties which completely characterize the set of quiescent configurations [56]. Here we link dimension properties of a W-limit set with chaotic properties. Correlating dimensional properties of invariant sets to dynamical properties has become during the years a fruitful source of new understanding [52]. Let X be a metric space. The Hausdorff dimension DH of V X is defined as:
X h jU i j D 1 D H (V) D sup h 2 Rj lim inf !0
where the infimum is taken over all countable coverings U i of V such that the diameter jU i jof each U i is less than (for more on Hausdorff dimension as well as other definitions of fractal dimension see [24]). Given a CA F we denote DH (F) the Hausdorff dimension of its W-limit set. Proposition 7 ([25,47]) Consider a linear CA F over Z p k where p is a prime number. If 1 < D H (F) < 2 then F is strongly transitive.
487
488
Chaotic Behavior of Cellular Automata
The converse relation is still an open problem. It would be also an interesting research direction to find out similar notions and results for general CA. Conjecture 6 Consider a linear CA F over Z p k , where p is a prime number. If F is strongly transitive then 1 < D H (F) < 2. Future Directions In this chapter we reviewed the chaotic behavior of cellular automata. It is clear from the results seen so far that there are close similarities between the chaotic behavior of dynamical systems on the real interval and CA. To complete the picture, it remains only to prove (or disprove) Conjecture 5. Due to its apparent difficulty, this problem promises to keep researchers occupied for some years yet. The study of the decidability of chaotic properties like expansively, transitivity, mixing etc. is another research direction which should be further addressed in the near future. It seems that new ideas are necessary since the proof techniques used up to now have been revealed as unsuccessful. The solution to these problems will be a source of new understanding and will certainly produce new results in connected fields. Finally, remark that most of the results on the chaotic behavior of CA are concerned with dimension one. A lot of work should be done to verify what happens in higher dimensions. Acknowledgments This work has been supported by the Interlink/MIUR project “Cellular Automata: Topological Properties, Chaos and Associated Formal Languages”, by the ANR Blanc Project “Sycomore” and by the PRIN/MIUR project “Formal Languages and Automata: Mathematical and Applicative Aspects”. Bibliography Primary Literature 1. Acerbi L, Dennunzio A, Formenti E (2007) Shifting and lifting of cellular automata. In: Third Conference on Computability in Europe, CiE 2007, Siena, Italy, 18–23 June 2007. Lecture Notes in Computer Science, vol 4497. Springer, Berlin, pp 1–10 2. Adler R, Konheim A, McAndrew J (1965) Topological entropy. Trans Amer Math Soc 114:309–319 3. Akin E, Auslander E, Berg K (1996) When is a transitive map chaotic? In: Bergelson V, March P, Rosenblatt J (eds) Convergence in Ergodic Theory and Probability. de Gruyter, Berlin, pp 25–40
4. Amoroso S, Patt YN (1972) Decision procedures for surjectivity and injectivity of parallel maps for tesselation structures. J Comp Syst Sci 6:448–464 5. Auslander J, Yorke JA (1980) Interval maps, factors of maps and chaos. Tohoku Math J 32:177–188 6. Banks J, Brooks J, Cairns G, Davis G, Stacey P (1992) On Devaney’s definition of chaos. Am Math Mon 99:332–334 7. Blanchard F, Maass A (1997) Dynamical properties of expansive one-sided cellular automata. Israel J Math 99:149–174 8. Blanchard F, Tisseur P (2000) Some properties of cellular automata with equicontinuity points. Ann Inst Henri Poincaré, Probabilité et Statistiques 36:569–582 9. Blanchard F, Kurka ˚ P, Maass A (1997) Topological and measuretheoretic properties of one-dimensional cellular automata. Physica D 103:86–99 10. Blanchard F, Formenti E, Kurka ˚ K (1998) Cellular automata in the Cantor, Besicovitch and Weyl topological spaces. Complex Syst 11:107–123 11. Blanchard F, Cervelle J, Formenti E (2005) Some results about chaotic behavior of cellular automata. Theor Comp Sci 349:318–336 12. Blanchard F, Glasner E, Kolyada S, Maass A (2002) On Li-Yorke pairs. J Reine Angewandte Math 547:51–68 13. Boyle M, Kitchens B (1999) Periodic points for cellular automata. Indag Math 10:483–493 14. Boyle M, Maass A (2000) Expansive invertible one-sided cellular automata. J Math Soc Jpn 54(4):725–740 15. Cattaneo G, Formenti E, Margara L, Mazoyer J (1997) A Shift-invariant Metric on SZ Inducing a Non-trivial Topology. In: Mathmatical Foundations of Computer Science 1997. Lecture Notes in Computer Science, vol 1295. Springer, Berlin, pp 179–188 16. Cattaneo G, Finelli M, Margara L (2000) Investigating topological chaos by elementary cellular automata dynamics. Theor Comp Sci 244:219–241 17. Cattaneo G, Formenti E, Manzini G, Margara L (2000) Ergodicity, transitivity, and regularity for linear cellular automata. Theor Comp Sci 233:147–164. A preliminary version of this paper has been presented to the Symposium of Theoretical Computer Science (STACS’97). LNCS, vol 1200 18. Cattaneo G, Dennunzio A, Margara L (2002) Chaotic subshifts and related languages applications to one-dimensional cellular automata. Fundam Inform 52:39–80 19. Cattaneo G, Dennunzio A, Margara L (2004) Solution of some conjectures about topological properties of linear cellular automata. Theor Comp Sci 325:249–271 20. D’Amico M, Manzini G, Margara L (2003) On computing the entropy of cellular automata. Theor Comp Sci 290:1629–1646 21. Denker M, Grillenberger C, Sigmund K (1976) Ergodic Theory on Compact Spaces.Lecture Notes in Mathematics, vol 527. Springer, Berlin 22. Devaney RL (1989) An Introduction to chaotic dynamical systems, 2nd edn. Addison-Wesley, Reading 23. Durand B, Formenti E, Varouchas G (2003) On undecidability of equicontinuity classification for cellular automata. Discrete Mathematics and Theoretical Computer Science, vol AB. pp 117–128 24. Edgar GA (1990) Measure, topology and fractal geometry. Undergraduate texts in Mathematics. Springer, New York 25. Formenti E (2003) On the sensitivity of additive cellular automata in Besicovitch topologies. Theor Comp Sci 301(1– 3):341–354
Chaotic Behavior of Cellular Automata
26. Formenti E, Grange A (2003) Number conserving cellular automata II: dynamics. Theor Comp Sci 304(1–3):269–290 27. Furstenberg H (1967) Disjointness in ergodic theory, minimal sets, and a problem in diophantine approximation. Math Syst Theor (now Theor Comp Syst) 1(1):1–49 28. Glasner E, Weiss B (1993) Sensitive dependence on initial condition. Nonlinearity 6:1067–1075 29. Guckenheimer J (1979) Sensitive dependence to initial condition for one-dimensional maps. Commun Math Phys 70:133– 160 30. Haeseler FV, Peitgen HO, Skordev G (1992) Linear cellular automata, substitutions, hierarchical iterated system. In: Fractal geometry and Computer graphics. Springer, Berlin 31. Haeseler FV, Peitgen HO, Skordev G (1993) Multifractal decompositions of rescaled evolution sets of equivariant cellular automata: selected examples. Technical report, Institut für Dynamische Systeme, Universität Bremen 32. Haeseler FV, Peitgen HO, Skordev G (1995) Global analysis of self-similarity features of cellular automata: selected examples. Physica D 86:64–80 33. Hedlund GA (1969) Endomorphism and automorphism of the shift dynamical system. Math Sy Theor 3:320–375 34. Hurd LP, Kari J, Culik K (1992) The topological entropy of cellular automata is uncomputable. Ergodic. Th Dyn Sy 12:255–265 35. Hurley M (1990) Ergodic aspects of cellular automata. Ergod Theor Dyn Sy 10:671–685 36. Ito M, Osato N, Nasu M (1983) Linear cellular automata over zm . J Comp Sy Sci 27:127–140 37. IV Assaf D, Gadbois S (1992) Definition of chaos. Am Math Mon 99:865 38. Kannan V, Nagar A (2002) Topological transitivity for discrete dynamical systems. In: Misra JC (ed) Applicable Mathematics in Golden Age. Narosa Pub, New Dehli 39. Kari J (1994) Reversibility and surjectivity problems of cellular automata. J Comp Sy Sci 48:149–182 40. Kari J (1994) Rice’s theorem for the limit set of cellular automata. Theor Comp Sci 127(2):229–254 41. Knudsen C (1994) Chaos without nonperiodicity. Am Math Mon 101:563–565 42. Kolyada S, Snoha L (1997) Some aspect of topological transitivity – a survey. Grazer Mathematische Berichte 334:3–35 43. Kurka ˚ P (1997) Languages, equicontinuity and attractors in cellular automata. Ergo Theor Dyn Sy 17:417–433 44. Kurka ˚ P (2004) Topological and Symbolic Dynamics, vol 11 of Cours Spécialisés. Société Mathématique de France, Paris 45. Di Lena P (2006) Decidable properties for regular cellular automata. In: Navarro G, Bertolossi L, Koliayakawa Y (eds) Proceedings of Fourth IFIP International Conference on Theoretical Computer Science, pp 185–196. Springer, Santiago de Chile 46. Li TY, Yorke JA (1975) Period three implies chaos. Am Math Mon 82:985–992 47. Manzini G, Margara L (1999) A complete and efficiently computable topological classification of D-dimensional linear cellular automata over Z m . Theor Comp Sci 221(1–2):157–177 48. Margara L (1999) On some topological properties of linear cellular automata. In: Kutylowski M, Pacholski L, Wierzbicki T (eds) Mathematical Foundations of Computer Science 1999
49. 50. 51.
52.
53.
54.
55.
56. 57. 58. 59. 60. 61. 62. 63. 64.
(MFCS99). Lecture Notes in Computer Science, vol 1672. Springer, Berlin, pp 209–219 Moothathu TKS (2005) Homogenity of surjective cellular automata. Discret Contin Dyn Syst 13:195–202 Morris G, Ward T (1998) Entropy bounds for endomorphisms commuting with k actions. Israel J Math 106:1–12 Nasu M (1995) Textile Systems for Endomorphisms and automorphisms of the shift, vol 114 of Memoires of the American Mathematical Society. American Mathematical Society, Providence Pesin YK (1997) Dimension Theory in Dynamical Systems. Chicago Lectures in Mathematics. The University of Chicago Press, Chicago Shereshevsky MA (1993) Expansiveness, entropy and polynomial growth for groups acting on subshifts by automorphisms. Indag Math 4:203–210 Shereshevsky MA, Afraimovich VS (1993) Bipermutative cellular automata are topologically conjugate to the one-sided Bernoulli shift. Random Comput Dynam 1(1):91–98 Sutner K (1999) Linear cellular automata and de Bruijn automata. In: Delorme M, Mazoyer J (eds) Cellular Automata, a Parallel Model, number 460 in Mathematics and Its Applications. Kluwer, Dordrecht Takahashi S (1992) Self-similarity of linear cellular automata. J Comp Syst Sci 44:114–140 Theyssier G (2007) Personal communication Vellekoop M, Berglund R (1994) On intervals, transitivity = chaos. Am Math Mon 101:353–355 Walters P (1982) An Introduction to Ergodic Theory. Springer, Berlin Weiss B (1971) Topological transitivity and ergodic measures. Math Syst Theor 5:71–5 Willson S (1984) Growth rates and fractional dimensions in cellular automata. Physica D 10:69–74 Willson S (1987) Computing fractal dimensions for additive cellular automata. Physica D 24:190–206 Willson S (1987) The equality of fractional dimensions for certain cellular automata. Physica D 24, 179–189 Wolfram S (1986) Theory and Applications of Cellular Automata. World Scientific, Singapore
Books and Reviews Akin E (1993) The general topology of dynamical systems. Graduate Stud. Math 1 Am Math Soc, Providence Akin E, Kolyada S (2003) Li-Yorke sensitivity. Nonlinearity 16:1421– 1433 Block LS, Coppel WA (1992) Dynamics in One Dymension. Springer, Berlin Katok A, Hasselblatt B (1995) Introduction to the Modern Theory of Dynamical Systems. Cambridge University Press, Cambridge Kitchens PB (1997) Symbolic dynamics: One-Sided, Two-Sided and Countable State Markov Shifts. Universitext. Springer, Berlin Kolyada SF (2004) Li-yorke sensitivity and other concepts of chaos. Ukr Math J 56(8):1242–1257 Lind D, Marcus B (1995) An Introduction to Symbolic Dynamics and Coding. Cambidge University Press, Cambidge
489
490
Community Structure in Graphs
Community Structure in Graphs SANTO FORTUNATO1 , CLAUDIO CASTELLANO2 1 Complex Networks Lagrange Laboratory (CNLL), ISI Foundation, Torino, Italy 2 SMC, INFM-CNR and Dipartimento di Fisica, “Sapienza” Università di Roma, Roma, Italy Article Outline Glossary Definition of the Subject Introduction Elements of Community Detection Computer Science: Graph Partitioning Social Science: Hierarchical and k-Means Clustering New Methods Testing Methods The Mesoscopic Description of a Graph Future Directions Bibliography Glossary Graph A graph is a set of elements, called vertices or nodes, where pairs of vertices are connected by relational links, or edges. A graph can be considered as the simplest representation of a complex system, where the vertices are the elementary units of the system and the edges represent their mutual interactions. Community A community is a group of graph vertices that “belong together” according to some precisely defined criteria which can be measured. Many definitions have been proposed. A common approach is to define a community as a group of vertices such that the density of edges between vertices of the group is higher than the average edge density in the graph. In the text also the terms module or cluster are used when referring to a community. Partition A partition is a split of a graph in subsets with each vertex assigned to only one of them. This last condition may be relaxed to include the case of overlapping communities, imposing that each vertex is assigned to at least one subset. Dendrogram A dendrogram, or hierarchical tree, is a branching diagram representing successive divisions of a graph into communities. Dendrograms are frequently used in social network analysis and computational biology, especially in biological taxonomy. Scalability Scalability expresses the computational complexity of an algorithm. If the running time of a com-
munity detection algorithm, working on a graph with n vertices and m edges, is proportional to the product n˛ mˇ , one says that the algorithm scales as O(n˛ mˇ ). Knowing the scalability allows to estimate the range of applicability of an algorithm. Definition of the Subject Graph vertices are often organized into groups that seem to live fairly independently of the rest of the graph, with which they share but a few edges, whereas the relationships between group members are stronger, as shown by the large number of mutual connections. Such groups of vertices, or communities, can be considered as independent compartments of a graph. Detecting communities is of great importance in sociology, biology and computer science, disciplines where systems are often represented as graphs. The task is very hard, though, both conceptually, due to the ambiguity in the definition of community and in the discrimination of different partitions and practically, because algorithms must find “good” partitions among an exponentially large number of them. Other complications are represented by the possible occurrence of hierarchies, i. e. communities which are nested inside larger communities, and by the existence of overlaps between communities, due to the presence of nodes belonging to more groups. All these aspects are dealt with in some detail and many methods are described, from traditional approaches used in computer science and sociology to recent techniques developed mostly within statistical physics. Introduction The origin of graph theory dates back to Euler’s solution [1] of the puzzle of Königsberg’s bridges in 1736. Since then a lot has been learned about graphs and their mathematical properties [2]. In the 20th century they have also become extremely useful as representation of a wide variety of systems in different areas. Biological, social, technological, and information networks can be studied as graphs, and graph analysis has become crucial to understand the features of these systems. For instance, social network analysis started in the 1930s and has become one of the most important topics in sociology [3,4]. In recent times, the computer revolution has provided scholars with a huge amount of data and computational resources to process and analyze these data. The size of real networks one can potentially handle has also grown considerably, reaching millions or even billions of vertices. The need to deal with such a large number of units has produced a deep change in the way graphs are approached [5,6,7,8,9].
Community Structure in Graphs
Community Structure in Graphs, Figure 1 A simple graph with three communities, highlighted by the dashed circles
Real networks are not random graphs. The random graph, introduced by P. Erdös and A. Rényi [10], is the paradigm of a disordered graph: in it, the probability of having an edge between a pair of vertices is equal for all possible pairs. In a random graph, the distribution of edges among the vertices is highly homogeneous. For instance, the distribution of the number of neighbors of a vertex, or degree, is binomial, so most vertices have equal or similar degree. In many real networks, instead, there are big inhomogeneities, revealing a high level of order and organization. The degree distribution is broad, with a tail that often follows a power law: therefore, many vertices with low degree coexist with some vertices with large degree. Furthermore, the distribution of edges is not only globally, but also locally inhomogeneous, with high concentrations of edges within special groups of nodes, and low concentrations between these groups. This feature of real networks is called community structure and is the topic of this chapter. In Fig. 1 a schematic example of a graph with community structure is shown. Communities are groups of vertices which probably share common properties and/or play similar roles within the graph. So, communities may correspond to groups of pages of the World Wide Web dealing with related topics [11], to functional modules such as cycles and pathways in metabolic networks [12,13], to groups of related individuals in social networks [14,15], to compartments in food webs [16,17], and so on.
Community detection is important for other reasons, too. Identifying modules and their boundaries allows for a classification of vertices, according to their topological position in the modules. So, vertices with a central position in their clusters, i. e. sharing a large number of edges with the other group partners, may have an important function of control and stability within the group; vertices lying at the boundaries between modules play an important role of mediation and lead the relationships and exchanges between different communities. Such classification seems to be meaningful in social [18,19,20] and metabolic networks [12]. Finally, one can study the graph where vertices are the communities and edges are set between modules if there are connections between some of their vertices in the original graph and/or if the modules overlap. In this way one attains a coarse-grained description of the original graph, which unveils the relationships between modules. Recent studies indicate that networks of communities have a different degree distribution with respect to the full graphs [13]; however, the origin of their structures can be explained by the same mechanism [21]. The aim of community detection in graphs is to identify the modules only based on the topology. The problem has a long tradition and it has appeared in various forms in several disciplines. For instance, in parallel computing it is crucial to know what is the best way to allocate tasks to processors so as to minimize the communications between them and enable a rapid performance of the calculation. This can be accomplished by splitting the computer cluster into groups with roughly the same number of processors, such that the number of physical connections between processors of different groups is minimal. The mathematical formalization of this problem is called graph partitioning. The first algorithms for graph partitioning were proposed in the early 1970s. Clustering analysis is also an important aspect in the study of social networks. The most popular techniques are hierarchical clustering and k-means clustering, where vertices are joined into groups according to their mutual similarity. In a seminal paper, Girvan and Newman proposed a new algorithm, aiming at the identification of edges lying between communities and their successive removal, a procedure that after a few iterations leads to the isolation of modules [14]. The intercommunity edges are detected according to the values of a centrality measure, the edge betweenness, that expresses the importance of the role of the edges in processes where signals are transmitted across the graph following paths of minimal length. The paper triggered a big activity in the field, and many new methods have been proposed in the last years. In particu-
491
492
Community Structure in Graphs
lar, physicists entered the game, bringing in their tools and techniques: spin models, optimization, percolation, random walks, synchronization, etc., became ingredients of new original algorithms. Earlier reviews of the topic can be found in [22,23]. Section “Elements of Community Detection” is about the basic elements of community detection, starting from the definition of community. The classical problem of graph partitioning and the methods for clustering analysis in sociology are presented in Sect. “Computer Science: Graph Partitioning” and “Social Science: Hierarchical and k-Means Clustering”, respectively. Section “New Methods” is devoted to a description of the new methods. In Sect. “Testing Methods” the problem of testing algorithms is discussed. Section “The Mesoscopic Description of a Graph” introduces the description of graphs at the level of communities. Finally, Sect. “Future Directions” highlights the perspectives of the field and sorts out promising research directions for the future. This chapter makes use of some basic concepts of graph theory, that can be found in any introductory textbook, like [2]. Some of them are briefly explained in the text. Elements of Community Detection The problem of community detection is, at first sight, intuitively clear. However, when one needs to formalize it in detail things are not so well defined. In the intuitive concept some ambiguities are hidden and there are often many equally legitimate ways of resolving them. Hence the term “Community Detection” actually indicates several rather different problems. First of all, there is no unique way of translating into a precise prescription the intuitive idea of community. Many possibilities exist, as discussed below. Some of these possible definitions allow for vertices to belong to more than one community. It is then possible to look for overlapping or nonoverlapping communities. Another ambiguity has to do with the concept of community structure. It may be intended as a single partition of the graph or as a hierarchy of partitions, at different levels of coarse-graining. There is then a problem of comparison. Which one is the best partition (or the best hierarchy)? If one could, in principle, analyze all possible partitions of a graph, one would need a sensible way of measuring their “quality” to single out the partitions with the strongest community structure. It may even occur that one graph has no community structure and one should be able to realize it. Finding a good method for comparing partitions is not a trivial task and different choices are possible. Last but not least,
the number of possible partitions grows faster than exponentially with the graph size, so that, in practice, it is not possible to analyze them all. Therefore one has to devise smart methods to find ‘good’ partitions in a reasonable time. Again, a very hard problem. Before introducing the basic concepts and discussing the relevant questions it is important to stress that the identification of topological clusters is possible only if the graphs are sparse, i. e. if the number of edges m is of the order of the number of nodes n of the graph. If m n, the distribution of edges among the nodes is too homogeneous for communities to make sense. Definition of Community The first and foremost problem is how to define precisely what a community is. The intuitive notion presented in the Introduction is related to the comparison of the number of edges joining vertices within a module (“intracommunity edges”) with the number of edges joining vertices of different modules (“intercommunity edges”). A module is characterized by a larger density of links “inside” than “outside”. This notion can be however formalized in many ways. Social network analysts have devised many definitions of subgroups with various degrees of internal cohesion among vertices [3,4]. Many other definitions have been introduced by computer scientists and physicists. In general, the definitions can be classified in three main categories. Local definitions. Here the attention is focused on the vertices of the subgraph under investigation and on its immediate neighborhood, disregarding the rest of the graph. These prescriptions come mostly from social network analysis and can be further subdivided in selfreferring, when one considers the subgraph alone, and comparative, when the mutual cohesion of the vertices of the subgraph is compared with their cohesion with the external neighbors. Self-referring definitions identify classes of subgraphs like cliques, n-cliques, k-plexes, etc. They are maximal subgraphs, which cannot be enlarged with the addition of new vertices and edges without losing the property which defines them. The concept of clique is very important and often recurring when one studies graphs. A clique is a maximal subgraph where each vertex is adjacent to all the others. In the literature it is common to call cliques also non-maximal subgraphs. Triangles are the simplest cliques, and are frequent in real networks. Larger cliques are rare, so they are not good models of communities. Besides, finding cliques is computationally very demanding: the Bron–Kerbosch method [24] runs in a time growing
Community Structure in Graphs
exponentially with the size of the graph. The definition of clique is very strict. A softer constraint is represented by the concept of n-clique, which is a maximal subgraph such that the distance of each pair of its vertices is not larger than n. A k-plex is a maximal subgraph such that each vertex is adjacent to all the others except at most k of them. In contrast, a k-core is a maximal subgraph where each vertex is adjacent to at least k vertices within the subgraph. Comparative definitions include that of LS set, or strong community, and that of weak community. An LS set is a subgraph where each node has more neighbors inside than outside the subgraph. Instead, in a weak community, the total degree of the nodes inside the community exceeds the external total degree, i. e. the number of links lying between the community and the rest of the graph. LS sets are also weak communities, but the inverse is not true, in general. The notion of weak community was introduced by Radicchi et al. [25]. Global definitions. Communities are structural units of the graph, so it is reasonable to think that their distinctive features can be recognized if one analyzes a subgraph with respect to the graph as a whole. Global definitions usually start from a null model, i. e. a graph which matches the original in some of its topological features, but which does not display community structure. After that, the linking properties of subgraphs of the initial graph are compared with those of the corresponding subgraphs in the null model. The simplest way to design a null model is to introduce randomness in the distribution of edges among the vertices. A random graph à la Erdös–Rényi, for instance, has no community structure, as any two vertices have the same probability to be adjacent, so there is no preferential linking involving special groups of vertices. The most popular null model is that proposed by Newman and Girvan and consists of a randomized version of the original graph, where edges are rewired at random, under the constraint that each vertex keeps its degree [26]. This null model is the basic concept behind the definition of modularity, a function which evaluates the goodness of partitions of a graph into modules (see Sect. “Evaluating Partitions: Quality Functions”). Here a subset of vertices is a community if the number of edges inside the subset exceeds the expected number of internal edges that the subset would have in the null model. A more general definition, where one counts small connected subgraphs (motifs), and not necessarily edges, can be found in [27]. A general class of null models, including that of modularity, has been designed by Reichardt and Bornholdt [28].
Definitions based on vertex similarity. In this last category, communities are groups of vertices which are similar to each other. A quantitative criterion is chosen to evaluate the similarity between each pair of vertices, connected or not. The criterion may be local or global: for instance one can estimate the distance between a pair of vertices. Similarities can be also extracted from eigenvector components of special matrices, which are usually close in value for vertices belonging to the same community. Similarity measures are at the basis of the method of hierarchical clustering, to be discussed in Sect. “Social Science: Hierarchical and k-Means Clustering”. The main problem in this case is the need to introduce an additional criterion to select meaningful partitions. It is worth remarking that, in spite of the wide variety of definitions, in many detection algorithms communities are not defined at all, but are a byproduct of the procedure. This is the case of the divisive algorithms described in Sect. “Divisive Algorithms” and of the dynamic algorithms of Sect. “Dynamic Algorithms”. Evaluating Partitions: Quality Functions Strictly speaking, a partition of a graph in communities is a split of the graph in clusters, with each vertex assigned to only one cluster. The latter condition may be relaxed, as shown in Sect. “Overlapping Communities”. Whatever the definition of community is, there is usually a large number of possible partitions. It is then necessary to establish which partitions exhibit a real community structure. For that, one needs a quality function, i. e. a quantitative criterion to evaluate how good a partition is. The most popular quality function is the modularity of Newman and Girvan [26]. It can be written in several ways, as
ki k j 1 X QD (1) ı(C i ; C j ) ; Ai j 2m 2m ij
where the sum runs over all pairs of vertices, A is the adjacency matrix, ki the degree of vertex i and m the total number of edges of the graph. The element Aij of the adjacency matrix is 1 if vertices i and j are connected, otherwise it is 0. The ı-function yields one if vertices i and j are in the same community, zero otherwise. Because of that, the only contributions to the sum come from vertex pairs belonging to the same cluster: by grouping them together the sum over the vertex pairs can be rewritten as a sum over the modules "
# nm X ds 2 ls QD : (2) m 2m sD1
493
494
Community Structure in Graphs
Here, nm is the number of modules, ls the total number of edges joining vertices of module s and ds the sum of the degrees of the vertices of s. In Eq. (2), the first term of each summand is the fraction of edges of the graph inside the module, whereas the second term represents the expected fraction of edges that would be there if the graph were a random graph with the same degree for each vertex. In such a case, a vertex could be attached to any other vertex of the graph, and the probability of a connection between two vertices is proportional to the product of their degrees. So, for a vertex pair, the comparison between real and expected edges is expressed by the corresponding summand of Eq. (1). Equation (2) embeds an implicit definition of community: a subgraph is a module if the number of edges inside it is larger than the expected number in modularity’s null model. If this is the case, the vertices of the subgraph are more tightly connected than expected. Basically, if each summand in Eq. (2) is non-negative, the corresponding subgraph is a module. Besides, the larger the difference between real and expected edges, the more “modular” the subgraph. So, large positive values of Q are expected to indicate good partitions. The modularity of the whole graph, taken as a single community, is zero, as the two terms of the only summand in this case are equal and opposite. Modularity is always smaller than one, and can be negative as well. For instance, the partition in which each vertex is a community is always negative. This is a nice feature of the measure, implying that, if there are no partitions with positive modularity, the graph has no community structure. On the contrary, the existence of partitions with large negative modularity values may hint to the existence of subgroups with very few internal edges and many edges lying between them (multipartite structure). Modularity has been employed as a quality function in many algorithms, like some of the divisive algorithms of Sect. “Divisive Algorithms”. In addition, modularity optimization is itself a popular method for community detection (see Sect. “Modularity Optimization”). Modularity also allows to assess the stability of partitions [29] and to transform a graph into a smaller one by preserving its community structure [30]. However, there are some caveats on the use of the measure. The most important concerns the value of modularity for a partition. For which values one can say that there is a clear community structure in a graph? The question is tricky: if two graphs have the same type of modular structure, but different sizes, modularity will be larger for the larger graph. So, modularity values cannot be compared for different graphs. Moreover, one would expect that partitions of random graphs will have modularity values close
to zero, as no community structure is expected there. Instead, it has been shown that partitions of random graphs may attain fairly large modularity values, as the probability that the distribution of edges on the vertices is locally inhomogeneous in specific realizations is not negligible [31]. Finally, a recent analysis has proved that modularity increases if subgraphs smaller than a characteristic size are merged [32]. This fact represents a serious bias when one looks for communities via modularity optimization and is discussed in more detail in Sect. “Modularity Optimization”. Hierarchies Graph vertices can have various levels of organization. Modules can display an internal community structure, i. e. they can contain smaller modules, which can in turn include other modules, and so on. In this case one says that the graph is hierarchical (see Fig. 2). For a clear classification of the vertices and their roles inside a graph, it is important to find all modules of the graph as well as their hierarchy. A natural way to represent the hierarchical structure of a graph is to draw a dendrogram, like the one illustrated in Fig. 3. Here, partitions of a graph with twelve vertices are shown. At the bottom, each vertex is its own module. By moving upwards, groups of vertices are successively aggregated. Merges of communities are represented by horizontal lines. The uppermost level represents the whole graph as a single community. Cutting the diagram horizontally at some height, as shown in the figure (dashed
Community Structure in Graphs, Figure 2 Schematic example of a hierarchical graph. Sixteen modules with four vertices each are clearly organized in groups of four
Community Structure in Graphs
Community Structure in Graphs, Figure 3 A dendrogram, or hierarchical tree. Horizontal cuts correspond to partitions of the graph in communities. Reprinted figure with permission from [26]
line), displays one level of organization of the graph vertices. The diagram is hierarchical by construction: each community belonging to a level is fully included in a community at a higher level. Dendrograms are regularly used in sociology and biology. The technique of hierarchical clustering, described in Sect. “Social Science: Hierarchical and k-Means Clustering”, lends itself naturally to this kind of representation.
Community Structure in Graphs, Figure 4 Overlapping communities. In the partition highlighted by the dashed contours, some vertices are shared between more groups
Overlapping Communities As stated in Sect. “Evaluating Partitions: Quality Functions”, in a partition each vertex is generally attributed only to one module. However, vertices lying at the boundary between modules may be difficult to assign to one module or another, based on their connections with the other vertices. In this case, it makes sense to consider such intermediate vertices as belonging to more groups, which are then called overlapping communities (Fig. 4). Many real networks are characterized by a modular structure with sizeable overlaps between different clusters. In social networks, people usually belong to more communities, according to their personal life and interests: for instance a person may have tight relationships both with the people of its working environment and with other individuals involved in common free time activities. Accounting for overlaps is also a way to better exploit the information that one can derive from topology. Ideally, one could estimate the degree of participation of a vertex in different communities, which corresponds to the likelihood that the vertex belongs to the various groups. Community detection algorithms, instead, often disagree in the classification of peripheral vertices of modules, because they are forced to put them in a single cluster, which may be the wrong one. The problem of community detection is so hard that very few algorithms consider the possibility of hav-
ing overlapping communities. An interesting method has been recently proposed by G. Palla et al. [13] and is described in Sect. “Clique Percolation”. For standard algorithms, the problem of identifying overlapping vertices could be addressed by checking for the stability of partitions against slight variations in the structure of the graph, as described in [33]. Computer Science: Graph Partitioning The problem of graph partitioning consists in dividing the vertices in g groups of predefined size, such that the number of edges lying between the groups is minimal. The number of edges running between modules is called cut size. Figure 5 presents the solution of the problem for a graph with fourteen vertices, for g D 2 and clusters of equal size. The specification of the number of modules of the partition is necessary. If one simply imposed a partition with the minimal cut size, and left the number of modules free, the solution would be trivial, corresponding to all vertices ending up in the same module, as this would yield a vanishing cut size. Graph partitioning is a fundamental issue in parallel computing, circuit partitioning and layout, and in the design of many serial algorithms, including techniques to
495
496
Community Structure in Graphs
Community Structure in Graphs, Figure 5 Graph partitioning. The cut shows the partition in two groups of equal size
solve partial differential equations and sparse linear systems of equations. Most variants of the graph partitioning problem are NP-hard, i. e. it is unlikely that the solution can be computed in a time growing as a power of the graph size. There are however several algorithms that can do a good job, even if their solutions are not necessarily optimal [34]. Most algorithms perform a bisection of the graph, which is already a complex task. Partitions into more than two modules are usually attained by iterative bisectioning. The Kernighan–Lin algorithm [35] is one of the earliest methods proposed and is still frequently used, often in combination with other techniques. The authors were motivated by the problem of partitioning electronic circuits onto boards: the nodes contained in different boards need to be linked to each other with the least number of connections. The procedure is an optimization of a benefit function Q, which represents the difference between the number of edges inside the modules and the number of edges lying between them. The starting point is an initial partition of the graph in two clusters of the predefined size: such initial partition can be random or suggested by some information on the graph structure. Then, subsets consisting of equal numbers of vertices are swapped between the two groups, so that Q has the maximal increase. To reduce the risk to be trapped in local maxima of Q, the procedure includes some swaps that decrease the function Q. After a series of swaps with positive and negative gains, the partition with the largest value of Q is selected and used as starting point of a new series of iterations. The Kernighan–
Lin algorithm is quite fast, scaling as O(n2 ) in worst-case time, n being the number of vertices. The partitions found by the procedure are strongly dependent on the initial configuration and other algorithms can do better. However, the method is used to improve on the partitions found through other techniques, by using them as starting configurations for the algorithm. Another popular technique is the spectral bisection method, which is based on the properties of the Laplacian matrix. The Laplacian matrix (or simply Laplacian) of a graph is obtained from the adjacency matrix A by placing on the diagonal the degrees of the vertices and by changing the signs of the other elements. The Laplacian has all non-negative eigenvalues and at least one zero eigenvalue, as the sum of the elements of each row and column of the matrix is zero. If a graph is divided into g connected components, the Laplacian would have g degenerate eigenvectors with eigenvalue zero and can be written in block-diagonal form, i. e. the vertices can be ordered in such a way that the Laplacian displays g square blocks along the diagonal, with entries different from zero, whereas all other elements vanish. Each block is the Laplacian of the corresponding subgraph, so it has the trivial eigenvector with components (1; 1; 1; : : : ; 1; 1). Therefore, there are g degenerate eigenvectors with equal non-vanishing components in correspondence of the vertices of a block, whereas all other components are zero. In this way, from the components of the eigenvectors one can identify the connected components of the graph. If the graph is connected, but consists of g subgraphs which are weakly linked to each other, the spectrum will have one zero eigenvalue and g 1 eigenvalues which are close to zero. If the groups are two, the second lowest eigenvalue will be close to zero and the corresponding eigenvector, also called Fiedler vector, can be used to identify the two clusters as shown below. Every partition of a graph with n vertices in two groups can be represented by an index vector s, whose component s i is C1 if vertex i is in one group and 1 if it is in the other group. The cut size R of the partition of the graph in the two groups can be written as 1 R D sT Ls ; (3) 4 where L is the Laplacian matrix and sT the transpose of P vector s. Vector s can be written as s D i a i v i , where v i , i D 1; : : : ; n are the eigenvectors of the Laplacian. If s is properly normalized, then X a2i i ; (4) RD i
where i is the Laplacian eigenvalue corresponding to
Community Structure in Graphs
eigenvector v i . It is worth remarking that the sum contains at most n 1 terms, as the Laplacian has at least one zero eigenvalue. Minimizing R equals to the minimization of the sum on the right-hand side of Eq. (4). This task is still very hard. However, if the second lowest eigenvector 2 is close enough to zero, a good approximation of the minimum can be attained by choosing s parallel to the Fiedler vector v2 : this would reduce the sum to 2 , which is a small number. But the index vector cannot be perfectly parallel to v2 by construction, because all its components are equal in modulus, whereas the components of v2 are not. The best one can do is to match the signs of the components. So, one can set s i D C1(1) if v2i > 0 (< 0). It may happen that the sizes of the two corresponding groups do not match the predefined sizes one wishes to have. In this case, if one aims at a split in n1 and n2 D n n1 vertices, the best strategy is to order the components of the Fiedler vector from the lowest to the largest values and to put in one group the vertices corresponding to the first n1 components from the top or the bottom, and the remaining vertices in the second group. If there is a discrepancy between n1 and the number of positive or negative components of v2 , this procedure yields two partitions: the better solution is the one that gives the smallest cut size. The spectral bisection method is quite fast. The first eigenvectors of the Laplacian can be computed by using the Lanczos method [36], that scales as m/(3 2 ), where m is the number of edges of the graph. If the eigenvalues 2 and 3 are well separated, the running time of the algorithm is much shorter than the time required to calculate the complete set of eigenvectors, which scales as O(n3 ). The method gives in general good partitions, that can be further improved by applying the Kernighan–Lin algorithm. Other methods for graph partitioning include levelstructure partitioning, the geometric algorithm, multilevel algorithms, etc. A good description of these algorithms can be found in [34]. Graph partitioning algorithms are not good for community detection, because it is necessary to provide as input both the number of groups and their size, about which in principle one knows nothing. Instead, one would like an algorithm capable to produce this information in its output. Besides, using iterative bisectioning to split the graph in more pieces is not a reliable procedure. Social Science: Hierarchical and k-Means Clustering In social network analysis, one partitions actors/vertices in clusters such that actors in the same cluster are more similar between themselves than actors of different clusters.
The two most used techniques to perform clustering analysis in sociology are hierarchical clustering and k-means clustering. The starting point of hierarchical clustering is the definition of a similarity measure between vertices. After a measure is chosen, one computes the similarity for each pair of vertices, no matter if they are connected or not. At the end of this process, one is left with a new n n matrix X, the similarity matrix. Initially, there are n groups, each containing one of the vertices. At each step, the two most similar groups are merged; the procedure continues until all vertices are in the same group. There are different ways to define the similarity between groups out of the matrix X. In single linkage clustering, the similarity between two groups is the minimum element xij , with i in one group and j in the other. On the contrary, the maximum element xij for vertices of different groups is used in the procedure of complete linkage clustering. In average linkage clustering one has to compute the average of the xij . The procedure can be better illustrated by means of dendrograms, like the one in Fig. 3. One should note that hierarchical clustering does not deliver a single partition, but a set of partitions. There are many possible ways to define a similarity measure for the vertices based on the topology of the network. A possibility is to define a distance between vertices, like sX xi j D (A i k A jk )2 : (5) k¤i; j
This is a dissimilarity measure, based on the concept of structural equivalence. Two vertices are structurally equivalent if they have the same neighbors, even if they are not adjacent themselves. If i and j are structurally equivalent, x i j D 0. Vertices with large degree and different neighbors are considered very “far” from each other. Another measure related to structural equivalence is the Pearson correlation between columns or rows of the adjacency matrix, P (A i k i )(A jk j ) ; (6) xi j D k n i j P where the averages i D ( j A i j )/n and the variances P i D j (A i j i )2 /n. An alternative measure is the number of edge(or vertex-) independent paths between two vertices. Independent paths do not share any edge (vertex), and their number is related to the maximum flow that can be conveyed between the two vertices under the constraint that each edge can carry only one unit of flow (max-
497
498
Community Structure in Graphs
flow/min-cut theorem). Similarly, one could consider all paths running between two vertices. In this case, there is the problem that the total number of paths is infinite, but this can be avoided if one performs a weighted sum of the number of paths, where paths of length l are weighted by the factor ˛ l , with ˛ < 1. So, the weights of long paths are exponentially suppressed and the sum converges. Hierarchical clustering has the advantage that it does not require a preliminary knowledge on the number and size of the clusters. However, it does not provide a way to discriminate between the many partitions obtained by the procedure, and to choose that or those that better represent the community structure of the graph. Moreover, the results of the method depend on the specific similarity measure adopted. Finally, it does not correctly classify all vertices of a community, and in many cases some vertices are missed even if they have a central role in their clusters [22]. Another popular clustering technique in sociology is k-means clustering [37]. Here, the number of clusters is preassigned, say k. The vertices of the graph are embedded in a metric space, so that each vertex is a point and a distance measure is defined between pairs of points in the space. The distance is a measure of dissimilarity between vertices. The aim of the algorithm is to identify k points in this space, or centroids, so that each vertex is associated to one centroid and the sum of the distances of all vertices from their respective centroids is minimal. To achieve this, one starts from an initial distribution of centroids such that they are as far as possible from each other. In the first iteration, each vertex is assigned to the nearest centroid. Next, the centers of mass of the k clusters are estimated and become a new set of centroids, which allows for a new classification of the vertices, and so on. After a sufficient number of iterations, the positions of the centroids are stable, and the clusters do not change any more. The solution found is not necessarily optimal, as it strongly depends on the initial choice of the centroids. The result can be improved by performing more runs starting from different initial conditions. The limitation of k-means clustering is the same as that of the graph partitioning algorithms: the number of clusters must be specified at the beginning, the method is not able to derive it. In addition, the embedding in a metric space can be natural for some graphs, but rather artificial for others. New Methods From the previous two sections it is clear that traditional approaches to derive graph partitions have serious limits.
The most important problem is the need to provide the algorithms with information that one would like to derive from the algorithms themselves, like the number of clusters and their size. Even when these inputs are not necessary, like in hierarchical clustering, there is the question of estimating the goodness of the partitions, so that one can pick the best one. For these reasons, there has been a major effort in the last years to devise algorithms capable of extracting a complete information about the community structure of graphs. These methods can be grouped into different categories. Divisive Algorithms A simple way to identify communities in a graph is to detect the edges that connect vertices of different communities and remove them, so that the clusters get disconnected from each other. This is the philosophy of divisive algorithms. The crucial point is to find a property of intercommunity edges that could allow for their identification. Any divisive method delivers many partitions, which are by construction hierarchical, so that they can be represented with dendrograms. Algorithm of Girvan and Newman. The most popular algorithm is that proposed by Girvan and Newman [14]. The method is also historically important, because it marked the beginning of a new era in the field of community detection. Here edges are selected according to the values of measures of edge centrality, estimating the importance of edges according to some property or process running on the graph. The steps of the algorithm are: 1. 2. 3. 4.
Computation of the centrality for all edges; Removal of edge with largest centrality; Recalculation of centralities on the running graph; Iteration of the cycle from step 2.
Girvan and Newman focused on the concept of betweenness, which is a variable expressing the frequency of the participation of edges to a process. They considered three alternative definitions: edge betweenness, currentflow betweenness and random walk betweenness. Edge betweenness is the number of shortest paths between all vertex pairs that run along the edge. It is an extension to edges of the concept of site betweenness, introduced by Freeman in 1977 [20]. It is intuitive that intercommunity edges have a large value of the edge betweenness, because many shortest paths connecting vertices of different communities will pass through them (Fig. 6). The betweenness of all edges of the graph can be calculated in a time that scales as O(mn), with techniques based on breadth-first-search [26,38].
Community Structure in Graphs
Community Structure in Graphs, Figure 6 Edge betweenness is highest for edges connecting communities. In the figure, the thick edge in the middle has a much higher betweenness than all other edges, because all shortest paths connecting vertices of the two communities run through it
Current-flow betweenness is defined by considering the graph a resistor network, with edges having unit resistance. If a voltage difference is applied between any two vertices, each edge carries some amount of current, that can be calculated by solving Kirchoff’s equations. The procedure is repeated for all possible vertex pairs: the current-flow betweenness of an edge is the average value of the current carried by the edge. Calculation of currentflow betweenness requires the inversion of an n n matrix (once), followed by obtaining and averaging the current for all pairs of nodes. Each of these two tasks takes a time O(n3 ) for a sparse matrix. The random-walk betweenness of an edge says how frequently a random walker running on the graph goes across the edge. We remind that a random walker moving from a vertex follows each edge with equal probability. A pair of vertices is chosen at random, s and t. The walker starts at s and keeps moving until it hits t, where it stops. One computes the probability that each edge was crossed by the walker, and averages over all possible choices for the vertices s and t. The complete calculation requires a time O(n3 ) on a sparse graph. It is possible to show that this measure is equivalent to current-flow betweenness [39]. Calculating edge betweenness is much faster than current-flow or random walk betweenness (O(n2 ) versus O(n3 ) on sparse graphs). In addition, in practical applications the Girvan–Newman algorithm with edge betweenness gives better results than adopting the other centrality measures. Numerical studies show that the recalculation step 3 of Girvan–Newman algorithm is essential to detect meaningful communities. This introduces an additional factor m in the running time of the algorithm: consequently, the edge betweenness version scales as O(m2 n), or O(n3 ) on a sparse graph. Because of that, the algorithm is quite slow, and applicable to graphs with up to n 10 000 vertices, with current computational resources. In the original version of Girvan–Newman’s algorithm [14], the authors had to deal with the whole hierarchy of partitions, as they had no procedure to say which
partition is the best. In a successive refinement [26], they selected the partition with the largest value of modularity (see Sect. “Evaluating Partitions: Quality Functions”), a criterion that has been frequently used ever since. There have been countless applications of the Girvan–Newman method: the algorithm is now integrated in well known libraries of network analysis programs. Algorithm of Tyler et al. Tyler, Wilkinson and Huberman proposed a modification of the Girvan–Newman algorithm, to improve the speed of the calculation [40]. The modification consists in calculating the contribution to edge betweenness only from a limited number of vertex pairs, chosen at random, deriving a sort of Monte Carlo estimate. The procedure induces statistical errors in the values of the edge betweenness. As a consequence, the partitions are in general different for different choices of the sampling pairs of vertices. However, the authors showed that, by repeating the calculation many times, the method gives good results, with a substantial gain of computer time. In practical examples, only vertices lying at the boundary between communities may not be clearly classified, and be assigned sometimes to a group, sometimes to another. The method has been applied to a network of people corresponding through email [40] and to networks of gene co-occurrences [41]. Algorithm of Fortunato et al. An alternative measure of centrality for edges is information centrality. It is based on the concept of efficiency [42], which estimates how easily information travels on a graph according to the length of shortest paths between vertices. The information centrality of an edge is the variation of the efficiency of the graph if the edge is removed. In the algorithm by Fortunato, Latora and Marchiori [43], edges are removed according to decreasing values of information centrality. The method is analogous to that of Girvan and Newman, but slower, as it scales as O(n4 ) on a sparse graph. On the other hand, it gives a better classification of vertices when communities are fuzzy, i. e. with a high degree of interconnectedness. Algorithm of Radicchi et al. Because of the high density of edges within communities, it is easy to find loops in them, i. e. closed non-intersecting paths. On the contrary, edges lying between communities will hardly be part of short loops. Based on this intuitive idea, Radicchi et al. proposed a new measure, the edge clustering coefficient, such that low values of the measure are likely to correspond to intercommunity edges [25]. The edge clustering coefficient generalizes to edges the notion of clustering coefficient introduced by Watts and Strogatz for vertices [44]. The latter is the number of triangles including a vertex divided by the number of possible triangles that can be formed. The edge clustering coefficient is the num-
499
500
Community Structure in Graphs
ber of loops of length g including the edge divided by the number of possible cycles. Usually, loops of length g D 3 or 4 are considered. At each iteration, the edge with smallest clustering coefficient is removed, the measure is recalculated again, and so on. The procedure stops when all clusters obtained are LS-sets or “weak” communities (see Sect. “Definition of Community”). Since the edge clustering coefficient is a local measure, involving at most an extended neighborhood of the edge, it can be calculated very quickly. The running time of the algorithm to completion is O(m4 /n2 ), or O(n2 ) on a sparse graph, so it is much shorter than the running time of the Girvan–Newman method. On the other hand, the method may give poor results when the graph has few loops, as it happens in several non-social networks. In this case, in fact, the edge clustering coefficient is small and fairly similar for all edges, and the algorithm may fail to identify the bridges between communities. Modularity Optimization If Newman–Girvan modularity Q (Sect. “Evaluating Partitions: Quality Functions”) is a good indicator of the quality of partitions, the partition corresponding to its maximum value on a given graph should be the best, or at least a very good one. This is the main motivation for modularity maximization, perhaps the most popular class of methods to detect communities in graphs. An exhaustive optimization of Q is impossible, due to the huge number of ways in which it is possible to partition a graph, even when the latter is small. Besides, the true maximum is out of reach, as it has been recently proved that modularity optimization is an NP-hard problem [45], so it is probably impossible to find the solution in a time growing polynomially with the size of the graph. However, there are currently several algorithms able to find fairly good approximations of the modularity maximum in a reasonable time. Greedy techniques. The first algorithm devised to maximize modularity was a greedy method of Newman [46]. It is an agglomerative method, where groups of vertices are successively joined to form larger communities such that modularity increases after the merging. One starts from n clusters, each containing a single vertex. Edges are not initially present, they are added one by one during the procedure. However, modularity is always calculated from the full topology of the graph, since one wants to find its partitions. Adding a first edge to the set of disconnected vertices reduces the number of groups from n to n 1, so it delivers a new partition of the graph. The edge is chosen such that this partition gives the maximum increase of modularity with respect to the previous configuration.
All other edges are added based on the same principle. If the insertion of an edge does not change the partition, i. e. the clusters are the same, modularity stays the same. The number of partitions found during the procedure is n, each with a different number of clusters, from n to 1. The largest value of modularity in this subset of partitions is the approximation of the modularity maximum given by the algorithm. The update of the modularity value at each iteration step can be performed in a time O(n C m), so the algorithm runs to completion in a time O((m C n)n), or O(n2 ) on a sparse graph, which is fast. In a later paper by Clauset et al. [47], it was shown that the calculation of modularity during the procedure can be performed much more quickly by use of max-heaps, special data structures created using a binary tree. By doing that, the algorithm scales as O(md log n), where d is the depth of the dendrogram describing the successive partitions found during the execution of the algorithm, which grows as log n for graphs with a strong hierarchical structure. For those graphs, the running time of the method is then O(n log2 n), which allows to analyze the community structure of very large graphs, up to 107 vertices. The greedy algorithm is currently the only algorithm that can be used to estimate the modularity maximum on such large graphs. On the other hand, the approximation it finds is not that good, as compared with other techniques. The accuracy of the algorithm can be considerably improved if one accounts for the size of the groups to be merged [48], or if the hierarchical agglomeration is started from some good intermediate configuration, rather than from the individual vertices [49]. Simulated annealing. Simulated annealing [50] is a probabilistic procedure for global optimization used in different fields and problems. It consists in performing an exploration of the space of possible states, looking for the global optimum of a function F, say its maximum. Transitions from one state to another occur with probability 1 if F increases after the change, otherwise with a probability exp(ˇF), where F is the decrease of the function and ˇ is an index of stochastic noise, a sort of inverse temperature, which increases after each iteration. The noise reduces the risk that the system gets trapped in local optima. At some stage, the system converges to a stable state, which can be an arbitrarily good approximation of the maximum of F, depending on how many states were explored and how slowly ˇ is varied. Simulated annealing was first employed for modularity optimization by R. Guimerá et al. [31]. Its standard implementation combines two types of “moves”: local moves, where a single vertex is shifted from one cluster to another, taken at random; global moves, consisting of merges and splits of
Community Structure in Graphs
communities. In practical applications, one typically combines n2 local moves with n global ones in one iteration. The method can potentially come very close to the true modularity maximum, but it is slow. Therefore, it can be used for small graphs, with up to about 104 vertices. Applications include studies of potential energy landscapes [51] and of metabolic networks [12]. Extremal optimization. Extremal optimization is a heuristic search procedure proposed by Boettcher and Percus [52], in order to achieve an accuracy comparable with simulated annealing, but with a substantial gain in computer time. It is based on the optimization of local variables, expressing the contribution of each unit of the system to the global function at study. This technique was used for modularity optimization by Duch and Arenas [53]. Modularity can be indeed written as a sum over the vertices: the local modularity of a vertex is the value of the corresponding term in this sum. A fitness measure for each vertex is obtained by dividing the local modularity of the vertex by its degree. One starts from a random partition of the graph in two groups. At each iteration, the vertex with the lowest fitness is shifted to the other cluster. The move changes the partition, so the local fitnesses need to be recalculated. The process continues until the global modularity Q cannot be improved any more by the procedure. At this stage, each cluster is considered as a graph on its own and the procedure is repeated, as long as Q increases for the partitions found. The algorithm finds an excellent approximation of the modularity maximum in a time O(n2 log n), so it represents a good tradeoff between accuracy and speed. Spectral optimization. Modularity can be optimized using the eigenvalues and eigenvectors of a special matrix, the modularity matrix B, whose elements are Bi j D Ai j
ki k j ; 2m
(7)
where the notation is the same used in Eq. (1). The method [54,55] is analogous to spectral bisection, described in Sect. “Computer Science: Graph Partitioning”. The difference is that here the Laplacian matrix is replaced by the modularity matrix. Between Q and B there is the same relation as between R and L in Eq. (3), so modularity can be written as a weighted sum of the eigenvalues of B, just like Eq. (4). Here one has to look for the eigenvector of B with largest eigenvalue, u1 , and group the vertices according to the signs of the components of u1 , just like in Sect. “Computer Science: Graph Partitioning”. The Kernighan–Lin algorithm can then be used to improve the result. The procedure is repeated for each of the clusters separately, and the number of communities increases as
long as modularity does. The advantage over spectral bisection is that it is not necessary to specify the size of the two groups, because it is determined by taking the partition with largest modularity. The drawback is similar as for spectral bisection, i. e. the algorithm gives the best results for bisections, whereas it is less accurate when the number of communities is larger than two. The situation could be improved by using the other eigenvectors with positive eigenvalues of the modularity matrix. In addition, the eigenvectors with the most negative eigenvalues are important to detect a possible multipartite structure of the graph, as they give the most relevant contribution to the modularity minimum. The algorithm typically runs in a time O(n2 log n) for a sparse graph, when one computes only the first eigenvector, so it is faster than extremal optimization, and slightly more accurate, especially for large graphs. Finally, some general remarks on modularity optimization and its reliability. A large value for the modularity maximum does not necessarily mean that a graph has a community structure. Random graphs can also have partitions with large modularity values, even though clusters are not explicitly built in [31,56]. Therefore, the modularity maximum of a graph reveals its community structure only if it is appreciably larger than the modularity maximum of random graphs of the same size [57]. In addition, one assumes that the modularity maximum delivers the “best” partition of the network in communities. However, this is not always true [32]. In the definition of modularity (Eq. (2)) the graph is compared with a random version of it, that keeps the degrees of its vertices. If groups of vertices in the graphs are more tightly connected than they would be in the randomized graph, modularity optimization would consider them as parts of p the same module. But if the groups have less than m internal edges, the expected number of edges running between them in modularity’s null model is less than one, and a single interconnecting edge would cause the merging of the two groups in the optimal partition. This holds for every density of edges inside the groups, even in the limit case in which all vertices of each group are connected to each other, i. e. if the groups are cliques. In Fig. 7 a graph is made out of nc identical cliques, with l vertices each, connected by single edges. It is intuitive to think that the modules of the best partition are the single cliques: instead, if nc is larger than about l2 , modularity would be higher for the partition in which pairs of consecutive cliques are parts of the same module (indicated by the dashed lines in the figure). The problem holds for a wide class of possible null models [58]. Attempts have been made to solve it within the modularity framework [59,60,61].
501
502
Community Structure in Graphs
Community Structure in Graphs, Figure 7 Resolution limit of modularity optimization. The natural community structure of the graph, represented by the individual cliques (circles), is not recognized by optimizing modularity, if the cliques are smaller than a scale depending on the size of the graph. Reprinted figure with permission from [32]
Modifications of the measure have also been suggested. Massen and Doye proposed a slight variation of modularity’s null model [51]: it is still a graph with the same degree sequence as the original, and with edges rewired at random among the vertices, but one imposes the additional constraint that there can be neither multiple edges between a pair of vertices nor edges joining a vertex with itself (selfedges). Muff, Rao and Caflisch remarked that modularity’s null model implicitly assumes that each vertex could be attached to any other, whether in real cases a cluster is usually connected to few other clusters [62]. Therefore, they proposed a local version of modularity, in which the expected number of edges within a module is not calculated with respect to the full graph, but considering just a portion of it, namely the subgraph including the module and its neighboring modules. Spectral Algorithms As discussed above, spectral properties of graph matrices are frequently used to find partitions. Traditional methods are in general unable to predict the number and size of the clusters, which instead must be fed into the procedure. Recent algorithms, reviewed below, are more powerful. Algorithm of Donetti and Muñoz. An elegant method based on the eigenvectors of the Laplacian matrix has been
Community Structure in Graphs, Figure 8 Spectral algorithm by Donetti and Muñoz. Vertex i is represented by the values of the ith components of Laplacian eigenvectors. In this example, the graph has an adhoc division in four communities, indicated by different symbols. The communities are better separated in two dimensions (b) than in one (a). Reprinted figure with permission from [63]
devised by Donetti and Muñoz [63]. The idea is simple: the values of the eigenvector components are close for vertices in the same community, so one can use them as coordinates to represent vertices as points in a metric space. So, if one uses M eigenvectors, one can embed the vertices in an M-dimensional space. Communities appear as groups of points well separated from each other, as illustrated in Fig. 8. The separation is the more visible, the larger the number of dimensions/eigenvectors M. The space points are grouped in communities by hierarchical clustering (see Sect. “Social Science: Hierarchical and k-Means Clustering”). The final partition is the one with largest modularity. For the similarity measure between vertices, Donetti and Muñoz used both the Euclidean distance and the angle distance. The angle distance between two points is the angle between the vectors going from the origin of the Mdimensional space to either point. Applications show that the best results are obtained with complete-linkage clustering. The algorithm runs to completion in a time O(n3 ), which is not fast. Moreover, the number M of eigenvectors that are needed to have a clean separation of the clusters is not known a priori. Algorithm of Capocci et al. Similarly to Donetti and Muñoz, Capocci et al. used eigenvector components to identify communities [64]. In this case the eigenvectors are those of the normal matrix, that is derived from the adjacency matrix by dividing each row by the sum of its elements. The eigenvectors can be quickly calculated by
Community Structure in Graphs
performing a constrained optimization of a suitable cost function. A similarity matrix is built by calculating the correlation between eigenvector components: the similarity between vertices i and j is the Pearson correlation coefficient between their corresponding eigenvector components, where the averages are taken over the set of eigenvectors used. The method can be extended to directed graphs. It is useful to estimate vertex similarities, however it does not provide a well-defined partition of the graph. Algorithm of Wu and Huberman. A fast algorithm by Wu and Huberman identifies communities based on the properties of resistor networks [65]. It is essentially a method for bisectioning graph, similar to spectral bisection, although partitions in an arbitrary number of communities can be obtained by iterative applications. The graph is transformed into a resistor network where each edge has unit resistance. A unit potential difference is set between two randomly chosen vertices. The idea is that, if there is a clear division in two communities of the graph, there will be a visible gap between voltage values for vertices at the borders between the clusters. The voltages are calculated by solving Kirchoff’s equations: an exact resolution would be too time consuming, but it is possible to find a reasonably good approximation in a linear time for a sparse graph with a clear community structure, so the more time consuming part of the algorithm is the sorting of the voltage values, which takes time O(n log n). Any possible vertex pair can be chosen to set the initial potential difference, so the procedure should be repeated for all possible vertex pairs. The authors showed that this is not necessary, and that a limited number of sampling pairs is sufficient to get good results, so the algorithm scales as O(n log n) and is very fast. An interesting feature of the method is that it can quickly find the natural community of any vertex, without determining the complete partition of the graph. For that, one uses the vertex as source voltage and places the sink at an arbitrary vertex. The same feature is present in an older algorithm by Flake et al. [11], where one uses max-flow instead of current flow. Previous works have shown that also the eigenvectors of the transfer matrix T can be used to extract useful information on community structure [66,67]. The element T ij of the transfer matrix is 1/k j if i and j are neighbors, where kj is the degree of j, otherwise it is zero. The transfer matrix rules the process of diffusion on graphs. Dynamic Algorithms This section describes methods employing processes running on the graph, focusing on spin-spin interactions, random walk and synchronization.
Q-state Potts model. The Potts model is among the most popular models in statistical mechanics [68]. It describes a system of spins that can be in q different states. The interaction is ferromagnetic, i. e. it favors spin alignment, so at zero temperature all spins are in the same state. If antiferromagnetic interactions are also present, the ground state of the system may not be the one where all spins are aligned, but a state where different spin values coexist, in homogeneous clusters. If Potts spin variables are assigned to the vertices of a graph with community structure, and the interactions are between neighboring spins, it is likely that the topological clusters could be recovered from like-valued spin clusters of the system, as there are many more interactions inside communities than outside. Based on this idea, inspired by an earlier paper by Blatt, Wiseman and Domany [69], Reichardt and Bornholdt proposed a method to detect communities that maps the graph onto a q-Potts model with nearest-neighbors interactions [70]. The Hamiltonian of the model, i. e. its energy, is the sum of two competing terms, one favoring spin alignment, one antialignment. The relative weight of these two terms is expressed by a parameter , which is usually set to the value of the density of edges of the graph. The goal is to find the ground state of the system, i. e. to minimize the energy. This can be done with simulated annealing [50], starting from a configuration where spins are randomly assigned to the vertices and the number of states q is very high. The procedure is quite fast and the results do not depend on q. The method also allows to identify vertices shared between communities, from the comparison of partitions corresponding to global and local energy minima. More recently, Reichardt and Bornholdt derived a general framework [28], in which detecting community structure is equivalent to finding the ground state of a q-Potts model spin glass [71]. Their previous method and modularity optimization are recovered as special cases. Overlapping communities can be discovered by comparing partitions with the same (minimal) energy, and hierarchical structure can be investigated by tuning a parameter acting on the density of edges of a reference graph without community structure. Random walk. Using random walks to find communities comes from the idea that a random walker spends a long time inside a community due to the high density of edges and consequent number of paths that could be followed. Zhou used random walks to define a distance between pairs of vertices [72]: the distance dij between i and j is the average number of edges that a random walker has to cross to reach j starting from i. Close vertices are likely to belong to the same community. Zhou defines the “global attractor” of a vertex i to be the closest vertex to i
503
504
Community Structure in Graphs
(smallest dij ), whereas the “local attractor” of i is its closest neighbor. Two types of communities are defined, according to local or global attractors: a vertex i has to be put in the same community of its attractor and of all other vertices for which i is an attractor. Communities must be minimal subgraphs, i. e. they cannot include smaller subgraphs which are communities according to the chosen criterion. Applications to real and artificial networks show that the method can find meaningful partitions. In a successive paper [73], Zhou introduced a measure of dissimilarity between vertices based on the distance defined above. The measure resembles the definition of distance based on structural equivalence of Eq. (5), where the elements of the adjacency matrix are replaced by the corresponding distances. Graph partitions are obtained with a divisive procedure that, starting from the graph as a single community, performs successive splits based on the criterion that vertices in the same cluster must be less dissimilar than a running threshold, which is decreased during the process. The hierarchy of partitions derived by the method is representative of actual community structures for several real and artificial graphs. In another work [74], Zhou and Lipowsky defined distances with biased random walkers, where the bias is due to the fact that walkers move preferentially towards vertices sharing a large number of neighbors with the starting vertex. A different distance measure between vertices based on random walks was introduced by Latapy and Pons [75]. The distance is calculated from the probabilities that the random walker moves from a vertex to another in a fixed number of steps. Vertices are then grouped into communities through hierarchical clustering. The method is quite fast, running to completion in a time O(n2 log n) on a sparse graph. Synchronization. Synchronization is another promising dynamic process to reveal communities in graphs. If oscillators are placed at the vertices, with initial random phases, and have nearest-neighbor interactions, oscillators in the same community synchronize first, whereas a full synchronization requires a longer time. So, if one follows the time evolution of the process, states with synchronized clusters of vertices can be quite stable and long-lived, so they can be easily recognized. This was first shown by Arenas, Díaz–Guilera and Pérez–Vicente [76]. They used Kuramoto oscillators [77], which are coupled two-dimensional vectors endowed with a proper frequency of oscillations. If the interaction coupling exceeds a threshold, the dynamics leads to synchronization. Arenas et al. showed that the time evolution of the system reveals some intermediate time scales, corresponding to topological scales of the graph, i. e. to different levels of organization of the vertices. Hierarchical community structure can be revealed
in this way. Based on the same principle, Boccaletti et al. designed a community detection method based on synchronization [79]. The synchronization dynamics is a variation of Kuramoto’s model, the opinion changing rate (OCR) model [80]. The evolution equations of the model are solved for decreasing values of a parameter that tunes the strength of the interaction coupling between neighboring vertices. In this way, different partitions are recovered: the partition with the largest value of modularity is chosen. The algorithm scales in a time O(mn), or O(n2 ) on sparse graphs, and gives good results on practical examples. However, synchronization-based algorithms may not be reliable when communities are very different in size. Clique Percolation In most of the approaches examined so far, communities have been characterized and discovered, directly or indirectly, by some global property of the graph, like betweenness, modularity, etc., or by some process that involves the graph as a whole, like random walks, synchronization, etc. But communities can be also interpreted as a form of local organization of the graph, so they could be defined from some property of the groups of vertices themselves, regardless of the rest of the graph. Moreover, very few of the algorithms presented so far are able to deal with the problem of overlapping communities (Sect. “Overlapping Communities”). A method that accounts both for the locality of the community definition and for the possibility of having overlapping communities is the Clique Percolation Method (CPM) by Palla et al. [13]. It is based on the concept that the internal edges of community are likely to form cliques due to their high density. On the other hand, it is unlikely that intercommunity edges form cliques: this idea was already used in the divisive method of Radicchi et al. (see Sect. “Divisive Algorithms”). Palla et al. define a k-clique as a complete graph with k vertices. Notice that this definition is different from the definition of n-clique (see Sect. “Definition of Community”) used in social science. If it were possible for a clique to move on a graph, in some way, it would probably get trapped inside its original community, as it could not cross the bottleneck formed by the intercommunity edges. Palla et al. introduced a number of concepts to implement this idea. Two k-cliques are adjacent if they share k 1 vertices. The union of adjacent k-cliques is called k-clique chain. Two k-cliques are connected if they are part of a k-clique chain. Finally, a k-clique community is the largest connected subgraph obtained by the union of a k-clique and of all k-cliques which are connected to it. Examples of k-clique communities are shown in Fig. 9.
Community Structure in Graphs
Community Structure in Graphs, Figure 9 Clique Percolation Method. The example shows communities spanned by adjacent 3-cliques (triangles). Overlapping vertices are shown by the bigger dots. Reprinted figure with permission from [13]
One could say that a k-clique community is identified by making a k-clique “roll” over adjacent k-cliques, where rolling means rotating a k-clique about the k 1 vertices it shares with any adjacent k-clique. By construction, k-clique communities can share vertices, so they can be overlapping. There may be vertices belonging to non-adjacent k-cliques, which could be reached by different paths and end up in different clusters. In order to find k-clique communities, one searches first for maximal cliques, a task that is known to require a running time that grows exponentially with the size of the graph. However, the authors found that, for the real networks they analyzed, the procedure is quite fast, allowing to analyze graphs with up to 105 vertices in a reasonably short time. The actual scalability of the algorithm depends on many factors, and cannot be expressed in closed form. The algorithm has been extended to the analysis of weighted [81] and directed [82] graphs. It was recently used to study the evolution of community structure in social networks [83]. A special software, called CFinder, based on the CPM, has been designed by Palla and coworkers and is freely available. The CPM has the same limit as the algorithm of Radicchi et al.: It assumes that the graph has a large number of cliques, so it may fail to give meaningful partitions for graphs with just a few cliques, like technological networks. Other Techniques This section describes some algorithms that do not fit in the previous categories, although some overlap is possible.
Markov Cluster Algorithm (MCL). This method, invented by van Dongen [84], simulates a peculiar process of flow diffusion in a graph. One starts from the stochastic matrix of the graph, which is obtained from the adjacency matrix by dividing each element Aij by the degree of i. The element Sij of the stochastic matrix gives the probability that a random walker, sitting at vertex i, moves to j. The sum of the elements of each column of S is one. Each iteration of the algorithm consists of two steps. In the first step, called expansion, the stochastic matrix of the graph is raised to an integer power p (usually p D 2). The entry M ij of the resulting matrix gives the probability that a random walker, starting from vertex i, reaches j in p steps (diffusion flow). The second step, which has no physical counterpart, consists in raising each single entry of the matrix M to some power ˛, where ˛ is now real-valued. This operation, called inflation, enhances the weights between pairs of vertices with large values of the diffusion flow, which are likely to be in the same community. Next, the elements of each row must be divided by their sum, such that the sum of the elements of the row equals one and a new stochastic matrix is recovered. After some iterations, the process delivers a stable matrix, with some remarkable properties. Its elements are either zero or one, so it is a sort of adjacency matrix. Most importantly, the graph described by the matrix is disconnected, and its connected components are the communities of the original graph. The method is really simple to implement, which is the main reason of its success: as of now, the MCL is one of the most used clustering algorithms in bioinformatics. Due to the matrix multiplication of the expansion step, the algorithm should scale as O(n3 ), even if the graph is sparse, as the running matrix becomes quickly dense after a few steps of the algorithm. However, while computing the matrix multiplication, MCL keeps only a maximum number k of non-zero elements per column, where k is usually much smaller than n. So, the actual worst-case running time of the algorithm is O(nk 2 ) on a sparse graph. A problem of the method is the fact that the final partition is sensitive to the parameter ˛ used in the inflation step. Therefore several partitions can be obtained, and it is not clear which are the most meaningful or representative. Maximum likelihood. Newman and Leicht have recently proposed an algorithm based on traditional tools and techniques of statistical inference [85]. The method consists in deducing the group structure of the graph by checking which possible partition better “fits” the graph topology. The goodness of the fit is measured by the likelihood that the observed graph structure was generated by the particular set of relationships between vertices that define a partition. The latter is described by two sets of model
505
506
Community Structure in Graphs
parameters, expressing the size of the clusters and the connection preferences among the vertices, i. e. the probabilities that vertices of one cluster are linked to any vertex. The partition corresponding to the maximum likelihood is obtained by iterating a set of coupled equations for the variables, starting from a suitable set of initial conditions. Convergence is fast, so the algorithm could be applied to fairly large graphs, with up to about 106 vertices. A nice feature of the method is that it discovers more general types of vertex classes than communities. For instance, multipartite structure could be uncovered, or mixed patterns where multipartite subgraphs coexist with communities, etc. In this respect, it is more powerful than most methods of community detection, which are bound to focus only on proper communities, i. e. subgraphs with more internal than external edges. In addition, since partitions are defined by assigning probability values to the vertices, expressing the extent of their membership in a group, it is possible that some vertices are not clearly assigned to a group, but to more groups, so the method is able to deal with overlapping communities. The main drawback of the algorithm is the fact that one needs to specify the number of groups at the beginning of the calculation, a number that is often unknown for real networks. It is possible to derive this information self-consistently by maximizing the probability that the data are reproduced by partitions with a given number of clusters. But this procedure involves some degree of approximation, and the results are often not good. L-shell method. This is an agglomerative method designed by Bagrow and Bollt [86]. The algorithm finds the community of any vertex, although the authors also presented a more general procedure to identify the full community structure of the graph. Communities are defined locally, based on a simple criterion involving the number of edges inside and outside a group of vertices. One starts from a vertex-origin and keeps adding vertices lying on successive shells, where a shell is defined as a set of vertices at a fixed geodesic distance from the origin. The first shell includes the nearest neighbors of the origin, the second the next-to-nearest neighbors, and so on. At each iteration, one calculates the number of edges connecting vertices of the new layer to vertices inside and outside the running cluster. If the ratio of these two numbers (“emerging degree”) exceeds some predefined threshold, the vertices of the new shell are added to the cluster, otherwise the process stops. Because of the local nature of the process, the algorithm is very fast and can identify communities very quickly. By repeating the process starting from every vertex, one could derive a membership matrix M: the element M ij is one if vertex j belongs to the community of ver-
tex i, otherwise it is zero. The membership matrix can be rewritten by suitably permutating rows and columns based on their mutual distances. The distance between two rows (or columns) is defined as the number of entries whose elements differ. If the graph has a clear community structure, the membership matrix takes a block-diagonal form, where the blocks identify the communities. Unfortunately, the rearrangement of the matrix requires a time O(n3 ), so it is quite slow. In a different algorithm, local communities are discovered through greedy maximization of a local modularity measure [87]. Algorithm of Eckmann and Moses. This is another method where communities are defined based on a local criterion [88]. The idea is to use the clustering coefficient [44] of a vertex as a quantity to distinguish tightly connected groups of vertices. Many edges mean many loops inside a community, so the vertices of a community are likely to have a large clustering coefficient. The latter can be related to the average distance between pairs of neighbors of the vertex. The possible values of the distance are 1 (if neighbors are connected) or 2 (if they are not), so the average distance lies between 1 and 2. The more triangles there are in the subgraph, the shorter the average distance. Since each vertex has always distance 1 from its neighbors, the fact that the average distance between its neighbors is different from 1 reminds what happens when one measures segments on a curved surface. Endowed with a metric, represented by the geodesic distance between vertices/points, and a curvature, the graph can be embedded in a geometric space. Communities appear as portions of the graph with a large curvature. The algorithm was applied to the graph representation of the World Wide Web, where vertices are Web pages and edges are the hyperlinks that take users from a page to the other. The authors found that communities correspond to Web pages dealing with the same topic. Algorithm of Sales–Pardo et al. This is an algorithm designed to detect hierarchical community structure (see Sect. “Hierarchies”), a realistic feature of many natural, social and technological networks, that most algorithms usually neglect. The authors [89] introduce first a similarity measure between pairs of vertices based on Newman– Girvan modularity: basically the similarity between two vertices is the frequency with which they coexist in the same community in partitions corresponding to local optima of modularity. The latter are configurations for which modularity is stable, i. e. it cannot increase if one shifts one vertex from one cluster to another or by merging or splitting clusters. Next, the similarity matrix is put in blockdiagonal form, by minimizing a cost function expressing the average distance of connected vertices from the diag-
Community Structure in Graphs
onal. The blocks correspond to the communities and the recovered partition represents the largest scale organization level. To determine levels at lower scales, one iterates the procedure for each subgraph identified at the previous level, which is considered as an independent graph. The method yields then a hierarchy by construction, as communities at each level are nested within communities at higher levels. The algorithm is not fast, as both the search of local optima for modularity and the rearrangement of the similarity matrix are performed with simulated annealing, but delivers good results for computer generated networks, and meaningful partitions for some social, technological and biological networks. Algorithm by Rosvall and Bergstrom. The modular structure can be considered as a reduced description of a graph to approximate the whole information contained in its adjacency matrix. Based on this idea, Rosvall and Bergstrom [90] envisioned a communication process in which a partition of a network in communities represents a synthesis Y of the full structure that a signaler sends to a receiver, who tries to infer the original graph topology X from it. The best partition corresponds to the signal Y that contains the most information about X. This can be quantitatively assessed by the maximization of the mutual information I(X; Y) [91]. The method is better than modularity optimization, especially when communities are of different size. The optimization of the mutual information is performed by simulated annealing, so the method is rather slow and can be applied to graphs with up to about 104 vertices. Testing Methods When a community detection algorithm is designed, it is necessary to test its performance, and compare it with other methods. Ideally, one would like to have graphs with known community structure and check whether the algorithm is able to find it, or how closely can come to it. In any case, one needs to compare partitions found by the
Community Structure in Graphs, Figure 10 Benchmark of Girvan and Newman. The three pictures correspond to zin D 15 (a), zin D 11 (b) and zin D 8 (c). In c the four groups are basically invisible. Reprinted figure with permission from [12]
method with “real” partitions. How can different partitions of the same graph be compared? Danon et al. [92] used a measure borrowed from information theory, the normalized mutual information. One builds a confusion matrix N, whose element N ij is the number of vertices of the real community i that are also in the detected community j. Since the partitions to be compared may have different numbers of clusters, N is usually not a square matrix. The similarity of two partitions A and B is given by the following expression 2
P c A Pc B
N i j log(N i j N/N i: N: j ) Pc B ; iD1 N i: log(N i: /N) C jD1 N: j log(N: j /N)
I(A; B) D Pc A
iD1
jD1
(8) where c B (c A ) is the number of communities in partition A(B), N i: is the sum of the elements of N on row i and N: j is the sum of the elements of N on column j. Another useful measure of similarity between partitions is the Jaccard index, which is regularly used in scientometric research. Given two partitions A and B, the Jaccard index is defined as I J (A; B) D
n11 ; n11 C n01 C n10
(9)
where n11 is the number of pairs of vertices which are in the same community in both partitions and n01 (n10 ) denotes the number of pairs of elements which are put in the same community in A(B) and in different communities in B(A). A nice presentation of criteria to compare partitions can be found in [93]. In the literature on community detection, algorithms have been generally tested on two types of graphs: computer generated graphs and real networks. The most famous computer generated benchmark is a class of graphs designed by Girvan and Newman [14]. Each graph consists of 128 vertices, arranged in four groups with 32 vertices each: 1–32, 33–64, 65–96 and 97–128. The average degree of each vertex is set to 16. The density of edges inside the groups is tuned by a parameter zin , expressing the average number of edges shared by each vertex of a group with the other members (internal degree). Naturally, when zin is close to 16, there is a clear community structure (see Fig. 10a), as most edges will join vertices of the same community, whereas when zin 8 there are more edges connecting vertices of different communities and the graph looks fuzzy (see Fig. 10c). In this way, one can realize different degrees of mixing between the groups. In this case the test consists in calculating the similarity between the partitions determined by the method at study
507
508
Community Structure in Graphs
and the natural partition of the graph in the four equalsized groups. The similarity can be calculated by using the measure of Eq. (8), but in the literature one used a different quantity, i. e. the fraction of correctly classified vertices. A vertex is correctly classified if it is in the same cluster with at least 16 of its “natural” partners. If the model partition has clusters given by the merging of two or more natural groups, all vertices of the cluster are considered incorrectly classified. The number of correctly classified vertices is then divided by the total size of the graph, to yield a number between 0 and 1. One usually builds many realizations of the graph for a particular value of zin and computes the average fraction of correctly classified vertices, which is a measure of the sensitivity of the method. The procedure is then iterated for different values of zin . Many different algorithms have been compared with each other according to the diagram where the fraction of correctly classified vertices is plotted against zin . Most algorithms usually do a good job for large zin and start to fail when zin approaches 8. The recipe to label vertices as correctly or incorrectly classified is somewhat arbitrary, though, and measures like those of Eq. (8) and (9) are probably more objective. There is also a subtle problem concerning the reliability of the test. Because of the randomness involved in the process of distributing edges among the vertices, it may well be that, in specific realizations of the graph, some vertices share more edges with members of another group than of their own. In this case, it is inappropriate to consider the initial partition in four groups as the real partition of the graph. Tests on real networks usually focus on a limited number of examples, for which one has precise information about the vertices and their properties. The most popular real network with a known community structure is the social network of Zachary’s karate club (see Fig. 11). This is a social network representing the personal relationships between members of a karate club at an American university. During two years, the sociologist Wayne Zachary observed the ties between members, both inside and outside the club [94]. At some point, a conflict arose between the club’s administrator (vertex 1) and one of the teachers (vertex 33), which led to the split of the club in two smaller clubs, with some members staying with the administrator and the others following the instructor. Vertices of the two groups are highlighted by squares and circles in Fig. 11. The question is whether the actual social split could be predicted from the network topology. Several algorithms are actually able to identify the two classes, apart from a few intermediate vertices, which may be misclassified (e. g. vertices 3, 10). Other methods are less successful: for instance, the maximum of Newman–Girvan
modularity corresponds to a split of the network in four groups [53,63]. It is fundamental however to stress that the comparison of community structures detected by the various methods with the split of Zachary’s karate club is based on a very strong assumption: that the split actually reproduced the separation of the social network in two communities. There is no real argument, beyond common wisdom, supporting this assumption. Two other networks have frequently been used to test community detection algorithms: the network of American college football teams derived by Girvan and Newman [14] and the social network of bottlenose dolphins constructed by Lusseau [95]. Also for these networks the caveat applies: Nothing guarantees that “reasonable” communities, defined on the basis of non-topological information, must coincide with those detected by methods based only on topology. The Mesoscopic Description of a Graph Community detection algorithms have been applied to a huge variety of real systems, including social, biological and technological networks. The partitions found for each system are usually similar, as the algorithms, in spite of their specific implementations, are all inspired by close intuitive notions of community. What are the general properties of these partitions? The analysis of partitions and their properties delivers a mesoscopic description of the graph, where the communities, and not the vertices, are the elementary units of the topology. The term mesoscopic is used because the relevant scale here lies between the scale of the vertices and that of the full graph. A simple question is whether the communities of a graph are usu-
Community Structure in Graphs, Figure 11 Zachary’s karate club network, an example of graph with known community structure. Reprinted figure with permission from [26]
Community Structure in Graphs
Community Structure in Graphs, Figure 12 Cumulative distribution of community sizes for the Amazon purchasing network. The partition is derived by greedy modularity optimization. Reprinted figure with permission from [47]
individual properties of the vertices. A nice classification has been proposed by Guimerá and Amaral [12,96]. The role of a vertex depends on the values of two indices, the z-score and the participation ratio, that determine the position of the vertex within its own module and with respect to the other modules. The z-score compares the internal degree of the vertex in its module with the average internal degree of the vertices in the module. The participation ratio says how the edges of the vertex are distributed among the modules. Based on these two indices, Guimerá and Amaral distinguish seven roles for a vertex. These roles seem to be correlated to functions of vertices: in metabolic networks, for instance, vertices sharing many edges with vertices of other modules (“connectors”) are often metabolites which are more conserved across species than other metabolites, i. e. they have an evolutionary advantage [12]. Future Directions
ally about of the same size or whether the community sizes have some special distribution. It turns out that the distribution of community sizes is skewed, with a tail that obeys a power law with exponents in the range between 1 and 3 [13,22,23,47]. So, there seems to be no characteristic size for a community: small communities usually coexist with large ones. As an example, Fig. 12 shows the cumulative distribution of community sizes for a recommendation network of the online vendor Amazon.com. Vertices are products and there is a connection between item A and B if B was frequently purchased by buyers of A. We remind that the cumulative distribution is the integral of the probability distribution: if the cumulative distribution is a power law with exponent ˛, the probability distribution is also a power law with exponent ˛ C 1. If communities are overlapping, one could derive a network, where the communities are the vertices and pairs of vertices are connected if their corresponding communities overlap [13]. Such networks seem to have some special properties. For instance, the degree distribution is a particular function, with an initial exponential decay followed by a slower power law decay. A recent analysis has shown that such distribution can be reproduced by assuming that the graph grows according to a simple preferential attachment mechanism, where communities with large degree have an enhanced chance to interact/overlap with new communities [21]. Finally, by knowing the community structure of a graph, it is possible to classify vertices according to their roles within their community, which may allow to infer
The problem of community detection is truly interdisciplinary. It involves scientists of different disciplines both in the design of algorithms and in their applications. The past years have witnessed huge progresses and novelties in this topic. Many methods have been developed, based on various principles. Their scalability has improved by at least one power in the graph size in just a couple of years. Currently partitions in graphs with up to millions of vertices can be found. From this point of view, the limit is close, and future improvements in this sense are unlikely. Algorithms running in linear time are very quick, but their results are often not very good. The major breakthrough introduced by the new methods is the possibility of extracting graph partitions with no preliminary knowledge or inputs about the community structure of the graph. Most new algorithms do not need to know how many communities there are, a major drawback of computer science approaches: they derive this information from the graph topology itself. Similarly, algorithms of new generation are able to select one or a few meaningful partitions, whereas social science approaches usually produce a whole hierarchy of partitions, which they are unable to discriminate. Especially in the last two years, the quality of the output produced by some algorithms has considerably improved. Realistic aspects of community structure, like overlapping and hierarchical communities, are now often taken into account. The main question is: is there at present a good method to detect communities in graphs? The answer depends on what is meant by “good”. Several algorithms give satisfactory results when they are tested as described in
509
510
Community Structure in Graphs
Sect. “Testing Methods”: in this respect, they can be considered good. However, if examined in more detail, some methods disclose serious limits and biases. For instance, the most popular method used nowadays, modularity optimization, is likely to give problems in the analysis of large graphs. Most algorithms are likely to fail in some limit, still one can derive useful indications from them: from the comparison of partitions derived by different methods one could extract the cores of real communities. The ideal method is one that delivers meaningful partitions and handles overlapping communities and hierarchy, possibly in a short time. No such method exists yet. Finding a good method for community detection is a crucial endeavor in biology, sociology and computer science. In particular, biologists often rely on the application of clustering techniques to classify their data. Due to the bioinformatics revolution, gene regulatory networks, protein–protein interaction networks, metabolic networks, etc., are now much better known that they used to be in the past and finally susceptible to solid quantitative investigations. Uncovering their modular structure is an open challenge and a necessary step to discover properties of elementary biological constituents and to understand how biological systems work. Bibliography Primary Literature 1. Euler L (1736) Solutio problematis ad geometriam situs pertinentis. Commentarii Academiae Petropolitanae 8:128–140 2. Bollobás B (1998) Modern Graph Theory. Springer, New York 3. Wasserman S, Faust K (1994) Social Network Analysis: Methods and Applications. Cambridge University Press, Cambridge 4. Scott JP (2000) Social Network Analysis. Sage Publications Ltd, London 5. Barabási AL, Albert R (2002) Statistical mechanics of complex networks. Rev Mod Phys 74:47–97 6. Dorogovtsev SN, Mendes JFF (2003) Evolution of Networks: from biological nets to the Internet and WWW. Oxford University Press, Oxford 7. Newman MEJ (2003) The structure and function of complex networks. SIAM Rev 45:167–256 8. Pastor-Satorras R, Vespignani A (2004) Evolution and structure of the Internet: A statistical physics approach. Cambridge University Press, Cambridge 9. Boccaletti S, Latora V, Moreno Y, Chavez M, Hwang DU (2006) Complex Networks: Structure and Dynamics. Phys Rep 424:175–308 10. Erdös P, Rényi A (1959) On Random Graphs. Publicationes Mathematicae Debrecen 6:290–297 11. Flake GW, Lawrence S, Lee Giles C, Coetzee FM (2002) SelfOrganization and Identification of Web Communities. IEEE Comput 35(3):66–71 12. Guimerà R, Amaral LAN (2005) Functional cartography of complex metabolic networks. Nature 433:895–900
13. Palla G, Derényi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435:814–818 14. Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc Nat Acad Sci USA 99(12): 7821–7826 15. Lusseau D, Newman MEJ (2004) Identifying the role that animals play in their social networks. Proc R Soc Lond B 271: S477–S481 16. Pimm SL (1979) The structure of food webs. Theor Popul Biol 16:144–158 17. Krause AE, Frank KA, Mason DM, Ulanowicz RE, Taylor WW (2003) Compartments exposed in food-web structure. Nature 426:282–285 18. Granovetter M (1973) The Strength of Weak Ties. Am J Sociol 78:1360–1380 19. Burt RS (1976) Positions in Networks. Soc Force 55(1):93–122 20. Freeman LC (1977) A Set of Measures of Centrality Based on Betweenness. Sociometry 40(1):35–41 21. Pollner P, Palla G, Vicsek T (2006) Preferential attachment of communities: The same principle, but a higher level. Europhys Lett 73(3):478–484 22. Newman MEJ (2004) Detecting community structure in networks. Eur Phys J B 38:321–330 23. Danon L, Duch J, Arenas A, Díaz-Guilera A (2007) Community structure identification. In: Caldarelli G, Vespignani A (eds) Large Scale Structure and Dynamics of Complex Networks: From Information Technology to Finance and Natural Science. World Scientific, Singapore, pp 93–114 24. Bron C, Kerbosch J (1973) Finding all cliques on an undirected graph. Commun ACM 16:575–577 25. Radicchi F, Castellano C, Cecconi F, Loreto V, Parisi D (2004) Defining and identifying communities in networks. Proc Nat Acad Sci USA 101(9):2658–2663 26. Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69:026113 27. Arenas A, Fernández A, Fortunato S, Gómez S (2007) Motifbased communities in complex networks. arXiv:0710.0059 28. Reichardt J, Bornholdt S (2006) Statistical mechanics of community detection. Phys Rev E 74:016110 29. Massen CP, Doye JPK (2006) Thermodynamics of community structure. arXiv:cond-mat/0610077 30. Arenas A, Duch J, Fernándes A, Gómez S (2007) Size reduction of complex networks preserving modularity. New J Phys 9(6):176–180 31. Guimerà R, Sales-Pardo M, Amaral LAN (2004) Modularity from fluctuations in random graphs and complex networks. Phys Rev E 70:025101(R) 32. Fortunato S, Barthélemy M (2007) Resolution limit in community detection. Proc Nat Acad Sci USA 104(1):36–41 33. Gfeller D, Chappelier J-C, De Los Rios P (2005) Finding instabilities in the community structure of complex networks. Phys Rev E 72:056135 34. Pothen A (1997) Graph partitioning algorithms with applications to scientific computing. In: Keyes DE, Sameh A, Venkatakrishnan V (eds) Parallel Numerical Algorithms. Kluwer Academic Press, Boston, pp 323–368 35. Kernighan BW, Lin S (1970) An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J 49:291–307 36. Golub GH, Van Loan CF (1989) Matrix computations. John Hopkins University Press, Baltimore
Community Structure in Graphs
37. JB MacQueen (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley. University of California Press, pp 281–297 38. Brandes U (2001) A faster algorithm for betweenness centrality. J Math Sociol 25(2):163–177 39. Newman MEJ (2005) A measure of betweenness centrality based on random walks. Soc Netw 27:39–54 40. Tyler JR, Wilkinson DM, Huberman BA (2003) Email as spectroscopy: automated discovery of community structure within organizations. In: Huysman M, Wenger E, Wulf V (eds) Proceeding of the First International Conference on Communities and Technologies. Kluwer Academic Press, Amsterdam 41. Wilkinson DM, Huberman BA (2004) A method for finding communities of related genes. Proc Nat Acad Sci USA 101(1): 5241–5248 42. Latora V, Marchiori M (2001) Efficient behavior of small-world networks. Phys Rev Lett 87:198701 43. Fortunato S, Latora V, Marchiori M (2004) A method to find community structures based on information centrality. Phys Rev E 70:056104 44. Watts D, Strogatz SH (1998) Collective dynamics of “smallworld” networks. Nature 393:440–442 45. Brandes U, Delling D, Gaertler M, Görke R, Hoefer M, Nikoloski Z, Wagner D (2007) On finding graph clusterings with maximum modularity. In: Proceedings of the 33rd International Workshop on Graph-Theoretical Concepts in Computer Science (WG’07). Springer, Berlin 46. Newman MEJ (2004) Fast algorithm for detecting community structure in networks. Phys Rev E 69:066133 47. Clauset A, Newman MEJ, Moore C (2004) Finding community structure in very large networks. Phys Rev E 70:066111 48. Danon L, Díaz-Guilera A, Arenas A (2006) The effect of size heterogeneity on community identification in complex networks. J Stat Mech Theory Exp 11:P11010 49. Pujol JM, Béjar J, Delgado J (2006) Clustering algorithm for determining community structure in large networks. Phys Rev E 74:016107 50. Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4598):671–680 51. Massen CP, Doye JPK (2005) Identifying communities within energy landscapes. Phys Rev E 71:046101 52. Boettcher S, Percus AG (2001) Optimization with extremal dynamics. Phys Rev Lett 86:5211–5214 53. Duch J, Arenas A (2005) Community detection in complex networks using extremal optimization. Phys Rev E 72:027104 54. Newman MEJ (2006) Modularity and community structure in networks. Proc Nat Acad Sci USA 103 (23):8577–8582 55. Newman MEJ (2006) Finding community structure in networks using the eigenvectors of matrices. Phys Rev E 74:036104 56. Reichardt J, Bornholdt S (2007) Partitioning and modularity of graphs with arbitrary degree distribution. Phys Rev E 76:015102(R) 57. Reichardt J, Bornholdt S (2006) When are networks truly modular? Phys D 224:20–26 58. Kumpula JM, Saramäki J, Kaski K, Kertész J (2007) Limited resolution in complex network community detection with Potts model approach. Eur Phys J B 56:41–45 59. Arenas A, Fernándes A, Gómez S (2007) Multiple resolution of the modular structure of complex networks. arXiv:physics/ 0703218
60. Ruan J, Zhang W (2007) Identifying network communities with high resolution. arXiv:0704.3759 61. Kumpula JM, Saramäki J, Kaski K, Kertész J (2007) Limited resolution and multiresolution methods in complex network community detection. In: Kertész J, Bornholdt S, Mantegna RN (eds) Noise and Stochastics in Complex Systems and Finance. Proc SPIE 6601:660116 62. Muff S, Rao F, Caflisch A (2005) Local modularity measure for network clusterizations. Phys Rev E 72:056107 63. Donetti L, Muñoz MA (2004) Detecting network communities: a new systematic and efficient algorithm. J Stat Mech Theory Exp P10012 64. Capocci A, Servedio VDP, Caldarelli G, Colaiori F (2004) Detecting communities in large networks. Phys A 352(2–4):669–676 65. Wu F, Huberman BA (2004) Finding communities in linear time: a physics approach. Eur Phys J B 38:331–338 66. Eriksen KA, Simonsen I, Maslov S, Sneppen K (2003) Modularity and extreme edges of the Internet. Phys Rev Lett 90(14):148701 67. Simonsen I, Eriksen KA, Maslov S, Sneppen K (2004) Diffusion on complex networks: a way to probe their large-scale topological structure. Physica A 336:163–173 68. Wu FY (1982) The Potts model. Rev Mod Phys 54:235–268 69. Blatt M, Wiseman S, Domany E (1996) Superparamagnetic clustering of data. Phys Rev Lett 76(18):3251–3254 70. Reichardt J, Bornholdt S (2004) Detecting fuzzy community structure in complex networks. Phys Rev Lett 93(21):218701 71. Mezard M, Parisi G, Virasoro M (1987) Spin glass theory and beyond. World Scientific Publishing Company, Singapore 72. Zhou H (2003) Network landscape from a Brownian particle’s perspective. Phys Rev E 67:041908 73. Zhou H (2003) Distance, dissimilarity index, and network community structure. Phys Rev E 67:061901 74. Zhou H, Lipowsky R (2004) Network Brownian motion: A new method to measure vertex-vertex proximity and to identify communities and subcommunities. Lect Notes Comput Sci 3038:1062–1069 75. Latapy M, Pons P 92005) Computing communities in large networks using random walks. Lect Notes Comput Sci 3733: 284–293 76. Arenas A, Díaz-Guilera A, Pérez-Vicente CJ (2006) Synchronization reveals topological scales in complex networks. Phys Rev Lett 96:114102 77. Kuramoto Y (1984) Chemical Oscillations, Waves and Turbulence. Springer, Berlin 78. Arenas A, Díaz-Guilera A (2007) Synchronization and modularity in complex networks. Eur Phys J ST 143:19–25 79. Boccaletti S, Ivanchenko M, Latora V, Pluchino A, Rapisarda A (2007) Detecting complex network modularity by dynamical clustering. Phys Rev E 76:045102(R) 80. Pluchino A, Latora V, Rapisarda A (2005) Changing opinions in a changing world: a new perspective in sociophysics. Int J Mod Phys C 16(4):505–522 81. Farkas I, Ábel D, Palla G, Vicsek T (2007) Weighted network modules. New J Phys 9:180 82. Palla G, Farkas IJ, Pollner P, Derényi I, Vicsek T (2007) Directed network modules. New J Phys 9:186 83. Palla G, Barabási A-L, Vicsek T (2007) Quantifying social groups evolution. Nature 446:664–667 84. van Dongen S (2000) Graph Clustering by Flow Simulation. Ph D thesis, University of Utrecht, The Netherlands
511
512
Community Structure in Graphs
85. Newman MEJ, Leicht E (2007) Mixture models and exploratory analysis in networks. Proc Nat Acad Sci USA 104(23):9564– 9569 86. Bagrow JP, Bollt EM (2005) Local method for detecting communities. Phys Rev E 72:046108 87. Clauset A (2005) Finding local community structure in networks. Phys Rev E 72:026132 88. Eckmann J-P, Moses E (2002) Curvature of co-links uncovers hidden thematic layers in the World Wide Web. Proc Nat Acad Sci USA 99(9):5825–5829 89. Sales-Pardo M, Guimerá R, Amaral LAN (2007) Extracting the hierarchical organization of complex systems. arXiv:0705. 1679 90. Rosvall M, Bergstrom CT (2007) An information-theoretic framework for resolving community structure in complex networks. Proc Nat Acad Sci USA 104(18):7327–7331 91. Shannon CE, Weaver V (1949) The Mathematical Theory of Communication. University of Illinois Press, Champaign 92. Danon L, Díaz-Guilera A, Duch J, Arenas A (2005) Comparing community structure identification. J Stat Mech Theory Exp P09008 93. Gustafsson M, Hörnquist M, Lombardi A (2006) Comparison
and validation of community structures in complex networks. Physica A 367:559–576 94. Zachary WW (1977) An information flow model for conflict and fission in small groups. J Anthr Res 33:452–473 95. Lusseau D (2003) The emergent properties of a dolphin social network. Proc R Soc Lond B 270(2):S186–188 96. Guimerá R, Amaral LAN (2005) Cartography of complex networks: modules and universal roles. J Stat Mech Theory Exp P02001
Books and Reviews Bollobás B (2001) Random Graphs. Cambridge University Press, Cambridge Chung FRK (1997) Spectral Graph Theory. CBMS Regional Conference Series in Mathematics 92. American Mathematical Society, Providence Dorogovtsev SN, Mendes JFF (2003) Evolution of Networks: From Biological Nets to the Internet and WWW. Oxford University Press, Oxford Elsner U (1997) Graph Partitioning: a Survey. Technical Report 9727, Technische Universität Chemnitz, Chemnitz
Comparison of Discrete and Continuous Wavelet Transforms
Comparison of Discrete and Continuous Wavelet Transforms PALLE E. T. JORGENSEN1 , MYUNG-SIN SONG2 1 Department of Mathematics, The University of Iowa, Iowa City, USA 2 Department of Mathematics and Statistics, Southern Illinois University, Edwardsville, USA Article Outline Glossary Definition of the Subject Introduction The Discrete vs. Continuous Wavelet Algorithms List of Names and Discoveries History Tools from Mathematics A Transfer Operator Future Directions Literature Acknowledgments Bibliography This glossary consists of a list of terms used inside the paper in mathematics, in probability, in engineering, and, on occasion, in physics. To clarify the seemingly confusing use of up to four different names for the same idea or concept, we have further added informal explanations spelling out the reasons behind the differences in current terminology from neighboring fields. D ISCLAIMER : This glossary has the structure of four areas. A number of terms are listed line by line, and each line is followed by explanation. Some “terms” have up to four separate (yet commonly accepted) names.
Glossary MATHEMATICS : function (measurable), PROBABILITY: random variable, ENGINEERING: signal, PHYSICS: state Mathematically, functions may map between any two sets, say, from X to Y; but if X is a probability space (typically called ˝), it comes with a -algebra B of measurable sets, and probability measure P. Elements E in B are called events, and P(E) the probability of E. Corresponding measurable functions with values in a vector space are called random variables, a terminology which suggests a stochastic viewpoint. The function values of a random variable may represent the outcomes of an experiment, for example “throwing of a die”. Yet, function theory is widely used in engineering where functions are typically thought of as signal. In this case, X may be the real line for time, or Rd . En-
gineers visualize functions as signals. A particular signal may have a stochastic component, and this feature simply introduces an extra stochastic variable into the “signal”, for example noise. Turning to physics, in our present application, the physical functions will be typically be in some L2 space, and L2 -functions with unit norm represent quantum mechanical “states”. MATHEMATICS : sequence (incl. vector-valued), PROBABILITY: random walk, ENGINEERING : timeseries, PHYSICS: measurement Mathematically, a sequence is a function defined on the integers Z or on subsets of Z, for example the natural numbers N. Hence, if time is discrete, this to the engineer represents a time series, such as a speech signal, or any measurement which depends on time. But we will also allow functions on lattices such as Zd . In the case d D 2, we may be considering the grayscale numbers which represent exposure in a digital camera. In this case, the function (grayscale) is defined on a subset of Z2 , and is then simply a matrix. A random walk on Zd is an assignment of a sequential and random motion as a function of time. The randomness presupposes assigned probabilities. But we will use the term “random walk” also in connection with random walks on combinatorial trees. MATHEMATICS : nested subspaces, PROBABILITY: refinement, ENGINEERING: multiresolution, PHYSICS: scales of visual resolutions While finite or infinite families of nested subspaces are ubiquitous in mathematics, and have been popular in Hilbert space theory for generations (at least since the 1930s), this idea was revived in a different guise in 1986 by Stéphane Mallat, then an engineering graduate student. In its adaptation to wavelets, the idea is now referred to as the multiresolution method. What made the idea especially popular in the wavelet community was that it offered a skeleton on which various discrete algorithms in applied mathematics could be attached and turned into wavelet constructions in harmonic analysis. In fact what we now call multiresolutions have come to signify a crucial link between the world of discrete wavelet algorithms, which are popular in computational mathematics and in engineering (signal/image processing, data mining, etc.) on the one side, and on the other side continuous wavelet bases in function spaces, especially in L2 (Rd ). Further, the multiresolution idea closely mimics how fractals are analyzed with the use of finite function systems. But in mathematics, or more precisely in operator theory, the underlying idea dates back to work of John von
513
514
Comparison of Discrete and Continuous Wavelet Transforms
Neumann, Norbert Wiener, and Herman Wold, where nested and closed subspaces in Hilbert space were used extensively in an axiomatic approach to stationary processes, especially for time series. Wold proved that any (stationary) time series can be decomposed into two different parts: The first (deterministic) part can be exactly described by a linear combination of its own past, while the second part is the opposite extreme; it is unitary, in the language of von Neumann. Von Neumann’s version of the same theorem is a pillar in operator theory. It states that every isometry in a Hilbert space H is the unique sum of a shift isometry and a unitary operator, i. e., the initial Hilbert space H splits canonically as an orthogonal sum of two subspaces Hs and Hu in H , one which carries the shift operator, and the other Hu the unitary part. The shift isometry is defined from a nested scale of closed spaces V n , such that the intersection of these spaces is Hu . Specifically, V1 V0 V1 V2 Vn VnC1 ^ _ Vn D Hu ; and Vn D H : n
n
However, Stéphane Mallat was motivated instead by the notion of scales of resolutions in the sense of optics. This, in turn, is based on a certain “artificial-intelligence” approach to vision and optics, developed earlier by David Marr at MIT, an approach which imitates the mechanism of vision in the human eye. The connection from these developments in the 1980s back to von Neumann is this: Each of the closed subspaces V n corresponds to a level of resolution in such a way that a larger subspace represents a finer resolution. Resolutions are relative, not absolute! In this view, the relative complement of the smaller (or coarser) subspace in larger space then represents the visual detail which is added in passing from a blurred image to a finer one, i. e., to a finer visual resolution. This view became an instant hit in the wavelet community, as it offered a repository for the fundamental father and the mother functions, also called the scaling function ', and the wavelet function . Via a system of translation and scaling operators, these functions then generate nested subspaces, and we recover the scaling identities which initialize the appropriate algorithms. What results is now called the family of pyramid algorithms in wavelet analysis. The approach itself is called the multiresolution approach (MRA) to wavelets. And in the meantime various generalizations (GMRAs) have emerged.
In all of this, there was a second “accident” at play: As it turned out, pyramid algorithms in wavelet analysis now lend themselves via multiresolutions, or nested scales of closed subspaces, to an analysis based on frequency bands. Here we refer to bands of frequencies as they have already been used for a long time in signal processing. One reason for the success in varied disciplines of the same geometric idea is perhaps that it is closely modeled on how we historically have represented numbers in the positional number system. Analogies to the Euclidean algorithm seem especially compelling. MATHEMATICS : operator, PROBABILITY: process, ENGINEERING : black box, PHYSICS : observable (if selfadjoint) In linear algebra students are familiar with the distinctions between (linear) transformations T (here called “operators”) and matrices. For a fixed operator T : V ! W, there is a variety of matrices, one for each choice of basis in V and in W. In many engineering applications, the transformations are not restricted to be linear, but instead represent some experiment (“black box”, in Norbert Wiener’s terminology), one with an input and an output, usually functions of time. The input could be an external voltage function, the black box an electric circuit, and the output the resulting voltage in the circuit. (The output is a solution to a differential equation.) This context is somewhat different from that of quantum mechanical (QM) operators T : V ! V where V is a Hilbert space. In QM, selfadjoint operators represent observables such as position Q and momentum P, or time and energy. MATHEMATICS : Fourier dual pair, PROBABILITY: generating function, ENGINEERING: time/frequency, PHYSICS : P/Q The following dual pairs position Q/momentum P, and time/energy may be computed with the use of Fourier series or Fourier transforms; and in this sense they are examples of Fourier dual pairs. If for example time is discrete, then frequency may be represented by numbers in the interval [ 0; 2); or in [ 0; 1) if we enter the number 2 into the Fourier exponential. Functions of the frequency are then periodic, so the two endpoints are identified. In the case of the interval [ 0; 1), 0 on the left is identified with 1 on the right. So a low frequency band is an interval centered at 0, while a high frequency band is an interval centered at 1/2. Let a function W on [ 0; 1) represent a probability assignment. Such functions W are thought of as “filters” in signal processing. We say that W is low-pass if it is 1 at 0, or if it is near 1 for frequencies near 0.
Comparison of Discrete and Continuous Wavelet Transforms
Low-pass filters pass signals with low frequencies, and block the others. If instead some filter W is 1 at 1/2, or takes values near 1 for frequencies near 1/2, then we say that W is high-pass; it passes signals with high frequency. MATHEMATICS : convolution, PROBABILITY: —, ENGINEERING : filter, PHYSICS : smearing Pointwise multiplication of functions of frequencies corresponds in the Fourier dual time-domain to the operation of convolution (or of Cauchy product if the time-scale is discrete.) The process of modifying a signal with a fixed convolution is called a linear filter in signal processing. The corresponding Fourier dual frequency function is then referred to as “frequency response” or the “frequency response function”. More generally, in the continuous case, since convolution tends to improve smoothness of functions, physicists call it “smearing.” MATHEMATICS : decomposition (e. g., Fourier coefficients in a Fourier expansion), PROBABILITY: , ENGINEERING : analysis, PHYSICS : frequency components Calculating the Fourier coefficients is “analysis,” and adding up the pure frequencies (i. e., summing the Fourier series) is called synthesis. But this view carries over more generally to engineering where there are more operations involved on the two sides, e. g., breaking up a signal into its frequency bands, transforming further, and then adding up the “banded” functions in the end. If the signal out is the same as the signal in, we say that the analysis/synthesis yields perfect reconstruction. MATHEMATICS : integrate (e. g., inverse Fourier PROBABILITY: reconstruct, ENGI transform), NEERING : synthesis, PHYSICS : superposition Here the terms related to “synthesis” refer to the second half of the kind of signal-processing design outlined in the previous paragraph. MATHEMATICS : subspace, PROBABILITY: —, ENGINEERING : resolution, PHYSICS : (signals in a) frequency band For a space of functions (signals), the selection of certain frequencies serves as a way of selecting special signals. When the process of scaling is introduced into optics of a digital camera, we note that a nested family of subspaces corresponds to a grading of visual resolutions. MATHEMATICS : Cuntz relations, PROBABILITY: —, ENGINEERING : perfect reconstruction from subbands, PHYSICS: subband decomposition N1 X iD0
S i S i D 1 ;
and
S i S j D ı i; j 1 :
MATHEMATICS : inner product, PROBABILITY: correlation, ENGINEERING : transition probability, PHYSICS: probability of transition from one state to another In many applications, a vector space with inner product captures perfectly the geometric and probabilistic features of the situation. This can be axiomatized in the language of Hilbert space; and the inner product is the most crucial ingredient in the familiar axiom system for Hilbert space. MATHEMATICS : fout D T fin , PROBABILITY: —, ENGINEERING : input/output, PHYSICS : transformation of states Systems theory language for operators T : V ! W. Then vectors in V are input, and in the range of T output. MATHEMATICS : fractal, PROBABILITY: —, ENGINEERING : , PHYSICS : Intuitively, think of a fractal as reflecting similarity of scales such as is seen in fern-like images that look “roughly” the same at small and at large scales. Fractals are produced from an infinite iteration of a finite set of maps, and this algorithm is perfectly suited to the kind of subdivision which is a cornerstone of the discrete wavelet algorithm. Self-similarity could refer alternately to space, and to time. And further versatility is added, in that flexibility is allowed into the definition of “similar”. MATHEMATICS : —, PROBABILITY: —, ENGINEERING : data mining, PHYSICS: The problem of how to handle and make use of large volumes of data is a corollary of the digital revolution. As a result, the subject of data mining itself changes rapidly. Digitized information (data) is now easy to capture automatically and to store electronically. In science, commerce, and industry, data represent collected observations and information: In business, there are data on markets, competitors, and customers. In manufacturing, there are data for optimizing production opportunities, and for improving processes. A tremendous potential for data mining exists in medicine, genetics, and energy. But raw data are not always directly usable, as is evident by inspection. A key to advances is our ability to extract information and knowledge from the data (hence “data mining”), and to understand the phenomena governing data sources. Data mining is now taught in a variety of forms in engineering departments, as well as in statistics and computer science departments. One of the structures often hidden in data sets is some degree of scale. The goal is to detect and identify one or more natural global and local scales in the data. Once this is done, it is often possible to detect associated similarities of scale, much like the familiar scale-similar-
515
516
Comparison of Discrete and Continuous Wavelet Transforms
ity from multidimensional wavelets, and from fractals. Indeed, various adaptations of wavelet-like algorithms have been shown to be useful. These algorithms themselves are useful in detecting scale-similarities, and are applicable to other types of pattern recognition. Hence, in this context, generalized multiresolutions offer another tool for discovering structures in large data sets, such as those stored in the resources of the Internet. Because of the sheer volume of data involved, a strictly manual analysis is out of the question. Instead, sophisticated query processors based on statistical and mathematical techniques are used in generating insights and extracting conclusions from data sets. Multiresolutions Haar’s work in 1909–1910 had implicitly the key idea which got wavelet mathematics started on a roll 75 years later with Yves Meyer, Ingrid Daubechies, Stéphane Mallat, and others—namely the idea of a multiresolution. In that respect Haar was ahead of his time. See Figs. 1 and 2 for details. V1 V0 V1 ; V0 C W0 D V1 The word “multiresolution” suggests a connection to optics from physics. So that should have been a hint to mathematicians to take a closer look at trends in signal and image processing! Moreover, even staying within mathematics, it turns out that as a general notion this same idea of a “multiresolution” has long roots in mathematics, even in such modern and pure areas as operator theory and Hilbert-space geometry. Looking even closer at these interconnections, we can now recognize scales of subspaces (so-called multiresolutions) in classical algorithmic construction of orthog-
Comparison of Discrete and Continuous Wavelet Transforms, Figure 1 Multiresolution. L2 (Rd )-version (continuous); ' 2 V0 , 2 W0
Comparison of Discrete and Continuous Wavelet Transforms, Figure 2 Multiresolution. l 2 (Z)-version (discrete); ' 2 V0 , 2 W0
onal bases in inner-product spaces, now taught in lots of mathematics courses under the name of the Gram– Schmidt algorithm. Indeed, a closer look at good old Gram–Schmidt reveals that it is a matrix algorithm, Hence new mathematical tools involving non-commutativity! If the signal to be analyzed is an image, then why not select a fixed but suitable resolution (or a subspace of signals corresponding to a selected resolution), and then do the computations there? The selection of a fixed “resolution” is dictated by practical concerns. That idea was key in turning computation of wavelet coefficients into iterated matrix algorithms. As the matrix operations get large, the computation is carried out in a variety of paths arising from big matrix products. The dichotomy, continuous vs. discrete, is quite familiar to engineers. The industrial engineers typically work with huge volumes of numbers. Numbers! – So why wavelets? Well, what matters to the industrial engineer is not really the wavelets, but the fact that special wavelet functions serve as an efficient way to encode large data sets – I mean encode for computations. And the wavelet algorithms are computational. They work on numbers. Encoding numbers into pictures, images, or graphs of functions comes later, perhaps at the very end of the computation. But without the graphics, I doubt that we would understand any of this half as well as we do now. The same can be said for the many issues that relate to the crucial mathematical concept of self-similarity, as we know it from fractals, and more generally from recursive algorithms. Definition of the Subject In this paper we outline several points of view on the interplay between discrete and continuous wavelet transforms; stressing both pure and applied aspects of both. We outline some new links between the two transform technologies based on the theory of representations of generators and relations. By this we mean a finite system of generators which are represented by operators in Hilbert space. We further outline how these representations yield sub-band filter banks for signal and image processing algorithms. The word “wavelet transform” (WT) means different things to different people: Pure and applied mathematicians typically give different answers the questions “What is the WT?” And engineers in turn have their own preferred quite different approach to WTs. Still there are two main trends in how WTs are used, the continuous WT on one side, and the discrete WT on the other. Here we offer
Comparison of Discrete and Continuous Wavelet Transforms
a user friendly outline of both, but with a slant toward geometric methods from the theory of operators in Hilbert space. Our paper is organized as follows: For the benefit of diverse reader groups, we begin with Glossary (Sect. “Glossary”). This is a substantial part of our account, and it reflects the multiplicity of how the subject is used. The concept of multiresolutions or multiresolution analysis (MRA) serves as a link between the discrete and continuous theory. In Sect. “List of Names and Discoveries”, we summarize how different mathematicians and scientists have contributed to and shaped the subject over the years. The next two sections then offer a technical overview of both discrete and the continuous WTs. This includes basic tools from Fourier analysis and from operators in Hilbert space. In Sect. “Tools from Mathematics” and Sect. “A Transfer Operator”, we outline the connections between the separate parts of mathematics and their applications to WTs. Introduction While applied problems such as time series, signals and processing of digital images come from engineering and from the sciences, they have in the past two decades taken a life of their own as an exciting new area of applied mathematics. While searches in Google on these keywords typically yield sites numbered in the millions, the diversity of applications is wide, and it seems reasonable here to narrow our focus to some of the approaches that are both more mathematical and more recent. For references, see for example [1,6,23,31]. In addition, our own interests (e. g., [20,21,27,28]) have colored the presentation below. Each of the two areas, the discrete side, and the continuous theory is huge as measured by recent journal publications. A leading theme in our article is the independent interest in a multitude of interconnections between the discrete algorithm and their uses in the more mathematical analysis of function spaces (continuous wavelet transforms). The mathematics involved in the study and the applications of this interaction we feel is of benefit to both mathematicians and to engineers. See also [20]. An early paper [9] by Daubechies and Lagarias was especially influential in connecting the two worlds, discrete and continuous.
and pass to the Hilbert space L2 (Rd ) of all square integrable functions on Rd , referring to d-dimensional Lebesgue measure. A wavelet basis refers to a family of basis functions for L2 (Rd ) generated from a finite set of normalized functions i , the index i chosen from a fixed and finite index set I, and from two operations, one called scaling, and the other translation. The scaling is typically specified by a d by d matrix over the integers Z such that all the eigenvalues in modulus are bigger than one, lie outside the closed unit disk in the complex plane. The d-lattice is denoted Zd , and the translations will be by vectors selected from Zd . We say that we have a wavelet basis if the triple indexed family i; j;k (x) :D j det Aj j/2 (A j x C k) forms an orthonormal basis (ONB) for L2 (Rd ) as i varies in I, j 2 Z, and k 2 Rd . The word “orthonormal” for a family F of vectors in a Hilbert space H refers to the norm and the inner product in H : The vectors in an orthonormal family F are assumed to have norm one, and to be mutually orthogonal. If the family is also total (i. e., the vectors in F span a subspace which is dense in H ), we say that F is an orthonormal basis (ONB.) While there are other popular wavelet bases, for example frame bases, and dual bases (see e. g., [2,18] and the papers cited there), the ONBs are the most agreeable at least from the mathematical point of view. That there are bases of this kind is not at all clear, and the subject of wavelets in this continuous context has gained much from its connections to the discrete world of signal- and image-processing. Here we shall outline some of these connections with an emphasis on the mathematical context. So we will be stressing the theory of Hilbert space, and bounded linear operators acting in Hilbert space H , both individual operators, and families of operators which form algebras. As was noticed recently the operators which specify particular subband algorithms from the discrete world of signal-processing turn out to satisfy relations that were found (or rediscovered independently) in the theory of operator algebras, and which go under the name of Cuntz algebras, denoted O N if n is the number of bands. For additional details, see e. g., [21]. N1 , and In symbols the C -algebra has generators (S i ) iD0 the relations are N1 X
S i S i D 1
(1)
iD0
The Discrete vs. Continuous Wavelet Algorithms The Discrete Wavelet Transform If one stays with function spaces, it is then popular to pick the d-dimensional Lebesgue measure on Rd , d D 1; 2; : : :,
(where 1 is the identity element in O N ) and N1 X iD0
S i S i D 1 ;
and
S i S j D ı i; j 1 :
(2)
517
518
Comparison of Discrete and Continuous Wavelet Transforms
In a representation on a Hilbert space, say H , the symbols Si turn into bounded operators, also denoted Si , and the identity element 1 turns into the identity operator I in H , i. e., the operator I : h ! h, for h 2 H . In operator language, the two formulas (1) and (2) state that each Si is an isometry in H , and that the respective ranges S i H are mutually orthogonal, i. e., S i H ? S j H for i ¤ j. Introducing the projections Pi D S i S i , we get Pi Pj D ı i; j Pi , and N1 X
Pi D I :
iD0
In the engineering literature this takes the form of programming diagrams: Fig. 3. If the process of Fig. 3 is repeated, we arrive at the discrete wavelet transform (Fig. 4) or stated in the form of images (n D 5) (Fig. 5). Selecting a resolution subspace V0 D closure span f'( k)jk 2 Zg, we arrive at a wavelet subdivision f j;k j j 0; k 2 Zg, where j;k (x) D 2 j/2 (2 j x k), P and the continuous expansion f D j;k h j;k j f i j;k or the discrete analogue derived from the isometries, i D 1; 2; : : : ; N 1, S0k S i for k D 0; 1; 2; : : : ; called the discrete wavelet transform.
Comparison of Discrete and Continuous Wavelet Transforms, Figure 3 Perfect reconstruction in a subband filtering as used in signaland image-processing
Comparison of Discrete and Continuous Wavelet Transforms, Figure 4 Binary decision tree for a sequential selection of filters.
Notational Convention In algorithms, the letter N is popular, and often used for counting more than one thing. In the present contest of the discrete wavelet algorithm (DWA) or DWT, we count two things, “the number of times a picture is decomposed via subdivision”. We have used n for this. The other related but different number N is the number of subbands, N D 2 for the dyadic DWT, and N D 4 for the image DWT. The image-processing WT in our present context is the tensor product of the 1-D dyadic WT, so 2 2 D 4. Caution: Not all DWAs arise as tensor products of N D 2 models. The wavelets coming from tensor products are called separable. When a particular image-processing scheme is used for generating continuous wavelets it is not transparent if we are looking at a separable or inseparable wavelet!
Comparison of Discrete and Continuous Wavelet Transforms, Figure 5 The subdivided squares represent the use of the pyramid subdivision algorithm to image processing, as it is used on pixel squares. At each subdivision step the top left-hand square represents averages of nearby pixel numbers, averages taken with respect to the chosen low-pass filter; while the three directions, horizontal, vertical, and diagonal represent detail differences, with the three represented by separate bands and filters. So in this model, there are four bands, and they may be realized by a tensor product construction applied to dyadic filters in the separate x- and the y-directions in the plane. For the discrete WT used in image-processing, we use iteration of four isometries S0 ; SH ; SV ; and SD with mutually orthogonal ranges, and satisC SH S C SV S C SD S D I, fying the following sum-rule S0 S 0 H V D with I denoting the identity operator in an appropriate l 2 -space
Comparison of Discrete and Continuous Wavelet Transforms
The Continuous Wavelet Transform Consider functions f on the real line R. We select the Hilbert space of functions to be L2 (R) To start a continuous WT, we must select a function 2 L2 (R) and r; s 2 R such that the following family of functions r;s (x)
D r1/2
x s r
creates an over-complete basis for L2 (R). An over-complete family of vectors in a Hilbert space is often called a coherent decomposition. This terminology comes from quantum optics. What is needed for a continuous WT in the simplest case is the following representation valid for all f 2 L2 (R): f (x) D C 1
“ h R2
r;s j f i r;s (x)
drds r2
R where C :D R j ˆ (!)j2 d! D ! and where h r;s j f i R (y) f (y)dy. The refinements and implications of r;s R this are spelled out in tables in Sect. “Connections to Group Theory”. Some Background on Hilbert Space Comparison of Discrete and Continuous Wavelet Transforms, Figure 6 n D 2 Jorgensen. The selection of filters (Fig 4) is represented by the use of one of the operators Si in Fig 3. A planar version of this principle is illustrated in Fig 6. For a more detailed discussion, see e. g., [3]
Wavelet theory is the art of finding a special kind of basis in Hilbert space. Let H be a Hilbert space over C and denote the inner product h j i. For us, it is assumed linear in the second variable. If H D L2 (R), then Z f (x) g (x) dx : h f j g i :D R
To clarify the distinction, it is helpful to look at the representations of the Cuntz relations by operators in Hilbert space. We are dealing with representations of the two distinct algebras O2 , and O4 ; two frequency subbands vs. four subbands. Note that the Cuntz O2 , and O4 are given axiomatic, or purely symbolically. It is only when subband filters are chosen that we get representations. This also means that the choice of N is made initially; and the same N is used in different runs of the programs. In contrast, the number of times a picture is decomposed varies from one experiment to the next! Summary: N D 2 for the dyadic DWT: The operators in the representation are S0 , S1 . One average operator, and one detail operator. The detail operator S1 “counts” local detail variations. Image-processing. Then N D 4 is fixed as we run different images in the DWT: The operators are now: S0 , SH , SV , SD . One average operator, and three detail operator for local detail variations in the three directions in the plane.
If H D `2 (Z), then X ¯n n : h j i :D n2Z
Let T D R/2Z. If H D L2 (T ), then Z 1 f ( ) g ( ) d : h f j g i :D 2 Functions f 2 L2 (T ) have Fourier series: Setting e n ( ) D e i n , Z 1 ei n f ( ) d ; fˆ (n) :D h e n j f i D 2 and k f k2L 2 (T ) D
Xˇ ˇ ˇ fˆ (n) ˇ2 : n2Z
519
520
Comparison of Discrete and Continuous Wavelet Transforms
Similarly if f 2 L2 (R), then Z ei x t f (x) dx ; fˆ (t) :D
Increasing the Dimension In wavelet theory, [7] there is a tradition for reserving ' for the father function and for the mother function. A 1-level wavelet transform of an N M image can be represented as
R
and k f k2L 2 (R) D
1 2
Z R
0
ˇ ˇ ˇ fˆ (t) ˇ2 dt :
Let J be an index set. We shall only need to consider the case when J is countable. Let f ˛ g˛2J be a family of nonzero vectors in a Hilbert space H . We say it is an orthonormal basis (ONB) if ˝ ˛ (Kronecker delta) (3) ˛ j ˇ D ı˛;ˇ and if X
jh
˛
j f ij2 D k f k2
holds for all f 2 H :
(4)
˛2J
a1 @ f 7! – v1
Introducing the rank-one operators Q˛ :D j ˛ i h ˛ j of Dirac’s terminology, see [3], we see that f ˛ g˛2J is an ONB if and only if the Q˛ ’s are projections, and X Q˛ D I (D the identity operator in H ) : (5) ˛2J
It is a (normalized) tight frame if and only if Eq. (5) holds but with no further restriction on the rank-one operators Q˛ . It is a frame with frame constants A and B if the operator X Q˛ S :D ˛2J
satisfies
i
˝ Wn1 : H (x; y) D (x)'(y) XX D g i h j '(2x i)'(2y j)
h D
i 1
j
˝ Vn1 : V (x; y) D '(x) (y) XX D h i g j '(2x i)'(2y j)
v D
Wm1 i
1
j
Vm1
(7)
j
˝ Wn1 : D (x; y) D (x) (y) XX D g i g j '(2x i)'(2y j)
d D
Wm1 i
j
where ' is the father function and is the mother function in sense of wavelet, V space denotes the average space and the W spaces are the difference space from multiresolution analysis (MRA) [7]. In the formulas, we have the following two indexed number systems a :D (h i ) and d :D (g i ), a is for averages, and d is for local differences. They are really the input for the DWT. But they also are the key link between the two transforms, the discrete and continuous. The link is made up of the following scaling identities:
AI S BI in the order of Hermitian operators. (We say that operators H i D H i , i D 1; 2, satisfy H1 H2 if h f j H1 f i h f j H2 f i holds for all f 2 H ). If h; k are vectors in a Hilbert space H , then the operator A D jhi hkj is defined by the identity h u j Av i D h u j h i h k j v i for all u; v 2 H . Wavelets in L2 (R) are generated by simple operations on one or more functions in L2 (R), the operations come in pairs, say scaling and translation, or phase-modulation and translations. If N 2 f2; 3; : : : g we set j j/2 N x k for j; k 2 Z : j;k (x) :D N
(6)
a1 D Vm1 ˝ Vn1 : ' A (x; y) D '(x)'(y) XX D h i h j '(2x i)'(2y j)
˛2J
holds for all f 2 H :
j
1 h1 –A d1
where the subimages h1 ; d1 ; a1 and v1 each have the dimension of N/2 by M/2.
1
If only Eq. (4) is assumed, but not Eq. (3), we say that f ˛ g˛2J is a (normalized) tight frame. We say that it is a frame with frame constants 0 < A B < 1 if X A k f k2 jh ˛ j f ij2 B k f k2
j
'(x) D 2
X
h i '(2x i) ;
i2Z
(x) D 2
X
g i '(2x i) ;
i2Z
P The and (low-pass normalization) i2Z h i D 1. scalars (h i ) may be real or complex; they may be finite or infinite in number. If there are four of them, it is called the “four tap”, etc. The finite case is best for computations since it corresponds to compactly supported functions. This means that the two functions ' and will vanish outside some finite interval on a real line.
Comparison of Discrete and Continuous Wavelet Transforms
The two number systems are further subjected to orthogonality relations, of which X i2Z
1 h¯ i h iC2k D ı0;k 2
(8)
is the best known. The systems h and g are both low-pass and high-pass filter coefficients. In Eq. (7), a1 denotes the first averaged image, which consists of average intensity values of the original image. Note that only ' function, V space and h coefficients are used here. Similarly, h1 denotes the first detail image of horizontal components, which consists of intensity difference along the vertical axis of the original image. Note that ' function is used on y and function on x, W space for x values and V space for y values; and both h and g coefficients are used accordingly. The data v1 denote the first detail image of vertical components, which consists of intensity difference along the horizontal axis of the original image. Note that ' function is used on x and function on y, W space for y values and V space for x values; and both h and g coefficients are used accordingly. Finally, d1 denotes the first detail image of diagonal components, which consists of intensity difference along the diagonal axis of the original image. The original image is reconstructed from the decomposed image by taking the sum of the averaged image and the detail images and scaling by a scaling factor. It could be noted that only function, W space and g coefficients are used here. See [28,33]. This decomposition not only limits to one step but it can be done again and again on the averaged detail depending on the size of the image. Once it stops at certain level, quantization (see [26,32]) is done on the image. This quantization step may be lossy or lossless. Then the lossless entropy encoding is done on the decomposed and quantized image. The relevance of the system of identities Eq. (8) may be summarized as follows. Set 1X m0 (z) :D h k z k for all z 2 T ; 2 k2Z
g k :D (1) k h¯ 1k
for all k 2 Z ;
1X g k z k ; and 2 k2Z p (S j f )(z) D 2m j (z) f (z2 ) ; m1 (z) :D
for j D 0; 1 ; f 2 L2 (T ) ; z 2 T : Then the following conditions are equivalent: (a) The system of Eq. (8) is satisfied.
Comparison of Discrete and Continuous Wavelet Transforms, Figure 7 Matrix representation of filters operations.
(b) The operators S0 and S1 satisfy the Cuntz relations. (c) We have perfect reconstruction in the subband system of Fig. 4. Note that the two operators S0 and S1 have equivalent matrix representations. Recalling that by Parseval’s formula, we have L2 (T ) ' l 2 (Z). So representing S0 instead as an 1 1 matrix acting on column vectors x D (x j ) j2Z we get p X h i2 j x j (S0 x) i D 2 j2Z
and for the adjoint operator F0 :D S0 , we get the matrix representation 1 X¯ (F0 x) i D p h j2i x j 2 j2Z with the overbar signifying complex conjugation. This is computational significance to the two matrix representations, both the matrix for S0 , and for F0 :D S0 , is slanted. However, the slanting of one is the mirror-image of the other, i. e., Fig. 7. Significance of Slanting The slanted matrix representations refers to the corresponding operators in L2 . In general operators in Hilbert function spaces have many matrix representations, one for each orthonormal basis (ONB), but here we are concerned with the ONB consisting of the Fourier frequencies z j , j 2 Z. So in our matrix representations for the S operators and their adjoints we will be acting on column vectors, each infinite column representing a vector in the sequence space l2 . A vector in l2 is said to be of finite size if it has only a finite set of non-zero entries. It is the matrix F 0 that is effective for iterated matrix computation. Reason: When a column vector x of a fixed size, say 2s is multiplied, or acted on by F 0 , the result is a vector y of half the size, i. e., of size s. So y D F0 x. If we use F 0 and F 1 together on x, then we get two vectors, each of size s, the other one z D F1 x, and we can form the combined column vector of y and z; stacking y on top of z. In
521
522
Comparison of Discrete and Continuous Wavelet Transforms
our application, y represents averages, while z represents local differences: Hence the wavelet algorithm. 2 : 3 : 6 : 7 6 y1 7 7 6 2 3 6 y0 7 :: 7 6 : 6 7 6 y1 7 6 7 7 6 6 : 7 2 3 6x2 7 6 6 :: 7 F0 6x1 7 7 7 6 7 7 4 56 6 6 – 7 D – 6 x0 7 6 7 6 : 7 6 : 7 F1 6 x1 7 6 7 6 : 7 6 x2 7 7 6 6 z1 7 4 5 :: 7 6 6 z0 7 : 7 6 6 z1 7 5 4 :: : y D F0 x z D F1 x Connections to Group Theory The first line in the two tables below is the continuous wavelet transform. It comes from what in physics is called coherent vector decompositions. Both transforms applies to vectors in Hilbert space H , and H may vary from case to case. Common to all transforms is vector input and output. If the input agrees with output we say that the combined process yields the identity operator image. N1 is 1 : H ! H or written 1H . So for example if (S i ) iD0 a finite operator system, and input/output operator example may take the form N1 X
The Summary of and variations on the resolution of the identity operator 1 in L2 or in `2 , for and ˜ where xs 1 r;s (x) D r 2 r , Z C
D
R
d! ˆ j (!)j2 < 1 ; j!j
R d! ˆ (!) ˆ˜ (!) is given in similarly for ˜ and C ; ˜ D R j!j Table 1. Then the assertions in Table 1 amount to the equations in Table 2. A function satisfying the resolution identity is called a coherent vector in mathematical physics. The representation theory ˚ for the (ax C b)-group, i. e., the matrix a b j a 2 R ; b 2 R , serves as its ungroup G D C 0 1 derpinning. Then the tables above illustrate how the f j;k g wavelet system arises from a discretization of the following unitary representation of G:
1 xb U a b f (x) D a 2 f a 0 1 acting on L2 (R). This unitary representation also explains the discretization step in passing from the first line to the second in the tables above. The functions f j;k j j; k 2 Z g which make up a wavelet system result from the choice of a suitable coherent vector 2 L2 (R), and then setting
j j 2jx k : D U (x) D 2 2 (x) j j;k 2 k2 0
1
Even though this representation lies at the historical origin of the subject of wavelets, the (ax C b)-group seems to be
S i S i D 1H :
iD0
Comparison of Discrete and Continuous Wavelet Transforms, Table 1 Splitting of a signal into filtered components: average and a scale of detail. Several instances: Continuous vs. discrete Operator representation ND2 Continuous resolution Discrete resolution
Overcomplete Basis “ drds C 1 j r;s ih r;s j D 1 r2 2 R XXˇ ˇ ˛˝ ˇ j;k ˇ j;k D 1 ; j;k
Dual Bases “ drds C 1 j r;s ih ˜ r;s j D 1 ˜ ; r2 R2 XX ˇ ˛ ˇ j;k h ˜ j;k j D 1
j2Z k2Z
j2Z k2Z
Corresponding to r D 2j , s D k2j N2 Sequence spaces
Isometries in `2 N1 X Si S i D 1 ; where S0 ; : : : ; SN1
Dual Operator System in `2 N1 X Si S˜ i D 1 ; for a dual operator
iD0
iD0
are adjoints to the quadrature mirror filter operators F i , i. e., Si D Fi
system S0 ; : : : ; SN1 , S˜ 0 ; : : : ; S˜ N1
Comparison of Discrete and Continuous Wavelet Transforms
Comparison of Discrete and Continuous Wavelet Transforms, Table 2 Application of the operator representation to specific signals, contiuous and discrete “ drds drds 2 2 2 (R) C 1 jh j f ij Dkf k 8 f 2 L h f j r;s i h ˜ r;s j g i Dh f j g i 8 f ; g 2 L2 (R) r;s L2 ;˜ r2 r2 R2 R2 XX XX ˇ˝ ˝ ˛ ˛ˇ ˇ j;k j f ˇ2 Dkf k22 8 f 2 L2 (R) f j j;k h ˜ j;k j g i Dh f j g i 8 f ; g 2 L2 (R) L
C 1
“
j2Z k2Z N1 X S c2 i iD0
2
D kck
2
8c 2 `
j2Z k2Z N1 X˝ S i c iD0
now largely forgotten in the next generation of the wavelet community. But Chaps. 1–3 of [7] still serve as a beautiful presentation of this (now much ignored) side of the subject. It also serves as a link to mathematical physics and to classical analysis. List of Names and Discoveries Many of the main discoveries summarized below are now lore. 1807 Jean Baptiste Joseph Fourier mathematics, physics (heat conduction) Expressing functions as sums of sine and cosine waves of frequencies in arithmetic progression (now called Fourier series). 1909 Alfred Haar mathematics Discovered, while a student of David Hilbert, an orthonormal basis consisting of step functions, applicable both to functions on an interval, and functions on the whole real line. While it was not realized at the time, Haar’s construction was a precursor of what is now known as the Mallat subdivision, and multiresolution method, as well as the subdivision wavelet algorithms. 1946 Denes Gabor (Nobel Prize): physics (optics, holography) Discovered basis expansions for what might now be called time-frequency wavelets, as opposed to timescale wavelets. 1948 Claude Elwood Shannon mathematics, engineering (information theory) A rigorous formula used by the phone company for sampling speech signals. Quantizing information, entropy, founder of what is now called the mathematical theory of communication. 1976 Claude Garland, Daniel Esteban (both) signal processing Discovered subband coding of digital transmission of speech signals over the telephone. 1981 Jean Morlet petroleum engineer Suggested the term “ondelettes.” J.M. decomposed re-
˛ j S˜ i d D hc j di
8 c; d 2 `2
flected seismic signals into sums of “wavelets (Fr.: ondelettes) of constant shape,” i. e., a decomposition of signals into wavelet shapes, selected from a library of such shapes (now called wavelet series). Received somewhat late recognition for his work. Due to contributions by A. Grossman and Y. Meyer, Morlet’s discoveries have now come to play a central role in the theory. 1985 Yves Meyer mathematics, applications Mentor for A. Cohen, S. Mallat, and other of the wavelet pioneers, Y.M. discovered infinitely often differentiable wavelets. 1989 Albert Cohen mathematics (orthogonality relations), numerical analysis Discovered the use of wavelet filters in the analysis of wavelets — the so-called Cohen condition for orthogonality. 1986 Stéphane Mallat mathematics, signal and image processing Discovered what is now known as the subdivision, and multiresolution method, as well as the subdivision wavelet algorithms. This allowed the effective use of operators in the Hilbert space L2 (R), and of the parallel computational use of recursive matrix algorithms. 1987 Ingrid Daubechies mathematics, physics, and communications Discovered differentiable wavelets, with the number of derivatives roughly half the length of the support interval. Further found polynomial algorithmic for their construction (with coauthor Jeff Lagarias; joint spectral radius formulas). 1991 Wayne Lawton mathematics (the wavelet transfer operator) Discovered the use of a transfer operator in the analysis of wavelets: orthogonality and smoothness. 1992 The FBI using wavelet algorithms in digitizing and compressing fingerprints C. Brislawn and his group at Los Alamos created the theory and the codes which allowed the compression of the enormous FBI fingerprint file, creating A/D, a new database of fingerprints.
523
524
Comparison of Discrete and Continuous Wavelet Transforms
2000 The International Standards Organization A wavelet-based picture compression standard, called JPEG 2000, for digital encoding of images. 1994 David Donoho statistics, mathematics Pioneered the use of wavelet bases and tools from statistics to “denoise” images and signals. History While wavelets as they have appeared in the mathematics literature (e. g., [7]) for a long time, starting with Haar in 1909, involve function spaces, the connections to a host of discrete problems from engineering is more subtle. Moreover the deeper connections between the discrete algorithms and the function spaces of mathematical analysis are of a more recent vintage, see e. g., [31] and [21]. Here we begin with the function spaces. This part of wavelet theory refers to continuous wavelet transforms (details below). It dominated the wavelet literature in the 1980s, and is beautifully treated in the first four chapters in [7] and in [8]. The word “continuous” refers to the continuum of the real line R. Here we consider spaces of functions in one or more real dimensions, i. e., functions on the line R (signals), the plane R2 (images), or in higher dimensions Rd , functions of d real variables. Tools from Mathematics In our presentation, we will rely on tools from at least three separate areas of mathematics, and we will outline how they interact to form a coherent theory, and how they come together to form a link between what is now called the discrete and the continuous wavelet transform. It is the discrete case that is popular with engineers ([1,23,29,30]), while the continuous case has come to play a central role in the part of mathematics referred to as harmonic analysis, [8]. The three areas are, operator algebras, dynamical systems, and basis constructions: a. Operator algebras. The theory of operator algebras in turn breaks up in two parts: One the study of “the algebras themselves” as they emerge from the axioms of von Neumann (von Neumann algebras), and Gelfand, Kadison and Segal (C -algebras.) The other has a more applied slant: It involves “the representations” of the algebras. By this we refer to the following: The algebras will typically be specified by generators and by relations, and by a certain norm-completion, in any case by a system of axioms. This holds both for the normclosed algebras, the so called C -algebras, and for the weakly closed algebras, the von Neumann algebras. In fact there is a close connection between the two parts of
the theory: For example, representations of C -algebras generate von Neumann algebras. To talk about representations of a fixed algebra say A we must specify a Hilbert space, and a homomorphism from A into the algebra B(H) of all bounded operators on H . We require that sends the identity element in A into the identity operator acting on H , and that (a ) D ( (a)) where the last star now refers to the adjoint operator. It was realized in the last ten years (see for example [3,21,22] that a family of representations that wavelets which are basis constructions in harmonic analysis, in signal/image analysis, and in computational mathematics may be built up from representations of an especially important family of simple C -algebras, the Cuntz algebras. The Cuntz algebras are denoted O2 ; O3 ; : : :, including O1 . b. Dynamical systems. The connection between the Cuntz algebras O N for N D 2; 3; : : : are relevant to the kind of dynamical systems which are built on branchinglaws, the case of O N representing N-fold branching. The reason for this is that if N is fixed, O N includes in its definition an iterated subdivision, but within the context of Hilbert space. For more details, see e. g., [12,13,14,15,16,17,22]. c. Analysis of bases in function spaces. The connection to basis constructions using wavelets is this: The context for wavelets is a Hilbert space H , where H may be L2 (Rd ) where d is a dimension, d D 1 for the line (signals), d D 2 for the plane (images), etc. The more successful bases in Hilbert space are the orthonormal bases ONBs, but until the mid 1980s, there were no ONBs in L2 (Rd ) which were entirely algorithmic and effective for computations. One reason for this is that the tools that had been used for 200 years since Fourier involved basis functions (Fourier wave functions) which were not localized. Moreover these existing Fourier tools were not friendly to algorithmic computations. A Transfer Operator A popular tool for deciding if a candidate for a wavelet basis is in fact an ONB uses a certain transfer operator. Variants of this operator is used in diverse areas of applied mathematics. It is an operator which involves a weighted average over a finite set of possibilities. Hence it is natural for understanding random walk algorithms. As remarked in for example [12,20,21,22], it was also studied in physics, for example by David Ruelle, who used to prove results on phase transition for infinite spin systems in quantum statistical mechanics. In fact the transfer operator has many
Comparison of Discrete and Continuous Wavelet Transforms
Comparison of Discrete and Continuous Wavelet Transforms, Figure 8 Julia set with c D 1 These images are generated by Mathematica by authors for different c values for 'c (z) D z( 2) C c.
Comparison of Discrete and Continuous Wavelet Transforms, Figure 9 Julia set with c D 0:45 0:1428i These images are generated by Mathematica by authors for different c values for 'c (z) D z( 2) C c.
incarnations (many of them known as Ruelle operators), and all of them based on N-fold branching laws. In our wavelet application, the Ruelle operator weights in input over the N branch possibilities, and the weighting is assigned by a chosen scalar function W. the and the W-Ruelle operator is denoted RW . In the wavelet setting there is in addition a low-pass filter function m0 which in its frequency response formulation is a function on the d-torus Td D Rd /Zd . Since the scaling matrix A has integer entries A passes to the quotient Rd /Zd , and the induced transformation r A : T d ! T d is an N-fold cover, where N D j det Aj, i. e., for every x in T d there are N distinct points y in T d solving r A (y) D x. In the wavelet case, the weight function W is W D jm0 j2 . Then with this choice of W, the ONB problem for a candidate for a wavelet basis in the Hilbert space L2 (Rd ) as it turns out may be decided by the dimension of a distinguished eigenspace for RW , by the so called Perron– Frobenius problem. This has worked well for years for the wavelets which have an especially simple algorithm, the wavelets that are initialized by a single function, called the scaling function. These are called the multiresolution analysis (MRA) wavelets, or for short the MRA-wavelets. But there are instances, for example if a problem must be localized in fre-
quency domain, when the MRA-wavelets do not suffice, where it will by necessity include more than one scaling function. And we are then back to trying to decide if the output from the discrete algorithm, and the O N representation is an ONB, or if it has some stability property which will serve the same purpose, in case where asking for an ONB is not feasible. Future Directions The idea of a scientific analysis by subdividing a fixed picture or object into its finer parts is not unique to wavelets. It works best for structures with an inherent self-similarity; this self-similarity can arise from numerical scaling of distances. But there are more subtle non-linear self-similarities. The Julia sets in the complex plane are a case in point [4,5,10,11,24,25]. The simplest Julia set come from a one parameter family of quadratic polynomials 'c (z) D z2 C c, where z is a complex variable and where c is a fixed parameter. The corresponding Julia sets J c have a surprisingly rich structure. A simple way to understand them is the following: Consider the two branches of the p inverse ˇ˙ D z 7! ˙ z c. Then J c is the unique minimal non-empty compact subset of C, which is invariant under fˇ˙ g. (There are alternative ways of presenting J c but this one fits our purpose. The Julia set J of
525
526
Comparison of Discrete and Continuous Wavelet Transforms
a holomorphic function, in this case z 7! z2 C c, informally consists of those points whose long-time behavior under repeated iteration, or rather iteration of substitutions, can change drastically under arbitrarily small perturbations.) Here “long-time” refers to large n, where ' (nC1) (z) D '(' (n) (z)), n D 0; 1; : : :, and ' (0) (z) D z. It would be interesting to adapt and modify the Haar wavelet, and the other wavelet algorithms to the Julia sets. The two papers [13,14] initiate such a development. Literature As evidenced by a simple Google check, the mathematical wavelet literature is gigantic in size, and the manifold applications spread over a vast number of engineering journals. While we cannot do justice to this volume st literature, we instead offer a collection of the classics [19] edited recently by C. Heil et al. Acknowledgments We thank Professors Dorin Dutkay and Judy Packer for helpful discussions. Work supported in part by the U.S. National Science Foundation. Bibliography 1. Aubert G, Kornprobst P (2006) Mathematical problems in image processing. Springer, New York 2. Baggett L, Jorgensen P, Merrill K, Packer J (2005) A non-MRA C r frame wavelet with rapid decay. Acta Appl Math 1–3:251–270 3. Bratelli O, Jorgensen P (2002) Wavelets through a looking glass: the world of the spectrum. Birkhäuser, Birkhäuser, Boston 4. Braverman M (2006) Parabolic julia sets are polynomial time computable. Nonlinearity 19(6):1383–1401 5. Braverman M, Yampolsky M (2006) Non-computable julia sets. J Amer Math Soc 19(3):551–578 (electronic) 6. Bredies K, Lorenz DA, Maass P (2006) An optimal control problem in medical image processing Springer, New York, pp 249– 259 7. Daubechies I (1992) Ten lectures on wavelets. CBMS-NSF Regional Conference Series in Applied Mathematics, vol 61, SIAM, Philadelphia 8. Daubechies I (1993) Wavelet transforms and orthonormal wavelet bases. Proc Sympos Appl Math, Amer Math Soc 47:1– 33, Providence 9. Daubechies I, Lagarias JC (1992) Two-scale difference equations. II. Local regularity, infinite products of matrices and fractals. SIAM J Math Anal 23(4):1031–1079 10. Devaney RL, Look DM (2006) A criterion for sierpinski curve julia sets. Topology Proc 30(1):163–179. Spring Topology and Dynamical Systems Conference
11. Devaney RL, Rocha MM, Siegmund S (2007) Rational maps with generalized sierpinski gasket julia sets. Topol Appl 154(1):11– 27 12. Dutkay DE (2004) The spectrum of the wavelet galerkin operator. Integral Equations Operator Theory 4:477–487 13. Dutkay DE, Jorgensen PET (2005) Wavelet constructions in non-linear dynamics. Electron Res Announc Amer Math Soc 11:21–33 14. Dutkay DE, Jorgensen PET (2006) Hilbert spaces built on a similarity and on dynamical renormalization. J Math Phys 47(5):20 15. Dutkay DE, Jorgensen PET (2006) Iterated function systems, ruelle operators, and invariant projective measures. Math Comp 75(256):1931–1970 16. Dutkay DE, Jorgensen PET (2006) Wavelets on fractals. Rev Mat Iberoam 22(1):131–180 17. Dutkay DE, Roysland K (2007) The algebra of harmonic functions for a matrix-valued transfer operator. arXiv:math/0611539 18. Dutkay DE, Roysland K (2007) Covariant representations for matrix-valued transfer operators. arXiv:math/0701453 19. Heil C, Walnut DF (eds) (2006) Fundamental papers in wavelet theory. Princeton University Press, Princeton, NJ 20. Jorgensen PET (2003) Matrix factorizations, algorithms, wavelets. Notices Amer Math Soc 50(8):880–894 21. Jorgensen PET (2006) Analysis and probability: wavelets, signals, fractals. grad texts math, vol 234. Springer, New York 22. Jorgensen PET (2006) Certain representations of the cuntz relations, and a question on wavelets decompositions. In: Operator theory, operator algebras, and applications. Contemp Math 414:165–188 Amer Math Soc, Providence 23. Liu F (2006) Diffusion filtering in image processing based on wavelet transform. Sci China Ser F 49(4):494–503 24. Milnor J (2004) Pasting together julia sets: a worked out example of mating. Exp Math 13(1):55–92 25. Petersen CL, Zakeri S (2004) On the julia set of a typical quadratic polynomial with a siegel disk. Ann Math (2) 159(1):1– 52 26. Skodras A, Christopoulos C, Ebrahimi T (2001) JPEG 2000 still image compression standard. IEEE Signal Process Mag 18:36– 58 27. Song MS (2006) Wavelet image compression. Ph.D. thesis, University of Iowa 28. Song MS (2006) Wavelet image compression. In: Operator theory, operator algebras, and applications. Contemp. Math., vol 414. Amer. Math. Soc., Providence, RI, pp 41–73 29. Strang G (1997) Wavelets from filter banks. Springer, Singapore, pp 59–110 30. Strang G (2000) Signal processing for everyone. Computional mathematics driven by industrial problems (Martina F 1999), pp 365–412. Lect Notes Math, vol 1739. Springer, Berlin 31. Strang G, Nguyen T (1996) Wavelets and filter banks. WellesleyCambridge Press, Wellesley 32. Usevitch BE (Sept. 2001) A tutorial on modern lossy wavelet image compression: foundations of jpeg 2000. IEEE Signal Process Mag 18:22–35 33. Walker JS (1999) A primer on wavelets and their scientific applications. Chapman & Hall, CRC
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination SUI HUANG, STUART A. KAUFFMAN Institute for Biocomplexity and Informatics, Department of Biological Sciences, University of Calgary, Calgary, Canada Article Outline Glossary Definition of the Subject Introduction Overview: Studies of Networks in Systems Biology Network Architecture Network Dynamics Cell Fates, Cell types: Terminology and Concepts History of Explaning Cell Types Boolean Networks as Model for Complex GRNs Three Regimes of Behaviors for Boolean Networks Experimental Evidence from Systems Biology Future Directions and Questions Bibliography Glossary Transcription factor A protein that binds to the regulatory region of a target gene (its promoter or enhancer regions) and thereby, controls its expression (transcription of the target gene into a mRNA which is ultimately translated into the protein encoded by the target gene). Transcription factors account for temporal and contextual specificity of the expression of genes; for instance, a developmentally regulated gene is expressed only during a particular phase in development and in particular tissues. Gene regulatory network (GRN) Transcription factors regulate the expression of other transcription factor genes as well as other ‘non-regulatory’ genes which encode proteins, such as metabolic enzymes or structural proteins. A regulatory relationship between two genes thus is formalized as: “transcription factor A is the regulator of target gene B” or: A ! B. The entirety of such regulatory interactions forms a network = the gene regulatory network (GRN). Synonyms: genetic network or gene network, transcriptional network. Gene network architecture and topology The GRN can be represented as a “directed graph”. The latter consists of a set of nodes (= vertices) representing the genes connected by arrows (directed links or edges)
representing the regulatory interactions pointing from the regulator to the regulated target gene. [In contrast, in an undirected graph, the links are simple lines without arrowheads. The protein-interaction network can be represented as undirected graph]. The topology of a network is the structure of this graph and is an abstract notation, without physicality, of all the potential regulatory interactions between the genes. Topology usually is used to denote the simple interactions captured by the directed graph. For defining the network dynamics, however, additional aspects of interactions need to be specified, including: modalities or “sign” of an arrow (inhibitory vs. activating regulation), the ‘transfer functions’ (relationship between magnitude of input to that of the output = target gene) and the logical function (notably, in Boolean network, defining how multiple inputs are related to each other and are integrated to shaping the output). In this article when all this additional information is implied to be included, the term gene network architecture is used. Thus, the graph topology is a subset of network architecture. Gene network dynamics The collective change of the gene expression levels of the genes in a network, essentially, the change over time of the network state S. State space Phase space = the abstract space that contains all possible states S of a dynamical system. For (autonomous) gene regulatory networks, each state S is specified by the configuration of the expression levels of each of the N genes of the network; thus a system state S is one point in the N-dimensional state space. As the system changes its state over time, S moves along trajectories in the state space. Transcriptome Gene expression pattern across the entire (or large portion of) the genome, measured at the level of mRNA levels. Used as synonym to “gene expression profile”. The transcriptome can in a first approximation be considered a snapshot of the network state S of the GRN in gene network dynamics. Cell type A distinct, whole-cell phenotype characteristic of a mature cell specialized to exert an organ-specific function. Example of cell types are: liver cell, red blood cell, skin fibroblast, heart muscle cell, fat cell, etc. Cell types are characterized by their distinct cell morphology and their gene expression pattern. Cell fate A potential developmental outcome of a (stem or progenitor) cell. A cell fate of a stem cell can be the development into a particular mature cell type. Multipotency The ability of a cell to generate multiple cell types; a hallmark of stem cells. Stem cells are said to be multipotent (see also under Stem cells).
527
528
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
Stem cell A multi-potent cell capable of “self-renewal” (division in which both daughter cells have the same degree of multi-potency as the mother cell) and can give rise to multiple cell types. There is a hierarchy of multipotency: a toti-potent embryonic stem cell can generate all possible cell types in the body, including extra-embryonic tissues, such as placenta. A pluripotent embryonic stem cell can generate tissues of three germ layers, i. e., it can produce all cell types of the foetus and the adult. A multipotent (sensu stritiore) stem cell of a tissue (e. g., blood) can give rise to all cell types of that tissue (e. g., a hematopoietic stem cell can produce all the blood cells). A multipotent progenitor cell can give rise to more than one cell types within a tissue (e. g. the granulocyte-monocyte progenitor cell). Cell lineage Developmental trajectory of a multipotent cell towards one of multiple cell types, e. g., the “myeloid lineage” among blood cells, comprising the white blood cells granulocytes, monocytes, etc. Thus, a cell fate decision is a decision between multiple lineages accessible to a stem or progenitor cell. Differentiation The process of cell fate decision in a stem or progenitor cell and the subsequent maturation into a mature cell type. Definition of the Subject Current studies of complex gene regulatory networks (GRN) in which thousands of genes regulate each others’ expression have revealed interesting features of the network structure using graph theory methods. But how does the particular network architecture translate into biology? Since individual genes alter their expression level as a consequence of the network interactions, the genome-wide gene expression pattern (transcriptome), which manifests the dynamics of the entire network, changes as a whole in a highly constrained manner. The transcriptome in turn determines the cell phenotype. Hence, the constraints in the global dynamics of the GRN directly map into the most elementary “biological observable”: the existence of distinct cell types in the metazoan body and their development from pluripotent stem cells. In this article a historical overview of the various levels at which GRNs are studied, starting from network architecture analysis to the dynamics are first presented. An introduction is given to continuous and discrete value models of GRN commonly used to understand the dynamics of small genetic circuits or of large genome wide networks, respectively. This will allow us to explain how the intuitive metaphor of the “epigenetic landscape”, a key idea that was proposed by Waddington in the 1940s to explain
the generation of discrete cell fates, formally arises from gene network dynamics. This central idea appears in its modern form in formal and molecular terms as the concept that cell types represent attractor states of GRNs, first proposed by Kauffman in 1969. This raises two fundamental questions currently addressed by experimental biologists in the era of “systems biology”: (1) Are cell types in the metazoan body indeed high-dimensional attractors of GRNs and (2) is the dynamics of GRNs in the “critical regime” – poised between order and chaos? A summary of recent experimental findings on these questions is given, and the broader implications of network concepts for cell fate commitment of stem cells are also briefly discussed. The idea of the epigenetic landscape is key to our understanding of how genes and gene regulatory networks give rise to the observable cell behavior, and thus to a formal and integrated view of molecular causation in biology. Introduction With the rise of “systems biology” over the past decade, molecular biology is moving away from the paradigm of linear genetic pathways which has long served as linear chains of causation (e. g., Gene A ! Gene B! Gene C! phenotype) in explaining cell behaviors. It has begun to embrace the idea of molecular networks as an integrated information processing system of the cell [93,138]. The departure from the gene-centered, mechanistic ‘arrow-arrow’ schemes that embody ‘proximal causation’ [189] towards an integrative view will also entail a change in our paradigm of what an “explanation” means in biology: How do we map the collective behavior of thousands of interacting genes, obtained from molecular dissection, to the “biological observable”? The latter term, borrowed from the statistical physics idea of the “macroscopic observable”, is most prosaically epitomized in whole-cell behavior in metazoan organisms: its capacity to adopt a large variety of easily recognizable, discretely distinct phenotypes, such as a liver cell vs. a red blood cell, or different functional states, such as the proliferative, quiescent or the apoptotic state. All these morphologically and functionally distinct phenotypes are produced by the very same set of genes of the genome as a result of the joint action of the genes. This is achieved by the differential expression (“turning ON or OFF”) of individual genes. Thus, in a first approximation, each cell phenotype is determined by a specific configuration of the status of expression of all individual genes across the genome. This genome-wide gene expression pattern or profile, or transcriptome, is the direct out-
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
put of the GRN and maps almost uniquely into a cell phenotype (see overview in Fig. 1). A network is the most elementary conceptualization of a complex system which is composed of interacting elements and whose behavior as an entity we wish to understand: it formalizes how a set of distinct elements (nodes) influence each other as predetermined by a fixed scheme of their interactions (links between the nodes) and the modality of interactions (rules associated with each node). In this article we focus on the gene regulatory network (GRN), the network formed by interactions through which genes regulate each other’s expression (Fig. 1a), and we ask how they control the global behavior of the network, and thereby, govern development of cells into the thousands of cell types found in the metazoan body. Because in a network a node exerts influence onto others, we can further formalize networks as directed graphs, that is, the links are arrows pointing from one node to another (Fig. 1b). The information captured in an undirected or directed graph is the network topology (see Glossary). In addition, the arrows have a modality (“sign”), namely, activating or inhibiting their target gene. But since each node can receive several inputs (“upstream regulators”), it is more appropriate to combine modality of interaction together with the way the target gene integrates the various inputs to change its expression behavior (= output). Thus, each node can be assigned a function that maps all its inputs in a specific way to the output (Fig. 1, top). For instance, “promoter logics” [209] which may dictate that two stimulating inputs, the transcription factors, act synergistically, or that one inhibitory input may, when present, override all other activating inputs, is one way to represent such an input integrating function. Here we use network architecture as a term that encompasses both network topology as well as the interaction modalities or the functions. The latter add the ingredients to the topology information that are necessary to describe the dynamics (behavior) of the network. The Core Gene Regulatory Network (GRN) in Mammals How complex is the effective GRN of higher, multicellular organisms, such as mammals? Virtually every cell type in mammals contains the 25,000 genes in the genome [44, 158,178] which could potentially interact with each other. However, in a first approximation, we do not need to deal with all the 25,000 genes but only with those intrinsic regulators that have direct influence on other genes. A subset of roughly 5–10% of the genes in the genome encode transcription factors (TF) [194] a class of DNA binding pro-
teins that regulate the expression of genes by binding to the promoter region of the ‘target genes’ (Fig. 1a). Another few hundred loci in the genome encode microRNAs (miRNA), which are transcribed to RNA but do not encode proteins. miRNAs regulate the expression of genes by interfering with the mRNA of their target genes based on sequence complementarity (for reviews see [41,89,141]). Thus, in this discussion we can assume a regulatory network of around 3000 regulator genes rather than the 25,000 genes of the genome. The genes that do not encode TFs are the “effector genes”, encoding the work horses of the cell, including metabolic enzymes and structural genes. In our approximation, we also assume that the effector genes do not directly control other genes (although they may have global effects such as change of pH that may affect the expression of some genes). Further we do not count genes encoding proteins of the signal transduction machinery since they act to mediate external signals (such as hormones) that can be viewed as perturbations to the network and can in a first approximation be left out when discussing cell-intrinsic processes. Thus, the directed graph describing the genome-wide GRN network has a “medusa” structure [112], with a core set of regulators (medusa head) and a periphery of regulated genes (arms) which in a first approximation do not feedback to the core. The Core GRN as a Graph That Governs Cell Phenotype The next question is: do the 3000 core regulatory genes form a connected graph or rather independent (detached) “modules”? The idea of modularity would have justified the classical paradigm of independent causative pathways and has in fact actively been promoted in an attempt to mitigate the discouragement in view of the unfathomable complexity of the genome [85]. While a systematic survey that would provide a precise number is still not available, we can, based on patchy knowledge from the study of individual TFs, safely assume that a substantial fraction of TFs control the expression of more than one other TF. Many of them also control entire batteries [50] of effector genes, while perhaps a third subset may be specialized in regulating solely the effector genes. In any case, the core regulatory network of 3000 nodes controls directly or indirectly the entire gene expression profile of 25,000 genes, and hence, the cell phenotype. Then, assuming that on a average each TF controls at least two (typically more) other TFs [128,188], and considering statistical properties of random evolved networks (graphs) [36], we can safely assume that the core transcriptional network is a connected graph or at least its gi-
529
530
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination, Figure 1 Overview: Concepts at different levels for understanding emergence of biological behavior from genes and networks. a Elementary interaction: regulator gene X 1 interacts with target (regulated) gene X 2 , simplified as one link of a graph and the corresponding standard ‘cartoon’ it represents. b Notation of gene regulatory network topology as directed graph. c Schematic of a higher dimensional gene expression state space, showing that a network state S(t) maps onto one point (denoted as cross) in the state space. Dashed curve/arrow represents state space trajectory. d Spotted DNA microarray for measuring gene expression (mRNA) profiles, representing a network state S(t) and e the associated “GEDI map” visualization of gene expression pattern. The map is obtained using the program GEDI [57,83]. This program places genes that behave similarly (with respect to their expression in a set of microarray measurements) onto the same pixel (= minicluster of genes) of the map. Similar miniclusters are arranged in nearby pixels in the twodimensional picture of an n m array of pixels. The assignment of genes to a pixel is achieved by a self-organizing map (SOM) algorithm. The color of each pixel represents the centroid gene expression level (mRNA abundance) of the genes in the minicluster. For a stack of GEDI maps, all genes are forced to be assigned to the same pixels in the different maps, hence the global coherent patterns of each GEDI map allows for a one-glance ‘holistic’ comparison of gene expression profiles of different conditions S(t) (tissue types, time points in a trajectory, etc). f Scheme of branching development of a subset of four different blood cells with their distinct GEDI maps, starting from the multipotent CMP = common myeloid progenitor cell [150]. MEP = Megakaryocte-Erythorid progenitor cell; GMP = Granulocyte-Monocyte progenitor
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
ant component (largest connected subgraph) covers the vast majority of its nodes. This appears to be the case in GRNs of simple organisms for which more data are available for parts of the network, as in yeast [128], C. elegans [54]; sea urchin [50] or, for even more limited subnetworks in mammalian systems [32,183] although studies focused on selected subnetworks may be subjected to investigation bias. However, recent analysis of DNA binding by ‘master transcription factors’ in mammalian cells using systemic (hence in principle unbiased) chromatin immunoprecipitation techniques [177] show that they typically bind to hundreds if not thousands of target genes [48, 70,105,155], strongly suggesting a global interconnectedness of the GRN. This however, does not exclude the possibility that the genome-wide network may exhibit some “modularity” in terms of weakly connected modules that are locally densely connected [166]. Structure of This Article The goal of this article is to present both to biologists and physicists a set of basic concepts for understanding how the maps of thousands of interacting genes that systems biology researchers are currently assembling, ultimately control cell phenotypes. In Sect. “Overview: Studies of Networks in Systems Biology” we present an overview to experimental biologists on the history of the analysis of networks in systems biology. In Sect. “Network Architecture” we briefly discuss core issues in studies of network topology, before explaining basic ideas of network dynamics in Sect. “Network Dynamics” based on twogene circuits. The central concepts of multi-stability will
be explained to biologists, assuming a basic calculus background. Waddington’s epigenetic landscape will also be addressed in this conjunction. In Sect. “Cell Fates, Cell Types: Terminology and Concepts” we discuss the actual “biological observable” that the gene regulatory network controls by introducing to non-biologists central concepts of cell fate regulation. Section “History of Explaning Cell Types” offers a historical overview of various explanations for metazoan cell type diversity, including dynamical, network-based concepts, as well as more ‘reductionist’ explanations that still prevail in current mainstream biology. Here the formal link between network concepts and Waddington’s landscape metaphor will be presented. We then turn from small gene circuits to large, complex networks, and in Sect. “Boolean Networks as Model for Complex GRNs” we introduce the model of random Boolean networks which, despite their simplicity, have provided a useful conceptual framework and paved the way to learning what an integrative understanding of global network dynamics would look like, leading to the first central hypothesis: Cell types may be high-dimensional attractors of the complex gene regulatory network. In Sect. “Three Regimes of Behaviors for Boolean Networks” the more fundamental dynamical properties of ordered, critical and chaotic behavior are discussed, leading to the second hypothesis: Networks that control living cells may be in the critical regime. In Sect. “Experimental Evidence from Systems Biology” we summarize current experimental findings that lend initial support to these ideas, and in Sect. “Future Directions and Questions” we conclude with an outlook on how these general concepts may impact future biology.
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination, Table 1 Overview of typical directions and levels in the study of gene regulatory networks A. Network Architecture A1. Determination of network architecture I Experimental I Theoretical = Network inference A2. Analysis of network architecture Identification/characterization of “interesting” structural features B. Network Dynamics For small networks B1. Modeling: theoretical prediction of behavior of a real circuit based on (partially known, assumed) architecture + experimental verification For complex networks (full architecture not known) B2. Theoretical Study of generic dynamics of simulated ensembles of networks of particular architecture classes B3. Experimental Measurement and analysis of high-dimensional state space trajectories of a real network
531
532
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
Overview: Studies of Networks in Systems Biology Cellular networks of biological molecules currently studied by “systems biologists” encompass three large classes: metabolic, protein–protein interaction and gene regulatory (transcriptional) networks (GRN). The formalization of these systems into networks is often taken for granted but is an important issue. It is noteworthy that metabolic network diagrams [193] represent physical networks in that there is a flow of information or energy in the links, and thus, the often used metaphors inspired by man-made transport or communication networks (‘bottleneck’, ‘hubs’ etc.) are more appropriate than in the other two classes. Moreover, metabolic networks are subjected to the constraint of mass preservation at each node – in compliance with Lavoisier’s mass conservation in chemistry. In contrast to such flow networks, protein–protein interaction networks and GRN are abstract notations of “influence networks” in which the nodes influence the behavior of other nodes directly or indirectly – as represented by the links of the graph. There is no actual flow of matter in the links (although of course information is exchanged) and there is no obvious physical law that constrains the architecture, thus allowing for much richer variations of the network structure. The links are hence abstract entities, representing potential interactions assembled from independent observations that may not coexist in a particular situation. Thus, in this case the network is rather a convenient graphical representation of a collection of potential interactions. Studies of such biomolecular influence networks can fundamentally be divided between two levels (Table 1): (A) network architecture and (B) network dynamics (system behavior). The study of network architecture in molecular biology can further be divided into (A1) efforts to determine the actual graph structure that represent the specific interactions of the genes for particular instance (a species) and (A2) the more recent analysis of its structure, e. g., for interesting topological features [2]. The determination of the graph structure of the genome-wide GRNs in turn is either achieved (i) by direct experimental demonstration of the physical interactions, which has in the past decade greatly benefited from novel massively parallel technologies, such as chromatin IP, promoter one-hybrid or protein binding DNA microarrays [35,54,177], or (ii) via theoretical inference based on observed correlations in gene expression behavior from genome-wide gene expression profiling experiments. Such correlations are the consequence of mutual dependencies of expression in the influence network.
The next section provides a concise overview of the study of the network architecture, which due to limitations of available data is mostly an exercise in studying graph topology, and briefly discusses associated problems. Network Architecture Network Inference An emerging challenge is the determination of entire networks, including the connection graph (topology), the direction of interactions (‘arrows”) and, ideally, the interaction modality and logical function, using systematic inference from genome-wide patterns of gene expression (“inverse problem”). Gene expression profiles (transcriptomes) represent a snap-shot of the mRNA levels in cell or tissue samples. The arrival of DNA microarrays for efficient measuring gene expression profiles for almost all the genes in the genome has stimulated the development of algorithms that address the daunting inverse problem, and the number of proposed algorithms has recently exploded. Most approaches concentrate on inference based on the more readily available static gene expression profiles, although time course microarray experiments, where time evolution of transcriptomes are monitored at close intervals would be advantageous, especially to infer the arrow direction of network links (directed graph) and the interaction functions. However, because of uncertainties and the paucity of experimental data, systematic network inference faces formidable technical and formal challenges, and most theoretical work has been developed and tested based on simulated networks. A fundamental concern is that microarray based expression levels reflect the average expression of a population of millions of cells that – as recently demonstrated – exhibit vastly diverse behaviors even when they are clonal. Thus, the actual quantity used for inference is not a direct but a convoluted manifestation of network regulation (this issue is discussed in Sect. “Are Cell Types Attractors?”). Moreover, while mRNA levels reflect relatively well the activity status of the corresponding gene promoter, i. e., revealing the regulated activity, they are poor indicator of the regulating activity of a gene, because of the loose relationship between mRNA level and effective activity of the transcription factor that it encodes. Since the true network architecture (‘gold standard’) is not known, validation of the theoretical approaches remains unsatisfactory. Nevertheless, the recent availability of large numbers of gene expression profiles and the increasing (although not complete) coverage of gene regulation maps for single-cell organisms (notably E. coli and yeast) open the opportunity to directly study the mapping between
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
gene expression profile and network structure [39,42,61, 88,135,190]. Here we refer to [2,131,139] for a survey of inference methods and instead briefly discuss the study of network topology before we move on to network dynamics. Analyzing Network Structures Once the network topology is known, even if direction, modality and logics of links are not specified to offer the complete system architecture, it can be analyzed using graph theory tools for the presence of global or local structures (subgraphs) that are “interesting”. A potentially interesting feature can be defined as a one which cannot be explained by chance occurrence in some randomized version of the graph (null model), i. e., which departs from what one would expect in a “random” network (see below). Most of the graph theoretical studies have been stimulated by the protein–protein interaction networks which have been available for some years, of which the best characterized is that of the yeast S. cerevisiae [29]. Such networks represent non-directed graphs, since the links between the nodes (proteins) have been determined by the identification of physical protein interactions (heterodimer or higher complex formation). Here we provide only a cursory overview of this exploding field, while focusing on conceptual issues. A large array of structural network features, many of them inspired by the study of ecological and social networks [152,160,199] have been found in biomolecular networks. These features include global as well as local features, such as, to mention a few, the scale-free or broadscale distribution of the connectivity ki of the nodes i [3, 10,20], betweenness of node i [107,208], hierarchical organization [164], modularity [102,121,133,166,206], assortativity [140], and enrichment for specific local topology motifs [147], etc. (for a review of this still expanding field see [2,21,29,152]). The global property of a scale-free distribution of connectivity ki , which has attracted most attention early on and quickly entered the vocabulary of the biologist, means that the probability P(k i ) for an arbitrary node (gene) i in the network to have the connectivity ki has the form P(k i ) k i where the characteristic constant is the power-law exponent – the slope of the line in a P(k i ) vs. ki double-logarithmic plot. This distribution implies that there is no characteristic scale, i. e., no stable average value of k: sampling of larger number of nodes N will lead to larger “average” k values. In other words, there is an “unexpectedly” high fraction of nodes which are highly connected (“hubs”) while the majority exhibits low connec-
tivity. This property has attracted as much interest as it has stirred controversy because of the connotation of “universality” of scale-freeness on the one hand, and several methodological concerns on the other hand [21,24,73,76, 175,181,193] The Problem of Choosing the Null Model In addition to well-articulated caveats due to incompleteness, bias and noise of the data [29,52], an important general methodological issue in the identification of structural features of interest, especially if conclusion on functionality and evolution is drawn, is the choice of the appropriate “null model” – or “negative control” in the lingo of experimentalists. A total random graph obviously is not an ideal null model, for on its background, any bias due to obvious constraints imposed by the physical reality of the molecular network will appear non-random and hence, be falsely labeled “interesting” even if one is not interested in the physical constraints but in evidence for functional organization [14,93]. For instance, the fact that gene duplication, a general mechanism of genome growth during evolution, will promote the generation of the network motif in which one regulator regulates two target genes, needs to be considered before a “purposeful” enrichment for such a motif due to increased fitness is assumed [93]. Similarly, the rewiring of artificial regulatory networks through reshuffling of cis and trans regions or the construction of networks based on promoter-sequence information content reveal constraints that lead to bias for particular structures in the absence of selection pressure [18,46] The problem amounts to the practical question of which structural property of the network should be preserved when randomizing an observed graph to generate a null model. Arbitrarily constrained randomization based on preservation of some a priori graph properties [18,140] thus may not suffice. However, the question of which feature to keep cannot be satisfactorily answered given the lack of detailed knowledge of all physical processes underlying the genesis and evolution of the network. The more one knows of the latter, the more appropriate constraints can be applied to the randomization in order to expose novel biological feature that cannot be (trivially) explained by the neglected elementary construction constrains. Evolution of Network Structure: Natural Selection or Natural Constraint? This question goes beyond the technicality of choosing a null model for it reaches into the realm of a deeper questions of evolutionary biology: which features are inevitably linked to the very physical processes through which net-
533
534
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
works have grown in evolution and which arose due to natural selection because they confer a survival advantage [15,77,78,93,203]? Often this question is taken for granted and an all-mighty selection process is assumed that can produce any structure as long as it contributes sufficiently to fitness. This however, would require that during natural selection the random, mutation-driven reshuffling of nodes and connections has no constrains and that Darwinian evolution explores the entire space of possible architectures. Clearly this is not the case: physical constraints in the rearrangement of DNA (insertions, deletions, duplications, conversion, etc) [16,170,187] as well as graph theoretical considerations channel the possibilities for how one graph can transform into another [93, 203]. For instance, growth of the network due to the increase of genome-size (gene number) by gene duplication can, without the invisible hand of selection, give rise to the widely-found scale-free structure [23,186] although it remains to be seen whether this mechanism (and not some more fundamental statistical process) accounts for the ubiquitous (near) scale-free topology. The fact that the scale-free (or at least, broad-scale [10]) architecture has dynamical consequences (see Sect. “Architectural Features of Large Networks and Their Dynamics”) raises the question whether properties such as robustness may be inherent to some structure that is “self-organized”, rather than sculpted by the invisible hand of natural selection. Thus, one certainly cannot simply argue that the scale-free structure has evolved “because of some functional advantage”. Instead, it is reminded here that natural selection can benefit from spontaneous, self-organized structures [116]. This structuralist view [200], in which the physically most likely and not the biologically most functional is realized, need to be considered when analyzing the anatomy of networks [93]. In summary, the choice of the null model has to be made carefully and requires knowledge of the biochemical and physicochemical process that underlies genome evolution. This methodological caveat has its counterpart in the identification of “interesting” nucleotide sequence motifs when analyzing genome sequences [167]. Gene-Specific Information: More Functional Analysis Based on Topology Beyond pure graph theoretical analysis, there have been attempts to link the topology with functional biological significance. For instance, one question is how the global graph structure changes (such as the size of the giant component) when nodes or links are randomly or selectively (e. g., hubs) removed. It should be noted that the term
“robustness” used in such structural studies [5] refer to networks with the aforementioned connotation of transport or communication function with flow in the links and thus, differs fundamentally from robustness in a dynamical sense in the influence networks that we will discuss below. Another approach towards connecting network topology with biological functionality is to employ bioinformatics and consider the biological identity of the genes or proteins represented by the nodes. Then one can ask whether some graph-related node properties (e. g., degree of connectivity, betweenness/centrality, contribution to network entropy, etc.) are correlated with known biological properties of the genes, of which the most prosaic is the “essentiality” of the protein, derived from genetic studies [84, 137,207,208]. To mention just a few of the earlier studies of this expanding field, it has been suggested that proteins with large connectivity (“hubs”) appear to be enriched for “essential proteins” [104], that hubs evolve slower [68], and that they tend to be older in evolution [58] – in accordance with the model of preferential attachment that generates the scale-free distribution of the connectivity k i . However, many of these findings have been contested for statistical or other reasons [27,28,67,106,107]. The conclusion of such functional bioinformatics analysis need to be re-examined when more reliable and complete data become available. Network Dynamics While the graph theoretical studies, including the ramifications into core questions of evolution outlined above, are interesting eo ipso, life is breathed into these networks only with the ‘dynamics’. Understanding the system-level dynamics of networks is an essential step in our desire to map the static network diagrams which are merely “anatomical observables”, into the functional observable of cell behavior. Dynamics is introduced by considering that a given gene (node) has a time-dependent expression value x i (t) which in a first approximation represents the activity state of gene i (an active gene is expressed and post-translationally activated). Instead of seeing dynamics as a sequence of gene activation, epitomizing the chain of causation of the gene-centric view (as outlined in Sect. “Introduction”) a goal in the study of complex systems is to understand the integrated, “emergent” behavior of systems as a holistic entity. We thus define a system (network) state S(t) as S(t) D [x1 (t); x2 (t); : : : ; x N (t)] which is determined by the expression values xi of all the N genes of a network which change over time. It is obvious that not all theoretically possible state configurations S can
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
be realized since the individual variables xi cannot change independently. This global constraint on S imposed by the gene regulatory interactions ultimately determines the “system behavior” which is equivalent to whole-cell phenotype changes, such as switching between cell phenotypes. Thus, one key question is: given a particular network architecture, what is the dynamics of S, and does it reproduce typical cell behaviors, or even predict the particular behavior of a specific cell? Before plunging into complex GRNs, let us introduce basic concepts of modeling dynamics using small circuits (Fig. 2, B1 in Table 1.).
Small Gene Circuits Basic Formalism The dynamics of the network can be written as a system of ordinary differential equations (ODE) that describe the rate of change of xi as a function of the state of the all the xj : dx D F(x) ; dt
(1)
where x is the state vector x(t) D [x1 ; x2 ; : : : ; x N ] for a network of N genes, and F describes the interactions
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination, Figure 2 From gene circuit architecture to dynamics, exemplified on a bistable two-gene circuit. a Circuit architecture: two mutually inhibitory genes X 1 and X 2 which are expressed at constant rate and inactivated with first-order kinetics (“bistable toggle switch”). b Typical ODE system equations for the circuit in a (see [71,109]). c State space (x 1 –x 2 -phase-plane). Each dot is an example initial state So D [x1 (t D 0); x2 (t D 0)] with the emanating tick revealing direction and extent that the state So would travel in the next time and S , denote stable fixed-points (“attractors”). Empty circle denotes unstable fixed-point (saddle-node). unit t. Solid circles, S 1 2 d, e, f Various schematic representations of the probability P(S) (for a noisy circuit) to find the system in state S D [x1 ; x2 ]. The “elevation” (z-axis over the x 1 –x 2 plane) is calculated as – ln(P); thus, the most probable = stable states are the lowest in the emerging landscape. g Waddington’s metaphoric epigenetic landscape, in the 1957 version [198]
535
536
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
(including the interaction matrix), defining how the components influence each other. Concretely, for individual genes in small circuits, e. g. of two genes, N = 2: dx1 D f1 (x1 ; x2 ) dt dx2 D f2 (x1 ; x2 ) : dt
(1a)
Here the functions f i are part of the network “architecture” that determine how the inputs onto gene i jointly determine its dynamical behavior – as defined in Sect. “Introduction”. An example of f is shown in Fig. 2b. Its form is further specified by system parameters that are constants for the time period of observation. The solution of these system equations is the movement of the state vector x in the N-dimensional gene expression state space spanned by xi . x(t) D S(t) D [x1 ; x2 ; : : : ; x N ] :
(2)
The Network State For a larger number of genes, it is convenient to describe the dynamics of the network with the vector x which roughly represents the system state S, Eq. (2), that biologists measure using DNA microarrays and is known as the “gene expression profile” (see Sect. “Definition of the Subject”, Fig. 1). We will thus refer to S instead of the vector x in discussing biological system states. Since Eq. (1) is a first order differential equation, S specifies a system state at time t which for t D 0 represents an initial condition. In other words, unlike in macroscopic mechanical systems with inertia, in which the velocity dx i /dt of the “particles” is also important to specify an initial state, in cells a photographic snapshot of all the positions of xi , that is S(t) itself, specifies the system state. (This will become important for experimental monitoring of network dynamics). Role of Dynamical Models in Biology Small networks, or more precisely, “circuits” of a handful of interacting proteins and genes, have long been the object of modeling efforts in cell biology that use ODEs of the type of (1a) to model the behavior of the circuit [109,192] (Table 1, B1). In contrast, the objects of interest in the study of complex system sciences are large, i. e., “complex” networks of thousands of nodes, such as the 3000-node core GRN, mentioned earlier, and has largely been driven by experimental monitoring of S(t) using microarrays since the function F that represents the architecture of the network is not known (Table 1, B3). What is the actual aim of modeling a biological network? In mathematical modeling of small gene circuits
or of signal transduction pathways, one typically predicts the temporal evolution (time course) of the concentration of the modeled variables x1 (t), or x2 (t), and characterizes critical points, such as stable steady states or oscillations in the low-dimensional state space, e. g. in the x1 x2 phase plane, after solving the system equations, as exemplified in Figs. 2 and 3. Unknown parameter values have to be estimated, either based on previous reports or obtained by fitting the modeled x i (t) to the observed time course. Successful prediction of behaviors also serves to validate the assumed circuit architecture. From the stability analysis of the resulting behavior [151], generalization as to the dynamical robustness (low sensitivity to noise that cause x(t) to randomly fluctuate) and structural robustness (preservation of similar dynamical behavior even if network architecture is slightly rewired by mutations) can be made [7,34]. Both types of robustness of the system shall not be confounded with the robustness which is sometimes encountered in the analysis of network topology where preservation of some graph properties (such as global connectedness) upon deletion of nodes or links is examined (“error tolerance”) (as discussed in Sect. “Network Architecture”). Although in reality most genes and proteins are embedded in the larger, genome-wide network, small, idealized circuit models which implicitly accept the absence of many unknown links as well as inputs from the nodes in the global network outside the considered circuit often surprisingly well predict the observed kinetics of the variables of the circuit. This not only points to intrinsic structural robustness of the class of natural circuits, in that circuit architectures slightly different from that of the real system can generate the observed dynamics. But it also suggests that obviously, for some not well-understood reasons, local circuits are to some extent dynamically insulated from the larger network in which they are embedded although they are topologically not detached. Such “functional modularity” may be a property of the particular architecture of the evolved complex GRNs. In fact, in models of complex networks (discussed in the next section), evidence for just such “functional modularity” has been found with respect to network dynamics [116]. In brief preview (see Sect. “Three Regimes of Behaviors for Boolean Networks”), work on random Boolean network models of GRN has revealed three “regimes” of dynamical behaviors: ordered, critical, and chaotic, as described in Sect. “Ordered and Chaotic Regimes of Network Behavior”. In the ordered and critical regimes, many or most genes become “frozen” into “active” or “inactive” states, leaving behind functionally isolated “islands” of genes which, despite being connected to other islands
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination, Figure 3 Expansion of the bistable two-gene circuit by self-stimulation – creating a central metastable attractor. a The bistable circuit of Fig. 2 and associated epigenetic landscape, now as contour graph. b If the two genes exert self-activation, then, for a large range of parameters and additive effect of the two inputs that each gene receives, the epigenetic landscape is altered such that the central unstable , giving rise to robust “tristable” dynamics [99]. c Two specific examples of observed gene fixed-point in a. becomes locally stable, S 3 circuits that represent the circuit of C. Such network motifs are typically found to regulate the branch points where multipotent can be modeled as representing the progenitor cells make binary cell fate decisions [99,162]. The metastable central attractor S 3 metastable bi-potent progenitor cell which is poised in a state of locally stable indeterminacy between the two prospective fate and S attractors S 1 2
through the interactions of the network, are free to vary their activities without impacting the behaviors of other functionally isolated islands of genes. Complex Networks Despite a recent spate of publications in which the temporal change of expression of individual genes have been predicted based on small circuit diagrams, such predictions do not provide understanding of the integrated cell behavior, such as the change of cell phenotype that may involve thousands genes in the core GRN. Thus, analysis of
genome-wide network state is needed to understand the biological observable. In a first approximation, it is plausible that the state of a cell, such as the particular cell (pheno-)type in a multi-cellular organism, is defined by its genome-wide gene expression profile, or transcriptome T D [x1 ; x2 ; : : : ; x N ] with N D 25; 000. In fact, microarray analysis of various cells and tissues reveals globally distinct, tissue specific patterns of gene expression profiles that can easily be discerned as shown in Fig. 1f. As mentioned above (Sect. “Definition of the Subject”), the gene expression profile across the genome is governed by the core regulatory network of transcription factors (TFs)
537
538
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
which enslave the rest of the genome. Thus, in our approximation the network state S of the core transcriptional network of 3000 or so genes essentially controls the entire (genome-wide) gene expression profile. For clarity of formalization, it is important to note that one genome in principle encodes exactly one fixed network, since the network connections are defined by the specific molecular interactions between the protein structure of TFs and the DNA sequence motif of the cis-regulatory promoter elements they recognize. Both are encoded by the genomic sequence. The often encountered notion that “networks change during development” and that “every cell type has its own network” is in this strict formalism incorrect – the genes absent in one cell type must directly or indirectly have been repressed (and sometimes, continuously kept repressed) by other genes that are expressed. Thus, the genome (of one species) directly maps into one (time-invariant) network architecture which in turn can generate a variety of states S(t). It is the state S that changes in time and is distinct in different cell types or in different functional states within one cell type. Only genetic mutations in the genome will “rewire” the network and change its architecture. If the network state S(t) of the core GRN directly maps into a state of a cell (the biological observable), then one question is: What is the nature of the integrated dynamics of S in complex, irregularly wired GRN and is it compatible with observable cell behavior? The study of the dynamics of complex networks (Table 1, B) may at first glance appear to be impeded by our almost complete ignorance about the architecture and interaction modalities between the genes. However, we cannot simply extrapolate from the mindset of studying small circuits with rate equations to the analysis of large networks. This is not only impossible due to the lack of information about the detailed structure of the entire network – the function F in Eq. 1 is unknown – but it may also be numerically hard to do. Yet, despite our ignorance about the network architecture, much can be learned if we reset our focus plane to the larger picture of the network. In this regard, the study of the dynamics of complex network can have two distinct goals (see Table 1). One line of research (Table 1, B2) overcomes the lack of information about the network architecture by taking an ensemble approach [112] and asks: What is the typical behavior of a statistical ensemble of networks that is characterized by an architectural feature (e. g. average connectivity ki , power-law exponent )? This computationally intense approach typically entails the use of discrete-valued gene networks (Sect. “Boolean Networks as Model for Complex GRNs”). As will be discussed below, such analy-
sis has led to the definition of three broad classes of behaviors: chaotic, critical and ordered. The second approach (Table 1, B3) to the dynamics of complex GRN is closer to experimental biology and exploits the availability of gene expression profile measurements using DNA microarray technology. Such measurements provide snapshots of the state of the network S(t) over N genes, covering almost the entire transcriptome, and thus, reveals the direct output of the GRN as a distributed control system. Monitoring S(t) and its change in time during biological processes at the whole-cell level will reveal the constraint imposed on the dynamics of S(t) by network interactions and can, in addition to providing data for the inference problem (Sect. “Network Inference”), expose particular dynamical properties of the transcriptome that can be correlated with the biological observable. Cell Fates, Cell types: Terminology and Concepts In order to appreciate the meaning of the network state S and how it maps to the biological observable, we will now present (to non-biologists) in more detail the most prosaic biological observable of gene network dynamics: cell fate determination during development. Stem Cells and Progenitor Cells in Multi-cellular Organism A hall-mark of multi-cellular organisms is the differentiation of the omnipotent zygote (fertilized egg) via totipotent and pluripotent embryonic stem cells and multipotent tissue stem cells into functionally distinct mature “cell types” of the adult body, such as red blood cell, skin cells, nerve cells, liver cells, etc. This is achieved through a branching (treelike) scheme of successive specialization into lineages. If the fertilized egg is represented by the main trunk of the “tree of development”, then think of cells at the branching points of developmental paths as the stem cells. One example of a multipotent stem cell is the hematopoietic stem cells (HSC) which is capable of differentiating into the entire palette of blood cells, such as red blood cells, platelets and the variety of white blood cells. The last branch points represent progenitor cells which have a lesser developmental potential but still can chose between a few cell types (e. g., the common granulocyte-macrophage progenitor (GMP) cell, Fig. 1f). Finally, the outmost branches of the tree represent the mature, terminally differentiated cell types of the body. A cell that can branch into various lineages is said to be “multipotent”. It is a “stem cell” when it has the potential to
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
self-renew by cell division, maintaining its differentiation potential, and to create the large family of cells of an entire tissue (e. g., the hematopoietic stem cell). Thus, progenitor cells, which proliferate but cannot infinitively self-renew, are strictly not stem cells. The commitment to a particular cell phenotype (a next generation branch of the tree) is also referred to as a “cell fate” since the cell at the proximal branching point is “fated” to commit to one of its prospective cell types. Development and Differentiation The diversification of the embryonic stem cell to yield the spectrum of the thousands [94] or so cell types in the body occurs in a process of successive branching events, at which multipotent cells commit to one fate and which appear to be binary in nature. Thus, multipotent cells make an either-or decision between typically two lineages – although more complex schemes have been proposed [65]. Moreover, it is generally assumed that during natural development there is only diversification of developmental paths but no confluence from different lineages, although recently exceptions to this rule have been reported for the hematopoietic system [1]. As cells develop towards the outer branches of the “tree of development”, they become more and more specialized and progressively lose their competence to proliferate and diversify (“potency”). They also develop the phenotype features of a mature cell type; for instance, in the case of red blood cells, they adopt the flat, donut-like shape and synthesize haemoglobin. This process is called differentiation. Most cells then also loose their capacity to divide, that is, to enter into the proliferative state. Thus, mature, terminally differentiated cells are typically quiescent or “post mitotic”. The branching scheme of cell types imposes another fundamental property: cell types are discrete entities. They are distinct from each other, i. e., they are well separated in the “phenotype space” and are stable [171]. There is thus no continuum of phenotype. As Waddington, a prominent embryologist of the last century, recognized in the 1940s: Cell types are “well-recognisable” entities and “intermediates are rare” [198]. In addition to the (quasi-) discreteness between branches of the same level, discreteness between the stages within one developmental path is also apparent: a multipotent stem cell at a branching point is in a discrete stage and can be identified based on molecular markers, isolated and cultured as such. Hence, a “stem cell” is not just a snapshot of an intermediate stage within a continuous process of development, but a discrete metastable entity.
The flow down the developmental paths, from a stem cell to the terminally differentiated state is, despite the pauses at the various metastable stem and progenitor cell levels, essentially unidirectional and, with a few exceptions, irreversible. In some tissues, the mature cells, as is the case with liver, pancreas or endothelial cells, can upon injury revert to a phenotype similar to that of the last immature (progenitor) stage and resume proliferation to restore the lost cell population, upon which they return to the differentiated, quiescent state. Biologists often speak in somewhat loose manner of cells “switching” their phenotype. This may refer to switching from a progenitor state to a terminally differentiated state (differentiation) or within a progenitor state from the quiescent to other functional states, such as the proliferative or the apoptotic state (apoptosis = programmed cell death). In any case, such intra-lineage switching between different functional states also represents discontinuous, quasi-discrete transitions of wholecell behaviors. The balance between division, differentiation and death in the progenitor or stem cell compartment of a tissue thus consists of state transitions that entail allor-none decisions. This balance is at the core of organismal development and tissue homeostasis. Now we can come back to the network formalism: if the network state S maps directly into the biological observable, what are the properties of the network architecture that confer its ability to produce the properties of the biological observable outlined in this section: discreteness and robustness of cell types, discontinuity of transitions, successive binary diversification and directionality of these processes? Addressing these questions is the longterm goal of a theory of the multicellular organism. In the following we describe the status of research toward this goal. History of Explaning Cell Types Waddington’s Epigenetic Landscape and Bistable Genetic Circuits One of the earliest conceptualization of the existence of discreteness of cell types was the work of C. Waddington, who proposed “epigenetic regulation” in the 1940s, an idea that culminated in the famous figure of the “epigenetic landscape” (Fig. 2g). This metaphor, devoid of any formal basis, let alone relationship to gene regulation, captures the basic properties of discrete entities and the instability of intermediates. The term ‘epigenetic’ was coined by Waddington to describe distinct biological traits that arise from the interplay of the same set of genes and does not require the assumption of a distinct, “causal” gene to
539
540
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
be explained [196]. The 1957 version of the epigenetic landscape [198] (Fig. 2g) also implies that multipotent cells are destined to make either-or decisions between two prospective lineages, as embodied in the “watershed” regions (Fig. 2g). Almost at the same time as Waddington, in 1948 Max Delbrück proposed a generic concept of differentiation into two discrete states in a biochemical system that can be described by equations of the form (3), consisting of two mutual inhibiting metabolites, x1 (t) and x2 (t) that exhibits bistability [53] (For a detailed qualitative explanation, see Fig. 2). The dynamics of such a system is graphically represented in the two dimensional state space spanned by x1 and x2 (Fig. 2c-f). Bistable dynamics implies that there are two steady states S1 and S2 that satisfy dx1 /dt D dx2 /dt D 0 and are stable fixed-points of the system. For the system equations of Fig. 2b, this behavior is observed for a large range of parameters. In a nutshell, the mutual inhibition renders the balanced steady state S3 (x1 D x2 ) unstable so that the system settles down in either the steady state S1 (x1 x2 ) or S2 (x1 x2 ) when kicked out of this unstable fixed point. These two stable steady states and their associated gene activity patterns are discretely separated in the x1 x2 state space and have been postulated to represent the differentiated state of cells, thus, corresponding to Waddington’s valleys. This was the first conceptualization of a cellular differentiated state as a stable fixed-point of a non-linear dynamical system. Soon after Monod and Jacob discovered the principle of gene regulation they also proposed a circuit of the same architecture as Delbrück [148], but consisting of two mutually suppressing genes instead of metabolites, to explain differentiation in bacteria as a bistable system. Bistability, Tristability and Multistability The central idea of bistability is that the very same system, in this case, a gene circuit composed of two genes x1 and x2 , can under some conditions produce two distinct stable states separated by an unstable stationary-state, and hence, it can switch in almost discontinuous manner from one state (S1 ) to another (S2 ) (“toggle switch”) (Fig. 2). Bistability is a special (the simplest) case of multi-stability, an elementary (but not necessary) property of systems with non-linear interactions, such as those underlying the gene regulatory influences as detailed in Fig. 2b. Which stable state (in this case, S1 or S2 ) a network occupies depends on the initial condition, the position of S(t D 0) in the x1 x2 state space (Fig. 2c). The two stable steady-states, S1 or S2 , are called “attractors” since the system in such states will return to the characteristic gene expression pat-
tern in response to small perturbations to the system by enforced change of the values of x1 or x2 . Similarly, after an external influence that places the network at an unstable initial state S 0 (t D 0) within the basin of attraction of attractor S1 (gray area in Fig. 2c), and hence cause the network to settle down in the stable state S1 , the network will stay there even after the causative influence that initially put it in the state S 0 has disappeared. Thus, attractor states confer memory of the network state. In contrast, larger perturbations beyond a certain threshold will trigger a transition from one attractor to another, thus explaining the observed discontinuous “switch-like” transitions between two stable states in a continuous system. After Delbrück, Monod and Jacob, numerous theoretical [74] and, with the rise of systems biology, experimental works have further explored similar simple circuitries that produce bistable behavior (see ref. in [38]). Such circuits and the predicted switch-like behavior have been found in various biological systems, including in gene regulatory circuits in Escherichia coli [157], mammalian cell differentiation [43,99,127,168] as well as in protein signal transduction modules [12,64,205]. Artificial gene regulatory circuits have been constructed using simple recombinant DNA technology to verify model predictions [22,71, 123]. Circuit analyzes have been expanded to cover more complex circuits in mammalian development. It appears that there is a common theme in the circuits that govern cell differentiation (Sect. “Cell Fates, Cell Types: Terminology and Concepts”): interconnected pairs of mutually regulating genes that are often also self-regulatory, as shown in Fig. 3b [43,99,168]. Such circuit diagrams may be crucial in controlling the binary diversification at developmental branch points (Fig. 1f) [99]. In the case where the two mutually inhibiting genes are also are self-stimulatory, as summarized in Fig. 3b, an interesting modification of the bistable dynamics can be obtained. Assuming independent (additive) influence of the self-stimulatory and cross-inhibitory inputs, this circuit will convert the central unstable state (saddle) S3 (Fig. 2, 3a) that separate the two stable steady-states into a stable steady state, thus generating tristable behavior [99]. The third, central stable fixed point has, in symmetrical cases, the gene expression configuration S3 [x1 x2 ] (Fig. 3b). The promiscuous expression of intermediate-low levels of x1 and x2 in the locally stable state S3 has been associated with a stem or progenitor cell that can differentiate into the cell represented by the attractors S1 (x1 x2 ) or S2 (x1 x2 ). In fact, the common progenitor cells (Fig. 1f) have been shown to express “promiscuous” expression of genes thought to be specific for either of the
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
lineages in which it will have to commit to, so-to speak providing a “preview” or “multi-lineage priming” of the gene expression of its prospective cell fates [47,59,92]. For instance, the common myeloid progenitor (CMP, Fig. 1) which can commit to either the erythroid or the myeloid lineage, expresses both the erythroid-specific transcription factor GATA1 (= x1 ) and the myeloid specific transcription factor PU.1 (= x2 ) at intermediate levels (Fig. 3c). The metastable [GATA1PU.1] configuration generates a state of indeterminacy or “suspense” [146] awaiting the signal, be it a biological instruction or stochastic perturbation, that resolves it to become either one of the more stable states [GATA1 PU.1] or [GATA1 PU.1] when cells differentiate into the erythroid or myeloid lineage, respectively. Thus, multi-potency and indeterminacy of a progenitor cell can be defined purely dynamically and does not require a much sought after “stemness” gene (again, a concept that arose from the gene-centered view) [195]. The metastable state also captures the notion of a “higher potential energy” [79] that drives development and hence, may account for the arrow of time in ontogenesis. A Formalism for Waddington’s Epigenetic Landscape Metaphor Obviously, the valleys in Waddington’s epigenetic landscape represent the stable steady states (attractors) while the hill-tops embody the unstable fixed-points, as shown in Fig. 2. How can Waddington’s metaphor formally be linked to gene network dynamics and the state space structure? For a one dimensional system dx/dt D f (x), this is easily shown with the following analogy: An “energy” landscape represents the cumulative “work” performed/received when “walking” in the state space against / along the vector field of the “force” f (x) (Fig. 2c). Thus, the “potential energy” is obtained by integrating f (x) (the right-hand side of the system equation, Eq. (1)) over the state space variables x. Z V(x) D f (x)dx : (3) Here the state space dimension x is given the meaning of a physical space, as that pertaining to a landscape, and the integral V(x) is the sum of the “forces” experienced by the network states S(t) D x(t) over a path in x that drive S(t) to the stable states (Fig. 2c). The negative sign establishes the notion of a potential in that the system loses energy as it moves towards the stable steady states which are in the valleys (“lowest energy”). Higher (N > 1) dimensional systems (Eq. 1) are
in general non-integrable (unless there exists a continuously differentiable (potential) function V (x1 ; x2 ; : : : ; x N ) for which f1 dx1 C f2 dx2 C C f N dx N D 0 is the exact total differential, so that –grad(V)=F(x) with F(x) = [ f1 (x); f2 (x); : : : ; f N (x)]T ). Thus, there is in general no potential function that could represent a “landscape”; however, the idea of some sort of “potential landscape” is widely (and loosely) used, especially in the case of two-dimensional systems where the third dimension can be depicted as a cartographic elevation V (x1 , x2 ). An elevation function V(x1 , x2 ) can be obtained in stochastic systems where x(t) is subjected to random fluctuations due to gene expression noise [87]. Then V(S) is related to the probability P(S) to find the system in state S D (x1 ; x2 ; : : : ), e. g., V(S) D ln(P(S) [118,177]. It should however be kept in mind that the “quasi-potential” V is not a true (conservative) potential, since the vector field is not a conservative field. The above treatment of the dynamics of the gene regulatory circuit explains the valleys and hills of Waddington’s landscape but still lacks the “directionality” of the overall flow on the landscape, depicted by Waddington as the slope from the back to the front of his landscape. (This arrow or time of development will briefly be discussed in the outlook Sect. “Future Directions and Questions”). In summary, the epigenetic landscape that Waddington proposed based on his careful observation of cell behavior and that he reshaped over decades [103,179,196, 197,198], can now be given both a molecular biology correlate (the gene regulatory networks) and a formal framework (probability landscape of network states S). The landscape idea lies at the heart of the connection between molecular network topology and biological observable. The Molecular Biology View: “Epigenetic” Marks of Chromatin Modification Although it is intuitively plausible that stable steady states of the circuits of Delbrück and of Monod and Jacob may represent the stable differentiated state, this explanation of a biological observable in terms of a dynamical system was not popular in the community of experimental molecular biologists and was soon sidelined as molecular biology, and with it the gene-centered view, came to dominate biology. The success in identifying novel genes (and their mutated alleles) and the often straightforward explanation of a phenotype offered by the mere discovery of a (mutant) gene triggered a hunt for such “explanatory genes” to account for biological observables. Gene circuits had to give place to the one-gene-one trait concept, leading to an era in which a new gene-centered epistemo-
541
542
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
logical habit, sometimes referred to as “genetic determinism”, prevailed [180]. Genetic determinism, a particular form of reductionism in biology, was to last for another fifty years after Delbrück’s proposal of bistability. Only as the “low hanging fruits” of simple genotype-phenotype relationships seemed to have all been picked, and genomewide gene expression measurements became possible was the path cleared for the rise of “system biology” and biocomplexity that we witness today. In genetic determinism, macroscopic observables are reduced in a qualitative manner to the existence of genes or regulatory interactions which at best “form” a linear chain of events. Such “pathways”, most lucidly manifest in the arrow-arrow schemes of modern cell biology papers, serve as a mechanistic explanation and satisfied the intellectual longing for causation. It is in light of this thinking that the molecular biologists’ explanation of the phenomenon of cell type determination has to be seen. Absent a theory of cell fate diversification and in view of the stunning stability of cell types, a dogma thus came into existence according to which the type identity of cells, once committed to, is irreversibly carved in stone [161]. Rare transdifferentiation events (switch of cell lineages) were regarded as curiosities. The observation that cell types express type-specific genes was explained by the presence of cell-type specific transcription factors. For instance, red blood cells express haemoglobin because of the presence of GATA1 – a lineage specific transcription factor that not only promotes commitment to the erythroid lineage (as discussed above, Sect. “Bistability, Tristability and Multistability”) but also controls haemoglobin expression [149]. Conversely, the absence of gene activity, for instance, nonexpression of liver-specific genes in erythrocytes, was explained by the silencing of the not needed genes by covalent DNA methylation and histone modifications (methylation, acetylation, etc.) [13,65,69,117,122] which modify chromatin structure, and thereby, control the access of the transcription machinery to the regulatory sites on the DNA. But who controls the controller? Chromatin modifications [117,122] are thought to confer discrete alterations of gene expression state that are stable and essentially irreversible. This idea of permanent marks on DNA represents the conceptual cousin of mutations (but without altering DNA sequence) and was thus in line with the spirit of genetic determinism. Accordingly, they were readily adopted as an explanation of the apparently irreversible cell type-specific gene inactivation and were given the attribute “epigenetic” to contrast them from the genetic changes that involve alteration of DNA sequences. But enzymes responsible for covalent DNA and chromatin modification are not gene-locus specific, leav-
ing open how the cell type specific gene expression pattern is orchestrated. It is important to mention here a disparity in the current usage of the term “epigenetics” [103]. In modern molecular biology, “epigenetics” is almost exclusively used to refer to DNA methylation and covalent histone modifications; this meaning is taken for granted even among authors who comment on the very usage of this term [13, 26,103], and are unaware that memory effects can arise purely dynamically without a distinct material substrate, as discussed in Sect. “Bistability, Tristability and Multistability”. In contrast, biological physicists use “epigenetic” to describe precisely phenomena, such as multi-stability (Sect. “Bistability, Tristability and Multistability”) that are found in non-linear systems – a usage that comes closer to Waddington’s original metaphor for illustrating the discreteness and stability of cell types [197,198]. The use of Waddington’s “epigenetic landscape” has recently seen a revival [75,165] in the modern literature in the context of chromatin modifications but remains loosely metaphoric and without a formal basis. Rethinking the Histone Code The idea of methylation and histone modifications as a second code beyond the DNA sequence (“histone code”) that cells use to “freeze” their type-specific gene expression pattern relied on the belief that these covalent modifications act like permanent marks (only erased in the germline when an oocyte is formed). This picture is now beginning to change. First, recent biochemical analysis suggest that the notion of a static, irreversible “histone code” is oversimplified, casting doubt on the view that histone modification is the molecular substrate of “epigenetic memory” [122,126,144,191]. With the accumulating characterization of chromatin modifying enzymes, notably those controlling histone lysine (de)methylation [122,126,144, 191], it is increasingly recognized that the covalent “epigenetic” modifications are bidirectional (reversible) and highly dynamic. Second, cell fate plasticity, most lucidly evident in the long-known but rarely observed transdifferentiation events, or in the artificial reprogramming of cells into embryonic stem cells either by nuclear transfer-mediated cloning [91] or genetic manipulation [143, 156,184,201], confirm that the “epigenetic” marks are reversible – given that the appropriate biochemical context is provided. If what was thought of as permanent molecular marks is actually dynamical and reversible – what then maintains lineage-specific gene expression patterns in an inheritable fashion across cell divisions?
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
In addition, as mentioned above, there is another, more fundamental question from a conceptual point of view: chromatin-based marking of gene expression status is a generic, not locus – specific process – the same enzymes can apply or remove the covalent marks on virtually any gene in the genome. They are dumb. So what smart system orchestrates the DNA methylation and histone modification machinery at the tens of thousands of loci in the genome so that the appropriate set of genes is (in)activated to generate the cell type-specific patterns of gene expression? A system-level view avoids the conundrum caused by the mechanistic, proximal explanation [189] of the genecentered view. A complex systems approach led to the idea that the genome-wide network of transcriptional regulation can under some conditions spontaneously orchestrate, thanks to self-organization, the establishment of lineage- specific gene expression profiles [114], as will be discussed below. In fact, the picture of chromatin modification as primum movens that operates “upstream” of the transcription factors (TFs), controlling their access to the regulatory elements in promoter regions, must be revised in light of a series of new observations. Evidence is accumulating that the controller itself is controlled – namely, by the TFs they are thought to control: TFs may actually take the initiative by recruiting the generic chromatin-modifying enzymes to their target loci [51,80,125, 144,145,182,185]. It is even possible that a mutual, cooperative dynamical interdependence between TFs and chromatin-modifying enzymes may establish locus-specific, switch-like behavior that commands “chromatin status” changes [56,132]. In fact, an equivalent of the indeterminacy state where both opposing lineage specific TFs balance each other (Sect. “Bistability, Tristability and Multistability”, Fig. 3b) is found at the level of chromatin modification, in that some promoters exhibit “bivalent” histone modification in which activating and suppressing histone methylations coexist [25]. Such states in fact are associated with TFs expressed at low level – in agreement with the central attractor S3 of the tristable model (Fig. 3b). Thus, chromatin modification is at least in part “downstream” of TFs and may thus act to add additional stability to the dynamical states that arise from the network of transcriptional regulation. If correct, such a relationship will allow us to pass primary responsibility for establishing the observable gene expression patterns back to transcription factors. With its genome-wide regulatory connections the GRN is predestined for the task of coordination and distributed information processing. But under what conditions can a complex, apparently randomly wired network of 3000 regulators create, with
such stunning reliability and accuracy, the thousands of distinct, stable, gene expression profiles that are associated with a meaningful biological state, such as a cell type? Studies of Boolean networks as toy models in the past 40 years have provided important insights.
Boolean Networks as Model for Complex GRNs General Comment on Simplifying Models The small-circuit model discussed in Sect. “Small Gene Circuits” represents arbitrarily cut-out fragments of the genome-wide regulatory network. A cell phenotype, however, is determined by the gene expression profile over thousands of genes. How can we study the entire network of 25,000 genes, or at least the core GRN with 3000 transcription factors even if most of the details of interactions remain elusive? The use of random Boolean networks as a generic network model, independent of a specific GRN architecture of a particular species, was proposed by Kauffman in 1969 – at the time around which the very idea of small gene regulatory circuits were presented by Monod and Jacob [148]. In random Boolean networks the dynamics is implemented by assuming discrete-valued genes that are either ON (expressing the encoded protein) or OFF (silenced). The interaction function (Sect. “Introduction”, and F in Sect. “Small Gene Circuits”) that determines how the ON/OFF status of multiple inputs of a target gene map into its behavior (output status) is a logical (Boolean) function B, and the network topology is randomly-wired, with the exception of some deliberately fixed features. Thus, the work on random Boolean networks allowed the study of the generic behavior of large networks of thousands of genes even before molecular biology could deliver the actual connections and interaction logics of the GRN of living systems (for review, see [8]). The lack of detailed knowledge about specific genes, their interaction functions and the formidable computational cost for modeling genome-wide networks has warranted a coarse-graining epitomized by the Boolean network approach. But more specifically, in the broader picture of system-wide dynamics, the discretization of gene expression level is also justified because (i) the above discussed steep sigmoidal shape of “transfer functions” (Fig. 2b) that describe the influence of one gene onto another’s expression rate can be approximated by a stepfunction and/or (ii) the local dynamics produced by such small gene circuit modules is in fact characterized by discontinuous transitions between discrete states as shown in Sect. “Network Dynamics”.
543
544
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
In addition, the Boolean network approach offers several advantages over the exhaustive, maximally detailed models favored by engineers who seek to understand a particular instance of a system rather than typical properties of a class of systems. The simplification opens a new vista onto fundamental principles that may have been obscured by the details since, as philosophers and physicists have repeatedly articulated, there is no understanding without simplification, and “less can be more” [11,30, 159]. An important practical advantage of the simplification in the Boolean network approach is the possibility to study statistical ensembles of ten thousands of network instances, i. e., entire classes of network architectures, and to address the question of how a particular architecture type maps into a particular type of dynamic behavior. Some of the results obtained in fact are valid for both discrete and continuous behavior of the network nodes [17]. Model Formalism for Boolean Networks In the Kauffman model of random Boolean networks gene activity values are binary (1 = ON, and 0 = OFF) [111, 115,116] and time is also discretized. Thus, a Boolean network is a generalized form of cellular automata but without the aspect of physical space and the particular neighborhood relations. Then, in analogy to continuous systems, a network of N elements i (i D 1; 2; : : : ; N), defines a network state S at any given discrete time step t : S(t) D [x1 (t); x2 (t); : : : ; x N (t)] where xi is the activity status that now only takes the values 1 or 0. The principles are summarized in Fig. 4. The state space that represents the entire dynamics of the network is finite and contains 2 N states. However, again, not all states are equally likely to be realized and observed, since genes do not behave independently but influence each other’s expression status. Regulation of gene i by its incoming network connections is modeled by the Boolean function Bi that is associated with each gene i and maps the configuration of the activity status (1 or 0) of its input genes (upstream regulators of gene i) into the new value of xi for the next time point. Thus, the argument of the Boolean function Bi is the input vector R i (t) D [x1 (t); x2 (t); : : : ; x ki (t)], where ki is the number of inputs that the gene i receives. At each time step, the value of each gene is updated: x i (t C 1) D B i [R i (t)]. The logical function Bi can be formulated as a “truth table” which is convenient for large k’s (Fig. 4a). In the widely studied case where each gene has exactly k D 2 inputs, the Boolean function can be one of the set of 2(2^k) D 16 classical Boolean operators, such as AND, OR, NOTIF, etc. [109,116]. Figure 4b shows the example of the well-studied lactose operon and how its regu-
latory characteristics can be captured as an AND Boolean function for the two inputs. In the simplest model, all genes are updated in every time step. This synchrony is artificial for it assumes a central clock in the cell which is not likely to exist, although some gating processes from oscillations in the redox potential has been reported [120]. The idealization of synchrony, however, facilitates the study of large Boolean networks, and many of the results that have been found with synchronous Boolean networks carry over to networks with asynchrony of updating that has been implemented in various ways [40,72,81,119]. For synchronous networks the entire network state S can also be viewed as being updated by the updating function U : S(t C 1) D U[S(t)], where U summarizes all the N Boolean functions Bi . This facilitates some treatments and is represented in the state transition table that lists all possible network states in one column and its successor in a second column. This leads to a higher-level, directed graph that represents the entire dynamics of a network as illustrated below. Dynamics in Boolean Networks The state transition table captures the entire dynamical behavior of the network and can conveniently be depicted as a state transition map, a directed graph in which a node represents one of the 2 N possible state S(t) (a box with a string or 1, 0s in Fig. 4a. representing the gene expression pattern). Such diagrams are particularly illustrative only for N up to 10, since they display all possible states S of the finite state space [204]. The states are connected by the arrows which represent individual state transitions and collectively depict the trajectories in the state space (Fig. 4a, right panel). Since the Boolean functions are deterministic, a state S(t) unambiguously transitions into one successor state S(t C 1). In contrast, a state can have multiple predecessors, since two or more different states can be updated into the same state. Hence, trajectories can converge but not diverge, i. e., there is no branching of trajectories emanating from one state. This property of “losing information about the ancestry” is essential to the robustness of the dynamics of networks. In updating the network states over time, S(t) can be represented as a walk on the directed graph of the state transition map. Because of the finiteness of the state space in discrete networks, S(t) will eventually encounter an already visited state, and because of the absence of divergence, the set of successive states will be the same as in the previous encounter. Thus, no matter what the initial state is, the network will eventually settle down in either a set of
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination, Figure 4 Principles of Boolean network models of GRN, exemplified on a N D 4 gene network. a From architecture to dynamics. The four genes i : A; B; C; D interact as indicated by the directed graph (top left panel), and each of the genes is assigned a Boolean function Bi as indicated, with the corresponding “truth table” shown below. A network state is a string of N D 4 binary variable, thus there are 2N D 24 D 16 possible network states S. They collectively establish the entire state space and can be arranged in a state transition map according to the state transitions imposed by the Boolean functions (right panel). The attractor states are colored gray. In the example, there are three point attractors and one limit cycle attractor (of period T D 2). The dotted lines in the state space denote the attractor boundaries. b Example for capturing the regulation at the promoter of the lactose operon as an “AND” Boolean function. Note that there are many ways to define what constitute an input – in the case shown, the allosteric ligands cAMP (“activator”) and a ˇ-galactoside (“inducer”), such as allolactose, give rise to a two-input Boolean function ‘AND’
cycling states (which form a limit cycle) or in a single stable state that updates onto itself. Accordingly, these states to which all the trajectories are “attracted” to are the attractors of the network. They are equivalent to stable oscillators or stable fixed-points, respectively, in the continuous description of gene circuits (Sect. “Bistability, Tristability and Multistability”). In other words, because of the regulatory interactions between the genes, the system cannot freely choose any gene expression configuration S. Again,
most network states S in the state space are thus unstable and transition to other states to comply with the Boolean rules until the network finds an attractor. And as with the small, continuous systems (Sect. “Bistability, Tristability and Multistability”) the set of states S that “drain” to an attractor state constitute its basin of attraction. However, unlike continuous systems, Boolean network dynamics do not produce unstable steady-states that can represent the indeterminacy of undecided cell states
545
546
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
that correspond to stem cells about to make an eitheror decision between two lineages (see Sect. “Bistability, Tristability and Multistability”). Instead, basins of attraction are “disjoint” areas of state space. Attractors as Cell Types Kauffman proposed that the high-dimensional attractor states represent cell types in metazoan organisms – thus expanding the early notion of steady state in small circuits to networks of thousands of genes [111,116] (Sect. “Bistability, Tristability and Multistability”). This provides a natural explanation for the stability of the genome-wide expression profile that is characteristic of and defines a cell type as well as for the stable coordination of genome-wide gene expression oscillations in proliferating cells that undergo the cell division cycle [202]. The correspondence of attractors in large networks with the cell type specific transcriptome is a central hypothesis that links the theoretical treatment of the dynamics of complex networks with experimental cell biology. Use of Boolean Networks to Model Real Network Dynamics Owing to their simple structure, Boolean networks have been applied in place of differential equations to model real-world networks for which a rudimentary picture of the topology with only few details about the interaction functions is known. Hereby, individual Boolean functions are assigned to the network nodes either based on best guesses, informed by qualitative descriptions from the experimental literature, or randomly. This approach has yielded surprisingly adequate recapitulation of biological behaviors, indicating that the topology itself accounts for a great deal of the dynamical constraints [4,49,60,62, 95,130]. Such studies also have been used to evaluate the dynamical regime (discussed in the next section) of real biological networks [19]. Three Regimes of Behaviors for Boolean Networks The simplicity and tractability of the Boolean network formalism has stimulated a broad stream of investigations that has led to important insights in the fundamental properties of the generic dynamics that large networks can generate [116] even before progress in genomics could even offer a first glimpse on the actual architecture of a real gene regulatory network [112]. Using the ensemble approach, the architectural parameters that influence the global, long-term behavior of complex networks with N up to 100,000 has been determined [114,116]. As mentioned in the introduction, based on the latest count of the genome size and the idea that
the core transcriptional network essentially governs the global dynamics of gene expression profiles, networks of N 3000 would have sufficed. Nevertheless, a striking result from the ensemble studies was that for a broad class of architectures, even a complex, irregular (randomly wired) network can produce ‘ordered’ dynamics with stable patterns of gene expression, thus potentially delivering the “biological observable”. In general, the global behavior of ensembles of Boolean networks can be divided in three broad regimes [116]: an ordered and a chaotic regime, and a regime that represents behavior at their common border, the critical regime. Ordered and Chaotic Regimes of Network Behavior In networks in the ordered regime, two randomly picked initial states S1 (t D 0) and S2 (t D 0) that are closed to one another in state space as measured by the Hamming distance H[H(t) D jS1 (t) S2 (t)j the number of genes whose activities differ in the two states S1 (t) and S2 (t)] will exhibit trajectories that on average will quickly converge (that is, the Hamming distance between the two trajectories will on average decrease with time) [55]. The two trajectories will settle down in one fixed-point attractor or a limit cycle attractor that will have a small period T compared to N and thus, produce very stable system behaviors. Such networks in general have a small number of attractors with typically have small periods T and drain large basins of attraction [116]. Numerical analyzes of large ensembles suggest that the average period length scales with p N. The state transition map, as shown for N D 4 in Fig. 4a, will show that trajectories converge onto attractor states from many different directions and are in general rather short [Maliackal, unpublished] so that attractor basins appear as compact, with high rotational symmetry and hence, “bushy”. It is important to stress here that if a cell type is an attractor, then different cell types are different attractors, and, in the absence of the unstable steady states present in continuous dynamical systems (see Sect. “Bistability, Tristability and Multistability”) differentiation consists in perturbations that move a system state from one attractor into the basin of another attractor from which it flows to the new attractor state that encodes the gene expression profile of the new cell type. Examination of the bushy basins in the ordered regime makes it clear, as do numerical investigations and experiments (Sect. “Are Cell Types Attractors?”), that multiple pathways can lead from one attractor to another attractor – a property that meets resistance in the community of pathway-centered biologists.
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
In contrast, in networks in the chaotic regime, two randomly placed initial states S1 (t D 0) and S2 (t D 0) that are initially close to one another (in terms of Hamming distance) will generate trajectories that on average will diverge and either end with high likelihood in two different attractors, or they may appear to “wander” aimlessly in state space. This happens because the attractor is a limit cycle attractor with very long period T – on the scale of 2 N so that in the worst case a trajectory may visit most if not all 2 N possible network configurations S. For a small network of just N D 200, this is a limit cycle in the order of length 2100 1030 time steps. As a point of comparison, the universe is some 1017 seconds old. Given the hyperastronomic size of this number, this “limit cycle” will appear as an aperiodic and endless stream of uncorrelated state transitions, as if the system is on a “permanent transient”. Thus, networks in the chaotic regime are not stable, trajectories tend to diverge (at least initially), and their behavior is sensitive to the initial state. In the state transition map, the small attractors typically receive trajectories with long transients that arrive from a few state space directions and hence, in contrast to the bushy attractors of the ordered regime, the basins have long thin branches and appear “tree-like”. The definition of “chaos” for discrete networks given here is distinct from that of (deterministic) chaos in continuous systems, where time evolution of infinitesimally closed initial states can be monitored and shown to diverge. Nevertheless, the degree of chaos in discrete networks, as qualitatively outlined above, is well-defined and can be quantified based on the slope of the curve in the socalled Derrida plot which assesses how a large number of random pairs of initial states evolves in one time step [55]. More recently, it was shown that it is possible to determine the behavior class from the architecture, without simulating the state transitions and determining the Derrida plot, simply by calculating the expected average sensitivity from all the N Boolean functions Bi [173]. In addition, a novel distance measure that uses normalized compression distances (NCD) – which captures the complexity in the difference between two states S(t) better than the Hamming distance used in the Derrida plot – has been proposed to determine the regime of networks [153]. Critical Networks: Life at Edge of Chaos? Critical networks are those which exhibit a dynamical behavior just at the edge between that of the ordered and the chaotic regime, and have been postulated to be optimally poised to govern the dynamics of living cells [114, 116]. Ordered behavior would be too rigid, in that most
perturbations of a stable attractor would be absorbed and the network would return to the same attractor state, minimizing the possibility for a network to change its internal equilibrium state in response to external signals (which are modeled as flipping individual genes from ON to OFF or vice versa). Chaotic behavior, on the other hand, would be too sensitive to such perturbations because trajectories diverge – so that the network would wander off in state space and fail to exhibit robust behavior. Critical networks may represent the optimal mix between stability and responsiveness, hence conferring robustness to random perturbations (noise) and adaptability in response to specific signals. Critical network have several remarkable features: First, consider the need of cells to plausibly make the maximum number of reliable discriminations, and to act on them in the presence of noise with maximum reliability. Then ‘deep’ in the ordered regime, convergence of trajectories in state space is high, hence, as explained above (Sect. “Dynamics in Boolean Networks”) information is constantly discarded. In this “lossy” regime, information about past discriminations is thus easily lost. Conversely, in the chaotic regime and with only a small amount of noise, the system will diverge and hence cannot respond reliably to external signals (perturbations). It seems plausible that optimal capacity to categorize and act reliably is found in critical networks, or networks slightly in the ordered regime. Second, it has recently been shown that a measure of the correlation of pairs of genes with respect to their altering activities, called “mutual information” (MI) is maximized for critical networks [154]. The MI measures the mutual dependence of two variables (vectors), such as two genes based on their expression in a set of states S(t). Consider two genes in a synchronous Boolean network, A and B. The mutual information between xA and xB is defined as MI(A,B) = H(xA ) C H(xB ) H(xA ; xB ). Here, H(x) is the entropy of the variable x, and H(x; y) is the joint entropy of x and y. Mutual information is 0 if either gene A or B is unchanging, or if A and B are changing in time in an uncorrelated way. Mutual information is greater than 0 and bounded by 1.0, if A and B are fluctuating in a correlated way. Thus, critical networks maximize the correlated changing behavior of variables in the genetic network. This new result strongly suggests, at least in the ensemble of random Boolean networks, that critical networks can coordinate the most complex organized behavior. Third, the “basin entropy” of a Boolean network, which characterizes the way the state space is partitioned in to the disjoint basins of attraction of various sizes (Fig. 4a) also exposes a particular property of critical networks [124]. If the size or “weight”, W i , of a basin of attrac-
547
548
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
tion i is the fraction of all the 2 N states that flow to that attractor, then the basin entropy is defined as ˙ Wi log(W i ). The remarkable result is that only critical networks have the property that this basin entropy continues to increase as the size of the network increases. By contrast, ordered and chaotic networks have basin entropies that first increase, but then stop increasing with network size [124]. If one thinks of basins of attraction and attractors not only as cell fates, or cell types, but as distinct specific cellular programs encoded by the network, then only critical networks appear to have the capacity to expand the diversity of what “a cell can do” with increasing network size. Again, this strongly suggests that critical networks can carry out the most complex coordinated behaviors. Thus, GRN may have evolved under natural selection (or otherwise – see Sect. “Evolution of Network Structure: Natural Selection or Natural Constraint?”) to be critical. Finally, it is noteworthy that while Boolean networks were invented to model GRNs, the variables can equally be interpreted as any kind of two-valued states of components in a cell, and the Boolean network becomes a causal network concerning events in a cell, including GRN as a subset of such events. This suggests that not only GRN but the entire network of processes in cells, including signal transduction and metabolic processes, that is, information, mass and energy flow, may optimally be coordinated if the network is critical. If life is poised to be in the critical regime, then the questions follow: which architecture produces ordered, critical and chaotic behaviors, and are living cells in fact the critical regime? Architectural Features of Large Networks and Their Dynamics As outlined in Sect. “Network Architecture”, the recent availability of data on gene regulation in real networks, although far from complete, has triggered the study of complex, irregular network topologies as static graphs. This line of investigation is now beginning to merge with the study of the dynamics of generic Boolean networks. Below we summarize some of the interesting architecture features and their significance for global dynamics in term of the three regimes. First, studies of generic dynamics in ensembles of Boolean networks have established the following major structure-dynamics relationships: (1) The average input connectivity per node, k. Initial studies on Boolean networks by Kauffman assumed a homogenous distribution of inputs k. It was found that k D 2 networks are in the ordered/critical regime (given other parameters, see below) [116]. Above
a critical kc value (which depends on other parameters, see below) networks behave chaotically. Analysis of continuous, linearized models also suggest that in general, sparsity of connections is more likely to promote ordered dynamics – or stability [142]. (2) The distribution of the connectivity (degree) over the individual network genes. As mentioned in Sect. “Network Architecture”, the global topology of many complex molecular networks appears to have a connectivity distribution that approximates a power-law. For directed graphs such as the GRN, one needs to distinguish between the distribution of the input and output connectivities. Analyses of the dynamics of random Boolean networks with either scale-free input [66] or output connectivity distribution suggest that this property favors the ordered regime for a given value of the parameter p (“internal homogeneity”, see below) [6]; Specifically, if the slope of the scale free distribution (power-law exponent ) is greater than 2.5, the corresponding Boolean network is ordered, regardless of the value of the parameter p (see below). For values of p approaching 1.0, > 2.0 suffices to assure ordered dynamics. (3) The nature of the Boolean functions is an important aspect of the network architecture that also influences the global dynamics. In the early studies, Kauffman characterized Boolean functions with respect to these two important features [116]: (a) “internal homogeneity p”. The parameter p (0:5 < p < 1) is the proportion of either 1s or 0s in the output column of the truth table of the Boolean function (Fig. 4). Thus, a function with p D 0:5 has equal numbers of 1s and 0s for the output of all input configurations. Boolean functions with p-values close to 1 or 0 are said to exhibit high internal homogeneity. (b) “Canalizing function”. A Boolean function Bi of target gene i is said to be canalizing if at least one of its inputs has one value (either 1 or 0) that determines the output of gene i (1 or 0), independently of the values of the other components of the input vector. If the two values of the input j determines both output values of gene i [a “fully canalizing” function, e. g., if x j (t) D 1 (or 0, respectively), then x i (t C 1) D 1 (or 0, respectively)], then the other inputs have no influence on the output at all and the “effective input connectivity” of i; k ieff is smaller than the “nominal” ki . For instance, for Boolean functions with k D 2, only two of the 16 possible functions, XOR and XNOR, are not canalizing. Four functions are “fully canaliz-
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
ing”, i. e., are effectively k D 1 functions (TRANSFER, COMPLEMENT). From ensemble studies it was found that both a high internal homogeneity p and a high proportion of canalizing functions contribute to ordered behavior [116]. Do Real Networks Have Architecture Features That Suggest They Avoid Chaos? Only scant data is available for transcriptional networks, and it must be interpreted with due caution since new data may, given sampling bias and artefacts, especially for nearly scale-free distributions, affect present statistics. In any case, existing data indicate that in fact the average input connectivity of GRNs is rather low, and far from k N which would lead to chaotic behavior. Specifically, analysis of available (partial) transcriptional networks suggest that the input degrees approximates an exponential distribution while the output degree distribution seems to be scale-free although the number of nodes are rather small to reliably identify a power-law distribution [54,82]. GRN data from E. coli, for which the most complete, hand-curated maps of genome-wide gene regulation exists [169] and from partial gene interaction networks from yeast, obtained mostly by chromatin-precipitation/microarray (ChIP-chip) [128], indicate that the average input connectivity (which is exponentially distributed) is below 4 [82,134,188]. A recent work on the worm C. elegans using the Yeast-one-hybrid system on a limited set of 72 promoters found an average of 4 DNAbinding proteins per promoter [54]. Thus, while such analyzes await correction as coverage increase, clearly, real GRNs for microbial and lower metazoan are sparse, and hence more likely to be in or near the ordered regime. In contrast to the input, the output connectivity appears to be power-law distributed for yeast and bacteria, and perhaps also for C. elegans, if the low coverage of the data available so far can be trusted [54,82]. The paucity of data points precludes reliable estimates of the power-law exponents. However, the scale-free property, if confirmed for the entire GRN, may well also contribute to avoiding chaotic and increasing the ordered or critical regime [6]. Nevertheless, It is reminded here that the deeper meaning of the scale-freeness per se and its genesis (natural selection due to functionality or not) are not clear – it may be an inevitable manifestation of fundamental statistics rather than evolved under natural selection (Sect. “Evolution of Network Structure: Natural Selection or Natural Constraint?”). As for the use of Boolean functions, analysis of a set of 150 experimentally verified regulatory mechanism of
well-studied promoters revealed an enrichment for canalizing functions, again, in accordance with the architecture criteria associated with ordered dynamics [86]. Similarly, when canalizing functions were randomly imposed onto the published yeast protein–protein interaction network topology to create a architecture whose dynamics was then simulated, ordered behavior was observed [113]. In this conjunction it is interesting to mention microRNAs (miRNAs), a recently discovered class of non-coding RNA which act by inhibiting gene expression at the posttranscriptional level through sequence complementarity to mRNA [41,89]. Hence, they suppress a gene independent of the TF constellation at the promoter. Thus, miRNAS epitomize the most simple and powerful molecular realization of a canalizing function. Their existence may therefore shift the network behavior from the chaotic towards the critical or ordered regime. Interestingly, RNAbased gene regulation is believed to have appeared before the metazoan radiation 1000 million years ago, and microRNAs are thought to have evolved in ancestors of Bilateria [141]. In fact, many miRNA play key roles in cell fate determination during tissue specification of development of vertebrates [41,172], and a composite feedback loop circuit involving both microRNA and TFs has been described in neutrophil differentiation [63]. The shift of network behavior from the chaotic towards the critical or ordered regime may indeed enable the coordination of complex gene expression patterns during ontogeny of multicellular systems which requires maximal information processing capacity to ensure the coexistence of stability and diversity. There are, as mentioned in Sect. “Analyzing Network Structures”, many more global and local topological features that have been found in biomolecular networks that appear to be interesting, as defined in the sense in Sect. “Network Architecture” (enriched above some null model graph). It would be interesting to test how they contribute to producing chaotic, critical or ordered behavior. The impact of most of these topology features on the global dynamics, notably the three regimes of behaviors, remains unknown, since most functional interpretation of network motifs have focused on local dynamics [9,136]. The low quality and availability of experimental data of GRN architectures opens at the moment only a minimal window into the dynamical regimes of biological networks. The improvement with respect to coverage and quality of real GRNs, to be expected in the coming years, is mostly driven by a reductionist agenda whose intrinsic aim is to exhaustively enumerate all the “pathways” that may serve “predictive modeling” of gene behaviors, as detailed in Sect. “Small Gene Circuits”. However,
549
550
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
with the concepts introduced here, a framework now exists that warrants a deeper, genome-wide analysis of the relationship between structure and biological function. Such analysis should also address the fundamental dualism between inevitable self-organization (due to intrinsic constraints from physical laws) and natural selection (of random mutants for functional advantages) [46,93,203] to ask whether criticality is self-organized (see Sect. “Evolution of Network Structure: Natural Selection or Natural Constraint?”). Experimental Evidence from Systems Biology The experimental validation of the central concepts that were erected based on theoretical analysis of network dynamics, notably the ensemble approach of Boolean networks, amounts to addressing the following two questions: 1. Are cell types attractors? 2. Is the dynamical behavior of the genomic GRN in the critical regime? As experimental systems biology begins to reach beyond the systematic characterization of genes and their interactions it has already been possible to design experiments to obtain evidence to address these questions. Are Cell Types Attractors? Obviously, the observable gene expression profiles S , as shown in Fig. 1e, are stationary (steady) states and characteristic of a cell type. But “steady state” ( dx/dt D 0 in Eq. (1)) does not necessary imply a stable (self-stabilizing) attractor state that attracts nearby states. In the absence of knowledge of the architecture of the underlying GRN (we do not know the function F in Eq. (1)) we cannot perform the standard formal analysis, such as linear stability analysis around S , to determine whether a stationary state is stable or not; however, use of microarray-based expression profiling not to identify individual genes as in the genecentered view, but in a novel integrated manner (Table 1, B3) for the analysis of S(t) provides a way to address the question. The qualitative properties of an attractor offer a handle, in that a high-dimensional attractor state S , be it a fixed-point or a small limit cycle or even a strange attractor in the state space, requires that the volume of the states around it contracts to the attractor, i. e., div(F) < 0. [90]. Convergence of Trajectories Thus, one consequence is that trajectories emanating from states around (and near) S converge towards it from most (ideally, all) directions and in all dimensions of the state space. It is technically challenging to sample multiple high-dimensional initial
states near the attractor states and demonstrate that they all converge towards S over time. However, the fact that the promyeloid precursor cells HL60 (a leukemic cell line) can be triggered to differentiate into mature neutrophillike cells by an array of distinct chemical agents can be exploited [45]. This historical observation itself, without analysis of S(t) by gene expression profiling, already suggests that the neutrophil state is an attractor, because it reflects robustness and means that the detailed history of how it is reached doesn’t matter. But the advent of microarray-based gene expression profiling technologies has recently opened up the possibility to show that (at least two) distinct trajectories converge towards gene expression profile of the neutrophil state SNeutr if SNeutr is an attractor state [98]. Thus, HL60 cells were treated with either one of two differentiation-inducing reagents, all-trans-retinoid acid (ATRA) and dimethyl sulfoxide (DMSO), two chemically unrelated compounds, and the changes of the transcriptome over time in response to either treatment were measured at multiple time intervals to monitor the two trajectories, S ATRA (t) and S DMSO (t), respectively (see Fig. 5a for details). In fact the two trajectories first diverged to an extent that the two state vectors lost most of their correlation (i. e. the two gene expression profiles S ATRA (24 h) and S DMSO (24 h) were maximally different at t D 24 hours after stimulation). But subsequently, they converged to very similar S(t) values when the cells reached the differentiated neutrophil state for both treatments (Fig. 5a). The convergence was not complete, but quite dramatic relative to the maximally divergent state at 24 h, and was contributed by around 2800 of the 3800 genes monitored (for details see [98]). Thus, it appears that at least the artificial drug-induced differentiated neutrophil state is so stable that it can apparently orchestrate the expression of thousands of genes to produce the appropriate celltype defining expression pattern S from quite distinct perturbed cellular states. Although only two trajectories have been measured rather than an entire state space volume, the convergence with respect to 2800 state space dimensions is strongly indicative of a high-dimensional attractor state. Relaxation After Small Perturbations and the Problem of Cell Heterogeneity A second way to expose qualitative properties of a high-dimensional attractor S is to perform a weak perturbation of S , into a state S 0 near the edge of (but within) the basin of attraction (where S 0 should differ from S with respect to as many genes xi as possible), and to observe its return of the network to S . This more intuitive property of an attractor state was
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination, Figure 5 Experimental evidence that a cell fate (differentiated state, cell type) is a high-dimensional stable attractor in gene expression state space. Evidence is based on convergence (a) from different directions of high-dimensional trajectories to the gene expression pattern of the differentiated state, or on the relaxation of states placed near the border of the basin of attraction back to the “center” of the attractor state (b). a Gene expression profile dynamics of HL60 cells treated with ATRA (all-trans retinoid acid) or DMSO (dimethylsulfoxide) at 0h to differentiate them into neutrophil-like cells. Selected gene expression profile snapshots along the trajectories shown as GEDI maps (as in Fig. 1), schematically placed along the state space trajectories. GEDI maps show the convergence of N 2800 genes towards very similar patterns at 168h [98]. b Heterogeneity of a population of clonal (genetically identical) cells is exploited to demonstrate the relaxation towards the attractor state. The histogram (top, left panel) from flow cytrometric measurement shows the inherent heterogeneity (spread) of cells with respect to stem cell marker Sca-1 expression in EML cells which cannot be attributed solely to measurement or gene expression noise but reflect metastable cell-individuality [37]. Two spontaneous “outlier” subfractions, expressing either low (L) or high (H) levels of the Sca-1, were sorted using FACS (fluorescence-activated cell sorting) and cultured independently. They represent “small perturbations” of the attractor state. Each sorted subpopulation will over a period of 5–9 days restore the parental distribution (right panel, schematic). Gene expression profiling (shown as as GEDI maps) reveal that the H and L subpopulations are distinct with respect not only to Sca-1 expression levels but also to that of multiple other genes. Thus, the two outlier cell fractions are at distinct states SH and SL in the high-dimensional state space of which Sca-1 is only one distinguishing dimension. The restoration of the parental distribution of Sca-1 is accompanied by the approaching of the two distinct gene expression profiles SH and SL of the spontaneously perturbed cells to become similar to each other (and to that of the parental population), indicating a high-dimensional attractor state
551
552
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
difficult to measure because of a phenomenon often neglected by theorists and experimentalists alike: cell population heterogeneity due to “gene expression noise”. Microarray measurements, as is the case with many biochemical analysis methods, require the material of millions of cells, thus the measured S(t) is actually a population average. This can be problematic because the population is heterogeneous [33]: gene expression level of gene i can typically differ by as much as 100 or more fold between two individual cells within a clonal (genetically identical) population [37]. Thus, virtually all genes i exhibit a broad (log-normal) histogram (Fig. 5b, inset) when the expression level of an individual gene xi is measured at the single-cell level across a population [38,101,129]. Such cellto-cell variability is often explained by “gene expression noise” caused by random temporal fluctuations due to low copy numbers of specific molecules in the cell [108]. However, there may be other possibly deterministic sources that generate metastable variant cells [33,176]. Such nongenetic, enduring individuality of cells translates into the picture of a cell population forming a cloud of points in state space around the attractor S , in which each cell represents a single point. Application of a low dose perturbation intended to allow a system to relax back to the attractor state then will be interpreted by individual cells differently: Those positioned at the border of the cloud may be kicked out of the basin of attraction, and move to another attractor, thus masking the trajectory of relaxation. In fact, single-cell resolution measurements of the response to low dose stimulation in cell populations confirmed this picture of heterogeneity, in that a differentiation inducer given at low dose to trigger a partial response (weak perturbation) produced a bimodal distribution of gene expression of differentiation markers: Some cells differentiated, other did not [38]. However, the spontaneous heterogeneity eo ipso, consisting of transient but persistent variant cells within a population [37], allows us to demonstrate the relaxation to the attractor when single cell-level manipulation and analysis are performed: Physical isolation of the population fractions at two opposite the edge of the cloud (basin of attraction), based on one single state space dimension xk can substitute for the weak perturbation that places cells to the border of a basin (see Fig. 5b for details). Indeed such “outlier” population fractions exhibited not only distinct levels of xk expression but also globally distinct gene expression profiles (despite being members of the same clonal population). Cells of both outlier fractions eventually “flowed back” to populate the state space region around the attractor state and restored the original distribution (shape of the cloud) [37]. The
time scale of this relaxation (>5 days) was similar to that for the HL60 cells to converge to the attractor of the differentiated neutrophils. This result is summarized in Fig. 5b. Again, the spontaneous regeneration of the distinct gene expression profile of a macroscopically observable cell phenotype, consisting of thousands of genes, supports the notion of a high-dimensional attractor in gene expression space that maps into a distinct, observable cell type. Are Gene Regulatory Networks in Living Cells Critical? The second central question we ask in this article is whether GRNs produce a global dynamics that is in the critical regime (Sect. “Three Regimes of Behaviors for Boolean Networks”). This question is not as straightforward to address experimentally. First, the notion of order, chaos and criticality are defined in the models as properties of network ensembles. Second, it is cumbersome to sample a large number of pairs of initial states and monitor their high-dimensional trajectories to determine whether they on average converge, diverge or stay “parallel”. As mentioned earlier (Sect. “Dynamics in Boolean Networks”), one approach for obtaining a first glimpse of an idea as to whether real networks are critical or not was recently proposed by Aldana and coworkers [19] and refs. herein): They “imposed” a dynamics onto real biological networks, for which only the topology but not the interaction functions are known, by treating them as Boolean networks, whereby the Boolean functions were guessed or randomly assigned according to some rules. Such studies suggest that these networks, given their assumed topologies, are in the critical regime. To more directly characterize the observed dynamics of natural systems in terms of the three regimes, several indirect approaches have been taken. These strategies are based on novel measurable quantities computed from several schemes of microarray experiments that are now available but were not originally generated with an intention to answer this question. By first determining in simulated networks whether that quantity is associated also with criticality in silico, inference is then made as to what regime of system behavior the observed system resembles. Three such pieces of evidence provide a first hint that perhaps, living cells may indeed be in the critical regime: (i)
Gene expression profile changes during Hela cell cycle progression was compared with the detailed temporal structure of the updating of network states in Boolean networks. Tithe discretized real gene expression data of cells progressing in the cell cycle were
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
compared with that of simulated state cycles of random Boolean networks in the three regimes in terms of the Lempev-Zip complexity of time series. This led to the conclusion that the dynamical behavior of thousands of genes was most consistent with either ordered or critical behavior, but not chaotic behavior [174]. (ii) A striking property predicted from analysis of simulated critical Boolean networks is that if a randomly selected gene is deleted and the number of other genes that change their expression as a consequence of that single-gene deletion (“avalanche size”) is measured, and such single-gene deletion experiments is repeated many times, the avalanche-sizes will exhibit a power law distribution with a slope of D 1:5. This specific behavior is only seen in critical networks. Analysis of just such data for over 200 single deletion mutants in yeast [163] from the experiments reported by Hughes et al. [100]was recently performed. It was found that the distribution of the avalanche sizes not only approached a power law, but that the slope was also 1.5 [163]. This result was insensitive to altering the criterion for defining a “change in gene activity” from two-fold to five fold in calculating the avalanche size. (iii) A more direct determination of the regime of network dynamics was recently reported, in which an analogous analysis as the Derrida plot (see Sect. “Ordered and Chaotic Regimes of Network Behavior”) was performed [153]. Macrophage gene expression profiles were measured at various time points in response to various perturbations (= stimulation of Toll-like receptors with distinct agents), offering a way to measure the time evolution of a large number of similar initial states. Here, instead of the Hamming distance the normalized compression distance NCD (as mentioned in Sect. “Ordered and Chaotic Regimes of Network Behavior”) was used to circumvent the problems associated with Derrida curves. The results were consistent with critical dynamics, in that on average, for many pairs of initial states, their distance at time t and t C t was on average equal, thus neither trajectory convergence nor divergence took place. Future Directions and Questions The ideas of a state space representing the dynamics of the network, and of its particular structure that stems from the constraints imposed by the regulatory interactions and can be epitomized as an ‘epigenetic landscape’, serve as a con-
ceptual bridge linking network architecture with the observable biological behavior of a cell. In this framework, the attractors (valleys) are disjoint regions in the state space landscape and represent cell fates and cell types. One profound and bold hypothesis is that for networks to have a landscape with attractors that optimally convey cell lineage identity and robustness of their gene expression profile, yet allow enough flexibility for cell phenotype switches during development, the networks must be poised at the boundary between the chaotic and the ordered regime. Such critical networks may be a universal property of networks that have maximal information processing capacity. But what can we naturally do with this conceptual framework presented here? And what are the next questions to ask for the near future? Clearly, new functional genomics analysis techniques will soon advance the experimental elucidation of the architecture of the GRN of various species, including that of metazoan organisms. This will drastically expand the opportunities for theoretical analysis of the architecture of networks and finally afford a much closer look at the dynamics without resorting to simulated network ensembles. However, beyond network analysis in terms of mathematical formalisms, the concepts of integrated network dynamics presented here should also pave the way for a formal rather than descriptive understanding of tissue homeostasis and development as well as diseases, such as cancer. One corollary of the idea that cell types are attractors is that cancer cells are also attractors – which are lurking somewhere in state space (near embryonic or stem cell attractors) and are normally avoided by the physiological developmental trajectories. They become accessible in pathological conditions and trap cells in them, preventing terminal differentiation (this idea is discussed in detail in [33,96]). Such biological interpretation of the concepts presented here will require that these concepts be expanded to embrace the following aspects of dynamical biological systems that are currently not well understood but can already be framed: Attractor Transitions and Development If cell fates are attractors, then development is a flow in state space trajectories which then represent the developmental trajectories. But how do cells, e. g., a progenitor cell committing to a particular cell fate, move from one attractor to another? Currently, two models are being envisioned for the “flow between attractors” that constitutes developmental paths in the generation of the multi-cellular organisms:
553
554
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
(i) In one model, stimulated by the studies in Boolean networks, the cell “jumps” from one attractor to another in response to a distinct perturbation (e. g., developmental signal) which is represented as the imposed alteration of the expression status of a set of genes (“bit-flipping” in binary Boolean networks). This corresponds to the displacement of the network state S(t) by an external influence to a new position in the basin of another attractor. (ii) The second model posits that the landscape changes, and has its roots in the classical modeling of nonlinear dynamical systems: For instance, the attractor of the progenitor cell (valley) may be converted into an unstable steady-state point (hill top) at which point the network (cell) will spontaneously be attracted by either of the two attractors on each side of the newly formed hill. This is exemplified in a model for fate decision in bipotent progenitor cells in hematopoiesis [99]. In this model, a change of the landscape structure ensues from a change in the network architecture caused by the slow alteration of the value of a system parameter that controls the interaction strength in the system equations (see Sect. “Small Gene Circuits”). Thus, the external signal that triggers the differentiation exerts its effect by affecting the system parameters. Increasing the decay rates of x1 and x2 in the example of Fig. 3b will lead to the disappearance of the central attractor and convert the tristable landscape of Fig. 3b to the bistable one of Fig. 3a [99]. Such qualitative changes that occur during the slow increase or decrease of a system parameter are referred to as a bifurcation. Note that this second model takes a narrower view, assuming that the network under study is not the global, universal and fixed network of the entire genome, as discussed in Sect. “Complex Networks”, but rather a subnetwork that can change its architecture based on the presence or absence of gene products of genes that are outside of the subnetwork.
random partitioning of molecules at cell divisions, implies that cell states cannot be viewed as deterministic points and trajectories in the state space, but rather as moving “clouds” – held together by the attractors. More recently, on the basis of such bistable switches, it has even been proposed and shown in synthetic networks that environmental bias in fluctuation magnitude may explicitly control the switch to the physiologically favorable attractor state – because gene expression noise may be higher (relative to the deterministically controlled expression levels) when cells are in the attractor that dictates a gene expression pattern that is incompatible with a given environment [110]. Such biased gene expression noise thus may drive the flow between the attractors, and hence, (on average) guide the system through attractor transitions to produce various cell fates in a manner that is commensurate with development of the tissue. On the other hand, it may also explain the local stochasticity of cell fate decisions observed for many stem and progenitor cells.
Noisy Systems
Beyond Cell Autonomous Processes
In a bistable switching system, it is immediately obvious that random fluctuations in xi due to “gene expression noise” may trigger a transition between the attractors and thus, explain the stochastic phenotype transitions, as has been recently shown for several micro-organismal systems and discussed for mammalian cell differentiation [87,97, 101,108]. The observed heterogeneity of cells in a nominally identical cell population;caused either by “gene expression noise” or other diversifying processes, such as the
To understand development we need of course to open our view beyond the cell autonomous dynamics of GRNs. Some of the genes expressed in particular states S(t) encode secreted proteins that affect the gene expression hence, state S(t) of neighboring cells. Such inter-cell communication establishes a network at a higher level, with its own constrained dynamics that need to be incorporated in the models of developmental trajectories in gene expression state space.
Directionality The concepts of cell fates and cell types as attractors, as well as any mechanisms that explains attractor transitions during development and differentiation, do not explain the overall directionality of development, or the “arrow of time” in ontogenesis: Why is development essentially a one-way process, i. e., irreversible in time, given that the underlying regulatory events, the switching ON or OFF of genes, are fully reversible? Where does the overall slope (from back to front) in Waddington’s epigenetic landscape (Fig. 2g) come from? One idea is that “gene expression noise” may play the role of thermal noise (heat) in thermodynamics in explaining the irreversibility of processes. Alternatively, the network could be specifically wired, perhaps through natural selection so that attractor transitions are highly biased for one direction.
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
Evolution As discussed several times in this article, when networks are studied in the light of evolution, a central question arises: how do mutations, which essentially rewire the network by altering the nature of regulatory interactions, give rise to the particular architecture of the GRN that we find today? What are the relative roles in shaping particular network architectures, of (i) constraints due to physical (graph-theoretical) laws and self-organization vs. that of (ii) natural selection for functionality? If selection plays a major role, can it select for such features as the global landscape structure, or even for criticality? Or can the latter even “self-organize” without adaptive selection [31]? Such questions go far beyond the current analysis of Darwinian mechanisms of robustness and evolvability of networks. They are at the heart of the quest in biocomplexity research for fundamental principles of life, of which the process of natural selection itself is a just subset. Regulatory networks offer a accessible and formalizable object of study to begin to ask these questions. Bibliography Primary Literature 1. Adolfsson J, Mansson R, Buza-Vidas N, Hultquist A, Liuba K, Jensen CT, Bryder D, Yang L, Borge OJ, Thoren LA, Anderson K, Sitnicka E, Sasaki Y, Sigvardsson M, Jacobsen SE (2005) Identification of Flt3+ lympho-myeloid stem cells lacking erythromegakaryocytic potential a revised road map for adult blood lineage commitment. Cell 121:295–306 2. Aittokallio T, Schwikowski B (2006) Graph-based methods for analysing networks in cell biology. Brief Bioinform 7:243–55 3. Albert R (2005) Scale-free networks in cell biology. J Cell Sci 118:4947–57 4. Albert R, Othmer HG (2003) The topology of the regulatory interactions predicts the expression pattern of the segment polarity genes in Drosophila melanogaster. J Theor Biol 223:1–18 5. Albert R, Jeong H, Barabasi AL (2000) Error and attack tolerance of complex networks. Nature 406:378–82 6. Aldana M, Cluzel P (2003) A natural class of robust networks. Proc Natl Acad Sci USA 100:8710–4 7. Aldana M, Balleza E, Kauffman S, Resendiz O (2007) Robustness and evolvability in genetic regulatory networks. J Theor Biol 245:433–48 8. Aldana M, Coppersmith S, Kadanoff LP (2003) Boolean dynamics with random couplings. In: Kaplan E, Marsden JE, Sreenivasan KR (eds) Perspectives and problems in nonlinear science. A celebratory volume in honor of Lawrence Sirovich. Springer, New York 9. Alon U (2003) Biological networks: the tinkerer as an engineer. Science 301:1866–7 10. Amaral LA, Scala A, Barthelemy M, Stanley HE (2000) Classes of small-world networks. Proc Natl Acad Sci USA 97:11149–52 11. Anderson PW (1972) More is different. Science 177:393–396
12. Angeli D, Ferrell JE Jr., Sontag ED (2004) Detection of multistability, bifurcations, and hysteresis in a large class of biological positive-feedback systems. Proc Natl Acad Sci USA 101: 1822–1827 13. Arney KL, Fisher AG (2004) Epigenetic aspects of differentiation. J Cell Sci 117:4355–63 14. Artzy-Randrup Y, Fleishman SJ, Ben-Tal N, Stone L (2004) Comment on Network motifs: simple building blocks of complex networks and Superfamilies of evolved and designed networks. Science 305:1107; author reply 1107 15. Autumn K, Ryan MJ, Wake DB (2002) Integrating historical and mechanistic biology enhances the study of adaptation. Q Rev Biol 77:383–408 16. Babu MM, Luscombe NM, Aravind L, Gerstein M, Teichmann SA (2004) Structure and evolution of transcriptional regulatory networks. Curr Opin Struct Biol 14:283–91 17. Bagley RJ, Glass L (1996) Counting and classifying attractors in high dimensional dynamical systems. J Theor Biol 183:269–84 18. Balcan D, Kabakcioglu A, Mungan M, Erzan A (2007) The information coded in the yeast response elements accounts for most of the topological properties of its transcriptional regulation network. PLoS ONE 2:e501 19. Balleza E, Alvarez-Buylla ER, Chaos A, Kauffman A, Shmulevich I, Aldana M (2008) Critical dynamics in genetic regulatory networks: examples from four kingdoms. PLoS One 3:e2456 20. Barabasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286:509–12 21. Barabasi AL, Oltvai ZN (2004) Network biology: understanding the cell’s functional organization. Nat Rev Genet 5: 101–113 22. Becskei A, Seraphin B, Serrano L (2001) Positive feedback in eukaryotic gene networks: cell differentiation by graded to binary response conversion. EMBO J 20:2528–2535 23. Berg J, Lassig M, Wagner A (2004) Structure and evolution of protein interaction networks: a statistical model for link dynamics and gene duplications. BMC Evol Biol 4:51 24. Bergmann S, Ihmels J, Barkai N (2004) Similarities and differences in genome-wide expression data of six organisms. PLoS Biol 2:E9 25. Bernstein BE, Mikkelsen TS, Xie X, Kamal M, Huebert DJ, Cuff J, Fry B, Meissner A, Wernig M, Plath K, Jaenisch R, Wagschal A, Feil R, Schreiber SL, Lander ES (2006) A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell 125:315–26 26. Bird A (2007) Perceptions of epigenetics. Nature 447:396–8 27. Bloom JD, Adami C (2003) Apparent dependence of protein evolutionary rate on number of interactions is linked to biases in protein–protein interactions data sets. BMC Evol Biol 3:21 28. Bloom JD, Adami C (2004) Evolutionary rate depends on number of protein–protein interactions independently of gene expression level: response. BMC Evol Biol 4:14 29. Bork P, Jensen LJ, von Mering C, Ramani AK, Lee I, Marcotte EM (2004) Protein interaction networks from yeast to human. Curr Opin Struct Biol 14:292–9 30. Bornholdt S (2005) Systems biology. Less is more in modeling large genetic networks. Science 310:449–51
555
556
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
31. Bornholdt S, Rohlf T (2000) Topological evolution of dynamical networks: global criticality from local dynamics. Phys Rev Lett 84:6114–7 32. Boyer LA, Lee TI, Cole MF, Johnstone SE, Levine SS, Zucker JP, Guenther MG, Kumar RM, Murray HL, Jenner RG, Gifford DK, Melton DA, Jaenisch R, Young RA (2005) Core transcriptional regulatory circuitry in human embryonic stem cells. Cell 122:947–56 33. Brock A, Chang H, Huang SH Non-genetic cell heterogeneity and mutation-less tumor progression. Manuscript submitted 34. Brown KS, Hill CC, Calero GA, Myers CR, Lee KH, Sethna JP, Cerione RA (2004) The statistical mechanics of complex signaling networks: nerve growth factor signaling. Phys Biol 1: 184–195 35. Bulyk ML (2006) DNA microarray technologies for measuring protein-DNA interactions. Curr Opin Biotechnol 17:422–30 36. Callaway DS, Hopcroft JE, Kleinberg JM, Newman ME, Strogatz SH (2001) Are randomly grown graphs really random? Phys Rev E Stat Nonlin Soft Matter Phys 64:041902 37. Chang HH, Hemberg M, Barahona M, Ingber DE, Huang S (2008) Transcriptome-wide noise controls lineage choice in mammalian progenitor cells. Nature 453:544–547 38. Chang HH, Oh PY, Ingber DE, Huang S (2006) Multistable and multistep dynamics in neutrophil differentiation. BMC Cell Biol 7:11 39. Chang WC, Li CW, Chen BS (2005) Quantitative inference of dynamic regulatory pathways via microarray data. BMC Bioinformatics 6:44 40. Chaves M, Sontag ED, Albert R (2006) Methods of robustness analysis for Boolean models of gene control networks. Syst Biol (Stevenage) 153:154–67 41. Chen K, Rajewsky N (2007) The evolution of gene regulation by transcription factors and microRNAs. Nat Rev Genet 8: 93–103 42. Chen KC, Wang TY, Tseng HH, Huang CY, Kao CY (2005) A stochastic differential equation model for quantifying transcriptional regulatory network in Saccharomyces cerevisiae. Bioinformatics 21:2883–90 43. Chickarmane V, Troein C, Nuber UA, Sauro HM, Peterson C (2006) Transcriptional dynamics of the embryonic stem cell switch. PLoS Comput Biol 2:e123 44. Claverie JM (2001) Gene number. What if there are only 30,000 human genes? Science 291:1255–7 45. Collins SJ (1987) The HL-60 promyelocytic leukemia cell line: proliferation, differentiation, and cellular oncogene expression. Blood 70:1233–1244 46. Cordero OX, Hogeweg P (2006) Feed-forward loop circuits as a side effect of genome evolution. Mol Biol Evol 23:1931–6 47. Cross MA, Enver T (1997) The lineage commitment of haemopoietic progenitor cells. Curr Opin Genet Dev 7: 609–613 48. Dang CV, O’Donnell KA, Zeller KI, Nguyen T, Osthus RC, Li F (2006) The c-Myc target gene network. Semin Cancer Biol 16:253–64 49. Davidich MI, Bornholdt S (2008) Boolean network model predicts cell cycle sequence of fission yeast. PLoS ONE 3:e1672 50. Davidson EH, Erwin DH (2006) Gene regulatory networks and the evolution of animal body plans. Science 311:796–800 51. de la Serna IL, Ohkawa Y, Berkes CA, Bergstrom DA, Dacwag CS, Tapscott SJ, Imbalzano AN (2005) MyoD targets chromatin
52.
53.
54.
55. 56.
57.
58. 59. 60.
61.
62.
63.
64.
65. 66. 67.
68.
69.
70.
remodeling complexes to the myogenin locus prior to forming a stable DNA-bound complex. Mol Cell Biol 25:3997–4009 Deane CM, Salwinski L, Xenarios I, Eisenberg D (2002) Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol Cell Proteomics 1:349–56 Delbrück M (1949) Discussion. In: Unités biologiques douées de continuité génétique Colloques Internationaux du Centre National de la Recherche Scientifique. CNRS, Paris Deplancke B, Mukhopadhyay A, Ao W, Elewa AM, Grove CA, Martinez NJ, Sequerra R, Doucette-Stamm L, Reece-Hoyes JS, Hope IA, Tissenbaum HA, Mango SE, Walhout AJ (2006) A gene-centered C. elegans protein-DNA interaction network. Cell 125:1193–205 Derrida B, Pomeau Y (1986) Random networks of automata: a simple annealed approximation. Europhys Lett 1:45–49 Dodd IB, Micheelsen MA, Sneppen K, Thon G (2007) Theoretical analysis of epigenetic cell memory by nucleosome modification. Cell 129:813–22 Eichler GS, Huang S, Ingber DE (2003) Gene Expression Dynamics Inspector (GEDI): for integrative analysis of expression profiles. Bioinformatics 19:2321–2322 Eisenberg E, Levanon EY (2003) Preferential attachment in the protein network evolution. Phys Rev Lett 91:138701 Enver T, Heyworth CM, Dexter TM (1998) Do stem cells play dice? Blood 92:348–51; discussion 352 Espinosa-Soto C, Padilla-Longoria P, Alvarez-Buylla ER (2004) A gene regulatory network model for cell-fate determination during Arabidopsis thaliana flower development that is robust and recovers experimental gene expression profiles. Plant Cell 16:2923–39 Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS (2007) Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol 5:e8 Faure A, Naldi A, Chaouiya C, Thieffry D (2006) Dynamical analysis of a generic Boolean model for the control of the mammalian cell cycle. Bioinformatics 22:e124–31 Fazi F, Rosa A, Fatica A, Gelmetti V, De Marchis ML, Nervi C, Bozzoni I (2005) A minicircuitry comprised of microRNA-223 and transcription factors NFI-A and C/EBPalpha regulates human granulopoiesis. Cell 123:819–31 Ferrell JE Jr., Machleder EM (1998) The biochemical basis of an all-or-none cell fate switch in Xenopus oocytes. Science 280:895–8 Fisher AG (2002) Cellular identity and lineage choice. Nat Rev Immunol 2:977–82 Fox JJ, Hill CC (2001) From topology to dynamics in biochemical networks. Chaos 11:809–815 Fraser HB, Hirsh AE (2004) Evolutionary rate depends on number of protein–protein interactions independently of gene expression level. BMC Evol Biol 4:13 Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW (2002) Evolutionary rate in the protein interaction network. Science 296:750–2 Fuks F (2005) DNA methylation and histone modifications: teaming up to silence genes. Curr Opin Genet Dev 15: 490–495 Gao H, Falt S, Sandelin A, Gustafsson JA, Dahlman-Wright K (2007) Genome-wide identification of estrogen receptor ˛ binding sites in mouse liver. Mol Endocrinol 22:10–22
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
71. Gardner TS, Cantor CR, Collins JJ (2000) Construction of a genetic toggle switch in Escherichia coli. Nature 403:339–342 72. Gershenson C (2002) Classification of random Boolean networks. In: Standish RK, Bedau MA, Abbass HA (eds) Artificial life, vol 8. MIT Press, Cambridge, pp 1–8 73. Gisiger T (2001) Scale invariance in biology: coincidence or footprint of a universal mechanism? Biol Rev Camb Philos Soc 76:161–209 74. Glass L, Kauffman SA (1972) Co-operative components, spatial localization and oscillatory cellular dynamics. J Theor Biol 34:219–37 75. Goldberg AD, Allis CD, Bernstein E (2007) Epigenetics: a landscape takes shape. Cell 128:635–8 76. Goldstein ML, Morris SA, Yen GG (2004) Problems with fitting to the power-law distribution. Eur Phys J B 41:255–258 77. Goodwin BC, Webster GC (1999) Rethinking the origin of species by natural selection. Riv Biol 92:464–7 78. Gould SJ, Lewontin RC (1979) The spandrels of San Marco and the Panglossian paradigm: a critique of the adaptationist programme. Proc R Soc Lond B Biol Sci 205:581–98 79. Graf T (2002) Differentiation plasticity of hematopoietic cells. Blood 99:3089–101 80. Grass JA, Boyer ME, Pal S, Wu J, Weiss MJ, Bresnick EH (2003) GATA-1-dependent transcriptional repression of GATA-2 via disruption of positive autoregulation and domain-wide chromatin remodeling. Proc Natl Acad Sci USA 100:8811–6 81. Greil F, Drossel B, Sattler J (2007) Critical Kauffman networks under deterministic asynchronous update. New J Phys 9:373 82. Guelzim N, Bottani S, Bourgine P, Kepes F (2002) Topological and causal structure of the yeast transcriptional regulatory network. Nat Genet 31:60–3 83. Guo Y, Eichler GS, Feng Y, Ingber DE, Huang S (2006) Towards a holistic, yet gene-centered analysis of gene expression profiles: a case study of human lung cancers. J Biomed Biotechnol 2006:69141 84. Hahn MW, Kern AD (2005) Comparative genomics of centrality and essentiality in three eukaryotic protein-interaction networks. Mol Biol Evol 22:803–6 85. Hartwell LH, Hopfield JJ, Leibler S, Murray AW (1999) From molecular to modular cell biology. Nature 402:C47–52 86. Harris SE, Sawhill BK, Wuensche A, Kauffman SA (2002) A model of transcriptional regulatory networks based on biases in the observed regulation rules. Complexity 7:23–40 87. Hasty J, Pradines J, Dolnik M, Collins JJ (2000) Noise-based switches and amplifiers for gene expression. Proc Natl Acad Sci USA 97:2075–80 88. Haverty PM, Hansen U, Weng Z (2004) Computational inference of transcriptional regulatory networks from expression profiling and transcription factor binding site identification. Nucleic Acids Res 32:179–88 89. He L, Hannon GJ (2004) MicroRNAs: small RNAs with a big role in gene regulation. Nat Rev Genet 5:522–31 90. Hilborn R (1994) Chaos and nonlinear dynamics: An introduction for scientists and engineers, 2 edn. Oxford University Press, New York 91. Hochedlinger K, Jaenisch R (2006) Nuclear reprogramming and pluripotency. Nature 441:1061–7 92. Hu M, Krause D, Greaves M, Sharkis S, Dexter M, Heyworth C, Enver T (1997) Multilineage gene expression precedes commitment in the hemopoietic system. Genes Dev 11:774–85
93. Huang S (2004) Back to the biology in systems biology: what can we learn from biomolecular networks. Brief Funct Genomics Proteomics 2:279–297 94. Huang S (2007) Cell fates as attractors – stability and flexibility of cellular phenotype. In: Endothelial biomedicine, 1st edn, Cambridge University Press, New York, pp 1761–1779 95. Huang S, Ingber DE (2000) Shape-dependent control of cell growth, differentiation, and apoptosis: switching between attractors in cell regulatory networks. Exp Cell Res 261:91–103 96. Huang S, Ingber DE (2006) A non-genetic basis for cancer progression and metastasis: self-organizing attractors in cell regulatory networks. Breast Dis 26:27–54 97. Huang S, Wikswo J (2006) Dimensions of systems biology. Rev Physiol Biochem Pharmacol 157:81–104 98. Huang S, Eichler G, Bar-Yam Y, Ingber DE (2005) Cell fates as high-dimensional attractor states of a complex gene regulatory network. Phys Rev Lett 94:128701 99. Huang S, Guo YP, May G, Enver T (2007) Bifurcation dynamics of cell fate decision in bipotent progenitor cells. Dev Biol 305:695–713 100. Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, Kidd MJ, King AM, Meyer MR, Slade D, Lum PY, Stepaniants SB, Shoemaker DD, Gachotte D, Chakraburtty K, Simon J, Bard M, Friend SH (2000) Functional discovery via a compendium of expression profiles. Cell 102:109–26 101. Hume DA (2000) Probability in transcriptional regulation and its implications for leukocyte differentiation and inducible gene expression. Blood 96:2323–8 102. Ihmels J, Bergmann S, Barkai N (2004) Defining transcription modules using large-scale gene expression data. Bioinformatics 20:1993–2003 103. Jablonka E, Lamb MJ (2002) The changing concept of epigenetics. Ann N Y Acad Sci 981:82–96 104. Jeong H, Mason SP, Barabasi AL, Oltvai ZN (2001) Lethality and centrality in protein networks. Nature 411:41–42 105. Johnson DS, Mortazavi A, Myers RM, Wold B (2007) Genomewide mapping of in vivo protein-DNA interactions. Science 316:1497–502 106. Jordan IK, Wolf YI, Koonin EV (2003) No simple dependence between protein evolution rate and the number of protein– protein interactions: only the most prolific interactors tend to evolve slowly. BMC Evol Biol 3:1 107. Joy MP, Brock A, Ingber DE, Huang S (2005) High-betweenness proteins in the yeast protein interaction network. J Biomed Biotechnol 2005:96–103 108. Kaern M, Elston TC, Blake WJ, Collins JJ (2005) Stochasticity in gene expression: from theories to phenotypes. Nat Rev Genet 6:451–64 109. Kaplan D, Glass L (1995) Understanding Nonlinear Dynamics, 1st edn. Springer, New York 110. Kashiwagi K, Urabe I, Kancko K, Yomo T (2006) Adaptive response of a gene network to environmental changes by fitness-induced attractor selection. PLoS One, 1:e49 111. Kauffman S (1969) Homeostasis and differentiation in random genetic control networks. Nature 224:177–8 112. Kauffman S (2004) A proposal for using the ensemble approach to understand genetic regulatory networks. J Theor Biol 230:581–90
557
558
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
113. Kauffman S, Peterson C, Samuelsson B, Troein C (2003) Random Boolean network models and the yeast transcriptional network. Proc Natl Acad Sci USA 100:14796–9 114. Kauffman SA (1969) Metabolic stability and epigenesis in randomly constructed genetic nets. J Theor Biol 22:437–467 115. Kauffman SA (1991) Antichaos and adaptation. Sci Am 265:78–84 116. Kauffman SA (1993) The origins of order. Oxford University Press, New York 117. Khorasanizadeh S (2004) The nucleosome: from genomic organization to genomic regulation. Cell 116:259–72 118. Kim KY, Wang J (2007) Potential energy landscape and robustness of a gene regulatory network: toggle switch. PLoS Comput Biol 3:e60 119. Klemm K, Bornholdt S (2005) Stable and unstable attractors in Boolean networks. Phys Rev E Stat Nonlin Soft Matter Phys 72:055101 120. Klevecz RR, Bolen J, Forrest G, Murray DB (2004) A genomewide oscillation in transcription gates DNA replication and cell cycle. Proc Natl Acad Sci USA 101:1200–5 121. Kloster M, Tang C, Wingreen NS (2005) Finding regulatory modules through large-scale gene-expression data analysis. Bioinformatics 21:1172–9 122. Kouzarides T (2007) Chromatin modifications and their function. Cell 128:693–705 123. Kramer BP, Fussenegger M (2005) Hysteresis in a synthetic mammalian gene network. Proc Natl Acad Sci USA 102: 9517–9522 124. Krawitz P, Shmulevich I (2007) Basin entropy in Boolean network ensembles. Phys Rev Lett 98:158701 125. Krysinska H, Hoogenkamp M, Ingram R, Wilson N, Tagoh H, Laslo P, Singh H, Bonifer C (2007) A two-step, PU.1-dependent mechanism for developmentally regulated chromatin remodeling and transcription of the c-fms gene. Mol Cell Biol 27:878–87 126. Kubicek S, Jenuwein T (2004) A crack in histone lysine methylation. Cell 119:903–6 127. Laslo P, Spooner CJ, Warmflash A, Lancki DW, Lee HJ, Sciammas R, Gantner BN, Dinner AR, Singh H (2006) Multilineage transcriptional priming and determination of alternate hematopoietic cell fates. Cell 126:755–66 128. Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tagne JB, Volkert TL, Fraenkel E, Gifford DK, Young RA (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298:799–804 129. Levsky JM, Singer RH (2003) Gene expression and the myth of the average cell. Trends Cell Biol 13:4–6 130. Li F, Long T, Lu Y, Ouyang Q, Tang C (2004) The yeast cellcycle network is robustly designed. Proc Natl Acad Sci USA 101:4781–6 131. Li H, Xuan J, Wang Y, Zhan M (2008) Inferring regulatory networks. Front Biosci 13:263–75 132. Lim HN, van Oudenaarden A (2007) A multistep epigenetic switch enables the stable inheritance of DNA methylation states. Nat Genet 39:269–75 133. Luo F, Yang Y, Chen CF, Chang R, Zhou J, Scheuermann RH (2007) Modular organization of protein interaction networks. Bioinformatics 23:207–14
134. Luscombe NM, Babu MM, Yu H, Snyder M, Teichmann SA, Gerstein M (2004) Genomic analysis of regulatory network dynamics reveals large topological changes. Nature 431:308–12 135. MacCarthy T, Pomiankowski A, Seymour R (2005) Using largescale perturbations in gene network reconstruction. BMC Bioinformatics 6:11 136. Mangan S, Alon U (2003) Structure and function of the feed-forward loop network motif. Proc Natl Acad Sci USA 100:11980–5 137. Manke T, Demetrius L, Vingron M (2006) An entropic characterization of protein interaction networks and cellular robustness. JR Soc Interface 3:843–50 138. Marcotte EM (2001) The path not taken. Nat Biotechnol 19:626–627 139. Margolin AA, Califano A (2007) Theory and limitations of genetic network inference from microarray data. Ann N Y Acad Sci 1115:51–72 140. Maslov S, Sneppen K (2002) Specificity and stability in topology of protein networks. Science 296:910–3 141. Mattick JS (2007) A new paradigm for developmental biology. J Exp Biol 210:1526–47 142. May RM (1972) Will a large complex system be stable? Nature 238:413–414 143. Meissner A, Wernig M, Jaenisch R (2007) Direct reprogramming of genetically unmodified fibroblasts into pluripotent stem cells. Nat Biotechnol 25:1177–1181 144. Mellor J (2006) Dynamic nucleosomes and gene transcription. Trends Genet 22:320–9 145. Metzger E, Wissmann M, Schule R (2006) Histone demethylation and androgen-dependent transcription. Curr Opin Genet Dev 16:513–7 146. Mikkers H, Frisen J (2005) Deconstructing stemness. Embo J 24:2715–9 147. Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs: simple building blocks of complex networks. Science 298:824–7 148. Monod J, Jacob F (1961) Teleonomic mechanisms in cellular metabolism, growth, and differentiation. Cold Spring Harb Symp Quant Biol 26:389–401 149. Morceau F, Schnekenburger M, Dicato M, Diederich M (2004) GATA-1: friends, brothers, and coworkers. Ann N Y Acad Sci 1030:537–54 150. Morrison SJ, Uchida N, Weissman IL (1995) The biology of hematopoietic stem cells. Annu Rev Cell Dev Biol 11:35–71 151. Murray JD (1989) Mathematical biology, 2nd edn (1993). Springer, Berlin 152. Newman MEJ (2003) The structure and function of complex networks. SIAM Review 45:167–256 153. Nykter M, Price ND, Aldana M, Ramsey SA, Kauffman SA, Hood L, Yli-Harja O, Shmulevich I (2008) Gene expression dynamics in the macrophage exhibit criticality. Proc Natl Acad Sci USA 105:1897–900 154. Nykter M, Price ND, Larjo A, Aho T, Kauffman SA, Yli-Harja O, Shmulevich I (2008) Critical networks exhibit maximal information diversity in structure-dynamics relationships. Phys Rev Lett 100:058702 155. Odom DT, Zizlsperger N, Gordon DB, Bell GW, Rinaldi NJ, Murray HL, Volkert TL, Schreiber J, Rolfe PA, Gifford DK, Fraenkel E, Bell GI, Young RA (2004) Control of pancreas and liver gene expression by HNF transcription factors. Science 303:1378–81
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
156. Okita K, Ichisaka T, Yamanaka S (2007) Generation of germline-competent induced pluripotent stem cells. Nature 448:313–7 157. Ozbudak EM, Thattai M, Lim HN, Shraiman BI, Van Oudenaarden A (2004) Multistability in the lactose utilization network of Escherichia coli. Nature 427:737–740 158. Pennisi E (2003) Human genome. A low number wins the GeneSweep Pool. Science 300:1484 159. Picht P (1969) Mut zur utopie. Piper, München 160. Proulx SR, Promislow DE, Phillips PC (2005) Network thinking in ecology and evolution. Trends Ecol Evol 20:345–53 161. Raff M (2003) Adult stem cell plasticity: fact or artifact? Annu Rev Cell Dev Biol 19:1–22 162. Ralston A and Rossant J (2005) Genetic regulation of stem cell origins in the mouse embryo. Clin Genet 68:106–12 163. Ramo P, Kesseli J, Yli-Harja O (2006) Perturbation avalanches and criticality in gene regulatory networks. J Theor Biol 242:164–70 164. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL (2002) Hierarchical organization of modularity in metabolic networks. Science 297:1551–5 165. Reik W, Dean W (2002) Back to the beginning. Nature 420:127 166. Resendis-Antonio O, Freyre-Gonzalez JA, Menchaca-Mendez R, Gutierrez-Rios RM, Martinez-Antonio A, Avila-Sanchez C, Collado-Vides J (2005) Modular analysis of the transcriptional regulatory network of E. coli. Trends Genet 21:16–20 167. Robins H, Krasnitz M, Barak H, Levine AJ (2005) A relativeentropy algorithm for genomic fingerprinting captures hostphage similarities. J Bacteriol 187:8370–4 168. Roeder I, Glauche I (2006) Towards an understanding of lineage specification in hematopoietic stem cells: a mathematical model for the interaction of transcription factors GATA-1 and PU.1. J Theor Biol 241:852–65 169. Salgado H, Santos-Zavaleta A, Gama-Castro S, Peralta-Gil M, Penaloza-Spinola MI, Martinez-Antonio A, Karp PD, ColladoVides J (2006) The comprehensive updated regulatory network of Escherichia coli K-12. BMC Bioinformatics 7:5 170. Samonte RV, Eichler EE (2002) Segmental duplications and the evolution of the primate genome. Nat Rev Genet 3:65–72 171. Sandberg R, Ernberg I (2005) Assessment of tumor characteristic gene expression in cell lines using a tissue similarity index (TSI). Proc Natl Acad Sci USA 102:2052–7 172. Shivdasani RA (2006) MicroRNAs: regulators of gene expression and cell differentiation. Blood 108:3646–53 173. Shmulevich I, Kauffman SA (2004) Activities and sensitivities in boolean network models. Phys Rev Lett 93:048701 174. Shmulevich I, Kauffman SA, Aldana M (2005) Eukaryotic cells are dynamically ordered or critical but not chaotic. Proc Natl Acad Sci USA 102:13439–44 175. Siegal ML, Promislow DE, Bergman A (2007) Functional and evolutionary inference in gene networks: does topology matter? Genetica 129:83–103 176. Smith MC, Sumner ER, Avery SV (2007) Glutathione and Gts1p drive beneficial variability in the cadmium resistances of individual yeast cells. Mol Microbiol 66:699–712 177. Southall TD, Brand AH (2007) Chromatin profiling in model organisms. Brief Funct Genomic Proteomic 6:133–40 178. Southan C (2004) Has the yo-yo stopped? An assessment of human protein-coding gene number. Proteomics 4:1712–26
179. Stern CD (2000) Conrad H. Waddington’s contributions to avian and mammalian development, 1930–1940. Int J Dev Biol 44:15–22 180. Strohman R (1994) Epigenesis: the missing beat in biotechnology? Biotechnology (N Y) 12:156–64 181. Stumpf MP, Wiuf C, May RM (2005) Subnets of scale-free networks are not scale-free: sampling properties of networks. Proc Natl Acad Sci USA 102:4221–4 182. Suzuki M, Yamada T, Kihara-Negishi F, Sakurai T, Hara E, Tenen DG, Hozumi N, Oikawa T (2006) Site-specific DNA methylation by a complex of PU.1 and Dnmt3a/b. Oncogene 25:2477–88 183. Swiers G, Patient R, Loose M (2006) Genetic regulatory networks programming hematopoietic stem cells and erythroid lineage specification. Dev Biol 294:525–40 184. Takahashi K, Yamanaka S (2006) Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell 126:663–76 185. Tapscott SJ (2005) The circuitry of a master switch: Myod and the regulation of skeletal muscle gene transcription. Development 132:2685–95 186. Taylor JS, Raes J (2004) Duplication and divergence: The evolution of new genes and old ideas. Annu Rev Genet 38: 615–643 187. Teichmann SA, Babu MM (2004) Gene regulatory network growth by duplication. Nat Genet 36:492–6 188. Thieffry D, Huerta AM, Perez-Rueda E, Collado-Vides J (1998) From specific gene regulation to genomic networks: a global analysis of transcriptional regulation in Escherichia coli. Bioessays 20:433–40 189. Tinbergen N (1952) Derived activities; their causation, biological significance, origin, and emancipation during evolution. Q Rev Biol 27:1–32 190. Toh H, Horimoto K (2002) Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling. Bioinformatics 18:287–97 191. Trojer P, Reinberg D (2006) Histone lysine demethylases and their impact on epigenetics. Cell 125:213–7 192. Tyson JJ, Chen KC, Novak B (2003) Sniffers, buzzers, toggles and blinkers: dynamics of regulatory and signaling pathways in the cell. Curr Opin Cell Biol 15:221–231 193. van Helden J, Wernisch L, Gilbert D, Wodak SJ (2002) Graphbased analysis of metabolic networks. Ernst Schering Res Found Workshop:245–74 194. van Nimwegen E (2003) Scaling laws in the functional content of genomes. Trends Genet 19:479–84 195. Vogel G (2003) Stem cells. ‘Stemness’ genes still elusive. Science 302:371 196. Waddington CH (1940) Organisers and genes. Cambridge University Press, Cambridge 197. Waddington CH (1956) Principles of embryology. Allen and Unwin Ltd, London 198. Waddington CH (1957) The strategy of the genes. Allen and Unwin, London 199. Watts DJ (2004) The “new” science of networks. Ann Rev Sociol 20:243–270 200. Webster G, Goodwin BC (1999) A structuralist approach to morphology. Riv Biol 92:495–8 201. Wernig M, Meissner A, Foreman R, Brambrink T, Ku M, Hochedlinger K, Bernstein BE, Jaenisch R (2007) In vitro reprogramming of fibroblasts into a pluripotent ES-cell-like state. Nature 448:318–24
559
560
Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination
202. Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, Botstein D (2002) Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol Biol Cell 13:1977–2000 203. Wilkins AS (2007) Colloquium Papers: Between “design” and “bricolage”: Genetic networks, levels of selection, and adaptive evolution. Proc Natl Acad Sci USA 104 Suppl 1:8590–6 204. Wuensche A (1998) Genomic regulation modeled as a network with basins of attraction. Pac Symp Biocomput:89–102 205. Xiong W, Ferrell JE Jr. (2003) A positive-feedback-based bistable ‘memory module’ that governs a cell fate decision. Nature 426:460–465 206. Xu X, Wang L, Ding D (2004) Learning module networks from genome-wide location and expression data. FEBS Lett 578:297–304 207. Yu H, Greenbaum D, Xin Lu H, Zhu X, Gerstein M (2004) Genomic analysis of essentiality within protein networks. Trends Genet 20:227–31 208. Yu H, Kim PM, Sprecher E, Trifonov V, Gerstein M (2007) The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics. PLoS Comput Biol 3:e59 209. Yuh CH, Bolouri H, Davidson EH (2001) Cis-regulatory logic in the endo16 gene: switching from a specification to a differentiation mode of control. Development 128:617–29
Books and Reviews Huang S (2004) Back to the biology in systems biology: what can we learn from biomolecular networks. Brief Funct Genomics Proteomics 2:279–297 Huang S (2007) Cell fates as attractors – stability and flexibility of cellular phenotype. In: Endothelial biomedicine, 1st edn. Cambridge University Press, New York, pp 1761– 1779 Huang S, Ingber DE (2006) A non-genetic basis for cancer progression and metastasis: self-organizing attractors in cell regulatory networks. Breast Dis 26:27–54 Kaneko K (2006) Life: An introduction to complex systems biology, 1edn. Springer, Berlin Kauffman SA (1991) Antichaos and adaptation. Sci Am 265:78–84 Kauffman SA (1993) The origins of order. Oxford University Press, New York Kauffman SA (1996) At home in the universe: the search for the laws of self-organization and complexity. Oxford University Press, New York Laurent M, Kellershohn N (1999) Multistability: a major means of differentiation and evolution in biological systems. Trends Biochem Sci 24:418–422 Wilkins AS (2007) Colloquium papers: Between “design” and “bricolage”: Genetic networks, levels of selection, and adaptive evolution. Proc Natl Acad Sci USA 104 Suppl 1:8590–6
Complexity in Systems Level Biology and Genetics: Statistical Perspectives
Complexity in Systems Level Biology and Genetics: Statistical Perspectives DAVID A. STEPHENS Department of Mathematics and Statistics, McGill University, Montreal, Canada Article Outline Glossary Definition of the Subject Introduction Mathematical Representations of the Organizational Hierarchy Transcriptomics and Functional Genomics Metabolomics Future Directions Bibliography Glossary Systems biology The holistic study of biological structure, function and organization. Probabilistic graphical model A probabilistic model defining the relationships between variables in a model by means of a graph, used to represent the relationships in a biological network or pathway. MCMC Markov chain Monte Carlo – a computational method for approximating high-dimensional integrals using Markov chains to sample from probability distributions, commonly used in Bayesian inference. Microarray A high-throughput experimental platform for collecting functional gene expression and other genomic data. Cluster analysis A statistical method for discovering subgroups in data. Metabolomics The study of the metabolic content of tissues. Definition of the Subject This chapter identifies the challenges posed to biologists, geneticists and other scientists by advances in technology that have made the observation and study of biological systems increasingly possible. High-throughput platforms have made routine the collection vast amounts of structural and functional data, and have provided insights into the working cell, and helped to explain the role of genetics in common diseases. Associated with the improvements in technology is the need for statistical procedures that extract the biological information from the available data in a coherent fashion, and perhaps more importantly,
can quantify the certainty with which conclusions can be made. This chapter outlines a biological hierarchy of structures, functions and interactions that can now be observed, and detail the statistical procedures that are necessary for analyzing the resulting data. The chapter has four main sections. The first section details the historical connection between statistics and the analysis of biological and genetic data, and summarizes fundamental concepts in biology and genetics. The second section outlines specific mathematical and statistical methods that are useful in the modeling of data arising in bioinformatics. In sections three and four, two particular issues are discussed in detail: functional genomic via microarray analysis, and metabolomics. Section five identifies some future directions for biological research in which statisticians will play a vital role. Introduction The observation of biological systems, their processes and inter-reactions, is one of the most important activities in modern science. It has the capacity to provide direct insight into fundamental aspects of biology, genetics, evolution, and indirectly will inform many aspects of public health. Recent advances in technology – high-throughput measurement platforms, imaging – have brought a new era of increasingly precise methods of investigation. In parallel to this, there is an increasingly important focus on statistical methods that allow the information gathered to be processed and synthesized. This chapter outlines key statistical techniques that allow the information gathered to be used in an optimal fashion. Although its origin is dated rather earlier, the term Systems Biology (see, for example, [1,2,3]) has, since 2000, been used to describe the study of the operation of biological systems, using tools from mathematics, statistics and computer science, supplanting computational biology and bioinformatics as an all-encompassing term for quantitative investigation in molecular biology. Most biological systems are hugely complex, involving chemical and mechanical processes operating at different scales. It is important therefore that information gathered is processed coherently, according to self-consistent rules and practices, in the presence of the uncertainty induced by imperfect observation of the underlying system. The most natural framework for coherent processing of information is that of probabilistic modeling. Statistical Versus Mathematical Modeling There is a great tradition of mathematical and probabilistic modeling of biology and genetics; see [4] for a thor-
561
562
Complexity in Systems Level Biology and Genetics: Statistical Perspectives
ough review. The mathematization of biology, evolution and heredity began at the end of the nineteenth century, and continued for the first half of the twentieth century, by far pre-dating the era of molecular biology and genetics that culminated at the turn of the last millennium with the human genome project. Consequently, the mathematical models of, say, evolutionary processes that were developed by Yule [5] and Fisher and Wright [6,7,8], and classical models of heredity, could only be experimentally verified and developed many years after their conception. It could also be convincingly argued that through the work of F. Galton, K. S. Pearson and R. A. Fisher, modern statistics has its foundation in biology and genetics. In parallel to statistical and stochastic formulation of models for biological systems, there has been a more recent focus on the construction of deterministic models to describe observed biological phenomena. Such models fall under the broad description Mathematical Biology, and have their roots in applied mathematics and dynamical systems; see, for example, [8,9] for a comprehensive treatment. The distinction between stochastic and deterministic models is important to make, as the objectives and tools used often differ considerably. This chapter will restrict attention to stochastic models, and the processing of observed data, and thus is perhaps more closely tied to the immediate interests of the scientist, although some of the models utilized will be inspired by mathematical models of the phenomena being observed. Fundamental Concepts in Biology and Genetics To facilitate the discussion of statistical methods applied to systems biology, it is necessary to introduce fundamental concepts from molecular biology and genetics; see the classic text [10] for full details. Attention is restricted to eukaryotic, organisms whose cells are constructed to contain a nucleus which coding information is encapsulated. The cell nucleus is a complex architecture containing several nuclear domains [11] whose organization is not completely understood, but the fundamental activity that occurs within the nucleus is the production and distribution of proteins. Dioxyribonucleic acid (DNA) is a long string of nucleotides that encodes biological information, and that is copied or transcribed into ribonucleic acid (RNA), which in turn enables the formation of proteins. Specific segments of the DNA, genes, encode the proteins, although non-coding regions of DNA – for example, promoter regions, transcription factor binding sites – also have important roles. Genetic variation at the nucleotide level, even involving a single nucleotide, can
disrupt cellular activity. In humans and most other complex organisms, DNA is arranged into chromosomes, which are duplicated in the process of mitosis. The entire content of the DNA of an organism is termed the genome. Proteins are macromolecules formed by the translation of RNA, comprising amino acids arranged (in primary structure) in a linear fashion, comprising domains with different roles, and physically configured in three dimensional space. Proteins are responsible for all biological activities that take place in the cell, although proteins may have different roles in different tissues at different times, due to the regulation of transcription. Proteins interact with each other in different ways in different contexts in interaction networks that may be dynamically organized. Genes are also regarded as having indirect interactions through gene regulatory networks. Genetic variation amongst individuals in a population is due to mutation and selection, which can be regarded as stochastic mechanisms. Genetic information in the form of DNA passes from parent to offspring, which promulgates genetic variation. Individuals in a population are typically related in evolutionary history. Similarly, proteins can also thought to be related through evolutionary history. Genetic disorders are the result of genetic variation, but the nature of the genetic variation can be large- or small-scale; at the smallest scale, variation in single nucleotides (single nucleotide polymorphisms (SNPs)) can contribute to the variation in observed traits.
Broadly, attention is focused on the study of structure and function of DNA, genes and proteins, and the nature of their interactions. It is useful, if simplistic, to view biological activities in terms of an organizational hierarchy of inter-related chemical reactions at the level of DNA, protein, nucleus, network and cellular levels. A holistic view of mathematical modeling and statistical inference requires the experimenter to model simultaneously actions and interactions of all the component features, whilst recognizing that the component features cannot observed directly, and can only be studied through separate experiments on often widely different platforms. It is the role of the bioinformatician or systems biologist to synthesize the data available from separate experiments in an optimal fashion. Mathematical Representations of the Organizational Hierarchy A mathematical representation of a biological system is required that recognizes, first, the complexity of the sys-
Complexity in Systems Level Biology and Genetics: Statistical Perspectives
tem, secondly, its potentially temporally changing nature, and thirdly the inherent uncertainties that are present. It is the last feature that necessitates the use of probabilistic or stochastic modeling. An aphorism commonly ascribed to D.V. Lindley states that “Probability is the language of uncertainty”; probability provides a coherent framework for processing information in the presence of imperfect knowledge, and through the paradigm of Bayesian theory [12] provides the mathematical template for statistical inference and prediction. In the modeling of complex systems, three sorts of uncertainty are typically present: Uncertainty of Structure: Imperfect knowledge of the connections between the interacting components is typically present. For example, in a gene regulatory network, it may be possible via the measurement of gene co-expression to establish which genes interact within the network, but it may not be apparent precisely how the organization of regulation operates, that is which genes regulate the expression of other genes. Uncertainty concerning Model Components: In any mathematical or probabilistic model of a biological system, there are model components (differential equations, probability distributions, parameter settings) that must be chosen to facilitate implementation of the model. These components reflect, but are not determined by, structural considerations. Uncertainty of Observation: Any experimental procedure carries with it uncertainty induced by the measurement of underlying system, that is typically subject to random measurement error, or noise. For example, many biological systems rely on imaging technology, and the extraction of the level of signal of a fluorescent probe, for a representation of the amount of biological material present. In microarray studies (see Sect. “Microarrays”), comparative hybridization of messenger RNA (mRNA) to a medium is a technique for measuring gene expression that is noisy due to several factors (imaging noise, variation in hybridization) not attributable to a biological cause. The framework to be built must handle these types of uncertainty, and permit inference about structure and model components. Models Derived from Differential Equations A deterministic model reflecting the dynamic relationships often present in biological systems may be based on the system of ordinary differential equations (ODEs) x˙ (t) D g(x(t))
(1)
where x(t) D (x1 (t); : : : ; x d (t))T represent the levels of d quantities being observed x˙ (t) represents time derivative, and g is some potentially non-linear system of equations, that may be suggested by biological prior knowledge or prior experimentation. The model in Eq. (1) is a classical “Mathematical Biology” model, that has been successful in representing forms of organization in many biological systems (see, for example [13] for general applications). Suppressed in the notation is a dependence on system parameters, , a k-dimensional vector that may be presumed fixed, and “tuned” to replicate observed behavior, or estimated from observed data. When data representing a partial observation of the system are available, inferences about can be made, and models defined by ODE systems are of growing interest to statisticians; see, for example, [15,16,17]. Equation (1) can be readily extended to a stochastic differential equation (SDE) system x˙ (t) D g(x(t)) C dz(t)
(2)
where z(t) is some stochastic process that renders the solution to Eq. (2) a stochastic process (see, for example, [18] for a comprehensive recent summary of modeling approaches and inference procedures, and a specific application in [19]). The final term dz(t) represents the infinitesimal stochastic increment in z(t). Such models, although particularly useful for modeling activity at the molecular level, often rely on simplifying assumptions (linearity of g, Gaussianity of z) and the fact that the relationship structure captured by g is known. Inference for the parameters of the system can be made, but in general require advanced computational methods (Monte Carlo (MC) and Markov chain Monte Carlo (MCMC)). Probabilistic Graphical Models A simple and often directly implementable approach is based on a probabilistic graphical model, comprising a graph G D (N ; E ), described by a series of nodes N, edges E , and a collection of random variables X D (X 1 ; : : : ; X d )T placed at the nodes, all of which may be dynamically changing. See, for example [20] for a recent summary, [14] for mathematical details and [22] for a biological application. The objective of constructing such a model is to identify the joint probability structure of X given the graph G , which possibly is parametrized by parameters , f X (xj; G ). In many applications, X is not directly observed, but is instead inferred from observed data, Y, arising as noisy observations derived from X. Again, a k-dimensional parameter vector helps to characterize the
563
564
Complexity in Systems Level Biology and Genetics: Statistical Perspectives
stochastic dependence of Y on X by parametrize the conditional probability density f YjX (yjx; ). The joint probability model encapsulating the probabilistic structure of the model is f X;Y (x; yj; ; G ) D f X (xj; G ) f Y jX (yjx; )
X2
X3
which encodes a conditional independence relationship between X 1 and X 3 given X 2 , and a factorization of the joint distribution p(x1 ; x2 ; x3 ) D p(x1 )p(x2 jx1 )p(x3 jx2 ):
X1
X2
X1
X3
X3 encoding p(x1 ; x2 ; x3 ) D p(x1 )p(x2 jx1 )p(x3 jx1 ) :
Such simple model assumptions are the building blocks for the construction of highly complex graphical representations of biological systems. There is an important difference between analysis based purely on simultaneous observation of all components of the system, which can typically on yield inference on dependencies (say, covariances measured in the joint probability model p(x) – see, for example [26,27,28] and analysis based on interventionsgenomic knock–out experiments, chemical or biological challenges, transcriptional/translational perturbation such as RNA interference (RNAi) – that may yield information on casual links; see, for example [29,30]. Bayesian Statistical Inference Given a statistical model for observed data such as Eq. (3), inference for the parameters (; ) and the graph structure G is required. The optimal coherent framework is that of Bayesian statistical inference (see for example [31]), that requires computation of the posterior distribution for the unknown (or unobservable) quantities given by
D L(; ; G jx; y)p(; ; G )
X1
X1
X2 encode p(x1 ; x2 ; x3 ) D p(x1 )p(x2 )p(x3 jx1 ; x2 )
X3
(4)
(; ; G jx; y) / f X;Y (x; yj; ; G )p(; ; G )
Similarly, the equivalent graphs
X3
X2
(3)
The objectives of inference are to learn about G (the uncertain structural component) and parameters (; ) (the uncertain model parameters and observation components). The graph structure G is described by N and E . In holistic models, G represents the interconnections between interacting modules (genomic modules, transcription modules, regulatory modules, proteomic modules, metabolic modules etc.) and also the interconnections within modules in the form of sub graphs. The nodes N (and hence X) represent influential variables in the model structure, and the edges E represent dependencies. The edge connecting two nodes, if present, may be directed or undirected according to the nature of the influence; a directed edge indicates the direction of causation, an undirected edge indicates a dependence. Causality is a concept distinct from dependence (association, co-variation or correlation), and represents the influence of one node on one or more other nodes (see, for example, [23] for a recent discussion of the distinction with examples, and [24,25] for early influential papers discussing how functional dependence may be learned from real data). A simple causal relationship between three variables X 1 ; X2 ; X3 can be represented X1
whereas the graphs for conditional independence of X 2 and X 3 given X 1 are
X2
(5)
a probability distribution from which can be computed parameter estimates with associated uncertainties, and predictions from the model. The terms L(; ; G jx; y) and p(; ; G ) are termed likelihood and prior probability distribution respectively. The likelihood reflects the observed data, and the prior distribution encapsulates biological prior knowledge about the system under study. If the graph structure is known in advance, the prior distribution for that component can be set to be degenerate. If, as in many cases of probabilistic graphical models, the x are unobserved, then the posterior distribution incorporates
Complexity in Systems Level Biology and Genetics: Statistical Perspectives
and in Eq. (6) takes the form
them also, (; ; G ; xj; y) / f YjX (yjx; ) f X (xj; G ) p(; ; G )
(6)
yielding a latent or state-space model, otherwise interpreted as a missing data model. The likelihood and prior can often be formulated in a hierarchical fashion to reflect believed causal or conditional independence structures. If a graph G is separable into two sub graphs G1 ; G2 conditional on a connecting node , similar to the graph in Eq. (4), then the probability model also factorizes into a similar fashion; for example, X 1 might represent the amount of expressed mRNA of a gene that regulates two separate functional modules, and X 2 and X 3 might be the levels of expression of collections of related proteins. The hierarchical specification also extends to parameters in probability models; a standard formulation of a Bayesian hierarchical model involves specification of conditional independence structures at multiple levels of within a graph. The following three-level hierarchical model relates data Y D (Y 1 ; : : : ; Y p ) at level 1, to a population of parameters D ( 1 ; : : : ; p )T at level 2, to hyper parameters at level 3
Level 1
1
Y1
2
Y2
Yp
is achieved by randomly sampling x 1 ; : : : ; x N (N large) from (), and using the estimate
yielding the factorization of the Bayesian full joint distribution as ; (x; y; ;
D p( )
) ( p Y iD1
)( p( i j )
p Y
and is termed the marginal likelihood or prior predictive distribution for the observable quantities x and y. In formal Bayesian theory, it is the representation of the distribution of the observable quantities through the paradigm of exchangeability that justifies the decomposition in Eq. (8) into likelihood and prior, and justifies, via asymptotic arguments, the use of the posterior distribution for inference (see Chaps. 1–4 in [13] for full details). It is evident from these equations that exact computation of the posterior distribution necessitates high-dimensional integration, and in many cases this cannot be carried out analytically.
p
(7)
f X;Y ;
(9)
Numerical Integration Approaches Classical numerical integration methods, or analytic approximation methods are suitable only in low dimensions. Stochastic numerical integration, for example Monte Carlo integration, approximates expectations by using empirical averages of functional of samples obtained from the target distribution; for probability distribution (x), the approximation of E [g(X)], Z E [g(X)] D g(x)(x) dx < 1
Level 3
Level 2
f Y (y) Z D f X;Y (x; yj; ; G )p(; ; G ) d d dG dx
) p(Y i j i )
:
iD1
Bayesian Computation The posterior distribution is, potentially, a high-dimensional multivariate function on a complicated parameter space. The proportionality constant in Eq. (5) takes the form
b E [g(X)] D
n 1 X g(x i ): N iD1
An adaptation of the Monte Carlo method can be used if the functions g and are not “similar” (in the sense that g is large in magnitude where is not, and vice versa); importance sampling uses the representation Z Z g(x)(x) p(x) dx E [g(X)] D g(x)(x) dx D p(x) for some pdf p() having common support with , and constructs an estimate from a sample x 1 ; : : : ; x N from p() of the form b E [g(X)] D
n 1 X g(x i )(x i ) : N p(x i ) iD1
f X;Y (x; y) Z D f X;Y (x; yj; ; G )p(; ; G ) d d dG
(8)
Under standard regularity conditions, the corresponding estimators converge to the required expectation. Further extensions are also useful:
565
566
Complexity in Systems Level Biology and Genetics: Statistical Perspectives
Sequential Monte Carlo: Sequential Monte Carlo (SMC) is an adaptive procedure that constructs a sequence of improving importance sampling distributions. SMC is a technique that is especially useful for inference problems where data are collected sequentially in time, but is also used in standard Monte Carlo problems (see [32]). Quasi Monte Carlo: Quasi Monte Carlo (QMC) utilizes uniform but not random samples to approximate the required expectations. It can be shown that QMC can produce estimators with lower variance than standard Monte Carlo. Markov Chain Monte Carlo Markov chain Monte Carlo (MCMC) is a stochastic Monte Carlo method for sampling from a high-dimensional probability distribution (x), and using the samples to approximate expectations with respect to that distribution. An ergodic, discrete-time Markov chain is defined on the support of in such a way that the stationary distribution of the chain exists, and is equal to . Dependent samples from are obtained by collecting realized values of the chain after it has reached its stationary phase, and then used as the basis of a Monte Carlo strategy. The most common MCMC algorithm is known as the Metropolis–Hastings algorithm which proceeds as follows. If the state of the d-dimensional chain fX t g at iteration t is given by X t D u, then a candidate state v is generated from conditional density q(u; v) D q(vju), and accepted def
as the new state of the chain (that is, X tC1 D v) with probability ˛(u; v) given by
(v)q(v; u) ˛(u; v) D min 1; : (u)q(u; v) A common MCMC approach involves using a Gibbs sampler strategy that performs iterative sampling with updating from the collection of full conditional distributions (x j jx ( j) ) D (x j jx 1 ; : : : ; x j1 ; x jC1 ; x d ) D
(x 1 ; : : : ; x d ) ; (x 1 ; : : : ; x j1 ; x jC1 ; x d )
j D 1; : : : ; d
rather than updating the components of x simultaneously. There is a vast literature on MCMC theory and applications; see [33,34] for comprehensive treatments. MCMC re-focuses inferential interest from computing posterior analytic functional forms to producing posterior samples. It is an extremely flexible framework for computational inference that carries with it certain well-documented problems, most important amongst them being
the assessment of convergence. It is not always straightforward to assess when the Markov chain has reached its stationary phase, so certain monitoring steps are usually carried out. Bayesian Modeling: Examples Three models that are especially useful in the modeling of systems biological data are regression models, mixture models, and state-space models. Brief details of each type of model follow. Regression Models Linear regression models relate an observed response variable Y to a collection of predictor variables X1 ; X2 ; : : : ; X d via the model for the ith response Yi D ˇ0 C
d X
ˇ j X i j C i D X Ti ˇ C i
jD1
say, or in vector form, for Y D (Y1 ; : : : ; Yn )T , Y D Xˇ C where ˇ D (ˇ0 ; ˇ1 ; : : : ; ˇd )T is a vector of real-valued parameters, and is a vector random variable with zeromean and variance-covariance matrix ˙ . The objective in the analysis is to make inference about ˇ, to understand the influence of the predictors on the response, and to perform prediction for Y. The linear regression model (or General Linear Model) is extremely flexible: the design matrix X can be formed from arbitrary, possibly non-linear basis functions of the predictor variables. By introducing a covariance structure into ˙ , it is possible to allow for dependence amongst the components of Y , and allows for the possibility of modeling repeated measures, longitudinal or time-series data that might arise from multiple observation of the same experimental units. An extension that is often also useful is to random effect or mixed models that take into account any repeated measures aspect to the recorded data. If data on an individual (person, sample, gene etc) is Y i D (Yi1 ; : : : ; Yi d )T , then Y i D Xˇ C ZU i C i
(10)
where Z is a d p constant design matrix, and U i is a p 1 vector of random effects specific to individual i. Typically the random effect vectors are assumed to be drawn from a common population. Similar formulations can be used to construct semi-parametric models that are useful for flexible modeling in regression.
Complexity in Systems Level Biology and Genetics: Statistical Perspectives
Mixture Models A mixture model presumes that the probability distribution of variable Y can be written f Yj (yj) D
K X
! k f k (yj k )
(11)
kD1
where f1 ; f2 ; : : : ; f K are distinct component densities indexed by parameters 1 ; : : : ; K , and for all k, 0 < ! k < 1, with K X
! k D 1:
kD1
i D 1; : : : ; n
1 ; : : : ; n DP(˛; F0 ) where DP(˛; F0 ) denotes a Dirichlet process. The DP(˛; F0 ) model may be sampled to produce 1 ; 2 ; : : : ;
n using the Polya-Urn scheme
1 F0
k j 1 ; : : : ; k1
State-Space Models A state-space model is specified through a pair of equations that relate a collection of states, X t , to observations Y t that represent a system and how that system develops over time. For example, the relationship could be model led as Y t D f (X t ; U t )
The model can be interpreted as one that specifies that with probability ! k , Y is drawn from density f k , for k D 1; : : : ; K. Hence the model is suitable for modeling in cluster analysis problems. This model can be extended to an infinite mixture model, which has close links with Bayesian non-parametric modeling. A simple infinite mixture/Bayesian non-parametric model is the mixture of Dirichlet processes (MDP) model [35,36]: for parameter ˛ > 0 and distribution function F 0 , an MDP model can be specified using the following hierarchical specification: for a sample of size n, we have Yi j i f Yj (yj i )
univariate or multivariate, and the model itself can be used to represent the variability in observed data or as a prior density. Inference for such models is typically carried out using MCMC or SMC methods [32,33]. For applications in bioinformatics and functional genomic, see [37,38].
k1
X ˛ 1 ı j F0 C ˛Ck1 ˛Ck1 jD1
where ıx is a point mass at x. For k , conditional on 1 ; : : : ; k1 , the Polya-Urn scheme either samples
k from F 0 (with probability ˛/(˛ C k 1)), or samples k D j for some j D 1; : : : ; k 1 (with probability 1/(˛ C k 1)). This model therefore induces clustering amongst the values, and hence has a structure similar to the finite mixture model – the distinct values of 1 ; : : : ; n are identified as the cluster “centers” that index the component densities in the mixture model in Eq. (11). The degree of clustering is determined by ˛; high values of ˛ encourage large numbers of clusters. The MDP model is a flexible model for statistical inference, and is used in a wide range of applications such as density estimation, cluster analysis, functional data analysis and survival analysis. The component densities can be
X tC1 D g(X t ; V t ) where f and g are vector-valued functions, and (U t ; V t ) are random error terms. A linear state-space model takes the form Y t D At X t C ct C U t X tC1 D B t X t C d t C V t for deterministic matrices A t and B t and vectors c t and d t . The X t represent the values of unobserved states, and the second equation represents the evolution of these states through time (see [39]). State-space models can be used as models for scalar, vector and matrix-valued quantities. One application is evolution of a covariance structure, for example, representing dependencies in a biological network. If the network is dynamically changing through time, a model similar to those above is required but where X t is a square, positive-definite matrix. For such a network, therefore, a probabilistic model for positive-definite matrices can be constructed from the Wishart/Inverse Wishart distributions [40]. For example, we may have for t D 1; 2; : : :, Y t Normal(0; X t ) X tC1 InverseWishart( t ; X t ) where degrees of freedom parameter t is chosen to induce desirable properties (stationary, constant expectation etc.) in the sequence of X t matrices. Transcriptomics and Functional Genomics A key objective in the study of biological organization is to understand the mechanisms of the transcription of genomic DNA into mRNA that initiates the production of proteins and hence lies at the center of the functioning of the nuclear engine. In a cell in a particular tissue at a particular time, the nucleus contains the entire mRNA profile (transcriptome) which, if it could be measured, would
567
568
Complexity in Systems Level Biology and Genetics: Statistical Perspectives
provide direct insight into the functioning of the cell. If this profile could be measured in a dynamic fashion, then the patterns of gene regulation for one, several or many genes could be studied. Broadly, if a gene is “active” at any time point, it is producing mRNA transcripts, sometimes at a high rate, sometimes at a lower rate, and understanding the relationships between patterns of up- and downregulation lies at the heart of uncovering pathways, or networks of interacting genes. Transcriptomics is the study of the entirety of recorded transcripts for a given genome in a given condition. Functional genomics, broadly, is the study of gene function via measured expression levels and how it relates to genome structure and protein expression. Microarrays A common biological problem is to detect differential expression levels of a gene in two or more tissue or cell types, as any differences may contribute to the understanding of the cellular organization (pathways, regulatory networks), or may provide a mechanism for discrimination between future unlabeled samples. An important tool for the analysis of these aspects of gene function is the microarray, a medium onto which DNA fragments (or probes) are placed or etched. Test sample mRNA fragments are tagged with a fluorescent marker, and then allowed to bond or hybridize with the matching DNA probes specific to that nucleotide sequence, according to the usual biochemical bonding process. The microarray thus produces a measurement of the mRNA content of the test sample for each of the large number of DNA sequences bound to the microarray as probes. Microarrays typically now contain tens of thousands of probes for simultaneous investigation of gene expression in whole chromosomes, or even whole genomes for simple organisms. The hybridization experiments are carried out under strict protocols, and every effort is made to regularize the production procedures, from the preparation stage through to imaging. Typically, replicate experiments are carried out. Microarray experiments have made the study of gene expression routine; instantaneous measurements of mRNA levels for large numbers of different genes can be obtained for different tissue or cell types in a matter of hours. The most important aspects of a statistical analysis of gene expression data are, therefore, twofold; the analysis should be readily implementable for large data sets (large numbers of genes, and/or large numbers of samples), and should give representative, robust and reliable results over a wide range of experiments. Since their initial use as experimental platforms, microarrays have become increasingly sophisticated, allow-
ing measurement of different important functional aspects. Arrays containing whole genomes of organisms can be used for investigation of function, copy-number variation, SNP variation, deletion/insertion sites and other forms of DNA sequence variation (see [41] for a recent summary). High-throughput technologies similar in the form of printed arrays are now at the center of transcriptome investigation in several different organisms, and also widely used for genome-wide investigation of common diseases in humans [42,43]. The statistical analysis of such data represents a major computational challenge. In the list below, a description of details of first and second generation microarrays is given. First Generation Microarray Studies From the mid 1990s, comparative hybridization experiments using microarrays or gene-chips began to be widely used for the investigation of gene expression. The two principal types of array used were cDNA arrays and oligonucleotide arrays: cDNA microarrays: In cDNA microarray competitive hybridization experiments, the mRNA levels of a genes in a target sample are compared to the mRNA level of a control sample by attaching fluorescent tags (usually red and green respectively for the two samples) and measuring the relative fluorescence in the two channels. Thus, in a test sample (containing equal amounts of target and control material), differential expression relative to the control is either in terms of up-regulation or downregulation of the genes in the target sample. Any genes that are up-regulated in the target compared to the control and hence that have larger amounts of the relevant mRNA, will fluoresce as predominantly red, and any that are down-regulated will fluoresce green. Absence of differences in regulation will give equal amounts of red and green, giving a yellow fluor. Relative expression is measured on the log scale y D log
xR xTARGET D log xCONTROL xG
(12)
where xR and xG are the fluorescence levels in the RED and GREEN channels respectively. Oligonucleotide arrays: The basic concept oligonucleotide arrays is that the array is produced to interrogate specific target mRNAs or genes by means of a number of oligo probes usually of length no longer than 25 bases; typically 10-15 probes are used to hybridize to a specific mRNA, with each oligo probe designed to target a specific segment of the mRNA
Complexity in Systems Level Biology and Genetics: Statistical Perspectives
sequence. Hybridization occurs between oligos and test DNA in the usual way. The novel aspect of the oligonucleotide array is the means by which the absolute level of the target mRNA is determined; each perfect match (PM) probe is paired with a mismatch (MM) probe that is identical to the prefect match probe except for the nucleotide in the center of the probe, for which a mismatch nucleotide is substituted, as indicated in the diagram below. PM : ATGTATACTATT A TGCCTAGAGTAC MM : ATGTATACTATT C TGCCTAGAGTAC The logic is that the target mRNA, which has been fluorescently tagged, will bind perfectly to the PM oligo, and not bind at all to the MM oligo, and hence the absolute amount of the target mRNA present can be obtained as the difference x P M x M M where x P M and x M M are the measurements of for the PM and MM oligos respectively. Second Generation Microarrays In the current decade, the number of array platforms has increased greatly. The principle of of hybridization of transcripts to probes on a printed array is often still the fundamental biological component, but the design of the new arrays is often radically different. Some of the new types of array are described below (see [44] for a summary). ChiP-Chip: ChIP-chip (chromatin immunoprecipitation chip) arrays are tiling array with genomic probes systematically covering whole genomes or chromosomes that is used to relate protein expression to DNA sequence by mapping the binding sites of transcription factor and other DNA-binding proteins. See [45] for an application and details of statistical issues. ArrayCGH: Array comparative genome hybridization (ArrayCGH) is another form of tiling array that is used to detect copy number variation (the variation in the numbers of repeated DNA segments) in subgroups of individuals with the aim of detecting important variations related to common diseases. See [46,47]. SAGE: Serial Analysis of Gene Expression (SAGE) is a platform for monitoring the patterns of expression of many thousands of transcripts in one sample, which relies on the sequencing of short cDNA tags that correspond to a sequence near one end of every transcript in a tissue sample. See [48,49,50]. Single Molecule Arrays: Single Molecule Arrays rely on the binding of single mRNA transcripts to
the spots on the array surface, and thus allows for extremely precise measurement of transcript levels: see [51]. Similar technology is used for precise protein measurement and antibody detection. See [52]. Statistical Analysis of Microarray Data In a microarray experiment, the experimenter has access to expression/expression profile data, possibly for a number of replicate experiments, for each of a (usually large) number of genes. Conventional statistical analysis techniques and principles (hypothesis testing, significance testing, estimation, simulation methods/Monte Carlo procedures) are used in the analysis of microarray data. The principal biological objectives of a typical microarray analysis are: Detection of differential expression: up- or down-regulation of genes in particular experimental contexts, or in particular tissue samples, or cell lines at a given time instant. Understanding of temporal aspects of gene regulation: the representation and modeling of patterns of changes in gene regulation over time. Discovery of gene clusters: the partitioning of large sets of genes into smaller sets that have common patterns of regulation. Inference for gene networks/biological pathways: the analysis of co-regulation of genes, and inference about the biological processes involving many genes concurrently. There are typically several key issues and models that arise in the analysis of microarray data: such methods are described in detail in [53,54,55,56]. For a Bayesian modeling perspective, see [57]. Array normalization: Arrays are often imaged under slightly different experimental conditions, and therefore the data are often very different even from replicate to replicate. This is a systematic experimental effect, and therefore needs to be adjusted for in the analysis of differential expression. A misdiagnosis of differential expression may be made purely due to this systematic experimental effect. Measurement error: The reported (relative) gene expression levels models are only in fact proxies for the true level gene expression in the sample. This requires a further level of variability to be incorporated into the model. Random effects modeling: It may be necessary to use mixed regression models, where gene specific randomeffects terms are incorporated into the model.
569
570
Complexity in Systems Level Biology and Genetics: Statistical Perspectives
Multivariate analysis: The covariability of response measurements, in time course experiments, or between PM and M M measurements for an oligonucleotide array experiment, is best handled using multivariate modeling. Testing: One- and two-sample hypothesis testing techniques, based on parametric and non-parametric testing procedures can be used in the assessment of the presence of differential expression. For detecting more complex (patterns of) differential expression, in more general structured models, the tools of analysis of variance (ANOVA) can be used to identify the chief sources of variability. Multiple testing/False discovery: In microarray analysis, a classical statistical analysis using significance testing needs to take into account the fact that a very large number of tests are carried out. Hence significance levels of tests must be chosen to maintain a required family-wise error rate, and to control the false discovery rate. Classification: The genetic information contained in a gene expression profile derived from microarray experiments for, say, an individual tissue or tumor type may be sufficient to enable the construction of a classification rule that will enable subsequent classification of new tissue or tumor samples. Cluster analysis: Discovery of subsets of sets of genes that have common patterns of regulation can be achieved using the statistical techniques of cluster analysis (see Sect. “Clustering”). Computer-intensive inference: For many testing and estimation procedures needed for microarray data analysis, simulation-based methods (bootstrap estimation, Monte Carlo and permutation tests, Monte Carlo and MCMC) are often necessary, especially when complex Bayesian models are used. Data compression/feature extraction: The methods of principal components analysis and extended linear modeling via basis functions can be used to extract the most pertinent features of the large microarray data sets. Experimental design: Statistical experimental design can assist in determining the number of replicates, the number of samples, the choice of time points at which the array data are collected and many other aspects of microarray experiments. In addition, power and sample size assessments can inform the experimenter as to the statistical worth of the microarray experiments that have been carried out.
Typically, data derived from both types of microarrays are highly noise and artefact corrupted. The statistical analysis of such data is therefore quite a challenging process. In many cases, the replicate experiments are very variable. The other main difficulty that arises in the statistical analysis of microarray data is the dimensionality; a vast number of gene expression measurements are available, usually only on a relatively small number of individual observations or samples, and thus it is hard to establish any general distributional models for the expression of a single gene. Clustering Cluster analysis is an unsupervised statistical procedure that aims to establish the presence of identifiable subgroups (or clusters) in the data, so that objects belonging to the same cluster resemble each other more closely than objects in different clusters; see [58,59] for comprehensive summaries. In two or three dimensions, clusters can be visualized by plotting the raw data. With more than three dimensions, or in the case of dissimilarity data (see below), analytical assistance is needed. Broadly, clustering algorithms fall into two categories: Partitioning Algorithms: A partitioning algorithm divides the data set into K clusters, where and the algorithm is run for a range of K-values. Partitioning methods are based on specifying an initial number of groups, and iteratively reallocating observations between groups until some equilibrium is attained. The most famous algorithm is the K-Means algorithm in which the observations are iteratively classified as belonging to one of K groups, with group membership is determined by calculating the centroid for each group (the multidimensional version of the mean) and assigning each observation to the group with the closest centroid. The K-means algorithm alternates between calculating the centroids based on the current group memberships, and reassigning observations to groups based on the new centroids. A more robust method uses mediods rather than centroids (that is, medians rather than means in each dimension, and more generally, any distance-based allocation algorithm could be used. Hierarchical Algorithms: A hierarchical algorithm yields an entire hierarchy of clustering for the given data set. Agglomerative methods start with each object in the data set in its own cluster, and then successively merges clusters until only one large cluster remains. Divisive methods start by considering the whole data set as
Complexity in Systems Level Biology and Genetics: Statistical Perspectives
one cluster, and then splits up clusters until each object is separated. Hierarchical algorithms are discussed in detail in Sect. “Hierarchical Clustering“. Data sets for clustering of N observations can either take the form of an N p data matrix, where rows contain the different observations, and columns contain the different variables, or an N N dissimilarity matrix, whose (i,j)th element is d i j , the distance or dissimilarity between observations i and j that obeys the usual properties of a metric. Typical data distance measures between two data points i and j with measurement vectors x i and x j are the L1 and L2 Euclidean distances, and the grid-based Manhattan distance for discrete variables, or the Hamming distance for binary variables. For ordinal (ordered categorical) or nominal (label) data, other dissimilarities can be defined. Hierarchical Clustering Agglomerative hierarchical clustering initially places each of the N items in its own cluster. At the first level, two objects are to be clustered together, and the pair is selected such that the potential function increases by the largest amount, leaving N 1 clusters, one with two members, the remaining N 2 each with one. At the next level, the optimal configuration of N 2 clusters is found, by joining two of the existing clusters. This process continuous until a single cluster remains containing all N items. At each level of the hierarchy, the merger chosen is the one that leads to the smallest increase in some objective function. Classical versions of the hierarchical agglomeration algorithm are typically used with average, single or complete linkage methods, depending on the nature of the merging mechanism. Such criteria are inherently heuristic, and more formal model-based criteria can also be used. Modelbased clustering is based on the assumption that the data are generated by a mixture of underlying probability distributions. Specifically, it is assumed that the population of interest consists of K different sub populations, and that the density of an observation from the the sub population is for some unknown vector of parameters. Model-based clustering is described in more detail in Sect. “Model– Based Hierarchical Clustering”. The principal display plot for a clustering analysis is the dendrogram which plots all of the individual data objects linked by means of a binary “tree”. The dendrogram represents the structure inferred from a hierarchical clustering procedure which can be used to partition the data into subgroups as required if it is cut at a certain “height” up the tree structure. As with many of the aspects of the clustering procedures described above, it is more of a heuristic graphical representation rather than a formal
inferential summary. However, the dendrogram is readily interpretable, and favored by biologists. Model-Based Hierarchical Clustering Another approach to hierarchical clustering is model-based clustering (see for example [60,61]), which is based on the assumption that the data are generated by a mixture of K underlying probability distributions as in Eq. (11). Given T data matrix y D y 1 ; : : : ; y N , let D (1 ; : : : ; N ) denote the cluster labels, where i D k if the ith data point comes from the kth sub-population. In the classification procedure, the maximum likelihood procedure is used to choose the parameters in the model. Commonly, the assumption is made that the data in the different sub-populations follow multivariate normal distributions, with mean u k and covariance matrix ˙ k for cluster k, so that f Yj (yj) D
K X
! k f k (yju k ; ˙ k )
kD1
D
K X kD1
1 1 d/2 (2) j˙ k j1/2
1 T 1 exp (y u k ) ˙ k (y u k ) 2
!k
where Pr[ i D k] D ! k . If ˙ k D 2 I p is a p p matrix, then maximizing the likelihood is the same as minimizing the sum of within-group sums of squares and corresponds to the case of hyper-spherical clusters with the same variance. Other forms of ˙ k yield clustering methods that are appropriate in different situations. The key to specifying this is the singular value or eigen decomposition of ˙ k , given by eigenvalues 1 ; : : : ; p and eigen vectors v1 ; : : : ; v p , as in Principal Components Analysis [62]. The eigen vectors of ˙ k , specify the orientation of the kth cluster, the largest eigenvalue 1 specifies its variance or size, and the ratios of the other eigenvalues to the largest one specify its shape. Further, if ˙ k D k2 I p , the criterion corresponds to hyper spherical clusters of different sizes, and by fixing the eigenvalue ratios ˛ j D j /1 for j D 2; 3; : : : ; p across clusters, other cluster shapes are encouraged. Model-Based Analysis of Gene Expression Profiles The clustering problem for vector-valued observations can be formulated using models used to represent the gene expression patterns via the extended linear model, that is, a linear model in non-linear basis functions; see, for example, [63,64] for details.
571
572
Complexity in Systems Level Biology and Genetics: Statistical Perspectives
Generically, the aim of the statistical model is to capture the behavior of the gene expression ratio y t as a function of time t. The basis of the modeling strategy would be to use models that capture the characteristic behavior of expression profiles likely to be observed due to different forms of regulation. A regression framework and model can be adopted. Suppose that Yt is model led using a linear model Yt D X t ˇ C " t
(v is p 1, V is p p positive-definite and symmetric, all other parameters are scalars) and IGamma denotes the inverse Gamma distribution. Using this prior, standard Bayesian calculations show that conditional on the data ˇjy; 2 Normal v ; 2 V
(16) T C˛ cC ; 2 jy IGamma 2 2 where
where X t is (in general) a 1 p vector of specified functions of t, and ˇ is a p 1 parameter vector. In vector representation, the gene expression profile over times t1 ; : : : ; t T can be written Y D (Y1 ; : : : ; YT ), Y D Xˇ C " :
(13)
1 V D X T X C V 1 1 T v D X T X C V 1 X y C V 1 v T c D y T y C v T V 1 v X T y C V 1 v T 1 T X X C V 1 X y C V 1 v
(17)
The precise form of design matrix X will be specified to model the time-variation in signal. Typically the random error terms f" t g are taken as independent and identically distributed Normal random variables with variance 2 , implying that the conditional distribution of the responses Y is multivariate normal (14) YjX; ˇ; 2 N Xˇ; 2 I T
In regression modeling, it is usual to consider a centered parametrization for ˇ so that v D 0, giving
where now X is T p where I T is the T T identity matrix. In order to characterize the underlying gene expression profile, the parameter vector ˇ must be estimated. For this model, the maximum likelihood/ordinary least squares estimates of ˇ and 2 are
A critical quantity in a Bayesian clustering procedure is the marginal likelihood, as in Eq. (8), for the data in light of the model: Z f Yjˇ; 2 yjˇ; 2 ˇj 2 2 dˇ d 2 : f Y (y) D
T 1 y b y y b y Tp T 1 T DX X X X y.
(18)
1 T b ˇ ML D X T X X y for fitted values b y D Xb ˇ ML
b 2 D
Bayesian Analysis in Model–Based Clustering In a Bayesian analysis of the model in (13) a joint prior distribution ˇ; 2 is specified for ˇ; 2 , and a posterior distribution conditional on the observed data is computed for the parameters. The calculation proceeds using Eq. (5) (essentially with G fixed). L y; X; ˇ; 2 ˇ; 2 ˇ; 2 jy; X D R L y; X; ˇ; 2 ˇ; 2 dˇ d 2 where L y; X; ˇ; 2 is the likelihood function. In the linear model context, a conjugate prior specification is used where ˇj 2 Normal v; 2 V ˛ (15) 2 IGamma ; 2 2
1 T v D X T X C V 1 X y T 1 T T T T c D y y y X X X C V 1 X y 1 T X y D y T I T X X T X C V 1
Combining terms above gives that
f Y (y) D
T/2 1
TC˛ 2 ˛ 2 1/2 jV j
˛/2
jVj
1/2
1
fc C g(TC˛)/2
(19)
This expression is the marginal likelihood for a single gene expression profile. For a collection of profiles belonging to a single cluster, y 1 ; : : : ; y N , Eq. (19) can again be evaluated and used as the basis of a dissimilarity measure as an input into a hierarchical clustering procedure. The marginal likelihood in Eq. (19) can easily be re-expressed for clustered data. The basis of the hierarchical clustering method outlined in [64] proceeds by agglomeration of clusters from N to 1, with the two clusters that lead to the greatest increase marginal likelihood score at each stage of
Complexity in Systems Level Biology and Genetics: Statistical Perspectives
the hierarchy. This method works for profiles of arbitrary length, potentially with different observation time points, however it is computationally most efficient when the time points are the same for each profile. The design matrix X is typically expressed via nonlinear basis functions, for example truncated polynomial splines, Fourier bases or wavelets. For T large, it is usually necessary to use a projection through a lower number of bases; for example, for a single profile, X becomes T p and ˇ becomes p 1, for T > p. Using different designs, many flexible models for the expression profiles can be fitted. In some cases, the linear mixed effect formulation in Eq. (10) can be used to construct the spline-based models; in such models, some of the ˇ parameters are themselves assumed to be random effects (see [65]). For example, in harmonic regression, regression in the Fourier bases is carried out. Consider the extended linear model Yt D
p X
ˇ j g j (t) C " t
jD0
Complexity in Systems Level Biology and Genetics: Statistical Perspectives, Figure 1 Cluster of gene expression profiles obtained using Bayesian hierarchical model-based clustering: data from the intraerythrocytic developmental cycle of protozoa Plasmodium falciparum. Clustering achieved using harmonic regression model with k D 2. Solid red line is posterior mean for this cluster, dotted red lines are point wise 95% credible intervals for the cluster mean profile, and dotted blue lines are point wise 95% credible intervals for the observations
where g0 (t) D 1 and ( g j (t) D
cos( j t)
j odd
sin( j t)
j even
where p is an even number, p D 2k say, and j ; j D 1; 2; : : : ; k are constants with 1 < 2 < < k . For fixed t, cos( j t) and sin( j x) are also fixed and this model is a linear model in parameters ˇ D (ˇ0 ; ˇ1 ; : : : ; ˇ p )T : This model can be readily fitted to time-course expression profiles. The plot below is a fit of the model with k D 2 to a cluster of profiles extracted using the method described in [64] from the malaria protozoa Plasmodium falciparum data set described in [66]. One major advantage of the Bayesian inferential approach is that any biological prior knowledge that is available can be incorporated in a coherent fashion. For example, the data in Figure 1 illustrate periodic behavior related to the cyclical nature of cellular organization, and thus the choice of the Fourier bases is a natural one. Choosing the Number of Clusters: Bayesian Information Criterion A hierarchical clustering procedure gives the sequence by which the clusters are merged (in agglomerative clustering) or split (in divisive clustering) according the model or distance measure used, but does not give
an indication for the number of clusters that are present in the data (under the model specification). This is obviously an important consideration. One advantage of the modelbased approach to clustering is that it allows the use of statistical model assessment procedures to assist in the choice of the number of clusters. A common method is to use approximate Bayes factors to compare models of different orders (i. e. models with different numbers of clusters), and gives a systematic means of selecting the parametrization of the model, the clustering method, and also the number of clusters (see [67]). The Bayes factor is the posterior odds for one model against the other assuming neither is favored a priori. A reliable approximation to twice the log Bayes factor called the Bayesian Information Criterion (BIC), which, for model M fitted to n data points is given by BIC M D 2 log L M (b ) C d M log n where LM is the Bayesian marginal likelihood from ) is the maximized log likelihood of the Eq. (18), L M (b data for the model M, and dM is the number of parameters estimated in the model. The number of clusters is not considered a parameter for the purposes of computing the BIC. The smaller (more negative) the value of the BIC, the stronger the evidence for the model.
573
574
Complexity in Systems Level Biology and Genetics: Statistical Perspectives
Classification via Model–Based Clustering Any clustering procedure can be used as the first step in the construction of classification rules. Suppose that it, on the basis of an appropriate decision procedure, it is known that there are C clusters, and that a set of existing expression profiles y1 ; : : : ; y N have been allocated in turn to the clusters. Let z1 ; : : : ; z N be the cluster allocation labels for the profiles. Now, suppose further that the C clusters can be decomposed further into two subsets of sizes C0 and C1 , where the subsets represent perhaps clusters having some common, known biological function or genomic origin. For example, in a cDNA microarray, it might be known that the clones are distinguishable in terms of the organism from which they were derived. A new objective could be to allocate a novel gene and expression profile to one of the subsets, and one of the clusters within that subset. Let y i jk , for i D 0; 1, j D 1; 2; : : : ; C i , k D 1; 2; : : : ; N i j denote the kth profile in cluster j in subset i. Let y denote a new profile to be classified, and be the binary classification-to-subset, and z the classification-to-cluster variable for y . Then, by Bayes Rule, for i D 1; 2, P D ijy ; y; z / p y j D i; y; z P D ijy; z :
(20)
The two terms in Eq. (20) can be determined on the basis of the clustering output. Metabolomics The term metabolome refers to the total metabolite content of an organic sample (tissue, blood, urine etc) obtained from a living organism which represents the products of a higher level of biological interaction than that which occurs within the cell. Metabolomics and metabonomics are the fields in biomedical investigation that combines the application of nuclear magnetic resonance (NMR) spectroscopy with multivariate statistical analysis in studies of the composition of the samples. Metabonomics is often used in reference to the static chemical content of the sample, whereas metabolomics is used to refer to the dynamic evolution of the metabolome. Both involve the measurement of the metabolic response to interventions – see for example [68] – and applications of metabolomics include several in public health and medicine [69,70]. Statistical Methods for Spectral Data The two principal spectroscopic measurement platforms, NMR and Mass Spectrometry (MS) yield alternative representations of the metabolic spectrum. They produce spec-
Complexity in Systems Level Biology and Genetics: Statistical Perspectives, Figure 2 A normalized rat urine spectrum. The ordinate is parts per million, the abscissa is intensity after standardization
tra (or profiles) that consist of several thousands of individual measurements at different resonances or masses. There are several phases of processing of such data; preprocessing using smoothing, alignment and de-noising, peak separation, registration and signal extraction. For an extensive discussion, see [62]. An NMR spectrum consists of measurements of the intensity or frequency of different biochemical compounds (metabolites) represented by a set of resonances dependent upon the chemical structure, and can be regarded as a linear combination of peaks (nominally of various widths) that correspond to singletons or multiple peaks according to the neighboring chemical environment. A typical spectrum extracted from rat urine is depicted in Fig. 2 see [71]. Two dominant sharp peaks are visible. Features of the spectra that require specific statistical modeling include multiple peaks for a single compound, variation in peak shape, and chemical shifts induced by variation in experimental pH Signals from different metabolites can be highly overlapped and subject to peak position variation due primarily to pH variations in the samples, and there are many small scale features (see Fig. 3). Statistical methods of pre-processing NMR spectra for statistical analysis which address the problems outlined above, using, for example, dynamic time warping to achieve alignment of resonance peaks across replicate spectra as a form or spectral registration form part of the necessary holistic Bayesian framework.
Complexity in Systems Level Biology and Genetics: Statistical Perspectives
Complexity in Systems Level Biology and Genetics: Statistical Perspectives, Figure 3 Magnified portion of the spectrum showing small scale features
Classical statistical methods for metabolic spectra include the following: Principal Components Analysis (PCA) and Regression: a linear data projection method for dimension reduction, feature extraction, and classification of samples in an unsupervised fashion, that is, without reference to labeled cases. Partial Least Squares (PLS): a non-linear projection method similar to PCA, but implemented in a supervised setting for sample discrimination. Clustering: Clusters of spectra, or peaks within spectra, can be discovered using similar techniques to those described in Sect. “Clustering”. Neural Networks: Flexible non-linear regression models constructed from simple mathematical functions that are learned from the observation of cases, that are ideal models for classification. The formulation of an neural netweork involves three levels of interlinked variables; outputs, inputs, and hidden variables, interpreted as a collection of unobserved random variables that form the hidden link between inputs and outputs. Bayesian Approaches The Bayesian framework is a natural one for incorporating genuine biological prior knowledge into the signal reconstruction, and typically useful prior information (about fluid composition, peak location, peak multiplicity) is available. In addition, a hierarchical Bayesian model structure naturally allows construction of plausible models for the spectra across experiments or individuals.
Complexity in Systems Level Biology and Genetics: Statistical Perspectives, Figure 4 Wavelet reconstruction of a region of the spectrum results under the “Least Asymmetric wavelet” with four vanishing moments using the hard thresholding (HT)
Flexible Bayesian Models: The NMR spectrum can be represented as a noisy signal derived from some underlying and biologically important mechanism. Basisfunction approaches (specifically, wavelets) have been much used to represent non-stationary time-varying signals [65,71,72,73]. The sparse representation of the NMR spectrum in terms of wavelet coefficients makes them an excellent tool in data compression, yet these coefficients can still be easily transformed back to the spectral domain to give a natural interpretation in terms of the underlying metabolites. Figure 4 depicts the reconstruction of the rat urine spectrum in the region between 2.5 and 2.8 ppm using wavelet methods; see [71]. Bayesian Time Series Models for Complex Non-stationary Signals: See for example [74]. The duality between semi-parametric modeling of functions and latent time series models allows a view of the analysis of the underlying NMR spectrum not as a set of point wise evaluations of a function, but rather as a (time-ordered) series of correlated observations with some identifiable latent structure. Time series models, computed using dynamic calculation (filtering), provide a method for representing the NMR spectra parsimoniously. Bayesian Mixture Models: A reasonable generative model for the spectra is one that constructs the spectra from a large number of symmetric peaks of varying size, corresponding to the contributions of dif-
575
576
Complexity in Systems Level Biology and Genetics: Statistical Perspectives
ferent biochemical compounds. This can be approximated using a finite mixture model, where the number, magnitudes and locations, of the spectral contributions are unknown. Much recent research has focused on the implementation of computational strategies for Bayesian mixtures, in particular Markov chain Monte Carlo (MCMC) and Sequential Monte Carlo (SMC) have proved vital. The reconstruction of NMR spectra is a considerably more challenging area than those for which mixture modeling is conventionally used, as many more individual components are required. Flexible semi-parametric mixture models have been utilized in [75,76], whilst fully non-parametric mixture models similar to those described in Sect. “Mixture Models” can also be used [73]. A major advantage of using the fully Bayesian framework is that, once again, all relevant information (the spectral data itself, knowledge of the measurement processes for different experimental platforms, the mechanisms via which multiple peaks and shifts are introduced) can be integrated in a coherent fashion. In addition, prior knowledge about the chemical composition of the samples can be integrated via a prior distribution constructed by inspection of the profiles for training samples. At a higher level of synthesis, the Bayesian paradigm offers a method for integrating metabolomic data with other functional or structural data, such as gene expression or protein expression data. Finally, the metabolic content of tissue changes temporally, so dynamic modeling of the spectra could also be attempted. Future Directions Biological data relating to structure and function of genes, proteins and other biological substances are now available from a wide variety of platforms. Researchers are beginning to develop methods for coherent combination of data from different experimental processes to get an entire picture of biological cause and effect. For example, the effective combination of gene expression and metabonomic data will be of tremendous utility. A principal challenge is therefore the fusion of expression data derived from different experimental platforms, and seeking links with sequence and ontological information available. Such fusion will be critical in the future of statistical analysis of large scale systems biology and bioinformatics data sets. In terms of public health impact of systems biology and statistical genomics, perhaps the most prominent is the study of common diseases through high-throughput genotyping of single nucleotide polymorphisms (SNPs). In genome wide association studies, SNP locations that
correlate with disease status or quantitative trait value are sought. In such studies, the key statistical step involves the selection of informative predictors (SNP or genomic loci) from a large collection of candidates. Many such genome wide studies have been completed or are ongoing (see [42,43,77,78]). Such studies represent huge challenges for statisticians and mathematical modelers, as the data contain many subtle structures but also as the amount of information is much greater than that available for typical statistical analysis. Another major challenge to the quantitative analysis of biological data comes in the form of image analysis and extraction. Many high throughput technologies rely on the extraction of information from images, either in static form, or dynamically from a series of images. For example it is now possible to track the expression level of mRNA transcripts in real-time ([79,80,81]), and to observe mRNA transcripts moving from transcription sites to translation sites (see for example [82]). Imaging techniques can also offer insights into aspects of the dynamic organization of nuclear function by studying the positioning of nuclear compartments and how those compartments reposition themselves in relation to each other through time. The challenges for the statistician are to develop real-time analysis methods for tracking and quantifying the nature and content of such images, and tools from spatial modeling and time series analysis will be required. Finally, flow cytometry can measure characteristics of millions of cells simultaneously, and is a technology that offers many promises for insights into biological organization and public health implications. However, quantitative measurement and analysis methods are only yet in the early stages of development, but offer much promise (see [83,84]). Bibliography 1. Kitano H (ed) (2001) Foundations of Systems Biology. MIT Press, Cambridge 2. Kitano H (2002) Computational systems biology. Nature 420(6912):206–210 3. Alon U (2006) An Introduction to Systems Biology. Chapman and Hall, Boca Raton 4. Edwards AWF (2000) Foundations of mathematical genetics, 2nd edn. Cambridge University Press, Cambridge 5. Yule GU (1924) A mathematical theory of evolution, based on the conclusions of Dr. J.C. Willis. Philos Trans R Soc Lond Ser B 213:21–87 6. Fisher RA (1922) On the dominance ratio. Proc R Soc Edinburgh 42:321–341 7. Fisher RA (1930) The genetical theory of natural selection. Clarendon Press, Oxford 8. Wright S (1931) Evolution in Mendelian populations. Genetics 16:97–159
Complexity in Systems Level Biology and Genetics: Statistical Perspectives
9. Murray JD (2002) Mathematical Biology: I An Introduction. Springer, New York 10. Murray JD (2003) Mathematical Biology: II Spatial Models and Biomedical Applications. Springer, New York 11. Lewin B (2007) Genes, 9th edn. Jones & Bartlett Publishers, Boston 12. Spector DL (2001) Nuclear domains. J Cell Sci 114(16):2891–3 13. Bernardo JM, Smith AFM (1994) Bayesian Theory. Wiley, New York 14. Haefner JW (ed) (2005) Modeling Biological Systems: Principles and Applications, 2nd edn. Springer, New York 15. Ramsay JO, Hooker G, Campbell D, Cao J (2007) Parameter estimation for differential equations: a generalized smoothing approach. J Royal Stat Soc: Series B (Methodology) 69(5):741–796 16. Donnet S, Samson A (2007) Estimation of parameters in incomplete data models defined by dynamical systems. J Stat Plan Inference 137(9):2815–2831 17. Rogers S, Khanin R, Girolami M (2007) Bayesian model-based inference of transcription factor activity. BMC Bioinformatics 8(Suppl 2) doi:10.1186/1471-2105-8-S2-S2 18. Wilkinson DJ (2006) Stochastic Modelling for Systems Biology. Chapman & Hall (CRC), Boca Raton 19. Heron EA, Finkenstädt B, Rand DA (2007) Bayesian inference for dynamic transcriptional regulation; the hes1 system as a case study. Bioinformatics 23(19):2596–2603 20. Airoldi EM (2007) Getting started in probabilistic graphical models. PLoS Comput Biol 3(12):e252 21. Husmeier D, Dybowski R, Roberts S (eds) (2005) Probabilistic Modelling in Bioinformatics and Medical Informatics. Springer, Ney York 22. Friedman N (2004) Inferring cellular networks using probabilistic graphical models. Science 303:799–805 23. Opgen-Rhein R, Strimmer K (2007) From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data. BMC Syst Biol 1:37:1–10 24. Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS (2000) Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc Natl Acad Sci USA 97:12182–12186 25. Friedman N, Linial M, Nachman I, Pe’er D (2000) Using bayesian networks to analyze expression data. J Comput Biol 7:601–620 26. West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA Jr, Marks JR, Nevins JR (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci USA 98(11):462–467 27. Dobra A, Hans C, Jones B, Nevins J, Yao G, West M (2004) Sparse graphical models for exploring gene expression data. J Multivar Anal 90:196–212 28. Jones B, Carvalho C, Dobra A, Hans C, Carter C, West M (2005) Experiments in stochastic computation for high dimensional graphical models. Stat Sci 20:388–400 29. Markowetz F, Bloch J, Spang R (2005) Non-transcriptional pathway features reconstructed from secondary effects of RNA interference. Bioinformatics 21:4026–4032 30. Eaton D, Murphy KP (2007) Exact Bayesian structure learning from uncertain interventions. Artificial Intelligence & Statistics 2:107–114 31. Robert CP (2007) The Bayesian Choice: From Decision– Theoretic Foundations to Computational Implementation. Texts in Statistics, 2nd edn. Springer, New York
32. Doucet A, de Freitas N, Gordon NJ (eds) (2001) Sequential Monte Carlo Methods in Practice, Statistics for Engineering and Information Science. Springer, New York 33. Robert CP, Casella G (2005) Monte Carlo Statistical Methods. Texts in Statistics, 2nd edn. Springer, New York 34. Gamerman D, Lopes HF (2006) Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference. Texts in Statistical Science, 2nd edn. Chapman and Hall (CRC), Boca Raton 35. Antoniak CE (1974) Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann Stat 2:1152–1174 36. Escobar MD, West M (1995) Bayesian density estimation and inference using mixtures. J Am Stat Assoc 90(430):577–588 37. Dahl DB (2006) Model-based clustering for expression data via a Dirichlet process mixture model. In: Do KA, Müller P, Vannucci M (eds) Bayesian Inference for Gene Expression and Proteomics. University Press, Cambridge, Chap 10 38. Kim S, Tadesse MG, Vannucci M (2006) Variable selection in clustering via Dirichlet process mixture models. Biometrika 93(4):877–893 39. West M, Harrison J (1999) Bayesian Forecasting and Dynamic models, 2nd edn. Springer, New York 40. Philipov A, Glickman ME (2006) Multivariate stochastic volatility via Wishart processes. J Bus Econ Stat 24(3):313–328 41. Gresham D, Dunham MJ, Botstein D (2008) Comparing whole genomes using DNA microarrays. Nat Rev Genet 9:291–302 42. The Wellcome Trust Case Control Consortium (2007) Association scan of 14,500 nonsynonymous snps in four diseases identifies autoimmunity variants. Nat Genet 39:1329–1337 43. The Wellcome Trust Case Control Consortium (2007) Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447:661–678 44. Liu XS (2007) Getting started in tiling microarray analysis. PloS Comput Biol 3(10):1842–1844 45. Johnson WE, Li W, Meyer CA, Gottardo R, Carroll JS, Brown M, Liu XS (2006) Model-based analysis of tiling-arrays for ChIPchip. Proc Natl Acad Sci USA 103(33):12457–62 (2006) 46. Freeman JL et al (2006) Copy number variation: New insights in genome diversity. Genome Res 16:949–961 47. Urban AE et al (2006) High-resolution mapping of DNA copy alterations in human chromosome 22 using high-density tiling oligonucleotide arrays. Proc Natl Acad Sci USA 103(12):4534–4539 48. Saha S et al (2002) Using the transcriptome to annotate the genome. Nat Biotech 20:508–512 49. Shadeo A et al (2007) Comprehensive serial analysis of gene expression of the cervical transcriptome. BMC Genomics 8:142 50. Robinson SJ, Guenther JD, Lewis CT, Links MG, Parkin IA (2007) Reaping the benefits of SAGE. Methods Mol Biol 406:365–386 51. Lu C, Tej SS, Luo S, Haudenschild CD, Meyers BC, Green PJ (2005) Elucidation of the Small RNA Component of the Transcriptome. Science 309(5740):1567–1569 52. Weiner H, Glökler J, Hultschig C, Büssow K, Walter G (2006) Protein, antibody and small molecule microarrays. In: Müller UR, Nicolau DV (eds) Microarray Technology and Its Applications. Biological and Medical Physics. Biomedical Engineering. Springer, Berlin, pp 279–295 53. Speed TP (ed) (2003) Statistical Analysis of Gene Expression Microarray Data. Chapman & Hall/CRC, Bacon Raton
577
578
Complexity in Systems Level Biology and Genetics: Statistical Perspectives
54. Parmigiani G, Garett ES, Irizarry RA, Zeger SL (eds) (2003) The Analysis of Gene Expression Data. Statistics for Biology and Health. Springer, New York 55. Wit E, McClure J (2004) Statistics for Microarrays: Design, Analysis and Inference. Wiley, New York 56. Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S (eds) (2005) Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Statistics for Biology and Health. Springer, New York 57. Do KA, Müller P, Vannucci M (2006) Bayesian Inference for Gene Expression and Proteomics. Cambridge University Press, Cambridge 58. Everitt BS, Landau S, Leese M (2001) Cluster Analysis, 4th edn. Hodder Arnold, London 59. Kaufman L, Rousseeuw PJ (2005) Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics, 2nd edn. Wiley, Ney York 60. Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL (2001) Model-based clustering and data transformation for gene expression data. Bioinformatics 17:977–987 61. McLachlan GJ, Bean RW, Peel D (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18:413–422 62. De Iorio M, Ebbels TMD, Stephens DA (2007) Statistical techniques in metabolic profiling. In: Balding DJ, Bishop M, Cannings C (eds) Handbook of Statistical Genetics, 3rd edn. Wiley, Chichester, Chap 11 63. Heard NA, Holmes CC, Stephens DA, Hand DJ, Dimopoulos G (2005) Bayesian coclustering of Anopheles gene expression time series: Study of immune defense response to multiple experimental challenges. Proc Natl Acad Sci USA 102(47):16939–16944 64. Heard NA, Holmes CC, Stephens DA (2006) A Quantitative Study of Gene Regulation Involved in the Immune Response of Anopheline Mosquitoes: An Application of Bayesian Hierarchical Clustering of Curves. J Am Stat Assoc 101(473):18–29 65. Morris JS, Brown PJ, Baggerly KA, Coombes KR (2006) Analysis of mass spectrometry data using Bayesian wavelet-based functional mixed models. In: Do KA, Müller P, Vannucci M (eds) Bayesian Inference for Gene Expression and Proteomics. Cambridge University Press, Cambridge, pp 269–292 66. Bozdech Z, Llinás M, Pulliam BL, Wong ED, Zhu J, DeRisi JL (2003) The transcriptome of the intraerythrocytic developmental cycle of Plasmodium falciparum. PLoS Biol 1(1):E5 67. Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90:773–795 68. Nicholson JK, Connelly J, Lindon JC, Holmes E (2002) Metabonomics: a platform for studying drug toxicity and gene function. Nat Rev Drug Discov 1:153–161 69. Lindon JC, Nicholson JK, Holmes E, Antti H, Bollard ME, Keun H, Beckonert O, Ebbels TM, Reily MD, Robertson D (2003) Con-
70.
71.
72.
73.
74.
75.
76.
77.
78. 79. 80. 81.
82.
83.
84.
temporary issues in toxicology the role of metabonomics in toxicology and its evaluation by the COMET project. Toxic Appl Pharmacol 187:137 Brindle JT, Annti H, Holmes E, Tranter G, Nicholson JK, Bethell HWL, Clarke S, Schofield SM, McKilligin E, Mosedale DE, Graingerand DJ (2002) Rapid and noninvasive diagnosis of the presence and severity of coronary heart disease using 1HNMR-based metabonomics. Nat Med 8:143 Yen TJ, Ebbels TMD, De Iorio M, Stephens DA, Richardson S (2008) Analysing real urine spectra with wavelet methods. (in preparation) Brown PJ, Fearn T, Vannucci M (2001) Bayesian wavelet regression on curves with applications to a spectroscopic calibration problem. J Am Stat Soc 96:398–408 Clyde MA, House LL, Wolpert RL (2006) Nonparametric models for proteomic peak identification and quantification. In: Do KA, Müller P, Vannucci M (eds) Bayesian Inference for Gene Expression and Proteomics. Cambridge University Press, Cambridge, pp 293–308 West M, Prado R, Krystal A (1999) Evaluation and comparison of EEG traces: Latent structure in non-stationary time series. J Am Stat Assoc 94:1083–1095 Ghosh S, Grant DF, Dey DK, Hill DW (2008) A semiparametric modeling approach for the development of metabonomic profile and bio-marker discovery. BMC Bioinformatics 9:38 Ghosh S, Dey DK (2008) A unified modeling framework for metabonomic profile development and covariate selection for acute trauma subjects. Stat Med 30;27(29):3776–88 Duerr RH et al (2006) A Genome–Wide Association Study Identifies IL23R as an Inflammatory Bowel Disease Gene. Science 314(5804):1461–1463 Sladek R et al (2007) A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature 445:881–885 Longo D, Hasty J (2006) Imaging gene expression: tiny signals make a big noise. Nat Chem Biol 2:181–182 Longo D, Hasty J (2006) Dynamics of single-cell gene expression. Mol Syst Biol 2:64 Wells AL, Condeelis JS, Singer RH, Zenklusen D (2007) Imaging real-time gene expression in living systems with single-transcript resolution: Image analysis of single mRNA transcripts. CSH Protocols, Cold Springer Habor Rodriguez AJ, Condeelis JS, Singer RH, Dictenberg JB (2007) Imaging mRNA movement from transcription sites to translation sites. Semin Cell Dev Biol 18(2):202–208 Lizard G (2007) Flow cytometry analyses and bioinformatics: Interest in new softwares to optimize novel technologies and to favor the emergence of innovative concepts in cell research. Cytom A 71A:646–647 Lo K, Brinkman RR, Gottardo R (2008) Automated gating of flow cytometry data via robust model-based clustering. Cytom Part A 73A(4):321–332
Complex Networks and Graph Theory
Complex Networks and Graph Theory GEOFFREY CANRIGHT Telenor R&I, Fornebu, Norway Article Outline Glossary Definition of the Subject Introduction Graphs, Networks, and Complex Networks Structure of Networks Dynamical Network Structures Dynamical Processes on Networks Graph Visualization Future Directions Bibliography Glossary Directed/Undirected graph A set of vertices connected by directed or undirected edges. A directed edge is one-way (A ! B), while an undirected edge is twoway or symmetric: A B. Network For our purposes, a network is defined identically to a graph: it is an abstract object composed of vertices (nodes) joined by (directed or undirected) edges (links). Hence we will use the terms ‘graph’ and ‘network’ interchangeably. Graph topology The list of nodes i and edges (i; j) or (i ! j) defines the topology of the graph. Graph structure There is no single agreed definition for what constitutes the “structure” of a graph. To the contrary: this question has been the object of a great deal of research—research which is still ongoing. Node degree distribution One crude measure of a graph’s structure. If nk is the number of nodes having degree k in a graph with N nodes, then the set of nk is the node degree distribution—which is also often expressed in terms of the frequencies p k D n k /N. Small-worlds graph A “small-worlds graph” has two properties: it has short path lengths (as is typical of random graphs)—so that the “world” of the network is truly “small”, in that every node is within a few (or not too many) hops of every other; and secondly, it has (like real social networks, and unlike random graphs) a significant degree of clustering—meaning that two neighbors of a node have a higher-than-random probability of also being linked to one another. Graph visualization The problem of displaying a graph’s topology (or part of it) in a 2D image, so as to give
the viewer insight into the structure of the graph. We see that this is a hard problem, as it involves both the unsolved problem of what we mean by the structure of the graph, and also the combined technological/psychological problem of conveying useful information about a (possibly large) graph via a 2D (or quasi-3D) layout. Clearly, the notion of a good graph visualization is dependent on the use to which the visualization is to be put—in other words, on the information which is to be conveyed. Section Here, a ‘bookkeeping’ definition. This article introduces the reader to all of the other articles in the Section of the Encyclopedia which is titled “Complex Networks and Graph Theory”. Therefore, whenever the word ‘Section’ (with a large ‘S’) is used in this ‘roadmap’ article, the word refers to that Section of the Encyclopedia. To avoid confusion, the various subdivisions of this roadmap article will be called ‘parts’. Definition of the Subject The basic network concept is very simple: objects, connected by relationships. Because of this simplicity, the concept turns up almost everywhere one looks. The study of networks (or equivalently, graphs), both theoretically and empirically, has taken off over the last ten years, and shows no sign of slowing down. The field is highly interdisciplinary, having important applications to the Internet and the World Wide Web, to social networks, to epidemiology, to biology, and in many other areas. This introductory article serves as a reader’s guide to the 13 articles in the Section of the Encyclopedia which is titled “Complex Networks and Graph Theory”. These articles will be discussed in the context of three broad themes: network structure; dynamics of network structure; and dynamical processes running over networks. Introduction In the past ten years or so, the study of graphs has exploded, leaving forever the peaceful sanctum of pure mathematics to become a fundamental concept in a vigorous, ill-defined, interdisciplinary, and important field of study. The most common descriptive term for this field is the “study of complex networks”. This term is distinguished from the older, more mathematically-bound term “graph theory” in two ways. First – and perhaps most important – this new field is not just theoretical; to the contrary, the curious researcher finds that there is an enormous variety of empirically obtained graphs/networks (we use the terms interchangeably here) available as datasets on the Web. That is, one studies real, measured graphs;
579
580
Complex Networks and Graph Theory
and not surprisingly, this empirical connection gives the endeavor much more of an applied flavor also. The second distinction is perhaps not well motivated by the words used; but the “study of complex networks” typically means studying networks whose structure deviates in important ways from the “classical random graphs” of Erd˝os and Rényi [10,11]. We note that these two points are related: as one turned to studying real networks [18], one found that they were not described by classical random graphs [24,25]; hence one was forced to look at other kinds of structures. This article is a reader’s guide to the other articles in the Section entitled “Complex Networks and Graph Theory”. The inclusion of both terms was very deliberate: graph theory gives the mathematical foundation for the more messy endeavor called “complex networks”; and the two fields have a strong and fruitful interaction. In this article I will describe the 13 other articles of this Section. These 13 articles amply document the interdisciplinary nature of this exciting field: we find represented mathematics, biology, the Web and the Internet, software, epidemiology, and social networks (with the latter field also having its own Section in the Encyclopedia). For this reason, I will not define the parts of this article by field of study, but rather by general themes which run through essentially all studies of networks (at least in principle, and often in fact). In each part of this article I will point out those articles which represent that part’s theme to a significant degree. Hence these themes are meant to tie together all of the articles into a simple framework. In the next part “Graphs, Networks, and Complex Networks”, I will concisely present the basic terminology which is in use. Then in entitled part the “Structure of Networks” I discuss the knotty question of the structure of a graph; this problem is the first theme, and it is very much unfinished business. In part “Dynamical Network Structures” we look at network topologies (and structures) which are dynamic rather than static. This is clearly an important theme, since (i) real empirical networks are necessarily dynamic (on some – often short – time scale), and (ii) the study of how networks grow and evolve can be highly useful for understanding how they have the structure we observe today. Then in part “Dynamical Processes on Networks” we look at the very large question of dynamical processes which take place over networks. Examples which illustrate the importance of this topic are epidemic spreading over social and computer networks, and the activation/inhibition processes going on over gene (or neural) networks. Clearly the progress of such dynamical processes may be strongly dependent on the underlying network structure; hence we see that all of these themes
(parts “Structure of Networks”–“Dynamical Processes on Networks”) are tightly related. In part “Graph Visualization” we look briefly at an important but somewhat ‘orthogonal’ theme, namely, the problem of graph visualization: given a graph’s topology, how can one best present this information visually to a human viewer? Finally, part “Future Directions” offers a very brief, modest, and personal take on the very large question of “future directions” in research on complex networks. Graphs, Networks, and Complex Networks Graphs One of the earliest applications of graph theory may be found in Euler’s solution, in 1736, of the problem called ‘The seven bridges of Königsberg’. Euler considered the problem of finding a continuous path crossing each of the seven bridges of Königsberg exactly once, and solved the problem by representing each connected piece of land as a vertex (node) of a graph, and each bridge as an undirected (two-way) link. (For a nice presentation of this problem, see http://mathforum.org/isaac/problems/ bridges2.html.) This little problem is an excellent example of the power of mathematics to extract understanding via abstraction: the lay person may stare at the bridges, islands etc, and try various ideas—but reducing the entire problem to an abstract graph, composed only of nodes and links, aids the application of pure reason, leading to a final and utterly convincing solution. To study graphs is to study discrete objects and the relationships between them. Hence graph theory may be regarded as a branch of combinatorics. Erd˝os and Rényi [10,11] founded the study of “classical random graphs”. These graphs are specified by the node number N (which typically is assumed to grow large), and by links laid down at random between these nodes, subject to various constraints. One such constraint is that every node have exactly k links—giving a (k) regular random graph. A more relaxed constraint is simply to specify m links in total (so that the average node degree is hki D m/N). Further relaxing of this constraint gives that every possible link, out of the set of N D N(N 1)/2 2 possible links, is included with probability p. This gives the average node degree (now averaged over many graphs) as hki D hmi/N D p(N 1)/2. While all of these types of classical random graphs are similar “in spirit”, those with the fewest constraints are those for which it is easiest to prove things.
Complex Networks and Graph Theory
We are fortunate to have in our Section the article Random Graphs, A Whirlwind Tour of by Fan Chung. In this article, the reader is given a good introduction to classical random graphs, along with a thorough presentation of the more modern theory of random graphs with a new, and more realistic, type of constraint—namely that they should, on average, have a given node degree distribution. This new theory makes the study of random graphs extremely relevant for today’s empirically-anchored research: many empirical graphs are characterized by their node degree distribution, and many of these in fact have a power-law degree distribution, such that the number nk of nodes having degree k varies with k by a power law: n k k ˇ . This work is useful, precisely because a random graph with a given node degree distribution is the “most typical” graph of the set of graphs with that degree distribution. Hence, statements about such random graphs are statements about typical graphs with the same degree distribution—unless and until we know more about the empirical graphs. Networks As noted earlier in this introduction, we consider the terms ‘network’ and ‘graph’ to be interchangeable. Nevertheless there is a bit more to be said about the term. The ‘network’ concept motivates and infuses a vigorous and lively research activity that has more or less exploded since (roughly) the work of Watts and Strogatz [24,25]. Much of their work was motivated by the ‘small worlds problem’. The latter dates back to the work of Milgram [18] (and even earlier). Milgram posed the question: how far is it from Kansas (or Nebraska) to Boston, Massachusetts—when the distance is measured in ‘hops’ between people who know one another? Modern language would rephrase this question as follows: is the US acquaintanceship network a ‘small world’? Milgram’s answer was ‘yes’: after disregarding the chains (of letters—that was the mechanism for a ‘hop’) that never reached the target, the average path length was roughly 5–6 hops—a ‘small world’. The explosion of interest in the last 10 years is well documented in the set of references in “Books and Reviews” (below). We also include in this part an introductory discussion of directed graphs. Directed graphs have directed links: it no longer suffices to say “i and j are linked”, because there is a directionality in the linking: (i ! j) or ( j ! i) (or both). Some of the mathematical background for understanding directed graphs is provided in the article in this Section by Berman and Shaked-Monderer ( Non-neg-
ative Matrices and Digraphs). One quickly finds, upon coming in contact with directed graphs, that one is in a rather different world—the mathematics has changed, the structures are different, and one’s intuition often fails. Directed graphs are however here to stay, and well worth study. We cite two classic examples to demonstrate the extreme relevance of directed graphs. First, there is early and pioneering work by Stuart Kauffman [14,15] on genetic regulatory networks. These form directed graphs, because the links express the fact that gene G1 regulates gene G2 (G1 ! G2)—a relationship which is by no means symmetric. Understanding gene regulation and expression is a fundamental problem in biology—and we see that the problem may be usefully expressed as one of understanding dynamics on a directed graph. The article in this Section by Huang and Kauffman ( Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination) brings us up to date on this exciting problem. A more well known directed graph, which plays a role for many of us in our daily lives, is the Web graph [1,2,8]. We navigate among Web pages using hyperlinks—oneway pointers taking us (e. g.) from page A to page B (A ! B). The utility of the Web as a source of information—and as a platform for interaction—is enormous. Hence the Web is well worth intense study, both intellectually and practically. Hyperlinks are useful, not only for navigation, but also to aid in ranking Web pages. This utility was clearly pointed out in two early papers on Web link analysis [7,17]. “Web link analysis” may be viewed most simply as a process which takes (all or part of) the Web graph as input, and which gives ‘importance scores’ for each Web page as output. The PageRank approach to link analysis of Brin and Page [7] has almost certainly played a significant role in the meteoric rise of the Web search company Google. Hence there is both practical and commercial value in understanding the Web graph. We have two complementary papers in this Section which treat this important area: that by Adamic on World Wide Web, Graph Structure, and that by Bjelland et al. on Link Analysis and Web Search. Complex Networks Now we come to the last substantial word in the title of this Section: ‘complex’. This is a word that (as yet) lacks a precise scientific meaning. There are in fact too many definitions; in other words, there is no generally-agreed precise definition. For a good overview of work on this thorny problem we refer the reader to [4].
581
582
Complex Networks and Graph Theory
Fortunately, for the purposes of this Section, we need only a very simple definition: ‘complex networks’ are those which are not well modeled by classical random graphs (described above, and in the article by Fan Chung). The use of the term ‘complex networks’ is very widespread—while examples which give a clear definition are less common. We note that the definition given here is also cited by Dorogovtsev in his article in this Section ( Growth Models for Networks). Structure of Networks We have noted already that there is no single, simple answer to the question “what is the structure of this graph”? With this caveat in mind, we offer the reader a guide to the articles in this Section which address the structure of networks. Undirected Graphs We have already noted the article by Fan Chung ( Random Graphs, A Whirlwind Tour of), giving an up-to-date overview of the properties of random graphs with a given node degree distribution—with the well-studied case being power-law graphs. She cites a number of experimental results which indicate that the experimental exponents ˇ (taken from the power-law degree distributions) fall in one range (2 < ˇ < 3) for social and technological networks, and another, rather distinct range (ˇ < 2) for biological networks. This may be regarded as a (quantitative) structural difference between these two types of network; Chung offers an explanation based on qualitatively distinct growth mechanisms (see the next part of this article). Fortunato and Castellano ( Community Structure in Graphs) offer an excellent overview of another broad approach to network structure. Here the idea is to understand structure in terms of substructure (and subsubstructure, etc). That is: here is a network. Can we identify subgraphs—possibly overlapping, possibly disjoint—of this network that in some sense “belong together”? In other words: how can one identify the community structure of a graph? The review of Fortunato and Castellano is on the one hand very thorough—and on the other hand makes clear that there is no one agreed answer to this question. There are indeed almost as many answers as there are theoretical approaches; and this problem has received a lot of attention. Fortunato and Castellano note that there is currently a “favorite” approach, defined by finding subgraphs with high modularity [21]. Roughly speaking, a subgraph with high modularity has a higher density of internal links than that found for the same subgraph in a randomized ‘null model’ for the same graph.
Fortunato and Castellano give a careful discussion of the strengths and weaknesses of this approach to community detection, as well as of many others. Since the work of Watts and Strogatz [24,25], the definition of a ‘small-world graph’ has included two criteria. First, one must have the short average path length which Milgram found in his experiments, and which gives rise to the term ‘small worlds’. However, one also insists that a ‘true’ small-world graph should locally resemble a real social network, in that there is a significant degree of clustering. Roughly speaking, this means that the probability that two of my acquaintances know each other is higher than random. More mathematically, to say that a graph G has high clustering means that the incidence of closed triangles of links in G is higher than expected in a randomized ‘null model’ for G. A closed triangle is a small, simply defined subgraph, and in studying clustering one studies the statistics of the occurrence of this small subgraph. A more general term for this type of small subgraph is a motif [19]. Much as with clustering and triangles, one defines a set of motifs, and then generates a significance profile for any given network, comparing the frequency of each motif in the profile to that of a corresponding random graph. Valverde and Solé ( Motifs in Graphs) offer a stimulating overview of the study of motifs in networks of many types—both directed and undirected. They point out a remarkable consistency of motif significance profiles across networks with very different origins—for example, a software graph and the gene network of a bacterium—and argue that this consistency is best understood in terms of historical accident, rather than in terms of functionality. The article by Liljeros ( Human Sexual Networks) looks at empirical human sexual networks. He addresses the evidence for and against the notion that such networks are power-law (also known as “scale free”). This question is important for understanding epidemic spreading over networks—especially in the light of the results of Pastor-Satorras and Vespignani [23], which showed that epidemic spreading on power-law networks is more difficult to stop than was predicted by earlier models using the well-mixed (all-to-all) approximation. Liljeros examines carefully what is known about the structure of human sexual networks, noting the great difficulty inherent in gathering extensive and/or reliable data. The article by He, Siganos, and Faloutsos ( Internet Topology) looks at a very different empirical network, namely the physical Internet, composed (at the lowest level) of routers and cables. The naïve newcomer might assume that the Internet, being an engineered system, is fully mapped out, and hence its topology should be readily
Complex Networks and Graph Theory
“understood”. The article by He et al. presents the reality of the Internet: it is largely self-organized (especially at the higher level of organization, the ‘Autonomous System’ or AS level); it is far from trivial to experimentally map out this network—even at the AS level; and there is not even agreement on whether or not the AS-graph is a power-law graph—which is after all a rather crude measure of structure of the network. He et al. describe recent work which offers a neat resolution of the conflicting and ambiguous data on this question. They then go on to describe more imaginative models for the structure of the Internet, going beyond simply the degree distribution, and having names like ‘Jellyfish’ and ‘Medusa’. Directed Graphs A generic directed graph is immediately distinguished from its undirected counterparts in that a natural unit of substructure is obvious, and virtually always present. That is the ‘strongly connected component’ or SCC (termed ‘class’ by Berman and Shaked-Monderer). That is, even when a directed graph is connected, there are node pairs which cannot reach one another by following directed paths. An SCC C is then a maximal set of nodes satisfying the constraint that every node in C is reachable, via a directed path, from every other node in C. The SCCs then form disjoint sets (equivalence classes), and every node is in one SCC. In short, the very notion of ‘reachability’ is more problematic in directed graphs: all nodes in the same SCC can reach one another, but otherwise, all bets are off! Lada Adamic ( World Wide Web, Graph Structure) gives a good overview of what is known empirically about the structure of the Web graph—that abstract and yet very real object in which a node is a Web page, and a link is a (one-way) hyperlink. The Web graph is highly dynamic, in at least two ways: pages have a finite lifetime, with new ones appearing while old ones disappear; also, many Web pages are dynamic, in that they generate new content when accessed—and can in principle represent an infinite amount of content. Also, of course, the Web is huge. Lada Adamic presents the problems associated with ‘crawling’ the Web to map out its topology. The reader will perhaps not be surprised to learn that the Web graph obeys a power law—both for the indegree distribution and for the outdegree distribution. Adamic discusses several other measures for the structure of the Web graph, including its gross SCC structure (the ‘bow tie’), its diameter, reciprocity, clustering, and motifs. The problem of path lengths and diameter is less straightforward for directed graphs, since the unreachability problem produces infinite or undefined path length for many pairs.
A nice bonus in the article by Adamic is the discussion of some popular and well studied subgraphs of the Web graph—query connection graphs, Weblogs, and Wikipedia. Query connection graphs are subgraphs built from a hit list, and can give useful information about the query itself, and the likelihood that the hit list will be satisfactory. Weblogs and Wikipedia are well known to most Web-literate people; and it is fascinating to see these daily phenomena subjected to careful scientific analysis. The article by Bjelland et al. ( Link Analysis and Web Search) has perhaps the closest ties to the mathematical presentation of Berman and Shaked-Monderer ( Non-negative Matrices and Digraphs). This is because Web link analysis tends to focus on the principal eigenvector of the graph’s adjacency matrix (or of some modification of the adjacency matrix), while Berman and ShakedMonderer discuss in some detail the spectral properties of this matrix, and give some results about the principal eigenvector. The latter yields importance or authority scores for each page. These scores are the principal output of Web link analysis; and in fact Berman and ShakedMonderer cite PageRank as an outstanding example of an application of the theory [22]. Bjelland et al. explain the logic leading to the choice of the principal eigenvector, as a way of ‘harvesting’ information from the huge set of ‘collective recommendations’ that the hyperlinks constitute. They also present the principal approaches to link analysis (the ‘big three’), and place them in a simple logical framework which is completed by the arrival of a new, fourth approach. In addition, Bjelland et al. discuss a number of technical issues related to Web link analysis—which is, after all, a practical and commercial field of endeavor, as well as an object for research and understanding. Jennifer Dunne ( Food Webs) presents a rather special type of directed graph taken from biology: the food web. She offers a very concise definition of this concept in her glossary, which we reproduce here: “the network of feeding interactions among diverse co-occurring species in a particular habitat”. We note that most feeding interactions are one-way: bird B eats insect I, but insect I does not eat bird B. However, food webs are rather special among empirical directed graphs, in that they have a lower incidence of loops than that found in typical directed graphs. Early work indicated (or even assumed) that a food web is loop-free; but this has been shown not to be strictly true. (A simple, but real example of a loop is cannibalism: A eats A; but much longer loops also exist.) The field faces (as many do) a real problem in getting good empirical data, and empirically obtained food webs tend to be small. For example, Table 1 of Dunne presents data for 16 empirical food webs, ranging in size from 25 nodes to 172.
583
584
Complex Networks and Graph Theory
Our second biological application of directed graphs is presented by Huang and Kauffman ( Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination). This article presents gene regulatory networks (GRN). The directed link (G1 ! G2) in a GRN expresses the fact that gene G1 regulates (by inhibition or activation) the expression of gene G2, via intermediate proteins (typically transcription factors). The links are thus inherently one way, although reciprocal links (G1 $ G2) do occur. The article of Huang and Kauffman is very comprehensive: the structure of a GRN is only their starting point, as they seek to understand and model the dynamical development process, in which the set of on/off states of the cell genome moves towards an attractor (a steady or quasi-steady state), which represents a mature and stable cell type. Returning to structure (of GRNs), we again face a severe data-extraction problem, which is compounded by the fact that the ‘interaction modality’ (activation or inhibition, plus or minus) of each link must also be known before one can study dynamical development of gene expression over the GRN. In short: one needs to know more than just the presence and direction of the links; one needs their type. Huang and Kauffman give a thorough discussion of these problems, and argue for studies using an “ensemble approach”. This is much like the random graph approach, in that it takes a set of statistical constraints which are empirically determined, and then studies (often via simulation) a randomly generated ensemble of graphs which satisfy the statistical constraints. In short, the ensemble approach takes the structure as determined by these constraints (node degree distribution, etc), and then studies random graphs with this structure. Note that both the graph topology and the interaction modalities of the links are randomized in this ensemble approach. Dynamical Network Structures We have already had a lot to say about the structure of networks—and yet we have left out one important dimension, namely time. Empirical networks change over time, so that most measurements that map out such a network are taking a snapshot. Now we will look explicitly at studies addressing the dynamical evolution of networks. One classic study is the paper by Barabasi and Albert [5]. Here the preferential-attachment (or “rich get richer”) model was introduced as an explanation for the ubiquitous power-law degree distributions. In other words, a growth (developmental) model was used to explain properties of snapshots of “mature” networks. In the preferential-attachment model, new nodes which join
a network link to existing nodes with a biased probability distribution—so that the probability of linking to an existing node of degree k is proportional to k. This simple model indeed gives (after long time) a power-law distribution; and the ideas in [5] stimulated a great interest in various growth models for networks. We are fortunate to have, in this Section of the Encyclopedia, the article by Sergey Dorogovtsev entitled Growth Models for Networks. This article is of course very short compared to the volume [9]—but it offers a good overview of the field, and a thorough updating of that volume, covering a broad range of questions and ideas, including the simple linear preferential attachment model, and numerous variations of it. Also, a distinctly different class of growth models, termed “optimization based models”, is presented, and compared to the class of models involving forms of preferential attachment. In optimization based models, new nodes place links so as to optimize some function of the resulting network properties. We also mention, in the context of growth models, the article by Fan Chung ( Random Graphs, A Whirlwind Tour of). As noted above, she has pointed out a tendency for biological networks to have significantly smaller exponents than technological networks; and she includes a good discussion of both preferential attachment (which tends to give the larger, technological, exponents) and duplication models. The latter involve new nodes “copying” or duplicating (probabilistically) the links of existing nodes. Chung observes that such duplication mechanisms do exist in biology, and also shows that they can give exponents in the appropriate range. Huang and Kauffman ( Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination) also offer a limited discussion of evolution of the genome—not so much as a topic in itself, but because (again) understanding genome evolution can help in understanding—and even determining—the structure of today’s genome. They mention both duplication and preferential attachment. Also, it is quite interesting to note the parallels between their discussion and that of Valverde and Solé ( Motifs in Graphs)—in that Huang and Kauffman also argue that many observed structures may be due, not to selection pressure or enhanced functionality, but simply to physical constraints on the evolutionary process, and/or to historical accident. The duplication mechanism in fact turns up again in the article by Valverde and Solé, who argue that this growth mechanism can largely account for the high frequency of occurrence of some motifs. Finally, we note that growth is not the only dynamical process operative in networks. Just as the mature body is constantly shedding and regenerating cells, many mature
Complex Networks and Graph Theory
networks are subject to constant small topology changes. One interesting class of non-growth dynamics is the attack. That is: how robust is a given network against a systematic attack which deletes nodes and/or links? A classic study in this direction is [3], which studied the attack tolerance of scale-free networks. There are many reasons to believe that such networks are very well connected, and this study added more: attacking random nodes had little effect on the functionality of the network (as crudely measured by the network diameter, and by the size of the largest surviving component). Thus such networks may be termed “robust”—but only with regard to this kind of “uninformed” attack. The same study showed that a “smart” attack, removing the nodes in descending order of node degree (i. e., highest degree first), caused the network to break down much more rapidly—thus highlighting once again the crucial role played by the ‘hubs’ in power-law networks. Jennifer Dunne ( Food Webs) reports some studies of the robustness of food webs to attack—where here the ‘attack’ is extinction of a node (species). The dynamics is slightly different from that in the previous paragraph, however, because of the phenomenon of secondary extinction: removal of one species can cause another (or several others) to go extinct as well, if they are dependent on the first for their food supply. Also, food webs are not well modeled by power-law degree distributions. Nevertheless, the cited studies indicate that removing high-degree species again causes considerably more damage (as measured by secondary extinctions and web fragmentation) than removing low-degree species. Dunne also reports studies seeking to simulate the effects of “ecologically plausible” extinction scenarios; here we find that the studied food webs are in fact very robust to such extinction scenarios—a result which (perhaps) confirms our prejudice, that today’s ecosystems are here because they are robust. Dynamical Processes on Networks Our next theme is also about dynamics—not of the network topology, however, but over it. That is, it is often of interest to study processes which occur on the nodes (changing their state), and which transmit something (information) over the links, so that a change in one node’s state induces a change in other nodes’ states. We can render these ideas less abstract by considering the concrete example of epidemic spreading. The nodes can be individuals (or computers, or mobile phones), and the network is then a network of social contacts (or a computer or phone network). The elementary point that the disease is spread via contact is readily captured by the net-
work model. A classic study in this regard, which strongly underscored the value of network models, was that (mentioned earlier) of Pastor-Satorras and Vespignani [23]. Here it was shown that classical threshold thinking about epidemic spreading fails when the network is scale-free: the effective threshold is zero. This study, in yet another way, revealed that such networks are extremely well connected—and it stimulated a great deal of thought about prevention strategies. We have already mentioned the article by Liljeros ( Human Sexual Networks) in this Section, with its evidence for power-law, or nearly power-law, degree distributions in human sexual networks. Liljeros offers a sober and careful discussion of the implications of the theoretical result just cited, for understanding and preventing the spread of sexually transmitted diseases on finite networks. One consequence of a power-law node degree distribution—that a few individuals will have an extremely high number of contacts—may seem at first glance implausible or even impossible; and yet, as Liljeros points out, careful empirical studies have tended to support this prediction. The human sexual network is of course not static; people change partners, and in fact people with many partners tend to change more often. Hence, the dynamics of the network topology must be considered, including how it affects the dynamics of the epidemic spreading process going on over the topology. A term which captures this interplay is “concurrence”—which tells, not how many partners you have, but how many partners you have “simultaneously”—i. e., within a time window which allows the passing of the disease from one contact to another. Considering concurrence leads to another type of graph, termed a ‘line graph’—a structure in which contact relationships become nodes, which are linked when they are concurrent. The reader will not be surprised to hear that epidemic spreading over directed graphs is qualitatively different from the same process over undirected graphs. The topic has real practical interest, because—as pointed out by Kephart and White [16] and by Newman, Forrest and Balthrop [20]—the effective network over which computer viruses propagate is typically directed. We invite the interested reader to consult these sources—and to note the striking similarities between the picture of the ‘email address graph’ in [20] and the gross structure of the Web graph [8] (and Figure 1 of Adamic). Another classic study of dynamical processes on networks is that by Duncan Watts [24,25] on synchronization phenomena over networks. Our Section includes a thorough and up-to-date survey of this problem, primarily from the theoretical side, by Chen et al. ( Synchroniza-
585
586
Complex Networks and Graph Theory
tion Phenomena on Networks). The field is large, as there exists a wide variety of dynamical processes on nodes, and types of inter-node coupling, which may (or may not) lead to synchronization over the network. Chen et al. focus on three main themes which neatly summarize the field. First, there is the synchronization process itself, and the theory which allows one to predict when synchronization will occur. Next comes the question of how the network structure affects the tendency to synchronization. The third theme is a logical followup to the second: if we know something about how structure affects synchronization, can we not find design methods which enhance the tendency to synchronize? This latter theme includes a number of ingenious methods for ‘rewiring’ the network in order to enhance its synchronizability. For example: it is found that a very strong community structure (recall the article by Castellano and Fortunato) will inhibit synchronization; and so one has studied networks called ‘entangled networks’, which are systematically rewired so as to have essentially no community structure, and so are optimally synchronizable. Dunne ( Food Webs) discusses dynamics of species number over food webs. That is, the links telling “who eats who” also mediate transfers of biomass—a simple dynamical process. And yet—as we know from the dynamics of the simple two-species Lotka–Volterra equations [13]—simple nonlinear rules can give rise to complex behavior. Dunne reports that the behavior does not get more simple as one studies larger networks; one finds the same types of asymptotic behavior—equilibrium, limit cycle, and chaotic dynamics. It is a nontrivial task to study nonlinear dynamical models over tens or hundreds of nodes, and at the same time anchor the theory and simulation in reality. One approach to doing this has been to insist that the resulting model satisfy certain stability criteria (hence conforming to reality!); these criteria are framed in terms of ‘species persistence’ (not too many extinctions) and/or ‘population stability’ (limited fluctuations in species mass for all species). A dynamical process also plays a central role in the discussion of Huang and Kauffman ( Complex Gene Regulatory Networks – from Structure to Biological Observables: Cell Fate Determination). In a simplified but highly nontrivial model, the state of the gene network is modeled by a vector S(t), which takes binary values (0 or 1, i. e., ‘off’ or ‘on’) at each node in the network, at each time t. Also, the interactions between genes are modeled by Boolean truth tables—i. e., a given gene’s (binary) output state is some Boolean function of all of its input states. The resulting ‘Boolean network model’ is at once very simple (discrete time, discrete binary states), and yet very com-
plex, in the sense that it is impossible to predict the behavior of such models without simulating them. Huang and Kauffman describe the three regimes of dynamical behavior found: an ‘ordered’ regime, a ‘chaotic’ regime, and an intermediate ‘critical’ regime. While the ordered regime gives stable behavior, Huang and Kauffman argue that biology favors the critical regime. One gets the flavor of their argument if one recalls that the same genome must be able to converge to many different cell types—so its dynamics cannot be too stable—and yet those same cell types must be ‘stable enough’. The discussion of the latter point includes a remarkable recent experimental study which graphically shows the return of two cell populations to the same ‘preferred’ stable attractor, after two distinct perturbations, via two distinct paths in state space. Valverde and Solé ( Motifs in Graphs) also discuss dynamical processes, in terms of motifs. They give a clear picture for the simple test case of the three-node motif called ‘Feed Forward Loop’ or FFL. Here again—as for the GRNs of Huang and Kauffman—the modality (activation/inhibition) of the three links must be considered, and the resulting possibilities are either ‘coherent’ (non-conflicting) or ‘incoherent’. It is interesting to note that the FFL motif has been studied both via simulation, and experimentally, in real gene networks. Valverde and Solé also make contact with the article of Chen et al. ( Synchronization Phenomena on Networks) by briefly discussing the connection between network synchronizability and the distribution of motif types. Graph Visualization The problem of graph visualization is a marvelous mixture of art and science. To produce a good graph visualization is to translate the science—what is known factually and analytically about the graph—into a 2D (or quasi-3D) image that the human brain can appreciate. Here the word ‘appreciate’ can include both understanding and the experience of beauty. Both of these aspects are present in great quantities in the many figures of this Section. We invite the reader interested in visualization to visit and compare the following figures: Figure 2 in Fan Chung’s article; Figures 1 and 3 in Dunne; Figures 2 and 5 in Liljeros; Figure 17 in Chen et al.; and Figures 1, 2, and 6b in Valverde and Solé. Graph visualization is included in the articles of this Section because it is a vital tool in the study of networks, and also an unfinished research challenge. The article by Vladimir Batagelj ( Complex Networks, Visualization of) offers a fine overview of this large field. This article in fact has a very broad scope, touching upon many visualization problems (molecules, Google maps) which may be
Complex Networks and Graph Theory
called “information visualization”. The connection to networks (no pun intended) is that information is readily understood in terms of connections between units—i. e., as a network. We note that one natural way of ‘understanding’ a graph is in terms of its communities; and since community substructure offers a way (perhaps even hierarchical) of ‘coarse-graining’ the network, methods for defining communities often lead naturally to methods for graph visualization. One example of this may be found in Figure 10 of Fortunato and Castellano ( Community Structure in Graphs); for other examples, see [12] and [6]. Batagelj describes a wide variety of graph visualization methods; but a careful reading of his article shows that many methods which are useful for large graphs are dependent on finding and exploiting some kind of community structure. The terms used for approaches in this vein include ‘multilevel algorithms’, ‘clustering’, ‘block modeling’, coarse graining, partitions, and hierarchies. The lessons learned from Fortunato and Castellano are brought home again in the article by Batagelj: there is no ‘magic bullet’ that gives a universally satisfying answer to the problem of visualizing large networks. To quote Batagelj in regard to an early graph visualization: “Nice as a piece of art, but with an important message: there are big problems with visualization of dense and/or large networks.” A more recent example is a 2008 visualization of the Internet: here there is a ‘natural’ unit of coarse graining, namely the autonomous system or AS level (recall the article by He et al. on Internet Topology), so that the network of almost 5 million nodes reduces to ‘only’ about 18 000 ASes. Yet the resulting visualization (Figure 4 of Batagelj) clearly reveals that 18 000 nodes is still ‘large’ relative to human visual processing capacity. For other beautiful and mystifying examples of this same point, I recommend figures 5 and 7 in the same article. The difficulty of visualizing large networks is perhaps most succinctly captured in the outline of Batagelj’s stimulating article. Besides the normal introductory parts, we find only two others: ‘Attempts’ and ‘Perspectives’. Yet the article is by no means discouraging—it is rather fascinating and inspiring, and I encourage the reader to read and enjoy it.
offer a very short and entirely personal assessment of the ‘Future’ for the broad field of complex networks and graph theory. I view the field somewhat as a living organism: it is highly dynamic, and still growing vigorously. New ideas continue to pop up—many of which could not be covered in this Section, simply due to practical limitations. Also, there is a fairly free flow of ideas across disciplines. The reader has perhaps already gotten a feeling for this crossboundary flow, by seeing the same basic ideas crop up in many articles, on problems coming for distinctly different traditional disciplines. In short, I feel that the interdisciplinarity of this field is real, it is vigorous and healthy, and it is exciting. Another aspect contributing to the excellent health of the field is its strong connection to empiricism. Network studies thrive on getting access to real, empirically-obtained graphs—we have seen this over and over again in discussion of the articles in this Section. From the Web graph with its tens of billions of nodes, to food webs with perhaps a hundred nodes, the science of networks is stimulated, challenged, and enriched by a steady influx of new data. Finally, the study of complex networks is eminently practical. The path to direct application can be very short. Again we cite the Web graph and Google’s PageRank algorithm as an example of this. The study of gene networks is somewhat farther from immediate application; but the possible benefits from a real understanding of cell and organism development can be enormous. The same holds for the problem of epidemic spreading. These examples are only picked out to illustrate the point; all of the articles, and topics represented by them, are not far removed from practical application. In short: the field is exciting, vigorous, and interdisciplinary, and offers great practical benefits to society. The study of graphs and networks shows no signs of becoming moribund. Hence I will hazard a guess about the future: that the field will continue to grow and inspire excitement for many years to come. It is my hope that many readers of this Section will be infected by this excitement, and will choose to join in the fun. Bibliography
Future Directions The many articles related to “Complex Networks and Graph Theory” have each offered their own view of ‘Future directions’ for the corresponding field of study. Therefore it would be both redundant and presumptuous for me to attempt the same task for all of these fields. Instead I will
Primary Literature 1. Adamic LA (1999) The Small World Web. In: Proc 3rd European Conf Research and Advanced Technology for Digital Libraries, ECDL, London, pp 443–452 2. Albert R, Jeong H, Barabasi AL (1999) Diameter of World-Wide Web. Nature 410:130–131
587
588
Complex Networks and Graph Theory
3. Albert R, Jeong H, Barabasi AL (2000) Error and attack tolerance of complex networks. Nature 406:378–382 4. Badii R, Politi A (1997) Complexity: Hierarchical Structures and Scaling in Physics. Cambridge Nonlinear Science Series, vol 6. Cambridge University Press, Cambridge 5. Barabasi AL, Albert R (1999) Emergence of Scaling in Random Networks. Science 286:509–512 6. Bjelland J, Canright G, Engø-Monsen K, Remple VP (2008) Topographic Spreading Analysis of an Empirical Sex Workers’ Network. In: Ganguly N, Mukherjee A, Deutsch A (eds) Dynamics on and of Complex Networks. Birkhauser, Basel,also in http://delis.upb.de/paper/DELIS-TR-0634.pdf 7. Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the seventh international conference on World Wide Web, pp 107–117 8. Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata S, Tomkins A, Wiener J (2000) Graph structure in the web. Comput Netw 33:309–320 9. Dorogovtsev SN, Mendes JFF (2003) Evolution of Networks: From Biological Nets to the Internet and WWW. Oxford University Press, Oxford ˝ P (1947) Some remarks on the theory of graphs. Bull 10. Erdos Amer Math Soc 53:292–294 ˝ P, Rényi A (1959) On random graphs, I. Publ Math Debre11. Erdos cen 6:290–297 12. Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc Natl Acad Sci USA 99:8271–8276 13. Kaplan D, Glass L (1995) Understanding Nonlinear Dynamics. Springer, New York 14. Kauffman S (1969) Homeostasis and differentiation in random genetic control networks. Nature 224:177–8 15. Kauffman SA (1969) Metabolic stability and epigenesis in randomly constructed genetic nets. J Theor Biol 22:437–467 16. Kephart JO, White SR (1991) Directed-graph epidemiological models of computer viruses. In: Proceedings of the 1991 IEEE Computer Society Symposium on Research in Security and Privacy, pp 343–359 17. Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604–632 18. Milgram S (1967) The Small World Problem. Psychol Today 2:60–67 19. Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon
20. 21. 22.
23. 24.
25.
U (2002) Network Motifs: Simple Building Blocks of Complex Networks. Science 298:824–827 Newman MEJ, Forrest S, Balthrop J (2002) Email networks and the spread of computer viruses. Phys Rev E 66:35–101 Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69:026113 Page L, Brin S, Motwani R, Winograd T (1998) The PageRank citation ranking: Bringing order to the web. Technical report. Stanford University, Stanford Pastor-Satorras R, Vespignani A (2001) Epidemic spreading in scale-free networks. Phys Rev Lett 86:3200–3203 Watts DJ (1999) Small Worlds: The Dynamics of Networks between Order and Randomness. Princeton Studies in Complexity. Princeton University Press, Princeton Watts D, Strogatz S (1998) Collective dynamics of ‘small world’ networks. Nature 393:440–442
Books and Reviews Albert R, Barabasi AL (2002) Statistical Mechanics of Complex Networks. Rev Mod Phys 74:47–97 Bornholdt S, Schuster HG (2003) Handbook of Graphs and Networks: From the Genome to the Internet. Wiley-VCH, Berlin Caldarelli G, Vespignani A (2007) Large Scale Structure and Dynamics of Complex Networks: From Information Technology to Finance and Natural Science. Cambridge University Press, Cambridge Chung F, Lu L (2006) Complex Graphs and Networks, CBMS Regional Conference. Series in Mathematics, vol 107. AMS, Providence da F Costa L, Osvaldo N, Oliveira J, Travierso G, Rodrigues FA, Paulino R, Boas V, Antiqueira L, Viana MP, da Rocha LEC Analyzing and Modeling Real-World Phenomena with Complex Networks: A Survey of Applications. Working Paper: http://arxiv. org/abs/0711.3199 Kauffman SA (1993) The origins of order. Oxford University Press, Oxford Newman MEJ (2003) The structure and function of complex networks. SIAM Rev 45:167–256 Newman M, Barabasi A, Watts DJ (2006) The Structure and Dynamics of Networks. Princeton Studies in Complexity. Princeton University Press, Princeton
Complex Networks, Visualization of
Complex Networks, Visualization of VLADIMIR BATAGELJ University of Ljubljana, Ljubljana, Slovenia Article Outline Glossary Definition of the Subject Introduction Attempts Perspectives Bibliography Glossary For basic notions on graphs and networks, see the articles by Wouter de Nooy: Social Network Analysis, Graph Theoretical Approaches to and by Vladimir Batagelj: Social Network Analysis, Large-Scale in the Social Networks Section. For complementary information on graph drawing in social network analysis, see the article by Linton Freeman: Social Network Visualization, Methods of. k-core A set of vertices in a graph is a k-core if each vertex from the set has an internal (restricted to the set) degree of at least k and the set is maximal – no such vertex can be added to it. Network A network consists of vertices linked by lines and additional data about vertices and/or lines. A network is large if it has at least some hundreds of vertices. Large networks can be stored in computer memory. Partition A partition of a set is a family of its nonempty subsets such that each element of the set belongs to exactly one of the subsets. The subsets are also called classes or groups. Spring embedder is another name for the energy minimization graph drawing method. The vertices are considered as particles with repulsive force between them, and lines as springs that attract or repel the vertices if they are too far or too close, respectively. The algorithm is a means of determining an embedding of vertices in two or three dimensional space that minimizes the ‘energy’ of the system. Definition of the Subject The earliest pictures containing graphs were magic figures, connections between different concepts (for example the Sephirot in Jewish Kabbalah), game boards (nine men’s morris, pachisi, patolli, go, xiangqi, and others) road maps
(for example Roman roads in Tabula Peutingeriana), and genealogical trees of important families [33]. The notion of the graph was introduced by Euler. In the eighteenth and nineteenth centuries, graphs were used mainly for solving recreational problems (Knight’s tour, Eulerian and Hamiltonian problems, map coloring). At the end of the nineteenth century, some applications of graphs to real life problems appeared (electric circuits, Kirchhoff; molecular graphs, Kekulé). In the twentieth century, graph theory evolved into its own field of discrete mathematics with applications to transportation networks (road and railway systems, metro lines, bus lines), project diagrams, flowcharts of computer programs, electronic circuits, molecular graphs, etc. In social science the use of graphs was introduced by Jacob Moreno around 1930 as a basis of his sociometric approach. In his book Who shall survive? [36], a relatively large network Sociometric geography of community – map III (435 individuals, 4350 lines) is presented. Linton Freeman wrote a detailed account of the development of social network analysis [20] and the visualization of social networks [19]. The networks studied in social network analysis until the 1990s were mostly small – some tens of vertices. Introduction Through the 1980s, the development of information technology (IT) laid the groundwork for the emerging field of computer graphics. During this time, the first algorithms for graph drawing appeared: Trees: Wetherell and Shannon [45]. Acyclic graphs: Sugiyama [42]. Energy minimization methods (spring embedders) for general graphs: Eades [17], Kamada and Kawai [30], Fruchterman and Reingold [21]. In energy minimization methods, vertices are considered as particles with repulsive force between them and lines as springs that attract or repel the vertices if they are too far or too close, respectively. The algorithms provide a means of determining an embedding of vertices in two or three dimensional space that minimizes the ‘energy’ of the system. As early as 1963, William Tutte proposed an algorithm for drawing planar graphs [43] and Donald E. Knuth put forth an algorithm for drawing flowcharts [31]. A well known example of an early graph visualization was produced by Alden Klovdahl using his program View_Net – see Fig. 1. As nice a piece of art as it was, it held an important message: there are big problems with the visualization of dense, large graphs.
589
590
Complex Networks, Visualization of
Complex Networks, Visualization of, Figure 1 Klovdahl: Social links in Canberra, Australia
These developments led to the emergence of a new field: graph drawing. In 1992 a group of computer scientists and mathematicians (Giuseppe Di Battista, Peter Eades, Michael Kaufmann, Pierre Rosenstiehl, Kozo Sugiyama, Roberto Tamassia, Ioannis Tollis, and others) started the conference International Symposium on Graph Drawing which takes place each year. The proceedings of the conference are published in the Lecture Notes in Computer Science series by Springer [24]. To stimulate new approaches to graph drawing, a graph drawing contest accompanies each conference. Many papers on graph drawing are published in the Journal of Graph Algorithms and Applications [79]. Most of the efforts of the graph drawing community were spent on problems of drawing special types of graphs (trees, acyclic graphs, planar graphs) or using special styles (straight lines, orthogonal, grid-based, circular, hierarchical) and deriving bounds on required space (area) for selected types of drawing. In the 1990s, further development of IT (GUI, multimedia, WWW) made large graph analysis a reality. For example, see the studies of large organic molecules (PDB [51]), the Internet (Caida [52]), and genealogies (White and Jorion [46], FamilySearch [60]). In chemistry, several tools for dynamic three-dimensional visualization and inspection of molecules were developed (Kinemage [48], Rasmol [87], MDL Chime [82]). One of the earliest systems for large networks was SemNet (see Fairchild, Poltrock, Furnas [18]), used to explore knowledge bases represented as directed graphs. In 1991 Tom Sawyer Software [91] was founded – the premier provider of high performance graph visualization, layout, and analysis systems that enable the user to see and interpret complex information to make better decisions.
Complex Networks, Visualization of, Figure 2 Network of traceroute paths for 29 June 1999
In 1993 at AT&T, the development of GraphViz tools for graph visualization began (dot, neato, dotty, tcldot, libgraph) [69]. Becker, Eick and Wilks developed a SeeXYZ family of network visualization programs [8]. In 1996 Vladimir Batagelj and Andrej Mrvar started the development of Pajek – a program for large network analysis and visualization [5]. In 1997 at La Sepienza, Rome, the development of GDToolkit [63] started as an extension of LEDA (Library of Efficient Data types and Algorithms) to provide implementations of several classical graph-drawing algorithms. The new version, GDT 4.0 (2007), produced in collaboration with University of Perugia, is LEDA independent. Graham Wills, a principal investigator at Bell Labs (1992–2001), built the Nicheworks system for the visual analysis of very large weighted network graphs (up to a million vertices). In the summer of 1998 Bill Cheswick and Hal Burch started work on the Internet Mapping Project at Bell Labs [53]. Its goal was to acquire and save Internet topological data over a long period of time. This data has been used in the study of routing problems and changes, distributed denial of service (DDoS) attacks, and graph theory. In the fall of 2000 Cheswick and Burch moved to a spin-off from Lucent/Bell Labs named Lumeta Corporation. Bill Cheswick is now back at AT&T Labs. Figure 2 shows a network obtained from traceroute paths for 29 June 1999 with nearly 100 000 vertices. In the years 1997–2004, Martin Dodge maintained his Cybergeography Research web pages [58]. The results were published in the book The Atlas of Cyberspace [14]. A newer, very rich, site on information visualization is Visual complexity [95], where many interesting ideas on
Complex Networks, Visualization of
Complex Networks, Visualization of, Figure 3 FAS: The scientific field of Austria
network visualizations can be found. These examples, and many others, can also be accessed from the Infovis 1100+ examples of information visualization site [77]. An interesting collection of graph/network visualizations can be found also on the CDs of Gerhard Dirmoser. The collection also contains many artistic examples and other pictures not produced by computers. In 1997 Harald Katzmair founded FAS research in Vienna, Austria [61], a company providing network analysis services. FAS emphasizes the importance of nice-looking final products (pictures) for customers by using graphical tools to enhance the visual quality of results obtained from network analysis tools. In Fig. 3, a network of Austrian research projects is presented. A similar company, Aguidel [49], was founded in France by Andrei Mogoutov, author of the program Réseau-Lu. Every year (from 2002) at the INSNA Sunbelt conference [78], the Viszards group has a special session in which they present their solutions – analysis and visualizations of selected networks or types of networks. Most of the selected networks are large (KEDS, Internet Movie Data Base, Wikipedia, Web of Science). Attempts The new millennium has seen several attempts to develop programs for drawing large graphs and networks. Most
of the following descriptions are taken verbatim from the programs’ web pages. Stephen Kobourov, with his collaborators from the University of Arizona, developed two graph drawing systems: GRIP (2000) [22,70] and Graphael (2003) [66]. GRIP – Graph dRawing with Intelligent Placement was designed for drawing large graphs and uses a multi-dimensional force-directed method together with fast energy function minimization. It employs a simple recursive coarsening scheme – rather than being placed at random, vertices are placed intelligently, several at a time, at locations close to their final positions. The Cooperative Association for Internet Data Analysis (CAIDA) [52], co-founded in 1998 by kc claffy, is an independent research group dedicated to investigating both the practical and theoretical aspects of the Internet to promote the engineering and maintenance of a robust, scalable, global Internet infrastructure. They have been focusing primarily on understanding how the Internet is evolving, and on developing a state-of-the-art infrastructure for data measurement that can be shared with the entire research community. Figure 4 represents a macroscopic snapshot of the Internet for two weeks: 1–17 January 2008. The graph reflects 4 853 991 observed IPv4 addresses and 5 682 419 IP links. The network is aggregated into a topology of
591
592
Complex Networks, Visualization of
Complex Networks, Visualization of, Figure 4 CAIDA: AS core 2008
Autonomous Systems (ASes). The abstracted graph consists of 17 791 ASes (vertices) and 50 333 peering sessions (lines). Walrus is a tool for interactively visualizing large directed graphs in three-dimensional space. It is best suited to visualizing moderately sized graphs (a few hundred thousand vertices) that are nearly trees. Walrus uses threedimensional hyperbolic geometry to display graphs under a fisheye-like magnifier. By bringing different parts of a graph to the magnified central region, the user can examine every part of the graph in detail. Walrus was developed by Young Hyun at CAIDA based on research by Tamara Munzner. Figure 5 presents two examples of visualizations produced with Walrus. Some promising algorithms for drawing large graphs have been proposed by Ulrik Brandes, Tim Dwyer, Emden Gansner, Stefan Hachul, David Harel, Michael Jünger, Yehuda Koren, Andreas Noack, Stephen North, Christian Pich, and Chris Walshaw [11,16,23,25,26,32,39,44]. They are based either on a multilevel energy minimization approach or on an algebraic or spectral approach that reduces to some application of eigenvectors. The multilevel approach speeds-up the algorithms. Multilevel algorithms are based on two phases: a coarsening phase, in which a sequence of coarse graphs with decreasing sizes is computed, and a refinement phase,
in which successively finer drawings of graphs are computed, using the drawings of the next coarser graphs and a variant of a suitable force-directed single-level algorithm [25]. The fastest algorithms combine the multilevel approach with fast approximation of long range repulsive force using nested data structures, such as quadtree or kdtree. Katy Börner from Indiana University, with her collaborators, produced several visualizations of scientometric networks such as Backbone of Science [9] and Wikipedia [73]. They use different visual cues to produce information-rich visualizations – see Fig. 6. She also commissioned the Map of Science based on data (800 000 published papers) from Thomson ISI and produced by Kevin Boyack, Richard Klavans and Bradford Paley [9]. Yifan Hu from AT&T Labs Information Visualization Group developed a multilevel graph drawing algorithm for visualization of large graphs [29]. The algorithm was first implemented in 2004 in Mathematica and released in 2005. For demonstration he applied it to the University of Florida Sparse Matrix collection [56] that contains over 1500 square matrices. The results are available in the Gallery of Large Graphs [74]. The largest graph (vanHeukelum/cage15) has 5 154 859 vertices and 47 022 346 edges. In Fig. 7, selected pictures from the gallery are presented.
Complex Networks, Visualization of
Complex Networks, Visualization of, Figure 5 Walrus
Complex Networks, Visualization of, Figure 6 Katy Börner: Text analysis
593
594
Complex Networks, Visualization of
Complex Networks, Visualization of, Figure 7 Examples from the Gallery of Large Graphs
From the examples that we have given, we can see that, in some cases, graph drawing algorithms can reveal symmetries in a given graph and also a ‘structure’ ((sub)trees, clusters, planarity, etc.). Challenges remain in devising ways to represent graphs with dense parts. For dense parts, a better approach is to display them using matrix representation. This representation was used in 1999 by Vladimir Batagelj, Andrej Mrvar and Matjaž Zaveršnik in their partitioning approach to visualization
of large graphs [7] and is a basis of systems such as Matrix Zoom by James Abello and Frank van Ham, 2004 [1,2], and MatrixExplorer by Nathalie Henry and Jean-Daniel Fekete, 2006 [27]. A matrix representation is determined by an ordering of vertices. Several algorithms exist that can produce such orderings. A comparative study of them was published by Chris Mueller [37,84]. In Fig. 8 three orderings of the same matrix are presented. Most ordering algorithms were originally designed for applications in nu-
Complex Networks, Visualization of
Complex Networks, Visualization of, Figure 8 Matrix representations
merical, rather than data, analysis. The orderings can also be determined using clustering or blockmodeling methods [15]. An important type of networks are temporal networks, where the presence of vertices and lines changes through time. Visualization of such networks requires special approaches (Sonia [88], SVGanim [90], TecFlow [75]). An interesting approach to visualization of temporal networks was developed by Ulrik Brandes and his group [12]. Perspectives In this section, we present a collection of ideas on how to approach visualization of large networks. These ideas are only partially implemented in different visualization solutions. While the technical problems of graph drawing strive for a single ‘best’ picture, network analysis is also a part of data analysis. Its goal is to gain insight not only into the structure and characteristics of a given network, but also into how this structure influences processes going on over the network. We usually need several pictures to present the obtained results. Small graphs can be presented in their totality and in detail within a single view. In a comprehensive view of large graphs, details become lost – conversely a detailed view can encompass only a part of a large graph. The literature on graph drawing is dominated by the ‘sheet of paper’ paradigm – the solutions and techniques are mainly based on the assumption that the final result is a static picture on a sheet of paper. In this model, to present a large data set we need a large ‘sheet of paper’ – but this has a limit. Figure 9 presents a visualization of a symmetrized subnetwork of 5952 words and 18 008 associations from the Edinburgh Associative Thesaurus [59] prepared by Vladimir Batagelj on a 3 m 5 m ‘sheet of paper’ for Ars Electronica, Linz 2004 Language of networks exhibition.
The main tool for dealing with large objects is abstraction. In graphs, abstraction is usually realized using a hierarchy of partitions. By shrinking selected classes of a partition we obtain a smaller reduced graph. The main operations related to abstraction are: Cut-out: Display of only selected parts (classes of partition) of a graph; Context: Display details of selected parts (classes) of a graph and display the rest of the graph in some reduced form; Model: Display the reduced graph with respect to a given partition; Hierarchy: Display the tree representing the nesting of graph partitions. In larger, denser networks there is often too much information to be presented at once. A possible answer is an interactive layout on a computer screen where the user controls what (s)he wants to see. The computer screen is a medium which offers many new possibilities: parallel views (global and local); brushing and linking; zooming and panning; temporary elements (additional information about the selected elements, labels, legends, markers, etc.); highlighted selections; and others. These features can and should be maximally leveraged to support data analytic tasks; or repeating Shneiderman’s mantra: overview first, zoom and filter, then details on-demand (extended with: Relate, history and extract) [40]. When interactively inspecting very large graphs, a serious problem appears: how does one avoid the “lost within the forest” effect? There are several solutions that can help the user maintain orientation: Restart option: Returns the user to the starting point; Introduction of additional orientation elements: Allows elements to be switched on and off. Multiview: Presents at least two views (windows):
595
596
Complex Networks, Visualization of
Complex Networks, Visualization of, Figure 9 Big picture, V. Batagelj, AE’04
– Map view: Shows an overall global view which contains the current position and allows ‘long’ moves (jumps). For very large graphs, a map view can be combined with zooming or fish-eye views. – Local view: Displays a selected portion of the graph. Additional support can be achieved by implementing trace, backtrack, and replay mechanisms and guided tours. An interactive dynamic visualization of a graph on the computer screen need not be displayed in its totality. Inspecting a visualization, the user can select which parts and elements will be displayed and in what way. See, for example, TouchGraph [92]. Closely related to the multiview concept are the associated concepts of glasses, lenses and zooming. Glasses affect the entire window, while lenses affect only selected region or elements. By selecting different glasses, we can obtain different views on the same data supporting different visualization aims. For example, in Fig. 10 four different glasses (ball and stick, space-fill, backbone, ribbons) were applied in the program Rasmol to the molecule 1atn.pdb (deoxyribonuclease I complex with actin). Another example of glasses is presented in Fig. 11. The two pictures were produced by James Moody [35]. The graph pictured was obtained by applying spring em-
bedders. It represents the friendships among students in a school. The glasses are the coloring of its vertices by different partitions: an age partition (left picture) and a race partition (right picture). This gives us an explanation of the four groups in the resulting graph picture, characterized by younger/older and white/black students. Figure 12 shows a part of the big picture presented in Fig. 9. The glasses in this case are based on ordering the edges in increasing order of their values and drawing them in this order – stronger edges cover the weaker. The picture emphasizes the strongest substructures; the remaining elements form a background. There are many kinds of glasses in representation of graphs, for example, fish-eye views, matrix representation, using application field conventions (genealogies, molecules, electric circuits, SBGN), displaying vertices only, selecting the type of labels (long/short name, value), displaying only the important vertices and/or lines, size of vertices determined by core number or “betweennes”. An example of lens is presented in Fig. 13 – contributions of companies to various presidential candidates from Follow the Oil Money by Greg Michalec and Skye Bender-deMoll [62]. When a vertex is selected, information about that vertex is displayed. Another possible use of a lens would be to temporarily enhance the display of neighbors of a selected vertex [94] or to display their labels.
Complex Networks, Visualization of
Complex Networks, Visualization of, Figure 10 Glasses: Rasmol displays – BallStick, SpaceFill, Backbone, Ribbons
Complex Networks, Visualization of, Figure 11 Glasses: Display of properties – school
597
598
Complex Networks, Visualization of
Complex Networks, Visualization of, Figure 12 Part of the big picture
Complex Networks, Visualization of, Figure 13 Lenses: Temporary info about the selected vertex
The “shaking” option used in Pajek to visually identify all vertices from a selected cluster is also a kind of lens; so are the matrix representations of selected clusters in NodeTrix [72]. Additional enhancement of a presentation can be achieved by the use of support elements such as labels, grids, legends, and various forms of help facilities.
An important concept connected with zooming is the level of detail, or LOD – subobjects are displayed differently depending on the zoom depth. A nice example of a combination of these techniques is the Google Maps service [65] – see Fig. 14. It combines zooming, glasses (Map, Satellite, Terrain), navigation (left, right, up, down) and lenses (info about points). The maps
Complex Networks, Visualization of
Complex Networks, Visualization of, Figure 14 Zoom, glasses, lenses, navigation: Google Maps
Complex Networks, Visualization of, Figure 15 Zoom, glasses, lenses, navigation: Grokker
at different zoom levels provide information at different levels of detail and in different forms. A similar approach could be used for inspection of a large graph or network by examining selected hierarchical clusterings of its vertices. To produce higher level ‘maps,’ different methods can be used: k-core representation [4], density contours [83], generalized blockmodeling [15], clustering [71] (Fig. 15), preserving only im-
portant vertices and lines, etc. In visualizing ‘maps,’ new graphical elements (many of them still to be invented) can be used (see [13,80], p. 223 in [15]) to preserve or to indicate information about structures at lower levels. The k-core representation [4,81] is based on k-core decomposition of a network [6,7] and was developed by Alessandro Vespignani and his collaborators [54]. Figure 16 shows a portion of the web at the .fr domain with
599
600
Complex Networks, Visualization of
modules can be developed and combined with different layout algorithms. Some useful ideas can be found in the nViZn (“envision”) system [89]. To specify layouts we can borrow from typesetting the notion of style. Bibliography Primary Literature
Complex Networks, Visualization of, Figure 16 k-core structure of a portion of the web at the .fr domain
1 million pages. Each node represents a web page and each edge is a hyperlink between two pages. Density contours were introduced by James Moody in 2006. First, a spring embedder layout of a (valued) network is determined. Next, vertices and lines are removed and replaced by density contours. Figure 17 shows this process applied to the case of a social science co-citation network. The left side shows the network layout and the right bottom part presents the corresponding density contours. The basic steps in graph/network visualization are: graph/network !
analysis !
! layouts
viewer
! pictures :
Development of different tools can be based on this scheme, depending on the kind of users (simple, advanced) and the tasks they address (reporting, learning, monitoring, exploration, analysis). In some cases, a simple viewer will be sufficient (for example SVG viewer, X3D viewer, or a special graph layout viewer), in others a complete network analysis system is needed (such as Geomi [3,64], ILOG [76], Pajek [50], Tulip [93], yFiles [96]). To visualize a network, layouts are obtained by augmenting network data with results of analysis and users’ decisions. In Pajek’s input format, there are several layout elements from Pajek’s predecessors (see Pajek’s manual, pp. 69–73 in [50]). As in typesetting text C formatting D formatted text so in network visualization network C layout D picture : It would be useful to define a common layout format (an extension of GraphML [68]?) so that independent viewer
1. Abello J, van Ham F (2004) Matrix zoom: A visual interface to semi-external graphs. IEEE Symposium on Information Visualization, October 10–12 2004, Austin, Texas, USA, pp 183–190 2. Abello J, van Ham F, Krishnan N (2006) ASK-GraphView: A large scale graph visualization system. IEEE Trans Vis Comput Graph 12(5):669–676 3. Ahmed A, Dwyer T, Forster M, Fu X, Ho J, Hong S, Koschützki D, Murray C, Nikolov N, Taib R, Tarassov A, Xu K (2006) GEOMI: GEOmetry for maximum insight. In: Healy P, Eades P (eds) Proc 13th Int Symp Graph Drawing (GD2005). Lecture Notes in Computer Science, vol 3843. Springer, Berlin, pp 468–479 4. Alvarez-Hamelin JI, DallAsta L, Barrat A, Vespignani A (2005) Large scale networks fingerprinting and visualization using the k-core decomposition. In: Advances in neural information processing systems 18, Neural Information Processing Systems, NIPS 2005, December 5–8, 2005, Vancouver, British Columbia, Canada 5. Batagelj V, Mrvar A (2003) Pajek – analysis and visualization of large networks. In: Jünger M, Mutzel P (eds) Graph drawing software. Springer, Berlin, pp 77–103 6. Batagelj V, Zaveršnik M (2002) Generalized cores. arxiv cs.DS/0202039 7. Batagelj V, Mrvar A, Zaveršnik M (1999) Partitioning approach to visualization of large graphs. In: Kratochvíl J (ed) Lecture notes in computer science, vol 1731. Springer, Berlin, pp 90–97 8. Becker RA, Eick SG, Wilks AR (1995) Visualizing network data. IEEE Trans Vis Comput Graph 1(1):16–28 9. Boyack KW, Klavans R, Börner K (2005) Mapping the backbone of science. Scientometrics 64(3):351–374 10. Boyack KW, Klavans R, Paley WB (2006) Map of science. Nature 444:985 11. Brandes U, Pich C (2007) Eigensolver methods for progressive multidimensional scaling of large data. In: Proc 14th Intl Symp Graph Drawing (GD ’06). Lecture notes in computer science, vol 4372. Springer, Berlin, pp 42–53 12. Brandes U, Fleischer D, Lerner J (2006) Summarizing dynamic bipolar conflict structures. IEEE Trans Vis Comput Graph (special issue on Visual Analytics) 12(6):1486–1499 13. Dickerson M, Eppstein D, Goodrich MT, Meng J (2005) Confluent drawings: Visualizing non-planar diagrams in a planar way. J Graph Algorithms Appl (special issue for GD’03) 9(1):31–52 14. Dodge M, Kitchin R (2001) The atlas of cyberspace. Pearson Education, Addison Wesley, New York 15. Doreian P, Batagelj V, Ferligoj A (2005) Generalized blockmodeling. Cambridge University Press, Cambridge 16. Dwyer T, Koren Y (2005) DIG-COLA: Directed graph layout through constrained energy minimization. INFOVIS 2005:9 17. Eades P (1984) A heuristic for graph drawing. Congressus Numerantium 42:149–160
Complex Networks, Visualization of
Complex Networks, Visualization of, Figure 17 Density structure
18. Fairchild KM, Poltrock SE, Furnas GW (1988) SemNet: Threedimensional representations of large knowledge bases. In: Guindon R (ed) Cognitive science and its applications for human-computer interaction. Lawrence Erlbaum, Hillsdale, pp 201–233 19. Freeman LC (2000) Visualizing social networks. J Soc Struct 1(1). http://wwww.cmu.edu/joss/content/articles/volume1/ Freeman/ 20. Freeman LC (2004) The development of social network analysis: A study in the sociology of science. Empirical, Vancouver 21. Fruchterman T, Reingold E (1991) Graph drawing by force directed placement. Softw Pract Exp 21(11):1129–1164
22. Gajer P, Kobourov S (2001) GRIP: Graph drawing with intelligent placement. Graph Drawing 2000 LNCS, vol 1984:222–228 23. Gansner ER, Koren Y, North SC (2005) Topological fisheye views for visualizing large graphs. IEEE Trans Vis Comput Graph 11(4):457–468 24. Graph Drawing. Lecture Notes in Computer Science, vol 894 (1994), 1027 (1995), 1190 (1996), 1353 (1997), 1547 (1998), 1731 (1999), 1984 (2000), 2265 (2001), 2528 (2002), 2912 (2003), 3383 (2004), 3843 (2005), 4372 (2006), 4875 (2007). Springer, Berlin 25. Hachul S, Jünger M (2007) Large-graph layout algorithms at work: An experimental study. JGAA 11(2):345–369
601
602
Complex Networks, Visualization of
26. Harel D, Koren Y (2004) Graph drawing by high-dimensional embedding. J Graph Algorithms Appl 8(2):195–214 27. Henry N, Fekete J-D (2006) MatrixExplorer: A dual-representation system to explore social networks. IEEE Trans Vis Comput Graph 12(5):677–684 28. Herman I, Melancon G, Marshall MS (2000) Graph visualization and navigation in information visualization: A survey. IEEE Trans Vis Comput Graph 6(1):24–43 29. Hu YF (2005) Efficient and high quality force-directed graph drawing. Math J 10:37–71 30. Kamada T, Kawai S (1988) An algorithm for drawing general undirected graphs. Inf Proc Lett 31:7–15 31. Knuth DE (1963) Computer-drawn flowcharts. Commun ACM 6(9):555–563 32. Koren Y (2003) On spectral graph drawing. COCOON 2003: 496–508 33. Kruja E, Marks J, Blair A, Waters R (2001) A short note on the history of graph drawing. In: Proc Graph Drawing 2001. Lecture notes in computer science, vol 2265. Springer, Berlin, pp 272–286 34. Lamping J, Rao R, Pirolli P (1995) A focus+context technique based on hyperbolic geometry for visualizing large hierarchies. CHI 95:401–408 35. Moody J (2001) Race, school integration, and friendship segregation in America. Am J Soc 107(3):679–716 36. Moreno JL (1953) Who shall survive? Beacon, New York 37. Mueller C, Martin B, Lumsdaine A (2007) A comparison of vertex ordering algorithms for large graph visualization. APVIS 2007, pp 141–148 38. Munzner T (1997) H3: Laying out large directed graphs in 3D hyperbolic space. In: Proceedings of the 1997 IEEE Symposium on Information Visualization, 20–21 October 1997, Phoenix, AZ, pp 2–10 39. Noack A (2007) Energy models for graph clustering. J Graph Algorithms Appl 11(2):453–480 40. Shneiderman B (1996) The eyes have it: A task by data type taxonomy for information visualization. In: IEEE Conference on Visual Languages (VL’96). IEEE CS Press, Boulder 41. Shneiderman B, Aris A (2006) Network visualization by semantic substrates. IEEE Trans Vis Comput Graph 12(5):733–740 42. Sugiyama K, Tagawa S, Toda M (1981) Methods for visual understanding of hierarchical systems. IEEE Trans Syst, Man, Cybern 11(2):109–125 43. Tutte WT (1963) How to draw a graph. Proc London Math Soc s3-13(1):743–767 44. Walshaw C (2003) A multilevel algorithm for force-directed graph-drawing. J Graph Algorithms Appl 7(3):253–285 45. Wetherell C, Shannon A (1979) Tidy drawing of trees. IEEE Trans Softw Engin 5:514–520 46. White DR, Jorion P (1992) Representing and computing kinship: A new approach. Curr Anthr 33(4):454–463 47. Wills GJ (1999) NicheWorks-interactive visualization of very large graphs. J Comput Graph Stat 8(2):190–212
51. 52.
53. 54. 55.
56.
57.
58.
59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72.
73.
Web Resources
74.
48. 3D Macromolecule analysis and Kinemage home page: http:// kinemage.biochem.duke.edu/. Accessed March 2008 49. Aguidel: http://www.aguidel.com/en/. Accessed March 2008 50. Batagelj V, Mrvar A (1996) Pajek – program for analysis and visualization of large network: http://pajek.imfm.si. Accessed
75.
March 2008. Data sets: http://vlado.fmf.uni-lj.si/pub/networks/ data/. Accessed March 2008 Brookhaven Protein Data Bank: http://www.rcsb.org/pdb/. Accessed March 2008 Caida: http://www.caida.org/home/. Accessed March 2008. Walrus gallery: http://www.caida.org/tools/visualization/ walrus/gallery1/. Accessed March 2008 Cheswick B: Internet mapping project – map gallery: http:// www.cheswick.com/ches/map/gallery/. Accessed March 2008 Complex Networks Collaboratory: http://cxnets.googlepages. com/. Accessed March 2008 Cruz I, Tamassia R (1994) Tutorial on graph drawing. http:// graphdrawing.org/literature/gd-constraints.pdf. Accessed March 2008 Davis T: University of Florida Sparse Matrix Collection: http:// www.cise.ufl.edu/research/sparse/matrices. Accessed March 2008 Di Battista G, Eades P, Tamassia R, Tollis IG (1994) Algorithms for drawing graphs: An annotated bibliography. Comput Geom: Theory Appl 4:235–282. http://graphdrawing.org/ literature/gdbiblio.pdf. Accessed March 2008 Dodge M: Cyber-Geography Research: http://personalpages. manchester.ac.uk/staff/m.dodge/cybergeography/. Accessed March 2008 Edinburgh Associative Thesaurus (EAT): http://www.eat.rl.ac. uk/. Accessed March 2008 FamilySearch: http://www.familysearch.org/. Accessed March 2008 FASresearch, Vienna, Austria: http://www.fas.at/. Accessed March 2008 Follow the Oil Money: http://oilmoney.priceofoil.org/. Accessed March 2008 GDToolkit – Graph Drawing Toolkit: http://www.dia.uniroma3. it/~gdt/gdt4/index.php. Accessed March 2008 GEOMI (Geometry for Maximum Insight): http://www.cs.usyd. edu.au/~visual/valacon/geomi/. Accessed March 2008 Google Maps: http://maps.google.com/. Accessed March 2008 Graphael: http://graphael.cs.arizona.edu/. Accessed March 2008 Graphdrawing home page: http://graphdrawing.org/. Accessed March 2008 GraphML File Format: http://graphml.graphdrawing.org/. Accessed March 2008 Graphviz: http://graphviz.org/. Accessed March 2008 GRIP: http://www.cs.arizona.edu/~kobourov/GRIP/. Accessed March 2008 Grokker – Enterprise Search Management and Content Integration: http://www.grokker.com/. Accessed March 2008 Henry N, Fekete J-D, Mcguffin M (2007) NodeTrix: Hybrid representation for analyzing social networks: https://hal.inria.fr/ inria-00144496. Accessed March 2008 Herr BW, Holloway T, Börner K (2007) Emergent mosaic of wikipedian activity: http://www.scimaps.org/dev/big_thumb. php?map_id=158. Accessed March 2008 Hu YF: Gallery of Large Graphs: http://www.research.att.com/ ~yifanhu/GALLERY/GRAPHS/index1.html. Accessed March 2008 iCKN: TeCFlow – a temporal communication flow visualizer for social network analysis: http://www.ickn.org/. Accessed March 2008
Complex Networks, Visualization of
76. ILOG Diagrams: http://www.ilog.com/. Accessed March 2008 77. Infovis – 1100+ examples of information visualization: http://www.infovis.info/index.php? cmd=search&words=graph&mode=normal. Accessed March 2008 78. INSNA – International Network for Social Network Analysis: http://www.insna.org/. Accessed March 2008 79. Journal of Graph Algorithms and Applications: http://jgaa. info/. Accessed March 2008 80. KartOO visual meta search engine: http://www.kartoo.com/. Accessed March 2008 81. LaNet-vi – Large Network visualization tool: http://xavier. informatics.indiana.edu/lanet-vi/. Accessed March 2008 82. MDL Chime: http://www.mdli.com/. Accessed March 2008 83. Moody J (2007) The network structure of sociological production II: http://www.soc.duke.edu/~jmoody77/presentations/ soc_Struc_II.ppt. Accessed March 2008 84. Mueller C: Matrix visualizations: http://www.osl.iu.edu/~chemuell/data/ordering/sparse.html. Accessed March 2008 85. OLIVE, On-line library of information visualization environments: http://otal.umd.edu/Olive/. Accessed March 2008 86. Pad++: Zoomable user interfaces: Portal filtering and ‘magic lenses’: http://www.cs.umd.edu/projects/hcil/pad++/tour/ lenses.html. Accessed March 2008 87. RasMol Home Page: http://www.umass.edu/microbio/rasmol/ index2.htm. Accessed March 2008 88. Sonia – Social Network Image Animator: http://www.stanford. edu/group/sonia/. Accessed March 2008 89. SPSS nViZn: http://www.spss.com/research/wilkinson/nViZn/ nvizn.html. Accessed March 2008 90. SVGanim: http://vlado.fmf.uni-lj.si/pub/networks/pajek/SVGanim. Accessed March 2008 91. Tom Sawyer Software: http://www.tomsawyer.com/home/ index.php. Accessed March 2008
92. TouchGraph: http://www.touchgraph.com/. Accessed March 2008 93. Tulip: http://www.labri.fr/perso/auber/projects/tulip/. Accessed March 2008 94. Viégas FB, Wattenberg M (2007) Many Eyes: http://services. alphaworks.ibm.com/manyeyes/page/Network_Diagram. html. Accessed March 2008 95. Visual complexity: http://www.visualcomplexity.com/vc/. Accessed March 2008 96. yWorks/yFiles: http://www.yworks.com/en/products_yfiles_ about.htm. Accessed March 2008
Books and Reviews Bertin J (1967) Sémiologie graphique. Les diagrammes, les réseaux, les cartes. Mouton/Gauthier-Villars, Paris/La Haye Brandes U, Erlebach T (eds) (2005) Network analysis: Methodological foundations. LNCS. Springer, Berlin Carrington PJ, Scott J, Wasserman S (eds) (2005) Models and methods in social network analysis. Cambridge University Press, Cambridge de Nooy W, Mrvar A, Batagelj V (2005) Exploratory social network analysis with Pajek. Cambridge University Press, Cambridge di Battista G, Eades P, Tamassia R, Tollis IG (1999) Graph drawing: Algorithms for the visualization of graphs. Prentice Hall, Englewood Cliffs Jünger M, Mutzel P (eds) (2003) Graph drawing software. Springer, Berlin Kaufmann M, Wagner D (2001) Drawing graphs, methods and models. Springer, Berlin Tufte ER (1983) The visual display of quantitative information. Graphics, Cheshire Wasserman S, Faust K (1994) Social network analysis: Methods and applications. Cambridge University Press, Cambridge Wilkinson L (2000) The grammar of graphics. Statistics and Computing. Springer, Berlin
603
604
Computer Graphics and Games, Agent Based Modeling in
Computer Graphics and Games, Agent Based Modeling in BRIAN MAC N AMEE School of Computing, Dublin Institute of Technology, Dublin, Ireland Article Outline Glossary Definition of the Subject Introduction Agent-Based Modelling in Computer Graphics Agent-Based Modelling in CGI for Movies Agent-Based Modelling in Games Future Directions Bibliography Glossary Computer generated imagery (CGI) The use of computer generated images for special effects purposes in film production. Intelligent agent A hardware or (more usually) softwarebased computer system that enjoys the properties autonomy, social ability, reactivity and pro-activeness. Non-player character (NPC) A computer controlled character in a computer game – as opposed to a player controlled character. Virtual character A computer generated character that populates a virtual world. Virtual world A computer generated world in which places, objects and people are represented as graphical (typically three dimensional) models. Definition of the Subject As the graphics technology used to create virtual worlds has improved in recent years, more and more importance has been placed on the behavior of virtual characters in applications such as games, movies and simulations set in these virtual worlds simulations. The behavior of these virtual characters should be believable in order to create the illusion that virtual worlds are populated with living characters. This has led to the application of agent-based modeling to the control of virtual characters. There are a number of advantages of using agent-based modeling techniques which include the fact that they remove the requirement for hand controlling all agents in a virtual environment, and allow agents in games to respond to unexpected actions by players or users.
Introduction Advances in computer graphics technology in recent years have allowed the creation of realistic and believable virtual worlds. However, as such virtual worlds have been developed for applications spanning games, education and movies it has become apparent that in order to achieve real believability, virtual worlds must be populated with life-like virtual characters. This is where the application of agent-based modeling has found a niche in the areas of computer graphics and, in a huge way, computer games. Agent-based modeling is a perfect solution to the problem of controlling the behaviors of the virtual characters that populate a virtual world. In fact, because virtual characters are embodied and autonomous these applications require an even stronger notion of agency than many other areas in which agent-based modeling is employed. Before proceeding any further, and because there are so many competing alternatives, it is worth explicitly stating the definition of an intelligent agent that will inform the remainder of this article. Taken from [83] an intelligent agent is defined as “. . . a hardware or (more usually) software-based computer system that enjoys the following properties: autonomy: agents operate without the direct intervention of humans or others, and have some kind of control over their actions and internal state; social ability: agents interact with other agents (and possibly humans) via some kind of agent-communication language; reactivity: agents perceive their environment, (which may be the physical world, a user via a graphical user interface, a collection of other agents, the INTERNET, or perhaps all of these combined), and respond in a timely fashion to changes that occur in it; pro-activeness: agents do not simply act in response to their environment, they are able to exhibit goal-directed behavior by taking the initiative.” Virtual characters implemented using agent-based modeling techniques satisfy all of these properties. The characters that populate virtual worlds should be fully autonomous and drive their own behaviors (albeit sometimes following the orders of a director or player). Virtual characters should be able to interact believably with other characters and human participants. This property is particularly strong in the case of virtual characters used in games which by their nature are particularly interactive. It is also imperative that virtual characters appear to perceive their environments and react to events that occur in that environment, especially the actions of other
Computer Graphics and Games, Agent Based Modeling in
Computer Graphics and Games, Agent Based Modeling in, Figure 1 The three rules used by Reynolds’ original Boids system to simulate flocking behaviors
characters or human participants. Finally virtual characters should be pro-active in their behaviors and not always require prompting from a human participant in order to take action. The remainder of this article will proceed as follows. Firstly, a broad overview of the use of agent-based modeling in computer graphics will be given, focusing in particular on the genesis of the field. Following on from this, the focus will switch to the use of agent-based modeling techniques in two particular application areas: computer generated imagery (CGI) for movies, and computer games. CGI has been used to astounding effect in movies for decades, and in recent times has become heavily reliant on agent-based modeling techniques in order to generate CGI scenes containing large numbers of computer generated extras. Computer games developers have also been using agent-based modeling techniques effectively for some time now for the control of non-player characters (NPCs) in games. There is a particularly fine match between the requirements of computer games and agent-based modeling due to the high levels of interactivity required. Finally, the article will conclude with some suggestions for the future directions in which agent-based modeling technology in computer graphics and games is expected to move. Agent-Based Modelling in Computer Graphics The serious use of agent-based modeling in computer graphics first arose in the creation of autonomous groups and crowds – for example, crowds of people in a town square or hotel foyer, or flocks of birds in an outdoor scene. While initially this work was driven by visually unappealing simulation applications such as fire safety testing for buildings [75], focus soon turned to the creation of visually realistic and believable crowds for applications such as movies, games and architectural walk-throughs.
Computer graphics researchers realized that creating scenes featuring large virtual crowds by hand (a task that was becoming important for the applications already mentioned) was laborious and time-consuming and that agent-based modeling techniques could remove some of the animator’s burden. Rather than requiring that animators hand-craft all of the movements of a crowd, agentbased systems could be created in which each character in a crowd (or flock, or swarm) could drive its own behavior. In this way the behavior of a crowd would emerge from the individual actions of the members of that crowd. Two of the earliest, and seminal, examples of such systems are Craig Reynolds’ Boids system [64] and Tu & Terzopoulos’ animations of virtual fish [76]. The Boids system simulates the flocking behaviors exhibited in nature by schools of fish, or flocks of birds. The system was first presented at the prestigious SIGGRAPH conference (www.siggraph.org) in 1987 and was accompanied by the short movie “Stanley and Stella in: Breaking the Ice”. Taking influence from the area of artificial life (or aLife) [52], Reynolds postulated that the individual members of a flock would not be capable of complex reasoning, and so flocking behavior must emerge from simple decisions made by individual flock members. This notion of emergent behavior is one of the key characteristics of aLife systems. In the original Boids system, each virtual agent (represented as a simple particle and known as a boid) used just three rules to control its movement. These were separation, alignment and cohesion and are illustrated in Fig. 1. Based on just these three simple rules extremely realistic flocking behaviors emerged. This freed animators from the laborious task of hand-scripting the behavior of each creature within the flock and perfectly demonstrates the advantage offered by agent-based modeling techniques for this kind of application. The system created by Tu and Terzopoulos took a more complex approach in that they created complex
605
606
Computer Graphics and Games, Agent Based Modeling in
models of biological fish. Their models took into account fish physiology, with a complex model of fish muscular structure, along with a perceptual model of fish vision. Using these they created sophisticated simulations in which properties such as schooling and predator avoidance were displayed. The advantage of this approach was that it was possible to create unique, unscripted, realistic simulations without the intervention of human animators. Terzopoulos has since gone on to apply similar techniques to the control of virtual humans [68]. Moving from animals to crowds of virtual humans, the Virtual Reality Lab at the Ecole Polytechnique Fédérale de Lausanne in Switzerland (vrlab.epfl.ch) led by Daniel Thalmann has been at the forefront of this work for many years. They group currently has a highly evolved system, VICrowd, for the animation of virtual crowds [62] which they model as a hierarchy which moves from individuals to groups to crowds. This hierarchy is used to avoid some of the complications which arise from trying to model large crowds in real time – one of the key gaols of ViCrowd. Each of the levels in the ViCrowd hierarchy can be modeled as an agent and this is done based on beliefs, desires and intentions. The beliefs of an agent represent the information that the agent possesses about the world, including information about places, objects and other agents. An agent’s desires represent the motivations of the agent regarding objectives it would like to achieve. Finally, the intentions of an agent represent the actions that an agent has chosen to pursue. The belief-desire-intention (BDI) model of agency was proposed by Rao and Georgeff [61] and has been used in many other application areas of agent-based modeling. ViCrowd has been used in ambitious applications including the simulation of a virtual city comprised of, amongst other things, a train station a park and a theater [22]. In all of these environments the system was capable of driving the believable behaviors of large groups of characters in real-time. It should be apparent to readers from the examples given thus far that the use of agent-based modeling techniques to control virtual characters gives rise to a range of unique requirements when compared to the use of agent based modeling in other application areas. The key to understanding these is to realize that the goal in designing agents for the control of virtual characters is typically not to design the most efficient or effective agent, but rather to design the most interesting or believable character. Outside of very practical applications such as evacuation simulations, when creating virtual characters, designers are concerned with maintaining what Disney, experts in this field, refer to as the illusion of life [36].
This refers to the fact that the user of a system must believe that virtual characters are living, breathing creatures with goals, beliefs, desires, and, essentially, lives of their own. Thus, it is not so important for a virtual human to always choose the most efficient or cost effective option available to it, but rather to always choose reasonable actions and respond realistically to the success or failure of these actions. With this in mind, and following a similar discussion given in [32], some of the foremost researchers in virtual character research have the following to say about the requirements of agents as virtual characters. Loyall writes [46] that “Believable agents are personality-rich autonomous agents with the powerful properties of characters from the arts.” Coming from a dramatic background it is not surprising that Loyall’s requirements reflect this. Agents should have strong personality and be capable of showing emotion and engaging in meaningful social relationships. According to Blumberg [11], “. . . an autonomous animated creature is an animated object capable of goal-directed and time-varying behavior”. The work of Blumberg and his group is very much concerned with virtual creatures, rather than humans in particular, and his requirements reflect this. Creatures must appear to make choices which improve their situation and display sophisticated and individualistic movements. Hayes–Roth and Doyle focus on the differences between “animate characters” and traditional agents [27]. With this in mind they indicate that agents’ behaviors must be “variable rather than reliable”, “idiosyncratic instead of predictable”, “appropriate rather than correct”, “effective instead of complete”, “interesting rather than efficient”, and “distinctively individual as opposed to optimal”. Perlin and Goldberg [59] concern themselves with building believable characters “that respond to users and to each other in real-time, with consistent personalities, properly changing moods and without mechanical repetition, while always maintaining an author’s goals and intentions”. Finally, in characterizing believable agents, Bates [7] is quite forgiving requiring “only that they not be clearly stupid or unreal”. Such broad, shallow agents must “exhibit some signs of internal goals, reactivity, emotion, natural language ability, and knowledge of agents . . . as well as of the . . . micro-world”. Considering these definitions, Isbister and Doyle [32] identify the fact that the consistent themes which run through all of the requirements given above match the general goals of agency – virtual humans must display autonomy, reactivity, goal driven behavior and social abil-
Computer Graphics and Games, Agent Based Modeling in
ity – and again support the use of agent-based modeling to drive the behavior of virtual characters. The Spectrum of Agents The differences between the systems mentioned in the previous discussion are captured particularly well on the spectrum of agents presented by Aylett and Luck [5]. This positions agent systems on a spectrum based on their capabilities, and serves as a useful tool in differentiating between the various systems available. One end of this spectrum focuses on physical agents which are mainly concerned with simulation of believable physical behavior, (including sophisticated physiological models of muscle and skeleton systems), and of sensory systems. Interesting work at this end of the spectrum includes Terzopoulos’ highly realistic simulation of fish [76] and his virtual stuntman project [21] which creates virtual actors capable of realistically synthesizing a broad repertoire of lifelike motor skills. Cognitive agents inhabit the other end of the agent spectrum and are mainly concerned with issues such as reasoning, decision making, planning and learning. Systems at this end of the spectrum include Funge’s cognitive modeling approach [26] which uses the situation calculus to control the behavior of virtual characters, and Nareyek’s work on planning agents for simulation [55], both of which will be described later in this article. While the systems mentioned so far sit comfortably at either end of the agent spectrum, many of the most effective inhabit the middle ground. Amongst these are c4 [13], used to great effect to simulate a virtual sheep dog with the ability to learn new behaviors, Improv [59] which augments sophisticated physical human animation with scripted behaviors and the ViCrowd system [62] which sits on top of a realistic virtual human animation system and uses planning to control agents’ behavior. Virtual Fidelity The fact that so many different agent-based modeling systems, for the control of virtual humans exist gives rise to the question why? The answer to this lies in the notion of virtual fidelity, as described by Badler [6]. Virtual fidelity refers to the fact that virtual reality systems need only remain true to actual reality in so much as this is required by, and improves, the system. In [47] the point is illustrated extremely effectively. The article explains that when game designers are architecting the environments in which games are set, the scale to which these environments are created is not kept true to reality. Rather, to ease players’ movement in these worlds,
areas are designed to a much larger scale, compared to character sizes, than in the real world. However, game players do not notice this digression from reality, and in fact have a negative response to environments that are designed to be more true to life finding them cramped. This is a perfect example of how, although designers stay true to reality for many aspects of environment design, the particular blend of virtual fidelity required by an application can dictate certain real world restrictions can be ignored in virtual worlds. With regard to virtual characters, virtual fidelity dictates that the set of capabilities which these characters should display is determined by the application which they are to inhabit. So, the requirements of an agent-based modeling system for CGI in movies would be very different to those of a agent-based modeling system for controlling the behaviors of game characters. Agent-Based Modelling in CGI for Movies With the success of agent-based modeling techniques in graphics firmly established there was something of a search for application areas to which they could be applied. Fortunately, the success of agent-based modeling techniques in computer graphics was paralleled with an increase in the use of CGI in the movie industry, which offered the perfect opportunity. In many cases CGI techniques were being used to replace traditional methods for creating expensive, or difficult to film scenes. In particular, scenes involving large numbers of people or animals were deemed no longer financially viable when set in the real world. Creating these scenes using CGI involved painstaking hand animation of each character within a scene, which again was not financially viable. The solution that agent-based modeling offers is to make each character within a scene an intelligent agent that drives its own behavior. In this way, as long as the initial situation is set up correctly scenes will play out without the intervention of animators. The facts that animating for movies does not need to be performed in real-time, and is in no way interactive (there are no human users involved in the scene), make the use of agent-based modeling a particularly fine match for this application area. Craig Reynolds’ Boids system [64] which simulates the flocking behaviors exhibited in nature by schools of fish, or flocks of birds and was discussed previously is one of the seminal examples of agent-based modeling techniques being used in movie CGI. Reynold’s approach was first used for CGI in the 1999 film “Batman Returns” [14] to simulate colonies of bats. Reynold’s technologies have been used in “The Lion King” [4] and “From Dusk ‘Till Dawn” [65]
607
608
Computer Graphics and Games, Agent Based Modeling in
amongst other films. Reynolds’ approach was so successful, in fact, that he was awarded an Academy Award for his work in 1998. Similar techniques to those utilized in the Boids system have been used in many other films to animate such diverse characters as ants, people and stampeding wildebeest. Two productions which were released in the same year, “Antz” [17] by Dreamworks and “A Bug’s Life” [44] by Pixar took great steps in using CGI effects to animate large crowds for. For “Antz” systems were developed which allowed animators easily create scenes containing large numbers of virtual characters modeling each as an intelligent agent capable of obstacle avoidance, flocking and other behaviors. Similarly, the creators of “A Bug’s Life” created tools which allowed animators easily combine pre-defined motions (known as alibis) to create behaviors which could easily be applied to individual agents in scenes composed of hundreds of virtual characters. However, the largest jump in the use of agent-based modeling in movie CGI was made in the recent Lord of the Rings trilogy [33,34,35]. In these films the bar was raised markedly in terms of the sophistication of the virtual characters displayed and the sheer number of characters populating each scene. To achieve the special effects shots required by the makers of these films, the Massive software system was developed by Massive Software (www.massivesoftware.com). This system [2,39] uses agent-based modeling techniques, again inspired by aLife, to create virtual extras that control their own behaviors. This system was put to particularly good use in the large scale battle sequences that feature in all three of the Lord of the Rings films. Some of the sequences in the final film of the trilogy, the Return of the King, contain over 200,000 digital characters. In order to create a large battle scene using the Massive software, each virtual extra is represented as an intelligent agent, making its own decisions about which actions it will perform based on its perceptions of the world around it. Agent control is achieved through the use of fuzzy logic based controllers in which the state of an agent’s brain is represented as a series of motivations, and knowledge it has about the world – such as the state of the terrain it finds itself on, what kinds of other agents are around it and what these other agents are doing. This knowledge about the world is perceived through simple simulated visual, auditory and tactile senses. Based on the information they perceive agents decide on a best course of action. Designing the brains of these agents is made easier that it might seem at first by the fact that agents are developed for short sequences, and so a small range of possible tasks. So for ex-
ample, separate agent models would be used for a fighting scene and a celebration scene. In order to create a large crowd scene using Massive animators initially set up an environment populating it with an appropriate cast of virtual characters where the brains of each character are slight variations (based on physical and personality attributes) of a small number of archetypes. The scene will then play itself out with each character making it’s own decisions. Therefore there is no need for any hand animation of virtual characters. However, directors can view the created scenes and by tweaking the parameters of the brains of the virtual characters have a scene play out in the exact way that they require. Since being used to such impressive effect in the Lord of the Rings trilogy (the developers of the Massive system were awarded an academy award for their work), the Massive software system has been used in numerous other films such as “I, Robot” [60], “The Chronicles of Narnia: The Lion, the Witch and the Wardrobe” [1] and “Ratatouille” [10] along with numerous television commercials and music videos. While the achievements of using agent-based modeling for movie CGI are extremely impressive, it is worth noting that none of these systems run in real-time. Rather, scenes are rendered by banks of high powered computers, a process that can take hours for relatively simple scenes. For example, the famous Prologue battle sequence in the “Lord of the Rings: The Fellowship of the Ring” took a week to render. When agent-based modeling is applied to the real-time world of computer games, things are very different. Agent-Based Modelling in Games Even more so than in movies, agent-based modeling techniques have been used to drive the behaviors of virtual characters in computer games. As games have become graphically more realistic (and in recent years they have become extremely so) game-players have come to expect that games are set in hugely realistic and believable virtual worlds. This is particularly evident in the widespread use of realistic physics modeling which is now commonplace in games [67]. In games that make strong use of physics modeling, objects in the game world topple over when pushed, float realistically when dropped in water and generally respond as one would expect them to. Players expect the same to be true of the virtual characters that populate virtual game worlds. This can be best achieved by modeling virtual characters as embodied virtual agents. However, there are a number of constraints which have a major
Computer Graphics and Games, Agent Based Modeling in
influence on the use of agent-based modeling techniques in games. The first of these constraints stems from the fact that modern games are so highly interactive. Players expect to be able to interact with all of the characters they encounter within a game world. These interactions can be as simple as having something to shoot at or having someone to race against; or involve much more sophisticated interactions in which a player is expected to converse with a virtual character to find out specific information or to cooperate with a virtual character in order to accomplish some task that is key to the plot of a game. Interactivity raises a massive challenge for practitioners as there is very little restriction in terms of what the player might do. Virtual characters should respond in a believable way at all times regardless of how bizarre and unexpected the actions of the player might be. The second challenge comes from the fact that the vast majority of video games should run in real time. This means that the computational complexity must be kept to a reasonable level as there are only a finite number of processor cycles available for AI processing. This problem is magnified by the fact that an enormous amount of CPU power it usually dedicated to graphics processing. When compared to the techniques that can be used for controlling virtual characters in films some of the techniques used in games are rudimentary due to this real-time constraint. Finally, modern games resemble films in the fact that creators go to great lengths to include intricate storylines and control the building of tension in much the way that film script writers do. This means that games are tested heavily in order to ensure that the game proceeds smoothly and that the level of difficulty is finely tuned so as to always hold the interest of a player. In fact, this testing of games has become something of a science in itself [77]. Using autonomous agents gives game characters the ability to do things that are unexpected by the game designers and so upset their well laid plans. This can often be a barrier to the use of sophisticated techniques such as learning. Unfortunately there is also a barrier to the discussion of agent-based modeling techniques used in commercial games. Because of the very competitive nature of the games industry, game development houses often consider the details of how their games work as valuable trade secrets to be kept well guarded. This can make it difficult to uncover the details of how particularly interesting features of a game are implemented. While this situation is improving – more commercial game developers are speaking at games conferences about how their games are developed and the release of game systems development kits for the
development of game modifications (or mods) allows researchers to plumb the depths of game code – it is still often impossible to find out the implementation details of very new games. Game Genres Before discussing the use of agent-based modeling in games any further, it is worth making a short clarification on the kinds of computer games that this article refers to. When discussing modern computer games, or video games, this article does not refer to computer implementations of traditional games such as chess, backgammon or card games such as solitaire. Although these games are of considerable research interest (chess in particular has been the subject of extremely successful research [23]) they are typically not approached using agent-based modeling techniques. Typically, artificial intelligence approaches to games such as these rely largely on sophisticated searching techniques which allow the computer player to search through a multitude of possible future situations dictated by the moves it will make and the moves it expects its opponent to make in response. Based on this search, and some clever heuristics that indicate what constitutes a good game position for the computer player, the best sequence of moves can be chosen. This searching technique relies on the fact that there are usually a relatively small number of moves that a player can make at any one time in a game. However, the fact that the ancient Chinese game of Go-Moku has not, to date, been mastered by computer players [80] illustrates the restrictions of such techniques. The common thread linking together the kinds of games that this article focuses on is that they all contain computer controlled virtual characters that possess a strong notion of agency. Efforts are often made to separate the many different kinds of modern video games that are the focus of this article into a small set of descriptive genres. Unfortunately, much like in music, film and literature, no categorization can hope to perfectly capture the nuances of all of the available titles. However, a brief mention of some of the more important game genres is worth while (a more detailed description of game genres, and artificial intelligence requirements of each is given in [41]). The most popular game genre is without doubt the action game in which the player must defeat waves of demented foes, typically (for increasingly bizarre motivations) bent upon global destruction. Illustrative examples of the genre include Half-Life 2 (www.half-life2.com) and the Halo series (www.halo3.com). A screenshot of the upcoming action game Rogue Warrior (www.bethsoft.com) is shown in Fig. 2.
609
610
Computer Graphics and Games, Agent Based Modeling in
Computer Graphics and Games, Agent Based Modeling in, Figure 2 A screenshot of the upcoming action game Rogue Warrior from Bethesda Softworks (image courtesy of Bethesda Softworks)
Strategy games allow players to control large armies in battle with other people, or computer controlled opponents. Players do not have direct control over their armies, but rather issue orders which are carried out by agent-based artificial soldiers. Well regarded examples of the genre include the Age of Empires (www. ageofempires.com) and Command & Conquer (www. commandandconquer.com) series. Role playing games (such as the Elder Scrolls (www. elderscrolls.com) series) place game players in expansive virtual worlds across which they must embark on fantastical quests which typically involve a mixture of solving puzzles, fighting opponents and interacting with non-player characters in order to gain information. Figure 3 shows a screenshot of the aforementioned role-playing game The Elder Scrolls IV: Oblivion. Almost every sport imaginable has at this stage been turned into a computer based sports game. The challenges in developing these games are creating computer controlled opponents and team mates that play the games at a level suitable to the human player. Interesting examples include FIFA Soccer 08 (www.fifa08.ea.com) and Forza Motorsport 2 (www.forzamotorsport.net). Finally, many people expected that the rise of massively multi-player online games (MMOGs), in which hundreds of human players can play together in an online world, would sound the death knell for the use of virtual non-player characters in games. Examples of MMOGs include World of Warcraft (www.worldofwarcraft.com) and Battlefield 2142 (www.battlefield.ea.com). However, this has not turned out to be the case as there are still large
numbers of single player games being produced and even MMOGs need computer controlled characters for roles that players do not wish to play. Of course there are many games that simply do not fit into any of these categorizations, but that are still relevant for a discussion of the use of agent-based techniques – for example The Sims (www.thesims.ea.com) and the Microsoft Flight Simulator series (www.microsoft.com/ games/flightsimulatorx). However the categorization still serves to introduce those unfamiliar with the subject to the kinds of games up for discussion. Implementing Agent-Based Modelling Techniques in Games One of the earliest examples of using agent-based modeling techniques in video games was its application to path planning. The ability of non-player characters (NPCs) to manoeuvre around a game world is one of the most basic competencies required in games. While in very early games it was sufficient to have NPCs move along prescripted paths, this soon become unacceptable. Games programmers soon began to turn to AI techniques which might be applied to solve some of the problems that were arising. The A* path planning algorithm [74] was the first example of such a technique to find wide-spread use in the games industry. Using the A* algorithm NPCs can be given the ability to find their own way around an environment. This was put to particularly fine effect early on in real-time strategy games where the units controlled by players are semi-autonomous and are given orders rather
Computer Graphics and Games, Agent Based Modeling in
Computer Graphics and Games, Agent Based Modeling in, Figure 3 A screenshot from Bethesda Softwork’s role playing game The Elder Scrolls IV: Oblivion (image courtesy of Bethesda Softworks)
than directly controlled. In order to use the A* algorithm a game world must be divided into a series of cells each of which is given a rating in terms of the effort that must be expended to cross it. The A* algorithm then performs a search across these cells in order to find the shortest path that will take a game agent from a start position to a goal. Since becoming widely understood amongst the game development community many interesting additions have been made to the basic A* algorithm. It was not long before three dimensional versions of the algorithm became commonly used [71]. The basic notion of storing the energy required to cross a cell within a game world has also been extended to augment cells with a wide range of other useful information (such as the level of danger in crossing a cell) that can be used in the search process [63]. The next advance in the kind of techniques being used to achieve agent-based modeling in games was the finite state machine (FSM) [30]. An FSM is a simple system in which a finite number of states are connected in a directed graph by transitions between these states. When used for the control of NPCs, the nodes of an FSM indicate the possible actions within a game world that an agent can perform. Transitions indicate how changes in the state of the game world or the character’s own attributes (such as health, tiredness etc) can move the agent from one state to another.
Computer Graphics and Games, Agent Based Modeling in, Figure 4 A simple finite state machine for a soldier NPC in an action game
Figure 4 shows a sample FSM for the control of an NPC in a typical action game. In this example the behaviors of the character are determined by just four states – CHASE , ATTACK, FLEE and EXPLORE . Each of these states provides an action that the agent should take. For example, when in the EXPLORE state the character should wan-
611
612
Computer Graphics and Games, Agent Based Modeling in
der randomly around the world, or while in the FLEE state the character should determine a direction to move in that will take it away from its current enemy and move in that direction. The links between the states show how the behaviors of the character should move between the various available states. So, for example, if while in the ATTACK state the agent’s health measure becomes low, they will move to the FLEE state and run away from their enemy. FSMs are widely used because they are so simple, well understood and extremely efficient both in terms of processing cycles required and memory usage. There have also been a number of highly successful augmentations to the basic state machine model to make them more effective, such as the introduction of layers of parallel state machines [3], the use of fuzzy logic in finite state machines [19] and the implementation of cooperative group behaviors through state machines [72]. The action game Halo 2 is recognized as having a particularly good implementation of state machine based NPC control [79]. At any time an agent could be in any one of the four states Idle, Guard/Patrol, Attack/Defend, and Retreat. Within each of these states a set of rules was used in order to select from a small set of appropriate actions for that state – for example a number of different ways to attack the player. The decisions made by NPCs were influenced by a number of character attributes including strength, speed and cowardliness. Transition between states was triggered by perceptions made by characters simulated senses of vision and hearing and by internal attributes such as health. The system implemented also allowed for group behaviors allowing NPCs to hold conversations and cooperate to drive vehicles. However, FSMs are not without their drawbacks. When designing FSMs developers must envisage every possible situation that might confront an NPC over the course of a game. While this is quite possible for many games, if NPCs are required to move between many different situations this task can become overwhelming. Similarly, as more and more states are added to an FSM designing the links between these states can become a mammoth undertaking. From [31] the definition of rule based systems states that they are “. . . comprised of a database of associated rules. Rules are conditional program statements with consequent actions that are performed if the specified conditions are satisfied”. Rule based systems have been applied extensively to control NPCs in games [16], in particular for the control of NPCs in role-playing games. NPCs behaviors are scripted using a set of rules which typically indicate how an NPC should respond to particular events within the game world. Borrowed from [82], the listing be-
low shows a snippet of the rules used to control a warrior character in the RPG Baldur’s Gate (www.bioware.com). IF
// If my nearest enemy is not within 3 !Range(NearestEnemyOf(Myself),3) // and is within 8 Range(NearestEnemyOf(Myself),8) THEN // 1/3 of the time RESPONSE #40 // Equip my best melee weapon EquipMostDamagingMelee() // and attack my nearest enemy, checking every 60 // ticks to make sure he is still the nearest AttackReevalutate(NearestEnemyOf (Myself),60) // 2/3 of the time RESPONSE #60 // Equip a ranged weapon EquipRanged() // and attack my nearest enemy, checking every 30 // ticks to make sure he is still the nearest AttackReevalutate(NearestEnemyOf (Myself), 30) The implementation of an NPC using a rule-based system would consist of a large set of such rules, a small set of which would fire based on the conditions in the world at any given time. Rule based systems are favored by game developers as they are relatively simple to use and can be exhaustively tested. Rule based systems also have the advantage that rule sets can be written using simple proprietary scripting systems [9], rather than full programming languages, making them easy to implement. Development companies have also gone so far as to make these scripting languages available to the general public, enabling them to author there own rule sets. Rule based systems, however, are not without their drawbacks. Authoring extensive rule sets is not a trivial task, and they are usually restricted to simple situations. Also, rule based systems can be restrictive in that they don’t allow sophisticated interplay between NPCs motivations, and require that rule set authors foresee every situation that the NPC might find itself in. Some of the disadvantages of simple rule based systems can be alleviated by using more sophisticated inference engines. One example uses Dempster Schafer theory [43] which allows rules to be evaluated by combining multiple sources of (often incomplete) evidence to determine actions. This goes some way towards supporting the use of rule based systems in situations where complete knowledge is not available. ALife techniques have also been applied extensively in the control of game NPCs, as much as a philosophy as any
Computer Graphics and Games, Agent Based Modeling in
particular techniques. The outstanding example of this is The Sims (thesims.ea.com) a surprise hit of 2000 which has gone on to become the best selling PC game of all time. Created by games guru Will Wright the Sims puts the player in control of the lives of a virtual family in their virtual home. Inspired by aLife, the characters in the game have a set of motivations, such as hunger, fatigue and boredom and seek out items within the game world that can satisfy these desires. Virtual characters also develop sophisticated social relationships with each other based on common interest, attraction and the amount of time spent together. The original system in the Sims has gone on to be improved in the sequel The Sims 2 and a series of expansion packs. Some of the more interesting work in developing techniques for the control of game characters (particularly in action games) has been focused on developing interesting sensing and memory models for game characters. Players expect when playing action games that computer controlled opponents should suffer from the same problems that players do when perceiving the world. So, for example, computer controlled characters should not be able to see through walls or from one floor to the next. Similarly, though, players expect computer controlled characters to be capable of perceiving events that occur in a world and so NPCs should respond appropriately to sound events or on seeing the player. One particularly fine example of a sensing model was in the game Thief: The Dark Project where players are required to sneak around an environment without alerting guards to their presence [45]. The developers produced a relatively sophisticated sensing model that was used by non-player characters which modeled visual effects such as not being able to see the player if they were in shadows, and moving some way towards modeling acoustics so that non-player characters could respond reasonably to sound events. 2004’s Fable (fable.lionhead.com) took the idea of adding memory to a game to new heights. In this adventure game the player took on the role of a hero from boyhood to manhood. However, every action the player took had an impact on the way in which the game world’s population would react to him or her as they would remember every action the next time they met the player. This notion of long-term consequences added an extra layer of believability to the game-playing experience. Serious Games & Academia It will probably have become apparent to most readers of the previous section that much of the work done in implementing agent-based techniques for the control of NPCs
in commercial games is relatively simplistic when compared to the application of these techniques in other areas of more academic focus, such as robotics [54]. The reasons for this have been discussed already and briefly relate to the lack of available processing resources and the requirements of commercial quality control. However, a large amount of very interesting work is taking place in the application of agent-based technologies in academic research, and in particular the field of serious games. This section will begin by introducing the area of serious games and then go on to discuss interesting academic projects looking at agent-based technologies in games. The term serious games [53] refers to games designed to do more than just entertain. Rather, serious games, while having many features in common with conventional games, have ulterior motives such as teaching, training, and marketing. Although games have been used for ends apart from entertainment, in particular education, for a long time, the modern serious games movement is set apart from these by the level of sophistication of the games it creates. The current generation of serious games is comparable with main-stream games in terms of the quality of production and sophistication of their design. Serious games offer particularly interesting opportunities for the use of agent-based modeling techniques due to the facts that they often do not have to live up to the rigorous testing of commercial games, can have the requirement of specialized hardware rather than being restricted to commercial games hardware and often, by the nature of their application domains, require more in-depth interactions between players and NPCs. The modern serious games movement can be said to have begun with the release of America’s Army (www. americasarmy.com) in 2002 [57]. Inspired by the realism of commercial games such as the Rainbow 6 series (www.rainbow6.com), the United States military developed America’s Army and released it free of charge in order to give potential recruits a flavor of army life. The game was hugely successful and is still being used today as both a recruitment tool and as an internal army training tool. Spurred on by the success of America’s Army the serious games movement began to grow, particularly within academia. A number of conferences sprung up and notably the Serious Games Summit became a part of the influential Game Developer’s Conference (www.gdconf. com) in 2004. Some other notable offerings in the serious games field include Food Force (www.food-force.com) [18], a game developed by the United Nations World Food Programme in order to promote awareness of the issues surrounding emergency food aid; Hazmat Hotzone [15], a game devel-
613
614
Computer Graphics and Games, Agent Based Modeling in
Computer Graphics and Games, Agent Based Modeling in, Figure 5 A screenshot of Serious Gordon, a serious game developed to aid in the teaching of food safety in kitchens
oped by the Entertainment Technology Centre at Carnegie Mellon University to train fire-fighters to deal with chemical and hazardous materials emergencies; Yourself!Fitness (www.yourselffitness.com) [53] an interactive virtual personal trainer developed for modern games consoles; and Serious Gordon (www.seriousgames.ie) [50] a game developed to aid in teaching food safety in kitchens. A screen shot of Serious Gordon is shown in Fig. 5. Over the past decade, interest in academic research that is directly focused on artificial intelligence, and in particular agent-based modelling techniques and their application to games (as opposed to the general virtual character/computer graphics work discussed previously) has grown dramatically. One of the first major academic research projects into the area of Game-AI was led by John Laird at the University of Michigan, in the United States. The SOAR architecture was developed in the early nineteen eighties in an attempt to “develop and apply a unified theory of human and artificial intelligence” [66]. SOAR is essentially a rule based inference system which takes the current state of a problem and matches this to production rules which lead to actions. After initial applications into the kind of simple puzzle worlds which characterized early AI research [42], the SOAR architecture was applied to the task of control-
ling computer generated forces [37]. This work lead to an obvious transfer to the new research area of gameAI [40]. Initially the work of Laird’s group focused on applying the SOAR architecture to the task of controlling NPC opponents in the action game Quake (www.idsoftware. com) [40]. This proved quite successful leading to opponents which could successfully play against human players, and even begin to plan based on anticipation of what the player was about to do. More recently Laird’s group have focused on the development of a game which requires more involved interactions between the player and the NPCs. Named Haunt 2, this game casts the player in the role of a ghost that must attempt to influence the actions of a group of computer controlled characters inhabiting the ghost’s haunted house [51]. The main issue that arises with the use the SOAR architecture is that it is enormously resource hungry, with the NPC controllers running on a separate machine to the actual game. At Trinity College in Dublin in Ireland, the author of this article worked on an intelligent agent architecture, the Proactive Persistent Agent (PPA) architecture, for the control of background characters (or support characters) in character-centric games (games that focus on character interactions rather than action, e. g. role-playing
Computer Graphics and Games, Agent Based Modeling in
Computer Graphics and Games, Agent Based Modeling in, Figure 6 Screenshots of the PPA system simulating parts of a college
games) [48,49]. The key contributions of this work were that it made possible the creation of NPCs that were capable of behaving believably in a wide range of situations and allowed for the creation of game environments which it appeared had an existence beyond their interactions with players. Agent behaviors in this work were based on models of personality, emotion, relationships to other characters and behavioral models that changed according to the current role of an agent. This system was used to develop a stand alone game and as part of a simulation of areas within Trinity College. A screenshot of this second application is shown in Fig. 6. At Northwestern University in Chicago the Interactive Entertainment group has also applied approaches from more traditional research areas to the problems facing game-AI. Ian Horswill has led a team that are attempting to use architectures traditionally associated with robotics for the control of NPCs. In [29] Horswill and Zubek consider how perfectly matched the behavior based architectures often used in robotics are with the requirements of NPC control architectures. The group have demonstrated some of their ideas in a test-bed environment built on top of the game Half-Life [38]. The group also looks at issues around character interaction [85] and the many psychological issues associated with creating virtual characters asking how we can create virtual game agents that display
all of the foibles that make us relate to characters in human stories [28]. Within the same research group a team led by Ken Forbus have extended research previously undertaken in conjunction with the military [24] and applied it to the problem of terrain analysis in computer strategy games [25]. Their goal is to create strategic opponents which are capable of performing sophisticated reasoning about the terrain in a game world and using this knowledge to identify complex features such as ambush points. This kind of high level reasoning would allow AI opponents play a much more realistic game, and even surprise human players from time to time, something that is sorely missing from current strategy games. As well as this work which has spring-boarded from existing applications, a number of projects began expressly to tackle problems in game-AI. Two which particularly stand out are the Excalibur Project, led by Alexander Nareyek [55] and work by John Funge [26]. Both of these projects have attempted to applying sophisticated planning techniques to the control of game characters. Nareyek uses constraint based planning to allow game agents reason about their world. By using techniques such as local search Nareyek has attempted to allow these sophisticated agents perform resource intensive planning
615
616
Computer Graphics and Games, Agent Based Modeling in
within the constraints of a typical computer game environment. Following on from this work, the term anytime agent was coined to describe the process by which agents actively refine original plans based on changing world conditions. In [56] Narayek describes the directions in which he intends to take this work in future. Funge uses the situational calculus to allow agents reason about their world. Similarly to Nareyek he has addressed the problems of a dynamic, ever changing world, plan refining and incomplete information. Funge’s work uses an extension to the situational calculus which allows the expression of uncertainty. Since completing this work Funge has gone on to be one of the founders of AiLive (www.ailive.net), a middleware company specializing in AI for games. While the approaches of both of these projects have shown promise within the constrained environments to which they have been applied during research, (and work continues on them) it remains to be seen whether such techniques can be successfully applied to a commercial game environment and all of the resource constraints that such an environment entails. One of the most interesting recent examples of agentbased work in the field of serious games is that undertaken by Barry Silverman and his group at the University of Pennsylvania in the United States [69,70]. Silverman models the protagonists in military simulations for use in training programmes and has taken a very interesting approach in that his agent models are based on established cognitive science and behavioral science research. While Silverman admits that many of the models described in the cognitive science and behavioral science literature are not well quantified enough to be directly implemented, he has adapted a number of well respected models for his purposes. Silverman’s work is an excellent example of the capabilities that can be explored in a serious games setting rather than a commercial game setting, and as such merits an in depth discussion. A high-level schematic diagram of Silverman’s approach is shown in Fig. 7 and shows the agent architecture used by Silverman’s system, PMFserv. The first important component of the PMFserv system is the biology module which controls biological needs using a metaphor based on the flow of water through a system. Biological concepts such as hunger and fatigue are simulated using a series of reservoirs, tanks and valves which model the way in which resources are consumed by the system. This biological model is used in part to model stress which has an important impact on the way in which agents make decisions. To model the way in which agent performance changes under pressure Silverman uses performance moderator functions (PMFs). An example of one
of the earliest PMFs used is the Yerkes–Dodson “invertedu” curve [84] which illustrates that as mental arousal is increased performance initially improves, peaks and then trails off again. In PMFserv a range of PMFs are used to model the way in which behavior should change depending on stress levels and biological conditions. The second important module of PMFserv attempts to model how personality, culture and emotion affect the behavior of an agent. In keeping with the rest of their system PMFserv uses models inspired by cognitive science to model emotions. In this case the well known OCC model [58], which has been used in agent-based applications before [8], is used. The OCC model provides for 11 pairs of opposite emotions such as pride and shame, and hope and fear. The emotional state of an agent with regard to past, current and future actions heavily influences the decisions that the agent makes. The second portion of the Personality, Culture, Emotion module uses a value tree in order to capture the values of an agent. These values are divided into a Preference Tree which captures long term desired states for the world, a Standards Tree which relates to the actions that an agent believes it can or cannot follow in order to achieve these desired states and a Goal Tree which captures short term goals. PMFserv also models the relationships between agents (Social Model, Relations, Trust in Fig. 7). The relationship of one agent to another is modeled in terms of three axes. The first is the degree to which the other agent is thought of as a human rather than an inanimate object – locals tend to view American soldiers as objects rather than people. The second axis is the cognitive grouping (ally, foe etc) to which the other agent belongs and whether this is also a group to which the first agent has an affinity. Finally, the valence, or strength, of the relationship is stored. Relationships continually change based on actions that occur within the game world. Like the other modules of the system this model is also based on psychological research [58]. The final important module of the PMFserv architecture is the Cognitive module which is used to decide on particular actions that agents will undertake. This module uses inputs from all of the other modules to make these decisions and so the behavior of PMFserv agents is driven by their stress levels, relationships to other agents and objects within the game world, personality, culture and emotions. The details of the PMFserv cognitive process are beyond the scope of this article, so it will suffice to say that action selection is based on a calculation of the utility of a particular action to an agent, with this calculation modified by the factors listed above.
Computer Graphics and Games, Agent Based Modeling in
Computer Graphics and Games, Agent Based Modeling in, Figure 7 A schematic diagram of the main components of the PMFserv system (with kind permission of Barry Silverman)
The most highly developed example using the PMFserv model is a simulation of the 1993 event in Mogadishu, Somalia in which a United States military Black Hawk helicopter crashed, as made famous by the book and film “Black Hawk Down” [12]. In this example, which was developed as a military training aid as part of a larger project looking at agent implementations within such systems [78,81] the player took on the role of a US army ranger on a mission to secure the helicopter wreck in a modification (or “mod”) of the game Unreal Tournament (www.unreal.com). A screenshot of this simulation is shown in Fig. 8. The PMFserv system was used to control the behaviors of characters within the game world such as Somali militia, and Somali civilians. These characters were imbued with physical attributes, a value system and relationships with other characters and objects within the game environment. The sophistication of PMFserv was apparent in many of the behaviors of the simulations NPCs. One particularly good example was the fact that Somali women would offer themselves as human shields for militia fighters. This behavior was never directly programmed into the agents make-up, but rather emerged as a result of their values and assessment of their situation. PMFserv remains one of the most sophisticated current agent implementations and shows the possibilities when the shackles of commercial game constraints are thrown off.
Future Directions There is no doubt that with the increase in the amount of work being focused on the use of agent-based modeling in computer graphics and games there will be major developments in the near future. This final section will attempt to predict what some of these might be. The main development that might be expected in all of the areas that have been discussed in this article is an increase in the depth of simulation. The primary driver of this increase in depth will be the development of more sophisticated agent models which can be used to drive ever more sophisticated agent behavior. The PMFserv system described earlier is one example of the kinds of deeper systems that are currently being developed. In general computer graphics applications this will allow for the creation of more interesting simulations including previously prohibitive features such as automatic realistic facial expressions and other physical expressions of agents’ internal states. This would be particularly use in CGI for movies in which, although agent based modeling techniques are commonly used for crowd scenes and background characters, main characters are still animated almost entirely by hand. In the area of computer games it can be expected that many of the techniques being used in movie CGI will filter over to real-time game applications as the process-
617
618
Computer Graphics and Games, Agent Based Modeling in
Computer Graphics and Games, Agent Based Modeling in, Figure 8 A screenshot of the PMFserv system being used to simulate the Black Hawk Down scenario (with kind permission of Barry Silverman)
ing power of game hardware increases – this is a pattern that has been evident for the past number of years. In terms of depth that might be added to the control of game characters one feature that has mainly been conspicuous by its absence in modern games is genuine learning by game agents. 2000’s Black & White and its sequel Black & White 2 (www.lionhead.com) featured some learning by one of the game’s main characters that the player could teach in a reinforcement manner [20]. While this was particularly successful in the game, such techniques have not been more widely applied. One interesting academic project in this area is the NERO project (www.nerogame. org) which allows a player to train an evolving army of soldiers and have them battle the armies of other players [73]. It is expected that these kinds of capabilities will become more and more common in commercial games. One new feature of the field of virtual character control in games is the emergence of specialized middleware. Middleware has had a massive impact in other areas of game development including character modeling (for example Maya available from www.autodesk. com) and physics modeling (for example Havok available from www.havok.com). AI focused middleware for games is now becoming more common with notable of-
ferings including AI-Implant (www.ai-implant.com) and Kynogon (www.kynogon.com) which perform path finding and state machine based control of characters. It is expected that more sophisticated techniques will over time find their way into such software. To conclude the great hope for the future is that more and more sophisticated agent-based modeling techniques from other application areas and other branches of AI will find their way into the control of virtual characters.
Bibliography Primary Literature 1. Adamson A (Director) (2005) The Chronicles of Narnia: The Lion, the Witch and the Wardrobe. Motion Picture. http:// adisney.go.com/disneypictures/narnia/lb_main.html 2. Aitken M, Butler G, Lemmon D, Saindon E, Peters D, Williams G (2004) The Lord of the Rings: the visual effects that brought middle earth to the screen. International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), Course Notes 3. Alexander T (2003) Parallel-State Machines for Believable Characters. In: Massively Multiplayer Game Development. Charles River Media 4. Allers R, Minkoff R (Directors) (1994) The Lion King. Motion
Computer Graphics and Games, Agent Based Modeling in
5.
6.
7.
8. 9.
10.
11.
12. 13.
14. 15.
16.
17. 18.
19. 20. 21.
22.
23.
24. 25.
Picture. http://disney.go.com/disneyvideos/animatedfilms/ lionking/ Aylett R, Luck M (2000) Applying Artificial Intelligence to Virtual Reality: Intelligent Virtual Environments. Appl Artif Intell 14(1):3–32 Badler N, Bindiganavale R, Bourne J, Allbeck J, Shi J, Palmer M (1999) Real Time Virtual Humans. In: Proceedings of the International conference on Digital Media Futures. Bates J (1992) The Nature of Characters in Interactive Worlds and the Oz Project. Technical Report CMU-CS-92–200. School of Computer Science, Carnegie Melon University Bates J (1992) Virtual reality, art, and entertainment. Presence: J Teleoper Virtual Environ 1(1):133–138 Berger L (2002) Scripting: Overview and Code Generation. In: Rabin S (ed) AI Game Programming wisdom. Charles River Media Bird B, Pinkava J (Directors) (2007) Ratatouille. Motion Picture. http://disney.go.com/disneyvideos/animatedfilms/ ratatouille/ Blumberg B (1996) Old Tricks, New Dogs: Ethology and Interactive Creatures. Ph D Thesis, Media Lab, Massachusetts Institute of Technology Bowden M (2000) Black Hawk Down. Corgi Adult Burke R, Isla D, Downie M, Ivanov Y, Blumberg B (2002) Creature Smarts: The Art and Architecture of a Virtual Brain. In: Proceedings of Game-On 2002: the 3rd International Conference on Intelligent Games and Simulation, pp 89–93 Burton T (Director) (1992) Batman Returns. Motion Picture. http://www.warnervideo.com/batmanmoviesondvd/ Carless S (2005) Postcard From SGS 2005: Hazmat: Hotzone – First-Person First Responder Gaming. Retrieved October 2007, from Gamasutra: www.gamasutra.com/features/ 20051102/carless_01b.shtml Christian M (2002) A Simple Inference Engine for a Rule Based Architecture. In: Rabin S (ed) AI Game Programming Wisdom. Charles River Media Darnell E, Johnson T (Directors) (1998) Antz. Motion Picture. http://www.dreamworksanimation.com/ DeMaria R (2005) Postcard from the Serious Games Summit: How the United Nations Fights Hunger with Food Force. Retrieved October 2007, from Gamasutra: www.gamasutra.com/ features/20051104/demaria_01.shtml Dybsand E (2001) A Generic Fuzzy State Machine in C++. In: Rabin S (ed) Game Programming Gems 2. Charles River Media Evans R (2002) Varieties of Learning. In: Rabin S (ed) AI Game Programming Wisdom. Charles River Media Faloutsos P, van de Panne M, Terzopoulos D (2001) The Virtual Stuntman: Dynamic Characters with a Repetoire of Autonomous Motor Skills. Comput Graph 25(6):933–953 Farenc N, Musse S, Schweiss E, Kallmann M, Aune O, Boulic R et al (2000) A Paradigm for Controlling Virtual Humans in Urban Environment Simulations. Appl Artif Intell J Special Issue Intell Virtual Environ 14(1):69–91 Feng-Hsiung H (2002) Behind Deep Blue: Building the Computer that Defeated the World Chess Champion. Princeton University Press Forbus K, Nielsen P, Faltings B (1991) Qualitative Spatial Reasoning: The CLOCK Project. Artif Intell 51:1–3 Forbus K, Mahoney J, Dill K (2001) How Qualitative Spatial Rea-
26. 27. 28.
29.
30.
31. 32.
33.
34. 35. 36. 37.
38.
39.
40.
41.
42.
43.
44. 45.
46.
soning Can Improve Strategy Game AIs. In: Proceedings of the AAAI Spring Symposium on AI and Interactive Entertainment Funge J (1999) AI for Games and Animation: A Cognitive Modeling Approach. A.K. Peters Hayes-Roth B, Doyle P (1998) Animate Characters. Auton Agents Multi-Agent Syst 1(2):195–230 Horswill I (2007) Psychopathology, narrative, and cognitive architecture (or: why NPCs should be just as screwed up as we are). In: Proceedings of AAAI Fall Symposium on Intelligent Narrative Technologies Horswill I, Zubek R (1999) Robot Architectures for Believable Game Agents. In: Proceedings of the 1999 AAAI Spring Symposium on Artificial Intelligence and Computer Games Houlette R, Fu D (2003) The Ultimate Guide to FSMs in Games. In: Rabin S (ed) AI Game Programming Wisdom 2. Charles River Media IGDA (2003) Working Group on Rule-Based Systems Report. International Games Development Association Isbister K, Doyle P (2002) Design and Evaluation of Embodied Conversational Agents: A Proposed Taxonomy. In: Proceedings of the AA- MAS02 Workshop on Embodied Conversational Agents: Lets Specify and Compare Them! Bologna, Italy Jackson P (Director) (2001) The Lord of the Rings: The Fellowship of the Ring. Motion Picture. http://www.lordoftherings. net/ Jackson P (Director) (2002) The Lord of the Rings: The Two Towers. Motion Picture. http://www.lordoftherings.net/ Jackson P (Director) (2003) The Lord of the Rings: The Return of the King. Motion Picture. http://www.lordoftherings.net/ Johnston O, Thomas F (1995) The Illusion of Life: Disney Animation. Disney Editions Jones R, Laird J, Neilsen P, Coulter K, Kenny P, Koss F (1999) Automated Intelligent Pilots for Combat Flight Simulation. AI Mag 20(1):27–42 Khoo A, Zubek R (2002) Applying Inexpensive AI Techniques to Computer Games. IEE Intell Syst Spec Issue Interact Entertain 17(4):48–53 Koeppel D (2002) Massive Attack. http://www.popsci.com/ popsci/science/d726359b9fa84010vgnvcm1000004eecbccdr crd.html. Accessed Oct 2007 Laird J (2000) An Exploration into Computer Games and Computer Generated Forces. The 8th Conference on Computer Generated Forces and Behavior Representation Laird J, van Lent M (2000) Human-Level AI’s Killer Application: Interactive Computer Games. In: Proceedings of the 17th National Conference on Artificial Intelligence Laird J, Rosenbloom P, Newell A (1984) Towards Chunking as a General Learning Mechanism. The 1984 National Conference on Artificial Intelligence (AAAI), pp 188–192 Laramée F (2002) A Rule Based Architecture Using DempsterSchafer theory. In: Rabin S (ed) AI Game Programming Wisdom. Charles River Media Lasseter J, Stanton A (Directors) (1998) A Bug’s Life; Motion Picture. http://www.pixar.com/featurefilms/abl/ Leonard T (2003) Building an AI Sensory System: Examining the Deign of Thief: The Dark Project. In: Proceedings of the 2003 Game Developers’ Conference, San Jose Loyall B (1997) Believable Agents: Building Interactive Personalities. Ph D Thesis, Carnegie Melon University
619
620
Computer Graphics and Games, Agent Based Modeling in
47. Määta A (2002) Realistic Level Design for Max Payne. In: Proceedings of the 2002 Game Developer’s conference, GDC 2002 48. Mac Namee B, Cunningham P (2003) Creating Socially Interactive Non Player Characters: The µ-SIC System. Int J Intell Games Simul 2(1) 49. Mac Namee B, Dobbyn S, Cunningham P, O’Sullivan C (2003) Simulating Virtual Humans Across Diverse Situations. In: Proceedings of Intelligent Virtual Agents ’03, pp 159–163 50. Mac Namee B, Rooney P, Lindstrom P, Ritchie A, Boylan F, Burke G (2006) Serious Gordon: Using Serious Games to Teach Food Safety in the Kitchen. The 9th International Conference on Computer Games: AI, Animation, Mobile, Educational & Serious Games CGAMES06, Dublin 51. Magerko B, Laird JE, Assanie M, Kerfoot A, Stokes D (2004) AI Characters and Directors for Interactive Computer Games. The 2004 Innovative Applications of Artificial Intelligence Conference. AAAI Press, San Jose 52. Thalmann MN, Thalmann D (1994) Artificial Life and Virtual Reality. Wiley 53. Michael D, Chen S (2005) Serious Games: Games That Educate, Train, and Inform. Course Technology PTR 54. Muller J (1996) The Design of Intelligent Agents: A Layered Approach. Springer 55. Nareyek A (2001) Constraint Based Agents. Springer 56. Nareyek A (2007) Game AI is Dead. Long Live Game AI! IEEE Intell Syst 22(1):9–11 57. Nieborg D (2004) America’s Army: More Than a Game. Bridging the Gap;Transforming Knowledge into Action through Gaming and Simulation. Proceedings of the 35th Conference of the International Simulation and Gaming Association (ISAGA), Munich 58. Ortony A, Clore GL, Collins A (1988) The cognitive structure of emotions. Cambridge University Press, Cambridge 59. Perlin K, Goldberg A (1996) Improv: A System for Scripting Interactive Actors in Virtual Worlds. In: Proceedings of the ACM Computer Graphics Annual Conference, pp 205–216 60. Proyas A (Director) (2004) I, Robot. Motion Picture. http://www. irobotmovie.com 61. Rao AS, Georgeff MP (1991) Modeling rational agents within a BDI-architecture. In: Proceedings of Knowledge Representation and Reasoning (KR&R-91). Morgan Kaufmann, pp 473–484 62. Musse RS, Thalmann D (2001) A Behavioral Model for Real Time Simulation of Virtual Human Crowds. IEEE Trans Vis Comput Graph 7(2):152–164 63. Reed C, Geisler B (2003) Jumping, Climbing, and Tactical Reasoning: How to Get More Out of a Navigation System. In: Rabin S (ed) AI Game Programming Wisdom 2. Charles River Media 64. Reynolds C (1987) Flocks, Herds and Schools: A Distributed Behavioral Model. Comput Graph 21(4):25–34 65. Rodriguez R (Director) (1996) From Dusk ’Till Dawn. Motion Picture 66. Rosenbloom P, Laird J, Newell A (1993) The SOAR Papers: Readings on Integrated Intelligence. MIT Press 67. Sánchez-Crespo D (2006) GDC: Physical Gameplay in Half-Life 2. Retrieved October 2007, from gamasutra.com: http://www. gamasutra.com/features/20060329/sanchez_01.shtml 68. Shao W, Terzopoulos D (2005) Autonomous Pedestrians. In: Proceedings of SIGGRAPH/EG Symposium on Computer Animation, SCA’05, pp 19–28 69. Silverman BG, Bharathy G, O’Brien K, Cornwell J (2006) Human
70.
71.
72.
73.
74. 75. 76.
77.
78.
79.
80.
81.
82.
83. 84.
85.
Behavior Models for Agents in Simulators and Games: Part II: Gamebot Engineering with PMFserv. Presence Teleoper Virtual Worlds 15(2):163–185 Silverman BG, Johns M, Cornwell J, O’Brien K (2006) Human Behavior Models for Agents in Simulators and Games: Part I: Enabling Science with PMFserv. Presence Teleoper Virtual Environ 15(2):139–162 Smith P (2002) Polygon Soup for the Programmer’s Soul: 3D Path Finding. In: Proceedings of the Game Developer’s Conference 2002, GDC2002 Snavely P (2002) Agent Cooperation in FSMs for Baseball. In: Rabin S (ed) AI Game Programming Wisdom. Charles River Media Stanley KO, Bryant BD, Karpov I, Miikkulainen R (2006) RealTime Evolution of Neural Networks in the NERO Video Game. In: Proceedings of the Twenty-First National Conference on Artificial Intelligence, AAAI-2006. AAAI Press, pp 1671–1674 Stout B (1996) Smart Moves: Intelligent Path-Finding. Game Dev Mag Oct Takahashi TS (1992) Behavior Simulation by Network Model. Memoirs of Kougakuin University 73, pp 213–220 Terzopoulos D, Tu X, Grzeszczuk R (1994) Artificial Fishes with Autonomous Locomotion, Perception, Behavior and Learning, in a Physical World. In: Proceedings of the Artificial Life IV Workshop. MIT Press Thompson C (2007) Halo 3: How Microsoft Labs Invented a New Science of Play. Retrieved October 2007, from wired.com: http://www.wired.com/gaming/virtualworlds/ magazine/15-09/ff_halo Toth J, Graham N, van Lent M (2003) Leveraging gaming in DOD modelling and simulation: Integrating performance and behavior moderator functions into a general cognitive architecture of playing and non-playing characters. Twelfth Conference on Behavior Representation in Modeling and Simulation (BRIMS, formerly CGF), Scotsdale, Arizona Valdes R (2004) In the Mind of the Enemy: The Artificial Intelligence of Halo 2. Retrieved October 2007, from HowStuffWorks.com: http://entertainment.howstuffworks.com/ halo2-ai.htm van der Werf E, Uiterwijk J, van den Herik J (2002) Programming a Computer to Play and Solve Ponnuki-Go. In: Proceedings of Game-On 2002: The 3rd International Conference on Intelligent Games and Simulation, pp 173–177 van Lent M, McAlinden R, Brobst P (2004) Enhancing the behavioral fidelity of synthetic entities with human behavior models. Thirteenth Conference on Behavior Representation in Modeling and Simulation (BRIMS) Woodcock S (2000) AI Roundtable Moderator’s Report. In: Proceedings of the Game Developer’s Conference 2000 (GDC2000) Wooldridge M, Jennings N (1995) Intelligent Agents: Theory and Practice. Know Eng Rev 10(2):115–152 Yerkes RW, Dodson JD (1908) The relation of strength of stimulus to rapidity of habit formation. J Comp Neurol Psychol 18:459–482 Zubek R, Horswill I (2005) Hierarchical Parallel Markov Models of Interaction. In: Proceedings of the Artificial Intelligence and Interactive Digital Entertainment Conference, AIIDE 2005
Computer Graphics and Games, Agent Based Modeling in
Books and Reviews DeLoura M (ed) (2000) Game Programming Gems. Charles River Media DeLoura M (ed) (2001) Game Programming Gems 2. Charles River Media Dickheiser M (ed) (2006) Game Programming Gems 6. Charles River Media Kirmse A (ed) (2004) Game Programming Gems 4. Charles River Media
Pallister K (ed) (2005) Game Programming Gems 5. Charles River Media Rabin S (ed) (2002) Game AI Wisdom. Charles River Media Rabin S (ed) (2003) Game AI Wisdom 2. Charles River Media Rabin S (ed) (2006) Game AI Wisdom 3. Charles River Media Russell S, Norvig P (2002) Artificial Intelligence: A Modern Approach. Prentice Hall Treglia D (ed) (2002) Game Programming Gems 3. Charles River Media
621
622
Computing in Geometrical Constrained Excitable Chemical Systems
Computing in Geometrical Constrained Excitable Chemical Systems JERZY GORECKI 1,2 , JOANNA N ATALIA GORECKA 3 Institute of Physical Chemistry, Polish Academy of Science, Warsaw, Poland 2 Faculty of Mathematics and Natural Sciences, Cardinal Stefan Wyszynski University, Warsaw, Poland 3 Institute of Physics, Polish Academy of Science, Warsaw, Poland
1
Article Outline Glossary Definition of the Subject Introduction Logic Gates, Coincidence Detectors and Signal Filters Chemical Sensors Built with Structured Excitable Media The Ring Memory and Its Applications Artificial Chemical Neurons with Excitable Medium Perspectives and Conclusions Acknowledgments Bibliography Glossary Some of the terms used in our article are the same as in the article of Adamatzky in this volume, and they are explained in the glossary of Reaction-Diffusion Computing. Activator A substance that increases the rate of reaction. Excitability Here we call a dynamical system excitable if it has a single stable state (the rest state) with the following properties: if the rest state is slightly perturbed then the perturbation uniformly decreases as the system evolves towards it. However, if the perturbation is sufficiently strong it may grow by orders of magnitude before the system approaches the rest state. The increase in variables characterizing the system (usually rapid if compared with the time necessary to reach the rest state) is called an excitation. A forest is a classical example of excitable medium and a wildfire that burns it is an excitation. A dynamical system is non-excitable if applied perturbations do not grow up and finally decay. Excitability level The measure of how strong a perturbation has to be applied to excite the system. For example, for the Ru-catalyzed Belousov–Zhabotinsky reaction, increasing illumination makes the medium less excitable. The decrease in the excitability level can be observed in reduced amplitudes of spikes and in decreased velocities of autowaves.
Firing number The ratio between the number of generated excitations to the number of applied external stimulations. In most of the cases we define the firing number as the ratio between the number of output spikes to the number of arriving pulses. Inhibitor A substance that decreases the rate of reaction or even prevents it. Medium In the article we consider a chemical medium, fully characterized by local concentrations of reagents and external conditions like temperature or illumination level. The time evolution of concentrations is governed by a set of reaction-diffusion equations, where the reaction term is an algebraic function of variables characterizing the system and the non-local coupling is described by the diffusion operator. We are mainly concerned with a two dimensional medium (i. e., a membrane with a solution of reagents used in experiments), but the presented ideas can be also applied to one-dimensional or three-dimensional media. Refractory period A period of time during which an excitable system is incapable of repeating its response to an applied, strong perturbation. After the refractory period the excitable medium is ready to produce an excitation as an answer to the stimulus. Spike, autowave In a spatially distributed excitable medium, a local excitation can spread around as a pulse of excitation. Usually a propagating pulse of excitation converges to a stationary shape characteristic for the medium, which is not dependent on initialization and propagates with a constant velocity – thus it is called an autowave. Subexcitability A system is called subexcitable if the amplitude and size of the initiated pulse of excitation decreases in time. However, if the decay time is comparable with the characteristic time for the system, defined as the ratio between system size and pulse velocity, then pulses in a subexcitable system travel for a sufficiently long distance to carry information. Subexcitable media can be used to control the amplitude of excitations. Subexcitability is usually related to system dynamics, but it may also appear as the result of geometrical constraints. For example, a narrow excitable channel surrounded by a non-excitable medium may behave as a subexcitable system because a propagating pulse dies out due to the diffusion of the activator in the neighborhood. Definition of the Subject It has been shown in the article of Adamatzky Reaction-Diffusion Computing that an excitable system can
Computing in Geometrical Constrained Excitable Chemical Systems
be used as an information processing medium. In such a medium, information is coded in pulses of excitation; the presence of a single excitation or of a group of excitations forms a message. Information processing discussed by Adamatzky is based on a homogeneous excitable medium and the interaction between pulses in such medium. Here we focus our attention on a quite specific type of excitable medium that has an intentionally introduced structure of regions characterized by different excitability levels. As the simplest case we consider a medium composed of excitable regions where autowaves can propagate and non-excitable ones where excitations rapidly die. Using such two types of medium one can, for example, construct signal channels: stripes of excitable medium where pulses can propagate surrounded by non-excitable areas thick enough to cancel potential interactions with pulses propagating in neighboring channels. Therefore, the propagation of a pulse along a selected line in an excitable system can be realized in two ways. In a homogeneous excitable medium, it can be done by a continuous control of pulse propagation and the local feedback with activating and inhibiting factors ( Reaction-Diffusion Computing). In a structured excitable medium, the same result can be achieved by creating a proper pattern of excitable and non-excitable regions. The first method gives more flexibility, the second is just simpler and does not require a permanent watch. As we show in this article, a number of devices that perform simple information processing operations, including the basic logic functions, can be easily constructed with structured excitable medium. Combining these devices as building blocks we can perform complex signal processing operations. Such an approach seems similar to the development of electronic computing where early computers were built of simple integrated circuits. The research on information processing with structured excitable media has been motivated by a few important problems. First, we would like to investigate how the properties of a medium can be efficiently used to construct devices performing given functions, and what tasks are the most suitable for chemical computing. There is also a question of generic designs valid for any excitable medium and specific ones that use unique features of a particular system (for example, a one-dimensional generator of excitation pulses that can be built with an excitable surface reaction [23]). In information processing with a structured excitable medium, the geometry of the medium is as important as its dynamics, and it seems interesting to know what type of structures are related to specific functions. In the article we present a number of such structures characteristic for particular information processing operations.
Another important motivation for research comes from biology. Even the simplest biological organisms can process information and make decisions important for their life without CPU-s, clocks or sequences of commands as it is in the standard von Neumann computer architecture [15]. In biological organisms, even at a very basic cellular level, excitable chemical reactions are responsible for information processing. The cell body considered as an information processing medium is highly structured. We believe that analogs of geometrical structures used for certain types of information processing operations in structured excitable media will be recognized in biological systems, so we will better understand their role in living organisms. At a higher level, the analogies between information processing with chemical media and signal processing in the brain seems to be even closer because the excitable dynamics of calcium in neural tissue is responsible for signal propagation in nerve system [30]. Excitable chemical channels that transmit signals between processing elements look similar to dendrites and axons. As we show in the article, the biological neuron has its chemical analog, and this allows for the construction of artificial neural networks using chemical processes. Having in mind that neural networks are less vulnerable to random errors than classical algorithms one can go back from biology to man-made computing and adopt the concepts in a fast excitable medium, for example especially prepared semiconductors ( Unconventional Computing, Novel Hardware for). The article is organized in the following way. In the next section we discuss the basic properties of a structured chemical medium that seem useful for information processing. Next we consider the binary information coded in propagating pulses of concentration and demonstrate how logic gates can be built. In the following chapter we show that a structured excitable medium can acquire information about distances and directions of incoming stimuli. Next we present a simple realization of read-write memory cell, discuss its applications in chemical counting devices, and show its importance for programming with pulses of excitation. In the following section we present a chemical realization of artificial neurons that perform multiargument operation on sets of input pulses. Finally we discuss the perspectives of the field, in particular more efficient methods of information coding and some ideas of self-organization that can produce structured media capable of information processing. Introduction Excitability is the wide spread behavior of far-from-equilibrium systems [35,39] observed, for example, in chem-
623
624
Computing in Geometrical Constrained Excitable Chemical Systems
ical reactions (Bielousov–Zhabotynsky BZ-reaction [45], CO oxidation on Pt [37], combustion of gases [26]) as well as in many other physical (laser action) and biochemical (signaling in neural systems, contraction of cardiovascular tissues) processes [51]. All types of excitable systems share a common property that they have a stable stationary state (the rest state) they reside in when they are not perturbed. A small perturbation of the rest state results only in a small-amplitude linear response of the system that uniformly decays in time. However, if a perturbation is sufficiently large then the system can evolve far away from the rest state before finally returning to it. This response is strongly nonlinear and it is accompanied by a large excursion of the variables through phase space, which corresponds to an excitation peak (a spike). The system is refractory after a spike, which means that it takes a certain recovery time before another excitation can take place. The excitability is closely related with relaxation oscillations and the phenomena differ by one bifurcation only [56]. Properties of excitable systems have an important impact on their ability to process information. If an excitable medium is spatially distributed then an excitation at one point of the medium (usually seen as an area with a high concentration of a certain reagent), may introduce a sufficiently large perturbation to excite the neighboring points as the result of diffusion or energy transport. Therefore, an excitation can propagate in space in the form of a pulse. Unlike mechanical waves that dissipate the initial energy and finally decay, traveling spikes use the energy of the medium to propagate and dissipate it. In a typical excitable medium, after a sufficiently long time, an excitation pulse converges to the stationary shape, independent of the initial condition what justifies to call it an autowave. Undamped propagation of signals is especially important if the distances between the emitting and receiving devices are large. The medium’s energy comes from the nonequilibrium conditions at which the system is kept. In the case of a batch reactor, the energy of initial composition of reagents allows for the propagation of pulses even for days [41], but a typical time of an experiment in such conditions is less than an hour. In a continuously fed reactor pulses can run as long as the reactants are delivered [44]. If the refractory period of the medium is sufficiently long then the region behind a pulse cannot be excited again for a long time. As a consequence, colliding pulses annihilate. This type of behavior is quite common in excitable systems. Another important feature is the dispersion relation for a train of excitations. Typically, the first pulse is the fastest, and the subsequent ones are slower, which is related to the fact that the medium behind the first
pulse has not relaxed completely [63]. For example, this phenomenon is responsible for stabilization of positions in a train of pulses rotating on a ring-shaped excitable area. However, in some systems [43] the anomalous dispersion relation is observed and there are selected stable distances between subsequent spikes. The excitable systems characterized by the anomalous dispersion can play an important role in information processing, because packages of pulses are stable and thus the information coded in such packages can propagate without dispersion. The mathematical description of excitable chemical media is based on differential equations of reaction-diffusion type, sometimes supplemented by additional equations that describe the evolution of the other important properties of the medium, for example the orientation of the surface in the case of CO oxidation on a Pt surface [12,37]. Numerical simulations of pulse propagation in an excitable medium which are presented in many papers [48,49] use the FitzHugh–Nagumo model describing the time evolution of electrical potentials in nerve channels [17,18,54]. The models for systems with the Belousov–Zhabotinsky (BZ) reaction, for example the Rovinsky and Zhabotinsky model [61,62] for the ferroin catalyzed BZ reaction with the immobilized catalyst, can be derived from “realistic” reaction schemes [16] via different techniques of variable reduction. Experiments with the Ru-catalyzed, photosensitive BZ reaction have become standard in experiments with structured excitable media because the level of excitation can be easily controlled by illumination; see for example [7,27,33]. Light catalyzes the production of bromine that inhibits the reaction, so non illuminated regions are excitable and those strongly illuminated are not. The pattern of excitable (dark) and non-excitable (transparent) fields is just projected on a membrane filled with the reagents. For example, the labyrinth shown in Fig. 1 has been obtained by illuminating a membrane through a proper mask. The presence of a membrane is important because it stops convection in the solution and reduces the speed and size of spikes, so studied systems can be smaller. The other methods of forming excitable channels based on immobilizing a catalyst by imprinting it on a membrane [68] or attaching it by lithography [21,71,72] have been also used, but they seem to be more difficult. Numerical simulations of the Ru-catalyzed BZ reaction can be performed with different variants of the Oregonator model [9,19,20,38]. For example, the three-variable model uses the following equations: @u "1 (1) D u(1 u) w(u q) C Du r 2 u @t @v Duv (2) @t
Computing in Geometrical Constrained Excitable Chemical Systems
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 1 Pulses of excitation propagating in a labyrinth observed in an experiment with a Ru-catalyzed BZ-reaction. The excitable areas are dark, the non-excitable light. The source of a train of pulses (a tip of a silver wire) is placed at the point A
"2
@w D C f v w(u C q) C Dw r 2 w @t
(3)
where u, v and w denote dimensionless concentrations of the following reagents: HBrO2 , Ru(4,40 -dm-bpy)3C 3 , and Br , respectively. In the considered system of equations, u is an activator and v is an inhibitor. The set of Oregonator equations given above reduces to the two-variable model cited in the article of Adamatzky Reaction-Diffusion Computing if the processes responsible for bromide production are very fast if compared to the other reactions ("2 "1 1). In such cases, the local value of w can be calculated assuming that it corresponds to the stationary solution of the third equation. If such a w is substituted into the first equation, one obtains the two-variable Oregonator model. In the equations given above, the units of space and time are dimensionless and they have been chosen to scale the reaction rates. Here we also neglected the diffusion of ruthenium catalytic complex because usually it is much smaller than those of the other reagents. The reaction-diffusion equations describing the time evolution of the system can be solved with the standard numerical techniques [57]. The parameter represents the rate of bromide production caused by illumination and it is proportional to
the applied light intensity. Therefore, by adjusting the local illumination (or choosing the proper as a function of space variables in simulations) we create regions with the required level of excitability, like for example excitable stripes insulated by a non-excitable neighborhood. Of course, the reagents can freely diffuse between the regions characterized by different illuminations. The structure of the equations describing the system’s evolution in time and space gives the name “reaction diffusion computing” [4] to information processing with an excitable medium. Simulations play an important role in tests of potential information processing devices, because, unlike in experiment, the conditions and parameters of studied systems can be easily adjusted with the required precision and kept forever. Most of the information processing devices discussed below were first tested in simulations and next verified experimentally. One of the problems is related to the short time of a typical experiment in a batch condition (a membrane filled with reagents) that does not exceed one hour. In such systems the period in which the conditions can be regarded as stable is usually shorter than 30 minutes and within this time an experimentalist should prepare the medium, introduce the required illumination and perform observations. The problem can be solved when one uses a continuously fed reactor [44], but experimental setups are more complex. On the other hand, simulations often indicate that the range of parameters in which a given phenomenon appears is very narrow. Fortunately experiments seem to be more robust than simulations and the expected effects can be observed despite of inevitable randomness in reagent preparation. Having in mind the relatively short time of experiments in batch conditions and the low stability of the medium, we believe that applications of a liquid chemical excitable medium like the Ru-catalyzed BZ reaction as information processors (wetware) are rather academic and they are mainly oriented on the verification of ideas. Practical applications of reaction-diffusion computers will probably be based on an other type of medium, like structured semiconductors ( Unconventional Computing, Novel Hardware for). In an excitable reaction-diffusion medium, spikes propagate along the minimum time path. Historically, one of the first applications of a structured chemical medium used in information processing was the solution of the problem of finding the shortest path in a labyrinth [67]. The idea is illustrated in Fig. 1. The labyrinth is build of excitable channels (dark) separated by non-excitable medium (light) that does not allow for interactions between pulses propagating in different channels. Let us assume that we are interested in the distance between the
625
626
Computing in Geometrical Constrained Excitable Chemical Systems
right bottom corner (point A) and the left upper corner (point B) of the labyrinth shown. The algorithm that scans all the possible paths between these points and selects the shortest one is automatically executed if the paths in labyrinth are build of an excitable chemical medium. To see how it works, let us excite the medium at the point A. The excitation spreads out through the labyrinth, separates at the junctions and spikes enter all possible paths. During the time evolution, pulses of excitation can collide and annihilate, but the one that propagates along the shortest path has always unexcited medium in front. Knowing the time difference between the moment when the pulse is initiated at the point A and the moment when it arrives at the point B and assuming that the speed of a pulse is constant, we can estimate the length of shortest path linking both points. The algorithm described above is called the “prairie fire” algorithm and it is automatically executed by an excitable medium. It finds the shortest path in a highly parallel manner scanning all possible routes at the same time. It is quite remarkable that the time required for finding the shortest path does not depend on the the complexity of labyrinth structure, but only on the distance between the considered points. Although the estimation of the minimum distance separating two points in a labyrinth is relatively easy (within the assumption that corners do not significantly change pulse speed) it is more difficult to tell what is the shortest path. To do it one can trace the pulse that arrived first and plot a line tangential to its velocity. This idea was discussed in [6] in the context of finding the length of the shortest path in a nonhomogeneous chemical medium with obstacles. The method, although relatively easy, requires the presence of an external observer who follows the propagation of pulses. An alternative technique of extracting the shortest path based on the image processing was demonstrated in [59]. One can also locate the shortest path connecting two points in a labyrinth in a purely chemical way using the coincidence of excitations generated at the endpoints of the path. Such a method, described in [29], allows one to find the midpoint of a trajectory, and thus locate the shortest path point by point. The approximately constant speed of excitation pulses in a reaction-diffusion medium allows one to solve some geometrically oriented problems. For example one can “measure” the number by comparing the time of pulse propagation around a non-excitable circle of the radius d with the time of propagation around a square with the same side [69]. Similarly, a constant speed of propagating pulses can be used to obtain a given fraction of an angle. For example, a trisection of an angle can be done if the angle arms are linked with two arc-shaped excitable chan-
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 2 The idea of angle trisection. Angle arms are linked with red arc shaped excitable channels with the radius ratio r : R D 1 : 3. The channels have been excited at the same time at the right arm of the angle. At the moment when the excitation pulse (a yellow dot) on a smaller arch reaches the second arm the excitation on the other arch shows a point on a line trisecting the angle (the dashed line)
nels as shown in Fig. 2. The ratio of channel radii should be equal to 3. Both channels are excited at the same time at points on the same arm. The position of excitation at the larger arch at the moment when the excitation propagating on the shorter arch reaches the other arm belongs to a line that trisects the angle. The examples given above are based on two properties of structured excitable medium: the fact that the non-excitability of the neighborhood can restrict the motion of an excitation pulse to a channel, and that the speed of propagation depends on the excitability level of the medium but not on channel shape. This is true when channels are wide and the curvature is small. There are also other properties of an excitable medium useful for information processing. A typical shape of a pulse propagating in a stripe of excitable medium is illustrated in Fig. 3. The pulse moves from the left to the right as the arrow indicates. The profiles on Fig. 3b,c show cross sections of the concentrations of the activator u(x; y) and inhibitor v(x; y) along the horizontal axis of the stripe at a selected moment of time. The peak of the inhibitor follows activator maximum and is responsible for the refractory character of the region behind the pulse. Figure 3d illustrates the profile of an activator along the line perpendicular to the stripe axis. The concentration of the activator reaches its maximum in the stripe center and rapidly decreases at the boundary between the excitable and non-excitable areas. Therefore, the width of the excitable channel can be used as a parameter that controls the maximum concentration of an activator in a propagating pulse.
Computing in Geometrical Constrained Excitable Chemical Systems
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 3 The shape of the excitation pulse in a stripe of excitable medium. The arrow indicates the direction of propagation. Calculations were done for the Oregonator model (Eqs. (1)–(3)) and the following values of parameters were applied: f D 1:12; q D 0:002; "1 D 0:08; "2 D 0:00097; excitable D 0:007; non-excitable D 0:075. a The position of a spike on the stripe, b, c concentration of activator and inhibitor along the x-axis, d concentration of activator along the y-axis
Figure 4 shows profiles of a concentration in an excitable channel with a triangular shape. The two curves on Fig. 4b illustrate the profile of u along the lines 1 and 2, respectively measured at the time when the concentration of u on a given line reaches its maximum. It can be seen that the maximum concentration of the activator decreases when a pulse propagates towards the tip of the triangle. This effect can be used to build a chemical signal diode. Let us consider two pieces of excitable medium, one of triangular shape and another rectangular, separated by a non-excitable gap as shown on Fig. 4a. It is expected that a perturbation of the rectangular area by a pulse propagating towards the tip of the triangular one is much smaller than the perturbation of the triangular area by a pulse propagating towards the end of rectangular channel. Us-
ing this effect, a chemical signal diode that transmits pulses only in one direction can be constructed just by selecting the right width of non-excitable medium separating two pieces of excitable medium: a triangular one and a rectangular one [5,40]. The idea of a chemical signal diode presented in Fig. 4a was, in some sense, generalized by Davydov et al. [46], who considered pulses of excitation propagating on a 2-dimensional surface in 3-dimensional space. It has been shown that the propagation of spikes on surfaces with rapidly changing curvature can be unidirectional. For example, such an effect occurs when an excitation propagates on the surface of a tube with a variable diameter. In such a case, a spike moving from a segment characterized by a small diameter towards a larger one is stopped.
627
628
Computing in Geometrical Constrained Excitable Chemical Systems
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 4 The shape of activator concentration in an excitable medium of a triangular shape. a The structure of excitable (dark) and nonexcitable (light) regions, b concentration of activator along the lines marked in a. The profiles correspond to times when the activator reaches its maximum at a given line
For both of the chemical signal diodes mentioned above, the excitability of the medium is a non trivial function of some space variables. However, a signal diode can be also constructed when the excitability of the medium changes in one direction only, so in a properly selected coordinate system it is a function of a single space variable. For example, the diode behavior in a system where the excitability level is a triangular function of a single space variable has been confirmed in numerical simulations based on the Oregonator model of a photosensitive, Ru-catalyzed BZ reaction (Eqs. (1)–(3)) [76]. If the catalyst is immobilized, then pulses that enter the region of inhomogeneous illumination from the strongly illuminated site are not transmitted, whereas the pulses propagating in the other direction can pass through (see Fig. 5a). A similar diode-like behavior resulting from a triangular profile of medium excitability can be expected for oxidation of CO on a Pt surface. Calculations demonstrate that a triangular profile of temperature (in this case, reduced with respect to that which characterizes the excitable medium) allows for the unidirectional transmission of spikes characterized by a high surface oxygen concentration [23]. However, a realization of the chemical signal diode can be simplified. If the properties of excitable channels on both sites of a diode are the same, then the diode can be constructed with just two stripes of non-excitable medium characterized by different excitabilities as illustrated in Fig. 5b. If the symmetry is broken at the level of input channels then the construction of a signal diode can be yet simpler, and it reduces to a single narrow non-excitable gap with a much lower excitability than that of the neighboring channels (cf. Fig. 5c). In
both cases the numerical simulations based on the Oregonator model have shown that a diode works. The predictions of simulations have been qualitatively confirmed by experimental results [25]. Different realizations of a signal diode show that even for very simple signal processing devices the corresponding geometrical structure of excitable and non-excitable regions is not unique. Alternative constructions of chemical information processing devices seem important because they tell us on the minimum conditions necessary to build a device that performs a given function. In this respect a diode built with a single non-excitable region looks interesting, because such situation may occur at a cellular level, where the conditions inside the cell are different from those around. The diode behavior in the geometry shown on Fig. 5c indicates that the unidirectional propagation of spikes can be forced by a channel in a membrane, transparent to molecules or ions responsible for signal transmission. Wave-number-dependent transmission through a non-excitable barrier is another feature of a chemical excitable medium important for information processing [48]. Let us consider two excitable areas separated by a stripe of non-excitable medium and a pulse of excitation propagating in one of those areas. The perturbation of the area behind the stripe introduced by an arriving pulse depends on the direction of its propagation. If the pulse wavevector is parallel to the stripe then the perturbation is smaller than in the case when it arrives perpendicularly [48]. Therefore, the width of the stripe can be selected such that pulses propagating perpendicularly to the stripe can cross it, whereas a pulse propagating along the stripe
Computing in Geometrical Constrained Excitable Chemical Systems
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 5 The excitability as a function of a space variable in different one-dimensional realizations of a signal diode with BZ- reaction inhibited by light: a a triangular profile of illumination, b the illumination profile in a diode composed of two non-excitable barriers, c the illumination profile in a single barrier diode with nonsymmetrical excitable inputs. The upper graphs in b and c illustrate illumination on a membrane used in experiments [25]
do not excite the area on the other side. This feature is frequently used to arrange the geometry of excitable channels such that pulses arriving from one channel do not excite the other. Non-excitable barriers in a structured medium can play a more complex role than that described above. The problem of barrier crossing by a periodic train of pulses can be seen as an excitation via a periodic perturbation of the medium. It has been studied in detail in [13]. The answer of the medium is quite characteristic in the form of a devil-staircase-like firing number as a function of perturbation strength. In the case of barrier crossing, the strength of the excitation behind a barrier generated by an arriving pulse depends on the character of the non-excitable medium, the barrier width, and the frequency of the incoming signal (usually, due to an uncompleted relaxation of the medium, the amplitude of spikes decreases with frequency). A typical complex frequency transformation after barrier crossing is illustrated in Fig. 6. Experimental and numerical studies on firing number of a transmitted signal have been published [10,64,72,74]. It is interesting that the shape of regions characterized by the same firing number in the space of two parameters, barrier width and signal frequency, is not generic and depends on the type of the excitable medium. Figure 7 compares the firing numbers obtained for FitzHugh–Nagumo and Rovinsky– Zhabotinsky models. In the first case, trains of pulses with small periods can cross wider barriers than trains characterized by low frequency; for the second model the dependence is reversed.
Logic Gates, Coincidence Detectors and Signal Filters The simplest application of excitable media in information processing is based on the assumption that the logical FALSE and TRUE variables are represented by the rest state and by the presence of an excitation pulse at a given point of the system respectively. Within such interpretation a pulse represents a bit of information propagating in space. When the system remains in its rest state, no information is recorded or processed, which looks plausible for biological interpretation. Information coded in excitation pulses is processed in regions of space where pulses interact (via collision and subsequent annihilation or via transient local change in the properties of the medium). In this section we demonstrate that the geometry of excitable channels and non-excitable gaps can be tortured (the authors are grateful to prof. S. Stepney for this expression) to the level at which the system starts to perform the simplest logic operations on pulses. The binary chemical logic gates can be used as building blocks for devices performing more complex signal processing operations. Information processing with structured excitable media is “unconventional” because it is performed without a clock that sequences the operations, as it is in the standard von Neumann type computer architecture. On the other hand, in the signal processing devices described below, the proper timing of signals is important and this is achieved by selecting the right length and geometry of channels. In some cases, for example for the
629
630
Computing in Geometrical Constrained Excitable Chemical Systems
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 6 Frequency transformation on a barrier – the results for the FitzHugh–Nagumo model [64]. a The comparison of an arriving train of pulses (1) and the transmitted signal (2). b A typical dependence of the firing number as a function of barrier width. The plateaus are labeled with corresponding values of firing number (1, 6/7, 4/5, 3/4, 2/3, 1/2 and 0). At points labeled a–e the following values have been observed: a – 35/36; b – 14/15; c – 10/11; d – 8/9 and e – 5/6
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 7 The firing number as a function of the barrier width d and the interval of time between consecutive pulses (tp ). Labels in the white areas give the firing number, and the gray color marks values of parameters where more complicated transformations of frequency occur. a Results calculated for the FitzHugh–Nagumo model, b the Rovinsky–Zhabotinsky model. Both space and time are in dimensionless units
Computing in Geometrical Constrained Excitable Chemical Systems
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 8 The distribution of excitable channels (dark) that form the OR gate. b and c illustrate the time evolution of a single pulse arriving from inputs I1 and I2, respectively
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 9 The distribution of excitable channels (dark) that form the AND gate. b and c illustrate the response of the gate to a single pulse arriving from input I1 and to a pair of pulses, respectively
operations on trains of pulses, the presence of a reference signal, which plays a role similar to a clock, would significantly help to process information [70]. Historically, the logical gates were the first devices that have been realized with a structured chemical medium [1,2,14,65,73,75]. The setup of channels that execute the logic sum (OR) operation is illustrated in Fig. 8 [48]. The gate is composed of three excitable stripes (marked gray) surrounded by a non-excitable medium. The gaps separating the stripes have been selected such that a pulse that arrives perpendicularly to the gap can excite the area on the other side and a pulse that propagates parallel to the gap does not generate sufficient perturbation to excite the medium behind the gap. If there is no input pulse, the OR gate remains in the rest state and does not produce any output. A pulse in any of the input channels I1 and I2 can cross the gap separating these channels from the output O and an output spike appears. The excitation of the output channel generated by a pulse from one of the input channels propagates parallel to the gap separating the other input channel so it does not interfere with the other input as seen in Fig. 8b and c. The frequency at which the described OR gate operates is limited by the refractory period of the output medium. If the signals from both input channels arrive, but the time difference between pulses is smaller than the refractory time, then only the first pulse will produce the output spike. The gate that returns the logic product (AND) of input signals is illustrated in Fig. 9. The width of the gap separating the output channel O from the input one (I1,I2) is selected such that a single excitation propagating in the input channel does not excite the output (Fig. 9b). However,
if two counterpropagating pulses meet, then the resulting perturbation is sufficiently strong to generate an excitation in the output channel (Fig. 9c). Therefore, the output signal appears only when both sites of the input channel have been excited and pulses of excitation collided in front of the output channel. The width of the output channel defines the time difference between input pulses treated as simultaneous. Therefore, the structure shown in Fig. 9 can also be used as a detector of time coincidence between pulses. The design of the negation gate is shown in Fig. 10. The gaps separating the input channels, the source channel and the output channel can be crossed by a pulse that propagates perpendicularly, but they are non-penetrable for pulses propagating parallel to the gaps. The NOT gate should deliver an output signal if the input is in the rest state. Therefore, it should contain a source of excitation pulses (marked S). If the input is in the rest state then pulses from the source propagate unperturbed and enter
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 10 The distribution of excitable channels (dark) that form the NOT gate
631
632
Computing in Geometrical Constrained Excitable Chemical Systems
the output channel. If there is an excitation pulse in the input channel then it enters the channel linking the source with the output and annihilates with one of pulses generated by the source. As a result no output pulse appears. At the first look, the described NOT gate works fine, but if we assume that a single pulse is used in information coding than the input pulse should arrive at the right time to block the source. Therefore, additional synchronization of the source is required. If information is coded in trains of pulses then the frequency of the source should match with the one used for coding. The structure of excitable channels for the exclusive OR (XOR) gate is illustrated in Fig. 11c [34]. Two input channels bring signals to the central area C. The output channels are linked together via diodes (Fig. 4) that stop possible backward propagation. As in the previous cases only pulses perpendicular to the gaps can pass through non-excitable gaps between the central area and both input and output channels. The shape of the central area has been designed such that an excitation generated by a single input pulse propagates parallel to one of the outputs and is perpendicular to another. As the result one of the output channels is excited (see Fig. 11a). However, if pulses from both input channels arrive at the same time then the wavevector of the excitation in the central part is always parallel to the boundaries (Fig. 11b). Therefore, no output signal appears. Of course, there is no output signal if none of the inputs are excited. It is worth noticing that for some geometries of the XOR gate the diodes in output channels are not necessary because the backward propagation does not produce a pulse with a wavevector perpendicular to a gap as seen in the fourth frame of Fig. 11a. Another interesting example of behavior resulting from the interaction of pulses has been observed in a crossshaped structure built of excitable regions, separated by gaps penetrable for perpendicular pulses [65,66] shown in Fig. 12. The answer of cross-shaped junction to a pair pulses arriving from two perpendicular directions has been studied as a function of the time difference between pulses. Of course, if the time difference is large, pulses propagate independently along their channels. If the time difference is small the cross-junction acts like the AND gate and the output excitation appears in one of the corner areas. However, for a certain time difference the first arriving pulse is able to redirect the other and force it to follow. The effect is related to uncompleted relaxation of the central area of the junction at the moment when the second pulse arrives. Pulse redirection seems to be an interesting effect from the point of programming with excitation pulses, but in practice it requires a high precision in selecting the right time difference.
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 11 c the distribution of excitable channels (dark) that form the XOR gate. a and b illustrate the response of the gate to a single pulse arriving from input I1 and to a pair of pulses, respectively
The logic gates described above can also be applied to transform signals composed of many spikes. For example, two trains of pulses can be added together if they pass through an OR gate. The AND gate creates a signal composed of coinciding pulses from both trains. It is also easy to use a structured excitable medium and generate a signal that does not contain coinciding pulses [50]. The structure of the corresponding device is illustrated in Fig. 13. All non-excitable gaps are penetrable for perpendicular pulses. If a pulse of excitation arrives from one of the in-
Computing in Geometrical Constrained Excitable Chemical Systems
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 13 The distribution of excitable channels in devices that compare trains of pulses. The device shown in a applies the XOR operation to a pair of signals and makes a signal composed of spikes that do not coincide with the other signal. b generates a signal composed of spikes that arrive through I1 and do not coincide with excitation pulses coming from I2 Computing in Geometrical Constrained Excitable Chemical Systems, Figure 12 The distribution of excitable and non-excitable regions in a cross-shaped junction. Here the excitable regions are gray and the non-excitable black. Consecutive figures illustrate an interesting type of time evolution caused by interaction of pulses. Two central figures are enlarged in order to show how uncompleted relaxation influences the shape of the next pulse
put channels then it excites the segment A and propagates towards the other end of it. If there is no spike in the other input channel then this excitation propagates unperturbed and activates the output channel. However, if a pulse from the other channel arrives it annihilates with the original pulse and no output signal is generated. The structure illustrated on Fig. 13a can be easily transformed into a device that compares two trains of pulses such that the resulting signal is composed of spikes of the first signal that do not coincide with the pulses of the second train. Such a device can be constructed just by neglecting one of the output channels. Figure 13b illustrates the geometry of the device that produces the output signal composed of pulses in arriving at the input I1 that have no corresponding spikes in the signal arriving from I2. The coincidence detector can be used as a frequency filter that transmits periodic signals within a certain frequency interval [22,50]. The idea of such a filter is shown
in Fig. 14. The device tests if the time between subsequent spikes of the train remains in the assumed range. As illustrated the signal arriving from the input I separates and enters the segment E through diodes D1 and D2. The segments E and F form a coincidence detector (or an AND gate, c. f. Fig. 9). The excitation of the output appears when an excitation coming to E via the segment C and D2 is in the coincidence with the subsequent spike of the train that arrived directly via D1. The time shift for which coincidences are tested is decided by the difference in lengths of both paths. The time resolution depends on the width of the F channel (here l2 l1 ). For periodic signals the presented structure works as a frequency filter and transmits signals within the frequency range f˙ D v/(r ˙ (w)/2), where r is the difference of distances traveled by spikes calculated to the point in E placed above the center of F channels ( 2 l e ), w is the width of the output channel, and v is the velocity of a spike in the medium. Typical characteristics of the filter are illustrated on Fig. 14b. The points represent results of numerical simulations and the line shows the filter characteristics calculated from the equation given above. It can be noticed that at the ends of the transmitted frequency band a change in output signal frequency is observed. This unwelcome effect can be easily avoided if a sequence of two
633
634
Computing in Geometrical Constrained Excitable Chemical Systems
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 14 a The distribution of excitable and non-excitable regions in a band filter. b Typical characteristics of the band filter [22]. The black dots mark periods for which the signal is not transmitted, the empty ones indicate full transmission, the empty diamonds mark the periods of the arriving signal for which every second spike is transmitted
identical filters is used. A filter tuned to a certain frequency will also pass any of its harmonics because the output signal can be generated by the coincidence of every second, third, etc. pulses in the train. Chemical Sensors Built with Structured Excitable Media Even the simplest organisms without specialized nerve systems or brains are able to search for the optimum living conditions and resources of food. We should not be astonished by this fact because even chemical systems can receive and process information arriving from their neighborhood. In this section we describe how simple structures built of excitable media can be used for direction and distance sensing. In order to simplify the conversion between an external stimulus and information processed by a sensor, we assume that the environment is represented by the same excitable medium as the sensor itself, so a stimulus can directly enter the sensor and be processed. One possible strategy of sensing is based on a one-toone relationship between the measured variable and the activated sensor channel [53]. An example of a sensor of that type is illustrated on Fig. 15. It is constructed with a highly excitable, black ring surrounded by a number
of coincidence detectors, denoted as X 1 , X 2 and X 3 . The excitation of a detector appears if a pair of pulses collide on the ring in front of it. Let us consider a homogeneous excitable environment and a spherical pulse of excitation with the source at the point S1 . High excitability of the ring means that excitations on the ring propagate faster than those in the surrounding environment. At a certain moment the pulse arrives at the point P1 , which lies on a line connecting the S1 and the center of the ring (see Fig. 15b). The arriving pulse creates an excitation on the ring originating from the point P1 . This excitation splits into a pair of pulses rotating in opposite directions and after propagating around the ring, they collide at the point, which is symmetric to P with respect to the center of the ring O. The point of collision can be located by an array of coincidence detectors and thus we have information on the wave vector of the arriving pulse. In this method the resolution depends on the number of detectors used because each of them corresponds to a certain range of the measured wave vectors. The fact that a pulse has appeared in a given output channel implies that no other channel of the sensor gets excited. It is interesting that the output information is reversed in space if compared with the input one; the left sensor channels are excited when an excitation arrives from the right and vice versa.
Computing in Geometrical Constrained Excitable Chemical Systems
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 15 The distribution of excitable areas (gray and black) and non-excitable regions (white) in the ring-shaped direction detector. Triangles marked X 1 , X 2 and X 3 show the output channels. b and c illustrate time evolution of pulses generated from sources at different locations. The excited output channel depends on the direction of the source (copied from [53] with the permission of the authors)
The geometry of the direction sensor that sends information as an excitation pulse in one of its detector channels can be yet simplified. Two such realizations are illustrated in Fig. 16. The excitable areas that form the sensor are marked black, the excitable surrounding medium where stimuli propagate is gray, and the non-excitable areas are white. Let us consider an excitation generated in the medium M by a source S. It is obvious that if the source is located just above the sensor then pulses of excitation are generated at the ends of the sensor channel D at the same time and they finally annihilate above the central coincidence channel and generate an output pulse in it. If the source is located at a certain angle with respect to the vertical line then the annihilation point is shifted off the center of D. We have performed a series of numerical simulations to find the relation between the position of the source and the position of annihilation point in D. In Fig. 16c and 16d the position of the annihilation point is plotted as a function of the angle between the source position and the vertical line. Although the constructions of sensors seem to be very similar, the working range of angles is larger for the sensor shown on Fig. 16b whereas the sensor shown in Fig. 16a offers slightly better resolution. Using information from two detectors of direction, the position of the source of excitation can be easily located because it is placed on the intersection of the lines representing the detected directions [77]. Another strategy of sensing is based on the observation that frequencies of pulses excited in a set of sensor channels by an external periodic perturbation contain information on the location of a stimulus. Within this strategy there is no direct relationship between the number of
sensor channels and sensor resolution, and, as we show below, a sensor with a relatively small number of channels can quite precisely estimate the distance separating it from the excitation source. As we have mentioned in the Introduction, the frequency of a chemical signal can change after propagating through a barrier made of non-excitable medium. For a given frequency of arriving spikes the frequency of excitations behind the barrier depends on the barrier width and on the angle between the normal angle to the barrier and the wave vector of arriving pulses. This effect can be used for sensing. The geometrical arrangement of excitable and non-excitable areas in a distance sensor is shown in Fig. 17. The excitable signal channels (in Fig. 17a they are numbered 1–5) are wide enough to ensure stable propagation of spikes. They are separated from one another by parallel non-excitable gaps that do not allow for interference between pulses propagating in the neighboring channels. The sensor channels are separated from the excitable medium M by the non-excitable sensor gap G. The width of this gap is very important. If the gap is too wide then no excitation of the medium M can generate a pulse in the sensor channels. If the gap is narrow then any excitation in front of the sensor can pass G and create a spike in each sensor channel so the signals sent out by the sensor channels are identical. However, there is a range of gap widths such that the firing number depends on the wave vector characterizing a pulse at the gap in front of the channel. If the source S is close to the array of sensor channels, then the wave vectors characterizing excitations in front of various channels are significantly different. Thus the frequencies of excitations in various channels should differ too. On the other hand, if
635
636
Computing in Geometrical Constrained Excitable Chemical Systems
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 16 The geometry of excitable (black) and diffusive (white) areas in two realizations of simplified distance sensors. The gray color marks the excitable medium around the sensor, S marks the position of excitation source. c and d show the position of the annihilation point in the channel D as a function of the angle q between the source position and the vertical line. The half-length of D is used as the scale unit
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 17 The distance sensor based on frequency transformation. a The distribution of excitable channels, b the firing numbers in different channels as a function of distance; black, green, blue red and violet curves correspond to signals in channels 1–5, respectively
Computing in Geometrical Constrained Excitable Chemical Systems
the source of excitations is far away from the gap G then the wave vectors in front of different channels should be almost identical and the frequencies of excitations should be the same. Therefore, the system illustrated in Fig. 17a can sense the distance separating it from the source of excitations. If this distance is small then the firing numbers in neighboring sensor channels are different and these differences decrease when the source of excitations moves away. A typical distance dependence of firing numbers observed in different channels is illustrated in Fig. 17b. This result has been obtained in numerical simulations based on the Oregonator model. The range of sensed distances depends on the number of sensor channels. A similar sensor, but with 4 sensor channels, was studied in [28]. Comparing the results, we observe that the presence of 5th channel significantly improves the range of distances for which the sensor operates. On the other hand, the additional channel has almost no effect on sensor resolution at short distances. The sensor accuracy seems to be a complex function related to the width of the sensor gap and the properties of channels. The firing number of a single channel as a function of the distance between the sensor and the source has a devilstaircase-like form with long intervals where the function is constant corresponding to simple fractions as illustrated in Fig. 5b. In some range of distances the steps in firing numbers of different channels can coincide, so the resolution in this range of distances is poor. For the other regions, the firing numbers change rapidly and even small changes in distance can be easily detected. The signal transformation on a barrier depends on the frequency of incoming pulses (c. f. Fig. 6) so the distance sensor that works well for one frequency may not work for another. In order to function properly for different stimuli the sensor has to be adapted to the conditions it operates. In practice such adaptation can be realized by the comparison of frequencies of signals in detector channels with the frequency in a control channel. For example the control channel can be separated from the medium M by a barrier so narrow that every excitation of the medium generates a spike. The comparison between the frequency of excitations in the control channel and in the sensor channels can be used to adjust the sensor gap G. If the frequency of excitations in the sensor channels is the same as in the control channels then the width of the gap should be increased or its excitability level decreased. On the other hand if the frequency in the sensor channel is much smaller than in the control channel (or null) then the gap should be more narrow or more excitable. Such an adaptation mechanism allows one to adjust the distance detector to any frequency of arriving excitations.
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 18 Two snapshots from the experimental realization of the distance sensor with four channels. In the upper figure the source (1 mm thick silver wire) is placed 2 mm away from the sensor; in the bottom one the source is 12 mm away. The firing numbers are given next to the corresponding channels
The fact that the distance detector described above actually works has been confirmed in experiments with a photosensitive Ru-catalyzed BZ reaction. Typical snapshots from two experiments performed for the source placed 2 and 12 mm away from the sensors are shown in Fig. 18. The firing numbers observed in different sensor channels confirm qualitatively the predictions of numerical simulations. If the source of excitations is close to the sensor gap, then the differences between firing numbers observed in neighboring channels are large. On the other hand, when the source of excitations is far away from the sensor, the frequencies in different channels become similar. The range of distances at which the sensor works is measured in centimeters so it is of the same order as the sensor size. The Ring Memory and Its Applications The devices discussed in the previous section can be classified as instant machines [58] capable of performing just the task they have been designed for. A memory where information coded in excitation pulses can be written-in, kept, read-out and, if necessary, erased, significantly increases the information processing potential of structured excitable media. Moreover, due to the fact that the state of memory can be changed by a spike, the memory allows for programming with excitation pulses. One possible realization of a chemical memory is based on the observation that
637
638
Computing in Geometrical Constrained Excitable Chemical Systems
a pulse of excitation can rotate on a ring-shaped excitable area as long as the reactants are supplied and the products removed [41,52,55]. Therefore, a ring with a number of spikes rotating on it can be regarded as a loaded memory cell. Such memory can be erased by counterpropagating pulses. The idea of memory with loading pulses rotating in one direction and erasing pulses in another has been discussed in [49]. If the ring is big then it can be used to memorize a large amount of information because it has many states corresponding to different numbers of rotating pulses. However, in such cases, loading the memory with subsequent pulses may not be reliable because the input can be blocked by the refractory tail left by one of already rotating pulses. The same effect can block the erasing pulses. Therefore, the memory capable of storing just a single bit seems to be more reliable and we consider it in this section. Such memory has two states: if there is a rotating pulse the ring is in the logical TRUE state (we call it loaded); if there is no pulse the state of memory corresponds to the logical FALSE and we call such memory erased. Let us consider the memory illustrated in Fig. 19. The black areas define the memory ring, the output channel O, and the loading channel ML. The memory ring is formed by two L-shaped excitable areas. The areas are separated by gaps and, as we show below, with a special choice of the gaps the symmetry of the ring is broken and unidirectional rotation ensured. The Z-shaped excitable area composed of gray segments inside the ring forms the erasing channel. The widths of all non-excitable gaps separating excitable areas are selected such that a pulse of excitation propagating perpendicularly to the gap excites the active area on the other site of the gap, but the gap is impenetrable for pulses propagating parallel to the gap. The memory cell can be loaded by a spike arriving from the ML channel. Such a spike crosses the gap and generates an excitation
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 19 The distribution of excitable channels (dark) that form a memory cell. The cell is composed of the loading channel ML, the memory ring, the erasing channel ME (marked gray) and the output channel O
on the ring rotating counterclockwise. The information about loaded memory is periodically sent out as a series of spikes through the output channel O. The rotating pulse does not affect the erasing channel because it always propagates parallel to it. The erasing excitation is generated in the center of the Z-shaped area and it splits into two erasing pulses. These pulses can cross the gaps separating the erasing channel from the memory ring and create a pair of pulses rotating clockwise. A spike that propagates clockwise on the memory ring is not stable because it is not able to cross any of the gaps and dies. It also does not produce any output signal. Therefore, if the memory has not been loaded then an erasing excitation does not load it. On the other hand, if the memory is loaded then clockwise rotating pulses resulting from the excitation of the erasing channel annihilate with the loading pulse and the memory is erased. The idea of using two places where erasing pulses can enter the memory ring is used to ensure that at least one of those places is fully relaxed and so one of erasing pulses can always enter the ring. In order to verify that such memory works, we have performed a number of simulations using the Rovinsky– Zhabotinsky model of a BZ reaction and have done the experiments. We considered a loaded memory and at a random time the excitation was generated in the middle of the erasing channel. In all cases, such an excitation erased the memory ring. A typical experimental result is illustrated in Fig. 20. Here the channels are 1.5 mm thick and the gaps are 0.1 mm wide. Figure 20a shows the loaded memory and initiated pair of pulses in the erasing channel. In Fig. 20b one of the pulses from the erasing channel enters the memory ring. The memory ring in front of the left part of erasing channel is still in the refractory state so the left erasing pulse has not produced an excitation on the ring. Finally in Fig. 20c we observe the annihilation of the loading pulse with one of erasing pulses. Therefore, the state of memory changed from loaded to unloaded. The experiment was repeated a few times and the results were in a qualitative agreement with the simulations: the loaded memory cell kept its information for a few minutes and it was erased after every excitation of the erasing channel. Figure 21 illustrates two simple, yet interesting, applications of a memory cell. Figure 21a shows a switchable unidirectional channel that can be opened or closed depending on the state of memory [29]. The channel is constructed with three excitable segments A, B and C separated by signals diodes D1 and D2 (c. f. Fig. 4). The mid segment B is also linked with the output of the memory ring M. Here the erasing channels of M are placed outside the memory ring, but their function is exactly the same as in the memory illustrated in Fig. 19. The idea of switch-
Computing in Geometrical Constrained Excitable Chemical Systems
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 20 Three snapshots from an experiment with memory erasing. The memory ring is formed by two L-shaped excitable channels, the Z-shaped erasing channel is inside the ring
able channel is similar to the construction of the NOT gate (Fig. 10). If the memory is not loaded then the spike propagation from input I to output O is unperturbed. However, if the memory is loaded then pulses of excitation periodically enter the segment B and annihilate with the transmitted signal. As a result, the channel is either open or blocked depending on the state of the memory. The excitations generated by the memory ring do not spread outside
the B segment. On one end their propagation is stopped by the diode D1, on the other by the geometry of the junction between the B channel and the memory output. The state of memory is controlled by the excitation pulses coming from loading and erasing channels, so switchable channels can be used in devices that are programmable with excitation pulses. Figure 21b illustrates a self-erasing memory cell that changes its state from loaded to erased after a certain time. In such a memory cell, the output channel is connected with the erasing one. When the memory is loaded the output signal appears. After some time decided by the length of connecting channels an output pulse returns as an erasing pulse and switches the memory to unloaded state. This behavior is an example of a simple feedback process common in self regulating information processing systems. Using memory cells, signal diodes, and coincidence detectors, one can construct devices which perform more complex signal processing operations. As an example, we present a simple chemical realization of a device that counts arriving spikes and returns their number in any chosen positional representation [27]. Such a counter can be assembled from single digit counters. The construction of a single digit counter depends on the representation used. Here, as an example, we consider the positional representation with the base 3. The geometry of a single digit counter is schematically shown in Fig. 22. Its main elements are two memory cells M1 and M2 and two coincidence detectors C1 and C2 . At the beginning let us assume that none of the memory cells are loaded. When the first pulse arrives through the input channel I0 , it splits at all junctions and excitations enter segments B0 , B1 and B2 . The pulse that has propagated through B0 loads the memory cell M1 . The pulses that have propagated through B1 and B2 die at the bottom diodes of segments C1 and C2 respectively. Thus, the first input pulse loads the memory M1 and does not change the state of M2 . When M1 is loaded, pulses of excitation are periodically sent to segments B0 and C1 via the bottom channel. Now let us consider what happen when the second pulse arrives. It does not pass through B0 because it annihilates with the pulses arriving from the memory M1 . The excitations generated by the second pulse can enter B1 and B2 . The excitation that propagated through B2 dies at the bottom diode of the segment C2 . The pulse that has propagated through B1 enters C1 , annihilates with a pulse from memory M1 and activates the coincidence detector. The output pulse from the coincidence detector loads the memory M2 . Therefore, after the second input pulse both memories M1 and M2 are loaded. If the third pulse arrives the segments B0 and B1 are blocked by spikes sent
639
640
Computing in Geometrical Constrained Excitable Chemical Systems
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 21 Two simple applications of a memory cell. a A switchable unidirectional channel that stops or transmits signals depending on the state of memory. b A self-erasing memory cell that changes its state after a certain time. SEC marks the connection between the memory output and the erasing channel
from the memory rings. The generated excitation can enter channel B2 and its collision with a pulse coming from the memory cell M2 activates the output channel of C2 . The output signal is directed to the counter of responsible for the digit at next position (I1) and it is also used to erase all memory cells. Thus after the third pulse both memory cells M1 and M2 are erased. The counter shown in Fig. 22 returns a digit in a representation with the base 3: here 0 is represented by the (M1 ; M2 ) D (0; 0), 1 by (1; 0), 2 by (1; 1) and the next pulse changes the state of memory cell into (M1 ; M2 ) D (0; 0). Of course, using n 1 memory cells in a single digit counter we can represent digits of the system with base n. A cascade of single digit counters (see Fig. 22b) gives a positional representation of the number of arriving pulses. Artificial Chemical Neurons with Excitable Medium In this section we discuss a simple realization of an artificial neuron with structured excitable medium and show how neuron networks can be used in programmable devices. We consider a chemical analogy of the McCulloch– Pitts neuron, i. e., a device that produces the output signal if the combined activation exceeds a critical value [32]. The geometry of a chemical neuron is inspired by a the structure of biological neuron [30]. One of its realizations with an excitable medium is illustrated on Fig. 23. Another geometry of a neuron has been discussed in [24]. In an artificial chemical neuron, like in real neurons, dendrites (input channels 1–4) transmit weak signals which are added together through the processes of spatial and temporal integration inside the cell body (part C). If the aggregate excitation is larger than the threshold value the cell body gets excited. This excitation is transmitted as an output signal
down the axon (the output channel) and the amplitude of the output signal does not depend on the value of integrated inputs but only on the properties of the medium that makes the output channel. In Fig. 23a the axon is not shown and we assume that it is formed by an excitable channel located perpendicularly above the cell body. In the construction discussed in [24] both dendrites and the axon were on a single plane. We have studied the neuron using numerical simulations based on the Oregonator model and the reaction-diffusion equations have been solved on a square grid. The square shape of the neuron body and the input channels shown on Fig. 23a allows for precise definition of the boundary between the excitable and non-excitable parts. The idea behind the chemical neuron is similar to that of the AND gate: perturbations introduced by multiple inputs combine and generate a stronger excitation of the cell body than this resulting from a single input pulse. Therefore, it is intuitively clear that if we are able to adjust the amplitudes of excitations coming from individual input channels, then the output channel becomes excited only when the required number of excitations arrive from the inputs. In the studied neuron, the amplitudes of spikes in input channels have been adjusted by sub-excitability of these channels which can be controlled by the channel width or by the illumination of surrounding nonexcitable medium. In our simulations we considered different values of p for non-excitable areas, whereas a D 0:007 characterizing excitable regions has been fixed for dendrites and the neuron body. Simulation results shown on Fig. 23b indicate that the properties of chemical neurons are very sensitive with respect to changes in p . The thick line marks the values of p for which the output signal appears. For the parameters used when p is slightly
Computing in Geometrical Constrained Excitable Chemical Systems
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 22 The counter of excitation pulses that arrive at the input I0. a shows the geometry of excitable channels (black) in a single digit counter for the positional representation with the base 3. b is a schematic illustration of the cascade of single digit counters that provides a positional representation. The feedback signals from E1, E2 and E3 channels erase the memory of the single digit counters
Computing in Geometrical Constrained Excitable Chemical Systems, Figure 23 Artificial chemical neuron. a The geometry of excitable and non-excitable areas; b The response of the neuron to different types of excitations as a function of the illumination of non-excitable regions. The numbers given on the left list the excited channels. The values of p for which the neuron body gets excited are marked by a thick line
below 0.047377 any single excitation produces an output signal. On the other hand if p > 0:047397 then even the combined excitation of all inputs is not sufficient to excite the neuron. In between those two limiting values we observe all other thresholds; i. e., an output excitation as the result of two or three combined inputs. Therefore, by applying the proper illumination level the structure shown in Fig. 23a can work as a four input McCulloch–Pitts neuron with the required threshold. A similar high sensitivity of the neuron properties on a is also observed. It means that neuron properties can be controlled with tiny changes in system parameters. The chemical neuron illustrated in Fig. 23a can be used to program signal processing with pulses of excitation. If
we set the illuminations such that any two input pulses produce the output and use one of the inputs to control the device, then if the control pulse is present the device performs the OR operation on the other inputs. If there is no control pulse, it calculates an alternative on conjunctions of all pairs of channels. The geometry of a network constructed with neurons can be easily controlled if the switchable channels illustrated on Fig. 21b are used to establish or cut connections between processing elements. The state of the memory that controls the channel may depend on the state of the network through a feedback mechanism what allows for network training. In the chemical programming described above pulses of concentration of the same reagent
641
642
Computing in Geometrical Constrained Excitable Chemical Systems
are used to store and process information, and to program the medium. Therefore, the output signal may be directly used to change the network geometry. External programming factors like illumination or a temperature field [3] are difficult for applications in three dimensional structures. The pulse based programming seems to be easier if the proper geometry of switchable channels and of the feedbacks is introduced. At the first look it seems that a practical realization of a network built of chemical neurons may be difficult. However, the geometry of a considered neuron looks quite similar to structures of phases formed in a multicomponent system. For example, the diamond structure in an oilwater-surfactant system, which spontaneously appears at certain thermodynamic conditions [11], has a form of centers linked with the four nearest neighbors. If the reactants corresponding for excitability are soluble in water, but not in oil then the water rich phase forms the structure of excitable channels and processing elements just as the result of thermodynamic conditions. Within a certain range of parameters, such a structure is thermodynamically stable. This means that the network has an auto-repair ability and robustness against unexpected destruction. The subexcitability of a channel is related to its diameter, so the required value can be obtained by selecting the right composition of the mixture and the conditions at which the phase transition occurs. Moreover, the structure is three dimensional, which allows for a higher density of processing elements than that obtained with the classical two-dimensional techniques, for example lithography. Perspectives and Conclusions Perspectives In this article we have described a number of simple devices constructed with structured excitable chemical media which process information coded in excitation pulses. All the considered systems process information in an unconventional (non-von Neumann) way; i. e., without an external clock or synchronizing signal that controls the sequence of operations. On the other hand, in many cases the right timing of performed operation is hidden in the geometrical distribution and sizes of excitable regions. The described devices can be used as building blocks for more complex systems that process signals formed of excitation pulses. Some of the discussed devices can be controlled with spikes. Therefore, there is room for programming and learning. However, the further development of applications of structured excitable medium for information processing along this line seem to follow the evolution of classical electronic computing. In our opinion it would be
more interesting to step away from this path and learn more about the potential offered by excitable media. It would be interesting to study new types of excitable media suitable for information processing. Electrical analogs of reaction-diffusion systems ( Unconventional Computing, Novel Hardware for) seem promising, because they are more robust than the wetware based on liquid BZ media. In such media, spikes propagate much faster and the spatial scale can be much reduced if compared with typical chemical systems. They seem to be promising candidates for the hardware in geometrically oriented problems and direct image processing [4]. However, two interesting properties of chemical reaction-diffusion systems are lost in their electrical analogs. First, chemical information processing systems, unlike the electronic ones, integrate two functions: the chemical reactivity and the ability to process information. Having in mind the excitable oxidation of CO on a Pt surface and the potential application of this medium for information processing [23], we can think of catalysts that are able to monitor their activity and report it. Second, the media described in Unconventional Computing, Novel Hardware for are two-dimensional. Most of the systems discussed in this article can be realized in three dimensions. The use of two dimensional media in chemical experiments is mainly related with significant difficulties in observations of three dimensional chemical excitations [42]. Potential application of phase transitions for channel structure generation described in the previous section should increase the interest in information processing with three dimensional medium. Studies on more effective methods of information coding are important for the further development of reaction-diffusion computing. The most simple translation of chemistry into the language of information science based on the equivalence between the presence of a spike and the TRUE logic value is certainly not the most efficient. It can be expected that information interpreted within multivalued logic systems is more suitable for transformations with the use of structured media [47]. As an example, let us consider three-valued logic encoded by excitation pulses in the following way: the lack of a pulse represents a logical FALSE (F), two pulses separated by time ıt correspond at the logical TRUE (T), and one pulse is interpreted as a “nonsense” (F). Let us assume that two signals coded as described above are directed to the inputs of the AND gate illustrated on Fig. 9 and that they are fully synchronized at the input. The gap between the input and output channels is adjusted such that an activator diffusing from a single traveling pulse is not sufficient to generate a new pulse in the output channel, but the excitation resulting from the
Computing in Geometrical Constrained Excitable Chemical Systems
collision of facing pulses at the junction point exceeds the threshold and the output pulse is generated. Moreover, let us assume that the system is described by the dynamics for which signals are transformed as illustrated in Fig. 7a (for example FitzHugh–Nagumo dynamics) and that the time difference between spikes ıt is selected such that the second spike can pass the gap. The output signal appears when spikes from both input channels are in coincidence or when the second spike from the same input arrives. As a result the device performs the following function within 3-valued logic [47]: F 1 T F ?
T T ? ?
F ? F F
? ? F ?
The same operation, if performed on a classical computer would require a procedure that measures time between spikes. The structured excitable medium performs it naturally provided that the time ıt is adjusted with the properties of the medium and the geometry of the gap. Of course, similar devices that process variables of n-valued logic can be also constructed with structured excitable media. Conclusions In the article we have presented a number of examples that should convince the reader that structured excitable media can be used for information processing. The future research will verify if this branch of computer science is fruitful. It would be important to find new algorithms that can be efficiently executed using a structured excitable medium. However, the ultimate test for the usefulness of ideas should come from biology. It is commonly accepted that excitable behavior is responsible for information processing and coding in living organisms [8,31,36,60]. We believe that studies on chemical information processing will help us to better understand these problems. And, although at the moment computing with a homogeneous excitable medium seems to offer more applications than that with the structured one, we believe the proportions will be reversed in the future. After all, our brains are not made of a single piece of a homogeneous excitable medium. Acknowledgments The research on information processing with structured excitable medium has been supported by the Polish State Committee for Scientific Research project 1 P03B 035 27.
Bibliography Primary Literature 1. Adamatzky A, De Lacy Costello B (2002) Experimental logical gates in a reaction-diffusion medium: The XOR gate and beyond. Phys Rev E 66:046112 2. Adamatzky A (2004) Collision-based computing in Belousov– Zhabotinsky medium. Chaos Soliton Fractal 21(5):1259–1264 3. Adamatzky A (2005) Programming Reaction-Diffusion Processors. In: Banatre J-P, Fradet P, Giavitto J-L, Michel O (eds) LNCS, vol 3566. Springer, pp 47–55 4. Adamatzky A, De Lacy Costello B, Asai T (2005) Reaction-Diffusion Computers. Elsevier, UK 5. Agladze K, Aliev RR, Yamaguchi T, Yoshikawa K (1996) Chemical diode. J Phys Chem 100:13895–13897 6. Agladze K, Magome N, Aliev R, Yamaguchi T, Yoshikawa K (1997) Finding the optimal path with the aid of chemical wave. Phys D 106:247–254 7. Agladze K, Tóth Á, Ichino T, Yoshikawa K (2000) Propagation of Chemical Pulses at the Boundary of Excitable and Inhibitory Fields. J Phys Chem A 104:6677–6680 8. Agmon-Snir H, Carr CE, Rinzel J (1998) The role of dendrites in auditory coincidence detection. Nature 393:268–272 9. Amemiya T, Ohmori T, Yamaguchi T (2000) An Oregonatorclass model for photoinduced Behavior in the Ru(bpy)2C 3 – Catalyzed Belousov–Zhabotinsky reaction. J Phys Chem A 104:336–344 10. Armstrong GR, Taylor AF, Scott SK, Gaspar V (2004) Modelling wave propagation across a series of gaps. Phys Chem Chem Phys 6:4677–4681 11. Babin V, Ciach A (2003) Response of the bicontinuous cubic D phase in amphiphilic systems to compression or expansion. J Chem Phys 119:6217–6231 12. Bertram M, Mikhailov AS (2003) Pattern formation on the edge of chaos: Mathematical modeling of CO oxidation on a Pt(110) surface under global delayed feedback. Phys Rev E 67:036207 13. Dolnik M, Finkeova I, Schreiber I, Marek M (1989) Dynamics of forced excitable and oscillatory chemical-reaction systems. J Phys Chem 93:2764–2774; Finkeova I, Dolnik M, Hrudka B, Marek M (1990) Excitable chemical reaction systems in a continuous stirred tank reactor. J Phys Chem 94:4110–4115; Dolnik M, Marek M (1991) Phase excitation curves in the model of forced excitable reaction system. J Phys Chem 95:7267–7272; Dolnik M, Marek M, Epstein IR (1992) Resonances in periodically forced excitable systems. J Phys Chem 96:3218–3224 14. Epstein IR, Showalter K (1996) Nonlinear Chemical Dynamics: Oscillations, Patterns, and Chaos. J Phys Chem 100:13132– 13147 15. Feynman RP, Allen RW, Heywould T (2000) Feynman Lectures on Computation. Perseus Books, New York 16. Field RJ, Körös E, Noyes RM (1972) Oscillations in chemical systems. II. Thorough analysis of temporal oscillation in the bromate-cerium-malonic acid system. J Am Chem Soc 94:8649– 8664 17. FitzHugh R (1960) Thresholds and plateaus in the HodgkinHuxley nerve equations. J Gen Physiol 43:867–896 18. FitzHugh R (1961) Impulses and physiological states in theoretical models of nerve membrane. Biophys J 1:445–466 19. Field RJ, Noyes RM (1974) Oscillations in chemical systems. IV.
643
644
Computing in Geometrical Constrained Excitable Chemical Systems
20.
21.
22.
23.
24.
25.
26.
27. 28.
29.
30. 31. 32. 33.
34.
35. 36.
37.
38.
39.
Limit cycle behavior in a model of a real chemical reaction. J Chem Phys 60:1877–1884 Gaspar V, Bazsa G, Beck MT (1983) The influence of visible light on the Belousov–Zhabotinskii oscillating reactions applying different catalysts. Z Phys Chem (Leipzig) 264:43–48 Ginn BT, Steinbock B, Kahveci M, Steinbock O (2004) Microfluidic Systems for the Belousov–Zhabotinsky Reaction. J Phys Chem A 108:1325–1332 Gorecka J, Gorecki J (2003) T-shaped coincidence detector as a band filter of chemical signal frequency. Phys Rev E 67:067203 Gorecka J, Gorecki J (2005) On one dimensional chemical diode and frequency generator constructed with an excitable surface reaction. Phys Chem Chem Phys 7:2915–2920 Gorecka J, Gorecki J (2006) Multiargument logical operations performed with excitable chemical medium. J Chem Phys 124:084101 (1–5) Gorecka J, Gorecki J, Igarashi Y (2007) One dimensional chemical signal diode constructed with two nonexcitable barriers. J Phys Chem A 111:885–889 Gorecki J, Kawczynski AL (1996) Molecular dynamics simulations of a thermochemical system in bistable and excitable regimes. J Phys Chem 100:19371–19379 Gorecki J, Yoshikawa K, Igarashi Y (2003) On chemical reactors that can count. J Phys Chem A 107:1664–1669 Gorecki J, Gorecka JN, Yoshikawa K, Igarashi Y, Nagahara H (2005) Sensing the distance to a source of periodic oscillations in a nonlinear chemical medium with the output information coded in frequency of excitation pulses. Phys Rev E 72:046201 (1–7) Gorecki J, Gorecka JN (2006) Information processing with chemical excitations – from instant machines to an artificial chemical brain. Int J Unconv Comput 2:321–336 Haken H (2002) Brain Dynamics. In: Springer Series in Synergetics. Springer, Berlin Häusser M, Spruston N, Stuart GJ (2000) Diversity and Dynamics of Dendritic Signaling. Science 290:739–744 Hertz J, Krogh A, Palmer RG (1991) Introduction to the theory of neural computation. Addison-Wesley, Redwood City Ichino T, Igarashi Y, Motoike IN, Yoshikawa K (2003) Different operations on a single circuit: Field computation on an Excitable Chemical System. J Chem Phys 118:8185–8190 Igarashi Y, Gorecki J, Gorecka JN (2006) Chemical information processing devices constructed using a nonlinear medium with controlled excitability. Lect Note Comput Science 4135:130–138 Kapral R, Showalter K (1995) Chemical Waves and Patterns. Kluwer, Dordrecht Kindzelskii AL, Petty HR (2003) Intracellular calcium waves accompany neutrophil polarization, formylmethionylleucylphenylalanine stimulation, and phagocytosis: a high speed microscopy study. J Immunol 170:64–72 Krischer K, Eiswirth M, Ertl GJ (1992) Oscillatory CO oxidation on Pt(110): modelling of temporal self-organization. J Chem Phys 96:9161–9172 Krug HJ, Pohlmann L, Kuhnert L (1990) Analysis of the modified complete Oregonator accounting for oxygen sensitivity and photosensitivity of Belousov–Zhabotinskii systems. J Phys Chem 94:4862–4866 Kuramoto Y (1984) Chemical oscillations, waves, and turbulence. Springer, Berlin
40. Kusumi T, Yamaguchi T, Aliev RR, Amemiya T, Ohmori T, Hashimoto H, Yoshikawa K (1997) Numerical study on time delay for chemical wave transmission via an inactive gap. Chem Phys Lett 271:355–360 41. Lázár A, Noszticzius Z, Försterling H-D, Nagy-Ungvárai Z (1995) Chemical pulses in modified membranes I. Developing the technique. Physica D 84:112–119; Volford A, Simon PL, Farkas H, Noszticzius Z (1999) Rotating chemical waves: theory and experiments. Physica A 274:30–49 42. Luengviriya C, Storb U, Hauser MJB, Müller SC (2006) An elegant method to study an isolated spiral wave in a thin layer of a batch Belousov–Zhabotinsky reaction under oxygen-free conditions. Phys Chem Chem Phys 8:1425–1429 43. Manz N, Müller SC, Steinbock O (2000) Anomalous dispersion of chemical waves in a homogeneously catalyzed reaction system. J Phys Chem A 104:5895–5897; Steinbock O (2002) Excitable Front Geometry in Reaction-Diffusion Systems with Anomalous Dispersion. Phys Rev Lett 88:228302 44. Maselko J, Reckley JS, Showalter K (1989) Regular and irregular spatial patterns in an immobilized-catalyst Belousov– Zhabotinsky reaction. J Phys Chem 93:2774–2780 45. Mikhailov AS, Showalter K (2006) Control of waves, patterns and turbulence in chemical systems. Phys Rep 425:79–194 46. Morozov VG, Davydov NV, Davydov VA (1999) Propagation of Curved Activation Fronts in Anisotropic Excitable Media. J Biol Phys 25:87–100 47. Motoike IN, Adamatzky A (2005) Three-valued logic gates in reaction–diffusion excitable media. Chaos, Solitons & Fractals 24:107–114 48. Motoike I, Yoshikawa K (1999) Information Operations with an Excitable Field. Phys Rev E 59:5354–5360 49. Motoike IN, Yoshikawa K, Iguchi Y, Nakata S (2001) Real–Time Memory on an Excitable Field. Phys Rev E 63:036220 (1–4) 50. Motoike IN, Yoshikawa K (2003) Information operations with multiple pulses on an excitable field. Chaos, Solitons & Fractals 17:455–461 51. Murray JD (1989) Mathematical Biology. Springer, Berlin 52. Nagai Y, Gonzalez H, Shrier A, Glass L (2000) Paroxysmal Starting and Stopping of Circulatong Pulses in Excitable Media. Phys Rev Lett 84:4248–4251 53. Nagahara H, Ichino T, Yoshikawa K (2004) Direction detector on an excitable field: Field computation with coincidence detection. Phys Rev E 70:036221 (1–5) 54. Nagumo J, Arimoto S, Yoshizawa S (1962) An Active Pulse Transmission Line Simulating Nerve Axon. Proc IRE 50: 2061–2070 55. Noszticzuis Z, Horsthemke W, McCormick WD, Swinney HL, Tam WY (1987) Sustained chemical pulses in an annular gel reactor: a chemical pinwheel. Nature 329:619–620 56. Plaza F, Velarde MG, Arecchi FT, Boccaletti S, Ciofini M, Meucci R (1997) Excitability following an avalanche-collapse process. Europhys Lett 38:85–90 57. Press WH, Teukolsky SA, Vetterling WT, Flannery BP (2007) Numerical Recipes, 3rd edn. The Art of Scientific Computing. Available via http://www.nr.com. 58. Rambidi NG, Maximychev AV (1997) Towards a Biomolecular Computer. Information Processing Capabilities of Biomolecular Nonlinear Dynamic Media. BioSyst 41:195–211 59. Rambidi NG, Yakovenchuk D (1999) Finding paths in a labyrinth based on reaction-diffusion media. BioSyst 51:67– 72; Rambidi NG, Yakovenchuk D (2001) Chemical reaction-
Computing in Geometrical Constrained Excitable Chemical Systems
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
diffusion implementation of finding the shortest paths in a labyrinth. Phys Rev E 63:026607 Rambidi NG (2005) Biologically Inspired Information Processing Technologies: Reaction-Diffusion Paradigm. Int J Unconv Comp 1:101–121 Rovinsky AB, Zhabotinsky AM (1984) Mechanism and mathematical model of the oscillating bromate-ferroin-bromomalonic acid reaction. J Phys Chem 88:6081–6084 Rovinsky AB (1986) Spiral waves in a model of the ferroin catalyzed Belousov–Zhabotinskii reaction. J Phys Chem 90:217–219 Sielewiesiuk J, Gorecki J (2002) Chemical Waves in an Excitable Medium: Their Features and Possible Applications in Information Processing. In: Klonowski W (ed) Attractors, Signals and Synergetics. 1st European Interdisciplinary School on Nonlinear Dynamics for System and Signal Analysis Euroattractor 2000, Warsaw, 6–15 June 2000. Pabst, Lengerich, pp 448–460 Sielewiesiuk J, Gorecki J (2002) On complex transformations of chemical signals passing through a passive barrier. Phys Rev E 66:016212; Sielewiesiuk J, Gorecki J (2002) Passive barrier as a transformer of chemical signal frequency. J Phys Chem A 106:4068–4076 Sielewiesiuk J, Gorecki J (2001) Chemical impulses in the perpendicular junction of two channels. Acta Phys Pol B 32:1589–1603 Sielewiesiuk J, Gorecki J (2001) Logical functions of a cross junction of excitable chemical media. J Phys Chem A 105:8189–8195 Steinbock O, Toth A, Showalter K (1995) Navigating complex labyrinths – optimal paths from chemical waves. Science 267:868–871 Steinbock O, Kettunen P, Showalter K (1995) Anisotropy and Spiral Organizing Centers in Patterned Excitable Media. Science 269:1857–1860 Steinbock O, Kettunen P (1996) Chemical clocks on the basis of rotating pulses. Measuring irrational numbers from period ratios. Chem Phys Lett 251:305–308
70. Steinbock O, Kettunen P, Showalter K (1996) Chemical Pulse Logic Gates. J Phys Chem 100:18970–18975 71. Suzuki K, Yoshinobu T, Iwasaki H (1999) Anisotropic waves propagating on two–dimensional arrays of Belousov– Zhabotinsky oscillators. Jpn J Appl Phys 38:L345–L348 72. Suzuki K, Yoshinobu T, Iwasaki H (2000) Unidirectional propagation of chemical waves through microgaps between zones with different excitability. J Phys Chem A 104:6602– 6608 73. Suzuki R (1967) Mathematical analysis and application of iron-wire neuron model. IEEE Trans Biomed Eng 14:114– 124 74. Taylor AF, Armstrong GR, Goodchild N, Scott SK (2003) Propagation of chemical waves across inexcitable gaps. Phys Chem Chem Phys 5:3928–3932 75. Toth A, Showalter K (1995) Logic gates in excitable media. J Chem Phys 103:2058–2066 76. Toth A, Horvath D, Yoshikawa K (2001) Unidirectional wave propagation in one spatial dimension. Chem Phys Lett 345: 471–474 77. Yoshikawa K, Nagahara H, Ichino T, Gorecki J, Gorecka JN, Igarashi Y (2009) On Chemical Methods of Direction and Distance Sensing. Int J Unconv Comput 5:53–65
Books and Reviews Hjelmfelt A, Ross J (1994) Pattern recognition, chaos, and multiplicity in neural networks of excitable systems. Proc Natl Acad Sci USA 91:63–67 Nakata S (2003) Chemical Analysis Based on Nonlinearity. Nova, New York Rambidi NG (1998) Neural network devices based on reaction– diffusion media: an Approach to artificial retina. Supramol Sci 5:765–767 Storb U, Müller SC (2004) Scroll waves. In: Scott A (ed) Encyclopedia of nonlinear sciences. Routledge, Taylor and Francis Group, New York, pp 825–827
645
646
Computing with Solitons
Computing with Solitons DARREN RAND1 , KEN STEIGLITZ2 1 Lincoln Laboratory, Massachusetts Institute of Technology, Lexington, USA 2 Computer Science Department, Princeton University, Princeton, USA Article Outline Glossary Definition of the Subject Introduction Manakov Solitons Manakov Soliton Computing Multistable Soliton Collision Cycles Experiments Future Directions Bibliography Glossary Integrable This term is generally used in more than one way and in different contexts. For the purposes of this article, a partial differential equation or system of partial differential equations is integrable if it can be solved explicitly to yield solitons (qv). Manakov system A system of two cubic Schrödinger equations where the self- and cross-phase modulation terms have equal weight. Nonlinear Schrödinger equation A partial differential equation that has the same form as the Schrödinger equation of quantum mechanics, with a term nonlinear in the dependent variable, and for the purposes of this article, interpreted classically. Self- and cross-phase modulation Any terms in a nonlinear Schrödinger equation that involve nonlinear functions of the dependent variable of the equation, or nonlinear functions of a dependent variable of another (coupled) equation, respectively. Solitary wave A solitary wave is a wave characterized by undistorted propagation. Solitary waves do not in general maintain their shape under perturbations or collisions. Soliton A soliton is a solitary wave which is also robust under perturbations and collisions. Turing equivalent Capable of simulating any Turing Machine, and hence by Turing’s Thesis capable of performing any computation that can be carried out by a sequence of effective instructions on a finite amount of data. A machine that is Turing equivalent is there-
fore as powerful as any digital computer. Sometimes a device that is Turing equivalent is called “universal.” Definition of the Subject Solitons are localized, shape-preserving waves characterized by robust collisions. First observed as water waves by John Scott Russell [29] in the Union Canal near Edinburgh and subsequently recreated in the laboratory, solitons arise in a variety of physical systems, as both temporal pulses which counteract dispersion and spatial beams which counteract diffraction. Solitons with two components, vector solitons, are computationally universal due to their remarkable collision properties. In this article, we describe in detail the characteristics of Manakov solitons, a specific type of vector soliton, and their applications in computing. Introduction In this section, we review the basic principles of soliton theory and spotlight relevant experimental results. Interestingly, the phenomena of soliton propagation and collision occur in many physical systems despite the diversity of mechanisms that bring about their existence. For this reason, the discussion in this article will treat temporal and spatial solitons interchangeably, unless otherwise noted. Scalar Solitons A pulse in optical fiber undergoes dispersion, or temporal spreading, during propagation. This effect arises because the refractive index of the silica glass is not constant, but is rather a function of frequency. The pulse can be decomposed into a frequency range—the shorter the pulse, the broader its spectral width. The frequency dependence of the refractive index will cause the different frequencies of the pulse to propagate at different velocities, giving rise to dispersion. As a result, the pulse develops a chirp, meaning that the individual frequency components are not evenly distributed throughout the pulse. There are two types of dispersion: normal and anomalous. If the longer wavelengths travel faster, the medium is said to have normaldispersion. If the opposite is true, the medium has anomalous dispersion. The response of a dielectric such as optical fiber is nonlinear. Most of the nonlinear effects in fiber originate from nonlinear refraction, where the refractive index n depends on the intensity of the propagating field according to the relation n D n0 C n2 jEj2 ;
(1)
Computing with Solitons
where n0 is the linear part of the refractive index, jE 2 j is the optical intensity, and n2 is the coefficient of nonlinear contribution to the refractive index. Because the material responds almost instantaneously, on the order of femtoseconds, and because the phase shift is proportional to n, each component of an intense optical pulse sees a phase shift proportional to its intensity. Since the frequency shift ı! D (@)/(@t), the leading edge of the pulse is red-shifted (ı! < 0), while the trailing edge is blue-shifted (ı! > 0), an effect known as self-phase modulation (SPM). As a result, if the medium exhibits normal dispersion, the pulse is broadened; for anomalous dispersion, the pulse is compressed. Under the proper conditions, this pulse compression can exactly cancel the linear, dispersion-induced broadening, resulting in distortionless soliton propagation. For more details, see the book by Agrawal [3]. The idealized mathematical model for this pulse propagation is the nonlinear Schrödinger equation (NLSE): i
@u 1 @2 u C juj2 u D 0; ˙ @z 2 @x 2
(2)
where u(z, x) is the complex-valued field envelope, z is a normalized propagation distance and x is normalized time propagating with the group velocity of the pulse. The second and third terms describe dispersion and the intensity-dependent Kerr nonlinearity, respectively. The coefficient of the dispersion term is positive for anomalous dispersion and negative for normal dispersion. Equation (2), known as the scalar NLSE, is integrable—that is, it can be solved analytically, and collisions between solitons are ‘elastic,’ in that no change in amplitude or velocity occurs as a result of a collision. Zakharov and Shabat [38] first solved this equation analytically using the inverse scattering method. It describes, for example, the propagation of picosecond or longer pulses propagating in lossless optical fiber. Two solitons at different wavelengths will collide in an optical fiber due to dispersion-induced velocity differences. A schematic of such a collision is depicted in Fig. 1. The scalar soliton collision is characterized by two phenomena— a position and phase shift—both of which can be understood in the same intuitive way. During collision, there will be a local increase in intensity, causing a local increase in the fiber’s refractive index, according to Eq. (1). As a result, both the soliton velocity and phase will be affected during the collision. From an all-optical signal processing perspective, the phase and position shifts in a soliton collision are not useful. This is because these effects are independent of any soliton properties that are changed by collision; that is, the
Computing with Solitons, Figure 1 Schematic of a scalar soliton collision, in which amplitude and velocities are unchanged. The two soliton collision effects are a position shift (depicted through the translational shift in the soliton path) and phase shift (not pictured)
result of one collision will not affect the result of subsequent collisions. Scalar solitons are therefore not useful for complex logic or computing, which depend on multiple, cascaded interactions. Despite this setback, it was discovered later that a system similar to the scalar NLSE, the Manakov system [19], possesses very rich collisional properties [26] and is integrable as well. Manakov solitons are a specific instance of two-component vector solitons, and it has been shown that collisions of Manakov solitons are capable of transferring information via changes in a complex-valued polarization state [16]. Vector Solitons When several field components, distinguished by polarization and/or frequency, propagate in a nonlinear medium, the nonlinear interaction between them must be considered as well. This interaction between field components results in intensity-dependent nonlinear coupling terms analogous to the self-phase modulation term in the scalar case. Such a situation gives rise to a set of coupled nonlinear Schrödinger equations, and may allow for propagation of vector solitons. For the case of two components propagating in an ideal sl medium with no higher-order effects and only intensity-dependent nonlinear coupling, the equations become: @u1 @2 u1 C 2(ju1 j2 C ˛ju2 j2 )u1 D 0 ; C @z @x 2 @u2 @2 u2 i C 2(ju2 j2 C ˛ju1 j2 )u2 D 0 ; C @z @x 2
i
(3)
647
648
Computing with Solitons
where u1 (z; x) and u2 (z; x) are the complex-valued pulse envelopes for each component, is a nonlinearity parameter, and ˛ describes the ratio between self- and crossphase modulation contributions to the overall nonlinearity. Only for the special case of ˛ D 1 are Eqs. (3) integrable. First solved using the method of inverse scattering by Manakov [19], Eqs. (3) admit solutions known as Manakov solitons. For nonintegrable cases (˛ ¤ 1), some analytical solitary-wave solutions are known for specific cases, although in general a numerical approach is required [36]. The specific case of ˛ D 2/3, for example, corresponds to linearly birefringent polarization maintaining fiber, and will be considered in more detail in Sect. “Experiments”. Due to their multicomponent structure, vector solitons have far richer collision dynamics than their scalar, one-component counterparts. Recall that scalar collisions are characterized by phase and position shifts only. Vector soliton collisions also exhibit these effects, with the added feature of possible intensity redistributions between the component fields [19,26]. This process is shown schematically in Fig. 2. In the collision, two conservation relations are satisfied: (i) the energy in each soliton is conserved and (ii) the energy in each component is conserved. It can be seen that when the amplitude of one component in a soliton increases as a result of the collision, the other compo-
Computing with Solitons, Figure 2 Schematic of a vector soliton collision, which exhibits a position shift and phase shift (not pictured), similar to the scalar soliton collision (cf. Fig. 1). Vector soliton collisions also display an energy redistribution among the component fields, shown here as two orthogonal polarizations. Arrows indicate direction of energy redistribution
nent decreases, with the opposite exchange in the second soliton. The experimental observation of this effect will be discussed in Sect. “Experiments”. In addition to fundamental interest in such solitons, collisions of vector solitons make possible unique applications, including collision-based logic and universal computation [16,27,34,35], as discussed in Sect. “Manakov Soliton Computing”. Manakov Solitons As mentioned in Sect. “Introduction”, computation is possible using vector solitons because of an energy redistribution that occurs in a collision. In this section, we provide the mathematic background of Manakov soliton theory, in order to understand soliton computing and a remarkable way to achieve bistability using soliton collisions as described in Sects. “Manakov Soliton Computing” and “Multistable Soliton Collision Cycles”, respectively. The Manakov system consists of two coupled NLSEs [19]: @q1 @2 q1 C 2(jq1 j2 C jq2 j2 )q1 D 0 ; C @z @x 2 @2 q2 @q2 C 2(jq1 j2 C jq2 j2 )q2 D 0 ; C i @z @x 2 i
(4)
where q1 (x; z) and q2 (x; z) are two interacting optical components, is a positive parameter representing the strength of the nonlinearity, and x and z are normalized space and propagation distance, respectively. As mentioned in Sect. “Vector Solitons”, the Manakov system is a special case of Eqs. (3) with ˛ D 1. The two components can be thought of as components in two polarizations, or, as in the case of a photorefractive crystal, two uncorrelated beams [11]. Manakov first solved Eqs. (4) by the method of inverse scattering [19]. The system admits single-soliton, two-component solutions that can be characterized by a complex number k k R C ik I , where kR represents the energy of the soliton and kI the velocity, all in normalized units. The additional soliton parameter is the complex-valued polarization state q1 /q2 , defined as the (z- and x-independent) ratio between the q1 and q2 components. Figure 3 shows the schematic for a general two-soliton collision, with initial parameters 1 , k1 and L , k2 , corresponding to the right-moving and left-moving solitons, respectively. The values of k1 and k2 remain constant during collision, but in general the polarization state changes. Let 1 and L denote the respective soliton states before impact, and suppose the collision transforms 1 into R , and L into 2 . It turns out that the state change undergone
Computing with Solitons
operator on the state. Note that this requires that the inverse operator have the same k parameter as the original, a condition that will hold in our application of computing in the next section. These state transformations were first used by Jakubowski et al. [16] to describe logical operations such as NOT . Later, Steiglitz [34] established that arbitrary computation was possible through time gating of Manakov (1 + 1)-dimensional spatial solitons. We will describe this in Sect. “Manakov Soliton Computing”. There exist several candidates for the physical realization of Manakov solitons, including photorefractive crystals [4,5,9,11,30], semiconductor waveguides [17], quadratic media [33], and optical fiber [23,28]. In Sect. “Experiments”, we discuss in detail an experiment with vector solitons in linearly birefringent optical fiber. Computing with Solitons, Figure 3 Schematic of a general two-soliton collision. Each soliton is characterized by a complex-valued polarization state and complex parameter k. Reprinted with permission from [34]. Copyright by the American Physical Society
by each colliding soliton takes on the very simple form of a linear fractional transformation (also called a bilinear or Möbius transformation). Explicitly, the state of the emerging left-moving soliton is given by [16]: [(1 g)/ 1 C 1 ] L C g 1 / 1 ; g L C (1 g) 1 C 1/ 1
2 D
(5)
where g
k1 C k1 : k2 C k1
(6)
The state of the right-moving soliton is obtained similarly, and is R D
[(1 h )/ L C L ] 1 C h L / L ; h 1 C (1 h ) L C 1/ L
(7)
where h
k2 C k2 : k1 C k2
(8)
We assume here, without loss of generality, that k1R ; k2R > 0. Several properties of the linear fractional transformations in Eqs. (5) and (7) are derived in [16], including the characterization of inverse operators, fixed points, and implicit forms. In particular, when viewed as an operator every soliton has an inverse, which will undo the effect of the
Manakov Soliton Computing We described in the previous section how collisions of Manakov solitons can be described by transformations of a complex-valued state which is the ratio between the two Manakov components. We show in this section that general computation is possible if we use (1+1)-dimensional spatial solitons that are governed by the Manakov equations and if we are allowed to time-gate the beams input to the medium. The result is a dynamic computer without spatially fixed gates or wires, which is unlike most presentday conceptions of a computer that involve integrated circuits, in which information travels between logical elements that are fixed spatially through fabrication on a silicon wafer. We can call such a scheme “nonlithographic,” in the sense that there is no architecture imprinted on the medium. The requirements for computation include cascadability, fanout, and Boolean completeness. The first, cascadability, requires that the output of one device can serve as input to another. Since any useful computation consists of many stages of logic, this condition is essential. The second, fanout, refers to the ability of a logic gate to drive at least two similar gates. Finally, Boolean completeness makes it possible to perform arbitrary computation. We should emphasize that although the model we use is meant to reflect known physical phenomena, at least in the limit of ideal behavior, the result is a mathematical one. Practical considerations of size and speed are not considered here, nor are questions of error propagation. In this sense the program of this article is analogous to Fredkin and Toffoli [13] for ideal billiard balls, and Shor [31] for quantum mechanics. There are however several can-
649
650
Computing with Solitons
Computing with Solitons, Figure 5 Colliding spatial solitons. Reprinted with permission from [34]. Copyright by the American Physical Society
Computing with Solitons, Figure 4 The general physical arrangement considered in this paper. Time-gated beams of spatial Manakov solitons enter at the top of the medium, and their collisions result in state changes that reflect computation. Each solid arrow represents a beam segment in a particular state. Reprinted with permission from [34]. Copyright by the American Physical Society
didates for physical instantiation of the basic ideas in this paper, as noted in the previous section. Although we are describing computation embedded in a homogeneous medium, and not interconnected gates in the usual sense of the word, we will nevertheless use the term gates to describe prearranged sequences of soliton collisions that effect logical operations. We will in fact adopt other computer terms to our purpose, such as wiring to represent the means of moving information from one place to another, and memory to store it in certain ways for future use. We will proceed in the construction of what amounts to a complete computer in the following stages: First we will describe a basic gate that can be used for FANOUT. Then we will show how the same basic configuration can be used for NOT, and finally, NAND. Then we will describe ways to use time gating of the input beams to interconnect signals. The NAND gate, FANOUT, and interconnect are sufficient to implement any computer, and we conclude with a layout scheme for a general-purpose, and hence Turing-equivalent computer. The general picture of the physical arrangement is shown in Fig. 4. Figure 5 shows the usual picture of colliding solitons, which can work interchangeably for the case of temporal
Computing with Solitons, Figure 6 Convenient representation of colliding spatial solitons. Reprinted with permission from [34]. Copyright by the American Physical Society
or spatial solitons. It is convenient for visualization purposes to turn the picture and adjust the scale so the axes are horizontal and vertical, as in Fig. 6. We will use binary logic, with two distinguished, distinct complex numbers representing TRUE and FALSE, called 1 and 0, respectively. In fact, it turns out to be possible to use complex 1 and 0 for these two state values, and we will do that throughout this paper, but this is a convenience and not at all a necessity. We will thus use complex polarization states 1 and 0 and logical 1 and 0 interchangeably. FANOUT We construct the FANOUT gate by starting with a COPY gate, implemented with collisions between three downmoving, vertical solitons and one left-moving horizontal soliton. Figure 7 shows the arrangement. The soliton state labeled in will carry a logical value, and so be in one of the two states 0 or 1. The left-moving soliton labeled actuator will be in the fixed state 0, as will be the case throughout this paper. The plan is to adjust the (so far) arbitrary states z and y so that out = in, justifying the name COPY. It is reasonable to expect that this might be possible, be-
Computing with Solitons
Computing with Solitons, Figure 7 COPY gate. Reprinted with permission from [34]. Copyright by the American Physical Society
cause there are four degrees of freedom in the two complex numbers z and y, and two complex equations to satisfy: that out be 1 and 0 when in is 1 and 0, respectively. Values that satisfy these four equations in four unknowns were obtained numerically. We will call them zc and yc . It is not always possible to solve these equations; Ablowitz et al. [1] showed that a unique solution is guaranteed to exist in certain parameter regimes. However, explicit solutions have been found for all the cases used in this section, and are given in Table 1. To be more specific about the design problem, write Eq. (5) as the left-moving product 2 D L( 1 ; L ), and similarly write Eq. (7) as R D R( 1 ; L ). The successive left-moving products in Fig. 7 are L(in; 0) and L(y; L(in; 0)). The out state is then R(z; L(y; L(in; 0))). The stipulation that 0 maps to 0 and 1 maps to 1 is expressed by the following two simultaneous complex equations in two complex unknowns R(z; L(y; L(0; 0))) D 0 ; R(z; L(y; L(1; 0))) D 1 : It is possible to solve for z as a function of y and then eliminate z from the equations, yielding one complex equation in the one complex unknown y. This is then solved numerically by grid search and successive refinement. There is no need for efficiency here, since we will require solutions in only a small number of cases.
Computing with Solitons, Figure 8 FANOUT gate. Reprinted with permission from [34]. Copyright by the American Physical Society
To make a FANOUT gate, we need to recover the input, which we can do using a collision with a soliton in the state which is the inverse of 0, namely 1 [16]. Figure 8 shows the complete FANOUT gate. Notice that we indicate collisions with a dot at the intersection of paths, and require that the continuation of the inverse soliton not intersect the continuation of z that it meets. We indicate that by a broken line, and postpone the explanation of how this “wire crossing” is accomplished. It is immaterial whether the continuation of the inverse operator hits the continuation of y, because it is not used later. We call such solitons garbage solitons. NOT and ONE Gates In the same way we designed the complex pair of states (z c ; y c ) to produce a COPY and FANOUT gate, we can find a pair (z n ; y n ) to get a NOT gate, mapping 0 to 1 and 1 to 0; and a pair (z1 ; y1 ) to get a ONE gate, mapping both 0 and 1 to 1. These (z, y) values are given in Table 1. We should point out that the ONE gate in itself, considered as a one-input, one-output gate, is not invertible, and could never be achieved by using the continuation of one particular soliton through one, or even many collisions. This is because such transformations are always nonsin-
Computing with Solitons, Table 1 Parameters for gates when soliton speeds are 1 gate
z y 0:24896731 0:62158212 I 2:28774210 C 0:01318152 I NOT 0:17620885 C 0:38170630 I 0:07888703 1:26450654 I ONE 0:45501471 1:37634227 I 1:43987094 C 0:64061349 I Z - CONV 0:31838068 0:43078735 I 0:04232340 C 2:17536612 I Y - CONV 1:37286955 C 0:88495501 I 0:58835758 0:18026939 I COPY
651
652
Computing with Solitons
Computing with Solitons, Figure 9 A NAND gate, using converter gates to couple copies of one of its inputs to its z and y parameters. Reprinted with permission from [34]. Copyright by the American Physical Society
gular linear fractional transformations, which are invertible [16]. The transformation of state from the input to the continuation of z is, however, much more complicated and provides the flexibility we need to get the ONE gate. It turns out that this ONE gate will give us a row in the truth table of a NAND, and is critical for realizing general logic. Output/Input Converters, Two-Input Gates, and NAND To perform logic of any generality we must of course be able to use the output of one operation as the input to another. To do this we need to convert logic (0/1) values to some predetermined z and y values, the choice depending on the type of gate we want. This results in a two-input, one-output gate. As an important example, here’s how a NAND gate can be constructed. We design a z-converter that converts 0/1 values to appropriate values of z, using the basic threecollision arrangement shown in Fig. 7. For a NAND gate, we map 0 to z1 , the z value for the ONE gate, and map 1 to zn , the z value for the NOT gate. Similarly, we construct a y-converter that maps 0 to y1 and 1 to yn . These z-
and y-converters are used on the fanout of one of the inputs, and the resulting two-input gate is shown in Fig. 9. Of course these z- and y-converters require z and y values themselves, which are again determined by numerical search (see Table 1). The net effect is that when the left input is 0, the other input is mapped by a ONE gate, and when it is 1 the other input is mapped by a NOT gate. The only way the output can be 0 is if both inputs are 1, thus showing that this is a NAND gate. Another way of looking at this construction is that the 2×2 truth table of (left input)×(right input) has as its 0 row a ONE gate of the columns (1 1), and as its 1 row a NOT gate of the columns (1 0). The importance of the NAND gate is that it is universal [20]. That is, it can be used with interconnects and fanouts to construct any other logical function. Thus we have shown that with the ability to “wire” we can implement any logic using the Manakov model. We note that other choices of input converters result in direct realizations of other gates. Using input converters that convert 0 and 1 to (z c ; y c ) and (z n ; y n ), respectively, results in a truth table with first row (0 1) and second row (1 0), an XOR gate. Converting 0 and 1 to (z c ; y c ) and (z1 ; y1 ), respectively, results in an OR gate, and so on. Time Gating We next take up the question of interconnecting the gates described above, and begin by showing how the continuation of the input in the COPY gate can be restored without affecting the other signals. In other words, we show how a simple “wire crossing” can be accomplished in this case. For spatial solitons, the key flexibility in the model is provided by assuming that input beams can be time-gated; that is, turned on and off. When a beam is thus gated, a finite segment of light is created that travels through the medium. We can think of these finite segments as finite light pulses, and we will call them simply pulses in the remainder of this article. Figure 10a shows the basic three-collision gate implemented with pulses. Assuming that the actuator and data pulses are appropriately timed, the actuator pulse hits all three data pulses, as indicated in the projection below the space-space diagram. The problem is that if we want a later actuator pulse to hit the rightmost data pulse (to invert the state, for example, as in the FANOUT gate), it will also hit the remaining two data pulses because of the way they must be spaced for the earlier three collisions. We can overcome this difficulty by sending the actuator pulse from the left instead of the right. Timing it appropriately early it can be made to miss the first two data
Computing with Solitons
Computing with Solitons, Figure 10 a When entered from the right and properly timed, the actuator pulse hits all three data pulses, as indicated in the projection at the bottom; b When entered from the left and properly timed, the actuator pulse misses two data pulses and hits only the rightmost data pulse, as indicated in the projection at the bottom. Reprinted with permission from [34]. Copyright by the American Physical Society
pulses, and hit the third, as shown in Fig. 10b. It is easy to check that if the velocity of the right-moving actuator solitons is algebraically above that of the data solitons by the same amount that the velocity of the data solitons is algebraically above that of the left-moving actuator solitons, the same state transformations will result. For example, if we choose the velocities of the data and left-moving actuator solitons to be C1 and 1, we should choose the velocity of the right-moving actuator solitons to be C3. This is really a consequence of the fact that the g and h parameters of Eqs. (6) and (8) in the linear fractional transformation depend only on the difference in the velocities of the colliding solitons.
Computing with Solitons, Figure 11 The frame of this figure is moving down with the data pulses on the left. A data pulse in memory is operated on with a three-collision gate actuated from the left, and the result deposited to the upper right. Reprinted with permission from [34]. Copyright by the American Physical Society
not remain in the picture with a moving frame and hence cannot interfere with later computations. However, all vertically moving pulses remain stationary in this picture. Once a diagonal trajectory is used for a three-collision gate, reusing it will in general corrupt the states of all the stationary pulses along that diagonal. However, the original data pulse (gate input) can be restored with a pulse in the state inverse to the actuator, either along the same diagonal as the actuator, provided we allow enough time for the result (the gate output, a stationary z pulse) to be used, or along the other diagonal.
Wiring Having shown that we can perform FANOUT and NAND, it remains only to show that we can “wire” gates so that any outputs can be fed to any inputs. The basic method for doing this is illustrated in Fig. 11. We think of data as stored in the down-moving pulses in a column, which we can think of as “memory”. The observer moves with this frame, so the data appears stationary. Pulses that are horizontal in the three-collision gates shown in previous figures will then appear to the observer to move upward at inclined angles. It is important to notice that these upward diagonally moving pulses are evanescent in our picture (and hence their paths are shown dashed in the figure). That is, once they are used, they do
Computing with Solitons, Figure 12 A data pulse is copied to the upper right, this copy is copied to the upper left, and the result put at the top of memory. The original data pulse can then be restored with an inverse pulse and copied to the left in the same way. Reprinted with permission from [34]. Copyright by the American Physical Society
653
654
Computing with Solitons
Suppose we want to start with a given data pulse in the memory column and create two copies above it in the memory column. Figure 12 shows a data pulse at the lower left being copied to the upper right with a three-collision COPY gate, initiated with an actuator pulse from the left. This copy is then copied again to the upper left, back to a waiting z pulse in the memory column. After the first copy is used, an inverse pulse can be used along the lower left to upper right diagonal to restore the original data pulse. The restored data pulse can then be copied to the left in the same way, to a height above the first copy, say, and thus two copies can be created and deposited in memory above the original. A Second Speed and Final FANOUT and NAND There is one problem still remaining with a true FANOUT: When an original data pulse in memory is used in a COPY operation for FANOUT, two diagonals are available, one from the lower left to the upper right, and the other from the lower right to the upper left. Thus, two copies can be made, as was just illustrated. However, when a data pulse is deposited in the memory column as a result of a logic operation, the logical operation itself uses at least one diagonal, which leaves at most one free. This makes a FANOUT of the output of a gate impossible with the current scheme. A simple solution to this problem is to introduce another speed, using velocities ˙0:5, say, in addition to ˙1. This effectively provides four rather than two directions in which a pulse can be operated on, and allows true FANOUT and general interconnections. Figure 13 shows such a FANOUT; the data pulse at the lower left is copied to a position above it using one speed, and to another position, above that, using another. Finally, a complete NAND gate is shown in Fig. 14. The gate can be thought of as composed of the following steps: input 2 is copied to the upper left, and that copy transformed by a z-converter to the upper right, placing the z pulse for the NAND gate at the top of the figure; after the copy of input 2 is used, input 2 is restored with an inverse pulse to the upper left; input 2 is then transformed to the upper right by a yconverter; input 1 is copied to the upper right, to a position collinear with the z- and y-converted versions of the other input; a final actuator pulse converts the z pulse at the top to the output of the NAND gate. Note that the output of the NAND has used two diagonals, which again shows why a second speed is needed if
Computing with Solitons, Figure 13 The introduction of a second speed makes true FANOUT possible. For simplicity, in this and the next figure, data and operator pulses are indicated by solid dots, and the y operator pulses are not shown. The paths of actuator pulses are indicated by dashed lines. Reprinted with permission from [34]. Copyright by the American Physical Society
we are to use the NAND output as an input to subsequent logical operations. The y operator pulses, middle components in the three-collision COPY and converter gates, are not shown in the figure, but room can always be made for them to avoid accidental collisions by adding only a constant amount of space. Universality It should be clear now that any sequence of three-collision gates can be implemented in this way, copying data out of the memory column to the upper left or right, and performing NAND operations on any two at a time in the way shown in the previous section. The computation can proceed in a breadth-first manner, with the results of each successive stage being stored above the earlier results. Each additional gate can add only a constant amount of height
Computing with Solitons
the Game of Life [7], and Lattice Gasses [32], for example, in that no internal mirrors or structures of any kind are used inside the medium. To the author’s knowledge, whether internal structure is necessary in these other cases is open. Finally, we remark that the model used is reversible and dissipationless. The fact that some of the gate operations realized are not in themselves reversible is not a contradiction, since extra, “garbage” solitons [13] are produced that save enough state to run the computation backwards. Multistable Soliton Collision Cycles
Computing with Solitons, Figure 14 Implementation of a NAND gate. A second speed will be necessary to use the output. Reprinted with permission from [34]. Copyright by the American Physical Society
and width to the medium, so the total area required is no more than proportional to the square of the number of gates. The “program” consists of down-moving y and z operator pulses, entering at the top with the down-moving data, and actuator pulses that enter from the left or right at two different speeds. In the frame moving with the data, the data and operator pulses are stationary and new results are deposited at the top of the memory column. In the laboratory frame the data pulses leave the medium downward, and new results appear in the medium at positions above the old data, at the positions of newly entering z pulses. Discussion We have shown that in principle any computation can be performed by shining time-gated lasers into a completely homogeneous nonlinear optical medium. This result should be viewed as mathematical, and whether the physics of vector soliton collisions can lead to practical computational devices is a subject for future study. With regard to the economy of the model, the question of whether time gating is necessary, or even whether two speeds are necessary, is open. We note that the result described here differs from the universality results for the ideal billiard ball model [13],
Bistable and multistable optical systems, besides being of some theoretical interest, are of practical importance in offering a natural “flip-flop” for noise immune storage and logic. We show in this section that simple cycles of collisions of solitons governed by the Manakov equations can have more than one distinct stable set of polarization states, and therefore these distinct equilibria can, in theory, be used to store and process information. The multistability occurs in the polarization states of the beams; the solitons themselves do not change shape and remain the usual sech-shaped solutions of the Manakov equations. This phenomenon is dependent only on simple soliton collisions in a completely homogeneous medium. The basic configuration considered requires only that the beams form a closed cycle, and can thus be realized in any nonlinear optical medium that supports spatial Manakov solitons. The possibility of using multistable systems of beam collisions broadens the possibilities for practical application of the surprisingly strong interactions that Manakov solitons can exhibit, a phenomenon originally described in [26]. We show here by example that a cycle of three collisions can have two distinct foci surrounded by basins of attractions, and that a cycle of four collisions can have three. The Basic Three-Cycle and Computational Experiments Figure 15 shows the simplest example of the basic scheme, a cycle of three beams, entering in states A, B, and C, with intermediate beams a, b, and c. For convenience, we will refer to the beams themselves, as well as their states, as A, B, C, etc. Suppose we start with beam C initially turned off, so that A D a. Beam a then hits B, thereby transforming it to state b. If beam C is then turned on, it will hit A, closing the cycle. Beam a is then changed, changing b, etc., and the cycle of state changes propagates clockwise. The question we ask is whether this cycle converges, and if so, whether
655
656
Computing with Solitons
Computing with Solitons, Figure 15 The basic cycle of three collisions. Reprinted with permission from [35]. Copyright by the American Physical Society
it will converge with any particular choice of complex parameters to exactly zero, one, two, or more foci. We answer the question with numerical simulations of this cycle. A typical computational experiment was designed by fixing the input beams A, B, C, and the parameters k1 and k2 , and then choosing points a randomly and independently with real and imaginary coordinates uniformly distributed in squares of a given size in the complex plane. The cycle described above was then carried out until convergence in the complex numbers a, b, and c was obtained to within 10–12 in norm. Distinct foci of convergence were stored and the initial starting points a were categorized by which focus they converged to, thus generating the usual picture of basins of attraction for the parameter a. Typically this was done for 50,000 random initial values of a, effectively filling in the square, for a variety of parameter choices A, B, and C. The following results were observed:
Computing with Solitons, Figure 16 The two foci and their corresponding basins of attraction in the first example, which uses a cycle of three collisions. The states of the input beams are A D 0:8 i 0:13, B D 0:4 i 0:13, C D 0:5 C i 1:6; and k D 4 ˙ i. Reprinted with permission from [35]. Copyright by the American Physical Society
In cases with one or two clear foci, convergence was obtained in every iteration, almost always within one or two hundred iterations. Each experiment yielded exactly one or two foci. The bistable cases (two foci) are somewhat less common than the cases with a unique focus, and are characterized by values of kR between about 3 and 5 when the velocity difference was fixed at 2. Figure 16 shows a bistable example, with the two foci and their corresponding basins of attraction. The parameter k is fixed in this and all subsequent examples at 4 ˙ i for the right- and left-moving beams of any given collision, respectively. The second example, shown in Fig. 17, shows that the basins are not always simply connected; a sizable island that maps to the upper focus appears within the basin of the lower focus. A Tristable Example Using a Four-Cycle Collision cycles of length four seem to exhibit more complex behavior than those of length three, although it is dif-
Computing with Solitons, Figure 17 A second example using a cycle of three collisions, showing that the basins need not be simply connected. The states of the input beams are A D 0:7 i 0:3, B D 1:1 i 0:5, C D 0:4 C i 0:81; and k D 4 ˙ i. Reprinted with permission from [35]. Copyright by the American Physical Society
Computing with Solitons
Computing with Solitons, Figure 18 A case with three stable foci, for a collision cycle of length four. The states of the input beams are A D 0:39 i 0:45, B D 0:22 i 0:25, C D 0:0 C i 0:25, D D 0:51 C i 0:48; and k D 4 ˙ i. Reprinted with permission from [35]. Copyright by the American Physical Society
ficult to draw any definite conclusions because the parameter spaces are too large to be explored exhaustively, and there is at present no theory to predict such highly nonlinear behavior. If one real degree of freedom is varied as a control parameter, we can move from bistable to tristable solutions, with a regime between in which one basin of attraction disintegrates into many small separated fragments. Clearly, this model is complex enough to exhibit many of the well-known features of nonlinear systems. Fortunately, it is not difficult to find choices of parameters that result in very well behaved multistable solutions. For example, Fig. 18 shows such a tristable case. The smallest distance from a focus to a neighboring basin is on the order of 25% of the interfocus distance, indicating that these equilibria will be stable under reasonable noise perturbations. Discussion The general phenomenon discussed in this section raises many questions, both of a theoretical and practical nature. The fact that there are simple polarization-multistable cycles of collisions in a Manakov system suggests that similar situations occur in other vector systems, such as photorefractive crystals or birefringent fiber. Any vector system with the possibility of a closed cycle of soliton collisions
becomes a candidate for multistability, and there is at this point really no compelling reason to restrict attention to the Manakov case, except for the fact that the explicit statechange relations make numerical study much easier. The simplified picture we used of information traveling clockwise after we begin with a given beam a gives us stable polarization states when it converges, plus an idea of the size of their basins of attractions. It is remarkable that in all cases in our computational experience, except for borderline transitional cases in going from two to three foci in a four-cycle, this circular process converges consistently and quickly. But understanding the actual dynamics and convergence characteristics in a real material requires careful physical modeling. This modeling will depend on the nature of the medium used to approximate the Manakov system, and is left for future work. The implementation of a practical way to switch from one stable state to another is likewise critically dependent on the dynamics of soliton formation and perturbation in the particular material at hand, and must be studied with reference to a particular physical realization. We remark also that no iron-clad conclusions can be drawn from computational experiments about the numbers of foci in any particular case, or the number possible for a given size cycle—despite the fact that we regularly used 50,000 random starting points. On the other hand, the clear cases that have been found, such as those used as examples, are very characteristic of universal behavior in other nonlinear iterated maps, and are sufficient to establish that bi- and tristability, and perhaps highermode multistability, is a genuine mathematical characteristic, and possibly also physically realizable. It strongly suggests experimental exploration. We restricted discussion in this section to the simplest possible structure of a single closed cycle, with three or four collisions. The stable solutions of more complicated configurations are the subject of continuing study. A general theory that predicts this behavior is lacking, and it seems at this point unlikely to be forthcoming. This forces us to rely on numerical studies, from which, as we point out above, only certain kinds of conclusions can be drawn. We are fortunate, however, in being able to find cases that look familiar and which are potentially useful, like the bistable three-cycles with well separated foci and simply connected basins of attraction. It is not clear however, just what algorithms might be used to find equilibria in collision topologies with more than one cycle. It is also intriguing to speculate about how collision configurations with particular characteristics can be designed, how they can be made to interact, and how they might be controlled by pulsed beams. There is
657
658
Computing with Solitons
promise that when the ramifications of complexes of vector soliton collisions are more fully understood they might be useful for real computation in certain situations. Application to Noise-Immune Soliton Computing Any physical instantiation of a computing technology must be designed to be immune from the effects of noise buildup from logic stage to logic stage. In the familiar computers of today, built with solid-state transistors, the noiseimmunity is provided by physical state restoration, so that voltage levels representing logical “0” and “1” are restored by bistable circuit mechanisms at successive logic stages. This is state restoration at the physical level. For another example, proposed schemes for quantum computing would be impractical without some means of protecting information stored in qubits from inevitable corruption by the rest of the world. The most common method proposed for accomplishing this is error correction at the software level, state restoration at the logical level. In the collision-based scheme for computing with Manakov solitons described in Sect. “Manakov Soliton Computing”, there is no protection against buildup of error from stage to stage, and some sort of logical staterestoration would be necessary in a practical realization. The bistable collision cycles of Manakov solitons described in this section, however, offer a natural computational building block for soliton computation with physical state restoration. This idea is explored in [27]. Figure 19 illustrates the approach with a schematic diagram of a NAND gate, implemented with bistable cycles to represent bits. The input bits are stored in the collision cycles (1) and (2), which have output beams that can be made to collide with
Computing with Solitons, Figure 19 Schematic of NAND gate using bistable collision cycles. Reprinted with permission from [27]. Copyright by Old City Publishing
input beam A of cycle (3), which represents the output bit of the gate. These inputs to the gate, shown as dashed lines, change the state of beam A of the ordinarily bistable cycle (3) so that it becomes monostable. The state of cycle (3) is then steered to a known state. When the input beams are turned off, cycle (3) returns to its normal bistable condition, but with a known input state. Its state then evolves to one of two bits, and the whole system of three collision cycles can be engineered so that the final state of cycle (3) is the NAND of the two bits represented by input cycles (1) and (2). (See [27] for details.) A computer based on such bistable collision cycles is closer in spirit to present-day ordinary transistor-based computers, with a natural noise-immunity and staterestoration based on physical bistability. As mentioned in the previous subsection, however, the basic bistable cycle phenomenon awaits laboratory verification, and much remains to be learned about the dynamics, and eventual speed and reliability of such systems. Experiments The computation schemes described in the previous sections obviously rely on the correct mathematical modeling of the physics proposed for realization. We next describe experiments that verify some of the required soliton phenomenology in optical fibers. Specifically, we highlight the experimental observation of temporal vector soliton propagation and collision in a birefringent optical fiber [28]. This is both the first demonstration of temporal vector solitons with two mutually-incoherent component fields, and of vector soliton collisions in a Kerr nonlinear medium. Temporal soliton pulses in optical fiber were first predicted by Hasegawa and Tappert [14], followed by the first experimental observation by Mollenauer et al. [24]. In subsequent work, Menyuk accounted for the birefringence in polarization maintaining fiber (PMF) and predicted that vector solitons, in which two orthogonally polarized components trap each other, are stable under the proper operating conditions [21,22]. For birefringent fibers, selftrapping of two orthogonally polarized pulses can occur when XPM-induced nonlinearity compensates the birefringence-induced group velocity difference, causing the pulse in the fiber’s fast axis to slow down and the pulse in the slow axis to speed up. The first demonstration of temporal soliton trapping was performed in the sub picosecond regime [15], in which additional ultrashort pulse effects such as Raman scattering are present. In particular, this effect results in a red-shift that is linearly proportional to the propagation distance, as observed in a later temporal
Computing with Solitons
soliton trapping experiment [25]. Recently, soliton trapping in the picosecond regime was observed with equal amplitude pulses [18]. However, vector soliton propagation could not be shown, because the pulses propagated for less than 1.5 dispersion lengths. In other work, phaselocked vector solitons in a weakly birefringent fiber laser cavity with nonlinear coherent coupling between components were observed [12]. The theoretical model for linearly birefringent fiber is the following coupled nonlinear Schrödinger equation (CNLSE): i
@A x @A x ˇ2 @2 A x 2 2 2 Ax C jA j C j Cˇ1x jA x y @z @t 2 @t 2 3
D0;
2 @A y @A y ˇ2 @ A y 2 2 2 C jA j C j Cˇ1y jA Ay i y x @z @t 2 @t 2 3
D 0;
(9)
where t is the local time of the pulse, z is propagation distance along the fiber, and Ax;y is the slowly varying pulse envelope for each polarization component. The parameter ˇ1x;y is the group velocity associated with each fiber axis, and ˇ2 represents the group velocity dispersion, assumed equal for both polarizations. In addition, we neglect higher order dispersion and assume a loss less medium with an instantaneous electronic response, valid for picosecond pulses propagating in optical fiber. The last two terms of Eqs. (9) account for the nonlinearity due to SPM and XPM, respectively. In linearly birefringent optical fiber, a ratio of 2/3 exists between these two terms. When this ratio equals unity, the CNLSE becomes the integrable Manakov system of Eqs. (4). On the other hand, solutions of Eqs. (9) are, strictly speaking, solitary waves, not solitons. However, it was found in [36] that the family of symmetric, single-humped (fundamental or first-order) solutions, to which the current investigation in this section belongs, are all stable. Higher-order solitons, characterized by multiple humps, are unstable. Furthermore, it was shown in [37] that collisions of solitary waves in Eqs. (9) can be described by application of perturbation theory to the integrable Manakov equations, indicating the similarities between the characteristics of these two systems. Experimental Setup and Design The experimental setup is shown in Fig. 20. We synchronized two actively mode-locked erbium-doped fiber lasers (EDFLs)—EDFL1 at 1.25 GHz repetition rate, and EDFL2 at 5 GHz. EDFL2 was modulated to match with the lower
Computing with Solitons, Figure 20 Experimental setup. EDFL: Erbium-doped fiber laser; EDFA: Erbium-doped fiber amplifier; MOD: modulator; D: tunable delay line; PLC: polarization loop controller; 2:1: fiber coupler; LP: linear polarizer; /2: half-wave plate; HB-PMF and LB-PMF: high and low birefringence polarization maintaining fiber; PBS: polarization beam splitter; OSA: optical spectrum analyzer. Reprinted with permission from [28]. Copyright by the American Physical Society
repetition rate of EDFL1. Each pulse train, consisting of 2 ps pulses, was amplified in an erbium-doped fiber amplifier (EDFA) and combined in a fiber coupler. To align polarizations, a polarization loop controller (PLC) was used in one arm, and a tunable delay line (TDL) was employed to temporally align the pulses for collision. Once combined, both pulse trains passed through a linear polarizer (LP) and a half-wave plate to control the input polarization to the PMF. Approximately 2 m of high birefringence (HB) PMF preceded the specially designed 500 m of low birefringence (LB) PMF used to propagate vector solitons. Although this short length of HB-PMF will introduce some pulse splitting (on the order of 2–3 ps), the birefringent axes of the HB- and LB-PMF were swapped in order to counteract this effect. At the output, each component of the vector soliton was then split at a polarization beam splitter, followed by an optical spectrum analyzer (OSA) for measurement. The design of the LB-PMF required careful control over three characteristic length scales: the (polarization) beat length, dispersion length Ld , and nonlinear length Lnl . A beat length Lb D /n D 50 cm was chosen at a wavelength of 1550 nm, where n is the fiber birefringence. According to the approximate stability criterion of [8], this choice allows stable propagation of picosecond vector solitons. By avoiding the sub picosecond regime, ultrashort pulse effects such as intrapulse Raman scattering will not be present. The dispersion D D 2 cˇ2 /2 D 16 ps/km nm and Ld D 2T02 /jˇ2 j 70 m, where T0 D TFWHM /1:763 is a characteristic pulse width related to the
659
660
Computing with Solitons
full width at half maximum (FWHM) pulse width. Since Ld Lb , degenerate four-wave mixing due to coherent coupling between the two polarization components can be neglected [23]. Furthermore, the total propagation distance is greater than 7 dispersion lengths. Polarization instability, in which the fast axis component is unstable, occurs when Lnl D ( P)1 is of the same order of magnitude or smaller than Lb , as observed in [6]. The nonlinearity parameter D 2 n2 /Aeff D 1:3 (km W)1 , with Kerr nonlinearity coefficient n2 D 2:6 1020 m2 /W and measured effective mode area Aeff D 83 #m2 . In the LB-PMF, the fundamental vector soliton power P 14 W, thus Lnl D 55 m Lb , mitigating the effect of polarization instability.
simulations, but are not as dominant (cf. Fig. 21d and e). We attribute this to the input pulse, which is calibrated for the D 45ı case, because the power threshold for vector soliton formation in this case is largest due to the 2/3 factor between SPM and XPM nonlinear terms in the CNLSE. As the input is rotated towards unequal components, there will be extra power in the input pulse, which will radiate in the form of dispersive waves as the vector soliton forms. Due to the nature of this system, these dispersive waves can be nonlinearly trapped, giving rise to the satellite features in the optical spectra. This effect is not as prevalent in the simulations because the threshold was numerically determined at each input angle . Vector Soliton Collision
Vector Soliton Propagation We first studied propagation of vector solitons using both lasers independently. The wavelength shift for each component is shown in Fig. 21a as a function of the input polarization angle , controlled through the half-wave plate. Due to the anomalous dispersion of the fiber at this wavelength, the component in the slow (fast) axis will shift to shorter (longer) wavelengths to compensate the birefringence. The total amount of wavelength shift between components x y D ˇ1 /D D 0:64 nm, where ˇ1 D jˇ1x ˇ1y j D 10:3 ps/km is the birefringenceinduced group velocity difference and dispersion D D 2 cˇ2 /2 D 16 ps/km nm. As approaches 0ı (90ı ), the vector soliton approaches the scalar soliton limit, and the fast (slow) axis does not shift in wavelength, as expected. At D 45ı , a symmetric shift results. For unequal amplitude solitons, the smaller component shifts more in wavelength than the larger component, because the former experiences more XPM. Numerical simulations of Eqs. (9), given by the dashed lines of Fig. 21a, agree very well with the experimental results. Also shown in Fig. 21 are two cases, D 45ı and 37ı , as well as the numerical prediction. The experimental spectra show some oscillatory features at 5 GHz, which are a modulation of the EDFL2 repetition rate on the optical spectrum. A sample input pulse spectrum from EDFL1 is shown in the inset of Fig. 21, which shows no modulation due to the limited resolution of the OSA. Vector solitons from both lasers produced similar results. In this and all subsequent plots in this section, the slow and fast axis components are depicted by solid and dashed lines, respectively. As the two component amplitudes become more unequal, satellite peaks become more pronounced in the smaller component. These features are also present in the
To prepare the experiment for a collision, we operated both lasers simultaneously, detuned in wavelength to allow for dispersion-induced walkoff, and adjusted the delay line in such a way that the collision occurred halfway down the fiber. We define a collision length Lcoll D 2TFWHM /D, where is the wavelength separation between the two vector solitons. For our setup, D 3 nm, and Lcoll D 83:3 m. An asymptotic theory of soliton collisions, in which a full collision takes place, requires at least 5 collision lengths. The total fiber length in this experiment is equal to 6 collision lengths, long enough to ensure sufficient separation of solitons before and after collision. In this way, results of our experiments can be compared to the asymptotic theory, even though full numerical simulations will be shown for comparison. To quantify our results, we introduce a quantity R tan2 , defined as the amplitude ratio between the slow and fast components. Recall that in Sect. “Manakov Solitons”, we introduced the Manakov equations (Eqs. (4)), and described collision-induced transformations of the polarization state of the soliton, which come about due to the asymptotic analysis of the soliton collision. The polarization state is the ratio between the two components A x /A y D cot exp(i ), and is therefore a function of the polarization angle and the relative phase between the two components. In the context of the experiments described in this section, these state transformations (Eqs. (5) and (7)) predict that the resulting energy exchange will be a function of amplitude ratios R1;2 , wavelength separation , and the relative phase 1;2 between the two components of each soliton, where soliton 1 (2) is the shorter (longer) wavelength soliton. A word of caution is in order at this point. An interesting consequence of the 2/3 ratio between SPM and XPM, which sets the birefringent fiber model apart from
Computing with Solitons
a
b
d
c
e
Computing with Solitons, Figure 21 Arbitrary-amplitude vector soliton propagation. a Wavelength shift vs. angle to fast axis , numerical curves given by dashed lines; b and d experimental results for D 45ı and 37ı with EDFL2, respectively. Inset: input spectrum for EDFL1; c and e corresponding numerical simulations of D 45ı and 37ı , respectively. The slow and fast axis components are depicted by solid and dashed lines, respectively. Reprinted with permission from [28]. Copyright by the American Physical Society
the Manakov model, is the relative phase between the two components. For the Manakov soliton, each component ‘feels’ the same amount of total nonlinearity, because the strengths of both SPM and XPM are equal. Therefore, regardless of the polarization angle, the amount of total nonlinear phase shift for each component is the same (even though the contributions of SPM and XPM phase shifts
are in general not equal). As a result, the relative phase between the two components stays constant during propagation, as does the polarization state. This is not the case for vector solitons in birefringent fiber. For the case of equal amplitudes, each component does experience the same amount of nonlinear phase shift, and therefore the polarization state is constant as a function of propagation
661
662
Computing with Solitons
Computing with Solitons, Figure 22 Demonstration of phase-dependent energy-exchanging collisions. a–c Short HB-PMF; d–f long HB-PMF; a,d experiment, without collision; b, e experiment, with collision; c,f simulated collision result with c 2 D 90ı and f 2 D 50ı . Values of slow-fast amplitude ratio R are given above each soliton. The slow and fast axis components are depicted by solid and dashed lines, respectively. Reprinted with permission from [28]. Copyright by the American Physical Society
distance. However, for arbitrary (unequal) amplitudes, the total phase shift for each component will be different. Consequently, the relative phase will change linearly as a function of propagation distance, and the polarization state will not be constant. As a result, the collision-induced change in polarization state, while being a function of the amplitude ratios R1;2 and wavelength separation , will also depend upon the collision position due to the propagation dependence of the relative phase 1;2 (z). To bypass this complication, we ensure that all collisions occur at the same spatial point in the fiber. Because only one half-wave plate is used in our experiment (see Fig. 20), it was not possible to prepare each vector soliton individually with an arbitrary R. In addition,
due to the wavelength dependence of the half-wave plate, it was not possible to adjust without affecting R. First, we investigated the phase dependence of the collision. This was done by changing the length of the HBPMF entering the LB-PMF, while keeping R and constant. As a result, we could change 1;2 due to the birefringence of the HB-PMF. Approximately 0.5 m of HBPMF was added to ensure that the total amount of temporal pulse splitting did not affect the vector soliton formation. The results are shown in Fig. 22, where Fig. 22a–c and d–f correspond to the short and long HB-PMFs, respectively. Figure 22a and d show the two vector solitons, which propagate independently when no collision occurs; as expected, the two results are similar because the OSA
Computing with Solitons
Computing with Solitons, Figure 23 Additional energy-exchanging collisions. a,d Experiment, without collision; b,e experiment, with collision; c,f simulated collision result, using 2 D 90ı inferred from the experiment of Fig. 22. Values of slow-fast amplitude ratio R are given above each soliton. The slow and fast axis components are depicted by solid and dashed lines, respectively. Reprinted with permission from [28]. Copyright by the American Physical Society
measurement does not depend on 1;2 . The result of the collision is depicted in Fig. 22b and e, along with the corresponding simulation results in Fig. 22c and f. In both of these collisions, an energy exchange between components occurs, and two important relations are satisfied: the total energy in each soliton and in each component is conserved. It can be seen that when one component in a soliton increases as a result of the collision, the other component decreases, with the opposite exchange in the second soliton. The difference between these two collisions is dramatic, in that the energy redistributes in opposite directions. For the simulations, idealized sech pulses for each component were used as initial conditions, and propagation was modeled without accounting for losses. The experimental amplitude ratio was used, and (without loss of generality [16,19,26]) 2 was
varied while 1 D 0. Best fits gave 2 D 90ı (Fig. 22c) and 50ı (Fig. 22f). Despite the model approximations, experimental and numerical results all agree to within 15%. In the second set of results (Fig. 23), we changed R while keeping all other parameters constant. More specifically, we used the short HB-PMF, with initial phase difference 2 D 90ı , and changed the amplitude ratio. In agreement with theoretical predictions, the same direction of energy exchange is observed as in Fig. 22a–c. Spatial Soliton Collisions We mention here analogous experiments with spatial solitons in photorefractive media by Anastassiou et al. In [4], it is shown that energy is transferred in a collision of vector spatial solitons in a way consistent with the predictions
663
664
Computing with Solitons
for the Manakov system (although the medium is a saturable one, and only approximates the Kerr nonlinearity). The experiment in [5] goes one step farther, showing that one soliton can be used as an intermediary to transfer energy from a second soliton to a third. We thus are now at a point where the ability of both temporal and spatial vector solitons to process information for computation has been demonstrated.
properties of vector soliton collisions extend perfectly to the semi-discrete case: that is, to the case where space is discretized, but time remains continuous. This models, for example, propagation in an array of coupled nonlinear waveguides [10]. The work suggests alternative physical implementations for soliton switching or computing, and also hints that the phenomenon of soliton information processing is a very general one.
Future Directions This article discussed computing with solitons, and attempted to address the subject from basic physical principles to applications. Although the nonlinearity of fibers is very weak, the ultralow loss and tight modal confinement make them technologically attractive. By no means, however, are they the only potential material for soliton-based information processing. Others include photorefractive crystals, semiconductor waveguides, quadratic media, and Bose–Einstein condensates, while future materials research may provide new candidate systems. From a computing perspective, scalar soliton collisions are insufficient. Although measurable phase and position shifts do occur, these phenomena cannot be cascaded to affect future soliton collisions and therefore cannot transfer information from one collision to the next. Meaningful computation using soliton collisions requires a new degree of freedom; that is, a new component. Collisions of vector solitons display interesting energy-exchanging effects between components, which can be exploited for arbitrary computation and bistability. The vector soliton experiments of Sect. “Experiments” were proof-of-principle ones. The first follow-up experiments with temporal vector solitons in birefringent fiber can be directed towards a full characterization of the collision process. This can be done fairly simply using the experimental setup of Fig. 20 updated in such a way as to allow independent control of two vector soliton inputs. This would involve separate polarizers and half-waveplates, followed by a polarization preserving fiber coupler. Cascaded collisions of temporal solitons also await experimental study. As demonstrated in photorefractive crystals with a saturable nonlinearity [5], one can show that information can be passed from one collision to the next. Beyond a first demonstration of two collisions is the prospect of setting up a multi-collision feedback cycle. Discussed in Sect. “Multistable Soliton Collision Cycles”, these collision cycles can be bistable and lead to interesting applications in computation. Furthermore, the recent work of Ablowitz et al. [2] shows theoretically that the useful energy-redistribution
Bibliography 1. Ablowitz MJ, Prinari B, Trubatch AD (2004) Soliton interactions in the vector nls equation. Inverse Problems 20(4):1217–1237 2. Ablowitz MJ, Prinari B, Trubatch AD (2006) Discrete vector solitons: Composite solitons, Yang–Baxter maps and computation. Studies in Appl Math 116:97–133 3. Agrawal GP (2001) Nonlinear Fiber Optics, 3rd edn. Academic Press, San Diego 4. Anastassiou C, Segev M, Steiglitz K, Giordmaine JA, Mitchell M, Shih MF, Lan S, Martin J (1999) Energy-exchange interactions between colliding vector solitons. Phys Rev Lett 83(12):2332– 2335 5. Anastassiou C, Fleischer JW, Carmon T, Segev M, Steiglitz K (2001) Information transfer via cascaded collisions of vector solitons. Opt Lett 26(19):1498–1500 6. Barad Y, Silberberg Y (1997) Phys Rev Lett 78:3290 7. Berlekamp ER, Conway JH, Guy RK (1982) Winning ways for your mathematical plays, Vol. 2. Academic Press Inc. [Harcourt Brace Jovanovich Publishers], London 8. Cao XD, McKinstrie CJ (1993) J Opt Soc Am B 10:1202 9. Chen ZG, Segev M, Coskun TH, Christodoulides DN (1996) Observation of incoherently coupled photorefractive spatial soliton pairs. Opt Lett 21(18):1436–1438 10. Christodoulides DN, Joseph RI (1988) Discrete self-focusing in nonlinear arrays of coupled waveguides. Opt Lett 13:794–796 11. Christodoulides DN, Singh SR, Carvalho MI, Segev M (1996) Incoherently coupled soliton pairs in biased photorefractive crystals. Appl Phys Lett 68(13):1763–1765 12. Cundiff ST, Collings BC, Akhmediev NN, Soto-Crespo JM, Bergman K, Knox WH (1999) Phys Rev Lett 82:3988 13. Fredkin E, Toffoli T (1982) Conservative logic. Int J Theor Phys 21(3/4):219–253 14. Hasegawa A, Tappert F (1973) Transmission of stationary nonlinear optical pulses in dispersive dielectric fibers 1: Anomalous dispersion. Appl Phys Lett 23(3):142–144 15. Islam MN, Poole CD, Gordon JP (1989) Opt Lett 14:1011 16. Jakubowski MH, Steiglitz K, Squier R (1998) State transformations of colliding optical solitons and possible application to computation in bulk media. Phys Rev E 58(5):6752–6758 17. Kang JU, Stegeman GI, Aitchison JS, Akhmediev N (1996) Observation of Manakov spatial solitons in AlGaAs planar waveguides. Phys Rev Lett 76(20):3699–3702 18. Korolev AE, Nazarov VN, Nolan DA, Truesdale CM (2005) Opt Lett 14:132 19. Manakov SV (1973) On the theory of two-dimensional stationary self-focusing of electromagnetic waves. Zh Eksp Teor Fiz 65(2):505–516, [Sov. Phys. JETP 38, 248 (1974)]
Computing with Solitons
20. Mano MM (1972) Computer Logic Design. Prentice-Hall, Englewood Cliffs 21. Menyuk CR (1987) Opt Lett 12:614 22. Menyuk CR (1988) J Opt Soc Am B 5:392 23. Menyuk CR (1989) Pulse propagation in an elliptically birefringent Kerr medium. IEEE J Quant Elect 25(12):2674–2682 24. Mollenauer LF, Stolen RH, Gordon JP (1980) Experimental observation of picosecond pulse narrowing and solitons in optical fibers. Phys Rev Lett 45(13):1095–1098 25. Nishizawa N, Goto T (2002) Opt Express 10:1151-1160 26. Radhakrishnan R, Lakshmanan M, Hietarinta J (1997) Inelastic collision and switching of coupled bright solitons in optical fibers. Phys Rev E 56(2):2213–2216 27. Rand D, Steiglitz K, Prucnal PR (2005) Signal standardization in collision-based soliton computing. Int J of Unconv Comp 1:31–45 28. Rand D, Glesk I, Brès CS, Nolan DA, Chen X, Koh J, Fleischer JW, Steiglitz K, Prucnal PR (2007) Observation of temporal vector soliton propagation and collision in birefringent fiber. Phys Rev Lett 98(5):053902 29. Russell JS (1844) Report on waves. In: Report of the 14th meeting of the British Association for the Advancement of Science, Taylor and Francis, London, pp 331–390 30. Shih MF, Segev M (1996) Incoherent collisions between two-di-
31.
32. 33.
34. 35. 36. 37.
38.
mensional bright steady-state photorefractive spatial screening solitons. Opt Lett 21(19):1538–1540 Shor PW (1994) Algorithms for quantum computation: Discrete logarithms and factoring. In: 35th Annual Symposium on Foundations of Computer Science, IEEE Press, Piscataway, pp 124–134 Squier RK, Steiglitz K (1993) 2-d FHP lattice gasses are computation universal. Complex Systems 7:297–307 Steblina VV, Buryak AV, Sammut RA, Zhou DY, Segev M, Prucnal P (2000) Stable self-guided propagation of two optical harmonics coupled by a microwave or a terahertz wave. J Opt Soc Am B 17(12):2026–2031 Steiglitz K (2000) Time-gated Manakov spatial solitons are computationally universal. Phys Rev E 63(1):016608 Steiglitz K (2001) Multistable collision cycles of Manakov spatial solitons. Phys Rev E 63(4):046607 Yang J (1997) Physica D 108:92–112 Yang J (1999) Multisoliton perturbation theory for the Manakov equations and its applications to nonlinear optics. Phys Rev E 59(2):2393–2405 Zakharov VE, Shabat AB (1971) Exact theory of two-dimensional self-focusing and one-dimensional self-modulation of waves in nonlinear media. Zh Eksp Teor Fiz 61(1):118–134, [Sov. Phys. JETP 34, 62 (1972)]
665
666
Cooperative Games
Cooperative Games ROBERTO SERRANO1,2 1 Department of Economics, Brown University, Providence, USA 2 IMDEA-Social Sciences, Madrid, Spain
Shapley value It is a solution that prescribes a single payoff for each player, which is the average of all marginal contributions of that player to each coalition he or she is a member of. It is usually viewed as a good normative answer to the question posed in cooperative game theory. That is, those who contribute more to the groups that include them should be paid more.
Article Outline Glossary Definition of the Subject Introduction Cooperative Games The Core The Shapley Value Future Directions Bibliography Glossary Game theory Discipline that studies strategic situations. Cooperative game Strategic situation involving coalitions, whose formation assumes the existence of binding agreements among players. Characteristic or coalitional function The most usual way to represent a cooperative game. Solution concept Mapping that assigns predictions to each game. Core Solution concept that assigns the set of payoffs that cannot be improved upon by any coalition. Shapley value Solution concept that assigns the average of marginal contributions to coalitions. Definition of the Subject Cooperative game theory It is one of the two counterparts of game theory. It studies the interactions among coalitions of players. Its main question is this: Given the sets of feasible payoffs for each coalition, what payoff will be awarded to each player? One can take a positive or normative approach to answering this question, and different solution concepts in the theory lean towards one or the other. Core It is a solution concept that assigns to each cooperative game the set of payoffs that no coalition can improve upon or block. In a context in which there is unfettered coalitional interaction, the core arises as a good positive answer to the question posed in cooperative game theory. In other words, if a payoff does not belong to the core, one should not expect to see it as the prediction of the theory if there is full cooperation.
Although there were some earlier contributions, the official date of birth of game theory is usually taken to be 1944, year of publication of the first edition of the Theory of Games and Economic Behavior, by John von Neumann and Oskar Morgenstern [42]. The core was first proposed by Francis Ysidro Edgeworth in 1881 [13], and later reinvented and defined in game theoretic terms in [14]. The Shapley value was proposed by Lloyd Shapley in his 1953 Ph D dissertation [37]. Both the core and the Shapley value have been applied widely, to shed light on problems in different disciplines, including economics and political science. Introduction Game theory is the study of games, also called strategic situations. These are decision problems with multiple decision makers, whose decisions impact one another. It is divided into two branches: non-cooperative game theory and cooperative game theory. The actors in non-cooperative game theory are individual players, who may reach agreements only if they are self-enforcing. The non-cooperative approach provides a rich language and develops useful tools to analyze games. One clear advantage of the approach is that it is able to model how specific details of the interaction among individual players may impact the final outcome. One limitation, however, is that its predictions may be highly sensitive to those details. For this reason it is worth also analyzing more abstract approaches that attempt to obtain conclusions that are independent of such details. The cooperative approach is one such attempt, and it is the subject of this article. The actors in cooperative game theory are coalitions, that is, groups of players. For the most part, two facts, that coalitions can form and that each coalition has a feasible set of payoffs available to its members, are taken as given. Given the coalitions and their sets of feasible payoffs as primitives, the question tackled is the identification of final payoffs awarded to each player. That is, given a collection of feasible sets of payoffs, one for each coalition, can one predict or recommend a payoff (or set of payoffs) to be awarded to each player? Such predictions or recommendations are embodied in different solution concepts.
Cooperative Games
Indeed, one can take several approaches to answering the question just posed. From a positive or descriptive point of view, one may want to get a prediction of the likely outcome of the interaction among the players, and hence, the resulting payoff be understood as the natural consequence of the forces at work in the system. Alternatively, one can take a normative or prescriptive approach, set up a number of normative goals, typically embodied in axioms, and try to derive their logical implications. Although authors sometimes disagree on the classification of the different solution concepts according to these two criteria – as we shall see, the understanding of each solution concept is enhanced if one can view it from very distinct approaches –, in this article we shall exemplify the positive approach with the core and the normative approach with the Shapley value. While this may oversimplify the issues, it should be helpful to a reader new to the subject. The rest of the article is organized as follows. Sect. “Cooperative Games” introduces the basic model of a cooperative game, and discusses its assumptions as well as the notion of solution concepts. Sect. “The Core” is devoted to the core, and Sect. “The Shapley Value” to the Shapley value. In each case, some of the main results for each of the two are described, and examples are provided. Sect. “Future Directions” discusses some directions for future research. Cooperative Games Representations of Games. The Characteristic Function Let us begin by presenting the different ways to describe a game. The first two are the usual ways employed in noncooperative game theory. The most informative way to describe a game is called its extensive form. It consists of a game tree, specifying the timing of moves for each player and the information available to each of them at the time of making a move. At the end of each path of moves, a final outcome is reached and a payoff vector is specified. For each player, one can define a strategy, i. e., a complete contingent plan of action to play the game. That is, a strategy is a function that specifies a feasible move each time a player is called upon to make a move in the game. One can abstract from details of the interaction (such as timing of moves and information available at each move), and focus on the concept of strategies. That is, one can list down the set of strategies available to each player, and arrive at the strategic or normal form of the game. For two players, for example, the normal form is represented in a bimatrix table. One player controls the rows, and the other the columns. Each cell of the bimatrix is occupied
with an ordered pair, specifying the payoff to each player if each of them chooses the strategy corresponding to that cell. One can further abstract from the notion of strategies, which will lead to the characteristic function form of representing a game. From the strategic form, one makes assumptions about the strategies used by the complement of a coalition of players to determine the feasible payoffs for the coalition (see, for example, the derivations in [7,42]). This is the representation most often used in cooperative game theory. Thus, here are the primitives of the basic model in cooperative game theory. Let N D f1; : : : ; ng be a finite set of players. Each non-empty subset of N is called a coalition. The set N is referred to as the grand coalition. For each coalition S, we shall specify a set V(S) RjSj containing jSj-dimensional payoff vectors that are feasible for coalition S. This is called the characteristic function, and the pair (N; V) is called a cooperative game. Note how a reduced form approach is taken because one does not explain what strategic choices are behind each of the payoff vectors in V(S). In addition, in this formulation, it is implicitly assumed that the actions taken by the complement coalition (those players in N n S) cannot prevent S from achieving each of the payoff vectors in V(S). There are more general models in which these sorts of externalities across coalitions are considered, but we shall ignore them in this article. Assumptions on the Characteristic Function Some of the most common technical assumptions made on the characteristic function are the following: (1) For each S N, V(S) is closed. Denote by @V (S) the boundary of V (S). Hence, @V (S) V(S). (2) For each S N, V(S) is comprehensive, i. e., for each jSj x 2 V(S), fxg RC V(S). (3) For each x 2 RjSj , jSj @V(S) \ fxg C RC is bounded. (4) For each S N, there exists a continuously differentiable representation of V(S), i. e., a continuously differentiable function g S : RjSj ! R such that V(S) D fx 2 RjSj jg S (x) 0g : (5) For each S N, V (S) is non-leveled, i. e., for every x 2 @V (S), the gradient of g S at x is positive in all its coordinates.
667
668
Cooperative Games
With the assumptions made, @V (S) is its Pareto frontier, i. e., the set of vectors x S 2 V(S) such that there does not exist y S 2 V(S) satisfying that y i x i for all i 2 S with at least one strict inequality. Other assumptions usually made relate the possibilities available to different coalitions. Among them, a very important one is balancedness, which we shall define next: A collection T of coalitions is balanced if there exists a set of weights w(S) 2 [0; 1] for each S 2 T such that P for every i 2 N, S2T ;S fig w(S) D 1. One can think of these weights as the fraction of time that each player devotes to each coalition he is a member of, with a given coalition representing the same fraction of time for each player. The game (N; V) is balanced if x N 2 V (N) whenever (x S ) 2 V (S) for every S in a balanced collection T . That is, the grand coalition can always implement any “time-sharing arrangement” that the different subcoalitions may come up with. The characteristic function defined so far is often referred to as a non-transferable utility (NTU) game. A particular case is the transferable utility (TU) game case, in which for each coalition S N, there exists a real number v(S) such that ( ) X jSj V(S) D x 2 R : x i v(S) : i2S
Abusing notation slightly, we shall denote a TU game by (N; v). In the TU case there is an underlying nummeraire – money – that can transfer utility or payoff at a one-to-one rate from one player to any other. Technically, the theory of NTU games is far more complex: it uses convex analysis and fixed point theorems, whereas the TU theory is based on linear inequalities and combinatorics. Solution Concepts Given a characteristic function, i. e., a collection of sets V(S), one for each S, the theory formulates its predictions on the basis of different solution concepts. We shall concentrate on the case in which the grand coalition forms, that is, cooperation is totally successful. Of course, solution concepts can be adapted to take care of the case in which this does not happen. A solution is a mapping that assigns a set of payoff vectors in V (N) to each characteristic function game (N; V). Thus, a solution in general prescribes a set, which can be empty, or a singleton (when it assigns a unique payoff vector as a function of the fundamentals of the problem). The leading set-valued cooperative solution concept is the core, while one of the most used single-valued ones is the Shapley value for TU games.
There are several criteria to evaluate the reasonableness or appeal of a cooperative solution. As outlined above, in a normative approach, one can propose axioms, abstract principles that one would like the solution to satisfy, and the next step is to pursue their logical consequences. Historically, this was the first argument to justify the Shapley value. Alternatively, one could start by defending a solution on the basis of its definition alone. In the case of the core, this will be especially natural: in a context in which players can freely get together in groups, the prediction should be payoff vectors that cannot be improved upon by any coalition. One can further enhance one’s positive understanding of the solution concept by proposing games in extensive form or in normal form played non-cooperatively by players whose self-enforcing agreements lead to a given solution. This is simply to provide non-cooperative foundations or non-cooperative implementation to the cooperative solution in question, and it is an important research agenda initiated by John Nash in [25], referred to as the Nash program (see [34] for a recent survey). Today, there are interesting results of these different kinds for many solution concepts, which include axiomatic characterizations and non-cooperative foundations. Thus, one can evaluate the appeal of the axioms and the non-cooperative procedures behind each solution to defend a more normative or positive interpretation in each case. The Core The idea of agreements that are immune to coalitional deviations was first introduced to economic theory by Edgeworth in [13], which defined the set of coalitionally stable allocations of an economy under the name “final settlements.” Edgeworth envisioned this concept as an alternative to competitive equilibrium [43], of central importance in economic theory, and was also the first to investigate the connections between the two concepts. Edgeworth’s notion, which today we refer to as the core, was rediscovered and introduced to game theory in [14]. The origins of the core were not axiomatic. Rather, its simple and appealing definition appropriately describes stable outcomes in a context of unfettered coalitional interaction. The core of the game (N; V) is the set of payoff vectors C(N; V ) D fx 2 V (N) : 6 9S N; x S 2 V (S) n @V(S)g : In words, it is the set of feasible payoff vectors for the grand coalition that no coalition can upset. If such a coalition S exists, we shall say that S can improve upon or block x, and x is deemed unstable. That is, in a context where any coalition can get together, when S has a blocking move, coalition S will form and abandon the grand coalition and
Cooperative Games
its payoffs xS in order to get to a better payoff for each of the members of the coalition, a plan that is feasible for them.
for every i 2 N, zi is top-ranked for agent i among all bundles z satisfying that pz p! i , P P and i2N z i D i2N ! i .
Non-Emptiness
In words, this is what the concept expresses. First, at the equilibrium prices, each agent demands zi , i. e., wishes to purchase this bundle among the set of affordable bundles, the budget set. And second, these demands are such that all markets clear, i. e., total demand equals total supply. Note how the notion of a competitive equilibrium relies on the principle of private ownership (each individual owns his or her endowment, which allows him or her to access markets and purchase things). Moreover, each agent is a price-taker in all markets. That is, no single individual can affect the market prices with his or her actions; prices are fixed parameters in each individual’s consumption decision. The usual justification for the price-taking assumption is that each individual is “very small” with respect to the size of the economy, and hence, has no market power. One difficulty with the competitive equilibrium concept is that it does not explain where prices come from. There is no single agent in the model responsible for coming up with them. Walras in [43] told the story of an auctioneer calling out prices until demand and supply coincide, but in many real-world markets there is no auctioneer. More generally, economists attribute the equilibrium prices to the workings of the forces of demand and supply, but this appears to be simply repeating the definition. So, is there a different way one can explain competitive equilibrium prices? As it turns out, there is a very robust result that answers this question. We refer to it as the equivalence principle (see, e. g., [6]), by which, under certain regularity conditions, the predictions provided by different game-theoretic solution concepts, when applied to an economy with a large enough set of agents, tend to converge to the set of competitive equilibrium allocations. One of the first results in this tradition was provided by Edgeworth in 1881 for the core. Note how the core of the economy can be defined in the space of allocations, using the same definition as above. Namely, a feasible allocation is in the core if it cannot be blocked by any coalition of agents when making use of the coalition’s endowments. Edgeworth’s result was generalized later by Debreu and Scarf in [11] for the case in which an exchange economy is replicated an arbitrary number of times (Anderson studies in [1] the more general case of arbitrary sequences of economies, not necessarily replicas). An informal statement of the Debreu–Scarf theorem follows:
The core can prescribe the empty set in some games. A game with an empty core is to be understood as a situation of strong instability, as any payoffs proposed to the grand coalition are vulnerable to coalitional blocking. Example Consider the following simple majority 3-player TU game, in which the votes of at least two players makes the coalition winning. That is, we represent the situation by the following characteristic function: v(S) D 1 for any S containing at least two members, v(fig) D 0 for all i 2 N. Clearly, C(N; v) D ;. Any feasible payoff agreement proposed to the grand coalition will be blocked by at least one coalition. An important sufficient condition for the non-emptiness of the core of NTU games is balancedness, as shown in [32]: Theorem 1 (Scarf [32]) Let the game (N; V) be balanced. Then C(N; V ) 6D ;. For the TU case, balancedness is not only sufficient, but it becomes also necessary for the non-emptiness of the core: Theorem 2 (Bondareva [9]; Shapley [39]) Let (N; v) be a TU game. Then, (N; v) is balanced if and only if C(N; v) 6D ;. The Connections with Competitive Equilibrium In economics, the institution of markets and the notion of prices are essential to the understanding of the allocation of goods and the distribution of wealth among individuals. For simplicity in the presentation, we shall concentrate on exchange economies, and disregard production aspects. That is, we shall assume that the goods in question have already been produced in some fixed amounts, and now they are to be allocated to individuals to satisfy their consumption needs. An exchange economy is a system in which each agent i l of commodin the set N has a consumption set Z i RC ity bundles, as well as a preference relation over Zi and an initial endowment ! i 2 Z i of the commodities. A feasible allocation of goods in the economy is a list of bundles P P (z i ) i2N such that z i 2 Z i and i2N z i i2N ! i . An allocation is competitive if it is supported by a competitive equilibrium. A competitive equilibrium is a price-allocation pair (p; (z i ) i2N ), where p 2 R l n f0g is such that
669
670
Cooperative Games
Theorem 3 (Debreu and Scarf [11]) Consider an exchange economy. Then, (i) The set of competitive equilibrium allocations is contained in the core. (ii) For each non-competitive core allocation of the original economy, there exists a sufficiently large replica of the economy for which the replica of the allocation is blocked. The first part states a very appealing property of competitive allocations, i. e., their coalitional stability. The second part, known as the core convergence theorem, states that the core “shrinks” to the set of competitive allocations as the economy grows large. In [3], Aumann models the economy as an atomless measure space, and demonstrates the following core equivalence theorem: Theorem 4 (Aumann [3]) Let the economy consists of an atomless continuum of agents. Then, the core coincides with the set of competitive allocations. For readers who wish to pursue the topic further, [2] provides a recent survey. Axiomatic Characterizations The axiomatic foundations of the core were provided much later than the concept was proposed. These characterizations are all inspired by Peleg’s work. They include [26,27], and [36] – the latter paper also provides an axiomatization of competitive allocations in which core convergence insights are exploited. In all these characterizations, the key axiom is that of consistency, also referred to as the reduced game property. Consistency means that the outcomes prescribed by a solution should be “invariant” to the number of players in the game. More formally, let (N; V) be a game, and let be a solution. Let x 2 (N; V ). Then, the solution is consistent if for every S N, x S 2 (S; Vx S ), where (S; Vx S ) is the reduced game for S given payoffs x, defined as follows. The feasible set for S in this reduced game is the projection of V(N) at x NnS , i. e., what remains after paying those outside of S: Vx S (S) D fy S : (y S ; x NnS ) 2 V(N)g : However, the feasible set of T S, T ¤ S, allows T to make deals with any coalition outside of S, provided that those services are paid at the rate prescribed by x NnS : Vx S (T) D fy T 2 [Q NnS (y T ; x Q ) 2 V(T [ Q)g :
It can be shown that the core satisfies consistency with respect to this reduced game. Moreover, consistency is the central axiom in the characterization of the core, which, depending on the version one looks at, uses a host of other axioms; see [26,27,36]. Non-cooperative Implementation To obtain a non-cooperative implementation of the core, the procedure must embody some feature of anonymity, since the core is usually a large set and it contains payoffs where different players are treated very differently. For instance, if the procedure always had a fixed set of moves, typically the prediction would favor the first mover, making it impossible to obtain an implementation of the entire set of payoffs. The model in [30] builds in this anonymity by assuming that negotiations take place in continuous time, so that anyone can speak at the beginning of the game, and at any point in time, instead of having a fixed order. The player that gets to speak first makes a proposal consisting of naming a coalition that contains him and a feasible payoff for that coalition. Next, the players in that coalition get to respond. If they all accept the proposal, the coalition leaves and the game continues among the other players. Otherwise, a new proposal may come from any player in N. It is shown that, if the TU game has a non-empty core (as well as any of its subgames), a class of stationary self-enforcing predictions of this procedure coincide with the core. If a core payoff is proposed to the grand coalition, there are no incentives for individual players to reject it. Conversely, a non-core payoff cannot be sustained because any player in a blocking coalition has an incentive to make a proposal to that coalition, who will accept it (knowing that the alternative, given stationarity, would be to go back to the non-core status quo). [24] offers a discrete-time version of the mechanism: in this work, the anonymity required is imposed on the solution concept, by looking at the orderindependent equilibria of the procedure. The model in [33] sets up a market to implement the core. The anonymity of the procedure stems from the random choice of broker. The broker announces a vector (x1 ; : : : ; x n ), where the components add up to v(N). One can interpret xi as the price for the productive asset held by player i. Following an arbitrary order, the remaining players either accept or reject these prices. If player i accepts, he sells his asset to the broker for the price xi and leaves the game. Those who reject get to buy from the broker, at the called out prices, the portfolio of assets of their choice if the broker still has them. If a player rejects, but does not get to buy the portfolio of assets he would like because some-
Cooperative Games
one else took them before, he can always leave the market with his own asset. The broker’s payoff is the worth of the final portfolio of assets that he holds, plus the net monetary transfers that he has received. It is shown in [33] that the prices announced by the broker will always be his top-ranked vectors in the core. If the TU game is such that gains from cooperation increase with the size of coalitions, a beautiful theorem of Shapley in [41] is used to prove that the set of all equilibrium payoffs of this procedure will coincide with the core. Core payoffs are here understood as those price vectors where all arbitrage opportunities in the market have been wiped out. Also, procedures in [35] implement the core, but do not rely on the TU assumption, and they use a procedure in which the order of moves can be endogenously changed by players. Finally, yet another way to build anonymity in the procedure is by allowing the proposal to be made by brokers outside of the set N, as done in [28]. An Application Consider majority games within a parliament. Suppose there are 100 seats, and decisions are made by simple majority so that 51 votes are required to pass a piece of legislation. In the first specification, suppose there is a very large party – player 1 –, who has 90 seats. There are five small parties, with 2 seats each. Given the simple majority rules, this problem can be represented by the following TU characteristic function: v(S) D 1 if S contains player 1, and v(S) D 0 otherwise. The interpretation is that each winning coalition can get the entire surplus – pass the desired proposal. Here, a coalition is winning if and only if player 1 is in it. For this problem, the core is a singleton: the entire unit of surplus is allocated to player 1, who has all the power. Any split of the unit surplus of the grand coalition (v(N) D 1) that gives some positive fraction of surplus to any of the small parties can be blocked by the coalition of player 1 alone. Consider now a second problem, in which player 1, who continues to be the large party, has 35 seats, and each of the other five parties has 13 seats. Now, the characteristic function is as follows: v(S) D 1 if and only if S either contains player 1 and two small parties, or it contains four of the small parties; v(S) D 0 otherwise. It is easy to see that now the core is empty: any split of the unit surplus will be blocked by at least one coalition. For example, the entire unit going to player 1 is blocked by the coalition of all five small parties, which can award 0.2 to each of them. But this arrangement, in which each small party gets 0.2 and player 1 nothing, is blocked as well, because player 1
can bribe two of the small parties (say, players 2 and 3) and promise them 1/3 each, keeping the other third for itself, and so on. The emptiness of the core is a way to describe the fragility of any agreement, due to the inherent instability of this coalition formation game. The Shapley Value Now consider a transferable utility or TU game in characteristic function form. The number v(S) is referred to as the worth of S, and it expresses S’s initial position (e. g., the maximum total amount of surplus in nummeraire – money, or power – that S initially has at its disposal. Axiomatics Shapley in [37] is interested in solving in a fair and unique way the problem of distribution of surplus among the players, when taking into account the worth of each coalition. To do this, he restricts attention to single-valued solutions and resorts to the axiomatic method. He proposes the following axioms on a single-valued solution: (i) Efficiency: The payoffs must add up to v(N), which means that all the grand coalition surplus is allocated. (ii) Symmetry: If two players are substitutes because they contribute the same to each coalition, the solution should treat them equally. (iii) Additivity: The solution to the sum of two TU games must be the sum of what it awards to each of the two games. (iv) Dummy player: If a player contributes nothing to every coalition, the solution should pay him nothing. (To be precise, the name of the first axiom should be different. In an economic sense, the statement does imply efficiency in superadditive games, i. e., when for every pair of disjoint coalitions S and T, v(S) C v(T) v(S [ T). In the absence of superadditivity, though, forming the grand coalition is not necessarily efficient, because a higher aggregate payoff can be obtained from a different coalition structure.) The surprising result in [37] is this: Theorem 5 (Shapley [37]) There is a unique single-valued solution to TU games satisfying efficiency, symmetry, additivity and dummy. It is what today we call the Shapley value, the function that assigns to each player i the payoff Sh i (N; v) D
X (jSj 1)!(jNj jSj)! [v(S)v(S nfig)]: jNj!
S;i2S
That is, the Shapley value awards to each player the average of his marginal contributions to each coalition. In taking this average, all orders of the players are considered to
671
672
Cooperative Games
be equally likely. Let us assume, also without loss of generality, that v(fig) D 0 for each player i. What is especially surprising in Shapley’s result is that nothing in the axioms (with the possible exception of the dummy axiom) hints at the idea of marginal contributions, so marginality in general is the outcome of all the axioms, including additivity or linearity. Among the axioms utilized by Shapley, additivity is the one with a lower normative content: it is simply a mathematical property to justify simplicity in the computation of the solution. Young in [45] provides a beautiful counterpart to Shapley’s theorem. He drops additivity (as well as the dummy player axiom), and instead, uses an axiom of marginality. Marginality means that the solution should pay the same to a pleyar in two games if his or her marginal contributions to coalitions is the same in both games. Marginality is an idea with a strong tradition in economic theory. Young’s result is “dual” to Shapley’s, in the sense that marginality is assumed and additivity derived as the result: Theorem 6 (Young [45]) There exists a unique singlevalued solution to TU games satisfying efficiency, symmetry and marginality. It is the Shapley value. Apart from these two, [19] provides further axiomatizations of the Shapley value using the idea of potential and the concept of consistency, as described in the previous section. There is no single way to extend the Shapley value to the class of NTU games. There are three main extensions that have been proposed: the Shapley -transfer value [40], the Harsanyi value [16], and the Maschler–Owen consistent value [23]. They were axiomatized in [5,10,17], respectively. The Connections with Competitive Equilibrium As was the case for the core, there is a value equivalence theorem. The result holds for the TU domain (see [4,8,38]). It can be shown that the Shapley value payoffs can be supported by competitive prices. Furthermore, in large enough economies, the set of competitive payoffs “shrinks” to approximate the Shapley value. However, the result cannot be easily extended to the NTU domain. While it holds for the -transfer value, it need not obtain for the other extensions. For further details, the interested reader is referred to [18] and the references therein.
tive procedures and techniques to the same end, including [20,21,29,44]. We shall concentrate on the description of the procedure proposed by Hart and Mas-Colell in [20]. Generalizing an idea found in [22], which studies the case of ı D 0 – see below –, Hart and Mas-Colell propose the following non-cooperative procedure. With equal probability, each player i 2 N is chosen to publicly make a feasible proposal to the others: (x1 ; : : : ; x n ) is such that the sum of its components cannot exceed v(N). The other players get to respond to it in sequence, following a prespecified order. If all accept, the proposal is implemented; otherwise, a random device is triggered. With probability 0 ı < 1, the same game continues being played among the same n players (and thus, a new proposer will be chosen again at random among them), but with probability 1 ı, the proposer leaves the game. He is paid 0 and his resources are removed, so that in the next period, proposals to the remaining n 1 players cannot add up to more than v(N n fig). A new proposer is chosen at random among the set N n fig, and so on. As shown in [20], there exists a unique stationary selfenforcing prediction of this procedure, and it actually coincides with the Shapley value payoffs for any value of ı. (Stationarity means that strategies cannot be history dependent). As ı ! 1, the Shapley value payoffs are also obtained not only in expectation, but with independence of who is the proposer. One way to understand this result, as done in [20], is to check that the rules of the procedure and stationary behavior in it are in agreement with Shapley’s axioms. That is, the equilibrium relies on immediate acceptances of proposals, stationary strategies treat substitute players similarly, the equations describing the equilibrium have an additive structure, and dummy players will have to receive 0 because no resources are destroyed if they are asked to leave. It is also worth stressing the important role in the procedure of players’ marginal contributions to coalitions: following a rejection, a proposer incurs the risk of being thrown out and the others of losing his resources, which seem to suggest a “price” for them. In [21], the authors study the conditions under which stationarity can be removed to obtain the result. Also, [29] uses a variant of the Hart and Mas-Colell procedure, by replacing the random choice of proposers with a bidding stage, in which players bid to obtain the right to make proposals.
Non-cooperative Implementation Reference [15] was the first to propose a procedure that provided some non-cooperative foundations of the Shapley value. Later, other authors have provided alterna-
An Application Consider again the class of majority problems in a parliament consisting of 100 seats. As we shall see, the Shap-
Cooperative Games
ley value is a good way to understand the power that each party has in the legislature. Let us begin by considering again the problem in which player 1 has 90 seats, while each of the five small parties has 2 seats. It is easy to see that the Shapley value, like the core in this case, awards the entire unit of surplus to player 1: effectively, each of the small parties is a dummy player, and hence, the Shapley value awards zero to each of them. Consider a second problem, in which player 1 is a big party with 35 seats, and there are 5 small parties, with 13 seats each. The Shapley value awards 1/3 to the large party, and, by symmetry, 2/15 to each of the small parties. To see this, we need to see when the marginal contributions of player 1 to any coalition are positive. Recall that there are 6! possible orders of players. Note how, if player 1 arrives first or second in the room in which the coalition is forming, his marginal contribution is zero: the coalition was losing before he arrived and continues to be a losing coalition after his arrival. Similarly, his marginal contribution is also zero if he arrives fifth or sixth to the coalition; indeed, in this case, before he arrives the coalition is already winning, so he adds nothing to it. Thus, only when he arrives third or fourth, which happens a third of the times, does he change the nature of the coalition, from losing to winning. This explains his Shapley value share of 1/3. In this game, the Shapley value payoffs roughly correspond to the proportion of seats that each party has. Next, consider a third problem in which there are two large parties, while the other four parties are very small. For example, let each of the large parties have 48 seats (say, players 1 and 2), while each of the four small parties has only one seat. Now, the Shapley value payoffs are 0.3 to each of the two large parties, and 0.1 to each of the small ones. To see this, note that the marginal contribution of a small party is only positive when he comes fourth in line, and out of the preceding three parties in the coalition, exactly one of them is a large party, i. e., 72 orders out of the 5! orders in which he is fourth. That is, (72/5!) (1/6) D 1/10. In this case, the competition between the large parties for the votes of the small parties increases the power of the latter quite significantly, with respect to the proportion of seats that each of them holds. Finally, consider a fourth problem with two large parties (players 1 and 2) with 46 seats each, one mid-size party (player 3) with 5 seats, and three small parties, each with one seat. First, note that each of the three small parties has become a dummy player: no winning coalition where he belongs becomes losing if he leaves the coalition, and so players 4, 5 and 6 are paid zero by the Shapley value. Now, note that, despite the substantial differ-
ence of seats between each large party and the mid-size party, each of them is identical in terms of marginal contributions to a winning coalition. Indeed, for i D 1; 2; 3, player i’s marginal contribution to a coalition is positive only if he arrives second or third or fourth or fifth (and out of the preceding players in the coalition, exactly one is one of the non-dummy players). Note how the Shapley value captures nicely the changes in the allocation of power due to each different political scenario. In this case, the fierce competition between the large parties for the votes of player 3, the swinging party to form a majority, explains the equal share of power among the three.
Future Directions This article has been a first approach to cooperative game theory, and has emphasized two of its most important solution concepts. The literature on these topics is vast, and the interested reader is encouraged to consult the general references listed below. For the future, one should expect to see progress of the theory into areas that have been less explored, including games with asymmetric information and games with coalitional externalities. In both cases, the characteristic function model must be enriched to take care of the added complexities. Relevant to this encyclopedia are issues of complexity. The complexity of cooperative solution concepts has been studied (see, for instance, [12]). In terms of computational complexity, the Shapley value seems to be easy to compute, while the core is harder, although some classes of games have been identified in which this task is also simple. Finally, one should insist on the importance of novel and fruitful applications of the theory to shed new light on concrete problems. In the case of the core, for example, the insights of core stability in matching markets have been successfully applied by Alvin Roth and his collaborators to the design of matching markets in the “real world” (e. g., the job market for medical interns and hospitals, the allocation of organs from doners to patients, and so on) – see [31]. Bibliography Primary Literature 1. Anderson RM (1978) An elementary core equivalence theorem. Econometrica 46:1483–1487 2. Anderson RM (2008) Core convergence. In: Durlauff S, Blume L (eds) The New Palgrave Dictionary of Economics, 2nd edn. McMillan, London 3. Aumann RJ (1964) Markets with a continuum of traders. Econometrica 32:39–50
673
674
Cooperative Games
4. Aumann RJ (1975) Values of markets with a continuum of traders. Econometrica 43:611–646 5. Aumann RJ (1985) An axiomatization of the non-transferable utility value. Econometrica 53:599–612 6. Aumann RJ (1987) Game theory. In: Eatwell J, Milgate M, Newman P (eds) The New Palgrave Dictionary of Economics, Norton, New York 7. Aumann RJ, Peleg B (1960) Von Neumann–Morgenstern solutions to cooperative games without side payments. Bull Am Math Soc 66:173–179 8. Aumann RJ, Shapley LS (1974) Values of Non-Atomic Games. Princeton University Press, Princeton 9. Bondareva ON (1963) Some applications of linear programming methods to the theory of cooperative games (in Russian). Problemy Kibernetiki 10:119 139 10. de Clippel G, Peters H, Zank H (2004) Axiomatizing the Harsanyi solution, the symmetric egalitarian solution and the consistent solution for NTU-games. I J Game Theory 33:145–158 11. Debreu G, Scarf H (1963) A limit theorem on the core of an economy. Int Econ Rev 4:235–246 12. Deng X, Papadimitriou CH (1994) On the complexity of cooperative solution concepts. Math Oper Res 19:257–266 13. Edgeworth FY (1881) Mathematical Psychics. Kegan Paul Publishers, London. reprinted in 2003) Newman P (ed) F. Y. Edgeworth’s Mathematical Psychics and Further Papers on Political Economy. Oxford University Press, Oxford 14. Gillies DB (1959) Solutions to general non-zero-sum games. In: Tucker AW, Luce RD (eds) Contributions to the Theory of Games IV. Princeton University Press, Princeton, pp 47–85 15. Gul F (1989) Bargaining foundations of Shapley value. Econometrica 57:81–95 16. Harsanyi JC (1963) A simplified bargaining model for the n-person cooperative game. Int Econ Rev 4:194–220 17. Hart S (1985) An axiomatization of Harsanyi’s non-transferable utility solution. Econometrica 53:1295–1314 18. Hart S (2008) Shapley value. In: Durlauff S, Blume L (eds) The New Palgrave Dictionary of Economics, 2nd edn. McMillan, London 19. Hart S, Mas-Colell A (1989) Potencial, value and consistency. Econometrica 57:589–614 20. Hart S, Mas-Colell A (1996) Bargaining and value. Econometrica 64:357–380 21. Krishna V, Serrano R (1995) Perfect equilibria of a model of n-person non-cooperative bargaining. I J Game Theory 24:259–272 22. Mas-Colell A (1988) Algunos comentarios sobre la teoria cooperativa de los juegos. Cuadernos Economicos 40:143–161 23. Maschler M, Owen G (1992) The consistent Shapley value for games without side payments. In: Selten R (ed) Rational Interaction: Essays in Honor of John Harsanyi. Springer, New York 24. Moldovanu B, Winter E (1995) Order independent equilibria. Games Econ Behav 9:21–34 25. Nash JF (1953) Two person cooperative games. Econometrica 21:128–140 26. Peleg B (1985) An axiomatization of the core of cooperative games without side payments. J Math Econ 14:203–214
27. Peleg B (1986) On the reduced game property and its converse. I J Game Theory 15:187–200 28. Pérez-Castrillo D (1994) Cooperative outcomes through noncooperative games. Games Econ Behav 7:428–440 29. Pérez-Castrillo D, Wettstein D (2001) Bidding for the surplus: a non-cooperative approach to the Shapley value. J Econ Theory 100:274–294 30. Perry M, Reny P (1994) A non-cooperative view of coalition formation and the core. Econometrica 62:795–817 31. Roth AE (2002) The economist as engineer: game theory, experimentation and computation as tools for design economics. Econometrica 70:1341–1378 32. Scarf H (1967) The core of an N person game. Econometrica 38:50–69 33. Serrano R (1995) A market to implement the core. J Econ Theory 67:285–294 34. Serrano R (2005) Fifty years of the Nash program, 1953–2003. Investigaciones Económicas 29:219–258 35. Serrano R, Vohra R (1997) Non-cooperative implementation of the core. Soc Choice Welf 14:513–525 36. Serrano R, Volij O (1998) Axiomatizations of neoclassical concepts for economies. J Math Econ 30:87–108 37. Shapley LS (1953) A value for n-person games. In: Tucker AW, Luce RD (eds) Contributions to the Theory of Games II. Princeton University Press, Princeton, pp 307–317 38. Shapley LS (1964) Values of large games VII: a general exchange economy with money. Research Memorandum 4248PR. RAND Corporation, Santa Monica 39. Shapley LS (1967) On balanced sets and cores. Nav Res Logist Q 14:453–460 40. Shapley LS (1969) Utility comparison and the theory of games. In La Décision: Agrégation et Dynamique des Ordres de Préférence. CNRS, Paris 41. Shapley LS (1971) Cores of convex games. I J Game Theory 1:11–26 42. von Neumann J, Morgenstern O (1944) Theory of Games and Economic Behavior. Princeton University Press, Princeton 43. Walras L (1874) Elements of Pure Economics, or the Theory of Social Wealth. English edition: Jaffé W (ed) Reprinted in 1984 by Orion Editions, Philadelphia 44. Winter E (1994) The demand commitment bargaining and snowballing of cooperation. Econ Theory 4:255–273 45. Young HP (1985) Monotonic solutions of cooperative games. I J Game Theory 14:65–72
Books and Reviews Myerson RB (1991) Game Theory: An Analysis of Conflict. Harvard University Press, Cambridge Osborne MJ, Rubinstein A (1994) A Course in Game Theory. MIT Press, Cambridge Peleg B, Sudholter P (2003) Introduction to the Theory of Cooperative Games. Kluwer, Amsterdam. 2nd edn. Springer, Berlin Roth AE, Sotomayor M (1990) Two-Sided Matching: A Study in Game-Theoretic Modeling and Analysis. Cambridge University Press, Cambridge
Cooperative Games (Von Neumann–Morgenstern Stable Sets)
Cooperative Games (Von Neumann– Morgenstern Stable Sets) JUN W AKO1 , SHIGEO MUTO2 1 Department of Economics, Gakushuin University, Tokyo, Japan 2 Graduate School of Decision Science and Technology, Tokyo, Institute of Technology, Tokyo, Japan Article Outline Glossary Definition of the Subject Introduction Stable Sets in Abstract Games Stable Set and Core Stable Sets in Characteristic Function form Games Applications of Stable Sets in Abstract and Characteristic Function Form Games Stable Sets and Farsighted Stable Sets in Strategic Form Games Applications of Farsighted Stable Sets in Strategic Form Games Future Directions Bibliography Glossary Characteristic function form game A characteristic function form game consists of a set of players and a characteristic function that gives each group of players, called a coalition, a value or a set of payoff vectors that they can gain by themselves. It is a typical representation of cooperative games. For characteristic function form games, several solution concepts are defined such as von Neumann – Morgenstern stable set, core, bargaining set, kernel, nucleolus, and Shapley value. Abstract game An abstract game consists of a set of outcomes and a binary relation, called domination, on the outcomes. Von Neumann and Morgenstern presented this game form for general applications of stable sets. Strategic form game A strategic form game consists of a player set, each player’s strategy set, and each player’s payoff function. It is usually used to represent non-cooperative games. Imputation An imputation is a payoff vector in a characteristic function form game that satisfies group rationality and individual rationality. The former means that the players divide the amount that the grand coalition of all players can gain, and the latter says that each
player is assigned at least the amount that he/she can gain by him/herself. Domination Domination is a binary relation defined on the set of imputations, outcomes, or strategy combinations, depending on the form of a given game. In characteristic function form games, an imputation is said to dominate another imputation if there is a coalition of players such that they can realize their payoffs in the former by themselves, and make each of them better off than in the latter. Domination given a priori in abstract games can be also interpreted in the same way. In strategic form games, domination is defined on the basis of commonly beneficial changes of strategies by coalitions. Internal stability A set of imputations (outcomes, strategy combinations) satisfies internal stability if there is no domination between any two imputations in the set. External stability A set of imputations (outcomes, strategy combinations) satisfies external stability if any imputation outside the set is dominated by some imputation inside the set. von Neumann–Morgenstern stable set A set of imputations (outcomes, strategy combinations) is a von Neumann–Morgenstern stable set if it satisfies both internal and external stability. Farsighted stable set A farsighted stable set is a more sophisticated stable set concept mainly defined for strategic form games. Given two strategy combinations x and y, we say that x indirectly dominates y if there exist a sequence of coalitions S 1 ; : : : ; S p and a sequence of strategy combinations y D x 0 ; x 1 ; : : : ; x p D x such that each coalition S j can induce strategy combination x j by a joint move from x j1 , and all members of S j end up with better payoffs at x p D x compared to the payoffs at x j1 . A farsighted stable set is a set of strategy combinations that is stable both internally and externally with respect to the indirect domination. A farsighted stable set can be defined in abstract games and characteristic function form games. Definition of the Subject The von Neumann–Morgenstern stable set for solution (hereafter stable set) is the first solution concept in cooperative game theory defined by J. von Neumann and O. Morgenstern. Though it was defined cooperative games in characteristic function form, von Neumann and Morgenstern gave a more general definition of a stable set in abstract games. Later, J. Greenberg and M. Chwe cleared a way to apply the stable set concept to the analysis of
675
676
Cooperative Games (Von Neumann–Morgenstern Stable Sets)
non-cooperative games in strategic and extensive forms. Though general existence of stable sets in characteristic function form games was denied by a 10-person game presented by W.F. Lucas, stable sets exist in many important games. In voting games, for example, stable sets exist, and they indicate what coalitions can be formed in detail. The core, on the other hand, can be empty in voting games, though it is one of the best known solution concept in cooperative game theory. The analysis of stable sets is not necessarily straightforward, since it can reveal a variety of possibilities. However, stable sets give us deep insights into players’ behavior in economic, political and social situations such as coalition formation among players. Introduction For studies of economic or social situations where players can take cooperative behavior, the stable set was defined by von Neumann and Morgenstern [31] as a solution concept for characteristic function form cooperative games. They also defined the stable set in abstract games so that one can apply the concept to more general games including non-cooperative situations. Greenberg [9] and Chwe [3] cleared a way to apply the stable set concept to the analysis of non-cooperative games in strategic and extensive forms. The stable set is a set of outcomes satisfying two stability conditions: internal and external stability. The internal stability means that between any two outcomes in the set, there is no group of players such that all of its members prefer one to the other and they can realize the preferred outcome. The external stability means that for any outcome outside the set, there is a group of players such that all of its members have a commonly preferred outcome in the set and they can realize it. Though general existence was denied by Lucas [18] and Lucas and Rabie [21], the stable set has revealed many interesting behavior of players in economic, political, and social systems. Von Neumann and Morgenstern (also Greenberg) assumed only a single move by a group of players. Harsanyi [13] first pointed out that stable sets in characteristic function form games may fail to cover “farsighted” behavior of players. Harsanyi’s work inspired Chwe’s [3] contribution to the formal study of foresight in social environments. Chwe paid attention to a possible chain of moves such that a move of a group of players will bring about a new move of another group of players, which will further cause a third group of players to move, and so on. Then the group of players moving first should take into account a sequence of moves that may follow, and evaluate their profits obtained at the end. By incorporating such a se-
quence of moves, Chwe [3] defined a more sophisticated stable set, which we call a farsighted stable set in what follows. Recent work by Suzuki and Muto [51,52] showed that the farsighted stable set provides more reasonable outcomes than the original (myopic) stable set in important classes of strategic form games. The rest of the chapter is organized as follows. Section “Stable Sets in Abstract Games” presents the definition of a stable set in abstract games. Section “Stable Set and Core” shows basic relations between the two solution concepts of stable set and core. Section “Stable Sets in Characteristic Function form Games” gives the definition of a stable set in characteristic function form games. Section “Applications of Stable Sets in Abstract and Characteristic Function Form Games” first discusses general properties of stable sets in characteristic function form games, and then presents applications of stable sets in abstract and characteristic function games to political and economic systems. Section “Stable Sets and Farsighted Stable Sets in Strategic Form Games” gives the definitions of a stable set and a farsighted stable set in strategic form games. Section “Applications of Farsighted Stable Sets in Strategic Form Games” discusses properties of farsighted stable sets and their applications to social and economic situations. Section “Future Directions” ends the chapter with remarks. Section “Bibliography” offers a list of references. Stable Sets in Abstract Games An abstract game is a pair (W; ) of a set of outcomes W and an irreflexive binary relation on W, where irreflexivity means that x x holds for no element x 2 W. The relation is interpreted as follows: if x y holds, then there must exist a set of players such that they can induce x from y by themselves and all of them are better off in x. A subset K of W is called a stable set of abstract game (W; ) if the following two conditions are satisfied: 1. Internal stability: For any two elements x; y 2 K; x y never holds. 2. External stability: For any element z … K, there must exist x 2 K such that x z. We explain more in detail what the external and internal stability conditions imply in the definition of a stable set. Suppose players have common understanding that each outcome inside a stable set is “stable” and that each outcome outside the set is “unstable”. Here the “stability” means that no group of players has an incentive to deviate from it, and the “instability” means that there is at least one group of players that has an incentive to deviate from
Cooperative Games (Von Neumann–Morgenstern Stable Sets)
it. Then the internal and external stability conditions guarantee that the common understanding is never disproved, and thus continues to prevail. In fact, suppose the set is both internally and externally stable, and pick any outcome in the set. Then by internal stability, no group of players can be better off by deviating from it and inducing an outcome inside the set. Thus no group of players reaches an agreement to deviate, which makes each outcome inside the set remain stable. Deviating players may be better off by inducing an outcome outside the set; but outcomes outside the set are commonly considered to be unstable. Thus deviating players can never expect that such an outcome will continue. Next pick any outcome outside the set. Then by external stability, there exists at least one group of players who can become better off by deviating from it and inducing an outcome inside the set. The induced outcome is considered to be stable since it is in the set. Hence the group of players will deviate. Hence each outcome outside the set remains unstable. Stable Set and Core Another solution concept that is widely known is a core. For a given abstract game G D (W; ), a subset C of W is called the core of G if C D fx 2 Wj there is no y 2 W with y xg. From the definition, the core satisfies internal stability. Thus the core C of G is contained in any stable set of G if the latter exists. To see this, suppose that C 6 K for a stable set K, and C is non-empty, i. e., C ¤ ;. (If C D ;, then clearly C K.) Pick any element x 2 CnK. Since x … K, by external stability there exists y 2 K with y x, which contradicts x 2 C. When the core of a game satisfies external stability, it has very strong stability. It is called the stable core. The stable core is the unique stable set of the game. Stable Sets in Characteristic Function form Games An n-person game in characteristic function form with transferable utility is a pair (N; v) of a player set N D f1; 2; : : : ; ng and a characteristic function v on the set 2 N of all subsets of N such that v(;) D 0. Each subset of N is called a coalition. The game (N; v) is often called a TU-game. A characteristic function form game without transferable utility is called an NTU-game: its characteristic function gives each coalition a set of payoff vectors to players. For NTU-games and their stable sets, refer to Aumann and Peleg [1] and Peleg [35]. In this section, hereafter, we deal with only TU characteristic function form games, and refer to them simply as characteristic function form games.
Let (N; v) be a characteristic function form game. The characteristic function v assigns a real number v(S) to each coalition S N. The value v(S) indicates the worth that coalition S can achieve by itself. An n-dimensional vector x D (x1 ; x2 ; : : : ; x n ) is called a payoff vector. A payoff vector x is called an imputation if the following two conditions are satisfied: P 1. Group rationality: niD1 x i D v(N), 2. Individual rationality: x i v(fig) for each i 2 N. The first condition says that all players cooperate and share the worth v(N) that they can produce. The second condition says that each player must receive at least the amount that he/she can gain by him/herself. Let A be the set of all imputations. Let x; y be any imputations and S be any coalition. We say that x dominates y via S and write this as x domS y if the following two conditions are satisfied: 1. Coalitional rationality: x i > y i for each i 2 S, P 2. Effectivity: i2S x i v(S). The first condition says that every member of coalition S strictly prefers x to y. The second condition says that coalition S can guarantee the payoff xi for each member i 2 S by themselves. We say that x dominates y (denoted by x dom y) if there exists at least one coalition S such that x domS y. It should be noted that a pair (A, dom) is an abstract game defined in Sect. “Stable Sets in Abstract Games”. It is easily seen that “dom” is an irreflexive binary relation on A. A stable set and the core of game (N; v) are defined to be a stable set and the core of the associated abstract game (A, dom), respectively. Since von Neumann and Morgenstern defined the stable set, its general existence had been one of the most important problems in game theory. The problem was eventually solved in a negative way. Lucas [18] found the following 10-person characteristic function form game in which no stable set exists. A game with no stable set: Consider the following 10person game: N D f1; 2; : : : ; 10g ; v(N) D 5 ; v(f1; 3; 5; 7; 9g) D 4 ; v(f3; 5; 7; 9g) D v(f1; 5; 7; 9g) D v(f1; 3; 7; 9g) D 3 ; v(f1; 2g) D v(f3; 4g) D v(f5; 6g) D v(f7; 8g) D v(f9; 10g) D 1 ; v(f3; 5; 7g) D v(f1; 5; 7g) D v(f1; 3; 7g) D v(f3; 5; 9g) D v(f1; 5; 9g) D v(f1; 3; 9g) D 2 ;
677
678
Cooperative Games (Von Neumann–Morgenstern Stable Sets)
v(f1; 4; 7; 9g) D v(f3; 6; 7; 9g) D v(f5; 2; 7; 9g) D 2 and v(S) D 0 for all other S N : Though this game has no stable set, it has a nonempty core. A game with no stable set and an empty core was also found by Lucas and Rabie [21]. We remark on a class of games in which a stable core exists. As mentioned before, if a stable set exists, it always contains the core, which is of course true also in characteristic function form games. Furthermore, in characteristic function form games, there is an interesting class, called convex games, in which the core satisfies external stability. That is, the core is a stable core. A characteristic function form game (N; v) is a convex game if for any S; T N with S T and for any i … T; v(S[fig)v(S) v(T[fig)v(T), i. e., the bigger coalition a player joins, the larger the player’s contribution becomes. In convex games, the core is large and satisfies external stability. For the details, refer to Shapley [45]. Though general existence is denied, the stable set provides us with very useful insights into many economic, political, and social issues. In the following, we will present some stable set analyzes applied to those issues. Applications of Stable Sets in Abstract and Characteristic Function Form Games Symmetric Voting Games This section deals with applications of stable sets to voting situations. Let us start with a simple example. Example 1 Suppose there is a committee consisting of three players 1, 2 and 3. Each player has one vote. Decisions are done according to a simple majority rule. That is, to pass a bill, at least two votes are necessary. Before analyzing players’ behavior, we first formulate the situation as a characteristic function form game. Let the player set be N D f1; 2; 3g. Since a coalition of a simple majority of players can pass any bill, we give value 1 to such coalitions. Other coalitions can pass no bill. We thus give them value 0. Hence the characteristic function is given by ( 1 if jSj 2 ; v(S) D 0 if jSj 1 ;
2; 1/2). A brief proof is the following. Since each of the three imputations has only two numbers 1/2 and 0, internal stability is trivial. To show external stability, take any imputation x D (x1 ; x2 ; x3 ) from outside K. Suppose first x1 < 1/2. Since x … K, at least one of x2 and x3 is less than 1/2. We assume x2 < 1/2. Then (1/2; 1/2; 0) dominates x via coalition f1; 2g. Next suppose x1 D 1/2. Since x … K; 0 < x2 ; x3 < 1/2. Thus (0; 1/2; 1/2) dominates x via coalition f2; 3g. Finally suppose x1 > 1/2. Then x2 ; x3 < 1/2, and thus (0; 1/2; 1/2) dominates x via coalition f2; 3g. Thus the proof of external stability is complete. This three-point stable set indicates that a two-person coalition is formed, and that players in the coalition share equally the outcome obtained by passing a bill. This game has another three types of stable sets. First, any set K c1 D fx 2 Ajx1 D cg with 0 c < 1/2 is a stable set. The internal stability of each K 1c is trivial. To show external stability, take any imputation x D (x1 ; x2 ; x3 ) … K c1 . Suppose x1 > c. Define by y1 D c; y2 D x2 C (x1 c)/2; y3 D x3 C (x1 c)/2. Then y 2 K c1 and y domf2;3g x. Next suppose x1 < c. Notice that at least one of x2 and x3 is less than 1c since c < 1/2. Suppose without loss of generality x2 < 1 c. Since c < 1/2, we have (c; 1 c; 0) 2 K c1 and (c; 1 c; 0) domf1;2g x. Thus external stability holds. This stable set indicates that player 1 gets a fixed amount c and players 2 and 3 negotiate for how to allocate the rest 1 c. Similarly, any sets K c2 D fx 2 Ajx2 D cg and K c3 D fx 2 Ajx3 D cg with 0 c < 1/2 are stable sets. The three-person game of Example 1 has no other stable set. See von Neumann and Morgenstern [31]. The former stable set is called a symmetric (or objective) stable set, while the latter types are called discriminatory stable sets. As a generalization of the above result, symmetric stable sets are found in general n-person simple majority voting games. An n-person characteristic function form game (N; v) with N D f1; 2; : : : ; ng is called a simple majority voting game if ( 1 if jSj > n/2 ; v(S) D 0 if jSj n/2 :
A D fx D (x1 ; x2 ; x3 )jx1 C x2 C x3 D 1; x1 ; x2 ; x3 0g :
A coalition S with v(S) D 1, i. e., with jSj > n/2, is called a winning coalition. A winning coalition including no smaller winning coalitions is called a minimal winning coalition. In simple majority voting games, a minimal winning coalition means a coalition of (n C 1)/2 players if n is odd, or (n C 2)/2 players if n is even. The following theorem holds. See Bott [2] for the proof.
One stable set of this game is given by the set K consisting of three imputations, (1/2; 1/2; 0); (1/2; 0; 1/2); (0; 1/
Theorem 1 Let (N; v) be a simple majority voting game. Then the following hold.
where jSj denotes the number of players in coalition S. The set of imputations is given by
Cooperative Games (Von Neumann–Morgenstern Stable Sets)
(1) If n is odd, then the set K D h2/(n C 1); : : : ; 2/(n C 1); 0; : : : ; 0i „ ƒ‚ … „ ƒ‚ … n1 2
nC1 2
is a stable set where the symbol hxi denotes the set of all imputations obtained from x through permutations of its components. (2) If n is even, the set K D hfx 2 Aj x1 D : : : D x n/2 x(n/2)C1 D : : : D x n gi „ ƒ‚ … „ ƒ‚ … n 2
n 2
is a stable set, where ˇX ˇ n x i D 1; x1 ; : : : ; x n 0 A D x D (x1 ; : : : ; x n )ˇˇ
iD1
and hYi D [ hxi . x2Y
It should be noted from (1) of Theorem 1 that when the number of players is odd, a minimal winning coalition is formed. The members of the coalition share equally the total profit. On the other hand, when the number of players is even, (2) of Theorem 1 shows that every player may gain a positive profit. This implies that the grand coalition of all players is formed. In negotiating for how to share the profit, two coalitions, each with n/2 players, are formed and profits are shared equally within each coalition. Since at least n/2 C 1 players are necessary to win when n is even, an n/2-player coalition is the smallest coalition that can prevent its complement from winning. Such a coalition is called a minimal blocking coalition. When n is odd, an (n C 1)/2-player minimal winning coalition is also a minimal blocking coalition. General Voting Games In this section, we present properties of stable sets and cores in general (not necessarily symmetric) voting games. A characteristic function from game (N; v) is called a simple game if v(S) D 1 or 0 for each nonempty coalition S N. A coalition S with v(S) D 1 (resp. v(S) D 0) is a winning coalition (resp. losing coalition). A simple game is called a voting game if it satisfies (1) v(N) D 1, (2) if S T, then v(S) v(T), and (3) if S is winning, then N S is losing. The first condition implies that the grand coalition N is always winning. The second condition says that a superset of a winning coalition is also winning. The third condition says that there are no two disjoint winning coalitions. It is easily shown that the simple majority
voting game studied in the previous section satisfies these conditions. A player has a veto if he/she belongs to every winning coalition. As for cores of voting games, the following theorem holds. Theorem 2 Let (N; v) be a voting game. Then the core of (N; v) is nonempty if and only if there exists a player with veto. Thus the core is not a useful tool for analyzing voting situations with no veto player. In simple majority voting games, no player has a veto, and thus the core is empty. The following theorem shows that stable sets always exist. Theorem 3 Let (N; v) be a voting game. Let S be a minimal winning coalition and define a set K by ˇX
ˇ ˇ K D x 2 Aˇ x i D 1; x i D 08i … S : i2S
Then K is a stable set. Thus in voting games, a minimal winning coalition is always formed, and they gain all the profit. For the proofs of these theorems, see Owen [34]. Further results on stable sets in voting games are found in Bott [2], Griesmer [12], Heijmanns [16], Lucas et al. [20], Muto [26,28], Owen [32], Rosenmüller [36], Shapley [43,44]. Production Market Games Let us start with a simple example. Example 2 There are four players, each having one unit of a raw material. Two units of the raw material are necessary for producing one unit of an indivisible commodity. One unit of the commodity is sold at p dollars. The situation is formulated as the following characteristic function form game. The player set is N D f1; 2; 3; 4g. Since two units of the raw material are necessary to produce one unit of the commodity, the characteristic function v is given by v(S) D 2p if jSj D 4 ;
v(S) D p if jSj D 3; 2 ; v(S) D 0 if jSj D 1; 0 :
The set of imputations is A D fx D (x1 ; x2 ; x3 ; x4 )jx1 C x2 C x3 C x4 D 2p; x1 ; x2 ; x3 ; x4 0g : The following set K is one of the stable sets of the game: K D hfx D (x1 ; x2 ; x3 ; x4 ) 2 Ajx1 D x2 D x3 x4 gi :
679
680
Cooperative Games (Von Neumann–Morgenstern Stable Sets)
To show internal stability, take two imputations x D (x1 ; x2 ; x3 ; x4 ) with x1 D x2 D x3 x4 and y D (y1 ; y2 ; y3 ; y4 ) in K. Suppose x dominates y. Since x1 D x2 D x3
p/2 x4 , the domination must hold via coalition fi; 4g with i D 1; 2; 3. Then we have a contradiction 2p D P4 P4 iD1 x i > iD1 y i D 2p, since y 2 K implies that the largest three elements of y are equal. To show external stability, take z D (z1 ; z2 ; z3 ; z4 ) … K. Suppose z1 z2
z3 z4 . Then z1 > z3 . Define y D (y1 ; y2 ; y3 ; y4 ) by 8 z C z2 2z3 ˆ z3 ; y4 > z4 and y3 C y4 p D v(f3; 4g). This stable set shows that in negotiating for how to share the profit of 2p dollars, three players form a coalition and share equally the gain obtained through collaboration. At least two players are necessary to produce the commodity. Thus a three-player coalition is the smallest coalition that can prevent its complement from producing the commodity, i. e., a minimal blocking coalition. We would claim that in the market a minimal blocking coalition is formed and that profits are shared equally within the coalition. An extension of the model was given by Hart [14] and Muto [27]. Hart considered the following production market with n players, each holding one unit of a raw material. To produce one unit of an indivisible commodity, k units of raw materials are necessary. The associated production market game is defined by the player set N D f1; 2; : : : ; ng and the characteristic function given by 8 0 ˆ ˆ ˆ ˆ ˆ p ˆ ˆ ˆ ˆ : ˆ y1 and x2 > y2 P3 hold. Thus we have a contradiction p D iD1 x i > P3 iD1 y i D p. The domination via f1; 3g leads to the same contradiction. To show external stability, take any imputation z D (z1 ; z2 ; z3 ) … K. Then z2 ¤ z3 . Without loss of generality, let z2 < z3 . Define y D (y1 ; y2 ; y3 ) by 8 z3 z2 ˆ z1 C for i D 1 ; ˆ ˆ 3z < z3 2 for i D 2 ; y i D z2 C ˆ 3z ˆ z ˆ 3 2 :z 2 C for i D 3 : 3 Then y 2 K and y domf1;2g z, since y1 > z1 ; y2 > z2 and y1 C y2 < (f1; 2g). This stable set shows that in negotiating for how to share the profit p dollars, players 2 and 3 form a coalition against player 1 and share equally the gain obtained through collaboration. There exist other stable sets in which players 2 and 3 collaborate but they do not share equally the profit. More precisely, the following set K D fx D (x1 ; x2 ; x3 ) 2 Aj x2 and x3 move towards the same directiong is a stable set, where “move towards the same direction” means that if x2 increases then x3 increases, and if x2 decreases then x3 decreases. A generalization of the results above is given by the following theorem due to Shapley [42]. Shapley’s original theorem is more complicated and holds in more general markets. Theorem 6 Suppose there are m players, 1; : : : ; m, each holding one unit of raw material P, and n players, m C 1; : : : ; m C n, each holding one unit of raw material Q. To produce one unit of an indivisible commodity, one unit of each of raw materials Pand Q is necessary. One unit of commodity is sold at p dollars. In this market, the following set K is a stable set. K D fx D (x1 ; x2 ; : : : ; x mCn ) 2 Aj x1 D D x m ; x mC1 D D x mCn g :
where ˇ ˇ A D x D (x1 ; : : : ; x m ; x mC1 ; : : : ; x mCn )ˇˇ
mCn X
x i D p min(m; n); x1 ; : : : ; x mCn 0 ;
iD1
is the set of imputations of this game. This theorem shows that players holding the same raw material form a coalition and share equally the profit gained through collaboration. For further results on stable sets in production market games, refer to Hart [14], Muto [27], Owen [34]. Refer also to Lucas [19], Owen [33], Shapley [41], for further general studies on stable sets. Assignment Games The following two sections deal with markets in which indivisible commodities are traded between sellers and buyers, or bartered among agents. The first market is the assignment market originally introduced by Shapley and Shubik [47]. An assignment market consists of a set of n( 1) buyers B D f1; : : : ; ng and a set of n sellers F D f10 ; : : : ; n0 g. Each seller k 0 2 F is endowed with one indivisible commodity to sell, which is called object k0 . Thus F also denotes the set of n objects in the market. The objects are differentiated. Each buyer i 2 B wants to buy at most one of the objects, and places a nonnegative monetary valuation u i k 0 ( 0) for each object k 0 2 F. The matrix U D (u i k 0 )(i;k 0 )2BF is called the valuation matrix. The sellers place no valuation for any objects. An assignment market is denoted by M(B; F; U). We remark that an assignment market with jBj ¤ jFj can be transformed into the market with jBj D jFj by adding dummy buyers resp. sellers, and zero rows resp. columns correspondingly to valuation matrix U. For each coalition S B [ F with S \ B ¤ ; and S \ F ¤ ;, we define assignment problem P(S) as follows: X P(S) : m(S) D max ui k0 xi k0 x
s:t:
(i;k 0 )2(S\B)(S\F)
X
xi k0 1
for all i 2 S \ B ;
k 0 2S\F
X
xi k0 1
for all k 0 2 S \ F ;
i2S\B
x i k 0 0 for all (i; k 0 ) 2 (S \ B) (S \ F) : Assignment problem P(S) has at least one optimal integer solution (see Simonnard [49]), which gives an opti-
681
682
Cooperative Games (Von Neumann–Morgenstern Stable Sets)
mal matching between sellers and buyers in S that yields the highest possible surplus in S. Without loss of generality, we assume that the rows and columns of valuation matrix U are arranged so that the diagonal assignment x with x i i 0 D 1; i D 1; : : : ; n, is one of the optimal solutions to P(B [ F). For a given assignment market M(B; F; U), we define the associated assignment game G to be the characteristic function form game (B [ F; v). The player set of G is B [ F. The characteristic function v is defined as follows: v(S) D m(S) for each S F [ B with S \ B ¤ ; and S \ F ¤ ;. For coalitions only of sellers or buyers, they cannot produce surplus from trade. Thus v(S) D 0 for each S with S B; S F, or S D ;. The imputation set of G is ˇX n o X B F ˇ A D (w; p) 2 u i (x j1 ) for each i 2 S j . We sometimes say that x indirectly dominates y starting with S1 (denoted by x S 1 y) to specify the set of players which deviates first from y. We hereupon remark that in the definition of indirect domination we implicitly assume that joint moves by groups of players are neither once-and-for-all nor binding, i. e., some players in a deviating group may later make another move with players within or even outside the group. It should be noted that the indirect domination defined above is borrowed from Chwe [3]. Though Harsanyi [13] first proposed the notion of indirect domination, his definition was given in characteristic function form games. When p D 1 in the definition of indirect domination, we simply say that x directly dominates y, which is denoted by x d y. When we want to specify a deviating coalition, we say that x directly dominates y via coalition S, which is denoted by x dS y. This direct domination in strategic form games was defined by Greenberg [9]. Let pairs (X; ) and (X; d ) be the abstract games associated with game G. A farsighted stable set of G is defined to be a stable set of abstract game (X; ) with indirect domination. A stable set of abstract game (X; d ) with direct domination is simply called a stable set of G . Applications of Farsighted Stable Sets in Strategic Form Games Existence of farsighted stable sets in strategic form games remains unsolved. Nevertheless, it has turned out through
applications that farsighted stable sets give much sharper insights into players’ behavior in economic, political, and social situations than (myopic) stable sets with direct dominations. In what follows, we show the analyses of farsighted stable sets in the prisoner’s dilemma and two types of duopoly market games in strategic form. Prisoner’s Dilemma To make discussion as clear as possible, we will focus on the following particular version of the prisoner’ s dilemma shown below. Similar results hold in general prisoner’ s dilemma games. Prisoner’s Dilemma:
Player 1
Cooperate Defect
Player 2 Cooperate Defect 4; 4 0; 5 5; 0 1; 1
We first present a farsighted stable set derived when two players use only pure strategies. In this case, the set of strategy combinations is X = {(Cooperate, Cooperate), (Cooperate, Defect), (Defect, Cooperate), (Defect, Defect)}, where in each combination, the former (resp. the latter) is player 1’s (resp. 2’s) strategy. The direct domination relation of this game is summarized as in Fig. 1. From Fig. 1, no stable set (with direct domination) exists in the prisoner’s dilemma. However, since (Cooperate, Cooperate) (Cooperate, Defect), (Defect,Cooperate) and there is no other indirect domination, the singleton {(Cooperate, Cooperate)} is the unique farsighted stable set with respect to . Hence if the two players are farsighted and make a joint but not binding move, the farsighted stable set succeeds in showing that cooperation of the players results in the unique stable outcome. We now study stable outcomes in the mixed extension of the prisoner’s dilemma, i. e., the prisoner’s dilemma with mixed strategies played. Let X1 D X 2 D [0; 1] be the sets of mixed strategies of players 1 and 2, respectively, and let t1 2 X1 (resp. t2 2 X2 ) denote the probability that
Cooperative Games (Von Neumann–Morgenstern Stable Sets), Figure 1 2
Here ! denotes a direct domination. For example, “(Cooperate, Cooperate) !(Cooperate,Defect) means (Cooperate, Defect) df2g (Cooperate,Cooperate). “(Cooperate, Cooperate)
12
(Defect, Defect)” means (Cooperate, Cooperate) df1;2g (Defect, Defect)
685
686
Cooperative Games (Von Neumann–Morgenstern Stable Sets)
player 1 (resp. 2) plays “Cooperate”. It is easily seen that the minimax payoffs to players 1 and 2 are both 1 in this game. We call a strategy combination that gives both players at least (resp. more than) their minimax payoffs, an individually rational (resp. a strictly individually rational) strategy combination. We then have the following theorem. Theorem 10 Let T D f(t1 ; t2 )j1/4 < t1 1 ; t2 D 1g[ f(t1 ; t2 )jt1 D 1; 1/4 < t2 1g ; and define the singleton K 1 (t1 ; t2 ) D f(t1 ; t2 )g for each (t1 ; t2 ) 2 T. Let K 2 D f(0; 0); (1; 1/4)g and K 3 D f(0; 0); (1/4; 1)g. Then the sets K 2 ; K 3 , and any K 1 (t1 ; t2 ) with (t1 ; t2 ) 2 T are farsighted stable sets of the mixed extension of the prisoner’s dilemma. There are no other types of farsighted stable sets in the mixed extension of the prisoner’s dilemma. This theorem shows that if the two players are farsighted and make a joint but not binding move in the prisoner’s dilemma, then essentially a single Pareto efficient and strictly individually rational strategy combination results as a stable outcome. i. e., K 1 (t1 ; t2 ). We, however, have two exceptional cases K 2 ; K 3 that (Defect, Defect) could be stable together with one Pareto efficient point at which one player gains the same payoff as in (Defect, Defect). n-Person Prisoner’s Dilemma We consider an n-person prisoner’s dilemma. Let N D f1; : : : ; ng be the player set. Each player i has two strategies: C (Cooperate) and D (Defect). Let X i D fC; Dg for each i 2 N. Hereafter we refer to a strategy combination Q as a state. The set of states is X D i2N X i . For each Q Q coalition S N, let X S D i2S X i and X S D i2S c X i , where Sc denotes the complement of S with respect to N. Let x s and xs denote generic elements of X S and X S , respectively. Player i’s payoff depends not only on his/her strategy but also on the number of other cooperators. Player i’s payoff function u i : X ! < is given by u i (x) D f i (x i ; h), where x 2 X; x i 2 X i (player i’s choice in x), and h is the number of players other than i playing C. We call the strategic form game thus defined an n-person prisoner’s dilemma game. For simplifying discussion, we assume that all players are homogeneous and each player has an identical payoff function. That is, f i ’s are identical, and simply written as f unless any confusion arises. We assume the following properties on the function f
Assumption 1 (1) f (D; h) > f (C; h) for all h D 0; 1; : : : ; n 1 (2) f (C; n 1) > f (D; 0) (3) f (C; h) and f (D; h) are strictly increasing in h. Property (1) means that every player prefers playing D to playing C regardless of which strategies other players play. Property (2) means that if all players play C, then each of them gains a payoff higher than the one in (D; : : : ; D). Property (3) means that if the number of cooperators increases, every player becomes better off regardless of which strategy he/she plays. It holds from Property (1) that (D; : : : ; D) is the unique Nash equilibrium of the game. Here for x; y 2 X, we say that y is Pareto superior to x if u i (y) u i (x) for all i 2 N and u i (y) > u i (x) for some i 2 N. The state x 2 X is said to be Pareto efficient if there is no y 2 X that is Pareto superior to x. By Property (2), (C; : : : ; C) is Pareto superior to (D; : : : ; D). Together with Property (3), (C; : : : ; C) is Pareto efficient. Given a state x, we say that x is individually rational if for all i 2 N; u i (x) min yi 2Xi max y i 2X i u i (y). If a strict inequality holds, we say that x is strictly individually rational. It holds from (1), (3) of Assumption 8.1 that min yi 2Xi max y i 2X i u i (y) D f (D; 0). The following theorem shows that any state that is strictly individually rational and Pareto efficient is itself a singleton farsighted stable set. That is, any strictly individually rational and Pareto efficient outcome is stable if the players are farsighted. Refer to Suzuki and Muto [51,52] for the details. Theorem 11 For n-person prisoner’s dilemma game, if x is a strictly individually rational and Pareto efficient state, then fxg is a farsighted stable set. Duopoly Market Games We consider two types of duopoly: Cournot quantitysetting duopoly and Bertrand price-setting duopoly. For simplifying discussion, we will consider a simple duopoly model in which firms’ cost functions and a market demand function are both linear. Similar results, however, hold in more general duopoly models. There are two firms 1,2, each producing a homogeneous good with the same marginal cost c > 0. No fixed cost is assumed. (1) Cournot duopoly: Firms’ strategic variables are their production levels. Let x1 and x2 be production levels of firms 1 and 2, respectively. The market price p(x1 ; x2 )
Cooperative Games (Von Neumann–Morgenstern Stable Sets)
for x1 and x2 is given by p(x1 ; x2 ) D max(a (x1 C x2 ); 0) ; where a > c. We restrict the domain of production of both firms to 0 x i a c ; i D 1; 2. This is reasonable since a firm would not overproduce to make a nonpositive profit. When x1 and x2 are produced, firm i’s profit is given by i (x1 ; x2 ) D (p(x1 ; x2 ) c)x i : Thus Cournot duopoly is formulated as the following strategic form game G C D (N; fX i g iD1;2 ; f i g iD1;2) ; where the player set is N D f1; 2g, each player’s strategy set is a closed interval between 0 and a c, i. e., X1 D X2 D [0; a c], and their payoff functions are i ; i D 1; 2. Let X D X1 X 2 . The joint profit of two firms is maximized when x1 C x2 D (a c)/2. (2) Bertrand duopoly: Firms’ strategic variables are their price levels. Let D(p) D max (a p; 0) be the market demand at price p. Then the total profit at p is Y (p) D (p c)D(p) : We restrict the domain of price level p of both firms to c p a. This assumption is also reasonable since a firm Q would avoid a negative profit. The total profit (p) is maximized at p D (a C c)/2, which is called a monopoly price. Let p1 and p2 be prices of firms 1 and 2, respectively. We assume that if firms’ prices are equal, then they share equally the total profit, otherwise all sales go to the lower pricing firm of the two. Thus firm i’s profit is given by 8Q if p i < p j ˆ p j for i; j D 1; 2; i ¤ j : Hence Bertrand duopoly is formulated as the strategic form game G B D N; fYi g iD1;2 ; f i g iD1;2 ; where N D f1; 2g; Y1 D Y2 D [c; a], and i (i D 1; 2) is i’s payoff function. Let Y D Y1 Y2 .
It is well-known that a Nash equilibrium is uniquely determined in either market: x1 D x2 D (a c)/3 in the Cournot market, and p1 D p2 D c in the Bertrand market. The following theorem holds for the farsighted stable sets in Cournot duopoly. Theorem 12 Let (x1 ; x2 ) 2 X be any strategy pair with x1 C x2 D (a c)/2. Then the singleton f(x1 ; x2 )g is a farsighted stable set. Furthermore, every farsighted stable set is of the form f(x1 ; x2 )g with x1 C x2 D (a c)/2 and x1 ; x2 0. As mentioned before, any strategy pair (x1 ; x2 ) with x1 C x2 D (a c)/2 and x1 ; x2 0 maximizes two firms’ joint profit. This suggests that the von Neumann– Morgenstern stability together with firms’ farsighted behavior produce joint profit maximization even if firms’ collaboration is not binding. As for Bertrand duopoly, we have the following theorem, which claims that the monopoly price pair is itself a farsighted stable set, and no other farsighted stable set exists. Therefore the von Neumann–Morgenstern stability together with firms’ farsighted behavior attain efficiency (from the standpoint of firms) also in Bertrand duopoly. Refer to Suzuki and Muto [53] for the details. Theorem 13 Let p D (p1 ; p2 ) be the pair of monopoly prices, i. e., p1 D p2 D (a C c)/2. Then the singleton fpg is the unique farsighted stable set. For studies of stable sets with direct domination in duopoly market games, refer to Muto and Okada [29,30]. Properties of stable sets and Harsanyi’s original farsighted stable sets in pure exchange economies are investigated by Greenberg et al. [10]. For further studies on stable sets and farsighted stable sets in strategic form games, refer to Kaneko [17], Mariotti [24], Xue [55,56], Diamantoudi and Xue [4]. Future Directions In this paper, we have reviewed applications of von Neumann–Morgenstern stable sets in abstract games, characteristic function form games, and strategic form games to economic, political and social systems. Stable sets give us insights into coalition formation among players in the systems in question. Farsighted stable sets, especially applied to some economic systems, show that players’ farsighted behavior leads to Pareto efficient outcomes even though their collaboration is not binding. The stable set analysis is also applicable to games with infinitely many players. Those analyses show us new approaches to large economic and social systems with infinitely many players. For the details, refer to Hart [15],
687
688
Cooperative Games (Von Neumann–Morgenstern Stable Sets)
Einy et al. [7], Einy and Shitovitz [6], Greenberg et al. [11], Shitovitz and Weber [48], and Rosenmüller and Shitovitz [37]. There is also a study on the linkage between common knowledge of Bayesian rationality and achievement of stable sets in generalized abstract games. Refer to Luo [22,23] for the details. Analyses of social systems by applying the concepts of farsighted stable sets as well as stable sets must further advance theoretical studies on games in which players inherently take both cooperative and non-cooperative behavior. Those studies will in turn have impacts on developments of economics, politics, sociology, and many applied social sciences.
Bibliography Primary Literature 1. Aumann R, Peleg B (1960) Von Neumann–Morgenstern solutions to cooperative games without side payments. Bull Am Math Soc 66:173–179 2. Bott R (1953) Symmetric solutions to majority games. In: Kuhn HW, Tucker AW (eds) Contribution to the theory of games, vol II. Annals of Mathematics Studies, vol 28. Princeton University Press, Princeton, pp 319–323 3. Chwe MS-Y (1994) Farsighted coalitional stability. J Econ Theory 63:299–325 4. Diamantoudi E, Xue L (2003) Farsighted stability in hedonic games. Soc Choice Welf 21:39–61 5. Ehlers L (2007) Von Neumann–Morgenstern stable sets in matching problems. J Econ Theory 134:537–547 6. Einy E, Shitovitz B (1996) Convex games and stable sets. Games Econ Behav 16:192–201 7. Einy E, Holzman R, Monderer D, Shitovitz B (1996) Core and stable sets of large games arising in economics. J Econ Theory 68:200–211 8. Gale D, Shapley LS (1962) College admissions and the stability of marriage. Am Math Mon 69:9–15 9. Greenberg J (1990) The theory of social situations: an alternative game theoretic approach. Cambridge University Press, Cambridge 10. Greenberg J, Luo X, Oladi R, Shitovitz B (2002) (Sophisticated) stable sets in exchange economies. Games Econ Behav 39:54– 70 11. Greenberg J, Monderer D, Shitovitz B (1996) Multistage situations. Econometrica 64:1415–1437 12. Griesmer JH (1959) Extreme games with three values. In: Tucker AW, Luce RD (eds) Contribution to the theory of games, vol IV. Annals of Mathematics Studies, vol 40. Princeton University Press, Princeton, pp 189–212 13. Harsanyi J (1974) An equilibrium-point interpretation of stable sets and a proposed alternative definition. Manag Sci 20:1472– 1495 14. Hart S (1973) Symmetric solutions of some production economies. Int J Game Theory 2:53–62 15. Hart S (1974) Formation of cartels in large markets. J Econ Theory 7:453–466
16. Heijmans J (1991) Discriminatory von Neumann–Morgenstern solutions. Games Econ Behav 3:438–452 17. Kaneko M (1987) The conventionally stable sets in noncooperative games with limited observations I: Definition and introductory argument. Math Soc Sci 13:93–128 18. Lucas WF (1968) A game with no solution. Bull Am Math Soc 74:237–239 19. Lucas WF (1990) Developments in stable set theory. In: Ichiishi T et al (eds) Game Theory and Applications, Academic Press, New York, pp 300–316 20. Lucas WF, Michaelis K, Muto S, Rabie M (1982) A new family of finite solutions. Int J Game Theory 11:117–127 21. Lucas WF, Rabie M (1982) Games with no solutions and empty cores. Math Oper Res 7:491–500 22. Luo X (2001) General systems and '-stable sets: a formal analysis of socioeconomic environments. J Math Econ 36:95–109 23. Luo X (2006) On the foundation of stability. Academia Sinica, Mimeo, available at http://www.sinica.edu.tw/~xluo/pa14.pdf 24. Mariotti M (1997) A model of agreements in strategic form games. J Econ Theory 74:196–217 25. Moulin H (1995) Cooperative Microeconomics: A Game-Theoretic Introduction. Princeton University Press, Princeton 26. Muto S (1979) Symmetric solutions for symmetric constantsum extreme games with four values. Int J Game Theory 8:115– 123 27. Muto S (1982) On Hart production games. Math Oper Res 7:319–333 28. Muto S (1982) Symmetric solutions for (n,k) games. Int J Game Theory 11:195–201 29. Muto S, Okada D (1996) Von Neumann–Morgenstern stable sets in a price-setting duopoly. Econ Econ 81:1–14 30. Muto S, Okada D (1998) Von Neumann–Morgenstern stable sets in Cournot competition. Econ Econ 85:37–57 31. von Neumann J, Morgenstern O (1953) Theory of Games and Economic Behavior, 3rd ed. Princeton University Press, Princeton 32. Owen G (1965) A class of discriminatory solutions to simple n-person games. Duke Math J 32:545–553 33. Owen G (1968) n-Person games with only 1, n-1, and n-person coalitions. Proc Am Math Soc 19:1258–1261 34. Owen G (1995) Game theory, 3rd ed. Academic Press, New York 35. Peleg B (1986) A proof that the core of an ordinal convex game is a von Neumann–Morgenstern solution. Math Soc Sci 11:83–87 36. Rosenmüller J (1977) Extreme games and their solutions. In: Lecture Notes in Economics and Mathematical Systems, vol 145. Springer, Berlin 37. Rosenmüller J, Shitovitz B (2000) A characterization of vNMstable sets for linear production games. Int J Game Theory 29:39–61 38. Roth A, Postlewaite A (1977) Weak versus strong domination in a market with indivisible goods. J Math Econ 4:131–137 39. Roth A, Sotomayor M (1990) Two-Sided Matching: A Study in Game-Theoretic Modeling and Analysis. Cambridge University Press, Cambridge 40. Quint T, Wako J (2004) On houseswapping, the strict core, segmentation, and linear programming, Math Oper Res 29:861– 877
Cooperative Games (Von Neumann–Morgenstern Stable Sets)
41. Shapley LS (1953) Quota solutions of n-person games. In: Kuhn HW, Tucker TW (eds) Contribution to the theory of games, vol II. Annals of Mathematics Studies, vol 28. Princeton University Press, Princeton, pp 343–359 42. Shapley LS (1959) The solutions of a symmetric market game. In: Tucker AW, Luce RD (eds) Contribution to the theory of games, vol IV. Annals of Mathematics Studies, vol 40. Princeton University Press, Princeton, pp 145–162 43. Shapley LS (1962) Simple games: An outline of the descriptive theory. Behav Sci 7:59–66 44. Shapley LS (1964) Solutions of compound simple games. In Tucker AW et al (eds) Advances in Game Theory. Annals of Mathematics Studies, vol 52. Princeton University Press, Princeton, pp 267–305 45. Shapley LS (1971) Cores of convex games. Int J Game Theory 1:11–26 46. Shapley LS, Scarf H (1974) On cores and indivisibilities. J Math Econ 1:23–37 47. Shapley LS, Shubik M (1972) The assignment game I: The core. Int J Game Theory 1:111–130 48. Shitovitz B, Weber S (1997) The graph of Lindahl correspondence as the unique von Neumann–Morgenstern abstract stable set. J Math Econ 27:375–387 49. Simonnard M (1966) Linear programming. Prentice-Hall, New Jersey
50. Solymosi T, Raghavan TES (2001) Assignment games with stable core. Int J Game Theory 30:177–185 51. Suzuki A, Muto S (2000) Farsighted stability in prisoner’s dilemma. J Oper Res Soc Japan 43:249–265 52. Suzuki A, Muto S (2005) Farsighted stability in n-person prisoner’s dilemma. Int J GameTheory 33:431–445 53. Suzuki A, Muto S (2006) Farsighted behavior leads to efficiency in duopoly markets. In: Haurie A, et al (eds) Advances in Dynamic Games. Birkhauser, Boston, pp 379–395 54. Wako J (1991) Some properties of weak domination in an exchange market with indivisible goods. Jpn Econ Rev 42:303– 314 55. Xue L (1997) Nonemptiness of the largest consistent set. J Econ Theory 73:453–459 56. Xue L (1998) Coalitional stability under perfect foresight. Econ Theory 11:603–627
Books and Reviews Lucas WF (1992) Von Neumann–Morgenstern stable sets. In: Aumann RJ, Hart S (eds) Handbook of Game Theory with Economic Applications, vol 1. North-Holland, Amsterdam, pp 543– 590 Shubik M (1982) Game theory in the social sciences: Concepts and solutions. MIT Press, Boston
689
690
Cooperative Multi-hierarchical Query Answering Systems
Cooperative Multi-hierarchical Query Answering Systems Z BIGNIEW W. RAS1,2 , AGNIESZKA DARDZINSKA 3 Department of Computer Science, University of North Carolina, Charlotte, USA 2 Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland 3 Department of Computer Science, Białystok Technical University, Białystok, Poland
1
Article Outline Glossary Definition of the Subject Introduction Multi-hierarchical Decision System Cooperative Query Answering Future Directions Bibliography Glossary Autonomous information system An autonomous information system is an information system existing as an independent entity. Intelligent query answering Intelligent query answering is an enhancement of query-answering into a sort of intelligent system (capable or being adapted or molded). Such systems should be able to interpret incorrectly posed questions and compose an answer not necessarily reflecting precisely what is directly referred to by the question, but rather reflecting what the intermediary understands to be the intention linked with the question. Knowledge base Knowledge base is a collection of rules defined as expressions written in predicate calculus. These rules have a form of associations between conjuncts of values of attributes. Ontology Ontology is an explicit formal specification of how to represent objects, concepts and other entities that are assumed to exist in some area of interest and relationships holding among them. Systems that share the same ontology are able to communicate about the domain of discourse without necessarily operating on a globally shared theory. A system commits to ontology if its observable actions are consistent with the definitions in the ontology. Semantics The meaning of expressions written in some language as opposed to their syntax which describes how symbols may be combined independently of their meaning.
Definition of the Subject One way to make a Query Answering System (QAS) intelligent is to assume the hierarchical structure of their attributes. Such systems have been investigated by Cuppens and Demolombe [3], Gal and Minker [4], and Gaasterland et al. [6], and they are called cooperative. Queries submitted to them are built, in a classical way, from values of attributes describing objects in an information system S and from two-argument functors “and”, “or”. Instead of “or”, we use the symbol “+”. Instead of “and”, we use the symbol “*”. Let us assume that QAS is associated with an information system S. Now, if query q submitted to QAS fails, then any attribute value listed in q can be generalized and the number of objects supporting q in S may increase. In cooperative systems, these generalizations are controlled either by users [4], or by methods based on knowledge discovery [12]. Conceptually, a similar approach has been proposed by Lin [11]. He defines a neighborhood of an attribute value which we can interpret as its generalization (or its parent in the corresponding hierarchical attribute structure). When query fails, then the query answering system is trying to replace values in a query by new values from their corresponding neighborhoods. QAS for S can also collaborate and exchange knowledge with other information systems. In all such cases, it is called intelligent. In papers [14,15] the query answering strategy was based on a guided process of knowledge (rules) extraction and knowledge exchange among systems. Knowledge extracted from information systems collaborating with S was used to construct new attributes in S and/or impute null or hidden values of attributes in S. This way we do not only enlarge the set of queries which QAS can successfully answer but also increase the overall number of retrieved objects and their confidence. Some attributes in S can be distinguished. We usually call them decision attributes. Their values represent concepts which can be defined in terms of the remaining attributes in S, called classification attributes. Query languages for such information systems are built only from values of decision attributes and from two-argument functors “+”, “*” [16]. The semantics of queries is defined in terms of semantics of values of classification attributes. Precision and recall of QAS is strictly dependent on the support and confidence of the classifiers used to define queries. Introduction Responses by QAS to submitted queries do not always contain the information desired and although they may be logically correct, can sometimes be misleading. Research in the area of intelligent query answering rectifies these
Cooperative Multi-hierarchical Query Answering Systems
problems. The classical approach is based on a cooperative method called relaxation for expanding an information system and related to it queries [3,4]. The relaxation method expands the scope of a query by relaxing the constraints implicit in the query. This allows QAS to return answers related to the original query as well as the literal answers which may be of interest to the user. This paper concentrates on multi-hierarchical decision systems which are defined as information systems with several hierarchical distinguished attributes called decision attributes. Their values are used to build queries. We give the theoretical framework for modeling such systems and its corresponding query languages. Standard interpretation and the classifier-based interpretation of queries are introduced and used to model the quality (precision, recall) of QAS.
h [ fVa : a 2 Dg (8 j 2 f1; 2; : : : ; ng) t j 2 i [ fVa : a 2 Dg : _ tj D w ^ w 2 Definition 4 By a set of classification terms (c-terms) for S we mean a least set T C such that: 0; 1 2 TC , S if w 2 fVa : a 2 Ag, then w; w 2 TC , t1 ; t2 2 TC , then (t1 C t2 ); (t1 t2 ) 2 TC . Definition 5 Classification term t is called simple if t D t1 t2 : : : t n and h [ (8 j 2 f1; 2; : : : ; ng) t j 2 fVa : a 2 Ag i [ fVa : a 2 Ag : _ tj D w ^ w 2
Multi-hierarchical Decision System In this section we introduce the notion of a multi-hierarchical decision system S and the query language built from atomic expressions containing only values of the decision attributes in S. Classifier-based semantics and the standard semantics of queries in S are proposed. The set of objects X in S is defined as the interpretation domain for both semantics. Standard semantics identifies all objects in X which should be retrieved by a query. Classifierbased semantics gives weighted sets of objects which are retrieved by queries. The notion of precision and recall of QAS in the proposed setting is introduced. We use only rule-based classifiers to define the classifier-based semantics. By improving the confidence and support of the classifiers we improve the precision and recall of QAS. Definition 1 By a multi-hierarchical decision system we mean a triple S D (X; A [ D; V), where X is a nonempty, finite set of objects, D D fd[i] : 1 i kg is a set of hierarchical decision attributes, A is a nonempty finite set S of classification attributes, and V D fVa : a 2 A [ Dg is a set of their values. We assume that: V a , V b are disjoint for any a; b 2 A [ D, such that a ¤ b, a : X ! Va is a partial function for every a 2 A [ D. Definition 2 By a set of decision queries (d-queries) for S we mean a least set T D such that: 0; 1 2 TD , S if w 2 fVa : a 2 Dg, then w; w 2 TD , if t1 ; t2 2 TD , then (t1 C t2 ), (t1 t2 ) 2 TD . Definition 3 Decision query t is called simple if t D t1 t2 : : : t n and
Definition 6 By a classification rule we mean any expression of the form [t1 ! t2 ], where t1 is a simple classification term and t2 is a simple decision query. Definition 7 Semantics M S of c-terms in S D (X; A [ D; V) is defined in a standard way as follows: M S (0) D 0, M S (1) D X, M S (w) D fx 2 X : w D a(x)g for any w 2 Va , a 2 A, M S ( w) D fx 2 X : (9v 2 Va )[v D a(x) & v ¤ w]g for any w 2 Va , a 2 A, if t1 , t2 are terms, then M S (t1 C t2 ) D M S (t1 ) [ M S (t2 ) ; M S (t1 t2 ) D M S (t1 ) \ M S (t2 ) : Now, we introduce the notation used for values of decision attributes. Assume that the term d[i] also denotes the first granularity level of a hierarchical decision attribute d[i]. The set fd[i; 1]; d[i; 2]; d[i; 3]; : : :g represents the values of attribute d[i] at its second granularity level. The set fd[i; 1; 1]; d[i; 1; 2]; : : : ; d[i; 1; n i ]g represents the values of attribute d at its third granularity level, right below the node d[i; 1]. We assume here that the value d[i; 1] can be refined to any value from fd[i; 1; 1]; d[i; 1; 2]; : : : ; d[i; 1; n i ]g, if necessary. Similarly, the set fd[i; 3; 1; 3; 1]; d[i; 3; 1; 3; 2]; d[i; 3; 1; 3; 3]; d[i; 3; 1; 3; 4]g represents the values of attribute d at its fourth granularity level which are finer than the value d[i; 3; 1; 3]. Now, let us assume that a rule-based classifier is used to extract rules describing simple decision queries in S. We denote that classifier by RC. The definition of semantics
691
692
Cooperative Multi-hierarchical Query Answering Systems
N S of c-terms is RC independent whereas the definition of semantics M S of d-queries is RC dependent. Definition 8 Classifier-based semantics M S of d-queries in S D (X; A [ D; V) is defined as follows: If t is a simple d-query in S and fr j D [t j ! t] : j 2 J t g is a set of all rules defining t which are extracted from S by classifier RC, then M S (t) D f(x; p x ) : (9 j 2 J t )(x 2 M S (t j )[p x D ˙ fconf( j) sup( j) : x 2 M S (t j ) & j 2 J t g/ ˙ fsup( j) : x 2 M S (t j ) & j 2 J t g], where conf( j); sup( j) denote the confidence and the support of the rule [t j ! t], correspondingly. Definition 9 Attribute value d[ j1 ; j2 ; : : : ; j n ] in S D (X; A [ D; V) is dependent on d[i1 ; i2 ; : : : ; i k ] in S, if one of the following conditions hold: 1) n k & (8m n) [i m D j m ], 2) n > k & (8m k) [i m D j m ]. Otherwise, d[ j1 ; j2 ; : : : ; j n ] is called independent from d[i1 ; i2 ; : : : ; i k ] in S. Example 1 The attribute value d[2; 3; 1; 2] is dependent on the attribute value d[2; 3; 1; 2; 5; 3]. Also, d[2; 3; 1; 2; 5; 3; 2; 4] is dependent on d[2; 3; 1; 2; 5; 3]. Definition 10 Let S D X; A [ fd[1]; d[2]; : : : ; d[k]g; V , w 2 Vd[i] , and IVd[i] be the set of all attribute values in Vd[i] which are independent from w. Standard semantics N S of d-queries in S is defined as follows: N S (0) D 0, N S (1) D X, if w 2 Vd[i] , then N S (w) D fx 2 X : d[i](x) D wg, for any 1 i k if w 2 Vd[i] , then N S ( w) D fx 2 X : (9v 2 IVd[i] ) [d[i](x) D v]g, for any 1 i k if t1 , t2 are terms, then N S (t1 C t2 ) D N S (t1 ) [ N S (t2 ) ; N S (t1 t2 ) D N S (t1 ) \ N S (t2 ) : Definition 11 Let S D (X; A [ D; V), t is a d-query in S, N S (t) is its meaning under standard semantics, and M S (t) is its meaning under classifier-based semantics. Assume that N S (t) D X 1 [ Y1 , where X 1 D fx i ; i 2 I1 g; Y1 D fy i ; i 2 I2 g. Assume also that M S (t) D f(x i ; p i ) : i 2 I1 g [ f(z i ; q i ) : i 2 I3 g and fy i ; i 2 I2 g \ fz i ; i 2 I3 g D ;. By precision of a classifier-based semantics M S on a dquery t, we mean Prec(M S ; t) D [˙ fp i : i 2 I1 g C ˙ f(1 q i ) : i 2 I3 g] /[card(I1 ) C card(I3 )] :
By recall of a classifier-based semantics M S on a d-query t, we mean Rec(M S ; t) D [˙ fp i : i 2 I1 g]/[card(I1 ) C card(I2 )] : Example 2 Assume that N S (t) D fx1 ; x2 ; x3 ; x4 g, M S (t) D f(x1 ; p1 ); (x2 ; p2 ); (x5 ; p5 ); (x6 ; p6 )g. Then: Prec(M S ; t) D [p1 C p2 C (1 p5 ) C (1 p6 )]/4 ; Rec(M S ; t) D [p1 C p2 ]/4 : Cooperative Query Answering There are cases when classical Query Answering Systems fail to return any answer to a d-query q but still a satisfactory answer can be found. For instance, let us assume that in a multi-hierarchical decision system S D (X; A[ D; V ), where D D fd[1]; d[2]; : : : ; d[k]g, there is no single object whose description matches the query q. Assuming that a distance measure between objects in S is defined, then by generalizing q, we may identify objects in S whose descriptions are closest to the description of q. This problem is similar to the problem when the granularity of an attribute value used in a query q is finer than the granularity of the corresponding attribute used in S. By replacing such attribute values in q by more general values used in S, we may retrieve objects from S which satisfy q. Definition 12 The distance ı S between two attribute values d[ j1 ; j2 ; : : : j n ], d[i1 ; i2 ; : : : ; i m ] in S D (X; A[ D; V ), where j1 D i1 , p 1, is defined as follows: 1) if [ j1 ; j2 ; : : : ; j p ] D [i1 ; i2 ; : : : ; i p ] and j pC1 ¤ i pC1 , then ıS d[ j1 ; j2 ; : : : ; j n ]; d[i1 ; i2 ; : : : ; i m ] D 1/2 p1 2) if n m and [ j1 ; j2 ; : : : ; j n ] D [i1; i2 ; : : : ; i n ], then ıS d[ j1 ; j2 ; : : : ; j n ]; d[i1 ; i2 ; : : : ; i m ] D 1/2n The second condition, in the above definition, represents the average case between the best and the worth case. Example 3 Following the above definition of the distance measure, we get: 1. ıS d[2; 3; 2; 4]; d[2; 3; 2; 5; 1] D 1/4 2. ıS d[2; 3; 2; 4]; d[2; 3; 2] D 1/8 Let us assume that q D q(a[3; 1; 3; 2]; b[1]; c[2]) is a dquery submitted to S. The notation q(a[3; 1; 3; 2]; b[1]; c[2]) means that q is built from a[3; 1; 3; 2]; b[1]; c[2] which are the atomic attribute values in S. Additionally, we assume that attribute a is not only hierarchical but also it is ordered. It basically means that the difference between the values a[3; 1; 3; 2] and a[3; 1; 3; 3] is smaller than between the values a[3; 1; 3; 2] and a[3; 1; 3; 4]. Also, the
Cooperative Multi-hierarchical Query Answering Systems Cooperative Multi-hierarchical Query Answering Systems, Table 1 Multi-hierarchical decision system S X x1 x2 x3 x4
e e[1 e[2] e[2] e[1]
f f [1] f [1] f [1] f [2]
g ... ... ... ...
... ... ... ... ...
... ... ... ... ...
a a[1] a[1; 1] a[1;1;1] a[2]
b b[2] b[2; 1] b[2;2;1] b[2; 2]
c c[1; 1] c[1;1;1] c[2; 2] c[1; 1]
d d[3] d[3;1;2] d[1] d[1; 1]
Clearly, ıS [q; x1 ] D ıS [a[1; 2]; b[2]; c[1; 1]; d[3; 1; 1]]; [a[1]; b[2]; c[1; 1]; d[3]] D 1/4 C 0 C 0 C 1/4 D 1/2 ; ıS [q; x2 ] D ıS [a[1; 2]; b[2]; c[1; 1]; d[3; 1; 1]];
[a[1; 1]; b[2; 1]; c[1; 1; 1]; d[3; 1; 2]]
D 1/4 C 1/4 C 1/8 C 1/8 D 3/4 ; difference between any two elements in fa[3; 1; 3; 1]; a[3; 1; 3; 2]; a[3; 1; 3; 3]; a[3; 1; 3; 4]g is smaller than between a[3; 1; 3] and a[3; 1; 2]. Now, we outline a possible strategy which QAS can follow to solve q. Clearly, the best solution for answering q is to identify objects in S which precisely match the d-query submitted by the user. If it fails, we try to identify objects which match d-query q(a[3; 1; 3]; b[1]; c[2]). If we succeed, then we try d-queries q(a[3; 1; 3; 1]; b[1]; c[2]) and q(a[3; 1; 3; 3]; b[1]; c[2]). If we fail, then we should succeed with q(a[3; 1; 3; 4]; b[1]; c[2]). If we fail with q(a[3; 1; 3]; b[1]; c[2]), then we try q(a[3; 1]; b[1]; c[2]) and so on. To present this cooperative strategy in a more precise way, we use an example and start with a very simple dataset. Namely, we assume that S has four decision attributes which belong to the set fa; b; c; dg. System S contains only four objects listed in Table 1. Now, we assume that d-query q D a[1; 2] b[2] c[1; 1] d[3; 1; 1] is submitted to the multi-hierarchical decision system S (see Table 1). Clearly, q fails in S. Jointly with q, also a threshold value for a minimum support can be supplied as a part of a d-query. This threshold gives the minimal number of objects that need to be returned as an answer to q. When QAS fails to answer q, the nearest objects satisfying q have to be identified. The algorithm for finding these objects is based on the following steps: If QAS fails to identify a sufficient number of objects satisfying q in S, then the generalization process starts. We can generalize either attribute a or d. Since the value d[3; 1; 2] has a lower granularity level than a[1; 1], then we generalize d[3; 1; 2] getting a new query q1 D a[1; 2] b[2] c[1; 1] d[3; 1]. But q1 still fails in S. Now, we generalize a[1; 1] getting a new query q2 D a[1]b[2] c[1; 1] d[3; 1]. Objects x1 , x2 are the only objects in S which support q2 . If the user is only interested in one object satisfying the query q, then we need to identify which object in fx1 ; x2 g has a distance closer to q.
which means x1 is the winning object. Note that the cooperative strategy only identifies objects satisfying d-queries and it identifies objects to be returned by QAS to the user. The confidence assigned to these objects depends on the classifier RC. Future Directions We have introduced the notion of system-based semantics and user-based semantics of queries. User-based semantics are associated with the indexing of objects by a user which is time consuming and unrealistic for very large sets of data. System-based semantics are associated with automatic indexing of objects in X which strictly depends on the support and confidence of classifiers and depends on the precision and recall of a query answering system. The quality of classifiers can be improved by a proper enlargement of the set X and the set of features describing them which differentiate the real-life objects from the same semantic domain as X in a better way. An example, for instance, is given in [16]. The quality of a query answering system can be improved by its cooperativeness. Both precision and recall of QAS is increased if no-answer queries are replaced by generalized queries which are answered by QAS on a higher granularity level than the initial level of queries submitted by users. Assuming that the system is distributed, the quality of QAS for multi-hierarchical decision system S can also be improved through collaboration among sites [14,15]. The key concept of intelligent QAS based on collaboration among sites is to generate global knowledge through knowledge sharing. Each site develops knowledge independently which is used jointly to produce global knowledge. Assume that two sites S1 and S2 accept the same ontology of their attributes and share their knowledge in order to solve a user query successfully. Also, assume that one of the attributes at site S1 is confidential. The confidential data in S1 can be hidden by replacing them with null values. However, users at S1 may treat them as missing data and reconstruct them with the knowledge extracted from S2 [10]. The vulnerability illustrated in this exam-
693
694
Cooperative Multi-hierarchical Query Answering Systems
ple shows that a security-aware data management is an essential component for any intelligent QAS to ensure data confidentiality. Bibliography 1. Chmielewski MR, Grzymala-Busse JW, Peterson NW (1993) The rule induction system LERS – a version for personal computers. Found Comput Decis Sci 18(3-4):181–212 2. Chu W, Yang H, Chiang K, Minock M, Chow G, Larson C (1996) Cobase: A scalable and extensible cooperative information system. J Intell Inf Syst 6(2/3):223–259 3. Cuppens F, Demolombe R (1988) Cooperative answering: a methodolgy to provide intelligent access to databases. Proceeding of the Second International Conference on Expert Database Systems, pp 333–353 4. Gal A, Minker J (1988) Informative and cooperative answers in databases using integrity constraints. Natural Language Understanding and Logic Programming, North Holland, pp 277– 300 5. Gaasterland T (1997) Cooperative answering through controlled query relaxation. IEEE Expert 12(5):48–59 6. Gaasterland T, Godfrey P, Minker J (1992) Relaxation as a platform for cooperative answering. J Intell Inf Syst 1(3):293–321 7. Giannotti F, Manco G (2002) Integrating data mining with intelligent query answering. Logics in Artificial Intelligence. Lecture Notes in Computer Science, vol 2424. Springer, Berlin, pp 517–520
8. Godfrey P (1993) Minimization in cooperative response to failing database queries. Int J Coop Inf Syst 6(2):95–149 9. Guarino N (1998) Formal ontology in information systems. IOS Press, Amsterdam 10. Im S, Ras ZW (2007) Protection of sensitive data based on reducts in a distributed knowledge discovery system. Proceedings of the International Conference on Multimedia and Ubiquitous Engineering (MUE 2007), Seoul. IEEE Computer Society, pp 762–766 11. Lin TY (1989) Neighborhood systems and approximation in relational databases and knowledge bases. Proceedings of the Fourth International Symposium on Methodologies for Intelligent Systems. Poster Session Program, Oak Ridge National Laboratory, ORNL/DSRD-24, pp 75–86 12. Muslea I (2004) Machine Learning for Online Query Relaxation. Proceedings of KDD-2004, Seattle. ACM, pp 246–255 13. Pawlak Z (1981) Information systems – theoretical foundations. Inf Syst J 6:205–218 14. Ras ZW, Dardzinska A (2004) Ontology based distributed autonomous knowledge systems. Inf Syst Int J 29 (1): 47–58 15. Ras ZW, Dardzinska A (2006) Solving Failing Queries through Cooperation and Collaboration, Special Issue on Web Resources Access. World Wide Web J 9(2):173–186 16. Ras ZW, Zhang X, Lewis R (2007) MIRAI: Multi-hierarchical, FS-tree based Music Information Retrieval System. In: Kryszkiewicz M et al (eds) Proceedings of RSEISP 2007. LNAI, vol 4585. Springer, Berlin, pp 80–89
Correlated Equilibria and Communication in Games
Correlated Equilibria and Communication in Games FRANÇOISE FORGES Ceremade, Université Paris-Dauphine, Paris, France
Article Outline Glossary Definition of the Subject Introduction Correlated Equilibrium: Definition and Basic Properties Correlated Equilibrium and Communication Correlated Equilibrium in Bayesian Games Related Topics and Future Directions Acknowledgment Bibliography Glossary Bayesian game An interactive decision problem consisting of a set of n players, a set of types for every player, a probability distribution which accounts for the players’ beliefs over each others’ types, a set of actions for every player and a von Neumann–Morgenstern utility function defined over n-tuples of types and actions for every player. Nash equilibrium In an n-person strategic form game, a strategy n-tuple from which unilateral deviations are not profitable. von Neumann–Morgenstern utility function A utility function which reflects the individual’s preferences over lotteries. Such a utility function is defined over outcomes and can be extended to any lottery by taking expectation with respect to . Pure strategy (or simply strategy) A mapping which, in an interactive decision problem, associates an action with the information of a player whenever this player can make a choice. Sequential equilibrium A refinement of the Nash equilibrium for n-person multistage interactive decision problems, which can be loosely defined as a strategy n-tuple together with beliefs over past information for every player, such that every player maximizes his expected utility given his beliefs and the others’ strategies, with the additional condition that the beliefs satisfy (possibly sophisticated) Bayes updating given the strategies.
Strategic (or normal) form game An interactive decision problem consisting of a set of n players, a set of strategies for every player and a (typically, von Neumann–Morgenstern) utility function defined over n-tuples of strategies for every player. Utility function A real valued mapping over a set of outcomes which reflects the preferences of an individual by associating a utility level (a “payoff”) with every outcome. Definition of the Subject The correlated equilibrium is a game theoretic solution concept. It was proposed by Aumann [1,2] in order to capture the strategic correlation opportunities that the players face when they take into account the extraneous environment in which they interact. The notion is illustrated in Sect. “Introduction”. A formal definition is given in Sect. “Correlated Equilibrium: Definition and Basic Properties”. The correlated equilibrium also appears as the appropriate solution concept if preplay communication is allowed between the players. As shown in Sect. “Correlated Equilibrium and Communication”, this property can be given several precise statements according to the constraints imposed on the players’ communication, which can go from plain conversation to exchange of messages through noisy channels. Originally designed for static games with complete information, the correlated equilibrium applies to any strategic form game. It is geometrically and computationally more tractable than the better known Nash equilibrium. The solution concept has been extended to dynamic games, possibly with incomplete information. As an illustration, we define in details the communication equilibrium for Bayesian games in Sect. “Correlated Equilibrium in Bayesian Games”. Introduction Example Consider the two-person game known as “chicken”, in which each player i can take a “pacific” action (denoted as pi ) or an “aggressive” action (denoted as ai ): p1 a1
p2 (8; 8) (10; 3)
a2 (3; 10) (0; 0)
The interpretation is that player 1 and player 2 simultaneously choose an action and then get a payoff, which is determined by the pair of chosen actions according to the previous matrix. If both players are pacific, they both get 8. If both are aggressive, they both get 0. If one player is ag-
695
696
Correlated Equilibria and Communication in Games
gressive and the other is pacific, the aggressive player gets 10 and the pacific one gets 3. This game has two pure Nash equilibria (p1 ; a2 ), (a1 ; p2 ) and one mixed Nash equilibrium in which both players choose the pacific action with probability 3/5, resulting in the expected payoff 6 for both players. A possible justification for the latter solution is that the players make their choices as a function of independent extraneous random signals. The assumption of independence is strong. Indeed, there may be no way to prevent the players’ signals from being correlated. Consider a random signal which has no effect on the players’ payoffs and takes three possible values: low, medium or high, occurring each with probability 1/3. Assume that, before the beginning of the game, player 1 distinguishes whether the signal is high or not, while player 2 distinguishes whether the signal is low or not. The relevant interactive decision problem is then the extended game in which the players can base their action on the private information they get on the random signal, while the payoffs only depend on the players’ actions. In this game, suppose that player 1 chooses the aggressive action when the signal is high and the pacific action otherwise. Similarly, suppose that player 2 chooses the aggressive action when the signal is low and the pacific action otherwise. We show that these strategies form an equilibrium in the extended game. Given player 2’s strategy, assume that player 1 observes a high signal. Player 1 deduces that the signal cannot be low so that player 2 chooses the pacific action; hence player 1’s best response is to play aggressively. Assume now that player 1 is informed that the signal is not high; he deduces that with probability 1/2, the signal is medium (i. e., not low) so that player 2 plays pacific and with probability 1/2, the signal is low so that player 2 plays aggressive. The expected payoff of player 1 is 5.5 if he plays pacific and 5 if he plays aggressive; hence, the pacific action is a best response. The equilibrium conditions for player 2 are symmetric. To sum up, the strategies based on the players’ private information form a Nash equilibrium in the extended game in which an extraneous signal is first selected. We shall say that these strategies form a “correlated equilibrium”. The corresponding probability distribution over the players’ actions is
p1 a1
p2
a2
1 3 1 3
1 3
(1)
0
and the expected payoff of every player is 7. This probability distribution can be used directly to make private recommendations to the players before the beginning of the game (see the canonical representation below).
Correlated Equilibrium: Definition and Basic Properties Definition A game in strategic form G D (N; (˙ i ) i2N ; (u i ) i2N ) consists of a set of players N together with, for every player i 2 N, a set of strategies (for instance, a set of actions) ˙ i and a (von Neumann–Morgenstern) utility function Q u i : ˙ ! R, where ˙ D j2N ˙ j is the set of all strategy profiles. We assume that the sets N and ˙ i , i 2 N, are finite. A correlation device d D (˝; q; (P i ) i2N ) is described by a finite set of signals ˝, a probability distribution q over ˝ and a partition P i of ˝ for every player i 2 N. Since ˝ is finite, the probability distribution q is just a real vector P q D (q(!))!2˝ such that q(!) 0 and !2˝ q(!) D 1. From G and d, we define the extended game Gd as follows: ! is chosen in ˝ according to q every player i is informed of the element P i (!) of P i which contains ! G is played: every player i chooses a strategy i in ˙ i and gets the utility u i ( ), D ( j ) j2N . A (pure) strategy for player i in Gd is a mapping ˛ i : ˝ ! ˙ i which is P i -measurable, namely, such that ˛ i (! 0 ) D ˛ i (!) if ! 0 2 P i (!). The interpretation is that, in Gd , every player i chooses his strategy i as a function of his private information on the random signal ! which is selected before the beginning of G. According to Aumann [1], a correlated equilibrium of G is a pair (d; ˛), which consists of a correlation device d D (˝; q; (P i ) i2N ) and a Nash equilibrium ˛ D (˛ i ) i2N of Gd . The equilibrium conditions of every player i, conditionally on his private information, can be written as: X
q(! 0 jP i (!))u i (˛(! 0 ))
! 0 2P i (!)
X
q(! 0 jP i (!))u i ( i ; ˛ i (! 0 )) ;
! 0 2P i (!)
8i 2 N; 8 i 2 ˙ i ; 8! 2 ˝ : q(!) > 0 ;
(2)
where ˛ i D (˛ j ) j¤i . A mixed Nash equilibrium D ( i ) i2N of G can be viewed as a correlated equilibrium of G. By definition, every i is a probability distribution over ˙ i , the finite set of pure strategies of player i. Let us consider the correlation device d D (˝; q; (P i ) i2N ) in Q which ˝ D ˙ D j2N ˙ j , q is the product probability distribution induced by the mixed strategies (i. e.,
Correlated Equilibria and Communication in Games
Q q(( j ) j2N ) D j2N j ( j )) and for each i, P i is the partition of ˝ generated by ˙ i (i. e., for !; 2 ˝,
2 P i (!) , i D ! i ). Let ˛ i : ˙ ! ˙ i be the projection over ˙ i (i. e., ˛ i () D i ). The correlation device d and the strategies ˛ i defined in this way form a correlated equilibrium. As we shall see below, this correlated equilibrium is “canonical” .
let (˛; d) be a correlated equilibrium associated with an arbitrary correlation device d D (˝; q; (P i ) i2N ). The corresponding “correlated equilibrium distribution”, namely, the probability distribution induced over ˙ by q and ˛, defines a canonical correlated equilibrium. For instance, in the introduction, (1) describes a canonical correlated equilibrium.
Canonical Representation
Duality and Existence
A canonical correlated equilibrium of G is a correlated Q equilibrium in which ˝ D ˙ D j2N ˙ j while for every player i, the partition P i of ˙ is generated by ˙ i and ˛ i : ˙ ! ˙ i is the projection over ˙ i . A canonical correlated equilibrium is thus fully specified by a probability distribution q over ˙ . A natural interpretation is that a mediator selects D ( j ) j2N according to q and privately recommends i to player i, for every i 2 N. The players are not forced to obey the mediator, but is selected in such a way that player i cannot benefit from deviating unilaterally from the recommendation i , i. e., i D i maximizes the conditional expectation of player i’s payoff u i ( i ; i ) given the recommendation i . A probability distribution q over ˙ thus defines a canonical correlated equilibrium if and only if it satisfies the following linear inequalities:
From the linearity of (3), duality theory can be used to study the properties of correlated equilibria, in particular to prove their existence without relying on Nash’s [45] theorem and its fixed point argument (recall that every mixed Nash equilibrium is a correlated equilibrium). Hart and Schmeidler [30] establish the existence of a correlated equilibrium by constructing an auxiliary two person zerosum game and applying the minimax theorem. Nau and McCardle [47] derive another elementary proof of existence from an extension of the “no arbitrage opportunities” axiom that underlies subjective probability theory. They introduce jointly coherent strategy profiles, which do not expose the players as a group to arbitrage from an outside observer. They show that a strategy profile is jointly coherent if and only if it occurs with positive probability in some correlated equilibrium. From a technical point of view, both proofs turn out to be similar. Myerson [44] makes further use of the linear structure of correlated equilibria by introducing dual reduction, a technique to replace a finite game with a game with fewer strategies, in such a way that any correlated equilibrium of the reduced game induces a correlated equilibrium of the original game.
X
q( i j i )u i ( i ; i )
i 2˙ i
X
q( i j i )u i ( i ; i ) ;
i 2˙ i
8i 2 N; 8 i 2 ˙ i : q( i ) > 0; 8 i 2 ˙ i
Geometric Properties
or, equivalently, X
i
q( ;
i
i
i
)u ( ;
i
)
i 2˙ i
X
q( i ; i )u i ( i ; i ) ;
i 2˙ i
8i 2 N; 8 i ; i 2 ˙ i
(3)
The equilibrium conditions can also be formulated ex ante: X 2˙
q()u i ()
X
q()u i (˛ i ( i ); i ) ;
2˙
8i 2 N; 8˛ i : ˙ i ! ˙ i The following result is an analog of the “revelation principle” in mechanism design (see, e. g., Myerson [41]):
As (3) is a system of linear inequalities, the set of all correlated equilibrium distributions is a convex polytope. Nau et al. [49] show that if it has “full” dimension (namely, dimension j˙ j 1), then all Nash equilibria lie on its relative boundary. Viossat [60] characterizes in addition the class of games whose correlated equilibrium polytope contains a Nash equilibrium in its relative interior. Interestingly, this class of games includes two person zero-sum games but is not defined by “strict competition” properties. In two person games, all extreme Nash equilibria are also extreme correlated equilibria [13,25]; this result does not hold with more than two players. Finally, Viossat [59] proves that having a unique correlated equilibrium is a robust property, in the sense that the set of n person games with a unique correlated equilibrium is open. The same is not true for Nash equilibrium (unless n D 2).
697
698
Correlated Equilibria and Communication in Games
Complexity From (3), correlated equilibria can be computed by linear programming methods. Gilboa and Zemel [24] show more precisely that the complexity of standard computational problems is “NP-hard” for the Nash equilibrium and polynomial for the correlated equilibrium. Examples of such problems are: “Does the game G have a Nash (resp., correlated) equilibrium which yields a payoff greater than r to every player (for some given number r)?” and “Does the game G have a unique Nash (resp., correlated) equilibrium?”. Papadimitriou [50] develops a polynomial-time algorithm for finding correlated equilibria, which is based on a variant of the existence proof of Hart and Schmeidler [30]. Foundations By re-interpreting the previous canonical representation, Aumann [2] proposes a decision theoretic foundation for the correlated equilibrium in games with complete information, in which ˙ i for i 2 N, stands merely for a set of actions of player i. Let ˝ be the space of all states of the world; an element ! of ˝ thus specifies all the parameters which may be relevant to the players’ choices. In particular, the action profile in the underlying game G is part of the state of the world. A partition P i describes player i’s information on ˝. In addition, every player i has a prior belief, i. e., a probability distribution, qi over ˝. Formally, the framework is similar as above except that the players possibly hold different beliefs over ˝. Let ˛ i (!) denote player i’s action at !; a natural assumption is that player i knows the action he chooses, namely that ˛ i is P i -measurable. According to Aumann [2], player i is Bayesrational at ! if his action ˛ i (!) maximizes his expected payoff (with respect to qi ) given his information P i (!). Note that this is a separate rationality condition for every player, not an equilibrium condition. Aumann [2] proves the following result: Under the common prior assumption (namely, q i D q, i 2 N), if every player is Bayes-rational at every state of the world, the distribution of the corresponding action profile ˛ is a correlated equilibrium distribution. The key to this decision theoretic foundation of the correlated equilibrium is that, under the common prior assumption, Bayesian rationality amounts to (2). If the common prior assumption is relaxed, the previous result still holds, with subjective prior probability distributions, for the subjective correlated equilibrium which was also introduced by Aumann [1]. The latter solution concept is defined in the same way as above, by considering a device (˝; (q i ) i2N ; (P i ) i2N ), with a probability distribution qi for every player i, and by writing (2) in terms
of qi instead of q. Brandenburger and Dekel [10] show that (a refinement of) the subjective correlated equilibrium is equivalent to (correlated) rationalizability, another wellestablished solution concept which captures players’ minimal rationality. Rationalizable strategies reflect that the players commonly know that each of them makes an optimal choice given some belief. Nau and McCardle [48] reconcile objective and subjective correlated equilibrium by proposing the no arbitrage principle as a unified approach to individual and interactive decision problems. They argue that the objective correlated equilibrium concept applies to a game that is revealed by the players’ choices, while the subjective correlated equilibrium concept applies to the “true game”; both lead to the same set of jointly coherent outcomes. Correlated Equilibrium and Communication As seen in the previous section, correlated equilibria can be achieved in practice with the help of a mediator and emerge in a Bayesian framework embedding the game in a full description of the world. Both approaches require to extend the game by taking into account information which is not generated by the players themselves. Can the players reach a correlated equilibrium without relying on any extraneous correlation device, by just communicating with each other before the beginning of the game? Consider the game of “chicken” presented in the introduction. The probability distribution p2
a2
p1
0
1 2
a1
1 2
0
(4)
describes a correlated equilibrium, which amounts to choosing one of the two pure Nash equilibria, with equal probability. Both players get an expected payoff of 6.5. Can they safely achieve this probability distribution if no mediator tosses a fair coin for them? The answer is positive, as shown by Aumann et al. [3]. Assume that before playing “chicken”, the players independently toss a coin and simultaneously reveal to each other whether heads or tails obtains. Player 1 tells player 2 “h1 ” or “t1 ” and, at the same time, player 2 tells player 1 “h2 ” or “t2 ”. If both players use a fair coin, reveal correctly the result of the toss and play (p1 ; a2 ) if both coins fell on the same side (i. e., if (h1 ; h2 ) or (t 1 ; t 2 ) is announced) and (a1 ; p2 ) otherwise (i. e., if (h1 ; t 2 ) or (t 1 ; h2 ) is announced), they get the same effect as a mediator using (4). Furthermore, none of them can gain by unilaterally deviating from the described strategies, even at the randomizing stage: the two relevant out-
Correlated Equilibria and Communication in Games
comes, [(h1 ; h2 ) or (t 1 ; t 2 )] and [(h1 ; t 2 ) or (t 1 ; h2 )], happen with probability 1/2 provided that one of the players reveals the toss of a fair coin. This procedure is known as a “jointly controlled lottery” . An important feature of the previous example is that, in the correlated equilibrium described by (4), the players know each other’s recommendation. Hence, they can easily reproduce (4) by exchanging messages that they have selected independently. In the correlated equilibrium described by the probability distribution (1), the private character of recommendations is crucial to guarantee that (p1 ; p2 ) be played with positive probability. Hence one cannot hope that a simple procedure of direct preplay communication be sufficient to generate (1). However, the fact that direct communication is necessarily public is typical of two-person games. Given the game G D (N; (˙ i ) i2N ; (u i ) i2N ), let us define a (bounded) “cheap talk” extension ext(G) of G as a game in which T stages of costless, unmediated preplay communication are allowed before G is played. More precisely, let M ti be a finite set of messages for player i, i 2 N, at stage t, t D 1; 2; : : : T; at every stage t of ext(G), every player i selects a message in M ti ; these choices are made simultaneously before being revealed to a subset of players at the end of stage t. The rules of ext(G) thus determine a set of “senders” for every stage t (those players i for whom M ti contains more than one message) and a set of “receivers” for every stage t. The players perfectly recall their past messages. After the communication phase, they choose their strategies (e. g., their actions) as in G; they are also rewarded as in G, independently of the preplay phase, which is thus “cheap” . Communication has an indirect effect on the final outcome in G, since the players make their decisions as a function of the messages that they have exchanged. Specific additional assumptions are often made on ext(G), as we will see below. Let us fix a cheap talk extension ext(G) of G and a Nash equilibrium of ext(G). As a consequence of the previous definitions, the distribution induced by this Nash equilibrium over ˙ defines a correlated equilibrium of G (this can be proved in the same way as the canonical representation of correlated equilibria stated in Sect. “Correlated Equilibrium: Definition and Basic Properties”). The question raised in this section is whether the reverse holds. If the number of players is two, the Nash equilibrium distributions of cheap talk extensions of G form a subset of the correlated equilibrium distributions: the convex hull of Nash equilibrium distributions. Indeed, the players have both the same information after any direct exchange of messages. Conversely, by performing repeated jointly controlled lotteries like in the example above, the players can achieve any convex combination (with rational weights) of
Nash equilibria of G as a Nash equilibrium of a cheap talk extension of G. The restriction on probability distributions whose components are rational numbers is only needed as far as we focus on bounded cheap talk extensions. Bárány [4] establishes that, if the number of players of G is at least four, every (rational) correlated equilibrium distribution of G can be realized as a Nash equilibrium of a cheap talk extension ext(G), provided that ext(G) allows the players to publicly check the record of communication under some circumstances. The equilibria of ext(G) constructed by Bárány involve that a receiver gets the same message from two different senders; the message is nevertheless not public thanks to the assumption on the number of players. At every stage of ext(G), every player can ask for the revelation of all past messages, which are assumed to be recorded. Typically, a receiver can claim that the two senders’ messages differ. In this case, the record of communication surely reveals that either one of the senders or the receiver himself has cheated; the deviator can be punished (at his minmax level in G) by the other players. The punishments in Bárány’s [4] Nash equilibria of ext(G) need not be credible threats. Instead of using double senders in the communication protocols, BenPorath [5,6] proposes a procedure of random monitoring, which prescribes a given behavior to every player in such a way that unilateral deviations can be detected with probability arbitrarily close to 1. This procedure applies if there are at least three players, which yields an analog of Bárány’s result already in this case. If the number of players is exactly three, Ben-Porath [6] needs to assumes, as Bárány [4], that public verification of the record of communication is possible in ext(G) (see Ben-Porath [7]). However, Ben-Porath concentrates on (rational) correlated equilibrium distributions which allow for strict punishment on a Nash equilibrium of G; he constructs sequential equilibria which generate these distributions in ext(G), thus dispensing with incredible threats. At the price of raising the number of players to five or more, Gerardi [22] proves that every (rational) correlated equilibrium distribution of G can be realized as a sequential equilibrium of a cheap talk extension of G which does not require any message recording. For this, he builds protocols of communication in which the players base their decisions on majority rule, so that no punishment is necessary. We have concentrated on two extreme forms of communication: mediated communication, in which a mediator performs lotteries and sends private messages to the players and cheap talk, in which the players just exchange messages. Many intermediate schemes of communication are obviously conceivable. For instance, Lehrer [36] introduces (possibly multistage) “mediated talk”: the play-
699
700
Correlated Equilibria and Communication in Games
ers send private messages to a mediator, but the latter can only make deterministic public announcements. Mediated talk captures real-life communication procedures, like elections, especially if it lasts only for a few stages. Lehrer and Sorin [37] establish that whatever the number of players of G, every (rational) correlated equilibrium distribution of G can be realized as a Nash equilibrium of a single stage mediated talk extension of G. Ben-Porath [5] proposes a variant of cheap talk in which the players do not only exchange verbal messages but also “hard” devices such as urns containing balls. This extension is particularly useful in two-person games to circumvent the equivalence between the equilibria achieved by cheap talk and the convex hull of Nash equilibria. More precisely, the result of Ben-Porath [5] stated above holds for two-person games if the players first check together the content of different urns, and then each player draws a ball from an urn that was chosen by the other player, so as to guarantee that one player only knows the outcome of a lottery while the other one only knows the probabilities of this lottery. The various extensions of the basic game G considered up to now, with or without a mediator, implicitly assume that the players are fully rational. In particular, they have unlimited computational abilities. By relaxing that assumption, Urbano and Vila [55] and Dodis et al. [12] build on earlier results from cryptography so as to implement any (rational) correlated equilibrium distribution through unmediated communication, including in twoperson games. As the previous paragraphs illustrate, the players can modify their intitial distribution of information by means of many different communication protocols. Gossner [26] proposes a general criterion to classify them: a protocol is “secure” if under all circumstances, the players cannot mislead each other nor spy on each other. For instance, given a cheap talk extension ext(G), a protocol P describes, for every player, a strategy in ext(G) and a way to interpret his information after the communication phase of ext(G). P induces a correlation device d(P) (in the sense of Sect. “Correlated Equilibrium: Definition and Basic Properties”). P is secure if, for every game G and every Nash equilibrium ˛ of Gd(P) , the following procedure is a Nash equilibrium of ext(G): communicate according to the strategies described by P in order to generate d(P) and make the final choice, in G, according to ˛. Gossner [26] gives a tractable characterization of secure protocols. Correlated Equilibrium in Bayesian Games A Bayesian game D (N; (T i ) i2N ; p; (Ai ) i2N ; (v i ) i2N ) consists of: a set of players N; for every player i 2 N,
a set of types T i , a probability distribution pi over Q T D j2N T j , a set of actions Ai and a (von Neumann– Morgenstern) utility function v i : T A ! R, where Q A D j2N A j . For simplicity, we make the common prior assumption: p i D p for every i 2 N. All sets are assumed finite. The interpretation is that a virtual move of nature chooses t D (t j ) j2N according to p; player i is only informed of his own type ti ; the players then choose simultaneously an action. We will focus on two possible extensions of Aumann’s [1] solution concept to Bayesian games: the strategic form correlated equilibrium and the communication equilibrium. Without loss of generality, the definitions below are given in “canonical form” (see Sect. “Correlated Equilibrium: Definition and Basic Properties”). Strategic Form Correlated Equilibrium A (pure) strategy of player i in is a mapping i : T i ! Ai , i 2 N. The strategic form of is a game G( ), like the game G considered in Sect. “Correlated Equilibrium: Definition and Basic Properties”, with sets of pure strategies Q j ˙ i D ATi i and utility functions ui over ˙ D j2N ˙ i computed as expectations with respect to p : u ( ) D E[v i (t; (t))], with (t) D ( i (t i )) i2N . A strategic form correlated equilibrium, or simply, a correlated equilibrium, of a Bayesian game is a correlated equilibrium, in the sense of Sect. “Correlated Equilibrium: Definition and Basic Properties”, of G( ). A canonical correlated equilibrium of is thus described by a probability distribution Q over ˙ , which selects an N-tuple of pure strategies ( i ) i2N . This lottery can be thought of as being performed by a mediator who privately recommends i to player i, i 2 N, before the beginning of , i. e., before (or in any case, independently of) the chance move choosing the N -tuple of types. The equilibrium conditions express that, once he knows his type ti , player i cannot gain in unilaterally deviating from i (t i ). Communication Equilibrium Myerson [41] transforms the Bayesian game into a mechanism design problem by allowing the mediator to collect information from the players before making them recommendations. Following Forges [15] and Myerson [42], a canonical communication device for consists of a system q of probability distributions q D (q(:jt)) t2T over A. The interpretation is that a mediator invites every player i, i 2 N, to report his type ti , then selects an N-tuple of actions a according to q(:jt) and privately recommends ai to player i. The system q defines a communication equi-
Correlated Equilibria and Communication in Games
librium if none of the players can gain by unilaterally lying on his type or by deviating from the recommended action, namely if X
p(t i jt i )
t i 2T i
X
X
q(ajt)v i (t; a)
a2A i
p(t jt i )
t i 2T i
X
q(ajs i ; t i )v i (t; ˛ i (a i ); ai ) ;
a2A
8i 2 N; 8t i ; s i 2 T i ; 8˛ i : Ai ! Ai Correlated Equilibrium, Communication Equilibrium and Cheap Talk Every correlated equilibrium of the Bayesian game induces a communication equilibrium of , but the converse is not true, as the following example shows. Consider the two-person Bayesian game in which T 1 D fs 1 ; t 1 g, T 2 D ft 2 g, A1 D fa1 ; b1 g, A2 D fa2 ; b2 g, p(s1 ) D p(t 1 ) D 12 and payoffs are described by s
1
t1
b1
a2 (1; 1) (0; 0)
a1 b1
a2 (0; 0) (1; 1)
a1
b2 (1; 1) (0; 0) b2 (0; 0) (1; 1)
In this game, the communication equilibrium q(a 1 ; a2 js 1 ) D q(b1 ; b2 jt 1 ) D 1 yields the expected payoff of 1 to both players. However the maximal expected payoff of every player in a correlated equilibrium is 1/2. In order to see this, one can derive the strategic form of the game (in which player 1 has four strategies and player 2 has two strategies). Let us turn to the game in which player 1 can cheaply talk to player 2 just after having learned his type. In this new game, the following strategies form a Nash equilibrium: player 1 truthfully reveals his type to player 2 and plays a1 if s1 , b1 if t1 ; player 2 chooses a2 if s1 , b2 if t1 . These strategies achieve the same expected payoffs as the communication equilibrium. As in Sect. “Correlated Equilibrium and Communication”, one can define cheap talk extensions ext( ) of . A wide definition of ext( ) involves an ex ante preplay phase, before the players learn their types, and an interim preplay phase, after the players learn their types but before they choose their actions. Every Nash equilibrium of ext( ) induces a communication equilibrium of . In order to investigate the converse, namely whether cheap talk can simulate mediated communication in a Bayesian game, two approaches have been developed. The first one
(Forges [17], Gerardi [21,22], Vida [58]) proceeds in two steps, by reducing communication equilibria to correlated equilibria before applying the results obtained for strategic form games (see Sect. “Correlated Equilibrium and Communication”). The second approach (Ben-Porath [6], Krishna [33]) directly addresses the question in a Bayesian game. By developing a construction introduced for particular two person games (Forges [14]), Forges [17] shows that every communication equilibrium outcome of a Bayesian game with at least four players can be achieved as a correlated equilibrium outcome of a two stage interim cheap talk extension extint ( ) of . No punishment is necessary in extint ( ): at the second stage, every player gets a message from three senders and uses majority rule if the messages are not identical. Thanks to the underlying correlation device, each receiver is able to privately decode his message. Vida [58] extends Forges [17] to Bayesian games with three or even two players. In the proof, he constructs a correlated equilibrium of a long, but almost surely finite, interim cheap talk extension of , whose length depends both on the signals selected by the correlation device and the messages exchanged by the players. No recording of messages is necessary to detect and punish a cheating player. If there are at least four players in , once a communication equilibrium of has been converted into a correlated equilibrium of extint ( ), one can apply Bárány’s [4] result to extint ( ) in order to transform the correlated equilibrium into a Nash equilibrium of a further, ex ante, cheap talk preplay extension of . Gerardi [21] modifies this ex ante preplay phase so as to postpone it at the interim stage. This result is especially useful if the initial move of nature in is just a modelling convenience. Gerardi [22] also extends his result for at least five person games with complete information (see Sect. “Correlated Equilibrium and Communication”) to any Bayesian game with full support (i. e., in which all type profiles have positive probability: p(t) > 0 for every t 2 T) by proving that every (rational) communication equilibrium of can be achieved as a sequential equilibrium of a cheap talk extension of . Ben-Porath [6] establishes that if is a three (or more) person game with full support, every (rational) communication equilibrium of which strictly dominates a Nash equilibrium of for every type ti of every player i, i 2 N, can be implemented as a Nash equilibrium of an interim cheap talk extension of in which public verification of past record is possible (see also Ben-Porath [7]). Krishna [33] extends Ben-Porath’s [5] result on two person games (see Sect. “Correlated Equilibrium and Communication”) to the incomplete information framework. The
701
702
Correlated Equilibria and Communication in Games
other results mentioned at the end of Sect. “Correlated Equilibrium and Communication” have also been generalized to Bayesian games (see [26,37,56]). Related Topics and Future Directions In this brief article, we concentrated on two solution concepts: the strategic form correlated equilibrium, which is applicable to any game, and the communication equilibrium, which we defined for Bayesian games. Other extensions of Aumann’s [1]solution concept have been proposed for Bayesian games, as the agent normal form correlated equilibrium and the (possibly belief invariant) Bayesian solution (see Forges [18,19] for definitions and references). The Bayesian solution is intended to capture the players’ rationality in games with incomplete information in the spirit of Aumann [2] (see Nau [46] and Forges [18]). Lehrer et al. [38] open a new perspective in the understanding of the Bayesian solution and other equilibrium concepts for Bayesian games by characterizing the classes of equivalent information structures with respect to each of them. Comparison of information structures, which goes back to Blackwell [8,9] for individual decision problems, was introduced by Gossner [27] in the context of games, both with complete and incomplete information. In the latter model, information structures basically describe how extraneous signals are selected as a function of the players’ types; two information structures are equivalent with respect to an equilibrium concept if, in every game, they generate the same equilibrium distributions over outcomes. Correlated equilibria, communication equilibria and related solution concepts have been studied in many other classes of games, like multistage games (see, e. g., [15,42]), repeated games with incomplete information (see, e. g., [14,16]) and stochastic games (see, e. g., [53,54]). The study of correlated equilibrium in repeated games with imperfect monitoring, initiated by Lehrer [34,35], proved to be particularly useful and is still undergoing. Lehrer [34] showed that if players are either fully informed of past actions or get no information (“ standard-trivial” information structure), correlated equilibria are equivalent to Nash equilibria. In other words, all correlations can be generated internally, namely by the past histories, on which players have differential information. The schemes of internal correlation introduced to establish this result are widely applicable and inspired those of Lehrer [36] (see Sect. “Correlated Equilibrium and Communication”). In general repeated games with imperfect monitoring, Renault and Tomala [52] characterize communication equilibria but the amount of correlation that the players can
achieve in a Nash equilibrium is still an open problem (see, e. g., [28,57] for recent advances). Throughout this article, we defined a correlated equilibrium as a Nash equilibrium of an extension of the game under consideration. The solution concept can be strengthened by imposing some refinement, i. e., further rationality conditions, to the Nash equilibrium in this definition (see, e. g., [11,43]). Refinements of communication equilibria have also been proposed (see, e. g., [22,23,42]). Some authors (see, e. g., [39,40,51]) have also developed notions of coalition proof correlated equilibria, which resist not only to unilateral deviations, as in this article, but even to multilateral ones. A recurrent difficulty is that, for many of these stronger solution concepts, a useful canonical representation (as derived in Sect. “Correlated Equilibrium: Definition and Basic Properties”) is not available. Except for two or three references, we deliberately concentrated on the results published in the game theory and mathematical economics literature, while substantial achievements in computer science would fit in this survey. Both streams of research pursue similar goals but rely on different formalisms and techniques. For instance, computer scientists often make use of cryptographic tools which are not familiar in game theory. Halpern [29] gives an idea of recent developments at the interface of computer science and game theory (see in particular the section “implementing mediators”) and contains a number of references. Finally, the assumption of full rationality of the players can also be relaxed. Evolutionary game theory has developed models of learning in order to study the long term behavior of players with bounded rationality. Many possible dynamics are conceivable to represent more or less myopic attitudes with respect to optimization. Under appropriate learning procedures, which express for instance that agents want to minimize the regret of their strategic choices, the empirical distribution of actions converge to correlated equilibrium distributions (see, e. g., [20,31,32] for a survey). However, standard procedures, as the “replicator dynamics”, may even eliminate all the strategies which have positive probability in a correlated equilibrium (see [61]). Acknowledgment The author wishes to thank Elchanan Ben-Porath, Frédéric Koessler, R. Vijay Krishna, Ehud Lehrer, Bob Nau, Indra Ray, Jérôme Renault, Eilon Solan, Sylvain Sorin, Bernhard von Stengel, Tristan Tomala, Amparo Urbano, Yannick Viossat and, especially, Olivier Gossner and Péter Vida, for useful comments and suggestions.
Correlated Equilibria and Communication in Games
Bibliography Primary Literature 1. Aumann RJ (1974) Subjectivity and correlation in randomized strategies. J Math Econ 1:67–96 2. Aumann RJ (1987) Correlated equilibrium as an expression of Bayesian rationality. Econometrica 55:1–18 3. Aumann RJ, Maschler M, Stearns R (1968) Repeated games with incomplete information: an approach to the nonzero sum case. Reports to the US Arms Control and Disarmament Agency, ST-143, Chapter IV, 117–216 (reprinted In: Aumann RJ, Maschler M (1995) Repeated Games of Incomplete Information. M.I.T. Press, Cambridge) 4. Bárány I (1992) Fair distribution protocols or how players replace fortune. Math Oper Res 17:327–340 5. Ben-Porath E (1998) Correlation without mediation: expanding the set of equilibrium outcomes by cheap pre-play procedures. J Econ Theor 80:108–122 6. Ben-Porath E (2003) Cheap talk in games with incomplete information. J Econ Theor 108:45–71 7. Ben-Porath E (2006) A correction to “Cheap talk in games with incomplete information”. Mimeo, Hebrew University of Jerusalem, Jerusalem 8. Blackwell D (1951) Comparison of experiments. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Berkeley and Los Angeles, pp 93–102 9. Blackwell D (1953) Equivalent comparison of experiments. Ann Math Stat 24:265–272 10. Brandenburger A, Dekel E (1987) Rationalizability and correlated equilibria. Econometrica 55:1391–1402 11. Dhillon A, Mertens JF (1996) Perfect correlated equilibria. J Econ Theor 68:279–302 12. Dodis Y, Halevi S, Rabin T (2000) A cryptographic solution to a game theoretic problem. CRYPTO 2000: 20th International Cryptology Conference. Springer, Berlin, pp 112–130 13. Evangelista F, Raghavan TES (1996) A note on correlated equilibrium. Int J Game Theory 25:35–41 14. Forges F (1985) Correlated equilibria in a class of repeated games with incomplete information. Int J Game Theory 14:129–150 15. Forges F (1986) An approach to communication equilibrium. Econometrica 54:1375–1385 16. Forges F (1988) Communication equilibria in repeated games with incomplete information. Math Oper Res 13:191–231 17. Forges F (1990) Universal mechanisms. Econometrica 58:1341– 1364 18. Forges F (1993) Five legitimate definitions of correlated equilibrium in games with incomplete information. Theor Decis 35:277–310 19. Forges F (2006) Correlated equilibrium in games with incomplete information revisited. Theor Decis 61:329–344 20. Foster D, Vohra R (1997) Calibrated learning and correlated equilibrium. Game Econ Behav 21:40–55 21. Gerardi D (2000) Interim pre-play communication. Mimeo, Yale University, New Haven 22. Gerardi D (2004) Unmediated communication in games with complete and incomplete information. J Econ Theory 114:104– 131
23. Gerardi D, Myerson R (2007) Sequential equilibria in Bayesian games with communication. Game Econ Behav 60:104–134 24. Gilboa I, Zemel E (1989) Nash and correlated equilibria: some complexity considerations. Game Econ Behav 1:80–93 25. Gomez-Canovas S, Hansen P, Jaumard B (1999) Nash Equilibria from the correlated equilibria viewpoint. Int Game Theor Rev 1:33–44 26. Gossner O (1998) Secure protocols or how communication generates correlation. J Econ Theory 83:69–89 27. Gossner O (2000) Comparison of information structures. Game Econ Behav 30:44–63 28. Gossner O, Tomala T (2007) Secret correlation in repeated games with signals. Math Oper Res 32:413–424 29. Halpern JY (2007) Computer science and game theory. In: Durlauf SN, Blume LE (eds) The New Palgrave dictionary of economics, 2nd edn. Palgrave Macmillan. The New Palgrave dictionary of economics online. http://www. dictionaryofeconomics.com/article?id=pde2008_C000566. Accessed 24 May 2008 30. Hart S, Schmeidler D (1989) Existence of correlated equilibria. Math Oper Res 14:18–25 31. Hart S, Mas-Colell A (2000) A simple adaptive procedure leading to correlated equilibrium. Econometrica 68:1127–1150 32. Hart S (2005) Adaptative heuristics. Econometrica 73:1401– 1430 33. Krishna RV (2007) Communication in games of incomplete information: two players. J Econ Theory 132:584–592 34. Lehrer E (1991) Internal correlation in repeated games. Int J Game Theory 19:431–456 35. Lehrer E (1992) Correlated equilibria in two-player repeated games with non-observable actions. Math Oper Res 17:175– 199 36. Lehrer E (1996) Mediated talk. Int J Game Theory 25:177–188 37. Lehrer E, Sorin S (1997) One-shot public mediated talk. Game Econ Behav 20:131–148 38. Lehrer E, Rosenberg D, Shmaya E (2006) Signaling and mediation in Bayesian games. Mimeo, Tel Aviv University, Tel Aviv 39. Milgrom P, Roberts J (1996) Coalition-proofness and correlation with arbitrary communication possibilities. Game Econ Behav 17:113–128 40. Moreno D, Wooders J (1996) Coalition-proof equilibrium. Games Econ Behav 17:80–112 41. Myerson R (1982) Optimal coordination mechanisms in generalized principal-agent problems. J Math Econ 10:67–81 42. Myerson R (1986a) Multistage games with communication. Econometrica 54:323–358 43. Myerson R (1986b) Acceptable and predominant correlated equilibria. Int J Game Theory 15:133–154 44. Myerson R (1997) Dual reduction and elementary games. Game Econ Behav 21:183–202 45. Nash J (1951) Non-cooperative games. Ann Math 54:286–295 46. Nau RF (1992) Joint coherence in games with incomplete information. Manage Sci 38:374–387 47. Nau RF, McCardle KF (1990) Coherent behavior in noncooperative games. J Econ Theory 50(2):424–444 48. Nau RF, McCardle KF (1991) Arbitrage, rationality and equilibrium. Theor Decis 31:199–240 49. Nau RF, Gomez-Canovas S, Hansen P (2004) On the geometry of Nash equilibria and correlated equilibria. Int J Game Theory 32:443–453
703
704
Correlated Equilibria and Communication in Games
50. Papadimitriou CH (2005) Computing correlated equilibria in multiplayer games. Proceedings of the 37th ACM Symposium on Theory of Computing. STOC, Baltimore, pp 49–56 51. Ray I (1996) Coalition-proof correlated equilibrium: a definition. Game Econ Behav 17:56–79 52. Renault J, Tomala T (2004) Communication equilibrium payoffs in repeated games with imperfect monitoring. Game Econ Behav 49:313–344 53. Solan E (2001) Characterization of correlated equilibrium in stochastic games. Int J Game Theory 30:259–277 54. Solan E, Vieille N (2002) Correlated equilibrium in stochastic games. Game Econ Behav 38:362–399 55. Urbano A, Vila J (2002) Computational complexity and communication: coordination in two-player games. Econometrica 70:1893–1927 56. Urbano A, Vila J (2004a) Computationally restricted unmediated talk under incomplete information. J Econ Theory 23:283– 320 57. Urbano A, Vila J (2004b) Unmediated communication in repeated games with imperfect monitoring. Game Econ Behav 46:143–173 58. Vida P (2007) From communication equilibria to correlated equilibria. Mimeo, University of Vienna, Vienna 59. Viossat Y (2005) Is having a unique equilibrium robust? to appear in J Math Econ
60. Viossat Y (2006) The geometry of Nash equilibria and correlated equilibria and a generalization of zero-sum games. Mimeo, S-WoPEc working paper 641, Stockholm School of Economics, Stockholm 61. Viossat Y (2007) The replicator dynamics does not lead to correlated equilibria. Game Econ Behav 59:397–407
Books and Reviews Forges F (1994) Non-zero sum repeated games and information transmission. In: Megiddo N (ed) Essays in Game Theory in Honor of Michael Maschler. Springer, Berlin, pp 65–95 Mertens JF (1994) Correlated- and communication equilibria. In: Mertens JF, Sorin S (eds) Game Theoretic Methods in General Equilibrium Analysis. Kluwer, Dordrecht, pp 243–248 Myerson R (1985) Bayesian equilibrium and incentive compatibility. In: Hurwicz L, Schmeidler D, Sonnenschein H (eds) Social Goals and Social Organization. Cambridge University Press, Cambridge, pp 229–259 Myerson R (1994) Communication, correlated equilibria and incentive compatibility. In: Aumann R, Hart S (eds) Handbook of Game Theory, vol 2. Elsevier, Amsterdam, pp 827–847 Sorin S (1997) Communication, correlation and cooperation. In: Mas Colell A, Hart S (eds) Cooperation: Game Theoretic Approaches. Springer, Berlin, pp 198–218
Correlations in Complex Systems
Correlations in Complex Systems RENAT M. YULMETYEV1,2 , PETER HÄNGGI 3 1 Department of Physics, Kazan State University, Kazan, Russia 2 Tatar State University of Pedagogical and Humanities Sciences, Kazan, Russia 3 University of Augsburg, Augsburg, Germany Article Outline Glossary Definition of the Subject Introduction Correlation and Memory in Discrete Non-Markov Stochastic Processes Correlation and Memory in Discrete Non-Markov Stochastic Processes Generated by Random Events Information Measures of Memory in Complex Systems Manifestation of Strong Memory in Complex Systems Some Perspectives on the Studies of Memory in Complex Systems Bibliography Glossary Correlation A correlation describes the degree of relationship between two or more variables. The correlations are viewed due to the impact of random factors and can be characterized by the methods of probability theory. Correlation function The correlation function (abbreviated, as CF) represents the quantitative measure for the compact description of the wide classes of correlation in the complex systems (CS). The correlation function of two variables in statistical mechanics provides a measure of the mutual order existing between them. It quantifies the way random variables at different positions are correlated. For example in a spin system, it is the thermal average of the scalar product of the spins at two lattice points over all possible orderings. Memory effects in stochastic processes through correlations Memory effects (abbreviated, as ME) appear at a more detailed level of statistical description of correlation in the hierarchical manner. ME reflect the complicated or hidden character of creation, the propagation and the decay of correlation. ME are produced by inherent interactions and statistical after-effects in CS. For the statistical systems ME are induced by contracted description of the evolution of the dynamic variables of a CS.
Memory functions Memory functions describe mutual interrelations between the rates of change of random variables on different levels of the statistical description. The role of memory has its roots in the natural sciences since 1906 when the famous Russian mathematician Markov wrote his first paper in the theory of Markov Random Processes. The theory is based on the notion of the instant loss of memory from the prehistory (memoryless property) of random processes. Information measures of statistical memory in complex systems From the physical point of view time scales of correlation and memory cannot be treated as arbitrary. Therefore, one can introduce some statistical quantifiers for the quantitative comparison of these time scales. They are dimensionless and possess the statistical spectra on the different levels of the statistical description. Definition of the Subject As commonly used in probability theory and statistics, a correlation (also so called correlation coefficient), measures the strength and the direction of a linear relationship between two random variables. In a more general sense, a correlation or co-relation reflects the deviation of two (or more) variables from mutual independence, although correlation does not imply causation. In this broad sense there are some quantifiers which measures the degree of correlation, suited to the nature of data. Increasing attention has been paid recently to the study of statistical memory effects in random processes that originate from nature by means of non-equilibrium statistical physics. The role of memory has its roots in natural sciences since 1906 when the famous Russian mathematician Markov wrote his first paper on the theory of Markov Random Processes (MRP) [1]. His theory is based on the notion of an instant loss of memory from the prehistory (memoryless property) of random processes. In contrast, there are an abundance of physical phenomena and processes which can be characterized by statistical memory effects: kinetic and relaxation processes in gases [2] and plasma [3], condensed matter physics (liquids [4], solids [5], and superconductivity [6]) astrophysics [7], nuclear physics [8], quantum [9] and classical [9] physics, to name only a few. At present, we have a whole toolbox available of statistical methods which can be efficiently used for the analysis of the memory effects occurring in diverse physical systems. Typical such schemes are Zwanzig–Mori’s kinetic equations [10,11], generalized master equations and corresponding statistical quantifiers [12,13,14,15,16,17,18], Lee’s recurrence relation method [19,20,21,22,23], the generalized Langevin equation (GLE) [24,25,26,27,28,29], etc.
705
706
Correlations in Complex Systems
Here we shall demonstrate that the presence of statistical memory effects is of salient importance for the functioning of the diverse natural complex systems. Particularly, it can imply that the presence of large memory times scales in the stochastic dynamics of discrete time series can characterize catastrophical (or pathological for live systems) violation of salutary dynamic states of CS. As an example, we will demonstrate here that the emergence of strong memory time scales in the chaotic behavior of complex systems (CS) is accompanied by the likely initiation and the existence of catastrophes and crises (Earthquakes, financial crises, cardiac and brain attack, etc.) in many CS and especially by the existence of pathological states (diseases and illness) in living systems.
for by a linear fit of xi to y . This is written R2x y D 1
2 yjx
y2
;
2 denotes the square of the error of a linear rewhere yjx gression of xi on yi in the equation y D a C bx, 2 yjx D
n 1X (y i a bx i )2 n iD1
and y2 denotes just the dispersion of y. Note that since the sample correlation coefficient is symmetric in xi and yi , we will obtain the same value for a fit to yi :
Introduction A common definition [30] of a correlation measure (X; Y) between two random variables X and Y with the mean values E(X) and E(Y), and fluctuations ıX D X E(X) and ıY D Y E(Y), dispersions X2 D E(ıX 2 ) D E(X 2 ) E(X)2 and Y2 D E(ıY 2 ) D E(Y 2 ) E(Y)2 is defined by: (X; Y) D
E(ıX ıY) ; X Y
where E is the expected value of the variable. Therefore we can write (X; Y) D
[E(XY) E(X) E(Y)] : (E(X 2 ) E(X)2 )1/2 (E(Y 2 ) E(Y)2 )1/2
Here, a correlation can be defined only if both of the dispersions are finite and both of them are nonzero. Due to the Cauchy–Schwarz inequality, a correlation cannot exceed 1 in absolute value. Consequently, a correlation assumes it maximum at 1 in the case of an increasing linear relationship, or 1 in the case of a decreasing linear relationship, and some value in between in all other cases, indicating the degree of linear dependence between the variables. The closer the coefficient is either to 1 or 1, the stronger is the correlation between the variables. If the variables are independent then the correlation equals 0, but the converse is not true because the correlation coefficient detects only linear dependencies between two variables. Since the absolute value of the sample correlation must be less than or equal to 1 the simple formula conveniently suggests a single-pass algorithm for calculating sample correlations. The square of the sample correlation coefficient, which is also known as the coefficient of determination, is the fraction of the variance in x that is accounted
R2x y D 1
2 xjy
x2
:
This equation also gives an intuitive idea of the correlation coefficient for random (vector) variables of higher dimension. Just as the above described sample correlation coefficient is the fraction of variance accounted for by the fit of a 1-dimensional linear submanifold to a set of 2-dimensional vectors (x i ; y i ), so we can define a correlation coefficient for a fit of an m-dimensional linear submanifold to a set of n-dimensional vectors. For example, if we fit a plane z D a C bx C cy to a set of data (x i ; y i ; z i ) then the correlation coefficient of z to x and y is R2 D 1
2 zjx y
z2
:
Correlation and Memory in Discrete Non-Markov Stochastic Processes Here we present a non-Markov approach [31,32] for the study of long-time correlations in chaotic long-time dynamics of CS. For example, let the variable xi be defined as the R-R interval or the time distance between nearest, so called R peaks occurring in a human electrocardiogram (ECG). The generalization will consist in taking into account non-stationarity of stochastic processes and its further applications to the analysis of the heart-ratevariability. We should bear in mind, that one of the key moments of the spectral approach in the analysis of stochastic processes consists in the use of normalized time correlation function (TCF) a0 (t) D
hhA(T) A(T C t)ii : hA(T)2 i
(1)
Correlations in Complex Systems
Here the time T indicates the beginning of a time serial, A(t) is a state vector of a complex system as defined below in Eq. (5) at t, jA(t)j is the length of vector A(t), the double angular brackets indicate a scalar product of vectors and an ensemble averaging. The ensemble averaging is, of course needed in Eq. (1) when correlation and other characteristic functions are constructed. The average and scalar product becomes equivalent when a vector is composed of elements from a discrete-time sampling, as done later. Here a continuous formalism is discussed for convenience. However further, since Sect. “Correlation and Memory in Discrete Non-Markov Stochastic Processes” we shall consider only a case of discrete processes. The above-stated designation is true only for stationary systems. In a non-stationary case Eq. (1) is not true and should be changed. The concept of TCF can be generalized in case of discrete non-stationary sequence of signals. For this purpose the standard definition of the correlation coefficient in probability theory for the two random signals X and Y must be taken into account D
hhXYii ; X Y
X D hjXji ;
Y D hjYji :
(2)
In Eq. (2) the multi-component vectors X, Y are determined by fluctuations of signals x and y accordingly, X2 ; Y2 represent the dispersions of signals x and y, and values jXj; jYj represent the lengths of vectors X, Y, correspondingly. Therefore, the function a(T; t) D
hhA(T) A(T C t)ii hjA(T)ji hjA(T C t)ji
(3)
can serve as the generalization of the concept of TCF (1) for non-stationary processes A(T C t). The non-stationary TCF (3) obeys the conditions of the normalization and attenuation of correlation a(T; 0) D 1 ;
lim a(T; t) D 0 :
t!1
Let us note, that in a real CS the second limit, typically, is not carried out due possible occurrence nonergodocity (meaning that a time average does not equal its ensemble average). According to the Eqs. (1) and (3) for the quantitative description of non-stationarity it is convenient to introduce a function of non-stationarity (T; t) D
hjA(T C t)ji D hjA(T)ji
2 (T C t) 2 (T)
1/2 :
(4)
One can see that this function equals the ratio of the lengths of vectors of final and initial states. In case of stationary process the dispersion does not vary with the time
(or its variation is very weak). Therefore the following relations (T C t) D (T) ;
(T; t) D 1
(5)
hold true for the stationary process. Due to the condition (5) the following function (T; t) D 1 (T; t)
(6)
is suitable in providing a dynamic parameter of non-stationarity. This dynamic parameter can serve as a quantitative measure of non-stationarity of the process under investigation. According to Eqs. (4)–(6) it is reasonable to suggest the existence of three different elementary classes of non-stationarity j (T; t)j D j1 (T; t)j 9 8 = 1 one deals with a situation with moderate memory strength, and the case where both ", ı 1 typically constitutes a more regular and robust random process exhibiting strong memory features. Manifestation of Strong Memory in Complex Systems A fundamental role of the strong and weak memory in the functioning of the human organism and seismic phenomena can be illustrated by the example of some situations examined next. We will consider some examples of the time series for both living and for seismic systems. It is necessary to note that a comprehensive analysis of the experimental data includes the calculation and the presentation of corresponding phase portraits in some planes of the dynamic orthogonal variables, the autocorrelation
time functions, the memory time functions and their frequency power spectra, etc. However, we start out by calculating two statistical quantifiers, characterizing two informational measures of memory: the parameters 1 (!) and ı1 (!). Figures 1 and 3 present the results of experimental data of pathological states of human cardiovascular systems (CVS). Figure 2 depicts the analysis for the seismic observation. Figures 4 and 5 indicate the memory effects for the patients with Parkinson disease (PD), and the last two Figs. 6, 7 demonstrate the key role of the strength of memory in the case of time series of patients suffering from photosensitive epilepsy which are contrasted with signals taken from healthy subjects. All these cases convincingly display the crucial role of the statistical memory in the functioning of complex (living and seismic) systems. A characteristic role of the statistical memory can be detected from Fig. 1 for the typical representatives taken from patients from four different CVS-groups: (a) for healthy subject, (b) for a patient with rhythm driver migration, (c) for a patient after myocardial infarction (MI), (d) for a patient after MI with subsequent sudden car-
Correlations in Complex Systems
Correlations in Complex Systems, Figure 2 Frequency spectra of the first three points of the first measure of memory (non-Markovity parameters) "1 (!), "2 (!), and "3 (!) for the seismic phenomena: a, b, c long before the strong Earthquake (EQ) for the steady state of Earth and d, e, f during the strong EQ. Markov and quasi-Markov behavior of seismic signals with manifestation of the weak memory is observed only for "1 in state before the strong EQ. All remaining cases b, c, d and e relate to non-Markov processes. Strong non-Markovity and strong memory is typical for case d (state during the strong EQ). In behavior of "2 (!) and "3 (!) one can see a transition from quasi-Markovity (at low frequencies) to strong non-Markovity (at high frequencies). From Fig. 6 in [105]
diac death (SSCD). All these data were obtained from the short time series of the dynamics of RR-intervals from the electric signals of the human ECG’s. It can be seen here that significant memory effects typically lead to the longtime correlations in the complex systems. For healthy we observe weak memory effects while and large values of the measure memory 1 (! D 0) 25. The strong memory and the long memory time (approximately, 10 times more) are being observed with the help of 3 patient groups: with RDM (rhythm driver migration) (b), after MI (c) and after MI with SSCD (d). Figure 2 depicts the strong memory effects presented in seismic phenomena. By a transition from the steady state of Earth ((a), (b) and (c)) to the state of strong earthquake (EQ) ((d), (e), and (f)) a remarkable amplification of memory effects is highly visible. The term amplification refers to the appearance of strong memory and the prolon-
gation of the memory correlation time in the seismic system. Recent study show that discrete non-Markov stochastic processes and long-range memory effects play a crucial role in the behavior of seismic systems. An approach, permitting us to obtain an algorithm of strong EQ forecasting and to differentiate technogenic explosions from weak EQs, can be developed thereupon. Figure 3 demonstrates an intensification of memory effects of one order at the transition from healthy people ((a), (b) and (c)) to patient suffering from myocardial infarction. The figures were calculated from the long time series of the RR-intervals dynamics from the human ECG’s. The zero frequency values 1 (! D 0) at ! D 0 sharply reduced, approximately of the size of one order for patient as compared to healthy subjects. Figures 4 and 5 illustrate the behavior for patients with Parkinson’s disease. Figure 4 shows time recording of the
715
716
Correlations in Complex Systems
Correlations in Complex Systems, Figure 3 The frequency dependence of the first three points of non-Markovity parameter (NMP) for the healthy person (a), (b), (c) and patient after myocardial infarction (MI) (d), (e), (f) from the time dynamics of RR-intervals of human ECG’s for the case of the long time series. In the spectrum of the first point of NMP "1 (!) there is an appreciable low-frequency (long time) component, which concerns the quasi-Markov processes. Spectra NMP "2 (!) and NMP "3 (!) fully comply with non-Markov processes within the whole range of frequencies. From Fig. 6 in [106]
pathological tremor velocity in the left index finger of a patient with Parkinson’s disease (PD) for eight diverse pathological cases (with or without medication, with or without deep brain stimulation (DBS), for various DBS, medication and time conditions). Figure 5, arranged in accordance with these conditions, displays a wide variety of the memory effects in the treatment of PD’s patients. Due to the large impact of memory effects this observation permits us to develop an algorithm of exact diagnosis of Parkinson’s disease and a calculation of the quantitative parameter of the quality of treatment. A physical role of the strong and long memory correlation time enables us to extract a vital information about the states of various patient on basis of notions of correlation and memory times. According to Figs. 6 and 7 specific information about the physiological mechanism of photosensitive epilepsy (PSE) was obtained from the analysis of the strong
memory effects via the registration the neuromagnetic responses in recording of magnetoencephalogram (MEG) of the human brain core. Figure 6 presents the topographic dependence of the first level of the second memory measure ı1 (! D 0; n) for the healthy subjects in the whole group (upper line) vs. patients (lower line) for red/blue combination of the light stimulus. This topographic dependence of "1 (! D 0; n) depicted in Fig. 6 clearly demonstrates the existence of long-range time correlation. It is accompanied by a sharp increase of the role of the statistical memory effects in the all MEG’s sensors with sensor numbers n D 1; 2; : : : ; 61 of the patient with PSE in comparison with healthy peoples. A sizable difference between the healthy subject and a subject with PSE occurs. To emphasize the role of strong memory one can continue studying the topographic dependence in terms of the novel informational measure, the index of memory, de-
Correlations in Complex Systems
Correlations in Complex Systems, Figure 4 Pathological tremor velocity in the left index finger of the sixth patient with Parkinson’s disease (PD). The registration of Parkinsonian tremor velocity is carried out for the following conditions: a “OFF-OFF” condition (no any treatment), b “ON-ON” condition (using deep brain stimulation (DBS) by electromagnetic stimulator and medicaments), c “ON-OFF” condition (DBS only), d “OFF-ON” condition (medicaments (L-Dopa) only), e–h the “15 OFF”, “30 OFF”, “45 OFF”, “60 OFF” conditions – the patient’s states 15 (30, 45, 60) minutes after the DBS is switched off, no treatment. Let’s note the scale of the pathological tremor amplitude (see the vertical scale). Such representation of the time series allows us to note the increase or the decrease of pathological tremor. From Fig. 1 in [107]
fined as:
(n) D
healthy (0; n) ı1 patient ı1 (0; n)
;
(62)
see in Fig. 7. This measure quantifies the detailed memory effects in the individual MEG sensors of the patient with PSE versus the healthy group. A sharp increase of the role of the memory effects in the stochastic behavior of the magnetic signals is clearly detected in sensor numbers n D 10; 46; 51; 53 and 59. The observed points of MEG sensors locate the regions of a protective mechanism against PSE in a human organism: frontal (sensor 10), occipital (sensors 46, 51 and 53) and right parietal (sensor 59) regions. The early activity in these sensors may reflect a protective mechanism suppressing the cortical hyperactivity due to the chromatic flickering. We remark that some early steps towards understanding the normal and various catastrophical states of complex systems have already been taken in many fields of
science such as cardiology, physiology, medicine, neurology, clinical neurophysiology, neuroscience, seismology and so forth. With the underlying systems showing fractal and complicated spatial structures numerous studies applying the linear and nonlinear time series analysis to various complex systems have been discussed by many authors. Specifically the results obtained shows evidence of the significant nonlinear structure evident in the registered signals in the control subjects, whereas nonlinearity for the patients and catastrophical states were not detected. Moreover the couplings between distant parts and regions were found to be stronger for the control subjects. These prior findings are leading to the hypothesis that the real normal complex systems are mostly equipped with significantly nonlinear subsystems reflecting an inherent mechanism which stems against a synchronous excitation vs. outside impact or inside disturbances. Such nonlinear mechanisms are likely absent in the occurrence of catastrophical or pathological states of the complex systems.
717
718
Correlations in Complex Systems
Correlations in Complex Systems, Figure 5 The frequency dependence of the first point of the non-Markovity parameter "1 () for pathological tremor velocity in the patient. As an example, the sixth patient with Parkinson’s disease is chosen. The figures are submitted according to the arrangement of the initial time series. The characteristic low-frequency oscillations are observed in frequency dependence (a, e–h), which get suppressed under medical influence (b–d). The non-Markovity parameter reflects the Markov and non-Markov components of the initial time signal. The value of the parameter on zero frequency "1 (0) reflects the total dynamics of the initial time signal. The maximal values of parameter "1 (0) correspond to small amplitudes of pathological tremor velocity. The minimal values of this parameter are characteristic of significant pathological tremor velocities. The comparative analysis of frequency dependence "1 () allows us to estimate the efficiency of each method of treatment. From Fig. 5 in [107]
Correlations in Complex Systems, Figure 6 The topographic dependence of the first point of the second measure of memory ı1 (! D 0; n) for the healthy on average in the whole group (upper line) vs. patient (lower line) for R/B combination of the light stimulus. One can note the singular weak memory effects for the healthy on average in sensors with No. 5, 23, 14, 11 and 9
Correlations in Complex Systems
Correlations in Complex Systems, Figure 7 The topographic dependence of the memory index (n) D 1 (n; 0) for the the whole group of healthy on average vs. patient for an R/B combination of the light stimulus. Strong memory in patient vs. healthy appears clearly in sensors with No. 10, 5, 23, 40 and 53
From the physical point of view our results can be used as a toolbox for testing and identifying the presence or absence of various memory effects as they occur in complex systems. The set of our memory quantifiers is uniquely associated with the appearance of memory features in the chaotic behavior of the observed signals. The registration of the behavior belonging to these indicators, as elucidated here, is of beneficial use for detecting the catastrophical or pathological states in the complex systems. There exist alternative quantifiers of different nature as well, such as the Lyapunov’s exponent, Kolmogorov–Sinai entropy, correlation dimension, etc., which are widely used in nonlinear dynamics and relevant applications. In the present context, we have found out that the employed memory measures are not only convenient for the analysis but are also ideally suitable for the identification of anomalous behavior occurring in complex systems. The search for other quantifiers, and foremost, the ways of optimization of such measures when applied to the complex discrete time dynamics presents a real challenge. Especially this objective is met when attempts are made towards the identification and quantification of functioning in complex systems. This work presents initial steps towards the understanding of basic foundation of anomalous processes in complex systems on the basis of a study of the underlying memory effects and connected with this, the occurrence of long lasting correlations. Some Perspectives on the Studies of Memory in Complex Systems Here we present a few outlooks on the fundamental role of statistical memory in complex systems. This involves the issue of studying cross-correlations. The statistical theory of stochastic dynamics of cross-correlation can be cre-
ated on the basis of the mentioned formalism of projection operators technique in the linear space of random variables. As a result we obtain the cross-correlation memory functions (MF’s) revealing the statistical memory effects in complex systems. Some memory quantifiers will appear simultaneously which will reflect cross-correlation between different parts of CS. Cross-correlation MF’s can be very useful for the analysis of the weak and strong interactions, signifying interrelations between the different groups of random variables in CS. Besides that the cross-correlation can be important for the problem of phase synchronization, which can find a unique way of studying of synchronization phenomena in CS that has a special importance when studying aspects of brain and living systems dynamics. Some additional information about the strong and weak memory effects can be extracted from the observation of correlation in CS in the random event’s scales. Similar effects are playing a crucial role in the differentiation between stochastic phenomena within astrophysical systems, for example, in galaxies, pulsars, quasars, microquasars, lacertides, black holes, etc. One of the most important area of application of developed approach is a bispectral and polyspectral analysis for the diverse CS. From the mathematical point of view a correct definition of the spectral properties in the functional space of random functions is quite important. A variety of MF’s arises in the quantitative analysis of the fine details of memory effects in a nonlinear manner. The quantitative control of the treatment quality in the diverse areas of medicine and physiology may be one of the important biomedical application of the manifestation of the strong memory effects. These and other features of memory effects in CS call for an advanced development of brain studies on the basis of EEG’s and MEG’s data, cardiovascular, locomotor
719
720
Correlations in Complex Systems
and respiratory human systems, in the development of the control system of information flows in living systems. An example is the prediction of strong EQ’s and the clear differentiation between the occurrence of weak EQ’s and the technogenic explosions, etc. In conclusion, we hope that the interested reader becomes invigorated by this presentation of correlation and memory analysis of the inherent nonlinear system dynamics of varying complexity. He can find further details how significant memory effects typically cause long time correlations in complex systems by inspecting more closely some of the published items in [42–103]. There are the relationships between standard fractional and polyfractal processes and long-time correlation in complex systems, which were explained in [39,40,44,45, 46,49,53,54,60,62,64,76,79,83,84,94] in detail. Example of using the Hurst exponent over time for testing the assertion that emerging markets are becoming more efficient can be found in [51]. While over 30 measures of complexity have been proposed in the research literature one can distinguish [42,55,66,81,89,99] with the specific designation of long-time correlation and memory effects. The papers [48,57] are focused on long range correlation processes that are nonlocal in time and whence show memory effects. The statistical characterization of the nonstationarities in real-world time series is an important topic in many fields of research and some numerous methods of characterizing nonstationary time series were offered in [59,65,84]. Long-range correlated time series have been widely used in [52,61,63,68,74] for the theoretical description of diverse phenomena. Example of the study an anatomy of extreme events in a complex adaptive system can be found in [67]. Approaches for modeling long-time and longrange correlation in complex systems from time series are investigated and applied to different examples in [50,56,69,70,73,75,80,82,86,100,101,102]. Detecting scale invariance and its fundamental relationships with statistical structures is one of the most relevant problems among those addressed correlation analysis [47,71,72,91]. Specific long-range correlation in complex systems are the object of active research due to its implications in the technology of materials and in several fields of scientific knowledge with the use of quantified histograms [78], decrease of chaos in heart failure [85], scaling properties of ECG’s signals fluctuations [87], transport properties in correlated systems [88] etc.
It is demonstrated in [43,92,93] how ubiquity of the long-range correlations is apparent in typical and exotic complex statistical systems with application to biology, medicine, economics and to time clustering properties [95,98]. The scale-dependent wavelet and spectral measures for assessing cardiac dysfunction have been used in [97]. In recent years the study of an increasing number of natural phenomena that appear to deviate from standard statistical distributions has kindled interest in alternative formulations of statistical mechanics [58,101]. At last, papers [77,90] present the samples of the deep and multiple interplay between discrete and continuous long-time correlation and memory in complex systems and the corresponding modeling the discrete time series on the basis of physical Zwanzig–Mori’s kinetic equation for the Hamilton statistical systems. Bibliography Primary Literature 1. Markov AA (1906) Two-dimensional Brownian motion and harmonic functions. Proc Phys Math Soc Kazan Imp Univ 15(4):135–178; in Russian 2. Chapman S, Couling TG (1958) The mathematical theory of nonuniform gases. Cambridge University Press, Cambridge 3. Albeverio S, Blanchard P, Steil L (1990) Stochastic processes and their applications in mathematics and physics. Kluwer, Dordrecht 4. Rice SA, Gray P (1965) The statistical mechanics of simple liquids. Interscience, New York 5. Kubo R, Toda M, Hashitsume N, Saito N (2003) Statistical physics II: Nonequilibrium statistical mechanics. In: Fulde P (ed) Springer Series in Solid-State Sciences, vol 31. Springer, Berlin, p 279 6. Ginzburg VL, Andryushin E (2004) Superconductivity. World Scientific, Singapore 7. Sachs I, Sen S, Sexton J (2006) Elements of statistical mechanics. Cambridge University Press, Cambridge 8. Fetter AL, Walecka JD (1971) Quantum theory of many-particle physics. Mc Graw-Hill, New York 9. Chandler D (1987) Introduction to modern statistical mechanics. Oxford University Press, Oxford 10. Zwanzig R (2001) Nonequilibrium statistical mechanics. Cambridge University Press, Cambridge 11. Zwanzig R (1961) Memory effects in irreversible thermodynamics. Phys Rev 124:983–992 12. Mori H (1965) Transport, collective motion and Brownian motion. Prog Theor Phys 33:423–455; Mori H (1965) A continued fraction representation of the time correlation functions. Prog Theor Phys 34:399–416 13. Grabert H, Hänggi P, Talkner P (1980) Microdynamics and nonlinear stochastic processes of gross variables. J Stat Phys 22:537–552 14. Grabert H, Talkner P, Hänggi P (1977) Microdynamics and time-evolution of macroscopic non-Markovian systems. Z Physik B 26:389–395
Correlations in Complex Systems
15. Grabert H, Talkner P, Hänggi P, Thomas H (1978) Microdynamics and time-evolution of macroscopic non-Markovian systems II. Z Physik B 29:273–280 16. Hänggi P, Thomas H (1977) Time evolution, correlations and linear response of non-Markov processes. Z Physik B 26:85–92 17. Hänggi P, Talkner P (1983) Memory index of first-passage time: A simple measure of non-Markovian character. Phys Rev Lett 51:2242–2245 18. Hänggi P, Thomas H (1982) Stochastic processes: Time-evolution, symmetries and linear response. Phys Rep 88:207–319 19. Lee MH (1982) Orthogonalization process by recurrence relations. Phys Rev Lett 49:1072–1072; Lee MH (1983) Can the velocity autocorrelation function decay exponentially? Phys Rev Lett 51:1227–1230 20. Balucani U, Lee MH, Tognetti V (2003) Dynamic correlations. Phys Rep 373:409–492 21. Hong J, Lee MH (1985) Exact dynamically convergent calculations of the frequency-dependent density response function. Phys Rev Lett 55:2375–2378 22. Lee MH (2000) Heisenberg, Langevin, and current equations via the recurrence relations approach. Phys Rev E 61:3571– 3578; Lee MH (2000) Generalized Langevin equation and recurrence relations. Phys Rev E 62:1769–1772 23. Lee MH (2001) Ergodic theory, infinite products, and long time behavior in Hermitian models. Phys Rev Lett 87(1– 4):250601 24. Kubo R (1966) Fluctuation-dissipation theorem. Rep Progr Phys 29:255–284 25. Kawasaki K (1970) Kinetic equations and time correlation functions of critical fluctuations. Ann Phys 61:1–56 26. Michaels IA, Oppenheim I (1975) Long-time tails and Brownian motion. Physica A 81:221–240 27. Frank TD, Daffertshofer A, Peper CE, Beek PJ, Haken H (2001) H-theorem for a mean field model describing coupled oscillator systems under external forces. Physica D 150:219–236 28. Vogt M, Hernandez R (2005) An idealized model for nonequilibrium dynamics in molecular systems. J Chem Phys 123(1– 8):144109 29. Sen S (2006) Solving the Liouville equation for conservative systems: Continued fraction formalism and a simple application. Physica A 360:304–324 30. Prokhorov YV (1999) Probability and mathematical statistics (encyclopedia). Scien Publ Bolshaya Rossiyskaya Encyclopedia, Moscow 31. Yulmetyev R et al (2000) Stochastic dynamics of time correlation in complex systems with discrete time. Phys Rev E 62:6178–6194 32. Yulmetyev R et al (2002) Quantification of heart rate variability by discrete nonstationarity non-Markov stochastic processes. Phys Rev E 65(1–15):046107 33. Reed M, Samon B (1972) Methods of mathematical physics. Academic, New York 34. Graber H (1982) Projection operator technique in nonequilibrium statistical mechanics. In: Höhler G (ed) Springer tracts in modern physics, vol 95. Springer, Berlin 35. Yulmetyev RM (2001) Possibility between earthquake and explosion seismogram differentiation by discrete stochastic non-Markov processes and local Hurst exponent analysis. Phys Rev E 64(1–14):066132 36. Abe S, Suzuki N (2004) Aging and scaling of earthquake aftershocks. Physica A 332:533–538
37. Tirnakli U, Abe S (2004) Aging in coherent noise models and natural time. Phys Rev E 70(1–4):056120 38. Abe S, Sarlis NV, Skordas ES, Tanaka HK, Varotsos PA (2005) Origin of the usefulness of the natural-time representation of complex time series. Phys Rev Lett 94(1–4):170601 39. Stanley HE, Meakin P (1988) Multifractal phenomena in physics and chemistry. Nature 335:405–409 40. Ivanov P Ch, Amaral LAN, Goldberger AL, Havlin S, Rosenblum MG, Struzik Z, Stanley HE (1999) Multifractality in human heartbeat dynamics. Nature 399:461–465 41. Mokshin AV, Yulmetyev R, Hänggi P (2005) Simple measure of memory for dynamical processes described by a generalized Langevin equation. Phys Rev Lett 95(1–4):200601 42. Allegrini P et al (2003) Compression and diffusion: A joint approach to detect complexity. Chaos Soliton Fractal 15: 517–535 43. Amaral LAN et al (2001) Application of statistical physics methods and concepts to the study of science and technology systems. Scientometrics 51:9–36 44. Arneodo A et al (1996) Wavelet based fractal analysis of DNA sequences. Physica D 96:291–320 45. Ashkenazy Y et al (2003) Magnitude and sign scaling in power-law correlated time series. Physica A Stat Mech Appl 323:19–41 46. Ashkenazy Y et al (2003) Nonlinearity and multifractality of climate change in the past 420,000 years. Geophys Res Lett 30:2146 47. Azbel MY (1995) Universality in a DNA statistical structure. Phys Rev Lett 75:168–171 48. Baldassarri A et al (2006) Brownian forces in sheared granular matter. Phys Rev Lett 96:118002 49. Baleanu D et al (2006) Fractional Hamiltonian analysis of higher order derivatives systems. J Math Phys 47:103503 50. Blesic S et al (2003) Detecting long-range correlations in time series of neuronal discharges. Physica A 330:391–399 51. Cajueiro DO, Tabak BM (2004) The Hurst exponent over time: Testing the assertion that emerging markets are becoming more efficient. Physica A 336:521–537 52. Brecht M et al (1998) Correlation analysis of corticotectal interactions in the cat visual system. J Neurophysiol 79: 2394–2407 53. Brouersa F, Sotolongo-Costab O (2006) Generalized fractal kinetics in complex systems (application to biophysics and biotechnology). Physica A 368(1):165–175 54. Coleman P, Pietronero L (1992) The fractal structure of the universe. Phys Rep 213:311–389 55. Goldberger AL et al (2002) What is physiologic complexity and how does it change with aging and disease? Neurobiol Aging 23:23–26 56. Grau-Carles P (2000) Empirical evidence of long-range correlations in stock returns. Physica A 287:396–404 57. Grigolini P et al (2001) Asymmetric anomalous diffusion: An efficient way to detect memory in time series. FractalComplex Geom Pattern Scaling Nat Soc 9:439–449 58. Ebeling W, Frommel C (1998) Entropy and predictability of information carriers. Biosystems 46:47–55 59. Fukuda K et al (2004) Heuristic segmentation of a nonstationary time series. Phys Rev E 69:021108 60. Hausdorff JM, Peng CK (1996) Multiscaled randomness: A possible source of 1/f noise in biology. Phys Rev E 54: 2154–2157
721
722
Correlations in Complex Systems
61. Herzel H et al (1998) Interpreting correlations in biosequences. Physica A 249:449–459 62. Hoop B, Peng CK (2000) Fluctuations and fractal noise in biological membranes. J Membrane Biol 177:177–185 63. Hoop B et al (1998) Temporal correlation in phrenic neural activity. In: Hughson RL, Cunningham DA, Duffin J (eds) Advances in modelling and control of ventilation. Plenum Press, New York, pp 111–118 64. Ivanova K, Ausloos M (1999) Application of the detrended fluctuation analysis (DFA) method for describing cloud breaking. Physica A 274:349–354 65. Ignaccolo M et al (2004) Scaling in non-stationary time series. Physica A 336:595–637 66. Imponente G (2004) Complex dynamics of the biological rhythms: Gallbladder and heart cases. Physica A 338:277–281 67. Jefferiesa P et al (2003) Anatomy of extreme events in a complex adaptive system. Physica A 318:592–600 68. Karasik R et al (2002) Correlation differences in heartbeat fluctuations during rest and exercise. Phys Rev E 66:062902 69. Kulessa B et al (2003) Long-time autocorrelation function of ECG signal for healthy versus diseased human heart. Acta Phys Pol B 34:3–15 70. Kutner R, Switala F (2003) Possible origin of the non-linear long-term autocorrelations within the Gaussian regime. Physica A 330:177–188 71. Koscielny-Bunde E et al (1998) Indication of a universal persistence law governing atmospheric variability. Phys Rev Lett 81:729–732 72. Labini F (1998) Scale invariance of galaxy clustering. Phys Rep 293:61–226 73. Linkenkaer-Hansen K et al (2001) Long-range temporal correlations and scaling behavior in human brain oscillations. J Neurosci 21:1370–1377 74. Mercik S et al (2000) What can be learnt from the analysis of short time series of ion channel recordings. Physica A 276:376–390 75. Montanari A et al (1999) Estimating long-range dependence in the presence of periodicity: An empirical study. Math Comp Model 29:217–228 76. Mark N (2004) Time fractional Schrodinger equation. J Math Phys 45:3339–3352 77. Niemann M et al (2008) Usage of the Mori–Zwanzig method in time series analysis. Phys Rev E 77:011117 78. Nigmatullin RR (2002) The quantified histograms: Detection of the hidden unsteadiness. Physica A 309:214–230 79. Nigmatullin RR (2006) Fractional kinetic equations and universal decoupling of a memory function in mesoscale region. Physica A 363:282–298 80. Ogurtsov MG (2004) New evidence for long-term persistence in the sun’s activity. Solar Phys 220:93–105 81. Pavlov AN, Dumsky DV (2003) Return times dynamics: Role of the Poincare section in numerical analysis. Chaos Soliton Fractal 18:795–801 82. Paulus MP (1997) Long-range interactions in sequences of human behavior. Phys Rev E 55:3249–3256 83. Peng C-K et al (1994) Mosaic organization of DNA nucleotides. Phys Rev E 49:1685–1689 84. Peng C-K et al (1995) Quantification of scaling exponents and crossover phenomena in nonstationary heartbeat time series. Chaos 5:82–87
85. Poon CS, Merrill CK (1997) Decrease of cardiac chaos in congestive heart failure. Nature 389:492–495 86. Rangarajan G, Ding MZ (2000) Integrated approach to the assessment of long range correlation in time series data. Phys Rev E 61:4991–5001 87. Robinson PA (2003) Interpretation of scaling properties of electroencephalographic fluctuations via spectral analysis and underlying physiology. Phys Rev E 67:032902 88. Rizzo F et al (2005) Transport properties in correlated systems: An analytical model. Phys Rev B 72:155113 89. Shen Y et al (2003) Dimensional complexity and spectral properties of the human sleep EEG. Clinic Neurophysiol 114:199–209 90. Schmitt D et al (2006) Analyzing memory effects of complex systems from time series. Phys Rev E 73:056204 91. Soen Y, Braun F (2000) Scale-invariant fluctuations at different levels of organization in developing heart cell networks. Phys Rev E 61:R2216–R2219 92. Stanley HE et al (1994) Statistical-mechanics in biology – how ubiquitous are long-range correlations. Physica A 205: 214–253 93. Stanley HE (2000) Exotic statistical physics: Applications to biology, medicine, and economics. Physica A 285:1–17 94. Tarasov VE (2006) Fractional variations for dynamical systems: Hamilton and Lagrange approaches. J Phys A Math Gen 39:8409–8425 95. Telesca L et al (2003) Investigating the time-clustering properties in seismicity of Umbria-Marche region (central Italy). Chaos Soliton Fractal 18:203–217 96. Turcott RG, Teich MC (1996) Fractal character of the electrocardiogram: Distinguishing heart-failure and normal patients. Ann Biomed Engin 24:269–293 97. Thurner S et al (1998) Receiver-operating-characteristic analysis reveals superiority of scale-dependent wavelet and spectral measures for assessing cardiac dysfunction. Phys Rev Lett 81:5688–5691 98. Vandewalle N et al (1999) The moving averages demystified. Physica A 269:170–176 99. Varela M et al (2003) Complexity analysis of the temperature curve: New information from body temperature. Eur J Appl Physiol 89:230–237 100. Varotsos PA et al (2002) Long-range correlations in the electric signals that precede rupture. Phys Rev E 66:011902 101. Watters PA (2000) Time-invariant long-range correlations in electroencephalogram dynamics. Int J Syst Sci 31:819–825 102. Wilson PS et al (2003) Long-memory analysis of time series with missing values. Phys Rev E 68:017103 103. Yulmetyev RM et al (2004) Dynamical Shannon entropy and information Tsallis entropy in complex systems. Physica A 341:649–676 104. Yulmetyev R, Hänggi P, Gafarov F (2000) Stochastic dynamics of time correlation in complex systems with discrete time. Phys Rev E 62:6178 105. Yulmetyev R, Gafarov F, Hänggi P, Nigmatullin R, Kayumov S (2001) Possibility between earthquake and explosion seismogram processes and local Hurst exponent analysis. Phys Rev E 64:066132 106. Yulmetyev R, Hänggi P, Gafarov F (2002) Quantification of heart rate variability by discrete nonstationary non-Markov stochastic processes. Phys Rev E 65:046107 107. Yulmetyev R, Demin SA, Panischev OY, Hänggi P, Tima-
Correlations in Complex Systems
shev SF, Vstovsky GV (2006) Regular and stochastic behavior of Parkinsonian pathological tremor signals. Physica A 369:655
Books and Reviews Badii R, Politi A (1999) Complexity: Hierarchical structures and scaling in physics. Oxford University Press, New York Elze H-T (ed) (2004) Decoherence and entropy in complex systems. In: Selected lectures from DICE 2002 series: Lecture notes in physics, vol 633. Springer, Heidelberg
Kantz H, Schreiber T (2004) Nonlinear time series analysis. Cambridge University Press, Cambridge Mallamace F, Stanley HE (2004) The physics of complex systems (new advances and perspectives). IOS Press, Amsterdam Parisi G, Pietronero L, Virasoro M (1992) Physics of complex systems: Fractals, spin glasses and neural networks. Physica A 185(1–4):1–482 Sprott JC (2003) Chaos and time-series analysis. Oxford University Press, New York Zwanzig R (2001) Nonequilibrium statistical physics. Oxford University Press, New York
723
724
Cost Sharing
Cost Sharing MAURICE KOSTER University of Amsterdam, Amsterdam, Netherlands Article Outline Glossary Definition of the Subject Introduction Cooperative Cost Games Non-cooperative Cost Games Continuous Cost Sharing Models Future Directions Bibliography Glossary Core The core of a cooperative cost game hN; ci is the set of all coalitionally stable vectors of cost shares. Cost function A cost function relates to each level of output of a given production technology the minimal necessary units of input to generate it. It is non-decreasing function c : X ! RC , where X is the (ordered) space of outputs. Cost sharing problem A cost sharing problem is an orN is a profile of indidered pair (q; c), where q 2 RC vidual demands of a fixed and finite group of agents N D f1; 2; : : : ; ng and c is a cost function. Game theory The branch of applied mathematics and economics that studies situations where players make decisions in an attempt to maximize their returns. The essential feature is that it provides a formal modeling approach to social situations in which decision makers interact. Cost sharing rule A cost sharing rule is a mapping that assigns to each cost sharing problem under consideration a vector of non-negative cost shares. Demand game Strategic game where agents place demands for output strategically. Demand revelation game Strategic game where agents announce their maximal contribution strategically. Strategic game An ordered triple G D hN; (A i ) i2N ; (- i ) i2N i, where N D f1; 2; : : : ; ng is the set of players, Ai is the set of available actions for player i, - i is a preference relation over the set of possible consequences C of action.
Definition of the Subject Throughout we will use a fixed set of agents N D f1; 2; : : : ; ng, where n is a given natural number. For subsets S; T of N, we write S T if each element of S is contained in T; TnS denotes the set of agents in T except those in S. The power set of N is the set of all subsets of N; each coalition S N will be identified with the element 1 S 2 f0; 1g N , the vector with ith coordinate equal to 1 precisely when i 2 S. Fix a vector x 2 R N and S N. The projection of x on RS is denoted by xS , and x NnS is sometimes more conveniently denoted by xS . For any y 2 RS , (xS ; y) stands for the vector z 2 R N such that z i D x i if P i 2 NnS and z i D y i if i 2 S. We denote x(S) D i2S x i . The vector in RS with all coordinates equal to zero is denoted by 0S . Other notation will be introduced when necessary. This article focuses on different approaches in the literature through a discussion of a couple of basic and illustrative models, each involving a single facility for the production of a finite set M of outputs, commonly shared by a fixed set N :D f1; 2; : : : ; ng of agents. The feasible set of M. outputs for the technology is identified with a set X RC It is assumed that the users of the technology may freely dispose over any desired quantity or level of the outputs; each agent i has some demand x i 2 X for output. Each profile of demands x 2 X N is associated to its cost c(x), i. e. the minimal amount of the idiosyncratic input commodity needed to fulfill the individual demands. This defines the cost function c : X N ! RC for the technology, comprising all the production externalities. A cost sharing problem is an ordered pair (x; c) of a demand profile x and a cost function c. The interpretation is that x is produced and the resulting cost c(x) has to be shared by the collective N. Numerous practical applications fit this general description of a cost sharing problem. In mathematical terms a cost sharing problem is equivalent to a production sharing problem where output is shared based on the profile of inputs. However, although many concepts are just as meaningful as they are in the cost sharing context, results are not at all easily established using this mathematical duality. In this sense consider [68] as a warning to the reader, showing that the strategic analysis of cost sharing solutions is quite different from surplus sharing solutions. This monograph will center on cost sharing problems. For further reference on production sharing see [51,67,68,91]. Introduction In many practical situations managers or policy-makers deal with private or public enterprises with multiple users.
Cost Sharing
A production technology facilitates its users, causing externalities that have to be shared. Applications are numerous, ranging from environmental issues like pollution, fishing grounds, to sharing multipurpose reservoirs, road systems, communication networks, and the Internet. The essence in all these examples is that a manager cannot directly influence the behavior of the users, but only indirectly by addressing the externalities through some decentralization device. By choosing the right instrument the manager may help to shape and control the nature of the resulting individual and aggregate behavior. This is what is usually understood as the mechanism design or implementation paradigm. The state-of the-art literature shows for a couple of simple but illustrative cost sharing models that one cannot push these principles too far, as there is often a trade-off between the degree of distributive justice and economic efficiency. Then this is what makes choosing ‘the’ right solution an ambiguous task, certainly without a profound understanding of the basic allocation principles. Now first some examples will be discussed. Example 1 The water-resource management problem of the Tennessee Valley Authority (TVA) in the 1930s is a classic in the cost-sharing literature. It concerns the construction of a dam in a river to create a reservoir, which can be used for different purposes like flood control, hydro-electric power, irrigation, and municipal supply. Each combination of purposes requires a certain dam height and accompanying construction costs have to be shared by the purposes. Typical for the type of problem is that up to a certain critical height there are economies of scale as marginal costs of extra height are decreasing. Afterwards, marginal costs increase due to technological constraints. The problem here is to allocate the construction costs of a specific dam among the relevant purposes. Example 2 Another illustrative cost sharing problem dating back from the early days in the cost sharing literature [69,70] deals with landing fee schedules at airports, so-called airport problems. These were often established to cover the costs of building and maintaining the runways. The cost of a runway essentially depends on the size of the largest type of airplane that has to be accommodated – a long runway can be used by smaller types as well. Suppose there are m types of airplanes and that ci is the cost of constructing a landing strip suitable for type i. Moreover, index the types from small to large so that 0 D c0 < c1 < c2 < < c m . In the above terminology the technology can be described by X D f0; 1; 2; : : : ; mg, and the cost function c : X N ! RC is defined by c(x) D c k where k D max fx i j i 2 Ng is the maximal service level required in x. Suppose that in a given
year, N k is the set landings of type k airplanes, then the set of users of the runway is N D [ k N k . The problem is now to apportion the full cost c(x) of the runway to the users in N, where x is the demand vector given by x i D ` if i 2 N` . Airport problems describe a wide range of cost sharing problems, ranging from sharing the maintenance cost of a ditch system for irrigation projects [1], to sharing the dredging costs in harbors [14]. Example 3 A joint project involves a number of activities for which the estimated durations and precedence relations are known. Delay in each of these components affects the period in which the project can be realized. Then a cost sharing problem arises when the joint costs due to the accumulated delay are shared among the individuals causing the delays. See [21]. Example 4 In many applications the production technology is given by a network G D (V ; E) with nodes V and set of costly edges E V V, and cost function c : E ! RC . The demands of the agents are now parts of the infrastructure, i. e. subsets of E. Examples include the sharing the cost of infrastructure for supply of energy and water, or transport systems. For example, the above airport problem can be modeled as such with V D f1; 2; : : : ; mg [ fˇg ; E D f(ˇ; 1); (1; 2); : : : ; (m 1; m)g : Graphically, the situation is depicted by the line graph in Fig. 1. Imagine that the runway starts at the special node ˇ, and that the edges depict the different pieces of runway served to the players. An airplane of type k is situated at node k, and needs all edges towards ˇ. The edge left to the kth node is called e k D (k 1; k), and the corresponding cost is c(e k ) D c k c k1 . The demand of an airplane at node k is now described by the edges on the path from node k to ˇ. Example 5 In more general network design problems, a link facilitates a flow; for instance, in telecommunication it is data flowing through the network, in road systems it is traffic. [49] discusses a model where a network planner allocates the fixed cost of a network based on the individual
Cost Sharing, Figure 1 Graphical representation of an airport problem
725
726
Cost Sharing
demands being flows. [77,124] discuss congested telecommunication networks, where the cost of a link depends on the size of the corresponding flow. Then these positive network externalities lead to a concentration of flow, and thus to hub-like networks. Economies of scale require cooperation of the users, and the problem now is to share the cost of these so-called hub-like networks. Example 6 As an insurance against the uncertainty of the future net worths of its constituents, firms are often regulated to hold an amount of riskless investments, i. e. its risk capital. Given that returns of normal investments are higher, the difference with the riskless investments is considered as a cost. The sum of the risk capitals of each constituent is usually larger than the risk capital of the firm as a whole, and the allocation problem is to apportion this diversification effect observed in risk measurements of financial portfolios. See Denault [28]. Solving Cost Sharing Problems: Cost Sharing Rules A vector of cost shares for the cost sharing problem (x; c) is an element y 2 R N with the property that P i2N y i D c(x). This equality is also called budget-balancing condition. Central issue addressed in the cost sharing literature is how to determine the appropriate y. The vast majority of the cost sharing literature is devoted to a mechanistic way of sharing joint costs; given a class of cost sharing problems P , a (simple) formula computes the vector of cost shares for each of its elements. This yields a cost sharing rule : P ! R N where (P) is the vector of cost shares for each P 2 P . At this point it should be clear to the reader that many formula’s will see to a split of joint cost and heading for the solution to cost sharing problems is therefore an ambiguous task. The least we want from a solution is that it is consistent with some basic principles of fairness or justice and, moreover, that it creates the right incentives. Clearly, the desirability of solution varies with the context in which it is used, and so will the sense of appropriateness. Moreover, the different parties involved in the decision making process will typically hold different opinions; accountants, economists, production managers, regulators and others all are looking at the same institutional entity from different perspectives. The existing cost sharing literature is about exploring boundaries of what can be thought of desirable features of cost sharing rules. More important than the rules themselves, are the properties that each of them is consistent with. Instead of building a theory on single instances of cost sharing problems, the cost sharing literature discusses structural invariance properties over classes of problems. Here the main distinction is made on the ba-
sis of the topological properties of the technology, whether the cost sharing problem allows for a discrete or continuous formulation. For each type of models, divisible or indivisible goods, the state-of-the-art cost sharing literature has developed into two main directions, based on the way individual preferences over combinations of cost shares and (levels of) service are treated. On the one hand, there is a stream of research in which individual preferences are not explicitly modeled and demands are considered inelastic. Roughly, it accommodates the large and vast growing axiomatic literature (see e. g. [88,129]) and the theory on cooperative cost games [105,130,136,145]. Secondly, there is the literature on cost sharing models where individual preferences are explicitly modeled and demands are elastic. The focus is on non-cooperative demand games in which the agents are assumed to choose their demands strategically, see e. g. [56,91,141]. As an interested reader will soon find out, in the literature there is no shortage of plausible cost sharing techniques. Instead of presenting a kind of summary, this article focuses on the most basic and most interesting ones, and in particular their properties with respect to strategic interplay of the agents.
Outline The article is organized as follows. Section “Cooperative Cost Games” discusses cost sharing problems from the perspective of cooperative game theory. Basic concepts like core, Shapley value, nucleolus and egalitarian solution are treated. Section “Non-cooperative Cost Games” introduces the basic concepts of non-cooperative game theory including dominance relations, preferences, and Nash-equilibrium. Demand games and demand revelation games are introduced for discrete technologies with concave cost function. This part is concluded with two theorems, the strategic characterization of the Shapley value and constrained egalitarian solution as cost sharing solution, respectively. Section “Continuous Cost Sharing Models” introduces the continuous production model and it consists of two parts. First the simple case of a production technology with homogeneous and perfectly divisible private goods is treated. Prevailing cost sharing rules like proportional, serial, and Shapley–Shubik are shortly introduced. We then give a well-known characterization of additive cost sharing rules in terms of corresponding rationing methods, discuss the related cooperative and strategic games. The second part is devoted to the heterogeneous output model and famous solutions like Aumann–Shapley, Shapley–Shubik, and serial rules. We fi-
Cost Sharing
nalize with Sect. “Future Directions” where some future directions of research are spelled out. Cooperative Cost Games A discussion of cost sharing solutions and incentives needs a proper framework wherein the incentives are formalized. In the seminal work of von Neumann and Morgenstern [140] the notion of a cooperative game was introduced as to model the interaction between actors/players who coordinate their strategies in order to maximize joint profits. Shubik [122] was one of the first to apply this theory in the cost sharing context. Cooperative Cost Game A cooperative cost game among players in N is a function c : 2 N ! R with the property that c(¿) D 0; for nonempty sets S N the value c(S) is interpreted as the cost that would arise should the individuals in S work together and serve only their own purposes. The class of all cooperative cost games for N will be denoted by CG. Any general class P of cost sharing problems can be embedded in CG as follows. For the cost sharing problem (x; c) 2 P among agents in N define the stand-alone cost game c x 2 CG by
c x (S) :D
c(x S ; 0 NnS ) 0
if S N; S ¤ ¿ if S D ¿ :
(1)
So c x (S) can be interpreted as the cost of serving only the agents in S. Example 7 The following numerical example will be frequently referred to. An airport is visited by three airplanes in the set N D f1; 2; 3g, which can be accommodated at cost c1 D 12; c2 D 20, and c3 D 33, respectively. The situation is depicted in Fig. 2. The corresponding cost game c is determined by associating each coalition S of airplanes to the minimum cost of the runway needed to accommodate each of its members. Then the corresponding cost game c is given by the table below. Slightly abusing notation we denote c(i) to indicate c(fig), c(ij) for c(fi; jg) and so forth. S ¿ 1 2 3 12 13 23 123 c(S) 0 12 20 33 20 33 33 33
Cost Sharing, Figure 2 Airport game
Cost Sharing, Figure 3 Minimum cost spanning tree problem
Note that, since we identified coalitions of players in N with elements in 2 N , we may write c to denote the cooperative cost game. By the binary nature of the demands the cost function for the technology formally is a cooperative cost game. For x D (1; 0; 1) the corresponding cost game cx is specified by S ¿ 1 2 3 12 13 23 123 cx (S) 0 12 0 33 12 33 33 33
Player 2 is a dummy player in this game, for all S Nn f2g it holds c x (S) D c x (S [ f2g). Example 8 Consider the situation as depicted in the Fig. 3 below, where three players, each situated at a different node, want to be connected to the special node ˇ using the indicated costly links. In order to connect themselves to ˇ a coalition S may use only links with ˇ and the direct links between its members, and then only if the costs are paid for. For instance, the minimum cost of connecting player 1 in the left node to ˇ is 10, and the cost of connecting players 1 and 2 to ˇ are 18 – the cost of the direct link from 2 and the indirect link between 1 and 2. Then the associated cost game is given by S ¿ 1 2 3 12 13 23 123 c(S) 0 10 10 10 18 20 19 27
Notice that in this case the network technology exhibits positive externalities. The more players want to be connected, the lower the per capita cost. For those applications where the cost c(S) can be determined irrespective of the actions taken by its complement NnS the interpretation of c implies sub-additivity, i. e. the property that for all S; T N with S \ T D ¿ implies c(S [ T) c(S) C c(T). This is for instance an essential feature of the technology underlying natural monopolies (see, e. g., [13,120]). Note that the cost games in Example 7 and 8 are sub-additive. This is a general property for airport games as well as minimum cost spanning tree games.
727
728
Cost Sharing
Cost Sharing, Figure 4 Non-concave MCST game
Sometimes the benefits of cooperation are even stronger. A game is called concave (or sub-modular) if for all S; T N we have (S [ T) C c(S \ T) c(S) C c(T) :
(2)
At first this seems a very abstract property, but one may show that it is equivalent with the following that c(S [ fig) c(S) c(T [ fig) c(T)
(3)
for all coalitions S T Nn fig. This means that the marginal cost of a player i with respect to larger coalitions is non-increasing, i. e. the technology exhibits positive externalities. Concave games are also frequently found in the network literature, see [63,75,93,121].
the imputation set. If, in a similar fashion, x(S) c(S) for all S N then x is called stable; under proposal x no coalition S has a strong incentive to go alone, as it is not possible to redistribute the cost shares afterwards and make every defector better of. The core of a cost game c, notation core(c), consists of all stable vectors of cost shares for c. If cooperation on a voluntary basis by the grand coalition N is conceived as a desirable feature then the core and certainly the imputation set impose reasonable conditions for reaching it. Nevertheless, the core of a game can be empty. Call a collection B of coalitions balanced, if there is a vector of positive weights (S )S2B such that for all i 2 N X
S D 1 :
S2B;S3i
A cost game c is balanced if it holds for each balanced collection B of coalition that X S c(S) c(N) : S2B
It is the celebrated theorem below which characterizes all games with non-empty cores. Theorem 1 (Bondareva–Shapley [20,117]) The cost game c is balanced if and only if the core of c is non-empty. Concave cost games are balanced, see [119]. Concavity is not a necessary condition for non-emptiness of the core, since minimum cost spanning tree games are balanced as well.
Example 9 Although sub-additive, minimum cost spanning tree games are not always concave. Consider the following example due to [17]. The numbers next to the edges indicate the corresponding cost. We assume a complete graph and that the invisible edges cost 4. Note that in this game every three-player coalition is connected at cost 12, whereas c(34) D 16. Then c(1234) c(234) D 1612 D 4 whereas c(134) c(34) D 4. So the marginal cost of player 1 is not decreasing with respect to larger coalitions.
Example 10 Consider the two-player game c defined by c(12) D 10; c(1) D 3; c(2) D 8. Then core(c) D f(x; 10 x)j2 x 3g. Note that, opposed to the general case, for two-player games sub-additivity is equivalent with non-emptiness of the core.
Incentives in Cooperative Cost Games
Cooperative Solutions
The objective in cooperative games is to share the profits or costs savings of cooperation. Similar to the general framework, a vector of cost shares for a cost game c 2 CG is a vector x 2 R N such that x(N) D c(N). The question is what cost share vectors make sense if (coalitions of) players have the possibility to opt out thereby destroying cooperation on a larger scale. In order to ensure that individual players join, a proposed allocation x should at least be individual rational so that x i c(i) for all i 2 N. In that case no player has a justified claim to reject x as proposal, since going alone yields a higher cost. The set of all such elements is called
A solution on a subclass A of CG is a mapping : A ! R N that assigns to each c 2 A a vector of cost shares (c); i (c) stands for the charge to player i. The Separable Cost Remaining Benefit Solution Common practice among civil engineers to allocate costs of multipurpose reservoirs is the following solution. The separable cost for each player (read purpose) i 2 N is given by s i D c(N) c(Nn fig), and the remaining benefit by r i D c(i) s i . The separable cost remaining benefit solution charges each player i for the separable cost si and the P non-separable costs c(N) j2N s j are then allocated in
Cost Sharing
proportion to the remaining benefits ri , leading to the formula 2 3 X ri 4 c(N) SCRB i (c) D s i C P (4) sj5 : j2N r j j2N
In this formula it is assumed that c is at least sub-additive to ensure that the ri ’s are all positive. For the two-player game c in Example 10 the solution is given by SCRB(c) D (2 C 12 (10 9); 7 C 12 (10 9)) D (2 12 ; 7 12 ). In earlier days the solution was as well known as ‘the alternate cost avoided method’ or ‘alternative justifiable expenditure method’. For references see [144]. Shapley Value One of the most popular and oldest solution concepts in the literature on cooperative games is due to Shapley [116], and named Shapley-value. Roughly it measures the average marginal impact of players. Consider an ordering of the players : N ! N so that (i) indicates the ith player in the order. Let (i) be the set of the first i players according to ; so (1) D f(1)g ; (2) D f(1); (2)g, etc. The marginal cost share vector m (c) 2 R N is defined by m(1) (1) D c( (1)) and for i D 2; 3; : : : ; n m(i) (c) D c( (i)) c( (i 1)) :
(5)
So according to m each player is charged with the increase in costs when joining the coalition of players before her. Then the Shapley-value for c is defined as the average of all n! marginal vectors, i. e. ˚(c) D
1 X m (c) : n!
(6)
Example 11 Consider the airport game in Example 7. Then the marginal vectors are given by (123) (132) (213) (231) (312) (321) (c) (12,8,13) (12,0,21) (0,20,13) (0,20,13) (0,0,33) (0,0,33)
Hence the Shapley value of the corresponding game is ˚ (c) D (4; 8; 21). Following [69,107], for airport games this allocation is easily interpreted as the allocation according to which each player pays an equal share of the cost of only those parts of the runway she uses. Then c(e1 ) is shared by all three players, c(e2 ) only by players 2 and 3, and, finally, c(e3 ) is paid in full by player 3. This interpretation extends to the class of standard fixed tree games, where instead of the lattice structure of the runway, there is a cost of a tree network to be shared, see [63]. If cost game is concave then the Shapley-value is in the core. Since then each marginal vector specifies a core-
element, and in particular the Shapley-value as a convex combination of these. Reconsider the minimum cost spanning tree game c in Example 9, a non-concave game with non-empty core and ˚(c) D (2 23 ; 2 23 ; 6 23 ; 4). Note that this is not a stable cost allocation since the coalition f2; 3g would profit by defecting, c(23) D 8 < 9 13 D ˚2 (c) C ˚3 (c). [50] show that in general games ˚(c) 2 core(c) precisely when c is average concave. Although not credited as a core-selector, the classic way to defend the Shapley-value is by the following properties. Symmetry Two players i; j are called symmetric in the cost game c if for all coalitions S not containing i; j it holds c(S [ fig) D c(S [ f jg). A solution is symmetric if symmetric players in a cost game c get the same cost shares. If the cost game does not provide any evidence to distinguish between two players, symmetry is the property endorsing equal cost shares. Dummy A player i in a cost game c is dummy if c(S [ fig) D c(S) for all coalitions S. A solution satisfies dummy if i (c) D 0 for all dummy players i in c. So when a player has no impact on costs whatsoever, she can not be held responsible. Additivity A solution is additive if for all cost games c1 ; c2 it holds that (c1 ) C (c2 ) D (c1 C c2 ) :
(7)
For accounting reasons, in multipurpose projects it is a common procedure to subdivide the costs related to the different activities (players) into cost categories, like salaries, maintenance costs, marketing, et cetera. Each category ` is associated with a cost game c` where c` (S) is the total of category ` cost made for the different activities in S; P then c(S) D ` c` (S) is the joint cost for S. Suppose a solution is applied to each of the cost categories separately, then under an additive solution the aggregate cost share of an activity is independent from the particular cross-section in categories. Theorem 2 (Shapley [116]) ˚ is the unique solution on CG which satisfies all three properties dummy, symmetry, additivity. Note that SCRB satisfies dummy and symmetry, but that it does not satisfy additivity. The Shapley-value is credited with other virtues, like the following due to [144]. Consider the practical situation that several division managers simultaneously take steps to increase efficiency by decreasing joint costs, but one division manager establishes a greater relative improvement in the sense that its
729
730
Cost Sharing
marginal contribution to the cost associated with all possible coalitions increases. Then it is more than reasonable that this division should not be penalized. In a broader context this envisions the idea that each player in the cost game should be credited with the merits of ‘uniform’ technological advances. Strong Monotonicity Solution is strongly monotonic if for any two cost games c; c it holds for all i 2 N that c(S [ fig) c(S) c(S [ fig) c(S) for all S Nn fig implies i (c) i (c). Anonymity is the classic property for solutions declaring independence of solution with respect to the name of the actors in the cost sharing problem. See e. g., [3,91,106]. Formally, the definition is as follows. For a given permutation : N ! N and c 2 CG define c 2 CG by c(S) D c((S)) for all S N. Anonymity Solution is anonymous if for all permutations of N, and all i 2 N, (i) ( c) D i (c) for all cost games c. Theorem 3 (Young [144]) The Shapley-value is the unique anonymous and strongly monotonic solution. [99] introduced the balanced contributions axiom for the model of non-transferable utility games, or games without side-payments, see [118]. Within the present context of CG, a solution satisfies the balanced contributions axiom if for any cost game c and for any non-empty subset S N, fi; jg S 2 N it holds that i (S; c) i (Sn f jg ; c) D j (S; c) j (Sn fig ; c) : (8) The underlying idea is the following. Suppose that players agree on using solution and that coalition S forms. Then i (S; c) i (Sn f jg ; c) is the amount player i gains or loses when S is already formed and player j resigns. The balanced contributions axiom states that the gains and/or losses by other player’s withdrawal from the coalition should be the same. Theorem 4 (Myerson [99]) There is a unique solution on CG that satisfies the balanced contributions axiom, and that is ˚. The balanced contribution property can be interpreted in a bargaining context as well. In the game c and with solution a player i can object against player j to the solution (c) when the cost share for j increases when i steps out of the cooperation, i. e. j (N; c) j (N n fig). In turn, a counter objection by player j to this objection is an assertion that player i would suffer more when j ends cooperation, i. e. j (N; c) j (N n fig) i (N; c)
i (N n f jg). The balanced contribution property is equivalent to the requirement that each objection is balanced by a counter objection. For an excellent overview of ideas developed in this spirit, see [74]. Another marginalistic approach is by [44]. Denote for c 2 CG the game restricted to the players in S N by (S; c). Given a function P : CG ! R which associates a real number P(N; c) to each cost game c with player set N, the marginal cost of a player i is defined to be D i P(c) D P(N; c) P(Nn fig ; c). Such a function P with P P(¿; c) D 0 is called potential if i2N D i P(N; c) D c(N). Theorem 5 (Hart & Mas-Colell [44]) There exists a unique potential function P, and for every c 2 CG the resulting payoff vector DP(N; c) coincides with ˚(c). Egalitarian Solution The Shapley-value is one of the first solution concepts proposed within the framework of cooperative cost games, but not the most trivial. This would be to neglect all asymmetries between the players and split total costs equally between them. But as one can expect egalitarianism in this pure form will not lead to a stable allocation. Just consider the two-player game in Example 10 where pure egalitarianism would dictate the allocation (5; 5), which violates individual rationality for player 1. In order to avoid these problems of course we can propose to look for the most egalitarian allocation within the core (see [7,31]). Then in this line of thinking what is needed in Example 10 is a minimal transfer of cost 2 to player 2, leading to the final allocation (3; 7) – the constrained egalitarian solution. Although in the former example is was clear what allocation to chose, in general we need a tool to evaluate allocations for the degree of egalitarianism. The earlier mentioned papers all suggest the use of Lorenz-order (see, e. g. [8]). More precisely, consider two vectors of cost shares x and x 0 such that x(N) D x 0 (N). Assume that these vectors are ordered in decreasing order so that x1 x2 x n and x10 x20 x n0 . Then x Lorenz-dominates x 0 – read x is more egalitarian than x 0 – if for all k D 1; : : : ; n 1 it holds that k X iD1
xi
k X
x 0i ;
(9)
iD1
with at least one strict inequality. That is, x is better for those paying the most. Example 12 Consider the three allocations of cost 15 among three players x D (6; 5; 4); x 0 D (6; 6; 3), and x 00 D (7; 4; 4). Firstly, x Lorenz-dominates x 00 since x1 D
Cost Sharing
6 < 7 D x100 , and x1 C x2 D x10 C x20 . Secondly, x Lorenz-dominates x 0 since x1 D x10 ; x1 C x2 < x10 C x20 . Notice, however, that on the basis of only Eq. (9) we can not make any judgment what is the more egalitarian of the allocations x 0 and x 00 . Since x10 D 6 < 7 D x100 but x10 C x20 D 6 C 6 > 7 C 4 D x100 C x200 . The Lorenz-order is only a partial order. The constrained egalitarian solution is the set of Lorenzundominated allocations in the core of a game. Due to the partial nature of the Lorenz-order there may be more than one Lorenz-undominated elements in the core. And what if the core is empty? The constrained egalitarian solution is obviously not a straight-forward solution. The original idea of constrained egalitarianism as in [30], focuses on the Lorenz-core instead of the core. It is shown that there is at most one such allocation, that may exist even when the core of the underlying game is empty. For concave cost games c, the allocation is well defined and denoted by E (c). In particular this holds for airport games. Intriguingly, empirical studies [1,2] show there is a tradition in using the solution for this type of problems. For concave cost games c, there exists an algorithm to compute E (c). This method, due to [30], performs the following consecutive steps. First determine the maximal set S1 of players minimizing the per capita cost c(S)/jSj, where jSj is the size of the coalition S. Then each of these players in S1 pays c(S1 )/jS1 j. In the next step determine the maximal set S2 of players in NnS1 minimizing c2 (S)/jSj, where c2 is the cost game defined by c2 (S) D c(S1 [ S) c(S1 ). The players in S2 pay c2 (S2 )/jS2 j each. Continue in this way just as long as not everybody is allocated a cost share. Then in at most n steps this procedure results in an allocation of total cost, the constrained egalitarian solution. In short the algorithm is as follows Stage 0: Initialization, put S0 D ¿; x D 0 N , go to stage t D 1. Stage t: Determine S t 2 arg max
S¤¿
c(S [ S t1 ) c(S t1 ) : jSj
Put S t D S t1 [ S t and for i 2 S t , x i :D
c(S t ) c(S t1 ) : jS t j
If S t D N, we are finished, put E (c) D x . Else repeat the stage with t :D t C 1. For example, this algorithm can be used to calculate the constrained egalitarian solution for the airport game in
Cost Sharing, Figure 5 Standard fixed tree
Example 7. In the first step we determine S1 D f1; 2g, together with cost shares 10 for the players 1 and 2. Player 3 is allocated the remaining cost in the next step; hence the corresponding final allocation is E (c) D (10; 10; 13). Example 13 Consider the case where six players share the cost of the following tree network that connects them to ˇ. The standard fixed tree game c for this network associates to each coalition of players the minimum cost of connecting cost of connecting each member to ˇ, where it may use all the links of the tree. This type of games is known to be concave we can use the above algorithm to calculate E (c). In the first step we determine S1 D f1; 3; 4g and each herein pays 8. Then in the second step the game remains where the edges e1 ; e3 ; e4 connecting S1 have been paid for. Then it easily follows that S2 D f2; 5g, so that players 2 and 5 pay 9 each, leaving 10 as cost share for player 6. Thus, we find E (c) D (8; 9; 8; 8; 9; 10). Nucleolus Given a cost game c 2 CG the excess of a coalition S N with respect to a vector x 2 R N is defined as e(S; x) D x(S) c(S); it measures dissatisfaction of S under proposal x. Arrange the excesses of all coalitions S ¤ N; ¿ in decreasing order and call the resultn ing vector #(x) 2 R2 2 . A vector of cost shares x will be preferred to a vector y, notation x y, whenever #(x) is smaller than #(y) in the lexicographic order, i. e. there exists i such that for i i 1 it holds # i (x) D # i (y) and # i (x) < # i (y). Schmeidler [115] showed that in the set of individual rational cost sharing vectors there is a unique element that is maximal with respect to , which is called the nucleolus. This allocation, denoted by (c), is based on the idea of egalitarianism that the largest complaints of coalitions should consistently be minimized. The concept gained much popularity as a core-selector, i. e. it is
731
732
Cost Sharing
a one-point solution contained in the core when it is nonempty. This contrasts with the constrained egalitarian solution which might not be well defined, and the Shapleyvalue which may lay outside the core. Example 14 Consider in Example 7 the excesses of the different coalitions with respect to the constrained egalitarian solution E (c) D (10; 10; 13) and the nucleolus
(c) D (6; 7; 20): S 1 2 3 12 13 23 e(S; E (c)) 2 10 20 0 10 10 e(S; (c)) 6 13 13 7 7 6
Then the ordered excess vectors are #(10; 10; 13) D (10; 0; 2; 10; 10; 20) ; #(6; 7; 20) D (6; 6; 7; 7; 13; 13) : Note that indeed #( (c)) #(E (c)) since #1 (6; 7; 20) D 6 < 10 D #1 (10; 10; 13) : The nucleolus of standard fixed tree games may be calculated as a particular home-down allocation, as was pointed out by Maschler et al. [75]. For standard fixed tree games and minimum cost spanning tree games the special structure of the technology makes it possible to calculate the nucleolus in polynomial time, i. e. with a number of calculations bounded by a multiple of n2 (see [39]). Sometimes one may even express the nucleolus through a nice formula; Legros [66] showed a class of cost sharing problems for which the nucleolus equals the SCRB solution. But in general calculations are hard and involve solving a linear program with a number of inequalities which is exponential in n. [124] suggests to use the nucleolus on the cost game corresponding to hubgames. Instead of the direct comparison of excesses like above, the literature also discusses weighted excesses as to model the asymmetries of justifiable complaints within coalitions. For instance the per capita nucleolus minimizes maximal excesses which are divided by the number of players in the coalition (see [105]).
Cost Sharing, Figure 6 Induced cost sharing rules
recall that each cost sharing problem (x; c) is associated its stand-alone cost game c x 2 CG, as in Eq. (1). Now let be a solution on a subclass of A CG and B a class of cost sharing problems (x; c) for which c x 2 A. Then a cost sharing rule is defined on B through (x; c) D (c x ) :
(10)
The general idea is illustrated in the diagram on the left. For example, since the Shapley value is defined on the class of all cost games, it defines a cost sharing rule ˚ on the class of all cost sharing problems. The cost sharing rule E is defined on the general class of cost sharing problems with corresponding concave cost game. Cost sharing rules derived in this way, game-theoretical rules according to [130], will be most useful below. Non-cooperative Cost Games Formulating the cost sharing problem through a cooperative cost game assumes inelastic demands of the players. It might well be that for some player the private merits of service do not outweigh the cost share that is calculated by the planner. She will try to block the payment when no service at no cost is a preferred outcome. Another aspect is that the technology may operate at a sub-optimal level if benefits of delivered services are not taken into account. Below the focus is on a broader framework with elastic demands, which incorporates preferences of a player are defined over combinations of service levels and cost shares. The theory of non-cooperative games will provide a proper framework in which we can discuss individual aspirations and efficiency of outcomes on a larger scale.
Cost Sharing Rules Induced by Solutions Most of the above numerical examples deal with cost sharing problems which have a natural and intuitive representation as a cost game. Then basically on such domains of cost sharing problems there is no difference between cost sharing rules and solutions. It may seem that the cooperative solutions are restricted to this kind of situations. But
Strategic Demand Games At the heart of this non-cooperative theory is the notion of a strategic game, which models an interactive decisionmaking process among a group of players whose decisions may impact the consequences for others. Simultaneously, each player i independently chooses some available action
Cost Sharing
ai and the so realized action profile a D (a1 ; a2 ; : : : ; a n ) is associated with some consequence f (a). Below we will have in mind demands or offered contributions as actions, and consequences are combinations of service levels with cost shares. Preferences Over Consequences Denote by A the set of possible action profiles and C as the set of all consequences of action. Throughout we will assume that players have preferences over the different consequences of action. Moreover, such preference relation can be expressed by a utility function u i : C ! R such that for z; z0 2 C it holds u i (z) u i (z0 ) if agent i weakly prefers z0 to z. Below the set of consequences for agent i 2 N will consist of pairs (x; y) where x is the level of service and y a cost share, so that a utilities are specified through multi-variable functions, (x; y) 7! u i (x; y). Preferences Over Action Profiles In turn define for each agent i and all a 2 A, U i (a) D u i ( f (a)); then U i assigns to each action profile the utility of its consequence. We will say that the action profile a 0 is weakly preferred to a by agent i if U i (a) U i (a 0 ); U i is called agent i’s utility function over action profiles. Strategic Game and Nash Equilibrium A strategic game is an ordered triple G D hN; (A i ) i2N ; (U i ) i2N i where N D f1; 2; : : : ; ng is the set of players, Ai is the set of available actions for player i, U i is player i’s utility function over action profiles. Rational players in a game will choose optimal actions in order to maximize utility. The most commonly used concept in game theory is that of Nash-equilibrium, a profile of strategies from where unilateral deviation by a single player does not pay. It can be seen as a steady state of action in which players hold correct beliefs about the actions taken by others and act rationally. An important assumption here is the level at which the players understand the game; usually it is taken as a starting point that players know the complete description of the game, including the action spaces and preferences of others. Nash Equilibrium (Nash 1950) An action profile a in a strategic game G D hN; (A i ) i2N ; (u i ) i2N i is a Nashequilibrium if, for every player i it holds u i (a )
) for every a 2 A . u i (a i ; ai i i The literature discusses several refinements of this equilibrium concept. One that will play a role in the games below is that of strong Nash equilibrium due to Aumann [9]; it is
a Nash equilibrium a in a strategic game G such that for all S N and action profile aS there exists a player i 2 S such that u i (a S ; aNnS ) u i (a ). This means that a strong Nash-equilibrium guarantees stability against coordinated deviations, since within the deviating coalition there is at least one agent who does not strictly improve. Example 15 Consider the following two-player strategic game with N D f1; 2g, A1 D fT; Bg and A2 D fL; M; Rg. Let the utilities be as in the table below L M R T 5,4 2,1 3,2 B 4,3 5,2 2,5
Here player 1 chooses a row, and player 2 a column. The numbers in the cells summarize the individual utilities corresponding to the action profiles; the first number is the utility of player 1, the second that of player 2. In this game there is a unique Nash-equilibrium, which is the action profile (T; L). Dominance in Strategic Games In the game G D hN; (A i ) i2N ; (U i ) i2N i, the action a i 2 A i is weakly dominated by a0i 2 A i if U i (a i ; ai ) U i (a 0i ; ai ) for all ai 2 Ai , with strict inequality for some profile of actions ai . If strict inequality holds for all ai then ai is strictly dominated by a˜ i . Rational players will not use strictly dominated strategies, and, as far as prediction of play is concerned, these may be eliminated from the set of possible actions. If we do this elimination step for each player, then we may reconsider whether some actions are dominated within the reduced set of action profiles. This step-by-step reduction of action sets is called the procedure of successive elimination of (strictly) dominated strategies. The set of all action profiles surviving this procedure is denoted by D1 . Example 16 In Example 15 action M of player 2 is strictly dominated by L and R. Player 1 has no dominated actions. Now eliminate M from the actions for player 2. Then the reduced game is L R T 5,4 3,2 B 4,3 2,5
Notice that action L for player 1 was not dominated in the original game, for the reason that B was the better of the two actions against M. But if M is never played, T is strictly better than B. Now eliminate B, yielding the reduced game L R T 5,4 3,2
733
734
Cost Sharing
In this game, L dominates R; hence the only action profile surviving the successive elimination of strictly dominated strategies is (T; L). A stronger notion than dominance is the following. Call an action a i 2 A i overwhelmed by a0i 2 A i if max fU i (a i ; ai )jai 2 Ai g ˚ < min U i (a 0i ; ai )jai 2 Ai : Then O1 is the set of all action profiles surviving the successive elimination of overwhelmed actions. This notion is due to [34,37]. In Example 15 the action M is overwhelmed by L, not by R. Moreover, the remaining actions in O1 are B; T; L, and R.
Here qi takes values 0 (no service) or 1 (service) and xi stands for the allocated cost. So player 1 prefers to be served at unit cost instead of not being served at zero cost, u1 (0; 0) D 0 < 7 D u1 (1; 1). The infrastructure is seen as an excludable public good, so those with demand 0 do not get access to the technology. Each player now actively chooses to be served or not, so her action set is specified by A i D f0; 1g. Recall the definition of cx as in Eq. (1). Then given a profile of such actions a D (a1 ; a2 ; a3 ) and cost shares ˚(a; c), utilities of the players in terms of action profiles become U i (a) D u i (a i ; ˚ i (a; c)), so that U1 (a) D 8a1 ˚ 1 (a; c) ; U2 (a) D 6a1 ˚ 2 (a; c) ; U3 (a) D 30a3 ˚ 3 (a; c) :
Demand Games Strategic games in cost sharing problems arise when we assume that the users of the production technology choose their demands strategically and a cost sharing rule sees to an allocation of the corresponding service costs. The action profiles Ai are simply specified by the demand spaces of the agents, and utilities are specified over combinations of (level of) received service and accompanying cost shares. Hence utilities are defined over consequences of action, u i (q i ; x i ) denotes i’s utility at receiving service level qi and cost share xi ; ui is increasing in the level of received service xi , and decreasing in the allocated cost yi . Now assume a cost function c and a cost sharing rule . Then given a demand profile a D (a1 ; a2 ; : : : ; a n ) the cost sharing rule determines a vector of cost shares (a; c), and in return also the corresponding utilities over demands U i (a) D u i (a i ; i (a; c)). Observe that agents influence each others utility via the cost component. The demand game for this situation is then the strategic game G(; c) D hN; (A i ) i2N ; (U i ) i2N i :
(11)
Example 17 Consider the airport problem in Example 7. Each player now may request service (1) or not (0). Then the cost function is fully described by the demand of the largest player. That is, c(x) D 33 if 3 requires service, c(x) D 20 for all profiles with x2 D 1; x3 D 0 and c(x) D 12 if x D (1; 0; 0), c(0; 0; 0) D 0. Define the cost sharing rule ˚(x; c) D ˚(c x ), that is, ˚ calculates the Shapley-value for the underlying cost game cx as in Eq. (1). Assume that the players’ preferences over ordered pairs of service level and cost shares are fully described by u1 (q1 ; x1 ) D 8q1 x1 ; u2 (q2 ; x2 ) D 6q2 x2 ; u3 (q3 ; x3 ) D 30q3 x3 :
Now that we provided all details of the demand game G(˚ ; c), let us look for (strong) Nash-equilibria. Suppose that the action profile a D (1; 0; 1) is played in the game. Then in turn the complete infrastructure is realized just for players 1 and 3 and the cost allocation is given by (6; 0; 27). Then the vectors of individual utilities is given by (2; 0; 3). Now if we consider unilateral deviations from a , what happens to the individual utilities? U1 (0; 0; 1) D 0 < 2 D U1 (1; 0; 1) ; U2 (1; 1; 1) D 6 ˚ 2 ((1; 1; 1); c) D 6 ˚(c) D 6 8 D 2 < 0 D U2 (1; 0; 1) ; U3 (1; 0; 0) D 0 < 3 D U3 (1; 0; 1) : This means that for each player unilateral deviation does not pay, a is a Nash-equilibrium. The first inequality shows as well why the action profile (0; 0; 1) is not. It is easy to see that the other Nash equilibrium of this game is the action profile (0; 0; 0), no player can afford the completion of the infrastructure just for herself. Notice however that this zero profile is not a strong Nash equilibrium as players 1 and 3 may well do better by choosing for service at the same time, ending up in (1; 0; 1). The latter profile is the unique strong Nash-equilibrium of the game. Similar considerations in the demand game G(E ; c) induced by the constrained egalitarian solution lead to the unique strong Nash-equilibrium (0; 0; 0), nobody wants service. With cost sharing rules as decentralization tools, the literature postulates Nash-equilibria of the related demand game as the resulting behavioral mode. This is a delicate step because – as the example above shows – it is easy
Cost Sharing
to find games with many equilibria, which causes a selection problem. And what can we do if there are no equilibria? This will not be the topic of this text and the interested reader is referred to any standard textbook on game theory, for instance see [103,104,109]. If there is a unique equilibrium then it is taken as the prediction of actual play. Demand Revelation Games For a social planner one way to retrieve the level at which to operate the production facility is via a pre-specified demand game. Another way is to ask each of the agents for the maximal amount that she is willing to contribute in order to get service, and then, contingent on the reported amounts, install an adequate level of service together with a suitable allocation of costs. Opposed to demand games ensuring the level of service, in a demand revelation game each player is able to ensure a maximum charge for service. The approach will be discussed under the assumption of a discrete production technology with binary demands, so that the cost function c for the technology is basically the characteristic function of a cooperative game. Moreover assume that the utilities of the agents in N are quasilinear and given by u i (q i ; x i ) D ˛ i q i x i
(12)
sensible ones will grant the players some control over the outcomes. We postulate the following properties: Voluntary Participation (VP) Each agent i can guarantee herself the welfare level u i (0; 0) (no service, no payment) by reporting truthfully the maximal willingness to pay, which is ˛ i under Eq. (12). Consumer Sovereignty (CS) For each agent i a report yi exists so that she receives service, irrespective of the reports by others. Now suppose that the planner receives the message D ˛, known to her as the profile of true player characteristics. Then for economic reasons she could choose to serve a coalition S of players that maximizes the net benefit at ˛, (S; ˛) D ˛(S) c(S). However, problems will arise when some player i is supposed to pay more than ˛ i , so the planner should be more careful than that. She may want to choose a coalition S with maximal (S; ˛) such that (1 S ; c) ˛ holds; such set S is called efficient. But in general the planner cannot tell whether the players reported truthfully or not, then what should she do then? One option is that she applies the above procedure thereby naively holding each reported profile for the true player characteristics. In other words, she will pick a coalition that solves the following optimization problem max
(S; ) D (S) c(S)
s.t.
(1 S ; c) :
SN
where q i 2 f0; 1g denotes the service level, xi stands for the cost share, and ˛ i is a non-negative real number. [93] discusses this framework and assume that c is concave, and [64,148] moreover take c as the joint cost function for the realization of several discrete public goods. Demand Revelation Mechanisms Formally, a revelation mechanism M assigns to each profile of reported maximal contributions a set S() of agents receiving service and x() a vector of monetary compensations. Here we will require that these monetary compensations are cost shares; given some cost sharing rule the vector x() is given by (1 S() ; c) where c is the relevant cost function. Moreover, note that by restricting ourselves to cost share vectors we implicitly assume non-positive monetary transfers. The budget balance condition is crucial here, otherwise mechanisms of a different nature must be considered as well, see [23,40,41]. There are other ways that a planner may use to determine a suitable service level is by demanding pre-payment from the players, and determine a suitable service level on the basis of these labeled contributions, see [64,148]. Many mechanisms come to mind, but in order to avoid too much arbitrariness from the planner’s side, the more
(13)
Denote such a set by S(; ). If this set is unique then the demand revelation mechanism M() selects S(; ) to be served at cost shares determined by x() D (1 S ; c). This procedure will be explained through some numerical examples. Example 18 Consider the airport problem and utilities of players over service levels and cost shares as in Example 7. Moreover assume the planner uses the Shapley cost sharing rule ˚ as in Example 17 and that she receives the true profile of preferences from the players, ˛ D (8; 6; 30). Calculate for each coalition S the net benefits at ˛: S ¿ 1 2 3 12 13 23 N (S; ˛) 0 -4 -14 -3 -6 8 3 11
Not surprisingly, the net benefits are the highest for the grand coalition. But if N were selected by the mechanism the corresponding cost shares are given by ˚(1 N ; c) D (4; 8; 21), and player 2 is supposed to contribute more than she is willing to. Then the second highest net benefits are generated by serving S D f1; 3g, with cost shares ˚(1 S ; c) D (6; 0; 27). Then f1; 3g is the solution to Eq. (13).
735
736
Cost Sharing
What happens if some of the players misrepresent their preferences, for instance like in D (13; 6; 20)? The planner determines the conceived net benefits S ¿ 1 2 3 12 13 23 N (S; ) 0 1 -14 -13 -1 0 -7 6
Again, if the planner served the coalition with the highest net benefit, N, then player 2 would refuse to pay. Second highest net benefit corresponds to the singleton S D f1g, and this player will get service under M(˚) since 3 D 13 > 12 D c(1; 0; 0). Example 19 Consider the same situation but now with E instead of ˚ as cost sharing rule. Now consider the following table, restricted to coalitions with non-negative net benefits (all other will not be selected): S ¿ 13 23 (S; ˛) 0 8 3 E (1S ; c) (0,0,0) (12,0,21) (0;
N 11 33 33 ; ) (10; 10; 13) 2 2
Here only the empty coalition S D ¿ satisfies the requirement E (1S ; c) D (0; 0; 0) (8; 6; 30); hence according to the mechanism M(E ) nobody will get service. In general, the optimization problem Eq. (13) does not give unique solutions in case of which a planner should still further specify what she does in those cases. For concave cost functions, consider the following sequence of coalitions S1 D N; S t D f i 2 S t1 j i i (1 S t ; c)g : So, starting with the grand coalition N, at each consecutive step those players are removed whose maximal contributions are not consistent with the proposed cost share – until the process settles down. The set of remaining players defines a solution to Eq. (13) and taken to define S(; ). Strategyproofness The essence of a demand revelation mechanism M is that its rules are set up in such a way that it provides enough incentives for the players not to lie about their true preferences. We will now discuss the most common non-manipulability properties of revelation mechanisms in the literature. N , where ˛ corresponds to Fix two profiles ˛ 0 ; ˛ 2 RC the true maximal willingness’ to pay. Let (q0 ; x 0 ) and (q; x) be the allocations implemented by the mechanism M on receiving the messages ˛ 0 and ˛, respectively. The mechanism M is called strategy-proof if it holds for all i 2 N that ˛ 0Nnfig D ˛ Nnfig implies u i (q0i ; x 0i ) u i (q i ; x i ). So, given the situation that the other agents report truthfully, unilateral deviation by agent i from the true
preference never produces better outcomes for her. Similarly, M is group strategy-proof if deviations of groups of agents does not pay for all deviators, i. e., for all T N the fact that ˛ 0NnT D ˛ NnT implies u i (q0i ; x 0i ) u i (q i ; x i ) for all i 2 T. So, under a (group) strategy-proof mechanism, there is no incentive to act untruthfully by misrepresenting the true preferences and this gives a benevolent planner control over the outcome. Cross-Monotonicity Cost sharing rule is called crossmonotonic if the cost share of an agent i is not increasing if other agents demand more service, in case of a concave cost function c. Formally, if x x and x i D x i then i (x; c) i (x; c); each agent is granted a (fair) share of the positive externality due to an increase in demand by others. Proposition 1 (Moulin & Shenker [93]) The only group strategy-proof mechanisms M() satisfying VP and CS are those related to cross-monotonic cost sharing rules . There are many different cross-monotonic cost sharing rules, and thus just as many mechanisms that are group strategy-proof. Examples include the mechanisms M(˚) and M(E ), because ˚ and E are cross-monotonic. However, the nucleolus is not cross-monotonic and does therefore not induce a strategy-proof mechanism. Above we discussed two instruments a social planner may invoke to implement a desirable outcome without knowing the true preferences of the agents. Basically, the demand revelation games define a so-called direct mechanism. The announcement of the maximal price for service pins down the complete preferences of an agent. So in fact the planner decides upon the service level based on a complete profile of preferences. In case of a crossmonotonic cost sharing rule, under the induced mechanism truth-telling is a weakly dominant strategy; announcing the true maximal willingness to pay is optimal for the agent regardless of the actions of others. This means good news for the social planner as the mechanism is self-organizing: the agents need not form any conjecture about the behavior of others in order to know what to do. In the literature [24] such a mechanism is called straightforward. The demand games define an indirect mechanism, as by reporting a demand the agents do no more than signaling their preferences to the planner. Although in general there is a clear distinction between direct and indirect mechanisms, in the model presented in this section these are nevertheless strongly connected. Focus on a demand game G(; c); recall that in this game the agents simultaneously and independently decide upon requesting service or not and costs are shared using the rule amongst those agents receiving service. Sup-
Cost Sharing
pose that for each profile of utility functions as in Eq. (12) the resulting game G(; c) has a unique (strong) Nashequilibrium. Then this equilibrium can be taken to define a mechanism. That is, the mechanism elicits u and chooses the unique equilibrium outcome of the reported demand game. Then this mechanism is equivalent with the demand revelation mechanism. Observe that indeed the strong equilibrium (1; 0; 1) in the game G(˚ ; c) in Example 17 corresponds to players chosen by M(˚) under truthful reporting. And where no player is served in the strong equilibrium of G(E ; c), none of the players is selected by M(E ). It is a general result in implementation theory due to [24] that a direct mechanism constructed in this way is (group) strategy-proof provided the underlying space of preferences is rich. It is easily seen that the above sets of preferences meet the requirements. To stress importance of such a structural property as richness, it is instructive to point at what is yet to come in Sect. “Uniqueness of Nash-Equilibria in P 1 -Demand Games”. Here, the strategic analysis of the demand game induced by the proportional rule shows uniqueness of Nash equilibrium on the domain of preferences L if only costs are convex. However, this domain is not rich, and the direct mechanism defined in the same fashion as above by the Nash equilibrium selection is not strategyproof. Efficiency and Strategy-Proof Cost Sharing Mechanisms Suppose cardinal utility for each agent, so that inter-comparison of utility is allowed. Proceeding on the net benefit of a coalition, we may define its value at ˛ by v(N; ˛) D max (S; ˛) ; SN
(14)
where (S; ˛) is the net benefit of S at ˛. A coalition S such that v(N; ˛) D (S; ˛) is called efficient. It will be clear that a mechanism M() that is defined through the optimization problem Eq. (13) will not implement an efficient coalition of served players, due to the extra constraint on the cost shares. For instance, in Example 7 the value of the grand coalition at ˛ D (8; 6; 30) is given by v(N; ˛) D ˛(N)c(N) D 44 33 D 11. At the same profile the implemented outcome by mechanism M(˚) gives rise to a total surplus of 38 30 D 8 for the grand coalition – which is not optimal. The mechanism M(E ) performs even worse as it leads to the stand alone surplus 0, none is served. This observation holds for far more general settings, and, moreover, it is a well known result from implementation theory that – under non-constant marginal cost – any strategy-proof mechanism based on full coverage of total
costs will not always implement efficient outcomes. For the constant marginal cost case see [67,71]. Then, if there is an unavoidable loss in using demand revelation mechanisms, can we still tell which mechanisms are more efficient? Is it a coincidence that in the above examples the Shapley value performs better than the egalitarian solution? The welfare loss due to M() at a profile of true preferences ˛ is given by L(; ˛) D v(N; ˛) f˛(S(; ˛)) c(S)g :
(15)
For instance, with ˛ D (8; 6; 30) in the above examples we calculate L(˚ ; ˛) D 11 8 D 3 and L(E ; ˛) D 11 0 D 11. An overall measure of quality of a cost sharing rule in terms of efficiency loss is defined by () D sup˛ L(; ˛) :
(16)
Theorem 6 (Moulin & Shenker [93]) Among all mechanisms M() derived from cross-monotonic cost sharing rules , the Shapley rule ˚ has the unique smallest maximal efficiency loss, or (˚) < () if ¤ ˚. Notice that this makes a strong case for the Shapley-value against the egalitarian solution. The story does, however, not end here. [98] considers a model where the valuations of the agents for the good are independent random variables, drawn from a distribution function F satisfying the monotone hazard condition. This means that the function defined by h(x) D f (x)/(1 F(x)) is non-decreasing, where F has f as density function. It is shown that the constrained egalitarian solution maximizes the probability that all members of any given coalition accept the cost shares imputed to them. Moreover, [98] characterized the solution in terms of efficiency. Suppose for the moment that the planner calculated cost share vector x for the coalition S, and that its members are served conditional on acceptance of the proposed cost shares. The probability that all members of S accept the shares is given by Q P(x) D i2S (1 F(x i )), and if we assume that the support of F is (0; m) then the expected surplus from such an offer can be calculated as follows " # XZ m ui W(x) D P(x) dF(u i ) c(S) : (17) x i 1 F(x i ) i2S
The finding of [98] is that for log-concave f , i. e. x 7! ln( f (x)) is concave [5], the mechanism based on the constrained egalitarian solution not only maximizes the probability that a coalition accepts the proposal, but it maximizes its expected surplus as well. Formally, the result is the following.
737
738
Cost Sharing
Theorem 7 (Mutuswami (2004)) If the profile of valuations (u i ) i2N are independently drawn form a common distribution function F with log-concave and differentiable density function f , then W(E (1 S ; c)) W((1 S ; c)) for all cross monotonic solutions and all S N. Extension of the Model: Discrete Goods Suppose the agents consume idiosyncratic goods produced in indivisible units. Then given a profile of demands the cost associated with the joint production must be shared by the users. Then this model generalizes the binary good model discusses so far and it is a launch-pad to the continuous framework in the next section. In this discrete good setting [86] characterizes the cost sharing rules which induce strategyproof social choice functions defined by the equilibria of the corresponding demand game. As it turns out, these rules are basically the sequential stand alone rules, according to which costs are shared in an incremental fashion with respect to a fixed ordering of the agents. This means that such a rule charges the first agent for her stand alone costs, the second for the stand alone cost for the first two users minus the stand alone cost of the first, et cetera. Here the word ‘basically’ refers to all discrete cost sharing problems other than those with binary goods. Then here the sufficient condition for strategyproofness is that the underlying cost sharing rule be cross monotonic, which admits other rules than the sequential ones – like E and ˚.
Denote the set of all such cost functions by C 1 and the related cost sharing problems by P 1 . Several cost sharing rules on P 1 have been proposed in the literature. Average Cost Sharing Rule This is the most popular and oldest concept in the literature and advocates Aristotle’s principle of proportionality.
AV (x; c) D
x x(N)
0
c(x(N))
if x ¤ 0 N if x D 0 N
(18)
Shapley–Shubik Rule Each cost sharing problem (x; c) 2 P 1 is related to the stand-alone cost game cx such that c x (S) D c(x(S)). Then the Shapley–Shubik rule is determined by application of the Shapley-value to this game: SS (x; c) D ˚(c x ) : Serial Rule This rule, due to Moulin and Shenker [91], determines cost shares by considering particular intermediate cost levels. More precisely, given (x; c) 2 P 1 it first relabels the agents by increasing demands such that x1 x2 x n . The intermediate production levels are y 1 D nx1 ; y 2 D x1 C (n 1)x2 ; : : : ; yk D
k1 X
x j C (n k C 1)x k ; : : : ;
jD1
Continuous Cost Sharing Models Continuous Homogeneous Output Model, P1 This model deals with production technologies for one single perfectly divisible output commodity. Moreover, we will restrict ourselves to private goods. Many ideas below have been studied for public goods as well, for further references see, e. g., [33,72,82]. The demand space of an individual is given by X D RC . The technology is described by a non-decreasing cost function c : RC ! R such that c(0) D 0, i. e. there are N costs no fixed costs. Given a profile of demands x 2 RC c(x(N)) have to be shared. Moreover, the space of cost functions will be restricted to those c being absolutely continuous. Examples include the differentiable and Lipschitzcontinuous functions. Absolute continuity implies that aggregate costs for production can be calculated by the total of marginal costs Z c(y) D 0
y
c 0 (t) dt :
y n D x(N) : These levels are chosen such that at each new level one agent more is fully served his demand: at y1 each agent is handed out x1 , at y2 agent 1 is given x1 and the rest x2 , etc. The serial cost shares are now given by
SR i (x; c) D
k X c(y ` ) c(y `1 ) : n`C1 `D1
So according to SR each agent pays a fair share of the incremental costs in each stage that she gets new units. Example 20 Consider the cost sharing problem (x; c) with x D (10; 20; 30) and c(y) D 12 y 2 . Then first calculate the intermediate production levels y 0 D 0; y 1 D 30; y 2 D 50, and y 3 D 60. Then the cost shares are calculated as follows 1SR (x; c) D
c(y 1 ) c(y 0 ) D 150 ; 3
Cost Sharing
c(y 2 ) c(y 1 ) 2 1250 450 D 150 C D 550 ; 2 3SR (x; c) D 2SR (x; c) C c(y 3 ) c(y 2 ) D 550 C 550
2SR (x; c) D 1SR (x; c) C
D 1100 : The serial rule has attracted much attention lately in the network literature, and found its way in fair queuing packet scheduling algorithms in routers [27]. Decreasing Serial Rule De Frutos [26] proposes serial cost shares where demands of agents are put in decreasing order. Resulting is the decreasing serial rule. Consider N such that x x x . a demand vector x 2 RC 1 2 n Define recursively the numbers y ` for ` D 1; 2; : : : ; n by y ` D `x` C x`C1 C C x n , and put y nC1 D 0. Then the decreasing serial rule is defined by n X c(y ` ) c(y `C1 ) (x; c) D : DSR i `
(19)
per unit of the output good. [47,57] propose variations on the serial rule that coincide with the increasing (decreasing) serial rule in case of a convex (concave) cost function, meeting the positivity requirement. Marginal Pricing Rule A popular way of pricing an output of a production facility is marginal cost pricing. The price of the output good is set to cover the cost producing one extra unit. It is frequently used in the domain of public services and utilities. However, a problem is that for concave cost functions the method leads to budget deficits. An adapted form of marginal cost pricing splits these deficits equally over the agents. The marginal pricing rule is defined by 0 0 1 MP i (x; c) D x i c (x(N))C n c(x(N)) x(N)c (x(N)) : (20) Note that in case of convex cost functions agents can receive negative cost shares, just like it is the case with decreasing serial cost sharing.
`Di
Example 21 For the cost sharing problem in Example 20 calculate y 1 D 90; y 2 D 70; y 1 D 60, then c(y 3 ) c(y 4 ) 4050 0 3 (x; c) D D D 1350 ; 3 3 c(y 2 ) c(y 3 ) 2DSR (x; c) D 3DSR (x; c) C 2 2450 4050 D 1350 C D 550 ; 2 DSR DSR 1 1 (x; c) D 2 (x; c) C (c(y ) c(y 2 )) DSR
D 550 C (1800 2450) D 100 : Notice that here the cost share of agent 1 is negative, due to the convexity of c! This may be considered as an undesirable feature of the cost sharing rule. Not only are costs increasing in the level of demand, in case of a convex cost function each agent contributes to the negative externality. It seems fairly reasonable to demand a non-negative contribution in those cases, so that none profits for just being there. The mainstream cost sharing literature includes positivity of cost shares into the very definition of a cost sharing rule. Here we will add it as a specific property: Positivity is positive if (x; c) 0 N for all (x; c) in its domain. All earlier discussed cost sharing rules have this property, except for the decreasing serial rule. The decreasing serial rule is far more intuitive in case of economies of scale, in presence of a concave cost function. The larger agents now are credited with a lower price
Additive Cost Sharing and Rationing The above cost sharing rules for homogeneous production models share the following properties: Additivity (x; c1 C c2 ) D (x; c1 ) C (x; c2 ) for all relevant cost sharing problems. This property carries the same flavor as the homonymous property for cost games. Constant Returns (x; c) D # x for linear cost functions c such that c(y) D # y for all y. So if the agents do not cause any externality, the fixed marginal cost is taken as a price for the good. It turns out that the class of all positive cost sharing rules with these properties can be characterized by solutions to rationing problems – which are the most basic of all models of distributive justice. A rationing problem amongst the N R such that agents in N consists of a pair (x; t) 2 RC C x(N) t; t is the available amount of some (in)divisible good and x is the set of demands. The inequality sees to the interpretation of rationing as not every agent may get all she wants. A rationing method r is a solution to rationing problems, such that each problem (x; t) is assigned a vector of N such that 0 r(x; t) x. The latter shares r(x; t) 2 RC N restriction is a weak solidarity statement assuring that everybody’s demand be rationed in case of shortages. For t 2 RC define the special cost function t by t (y) D min fy; tg. The cone generated by these base
739
740
Cost Sharing
Cost Sharing, Figure 7 Intermediate production levels
functions lays dense in the space of all absolutely continuous cost functions c; if we know what the values (x; t ) are, then basically we know (x; c). Denote by M the class of all cost sharing rules with the properties positivity, additivity, and constant returns.
Under increasing returns to scale, this implies that P ; SR are core-selectors but MC is not. [137] associates to each cost sharing problem (x; c) a pessimistic one, (x; c ); here c (y) reflects the maximum of marginal cost on [0; x(N)] to produce y units,
Theorem 8 (Moulin & Shenker [88,92]) Consider the following mappings associating rationing methods with cost sharing rules and vice versa, Z x(N) c 0 (t)dr(x; t) ; r 7! : (x; c) D
8 ˚R 0 ˆ < sup T c (t)dt j T [0; x(N)]; (T) D y c (y) D if y x(N) ; ˆ : c(y) else : (21)
0
7! r : r(x; t) D (x; t ) :
Here denotes the Lebesgue measure.
These define an isomorphism between M and the space of all monotonic rationing methods.
Theorem 10 (Koster [60]) For any cost sharing problem (x; c), it holds core(c x ) D f(x; c) j 2 Mg.
So each monotonic rationing method relates to a cost sharing rule and vice versa. In this way P is directly linked with the proportional rationing method, SR to the uniform gains method, and SS to the random priority method. Properties of rationing methods lead to properties of cost sharing rules and vice versa [61].
In particular this means that for 2 M it holds that (x; c) 2 core(c x ) whenever c is concave, since this implies c D c. This result appeared earlier as a corollary to Theorem 8, see [88]. [47,57,61] show non-linear cost sharing rules yielding core elements for concave cost functions as well, so additivity is only a sufficient condition in the above statement. For average cost sharing, one can show more, P (x; c) 2 core(c x ) for all x precisely when the average cost c(y)/y is decreasing in y.
Incentives in Cooperative Production Stable Allocations, Stand-Alone Core Suppose again, like in the framework of cooperative cost games, that (coalitions of) agents can decide to leave the cost sharing and organize their own production facility. Under the ability to replicate the technology the question arises whether cost sharing rules induce stable cost share vectors. Theorem 9 (Moulin [85]) For concave cost functions c, if is an anonymous and cross-monotonic cost sharing rule then (x; c) 2 core(c x ).
Strategic Manipulation Through Reallocation of Demands In the cooperative production model, there are other ways that agents may use to manipulate the final allocation. In particular, note that the serial procedure gives the larger demanders an advantage in case of positive externalities; as marginal costs decrease, the price paid by the larger agents per unit of output is lower than that of the smaller agents. In the other direction larger demanders are
Cost Sharing
punished if costs are convex. Then as the examples below show, this is just why the serial ideas are vulnerable to misrepresentation of demands, since combining demands and redistribute the output afterwards can be beneficial.
Theorem 11 Assume that N contains at least three agents. The proportional cost sharing rule is the unique rule that charges nothing for a null demand and meets any one of the following properties:
Example 22 Consider the cost function c given by c(y) D min f5y; 60 C 2yg. Such cost functions are part of daily life, whenever one has to decide upon telephone or energy supply contracts: usually customers get to choose between a contract with high fixed cost and a low variable cost, and another with low or no fixed cost and a high variable price. Now consider the two cost sharing problems (x; c) and (x 0 ; c) where x D (10; 20; 30); x 0 D (0; 30; 30). The cost sharing problem (x 0 ; c) arises from (x; c) if agent 2 places a demand on behalf of agent 1 – without letting agent 3 know. The corresponding average and serial cost shares are given by
Independence of merging and splitting, No advantageous reallocation, Irrelevance of reallocation.
P (x; c) D (30; 60; 90) SR (x; c) D (40; 60; 80)
P (x 0 ; c) D (0; 90; 90) SR (x 0 ; c) D (0; 90; 90) :
Notice that the total of average cost shares for 1 and 2 is the same in both cost sharing problems. But if the serial rule were used, these agents can profit by merging their demands; if agent 2 returns agent 1’s demand and requires a payment from agent 1 between €30 and €40, then both agents will have profited by such merging of demand. Example 23 Consider the five-agent cost sharing problems (x; c) and (x; c) with x D (1; 2; 3; 0; 0); x D (1; 2; 1; 1; 1) and convex cost function c(y) D 12 y 2 . (x; c) arises out of (x; c) if agent 3 splits her demand over agents 4 and 5 as well. Then P (x; c) D (6; 12; 18; 0; 0)
P (x; c) D (6; 12; 6; 6; 6) ;
SR (x; c) D (3; 11; 22; 0; 0)
SR (x; c) D (5; 16; 5; 5; 5) :
The second property shows even a stronger property than merging and splitting: agents may redistribute the demands in any preferred way without changing the aggregate cost shares of the agents involved. The third property states that in such cases the cost shares of the other agents do not change. Then this makes proportional cost sharing compelling in situations where one is not capable of detecting the true demand characteristics of individuals. Demand Games for P1 Consider demand games G(; c) as in Eq. (11), Sect. “Demand Games”, where now is a cost sharing rule on P 1 . These games with uncountable strategy spaces are more complex than the demand games that we studied before. The set of consequences for players is now given by C D R2C , combinations of levels of production and costs (see Sect. “Strategic Demand Games”). Then an individual i’s preference relation is convex if for the corresponding utility function ui and all pairs z; z0 2 C it holds u i (z) D u i (z0 ) H) u(tz C (1 t)z) u i (z)
for all t 2 [0; 1] : (22)
The aggregate of average cost shares for agents 3, 4, and 5 does not change. But notice that according to the serial cost shares, there is a clear advantage for the agents. Instead of paying 22 in the original case, now the total of payments equals 15. Agent 3 may consider a transfer between 0 and 7 to 3 and 4 for their collaboration and still be better of. In general, in case of a convex cost function the serial rule is vulnerable with respect to manipulation of demands through splitting.
This means that a weighted average of the consequences is weakly preferred to both consequences, if these are equivalent. Such utility functions ui are called quasi-concave. An example of convex preferences are those related to linear utility functions of type u i (x; y) D ˛x y. Moreover, strictly convex preferences are those with strict inequality in Eq. (22) for 0 < t < 1; the corresponding utility functions are strictly quasi-concave. Special classes of preferences are the following.
Note that in the above cases the proportional cost sharing rule does prescribe the same cost shares. It is a nonmanipulable rule: reshuffling of demands will not lead to different aggregate cost shares. The rule does not discriminate between units, when a unit is produced is irrelevant. It is actually a very special feature of the cost sharing rule that is basically not satisfied by any other cost sharing rule.
L: the class of all convex and continuous preferences utility functions that are non-decreasing in the service component x, non-increasing in the cost component y, non-locally satiated and decreasing on (x; c(x)) for x large enough. The latter restriction is no more than assuring that agents will not place requests for unlimited amounts of the good.
741
742
Cost Sharing
Cost Sharing, Figure 8 Linear, convex preferences, u(x; y) D 2x y. The contours indicate indifference curves, i. e. sets of type f(x; y)ju(x; y) D kg, the k-level curve of u
L : the class of bi-normal preferences in L. Basically, if such a preference is represented by a differentiable utility function u then the slope dy/dx of the indifference contours is non-increasing in x, non-decreasing in y. For a concise definition see [141]. Examples include Cobb–Douglas utility functions and also those of type u i (x; y) D ˛(x) ˇ(y) where ˛ and ˇ are concave and convex functions, respectively. A typical plot of level curves of such utility functions is in Fig. 9. Note that the approach differs from the standard literature where agents have preferences over endowments. Here costs are ‘negative’ endowments. In the latter interpretation, the condition can be read as that the marginal rate of substitution is non-positive. At equal utility, an increase of the level of output has to be compensated by a decrease in the level of input good. Nash-Equilibria of Demand Games in a Simple Case Consider a production facility shared by three agents N D f1; 2; 3g with cost function c(y) D 12 y 2 . Assume that the agents have quasi-linear utilities in L, i. e. u i (x i ; y i ) D ˛ i x i y i for all pairs (x i ; y i ) 2 R2C . Below the Nash-equilibrium in the serial and proportional demand game is calculated in two special cases. This numerical example is based on [91]. Proportional Demand Game Consider the corresponding proportional demand game, G(P ; c), with utility over actions given by U iP (x) D ˛ i x i P (x; c) D ˛ i x i 12 x i x(N) :
(23)
Cost Sharing, Figure 9 p Strictly convex preferences, u(x; y) D x e0:5y . The straight line connecting any two points on the same contour lays in the lighter area – with higher utility. Fix the y-value, then an increase of x yields higher utility, whereas for fixed x an increase of y causes the utility to decrease
In a Nash-equilibrium x of G(P ; c) each player i gives , the action profile of the other a best response on xi ). agents. That is, player i chooses x i 2 arg max U iP (t; xi t
Then first order conditions implies for an interior solution ˛ i 12 x (N) 12 x i D 0
(24)
for all i 2 N. Then x (N) D 12 (˛1 C ˛2 C ˛3 ) and x i D 2˛ i 12 (˛1 C ˛2 C ˛3 ). Serial Demand Game Consider the same production facility and the demand game G(SR ; c), corresponding to the serial rule. Then the utilities over actions are given by U iSR (x) D ˛ i x i SR (x; c) :
(25)
Now suppose x is a Nash equilibrium of this game, and assume without loss of generality that x 1 x 2 x 3 . Then player 1 with the smallest equilibrium demand maximizes the expression U1SR ((t; x 2 ; x 3 ) D ˛1 t c(3t)/3 D ˛1 t 32 t 2 at x 1 , from which we may conclude that ˛1 D 3x 1 . In addition, in equilibrium, player 2, maximizes U2SR (x 1 ; t; x 3 ) D ˛2 t ( 13 c(3x 1 )C 12 (c(x 1 C2t) c(3x 1 )); for t x 1 , yielding ˛2 D x 1 C 2x 2 . Finally, the equilibrium condition for player 3 implies ˛3 D x(N). Then it is not hard to see that actually this constitutes the serial equilibrium.
Cost Sharing
Comparison of Proportional and Serial Equilibria (I) Now let’s compare the serial and the proportional equilibrium in the following two cases: (i) (ii)
˛1 D ˛2 D ˛3 D ˛ ˛1 D ˛2 D 2; ˛3 D 4 :
Case (i): Then we get x D ( 12 ˛; 12 ˛; 12 ˛) and x i D ( 13 ˛; 13 ˛; 13 ˛) for all i. The resulting equilibrium payoff vectors are given by U P (x ) D ( 18 ˛ 2 ; 18 ˛ 2 ; 18 ˛ 2 ) and U SR (x) D ( 16 ˛ 2 ; 16 ˛ 2 ; 16 ˛ 2 ) : Not only the average outcome is less efficient than its serial counterpart, it is also Pareto-inferior to the latter. Case (ii): The proportional equilibrium is a boundary solution, x D (0; 0; 2) with utility profile U P (x ) D (0; 0; 8). The serial equilibrium strategies and utilities are given by x D ( 23 ; 23 ; 83 ); U SR (x) D ( 23 ; 23 ; 4) : Notice that the serial equilibrium is now less efficient, but is not Pareto-dominated by the proportional utility distribution. Uniqueness of Nash-Equilibria in P1 -Demand Games In the above demand games there is a unique Nash-equilibrium which serves as a prediction of actual play. This need not hold for any game. In the literature strategic characterizations of cost sharing rules are discussed in terms of uniqueness of equilibrium in the induced cost games, relative to specific domains of preferences and cost functions. Below we will discuss the major findings of [141]. These results concern a broader cost sharing model with the notion of a cost function as a differentiable function RC ! RC . So in this paragraph such cost function can decrease, fixed cost need not be 0. This change in setup is not crucial to the overall exposition since the characterizations below are easily interpreted within the context of P 1 . Demand Monotonicity The mapping t 7! i ((t; xi ); c) is non-decreasing; is strictly demand monotonic if this mapping is increasing whenever c is increasing. Smoothness The mapping x 7! (x; c) is continuously differentiable for all continuously differentiable c 2 C 1 . Recall that L is the domain of all bi-normal preferences. Theorem 12 (Watts [141]) Fix a differentiable cost function c and a demand monotonic and smooth cost sharing rule . A cost sharing game G(; c) has a unique equilibrium whenever agents’ preferences belong to L , only if, for all x D (x1 ; : : : ; x n )
Every principal minor of the matrix W with rows wi is non-negative for all i
w 2
@ i @ i ;:::; @x1 @x n
2
@ i @2 i ; : ;:::; @x1 @x1 @x i @x n (26)
The determinant of the Hessian matrix corresponding to the mapping x 7! (x; c) is strictly positive. A sufficient condition to have uniqueness of equilibrium is that the principle minor of the matrix W is strictly positive. The impact of this theorem is that one can characterize the class of cost functions yielding unique equilibria if the domain of preferences is L . G(SR ; c); G(DSR ; c): Necessary condition for uniqueness of equilibrium is that c is strictly convex, i. e. c 00 > 0. A sufficient condition is that c is increasing and strictly convex. Actually, [141] also shows that the conclusions for the serial rule do not change when L is used instead of L . As will get more clear below, the serial games have unique strategic properties. G(P ; c): The necessary and sufficient conditions are those for the serial demand game, including c 0 (y) > c(y)/y for all y ¤ 0. Notice, that the latter property does not pose additional restrictions on cost functions within the framework of P 1 . G(SS ; c): Necessary condition is c 00 > 0. In general it is hard to establish uniqueness if more than 2 players are involved. G(MP ; c): even in 2 player games uniqueness is not guaranteed. For instance, uniqueness is guaranteed for cost functions c(y) D y ˛ only if 1 < ˛ 3. For c(y) D y 4 there are preference profiles in L such that multiple equilibria reside. Decreasing Returns to Scale The above theorem basically shows that uniqueness of equilibrium in demand games related to P 1 can be achieved for preferences in L if only costs are convex, i. e. the technology exhibits decreasing returns to scale. Starting point in the literature to characterize cost sharing rules in terms of their strategic properties is the seminal paper by Moulin and Shenker [91]. Their finding is that on L basically SR is the only cost sharing rule of which the corresponding demand game passes the unique equilibrium test like in Theorem 12. Call a smooth and strictly demand monotonic cost sharing rule regular if it is anonymous, so that the name of an agent does not have an impact on her cost share.
743
744
Cost Sharing
Theorem 13 (Moulin & Shenker [91]) Let c be a strictly convex continuously differentiable cost function, and let be a regular cost sharing rule. The following statements are equivalent: D SR , for all profiles (u1 ; u2 ; : : : ; u n ) of utilities in L, G(; c) has at most one Nash-equilibrium, for all profiles (u1 ; u2 ; : : : ; u n ) of utilities in L every Nash-equilibrium of G(; c) is also a strong equilibrium, i. e., no coalition can coordinate in order to improve the payoff for all its members. This theorem makes a strong case for the serial cost sharing rule, especially when one realizes that the serial equilibrium is the unique element surviving successive elimination of strictly dominated strategies. Then this equilibrium may naturally arise through evolutive or eductive behavior, it is a robust prediction of non-cooperative behavior. Recent experimental studies are in line with this theoretical support, see [22,108]. Proposition 1 in [141] shows how easy it is to construct preferences in L such that regular cost sharing rules other than SR give rise to multiple equilibria in the corresponding demand game, even in two-agent cost sharing games. Besides other fairness concepts in the distributive literature the most compelling is envy-freeness. An allocation passes the no envy test if no player prefers her own allocation less than that of other players. Formally, the definition is as follows. No Envy Test Let x be a demand profile and y a vector of cost shares. Then the allocation (x i ; y i ) i2N is envy-free if for all i; j 2 N it holds u i (x i ; y i ) u i (x j ; y j ). It is easily seen that the allocations associated with the serial equilibria are all envy-free. Increasing Returns to Scale As Theorem 12 already shows, uniqueness of equilibrium in demand games for all utility profiles in L is in general inconsistent with concave cost functions. Theorem 14 (de Frutos [26]) Let c be a strictly concave continuously differentiable cost function, and let be a regular cost sharing rule. The following statements are equivalent: D DSR or D SR . For all utility profiles u D (u i ) i2N in L the induced demand game G(; c) has at most one Nash equilibrium, or
Cost Sharing, Figure 10 Scenario (ii). The indifference curves of agent 1 together with the curve : t 7! 1SR ((t; x 1 ); c). Best response of player 1 against x 1 D ( 23 ; 83 ) is the value of x where the graph of is tangent to an indifference curve of u1
For all utility profiles u D (u i ) i2N in L, every Nash equilibrium of the game G(; c) is a strong Nash equilibrium as well. Moreover, if the curvature of the indifference curves is bigger than that of the curve generated by the cost sharing rule as in Fig. 10 then the second and third statement are equivalent with D DSR . Theorem 15 (Moulin [85]) Assume agents have preferences in L . The serial cost sharing rule is the unique continuous, cross-monotonic and anonymous cost sharing rule for which the Nash-equilibria of the corresponding demand games all pass the no-envy test. Comparison of Serial and Proportional Equilibria (II) Just as in the earlier analysis in Sect. “Efficiency and Strategy-Proof Cost Sharing Mechanisms”, performance of cost sharing rules can be measured by the related surpluses in the Nash-equilibria of the corresponding demand games. Assume in this section that the preferences of the agents are quasi-linear in cost shares, and represented by functions U i (x i ; y i ) D u i (x i ) y i . Moreover, assume that ui is non-decreasing and concave, u i (0) D 0. Then the surplus at the demand profile x and utility profile is the numP ber i2N u i (x i ) c(x(N)). Define the efficient surplus or value of N relative to c and U by X u i (x i ) c(x(N)) : (27) v(c; U) D sup x2RN C
i2N
Cost Sharing
Denote the set of Nash-equilibria in the demand game G(; c) with profile of preferences U by NE(; c; U). Given c; the guaranteed (relative) surplus of the cost sharing rule for N is defined by P i2N u i (x i ) c(x(N)) : (28) (c; ) D inf v(c; U) U ;x2NE(;c;U ) Here the infimum is taken over all utility profiles discussed above. This measure is also called the price of anarchy of the game, see [65]. Let C be the set of all convex increasing cost functions with lim y!1 c(y)/y D 1. Then [89] shows that for the serial and the proportional rule the guaranteed surplus is at least 1/n. But sometimes the distinction is eminent. Define the number ı(y) D
yc 00 (y) ; c 0 (y) c 0 (0)
which is a kind of elasticity. The below theorem shows that on certain domains of cost functions with bounded ı the serial rule prevails over the proportional rule. For large n the guaranteed surplus at SR is of order 1/ ln(n), that of AV of order 1/n. More precise, write 1 K n D 1 C 13 C C 2n1 1 C ln2n , then: Theorem 16 (Moulin [89]) For any convex increasing cost function c with lim y!1 c(y)/y D 1 it holds that If c 0 is concave and inf fı(y)jy 0g D p > 0, then (c; SR )
p Kn
(c; AV )
4 nC3
If c 0 is convex and sup fı(y)jy 0g D p < 1 then (c; SR )
1 1 2p 1 K n
4 4(2p 1) (c; AV ) nC3 n
the true preferences. Then this gives rise to a strategyproof mechanism. Note that the same approach can not be used for the proportional rule. The strategic properties of the proportional demand game are weaker than that of the serial demand game in several aspects. First of all, it is not hard to find preference profiles in L leading to multiple equilibria. Whereas uniqueness of equilibrium can be repaired by restricting L to L , the resulting equilibria are, in general, not strong (like the serial counterparts). In the proportional equilibria there is overproduction, see e. g. the example in Sect. “Proportional Demand Game” where a small uniform reduction of demands yields higher utility for all the players. Besides, a single-valued Nash equilibrium selection corresponds to a strategyproof mechanism provided the underlying domain of preferences is rich, and L is not. Though richness is not a necessary condition, the proportional rule is not consistent with a strategyproof demand game. Bayesian P1 -Demand Games Recall that at the basis of a strategic game there is the assumption that each player knows all the ingredients of the game. However, as [56] argues, production cost and output quality may vary unpredictably as a consequence of the technology and worker quality. Besides that, changes in the available resources and demands will have not foreseen influences on individual preferences. On top of that, the players may have asymmetrical information regarding the nature of uncertainty. [56] study the continuous homogeneous cost sharing problem within the context of a Bayesian demand game [43], where these uncertainties are taking into account. The qualifications of the serial rule in the stochastic model are roughly the same as in the deterministic framework. Continuous Heterogeneous Output Model, P n
A Word on
Strategy-Proofness in P1
Recall the discussion on strategyproofness in Sect. “Strategyproofness”. The serial demand game has a unique strong Nash equilibrium in case costs are convex and preferences are drawn from L. Suppose the social planner aims at designing a mechanism to implement the outcomes associated with these equilibria. [91] show an efficient way to implement this serial choice function by an indirect mechanism. It is defined through a multistage game which mimics the way the serial Nash equilibria are calculated. It is easily seen that here each agent has a unique dominant strategy, in which demands result from optimization of
The analysis of continuous cost sharing problems for multi-service facilities are far more complex than the single-output model. The literature discusses two different models, one where each agent i demands a different good, and one where agents may require mixed bundles of goods. As the reader will notice, the modeling and analysis of solutions differs in abstraction and complexity. In order to concentrate on the main ideas, here we will stick to the first model, where goods are identified with agents. This N means that a demand profile is a vector x 2 RC , where xi denotes the demand of agent i for good i. From now we deal with technologies described by continuously differ-
745
746
Cost Sharing
N entiable cost functions c : RC ! RC , non-decreasing and c(0 N ) D 0; the class of all such functions is denoted by C n .
Extensions of Cost Sharing Rules The single-output model is connected to the multi-output model via the homogeneous cost sharing problems. Suppose that for c 2 C n there is a function c0 2 C such that c(x) D c0 (x(N)) for all x. For instance, such functions are found if we distinguish between production of blue and red cars; the color of the car does not affect the total production costs. Essentially, a homogeneous cost sharing problem (x; c) may be solved as if it were in P 1 . If is the compelling solution on P 1 then any cost sharing rule on P n should determine the same solution on the class of homogeneous problems therein. Formally the cost sharing rule on P n extends on P 1 if for all homogeneous cost sharing problems (x; c) it holds (x; c) D (x; c0 ). In general a cost sharing rule on P 1 allows for a whole class of extensions. Below we will focus on extensions of SR ; P ; SS . Measurement of Scale Around the world quantities of goods are measured by several standards. Length is expressed in inches or centimeters, volume in gallons or liters, weight in ounces to kilos. Here measurement conversion involves no more than multiplication with a fixed scalar. When such linear scale conversions do not have any effect on final cost shares, a cost sharing rule is called scale invariant. It is an ordinal rule if this invariance extends to essentially all transformations of scale. Scale invariance captures the important idea that the relative cost shares should not change, whether we are dividing 1 Euro or 1,000 Euros. Ordinality may be desirable, but for many purposes too strong as a basic requirement. Formally, N ! R N such a transformation of scale is a mapping f : RC C that f (x) D ( f1 (x1 ); f2 (x2 ); : : : ; f n (x n )) for all x and each of the coordinate mappings f j is differentiable and strictly increasing. Ordinality A cost sharing rule on P n is ordinal if for all transformations of scale f and all cost sharing problems (x; c) 2 P n , it holds that (x; c) D ( f (x); c ı f 1 ) :
(29)
Scale Invariance A cost sharing rule on P n is scale invariant if Eq. (29) holds for all linear transforms f . Under a scale invariant cost sharing rule final cost shares do not change by changing the units in which the goods are measured.
Path-Generated Cost Sharing Rules Many cost sharing rules on P n calculate the cost shares for (x; c) 2 P n by the total of marginal costs along some production path from 0 toward x. Here a path for x is a non-decreasing N such that (0) D 0 and there mapping x : RC ! RC N cost sharing is a T 2 RC with x (T) D x. The ˚ rule generN is defined by ated by the path collection D x jx 2 RC Z 1 @ i c( x (t))( ix )0 (t)dt : (30) i (x; c) D 0
Special path-generated cost sharing rules are the fixed-path N with the cost sharing rules; a single path : RC ! RC property that lim t!1 i (t) D 1 defines the whole underlying family of paths. More precisely, the fixed path cost sharing rule generated by ˚ is the path-gen N erated rule for the family of paths x jx 2 RC defined˚ by x (t)D (t) ^ x, the vector with coordinates min i (t); x i . So the paths are no more than the projections of (t) on the cube [0; x]. Below we will see many examples of (combinations of) such fixed-path methods. Aumann–Shapley Rule The early characterizations by [15,79] on this rule set off a vast growing literature on cost sharing models with variable demands. [16] suggested to use the Aumann–Shapley rule to determine telephone billing rates in the context of sharing the cost of a telephone system. This extension of proportional cost sharing calculates marginal costs along the path AS (t) D tx for t 2 [0; 1]. Then Z 1 (x; c) D x @ i c(tx)dt : (31) AS i i 0
The Aumann–Shapley rule can be interpreted as the Shapley-value of the non-atomic game where each unit of the good is a player, see [11]. It is the uniform average over marginal costs along all increasing paths from 0 N to x. The following is a classic result in the cost sharing literature: Theorem 17 (Mirman & Tauman [79], Billera & Heath [15]) There is only one additive, positive, and scale invariant cost sharing rule on P n that extends the proportional rule, and this is AS . Example 24 If c is positively homogeneous, i. e. c(˛ y) D N , then ˛c(y) for ˛ 0 and all y 2 RC AS i (x; c) D
@i c (x) @x i
i. e. AS calculates the marginal costs of the ith good at the final production level x. The risk measures (cost functions) as in [28] are of this kind.
Cost Sharing
Friedman–Moulin Rule This serial extension [36] calculates marginal costs along the diagonal path, i. e. FM (t) D t1 N ^ x Z xi (x; c) D @ i c( x (t))dt : (32) FM i 0
This fixed path cost sharing rule is demand monotonic. As far as invariance with the choice of unit is concerned, the performance is bad as it is not a scale invariant cost sharing rule. Moulin–Shenker Rule This fixed-path cost sharing rule is proposed as an ordinal serial extension by [125]. Suppose that the partial derivatives of c 2 C n are bounded away from 0, i. e. there is a such that @ i c(x) > a for all N . The Moulin–Shenker rule MS is generated by the x 2 RC path MS as solution to the system of ordinary differential equations (P i0 (t) D
@ j c( (t)) @ i c( (t))
j2N
if i (t) < x i ; else :
0
(33)
The interpretation of this path is that at each moment the total expenditure for production of extra units for the different agents is equal; if good 2 is twice as expensive as good 1, then the production device MS will produce twice as much of the good 1. The serial rule embraces the same idea – as long as an agent desires extra production, the corresponding incremental costs are equally split. This makes MS a natural extension of the serial rule. Call t i the completion time of production for agent i, i. e. iMS (t) < x i if t < t i and iMS (t i ) D x i . Assume without loss of generality that these completion times are ordered such that 0 D t0 t1 t2 t n , then the Moulin– Shenker rule is given by MS i (x; c) D
i )) X c( MS (t` )) c( MS (t`1 : n`C1
(34)
`D1
Note that the path varies with the cost function, and that this is the reason why MS is a non-additive solution. Such solutions – though intuitive – are in general notoriously hard to analyze. There are two axiomatic characterizations of the Moulin–Shenker rule. The first is by [125], in terms of the serial principle and the technical condition that a cost sharing rule be a partial differentiable functions of the demands. The other characterization by [60] is more in line with the ordinal character of this serial extension. Continuity A cost sharing rule on P n is continuous if N for all c. q 7! (q; c) is continuous on RC
Continuity is weaker than partial differentiability, as it requires stability of the solution with respect to small changes in the demands. Upperbound A cost sharing rule satisfies upperbound if for all (q; c) 2 P n ; i 2 N i (q; c) max @ i c(y) : y2[0;q]
An upperbound provides each agent with a conservative and rather pessimistic estimate of her cost share, based on the maximal value of the corresponding marginal cost toward the aggregate demand. Suppose that d is a demand profile smaller than q. A reduced cost sharing problem is defined by (q d; c d ) where cd is defined by c d (y) D c(y C d) c(d). So cd measures the incremental cost of production beyond the level d. Self–consistency A cost sharing rule is self-consistent if for all cost sharing problems (q; c) 2 P n with q NnS D 0 NnS for some S N, and d q such that i (d; c) D j (d; c) for all fi; jg S, then (q; c)S D (d; c)S C (q d; c d )S . So, self-consistency is expresses the idea that if cost shares of agents with non-zero demand differ, then this is not due to the part of the problem that they are equally charged for, but due to the asymmetries in the related reduced problem. The property is reminiscent of the step-by-step negotiation property in the bargaining literature, see [54]. Theorem 18 (Koster [60]) There is only one continuous, self-consistent and scale invariant cost sharing rule satisfying upper bounds, which is the Moulin–Shenker rule. Shapley–Shubik Rule For each demand profile x the stand-alone cost game cx is defined as before. Then the Shapley–Shubik rule is no more than the Shapley value of this game, i. e. SS (x; c) D ˚ (c x ). The Shapley–Shubik rule is ordinal. A Numerical Example Consider the cost sharing problem (x; c) with N D f1; 2g ; x D (5; 10), and c 2 C 2 is given by c(t1 ; t2 ) D e2t1 Ct2 1 on [0; 10] [0; 10]. We calculate the partial derivatives @1 c(t1 ; t2 ) D 2e2t1 Ct2 D 2@2 c(t1 ; t2 ) for all (t1 ; t2 ) 2 R2C . The Aumann–Shapley path is given by (t) D (5t; 10t) for t 2 [0; 1] and 1AS (x; c) D
Z
1
@1 c(5t; 10t) 5dt 0
747
748
Cost Sharing
Z
1
D 2AS (x; c) D
Z
0
Z
0
2e20t 5dt D
1 2
20 e 1
1
@2 c(5t; 10t) 10dt 1
D
e20t 10dt D
0
1 2
20 e 1 :
of MS such that (t) D (t; 2t) for 0 t 5. The corresponding cost shares are equal since reaches both coordinates of x at the same time, so 1MS (x; c) D 2MS (x; c) D 12 c( (5)) D 12 c(5; 10) D 12 (e20 1). Now suppose that the demands are summarized by x D (10; 10). In order to calculate MS (x ; c), notice that there is a parametrization of of the corresponding path MS such that
The Friedman–Moulin rule uses the path ( FM (t) D t1 N ^ q D
(t; t) (5; 5 C t)
if 0 t < 5 ;
Z
5 0 5
D
2e3t dt D
0
2 3
Z
5
Note that both discussed cost sharing rules use one and the same path for all cost sharing problems with demand profile x. This is characteristic for additive cost sharing rules (see e. g. [35,42]). Now turn to the Moulin–Shenker rule. Since @1 c D 2@2 c everywhere on [0; 10] [0; 10], according the solution MS of Eq. (33), until one of the demands is reached, for each produced unit of good 1 two units of good 2 are produced. In particular there is a parametrization
Cost Sharing, Figure 11 Paths for MS ; AS ; FM
if t 5 ;
(t; 10)
for 5 < t 10 ;
Notice that this path extends just to complete service for agent 1, so that – like before – agent 2 only contributes while t < 5. Then the cost shares are given by
1MS (x ; c) D 2MS (x ; c) C c( (10)) c( (5)) D e30 12 (e20 C 1) :
e15 1 ;
Z 10 @2 c(t; t)dt C @2 c(5; 5 C t)dt 0 5 D 13 e15 1 C e20 e15 :
2FM (x; c) D
(t; 2t)
2MS (x ; c) D 12 c( (5)) D 12 c(5; 10) D 12 (e20 1) ;
@1 c(t; t)dt Z
(t) D
if 5 t ;
and the corresponding cost shares are calculated as follows: 1FM (x; c) D
(
For x the cost sharing rules AS and FM use essentially the same symmetric path (t) D (t; t), so that it is easily calculated that AS (x ; c) D FM (x ; c) D ( 23 e30 1 ; 1 30 3 e 1 . Axiomatic Characterization of Fixed-Path Rules Recall demand monotonicity as a weak incentive constraint for cost sharing rules. Despite the attention that the Aumann–Shapley rule received, it fails to meet this standard. To see this consider the following c(y) D
y1 y2 ; y1 C y2
Cost Sharing, Figure 12 Paths for MS ; AS ; FM
and 1AS (x; c) D
x1 x22 : (x1 C x2 )2
Cost Sharing
Then the latter expression is not monotonic in x1 . One may show that the combination of properties in Theorem 17 are incompatible with demand monotonicity. Now what kind of rules are demand monotonic? The classification of all such rules is too complex. We will restrict our attention to the additive rules with the dummy property, which comprises the idea that a player pays nothing if her good is for free:
G(; c), and that it is strong as well. Moreover, [34] shows that this Nash-equilibrium can be reached through some learning dynamics. Then this means that the demand games induced by FM and MS have strong strategic properties. Notice that the above theorem is only one-way. There are other cost sharing rules, like the axial serial rule, having the same strategic characteristics.
Dummy If @ i c(y) D 0 for all y then i (x; c) D 0 for all cost sharing problems (x; c) 2 P n .
Future Directions
Theorem 19 (Friedman [35]) A cost sharing rule satisfies dummy, additivity and demand monotonicity if and only if it is an (infinite) convex combination of rules generated by fixed paths which do not depend on the cost structure. A cost sharing rule satisfies dummy, additivity and scale invariance if and only if it is an (infinite) convex combination of rules generated by scale invariant fixed paths which do not depend on the cost structure. This theorem has some important implications. The Friedman–Moulin rule is the unique serial extension with the properties additivity, dummy and demand monotonicity. As we mentioned before, FM is not scale invariant. The only cost sharing rules satisfying all four of the above properties are random order values, i. e. a convex combination of marginal vectors of the stand-alone cost game [142]. SS is the special element in this class of rules, giving equal weight to each marginal vector. Consider the following weak fairness property: Equal Treatment Consider (x; c) 2 P n . If c x (S [ fig) D c x (S [ f jg) for all i; j and S Nn fi; jg then i (x; c) D j (x; c). Within the class of random order values, SS is the unique cost sharing rule satisfying equal treatment. Strategic Properties of Fixed-Path Rules [34] shows that the fixed-path cost sharing rules essentially have the same strategic properties as the serial rule. The crucial result in this respect is the following. Theorem 20 (Friedman [34]) Consider the demand game G(; c) where is a fixed path cost sharing rule and c 2 C n has strictly increasing partial derivatives. Then the corresponding set O1 of action profiles surviving the successive elimination of overwhelmed actions consists of a unique element. As a corollary one may prove that the action profile in O1 is actually the unique Nash-equilibrium of the game
So far, a couple of standard stylized models have been launched providing a theoretical basis for defining and studying cost sharing principles at a basic level. The list of references below indicates that this field of research is full in swing, both in theoretical and applied directions. Although it is hard to make a guess where developments lead to, a couple of future directions will be highlighted. Informational Issues So far most of the literature is devoted to deterministic cost sharing problems. The cost sharing problems we face in practice are shaped by unsure events. Despite its relevance, this stochastic modeling in the literature is quite exceptional, see [56,138]. The presented models assume the information of costs for every contingent demand profile. Certainly within the continuous framework this seems too much to ask for. Retrieving the necessary information is hindered not only by technical constraints, but leads to new costs as well. [48] discusses data envelopment in cost sharing problems. A stochastic framework will be useful to study such estimated cost sharing problems. Other work focusing on informational coherence in cost sharing problems is [126]. Related work is [4], discussing mixtures of discrete and continuous cost sharing problems. Budget Balance In this overview, the proposed mechanisms are based on cost sharing rules. Another stream in implementation theory – at the other extreme of the spectrum – deals with cost allocation rules with no restrictions on the budget. [89] compares the size of budget deficits relative to the overall efficiency of a mechanism. Performance Recall the performance indices measuring the welfare impact of different cost sharing rules. [90] focuses on the continuous homogeneous production situations, with cost functions of specific types. There is still a need for a more
749
750
Cost Sharing
general theory. In particular this could prove to be indispensable for analyzing the quality of cost sharing rules in a broader set-up, the heterogeneous and Bayesian cost sharing problems.
15. 16.
Non-linear Cost Sharing Rules Most of the axiomatic literature is devoted to the analysis of cost sharing rules as linear operators. The additivity property is usually motivated as an accounting convention, but it serves merely as a tool by which some mathematical representation theorems apply. Besides the practical motivation, it is void of any ethical content. As [88] underlines, there are hardly results on non-additive cost sharing rules – one of the reasons is that the mathematical analysis becomes notoriously hard. But – as a growing number of authors acknowledges – the usefulness of these mathematical techniques alone cannot justify the contribution of the property. I thank Hervé Moulin as a referee of this article, and his useful suggestions.
17. 18. 19. 20.
21.
22. 23. 24.
Bibliography 1. Aadland D, Kolpin V (1998) Shared irrigation cost: an empirical and axiomatical analysis. Math Soc Sci 35:203–218 2. Aadland D, Kolpin V (2004) Environmental determinants of cost sharing. J Econ Behav Organ 53:495–511 3. Albizuri MJ, Zarzuelo JM (2007) The dual serial cost-sharing rule. Math Soc Sci 53:150–163 4. Albizuri MJ, Santos JC, Zarzuelo JM (2003) On the serial cost sharing rule. Int J Game Theory 31:437–446 5. An M (1998) Logconcavity versus logconvexity, a complete characterization. J Econ Theory 80:350–369 6. Archer A, Feigenbaum J, Krishnamurthy A, Sami R, Shenker S (2004) Approximation and collusion in multicast costsharing. Games Econ Behav 47:36–71 7. Arin J, Iñarra E (2001) Egalitarian solutions in the core. Int J Game Theory 30:187–193 8. Atkinson AB (1970) On the measurement of inequality. J Econ Theory 2:244–263 9. Aumann RJ (1959) Acceptable points in general cooperative n-person games. In: Kuhn HW, Tucker AW (eds) Contributions to the Theory of Games, vol IV. Princeton University Press, Princeton 10. Aumann RJ, Maschler M (1985) Game theoretic analysis of a bankruptcy problem from the Talmud. J Econ Theory 36:195–213 11. Aumann RJ, Shapley LS (1974) Values of Non-atomic Games. Princeton University Press, Princeton 12. Baumol W, Bradford D (1970) Optimal departure from marginal cost pricing. Am Econ Rev 60:265–283 13. Baumol W, Panzar J, Willig R (1988) Contestable Markets & the Theory of Industry Structure. 2nd edn Hartcourt Brace Jovanovich. San Diego 14. Bergantino A, Coppejans L (1997) A game theoretic approach
25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37.
38.
39.
to the allocation of joint costs in a maritime environment: A case study. Occasional Papers, vol 44. Department of Maritime Studies and International Transport, University of Wales, Cardi Billera LJ, Heath DC (1982) Allocation of shared costs: A set of axioms yielding a unique procedure. Math Oper Res 7:32–39 Billera LJ, Heath DC, Raanan J (1978) Internal telephone billing rates: A novel application of non-atomic game theory. Oper Res 26:956–965 Bird CG (1976) On cost allocation for a spanning tree: A game theoretic approach. Networks 6:335–350 Binmore K (2007) Playing for real: A text on game theory. Oxford University Press, Oxford Bjorndal E, Hamers H, Koster M (2004) Cost allocation in a bank ATM network. Math Methods Oper Res 59:405–418 Bondareva ON (1963) Some applications of linear programming to the theory of cooperative games. Problemy Kybernetiki 10:119–139 (in Russian) Brânzei R, Ferrari G, Fragnelli V, Tijs S (2002) Two approaches to the problem of sharing delay costs in joint projects. Ann Oper Res 109:359–374 Chen Y (2003) An experimental study of serial and average cost pricing mechanisms. J Publ Econ 87:2305–2335 Clarke EH (1971) Multipart pricing of public goods. Publ Choice 11:17–33 Dasgupta PS, Hammond PJ, Maskin ES (1979) The implementation of social choice rules: Some general results on incentive compatibility. Rev Econ Stud 46:185–216 Davis M, Maschler M (1965) The kernel of a cooperative game. Nav Res Logist Q 12:223–259 de Frutos MA (1998) Decreasing serial cost sharing under economies of scale. J Econ Theory 79:245–275 Demers A, Keshav S, Shenker S (1990) Analysis and simulation of a fair queueing algorithm. J Internetworking 1:3–26 Denault M (2001) Coherent allocation of risk capital. J Risk 4_7–21 Dewan S, Mendelson H (1990) User delay costs and internal pricing for a service facility. Manag Sci 36:1502–1517 Dutta B, Ray D (1989) A concept of egalitarianism under participation constraints. Econometrica 57:615–635 Dutta B, Ray D (1991) Constrained egalitarian allocations. Games Econ Behav 3:403–422 Flam SD, Jourani A (2003) Strategic behavior and partial cost sharing. Games Econ Behav 43:44–56 Fleurbaey M, Sprumont Y (2006) Sharing the cost of a public good without subsidies. Université de Montréal, Cahier Friedman E (2002) Strategic properties of heterogeneous serial cost sharing. Math Soc Sci 44:145–154 Friedman E (2004) Paths and consistency in additive cost sharing. Int J Game Theory 32:501–518 Friedman E, Moulin H (1999) Three methods to share joint costs or surplus. J Econ Theory 87:275–312 Friedman E, Shenker S (1998) Learning and implementation on the Internet. Working paper 199821, Rutgers University, Piscataway González-Rodríguez P, Herrero C (2004) Optimal sharing of surgical costs in the presence of queues. Math Methods Oper Res 59:435–446 Granot D, Huberman G (1984) On the core and nucleolus of minimum cost spanning tree games. Math Program 29:323– 347
Cost Sharing
40. Green J, Laffont JJ (1977) Characterization of satisfactory mechanisms for the revelation of preferences for public goods. Econometrica 45:427–438 41. Groves T (1973) Incentives in teams. Econometrica 41:617– 663 42. Haimanko O (2000) Partially symmetric values. Math Oper Res 25:573–590 43. Harsanyi J (1967) Games with incomplete information played by Bayesian players. Manag Sci 14:159–182 44. Hart S, Mas-Colell A (1989) Potential, value, and consistency. Econometrica 57:589–614 45. Haviv M (2001) The Aumann–Shapley price mechanism for allocating congestion costs. Oper Res Lett 29:211–215 46. Hougaard JL, Thorlund-Petersen L (2000) The stand-alone test and decreasing serial cost sharing. Econ Theory 16:355– 362 47. Hougaard JL, Thorlund-Petersen L (2001) Mixed serial cost sharing. Math Soc Sci 41:51–68 48. Hougaard JL, Tind J (2007) Cost allocation and convex data envelopment. mimeo University of Copenhagen, Copenhagen 49. Henriet D, Moulin H (1996) Traffic-based cost allocation in a network. J RAND Econ 27:332–345 50. Iñarra E, Isategui JM (1993) The Shapley value and average convex games. Int J Game Theory 22:13–29 51. Israelsen D (1980) Collectives, communes, and incentives. J Comp Econ 4:99–124 52. Jackson MO (2001) A crash course in implementation theory. Soc Choice Welf 18:655–708 53. Joskow PL (1976) Contributions of the theory of marginal cost pricing. Bell J Econ 7:197–206 54. Kalai E (1977) Proportional solutions to bargaining situations: Interpersonal utility comparisons. Econometrica 45:1623– 1630 55. Kaminski M (2000) ‘Hydrolic’ rationing. Math Soc Sci 40:131– 155 56. Kolpin V, Wilbur D (2005) Bayesian serial cost sharing. Math Soc Sci 49:201–220 57. Koster M (2002) Concave and convex serial cost sharing. In: Borm P, Peters H (eds) Chapters in game theory. Kluwer, Dordrecht 58. Koster M (2005) Sharing variable returns of cooperation. CeNDEF Working Paper 05–06, University of Amsterdam, Amsterdam 59. Koster M (2006) Heterogeneous cost sharing, the directional serial rule. Math Methods Oper Res 64:429–444 60. Koster M (2007) The Moulin–Shenker rule. Soc Choice Welf 29:271–293 61. Koster M (2007) Consistent cost sharing. mimeo, University of Amsterdam 62. Koster M, Tijs S, Borm P (1998) Serial cost sharing methods for multi-commodity situations. Math Soc Sci 36:229–242 63. Koster M, Molina E, Sprumont Y, Tijs ST (2002) Sharing the cost of a network: core and core allocations. Int J Game Theory 30:567–599 64. Koster M, Reijnierse H, Voorneveld M (2003) Voluntary contributions to multiple public projects. J Publ Econ Theory 5:25– 50 65. Koutsoupias E, Papadimitriou C (1999) Worst-case equilibria. In: 16th Annual Symposium on Theoretical Aspects of Computer Science, Trier. Springer, Berlin, pp 404–413
66. Legros P (1986) Allocating joint costs by means of the Nucleolus. Int J Game Theory 15:109–119 67. Leroux J (2004) Strategy-proofness and efficiency are incompatible in production economies. Econ Lett 85:335–340 68. Leroux J (2006) Profit sharing in unique Nash equilibrium: characterization in the two-agent case, Games and Economic Behavior 62:558–572 69. Littlechild SC, Owen G (1973) A simple expression for the Shapley value in a simple case. Manag Sci 20:370–372 70. Littlechild SC, Thompson GF (1977) Aircraft landing fees: A game theory approach. Bell J Econ 8:186–204 71. Maniquet F, Sprumont Y (1999) Efficient strategy-proof allocation functions in linear production economies. Econ Theory 14:583–595 72. Maniquet F, Sprumont Y (2004) Fair production and allocation of an excludable nonrival good. Econometrica 72:627– 640 73. Maschler M (1990) Consistency. In: Ichiishi T, Neyman A, Tauman Y (eds) Game theory and applications. Academic Press, New York, pp 183–186 74. Maschler M (1992) The bargaining set, kernel and nucleolus. In: Aumann RJ, Hart S (eds) Handbook of Game Theory with Economic Applications, vol I. North–Holland, Amsterdam 75. Maschler M, Reijnierse H, Potters J (1996) Monotonicity properties of the nucleolus of standard tree games. Report 9556, Department of Mathematics, University of Nijmegen, Nijmegen 76. Maskin E, Sjöström T (2002) Implementation theory. In: Arrow KJ, Sen AK, Suzumura K (eds) Handbook of Social Choice and Welfare, vol I. North–Holland, Amsterdam 77. Matsubayashi N, Umezawa M, Masuda Y, Nishino H (2005) Cost allocation problem arising in hub-spoke network systems. Europ J Oper Res 160:821–838 78. McLean RP, Pazgal A, Sharkey WW (2004) Potential, consistency, and cost allocation prices. Math Oper Res 29:602–623 79. Mirman L, Tauman Y (1982) Demand compatible equitable cost sharing prices. Math Oper Res 7:40–56 80. Monderer D, Shapley LS (1996) Potential games. Games Econ Behav 14:124–143 81. Moulin H (1987) Equal or proportional division of a surplus, and other methods. Int J Game Theory 16:161–186 82. Moulin H (1994) Serial cost-sharing of an excludable public good. Rev Econ Stud 61:305–325 83. Moulin H (1995) Cooperative microeconomics: A game– theoretic introduction. Prentice Hall, London 84. Moulin H (1995) On additive methods to share joint costs. Japan Econ Rev 46:303–332 85. Moulin H (1996) Cost sharing under increasing returns: A comparison of simple mechanisms. Games Econ Behav 13:225–251 86. Moulin H (1999) Incremental cost sharing: Characterization by coalition strategy-proofness. Soc Choice Welf 16:279–320 87. Moulin H (2000) Priority rules and other asymmetric rationing methods. Econometrica 68:643–684 88. Moulin H (2002) Axiomatic cost and surplus-sharing. In: Arrow KJ, Sen AK, Suzumura K (eds) Handbook of Social Choice and Welfare. Handbooks in Economics, vol 19. North–Holland, Amsterdam, pp 289–357 89. Moulin H (2006) Efficient cost sharing with cheap residual claimant. mimeo Rice University, Housten
751
752
Cost Sharing
90. Moulin H (2007) The price of anarchy of serial, average and incremental cost sharing. mimeo Rice University, Housten 91. Moulin H, Shenker S (1992) Serial cost sharing. Econometrica 60:1009–1037 92. Moulin H, Shenker S (1994) Average cost pricing versus serial cost sharing; an axiomatic comparison. J Econ Theory 64:178– 201 93. Moulin H, Shenker S (2001) Strategy-proof sharing of submodular cost: budget balance versus efficiency. Econ Theory 18:511–533 94. Moulin H, Sprumont Y (2005) On demand responsiveness in additive cost sharing. J Econ Theory 125:1–35 95. Moulin H, Sprumont Y (2006) Responsibility and cross-subsidization in cost sharing. Games Econ Behav 55:152–188 96. Moulin H, Vohra R (2003) Characterization of additive cost sharing methods. Econ Lett 80:399–407 97. Moulin H, Watts A (1997) Two versions of the tragedy of the commons. Econ Des 2:399–421 98. Mutuswami S (2004) Strategy proof cost sharing of a binary good and the egalitarian solution. Math Soc Sci 48:271–280 99. Myerson RR (1980) Conference structures and fair allocation rules. Int J Game Theory 9:169–182 100. Myerson RR (1991) Game theory: Analysis of conflict. Harvard University Press, Cambridge 101. Nash JF (1950) Equilibrium points in n-person games. Proc Natl Acad Sci 36:48–49 102. O’Neill B (1982) A problem of rights arbitration from the Talmud. Math Soc Sci 2:345–371 103. Osborne MJ (2004) An introduction to game theory. Oxford University Press, New York 104. Osborne MJ, Rubinstein A (1994) A Course in Game Theory. MIT Press, Cambridge 105. Peleg B, Sudhölter P (2004) Introduction to the theory of cooperative games, Series C: Theory and Decision Library Series. Kluwer, Amsterdam 106. Pérez-Castrillo D, Wettstein D (2006) An ordinal Shapley value for economic environments. J Econ Theory 127:296–308 107. Potters J, Sudhölter P (1999) Airport problems and consistent allocation rules. Math Soc Sci 38:83–102 108. Razzolini L, Reksulak M, Dorsey R (2004) An experimental evaluation of the serial cost sharing rule. Working paper 0402, VCU School of Business, Dept. of Economics, Richmond 109. Ritzberger K (2002) Foundations of non-cooperative game theory. Oxford University Press, Oxford 110. Rosenthal RW (1973) A class of games possessing pure-strategy Nash equilibria. J Econ Theory 2:65–67 111. Roth AE (ed) (1988) The Shapley value, essays in honor of Lloyd S Shapley. Cambridge University Press, Cambridge, pp 307–319 112. Samet D, Tauman Y, Zang I (1984) An application of the Aumann–Shapley prices for cost allocation in transportation problems. Math Oper Res 9:25–42 113. Sánchez SF (1997) Balanced contributions axiom in the solution of cooperative games. Games Econ Behav 20:161–168 114. Sandsmark M (1999) Production games under uncertainty. Comput Econ 14:237–253 115. Schmeidler D (1969) The Nucleolus of a characteristic function game. SIAM J Appl Math 17:1163–1170 116. Shapley LS (1953) A value for n-person games. Ann Math Study, vol 28. Princeton University Press, Princeton, pp 307– 317
117. Shapley LS (1967) On balanced sets and cores. Nav Res Logist Q 14:453–460 118. Shapley LS (1969) Utility comparison and the theory of games. In: Guilbaud GT (ed) La decision: Aggregation et dynamique des ordres de preference. Editions du Centre National de la Recherche Scientifique. Paris pp 251–263 119. Shapley LS (1971) Cores of convex games. Int J Game Theory 1:1–26 120. Sharkey W (1982) Suggestions for a game–theoretic approach to public utility pricing and cost allocation. Bell J Econ 13:57–68 121. Sharkey W (1995) Network models in economics. In: Ball MO et al (eds) Network routing. Handbook in Operations Research and Management Science, vol 8. North–Holland, Amsterdam 122. Shubik M (1962) Incentives, decentralized control, the assignment of joint cost, and internal pricing. Manag Sci 8:325–343 123. Skorin-Kapov D (2001) On cost allocation in hub-like networks. Ann Oper Res 106:63–78 124. Skorin-Kapov D, Skorin–Kapov J (2005) Threshold based discounting network: The cost allocation provided by the nucleolus. Eur J Oper Res 166:154–159 125. Sprumont Y (1998) Ordinal cost sharing. J Econ Theory 81:126–162 126. Sprumont Y (2000) Coherent cost sharing. Games Econ Behav 33:126–144 127. Sprumont Y (2005) On the discrete version of the Aumann– Shapley cost-sharing method. Econometrica 73:1693–1712 128. Sprumont Y, Ambec S (2002) Sharing a river. J Econ Theory 107:453–462 129. Sprumont Y, Moulin H (2005) Fair allocation of production externalities: Recent results. CIREQ working paper 28-2005, Montreal 130. Sudhölter P (1998) Axiomatizations of game theoretical solutions for one-output cost sharing problems. Games Econ Behav 24:42–71 131. Suijs J, Borm P, Hamers H, Koster M, Quant M (2005) Communication and cooperation in public network situations. Ann Oper Res 137:117–140 132. Tauman Y (1988) The Aumann–Shapley prices: A survey. In: Roth A (ed) The shapley value. Cambridge University Press, Cambridge, pp 279–304 133. Thomas LC (1992) Dividing credit-card costs fairly. IMA J Math Appl Bus & Ind 4:19–33 134. Thomson W (1996) Consistent allocation rules. Mimeo, Economics Department, University of Rochester, Rochester 135. Thomson W (2001) On the axiomatic method and its recent applications to game theory and resource allocation. Soc Choice Welf 18:327–386 136. Tijs SH, Driessen TSH (1986) Game theory and cost allocation problems. Manag Sci 32:1015–1028 137. Tijs SH, Koster M (1998) General aggregation of demand and cost sharing methods. Ann Oper Res 84:137–164 138. Timmer J, Borm P, Tijs S (2003) On three Shapley-like solutions for cooperative games with random payoffs. Int J Game Theory 32:595–613 139. van de Nouweland A, Tijs SH (1995) Cores and related solution concepts for multi-choice games. Math Methods Oper Res 41:289–311 140. von Neumann J, Morgenstern O (1944) Theory of games and economic behavior. Princeton University Press, Princeton
Cost Sharing
141. Watts A (2002) Uniqueness of equilibrium in cost sharing games. J Math Econ 37:47–70 142. Weber RJ (1988) Probabilistic values for games. In: Roth AE (ed) The Shapley value. Cambridge University Press, Cambridge 143. Young HP (1985) Producer incentives in cost allocation. Econometrica 53:757–765 144. Young HP (1985) Monotonic solutions of cooperative games. Int J Game Theory 14:65–72
145. Young HP (1985) Cost allocation: Methods, principles, applications. North-Holland, Amsterdam 146. Young HP (1988) Distributive justice in taxation. J Econ Theory 44:321–335 147. Young HP (1994) Cost allocation. In: Aumann RJ, Hart S (eds) Handbook of Game Theory, vol II. Elsevier, Amsterdam, pp 1193–1235 148. Young HP (1998) Cost allocation, demand revelation, and core implementation. Math Soc Sci 36:213–229
753
754
Curvelets and Ridgelets
Curvelets and Ridgelets JALAL FADILI 1, JEAN-LUC STARCK2 GREYC CNRS UMR 6072, ENSICAEN, Caen Cedex, France 2 Laboratoire Astrophysique des Interactions Multiéchelles UMR 7158, CEA/DSM-CNRS-Universite Paris Diderot, SEDI-SAP, Gif-sur-Yvette Cedex, France
1
Article Outline Glossary Definition of the Subject Introduction Ridgelets Curvelets Stylized Applications Future Directions Bibliography Glossary WT1D The one-dimensional Wavelet Transform as defined in [53]. See also Numerical Issues When Using Wavelets. WT2D The two-dimensional Wavelet Transform. Discrete ridgelet trasnform (DRT) The discrete implementation of the continuous Ridgelet transform. Fast slant stack (FSS) An algebraically exact Radon transform of data on a Cartesian grid. First generation discrete curvelet transform (DCTG1) The discrete curvelet transform constructed based on the discrete ridgelet transform. Second generationdiscrete curvelet transformx (DCTG2) The discrete curvelet transform constructed based on appropriate bandpass filtering in the Fourier domain. Anisotropic elements By anistropic, we mean basis elements with elongated effective support; i. e. length > width. Parabolic scaling law A basis element obeys the parabolic scaling law if its effective support is such that width length2 . Definition of the Subject Despite the fact that wavelets have had a wide impact in image processing, they fail to efficiently represent objects with highly anisotropic elements such as lines or curvilinear structures (e. g. edges). The reason is that wavelets are non-geometrical and do not exploit the regularity of the edge curve.
The Ridgelet and the Curvelet [16,17] transforms were developed as an answer to the weakness of the separable wavelet transform in sparsely representing what appears to be simple building atoms in an image, that is lines, curves and edges. Curvelets and ridgelets take the form of basis elements which exhibit high directional sensitivity and are highly anisotropic [9,18,32,68]. These very recent geometric image representations are built upon ideas of multiscale analysis and geometry. They have had an important success in a wide range of image processing applications including denoising [42,64,68], deconvolution [38,74], contrast enhancement [73], texture analysis [2], detection [44], watermarking [78], component separation [70,71], inpainting [37,39] or blind source separation [6,7]. Curvelets have also proven useful in diverse fields beyond the traditional image processing application. Let’s cite for example seismic imaging [34,42,43], astronomical imaging [48,66,69], scientific computing and analysis of partial differential equations [13,14]. Another reason for the success of ridgelets and curvelets is the availability of fast transform algorithms which are available in non-commercial software packages following the philosophy of reproducible research, see [4,75]. Introduction Sparse Geometrical Image Representation Multiscale methods have become very popular, especially with the development of the wavelets in the last decade. Background texts on the wavelet transform include [23,53, 72]. An overview of implementation and practical issues of the wavelet transform can also be found in Numerical Issues When Using Wavelets. Despite the success of the classical wavelet viewpoint, it was argued that the traditional wavelets present some strong limitations that question their effectiveness in higher-dimension than 1 [16,17]. Wavelets rely on a dictionary of roughly isotropic elements occurring at all scales and locations, do not describe well highly anisotropic elements, and contain only a fixed number of directional elements, independent of scale. Following this reasoning, new constructions have been proposed such as the ridgelets [9,16] and the curvelets [17,18,32,68]. Ridgelets and curvelets are special members of the family of multiscale orientation-selective transforms, which has recently led to a flurry of research activity in the field of computational and applied harmonic analysis. Many other constructions belonging to this family have been investigated in the literature, and go by the name contourlets [27], directionlets [76], bandlets [49,62], grouplets [54], shearlets [47], dual-tree wavelets and wavelet packets [40,46], etc.
Curvelets and Ridgelets
Curvelets and Ridgelets, Figure 1 Few Ridgelets examples – The second to fourth graphs are obtained after simple geometric manipulations of the first ridgelet, namely rotation, rescaling, and shifting
Throughout this paper, the term ‘sparsity’ is used and intended in a weak sense. We are aware that practical images and signals may not be supported in a transform domain on a set of relatively small size (sparse set). Instead, they may only be compressible (nearly sparse) in some transform domain. Hence, with a slight abuse of terminology, we will say that a representation is sparse for an image within a certain class, if it provides a compact description of such an image.
A ridgelet is constant along lines x1 cos C x2 sin D const. Transverse to these ridges it is a wavelet. Figure 1 depicts few examples of ridgelets. The second to fourth panels are obtained after simple geometric manipulations of the ridgelet (left panel), namely rotation, rescaling, and shifting. Given an integrable bivariate function f (x), we define its ridgelet coefficients by Z ˛ ˝ R f (a; b; ) :D f ; a;b; D f (x) a;b; (x) dx : R2
Notations We work throughout in two dimensions with spatial variable x 2 R2 and a continuous frequency-domain variable. Parentheses (:; :) are used for continuous-domain function evaluations, and brackets [:; :] for discrete-domain array indices. The hatˆnotation will be used for the Fourier transform. Ridgelets The Continuous Ridgelet Transform The two-dimensional continuous ridgelet transform in R2 can be defined as follows [10]. We pick a smooth univariate function : R ! R with sufficient decay and satisfying the admissibility condition Z (1) j ˆ ()j2 /jj2 d < 1 ; R which holds if, say, has a vanishing mean (t)dt D 0. We will suppose a special normalization about so that R1 ˆ ()j2 2 d D 1. j 0 For each scale a > 0, each position b 2 R and each orientation 2 [0; 2), we define the bivariate ridgelet 2 a;b; : R ! R by a;b; (x)
D
a;b; (x1 ; x2 ) D a1/2 ((x1 cos
C x2 sin b)/a) ;
(2)
We have the exact reconstruction formula Z
2
Z
1
Z
f (x) D 1
0
0
1
da
d
R f (a; b; ) a;b; (x) 3 db a 4
(3) valid almost everywhere for functions which are both integrable and square integrable. This formula is stable and one can prove a Parseval relation [16]. Ridgelet analysis may be constructed as wavelet analysis in the Radon domain. The rationale behind this is that the Radon transform translates singularities along lines into point singularities, for which the wavelet transform is known to provide a sparse representation. Recall that the Radon transform of an object f is the collection of line integrals indexed by ( ; t) 2 [0; 2) R given by Z R f ( ; t) D
R2
f (x1 ; x2 )ı(x1 cos C x2 sin t) dx1 dx2 ; (4)
where ı is the Dirac distribution. Then the ridgelet transform is precisely the application of a 1-dimensional wavelet transform to the slices of the Radon transform where the angular variable is constant and t is varying. Thus, the basic strategy for calculating the continuous
755
756
Curvelets and Ridgelets
ridgelet transform is first to compute the Radon transform R f (t; ) and second, to apply a one-dimensional wavelet transform to the slices R f (; ). Several digital ridgelet transforms (DRTs) have been proposed, and we will describe three of them in this section, based on different implementations of the Radon transform. The RectoPolar Ridgelet Transform A fast implementation of the Radon transform can be proposed in the Fourier domain, based on the projection-slice-theorem. First the 2D-FFT of the given image is computed. Then the resulting function in the frequency domain is to be used to evaluate the frequency values in a polar grid of rays passing through the origin and spread uniformly in angle. This conversion from Cartesian to Polar grid could be obtained by interpolation, and this process is well known by the name gridding in tomography. Given the polar grid samples, the number of rays corresponds to the number of projections, and the number of samples on each ray corresponds to the number of shifts per such angle. Applying one dimensional inverse Fourier transform for each ray, the Radon projections are obtained. The above described process is known to be inaccurate due to the sensitivity to the interpolation involved. This implies that for a better accuracy, the first 2D-FFT employed should be done with high-redundancy. An alternative solution for the Fourier-based Radon transform exists, where the polar grid is replaced with
a pseudo-polar one. The geometry of this new grid is illustrated in Fig. 2. Concentric circles of linearly growing radius in the polar grid are replaced by concentric squares of linearly growing sides. The rays are spread uniformly not in angle but in slope. These two changes give a grid vaguely resembling the polar one, but for this grid a direct FFT can be implemented with no interpolation. When applying now 1D-FFT for the rays, we get a variant of the Radon transform, where the projection angles are not spaced uniformly. For the pseudo-polar FFT to be stable, it was shown that it should contain at least twice as many samples, compared to the original image we started with. A by-product of this construction is the fact that the transform is organized as a 2D array with rows containing the projections as a function of the angle. Thus, processing the Radon transform in one axis is easily implemented. More details can be found in [68]. One-Dimensional Wavelet Transform To complete the ridgelet transform, we must take a one-dimensional wavelet transform (WT1D) along the radial variable in Radon space. We now discuss the choice of the digital WT1D. Experience has shown that compactly-supported wavelets can lead to many visual artifacts when used in conjunction with nonlinear processing, such as hardthresholding of individual wavelet coefficients, particularly for decimated wavelet schemes used at critical sampling. Also, because of the lack of localization of such compactly-supported wavelets in the frequency domain, fluctuations in coarse-scale wavelet coefficients can introduce fine-scale fluctuations. A frequency-domain approach must be taken, where the discrete Fourier transform is reconstructed from the inverse Radon transform. These considerations lead to use band-limited wavelet, whose support is compact in the Fourier domain rather than the time-domain [28,29,68]. In [68], a specific overcomplete wavelet transform [67,72] has been used. The wavelet transform algorithm is based on a scaling function such that ˆ vanishes outside of the interval [ c ; c ]. We define the Fourier transform of the scaling function as a re-normalized B3 -spline 3 ˆ ( ) D B3 (4 ); 2
Curvelets and Ridgelets, Figure 2 Illustration of the pseudo-polar grid in the frequency domain for an n by n image (n D 8)
and ˆ as the difference between two consecutive resolutions ˆ (2 ) D ( ) ˆ ˆ (2 ):
Curvelets and Ridgelets
Because ˆ is compactly supported, the sampling theorem shows than one can easily build a pyramid of n C n/2 C C 1 D 2n elements, see [72] for details. This WT1D transform enjoys the following useful properties: The wavelet coefficients are directly calculated in the Fourier space. In the context of the ridgelet transform, this allows avoiding the computation of the one-dimensional inverse Fourier transform along each radial line. Each sub-band is sampled above the Nyquist rate, hence, avoiding aliasing –a phenomenon typically encountered by critically sampled orthogonal wavelet transforms [65]. The reconstruction is trivial. The wavelet coefficients simply need to be co-added to reconstruct the input signal at any given point. In our application, this implies that the ridgelet coefficients simply need to be co-added to reconstruct Fourier coefficients. This wavelet transform introduces an extra redundancy factor. However, we note that the goal in this implementation is not data compression or efficient coding. Rather, this implementation would be useful to the practitioner whose focuses on data analysis, for which it is well-known that over-completeness through (almost) translation-invariance can provide substantial advantages. Assembling all above ingredients together gives the flowchart of the discrete ridgelet transform (DRT) depicted in Fig. 3. The DRT of an image of size n n is an image of size 2n 2n, introducing a redundancy factor equal to 4. We note that, because this transform is made of a chain of steps, each one of which is invertible, the whole transform is invertible, and so has the exact reconstruction property. For the same reason, the reconstruction is stable under perturbations of the coefficients. Last but not least, this discrete transform is computationally attractive. Indeed, the algorithm we presented here has low complexity since it runs in O(n2 log n) flops for an n n image. The Orthonormal Finite Ridgelet Transform The orthonormal finite ridgelet transform (OFRT) has been proposed [26] for image compression and filtering. This transform is based on the finite Radon transform [55] and a 1D orthogonal wavelet transform. It is not redundant and reversible. It would have been a great alternative to the previously described ridgelet transform if the OFRT were not based on an awkward definition of a line. In fact,
Curvelets and Ridgelets, Figure 3 Discrete ridgelet transform flowchart. Each of the 2n radial lines in the Fourier domain is processed separately. The 1-D inverse FFT is calculated along each radial line followed by a 1-D nonorthogonal wavelet transform. In practice, the one-dimensional wavelet coefficients are directly calculated in the Fourier space
Curvelets and Ridgelets, Figure 4 The backprojection of a ridgelet coefficient by the FFT-based ridgelet transform (left), and by the OFRT (right)
a line in the OFRT is defined algebraically rather that geometrically, and so the points on a ‘line’ can be arbitrarily and randomly spread out in the spatial domain. Figure 4 shows the back-projection of a ridgelet coefficient by the FFT-based ridgelet transform (left) and by the OFRT (right). It is clear that the backprojection of the OFRT is nothing like a ridge function. Because of this specific definition of a line, the thresholding of the OFRT coefficients produces strong artifacts. Figure 5 shows a part of the original image Boat, and its reconstruction after hard thresholding the OFRT of the noise-free Boat. The resulting image is not smoothed as
757
758
Curvelets and Ridgelets
Curvelets and Ridgelets, Figure 5 Part of original noise-free Boat image (left), and reconstruction after hard thresholding its OFRT coefficients (right)
one would expect, but rather a noise has been added to the noise-free image as part of the filtering ! Finally, the OFRT presents another limitation: the image size must be a prime number. This last point is however not too restrictive, because we generally use a spatial partitioning when denoising the data, and a prime number block size can be used. The OFRT is interesting from the conceptual point of view, but still requires work before it can be used for real applications such as denoising. The Fast Slant Stack Ridgelet Transform The Fast Slant Stack (FSS) [3] is a Radon transform of data on a Cartesian grid, which is algebraically exact and geometrically more accurate and faithful than the previously described methods. The back-projection of a point in Radon space is exactly a ridge function in the spatial domain (see Fig. 6). The transformation of an n n image is a 2n 2n image. n line integrals with angle between 4 ; 4 are calculated from the zero padded image on the y-axis, and n line integrals with angle between 4 ; 3 4 are computed by zero padding the image on the x-axis. For a given angle inside 4 ; 4 , 2n line integrals are calculated by first shearing the zero-padded image, and then integrating the pixel values along all horizontal lines (resp. vertical lines for angles in 4 ; 3 4 ). The shearing is per-
formed one column at a time (resp. one line at a time) by using the 1D FFT. Figure 7 shows an example of the image shearing step with two different angles (5 4 and 4 ). A DRT based on the FSS transform has been proposed in [33]. The connection between the FSS and the Linogram has been investigated in [3]. A FSS algorithm is also proposed in [3], based on the 2D Fast Pseudo-polar Fourier transform which evaluates the 2-D Fourier transform on a non-Cartesian (pseudo-polar) grid, operating in O(n2 log n) flops. Figure 8 left exemplifies a ridgelet in the spatial domain obtained from the DRT based on FSS implementation. Its Fourier transform is shown on Fig. 8 right superimposed on the DRT frequency tiling [33]. The Fourier transform of the discrete ridgelet lives in an angular wedge. More precisely, the Fourier transform of a discrete ridgelet at scale j lives within a dyadic square of size 2 j . Local Ridgelet Transforms The ridgelet transform is optimal for finding global lines of the size of the image. To detect line segments, a partitioning must be introduced [9]. The image can be decomposed into overlapping blocks of side-length b pixels in such a way that the overlap between two vertically adjacent blocks is a rectangular array of size b by b/2; we use overlap to avoid blocking artifacts. For an n by n image, we count 2n/b such blocks in each direction, and thus the redundancy factor grows by a factor of 4. The partitioning introduces redundancy, as a pixel belongs to 4 neighboring blocks. We present two competing strategies to perform the analysis and synthesis: 1. The block values are weighted by a spatial window w (analysis) in such a way that the co-addition of all blocks reproduce exactly the original pixel value (synthesis). 2. The block values are those of the image pixel values (analysis) but are weighted when the image is reconstructed (synthesis).
Curvelets and Ridgelets, Figure 6 Backprojection of a point at four different locations in the Radon space using the FSS algorithm
Curvelets and Ridgelets
Curvelets and Ridgelets, Figure 7 Slant Stack Transform of an image
Curvelets and Ridgelets, Figure 8 a Example of a ridgelet obtained by the Fast Slant Stack implementation. b Its FFT superimposed on the DRT frequency tiling
Experiments have shown that the second approach leads to better results especially for restoration problems, see [68] for details. We calculate a pixel value, f [i1 ; i2 ] from its four corresponding block values of half-size m D b/2, namely, B1 [k1 ; l1 ], B2 [k2 ; l1 ], B3 [k1 ; l2 ] and B4 [k2 ; l2 ] with k1 ; l1 > b/2 and k2 D k1 m; l2 D l1 m, in the
following way: f1 D w(k2 /m)B1 [k1 ; l1 ] C w(1 k2 /m)B2 [k2 ; l1 ] f2 D w(k2 /m]B3 [k1 ; l2 ] C w(1 k2 /m)B4 [k2 ; l2 ] f [i1 ; i2 ] D w(l2 /m) f1 C w(1 l2 /m) f2 : (5)
759
760
Curvelets and Ridgelets
Curvelets and Ridgelets, Figure 9 Local ridgelet transform on bandpass filtered image. At fine scales, curved edges are almost straight lines
where w(x) D cos2 ( x/2) is the window. Of course, one might select any other smooth, non-increasing function satisfying, w(0) D 1, w(1) D 0, w 0 (0) D 0 and obeying the symmetry property w(x) C w(1 x) D 1. Sparse Representation by Ridgelets The continuous ridgelet transform provides sparse representation of both smooth functions (in the Sobolev space W22 ) and of perfectly straight lines [11,31]. We have just seen that there are also various DRTs, i. e. expansions with countable discrete collection of generating elements, which correspond either to frames or orthobases. It has been shown for these schemes that the DRT achieves nearoptimal M-term approximation – that is the non-linear approximation of f using the M highest ridgelet coefficients in magnitude - to smooth images with discontinuities along straight lines [16,31]. In summary, ridgelets provide sparse presentation for piecewise smooth images away from global straight edges. Curvelets The First Generation Curvelet Transform In image processing, edges are curved rather than straight lines and ridgelets are not able to efficiently represent such images. However, one can still deploy the ridgelet machinery in a localized way, at fine scales, where curved edges are almost straight lines (see Fig. 9). This is the idea underlying the first generation curvelets (termed here CurveletG1) [18].
First Generation Curvelets Construction The CurveletG1 transform [18,32,68] opens the possibility to analyze an image with different block sizes, but with a single transform. The idea is to first decompose the image into a set of wavelet bands, and to analyze each band by a local ridgelet transform as illustrated on Fig. 9. The block size can be changed at each scale level. Roughly speaking, different levels of the multiscale ridgelet pyramid are used to represent different sub-bands of a filter bank output. At the same time, this sub-band decomposition imposes a relationship between the width and length of the important frame elements so that they are anisotropic and obey approximately the parabolic scaling law width length2 . The First Generation Discrete Curvelet Transform (DCTG1) of a continuum function f (x) makes use of a dyadic sequence of scales, and a bank of filters with the property that the bandpass filter j is concentrated near the frequencies [22 j ; 22 jC2 ], i. e. j ( f ) D 2 j f ;
b 2 j () D b (22 j ) :
In wavelet theory, one uses a decomposition into dyadic sub-bands [2 j ; 2 jC1 ]. In contrast, the sub-bands used in the discrete curvelet transform of continuum functions has the nonstandard form [22 j ; 22 jC2 ]. This is nonstandard feature of the DCTG1 well worth remembering (this is where the approximate parabolic scaling law comes into play). The DCTG1 decomposition is the sequence of the following steps: Sub-band Decomposition. The object f is decomposed into sub-bands.
Curvelets and Ridgelets
Require: Input n n image f [i1 ; i2 ], type of DRT (see above). 1: Apply the a` trous isotropic WT2D with J scales, 2: Set B1 D Bmin , 3: for j D 1; : : : ; J do 4: Partition the sub-band w j with a block size B j and apply the DRT to each block, 5: if j modulo 2 D 1 then 6: B jC1 D 2B j , 7: else 8: B jC1 D B j . 9: end if 10: end for Curvelets and Ridgelets, Algorithm 1 DCTG1
Smooth Partitioning. Each sub-band is smoothly windowed into “squares” of an appropriate scale (of sidelength 2 j ). Ridgelet Analysis. Each square is analyzed via the DRT. In this definition, the two dyadic sub-bands [22 j ; 22 jC1 ] and [22 jC1 ; 22 jC2 ] are merged before applying the ridgelet transform.
Digital Implementation It seems that the isotropic “à trous” wavelet transform ( Numerical Issues When Using Wavelets), [72] is especially well-adapted to the needs of the digital curvelet transform. The algorithm decomposes an n by n image f [i1; i2 ] as a superposition of the form f [i1 ; i2 ] D c J [i1 ; i2 ] C
J X
w j [i1 ; i2 ];
jD1
where cJ is a coarse or smooth version of the original image f and wj represents ‘the details of f ’ at scale 2 j . Thus, the algorithm outputs J C 1 sub-band arrays of size n n. A sketch of the DCTG1 implementation is given in Algorithm 1. The side-length of the localizing windows is doubled at every other dyadic sub-band, hence maintaining the fundamental property of the curvelet transform which says that elements of length about 2 j/2 serve for the analysis and synthesis of the jth sub-band [2 j ; 2 jC1 ]. Note also that the coarse description of the image cJ is left intact. In the results shown in this paper, we used the default value Bmin D 16 pixels in our implementation. Figure 10 gives an overview of the organization of the DCTG1 algorithm.
Curvelets and Ridgelets, Figure 10 First Generation Discrete Curvelet Transform (DCTG1) flowchart. The figure illustrates the decomposition of the original image into sub-bands followed by the spatial partitioning of each sub-band. The ridgelet transform is then applied to each block
761
762
Curvelets and Ridgelets
Curvelets and Ridgelets, Figure 11 A few first generation curvelets
This implementation of the DCTG1 is also redundant. The redundancy factor is equal to 16J C 1 whenever J scales are employed. The DCTG1 algorithm enjoys exact reconstruction and stability, as each step of the analysis (decomposition) algorithm is itself invertible. One can show that the computational complexity of the DCTG1 algorithm we described here based on the DRT of Fig. 3 is O(n2 (log n)2 ) for an n n image. Figure 11 shows a few curvelets at different scales, orientations and locations. Sparse Representation by First Generation Curvelets The CurveletG1 elements can form either a frame or a tight frame for L2 (R2 ) [17], depending on the WT2D used and the DRT implementation (rectopolar or FSS Radon transform). The frame elements are anisotropic by construction and become successively more anisotropic at progressively higher scales. These curvelets also exhibit directional sensitivity and display oscillatory components across the ‘ridge’. A central motivation leading to the curvelet construction was the problem of non-adaptively representing piecewise smooth (e. g. C2 ) images f which have discontinuity along a C2 curve. Such a model is the so-called cartoon model of (non-textured) images. With the CurveletG1 tight frame construction, it was shown in [17] that for such f , the M-term non-linear approximations f M of f obey, for each > 0,
The Second Generation Curvelet Transform Despite these interesting properties, the CurveletG1 construction presents some drawbacks. First, the construction involves a complicated seven-index structure among which we have parameters for scale, location and orientation. In addition, the parabolic scaling ratio width length2 is not completely true (see Subsect. “First Generation Curvelets Construction”). In fact, CurveletG1 assumes a wide range of aspect ratios. These facts make mathematical and quantitative analysis especially delicate. Second, the spatial partitioning of the CurveletG1 transform uses overlapping windows to avoid blocking effects. This leads to an increase of the redundancy of the DCTG1. The computational cost of the DCTG1 algorithm may also be a limitation for large-scale data, especially if the FSSbased DRT implementation is used. In contrast, the second generation curvelets (CurveletG2) [15,20] exhibit a much simpler and natural indexing structure with three parameters: scale, orientation (angle) and location, hence simplifying mathematical analysis. The CurveletG2 transform also implements a tight frame expansion [20] and has a much lower redundancy. Unlike the DCTG1, the discrete CurveletG2 implementation will not use ridgelets yielding a faster algorithm [15, 20]. Second Generation Curvelets Construction
k f f M k2 C M 2C ;
M ! C1 :
The M-term approximations in the CurveletG1 are almost rate optimal, much better than M-term Fourier or wavelet approximations for such images, see [53].
Continuous Coronization The second generation curvelets are defined at scale 2 j , orientation l and j;l position xk D R1 (2 j k1 ; 2 j/2 k2 ) by translation and l rotation of a mother curvelet ' j as
Curvelets and Ridgelets
Curvelets and Ridgelets, Figure 12 a Continuous curvelet frequency tiling. The dark gray area represents a wedge obtained as the product of the radial window (annulus shown in gray) and the angular window (light gray). b The Cartesian grid in space associated to the construction in a whose spacing also obeys the parabolic scaling by duality. c Discrete curvelet frequency tiling. The window uˆ j;l isolates the frequency near trapezoidal wedge such as the one shown in dark gray. d The wrapping transformation. The dashed line shows the same trapezoidal wedge as in c. The parallelogram contains this wedge and hence the support of the curvelet. After periodization, the wrapped Fourier samples can be collected in the rectangle centered at the origin j;l
' j;l ;k (x) D ' j (R l (x xk )) ;
(6)
where R l is the rotation by l radians. l is the equispaced sequence of rotation angles l D 22b j/2c l, with integer l such that 0 p l 2 (note that the number of orientations varies as 1/ scale). k D (k1 ; k2 ) 2 Z2 is the sequence of translation parameters. The waveform ' j is defined by means of its Fourier transform 'ˆ j (), written in polar coordinates in the Fourier domain
'ˆ j (r; ) D 2
3 j/4
j
ˆ w(2 r)ˆv
2b j/2c 2
! :
(7)
The support of 'ˆ j is a polar parabolic wedge defined by the support of wˆ and vˆ, the radial and angular windows (both smooth, nonnegative and real-valued), applied with scale-dependent window widths in each direction. wˆ and vˆ must also satisfy the partition of unity property [15]. See the frequency tiling in Fig. 12a.
763
764
Curvelets and Ridgelets
In continuous frequency , the CurveletG2 coefficients of data f (x) are defined as the inner product Z j;l ˝ ˛ c j;l ;k :D f ' j;l ;k D fˆ()'ˆ j (R l )e ixk d : (8)
now define the window which is the Cartesian analog of 'ˆ j above, uˆ j;l () D wˆ j ()ˆv j;l () D wˆ j ()ˆv j;0 (S l );
(10)
R2
This construction implies a few properties: (i) the CurveletG2 defines a tight frame of L2 (R2 ), (ii) the effective length and width of these curvelets obey the parabolic scaling relation (2 j D width) D (length D 2 j/2 )2 , (iii) the curvelets exhibit an oscillating behavior in the direction perpendicular to their orientation. Curvelets as just constructed are complex-valued. It is easy to obtain real-valued curvelets by working on the symmetrized version 'ˆ j (r; ) C 'ˆ j (r; C ). Discrete Coronization The discrete transform takes as input data defined on a Cartesian grid and outputs a collection of coefficients. The continuous-space definition of the CurveletG2 uses coronae and rotations that are not especially adapted to Cartesian arrays. It is then convenient to replace these concepts by their Cartesian counterparts. That is concentric squares (instead of concentric circles) and shears (instead of rotations), see Fig. 12c. The Cartesian equivalent to the radial window ˆ j ) would be a bandpass frequencywˆ j () D w(2 localized window which can be derived from the differˆ j 1 ) ence of separable low-pass windows H j () D h(2 ˆh(2 j 2 ) (h is a 1D low-pass filter): wˆ j () D
q
H 2jC1 () H 2j (); 8 j 0 ; ˆ 1 ) h(
ˆ 2) : and wˆ 0 () D h(
Another possible choice is to select these windows inspired by the construction of Meyer wavelets [20,57]. See [15] for more details about the construction of the Cartesian wˆ j ’s. Let’s now examine the angular localization. Each Cartesian coronae has four quadrants: East, North, West and South. Each quadrant is separated into 2b j/2c orientations (wedges) with the same areas. Take for example the East quadrant (/4 l < /4). For the West quadrant, we would proceed by symmetry around the origin, and for the North and South quadrant by exchanging the roles of
1 and 2 . Define the angular window for the lth direction as
2 1 tan l vˆ j;l () D vˆ 2b j/2c ; (9)
1 with the sequence of equi-spaced slopes (and not angles) tan l D 2b j/2c l, for l D 2b j/2c ; : : : ; 2b j/2c 1. We can
where S l is the shear matrix. From this definition, it can be seen that uˆ j;l is supported near the trapezoidal wedge f D ( 1 ; 2 )j2 j 1 2 jC1 ; 2 j/2 2 / 1 tan l 2 j/2 g. The collection of uˆ j;l () gives rise to the frequency tiling shown in Fig. 12c. From uˆ j;l (), the digital CurveletG2 construction suggests Cartesian curvelets that are translated and sheared versions of a mother Cartesian D (x) D 23 j/4 ' D S T x m curvelet 'ˆ jD () D uˆ j;0 (), ' j;l j ;k l
where m D (k1 2 j ; k2 2 j/2 ). Digital Implementation The goal here is to find a digital implementation of the Second Generation Discrete Curvelet Transform (DCTG2), whose coefficients are now given by E Z D i S T m D )e l d: (11) c j;l ;k :D f ' j;l fˆ()'ˆ jD (S1 ;k D l R2
To evaluate this formula with discrete data, one may think of (i) using the 2D FFT to get fˆ, (ii) form the windowed frequency data fˆuˆ j;l and (iii) apply the the inverse Fourier transform. But this necessitates to evaluate the FFT at the sheared grid ST m, for which the classical FFT algol rithm is not valid. Two implementations were then proposed [15], essentially differing in their way of handling the grid: A tilted grid mostly aligned with the axes of uˆ j;l () which leads to the Unequi-Spaced FFT (USFFT)-based DCTG2. This implementation uses a nonstandard interpolation. Furthermore, the inverse transform uses conjugate gradient iteration to invert the interpolation step. This will have the drawback of a higher computational burden compared to the wrapping-based implementation that we will discuss hereafter. We will not elaborate more about the USFFT implementation as we never use it in practice. The interested reader may refer to [15] for further details and analysis. A grid aligned with the input Cartesian grid which leads to the wrapping-based DCTG2. The wrapping-based DCTG2 makes a simpler choice of the spatial grid to translate the curvelets. The curvelet coefficients are essentially the same as in (11), except that ST m is replaced by m with values on a rectangular grid. l But again, a difficulty rises because the window uˆ j;l does not fit in a rectangle of size 2 j 2 j/2 to which an inverse
Curvelets and Ridgelets
Require: Input n n image f [i1 ; i2 ], coarsest decomposition scale, curvelets or wavelets at the finest scale. 1: Apply the 2D FFT and obtain Fourier samples fˆ[i1 ; i2 ]. 2: for each scale j and angle l do 3: Form the product fˆ[i1 ; i2 ]uˆ j;l [i1 ; i2 ]. 4: Wrap this product around the origin. 5: Apply the inverse 2D FFT to the wrapped data to get discrete DCTG2 coefficients. 6: end for Curvelets and Ridgelets, Algorithm 2 DCTG2 via wrapping
FFT could be applied. The wrapping trick consists in periodizing the windowed frequency data fˆuˆ j;l , and reindexing the samples array by wrapping around a 2 j 2 j/2 rectangle centered at the origin, see Fig. 12d to get a gist of the wrapping idea. The wrapping-based DCTG2 algorithm can be summarized as in Algorithm 2. The DCTG2 implementation can assign either wavelets or curvelets at the finest scale. In the CurveLab toolbox [75], the default choice is set to wavelets at the finest scale, but this can be easily modified directly in the code. We would like to apologize to the expert reader as many technical details are (deliberately) missing here on the CurveletG2 construction. For instance, low-pass coarse component, window overlapping, windows over junctions between quadrants. This paper is intended to give an overview of these recent multi-scale transforms, and the genuinely interested reader may refer to the original papers of Candès, Donoho, Starck and co-workers for further details (see bibliography). The computational complexity of the wrapping-based DCTG2 analysis and reconstruction algorithms is that of the FFT O(n2 log n), and in practice, the computation time is that of 6 to 10 2D FFTs [15]. This is a faster algorithm compared to the DCTG1. The DCTG2 fast algorithm has participated to make the use of the curvelet transform more attractive in many applicative fields (see Sect. “Stylized Applications” for some of them). The DCTG2, as it is implemented in the CurveLab toolbox [75], has reasonable redundancy, at most 7:8 (much higher in 3D) if curvelets are used at the finest scale. This redundancy can even be reduced down to 4 (and 8 in 3D) if we replace in this implementation the Meyer wavelet construction, which introduces a redundancy factor of 4, by an-
other wavelet pyramidal construction, similar to the one presented in Sect. “One–dimensional Wavelet Transform” which has a redundancy less than 2 in any dimension. Our experiments have shown that this modification does not modify the results in denoising experiments. DCTG2 redundancy is anyway much smaller than the DCTG1 one which is 16J C 1. As stated earlier, the DCTG2 coefficients are complex-valued, but a real-valued DCTG2 with the same redundancy factor can be easily obtained by properly combining coefficients at orientations l and l C . The DCTG2 can be extended to higher dimensions [21]. In the same vein as wavelets on the interval [53], the DCGT2 has been recently adapted to handle image boundaries by mirror extension instead of periodization [25]. The latter modification can have immediate implications in image processing applications where the contrast difference at opposite image boundaries may be an issue (see e. g. the denoising experiment discussion reported in Sect. “Stylized Applications”). We would like to make a connection with other multiscale directional transforms directly linked to curvelets. The contourlets tight frame of Do and Vetterli [27] implements the CurveletG2 idea directly on a discrete grid using a perfect reconstruction filter bank procedure. In [51], the authors proposed a modification of the contourlets with a directional filter bank that provides a frequency partitioning which is close to the curvelets but with no redundancy. Durand in [35] recently introduced families of non-adaptive directional wavelets with various frequency tilings, including that of curvelets. Such families are nonredundant and form orthonormal bases for L2 (R2 ), and have an implementation derived from a single nonseparable filter bank structure with nonuniform sampling. Sparse Representation by Second Generation Curvelets It has been shown by Candès and Donoho [20] that with the CurveletG2 tight frame construction, the M-term nonlinear approximation error of C2 images except at discontinuities along C2 curves obey k f f M k2 CM 2 (log M)3 : This is an asymptotically optimal convergence rate (up to the (log M)3 factor), and holds uniformly over the C 2 C 2 class of functions. This is a remarkable result since the CurveletG2 representation is non-adaptative. However, the simplicity due to the non-adaptivity of curvelets has a cost: curvelet approximations loose their near optimal properties when the image is composed of edges which are not exactly C2 . Additionally, if the edges are C ˛ -regular with ˛ > 2, then the curvelets convergence rate exponent
765
766
Curvelets and Ridgelets
Curvelets and Ridgelets, Figure 13 An example of second generation real curvelet. Left: curvelet in spatial domain. Right: its Fourier transform
remain 2. Other adaptive geometric representations such as bandlets are specifically designed to reach the optimal decay rate O(M ˛ ) [49,62]. Stylized Applications Denoising Elongated Feature Recovery The ability of ridgelets to sparsely represent piecewise smooth images away from discontinuities along lines has an immediate implication on statistical estimation. Consider a piecewise smooth image f away from line singularities embedded in an additive white noise of standard deviation . The ridgeletbased thresholding estimator is nearly optimal for recovering such functions, with a mean-square error (MSE) decay rate almost as good as the minimax rate [12]. To illustrate these theoretical facts, we simulate a vertical band embedded in white Gaussian noise with large . Figure 14 (top left) represents such a noisy image. The parameters are as follows: the pixel width of the band is 20 and the signal-to-noise ratio (SNR) is set to 0.1. Note that it is not possible to distinguish the band by eye. The wavelet transform (undecimated wavelet transform) is also incapable of detecting the presence of this object; roughly speaking, wavelet coefficients correspond to weighted averages over approximately isotropic neighborhoods (at different scales) and those wavelets clearly do not correlate very well with the very elongated structure (pattern) of the object to be detected. Curve Recovery Consider now the problem of recovering a piecewise C2 function f apart from a discontinuity along a C2 edge. Again, a simple strategy based on thresh-
olding curvelet tight frame coefficients yields an estimator that achieves a MSE almost of the order O( 4/3 ) uniformly over the C 2 C 2 class of functions [19]. This is the optimal rate of convergence as the minimax rate for that class scales as 4/3 [19]. Comparatively, wavelet thresholding methods only achieves a MSE of order O( ) and no better. We also note that the statistical optimality of the curvelet thresholding extends to a large class of ill-posed linear inverse problems [19]. In the experiment of Fig. 15, we have added a white Gaussian noise to “War and Peace”, a drawing from Picasso which contains many curved features. Figure 15 bottom left and right shows respectively the restored images by the undecimated wavelet transform and the DCTG1. Curves are more sharply recovered with the DCTG1. In a second experiment, we compared the denoising performance of several digital implementations of the curvelet transform; namely the DCTG1 with the rectopolar DRT, the DCTG1 with the FSS-based DRT and the wrapping-based DCTG2. The results are shown in Fig. 16, where the original 512 512 Peppers image was corrupted by a Gaussian white noise D 20 (PSNR = 22dB). Although the FSS-based DRT is more accurate than the rectopolar DRT, the denoising improvement of the former (PSNR=31.31dB) is only 0.18 dB better than the latter (PSNR=31.13dB) on Peppers. The difference is almost undistinguishable by eye, but the computation time is 20 higher for the DCTG1 with the FSS DRT. Consequently, it appears that there is a little benefit of using the FSS DRT in the DCTG1 for restoration applications. Denoising using the DCTG2 with the wrapping implementation gives a PSNR=30.62 dB which is 0:7 dB less than the DCTG1. But this is the price to pay for a lower re-
Curvelets and Ridgelets
Curvelets and Ridgelets, Figure 14 Original image containing a vertical band embedded in white noise with relatively large amplitude (left). Denoised image using the undecimated wavelet transform (middle). Denoised image using the DRT based on the rectopolar Radon transform (right)
Curvelets and Ridgelets, Figure 15 The Picasso picture War and Peace (top left), the same image contaminated with a Gaussian white noise (top right). The restored images using the undecimated wavelet transform (bottom left) and the DCTG1 (bottom right)
767
768
Curvelets and Ridgelets
Curvelets and Ridgelets, Figure 16 Comparison of denoising performance of several digital implementations of the curvelet transform. Top left: original image. Top right: noisy image D 20. Bottom left: denoised with DCTG1 using the rectopolar DRT. Bottom middle: denoised with DCTG1 using the FSS DRT. Bottom right: denoised with DCTG2 using the wrapping implementation
dundancy and a much faster transform algorithm. Moreover, the DCTG2 exhibits some artifacts which look like ‘curvelet ghosts’. This is a consequence of the fact that the DCTG2 makes a central use of the FFT which has the side effect of treating the image boundaries by periodization. Linear Inverse Problems Many problems in image processing can be cast as inverting the linear degradation equation y D H f C ", where f is the image to recover, y the observed image and " is a white noise of variance 2 < C1. The linear mapping H is generally ill-behaved which entails ill-posedness of the inverse problem. Typical examples of linear inverse problems include image deconvolution where H is the convolution operator, or image inpainting (recovery of missing data) where H is a binary mask. In the last few years, some authors have attacked the problem of solving linear inverse problems under the umbrella of sparse representations and variational formulations, e. g. for deconvolution [19,24,38,41,74] and in-
painting [37,39]. Typically, in this setting, the recovery of f is stated as an optimization problem with a sparsitypromoting regularization on the representation coefficients of f , e. g. its wavelet or curvelet coefficients. See [37, 38,39,74] for more details. In Fig. 17 first row, we depict an example of deconvolution on Barbara using the algorithm described in [38] with the DCTG2 curvelet transform. The original, degraded (blurred with an exponential kernel and noisy) and restored images are respectively shown on the left, middle and right. The second row gives an example of inpainting on Claudia image using the DCTG2 with 50% missing pixels. Contrast Enhancement The curvelet transform has been successfully applied to image contrast enhancement by Starck et al. [73]. As the curvelet transform capture efficiently edges in an image, it is a good candidate for multiscale edge enhancement. The idea is to modify the curvelet coefficients of the in-
Curvelets and Ridgelets
Curvelets and Ridgelets, Figure 17 Illustration of the use of curvelets (DCTG2 transform) when solving two typical linear inverse problems: deconvolution (first row), and inpainting (second row). First row: deconvolution of Barbara image, original (left), blurred and noisy (middle), restored (right). Second row: inpainting of Claudia image, original (left), masked image (middle), inpainted (right)
put image in order to enhance its edges. The curvelet coefficients are typically modified according to the function displayed in the left plot of Fig. 18. Basically, this plot says that the input coefficients are kept intact (or even shrunk) if they have either low (e. g. below the noise level) or high (strongest edges) values. Intermediate curvelet coefficient values which correspond to the faintest edges are amplified. An example of curvelet-based image enhancement on Saturn image is given in Fig. 18. Morphological Component Separation The idea to morphologically decompose a signal/image into its building blocks is an important problem in signal and image processing. Successful separation of a signal content has a key role in the ability to effectively analyze it, enhance it, compress it, synthesize it, and more. Various approaches have been proposed to tackle this problem. The Morphological Component Analysis method (MCA) [70,71] is a method which allows us to decompose
a single signal into two or more layers, each layer containing only one kind of feature in the input signal or image. The separation can be achieved when each kind of feature is sparsely represented by a given transformation in the dictionary of transforms. Furthermore, when a transform sparsely represents a part in the signal/image, it yields nonsparse representation on the other content type. For instance, lines and Gaussians in a image can be separated using the ridgelet transform and the wavelet transform. Locally oscillating textures can be separated from the piecewise smooth content using the local discrete cosine transform and the curvelet transform [70]. A full description of MCA is given in [70]. The first row of Fig. 19 illustrates a separation result when the input image contains only lines and isotropic Gaussians. Two transforms were amalgamated in the dictionary; namely the à trous WT2D and the DRT. The left, middle and right images in the first row of Fig. 19 represent respectively, the original image, the reconstructed component from the à trous wavelet coefficients, and
769
770
Curvelets and Ridgelets
Curvelets and Ridgelets, Figure 18 Curvelet contrast enhancement. Left: enhanced vs original curvelet coefficient. Middle: original Saturn image. Right: result of curvelet-based contrast enhancement
Curvelets and Ridgelets, Figure 19 First row, from left to right: original image containing lines and Gaussians, separated Gaussian component (wavelets), separated line component (ridgelets). Second row, from left to right: original Barbara image, reconstructed local discrete cosine transform part (texture), and piecewise smooth part (curvelets)
Curvelets and Ridgelets
the reconstructed layer from the ridgelet coefficients. The second row of Fig. 19 shows respectively the Barbara image, the reconstructed local cosine (textured) component and the reconstructed curvelet component. In the Barbara example, the dictionary contained the local discrete cosine and the DCTG2 transforms. Future Directions In this paper, we gave an overview of two important geometrical multiscale transforms; namely ridgelets and curvelets. We illustrate their potential applicability on a wide range of image processing problems. Although these transforms are not adaptive, they are strikingly effective both theoretically and practically on piecewise images away from smooth contours. However, in image processing, the geometry of the image and its regularity is generally not known in advance. Therefore, to reach higher sparsity levels, it is necessary to find representations that can adapt themselves to the geometrical content of the image. For instance, geometric transforms such as wedgelets [30] or bandlets [49, 50,62] allow to define an adapted multiscale geometry. These transforms perform a non-linear search for an optimal representation. They offer geometrical adaptivity together with stable algorithms. Recently, Mallat [54] proposed a more biologically inspired procedure named the grouplet transform, which defines a multiscale association field by grouping together pairs of wavelet coefficients. In imaging science and technology, there is a remarkable proliferation of new data types. Beside the traditional data arrays defined on uniformly sampled cartesian grids with scalar-valued samples, many novel imaging modalities involve data arrays that are either (or both): acquired on specific “exotic” grids such as in astronomy, medicine and physics. Examples include data defined on spherical manifolds such as in astronomical imaging, catadioptric optical imaging where a sensor overlooks a paraboloidal mirror, etc. or with samples taking values in a manifold. Examples include vector fields such as those of polarization data that may rise in astronomy, rigid motions (a special Euclidean group), definite-positive matrices that are encountered in earth science or medical imaging, etc. The challenge faced with this data is to find multiscale representations which are sufficiently flexible to apply to many data types and yet defined on the proper grid and respect the manifold structure. Extension of wavelets, curvelets and ridgelets for scalar-valued data on the sphere has been proposed recently by [66]. Construc-
tion of wavelets for scalar-valued data defined on graphs and some manifolds was proposed by [22]. The authors in [63][see references therein] describe multiscale representations for data observed on equispaced grids and taking values in manifolds such as: the sphere, the special orthogonal group, the positive definite matrices, and the Grassmannian manifolds. Nonetheless many challenging questions are still open in this field: extend the idea of multiscale geometrical representations such as curvelets or ridgelets to manifold-valued data, find multiscale geometrical representations which are sufficiently general for a wide class of grids, etc. We believe that these directions are one of the hottest topics in this field. Most of the transforms discussed in this paper can handle efficiently smooth or piecewise smooth functions. But sparsely representing textures remains an important open question, mainly because there is no consensus on how to define a texture. Although Julesz [45] stated simple axioms about the probabilistic characterization of textures. It has been known for some time now that some transforms can sometimes enjoy reasonably sparse expansions of certain textures; e. g. oscillatory textures in bases such as local discrete cosines [70], brushlets [56], Gabor [53], complex wavelets [46]. Gabor and wavelets are widely used in the image processing community for texture analysis. But little is known on the decay of Gabor and wavelet coefficients of “texture”. If one is interested in synthesis as well as analysis, the Gabor representation may be useless (at least in its strict definition). Restricting themselves to locally oscillating patters, Demanet and Ying have recently proposed a wavelet-packet construction named WaveAtoms [77]. They showed that WaveAtoms provide optimally sparse representation of warped oscillatory textures. Another line of active research in sparse multiscale transforms was initiated by the seminal work of Olshausen and Field [59]. Following their footprints, one can push one step forward the idea of adaptive sparse representation and requires that the dictionary is not fixed but rather optimized to sparsify a set of exemplar signals/images, i. e. patches. Such a learning problem corresponds to finding a sparse matrix factorization and several algorithms have been proposed for this task in the literature; see [1] for a good overview. Explicit structural constraints such as translation invariance can also be enforced on the learned dictionary [5,58]. These learning-based sparse representations have shown a great improvement over fixed (and even adapted) transforms for a variety of image processing tasks such as denoising and compression [8,36,52], linear inverse problems (image decomposition and inpainting) [61], texture synthesis [60].
771
772
Curvelets and Ridgelets
Bibliography 1. Aharon M, Elad M, Bruckstein AM (2006) The K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process 54(11):4311–4322 2. Arivazhagan S, Ganesan L, Kumar TS (2006) Texture classification using curvelet statistical and co-occurrence feature. In: Proceedings of the 18th International Conference on Pattern Recognition, Hong Kong (ICPR 2006), vol 2. pp 938–941 3. Averbuch A, Coifman RR, Donoho DL, Israeli M, Waldén J (2001) Fast Slant Stack: A notion of Radon transform for data in a cartesian grid which is rapidly computible, algebraically exact, geometrically faithful and invertible. SIAM J Sci Comput 4. BeamLab 200 (2003) http://www-stat.stanford.edu/beamlab/ 5. Blumensath T, Davies M (2006) Sparse and shift-invariant representations of music. IEEE Trans Speech Audio Process 14(1):50–57 6. Bobin J, Moudden Y, Starck JL, Elad M (2006) Morphological diversity and source separation. IEEE Trans Signal Process 13(7):409–412 7. Bobin J, Starck JL, Fadili J, Moudden Y (2007) Sparsity, morphological diversity and blind source separation. IEEE Transon Image Process 16(11):2662–2674 8. Bryt O, Elad M (2008) Compression of facial images using the ksvd algorithm. J Vis Commun Image Represent 19(4):270–283 9. Candès EJ (1998) Ridgelets: theory and applications. Ph D thesis, Stanford University 10. Candès EJ (1999) Harmonic analysis of neural networks. Appl Comput Harmon Anal 6:197–218 11. Candès EJ (1999) Ridgelets and the representation of mutilated sobolev functions. SIAM J Math Anal 33:197–218 12. Candès EJ (1999) Ridgelets: Estimating with ridge functions. Ann Statist 31:1561–1599 13. Candès EJ, Demanet L (2002) Curvelets and Fourier integral operators. C R Acad Sci, Paris, Serie I(336):395–398 14. Candès EJ, Demanet L (2005) The curvelet representation of wave propagators is optimally sparse. Com Pure Appl Math 58(11):1472–1528 15. Candès EJ, Demanet L, Donoho DL, Ying L (2006) Fast discrete curvelet transforms. SIAM Multiscale Model Simul 5(3):861– 899 16. Candès EJ, Donoho DL (1999) Ridgelets : the key to high dimensional intermittency? Philos Trans Royal Soc of Lond A 357:2495–2509 17. Candès EJ, Donoho DL (1999) Curvelets – a surprisingly effective nonadaptive representation for objects with edges. In: Cohen A, Rabut C, Schumaker LL (eds) Curve and Surface Fitting: Saint-Malo. Vanderbilt University Press, Nashville 18. Candès EJ, Donoho DL (2000) Curvelets and curvilinear integrals. Approx J Theory 113:59–90 19. Candès EJ, Donoho DL (2000) Recovering edges in ill-posed inverse problems: Optimality of curvelet frames. Ann Stat 30:784–842 20. Candès EJ, Donoho DL (2002) New tight frames of curvelets and optimal representations of objects with piecewise-C2 singularities. Comm Pure Appl Math 57:219–266 21. Candès E,Ying L, Demanet L (2005) 3d discrete curvelet transform. In: Wavelets XI conf., San Diego. Proc. SPIE, vol 5914, 591413; doi:10.1117/12.616205 22. Coifman RR, Maggioni M (2006) Diffusion wavelets. Appl Comput Harmon Anal 21:53–94
23. Daubechies I (1992) Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics, Philadelphia 24. Daubechies I, Defrise M, De Mol C (2004) An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Com Pure Appl Math 57:1413–1541 25. Demanet L, Ying L (2007) Curvelets and wave atoms for mirrorextended images. In: Wavelets XII conf., San Diego. Proc. SPIE, vol 6701, 67010J; doi:10.1117/12.733257 26. Do MN, Vetterli M (2003) The finite ridgelet transform for image representation. IEEE Trans Image Process 12(1):16–28 27. Do MN, Vetterli M (2003) Contourlets. In: Stoeckler J, Welland GV (eds) Beyond Wavelets. Academic Press, San Diego 28. Donoho DL (1997) Fast ridgelet transforms in dimension 2. Stanford University, Stanford 29. Donoho DL (1998) Digital ridgelet transform via rectopolar coordinate transform. Technical Report, Stanford University 30. Donoho DL (1999) Wedgelets: nearly-minimax estimation of edges. Ann Stat. 27:859–897 31. Donoho DL (2000) Orthonormal ridgelets and linear singularities. SIAM J Math Anal 31(5):1062–1099 32. Donoho DL, Duncan MR (2000) Digital curvelet transform: strategy, implementation and experiments. In: Szu HH, Vetterli M, Campbell W, Buss JR (eds) Proc. Aerosense, Wavelet Applications VII, vol 4056. SPIE, pp 12–29 33. Donoho DL, Flesia AG (2002) Digital ridgelet transform based on true ridge functions. In: Schmeidler J, Welland GV (eds) Beyond Wavelets. Academic Press, San Diego 34. Douma H, de Hoop MV (2007) Leading-order seismic imaging using curvelets. Geophys 72(6) 35. Durand S (2007) M-band filtering and nonredundant directional wavelets. Appl Comput Harmon Anal 22(1):124–139 36. Elad M, Aharon M (2006) Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans Image Process 15(12):3736–3745 37. Elad M, Starck JL, Donoho DL, Querre P (2006) Simultaneous cartoon and texture image inpainting using morphological component analysis (MCA). J Appl Comput Harmon Anal 19:340–358 38. Fadili MJ, Starck JL (2006) Sparse representation-based image deconvolution by iterative thresholding. In: Murtagh F, Starck JL (eds) Astronomical Data Analysis IV. Marseille 39. Fadili MJ, Starck JL, Murtagh F (2006) Inpainting and zooming using sparse representations. Comput J 40. Fernandes F, Wakin M, Baraniuk R (2004) Non-redundant, linear-phase, semi-orthogonal, directional complex wavelets. In: IEEE Conf. on Acoustics, Speech and Signal Processing, Montreal, Canada, vol 2, pp 953 956 41. Figueiredo M, Nowak R (2003) An EM algorithm for waveletbased image restoration. IEEE Trans Image Process 12(8):906– 916 42. Hennenfent G, Herrmann FJ (2006) Seismic denoising with nonuniformly sampled curvelets. IEEE Comput Sci Eng 8(3):16– 25 43. Herrmann FJ, Moghaddam PP, Stolk CC (2008) Sparsity- and continuity-promoting seismic image recovery with curvelet frames. Appl Comput Harmon Anal 24(2):150–173 44. Jin J, Starck JL, Donoho DL, Aghanim N, Forni O (2005) Cosmological non-gaussian signatures detection: Comparison of statistical tests. Eurasip J 15:2470–2485 45. Julesz B (1962) Visual pattern discrimination. Trans RE Inform Theory 8(2):84–92
Curvelets and Ridgelets
46. Kingsbury NG (1998) The dual-tree complex wavelet transform: a new efficient tool for image restoration and enhancement. In: European Signal Processing Conference, Rhodes, Greece, pp 319–322 47. Labate D, Lim W-Q, Kutyniok G, Weiss G (2005) Sparse multidimensional representation using shearlets. In: Wavelets XI, vol 5914. SPIE, San Diego, pp 254–262 48. Lambert P, Pires S, Ballot J, García RA, Starck JL, Turck-Chièze S (2006) Curvelet analysis of asteroseismic data. i. method description and application to simulated sun-like stars. Astron Astrophys 454:1021–1027 49. Le Pennec E, Mallat S (2000) Image processing with geometrical wavelets. In: International Conference on Image Processing, Thessaloniki, Greece 50. Le Pennec E, Mallat S (2005) Bandelet Image Approximation and Compression. Multiscale SIAM Mod Simul 4(3):992–1039 51. Lu Y, Do MN (2003) CRIPS-contourlets: A critical sampled directional multiresolution image representation. In: Wavelet X, San Diego. Proc. SPIE, vol 5207. pp 655–665 52. Mairal J, Elad M, Sapiro G (2008) Sparse representation for color image restoration. IEEE Trans Image Process 17(1):53–69 53. Mallat S (1998) A Wavelet Tour of Signal Processing. Academic Press, London 54. Mallat S (2008) Geometrical grouplets. Appl Comput Harmon Anal 55. Matus F, Flusser J (1993) Image representations via a finite Radon transform. IEEE Trans Pattern Anal Mach Intell 15(10):996–1006 56. Meyer FG, Coifman RR (1997) Brushlets: a tool for directional image analysis and image compression. Appl Comput Harmon Anal 4:147–187 57. Meyer Y (1993) Wavelets: Algorithms and Applications. SIAM, Philadelphia 58. Olshausen BA (2000) Sparse coding of time-varying natural images. In: Int. Conf. Independent Component Analysis and Blind Source Separation (ICA), Barcelona, pp 603–608 59. Olshausen BA, Field DJ (1996) Emergence of simple-cell receptive-field properties by learning a sparse code for natural images. Nature 381(6583):607–609 60. Peyré G (2007) Non-negative sparse modeling of textures. In: SSVM. Lecture Notes in Computer Science. Springer, pp 628– 639 61. Peyré G, Fadili MJ, Starck JL (2007) Learning adapted dictionaries for geometry and texture separation. In: Wavelet XII, San Diego. Proc. SPIE, vol 6701, 67011T; doi:10.1117/12.731244 62. Peyré G, Mallat S (2007) A review of bandlet methods for geometrical image representation. Num Algorithms 44(3):205–234 63. Rahman I, Drori I, Stodden VC, Donoho DL, Schröder P (2005) Multiscale representations fpr manifold-valued data. Multiscale Mod Simul 4(4):1201–1232
64. Saevarsson B, Sveinsson J, Benediktsson J (2003) Speckle reduction of SAR images using adaptive curvelet domain. In: Proceedings of the IEEE International Conference on Geoscience and Remote Sensing Symposium, IGARSS ’03, vol 6., pp 4083– 4085 65. Simoncelli EP, Freeman WT, Adelson EH, Heeger DJ (1992) Shiftable multi-scale transforms [or what’s wrong with orthonormal wavelets]. IEEE Trans Inf Theory 38(2):587–607 66. Starck JL, Abrial P, Moudden Y, Nguyen M (2006) Wavelets, ridgelets and curvelets on the sphere. Astron Astrophys 446:1191–1204 67. Starck JL, Bijaoui A, Lopez B, Perrier C (1994) Image reconstruction by the wavelet transform applied to aperture synthesis. Astron Astrophys 283:349–360 68. Starck JL, Candès EJ, Donoho DL (2002) The curvelet transform for image denoising. IEEE Trans Image Process 11(6):131– 141 69. Starck JL, Candès EJ, Donoho DL (2003) Astronomical image representation by the curvelet tansform. Astron Astrophys 398:785–800 70. Starck JL, Elad M, Donoho DL (2004) Redundant multiscale transforms and their application for morphological component analysis. In: Hawkes P (ed) Advances in Imaging and Electron Physics, vol 132. Academic Press, San Diego, London, pp 288–348 71. Starck JL, Elad M, Donoho DL (2005) Image decomposition via the combination of sparse representation and a variational approach. IEEE Trans Image Process 14(10):1570–1582 72. Starck JL, Murtagh F, Bijaoui A (1998) Image Processing and Data Analysis: The Multiscale Approach. Cambridge University Press, Cambridge 73. Starck JL, Murtagh F, Candès E, Donoho DL (2003) Gray and color image contrast enhancement by the curvelet transform. IEEE Trans Image Process 12(6):706–717 74. Starck JL, Nguyen MK, Murtagh F (2003) Wavelets and curvelets for image deconvolution: a combined approach. Signal Process 83(10):2279–2283 75. The Curvelab Toolbox (2005) http://www.curvelet.org 76. Velisavljevic V, Beferull-Lozano B, Vetterli M, Dragotti PL (2006) Directionlets: Anisotropic multi-directional representation with separable filtering. IEEE Trans Image Process 15(7):1916–1933 77. Ying L, Demanet L (2007) Wave atoms and sparsity of oscillatory patterns. Appl Comput Harmon Anal 23(3):368– 387 78. Zhang Z, Huang W, Zhang J, Yu H, Lu Y (2006) Digital image watermark algorithm in the curvelet domain. In: Proceedings of the International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP’06), Pasadena. pp 105–108
773
774
Data and Dimensionality Reduction in Data Analysis and System Modeling
Data and Dimensionality Reduction in Data Analysis and System Modeling W ITOLD PEDRYCZ1,2 1 Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada 2 Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Article Outline Glossary Definition of the Subject Introduction Information Granularity and Granular Computing Data Reduction Dimensionality Reduction Co-joint Data and Dimensionality Reduction Conclusions Future Directions Acknowledgments Appendix: Particle Swarm Optimization (PSO) Bibliography Glossary Reduction process A suite of activities leading to the reduction of available data and/or reduction of features. Feature selection An algorithmic process in which a large set of features (attributes) is reduced by choosing a certain relatively small subset of them. The reduction is a combinatorial optimization task which is NP-complete. Given this, quite often it is realized in a suboptimal way. Feature transformation A process of transforming a highly dimensional feature space into a low-dimensional counterpart. These transformations are linear or nonlinear and can be guided by some optimization criterion. The commonly encountered methods utilize Principal Component Analysis (PCA) which is an example of a linear feature transformation. Dimensionality reduction A way of converting a large data set into a representative subset. Typically, the data are grouped into clusters whose prototypes are representatives of the overall data set. Curse of dimensionality A phenomenon of a rapid exponential increase of computing related with the dimensionality of the problem (say, the number of data or the number of features) which prevents us from achieving
an optimal solution. The curse of dimensionality leads to the construction of sub-optimal solutions. Data mining A host of activities aimed at discovery of easily interpretable and experimentally sound findings in huge data sets. Biologically-inspired optimization An array of optimization techniques realizing searches in highly-dimensional spaces where the search itself is guided by a collection of mechanisms (operators) inspired by biological search processes. Genetic algorithms, evolutionary methods, particle swarm optimization, and ant colonies are examples of biologically-inspired search techniques. Definition of the Subject Data and dimensionality reduction are fundamental pursuits of data analysis and system modeling. With the rapid growth of sizes of data sets and the diversity of data themselves, the use of some reduction mechanisms becomes a necessity. Data reduction is concerned with a reduction of sizes of data sets in terms of the number of data points. This helps reveal an underlying structure in data by presenting a collection of groups present in data. Given a number of groups which is very limited, the clustering mechanisms become effective in terms of data reduction. Dimensionality reduction is aimed at the reduction of the number of attributes (features) of the data which leads to a typically small subset of features or brings the data from a highly dimensional feature space to a new one of a far lower dimensionality. A joint reduction process involves data and feature reduction. Introduction In the information age, we are continuously flooded by enormous amounts of data that need to be stored, transmitted, processed and understood. We expect to make sense of data and find general relationships within them. On the basis of data, we anticipate to construct meaningful models (classifiers or predictors). Being faced with this ongoing quest, there is an intensive research along this line with a clear cut objective to design effective and computationally feasible algorithms to combat a curse of dimensionality which comes associated with the data. Data mining has become one of the dominant developments in data analysis. The term intelligent data analysis is another notion which provides us with a general framework supporting thorough and user-oriented processes of data analysis in which reduction processes play a pivotal role. Interestingly enough, the problem of dimensionality reduction and complexity management is by no means
Data and Dimensionality Reduction in Data Analysis and System Modeling
Data and Dimensionality Reduction in Data Analysis and System Modeling, Figure 1 A general roadmap of reduction processes; note various ways of dealing with data and dimensionality along with a way of providing some evaluation mechanisms (performance indexes)
a new endeavor. It has been around for a number of decades almost from the very inception of computer science. One may allude here to pattern recognition, data visualization as the two areas in which we were faced with inherent data dimensionality. This has led to a number of techniques which as of now are regarded classic and are used quite intensively. There have been a number of approaches deeply rooted in classic statistical analysis. The ideas of principal component analysis, Fisher analysis and alike are the techniques of paramount relevance. What has changed quite profoundly over the decades is the magnitude of the problem itself which has forced us to the exploration of new ideas and optimization techniques involving advanced techniques of global search including tabu search and biologically-inspired optimization mechanisms. In a nutshell, we can distinguish between reduction processes involving (a) data and (b) features (attributes). Data reduction is concerned with grouping data and revealing their structure in the form of clusters (groups). Clustering is regarded as one of the fundamental techniques within the domain of data reduction. Typically, we start with thousands of data points and arrive at 10–15 clusters. Feature or attribute reduction deals with a (a) transformation of the feature space into another feature space of a far lower dimensionality or (b) selection of a subset of features that are regarded to be the most essential with respect to a certain predefined objective function. Considering the underlying techniques of feature transformation, we encounter a number of classic linear statistical techniques such as e. g., principal component analysis or
more advanced nonlinear mapping mechanisms realized by e. g., neural networks. The criteria used to assess the quality of the resulted (reduced) feature space give rise to the two general categories, namely filters and wrappers. Using filters we consider some criterion that pertains to the statistical characteristics of the selected attributes and evaluate them with this respect. In contrast, when dealing with wrappers, we are concerned with the effectiveness of the features as a vehicle to carry out classification so in essence there is a mechanism (e. g., a certain classifier) which effectively evaluates the performance of the selected features with respect to their discriminating capabilities. In addition to feature and data reduction being regarded as two separate processes, we may consider their combinations. A general roadmap of the dimensionality reduction is outlined in Fig. 1. While the most essential components have been already described, we note that all reduction processes are guided by various criteria. The reduction activities are established in some formal frameworks of information granules. In the study, we will start with an idea of information granularity and information granules, Sect. “Introduction”, in which we demonstrate their fundamental role of the concept that leads to the sound foundations and helps establish a machinery of data and feature reduction. We start with data reduction (Sect. “Information Granularity and Granular Computing”) where we highlight the role of clustering and fuzzy clustering as a processing vehicle leading to the discovery of structural relationships within the data. We also cover several key measures
775
776
Data and Dimensionality Reduction in Data Analysis and System Modeling
used to the evaluation of reduction process offering some thoughts on fuzzy quantization which brings a quantitative characterization of the reconstruction process. In this setting, we show that a tandem of encoding and decoding is guided by well-grounded optimization criteria. Dimension reduction is covered in Sect. “Data Reduction”. Here we start with linear and nonlinear transformations of the feature space including standard methods such as a wellknown Principal Component Analysis (PCA). In the sequel, we stress a way in which biologically inspired optimization leads to the formation of the optimal subset of features. Some mechanisms of co-joint data and dimensionality reduction are outlined in Sect. “Dimensionality Reduction” in which we discuss a role of biclustering in these reduction problems. In this paper, we adhere to the standard notation. Individual data are treated as n-dimensional vectors of real numbers, say x 1 ; x 2 ; : : : etc. We consider a collection of N data points which could be arranged in a matrix form where data occupy consecutive rows of the matrix. Furthermore we will be using the terms attribute and feature interchangeably. Information Granularity and Granular Computing Information granules permeate numerous human endeavors [1,19,20,22,23,24,25,26]. No matter what problem is taken into consideration, we usually express it in a certain conceptual framework of basic entities, which we regard to be of relevance to the problem formulation and problem solving. This becomes a framework in which we formulate generic concepts adhering to some level of abstraction, carry out processing, and communicate the results to the external environment. Consider, for instance, image processing. In spite of the continuous progress in the area, a human being assumes a dominant and very much uncontested position when it comes to understanding and interpreting images. Surely, we do not focus our attention on individual pixels and process them as such but group them together into semantically meaningful constructs – familiar objects we deal with in everyday life. Such objects involve regions that consist of pixels or categories of pixels drawn together because of their proximity in the image, similar texture, color, etc. This remarkable and unchallenged ability of humans dwells on our effortless ability to construct information granules, manipulate them and arrive at sound conclusions. As another example, consider a collection of time series. From our perspective we can describe them in a semi-qualitative manner by pointing at specific regions of such signals. Specialists can effortlessly interpret ECG signals. They distinguish some seg-
ments of such signals and interpret their combinations. Experts can interpret temporal readings of sensors and assess the status of the monitored system. Again, in all these situations, the individual samples of the signals are not the focal point of the analysis and the ensuing signal interpretation. We always granulate all phenomena (no matter if they are originally discrete or analog in their nature). Time is another important variable that is subjected to granulation. We use seconds, minutes, days, months, and years. Depending which specific problem we have in mind and who the user is, the size of information granules (time intervals) could vary quite dramatically. To the high level management time intervals of quarters of year or a few years could be meaningful temporal information granules on basis of which one develops any predictive model. For those in charge of everyday operation of a dispatching plant, minutes and hours could form a viable scale of time granulation. For the designer of high-speed integrated circuits and digital systems, the temporal information granules concern nanoseconds, microseconds, and perhaps microseconds. Even such commonly encountered and simple examples are convincing enough to lead us to ascertain that (a) information granules are the key components of knowledge representation and processing, (b) the level of granularity of information granules (their size, to be more descriptive) becomes crucial to the problem description and an overall strategy of problem solving, (c) there is no universal level of granularity of information; the size of granules is problem-oriented and user dependent. What has been said so far touched a qualitative aspect of the problem. The challenge is to develop a computing framework within which all these representation and processing endeavors could be formally realized. The common platform emerging within this context comes under the name of Granular Computing. In essence, it is an emerging paradigm of information processing. While we have already noticed a number of important conceptual and computational constructs built in the domain of system modeling, machine learning, image processing, pattern recognition, and data compression in which various abstractions (and ensuing information granules) came into existence, Granular Computing becomes innovative and intellectually proactive in several fundamental ways It identifies the essential commonalities between the surprisingly diversified problems and technologies used there which could be cast into a unified framework we usually refer to as a granular world. This is a fully operational processing entity that interacts with the external world (that could be another granular or nu-
Data and Dimensionality Reduction in Data Analysis and System Modeling
meric world) by collecting necessary granular information and returning the outcomes of the granular computing. With the emergence of the unified framework of granular processing, we get a better grasp as to the role of interaction between various formalisms and visualize a way in which they communicate. It brings together the existing formalisms of set theory (interval analysis) [8,10,15,21], fuzzy sets [6,26], rough sets [16,17,18], etc. under the same roof by clearly visualizing that in spite of their visibly distinct underpinnings (and ensuing processing), they exhibit some fundamental commonalities. In this sense, Granular Computing establishes a stimulating environment of synergy between the individual approaches. By building upon the commonalities of the existing formal approaches, Granular Computing helps build heterogeneous and multifaceted models of processing of information granules by clearly recognizing the orthogonal nature of some of the existing and well established frameworks (say, probability theory coming with its probability density functions and fuzzy sets with their membership functions). Granular Computing fully acknowledges a notion of variable granularity whose range could cover detailed numeric entities and very abstract and general information granules. It looks at the aspects of compatibility of such information granules and ensuing communication mechanisms of the granular worlds. Interestingly, the inception of information granules is highly motivated. We do not form information granules without reason. Information granules arise as an evident realization of the fundamental paradigm of abstraction.
Granular Computing forms a unified conceptual and computing platform. Yet, it directly benefits from the already existing and well-established concepts of information granules formed in the setting of set theory, fuzzy sets,
rough sets and others. A selection of a certain formalism from this list depends upon the problem at hand. While in dimensionality reduction set theory becomes more visible, in data reduction we might encounter both set theory and fuzzy sets. Data Reduction In data reduction, we transform large collections of data into a limited, quite small family of their representatives which capture the underlying structure (topology). Clustering has become one of the fundamental tools being used in this setting. Depending upon the distribution of data, we may anticipate here the use of set theory or fuzzy sets. As an illustration, consider a two-dimensional data portrayed in Fig. 2. The data in Fig. 2a exhibit a clearly delineated structure with three well separated clusters. Each cluster can be formalized as a set of elements. A situation illustrated in Fig. 2b is very different: while there are some clusters visible, they overlap to some extent and a number of points are “shared” by several clusters. In other words, some data may belong to more than a single cluster and hence the emergence of fuzzy sets as a formal setting in which we can formalize the resulting information granules. In what follows, we briefly review the essence of fuzzy clustering and then move on with the ideas of the evaluation of quality of the results. Fuzzy C-means as an Algorithmic Vehicle of Data Reduction Through Fuzzy Clusters Fuzzy sets can be formed on a basis of numeric data through their clustering (groupings). The groups of data give rise to membership functions that convey a global more abstract view at the available data. With this regard Fuzzy C-Means (FCM) is one of the commonly used mechanisms of fuzzy clustering [2,3,20]. Let us review its formulation, develop the algorithm and highlight the main properties of the fuzzy clusters.
Data and Dimensionality Reduction in Data Analysis and System Modeling, Figure 2 Data clustering and underlying formalism of set theory suitable for handling well-delineated structures of data (a) and the use of fuzzy sets in capturing the essence of data with significant overlap (b)
777
778
Data and Dimensionality Reduction in Data Analysis and System Modeling
Given a collection of n-dimensional data set fx k g; k D 1; 2; : : : ; N, the task of determining its structure – a collection of “c” clusters, is expressed as a minimization of the following objective function (performance index) Q being regarded as a sum of the squared distances between data and their representatives (prototypes) QD
N c X X
2 um i k jjx k v i jj :
(1)
iD1 kD1
Here v i s are n-dimensional prototypes of the clusters, i D 1; 2; : : : ; c and U D [u i k ] stands for a partition matrix expressing a way of allocation of the data to the corresponding clusters; uik is the membership degree of data x k in the ith cluster. The distance between the data z k and prototype v i is denoted by jj:jj. The fuzzification coefficient m (> 1:0) expresses the impact of the membership grades on the individual clusters. A partition matrix satisfies two important and intuitively appealing properties (a) 0 <
N X
u i k < N;
(b)
u i k D 1;
(2)
Let us denote by U a family of matrices satisfying (a)– (b). The first requirement states that each cluster has to be nonempty and different from the entire set. The second requirement states that the sum of the membership grades should be confined to 1. The minimization of Q completed with respect to U 2 U and the prototypes v i of V D fv 1 ; v 2 ; : : : ; v c g of the clusters. More explicitly, we write it down as follows
n
U 2 U; v 1 ; v 2 ; : : : ; v c 2 R :
ust D
iD1
where
D jjx k v i
iD1
jj2 .
m
1 m1
2
dstm1 :
(7)
Given the normalization condition plugging it into (2) one has
m
1 m1
c X
Pc
jD1
u jt D 1 and
2
d jtm1 D 1
(8)
jD1
we compute
m
1 m1
D
1 c P jD1
2
:
(9)
d jtm1
Inserting this expression into (7), we obtain the successive entries of the partition matrix 1 c P jD1
d 2i k
(6)
Given this relationship, using (5) we calculate ust to be equal to
(3)
From the optimization standpoint, there are two individual optimization tasks to be carried out separately for the partition matrix and the prototypes. The first one concerns the minimization with respect to the constraints given the requirement of the form (2b) which holds for each data point x k . The use of Lagrange multipliers converts the problem into its constraint-free version. The augmented objective function formulated for each data point, k D 1; 2; : : : ; N, reads as ! c c X X m 2 VD ui k di k C ui k 1 (4)
(5)
@V m1 2 D mu st dst C : @ust
ust D
with respect to
@V D0 @
s D 1; 2; : : : ; c; t D 1; 2; : : : ; N. Now we calculate the derivative of V with respect to the elements of the partition matrix in the following way
k D 1; 2; : : : ; N :
iD1
minQ
@V D0 @ust
i D 1; 2; : : : ; c
kD1 c X
It is now instructive to go over the optimization process. Proceeding with the necessary conditions for the minimum of V for k D 1; 2; : : : ; N, one has
d 2st d 2jt
1 m1
:
(10)
The optimization of the prototypes v i is carried out assuming the Euclidean distance between the data and Pn the prototypes that is jjx k v i jj2 D jD1 (x k j v i j )2 . The objective function reads now as follows Q D Pc P N m Pn 2 iD1 kD1 u i k jD1 (x k j v i j ) and its gradient with respect to v i , rv i Q made equal to zero yields the system of linear equations N X
um i k (x k t v st ) D 0
kD1
s D 1; 2; : : : ; c; t D 1; 2; : : : ; n.
(11)
Data and Dimensionality Reduction in Data Analysis and System Modeling
Thus N P
vi t D
um i k xk t
kD1 N P
kD1
:
(12)
um ik
Overall, the FCM clustering is completed through a sequence of iterations where we start from some random allocation of data (a certain randomly initialized partition matrix) and carry out the following updates by adjusting the values of the partition matrix and the prototypes. The iterative process is continued until a certain termination criterion has been satisfied. Typically, the termination condition is quantified by looking at the changes in the membership values of the successive partition matrices. Denote by U(t) and U(t C 1) the two partition matrices produced in two consecutive iterations of the algorithm. If the distance jjU(t C 1) U(t)jj is less than a small predefined threshold " (say, " D 105 or 106 ), then we terminate the algorithm. Typically, one considers the Tchebyschev distance between the partition matrices meaning that the termination criterion reads as follows max i;k ju i k (t C 1) u i k (t)j " :
(13)
The key components of the FCM and a quantification of their impact on the form of the produced results are summarized in Table 1. The fuzzification coefficient exhibits a direct impact on the geometry of fuzzy sets generated by the algorithm. Typically, the value of “m” is assumed to be equal to 2.0. Lower values of m (that are closer to 1) yield membership functions that start resembling characteristic functions of sets; most of the membership values become localized around 1 or 0. The increase of the fuzzification
coefficient (m D 3; 4, etc.) produces “spiky” membership functions with the membership grades equal to 1 at the prototypes and a fast decline of the values when moving away from the prototypes. Several illustrative examples of the membership functions are included in Fig. 3. In addition to the varying shape of the membership functions, observe that the requirement put on the sum of membership grades imposed on the fuzzy sets yields some rippling effect: the membership functions are not unimodal but may exhibit some ripples whose intensity depends upon the distribution of the prototypes and the values of the fuzzification coefficient. The membership functions offer an interesting feature of evaluating an extent to which a certain data point is shared between different clusters and in this sense become difficult to allocate to a single cluster (fuzzy set). Let us introduce the following index which serves as a certain separation measure between the clusters '(u1 ; u2 ; : : : ; u c ) D 1 c c
c Y
ui
(14)
iD1
where u1 ; u2 ; : : : ; u c are the membership degrees for some data point. If only one of membership degrees, say ui = 1, and the remaining are equal to zero, then the separation index attains its maximum equal to 1. On the other extreme, when the data point is shared by all clusters to the same degree equal to 1/c, then the value of the index drops down to zero. This means that there is no separation between the clusters as reported for this specific point. While the number of clusters is typically limited to a few information granules, we can easily proceed with successive refinements of fuzzy sets. This can be done by splitting fuzzy clusters of the highest heterogeneity. Let us assume that we have already constructed “c” fuzzy clusters.
Data and Dimensionality Reduction in Data Analysis and System Modeling, Table 1 The main features of the Fuzzy C-Means (FCM) clustering algorithm Feature of the FCM algorithm Representation and optimization aspects Number of clusters (c) Structure in the data set and the number of fuzzy sets estimated by the method; the increase in the number of clusters produces lower values of the objective function however given the semantics of fuzzy sets one should maintain this number quite low (5–9 information granules) Objective function Q Develops the structure aimed at the minimization of Q; iterative process supports the determination of the local minimum of Q Distance function jj. jj Reflects (or imposes) a geometry of the clusters one is looking for; essential design parameter affecting the shape of the membership functions Fuzzification coefficient (m) Implies a certain shape of membership functions present in the partition matrix; essential design parameter. Low values of “m” (being close to 1.0) induce characteristic function. The values higher than 2.0 yield spiky membership functions Termination criterion Distance between partition matrices in two successive iterations; the algorithm terminated once the distance below some assumed positive threshold (") that is jjU(iter C 1) U(iter)jj < "
779
780
Data and Dimensionality Reduction in Data Analysis and System Modeling
Data and Dimensionality Reduction in Data Analysis and System Modeling, Figure 3 Examples of membership functions of fuzzy sets; the prototypes are equal to 1, 3.5 and 5 while the fuzzification coefficient assumes values of 1.2 (a), 2.0 (b) and 3.5 (c). The intensity of the rippling effect is affected by the values of “m” and increases with the higher values of “m”
Each of them can be characterized by the performance index Vi D
N X
2 um i k jjx k v i jj
(15)
kD1
i D 1; 2; : : : ; c. The higher the value of V i , the more heterogeneous the ith cluster. The one with the highest value of V i , that is the one for which we have i0 D argmax i Vi is refined by being split into two clusters. Denote the set of data associated with the i0 th cluster by X(i0 ), X(i0 ) D fx k 2 Xju i 0 k D max i u i k g :
(16)
We cluster the elements in X(i0 ) by forming two clusters which leads to two more specific (detailed) fuzzy sets. This gives rise to a hierarchical structure of the family of fuzzy sets as illustrated in Fig. 4. The relevance of this construct in the setting of fuzzy sets is that it emphasizes the essence of forming a hierarchy of fuzzy sets rather than working with a single level structure of a large number of components whose semantics could not be retained. The process of further refinements is realized in the same by picking up the cluster of the highest heterogeneity and its split into two consecutive clusters. Evaluation of Performance of Data Reduction There are two fundamental ways of evaluating the quality of data reduction achieved through clustering or fuzzy clustering. In the first category of methods, we assess the quality of the obtained clusters in terms of their compactness, low fuzziness (entropy measure of fuzziness, in particular), structural separability, and alike. We also look for
Data and Dimensionality Reduction in Data Analysis and System Modeling, Figure 4 Successive refinements of fuzzy sets through fuzzy clustering applied to the clusters of the highest heterogeneity. The numbers indicate the order of the splits
a “plausible” number of clusters which can be descriptive of the structure of the overall data. The second category of methods stresses the aspects of data representation by its reduced counterpart. In this case, a reconstruction criterion plays a pivotal role. We elaborate on this method by looking at it as a useful vehicle of fuzzy quantization. Furthermore we contrast fuzzy quantization with its set-based counterpart. Exploring Fuzzy Quantization: Moving Beyond a Winner-Takes-All Scheme The crux of the considerations here dwells upon the following observation. While the set-based codebook devel-
Data and Dimensionality Reduction in Data Analysis and System Modeling
oped through clustering [5,7] leads to a decoding scheme which decodes the result using a single element of the codebook (which in essence becomes a manifestation of the well-known concept of the winner-takes-all strategy), here we are interested in the exploitation of the nonzero degrees of membership of several elements (fuzzy sets) of the codebook while representing the input datum. In other words, rather than using a single prototype as a sole representative of a collection of neighboring data, our conjecture is that involving several prototypes at different degrees of activation (weights) could be beneficial to the ensuing decoding. Along the line of abandoning the winner-takes-all principle, let us start with a collection of weights u i (x); i D 1; 2; : : : ; c. Adhering to the vector notation, we have u(x) D [u1 (x) u2 (x); : : :; u c (x)]T . The weights (membership degrees) express an extent to which the corresponding datum x is encoded in the language of the given prototypes (elements of the codebook) should be involved in the decoding (decompression) scheme. We require that these membership degrees are in-between 0 and 1 and sum up to 1. Thus at the encoding end of the overall scheme, we represent each vector x by c 1 values of u i (x). The decoding is then based upon a suitable aggregation of the degrees of membership and the prototypes. Denote this operation by xˆ D D(u(x); v 1 ; v 2 ; : : : ; v c ) where xˆ denotes the result of the decoding. On the other hand, the formation of the membership degrees (encoding) can be succinctly described in the form of the mapping E(x; v 1 , v 2 ; : : : ; v c ). The overall development process is split into two fundamental phases, namely (a) local optimization activities confined to the individual datum x k that involve (i) encoding each x k leading to the vector of the membership degrees (u i ), (ii) decoding being realized in terms of the membership degrees, and (b) global optimization activities concerning the formation of the codebook in which case we take all data into consideration thus bringing the design to the global level. The overall process described above will be referred to as fuzzy vector quantization (FVQ) or a fuzzy granulationdegranulation. The latter terminology emphasizes a position of fuzzy sets played in this construct. In contrast, the winner-takes-all strategy is a cornerstone of the vector quantization (VQ). To contrast the underlying computing in the VQ and FVQ schemes, we portray the essential computational facets in Fig. 5. We focus on the detailed formulas expressing the coding and decoding schemes. Let us emphasize that the cod-
Data and Dimensionality Reduction in Data Analysis and System Modeling, Figure 5 VQ and FVQ – a view contrasting the essence of the process and showing the key elements of the ensuing encoding and decoding
ing and decoding emerge as solutions to the well-defined optimization problems. At this point let us assume that the codebook – denoted here as fv1 ; v 2 ; : : :; v c g has been already formed and become available. Encoding Mechanism A way of encoding (representing) the original datum x is done through the collection of the degrees of activation of the elements of the codebook. We require that the membership degrees are confined to the unit interval and sum up to 1. We determine their individual values by minimizing the following performance index Q1 (x) D
c X
2 um i jjx v i jj
(17)
iD1
subject to the following constraints already stated above, that is u i (x) 2 [0; 1];
c X
u i (x) D 1 :
(18)
iD1
The distance function is denoted by jj:jj2 . The fuzzification coefficient (m; m > 1), standing in the above expression is used to adjust the level of contribution of the impact of the prototypes on the result of the encoding. The collection of “c” weights fu i (x)g is then used to encode the input datum x. These membership degrees along with the corresponding prototypes are afterwards used in the decoding scheme.
781
782
Data and Dimensionality Reduction in Data Analysis and System Modeling
Data and Dimensionality Reduction in Data Analysis and System Modeling, Figure 6 Plots of u1 (x) (3D and contour plots) for selected values of the fuzzification coefficient: a m D 1:2, b m D 2:0, c m D 3:5
The minimization of (17) is straightforward and follows a standard way of transforming the problem to unconstrained optimization using Lagrange multipliers. Once solved, the resulting weights (membership degrees) read as 1 : (19) u i (x) D 2 P jjx v i jj m1 jjxv j jj
The fuzzification coefficient used in the formulation of the encoding problem plays an important role. To demonstrate this, let us show how the values of the coefficient impact the distribution of the generated membership degrees. The plots of these degrees in the two-dimensional case (n D 2) and c D 3 elements of the codebook are included in Fig. 6. The prototypes have been set up to v 1 D [3:0 1:5]T , v 2 D [2:5 0:8]T , v 3 D [1:0 3:7]T . The values of “m” close to 1 result in a visible set-like character of the distribution of the membership degrees. The Decoding Mechanism The decoding is concerned ˆ It is with the “reconstruction” of x, denoted here by x. based on some aggregation of the elements of the codebook and the associated membership grades u(x). The proposed way of forming xˆ is accomplished through the minimization of the following expression c X
ˆ D Q2 (x)
ˆ v i jj2 : um i jj x
(20)
iD1
Given the Euclidean distance, the problem of unconstrained optimization leads to a straightforward solution expressed as a combination of the prototypes weighted by the membership degrees, that is c P
xˆ D
um i vi
iD1 c P
iD1
: um i
(21)
Interestingly, all prototypes contribute to the decoding process and this way of computing stands in a sharp contrast with the winner-takes-all decoding scheme encountered in the VQ where xˆ D v l where l denotes the index of the winning prototype that was identified during the decoding phase. Some other VQ decoding schemes are available in which several dominant elements of the codebook are involved. Nevertheless because of their binary involvement in the decoding phase, the possible ways of aggregation to be realized by the decoder are quite limited. The quality of the reconstruction depends on a number of essential parameters including the size of the codebook as well as the value of the fuzzification coefficient. The impact of these particular parameters is illustrated in Fig. 7 in which we quantify the distribution of the decoding error by showing values of the Hamming distance beˆ Evidently, what could have been anticitween x and x. pated, all x’s positioned in a close vicinity of the prototypes exhibit low values of the decoding error. Low values of “m” lead to significant jumps of the error which result from the abrupt switching in-between the prototypes used in the decoding process. For the higher value of the fuzzification coefficient, say m D 2:0, the jumps are significantly reduced. Further reduction is achieved with the larger values of “m”, see Fig. 7c. Obviously in the regions which are quite remote from the location of the prototypes, the error increases quite quickly. For instance, this effect is quite visible for the values of x close to [4:0 4:0]T . Dimensionality Reduction As indicated earlier, dimensionality reduction is concerned with a reduction of the set of features. One of the two fundamental directions that are taken here involves transformations of the original, highly dimensional feature vectors. The other group of methods focuses on a selection of an “optimal” subset of original attributes.
Data and Dimensionality Reduction in Data Analysis and System Modeling
Data and Dimensionality Reduction in Data Analysis and System Modeling, Figure 7 Plots of the decoding error expressed in the form of the Hamming distance between x and xˆ (3D and contour plots) for selected values of the fuzzification coefficient: a m D 1:2, b m D 2:0, c m D 3:5
Linear and Nonlinear Transformations of the Feature Space Making use of such transformations, we position the original vectors in some other spaces of far lower dimensionality than the original one. There are a number of statistical methods devoted to the formation of optimal linear transformations. For instance, the commonly used Principal Component Analysis (PCA) [11] deals with a linear mapping of original data in such a way that the reduced data vectors y are expressed as y D Tx where T denotes a transformation matrix and y is positioned in the reduced space of dimensionality “m” where m n. The transformation matrix is constructed using eigenvectors associated with the largest eigenvalues of the covariance matrix P T of the data, that is D N kD1 (x k m)(x k m) where m is a mean vector of all data. The PCA method does not consider the labels of data in case they become available. To accommodate class information, one may consider Fisher discriminant analysis [3]. Neural networks can serve as nonlinear transformations of the original feature space. More specifically, the dimensionality reduction is accomplished through a sand glass topology of the network as illustrated in Fig. 8. Notice that the shortest hidden layer of the network produces the reduced version of data. The optimization criterion deals with the minimal reconstruction error. For each input x, we strive for the output of the network (NN) to be made as close as possible to the input, that is NN(x) x. The learning (adjustments) of the weights helps minimize the error (objective function) of the form PN 2 kD1 jjNN(x k ) x k jj . Given the evident nonlinear nature of neurocomputing, we often allude to these transformations as a nonlinear component analysis (NLCA). Neural networks are also used as a visualization vehicle in which we display original data in some low-dimen-
Data and Dimensionality Reduction in Data Analysis and System Modeling, Figure 8 Neural network in nonlinear dimensionality reduction; the inner shortest layer produces a reduced version of the original data (y)
sional space. The typical architecture comes in the form of Kohonen self-organizing maps [13]. Data that are distant in an original space remain distant in the low-dimensional space. Likewise, if the two data points are originally close to each other, this closeness is also present in the transformed feature space. While Kohonen maps realize a certain form of clustering, their functioning is quite distinct from the FCM algorithm described before. In essence, selforganizing maps bring a high involvement of humans in the final interpretation of results. For instance, the number of clusters is not predefined in advance, as encountered in the FCM, but, its choice is left to the end user to determine on a basis of the structure displayed on the map. The Development of Optimal Subsets of Features An objective is to select a subset of attributes so that a certain predefined performance index (selection criterion) becomes optimized (either minimized or maximized). The choice of a subset of attributes is a combinatorial problem. As a matter of fact, it is NP complete. The only optimal solution comes through exhaustive search. Evidently, such an enumeration of possibilities is feasible for prob-
783
784
Data and Dimensionality Reduction in Data Analysis and System Modeling
lems of a very low dimensionality. In essence, the practicality of such enumeration is out of question. This has led to numerous techniques which rely on some sort of heuristics which help reduce the size of the search space. While the heuristics could be effective, its usage could easily build some bias into the search process. Subsequently, the results produced by such methods are suboptimal. In addition to a number of heuristics (which contribute to the effectiveness of the search process), there are more advanced optimization techniques such as those available in the realm of Evolutionary Computing [14] such as genetic algorithms, evolutionary strategies, particle swarm optimization are examples of biologically inspired techniques that are aimed at structural optimization using which we construct an optimal (or sub-optimal) collection of attributes. Any potential solution is represented in a form of a chromosome that identifies attributes contributing to the formation of the optimal space. We may encounter either binary or real-number chromosomes whose size is equal to the number of features. In case of the binary chromosome, we choose attributes which are identified as 1, see Fig. 9a. For the required “m” features, we choose the first ones encountered in the chromosome. For the real-number coding of the chromosome, we typically have far more alternatives to take advantage of. For instance, we could consider the first “m” largest entries of the chromosome. Then the coordinates of these entries form a reduced subset of the features; refer to Fig. 9b.
Data and Dimensionality Reduction in Data Analysis and System Modeling, Figure 9 Examples of forming a subset of features for a given chromosome: a binary coding, and b real-number coding (here we are concerned with a 3 dimensional feature space by choosing three dominant features)
Optimization Environment in Dimensionality Reduction: Wrappers and Filters Attribute selection is guided by some criterion. There are numerous ways of expressing a way in which a collection of features is deemed “optimal”. There is a generally accepted taxonomy of the methods in which we distinguish between so-called wrappers and filters. Filters embrace criteria that directly express the quality of the subset of features in terms of their homogeneity, compactness and alike. Wrappers, as the name implies, evaluate the quality of the subset of features after being used by some mechanism of classification or prediction. In other words, the assessment of the attributes is done once they have been “wrapped” in some construct. This means that the quality of the feature space is expressed indirectly. The essence of filters and wrappers in illustrated schematically in Fig. 10. Co-joint Data and Dimensionality Reduction So far our investigations were concerned with data or dimensionality reduction carried out individually. An inter-
Data and Dimensionality Reduction in Data Analysis and System Modeling, Figure 10 Filters and wrappers in the assessment of the process of dimensionality reduction
Data and Dimensionality Reduction in Data Analysis and System Modeling
Data and Dimensionality Reduction in Data Analysis and System Modeling, Figure 11 An example heatmap of biclusters (c D 3)
esting question arises as to the possibility of carrying out concurrent reduction in both ways. One among possible alternatives can be envisioned in the form of so-called biclustering (or co-clustering). The crux of this process is to carry our simultaneous clustering for data and features. The result is a collection of groups of data and groups of features. We start with a basic notation. Let us denote by D1 , D2 , . . . , Dc subsets of data. Similarly, by F 1 , F 2 , . . . , F c we denote subsets of features. We may regard Di and F i as two subsets of indexes of data and features f1; 2; : : : ; Ng and f1; 2; : : : ; ng. fD i g and fF i g form partitions so that the fundamental requirements of disjointness and total space coverage are satisfied. By biclusters of data set we mean a collection of pairs (D i ; F i ) i D 1; 2; : : : ; c. Graphically, biclusters are visualized in the form of socalled heatmaps where each bicluster is displayed as a certain region in the data and feature space, Fig. 11. Here the data are represented as a two-dimensional array X. There have been a number of biclustering algorithms. The method proposed by Hartigan [9] and Cheng and Church [4] are just two examples of biclustering algorithms. The objective function introduced by Hartigan is concerned with the overall variance of the “c” biclusters QD
c X X X
(x i j m k )2
(22)
kD1 i2D k j2F k
where mk is the mean of the kth bicluster X X 1 mk D xi j jD k jjF k j
(23)
i2D k j2F k
where j:j denotes a cardinality of clusters in data space and feature space. The minimization of Q leads to the development of the biclusters.
Data and Dimensionality Reduction in Data Analysis and System Modeling, Figure 12 Encoding of biclusters in terms of the chromosome with entries confined to the unit interval
The idea of Cheng and Church is to form bi-clusters in such a way so that we minimize the mean squared residue of each bi-cluster. More specifically, denote an average taken over data in Di and features in F i as follows 1 X xk j i j D jD i j k2D i (24) 1 X
i k D xk j : jF i j j2F i
The residue of xij taken with respect to the kth bicluster is defined as r i j D x i j i j i j C m k and then Vk D
X X
ri j :
(25)
i2D k j2F k
The overall minimization of V1 C V2 C : : : C Vc gives rise to the formation of biclusters. Biologically-inspired optimization can be used here to construct biclusters. Again, as before, a formation of a suitable structure (content) of a chromosome becomes crucial to the effectiveness of the obtained structure. One possible encoding could be envisioned as follows, refer to Fig. 12. Note that a chromosome consists of two parts where the first portion encodes the data while the other one deals with the features. Assuming that a chromosome has entries in-between 0 and 1 and “c” biclusters are to be developed, we convert entries in the interval [0; 1/c) to identify elements belong-
785
786
Data and Dimensionality Reduction in Data Analysis and System Modeling
mensionality reduction, we indicated that there are interesting alternatives of dealing with these two aspects making use of biclustering. Future Directions There are a number of directions in which the area can expand either in terms of the reduction concept itself or the techniques being sought. Several of them might offer some tangible benefits to the future pursuits:
Data and Dimensionality Reduction in Data Analysis and System Modeling, Figure 13 System modeling realized through a collection of local models developed for individual biclusters
ing to the first biclusters. Those entries that assume values in [1/c; 2/c) indicate elements (either data or features) belonging to the second bicluster, etc. Biclustering brings an interesting issue of simultaneous reduction of data and dimensionality. Each cluster of data is expressed in terms of its own (reduced) feature space. This interesting phenomenon gives rise to a collection of models which are locally formed by using subset of data and a subset of features. Each model is developed on a basis of the given bicluster, see Fig. 13. In this way, the complexity of modeling has been substantially reduced as instead of global models (which are more difficult to construct), we develop a family of compact models involving far fewer attributes and dealing with less data. Conclusions Reduction problems in data are essential to a number of fundamental developments of data analysis and system modeling. These activities are central to data understanding and constitute a core of generic activities of data mining and human-centric data interpretation. They are also indispensable in system modeling which otherwise being carried out for high-dimensional data make the development of the models highly inefficient and lead to the models that are of lower quality in particular when it comes to their generalization capabilities. The two facets of the reduction process involve data and features (attributes) and as such they tend to be highly complementary. When discussing individual techniques aimed at either data or di-
The use of stratified approaches to data and dimensionality reduction in which the reduction processes are carried out on a hierarchical basis meaning that further reductions are contemplated bases on results developed at the lower level of generality. More active involvement and integration of techniques of visualization of results. While those are in place now, we may envision their more vital role in the future. Distributed mode of reduction activities where we simultaneously encounter several sources of data and intend to process them at the same time. Exploitation of joint data and dimensionality reduction approaches in which one tackles both the reduction of the number of data and the number of attributes. These two are quite intertwined so their mutual handling could be advantageous. In particular, one could envision their role in the formation of a suite of simpler models. Given the NP complete nature of many reduction problems, the use of some heuristics is inevitable. In this way, the role of techniques of biologically inspired optimization will be anticipated and it is very likely that their role will become more visible. Acknowledgments Support from the Natural Sciences and Engineering Research Council of Canada (NSERC) and Canada Research Chair (CRC) is gratefully acknowledged. Appendix: Particle Swarm Optimization (PSO) In the studies on dimensionality and data reduction, we consider the use of Particle Swarm Optimization (PSO). PSO is an example of a biologically-inspired and population-driven optimization. Originally, the algorithm was introduced by Kennedy and Eberhart [12] where the authors strongly emphasized an inspiration coming from the swarming behavior of animals as well as some inspiration coming from the as well as human social behavior, cf. also [19]. In essence, a particle swarm is a population
Data and Dimensionality Reduction in Data Analysis and System Modeling
of particles – possible solutions in the multidimensional search space. Each particle explores the search space and during this search adheres to some quite intuitively appealing guidelines navigating the search process: (a) it tries to follow its previous direction, and (b) it looks back at the best performance both at the level of the individual particle and the entire population. In this sense there is some collective search of the problem space along with some component of memory incorporated as an integral part of the search mechanism. The performance of each particle during its movement is assessed by means of some performance index. A position of a swarm in the search space, is described by some vector z(t) where “t” denotes consecutive discrete time moments. The next position of the particle is governed by the following update expressions concerning the particle, z(t C 1) and its speed, v(t C 1) z(t C 1) D z(t) C v(t C 1) // update of position of the particle v(t C 1) D v(t) C 1 (p x(t)) C 2 (ptotal x(t)) // update of speed of the particle where p denotes the best position (the lowest performance index) reported so far for this particle, ptotal is the best position overall developed so far across the whole population. 1 and 2 are random number drawn from the uniform distribution U[0; 2] that help build a proper mix of the components of the speed; different random numbers affect the individual coordinates of the speed. The second expression governing the change in the velocity of the particle is particularly interesting as it nicely captures the relationships between the particle and its history as well as the history of the overall population in terms of their performance reported so far. There are three components determining the updated speed of the particle. First, the current speed v(t) is scaled by the inertial weight () smaller than 1 whose role is to articulate some tendency to a drastic change of the current speed. Second, we relate to the memory of this particle by recalling the best position of the particle achieved so far. Thirdly, there is some reliance on the best performance reported across the whole population (which is captured by the last component of the expression governing the speed adjustment). Bibliography Primary Literature 1. Bargiela A, Pedrycz W (2003) Granular Computing: An Introduction. Kluwer, Dordrecht
2. Bezdek JC (1992) On the relationship between neural networks, pattern recognition and intelligence. Int J Approx Reason 6(2):85–107 3. Duda RO, Hart PE, Stork DF (2001) Pattern Classification, 2nd edn. Wiley, Ney York 4. Cheng Y, Church GM (2000) Biclustering of expression data. Proc 8th Int Conf on Intelligent Systems for Molecular Biology, pp 93–103 5. Gersho A, Gray RM (1992) Vector Quantization and Signal Compression. Kluwer, Boston 6. Gottwald S (2005) Mathematical fuzzy logic as a tool for the treatment of vague information. Inf Sci 172(1–2):41–71 7. Gray RM (1984) Vector quantization. IEEE Acoust Speech Signal Process 1:4–29 8. Hansen E (1975) A generalized interval arithmetic. Lecture Notes in Computer Science, vol 29. Springer, Berlin, pp 7–18 9. Hartigan JA (1972) Direct clustering of a data matrix. J Am Stat Assoc 67:123–129 10. Jaulin L, Kieffer M, Didrit O, Walter E (2001) Applied Interval Analysis. Springer, London 11. Jolliffe IT (1986) Principal Component Analysis. Springer, New York 12. Kennedy J, Eberhart RC (1995) Particle swarm optimization, vol 4. Proc IEEE Int Conf on Neural Networks. IEEE Press, Piscataway, pp 1942–1948 13. Kohonen T (1989) Self Organization and Associative Memory, 3rd edn. Springer, Berlin 14. Mitchell M (1996) An Introduction to Genetic Algorithms. MIT Press, Cambridge 15. Moore R (1966) Interval Analysis. Prentice Hall, Englewood Cliffs 16. Pawlak Z (1982) Rough sets. Int J Comput Inform Sci 11:341– 356 17. Pawlak Z (1991) Rough Sets. Theoretical Aspects of Reasoning About Data. Kluwer, Dordrecht 18. Pawlak Z, Skowron A (2007) Rough sets: some extensions. Inf Sci 177(1):28–40 19. Pedrycz W (ed) (2001) Granular Computing: An Emerging Paradigm. Physica, Heidelberg 20. Pedrycz W (2005) Knowledge-based Clustering. Wiley, Hoboken 21. Warmus M (1956) Calculus of approximations. Bull Acad Pol Sci 4(5):253–259 22. Zadeh LA (1979) Fuzzy sets and information granularity. In: Gupta MM, Ragade RK, Yager RR (eds) Advances in Fuzzy Set Theory and Applications. North Holland, Amsterdam, pp 3–18 23. Zadeh LA (1996) Fuzzy logic = Computing with words. IEEE Trans Fuzzy Syst 4:103–111 24. Zadeh LA (1997) Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets Syst 90:111–117 25. Zadeh LA (1999) From computing with numbers to computing with words-from manipulation of measurements to manipulation of perceptions. IEEE Trans Circ Syst 45:105–119 26. Zadeh LA (2005) Toward a generalized theory of uncertainty (GTU) – an outline. Inf Sci 172:1–40 27. Zimmermann HJ (1996) Fuzzy Set Theory and Its Applications, 3rd edn. Kluwer, Norwell
787
788
Data and Dimensionality Reduction in Data Analysis and System Modeling
Books and Reviews Baldi P, Hornik K (1989) Neural networks and principal component analysis: learning from examples without local minima. Neural Netw 2:53–58 Bortolan G, Pedrycz W (2002) Fuzzy descriptive models: an interactive framework of information granulation. IEEE Trans Fuzzy Syst 10(6):743–755 Busygin S, Prokopyev O, Pardalos PM (2008) Biclustering in data mining. Comput Oper Res 35(9):2964–2987 Fukunaga K (1990) Introduction to Statistical Pattern Recognition. Academic Press, Boston Gifi A (1990) Nonlinear Multivariate Analysis. Wiley, Chichester Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323 Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324 Lauro C, Palumbo F (2000) Principal component analysis of interval data: a symbolic data analysis approach. Comput Stat 15:73– 87 Manly BF, Bryan FJ (1986) Multivariate Statistical Methods: A Primer. Chapman and Hall, London
Mitra P, Murthy CA, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312 Monahan AH (2000) Nonlinear principal component analysis by neural networks: Theory and applications to the Lorenz system. J Clim 13:821–835 Muni DP, Das Pal NR (2006) Genetic programming for simultaneous feature selection and classifier design. IEEE Trans Syst Man Cybern Part B 36(1):106–117 Pawlak Z (1991) Rough Sets – Theoretical Aspects of Reasoning about Data. Kluwer, Dordrecht Pedrycz W, Vukovich G (2002) Feature analysis through information granulation and fuzzy sets. Pattern Recognit 35:825–834 Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(1):2323–2326 Setiono R, Liu H (1977) Neural-network feature selector, IEEE Trans. Neural Netw 8(3):654–662 Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(1):2319–2323 Watada J, Yabuuchi Y (1997) Fuzzy principal component analysis and its application. Biomed Fuzzy Human Sci 3:83–92
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets LECH POLKOWSKI 1,2 1 Polish–Japanese Institute of Information Technology, Warsaw, Poland 2 Department of Mathematics and Computer Science, University of Warmia and Mazury, Olsztyn, Poland Article Outline Glossary Definition of the Subject Introduction Rough Set Theory: Extensions Rough Set Theory: Applications Nearest Neighbor Method Case-Based Reasoning Complexity Issues Future Directions Bibliography Glossary Knowledge This is a many-faceted and difficult notion to define and it is used very frequently without any attempt at definition as a notion that explains per se. One can follow J. M. Boche´nski in claiming that the world is a system of states of things, related to themselves by means of the network of relations; things, their features, relations among them and states are reflected in knowledge: things in objects or notions, features and relations in notions (or, concepts), states of things in sentences. Sentences constitute knowledge. Knowledge allows its possessor to classify new objects, model processes, make predictions etc. Reasoning Processes of reasoning include an effort by means of which sentences are created; various forms of reasoning depend on the chosen system of notions, symbolic representation of notions, forms of manipulating symbols etc. Knowledge representation This is a chosen symbolic system (language) by means of which notions are encoded and reasoning is formalized. Boolean functions An n-ary Boolean function is a mapping f : f0; 1gn ! f0; 1g from the space of binary sequences of length n into the doubleton f0; 1g. An equivalent representation of the function f is as a formula f of propositional calculus; a Boolean function
can be thus represented either in DNF form or in CNF form. The former representation: _ i2I ^ j2J i l ji , where l ji is a literal, i. e., either a propositional variable or its negation, is instrumental in Boolean Reasoning: when the problem is formulated as a Boolean function, its soV lutions are searched for as prime implicants j2J i l ji . Applications are to be found in various reduct induction algorithms. Information systems One of the languages for knowledge representation is the attribute-value language in which notions representing things are described by means of attributes (features) and their values; information systems are pairs of the form (U, A) where U is a set of objects – representing things – and A is a set of attributes; each attribute a is modeled as a mapping a : U ! Va from the set of objects into the value set V a . For an attribute a and its value v, the descriptor (a D v) is a formula interpreted in the set of objects U as [(a D v)] D fu 2 U : a(u) D vg. Descriptor formulas are the smallest set containing all descriptors and closed under sentential connectives _; ^; :; ). Meanings of complex formulas are defined recursively: [˛ _ ˇ] D [˛] [ [ˇ], [˛ ^ ˇ] D [˛] \ [ˇ], [:˛] D U n [˛], [˛ ) ˇ] D [:˛ _ ˇ]. In descriptor language each object u 2 U can be encoded over a set B of attributes as its information vector InfB (u) D f(a D a(u)) : a 2 Bg. Indiscernibility The Leibnizian Principle of Identity of Indiscernibles affirms that two things are identical in case they are indiscernible, i. e., no available operator acting on both of them yields distinct values; in the context of information systems, indiscernibility relations are induced from sets of attributes: given a set B A, the indiscernibility relation relative to B is defined as Ind(B) D f(u; u0 ) : a(u) D a(u0 ) for each a 2 Bg. Objects u; u0 in relation Ind(B) are said to be B-indiscernible and are regarded as identical with respect to knowledge represented by the information system (U, B). The class [u]B D fu0 : (u; u0 ) 2 Ind(B)g collects all objects identical to u with respect to B. Exact, inexact notion An exact notion is a set of objects in the considered universe which can be represented as the union of a collection of indiscernibility classes; otherwise, the set is inexact. In this case, there exist a boundary about the notion consisting of objects which can be with certainty classified neither into the notion nor into its complement (Pawlak, Frege). Decision systems A particular form of an information system, this is a triple (U; A; d) in which d is the decision, the attribute not in A, that does express the evaluation of objects by an external oracle, an expert. At-
789
790
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
tributes in A are called conditional in order to discern them from the decision d. Classification task The problem of assigning to each element in a set of objects (test sample) of a class (of a decision) to which the given element should belong; it is effected on the basis of knowledge induced from the given collection of examples (the training sample). To perform this task, objects are mapped usually onto vectors in a multi-dimensional real vector space (feature space). Decision rule A formula in descriptor language that does express a particular relation among conditional attributes in the attribute set A and the decision d, of the V form: a2A (a D v a ) ) (d D v) with the semantics defined in (Glossary: “Indiscernibility”). The formula V T is true in case [ a2A (a D v a )] D a2A [(a D v a )] [(d D v)]. Otherwise, the formula is partially true. An object o which matches the rule, i. e., a(o) D v a for a 2 A can be classified to the class [(d D v)]; often a partial match based on a chosen distance measure has to be performed. Distance functions (metrics) A metric on a set X is a non-negative valued function : X X ! R where R is the set of reals, which satisfies conditions: 1. (x; y) D 0 if and only if x D y. 2. (x; y) D (y; x). 3. (x; y) (x; z) C (z; y) for each z in X (the triangle inequality); when in 3. (x; y) is bound by maxf (x; z); (z; y)g instead of by the sum of the two, one says of non-archimedean metric. Object closest to a set For a metric on a set X, and a subset Y of X, the distance from an object x in X and the set Y is defined as dist(x; Y) D inff (x; y) : y 2 Yg; when Y is a finite set, then infimum inf is replaced with minimum min. A nearest neighbor For an object x0 2 X, this is an object n(x0 ) such that (x0 ; n(x0 )) D dist(x0 ; X n fx0 g); n(x0 ) may not be unique. In plain words, n(x0 ) is the object closest to x0 and distinct from it. K-nearest neighbors For an object x0 2 X, and a natural number K 1, this is a set n(x0 ; K) X n fx0 g of cardinality K such that for each object y 2 X n [n(x0 ; K)[ fx0 g] one has (y; x0 ) (z; x0 ) for each z 2 n(x0 ; K). In plain words, objects in n(x0 ; K) for a set of K objects that are closest to x0 among objects distinct from it. Rough inclusion A ternary relation on a set U U [0; 1] which satisfies conditions: 1. (x; x; 1). 2. (x; y; 1) is a binary partial order relation on the set X. 3. (x; y; 1) implies that for each object z in U: if (z; x; r) then (z; y; r). 4. (x; y; r) and s < r imply that (x; y; s). The formula (x; y; r) is read as “the object x is a part in object y to a degree at least r”.
The partial containment idea does encompass the idea of an exact part, i. e., mereological theory of concepts (Le´sniewski, Leonard–Goodman). Similarity An extension and relaxing of an equivalence relation of indiscernibility. Among similarity relations, we single out the class of tolerance relations (Poincaré, Zeeman) which are reflexive, i. e., (x; x), and symmetric, i. e., (x; y) implies necessarily (y; x). The basic distinction between equivalences and tolerances is that the former induce partitions of their universes into disjoint classes whereas the latter induce covering of their universes by their classes; for this reason they are more difficult in analysis. The symmetry condition may be over-restrictive, for instance, rough inclusions are usually not symmetric. Case A case is informally a part of knowledge that describes a certain state of the world (the context), along with a query (problem) and its solution, and a description of the outcome, i. e., the state of the world after the solution is applied. Retrieve A process by means of which a case similar in a sense to the currently considered is recovered from the case base. Reuse A process by means of which the retrieved case’s solution is re-used in the current new problem. Revise A process by means of which the retrieved solution is adapted in order to satisfactorily solve the current problem. Retain A process by means of which the adapted solution is stored in the case base as the solution to the current problem. Definition of the Subject Basic ideas of Rough Set Theory were proposed by Zdzisław Pawlak [60,62], as a formal set of notions aimed at carrying out tasks of reasoning, in particular about classification of objects, in conditions of uncertainty. Conditions of uncertainty are imposed by incompleteness, imprecision and ambiguity of knowledge. Originally, the basic notion proposed was that of a knowledge base, understood as a collection R of equivalence relations on a universe of objects; each relation r 2 R induces on the set U a partition indr into equivalence classes. Knowledge so encoded is meant to represent the classification ability. As objects for analysis and classification come most often in the form of data, a useful notion of an information system is commonly used in knowledge representation; knowledge base in that case is defined as the collection of indiscernibility relations. Exact concepts are defined as unions of indiscernibility classes whereas inex-
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
act concepts are only approximated from below (lower approximations) and from above (upper approximations) by exact ones. Each inexact concept is thus perceived as a pair of exact concepts between which it is sandwiched. In terms of indiscernibility it is possible to define dependencies among attributes: the best case, of functional dependence, comprises instances in which a set say B of attributes depends on a set C of attributes, meaning that values of attributes from B are function-dependent on values of attributes in C; a weaker form of dependence is partial dependence in which there is a non-functional relation between values of attributes in B and those of C. The most important case of dependency is the one between conditional attributes in a decision system and the decision; these dependencies are expressed by means of decision rules which altogether constitute a classifier. From the application point of view, one of most important tasks is to induce from data a satisfactory classifier. A plethora of ideas were put forth in this area. First, the idea of knowledge reduction appeared; here, the notion of a reduct in an information system comes foremost: a reduct B A is a minimal with respect to inclusion subset of A such that indiscernibility induced by B is the indiscernibility induced by A. Finding a reduct means that one can dispense with attributes in A n B and induce decision rules from conditional attributes restricted to B. Further developments in reduction area encompass notions of a relative reduct, a local reduct, a dynamic reduct, an entropy reduct and so on. Subsequent developments include discretization algorithms, similarity measures, granulation of knowledge. Research on classifiers has led to the emergence of a number of systems for rule induction that have been applied in many tasks in areas of, e. g., data classification, medical diagnosis, economic analysis, signal classification; in some cases rough set classifiers have been coupled with fuzzy, wavelet or neural reasoning systems to produce hybrid systems. Matching rules to new objects is done on the basis of closeness of the object and the rule measured often in terms of common descriptor fraction in the object description and in the rule and in many classifiers it is the closest rule that sets the value of decision class to the object. The idea of value assignment to a new object by the closest already classified object is at the heart of the Nearest Neighbor Method. The idea goes back to Fix and Hodges [21,22], Skellam [88] and Clark and Evans [12]. Nearest neighbor technique in its simplest form consists of imposing a metric on the object set in question and assigning the class value c(x0 ) to a currently examined test object x0 by selecting the closest to x0 with re-
spect to the metric training object n(x0 ) and letting c(x0 ) D c(n(x0 )), the class already assigned to n(x0 ). This simplest variant is also named 1-nearest neighbor method. A generalization consists of a choice of a natural number k and selecting from among training objects the set of k closest to x0 objects, n(x0 ) D fx1 ; ::; x k g, eventually breaking ties randomly. Then the class value c(x0 ) is defined on the basis of class values c(x1 ); : : : ; c(x k ) by majority voting; eventual ties are broken randomly. Nearest neighbor methods can be perceived as heuristics implementing the idea of the Parzen window: a region about x0 (the “window”), e. g., rectangular, in which the count of objects leads to an estimate of density of probability distribution of objects in space. The problem of choosing the “right” window is in case of nearest neighbor techniques delegated to the training set itself and the class value is assigned without any probability assessment. Nearest neighbor idea is a basis as well for prototype methods, in particular k-means and clustering methods and is also applied in other approaches, e.g, rough set classifiers. Applications of this technique can be found in many areas like satellite images analysis, plant ecology, forestry, cluster identification in medical diagnosis etc. Difficulties in extracting from data an explicit model on basis of which one could make inferences, already observed with the nearest neighbor method, led to emergence of the ideology of Case-Based Reasoning, traced back to Schank and Abelson’s [84] discussion of methods for describing the knowledge about situations. Closely related to reasoning by analogy, experiential learning, and philosophical and psychological theories of concepts claiming that real-world concepts are best classified by sets of representative cases (Wittgenstein [117]), Case-Based Reasoning emerged as a methodology for problem solving based on cases, i. e., capsules of knowledge containing the description of the current state of the world at the moment the problem is posed (the situation), the solution to the problem, and the state of the world after the solution is implemented. Solved cases stored in case base form a repository to which new cases are addressed. A new case is matched against the case base and cases in the case base most similar to the given case are retrieved; retrieval is usually based on a form of analogy reasoning, nearest neighbor selection, inductive procedures, or template retrieval. Once the similar case is retrieved, its solution is reused, or adapted, to provide a solution for the new case; various strategies are applied in the reuse process, from the simplest null strategy, i. e., using directly the retrieved solution in the new problem to derivational strategies that take the procedure that generated the retrieved solution and modify it to yield a new solution to
791
792
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
the current problem. Finally, the new completed case is retained (or, stored) in the case base. Case-based reasoning has been applied in many systems in the area of expert systems, to mention CYRUS (Kolodner [37]), MEDIATOR (Simpson [87]), PERSUADER (Sycara [106]) and many others (Watson and Marir [114]). CBR systems have found usage in many areas of expertise like complex diagnosis, legal procedures, civil engineering, manufacture planning. Introduction Three paradigms covered in this article have been arranged in the previous section in the order corresponding to the specificity of their scope: rough sets are a paradigm intended to reason under uncertainty from data and they exploit the idea of the nearest neighbor in common with many other paradigms; rough set techniques make use of metrics and similarity relations in their methods and consequently we discuss nearest neighbor method as based on metrics and finally case-based reasoning as based on similarity ideas. Introduced in Pawlak [60] rough set theory is based on ideas that go back to Gottlob Frege, Gottfried Wilhelm Leibniz, Jan Łukasiewicz, Stanislaw Le´sniewski, to mention a few names of importance. Its characteristics are that they divide notions (concepts) into two classes: exact as well as inexact. Because of the Fregean idea (Frege [23]), an inexact concept should possess a boundary into which objects that can be classified with certainty neither to the concept nor to its complement fall. This boundary to a concept is constructed from indiscernibility relations induced by attributes (features) of objects (see Glossary: “Indiscernibility”). Given a set of objects U and attributes in a set A, each set B of attributes does induce the B-indiscernibility relation ind(B) (see “Glossary: Indiscernibility”), according to the Leibnizian Principle of Identity of Indiscernibles, see [98]. Each class [u]B D fv 2 U : (u; v) 2 ind(B)g of ind(B) is B-definable, i. e., the decision problem whether v 2 [u]B is decidable: [u]B is exact (see “Glossary”). Unions of classes [u]B for some u; B are exact as well. Concepts U that are not of this form are inexact, they possess a non-empty boundary. To express the B-boundary of a concept X induced by the set B of attributes, approximations over B are introduced, i. e., [ f[u]B : [u]B Xg BX D BX D
[
(the B-lower approximation) f[u]B : [u]B \ X ¤ ;g (the B-upper approximation) :
The difference Bd B X D BX n BX is the B-boundary of X; when non-empty it does witness that X is inexact over B. The information system (U, A) (see Glossary: “Information systems”) does represent knowledge about a certain aspect of the world. This knowledge can be reduced: a reduct B of the set A of attributes is a minimal subset of A with the property that ind(B) D ind(A). An algorithm for finding reducts based on Boolean reasoning was proposed in Skowron and Rauszer [90]; given input (U, A) with U D fu1 ; : : : ; u n g it starts with the discernibility matrix, M U ;A D [c i; j D fa 2 A: a(u i ) ¤ a(u j )g]1i; jn and builds the Boolean function, ^ _ f U ;A D a; c i; j ¤;;i< j a2c i; j
where a is the Boolean variable assigned to the attribute a 2 A. The function f U ;A is converted to its DNF form: _ ^ a j;k : f U;A : j2J k2K j
Then: sets of the form R j D fa j;k : k 2 K j g for j 2 J, corresponding to prime implicants (see Glossary: “Boolean functions”), are all reducts of A. Choosing a reduct R, and forming the reduced information system (U, R) one is assured that no information encoded in (U, A) has been lost. Moreover, as ind(R) ind(B) for any subset B of the attribute set A, one can establish a functional dependence of B on R: as for each object u 2 U, [u]R [u]B , the assignment f R;B : Inf R (u) ! InfB (u) (see Glossary: “Information systems”) is functional, i. e., values of attributes in R determine functionally values of attributes on U, objectwise. Thus, any reduct determines functionally the whole data system. A decision system (U; A; d) (see Glossary: “Decision systems”) encodes information about the external classification d (by an oracle, expert etc.). Methods based on rough sets aim as well at finding a description of the concept d in terms of conditional attributes in A in the language of descriptors (see Glossary: “Information systems”). The simpler case is when the decision system is deterministic, i. e., ind(A) ind(d). In this case the relation between A and d is functional, given by the unique assignment f A;d or in the decision rule form (see Glossary: V “Decision rule”), as the set of rules: a2A (a D a(u)) ) (d D d(u)). In place of A any reduct R of A can be substituted leading to shorter rules. In the contrary case, some
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
classes [u]A are split into more than one decision class [v]d leading to ambiguity in classification. In that case, decision rules are divided into certain (or, exact) and possible; to induce the certain rule set, the notion of a ı-reduct was proposed in Skowron and Rauszer [90]; it is called a relative reduct in Bazan et al. [7]. To define ı-reducts, first the generalized decision ıB is defined: for u 2 U, ıB (u) D fv 2 Vd : d(u0 ) D v ^ (u; u0 ) 2 ind(B) for some u0 2 Ug. A subset B of A is a ı-reduct to d when it is a minimal subset of A with respect to the property that ıB D ıA . ı-reducts can be obtained from the modified Skowron and Rauszer algorithm [90]: it suffices to modify the entries c i; j to the discernibility matrix, by letting c di; j D fa 2 A [ fdg : a(u i ) ¤ a(u j )g and then setting c 0i; j D c di; j n fdg in case d(u i ) ¤ d(u j ) and c 0i; j D ; in case d(u i ) D d(u j ). The algorithm described above input with entries c 0i; j forming the matrix M Uı ;A outputs all ı-reducts to d encoded as prime implicants of the associated Boolean function f Uı ;A . An Example of Reduct Finding and Decision Rule Induction We conclude the first step into rough sets with a simple example of a decision system, its reducts and decision rules. Table 1 shows a simple decision system. Reducts of the information system (U; A D fa1 ; a2 ; a3 ; a4 g) can be found from the discernibility matrix M U ;A in Table 2; by symmetry, cells c i; j D c j;i with i > j are not filled. Each attribute ai is encoded by the Boolean variable i. After reduction by means of absorption rules of sentential calculus: (p _ q) ^ p , p, (p ^ q) _ p , p, the DNF form f U;A is 1 ^ 2 ^ 3 _ 1 ^ 2 ^ 4 _ 1 ^ 3 ^ 4. Reducts of A in the information system (U, A) are: fa1 ; a2 ; a3 g, fa1 ; a2 ; a4 g, fa1 ; a3 ; a4 g. ı-reducts of the decision d in the decision system Simple, can be found from the modified discernibility matrix M Uı ;A in Table 3. Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets, Table 1 Decision system Simple Obj. u1 u2 u3 u4 u5 u6
a1 1 0 1 1 0 1
a2 0 1 1 0 0 1
a3 0 0 0 0 0 1
a4 1 0 0 1 1 1
d 0 1 1 1 1 0
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets, Table 2 Discernibility matrix MU;A for reducts in (U,A) Obj. u1 u2 u3 u4 u5 u6
u1 ;
u2 f1; 2; 4g ;
u3 f2; 4g f1g ;
u4 ; f1; 2; 3g f2; 4g ;
u5 f1g f2; 4g f2; 4g f1g ;
u6 f2; 3g f1; 3; 4g f3; 4g f2; 3g f1; 2; 3g ;
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets, Table 3 Discernibility matrix MıU;A for ı-reducts in (U; A; d) Obj. u1 u2 u3 u4 u5 u6
u1 ;
u2 f1; 2; 4g ;
u3 f2; 4g ; ;
u4 ; ; ; ;
u5 f1g ; ; ; ;
u6 ; f1; 3; 4g f3; 4g f2; 3g f1; 2; 3g ;
From the Boolean function f Uı ;A we read off ı-reducts R1 D fa1 ; a2 ; a3 g, R2 D fa1 ; a2 ; a4 g, R3 D fa1 ; a3 ; a4 g. Taking R1 as the reduct for inducing decision rules, we read the following certain rules: r1 : (a1 D 0) ^ (a2 D 1) ^ (a3 D 0) ) (d D 1) ; r2 : (a1 D 1) ^ (a2 D 1) ^ (a3 D 0) ) (d D 1) ; r3 : (a1 D 0) ^ (a2 D 0) ^ (a3 D 0) ) (d D 1) ; r4 : (a1 D 1) ^ (a2 D 1) ^ (a3 D 1) ) (d D 0) ; and two possible rules r5 : (a1 D 1) ^ (a2 D 0) ^ (a3 D 0) ) (d D 0) ; r6 : (a1 D 1) ^ (a2 D 0) ^ (a3 D 0) ) (d D 1) ; each with certainty factor D :5 as there are two objects with d D 0. Consider a new object v : Inf A (v) D f(a1 D 0); (a2 D 1); (a3 D 1); (a4 D 0)g. The Nearest Neighbor Approach We define the metric on objects encoded by means of their information vectors (see Glossary: “Information systems”) as the reduced Hamming metrics on information vectors over the reduct R1 ; for an object x we let Desc(x) to denote the set of descriptors that occur in the information vector InfR1 (x) of x. The metric is
793
794
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
then R1 (x; y) D 1 (j Desc(x) \ Desc(y)j)/3; the nearest neighbors of v with respect to R1 are u2 and u6 described by, respectively, rules r1 ; r4 ; however, these rules point to distinct decision classes: 1; 0. The majority voting must be implemented by random choice between class values 0; 1 with probability of error of .5. We can still resort to the idea of the most similar case. The Case-Based Reasoning Approach This ambiguity may be resolved if we treat objects as cases defined over the set A with solutions as the values of the decision d; thus formally a case is of the form (x; d(x)). Similarity among cases will be defined by means of the Hamming distance reduced over the whole set A; thus, A (x; y) D 1 (j Desc(x) \ Desc(y)j)/4. The most similar case to v is u6 ( A (v; u6 ) D 1/4; (v; u2 ) D 1/2) hence we adapt the solution d D 0 to the case u6 as the solution to v and we classify v as d D 0; the new case in our case base is then (v; 0). Rough Set Theory: Extensions Basic principles of the rough set approach: knowledge reduction, indiscernibility, functional and partial dependence, decision rules are exposed in Sect. “Introduction”. In this section, we augment our discussion with more detailed and specific aspects, dividing the exposition into parts concerned with particular aspects. Reducts The notion of a reduct as a minimal with respect to set inclusion subset of the set of attributes A preserving a certain property of the whole attribute set, has undergone an evolution from the basic notions of a reduct and ı-reduct to more specialized variants. The problem of finding a reduct of minimal length is NP-hard [90], therefore one may foresee that no polynomial algorithm is available for computing reducts. Thus, the algorithm based on the discernibility matrix has been proposed with stop rules that permits one to stop the algorithm and obtain a partial set of reducts [7]. In order to find stable reducts, robust to perturbations in data, the idea of a dynamic reduct was proposed in Bazan et al. [7]; for a decision system D D (U; A; d), a family F of subsystems of the form (U1 ; A; d) with U1 U is derived (e. g., by random choice) and given " 2 (0; 1), a generalized dynamic "-reduct B of D is a set B which is also found as a reduct in at least (1 ")-fraction of subsystems in F. In order to precisely discriminate between certain and possible rules, the notion of a positive region along with
the notion of a relative reduct was proposed and studied in Skowron and Rauszer [90]. Positive region pos B (d) is the set fu 2 U : [u]B S [u]d g D v2Vd B[(d D v)]; posB (d) is the greatest subset of X of U such that (X; B; d) is deterministic; it generates certain rules. Objects in U n posB (d) are subjected to ambiguity: given such u, and the collection v1 ; ::; v k of decision d values on the class [u]B , the decision rule deV scribing u can be formulated as, a2B (a D a(u)) ) W V iD1;:::;k (d D v i ); each of the rules a2B (a D a(u)) ) (d D v i ) is possible but not certain as only for a fraction of objects in the class [u]B the decision take the value vi on. Relative reducts are minimal sets B of attributes with the property that posB (d) D pos A (d); they can also be found by means of discernibility matrix M U ;A [90]: c i; j D c di; j nfdg in case either d(u i ) ¤ d(u j ) and u i ; u j 2 pos A (d) or pos(u i ) ¤ pos(u j ) where pos is the characteristic function of posA (d); otherwise, c i; j D ;. For a relative reduct B, certain rules are induced from the deterministic system (pos B (d); A; d), possible rules are induced from the non-deterministic system (U n pos B (d); A; d). In the last case, one can find ı-reducts to d in this system and turn the system into a deterministic one (U n pos B (d); A; ı) inducing certain rules of the form W V a2B (a D a(u)) ) v2ı(u) (d D v). Localization of reducts was taken care of by means of the notion of a local reduct in [7]: given an object ui , a local reduct to decision d is a minimal set B(ui ) such that for each object uj , if d(u i ) ¤ d(u j ) then a(u i ) ¤ a(u j ) for some a 2 B(u i ). An algorithm for finding a covering of the set of objects by local reducts for each of the objects is given in [7]. A more general notion of a reduct as a minimal set of attributes preserving a certain characteristic of the distribution of objects into decision classes, e. g., frequencies of objects, entropy of classification distribution, etc., is discussed in [94]. Decision Rules Already defined as descriptor formulas of the form V a2B (a D v a ) ) (d D v) in descriptor language of the decision system (U; A; d), decision rules provide a description of the decision d in terms of conditional attributes in the set A. Forming a decision rule is searching in the pool of available semantically non-vacuous descriptors for their combination that describes as well as possible a chosen decision class. The very basic idea of inducing rules consists of considering a set B of attributes: the lower approximation posB (d) allows for rules which are certain, the
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
S upper approximation v2Vd B[(d D v)] adds rules which are possible. We write down a decision rule in the form /B; u ) V (d D v) where /B is a descriptor formula a2B (a D a(u)) over B. A method for inducing decision rules in the systematic way of Pawlak and Skowron [63] and Skowron [89] consists of finding the set of all ı-reducts R D fR1 ; : : : ; R m g, and defining for each reduct Rj and each object u 2 U, the rule /R j ; u ) (d D d(u)). Rules obtained by this method are not minimal usually in the sense of the number of descriptors in the premise . A method for obtaining decision rules with a minimal number of descriptors [63,89], consists of reducing a given rule r : /B; u ) (d D v) by finding a set Rr B consisting of irreducible attributes in B only, in the sense that removing any a 2 Rr causes the inequality [/Rr ; u ) (d D v)] ¤ [/Rr n fag; u ) (d D v)] to hold. In case B D A, reduced rules /Rr ; u ) (d D v) are called optimal basic rules (with a minimal number of descriptors). The method for finding all irreducible subsets of the set A (Skowron [89]) consists of considering another modification of the discernibility matrix: for each object u k 2 U, the entry c 0i; j into the matrix M Uı ;A for ı-reducts is modified into c i;k j D c 0i; j in case d(u i ) ¤ d(u j ) and i D k _ j D k, otherwise c i;k j D ;. Matrices M Uk ;A and associated Boolean functions f Uk ;A for all u k 2 U allow for finding all irreducible subsets of the set A and in consequence all basic optimal rules (with a minimal number of descriptors). Decision rules are judged by their quality on the basis of the training set and by quality in classifying new as yet unseen objects, i. e., by their performance on the test set. Quality evaluation is done on the basis of some measures: for a rule r : ) (d D v), and an object u 2 U, one says that u matches r in case u 2 []. match(r) is the number of objects matching r. Support supp(r) of r is the number of objects in [] \ [(d D v)]; the fraction cons(r) D (supp(r))/(match(r)) is the consistency degree of r cons(r) D 1 and means that the rule is certain. Strength, strength(r), of the rule r is defined (Michalski et al. [49], Bazan [6], Grzymala-Busse and Ming Hu [29]), as the number of objects correctly classified by the rule in the training phase; relative strength is defined as the fraction rel strength(r) D (supp(r))/(j[(d D v)]j). Specificity of the rule r, spec(r), is the number of descriptors in the premise of the rule r [29]. In the testing phase, rules vie among themselves for object classification when they point to distinct decision classes; in such a case, negotiations among rules or their sets are necessary. In these negotiations rules with better characteristics are privileged.
For a given decision class c : d D v, and an object u in the test set, the set Rule(c; u) of all rules matched by u and pointing to the decision v, is characterized globally by P Support(Rule(c; u)) D r2Rule(c;u) strength(r) spec(r). The class c for which Support(Rule(c; u)) is the largest wins the competition and the object u is classified into the class c : d D v [29]. It may happen that no rule in the available set of rules is matched by the test object u and partial matching is necessary, i. e., for a rule r, the matching factor match fact(r; u) is defined as the fraction of descriptors in the premise of r matched by u to the number spec(r) of descriptors in . The rule for which the P partial support Part Support(Rule(c; u)) D r2Rule(c;u) match fact(r; u) strength(r) spec(r) is the largest wins the competition and it does assign the value of decision to u [29]. In a similar way, notions based on relative strength can be defined for sets of rules and applied in negotiations among them [7]. Dependency Decision rules are particular cases of dependencies among attributes or their sets; certain rules of the form /B ) (d D v) establish functional dependency of decision d on the set B of conditional attributes. Functional dependence of the set B of attributes on the set C, C 7! B, in an information system (U, A) means that Ind(C) Ind(B). Minimal sets D C of attributes such that D 7! B can be found from a modified discernibility matrix M U ;A : letting hBi to denote the global attribute representing B : hBi(u) D hb1 (u); : : : ; b m (u)i where B D fb1 ; : : : ; b m g, for objects u i ; u j , one sets c i; j D fa 2 C [ fhBig : a(u i ) ¤ a(u j )g and then c i;B j D c i; j n fhBig in case hBi is in c i; j ; otherwise c i;B j is empty. The associated Boolean function f UB;A gives all minimal subsets of C on which B depends functionally; in particular, when B D fbg, one obtains in this way all subsets of the attribute set A on which b depends functionally (Skowron and Rauszer [90]). A number of contributions by Pawlak, Novotny and Pawlak and Novotny are devoted to this topic in an abstract setting of semi-lattices: Novotny and Pawlak [56], Pawlak [61]. Partial dependence of the set B on the set C of attributes takes place where there is no functional dependence C 7! B; in that case, some measures of a degree to which B depends on C were proposed in Novotny and Pawlak [55]: the degree can be defined, e. g., as the fraction B;C D (j posC Bj)/jUj where the C-positive region of B is defined in analogy to already discussed positive region for decision, i. e., posC (B) D fu 2 U : [u]C [u]B g; then,
795
796
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
B depends on C partially to the degree B;C : C 7! B;C B. The relation C 7!r B of partial dependency is transitive in the sense: if C 7!r B and D 7!s C, then D 7!maxf0;rCs1g B, where t(r; s) D maxf0; r C s 1g is the Łukasiewicz t-norm (or, tensor product) [55]. A logical characterization of functional dependence between attribute sets was given as a fragment of intuitionistic logic in Rauszer [80]. Similarity Analysis based on indiscernibility has allowed for extracting the most important notions of rough set theory; further progress has been obtained inter alia by departing from indiscernibility to more general similarity relations (see Glossary: “Similarity”). There have been various methods for introducing similarity relations. An attempt at introducing some degrees of comparison among objects with respect to particular concepts consisted of defining rough membership functions in Pawlak and Skowron [64]: for an object u, an attribute set B and a concept X U, the value B;X (u) of the rough membership function B;X was defined as B;X (u) D (j[u]B \ Xj)/(j[u]B j). Informally, one can say that objects u, v are "; B-similar with respect to the concept X in case jB;X (u) B;X (v)j < ". This relation is reflexive and symmetric, i. e., it is a tolerance relation, see Poincaré [66] and Zeeman [125], for each "; B; X. Tolerance relations were introduced into rough sets in ˙ Polkowski, Skowron, and Zytkow [77] and also studied in Krawiec et al. [43], Skowron and Stepaniuk [91]. It is possible to build a parallel theory of rough sets based on tolerance or similarity relations in analogy to indiscernibility relations. B;X does characterize partial containment of objects in U into concepts X; a further step consists of considering general relations of partial containment in the form of predicates “to be a part to a degree”. A general form of partial containment was proposed as an extension of mereological theory of concepts by Le´sniewski [44]; mereology takes the predicate “to be a part of” as its primitive notion, requiring of it to be irreflexive and transitive on the universe of objects U. The primitive notion of a (proper) part is relaxed to the notion of an ingredient (an improper part) ing D part [ D. The extension consists of considering the predicate “to be a part to a degree of”, formally introduced as the generic notion of a rough inclusion in Polkowski and Skowron [74,75], a ternary predicate with the semantic domain of U U [0; 1] and satisfying the requirements that (1) (u; v; 1) is a binary relation of ingredient on objects in U, i. e., (u; v; 1) sets an exact scheme of decom-
position of objects in U into (possibly improper) parts. (2) (u; v; 1) implies informally that the object u is “inside” object v hence for any other object w: (w; u; r) implies (w; v; r). (3) (u; v; r) implies (u; v; s) for any s < r. Similarity of objects with respect to can be judged by the degree sim(u; v) D maxfarg max (u; v; r); arg max (v; u; s)g; sim(u; v) D 1 means that u, v are identical with respect to the parts; the smaller sim(u, v) the less similar u, v are. Rough inclusions in an information system (U, A) can be induced in some distinct ways as in Polkowski [67,68, 69,71]; Archimedean t-norms, i. e., t-norms t(x, y) that are continuous and have no idempotents, i. e., values x with t(x; x) D x except 0; 1 offer one way; it is well-known from Mostert and Shields [50] and Faucett [18] that up to isomorphism, there are two Archimedean t-norms: the Łukasiewicz t-norm L(x; y) D maxf0; x C y 1g and the product (Menger) t-norm P(x; y) D x y. Archimedean t-norms admit a functional characterization shown in Ling [48]: t(x; y) D g( f (x) C f (y)), where the function f : [0; 1] ! R is continuous decreasing with f (1) D 0, and g : R ! [0; 1] is the pseudo-inverse to f , i. e., f ı g D id. The t-induced rough inclusion t is defined [68] as t (u; v; r) , g((j DIS(u; v)j)/(jAj)) r where DIS(u; v) D fa 2 A: a(u) ¤ a(v)g. With the Łukasiewicz t-norm, f (x) D 1 x D g(x) and IND(u; v) D U U n DIS(u; v), the formula becomes: L (u; v; r) , (j IND(u; v)j)/(jAj) r; thus in the case of Łukasiewicz logic, L becomes the similarity measure based on the Hamming distance between information vectors of objects reduced modulo jAj; from a probabilistic point of view it is based on the relative frequency of descriptors in information sets of u, v. This formula permeates data mining ˙ algorithms and methods, see Klösgen and Zytkow [36]. Another method exploits residua of t-norms [69]. For a continuous t-norm t(x, y), the residuum, x ) t y is defined as maxfz : t(x; z) yg. Rough inclusions are defined in this case with respect to objects selected as standards, e. g., objects u representing desired (in the given domain) combinations of attribute values. Given a standard s, the rough inclusion tIND;s is defined:
tIND;s (u; v; r) iff
j IND(u; s)j j IND(v; s)j )
r: jAj jAj
In case of the t-norm L, LIND;s (x; y; r) iff j IND(y; s)j j IND(x; s)j (1 r)jAj. In case of the t-norm P, PIND;s (x; y; 1) iff j IND(x; s)j j IND(y; s)j, and PIND;s (x; y; r) iff j IND(x; s)j
j IND(y; s)j r jAj. IND;s Finally, in the case of t D min, we have min (x; y; r) iff j IND(y; s)j/j IND(x; s)j r.
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
All these rough inclusions satisfy the transitivity property [68,71]: From (u; v; r) and (v; w; s) it follows that (u; w; t(r; s)), where is induced by the t-norm t. Each metric d on the set of objects U [see Glossary: “Distance functions (metrics)”] induces the rough inclusion: d (u; v; r) if and only if d(u; v) 1 r; conversely, any symmetric rough inclusion does induce the metric: d (u; v) r , (u; v; 1 r). For instance in the case of the t-norm L, the metric d L (u; v) D (j DIS(u; v)j)/(jAj), i. e., it is the Hamming distance on information vectors of objects reduced modulo jAj. Let : [0; 1] ! [0; 1] be an increasing mapping with (1) D 1; for a rough inclusion , we define a new predicate by letting, (u; v; r) , (u; v; (r)).
is a rough inclusion. It is transitive with respect to the t-norm T (r; s) D 1 [t((r); (s))] whenever is transitive with respect to the t-norm t. A particular case takes place when one replaces (j DIS j)/(jAj) with a sum P a2DIS w a where fw a : a 2 Ag is a convex family of positive weights. Further, one can replace weights wa with values of rough inclusions a defined on sets of values V a . The corresponding formulas will be L (u; v; r) ,
X
a (a(u); a(v)) r,
a2DIS(u;v)
2
P (u; v; r) , exp 4
X
3 a (a(u); a(v))5
r;
a2DIS(u;v)
where a (u; v) D P
argmax a (a(u); a(v); r) : b2DIS argmax b (b(u); b(v); r)
This discussion shows that rough inclusions encompass basically all forms of distance functions applied in Data Mining. Other methods of introducing similarity relations into the realm of information systems, besides rough inclusions, include methods based on templates and on quasi-distance functions in Nguyen S. H. [53]; a temV plate is any conjunct of the form T : a2B (a 2 Wa ) where B is a set of attributes, and Wa Va is a subset of the value set V a of the attribute a. Semantics of templates is defined as with descriptor formulas (see Glossary: “Information systems”). Templates are judged by some parameters: length, i. e., the number of generalized descriptors (a 2 Wa ); support, i.e, number of matching objects; approximated length, applength, i. e., P a2B (1)/(jWa \ a(U)j). Quality of the template is given by a combination of some of the parameters, e. g., quality(T)=support(T)+applength(T), etc. Templates are used
in classification problems in an analogous way to decision rules. A quasi-metric (a similarity measure) [53] is a family : f a : a 2 Ag of functions where a (u; u) D 0 and a (u; v) D a (v; u) for u; v 2 U. By means of these functions tolerance relations are built with the help of stanP dard metric-forming operators like max, : 1 (u; v) , P max a f a (u; v) "g, 2 (u; v) , a a (u; v) "g for a given threshold " are examples of such similarity relations. These similarity relations are applied in finding classifiers in [53]. Discretization The important problem of treating continuous values of attributes has been resolved in rough sets with the help of the discretization of attributes technique, common to many paradigms like decision trees etc.; for a decision system (U; A; d), a cut is a pair (a, c), where a 2 A; c in reals. The cut (a, c) induces the binary attribute b a;c (u) D 1 if a(u) c and it is 0, otherwise. a of reGiven a finite sequence p a D c0a < c1a < < c m als, the set V a of values of a is split into disjoint ina ; !); the new attribute tervals: ( ; c0a ); [c0a ; c1a ); : : : ; [c m D a (u) D i when b c aiC1 D 0; b c ai D 1, is a discrete counterpart to the continuous attribute a. Given a collection P D fp a : a 2 Ag (a cut system), the set D D fD a : a 2 Ag of attributes transforms the system (U; A; d) into the discrete system (U; D P ; d) called the P-segmentation of the original system. The set P is consistent in case the generalized decision in both systems is identical, i. e., ıA D ı D P ; a consistent P is irreducible if P0 is not consistent for any proper subset P0 P; P is optimal if its cardinality is minimal among all consistent cut systems, see Nguyen Hung Son [51], Nguyen Hung Son and Skowron [52]. MD-heuristics based on maximal discernibility of objects and decision trees for optimal sets of cuts were proposed in [51,52]; see also [7]. Granulation of Knowledge The issue of granulation of knowledge as a problem on its own, has been posed by L.A. Zadeh [124]. Granulation can be regarded as a form of clustering, i. e., grouping objects into aggregates characterized by closeness of certain parameter values among objects in the aggregate and greater differences in those values from aggregate to aggregate. The issue of granulation has been a subject of intensive studies within the rough set community in, e. g., T.Y. Lin [47], Polkowski and Skowron [75,76], Skowron and Stepaniuk [92], Y. Y. Yao [123].
797
798
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
The rough set context offers a natural venue for granulation, and indiscernibility classes were recognized as elementary granules whereas their unions serve as granules of knowledge; these granules and their direct generalizations to various similarity classes induced by binary relations were subject to research by T.Y. Lin [46], Y.Y. Yao [121]. Granulation of knowledge by means of rough inclusions was studied as well in Polkowski and Skowron [74,75], Skowron and Stepaniuk [92], Polkowski [67,68,69,71]. For an information system (U, A), and a rough inclusion on U, granulation with respect to similarity induced by is formally performed (Polkowski, op.cit.; op.cit.) by exploiting the class operator Cls of mereology [44]. The class operator is applied to any non-vacuous property F of objects (i. e. a distributive entity) in the universe U and produces the object ClsF (i. e., the collective entity) representing wholeness of F. The formal definition of Cls is: assuming a part relation in U and the associated ingredient relation in ClsF does satisfy conditions (1) if u 2 F then u is an ingredient of ClsF; (2) if v is an ingredient of ClsF then some ingredient w of v is an ingredient as well of a T that is in F; in plain words, each ingredient of ClsF has an ingredient in common with an object in F. An example of part relation is the proper subset relation on a family of sets; then the subset relation is the ingredient relation, S and the class of a family F of sets is its union F. The merit of class operator is in the fact that it always projects hierarchies onto the collective entity plane containing objects. For an object u and a real number r 2 [0; 1], we define the granule g (u; r) about u of the radius r, relative to , as the class ClsF(u, r), where the property F(u, r) is satisfied with an object v if and only if (v; u; r) holds. It was shown [68] that in the case of a transitive , v is an ingredient of the granule g (u; r) if and only if (v; u; r). This fact allows for writing down the granule g (u; r) as a distributive entity (a set, a list) of objects v satisfying (v; u; r). Granules of the form g (u; r) have regular properties of a neighborhood system [69]. Granules generated from a rough inclusion can be used in defining a compressed form of the decision system: a granular decision system [69]; for a granulation radius r, and a rough inG D fg (u; r)g. We clusion , we form the collection Ur; apply a strategy G to choose a covering CovG r; of the uniG . We apply a strategy S in verse U by granules from Ur; order to assign the value a (g) of each attribute a 2 A to each granule g 2 Cov G r; : a (g) D S(fa(u) : u 2 gg). The granular counterpart to the decision system (U; A; d) is G ; G ; S; fa : a 2 Ag; d ). The heuristic prina tuple (Ur; ciple that H: objects, similar with respect to conditional at-
tributes in the set A, should also reveal similar (i. e., close) decision values, and therefore, granular counterparts to decision systems should lead to classifiers satisfactorily close in quality to those induced from original decision systems that is at the heart of all classification paradigms, can be also formulated in this context [69]. Experimental results bear out the hypothesis [73]. Rough Set Theory: Applications Applications of rough set theory are mainly in classification patterns recognition under uncertainty; classification is performed on test sets by means of judiciously chosen sets of decision rules or generalized decision rules (see Subsects. “Decision Rules”, “Similarity”, and “Granulation of Knowledge” of Sect. “Rough Set Theory. Extensions”) induced from training samples. Classification Classification methods can be divided according to the adopted methodology, into classifiers based on reducts and decision rules, classifiers based on templates and similarity, classifiers based on descriptor search, classifiers based on granular descriptors, hybrid classifiers. For a decision system (U; A; d), classifiers are sets of decision rules. Induction of rules was a subject of research in rough set theory since its beginning. In most general terms, building a classifier consists of searching in the pool of descriptors for their conjuncts that describe sufficiently well decision classes. As distinguished in Stefanowski [99], there are three main kinds of classifiers searched for: minimal, i. e., consisting of the minimum possible number of rules describing decision classes in the universe, exhaustive, i. e., consisting of all possible rules, satisfactory, i. e., containing rules tailored to a specific use. Classifiers are evaluated globally with respect to their ability to properly classify objects, usually by error which is the ratio of the number of correctly classified objects to the number of test objects, total accuracy being the ratio of the number of correctly classified cases to the number of recognized cases, and total coverage, i.e, the ratio of the number of recognized test cases to the number of test cases. Minimum size algorithms include the LEM2 algorithm due to Grzymala-Busse [28] and covering algorithm in the RSES package [82]; exhaustive algorithms include, e. g., the LERS system due to Grzymala-Busse [27], systems based on discernibility matrices and Boolean reasoning by Skowron [89], Bazan [6], Bazan et al. [7], implemented in the RSES package [82]. Minimal consistent sets of rules were introduced in Skowron and Rauszer [90]; they were shown to coin-
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
cide with rules induced on the basis of local reducts in Wróblewski [120]. Further developments include dynamic rules, approximate rules, and relevant rules as described in [6,7] as well as local rules [6,7], effective in implementations of algorithms based on minimal consistent sets of rules. Rough set-based classification algorithms, especially those implemented in the RSES system [82], were discussed extensively in [7]; [101] contains a discussion of rough set classifiers along with some attempt at analysis of granulation in the process of knowledge discovery. In [6], a number of techniques were verified in experiments with real data, based on various strategies: discretization of attributes (codes: N-no discretization, S-standard discretization, D-cut selection by dynamic reducts, G-cut selection by generalized dynamic reducts); dynamic selection of attributes (codes: N-no selection, D-selection by dynamic reducts, G-selection based on generalized dynamic reducts); decision rule choice (codes: A-optimal decision rules, G-decision rules on the basis of approximate reducts computed by Johnson’s algorithm, simulated annealing and Boltzmann machines etc., N-without computing of decision rules); approximation of decision rules (codes: N-consistent decision rules, P-approximate rules obtained by descriptor dropping); negotiations among rules (codes: S-based on strength, M-based on maximal strength, R-based on global strength, D-based on stability). Any choice of a strategy in particular areas yields a compound strategy denoted with the alias being concatenation of symbols of strategies chosen in consecutive areas, e. g., NNAND etc. We record here in Table 4 an excerpt from the comparison (Tables 8, 9, 10 in [6]) of the best of these strategies with results based on other paradigms in classification for two sets of data: Diabetes and Australian credit from UCI Repository [109]. An adaptive method of classifier construction was proposed in [121]; reducts are determined by means of a genetic algorithm [7] and in turn reducts induce subtables of data regarded as classifying agents; the choice of optimal ensembles of agents is done by a genetic algorithm. Classifiers constructed by means of similarity relations are based on templates matching a given object or closest to it with respect to a certain distance function, or on coverings of the universe of objects by tolerance classes and assigning the decision value on basis of some of them [53];
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets, Table 4 A comparison of errors in classification by rough set and other paradigms Paradigm Stat. Methods Stat. Methods Neural Nets Neural Networks Decision Trees Decision Trees Decision Trees Decision Rules Rough Sets Rough Sets Rough Sets
System/method Logdisc SMART Backpropagation2 RBF CART C4.5 ITrule CN2 NNANR DNANR Best result
Diabetes 0.223 0.232 0.248 0.243 0.255 0.270 0.245 0.289 0.335 0.280 0.255 (DNAPM)
Austr. credit 0.141 0.158 0.154 0.145 0.145 0.155 0.137 0.204 0.140 0.165 0.130 (SNAPM)
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets, Table 5 Accuracy of classification by template and similarity methods Paradigm Rough Sets Rough Sets Rough Sets Rough Sets Rough Sets Rough Sets
System/method Simple.templ./Hamming Gen.templ./Hamming Simple.templ./Euclidean Gen.templ./Euclidean Match.tolerance Clos.tolerance
Diabetes 0.6156 0.742 0.6312 0.7006 0.757 0.743
Austr. credit 0.8217 0.855 0.8753 0.8753 0.8747 0.8246
we include in Table 5 excerpts from classification results in [53]. A combination of rough set methods with the k-nearest neighbor idea is a further refinement of the classification based on similarity or analogy in Wojna [119]. In this approach, training set objects are endowed with a metric, and the test objects are classified by voting by k nearest training objects for some k that is subject to optimization. A far-reaching step in classification based on similarity or analogy is granular classification in Polkowski and Artiemjew [73]; in this approach, a granular counterpart (see Subsect. “Granulation of Knowledge” of Sect. “Rough G ; G ; S; fa : a 2 Ag; d ) to Set Theory. Extensions”) (Ur; a decision system (U; A; d) is constructed; strategy S can be majority voting, and G either a random choice of a covering or a sequential process of selection. To the granular system standard classifiers like LEM2 or RSES classifiers have been applied. This approach does involve a compression of both universes of objects and rule sets. Table 6 shows a comparison of results with other rough set methods on Australian credit data.
799
800
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets, Table 6 Best results for Australian credit by some rough set-based algorithms; in case , reduction in object size is 49.9 percent, reduction in rule number is 54.6 percent; in case
, resp., 19.7, 18.2; in case , resp., 3.6, 1.9 Source (Bazan,1998) (Nguyen S.H.,2000) (Nguyen S.H., 2000) (Nguyen S.H.,2000) (Nguyen S.H., 2000) (Nguyen S.H., 2000) (Nguyen S.H., 2000) (Wroblewski, 2004) (Polkowski Artiemjew, 2007) (Polkowski Artiemjew, 2007) (Polkowski Artiemjew, 2007)
Method SNAPM(0.9) simple.templates general.templates closest.simple.templates closest.gen.templates tolerance.simple.templ. tolerance.gen.templ. adaptive.classifier granular*.r=0.642 granular**.r=0.714 granular***.concept.r=0.785
Descriptor search is applied in the LEM2 algorithm [28] aimed at issuing a set of rules with a minimal number of descriptors. This system is applied in medical diagnosis, particularly in the problem of missing values, i.e, data in which some attribute values are lacking. A number of software systems for inducing classifiers were proposed based on rough set methodology, among them LERS by Grzymala-Busse [27] ; TRANCE due to Kowalczyk [42]; RoughFamily by Słowi´nski and Stefanowski [95,96]; TAS by Suraj [102]; PRIMEROSE due to Tsumoto [108]; KDD-R by Ziarko [126]; RSES by Skowron et al. [82]; ROSETTA due to Komorowski, Skowron et al. [39]; RSDM by Fernandez-Baizan et al. [19]; GROBIAN due to Duentsch and Gediga [17]; RoughFuzzyLab by Swiniarski [104] and MODLEM by Stefanowski [100]. Rough set techniques were applied in many areas of data exploration, among them in exemplary works: Processing of audio signals: Czy˙zewski et al. [14], Kostek [40]. Handwritten digit recognition: Nguyen Tuan Trung [54]. Pattern recognition: Skowron and Swiniarski [93]. Signal classification: Wojdyłło [118]. Image processing: Swiniarski and Skowron [105]. Synthesis and analysis of concurrent processes; Petri nets: Suraj [103]. Conflict analysis, game theory: Deja [15], Polkowski and Araszkiewicz [72]. Rough neural computation modeling: Polkowski [70]. Self organizing maps: Pal, Dasgupta and Mitra [57].
Accuracy error=0.130 0.929 0.886 0.821 0.855 0.842 0.875 0.863 0.8990 0.964 0.9970
Coverage – 0.623 0.905 1.0 1.0 1.0 1.0 – 1.0 1.0 0.9995
Software engineering: Peters and Ramanna [65]. Multicriteria decision theory; operational research Greco et al. [26]. Programming in logic: Vitoria [112]. Missing value problem: Grzymala-Busse [28]. Learning cognitive concepts: M. Semeniuk-Polkowska [85].
Nearest Neighbor Method The nearest neighbor method goes back to Fix and Hodges [21,22], Skellam [88] and Clark and Evans [12], who used the idea in statistical tests of significance in nearest neighbor statistics in order to discern between random patterns and clusters. The idea is to assign a class value to a tested object on the basis of the class value of the nearest with respect to a chosen metric already classified object. The method was generalized to k-nearest neighbors in Patrick and Fischer [59]. The idea of a closest classified object permeates as already mentioned, many rough set algorithms and heuristics; it is also employed by many other paradigms as a natural choice of strategy. The underlying assumption is that data are organized according to some form of continuity and finding a proper metric or similarity measure secures the correctness of the principle that close in that metric or satisfactorily similar objects should have close class values. To reach for a formal justification of these heuristics, we recall the rudiments of statistical decision theory in its core fragment, the Bayesian decision theory, see, e. g., Duda, Hart and Stork [16]. Assuming that objects in question are assigned to k decision classes c1 , . . . , ck , with probabilities p(c1 ); : : : ; p(c k ) (priors) and the prob-
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
ability that an object with feature vector x will be observed in the class ci is p(xjc i ) (the likelihood), the probability of the class ci when feature vector x has been observed (posterior) is given by the Bayes formula of Bayes [5]: p(c i jx) D (p(xjc i ) p(c i ))/p(x) where the eviP dence p(x) D kjD1 p(xjc j ) p(c j ). Bayesian decision theory assigns to an object with feature vector x the class ci with p(c i jx) > p(c j jx) for all j ¤ i. The Bayesian error in this case is p B D 1 p(c i jx). This decision rule may be conveniently expressed by means of discriminants assigned to classes: to the class ci the discriminant g i (x) is assigned which by the Bayes formula can be expressed as an increasing function of the product p(xjc i ) p(c i ), e. g., g i (x) D log p(xjc i ) C log p(c i ); an object with features x is classified to ci for which g i (x) D maxfg j (x) : j D 1; 2; : : : ; kg. The most important case happens when conditional densities p(xjc i ) are distributed normally unP der N( i ; i ); under simplifying assumptions that prior probabilities are equal and features are independent, the Bayes decision rule becomes the following recipe: for an object with feature vector x, the class ci is assigned for which the distance jjx i jj from x to the mean i is the smallest among all distances jjx j jj [16]. In this way the nearest neighbor method can be formally introduced and justified. In many real cases, estimation of densities p(xjc i ) is for obvious reasons difficult; nonparametric techniques were proposed to bypass this difficulty. The idea of a Parzen window in Parzen [58], see [16], consists of considering a sample S of objects in a d-space potentially increasing to infinity their number, and a sequence R1 ; R2 ; : : : ; R n ; : : : of d-hypercubes of edge lengths hn hence of volumes Vn D h dn , Rn containing kn objects from the sample; the estimate pn (x) for density induced from Rn is clearly equal to p n (x) D (k n jSj1 )/(Vn ). Under additional conditions that limn!1 Vn D 0 and lim n!1 n Vn D 1, the sequence pn (x) does converge to density p(x) in case the latter is continuous [16]. The difficulty with this approach lies in the question of how to choose regions Ri and their parameters and the idea of nearest neighbors returns: the modification rests on the idea that the sampling window should be dependent on the training sample itself and adapt to its structure; the way of implementing this idea can be as follows: one could center the region R about the object with features x and resize it until it absorbs k training objects (k nearest neighbors); if ki of them falls into the class ci then the estimate for probability density is p(c i jx) D (k i )/(k) [16]. Letting k variable with l im n!1 nk D 0 and l im n!1 k D 1 secures that the estimate sequence p n (cjx) would converge in probability to p(c i jx) at continuity points of the lat-
ter [16]. The k-nearest neighbor method finds its justification through considerations like the one recalled above. The 1-nearest neighbor method is the simplest variant of k-nearest neighbors; in the sample space, with the help of a selected metric, it does build neighborhoods of training objects, splitting the space into cells composing together the Voronoi tesselation (Voronoi diagram) of the space, see [79]. The basic theorem by Cover and Hart [13] asserts that the error P in classification by the 1-nearest neighbor method is related to the error PB in classification by the Bayesian decision rule by the inequality: P B P P B (2 (k 1)/(k) P B ), where k is the number of decision classes. Although the result is theoretically valid under the assumption of infinite sample, yet it can be regarded as an estimate of limits of error by the 1-nearest neighbor method as at most twice the error by Bayes classifier also in case of finite reasonably large samples. A discussion of analogous results for the k-nearest neighbor method can be found in Ripley [81]. Metrics [see Glossary: “Distance functions (metrics)”] used in nearest neighbor-based techniques can be of varied form; the basic distance function is the Euclidean P metric in a d-space: E (x; y) D [ diD1 (x i y i )2 ]1/2 and its generalization to the class of Minkowski metrics P L p (x; y) D [ diD 1 jx i y i j p ]1/p for p 1 with limiting P cases of L1 D diD1 jx i y i j (the Manhattan metric) and L1 (x; y) D maxfjx i y i j : i D 1; 2; : : : ; dg. These metrics can be modified by scaling factors (weights) applied P to coordinates, e. g., Lw1 (x; y) D diD1 w i jx i y i j is the Manhattan metric modified by the non-negative weight vector w [119] and subject to adaptive training. Metrics like the above can be detrimental to the nearest neighbor method in the sense that the nearest neighbors are not invariant with respect to transformations like translations, shifts, rotations. A remedy for this difficulty was proposed as the notion of the tangent distance by Simard, Le Cun and Denker [86]. The idea consists of replacing each training as well as each test object, represented as a vector x in the feature space Rk , with its invariance manifold, see Hastie, Tibshirani and Friedman [32], consisting of x along with all its images by allowable transformations: translation, scaling of axes, rotation, shear, line thickening; instead of measuring distances among object representing vectors x; y, one can find shortest distances among invariance manifolds induced by x and y; this task can be further simplified by finding for each x the tangent hyperplane at x to its invariance manifold and measuring the shortest distance between these tangents. For a vector x and the matrix T of basic tangent vectors at x to the invariance manifold, the equation of the tangent
801
802
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
hyperplane is H(x) : y D x C Ta, where a in Rk . The simpler version of the tangent metric method, so-called “onesided”, assumes that tangent hyperplanes are produced for training objects x whereas for test objects x 0 no invariance manifold is defined but the nearest to x 0 tangent hyperplane is found and x 0 is classified as x that defined the nearest tangent hyperplane. In this case the distance to the tangent hyperplane is given by arg min a E (x 0 ; x C Ta) in case the Euclidean metric is chosen as the underlying distance function. In case of a “two-sided” variant, the tangent hyperplane at x 0 is found as well and x 0 is classified as the training object x for which the distance between tangent hyperplanes H(x); H(x 0 ) is the smallest among distances from H(x 0 ) to tangent hyperplanes at training objects. A simplified variant of the tangent distance method was proposed by Abu-Mostafa, see [32], consisting of producing all images of training and test objects representations in the feature space and regarding them as respectively training and test objects and using standard nearest neighbor techniques. Among problems related to metrics is also the dimensionality problem; in high-dimensional feature spaces, nearest neighbors can be located at large distances from the test point in question, thus violating the basic principle justifying the method as window choice; as pointed out in [32], the median of the radius R of the sphere about the origin containing the nearest neighbor of n objects uniformly distributed in the cube [1/2; 1/2] p is equal p to (1)/(Vp ) (1 1/21/N )1/p where Vp r p is the volume of the sphere of radius r in p-space, and it approaches the value of 0.5 with an increase of p for each n, i.e, a nearest neighbor is asymptotically located on the surface of the cube. A related result in Aggarwal et al. [3] asserts that the expected value of the quotient k 1/2 (maxfjxj p ; jyj p g minfjxj p ; jyj p g)/(minfjxj p ; jyj p g) for vectors x, y uniformly distributed in the cube (0; 1) k is asymptotically (k ! 1) equal to C (2 p C 1)1/2 (this phenomena are collectively known as the dimensionality curse). To cope with this problem, a method called discriminant adaptive nearest neighbor was proposed by Hastie and Tibshirani [31]; the method consists of adaptive modification of the metric in neighborhoods of test vectors in order to assure that probabilities of decision classes do not vary much; the direction in which the neighborhood is stretched is the direction of the least change in class probabilities. This idea that the metric used in nearest neighbor finding should depend locally on the training set was also used in constructions of several metrics in the realm of deci-
sion systems, i. e., objects described in the attribute-value language; for nominal values, the metric VDM (Value Difference Metric) in Stanfill and Waltz [97] takes into account conditional probabilities P(d D vja i D v i ) of decision value given the attribute value, estimated over the training set Trn, and on this basis constructs in the value set V i of the attribute ai a metric i (v i ; v 0i ) D P 0 v2Vd jP(d D vja i D v i ) P(d D vja i D v i )j. The global metric is obtained by combining metrics i for all attributes a i 2 A according to one of many-dimensional metrics, e. g., Minkowski metrics. This idea was also applied to numerical attributes in Wilson and Martinez [116] in metrics IVDM (Interpolated VDM) and WVDM (Windowed VDM), see also [119]. A modification of the WVDM metric based again on the idea of using probability densities in determining the window size was proposed as the DBVDM metric [119]. Implementation of the k-nearest neighbor method requires satisfactorily fast methods for finding k nearest neighbors; the need for accelerating search in training space led to some indexing methods. The method due to Fukunaga and Narendra [24] splits the training sample into a hierarchy of clusters with the idea that in a search for the nearest neighbor the clusters are examined in a topdown manner and clusters evaluated as not fit in the search are pruned from the hierarchy. The alternative is the bottom-up scheme of clustering due to Ward [113]. Distances between test objects and clusters are evaluated as distances between test objects and clusters centers that depend on the chosen strategy. For features that allow one to map objects onto vectors in a vector space, and clusters in the form of hypercubes a number of tree-based indexing methods exist: k-d trees due to Bentley [8] and quad trees in Finkel and Bentley [20] among them; for high-dimensional feature spaces where the dimensionality curse plays a role, more specialized tree-based indexing methods were proposed: X-trees by Berchtold, Keim and Kriegel [9], SR-trees in Katayama and Satoh [34], TV-trees by Lin, Jagadish and Faloustos [45]. For general, also non-numerical features, more general cluster regions are necessary and a number of indexing methods like BST due to Kalantari and McDonald [33], GHT by Uhlmann [110], GNAT in Brin [10], SS-tree due to White and Jain [115] and M-trees in Ciaccia, Patella and Zezula [11] were proposed. A selection of cluster centers for numerical features may be performed as mean values of feature values of vectors in the cluster; in a general case it may be based on minimization of variance of distance within the cluster [119]. Search for k nearest neighbors of a query object is done in the tree in depthfirst order from the root down.
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
Applications of the k nearest neighbor method include satellite images recognition, protein sequence matching, spatial databases, information retrieval, stock market and weather forecasts, see Aha [4] and Veloso [111]. Case-Based Reasoning Nearest neighbor techniques can be seen abstractly as tasks of classification of an entity e by means of recovering from the base of already classified entities the nearest one E to e and adopting the class of E as the class of e. Regarded in this light, classification and more generally, problem solving, tasks may be carried out as tasks of recovering from the base of cases already solved the case most similar to the case being currently solved and adopting its solution, possibly with some modifications, as the solution of the current problem. The distinction between this general idea and nearest neighbor techniques is that in the former, new cases are expected to be used in classification of the next cases. This is the underlying idea of Case-Based Reasoning. The Case-Based Reasoning methodology emerged in response to some difficulties that were encountered by Knowledge-Based Reasoning systems that dominated in the first period of growth of Artificial Intelligence like DENDRAL, MYCIN, or PROSPECTOR. These systems relied on knowledge of the subject in the form of decision rules or models of entities; the problems were of varied nature: complexity of knowledge extraction; difficulty of managing large amounts of information, difficulties with maintenance and refreshing knowledge, see [114]. In spite of search for improvements and a plethora of ideas and concepts aimed at solving the mentioned difficulties, new ideas based on analogy reasoning came forth. Resorting to analogous cases eliminates the need for modeling knowledge as knowledge is cases along with their solutions and methods; implementing consists in identifying features determining cases; large volumes of information can be managed; learning is acquiring new cases [114]. Case-Based Reasoning (CBR) can be traced back to few sources of origin; on a philosophical level, L. Wittgenstein [117] is quoted in Aamodt and Plaza [2] due to his idea of natural concepts and the claim that they be represented as collections of cases rather than features. An interest in cognitive aspects of learning due partially to the contemporary growth of interest in theories of language inspired in large part by work of N. Chomsky and in psychological aspects of learning and exploiting the concept of situation, was reflected in the machine-learning area by works of Schank and Abelson and Schank [84], [83], regarded as pioneering in the CBR area in [114]. Schank and
Abelson’s idea was to use scripts as the tool to describe memory of situation patterns. The role of memory in reasoning by cases was analyzed by Schank [83] leading to memory organization packets (MOP) and by Porter as the case memory model. These studies led to the first CBRbased systems: CYRUS by Kolodner [37], MEDIATOR by Simpson [87], PROTOS by Porter and Bareiss [78], among many others; see [114] for a more complete listing. The methodology of CBR systems was described in [2] as consisting of four distinct processes repeated in cycles: (1) RETRIEVE: the process consisting of matching the currently being solved case against the case base and fetching cases most similar according to adopted similarity measure to the current case; (2) REUSE: the process of making solutions of most similar cases fit to solve the current case; (3) REVISE: the process of revision of fetched solutions to adapt it to the current case taking place after the REUSED solution was proposed, evaluated and turned to be REVISED; (4) RETAIN: the process of RETAINING the REVISED solution in the case base along with the case as the new solved case that in turn can be used in solving future cases. Whereas the process of reasoning by cases can be adapted to model reasoning by nearest neighbors yet the difference between the two is clear from Aamodt and Plaza’s description of the CBR process: in CBR, case base is incrementally enlarged and retained cases become its valid members used in the process of solving new cases; in nearest neighbors the sharp distinction between training and test objects is kept throughout the classification process. From an implementation point of view, the problem of representing cases is the first to comment upon. According to Kolodner [38], case representation should secure functionality and ease of information acquisition from the case. The structure of a case is a triple (problem, solution, outcome): the problem is a query along with the state of the world (situation) at the moment the query is posed; the solution is the proposed change in the state of the world and the outcome is the new state after the query is answered. The solution may come along with the method in which it was obtained which is important when revision is necessary. In representing cases many formalisms were used like frames/objects, rules, semantic nets, predicate calculi/first order logics etc. Retrieval of cases from the case base requires a form of case indexing; Kolodner [38] recommended hand-chosen indices as more human-oriented than automated methods but many automated strategies for indexing were introduced and verified, among them (listed in [114]):
803
804
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
Checklist-based indexing in Kolodner [38] that indexes cases by features and dimensions like MEDIATOR indexing which takes into account types and functions of objects being disputed and relationships among disputants; Difference-based indexing that selects features discerning best among cases like in CYRUS by Kolodner [37]; Methods based on inductive learning and rule induction which extract features used in indexing; Similarity methods which produce abstractions of cases sharing common sets of features and use remaining features to discern among cases; Explanation-based methods which select features by inspecting each case and deciding on the set of relevant features. The retrieval mechanism relies on indexing and memory organization to retrieve cases; the process of selection of the most similar case uses some measures of similarity and a strategy to find next matching cases in the case base. Among these strategies the nearest neighbor strategy can be found, based on chosen similarity measures for particular features; an exemplary similarity measure is the one adopted in the system ReMind in Kolodner [38]: P w f sim f ( f (I); f (R)) sim(I; R) D features f P wf features f where wf is a weight assigned to the feature f , sim f is the chosen similarity measure on values of feature f , I is the new case and R is the retrieved case. Other methods are based on inductive learning, template retrieval etc., see [114]. Memory organization should provide a trade-off between wealth of information stored in the case base and efficiency of retrieval; two basic case memory models emerged: the dynamic memory model by Schank [83], Kolodner [37] and the category-exemplar model due to Porter and Bareiss [78]. On the basis of Schank’s MOP (Memory Organization Packets), Episodic Memory Organization Packets in Kolodner [37] and Generalized Episodes (GE) in Koton [41] were proposed; a generalized episode consists of cases, norms and indices where norms are features shared by all cases in the episode and indices are features discerning cases in the episode. The memory is organized as a network of generalized episodes, cases, index names and index values. New cases are dynamically incorporated into new episodes. The category-exemplar model divides cases (exemplars) into categories, according to case features, and indexes cases by case links from
categories to cases in them; other indices are feature links from features to categories or cases and difference links from categories to similar cases that differ to a small degree in features. Categories are related within a semantic network. Reusing and adaptation follows two main lines: structural adaptation, see [38]; in this methodology, the solution of the retrieved case is directly adapted by the adaptation rules. Derivational adaptation in Simpson [87] in which the methods that produced the retrieved solution undergo adaptation in order to yield the solution for the new case is the other. Various techniques were proposed and applied in adaptation process, see, e. g., [114]. Applications based on CBR up to the early 1990s are listed in [38]; to mention a few: JUDGE (Bain): the system for sentencing in murder, assault and man-slaughter; KICS (Yang and Robertson): the system for deciding about building regulations; CASEY (Koton): the system for heart failure diagnosis; CADET (Sycara K): the system for assisting in mechanical design; TOTLEC (Costas and Kashyap): the system for planning in the process of manufacturing design; PLEXUS (Alterman): the system for plan adaptation; CLAVIER (Hennessy and Hinkle): the system implemented at Lockheed to control and modify the autoclave processes in part manufacturing; CaseLine (Magaldi): the system implemented at British Airways for maintenance and repair of the Boeing fleet.
Complexity Issues Complexity of Computations in Information Systems The basic computational problems: (DM) Computing the discernibility matrix M U, A from an information system (U, A); (MLA) The membership in the lower approximation; (RD) Rough definability of sets are of polynomial complexity: (DM) and (RD) in time O(n2 ); (MLA) in time O(n) [90]. The core, CORE(U, A), of an information system (U, A), is the set of all indispensable attributes, i. e., CORE(U; A) D fa 2 A: ind(A) ¤ ind(A n fag)g. As proved in [90], CORE(U; A) D fa 2 A: c i; j D fag for some entry c i; j into the discernibility matrix. Thus, finding the core requires O(n2 ) time [90]. The reduct membership Problem, i. e., checking whether a given set B of attributes is a reduct, requires O(n2 ) time [90]. The reduct set Problem, i. e., finding the set of reducts, is polynomially equivalent to the problem of converting a conjunctive form of a monotone Boolean function into the reduced disjunctive form [90]. The number of reducts
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
in an information system of n attributes can be exponential n with the upper bound of bn/2c reducts [89]. The problem of finding a reduct of minimal cardinality is NP-hard [90]; thus, heuristics for reduct finding are in the general case necessary; they are based on the Johnson algorithm, simulated annealing etc. Genetic algorithmsbased hybrid algorithms give short reducts in relatively short time [120]. Complexity of Template-Based Computations Templates (see Subsect. “Similarity” of Sect. “Rough Set Theory. Extensions”) were used in rule induction based on similarity relations [53]. Decision problems related to templates are [53]: TEMPLATE SUPPORT (TS) Instance: information system (U, A); natural numbers s,l Q.: does there exist a template of length l and support s? OPTIMAL TEMPLATE SUPPORT (OTS) Input: information system (U, A); natural number l Output: a template T of length l and maximal support.
The problem TS is NP-complete and the problem OTS is NP-hard [53]. TEMPLATE QUALITY PROBLEM (TQP) Instance: an information system (U, A); a natural number k Q.: does there exist a template of quality greater than k?
In the case quality(T) D support(T) C length(T), the problem TQP can be solved in polynomial time; in the case quality(T) D support(T) length(T), the problem TQP is conjectured to be NP-hard. In [53] some methods for template extraction of satisfactory quality are discussed. Complexity of Discretization Discretization (see Subsect. “Discretization” of Sect. “Rough Set Theory. Extensions”) offers a number of decision problems. In the case of numeric values of attributes, the decision problem is to check for an irreducible set of cuts. IRREDUCIBLE SET OF CUTS (ISC) Instance: a decision system (U, A,d) with jAj 2; a natural number k Q.: does there exist an irreducible set of cuts of cardinality less than k?
ISC is NP-complete and its optimization version, i. e., if there exists an optimal set of cuts, is NP-hard [53]; in the case jAj D 1, ISC is of complexity O(jUj logjUj) [53]. For non-numerical values of attributes, the counterpart of a cut system is a partition of the value set; for a partition Pa of an attribute value set V a into k sets, the rank r(Pa ) is k; for a family P : fPa : a 2 Ag of partitions for all value sets of attributes in a decision system (U; A; d), the P rank of P is r(P) D a2A r(Pa ); the notion of a consistent partition P mimics that of a consistent cut system. The corresponding problem is Symbolic Value Partition Problem (SVPP) Input: a decision system (U, A,d) a set B of attributes; a natural number k Output: a B-consistent partition of rank less or equal k.
The problem SVPP is NP-complete [53]. Problems of generating satisfactory cut systems require some heuristics; Maximal Discernibility Heuristics (MD) [51] is also discussed in [7,53]. Complexity of Problems Related to K-Nearest Neighbors In the case of n training objects in d-space, the search for the nearest neighbor (k D 1) does require O(dn2 ) time and O(n) space. A parallel implementation [16] is O(1) in time and O(n) in space. This implementation checks for each of Voronoi regions induced from the training sample whether the test object falls inside it and thus would receive its class label by checking for each of the d “faces” of the region whether the object is on the region side of the face. The large complexity associated with storage of n training objects, called for eliminating some “redundant” training objects; a technique of editing in Hart [30] consists of editing away, i. e., eliminating from the training set of objects that are surrounded in the Voronoi diagram by objects in the same decision class; the complexity of the editing algorithm is O(d 3 nbd/2c log n). The complexity of problems related to Voronoi diagrams has been researched in many aspects; the Voronoi diagram for n points in 2-space can be constructed in O(n log n) [79]; in d-space the complexity is (nbd/2c ) (Klee [35]). Such also is the complexity of finding the nearest neighbor in the corresponding space. There were proposed other editing techniques based on graphs: Gabriel graphs in Gabriel and Sokal [25] and relative neighborhood graphs in Toussaint [107].
805
806
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
Future Directions It seems reasonable to include the following among problems that could occupy the attention of researchers notwithstanding the paradigm applied: Compression of knowledge encoded in the training set in order to treat large volumes of data. Granulation of knowledge as a compression means. This should call for efficient formal granulation models based on similarity relations rather than on metrics that would preserve large parts of information encoded in original data. Search for the most critical decision/classification rules among the bulk induced from data. This search should lead to a deeper insight into relations in data. Missing values problem, important especially for medical and biological data; its solution should also lead to new deeper relations and dependencies among attributes. Working out the methods for analyzing complex data like molecular, genetic and medical data that could include signals, images.
Bibliography Primary Literature 1. Aamodt A (1991) A knowledge intensive approach to problem solving and sustained learning. Dissertation, University Trondheim, Norway. University Microfilms PUB 92–08460 2. Aamodt A, Plaza E (1994) Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Communications 7:39–59 3. Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional space. In: Proceedings of the eighth international conference on database theory, London, pp 420–434 4. Aha DW (1998) The omnipresence of case-based reasoning in science and applications. Knowl-Based Syst 11:261–273 5. Bayes T (1763) An essay towards solving a problem in the doctrine of chances. Philos Trans R Soc (London) 53:370–418
6. Bazan JG (1998) A comparison of dynamic and non-dynamic rough set methods for extracting laws from decision tables. In: Polkowski L, Skowron A (eds) Rough sets in knowledge discovery, vol 1. Physica, Heidelberg, pp 321–365 7. Bazan JG et al (2000) Rough set algorithms in classification problems. In: Polkowski L, Tsumoto S, Lin TY (eds) Rough set methods and applications. New developments in knowledge discovery in information systems. Physica, Heidelberg, pp 49– 88 8. Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18:509–517 9. Berchtold S, Keim D, Kriegel HP (1996) The X-tree: an index structure for high dimensional data. In: Proceedings of the 22nd International Conference on Very Large Databases VLDB’96 1996 Mumbai, Morgan Kaufmann, San Francisco, pp 29–36 10. Brin S (1995) Near neighbor search in large metric spaces. In: Proceedings of the 21st International Conference on Very Large Databases VLDB’95 Zurich, Morgan Kaufmann, San Francisco, pp 574–584 11. Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd International Conference on Very Large Databases VLDB’97, Athens, Morgan Kaufmann, San Francisco, pp 426– 435 12. Clark P, Evans F (1954) Distance to nearest neighbor as a measure of spatial relationships in populations. Ecology 35:445– 453 13. Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory IT-13(1):21–27 14. Czy˙zewski A et al (2004) Musical phrase representation and recognition by means of neural networks and rough sets. In: Transactions on rough sets, vol 1. Lecture Notes in Computer Science, vol 3100. Springer, Berlin, pp 254–278 15. Deja R (2000) Conflict analysis. In: Polkowski L, Tsumoto S, Lin TY (eds) Rough set methods and applications. New developments in knowledge discovery in information systems. Physica, Heidelberg, pp 491–520 16. Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New York 17. Düntsch I, Gediga G (1998) GROBIAN. In: Polkowski L, Skowron A (eds) Rough sets in knowledge discovery 2. Physica, Heidelberg, pp 555–557 18. Faucett WM (1955) Compact semigroups irreducibly connected between two idempotents. Proc Am Math Soc 6:741– 747 19. Fernandez-Baizan MC et al (1998) RSDM: Rough sets data miner. A system to add data mining capabilities to RDBMS. In: Polkowski L, Skowron A (eds) Rough sets in knowledge discovery 2. Physica, Heidelberg, pp 558–561 20. Finkel R, Bentley J (1974) Quad trees: a data structure for retrieval and composite keys. Acta Inf 4:1–9 21. Fix E, Hodges JL Jr (1951) Discriminatory analysis: Nonparametric discrimination: Consistency properties. USAF Sch Aviat Med 4:261–279 22. Fix E, Hodges JL Jr (1952) Discriminatory analysis: Nonparametric discrimination: Small sample performance. USAF Sch Aviat Med 11:280–322 23. Frege G (1903) Grundlagen der Arithmetik II. Jena, Hermann Pohle
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
24. Fukunaga K, Narendra PM (1975) A branch and bound algorithm for computing k-nearest neighbors. IEEE Trans Comput 24:750–753 25. Gabriel KR, Sokal RR (1969) A new statistical approach to geographic variation analysis. Syst Zool 18:259–278 ´ R (1999) On joint use of indis26. Greco S, Matarazzo B, Słowinski cernibility, similarity and dominance in rough approximation of decision classes. In: Proceedings of the 5th international conference of the decision sciences institute, Athens, Greece, pp 1380–1382 27. Grzymala-Busse JW (1992) LERS – a system for learning from ´ R (ed) Intelligent examples based on rough sets. In: Słowinski decision support. Handbook of Advances and Applications of the Rough Sets Theory. Kluwer, Dordrecht, pp 3–18 28. Grzymala-Busse JW (2004) Data with missing attribute values: Generalization of indiscernibility relation and rule induction. In: Transactions on rough sets, vol 1. Lecture Notes in Computer Science, vol 3100. Springer, Berlin, pp 78–95 29. Grzymala-Busse JW, Ming H (2000) A comparison of several approaches to missing attribute values in data mining. In: Lecture notes in AI, vol 2005. Springer, Berlin, pp 378–385 30. Hart PE (1968) The condensed nearest neighbor rule. IEEE Trans Inf Theory IT-14(3):515–516 31. Hastie T, Tibshirani R (1996) Discriminant adaptive nearestneighbor classification. IEEE Pattern Recognit Mach Intell 18:607–616 32. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer, New York 33. Kalantari I, McDonald G (1983) A data structure and an algorithm for the nearest point problem. IEEE Trans Softw Eng 9:631–634 34. Katayama N, Satoh S (1997) The SR-tree: an index structure for high dimensional nearest neighbor queries. In: Proceedings of the 1997 ACM SIGMOD international conference on management of data, Tucson, AZ, pp 369–380 35. Klee V (1980) On the complexity of d-dimensional Voronoi diagrams. Arch Math 34:75–80 ˙ 36. Klösgen W, Zytkow J (eds) (2002) Handbook of data mining and knowledge discovery. Oxford University Press, Oxford 37. Kolodner JL (1983) Maintaining organization in a dynamic long-term memory. Cogn Sci 7:243–80 38. Kolodner JL (1993) Case-based reasoning. Morgan Kaufmann, San Mateo 39. Komorowski J, Skowron A et al (1998) The ROSETTA software system. In: Polkowski L, Skowron A (eds) Rough sets in knowledge discovery 2. Physica, Heidelberg, pp 572–575 40. Kostek B (2007) The domain of acoustics seen from the rough set perspective. In: Transactions on rough sets, vol VI. Lecture Notes in Computer Science, vol 4374. Springer, Berlin, pp 133–151 41. Koton P (1989) Using experience in learning and problem solving. Ph D Dissertation MIT/LCS/TR-441, MIT, Laboratory of Computer Science, Cambridge 42. Kowalczyk W (1998) TRANCE: A tool for rough data analysis, classification and clustering. In: Polkowski L, Skowron A (eds) Rough sets in knowledge discovery 2. Physica, Heidelberg, pp 566–568 43. Krawiec K et al (1998) Learning decision rules from similarity based rough approximations. In: Polkowski L, Skowron A (eds) Rough sets in knowledge discovery, vol 2. Physica, Heidelberg, pp 37–54
44. Le´sniewski S (1916) Podstawy Ogólnej Teoryi Mnogosci (On the Foundations of Set Theory), in Polish. The Polish Scientific Circle, Moscow; see also a later digest: (1982) Topoi 2:7–52 45. Lin KI, Jagadish HV, Faloustos C (1994) The TV-tree: an index structure for high dimensional data. VLDB J 3:517–542 46. Lin TY (1997) From rough sets and neighborhood systems to information granulation and computing with words. In: 5th European Congress on Intelligent Techniques and Soft Computing, 1997 Aachen, Verlagshaus Mainz, Aachen, pp 1602– 1606 47. Lin TY (2005) Granular computing: Examples, intuitions, and modeling. In: Proceedings of IEEE 2005 conference on granular computing GrC05, Beijing, China. IEEE Press, pp 40–44, IEEE Press, New York 48. Ling C-H (1965) Representation of associative functions. Publ Math Debrecen 12:189–212 49. Michalski RS et al (1986) The multi-purpose incremental learning system AQ15 and its testing to three medical domains. In: Proceedings of AAAI-86. Morgan Kaufmann, San Mateo, pp 1041–1045 50. Mostert PS, Shields AL (1957) On the structure of semigroups on a compact manifold with a boundary. Ann Math 65:117– 143 51. Nguyen SH (1997) Discretization of real valued attributes: Boolean reasoning approach. Ph D Dissertation, Warsaw University, Department of Mathematics, Computer Science and Mechanics 52. Nguyen SH, Skowron A (1995) Quantization of real valued attributes: Rough set and Boolean reasoning approach. In: Proceedings 2nd annual joint conference on information sciences, Wrightsville Beach, NC, pp 34–37 53. Nguyen SH (2000) Regularity analysis and its applications in Data Mining. In: Polkowski L, Tsumoto S, Lin TY (eds) Rough set methods and applications. New developments in knowledge discovery in information systems. Physica, Heidelberg, pp 289–378 54. Nguyen TT (2004) Handwritten digit recognition using adaptive classifier construction techniques. In: Pal SK, Polkowski L, Skowron A (eds) Rough – neural computing. Techniques for computing with words. Springer, Berlin, pp 573–586 55. Novotny M, Pawlak Z (1988) Partial dependency of attributes. Bull Pol Acad Ser Sci Math 36:453–458 56. Novotny M, Pawlak Z (1992) On a problem concerning dependence spaces. Fundam Inform 16:275–287 57. Pal SK, Dasgupta B, Mitra P (2004) Rough-SOM with fuzzy discretization. In: Pal SK, Polkowski L, Skowron A (eds) Rough – neural computing. Techniques for computing with words. Springer, Berlin, pp 351–372 58. Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33(3):128–152 59. Patrick EA, Fisher FP (1970) A generalized k-nearest neighbor rule. Inf Control 16(2):128–152 60. Pawlak Z (1982) Rough sets. Int J Comput Inf Sci 11:341–356 61. Pawlak Z (1985) On rough dependency of attributes in information systems. Bull Pol Acad Ser Sci Tech 33:551–559 62. Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Kluwer, Dordrecht 63. Pawlak Z, Skowron, A (1993) A rough set approach for decision rules generation. In: Proceedings of IJCAI’93 workshop W12. The management of uncertainty in AI. also: ICS Research Report 23/93 Warsaw University of Technology
807
808
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
64. Pawlak Z, Skowron A (1994) Rough membership functions. In: Yaeger RR, Fedrizzi M, Kasprzyk J (eds) Advances in the Dempster–Schafer theory of evidence. Wiley, New York, pp 251–271 65. Peters J, Ramanna S (2004) Approximation space for software models. In: Transactions on rough sets, vol I. Lecture Notes in Computer Science, vol 3100. Springer, Berlin, pp 338–355 66. Poincaré H (1902) Science et hypothese and I’Hypothese.Flammarion, Paris 67. Polkowski L (2003) A rough set paradigm for unifying rough set theory and fuzzy set theory. In: Proceedings RSFDGrC03, Chongqing, China, 2003. Lecture Notes in AI, vol 2639. Springer, Berlin, pp 70–78; also: Fundam Inf 54:67–88 68. Polkowski L (2004) Toward rough set foundations. Mereological approach. In: Proceedings RSCTC04, Uppsala, Sweden. Lecture Notes in AI, vol 3066. Springer, Berlin, pp 8–25 69. Polkowski L (2005) Formal granular calculi based on rough inclusions. In: Proceedings of IEEE 2005 conference on granular computing GrC05, Beijing, China. IEEE Press, New York, pp 57–62 70. Polkowski L (2005) Rough-fuzzy-neurocomputing based on rough mereological calculus of granules. Int J Hybrid Intell Syst 2:91–108 71. Polkowski L (2006) A model of granular computing with applications. In: Proceedings of IEEE 2006 conference on granular computing GrC06, Atlanta, USA May 10-12. IEEE Press, New York, pp 9–16 72. Polkowski L, Araszkiewicz B (2002) A rough set approach to estimating the game value and the Shapley value from data. Fundam Inf 53(3/4):335–343 73. Polkowski L, Artiemjew P (2007) On granular rough computing: Factoring classifiers through granular structures. In: Proceedings RSEISP’07, Warsaw. Lecture Notes in AI, vol 4585, pp 280–290 74. Polkowski L, Skowron A (1994) Rough mereology. In: Proceedings of ISMIS’94. Lecture notes in AI, vol 869. Springer, Berlin, pp 85–94 75. Polkowski L, Skowron A (1997) Rough mereology: a new paradigm for approximate reasoning. Int J Approx Reason 15(4):333–365 76. Polkowski L, Skowron A (1999) Towards an adaptive calculus of granules. In: Zadeh LA, Kacprzyk J (eds) Computing with words in information/intelligent systems, vol 1. Physica, Heidelberg, pp 201–228 ˙ 77. Polkowski L, Skowron A, Zytkow J (1994) Tolerance based rough sets. In: Lin TY, Wildberger M (eds) Soft Computing: Rough sets, fuzzy logic, neural networks, uncertainty management, knowledge discovery. Simulation Councils Inc., San Diego, pp 55–58 78. Porter BW, Bareiss ER (1986) PROTOS: An experiment in knowledge acquisition for heuristic classification tasks. In: Proceedings of the first international meeting on advances in learning (IMAL), Les Arcs, France, pp 159–174 79. Preparata F, Shamos MI (1985) Computational geometry: an introduction. Springer, New York 80. Rauszer C (1985) An equivalence between indiscernibility relations in information systems and a fragment of intuitionistic logic. Bull Pol Acad Ser Sci Math 33:571–579 81. Ripley BD (1997) Pattern recognition and neural networks. Cambridge University Press, Cambridge 82. Skowron A et al (1994) A system for data analysis. http://logic. mimuw.edu.pl/~rses/
83. Schank RC (1982) Dynamic memory: A theory of reminding and learning in computers and people. Cambridge University Press, Cambridge 84. Schank RC, Abelson RP (1977) Scripts, plans, goals and understanding. Lawrence Erlbaum, Hillsdale 85. Semeniuk-Polkowska M (2007) On conjugate information systems: A proposition on how to learn concepts in humane sciences by means of rough set theory. In: Transactions on rough sets, vol VI. Lecture Notes in Computer Science, vol 4374. Springer, Berlin, pp 298–307 86. Simard P, Le Cun Y, Denker J (1993) Efficient pattern recognition using a new transformation distance. In: Hanson SJ, Cowan JD, Giles CL (eds) Advances in neural information processing systems, vol 5. Morgan Kaufmann, San Mateo, pp 50– 58 87. Simpson RL (1985) A A computer model of case-based reasoning in problem solving: An investigation in the domain of dispute mediation. Georgia Institute of Technology, Atlanta 88. Skellam JG (1952) Studies in statistical ecology, I, Spatial pattern. Biometrica 39:346–362 89. Skowron A (1993) Boolean reasoning for decision rules generation. In: Komorowski J, Ras Z (eds) Proceedings of ISMIS’93. Lecture Notes in AI, vol 689. Springer, Berlin, pp 295–305 90. Skowron A, Rauszer C (1992) The discernibility matrices and ´ R (ed) Intelligent functions in decision systems. In: Słowinski decision support. Handbook of applications and advances of the rough sets theory. Kluwer, Dordrecht, pp 311–362 91. Skowron A, Stepaniuk J (1996) Tolerance approximation spaces. Fundam Inf 27:245–253 92. Skowron A, Stepaniuk J (2001) Information granules: towards foundations of granular computing. Int J Intell Syst 16:57–85 93. Skowron A, Swiniarski RW (2004) Information granulation and pattern recognition. In: Pal SK, Polkowski L, Skowron A (eds), Rough – Neural Computing. Techniques for computing with words. Springer, Berlin, pp 599–636 94. Slezak D (2000) Various approaches to reasoning with frequency based decision reducts: a survey. In: Polkowski L, Tsumoto S, Lin TY (eds) Rough set methods and applications. New developments in knowledge discovery in information systems. Physica, Heidelberg, pp 235–288 ´ 95. Słowinski R, Stefanowski J (1992) “RoughDAS” and “RoughClass” software implementations of the rough set approach. ´ R (ed) Intelligent decision support: Handbook of In: Słowinski advances and applications of the rough sets theory. Kluwer, Dordrecht, pp 445–456 ´ 96. Słowinski R, Stefanowski J (1998) Rough family – software implementation of the rough set theory. In: Polkowski L, Skowron A (eds) Rough sets in knowledge discovery 2. Physica, Heidelberg, pp 580–586 97. Stanfill C, Waltz D (1986) Toward memory-based reasoning. Commun ACM 29:1213–1228 98. Mackie M (2006) Stanford encyclopedia of philosophy: Transworld identity http://plato.stanford.edu/entries/ identity-transworld Accessed 6 Sept 2008 99. Stefanowski J (1998) On rough set based approaches to induction of decision rules. In: Polkowski L, Skowron A (eds) Rough sets in knowledge discovery 1. Physica, Heidelberg, pp 500–529 100. Stefanowski J (2007) On combined classifiers, rule induction and rough sets. In: Transactions on rough sets, vol VI. Lec-
Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets
101.
102.
103.
104.
105.
106.
107. 108.
109. 110. 111. 112.
113. 114.
115.
116. 117.
ture Notes in Computer Science, vol 4374. Springer, Berlin, pp 329–350 Stepaniuk J (2000) Knowledge discovery by application of rough set models. In: Polkowski L, Tsumoto S, Lin TY (eds) Rough set methods and applications. New developments in knowledge discovery in information systems. Physica, Heidelberg, pp 138–233 Suraj Z (1998) TAS: Tools for analysis and synthesis of concurrent processes using rough set methods. In: Polkowski L, Skowron A (eds) Rough sets in knowledge discovery, vol 2. Physica, Heidelberg, pp 587–590 Suraj Z (2000) Rough set methods for the synthesis and analysis of concurrent processes. In: Polkowski L, Tsumoto S, Lin TY (eds) Rough set methods and applications. New developments in knowledge discovery in information systems. Physica, Heidelberg, pp 379–490 Swiniarski RW (1998) RoughFuzzyLab: A system for data mining and rough and fuzzy sets based classification. In: Polkowski L, Skowron A (eds) Rough sets in knowledge discovery 2. Physica, Heidelberg, pp 591–593 Swiniarski RW, Skowron A (2004) Independent component analysis, principal component analysis and rough sets in face recognition. In: Transactions on rough sets, vol I. Lecture Notes in Computer Science, vol 3100. Springer, Berlin, pp 392–404 Sycara EP (1987) Resolving adversial conflicts: An approach to integrating case-based and analytic methods. Georgia Institute of Technology, Atlanta Toussaint GT (1980) The relative neighborhood graph of a finite planar set. Pattern Recognit 12(4):261–268 Tsumoto S (1998) PRIMEROSE. In: Polkowski L, Skowron A (eds) Rough sets in knowledge discovery 2. Physica, Heidelberg, pp 594–597 UCI Repository http://www.ics.uci.edu./mlearn/databases/ University of California, Irvine, Accessed 6 Sept 2008 Uhlmann J (1991) Satisfying general proximity/similarity queries with metric trees. Inf Process Lett 40:175–179 Veloso M (1994) Planning and learning by analogical reasoning. Springer, Berlin Vitoria A (2005) A framework for reasoning with rough sets. In: Transactions on rough sets, vol IV. Lecture Notes in Computer Science, vol 3100. Springer, Berlin, pp 178–276 Ward J (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58:236–244 Watson I, Marir F (1994) Case-based reasoning: A review http://www.ai-cbr.org/classroom/cbr-review.html Accessed 6 Sept 2008; see also: Watson I (1994). Knowl Eng Rev 9(4):327–354 White DA, Jain R (1996) Similarity indexing with the SS-tree. In: Proceedings of the twelve international conference on data engineering, New Orleans LA, pp 516–523 Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 6:1–34 Wittgenstein L (1953) Philosophical investigations. Blackwell, London
118. Wojdyłło P (2004) WaRS: A method for signal classification. In: Pal SK, Polkowski L, Skowron A (eds) Rough – neural computing. Techniques for computing with words. Springer, Berlin, pp 649–688 119. Wojna A (2005) Analogy-based reasoning in classifier construction. In: Transactions on rough sets, vol IV. Lecture Notes in Computer Science, vol 3700. Springer, Berlin, pp 277–374 120. Wróblewski J (1998) Covering with reducts – a fast algorithm for rule generation. In: Lecture notes in artificial intelligence, vol 1424. Springer, Berlin, pp 402–407 121. Wróblewski J (2004) Adaptive aspects of combining approximation spaces. In: Pal SK, Polkowski L, Skowron A (eds) Rough – neural computing. Techniques for computing with words. Springer, Berlin, pp 139–156 122. Yao YY (2000) Granular computing: Basic issues and possible solutions. In: Proceedings of the 5th Joint Conference on Information Sciences I. Assoc Intell Machinery, Atlantic NJ, pp 186–189 123. Yao YY (2005) Perspectives of granular computing. In: Proceedings of IEEE 2005 Conference on Granular Computing GrC05, Beijing, China. IEEE Press, New York, pp 85–90 124. Zadeh LA (1979) Fuzzy sets and information granularity. In: Gupta M, Ragade R, Yaeger RR (eds) Advances in fuzzy set theory and applications. North-Holland, Amsterdam, pp 3–18 125. Zeeman EC (1965) The topology of the brain and the visual perception. In: Fort MK (ed) Topology of 3-manifolds and selected topics. Prentice Hall, Englewood Cliffs, pp 240–256 126. Ziarko W (1998) KDD-R: Rough set-based data mining system. In: Polkowski L, Skowron A (eds) Rough sets in knowledge discovery 2. Physica, Heidelberg, pp 598–601
Books and Reviews Avis D, Bhattacharya BK (1983) Algorithms for computing d-dimensional Voronoi diagrams and their duals. In: Preparata FP (ed) Advances in computing research: Computational geometry. JAI Press, Greenwich, pp 159–180 ´ Bochenski JM (1954) Die Zeitgenössischen Denkmethoden. A. Francke, Bern Dasarathy BV (ed) (1991) Nearest neighbor (NN) norms: NN Pattern classification techniques. IEEE Computer Society, Washington Friedman J (1994) Flexible metric nearest-neighbor classification. Technical Report, Stanford University Polkowski L (2002) Rough sets. Mathematical foundations. Physica, Heidelberg Russell SJ, Norvig P (2003) Artificial intelligence. A modern approach, 2nd edn. Prentice Hall Pearson Education, Upper Saddle River Toussaint GT, Bhattacharya BV, Poulsen RS (1984) Application of voronoi diagrams to nonparametric decision rules. In: Proceedings of Computer Science and Statistics: The Sixteenth Symposium on the Interface. North Holland, Amsterdam, pp 97–108 Watson I (1997) Applying case-based reasoning. Techniques for enterprise systems. Morgan Kaufmann, Elsevier, Amsterdam
809
810
Data-Mining and Knowledge Discovery, Introduction to
Data-Mining and Knowledge Discovery, Introduction to PETER KOKOL Department of Computer Science, University of Maribor, Maribor, Slovenia Data mining and knowledge discovery is the principle of analyzing large amounts of data and picking out relevant information leading to the knowledge discovery process for extracting meaningful patterns, rules and models from raw data making discovered patterns understandable. Applications include medicine, politics, games, business, marketing, bioinformatics and many other areas of science and engineering. It is an area of research activity that stands at the intellectual intersection of statistics, computer science, machine learning and database management. It deals with very large datasets, tries to make fewer theoretical assumptions than has traditionally been done in statistics, and typically focuses on problems of classification, prediction, description and profiling, clustering, and regression. In such domains, data mining often uses decision trees or neural networks as models and frequently fits them using some combination of techniques such as bagging, boosting/arcing, and racing. Data mining techniques include data visualization, neural network analysis, support vector machines, genetic and evolutionary algorithms, case based reasoning, etc. Other activities in data mining focus on issues such as causation in largescale systems, and this effort often involves elaborate statistical models and, quite frequently, Bayesian methodology and related computational techniques. Also data mining covers evaluation of the top down approach of model building, starting with an assumed mathematical model solving dynamical equations, with the bottom up approach of time series analysis, which takes measured data as input and provides as output a mathematical model of the system. Their difference is termed measurement error. Data mining allows characterization of chaotic dynamics, involves Lyapunov exponents, fractal dimension and Kolmogorov–Sinai entropy. Data mining comes in two main directions: directed and undirected. Directed data mining tries to categorize or explain some particular target field, while undirected data mining attempts to find patterns or similarities among groups of records without the specific goal. Pedrycz (see Data and Dimensionality Reduction in Data Analysis and System Modeling) points out that data and dimensionality reduction are fundamental pursuits of data analysis and system modeling. With the rapid growth
of the size of data sets and the diversity of data themselves, the use of some reduction mechanisms becomes a necessity. Data reduction is concerned with a reduction of sizes of data sets in terms of the number of data points. This helps reveal an underlying structure in data by presenting a collection of groups present in data. Given the number of groups which are very limited, the clustering mechanisms become effective in terms of data reduction. Dimensionality reduction is aimed at the reduction of the number of attributes (features) of the data which leads to a typically small subset of features or brings the data from a highly dimensional feature space to a new one of a far lower dimensionality. A joint reduction process involves data and feature reduction. Brameier (see Data-Mining and Knowledge Discovery, Neural Networks in) describes neural networks or, more precisely, artificial neural networks as mathematical and computational models that are inspired by the way biological nervous systems process information. A neural network model consists of a larger number of highly interconnected, simple processing nodes or units which operate in parallel and perform functions collectively, roughly similar as in biological neural networks. Artificial neural networks like their biological counterpart, are adaptive systems which learn by example. Learning works by adapting free model parameters, i. e., the signal strength of the connections and the signal flow, to external information that is presented to the network. In other terms, the information is stored in the weighted connections. If not stated explicitly, let the term “neural network” mean “artificial neural network” in the following. In more technical terms, neural networks are nonlinear statistical data modeling tools. Neural networks are generally well suited for solving problem tasks that involve classification of (necessarily numeric) data vectors, pattern recognition and decision-making. The power and usefulness of neural networks have been demonstrated in numerous application areas, like image processing, signal processing, biometric identification – including handwritten character, fingerprint, face, and speech recognition – robotic control, industrial engineering, and biomedicine. In many of these tasks neural networks outperform more traditional statistical or artificial intelligence techniques or may even achieve human-like performance. The most valuable characteristics of neural networks are adaptability and tolerance to noisy or incomplete data. Another important advantage is in solving problems that do not have an algorithmic solution or for which an algorithmic solution is too complex or time-consuming to be found. Brameier describes his chapter as providing a concise introduction to the two most popular neural network types
Data-Mining and Knowledge Discovery, Introduction to
used in application, back propagation neural networks and self-organizing maps The former learn high-dimensional non-linear functions from given input-output associations for solving classification and approximation (regression) problems. The latter are primarily used for data clustering and visualization and for revealing relationships between clusters. These models are discussed in context with alternative learning algorithms and neural network architectures. Polkowski (see Data-Mining and Knowledge Discovery: Case-Based Reasoning, Nearest Neighbor and Rough Sets) describes rough set theory as a formal set of notions aimed at carrying out tasks of reasoning, in particular about classification of objects, in conditions of uncertainty. Conditions of uncertainty are imposed by incompleteness, imprecision and ambiguity of knowledge. Applications of this technique can be found in many areas like satellite images analysis, plant ecology, forestry, conflict analysis, game theory, and cluster identification in medical diagnosis. Originally, the basic notion proposed was that of a knowledge base, understood as a collection of equivalence relations on a universe of objects; each relation induces on the set a partition into equivalence classes. Knowledge so encoded is meant to represent the classification ability. As objects for analysis and classification come most often in the form of data, a useful notion of an information system is commonly used in knowledge representation; knowledge base in that case is defined as the collection of indiscernibility relations. Exact concepts are defined as unions of indiscernibility classes whereas inexact concepts are only approximated from below (lower approximations) and from above (upper approximations) by exact ones. Each inexact concept is, thus, perceived as a pair of exact concepts between which it is sandwiched. Povalej, Verlic and Stiglig (see Discovery Systems) point out that results from simple query for “discovery system” on the World Wide Web returns different types of discovery systems: from knowledge discovery systems in databases, internet-based knowledge discovery, service discovery systems and resource discovery systems to more specific, like for example drug discovery systems, gene discovery systems, discovery system for personality profiling, and developmental discovery systems among others. As illustrated variety of discovery systems can be found in many different research areas, but they focus on knowledge discovery and knowledge discovery systems from the computer science perspective. A decision tree in data mining or machine learning is a predictive model or a mapping process from observations about an item to conclusions about its target value. More descriptive names for such tree models are classifica-
tion tree (discrete outcome) or regression tree (continuous outcome). In these tree structures, leaves represent classifications and branches represent conjunctions of features that lead to those classifications. Podgorelec and Zorman (see Decision Trees) point out that the term “decision trees” has been used for two different purposes: in decision analysis as a decision support tool for modeling decisions and their possible consequences to select the best course of action in situations where one faces uncertainty, and in machine learning or data mining as a predictive model; that is, a mapping from observations about an item to conclusions about its target value. Their article concentrates on the machine learning view. Orlov, Sipper and Hauptman (see Genetic and Evolutionary Algorithms and Programming: General Introduction and Application to Game Playing) cover genetic and evolutionary algorithms, which are a family of search algorithms inspired by the process of (Darwinian) evolution in Nature. Common to all the different family members is the notion of solving problems by evolving an initially random population of candidate solutions, through the application of operators like crossover and mutation inspired by natural genetics and natural selection, such that in time “fitter” (i. e., better) solutions emerge. The field, whose origins can be traced back to the 1950s and 1960s, has come into its own over the past two decades, proving successful in solving multitudinous problems from highly diverse domains including (to mention but a few): optimization, automatic programming, electronic-circuit design, telecommunications, networks, finance, economics, image analysis, signal processing, music, and art. Berkhin and Dhillon (see Knowledge Discovery: Clustering) discuss the condition where data found in scientific and business applications usually do not fit a particular parametrized probability distribution. In other words, the data are complex. Knowledge discovery starts with exploration of this complexity in order to find inconsistencies, artifacts, errors, etc. in the data. After data are cleaned, it is usually still extremely complex. Descriptive data mining deals with comprehending and reducing this complexity. Clustering is a premier methodology in descriptive unsupervised data mining. A cluster could represent an important subset of the data such as a galaxy in astronomical data or a segment of customers in marketing applications. Clustering is important as a fundamental technology to reduce data complexity and to find data patterns in an unsupervised fashion. It is universally used as a first technology of choice in data exploration. Džeroski, Panov and Zenko (see Machine Learning, Ensemble Methods in) cover ensemble methods, which are
811
812
Data-Mining and Knowledge Discovery, Introduction to
machine learning methods that construct a set of predictive models and combine their outputs into a single prediction. The purpose of combining several models together is to achieve better predictive performance, and it has been shown in a number of cases that ensembles can be more accurate than single models. While some work on ensemble methods has already been done in the 1970s, it was not until the 1990s, and the introduction of methods such as bagging and boosting that ensemble methods started to be more widely used. Today, they represent a standard machine learning method which has to be considered whenever good predictive accuracy is demanded. Liu and Zhao (see Manipulating Data and Dimension Reduction Methods: Feature Selection) cover feature selection, which is the study of algorithms for reducing dimensionality of data for various purposes. One of the most common purposes is to improve machine learning performance. The other purposes include simplifying data description, streamlining data collection, improving com-
prehensibility of the learned models, and helping gain insight through learning. The objective of feature selection is to remove irrelevant and/or redundant features and retain only relevant features. Irrelevant features can be removed without affecting learning performance. Redundant features are a type of irrelevant features. The distinction is that a redundant feature implies the co-presence of another feature; individually, each feature is relevant, but the removal of either one will not affect learning performance. As a plethora of data are generated in every possible means with the exponential decreasing costs of data storage and computer processing power, data dimensionality increases on a scale beyond imagination in cases ranging from transactional data to high-throughput data. In many fields such as medicine, health care, Web search, and bioinformatics, it is imperative to reduce high dimensionality such that efficient data processing and meaningful data analysis can be conducted in order to mine nuggets from high-dimensional, massive data.
Data-Mining and Knowledge Discovery, Neural Networks in
Data-Mining and Knowledge Discovery, Neural Networks in MARKUS BRAMEIER Bioinformatics Research Center, University of Aarhus, Århus, Denmark Article Outline Glossary Definition of the Subject Introduction Neural Network Learning Feedforward Neural Networks Backpropagation Other Learning Rules Other Neural Network Architectures Self-organizing Maps Future Directions Bibliography Glossary Artificial neural network An artificial neural network is a system composed of many simple, but highly interconnected processing nodes or neurons which operate in parallel and collectively. It resembles biological nervous systems in two basic functions: (1) Experiential knowledge is acquired through a learning process and can be retrieved again later. (2) The knowledge is stored in the strength (weights) of the connections between the neurons. Artificial neuron An artificial neuron receives a number of inputs, which may be either external inputs to the neural network or outputs of other neurons. Each input connection is assigned a weight, similar to the synaptic efficacy of a biological neuron. The weighted sum of inputs is compared against an activation level (threshold) to determine the activation value of the neuron. Activation function The activation or transfer function transforms the weighted inputs of a neuron into an output signal. Activation functions often have a “squashing” effect. Common activation functions used in neural networks are: threshold, linear, sigmoid, hyperbolic, and Gaussian. Learning rule The learning rule describes the way a neural network is trained, i. e., how its free parameters undergo changes to fit the network to the training data. Feedforward network Feedforward neural networks are organized in one or more layers of processing units (neurons). In a feedforward neural network the signal
is allowed to flow one-way only, i. e., from inputs to outputs. There are no feedback loops, i. e., the outputs of a layer do not affect its inputs. Feedback networks In feedback or recurrent networks signals may flow in both directions. Feedback networks are dynamic such that they have a state that is changing continuously until it reaches an equilibrium point. Definition of the Subject Neural networks (NNs) or, more precisely, artificial neural networks (ANNs) are mathematical and computational models that are inspired by the way biological nervous systems process information. A neural network model consists of a larger number of highly interconnected, simple processing nodes or units which operate in parallel and perform functions collectively, roughly similar to biological neural networks. ANNs, like their biological counterpart, are adaptive systems that learn by example. Learning works by adapting free model parameters, i. e., the signal strength of the connections and the signal flow, to external information that is presented to the network. In other terms, the information is stored in the weighted connections. If not stated explicitly, let the term “neural network” mean “artificial neural network” in the following. In more technical terms, neural networks are non-linear statistical data modeling tools. Neural networks are generally well suited for solving problem tasks that involve classification, pattern recognition and decision making. The power, and usefulness of neural networks have been demonstrated in numerous application areas, like image processing, signal processing, biometric identification – including handwritten character, fingerprint, face, and speech recognition – robotic control, industrial engineering, and biomedicine. In many of these tasks neural networks outperform more traditional statistical or artificial intelligence techniques or may even achieve humanlike performance. The most valuable characteristics of neural networks are adaptability and tolerance to noisy or incomplete data. Another important advantage is in solving problems that do not have an algorithmic solution or for which an algorithmic solution is too complex or timeconsuming to be found. The first neural network model was developed by McCulloch and Pitts in the 1940s [32]. In 1958 Rosenblatt [41] described the first learning algorithm for a single neuron, the perceptron model. After Rumelhart et al. [46] invented the popular backpropagation learning algorithm for multi-layer networks in 1986, the field of neural networks gained incredible popularity in the 1990s.
813
814
Data-Mining and Knowledge Discovery, Neural Networks in
Neural networks is regarded as a method of machine learning, the largest subfield of artificial intelligence (AI). Conventional AI mainly focuses on the development of expert systems and the design of intelligent agents. Today neural networks also belong to the more recent field of computational intelligence (CI), which also includes evolutionary algorithms (EAs) and fuzzy logic.
Artificial neural networks copy only a small amount of the biological complexity by using a much smaller number of simpler neurons and connections. Nevertheless, artificial neural networks can perform remarkably complex tasks by applying a similar principle, i. e., the combination of simple and local processing units, each calculating a weighted sum of its inputs and sending out a signal if the sum exceeds a certain threshold.
Introduction Herein I provide a concise introduction to the two most popular neural network types used in application, backpropagation neural networks (BPNNs) and self-organizing maps (SOMs). The former learn high-dimensional non-linear functions from given input–output associations for solving classification and approximation (regression) problems. The latter are primarily used for data clustering and visualization and for revealing relationships between clusters. These models are discussed in context with alternative learning algorithms and neural network architectures. Biological Motivation Artificial neural networks are inspired by biological nervous systems, which are highly distributed and interconnected networks. The human brain is principally composed of a very large number of relatively simple neurons (approx. 100 billion), each of which is connected to several thousand other neurons, on average. A neuron is a specialized cell that consists of the cell body (the soma), multiple spine-like extensions (the dendrites) and a single nerve fiber (the axon). The axon connects to the dendrites of another neuron via a synapse. When a neuron is activated, it transmits an electrical impulse (activation potential) along its axon. At the synapse the electric signal is transformed into a chemical signal such that a certain number of neurotransmitters cross the synaptic gap to the post synaptic neuron, where the chemical signal is converted back to an electrical signal to be transported along the dendrites. The dendrites receive signals from the axons of other neurons. One very important feature of neurons is that they react delayed. A neuron combines the strengths (energies) of all received input signals and sends out its own signal (“fires”) only if the total signal strength exceeds a certain critical activation level. A synapse can either be excitatory or inhibitory. Input signals from an excitatory synapse increase the activation level of the neuron while inputs from an inhibitory synapse reduce it. The strength of the input signals critically depends on modulations at the synapses. The brain learns basically by adjusting number and strength of the synaptic connections.
History and Overview The history of artificial neural networks begins with a discrete mathematical model of a biological neural network developed by pioneers McCulloch and Pitts in 1943 [32]. This model describes neurons as threshold logic units (TLUs) or binary decision units (BDNs) with multiple binary inputs and a single binary output. A neuron outputs 1 (is activated) if the sum of its unweighted inputs exceeds a certain specified threshold, otherwise it outputs 0. Each neuron can only represent simple logic functions like OR or AND, but any boolean function can be realized by combinations of such neurons. In 1958 Rosenblatt [41] extended the McCulloch–Pitts model to the perceptron model. This network was based on a unit called the perceptron, which produces an output depending on the weighted linear combination of its inputs. The weights are adapted by the perceptron learning rule. Another single-layer neural network that is based on the McCulloch–Pitts neuron is the ADALINE (ADAptive Linear Element) which was invented in 1960 by Widrow and Hoff [55] and employs a Least-Mean-Squares (LMS) learning rule. In 1969 Minsky and Papert [33] provided mathematical proofs that single-layer neural networks like the perceptron are incapable of representing functions which are linearly inseparable, including in particular the exclusiveor (XOR) function. This fundamental limitation led the research on neural networks to stagnate for many years, until it was found that a perceptron with more than one layer has far greater processing power. The backpropagation learning method was first described by Werbos in 1974 [53,54], and further developed for multi-layer neural networks by Rumelhart et al. in 1986 [46]. Backpropagation networks are by far the most well known and most commonly used neural networks today. Recurrent auto-associative networks were first described independently by Anderson [2] and Kohonen [21] in 1977. Invented in 1982, the Hopfield network [17] is a recurrent neural network in which all connections are symmetric. All neurons are both input and output neurons
Data-Mining and Knowledge Discovery, Neural Networks in
and update their activation values asynchronously and independently from each other. For each new input the network converges dynamically to a new stable state. A Hopfield network may serve as an associative, i. e., content-addressable, memory. The Boltzmann machine by Ackley et al. [1] can be seen as an extension of the Hopfield network. It uses a stochastic instead of a deterministic update rule that simulates the physical principle of annealing. The Boltzmann machine is one of the first neural networks to demonstrate learning of an internal representation (hidden units). It was also in 1982 when Kohonen first published his self-organizing maps [22], a neural networks model based on unsupervised learning and competitive learning. SOMs produce a low-dimensional representation (feature map) of high-dimensional input data while preserving their most important topological features. Significant progress was made in the 1990s in the field of neural networks which attracted a great deal of attention both in research and in many application domains (see Sect. “Definition of the Subject”). Faster computers allowed a more efficient solving of more complex problems. Hardware implementations of larger neural networks were realized on parallel computers or in neural network chips with multiple units working simultaneously. Characteristics of Neural Networks Interesting general properties of neural networks are that they (1) mimic the way the brain works, (2) are able to learn by experience, (3) make predictions without having to know the precise underlying model, (4) have a high fault tolerance, i. e., can still give the correct output to missing, noisy or partially correct inputs, and (5) can work with data they have never seen before, provided that the underlying distribution fits the training data. Computation in neural networks is local and highly distributed throughout the network such that each node operates by itself, but tries to minimize the overall network (output) error in cooperation with other nodes. Working like a cluster of interconnected processing nodes, neural networks automatically distribute both the problem and the workload among the nodes in order to find and converge to a common solution. Actually, parallel processing was one of the original motivations behind the development of artificial neural networks. Neural networks have the topology of a directed graph. There are only one-way connections between nodes, just like in biological nervous systems. A two-way relationship requires two one-way connections. Many different net-
work architectures are used, often with hundreds or thousands of adjustable parameters. Neural networks are typically organized in layers which each consist of a number of interconnected nodes (see Fig. 2 below). Data patterns are presented to the system via the input layer. This non-computing layer is connected to one or more hidden layers where the actual processing is done. The hidden layers then link to an output layer which combines the results of multiple processing units to produce the final response of the network. The network acts as a high-dimensional vector function, taking one vector as input and returning another vector as output. The modeled functions are general and complex enough to solve a large class of non-linear classification and estimation problems. No matter which problem domain a neural network is operating in, the input data always have to be encoded into numbers which may be continuous or discrete. The high number of free parameters in a neural network and the high degree of collinearity between the neuron outputs let individual parameter settings (weight coefficients) become meaningless and make the network for the most part uninterpretable. High-order interactions between neurons also do not permit simpler substructures in the model to be identified. Therefore, an extraction of the acquired knowledge from such black box predictors to understand the underlying model is almost impossible. Neural Network Learning The human brain learns by practice and experience. The learned knowledge can change if more information is received. Another important element of learning is the ability to infer knowledge, i. e., to make assumptions based on what we know and to apply what we have learned in the past to similar problems and situations. One theory of the physiology of human learning by repetition is that repeated sequences of impulses strengthen connections between neurons and form memory paths. To retrieve the learned information, nerve impulses follow these paths to the correct information. If we get out of practice these paths may diminish over time and we forget what we have learned. The information stored in a neural network is contained in its free parameters. In general, the network architecture and connections are held constant and only the connection weights are variable during training. Once the numbers of (hidden) layers and units have been selected, the free parameters (weights) are set to fit the model or function represented by the network to the training data, following a certain training algorithm or learning rule.
815
816
Data-Mining and Knowledge Discovery, Neural Networks in
A neural network learns, i. e., acquires knowledge, by adjusting the weights of connections between its neurons. This is also referred to as connectionist learning. Training occurs iteratively in multiple cycles during which the training examples are repeatedly presented to the network. During one epoch all data patterns pass through the network once. The three major learning paradigms include supervised learning, unsupervised learning, and reinforcement learning. Supervised learning, in general, means learning by an external teacher using global information. The problem the network is supposed to solve is defined through a set of training examples given. The learning algorithm searches the solution space F, the class of possible functions, for a function f 2 F that matches this set of input–output associations (E x ; yE) best. In other words, the mapping implied by the sample data has to be inferred. Training a neural network means to determine a set of weights which minimizes its prediction error on the training set. The cost or error function E : F ! R measures the error between the desired output values yE and the predicted network outputs f (E x ) over all input vectors xE. That means, it calculates how far away the current state f of the network is from the optimal solution f with E( f ) E( f ) 8 f 2 F. A neural network cannot perfectly learn a mapping if the input data does not contain enough information to derive the desired outputs. It may also not converge if there is not enough data available (see also below). Unsupervised learning uses no external teacher and only local information. It is distinguished from supervised learning by the fact that there is no a priori output. In unsupervised learning we are given some input data xE, and the cost function to be minimized can be any function of xE and the network output f (E x ). Unsupervised learning incorporates self-organization, i. e., organizes the input data by using only their inherent properties to reveal their emergent collective properties. A neural network learns offline if learning phase and operation (application) phase are separated. A neural network learns online if both happens at the same time. Usually, supervised learning is performed offline, whereas unsupervised learning is performed online. In reinforcement learning, neither inputs xE nor outputs yE are given explicitly, but are generated by the interactions of an agent within an environment. The agent performs an action yE with costs c according to an observation xE made in the environment. The aim is to discover a policy or plan for selecting actions that minimizes some measure of the expected total costs.
Overtraining and Generalization The overall motivation and most desirable property of neural networks is their ability to generalize to new unknown data, i. e., to classify patterns correctly on which they have not been trained. Minimizing the network error on the training examples only, does not automatically minimize the real error of the unknown underlying function. This important problem is called overfitting or overtraining. A regular distribution of training examples over the input data space is important. Generalization is reasonable only as long as the data inputs remain inside the range for which the network was trained. If the training set only included vectors from a certain part of the data space, predictions on other parts are random and likely wrong. Overtraining may occur also when the iterative training algorithm is run for too long and if the network is too complex for the problem to solve or the available quantity of data. A larger neural network with more weights models a more complex function and invariably achieves a lower error, but is prone to overfitting. A network with less weights, on the other hand, may not be sufficiently powerful to model the underlying function. A simple heuristic, called early stopping, helps to ensure that the network will generalize well to examples not in the training set. One solution is to check progress during training against an independent data set, the validation set. As training progresses, the training error naturally decreases monotonically and, providing training is minimizing the true error function, also the validation error decreases. However, if the validation error stops dropping or even starts to increase again, this is an indication that the network is starting to overfit the data. Then the optimization process has become stuck in a local minima and training should be stopped. The weights that produced the minimum validation error are then used for the final model. In this case of overtraining, the size of the network, i. e., the number of hidden units and/or hidden layers, may be decreased. Neural networks typically involve experimenting with a large number of different configurations, training each one a number of times while observing the validation error. A problem with repeated experimentation is that the validation set is actually part of the training process. One may just find a network by chance that happens to perform well on the validation set. It is therefore normal practice to reserve a third set of examples for testing the final model on this test set. In many cases a sufficient amount of data is not available, however. Then we have to get around this problem by
Data-Mining and Knowledge Discovery, Neural Networks in
resampling techniques, like cross validation. In principle, multiple experiments are conducted, each using a different division of the available data into training and validation set. This should remove any sampling bias. For small data sets, where splitting the data would leave too few observations for training, leave-one-out validation may be used to determine when to stop training or the optimal network size. Machine-learning techniques, like neural networks, require both positive and negative training examples for solving classification problems. Because they minimize an overall error, the proportion of positive and negative examples in the training set is critical. Ideally, the relation should be close to the (usually unknown) real distribution in the data space. Otherwise, it may bias the network’s decision to be more often wrong on unknown data.
Feedforward Neural Networks In feedforward neural networks the information is passed in only one direction (forward) from the inputs, through the hidden nodes (if any) to the output nodes. There are no connections backwards to neurons of upper layers i. e., there are no feedback loops or cycles in the network. In feedforward networks with a single-layer of weights, the inputs are directly connected to the output units (single-layer neural networks). Multi-layer feedforward networks use additional intermediate layers of hidden units. Neural networks with two or more processing layers may have far greater processing power than networks with only one layer. Single-layer neural networks are only capable of learning linearly separable patterns and functions.
The Perceptron The most simple kind of feedforward neural network is a perceptron, which consists of a single pseudo input layer and one or more processing nodes in the output layer. All inputs are weighted and fed directly to the output neuron(s) (see Fig. 1). Each node calculates the sum of the products of weights and inputs. If this value is above some threshold (typically 0) the neuron takes the activated value 1, otherwise it outputs 0. Neurons with this kind of activation function are called threshold units. In the literature the term perceptron often refers to networks consisting of only one (output) neuron. More formally, the perceptron is a linear binary classifier that maps a binary n-dimensional input vector E xE) 2 f0; 1g calxE 2 f0; 1gn to a binary output value f (w culated as
Data-Mining and Knowledge Discovery, Neural Networks in, Figure 1 Principle structure of single-unit perceptron network
f (s) D
1 0
if s > T otherwise
(1)
P E xE D niD1 w i x i is the input sum and w E is where s D w a vector of real-valued weights. The constant threshold T does not depend on any input value. In addition to the network topology, the learning rule is an important component of neural networks. Perceptrons can be trained by a simple learning algorithm, called the delta rule or perceptron learning rule. This realizes a simple stochastic gradient descent where the weights of the network are adjusted depending on the error between the predicted outputs of the network and the example outputs. The delta rule changes the weight vector such that the output error is minimized. McClelland and Rumelhart [30] proved that a neural network using the delta rule can learn associations whenever the inputs are linearly independent. All neurons of a perceptron share the same structure and learning algorithm. Each weight w i j , representing the influence of input x i on neuron j, is updated at time t according to the rule: w i j (t C 1) D w i j (t) C w i j
(2)
w i j D ˛(o i y i )x i j :
(3)
The network learns by updating the weight vector after each iteration (training example) by an amount proportional to the difference between given output oi and calculated output y i D f (s i ). The learning rate ˛ is a constant with 0 < ˛ < 1 and regulates the learning speed. The training data set is linearly separable in n-dimensional data space if its two-classes of vectors xE can be separated by an (n 1)-dimensional hyperplane. If the training examples are not linearly separable, the perceptron learning algorithm is not guaranteed to converge. Linear classifiers, like single-unit perceptrons, are only able
817
818
Data-Mining and Knowledge Discovery, Neural Networks in
to learn, i. e., perfectly classify, linearly separable patterns because they can only implement a simple decision surface (single hyperplane) [33]. The same is true for singlelayer neural networks with more than one output unit. This makes these linear neural networks unable to learn, for example, the XOR function [33]. Nevertheless a problem that is thought to be highly complex may still be solved as well by a linear network as by a more powerful (non-linear) neural network. Single-Layer Neural Networks In general, the state of a neuron is represented by its activation value. An activation or transfer function f calculates the activation value of a unit from the weighted sum s of its inputs. In case of the perceptron f is called step or threshold function, with the activation value being 1 if the network sum is greater than a constant T, and 0 otherwise (see Eq. (1)). Another common form of non-linear activation function is the logistic or sigmoid function: f (s) D
1 : 1 C es
(4)
This enables a neural network to compute a continuous output between 0 and 1 instead of a step function. With this choice, a single-layer network is identical to a logistic regression model. If the activation functions is linear, i. e., the identity, then this is just a multiple linear regression and the output is proportional to the total weighted sum s. Multi-Layer Neural Networks The limitation that non-linearly separable functions cannot be represented by a single-layer network with fixed weights can be overcome by adding more layers. A multilayer network is a feedforward network with two or more layers of computational units, interconnected such that the neurons’ outputs of one layer serve as inputs only to neurons of the directly subsequent layer (see Fig. 2). The input layer is not considered a real layer with processing neurons. The number of units in the input layer is determined by the problem, i. e., the dimension of the input data space. The number of output units also depends on the output encoding (see Subsect.“ Application Issues”). By using hidden layers, the partitioning of the data space can be more effective. In principle, each hidden unit adds one hyperplane to divide the space and discriminate the solution. Only if the outputs of at least two neurons are combined in a third neuron, the XOR problem is solv-
Data-Mining and Knowledge Discovery, Neural Networks in, Figure 2 Principle structure of multi-layer feedforward neural network
able. Important issues in multi-layer NN design are, thus, the specification of the number of hidden layers and the number of units in these layers (see also Subsect. “Application Issues”). Both numbers determine the complexity of functions that can be modeled. There is no theoretical limitation on the number of hidden layers, but usually one or two are used. The universal approximation theorem for neural networks states that any continuous function that maps intervals of real numbers to an output interval of real numbers can be approximated arbitrarily closely by a multilayer neural network with only one hidden layer and certain types of non-linear activation functions. This gives, however, no indication about how fast or likely a solution is found. Networks with two hidden layers may work better for some problems. However, more than two hidden layers usually provide only marginal benefit compared to the significant increase in training time. Any multi-layer network with fixed weights and linear activation function is equivalent to a single-layer (linear) network: In the case of a two-layer linear system, for instance, let all input vectors to the first layer form matrix X and W 1 and W 2 be the weight matrices of the two processing layers. Then the output Y1 D W1 X of the first layer is input to the second layer, which produces output Y2 D W2 (W1 X) D (W2 W1 ) X. This is equivalent to a single-layer network with weight matrix W D W2 W1 . Only a multi-layer network that is non-linear can provide more computational power. In many applications these networks use a sigmoid function as non-linear activation function. This is the case at least for the hidden units. For the output layer, the sigmoid activation function is usually applied with classification problems, while a linear transfer function is applied with regression problems.
Data-Mining and Knowledge Discovery, Neural Networks in
Error Function and Error Surface
Backpropagation
The error function derives the overall network error from the difference of the network’s output yij and target output oij over all examples i and output units j. The mostcommon error function is the sum squared error:
The best known and most popular training algorithm for multi-layer networks is backpropagation, short for backwards error propagation and also referred to as the generalized delta rule [46]. The algorithm involves two phases: Forward pass. During the first phase, the free parameters (weights) of the network are fixed. An example pattern is presented to the network and the input signals are propagated through the network layers to calculate the network output at the output unit(s). Backward pass. During the second phase, the model parameters are adjusted. The error signals at the output units, i. e., the differences between calculated and expected outputs, are propagated back through the network layers. In doing so, the error at each processing unit is calculated and used to make adjustments to its connecting weights such that the overall error of the network is reduced by some small amount. After iteratively repeating both phases for a sufficiently large number of training cycles (epochs) the network will converge to a state where its output error is small enough. The backpropagation rule involves the repeated use of the chain rule, saying that the output error of a neuron can be ascribed partly to errors in the weights of its direct inputs and partly to errors in the outputs of higher-level (hidden) nodes [46]. Moreover, backpropagation learning may happen in two different modes. In sequential mode or online mode weight adjustments are made example by example, i. e., each time an example pattern has been presented to the network. The batch mode or offline mode adjustments are made epoch by epoch, i. e., only after all example patterns have been presented. Theoretically, the backpropagation algorithm performs gradient descent on the total error only if the weights are updated epoch-wise. There are empirical indications, however, that a pattern-wise update results in faster convergence. The training examples should be presented in random order. Then the precision of predictions will be more similar over all inputs. Backpropagation learning requires a differentiable activation function. Besides adding non-linearity to multilayer networks, the sigmoid activation function (see Eq. (4)) is often used in backpropagation networks because it has a continuous derivative that can be calculated easily:
ED
1 XX (o i j y i j )2 : 2 i
(5)
j
Neural network training performs a search within the space of solution, i. e., all possible network configurations, towards a global minimum of the error surface. The global minimum is the best overall solution with the lowest possible error. A helpful concept for understanding NN training is the error surface: The n weight parameters of the network model form the n dimensions of the search space. For any possible state of the network or configuration of weights the error is plotted in the (n C 1)th dimension. The objective of training is to find the lowest point on this n-dimensional surface. The error surface is seldom smooth. Indeed, for most problems, the surface is quite rugged with numerous hills and valleys which may cause the network search to run into a local minimum, i. e., a Suboptimum solution. The speed of learning is the rate of convergence between the current solution and the global minimum. In a linear network with a sum-squared error function, the error surface is an multi-dimensional parabola, i. e., has only one minimum. In general, it is not possible to analytically determine where the global minimum of the error surface is. Training is essentially an exploration of the error surface. Because of the probabilistic and often highly non-linear modeling by neural networks, we cannot be sure that the error could not be lower still, i. e., that the minimum we found is the absolute one. Since the shape of the error space cannot be known a priori, neural network analysis requires a number of independent runs to determine the best solution. When different initial values for the weights are selected, different network models will be derived. From an initially random configuration of the network, i. e., a random point on the error surface, the training algorithm starts seeking for the global minimum. Small random values are typically used to initialize the network weights. Although neural networks resulting from different initial weights may have very different parameter settings, their prediction errors usually do not vary dramatically. Training is stopped when a maximum number of epochs has expired or when the network error does not improve any further.
f 0 (s) D f (s)(1 f (s)) :
(6)
We further assume that there is only one hidden layer in order to keep notations and equations clear. A generalization to networks with more than one hidden layer is straightforward.
819
820
Data-Mining and Knowledge Discovery, Neural Networks in
The backpropagation rule is a generalization of the delta learning rule (see Eq. (3)) to multi-layer networks with non-linear activation function. For an input vector xE the output y D f (s) is calculated at each output neuron of the network and compared with the desired target output o, resulting in an error ı. Each weight is adjusted proportionally to its effect on the error. The weight of a connection between a unit i and a unit j is updated depending on the output of i (as input to j) and the error signal at j: w i j D ˛ı j y i :
(7)
For an output node j the error signal (error surface gradient) is given by: ı j D (o j y j ) f 0 (s j ) D (o j y j )y j (1 y j ) :
(8)
If the error is zero, no changes are made to the connection weight. The larger the absolute error, the more the responsible weight is changed, while the sign of the error determines the direction of change. For a hidden neuron j the error signal is calculated recursively using the signals of all directly connected output neurons k. 0
ı j D f (s j )
X k
ı k w jk D y j (1 y j )
X
ı k w jk :
(9)
k
The partial derivative of the error function with respect to the network weights can be calculated purely locally, such that each neuron needs information only from neurons directly connected to it. A theoretical foundation of the backpropagation algorithm can be found in [31]. The backpropagation algorithm performs a gradient descent by calculating the gradient vector of the error surface at the current search point. This vector points into the direction of the steepest descent. Moving in this direction will decrease the error and will eventually find a new (local) minimum, provided that the step size is adapted appropriately. Small steps slow down learning speed, i. e., require a larger number of iterations. Large steps may converge faster, but may also overstep the solution or make the algorithm oscillate around a minimum without convergence of the weights. Therefore, the step size is made proportional to the slope ı, i. e., is reduced when the search point approaches a minimum, and to the learning rate ˛. The constant ˛ allows one to control the size of the gradient descent step and is usually set to be between 0.1 and 0.5. For practical purposes, it is recommended to choose the learning rate as large as possible without leading to oscillation.
Momentum One possibility to avoid oscillation and to achieve faster convergence is in the addition of a momentum term that is proportional to the previous weight change: w i j (t C 1) D ˛ı j y i C ˇw i j (t) :
(10)
The algorithm increases learning speed step size if it has taken several steps in the same direction. This gives it the ability to overcome obstacles in the error surface, e. g., to avoid and escape from local minima, and to move faster over larger plateaus. Finding the optimum learning rate ˛ and momentum scale parameter ˇ, i. e., the best trade-off between longer training time and instability, can be difficult and might require many experiments. Global or local adaptation techniques use, for instance, the partial derivative to automatically adapt the learning rate. Examples here are the DeltaBar-Delta rule [18] and the SuperSAB algorithm [51].
Other Learning Rules The backpropagation learning algorithm is computationally efficient in that its time complexity is linear in the number of weight parameters. Its learning speed is comparatively low, however, on the basis of epochs. This may result in long training times, especially for difficult and complex problems requiring larger networks or larger amounts of training data. Another major limitation is that backpropagation does not always converge. Still, it is a widely used algorithm and has its advantages: It is relatively easy to apply and to configure and provides a quick, though not absolutely perfect solution. Its usually pattern-wise error adjustment is hardly affected by data that contains a larger number of redundant examples. Standard backpropagation also generalizes equally well on small data sets as more advanced algorithms, e. g., if there is insufficient information available to find a more precise solution. There are many variations of the backpropagation algorithm, like resilient propagation (Rprop) [42], quick propagation (Quickprop) [13], conjugate gradient descent [6], Levenberg–Marquardt [16], Delta-BarDelta [18], to mention the most popular. All these secondorder algorithms are designed to deal with some of the limitations on the standard approach. Some work substantially faster in many problem domains, but require more control parameters than backpropagation, which makes them more difficult to use.
Data-Mining and Knowledge Discovery, Neural Networks in
Resilient Propagation Resilient propagation (Rprop) as proposed in [42,43] is a variant of standard backpropagation with very robust control parameters that are easy to adjust. The algorithm converges faster than the standard algorithm without being less accurate. The size of the weight step w i j taken by standard backpropagation not only depends on the learning rate ˛, but also on the size of the partial derivative (see Eq. (7)). This may have an unpredictable influence during training that is difficult to control. Therefore, Rprop uses only the sign of derivative to adjust the weights. It necessarily requires learning by epoch, i. e., all adjustments take place after each epoch only. One iteration of the Rprop algorithm involves two steps, the adjustment of the step size and the update of the weights. The amount of weight change is found by the following update rule: 8 C < i j (t 1) if d i j (t 1) d i j (t) > 0 i j (t 1) if d i j (t 1) d i j (t) < 0 i j (t) D : otherwise i j (t 1) (11) with 0 < < 1 < C and derivative d i j D ı j y i . Every time t the derivative term changes its sign, indicating that the last update (at time t 1) was too large and the algorithm has jumped over a local minimum, the update value (step size) i j (t 1) is decreased by a constant factor . The rule for updating the weights is straightforward: w i j (t C 1) D w i j (t) C w i j (t) 8 < i j (t) if d i j (t) > 0 C i j (t) if d i j (t) < 0 w i j (t) D : 0 otherwise
(12)
:
(13)
One advantage of the Rprop algorithm, compared to, for example, Quickprop [13], is its small set of parameters that hardly requires adaptation. Standard values for decrease factor and increase factor C are 0.5 and 1.2, respectively. To avoid too large or too small weight values, this is bounded above by max and bounded below by min , set by default to 50 and 106 . The same initial value 0 D 0:1 is recommended for all ij . While the choice of the parameter settings is not critical, for most problems no other choice is needed to obtain the optimal or at least a nearly optimal solution. Application Issues The architecture of a neural network, i.e, the number of (hidden) neurons and layers, is an important decision. If
a neural network is highly redundant and overparameterized, it might adapt too much to the data. Thus, there is a trade-off between reducing bias (fitting the training data) and reducing variance (fitting unknown data). The mostcommon procedure is to select a network structure that has more than enough parameters and neurons and then to avoid overfitting only over the training algorithm (see Subsect. “Overtraining and Generalization”). There is no general best network structure for a particular type of application. There are only general rules for selecting the network architecture: (1) The more complex the relationships between input and output data are, the higher the number of hidden units should be selected. (2) If the modeled process is separable into multiple stages, more than one hidden layer may be beneficial. (3) An upper bound for the total number of hidden units may be set by the number of data examples divided by the number of input and output units and multiplied by a scaling factor. A simpler rule is to start with one hidden layer and half as many hidden units as there are input and output units. One would expect that for a given data set there would be an optimal network size, lying between a minimum of one hidden neuron (high bias, low variance) and a very large number of neurons (low bias, high variance). While this is true for some data sets, in many cases increasing the number of hidden nodes continues to improve prediction accuracy, as long as cross validation is used to stop training in time. For classification problems, the neural network assigns to each input case a class label or, more generally, estimates the probability of the case to fall into each class. The various output classes of a problem are normally represented in neural networks using one of two techniques, including binary encoding and one-out-of-n encoding. A binary encoding is only possible for two-class problems. A single unit calculates class 1 if its output is above the acceptance threshold. If the output is below the rejection threshold, class 0 is predicted. Otherwise, the output class is undecided. Should the network output be always defined, both threshold values must be equal (e. g. 0.5). In one-out-of-n encoding one unit is allocated for each class. A class is selected if the corresponding output is above the acceptance threshold and all the other outputs are below the rejection threshold. If this condition is not met, the class is undecided. Alternatively, instead of using a threshold, a winner-takes-all decision may be made such that the unit with the highest output gives the class. For regression problems, the objective is to estimate the value of a continuous output variable, given the input variables. Particularly important issues in regression are output scaling and interpolation. The most-common NN
821
822
Data-Mining and Knowledge Discovery, Neural Networks in
architectures produce outputs in a limited range. Scaling algorithms may be applied to the training data to ensure that the target outputs are in the same range. Constraining the network’s outputs limits its generalization performance. To overcome this, a linear activation function may be used for the output units. Then there is often no need for output scaling at all, since the units can in principle calculate any value. Other Neural Network Architectures This section summarizes some alternative NN architectures that are variants or extensions of multi-layer feedforward networks. Cascade Correlation Networks Cascade-correlation is a neural network architecture with variable size and topology [14]. The initial network has no hidden layer and grows during training by adding new hidden units one at a time. In doing so, a near minimal network topology is built. In the cascade architecture the outputs from all existing hidden neurons in the network are fed into a new neuron. In addition, all neurons – including the output neurons – receive all input values. For each new hidden unit, the learning algorithm tries to maximize the correlation between this unit’s output and the overall network error using an ordinary learning algorithm, like, e. g., backpropagation. After that the inputside weighs of the new neuron are frozen. Thus, it does not change anymore and becomes a permanent feature detector. Cascade correlation networks have several advantages over multi-layer perceptrons: (1) Training time is much shorter already because the network size is relatively small. (2) They require only little or no adjustment of parameters, especially not in terms of the number of hidden neurons to use. (3) They are more robust and training is less likely to become stuck in local minima. Recurrent Networks A network architecture with cycles is adopted by recurrent or feedback neural networks such that outputs of some neurons are fed back as extra inputs. Because past outputs are used to calculate future outputs, the network is said to “remember” its previous state. Recurrent networks are designed to process sequential information, like time series data. Processing depends on the state of the network at the last time step. Consequently, the response to the current input depends on previous inputs.
Two similar types of recurrent network are extensions of the multi-layer perceptron: Jordan networks [19] feed back all network outputs into the input layer; Elman networks [12] feed back from the hidden units. State or context units are added to the input layer for the feedback connections which all have constant weight one. At each time step t, an input vector is propagated in a standard feedforward fashion, and then a learning rule (usually backpropagation) is applied. The extra units always maintain a copy of the previous outputs at time step t 1. Radial Basis Function Networks Radial basic function (RBF) networks [8,34,37,38] are another popular variant of two-layer feedforward neural networks which uses radial basis functions as activation functions. The idea behind radial basis functions is to approximate the unknown function f (E x ) by a weighted sum of non-linear basis functions , which are often Gaussian functions with a certain standard deviation . f (E x) D
X
w i (jjE x cEi jj)
(14)
i
The basis functions operate on the Euclidean distance between n-dimensional input vector xE and center vector cEi . Once the center vectors cEi are fixed, the weight coefficients wi are found by simple linear regression. The architecture of RBF networks is fixed to two processing layers. Each unit in the hidden layer represents a center vector and a basis function which realizes a nonlinear transformation of the inputs. Each output unit calculates a weighted sum (linear combination) of the nonlinear outputs from the hidden layer. Only the connections between hidden layer and output layer are weighted. The use of a linear output layer in RBF networks is motivated by Cover’s theorem on the separability of patterns. The theorem states that if the transformation from the data (input) space to the feature (hidden) space is nonlinear and the dimensionality of the feature space is relatively high compared to that of the data space, then there is a high likelihood that a non-separable pattern classification task in the input space is transformed into a linearly separable one in the feature space. The center vectors are selected from the training data, either randomly or uniformly distributed in the input space. In principle, as many centers (and hidden units) may be used as there are data examples. Another method is to group the data in space using, for example, k-means clustering, and select the center vectors close to the cluster centers.
Data-Mining and Knowledge Discovery, Neural Networks in
RBF learning is considered a curve-fitting problem in high-dimensional space, i. e., approximates a surface with the basis functions that fits and interpolates the training data points best. The basis functions are well-suited to online learning applications, like adaptive process control. Adapting the network to new data and changing data statistics only requires a retraining by linear regression which is fast. RBF networks are more local approximators than multi-layer perceptrons. New training data from one region of the input space have less effect on the learned model and its predictions in other regions. Self-organizing Maps A self-organizing map (SOM) or Kohonen map [22,23] applies an unsupervised and competitive learning scheme. That means that the class labels of the data vectors are unknown or not used for training and that each neuron improves through competition with other neurons. It is a non-deterministic machine-learning approach to data clustering that implements a mapping of the high-dimensional input data into a low-dimensional feature space. In doing so, SOMs filter and compress information while preserving the most relevant features of a data set. Complex, non-linear relationships and dependencies are revealed between data vectors and between clusters and are transformed into simple geometric distances. Such an abstraction facilitates both visualization and interpretation of the clustering result. Typically, the units of a SOM network are arranged in a two-dimensional regular grid, the topological feature map, which defines a two-dimensional Euclidean distance between units. Each unit is assigned a center vector from the n-dimensional data space and represents a certain cluster. Algorithm 1 describes the basic principle behind SOM training. Starting with an initially random set of center vectors, the algorithm iteratively adjusts them to reflect the clustering of the training data. In doing so, the twodimensional order of the SOM units is imposed on the input vectors such that more similar clusters (center vectors) in the input space are closer to each other on the two-dimensional grid structure than more different clusters. One can think of the topological map to be folded and distorted into the n-dimensional input space, so as to preserve as much as possible the original structure of the data. Algorithm 1 (Self-organizing Map) 1. Initialize the n-dimensional center vector cEi 2 Rn of each cluster randomly.
2. For each data point pE 2 Rn find the nearest center vector cEw (called the “winner”) in n-dimensional space according to a distance metric d. 3. Move cEw and all centers cEi within its local neighborhood closer to pE : cEi (t C 1) D cEi (t) C ˛(t) hr(t) ( pE cEi (t)) with learning rate 0 < ˛ < 1 and neighborhood function h depending on a neighborhood radius r. 4. ˛(t C 1) D ˛(t) ˛ where ˛ D ˛0 /tmax r(t C 1) D r(t) r
where r D r0 /tmax
5. Repeat steps 2.–4. for each epoch t D 1; ::; tmax . 6. Assign each data point to the cluster with the nearest center vector. Each iteration involves randomly selecting a data point pE and moving the closest center vector a bit in the direction of pE. Only distance metric d (usually Euclidean) defined on the data space influences the selection of the closest cluster. The adjustment of centers is applied not just to the winning neuron, but to all the neurons of its neighborhood. The neighborhood function is often Gaussian. A simple definition calculates h D 1 if the Euclidean distance between the grid coordinates of the winning cluster w and a cluster i is below a radius r, i. e., jj(xw ; yw ) (x i ; y i )jj < r, and h D 0 otherwise. Both learning rate ˛ and neighborhood radius r monotonically decrease over time. Initially quite large areas of the network are affected by the neighborhood update, leading to a rather rough topological order. As epochs pass, less neurons are altered with lower intensity and finer distinctions are drawn within areas of the map. Unlike hierarchical clustering [52] and k-means clustering [29], which are both deterministic – apart from the randomized initialization in k-means clustering – and operate only locally, SOMs are less likely to become stuck in local minima and have a higher robustness and accuracy. Once the network has been trained to recognize structures in the data, it can be used as a visualization tool and for exploratory data analysis. The example map in Fig. 3 shows a clustering of time series of gene expression values [48]. Clearly higher similarities between neighboring clusters are revealed when comparing the mean vectors. If neurons in the feature map can be labeled, i. e., if common meaning can be derived from the vectors in a clusters, the network becomes capable of classification. If the winning neuron of an unknown input case has not been assigned a class label, labels of clusters in close or direct neighborhood may be considered. Ideally, higher similarities between neighboring data clusters are reflected in similar class labels. Alternatively, the network output is
823
824
Data-Mining and Knowledge Discovery, Neural Networks in
Data-Mining and Knowledge Discovery, Neural Networks in, Figure 3 6 x 6 SOM example clustering of gene expression data (time series over 24 time points) [48]. Mean expression vector plotted for each cluster. Cluster sizes indicate the number of vectors (genes) in each cluster
undefined in this case. SOM classifiers also make use of the distance of the winning neuron from the input case. If this distance exceeds a certain maximum threshold, the SOM is regarded as undecided. In this way, a SOM can be used for detecting novel data classes. Future Directions To date, neural networks are widely accepted as an alternative to classical statistical methods and are frequently used in medicine [4,5,11,25,27,44,45] with many applications related to cancer research [10,26,35,36,49]. In the first place, these comprise diagnostics and prognosis (i. e. classification) tasks, but also image analysis and drug design. Cancer prediction is often based on clustering of gene expression data [15,50] or microRNA expression profiles [28] which may involve both self-organizing maps and multi-layer feedforward neural networks. Another broad application field of neural networks today is bioinformatics and, in particular, the analysis
and classification of gene and protein sequences [3,20,56]. A well-known successful example is protein (secondary) structure prediction from sequence [39,40]. Even though the NN technology is clearly established today, the current period is rather characterized by stagnation. This is partly because of a redirection of research to newer and often – but not generally – more powerful paradigms, like the popular support vector machines (SVMs) [9,47], or to more open and flexible methods, like genetic programming (GP) [7,24]. In many applications, for example, in bioinformatics, SVMs have already replaced conventional neural networks as the state-of-theart black-box classifier. Bibliography Primary Literature 1. Ackley DH, Hinton GF, Sejnowski TJ (1985) A learning algorithm for Boltzman machines. Cogn Sci 9:147–169 2. Anderson JA et al (1977) Distinctive features, categorical per-
Data-Mining and Knowledge Discovery, Neural Networks in
3. 4. 5.
6. 7. 8. 9.
10.
11. 12. 13.
14.
15.
16.
17.
18. 19.
20.
21. 22. 23. 24.
25.
ception, and probability learning: Some applications of a neural model. Psychol Rev 84:413–451 Baldi P, Brunak S (2001) Bioinformatics: The machine learning approach. MIT Press, Cambridge Baxt WG (1995) Applications of artificial neural networks to clinical medicine. Lancet 346:1135–1138 Begg R, Kamruzzaman J, Sarkar R (2006) Neural networks in healthcare: Potential and challenges. Idea Group Publishing, Hershey Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press, London Brameier M, Banzhaf W (2007) Linear genetic programming. Springer, New York Broomhead DS, Lowe D (1988) Multivariable functional interpolation and adaptive networks. Complex Syst 2:321–355 Cristianini N (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, London Dybowski R (2000) Neural computation in medicine: Perspectives and prospects. In: Malmgren H, Borga M, Niklasson L (eds) Proceedings of the Conference on Artificial Neural Networks in Medicine and Biology (ANNIMAB). Springer, Berlin, pp 26–36 Dybowski R, Gant V (2001) Clinical Applications of Artificial Neural Networks. Cambridge University Press, London Elman JL (1990) Finding structure in time. Cogn Sci 14:179–211 Fahlman SE (1989) Faster learning variations on backpropagation: An empirical study. In: Touretzky DS, Hinton GE, Sejnowski TJ (eds) Proceedings of the (1988) Connectionist Models Summer School. Morgan Kaufmann, San Mateo, pp 38–51 Fahlman SE, Lebiere C (1990) The cascade-correlation learning architecture. In: Touretzky DS (ed) Advances in Neural Information Processing Systems 2. Morgan Kaufmann, Los Altos Golub TR et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537 Hagan MT, Menhaj M (1994) Training feedforward networks with the Marquardt algorithm. IEEE Trans Neural Netw 5(6):989–993 Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci 79(8):2554–2558 Jacobs RA (1988) Increased rates of convergence through learning rate adaptation. Neural Netw 1:295–307 Jordan MI (1986) Attractor dynamics and parallelism in a connectionist sequential machine. In: Proceedings of the Eighth Annual Conf of the Cogn Sci Society. Lawrence Erlbaum, Hillsdale, pp 531–546 Keedwell E, Narayanan A (2005) Intelligent bioinformatics: The application of artificial intelligence techniques to bioinformatics problems. Wiley, New York Kohonen T (1977) Associative Memory: A System-Theoretical Approach. Springer, Berlin Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43:59–69 Kohonen T (1995) Self-organizing maps. Springer, Berlin Koza JR (1992) Genetic programming: On the programming of computer programs by natural selection. MIT Press, Cambridge Lisboa PJG (2002) A review of evidence of health benefit from artificial neural networks in medical intervention. Neural Netw 15:11–39
26. Lisboa PJG, Taktak AFG (2006) The use of artificial neural networks in decision support in cancer: a systematic review. Neural Netw 19(4):408–415 27. Lisboa PJG, Ifeachor EC, Szczepaniak PS (2001) Artificial neural networks in biomedicine. Springer, Berlin 28. Lu J et al (2005) MicroRNA expression profiles classify human cancers. Nature 435:834–838 29. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol 1. University of California Press, Berkeley, pp 281–297 30. McClelland JL, Rumelhart DE (1986) Parallel distributed processing: Explorations in the microstructure of cognition. MIT Press, Cambridge 31. McClelland J, Rumelhart D (1988) Explorations in parallel distributed processing. MIT Press, Cambridge 32. McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 5:115–133 33. Minsky ML, Papert SA (1969/1988) Perceptrons. MIT Press, Cambridge 34. Moody J, Darken CJ (1989) Fast learning in networks of locallytuned processing units. Neural Comput 1:281–294 35. Naguib RN, Sherbet GV (1997) Artificial neural networks in cancer research. Pathobiology 65(3):129–139 36. Naguib RNG, Sherbet GV (2001) Artificial neural networks in cancer diagnosis, prognosis, and patient management. CRC Press, Boca Raton 37. Poggio T, Girosi F (1990) Regularization algorithms for learning that are equivalent to multi-layer networks. Science 247:978– 982 38. Poggio T, Girosi F (1990) Networks for approximation and learning. Proc IEEE 78:1481–1497 39. Quian N, Sejnowski TJ (1988) Predicting the secondary structure of globular proteins using neural network models. J Mol Biol 202:865–884 40. Rost B (2001) Review: Protein secondary structure prediction continues to rise. J Struct Biol 134:204–218 41. Rosenblatt F (1958) The perceptron: A probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386–408 42. Riedmiller M, Braun H (1992) Rprop – a fast adaptive learning algorithm. In: Proceedings of the International Symposium on Computer and Information Science VII 43. Riedmiller M, Braun H (1993) A direct adaptive method for faster backpropagation learning: The Rprop algorithm. In: Proceedings of the IEEE International Conference on Neural Networks. IEEE Press, Piscataway, pp 586–591 44. Ripley BD, Ripley RM (2001) Neural Networks as statistical methods in survival analysis. In: Dybowski R, Gant V (eds) Clinical Applications of Artificial Neural Networks. Cambridge Univ Press, London 45. Robert C et al (2004) Bibliometric overview of the utilization of artificial neural networks in medicine and biology. Scientometrics 59:117–130 46. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by backpropagating errors. Nature 323:533–536 47. Schlökopf B, Smola AJ (2001) Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
825
826
Data-Mining and Knowledge Discovery, Neural Networks in
48. Spellman PT et al (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9(12):3273–97 49. Taktak AFG, Fisher AC (2007) Outcome prediction in cancer. Elsevier Science, London 50. Tamayo P et al (1999) Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. PNAS 96(6):2907–2912 51. Tollenaere T (1990) SuperSAB: Fast adaptive backpropagation with good scaling properties. Neural Netw 3:561–573 52. Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244 53. Werbos PJ (1974) Beyond regression: New tools for prediction and analysis in the behavioral science. Ph D Thesis, Harvard University 54. Werbos PJ (1994) The roots of backpropagation. Wiley, New York 55. Widrow B, Hoff ME (1960) Adaptive switching circuits. In: IRE WESCON Convention Record, Institute of Radio Engineers (now IEEE), vol 4. pp 96–104 56. Wu CH, McLarty JW (2000) Neural networks and genome informatics. Elsevier Science, Amsterdam
Books and Reviews Abdi H (1994) A neural network primer. J Biol Syst 2:247–281 Bishop CM (2008) Pattern recognition and machine learning. Springer, Berlin Fausett L (1994) Fundamentals of neural networks: Architectures, algorithms, and applications. Prentice Hall, New York Freeman JA, Skapura DM (1991) Neural networks: Algorithms, applications, and programming techniques. Addison, Reading Gurney K (1997) An Introduction to neural networks. Routledge, London
Hastie T, Tibshirani R, Friedman JH (2003) The elements of statistical learning. Springer, Berlin Haykin S (1999) Neural networks: A comprehensive foundation. Prentice Hall, New York Hertz J, Krogh A, Palmer R (1991) Introduction to the theory of neural computation. Addison, Redwood City Kröse B, van der Smagt P (1996) An introduction to neural networks. University of Amsterdam, Amsterdam Masters T (1993) Practical neural network recipes in C++. Academic Press, San Diego Masters T (1995) Advanced algorithms for neural networks: A C++ sourcebook. Wiley, New York Parks R, Levine D, Long D (1998) Fundamentals of neural network modeling. MIT Press, Cambridge Patterson D (1996) Artif neural networks. Prentice Hall, New York Peretto P (1992) An introduction to the modeling of neural networks. Cambridge University Press, London Ripley BD (1996/2007) Pattern recognition and neural networks. Cambridge University Press, London Smith M (1993) Neural networks for statistical modeling. Van Nostrand Reinhold, New York Wasserman PD (1989) Neural Computing: Theory and Practice. Van Nostrand Reinhold, New York Wasserman PD (1993) Advanced methods in neural computing. Van Nostrand Reinhold, New York De Veaux RD, Ungar LH (1997) A brief introduction to neural networks. Technical Report, Williams College, University of Pennsylvania Hinton GE (1992) How neural networks learn from experience. Sci Am 267:144–151 Lippman RP (1987) An introduction to computing neural networks. IEEE ASSP Mag 4(2):4–22 Reilly D, Cooper L (1990) An overview of neural networks: Early models to real world systems. Neural Electron Netw 2:229–250
Decision Trees
Decision Trees VILI PODGORELEC, MILAN Z ORMAN University of Maribor, Maribor, Slovenia Article Outline Glossary Definition of the Subject Introduction The Basics of Decision Trees Induction of Decision Trees Evaluation of Quality Applications and Available Software Future Directions Bibliography Glossary Accuracy The most important quality measure of an induced decision tree classifier. The most general is the overall accuracy, defined as a percentage of correctly classified instances from all instances (correctly classified and not correctly classified). The accuracy is usually measured both for the training set and the testing set. Attribute A feature that describes an aspect of an object (both training and testing) used for a decision tree. An object is typically represented as a vector of attribute values. There are two types of attributes: continuous attributes whose domain is numerical, and discrete attributes whose domain is a set of predetermined values. There is one distinguished attribute called decision class (a dependent attribute). The remaining attributes (the independent attributes) are used to determine the value of the decision class. Attribute node Also called a test node. It is an internal node in the decision tree model that is used to determine a branch from this node based on the value of the corresponding attribute of an object being classified. Classification A process of mapping instances (i. e. training or testing objects) represented by attribute-value vectors to decision classes. If the predicted decision class of an object is equal to the actual decision class of the object, then the classification of the object is accurate. The aim of classification methods is to classify objects with the highest possible accuracy. Classifier A model built upon the training set used for classification. The input to a classifier is an object (a vector of known values of the attributes) and the output of the classifier is the predicted decision class for this object.
Decision node A leaf in a decision tree model (also called a decision) containing one of the possible decision classes. It is used to determine the predicted decision class of an object being classified that arrives to the leaf on its path through the decision tree model. Instance Also called an object (training and testing), represented by attribute-value vectors. Instances are used to describe the domain data. Induction Inductive inference is the process of moving from concrete examples to general models, where the goal is to learn how to classify objects by analyzing a set of instances (already solved cases) whose classes are known. Instances are typically represented as attribute-value vectors. Learning input consists of a set of such vectors, each belonging to a known class, and the output consists of a mapping from attribute values to classes. This mapping should accurately classify both the given instances (a training set) and other unseen instances (a testing set). Split selection A method used in the process of decision tree induction for selecting the most appropriate attribute and its splits in each attribute (test) node of the tree. The split selection is usually based on some impurity measures and is considered the most important aspect of decision tree learning. Training object An object that is used for the induction of a decision tree. In a training object both the values of the attributes and the decision class are known. All the training objects together constitute a training set, which is a source of the “domain knowledge” that the decision tree will try to represent. Testing object An object that is used for the evaluation of a decision tree. In a testing object the values of the attributes are known and the decision class is unknown for the decision tree. All the testing objects together constitute a testing set, which is used to test an induced decision tree – to evaluate its quality (regarding the classification accuracy). Training set A prepared set of training objects. Testing set A prepared set of testing objects.
Definition of the Subject The term decision trees (abbreviated, DT) has been used for two different purposes: in decision analysis as a decision support tool for modeling decisions and their possible consequences to select the best course of action in situations where one faces uncertainty, and in machine learning or data mining as a predictive model; that is, a mapping from observations about an item to conclusions
827
828
Decision Trees
about its target value. This article concentrates on the machine learning view of DT. More descriptive names for DT models are classification tree (discrete outcome) or regression tree (continuous outcome). In DT structures, leafs represent classifications and branches represent conjunctions of features that lead to those classifications. The machine learning technique for inducing a DT classifier from data (training objects) is called decision tree learning, or decision trees. The main goal of classification (and regression) is to build a model that can be used for prediction [16]. In a classification problem, we are given a data set of training objects (a training set), each object having several attributes. There is one distinguished attribute called a decision class; it is a dependent attribute, whose value should be determined using the induced decision tree. The remaining attributes (the independent attributes, we will denote them just as attributes in further text) are used to determine the value of the decision class. Classification is thus a process of mapping instances (i. e. training or testing objects) represented by attribute-value vectors to decision classes. The aim of DT learning is to induce such a DT model that is able to accurately predict the decision class of an object based on the values of its attributes. The classification of an object is accurate if the predicted decision class of the object is equal to the actual decision class of the object. The DT is induced using a training set (a set of objects where both the values of the attributes and decision class are known) and the resulting DT is used to determine decision classes for unseen objects (where the values of the attributes are known but the decision class is unknown). A good DT should accurately classify both the given instances (a training set) and other unseen instances (a testing set). DT have a wide range of applications, but they are especially attractive in data mining [16]. Because of their intuitive representation, the resulting classification model is easy to understand, interpret and criticize [5]. DT can be constructed relatively fast compared to other methods [28]. And lastly, the accuracy of classification trees is comparable to other classification models [19,28]. Every major data mining tool includes some form of classification tree model construction component [17]. Introduction In the early 1960s Feigenbaum and Simon presented EPAM (the Elementary Perceiver And Memorizer) [12] – a psychological theory of learning and memory implemented as a computer program. EPAM was the first sys-
tem to use decision trees (then called discrimination nets). It memorized pairs of nonsense syllables by using discrimination nets in which it could find images of syllables it had seen. It stored as its cue only enough information to disambiguate the syllable from others seen at the time the association was formed. Thus, old cues might be inadequate to retrieve data later, so the system could “forget”. Originally designed to simulate phenomena in verbal learning, it has been later adapted to account for data on the psychology of expertise and concept formation. Later in the 1960s Hunt et al. built further on this concept and introduced CLS (Concept Learning System) [23] that used heuristic lookahead to construct trees. CLS was a learning algorithm that learned concepts and used them to classify new cases. CLS was the precursor to decision trees; it lead to Quinlan’s ID3 system. ID3 [38] added the idea of using information content to choose the attribute to split; it also initially chooses a window (a subset) of the training examples, and tries to learn a concept that will correctly classify all based on that window; if not, it increases window size. Quinlan later constructed C4.5 [41], an industrial version of ID3. Utgoff’s ID5 [53] was an extension of ID3 that allowed many-valued classifications as well as incremental learning. From the early 1990s both the number of researchers and applications of DT have grown tremendously. Objective and Scope of the Article In this article an overview of DT is presented with the emphasis on a variety of different induction methods available today. Induction algorithms ranging from the traditional heuristic-based techniques to the most recent hybrids, such as evolutionary and neural network-based approaches, are described. Basic features, advantages and drawbacks of each method are presented. For the readers not very familiar with the field of DT this article should be a good introduction into this topic, whereas for more experienced readers it should broaden their perspective and deepen their knowledge. The Basics of Decision Trees Problem Definition DT are a typical representative of a symbolic machine learning approach used for the classification of objects into decision classes, whereas an object is represented in a form of an attribute-value vector (attribute1 , attribute2 , . . . attributeN , decision class). The attribute values describe the features of an object. The attributes are usually identified and selected by the creators of the dataset. The de-
Decision Trees
cision class is one special attribute whose value is known for the objects in the learning set and which will be predicted based on the induced DT for all further unseen objects. Normally the decision class is a feature that could not be measured (for example some prediction for the future) or a feature whose measuring is unacceptably expensive, complex, or not known at all. Examples of attributes-decision class objects are: patient’s examination results and the diagnosis, raster of pixels and the recognized pattern, stock market data and business decision, past and present weather conditions and the weather forecast. The decision class is always a discretevalued attribute and is represented by a set of possible values (in the case of regression trees a decision class is a continuous-valued attribute). If the decision should be made for a continuous-valued attribute, the values should be discretized first (by appropriately transforming continuous intervals into corresponding discrete values). Formal Definition Let A1 ; : : : ; A n , C be random variables where Ai has domain dom(A i ) and C has domain dom(C); we assume without loss of generality that dom(C) D fc1 ; c2 ; : : : ; c j g. A DT classifier is a function dt : dom(A1 ) dom(A n ) 7! dom(C) Let P(A0 ; C 0 ) be a probability distribution on dom(A1 ) dom(A n ) dom(C) and let t D ht:A1 ; : : : ; t:A n ; t:Ci be a record randomly drawn from P; i. e., t has probability P(A0 ; C 0 ) that ht:A1 ; : : : ; t:A n i 2 A0 and t:C 2 C 0 . We define the misclassification rate Rdt of classifier dt to be P(dt ht:A1 ; : : : ; t:A n i ¤ t:C). In terms of the informal introduction to this article, the training database D is a random sample from P, the Ai correspond to the attributes, and C is the decision class. A DT is a directed, acyclic graph T in the form of a tree. Each node in a tree has either zero or more outgoing edges. If a node has no outgoing edges, then it is called a decision node (a leaf node); otherwise a node is called a test node (or an attribute node). Each decision node N ˚is labeled with one of the pos sible decision classes c 2 c1 ; : : : ; c j . Each test node is labeled with one attribute A i 2 fA1 ; : : : ; A n g, called the splitting attribute. Each splitting attribute Ai has a splitting function f i associated with it. The splitting function f i determines the outgoing edge from the test node, based on the attribute value Ai of an object O in question. It is in a form of A i 2 Yi where Yi dom(A i ); if the value of the attribute Ai of the object O is within Y i , then the corresponding outgoing edge from the test node is chosen. The problem of DT construction is the following. Given a data set D D ft1 ; : : : ; td g where the ti are inde-
pendent random samples from an unknown probability distribution P, find a decision tree classifier T such that the misclassification rate R T (P) is minimal. Inducing the Decision Trees Inductive inference is the process of moving from concrete examples to general models, where the goal is to learn how to classify objects by analyzing a set of instances (already solved cases) whose decision classes are known. Instances are typically represented as attribute-value vectors. Learning input consists of a set of such vectors, each belonging to a known decision class, and the output consists of a mapping from attribute values to decision classes. This mapping should accurately classify both the given instances and other unseen instances. A decision tree is a formalism for expressing such mappings [41] and consists of test nodes linked to two or more sub-trees and leafs or decision nodes labeled with a decision class. A test node computes some outcome based on the attribute values of an instance, where each possible outcome is associated with one of the sub-trees. An instance is classified by starting at the root node of the tree. If this node is a test, the outcome for the instance is determined and the process continues using the appropriate sub-tree. When a leaf is eventually encountered, its label gives the predicted decision class of the instance. The finding of a solution with the help of DT starts by preparing a set of solved cases. The whole set is then divided into (1) a training set, which is used for the induction of a DT classifier, and (2) a testing set, which is used to check the accuracy of an obtained solution. First, all attributes defining each case are described (input data) and among them one attribute is selected that represents a decision class for the given problem (output data). For all input attributes specific value classes are defined. If an attribute can take only one of a few discrete values then each value takes its own class; if an attribute can take various numeric values then some characteristic intervals must be defined, which represent different classes. Each attribute can represent one internal node in a generated DT, also called a test node or an attribute node (Fig. 1). Such a test node has exactly as many outgoing edges as its number of different value classes. The leafs of a DT are decisions and represent the value classes of the decision attribute – decision classes (Fig. 1). When a decision has to be made for an unsolved case, we start with the root node of the DT classifier and moving along attribute nodes select outgoing edges where values of the appropriate attributes in the unsolved case matches the attribute values in the DT classifier until the leaf node is reached representing the decision
829
830
Decision Trees
Decision Trees, Figure 1 An example of a (part of a) decision tree
class. An example training database is shown in Table 1 and a sample DT is shown in Fig. 1. The DT classifier is very easy to interpret. From the tree shown in Fig. 1 we can deduce for example the following rules: If the chance of precipitation is less than 30% and the visibility is good and the temperature is in the range of 10–27°C, then we should go for a trip, If the chance of precipitation is less than 30% and the visibility is good and the temperature is less than 10°C or more than 27°C, then we should stay at home, Decision Trees, Table 1 An example training set for object classification # 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Attributes Color Edge green dotted green dotted yellow dotted red dotted red solid green solid green solid yellow dotted yellow solid red solid green solid yellow dotted yellow solid red dotted
Dot no yes no no no yes yes no no no yes yes no yes
Decision class Shape triangle triangle square square square triangle square triangle square square square square square triangle
If the chance of precipitation is 30% or more and the wind is moderate or strong, then we should stay at home. A DT can be built from a set of training objects with the “divide and conquer” principle. When all objects are of the same decision class (the value of the output attribute is the same) then a tree consists of a single node – a leaf with the appropriate decision. Otherwise an attribute is selected and a set of objects is divided according to the splitting function of the selected attribute. The selected attribute builds an attribute (test) node in a growing DT classifier, for each outgoing edge from that node the inducing procedure is repeated upon the remaining objects regarding the division until a leaf (a decision class) is encountered. From a geometric point of view a set of n attributes defines a n-dimensional space, where each data object represents a point. A division of data objects regarding the attribute’s class suits the definition of decision planes in the same space. Those planes are hyper-planes which are orthogonal to the selected attribute – DT divides a search space into hyper-rectangles, each of them represents one of the possible decision classes; of course more rectangles can also represent the same decision class. Induction of Decision Trees In 1986 Quinlan introduced an algorithm for inducing decision trees called ID3 [39,40]. In 1993 ID3 was upgraded with an improved algorithm C4.5 [41] that is still regarded as the reference model to build a DT based on the tradi-
Decision Trees
tional statistical approach. Both algorithms ID3 and C4.5 use the statistical calculation of information gain from a single attribute to build a DT. In this manner an attribute that adds the most information about the decision upon a training set is selected first, followed by the next one that is the most informative from the remaining attributes, etc. The method for constructing a DT as paraphrased from Quinlan [41], pp. 17–18, is as follows: If there are j classes denoted fc1 ; c2 ; : : : ; c j g, and a training set D, then If D contains one or more objects which all belong to a single class ci , then the decision tree is a leaf identifying class ci If D contains no objects, the decision tree is a leaf determined from information other than D If D contains objects that belong to a mixture of classes, then a test is chosen, based on a single attribute, that has one or more mutually exclusive outcomes fo1 ; o2 ; : : : ; o n g. D is partitioned into subsets D1 ; D2 ; : : : ; D n , where Di contains all the objects in D that have outcome oi of the chosen test. The same method is applied recursively to each subset of training objects.
tribute regarding all the training objects with jth value of splitting attribute within the training set D. Entropy, Information Gain, Information Gain Ratio The majority of classical DT induction algorithms (such as ID3, C4.5, . . . ) are based on calculating entropy from the information theory [45] to evaluate splits. In information theory entropy is used to measure the unreliability of a message as the source of information; the more information a message contains, the lower is the entropy. Two splitting criteria are implemented: Gain criterion, and Gain ratio criterion. The gain criterion [41] is developed in the following way: For any training set D, ni is the number of training objects of decision class ci within the training set D. Then consider the “message” that a randomly selected object belongs to decision class ci . The “message” has probability p i D nni , where n is the total number of training objects in D. The information conveyed by the message (in bits) is given by I D log2 p i D log2
Split Selection The most important aspect of a traditional DT induction strategy is the way in which a set is split, i. e. how to select an attribute test that determines the distribution of training objects into sub-sets upon which sub-trees are built consequently. This process is called split selection. Its aim is to find an attribute and its associated splitting function for each test node in a DT. In the following text some of the most widely used split selection approaches will be presented. Let D be the learning set of objects described with attributes A1 ; : : : ; A n and a decision class C. Let n denote the number of training objects in D, ni the number of training objects of decision class ci , nj the number of training objects with the jth value of splitting attribute, and nij the number of training objects of decision class ci and jth value of the splitting attribute. The relative frequencies (probabilities) of training objects in D are as follows: n p i j D ni j is the relative frequency of training objects of decision class ci and jth value of splitting attribute within the training set D, p i D nni is the relative frequency of training objects of decision class ci within the training n set D, p j D nj is the relative frequency of training objects with the jth value of splitting attribute within the trainn ing set D, and p i j j D nijj is the relative frequency of training objects of decision class ci and jth value of splitting at-
ni : n
Entropy E of an attribute A of a training object w with the possible attribute values a1 ; : : : ; a m and the probability distribution p(A(w) D a i ) is thus defined as EA D
X
p j log2 p j :
j
Let EC be entropy of decision class distribution, EA be entropy of values of an attribute A, and ECA be entropy of combined decision class distribution and attribute values as follows: X p i log2 p i EC D i
EA D
X
p j log2 p j
j
EC A D
XX i
p i j log2 p i j :
j
The expected entropy of decision class distribution regarding the attribute A is thus defined as E CjA D EC A E A : This expected entropy E CjA measures the reduction of entropy that is to be expected in the case when A is selected
831
832
Decision Trees
as the splitting attribute. The information gain is thus defined as Igain (A) D EC E CjA : In each test node an attribute A is selected with the highest value Igain(A). The gain criterion [41] selects a test to maximize this information gain. The gain criterion has one significant disadvantage in that it is biased towards tests with many outcomes. The gain ratio criterion [41] was developed to avoid this bias; it is defined as Igain (A) Igain ratio(A) D : EA If the split is near trivial, split information will be small and this ratio will be unstable. Hence, the gain ratio criterion selects a test to maximize the gain ratio subject to the constraint that the information gain is large. Gini Index The gain ratio criterion compares with CART’s impurity function approach [5], where impurity is a measure of the class mix of a subset and splits are chosen so that the decrease in impurity is maximized. This approach led to the development of the Gini index [5]. The impurity function approach considers the probability of misclassifying a new sample from the overall population, given that the sample was not part of the training sample, T. This probability is called the misclassification rate and is estimated using either the resubstitution estimate (or training set accuracy), or the test sample estimate (test set accuracy). The node assignment rule selects i to minimize this misclassification rate. In addition, the Gini index promotes splits that minimize the overall size of the tree. X X X gini(A) D pj p2ij j p2i : j
i
i
Chi-Square Various types of chi-square tests are often used in statistics for testing the significance. The Chisquare (chi-square goodness-of-fit) test compares the expected frequency eij with the actual frequency nij of training objects, which belong to decision class ci and having jth value of the given attribute [55]. It is defined as 2
(A) D
X X (e i j n i j )2 i
j
ei j
where n j ni ei j D : n The higher value of 2 means a clearer split. Consequentially, the attribute with the highest 2 value is selected.
J-Measure The J-measure heuristics has been introduced by Smyth and Goodman [48] as an informationtheoretical model of measuring the information content of a rule. The crossover entropy J j is appropriate for selecting a single value of a given attribute A for constructing a rule; it is defined as X p ij j p ij j log : J j (A) D p j pi i
The generalization over all possible values of a given attribute A measures the purity of the attribute X J j (A) : J(A) D j
The J-measure is also used to reduce over-fitting in the process of pre-pruning (see the following sections). DT Induction Example: Classifying Geometrical Objects Let us have a number of geometrical objects: squares and triangles. Each geometrical object is described with three features: color describes the color of an object (it can be either green, yellow or red), edge describes the line type of an object’s edge (it can be either solid or dotted), and dot describes whether there is a dot in an object. See Table 1 for details. We want to induce a DT that will predict the shape of an unseen object based on the three features. The decision class will thus be shape, having two possible values: square or triangle. The three features represent the tree attributes: color (with possible values green, yellow and red), edge (with possible values solid and dotted), and dot (with possible values yes and no). Table 1 resembles the training set with 14 training objects; in Fig. 2 the training objects are also visually represented. In the training set (Table 1, Fig. 2) there are 14 training objects: five triangles and nine squares. We can calculate the class distributions: 9 5 p(square) D and p(triangle) D : 14 14 The entropy of decision class (shape) distribution is thus X p i log2 p i Eshape D i
9 5 9 5 D log2 log2 D 0:940 : 14 14 14 14 The entropy can be reduced by splitting the whole training set. For this purpose either of the attributes (color, edge,
Decision Trees
(green, yellow, and red) the whole training set is split into three subsets (Fig. 3). The entropies are as follows: 2 3 3 2 EcolorDgreen D log2 log2 D 0:971 5 5 5 5 4 0 0 4 EcolorDyellow D log2 log2 D 0:0 4 4 4 4 3 2 2 3 EcolorDred D log2 log2 D 0:971 : 5 5 5 5 Consequentially, the entropy of the attribute color based on the whole training set is Decision Trees, Figure 2 Visual representation of training objects from Table 1
Ecolor D
X
pi Ei D
i
or dot) can be used as a splitting attribute. Regarding the information gain split selection approach, the entropy of each attribute is calculated first, then the information gain of each attribute is calculated based on these entropies, and finally the attribute is selected that has the highest information gain. To demonstrate it, let us calculate the entropy for the attribute color. Using different possible values of color
Decision Trees, Figure 3 Splitting the data set based on the attribute color
5 4 5 0:971 C 0:0 C 0:971 14 14 14
D 0:694 : The information gain of the attribute color is thus Igain (color) D Eshape E shapejcolor D 0:940 0:694 D 0:246 : Similarly, the information gain can be calculated also for the remaining two attributes, edge and dot. The informa-
833
834
Decision Trees
Decision Trees, Figure 4 The resulting DT classifier for the classification of geometrical objects
tion gain for all three attributes are thus Igain (color) D 0:246 Igain (edge) D 0:151 Igain (dot) D 0:048 : As we can see the attribute color has the highest information gain and is therefore chosen as the splitting attribute at the root node of the DT classifier. The whole process of splitting is recursively repeated on all subsets. The resulting DT classifier is shown in Fig. 4. Tests on Continuous Attributes In the above example all attributes were discrete-valued. For the continuous attributes a split threshold needs to be determined whenever the attribute is to be used as a splitting attribute. The algorithm for finding appropriate thresholds for continuous attributes [5,33,41] is as follows: The training objects are sorted on the values of the attribute. Denote them in order as fw1 ; w2 ; : : : ; w k g. Any threshold value lying between wi and w iC1 will have the same effect, so there are only k-1 possible splits, all of which are examined. Some approaches for building DT use discretization of continuous attributes. Two ways are used to select discretized classes: equidistant intervals; The Number Of Classes Is Selected First And Then Successive Equidistant Intervals Are Determined Between Absolute Lower And Upper Bounds, And percentiles; Again The Number Of Classes Is Selected First And Then Successive Intervals Are Determined Based On The Values Of The Appropriate Attribute In The Training Set So That All Intervals Contain The Same Number Of Training Objects.
Dynamic Discretization of Attributes In the MtDeciT 2.0 tool authors implemented an algorithm for finding subintervals [58], where the distribution of training objects is considered and there are more than two subintervals possible. The approach is called dynamic discretization of continuous attributes, since the subintervals are determined dynamically during the process of building a DT. This technique first splits the interval into many subintervals, so that every training object’s value has its own subinterval. In the second step it merges together smaller subintervals that are labeled with the same outcome into larger subintervals. In each of the following steps three subintervals are merged together: two “stronger” subintervals with one “weak” interval, where the “weak” interval lies between those two “strong” subintervals. Here strong and weak applies to the number of training objects in the subinterval tree (Fig. 5). In comparison to the previous two approaches the dynamic discretization returns more natural subintervals, which result in better and smaller DT classifiers. In general we differentiate between two types of dynamic discretization: General dynamic discretization, and Nodal dynamic discretization. General dynamic discretization uses all available training objects for the definition of subintervals. That is why the general dynamic discretization is performed before the start of building a DT. All the subintervals of all attributes are memorized in order to be used later in the process of building the DT. Nodal dynamic discretization performs the definition of subintervals for all continuous attributes which are available in the current node of a DT. Only those training objects that came in the current node are used for setting the subintervals of the continuous attributes.
Decision Trees
Decision Trees, Figure 5 Dynamic discretization of continuous attribute, which has values between 60 and 100. Shapes represent different attribute’s values
In a series of tests the authors showed that nodal dynamic discretization produces smaller DT with higher accuracy than DT built with general dynamic discretization [56,58]. Nodal dynamic discretization also outperforms classical discretization techniques in the majority of cases. Oblique Partitioning of Search Space The so-far presented algorithms are using univariate partitioning methods, which are attractive because they are straightforward to implement (since only one feature is analyzed at a time) and the derived DT is relatively easy to understand. Beside univariate partitioning methods there are also some successful partitioning methods that do not partition the search space axis-parallel based on only one attribute at a time but are forming oblique partition boundaries based on a combination of attributes (Fig. 6). Oblique partitioning provides a viable alternative to univariate methods. Unlike their univariate counterparts, oblique partitions are formed by combinations of attributes. The general form of an oblique partition is given by d X
ˇi x i C
iD1
where ˇ i represents the coefficient of the ith attribute. Be-
cause of their multivariate nature, oblique methods offer far more flexibility in partitioning the search space; this flexibility comes at a price of higher complexity, however. Consider that given a data set containing n objects P described with d attributes, there can be 2 diD0 n1 i oblique splits if n > d [50]; each split is a hyper-plane that divides the search space into two non-overlapping halves. For univariate splits, the number of potential partitions is much lower, but still significant, n d [29]. In short, finding the right oblique partition is a difficult task. Given the size of the search space, choosing the right search method is of critical importance in finding good partitions. Perhaps the most comprehensive reference on this subject is [5] on classification and regression trees (CART). Globally, CART uses the same basic algorithm as Quinlan in C4.5. At the decision node level, however, the algorithm becomes extremely complex. CART starts out with the best univariate split. It then iteratively searches for perturbations in attribute values (one attribute at a time) which maximize some goodness metric. At the end of the procedure, the best oblique and axis-parallel splits found are compared and the better of these is selected. Although CART provides a powerful and efficient solution to a very difficult problem, it is not without its disadvantages. Because the algorithm is fully deterministic, it has no inherent mechanism for escaping from local op-
835
836
Decision Trees
Decision Trees, Figure 6 Given axes that show the attribute values and shape corresponding to class labels: i axis-parallel and ii oblique decision boundaries
tima. As a result, CART has a tendency to terminate its partition search at a given node too early. The most fundamental disadvantage of CART (and of the traditional approach of inducing DT in general) is that the DT induction process can cause the metrics to produce misleading results. Because traditional DT induction algorithms choose what is locally optimal for each decision node, they inevitably ignore splits that score poorly alone, but yield better solution when used in combination. This problem is illustrated by Fig. 7. The solid lines indicate the splits found by CART. Although each split optimizes the impurity metric, the end product clearly does not reflect the best possible partitions (indicated by the dotted lines). However, when evaluated as individuals, the dotted lines register high impurities and are therefore not chosen. Given this, it is apparent that the sequential nature of DT can prevent the induction of trees that reflect the natural structure of the data. Pruning Decision Trees Most of the real-world data contains at least some amount of noise and outliers. Different machine-learning approaches elaborate different levels of sensitivity to noise in data. DT fall into the category of methods, sensitive to noise and outliers. In order to cure that, some additional methods must be applied to DT in order to reduce the complexity of the DT and increase its accuracy. In DTs we call this approach pruning. Though pruning can be applied in different stages, it follows the basic idea, also called the Ockham’s razor. William of Ockham (1285–1349), one of the most influ-
Decision Trees, Figure 7 CART generated splits (solid lines – 1,2) minimize impurity at each decision node in a way that is not necessarily optimal regarding the natural structure of data (denoted by dotted lines – 3,4)
ential medieval philosophers has been given credit for the rule, which said that “one should not draw more conclusions about a certain matter than minimum necessary and that the redundant conclusions should be removed or shaved off”. That rule was a basis for latter interpretation in the form of the following statement: “If two different solutions solve the same problem with the same accuracy, then the better of the two is the shorter solution.”
Decision Trees
Mapping this rule to DT pruning allows us to prune or shave away those branches in the DT, which do not decrease the classification accuracy. By doing so, we end up with a DT, which is not only smaller, but also has higher (or in the worse case scenario at least the same) accuracy as the original DT. DT classifiers aim to refine the training sample T into subsets which have only a single class. However, training samples may not be representative of the population they are intended to represent. In most cases, fitting a DT until all leafs contain data for a single class causes over-fitting. That is, the DT is designed to classify the training sample rather than the overall population and accuracy on the overall population will be much lower than the accuracy on the training sample. For this purpose most of the DT induction algorithms (C4.5, CART, OC1) use pruning. They all grow trees to maximum size, where each leaf contains single-class data or no test offers any improvement on the mix of classes at that leaf, and then prune the tree to avoid over-fitting. Pruning occurs within C4.5 when the predicted error rate is reduced by replacing a branch with a leaf. CART and OC1 use a proportion of the training sample to prune the tree. The tree is trained on the remainder of the training sample and then pruned until the accuracy on the pruning sample can not be further improved. In general we differentiate between two pruning approaches: prepruning, which takes place during DT induction and postpruning, applied to already-induced DT in order to reduce complexity. We will describe three representatives of pruning – one prepruning and two postpruning examples.
Prepruning Prepruning (called also stopping criteria) is a very simple procedure, performed during the first phase of induction. Early stopping of DT construction is based on criteria, which measures the percentage of the dominant class in the training subset in the node. If that percentage is higher than the preset threshold, then the internal node is transformed to a leaf and marked with the dominant class label. Such early stopping of DT construction reduces the size and complexity of DT, reduces the possibility of over-fitting, making it more general. However, there is also danger of oversimplification when the preset threshold is too low. This problem is especially present in training sets where frequencies of class labels are not balanced. An improper threshold can cause some special training objects, which are very important for solving the classification problem, to be discarded during the prepruning process.
Postpruning The opposite to prepruning, postpruning operates on DT that is already constructed. Using simple frequencies or error estimation, postpruning approaches calculate whether or not substituting a subtree with a leaf (pruning or shaving-off according to Ockham’s razor) would increase the DT’s accuracy or at least reduce its size without negatively effecting classification accuracy. Both the following approaches are practically the same up to a point, where we estimate error in a node or subtree. Static Error Estimate Pruning Static error estimate pruning uses frequencies of training objects’ class labels in nodes to estimate error, caused by replacing an internal node with a leaf. If the error estimate in the node is lower or equal to the error computed in the subtree, then we can prune the subtree in question. Let C be a set of training objects in node V, and m the number of all possible class labels. N is the number of training objects in C and O is the most frequent class label in C. Let n be the number of training objects in C that belong to class O. The static error estimate (also called Laplace error estimate) in node V is E(V ) : E(V ) D
N nCm1 : NCm
Let us compute the subtree error of V using the next expression SubtreeError(V ) D
NoOfChildNodes X
Pi Err(Vi ) :
iD1
Since we do not have information about class distribution, we use relative frequencies in the place of probabilities Pi . Err(Vi ) is the error estimate of child node V i . Now we can compute the new error estimate Err(V ) in node V by using the next expression: Err(V ) D min E(V ); SubtreeError(V) : If E(V) is less than or equal to SubtreeError(V), then we can prune the subtree of V, replace it by a leaf, label it class O and assign the value of E(V) to Err(V ). In the opposite case we only assign the value of SubtreeError(V ) to Err(V ). Static error estimate pruning is a recursive bottom-up approach, where we start to calculate error estimates in leafs and perform error propagation and eventual pruning actions toward parent nodes until the root node is reached. In Fig. 8 we have an unpruned DT with two possible decision classes, O1 and O2 . Inside each node there are frequencies of training objects according to their class label,
837
838
Decision Trees
Decision Trees, Figure 8 Unpruned decision tree with static error estimates
Decision Trees, Figure 9 Pruned decision tree with static error estimates
written in brackets (first for O1 and second for O2 ). Beside each internal node (squares) there are two numbers. The top one is static error estimate in the node and the bottom one is error estimate in the subtree. The number below each leaf (ellipses) is its error estimate. We can see that static error estimates are lower than subtree errors for two nodes: Att3 and Att4. Therefore, we prune those two subtrees and we get a smaller but more accurate DT (Fig. 9). Reduced Error Pruning Reduced error pruning implements a very similar principle as static error estimate pruning. The difference is in error estimation. Reduced error pruning uses the pruning set to estimate errors in nodes and leafs. The pruning set is very similar to training set,
since it contains objects, described with attributes and class label. In the first stage we try to classify each pruning object. When a leaf is reached, we compare DT’s classification with actual class label and mark correct/incorrect decision in the leaf. At the end there is a number of errors and number of correct classifications recorded in each leaf. In a recursive bottom-up approach, we start to calculate error estimates in leafs, perform error propagation and eventual pruning actions toward parent nodes until the root node is reached. Let SubtreeError(V) be a sum of errors from all subtrees of node V. 8 < V 2 Leafs ) NoOfErrorsInLeaf NoOfChildNodes P SubtreeError(V) D : V … Leafs ) Err(Vi ) : iD1
Decision Trees
Let C be a set of training objects in node V and O the most frequent class label in C. Let N be the number of all pruning objects that reached the node V and n be the number of pruning objects in V that belong to class O. The error estimate in node V is defined by E(V ) : E(V ) D N n : Nowwe can compare the error estimate in the node with the error estimate in the subtree and compute a new error estimate Err(V) in node V: Err(V) D min E(V ); SubtreeError(V) : IfE(V) is less than or equal to SubtreeError(V ), then we can prune the subtree of V, replace it by a leaf labeled with class O and assign the value of E(V) to Err(V). In the opposite case we only assign the value of SubtreeError(V ) to Err(V ). The drawback of this pruning approach is the fact that it is very hard to construct a pruning set that is capable of “visiting” each node in a DT at least once. If we cannot assure that, the pruning procedure will not be as efficient as other similar approaches. Deficiencies of Classical Induction The DT has been shown to be a powerful tool for decision support in different areas. The effectiveness and accuracy of classification of DTs have been a surprise for many experts and their greatest advantage is in simultaneous suggestion of a decision and the straightforward and intuitive explanation of how the decision was made. Nevertheless, the classical induction approach also contains several deficiencies. One of the most obvious drawbacks of classical DT induction algorithms is poor processing of incomplete, noisy data. If some attribute value is missing, classical algorithms do not perform well on processing of such an object. For example, in Quinlan’s algorithms before C4.5 such data objects with missing data were left out of the training set – this fact of course resulted in decreased quality of obtained solutions (in this way the training set size and the information about the problem’s domain were reduced). Algorithm C4.5 introduced a technique to overcome this problem, but this is still not very effective. In real world problems, especially in medicine, missing data is very common – therefore, effective processing of such data is of vital importance. The next important drawback of classical induction methods is the fact that they can produce only one DT classifier for a problem (when the same training set is
used). In many real-world situations it would be of great benefit if more DT classifiers would be available and a user could choose the most appropriate one for a single case. As it is possible for training objects to miss some attribute value, the same goes also for a testing object – there can be a new case where some data is missing and it is not possible to obtain it (unavailability of some medical equipment, for example, or invasiveness of a specific test for a patient). In this way another DT classifier could be chosen that does not include a specific attribute test to make a decision. Let us mention only one more disadvantage of classical induction methods, namely the importance of different errors. Between all decisions possible there are usually some that are more important than the others. Therefore, a goal is to build a DT in such a way that the accuracy of classification for those most important decisions is maximized. Once again, this problem is not solved adequately in classical DT induction methods. Variant Methods Although algorithms such as ID3, C4.5, and CART make up the foundation of traditional DT induction practice, there is always room for improvement of the accuracy, size, and generalization ability of the generated trees. As would be expected, many researchers have tried to build on the success of these techniques by developing better variations of them. Alternatively, a lot of different methods for the induction of DT have been introduced by various researchers, which try to overcome the deficiencies noted above. Most of them are based on so-called soft methods, like evolutionary techniques or neural networks, sometimes several methods are combined in a hybrid algorithm. A vast number of techniques have also been developed that help to improve only a part of the DT induction process. Such techniques include evolutionary algorithms for optimizing split functions in attribute nodes, dynamic discretization of attributes, dynamic subset selection of training objects, etc. Split Selection Using Random Search Since random search techniques have proven extremely useful in finding solutions to non-deterministic polynomial complete (NPcomplete) problems [30], naturally they have been applied to DT induction. Heath [20,21] developed a DT induction algorithm called SADT that uses a simulated annealing process to find oblique splits at each decision node. Simulated annealing is a variation of hill climbing which, at the beginning of the process, allows some random downhill moves to be made [42]. As a computational process,
839
840
Decision Trees
simulated annealing is patterned after the physical process of annealing [25], in which metals are melted (at high temperatures) and then gradually cool until some solid state is reached. Starting with an initial hyper-plane, SADT randomly perturbs the current solution and determines the goodness of the split by measuring the change in impurity (E). If E is negative (i. e. impurity decreases), the new hyper-plane becomes the current solution; otherwise, the new hyper-plane becomes the current split with probability e( E/T) where T is the temperature of the system. Because simulated annealing mimics the cooling of metal, its initially high temperature falls with each perturbation. At the start of the process, the probability of replacing the current hyper-plane is nearly 1. As the temperature cools, it becomes increasingly unlikely that worse solutions are accepted. When processing a given data set, Heath typically grows hundreds of trees, performing several thousands perturbations per decision node. Thus, while SADT has been shown to find smaller trees than CART, it is very expensive from a computational standpoint [21]. A more elegant variation on Heath’s approach is the OC1 system [29]. Like SADT, OC1 uses random search to find the best split at each decision node. The key difference is that OC1 rejects the brute force approach of SADT, using random search only to improve on an existing solution. In particular, it first finds a good split using a CART-like deterministic search routine. OC1 then randomly perturbs this hyper-plane in order to decrease its impurity. This step is a way of escaping the local optima in which deterministic search techniques can be trapped. If the perturbation results in a better split, OC1 resumes the deterministic search on the new hyper-plane; if not, it re-perturbs the partition a user-selectable number of times. When the current solution can be improved no further, it is stored for later reference. This procedure is repeated a fixed number of times (using a different initial hyper-plane in each trial). When all trials have been completed, the best split found is incorporated into the decision node. Incremental Decision Tree Induction The DT induction algorithms discussed so far grow trees from a complete training set. For serial learning tasks, however, training instances may arrive in a stream over a given time period. In these situations, it may be necessary to continually update the tree in response to the newly acquired data. Rather than building a new DT classifier from scratch, the incremental DT induction approach revises the existing tree to be consistent with each new training instance. Utgoff implemented an incremental version of ID3, called
ID5R [53]. ID5R uses an E-Score criteria to estimate the amount of ambiguity in classifying instances that would result from placing a given attribute as a test in a decision node. Whenever the addition of new training instances does not fit the existing tree, the tree is recursively restructured such that attributes with the lowest E-Scores are moved higher in the tree hierarchy. In general, Utgoff’s algorithm yields smaller trees compared to methods like ID3, which batch process all training data. Techniques similar to ID5R include an incremental version of CART [8]. Incremental DT induction techniques result in frequent tree restructuring when the amount of training data is small, with the tree structure maturing as the data pool becomes larger. Decision Forests Regardless of the DT induction method utilized, subtle differences in the composition of the training set can produce significant variances in classification accuracy. This problem is especially acute when cross-validating small data sets with high dimensionality [11]. Researchers have reduced these high levels of variance by using decision forests, composed of multiple trees (rather than just one). Each tree in a forest is unique because it is grown from a different subset of the same data set. For example, Quinlan’s windowing technique [41] induces multiple trees, each from a randomly selected subset of the training data (i. e., a window). Another approach was devised by Ho [22], who based each tree on a unique feature subset. Once a forest exists, the results from each tree must be combined to classify a given data instance. Such committee-type schemes for accomplishing this range from using majority rules voting [21,37] to different statistical methods [46]. A Combination of Decision Trees and Neural Networks When DT and neural networks are compared, one can see that their advantages and drawbacks are almost complementary. For instance knowledge representation of DT is easily understood by humans, which is not the case for neural networks; DT have trouble dealing with noise in training data, which is again not the case for neural networks; DT learn very fast and neural networks learn relatively slow, etc. Therefore, the idea is to combine DT and neural networks in order to combine their advantages. In this manner, different hybrid approaches from both fields have been introduced [7,54,56]. Zorman in his MtDecit 2.0 approach first builds a DT that is then used to initialize a neural network [56,57]. Such a network is then trained using the same training objects as the DT. After that the neural network is again converted to a DT that is better than the original DT [2]. The
Decision Trees
source DT classifier is converted to a disjunctive normal form – a set of normalized rules. Then the disjunctive normal form serves as the source for determining the neural network’s topology and weights. The neural network has two hidden layers, the number of neutrons on each hidden layer depends on rules in the disjunctive normal form. The number of neurons in the output layer depends on how many outcomes are possible in the training set. After the transformation, the neural networks is trained using back-propagation. The mean square error of such a network converges toward 0 much faster than it would in the case of randomly set weights in the network. Finally, the trained neural network has to be converted into a final DT. The neural network is examined in order to determine the most important attributes that influence the outcomes of the neural network. A list containing the most important attributes is then used to build the final DT that in the majority of cases performs better than the source DT. The last conversion usually causes a loss of some knowledge contained in the neural network, but even so most knowledge is transformed into the final DT. If the approach is successful, then the final DT has better classification capabilities than the source DT. Using Evolutionary Algorithms to Build Decision Trees Evolutionary algorithms are generally used for very complex optimization tasks [18], for which no efficient heuristic method exist. Construction of DT is a complex task, but heuristic methods exist that usually work efficiently and reliably. Nevertheless, there are some reasons justifying an evolutionary approach. Because of the robustness of evolutionary techniques they can be successfully used also on incomplete, noisy data (which often happens in real life data because of measurement errors, unavailability of proper instruments, risk to patients, etc.). Because of evolutionary principles used to evolve solutions, solutions can be found which can be easily overlooked otherwise. Also the possibility of optimizing the DT classifier’s topology and the adaptation of split thresholds is an advantage. There have been several attempts to build a DT with the use of evolutionary techniques [6,31,35], one the most recent and also very successful in various applications is genTrees, developed by Podgorelec and Kokol [35,36,37]. First an initial population of (about one hundred) semirandom DT is seeded. A random DT classifier is built by randomly choosing attributes and defining a split function. When a pre-selected number of test nodes is placed into a growing tree then a tree is finalized with decision nodes, which are defined by choosing the most fit decision class in each node regarding the training set. After the initialization the population of decision trees evolves through
many generations of selection, crossover and mutation in order to optimize the fitness function that determines the quality of the evolved DT. According to the fitness function the best trees (the most fit ones) have the lowest function values – the aim of the evolutionary process is to minimize the value of local fitness function (LFF) for the best tree. With the combination of selection that prioritizes better solutions, crossover that works as a constructive operator towards local optimums, and mutation that works as a destructive operator in order to keep the needed genetic diversity, the searching for the solution tends to be directed toward the global optimal solution. The global optimal solution is the most appropriate DT regarding the specific needs (expressed in the form of fitness function). As the evolution repeats, more qualitative solutions are obtained regarding the chosen fitness function. One step further from Podgorelec’s approach was introduced by Sprogar with his evolutionary vector decision trees – VEDEC [49]. In his approach the evolution of DT is similar to the one by Podgorelec, but the functionality of DT is enhanced in the way that not only one possible decision class is predicted in the leaf, but several possible questions are answered with a vector of decisions. In this manner, for example a possible treatment for a patient is suggested together with the diagnosis, or several diagnoses are suggested at the same time, which is not possible in ordinary DT. One of the most recent approaches to the evolutionary induction of DT-like methods is the AREX algorithm developed by Podgorelec and Kokol [34]. In this approach DT are extended by so-called decision programs that are evolved with the help of automatic programming. In this way a classical attribute test can be replaced by a simple computer program, which greatly improves the flexibility, at the cost of computational resources, however. The approach introduces a multi-level classification model, based on the difference between objects. In this manner “simple” objects are classified on the first level with simple rules, more “complex” objects are classified on the second level with more complex rules, etc. Evolutionary Supported Decision Trees Even though DT induction isn’t as complicated from the parameter settings’ point of view as some other machine-learning approaches, getting the best out of the approach still requires some time and skill. Manual setting of parameters takes more time than induction of the tree itself and even so, we have no guarantee that parameter settings are optimal. With MtDeciT3.1Gen Zorman presented [59] an implementation of evolutionary supported DT. This approach
841
842
Decision Trees
did not change the induction process itself, since the basic principle remained the same. The change was in evolutionary support, which is in charge of searching the space of induction parameters for the best possible combination. By using genetic operators like selection, crossover and mutation evolutionary approaches are often capable of finding optimal solutions even in the most complex of search spaces or at least they offer significant benefits over other search and optimization techniques. Enhanced with the evolutionary option, DT don’t offer us more generalizing power, but we gain on better coverage of method parameter’s search space and significant decrease of time spent compared to manual parameter setting. Better exploitation of a method usually manifests in better results, and the goal is reached. The rule of the evolutionary algorithm is to find a combination of parameters settings, which would enable induction of DT with high overall accuracy, high class accuracies, use as less attributes as possible, and be as small as possible. Ensemble Methods An extension of individual classifier models are ensemble methods that use a set of induced classifiers and combine their outputs into a single classification. The purpose of combining several individual classifiers together is to achieve better classification results. All kinds of individual classifiers can be used to construct an ensemble, however DT are used often as one of the most appropriate and effective models. DT are known to be very sensitive to changes in a learning set. Consequentially, even small changes in a learning set can result in a very different DT classifier which is a basis for the construction of efficient ensembles. Two well-known representatives of ensemble methods are Random forests and Rotation forests, which use DT classifiers, although other classifiers (especially in the case of Rotation forests) could be used as well. Dealing with Missing Values Like most of the machine-learning approaches, the DT approach also assumes that all data is available at the time of induction and usage of DT. Real data bases rarely fit this assumption, so dealing with incomplete data presents an important task. In this section some aspects of dealing with missing values for description attributes will be given. Reasons for incomplete data can be different: from cases where missing values could not be measured, human errors when writing measurement results, to usage of data bases, which were not gathered for the purpose of machine learning, etc.
Missing values can cause problems in three different stages of induction and usage of DT: Split selection in the phase of DT induction, Partition of training set in a test node during the phase of DT induction, and Selecting which edge to follow from a test node when classifying unseen cases during the phase of DT usage. Though some similarities can be found between approaches for dealing with missing data for listed cases, some specific solutions exist, which are applicable only to one of the possible situations. Let us first list possible approaches for the case of dealing with missing values during split selection in the phase of DT induction: Ignore – the training object is ignored during evaluation of the attribute with the missing value, Reduce – the evaluation (for instance entropy, gain or gain ratio) of the attribute with missing values is changed according to the percentage of training objects with missing values, Substitute with DT – a new DT with a reduced number of attributes is induced in order to predict the missing value for the training object in question; this approach is best suited for discrete attributes, Replace – the missing value is replaced with mean (continuous attributes) or most common (discrete attributes) value, and New value – missing values for discrete attributes are substituted with new value (“unknown”) which is then treated the same way as other possible values of the attribute. The next seven approaches are suited for partitioning of the training set in a test node during the phase of DT induction: Ignore – the training object is ignored during partition of the test set, Substitute with DT – a new DT with a reduced number of attributes is induced in order to predict the missing value for the training object in question – according to the predicted value we can propagate the training object to one of the successor nodes; this approach is best suited for discrete attributes, Replace – the missing value is replaced with the mean (continuous attributes) or most common (discrete attributes) value, Probability – the training object with the missing value is propagated to one of the edges according to prob-
Decision Trees
ability, proportional to strength of subsets of training objects, Fraction – a fraction of the training object with a missing value is propagated to each of the subsets; the size of the fraction is proportional to the strength of each subset of training objects, All – the training object with the missing value is propagated to all subsets, coming from the current node, and New value – training objects with new value (“unknown”) are assigned to a new subset and propagated further, following a new edge, marked with the previously mentioned new value. The last case for dealing with missing attribute values is during the decision-making phase, when we are selecting which edge to follow from a test node when classifying unseen cases: Substitute with DT – a new DT with a reduced number of attributes is induced in order to predict the missing value for the object in question – according to predicted value we can propagate the unseen object to one of the successor nodes; this approach is best suited for discrete attributes, Replace – the missing value is replaced with the mean (continuous attributes) or most common (discrete attributes) value, All – the unseen object with a missing value is propagated to all subsets, coming from the current node; the final decision is subject to voting by all leaf nodes we ended in, Stop – when facing a missing value, the decisionmaking process is stopped and the decision with the highest probability according to the training subset in the current node is returned, and New value – if there exists a special edge marked with value “unknown”, we follow that edge to the next node. Evaluation of Quality Evaluation of an induced DT is essential for establishing the quality of the learned model. For the evaluation it is essential to use a set of unseen instances – a testing set. A testing set is a set of testing objects which have not been used in the process of inducing the DT and can therefore be used to objectively evaluate its quality. A quality is a complex term that can include several measures of efficiency. For the evaluation of the DT’s quality the most general measure is the overall accuracy, defined as a percentage of correctly classified objects from all objects (correctly classified and not correctly classified).
Accuracy can be thus calculated as: ACC D
T TCF
where T stands for “true” cases (i. e. correctly classified objects) and F stands for “false” cases (i. e. not correctly classified objects). The above measure is used to determine the overall classification accuracy. In many cases, the accuracy of each specific decision class is even more important than the overall accuracy. The separate class accuracy of ith single decision class is calculated as: Ti ACC k;i D Ti C F i andthe average accuracy over all decision classes is calculated as ACC k D
M 1 X Ti M Ti C F i iD1
where M represents the number of decision classes. There are many situations in real-world data where there are exactly two decision classes possible (i. e. positive and negative cases). In such a case the common measures are sensitivity and specificity: Sens D
TP ; TP C F N
Spec D
TN T N C FP
where TP stands for “true positives” (i. e. correctly classified positive cases), TN stands for “true negatives” (i. e. correctly classified negative cases), FP stands for “false positives” (i. e. not correctly classified positive cases), and FN stands for “false negatives” (i. e. not correctly classified negative cases). Besides accuracy, various other measures of efficiency can be used to determine the quality of a DT. Which measures are used depends greatly on the used dataset and the domain of a problem. They include complexity (i. e. the size and structure of an induced DT), generality (differences in accuracy between training objects and testing objects), a number of used attributes, etc. Applications and Available Software DT are one of the most popular data-mining models because they are easy to induce, understand, and interpret. High popularity means also many available commercial and free applications. Let us mention a few. We have to start with Quinlan’s C4.5 and C5.0/See5 (MS Windows/XP/Vista version of C5.0), the classic version of DT tool. It is commercial and available at http:// www.rulequest.com/see5-info.html.
843
844
Decision Trees
Another popular commercial tool is CART 5, based on original regression trees algorithm. It is implemented in an intuitive Windows-based environment. It is commercially available at http://www.salford-systems.com/1112.php. Weka (Waikato Environment for Knowledge Analysis) environment contains a collection of visualization tools and algorithms for data analysis and predictive modeling. Among them there are also some variants of DTs. Beside its graphical user interfaces for easy access, one of its great advantages is the possibility of expanding the tool with own implementations of data analysis algorithms in Java. A free version is available at http://www.cs.waikato. ac.nz/ml/weka/. OC1 (Oblique Classifier 1) is a decision tree induction system written in C and designed for applications where the instances have numeric attribute values. OC1 builds DT that contain linear combinations of one or more attributes at each internal node; these trees then partition the space of examples with both oblique and axis-parallel hyperplanes. It is available free at http://www.cs.jhu.edu/ ~salzberg/announce-oc1.html. There exist many other implementations, some are available as a part of larger commercial packages like SPSS, Matlab, etc. Future Directions DT has reached the stage of being one of the fundamental classification tools used on a daily basis, both in academia and in industry. As the complexity of stored data is growing and information systems become more and more sophisticated, the necessity for an efficient, reliable and interpretable intelligent method is growing rapidly. DT has been around for a long time and has matured and proved its quality. Although the present state of research regarding DT gives reasons to think there is nothing left to explore, we can expect major developments in some trends of DT research that are still open. It is our belief that DT will have a great impact on the development of hybrid intelligent systems in the near future. Bibliography Primary Literature 1. Babic SH, Kokol P, Stiglic MM (2000) Fuzzy decision trees in the support of breastfeeding. In: Proceedings of the 13th IEEE Symposium on Computer-Based Medical Systems CBMS’2000, Houston, pp 7–11 2. Banerjee A (1994) Initializing neural networks using decision trees In: Proceedings of the International Workshop on Computational Learning Theory and Natural learning Systems, Cambridge, pp 3–15
3. Bonner G (2001) Decision making for health care professionals: use of decision trees within the community mental health setting. J Adv Nursing 35:349–356 4. Breiman L (1996) Bagging predictors. Mach Learn 24:123–140 5. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont 6. Cantu-Paz E, Kamath C (2000) Using evolutionary algorithms to induce oblique decision trees. In: Proceedings of the Genetic and Evolutionary Computation Conference GECCO-2000, Las Vegas, pp 1053–1060 7. Craven MW, Shavlik JW (1996) Extracting tree-structured representations of trained networks. In: Advances in Neural Information Processing Systems, vol 8. MIT Press, Cambridge 8. Crawford S (1989) Extensions to the CART algorithm. Int J ManMach Stud 31(2):197–217 9. Cremilleux B, Robert C (1997) A theoretical framework for decision trees in uncertain domains: Application to medical data sets. In: Lecture Notes in Artificial Intelligence, vol 1211. Springer, London, pp 145–156 10. Dantchev N (1996) Therapeutic decision frees in psychiatry. Encephale-Revue Psychiatr Clinique Biol Therap 22(3):205–214 11. Dietterich TG, Kong EB (1995) Machine learning bias, statistical bias and statistical variance of decision tree algorithms. Mach Learn, Corvallis 12. Feigenbaum EA, Simon HA (1962) A theory of the serial position effect. Br J Psychol 53:307–320 13. Freund Y (1995) Boosting a weak learning algorithm by majority. Inf Comput 121:256–285 14. Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: Machine Learning: Proc. Thirteenth International Conference. Morgan Kauffman, San Francisco, pp 148–156 15. Gambhir SS (1999) Decision analysis in nuclear medicine. J Nucl Med 40(9):1570–1581 16. Gehrke J (2003) Decision Tress. In: Nong Y (ed) The Handbook of Data Mining. Lawrence Erlbaum, Mahwah 17. Goebel M, Gruenwald L (1999) A survey of data mining software tools. SIGKDD Explor 1(1):20–33 18. Goldberg DE (1989) Genetic Algorithms in Search, Optimization, and Machine Learning. Addison Wesley, Reading 19. Hand D (1997) Construction and assessment of classification rules. Wiley, Chichester 20. Heath D et al (1993) k-DT: A multi-tree learning method. In: Proceedings of the Second International Workshop on Multistrategy Learning, Harpers Fery, pp 138–149 21. Heath D et al (1993) Learning Oblique Decision Trees. In: Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence IJCAI-93, pp 1002–1007 22. Ho TK (1998) The Random Subspace Method for Constructing Decision Forests. IEEE Trans Pattern Anal Mach Intell 20(8):832– 844 23. Hunt EB, Marin J, Stone PT (1966) Experiments in Induction. Academic Press, New York, pp 45–69 24. Jones JK (2001) The role of data mining technology in the identification of signals of possible adverse drug reactions: Value and limitations. Curr Ther Res-Clin Exp 62(9):664–672 25. Kilpatrick S et al (1983) Optimization by Simulated Annealing. Science 220(4598):671–680 26. Kokol P, Zorman M, Stiglic MM, Malcic I (1998) The limitations of decision trees and automatic learning in real world medical
Decision Trees
27.
28.
29. 30. 31. 32.
33. 34.
35.
36.
37.
38.
39. 40. 41. 42. 43.
44. 45. 46. 47.
decision making. In: Proceedings of the 9th World Congress on Medical Informatics MEDINFO’98, 52, pp 529–533 Letourneau S, Jensen L (1998) Impact of a decision tree on chronic wound care. J Wound Ostomy Conti Nurs 25:240– 247 Lim T-S, Loh W-Y, Shih Y-S (2000) A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach Learn 48:203–228 Murthy KVS (1997) On Growing Better Decision Trees from Data, Ph D dissertation. Johns Hopkins University, Baltimore Neapolitan R, Naimipour K (1996) Foundations of Algorithms. DC Heath, Lexington Nikolaev N, Slavov V (1998) Inductive genetic programming with decision trees. Intell Data Anal Int J 2(1):31–44 Ohno-Machado L, Lacson R, Massad E (2000) Decision trees and fuzzy logic: A comparison of models for the selection of measles vaccination strategies in Brazil. Proceedings of AMIA Symposium 2000, Los Angeles, CA, US, pp 625–629 Paterson A, Niblett TB (1982) ACLS Manual. Intelligent Terminals, Edinburgh Podgorelec V (2001) Intelligent systems design and knowledge discovery with automatic programming. Ph D thesis, University of Maribor Podgorelec V, Kokol P (1999) Induction f medical decision trees with genetic algorithms. In: Proceedings of the International ICSC Congress on Computational Intelligence Methods and Applications CIMA. Academic Press, Rochester Podgorelec V, Kokol P (2001) Towards more optimal medical diagnosing with evolutionary algorithms. J Med Syst 25(3): 195–219 Podgorelec V, Kokol P (2001) Evolutionary decision forests – decision making with multiple evolutionary constructed decision trees. In: Problems in Applied Mathematics and Computational Intelligence. WSES Press, pp 97–103 Quinlan JR (1979) Discovering rules by induction from large collections of examples. In: Michie D (ed) Expert Systems in the Micro Electronic Age, University Press, Edingburgh, pp 168– 201 Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106 Quinlan JR (1987) Simplifying decision trees. Int J Man-Mach Stud 27:221–234 Quinlan JR (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco Rich E, Knight K (1991) Artificial Intelligence, 2nd edn. McGraw Hill, New York Sanders GD, Hagerty CG, Sonnenberg FA, Hlatky MA, Owens DK (2000) Distributed decision support using a web-based interface: prevention of sudden cardiac death. Med Decis Making 19(2):157–166 Schapire RE (1990) The strength of weak learnability. Mach Learn 5:197–227 Shannon C, Weaver W (1949) The mathematical theory of communication. University of Illinois Press, Champagn Shlien S (1992) Multiple binary decision tree classifiers. Pattern Recognit Lett 23(7):757–763 Sims CJ, Meyn L, Caruana R, Rao RB, Mitchell T, Krohn M (2000)
48.
49. 50. 51.
52.
53. 54. 55. 56.
57.
58.
59.
Predicting cesarean delivery with decision tree models. Am J Obstet Gynecol 183:1198–1206 Smyth P, Goodman RM (1991) Rule induction using information theory, In: Piatsky-Scharpiro G, Frawley WJ (eds) Knowledge Discovery in Databases, AAAI Press, Cambridge, pp 159– 176 Sprogar M, Kokol P, Hleb S, Podgorelec V, Zorman M (2000) Vector decision trees. Intell Data Anal 4(3–4):305–321 Tou JT, Gonzalez RC (1974) Pattern Recognition Principles. Addison-Wesley, Reading Tsien CL, Fraser HSF, Long WJ, Kennedy RL (1998) Using classification tree and logistic regression methods to diagnose myocardial infarction. In: Proceedings of the 9th World Congress on Medical Informatics MEDINFO’98, 52, pp 493–497 Tsien CL, Kohane IS, McIntosh N (2000) Multiple signal integration by decision tree induction to detect artifacts in the neonatal intensive care unit. Artif Intell Med 19(3):189–202 Utgoff PE (1989) Incremental induction of decision trees. Mach Learn 4(2):161–186 Utgoff PE (1989) Perceptron trees: a case study in hybrid concept representations. Connect Sci 1:377–391 White AP, Liu WZ (1994) Bias in information-based measures in decisions tree induction. Mach Learn 15:321–329 Zorman M, Hleb S, Sprogar M (1999) Advanced tool for building decision trees MtDecit 2.0. In: Arabnia HR (ed) Proceedings of the International Conference on Artificial Intelligence ICAI99. Las Vegas Zorman M, Kokol P, Podgorelec V (2000) Medical decision making supported by hybrid decision trees. In: Proceedings of the ICSC Symposia on Intelligent Systems & Applications ISA’2000, ICSC Academic Press, Wollongong Zorman M, Podgorelec V, Kokol P, Peterson M, Lane J (2000) Decision tree’s induction strategies evaluated on a hard real world problem. In: Proceedings of the 13th IEEE Symposium on Computer-Based Medical Systems CBMS’2000, Houston, pp 19–24 Zorman M, Sigut JF, de la Rosa SJL, Alayón S, Kokol P, Verliè M (2006) Evolutionary built decision trees for supervised segmentation of follicular lymphoma images. In: Proceedings of the 9th IASTED International conference on Intelligent systems and control, Honolulu, pp 182–187
Books and Reviews Breiman L, Friedman JH, Olsen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont Han J, Kamber M (2006) Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco Hand D, Manilla H, Smyth P (2001) Principles of Data Mining. MIT Press, Cambridge Kantardzic M (2003) Data Mining: Concepts, Models, Methods, and Algorithms. Wiley, San Francisco Mitchell TM (1997) Machine Learning. McGraw-Hill, New York Quinlan JR (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco Ye N (ed) (2003) The Handbook of Data Mining. Lawrence Erlbaum, Mahwah
845
846
Dependency and Granularity in Data-Mining
Dependency and Granularity in Data-Mining SHUSAKU TSUMOTO, SHOJI HIRANO Department of Medical Informatics, Shimane University, School of Medicine, Enya-cho Izumo City, Shimane, Japan Article Outline Definition of the Subject Introduction Contingency Table from Rough Sets Rank of Contingency Table (2 × 2) Rank of Contingency Table (m × n) Rank and Degree of Dependence Degree of Granularity and Dependence Conclusion Acknowledgment Bibliography Definition of the Subject The degree of granularity of a contingency table is closely related with that of dependence of contingency tables. We investigate these relations from the viewpoints of determinantal devisors and determinants. From the results of determinantal divisors, it seems that the devisors provide information on the degree of dependencies between the matrix of all elements and its submatrices and that an increase in degree of granularity may lead to an increase in dependency. However, another approach shows that a constraint on the sample size of a contingency table is very strong, which leads to an evaluation formula in which an increase of degree of granularity gives a decrease of dependency. Introduction Independence (dependence) is a very important concept in data mining, especially for feature selection. In rough sets [2], if two attribute-value pairs, say [c D 0] and [d D 0] are independent, their supporting sets, denoted by C and D do not have an overlapping region (C \ D D ), which means that an attribute independent to a given target concept may not appear in the classification rule for the concept. This idea is also frequently used in other rule discovery methods: let us consider deterministic rules, described as if-then rules, which can be viewed as classic propositions (C ! D). From the set-theoretical point of view, a set of examples supporting the conditional part of a determin-
istic rule, denoted by C, is a subset of a set whose examples belong to the consequence part, denoted by D. That is, the relation C D holds and deterministic rules are supported only by positive examples in a dataset [4]. When such a subset relation is not satisfied, indeterministic rules can be defined as if-then rules with probabilistic information [6]. From the set-theoretical point of view, C is not a subset, but closely overlapped with D. That is, the relations C \ D ¤ and jC \ Dj/jCj ı will hold in this case1 . Thus, probabilistic rules are supported by a large number of positive examples and a small number of negative examples. On the other hand, in a probabilistic context, independence of two attributes means that one attribute (a1 ) will not influence the occurrence of the other attribute (a2 ), which is formulated as p(a2 ja1 ) D p(a2 ). Although independence is a very important concept, it has not been fully and formally investigated as a relation between two attributes. Tsumoto introduces linear algebra into formal analysis of a contingency table [5]. The results give the following interesting results: First, a contingency table can be viewed as comparison between two attributes with respect to information granularity. Second, algebra is a key point of analysis of this table. A contingency table can be viewed as a matrix and several operations and ideas of matrix theory are introduced into the analysis of the contingency table. Especially for the degree of independence, rank plays a very important role in extracting a probabilistic model from a given contingency table. This paper presents a further investigation into the degree of independence of a contingency matrix. Intuitively and empirically, when two attributes have many values, the dependence between these two attributes becomes low. However, from the results of determinantal divisors, it seems that the devisors provide information on the degree of dependencies between the matrix of all elements and its submatrices and that an increase in degree of granularity may lead to an increase in dependency. The key of the resolution of these conflicts is to consider constraints on the sample size. In this paper we show that a constraint on the sample size of a contingency table is very strong, which leads to the evaluation formula in which the increase of degree of granularity gives a decrease of dependency. The paper is organized as follows: Sect. “Contingency Table from Rough Sets” shows preliminaries. Sect. “Rank of Contingency Table (2 2)” discusses the former results. Sect. “Rank of 1 The threshold ı is the degree of the closeness of overlapping sets, which will be given by domain experts. For more information, please refer to Sect. “Rank of Contingency Table (2 2)”.
Dependency and Granularity in Data-Mining
Contingency Table (m n)” gives the relations between rank and submatrices of a matrix. Finally, Sect. “Degree of Granularity and Dependence” concludes this paper. Contingency Table from Rough Sets Notations In the subsequent sections, the following notation, introduced in [3], is adopted: Let U denote a nonempty, finite set called the universe and A denote a nonempty, finite set of attributes, that is, a : U ! Va for a 2 A, where V a is called the domain of a. Then, a decision table is defined as an information system, A D (U; A [ fDg), where fDg is a set of given decision attributes. The atomic formulas over B A [ fDg and V are expressions of the form [a D v], called descriptors over B, where a 2 B and v 2 Va . The set F(B; V) of formulas over B is the least set containing all atomic formulas over B and closed with respect to disjunction, conjunction and negation. For each f 2 F(B; V), f A denotes the meaning of f in A, that is, the set of all objects in U with property f , defined inductively as follows:
Dependency and Granularity in Data-Mining, Table 1 Contingency table (m n)
B1 B2 Bm Sum
a 1 0 0 1 0
Then, the following contingency table is obtained (Table 3). From this table, accuracy and coverage for [b D 0] ! [e D 0] are obtained as 1/(1 C 2) D 1/3 and 1/(1 C 1) D 1/2.
An x1n x2n xmn xn
Sum x1 x2 xm x D jUj D N
b 0 0 1 1 0
c 0 1 2 1 2
d 0 1 2 2 1
e 1 1 0 1 0
Dependency and Granularity in Data-Mining, Table 3 Corresponding contingency table bD0 eD0 1 eD1 2 3
Contingency Table (m n)
Example 1 Let us consider an information table shown in Table 2. The relationship between b and e can be examined by using the corresponding contingency table. First, the frequencies of four elementary relations, called marginal distributions, are counted: [b D 0], [b D 1], [e D 0], and [e D 1]. Then, the frequencies of four kinds of conjunction are counted: [b D 0] ^ [e D 0], [b D 0] ^ [e D 1], [b D 1] ^ [e D 0], and [b D 1] ^ [e D 1].
A2 x12 x22 xm2 x2
Dependency and Granularity in Data-Mining, Table 2 A small dataset
1. If f is of the form [a D v] then, f A D fs 2 Uja(s) D vg. 2. ( f ^g)A D f A \g A ; ( f _g)A D f A _g A ; (: f )A D U f a .
Definition 1 Let R1 and R2 denote multinomial attributes in an attribute space A which have m and n values. A contingency table is a table of a set described by the following formulas: j[R1 D A j ]A j, j[R2 D B i ]A j, j[R1 D A j ^ R2 D B i ]A j, j[R1 D A1 ^ R1 D A2 ^ : : : ^ R1 D A m ]A j, j[R2 D B1 ^ R2 D A2 ^ : : : ^ R2 D A n ]A j and jUj (i D 1; 2; 3; : : : ; n and j D 1; 2; 3; : : : ; m). This table is arranged into the form shown in Table 1, P where: j[R1 D A j ]A j D m iD1 x1i D x j , j[R2 D B i ] A j D Pn x D x , j[R D A ji i 1 j ^ R2 D B i ] A j D x i j , jUj D jD1 N D x (i D 1; 2; 3; : : : ; n and j D 1; 2; 3; : : : ; m).
A1 x11 x21 xm1 x1
bD1 1 2 1 3 2 5
One of the important observations from granular computing is that a contingency table shows the relations between two attributes with respect to intersection of their supporting sets. For example, in Table 3, both b and e have two different partitions of the universe and the table gives the relation between b and e with respect to the intersection of supporting sets. It is easy to see that this idea can be extended into an n n contingency table, which can be viewed as an n n-matrix. When two attributes have a different number of equivalence classes, the situation may be a little complicated. But, in this case, due to knowledge about linear algebra, we have only to consider the attribute which has a smaller number of equivalence classes and the surplus number of equivalence classes of the attributes with larger number of equivalence classes can be projected into other partitions. In other words, an m n matrix or contingency table includes a projection from one attribute to the other one. Rank of Contingency Table (2 × 2) Preliminaries Definition 2 A corresponding matrix C Ta;b is defined as a matrix the elements of which are equal to the value of the
847
848
Dependency and Granularity in Data-Mining
corresponding contingency table Ta;b of two attributes a and b, except for marginal values. Definition 3 The rank of a table is defined as the rank of its corresponding matrix. The maximum value of the rank is equal to the size of (square) matrix, denoted by r. Example 2 Let the table given in Table 3 be defined as Tb;e . Then, C Tb;e is:
1 1 : 2 1 Since the determinant of C Tb;e det(C Tb;e ) is not equal to 0, the rank of C Tb;e is equal to 2. It is the maximum value (r D 2), so b and e are statistically dependent. Independence when the Table is 2 2 From the application of linear algebra, several results are obtained. (The proof is omitted). First, it is assumed that a contingency table is given as m D 2, n D 2 in Table 1. Then the corresponding matrix (C TR1 ;R2 ) is given as:
x11 x12 : x21 x22 Proposition 1 The determinant of det(C TR1 ;R2 ) is equal to jx11 x22 x12 x21 j. Proposition 2 The rank will be: ( 2; if det(C TR1 ;R2 ) ¤ 0 rank D 1; if det(C TR1 ;R2 ) D 0 : If the rank of det(C Tb;e ) is equal to 1, according to the theorems of linear algebra, it is obtained that one row or column will be represented by the other column. That is, Proposition 3 Let r1 and r2 denote the rows of the corresponding matrix of a given 2 2 table, C Tb;e . That is, r1 D (x11 ; x12 ) ;
r2 D (x21 ; x22 ) :
Then, r1 can be represented by r2 : r1 D kr2 , where k is given as: x11 x12 x1 kD D D : x21 x22 x2 From this proposition, the following theorem is obtained. Theorem 1 If the rank of the corresponding matrix is 1, then two attributes in a given contingency table are statistically independent. Thus, ( 2; dependent rank D 1; statistical independent :
Rank of Contingency Table (m × n) In the case of a general square matrix, the results in the 2 2 contingency table can be extended. It is especially important to observe that conventional statistical independence is supported only when the rank of the corresponding matrix is equal to 1. Let us consider the contingency table of c and a in Table 2, which is obtained as follows. Thus, the corresponding matrix of this table is: 0 1 1 0 0 @ 0 1 1A ; 0 1 1 whose determinant is equal to 0. It is clear that its rank is 2. It is interesting to see that if the case of [d D 0] is removed, then the rank of the corresponding matrix is equal to 1 and two rows are equal. Thus, if the value space of d into f1; 2g is restricted, then c and d are statistically independent. This relation is called contextual independence [1], which is related to conditional independence. However, another type of weak independence is observed: let us consider the contingency table of a and c. The table is obtained as Table 4. Its corresponding matrix is: 0 1 0 1 C Ta;c D @1 1A ; 2 0 Since the corresponding matrix is not square, the determinant is not defined. But it is easy to see that the rank of this matrix is two. In this case, even removing any attributevalue pair from the table will not generate statistical independence. Finally, the relation between rank and independence in a n n contingency table is obtained. Theorem 2 Let the corresponding matrix of a given contingency table be a square n n matrix. If the rank of the corresponding matrix is 1, then two attributes in a given contingency table are statistically independent. If the rank of the corresponding matrix is n, then two attributes in a given contingency table are dependent. Otherwise, two attributes are contextually dependent, which means that several conditional probabilities can be represented by a linear Dependency and Granularity in Data-Mining, Table 4 Contingency table for a and c aD0 cD0 0 cD1 1 cD2 2 3
aD1 1 1 0 2
1 2 2 5
Dependency and Granularity in Data-Mining
combination of conditional probabilities. Thus, 8 ˆ 0. The least positive n with this property is called the period of x. The orbit of x is the set O(x) D fF n (x) : n > 0g. A set Y X is positively invariant, if F(Y) Y and strongly invariant if F(Y) D Y. A point x 2 X is equicontinuous (x 2 E F ) if the family of maps F n is equicontinuous at X, i. e. x 2 E F iff (8" > 0)(9ı > 0)(8y 2 Bı (x))(8n > 0) (d(F n (y); F n (x)) < ") : The system (X; F) is almost equicontinuous if E F ¤ ; and equicontinuous, if (8" > 0)(9ı > 0)(8x 2 X)(8y 2 Bı (x)) (8n > 0)(d(F n (y); F n (x)) < ") : For an equicontinuous system E F D X. Conversely, if E F D X and if X is compact, then F is equicontinuous; this needs not be true in the non-compact case. A system (X; F) is sensitive (to initial conditions), if (9" > 0)(8x 2 X)(8ı > 0)(9y 2 Bı (x)) (9n > 0)(d( f n (y); f n (x)) ") : A sensitive system has no equicontinuous point. However, there exist systems with no equicontinuity points which are not sensitive. A system (X; F) is positively expansive, if (9" > 0)(8x ¤ y 2 X)(9n 0)(d( f n (x); f n (y)) ") : A positively expansive system on a perfect space is sensitive. A system (X; F) is (topologically) transitive, if for
915
916
Dynamics of Cellular Automata in Non-compact Spaces
any nonempty open sets U; V X there exists n 0 such that F n (U) \ V ¤ ;. If X is perfect and if the system has a dense orbit, then it is transitive. Conversely, if (X; F) is topologically transitive and if X is compact, then (X; F) has a dense orbit. A system (X; F) is mixing, if for any nonempty open sets U; V X there exists k > 0 such that for every n k we have F n (U) \ V ¤ ;. An "-chain (from x0 to xn ) is a sequence of points x0 ; : : : ; x n 2 X such that d(F(x i ); x iC1 ) < " for 0 i < n. A system (X; F) is chain-transitive, if for any " > 0 and any x; y 2 X there exists an "-chain from x to y. A strongly invariant closed set Y X is stable, if 8" > 0; 9ı > 0; 8x 2 X; (d(x; Y) < ı H) 8n > 0; d(F n (x); Y) < ") : A strongly invariant closed stable set Y X is an attractor, if
cylinder set of a set of words U AC located at l 2 Z is S the set [U] l D u2U [u] l . A subshift is a nonempty subset ˙ AZ such that there exists a set D AC of forbidden words and ˙ D ˙ D :D fx 2 AZ : 8u v x; u 62 Dg. A subshift ˙ D is of finite type (SFT), if D is finite. A subshift is uniquely determined by its language L(˙ ) :D
[
Ln (˙ ) ;
n0
where Ln (˙ ) :D fu 2 An : 9x 2 ˙; u v xg : A cellular automaton is a map F : AZ ! AZ defined by F(x) i D f (x[ir;iCr]), where r 0 is a radius and f : A2rC1 ! A is a local rule. In particular the shift map : AZ ! AZ is defined by (x) i :D x iC1 . A local rule extends to the map f : A ! A by f (u) i D f (u[i;iC2r]) so that j f (u)j D maxfjuj 2r; 0g.
9ı > 0; 8x 2 X; (d(x; Y) < ı H) lim d(F n (x); Y) D 0: Definition 3 Let F : AZ ! AZ be a CA. n!1
Theorem 1 (Knudsen [10]) Let (X; F) be a dynamical system and Y X a dense, F-invariant subset.
(1) A word u 2 A is m-blocking, if juj m and there exists offset d juj m such that 8x; y 2 [u]0 ; 8n > 0; F n (x)[d;dCm) D F n (y)[d;dCm) . (2) A set U AC is spreading, if [U] is F-invariant and there exists n > 0 such that F n ([U]) 1 ([U]) \ ([U]).
(1) (X; F) is sensitive iff (Y; F) is sensitive. (2) (X; F) is transitive iff (Y; F) is transitive.
The following results will be useful in the sequel.
Recall that a space X is separable, if it has a countable dense set.
Proposition 4 (Formenti, K˚urka [5]) Let F : AZ ! AZ be a CA and let U AC be an invariant set. Then ˝F ([U]) is a subshift iff U is spreading.
Theorem 2 (Blanchard, Formenti, and K˚urka [2]) Let (X; F) be a dynamical system on a non-separable space. If (X; F) is transitive, then it is sensitive.
Theorem 5 (Hedlund [6]) Let F : AZ ! AZ be a CA with local rule f : A2rC1 ! A. Then F is surjective iff f : A ! A is surjective iff j f 1 (u)j D jAj2r for each u 2 AC .
Cellular Automata
Submeasures
For a finite alphabet A, denote by jAj the number of its eleS ments, by A :D n0 An the set of words over A, and by S AC :D n>0 An D A n fg the set of nonempty words. The length of a word u 2 An is denoted by juj :D n. We say that u 2 A is a subword of v 2 A (u v v) if there exists k such that v kCi D u i for all i < juj. We denote by u[i; j) D u i : : : u j1 and u[i; j] D u i : : : u j subwords of u associated to intervals. We denote by AZ the set of A-configurations, or doubly-infinite sequences of letters of A. For any u 2 AC we have a periodic configuration u1 2 AZ defined by (u1 ) kjujCi D u i for k 2 Z and 0 i < juj. The cylinder of a word u 2 A located at l 2 Z is the set [u] l D fx 2 AZ : x[l ;l Cjuj) D ug. The
A pseudometric on a set X is a map d : X X ! [0; 1) which satisfies the following conditions:
A set W X is inward, if F(W) W ı . In compact spaces, attractors are exactly ˝-limits ˝F (W) D T n>0 F(W) of inward sets.
1. d(x; y) D d(y; x), 2. d(x; z) d(x; y) C d(y; z). If moreover d(x; y) > 0 for x ¤ y, then we say that d is a metric. There is a standard method to create pseudometrics from submeasures. A bounded submeasure (with bound M 2 RC ) is a map ' : P (Z) ! [0; M] which satisfies the following conditions: 1. '(;) D 0, 2. '(U) '(U [ V) '(U) C '(V ) for U; V Z.
Dynamics of Cellular Automata in Non-compact Spaces
A bounded submeasure ' on Z defines a pseudometric d' : AZ AZ ! [0; 1) by d' (x; y) :D '(fi 2 Z : x i ¤ y i g). The Cantor, Besicovich and Weyl pseudometrics on AZ are defined by the following submeasures: 'C (U) :D 2 minfjij: i2U g ; 'B (U) :D lim sup l !1
jU \ [l; l)j ; 2l
jU \ [k; k C l)j 'W (U) :D lim sup sup : l l !1 k2Z The Cantor Space The Cantor metric on AZ is defined by dC (x; y) D 2k
where
k D minfjij : x i ¤ y i g ;
so dC (x; y) < 2k iff x[k;k] D y[k;k] . We denote by CA D (AZ ; dC ) the metric space of two-sided configurations with metric dC . The cylinders are clopen sets in CA . All Cantor spaces (with different alphabets) are homeomorphic. The Cantor space is compact, totally disconnected and perfect, and conversely, every space with these properties is homeomorphic to a Cantor space. Literature about CA dynamics in Cantor spaces is really huge. In this section, we just recall some results and definitions which will be used later. Theorem 6 (K˚urka [11]) Let (CA ; F) be a CA with radius r. (1) (CA ; F) is almost equicontinuous iff there exists a r-blocking word for F (2) (CA ; F) is equicontinuous iff all sufficiently long words are r-blocking. Denote by E F the set of equicontinuous points of F. The sets of equicontinuous directions and almost equicontinuous directions of a CA (CA ; F) (see Sablik [15]) are defined by
p : p 2 Z; q 2 N C ; E F q p D AZ ; q
p A(F) D : p 2 Z; q 2 N C ; E F q p ¤ ; : q
E(F) D
All periodic spaces (with different alphabets) are homeomorphic. The periodic space is not compact, but it is totally disconnected and perfect. It is dense in CA . If (CA ; F) is a CA, Then F(PA ) PA . We denote by FP : PA ! PA the restriction of F to PA , so (PA ; FP ) is a (non-compact) dynamical system. Every FP -orbit is finite, so every point x 2 PA is FP -eventually periodic. Theorem 8 Let F be a CA over alphabet A. (1) (CA ; F) is surjective iff (PA ; FP ) is surjective. (2) (CA ; F) is equicontinuous iff (PA ; FP ) is equicontinuous. (3) (CA ; F) is almost equicontinuous iff (PA ; FP ) is almost equicontinuous. (4) (CA ; F) is sensitive iff (PA ; FP ) is sensitive. (5) (CA ; F) is transitive iff (PA ; FP ) is transitive. Proof (1a) Let F be surjective, let y 2 PA and n (y) D y. There exists z 2 F 1 (y) and integers i < j such that z[i nr;i nrCr) D z[ jnr; jnrCr) . Then x D (z[i nr; jnr) )1 2 PA and FP (x) D y, so FP is surjective. (1b) Let FP be surjective, and u 2 AC . Then u1 has FP -preimage and therefore u has preimage under the local rule. By Hedlund Theorem, (CA ; F) is surjective. (2a) Since PA CA , the equicontinuity of F implies trivially the equicontinuity of FP . (2b) Let FP be equicontinuous. There exist m > r such that if x; y 2 PA and x[m;m] D y[m;m] , then F n (x)[r;r] D F n (y)[r;r] for all n 0. We claim that all words of length 2m C 1 are (2r C 1)-blocking with offset m r. If not, then for some x; y 2 AZ with x[m;m] D y[m;m] , there exists n > 0 such that F n (x)[r;r] ¤ F n (y)[r;r] . For periodic configurations x 0 D (x[mnr;mCnr] )1 ; y 0 D (y[mnr;mCnr] )1 we get F n (x 0 )[r;r] ¤ F n (y 0 )[r;r] contradicting the assumption. By Theorem 6, F is C -equicontinuous. (3a) If (CA ; F) is almost equicontinuous, then there exists a r-blocking word u and u1 2 PA is an equicontinuous configuration for (P a ; FP ). (3b) The proof is analogous as (2b). (4) and (5) follow from the Theorem 1 of Knudsen. The Toeplitz Space Definition 9 Let A be an alphabet
The Periodic Space Definition 7 The periodic space PA D fx 2 AZ : 9n > 0; n (x) D xg over an alphabet A consists of shift periodic configurations with Cantor metric dC .
(1) The Besicovitch pseudometric on AZ is defined by dB (x; y) D lim sup l !1
jf j 2 [l; l) : x j ¤ y j gj : 2l
917
918
Dynamics of Cellular Automata in Non-compact Spaces
(2) The Weyl pseudometric on AZ is defined by jf j 2 [k; k C l) : x j ¤ y j gj : dW (x; y) D lim sup max l k2Z l !1
dW (x; y) lim max jf j 2 [k p i ; k C p i ) : x j ¤ y j gj i!1 k2Z
lim
i!1
Clearly dB (x; y) dW (x; y) and dB (x; y) < " () 9l0 2 N ; 8l l0 ; jf j 2 [l; l] : x j ¤ y j gj < (2l C 1)" ; dW (x; y) < " () 9l0 2 N ; 8l l0 ; 8k 2 Z ; jf j 2 [k; k C l) : x j ¤ y j gj < l" : Both dB and dW are symmetric and satisfy the triangle inequality, but they are not metrics. Distinct configurations x; y 2 AZ can have zero distance. We construct a set of regular quasi-periodic configurations, on which dB and dW coincide and are metrics. Definition 10 (1) The period of k 2 Z in x 2 AZ is r k (x) :D inffp > 0 : 8n 2 Z; x kCn p D x k g. We set r k (x) D 1 if the defining set is empty. (2) x 2 AZ is quasi-periodic, if r k (x) < 1 for all k 2 Z. (3) A periodic structure for a quasi-periodic configuration x is a sequence of positive integers p D (p i ) i 0. Proof (1) We must show dW (x; y) dB (x; y). Let px , p y be the periodic structures for x and y, and let p i D k ix p xi D y y y k i p i be the lowest common multiple of p xi and p i . Then p D (p i ) i is a periodic structure for both x and y. For each i > 0 and for each k 2 Z we have jf j 2 [k p i ; k C p i ) : x j ¤ y j gj y y
2k ix q xi C 2k i q i C jf j 2 [p i ; p i ) : x j ¤ y j gj ;
y y
2k q 2k ix q xi C iy iy 2k ix p xi 2k i p i
jf j 2 [p i ; p i ) : x j ¤ y j gj C 2p i
D dB (x; y) : (2) Since x ¤ y, there exists i such that for some k 2 [0; p i ) and for all n 2 Z we have x kCn p i D x k ¤ y k D y kCn p i . It follows dB (x; y) 1/p i . Definition 12 The Toeplitz space TA over A consists of all regular quasi-periodic configurations with metric dB D dW . Toeplitz sequences are constructed by filling in periodic parts successively. For an alphabet A put e A D A [ fg. Definition 13 (1) The p-skeleton S p (x) 2 e AZ of x 2 AZ is defined by ( x k if 8n 2 Z; x kCn p D x k S p (x) k D otherwise : (2) The sequence of gaps of x 2 e AZ is the unique increasing integer sequence (t i ) a 0. We show that u1 is T -equicontinuous. For a given " > 0 set ı D "/(4m 2r C 1). If dT (y; x) < ı, then there exists l0 such that for all l l0 , jfi 2 [l; l] : x i ¤ y i gj < (2l C 1)ı. For k(2m C 1) j < (k C 1)(2m C 1), F n (y) j can differ from F n (x) j only if y differs from x in some i 2 [k(2mC1)(mr); (kC1)mC(mr)) Thus a change x i ¤ y i can cause at most 2mC1C2(mr) D 4m2rC1 changes F n (y) j ¤ F n (x) j . We get jfi 2 [l; l) : F n (x) i ¤ F n (y) i gj 2lı(4m 2r C 1) 2l" : This shows that FT is almost equicontinuous. In the q general case that A(F) ¤ ;, we get that FT p is almost equicontinuous for some p 2 Z, q 2 N C . Since q is T -equicontinuous, FT is almost equicontinuous and therefore (TA ; FT ) is almost equicontinuous. (3) The proof is the same as in (2) with the only modification that all u 2 Am are (2r C 1)-blocking. (4) The proof of Proposition 8 from [2] works in this case too. (5) The proof of Proposition 12 of [3] works in this case also.
919
920
Dynamics of Cellular Automata in Non-compact Spaces
The Besicovitch Space
The Generic Space
On AZ we have an equivalence x B y iff dB (x; y) D 0. Denote by B A the set of equivalence classes of B and by B : AZ ! B A the projection. The factor of dB is a metric on B A . This is the Besicovitch space on alphabet A. Using prefix codes, it can be shown that every two Besicovitch spaces (with different alphabets) are homeomorphic. By Proposition 11 each equivalence class contains at most one quasi-periodic sequence.
For a configuration x 2 AZ and word v 2 AC set ˚ v (x) D lim inf jfi 2 [n; n) : x[i;iCjvj) D vgj/2n ; n!1
˚ v (x) D lim sup jfi 2 [n; n) : x[i;iCjvj) D vgj/2n : n!1
For every v 2 AC ; ˚ v ; ˚ v : AZ ! [0; 1] are continuous in the Besicovitch topology. In fact we have j˚ v (x) ˚ v (y)j dB (x; y) jvj ;
Proposition 18 TA is dense in B A . The proof of Proposition 9 of [3] works also for regular quasi-periodic sequences. Theorem 19 (Blanchard, Formenti and K˚urka [2]) The Besicovitch space is pathwise connected, infinite-dimensional, homogenous and complete. It is neither separable nor locally compact. The properties of path-connectedness and infinite dimensionality is proved analogously as in Proposition 15. To prove that B A is neither separable nor locally compact, Sturmian configurations have been used in [2]. The completeness of B A has been proved by Marcinkiewicz [14]. Every cellular automaton F : AZ ! AZ is uniformly continuous with respect to dB , so it preserves the equivalence B . If dB (x; y) D 0, then dB (F(x); F(y)) D 0. Thus a cellular automaton F defines a uniformly continuous map FB : B A ! B A . Theorem 20 (Blanchard, Formenti and K˚urka [2]) Let F be a CA on A. (1) (CA ; F) is surjective iff (B A ; FB ) is surjective. (2) If A(F) ¤ ; then (B A ; FB ) is almost equicontinuous. (3) if E(F) ¤ ;, then (B A ; FB ) is equicontinuous. (4) If (B A ; FB ) is sensitive, then (CA ; F) is sensitive. (5) No cellular automaton (B A ; FB ) is positively expansive. (6) If (CA ; F) is chain-transitive, then (B A ; FB ) is chaintransitive. Theorem 21 (Blanchard, Cervelle and Formenti [3])
j˚ v (x) ˚ v (y)j dB (x; y) jvj : Define the generic space (over the alphabet A) as GA D fx 2 AZ : 8v 2 A ; ˚ v (x) D ˚ v (x)g :
It is a closed subspace of B A . For v 2 A denote by ˚v : GA ! [0; 1] the common value of ˚ and ˚. Using prefix codes, one can show that all generic spaces (with different alphabets) are homeomorphic. The generic space contains all uniquely ergodic subshifts, in particular all Sturmian sequences and all regular Toeplitz sequences. Thus the proofs in Blanchard Formenti and K˚urka [2] can be applied to the generic space too. In particular the generic space is homogenous. If we regard the alphabet A D f0; : : : ; m 1g as the group Zm D Z/mZ, then for every x 2 GA there is an isometry H x : GA ! GA defined by H x (y) D x C y. Moreover, GA is pathwise connected, infinite-dimensional and complete (as a closed subspace the full Besicovitch space). It is neither separable nor locally compact. If F : AZ ! AZ is a cellular automaton, then F(GA ) GA . Thus, the restriction of FB to GA defines a dynamical system (GA ; FG ). See also Pivato for a similar approach. Theorem 22 Let F : AZ ! AZ be a CA. (1) (2) (3) (4) (5)
(CA ; F) is surjective iff (GA ; FG ) is surjective. If A(F) ¤ ;, then (GA ; FG ) is almost equicontinuous. if E(F) ¤ ;, then (GA ; FG ) is equicontinuous. If (GA ; FG ) is sensitive, then (CA ; F) is sensitive. If F is C -chain transitive, then F is G -chain transitive.
The proofs are the same as the proofs of corresponding properties in [2].
(1) No CA (B A ; FB ) is transitive. (2) A CA (B A ; FB ) has either a unique fixed point and no other periodic point, or it has uncountably many periodic points. (3) If a surjective CA has a blocking word, then the set of its FB -periodic points is dense in B A .
The Space of Measures By a measure we mean a Borel shift-invariant probability measure on the Cantor space AZ (see Ergodic Theory of Cellular Automata). This is a countably additive function on the Borel sets of AZ which assigns 1 to the full space and satisfies (U) D ( 1 (U)).
Dynamics of Cellular Automata in Non-compact Spaces
A measure on AZ is determined by its values on cylinders (u) :D ([u]n ) which does not depend on n 2 Z. Thus a measure can be identified with a map : A ! [0; 1] subject to bilateral Kolmogorov compatibility conditions X X (ua) D (au) D (u) ; () D 1 : a2A
a2A
Define the distance of two measures X j(u) (u)j jAj2juj : dM (; ) :D u2AC
This is a metric which yields the topology of weak convergence on the compact space MA :D M (AZ ) of shiftinvariant Borel probability measures. A CA F : AZ ! AZ with local rule f determines a continuous and affine map FM : MA ! MA by X (FM ())(u) D (v) : v2 f 1 (u)
Moreover F and F determine the same dynamical system on MA : FM D (F)M . For x 2 GA denote by ˚ x : A ! [0; 1] the function x ˚ (v) D ˚v (x). For every x 2 GA , ˚ x is a shift-invariant Borel probability measure. The map ˚ : GA ! MA is continuous with respect to the Besicovich and weak topologies. In fact we have X dM (˚ x ; ˚ y ) dB (x; y) juj jAj2juj u2AC
D dB (x; y)
X
n jAjn
n>0
D dB (x; y) jAj/(jAj 1)2 : By a theorem of Kamae [9], ˚ is surjective. Every shiftinvariant Borel probability measure has a generic point. It follows from the ergodic theorem that if is a -invariant measure, then (GA ) D 1 and for every v 2 A , the measure of v is the integral of its density ˚ v , Z (v) D ˚v (x) d : If F is a CA, we have a commutative diagram ˚ FG D FM ˚. FG
GA ! GA
? ? ˚y
FM
? ? y˚
MA ! MA
Theorem 23 Let F be a CA over A. (1) (CA ; F) is surjective iff (MA ; FM ) is surjective. (2) If (GA ; FG ) has dense set of periodic points, then (MA ; FM ) has dense set of periodic points. (3) If A(F) ¤ ;, then (MA ; FM ) is almost equicontinuous. (4) If E(F) ¤ ;, then (MA ; FM ) is equicontinuous. Proof (1) See K˚urka [13] for a proof. (2) This holds since (MA ; FM ) is a factor of (GA ; FG ). (3) It suffices to prove the claim for the case that F is almost equicontinuous. In this case there exists a blocking word u 2 AC and the Dirac measure ı u defined by ( 1/juj if v v u ıu (v) D 0 if v 6v u is equicontinuous for (MA ; FM ). (4) If (CA ; F) is equicontinuous, then all sufficiently long words are blocking and there exists d > 0 such that for all n > 0, and for all x; y 2 AZ such that x[nd;nCd] D y[nd;nCd] we have F k (x)[n;n] D F k (y)[n;n] for all k > 0. Thus there are maps g k : A ! A such that jg k (u)j D maxfjuj 2d; 0g and for every x 2 AZ we have F k (x)[n;n] D F k (x[nkd;nCkd] ) D g k (x[nd;nCd] ), where f is the local rule for F. We get k k (); FM ( ) dM FM ˇ ˇ ˇ 1 X ˇ X X ˇ ˇ ˇ ((v) (v))ˇˇ jAj2n D ˇ ˇ nD1 u2An ˇv2 f k (u) ˇ ˇ ˇ ˇ 1 X ˇ X ˇ X ˇ ˇ D ((v) (v))ˇ jAj2n ˇ ˇ ˇ nD1 u2An ˇv2g 1 (u) ˇ k
1 X
X
j(v) (v)j jAj2n
nD1 v2AnC2d
jAj4d dM (; ) :
The Weyl Space Define the following equivalence relation on AZ : x W y iff dW (x; y) D 0. Denote by WA the set of equivalence classes of W and by W : AZ ! WA the projection. The factor of dW is a metric on WA . This is the Weyl space on alphabet A. Using prefix codes, it can be shown that every two Weyl spaces (with different alphabets) are homeomorphic. The Toeplitz space is not dense in the Weyl space (see Blanchard, Cervelle and Formenti [3]). Theorem 24 (Blanchard, Formenti and K˚urka [2]) The Weyl space is pathwise connected, infinite-dimensional and
921
922
Dynamics of Cellular Automata in Non-compact Spaces
homogenous. It is neither separable nor locally compact. It is not complete. Every cellular automaton F : AZ ! AZ is continuous with respect to dW , so it preserves the equivalence W . If dW (x; y) D 0, then dW (F(x); F(y)) D 0. Thus a cellular automaton F defines a continuous map FW : WA ! WA . The shift map : WA ! WA is again an isometry, so in W A many topological properties are preserved if F is composed with a power of the shift. This is true for example for equicontinuity, almost continuity and sensitivity. If : WA ! B A is the (continuous) projection and F a CA, then the following diagram commutes. FW
W A ! W A
? ? y
FB
? ? y
B A ! B A
Theorem 25 (Blanchard, Formenti and K˚urka [2]) Let F be a CA on A. (1) (2) (3) (4)
(CA ; F) is surjective iff (WA ; FW ) is surjective. If A(F) ¤ ;, then (WA ; FW ) is almost equicontinuous. if E(F) ¤ ;, then (WA ; FW ) is equicontinuous. If (CA ; F) is chain-transitive, then (WA ; FW ) is chaintransitive.
Theorem 26 (Blanchard, Cervelle and Formenti [3]) No CA is (WA ; FW ) is transitive. Theorem 27 Let ˙ be a subshift attractor of finite type for F (in the Cantor space). Then there exists ı > 0 such that for every x 2 WA satisfying dW (x; ˙ ) < ı, F n (x) 2 ˙ for some n > 0. Thus a subshift attractor of finite type is a W -attractor. Example 2 shows that it need not be B-attractor. Example 3 shows that the assertion need not hold if ˙ is not of finite type. Proof Let U AZ be a C -clopen set such that ˙ D ˝F (U). Let U be a union of cylinders of words of T n e (U) D length q. Set ˝ n2Z (U). By a generalization of a theorem of Hurd [7] (see Topological Dynamics of Cellular Automata), there exists m > 0 such that e ). If dW (x; ˙ ) < 1/q then there exists l > 0 ˙ D F m (˝
Dynamics of Cellular Automata in Non-compact Spaces, Figure 1 The product ECA184
such that for every k 2 Z there exists a nonnegative j < l such that kC j (x) 2 U. It follows that there exists n > 0 e (U) and therefore F nCm (x) 2 ˙ . such that F n (x) 2 ˝ Examples Example 1 The identity rule Id(x) D x. (B A ; IdB ) and (WA ; IdW ) are chain-transitive (since both B A and W A are connected). However, (CA ; Id) is not chain-transitive. Thus the converse of Theorem 20(6) and of Theorem 25(4) does not hold. Example 2 The product rule ECA128 F(x) i D x i1 x i x iC1 . (CA ; F), (B A ; FB ) and (WA ; FW ) are almost equicontinous and the configuration 01 is equicontinuous in all these versions. By Theorem 27, f01 g is a W -attractor. However, contrary to a mistaken Proposition 9 in [2], f01 g is not B-attractor. For a given 0 < " < 1 define x 2 AZ by x i D 1 iff 3n (1 ") < jij 3n for some n 0. Then dB (x; 01 ) D " but x is a fixed point, since dB (F(x); x) D lim n!1 2n/3n D 0 (see Fig. 1). Example 3 The traffic ECA184 F(x) i D 1 iff x[i1;i] D 10 or x[i;iC1] D 11. No F q p is C -almost equicontinuous, so A(F) D ;. However, if dW (x; 01 ) < ı, then dW (F n (x); 01 ) < ı for every n > 0, since F conserves the number of letters 1 in a configuration. Thus 01 is a point of equicontinuity in (TA ; FT ), (B A ; FB ), and (WA ; FW ). This shows that item (2) of Theorems 17, 20 and 25 cannot be converted. The maximal C -attractor ˝F D fx 2 AZ : 8n > 0; 1(10)n 0 6v xg is not SFT. We show that it does not W -attracts points from any of its neighborhood. For a given even integer q > 2 define x 2 AZ by 8 if 9n 0 ; i D qn C 1 ˆ 0 (see Fig. 2, where q D 8). Example 4 The sum ECA90 F(x) i D (x i1 C x iC1 ) mod 2.
Dynamics of Cellular Automata in Non-compact Spaces
Dynamics of Cellular Automata in Non-compact Spaces, Figure 2 The traffic ECA184
Dynamics of Cellular Automata in Non-compact Spaces, Figure 3 The sum ECA90
Both (B A ; FB ) and (WA ; FW ) are sensitive (Cattaneo et al. [4]). For a given n > 0 define a configuration z by n1 z i D 1 iff i D k2n for some k 2 Z. Then F 2 (z) D (01)1 . For any x 2 AZ , we have dW (x; x C z) D 2n but n1 n1 dW (F 2 (x); F 2 (x C z)) D 1/2. The same argument works for (B A ; FB ).
Acknowledgments We thank Marcus Pivato and Francois Blanchard for careful reading of the paper and many valuable suggestions. The research was partially supported by the Research Program Project “Sycomore” (ANR-05-BLAN-0374).
Example 5 The shift ECA170 F(x) i D x iC1 . Since the system has fixed points 01 and 11 , it has uncountable number of periodic points. However, the periodic points are not dense in B A ([3]). Future Directions One of the promising research directions is the connection between the generic space and the space of Borel probability measures which is based on the factor map ˚. In particular Lyapunov functions based on particle weight functions (see K˚urka [12]) work both for the measure space MA and the generic space GA . The potential of Lyapunov functions for the classification of attractors has not yet been fully explored. This holds also for the connections between attractors in different topologies. While the theory of attractors is well established in compact spaces, in noncompact spaces there are several possible approaches. Finally, the comparison of entropy properties of CA in different topologies may be revealing for classification of CA. There is even a more general approach to different topologies for CA based on the concept of submeasure on Z. Since each submeasure defines a pseudometric, it would be interesting to know, whether CA are continuous with respect to any of these pseudometrics, and whether some dynamical properties of CA can be derived from the properties of defining submeasures.
Bibliography Primary Literature 1. Besicovitch AS (1954) Almost periodic functions. Dover, New York 2. Blanchard F, Formenti E, Kurka ˚ P (1999) Cellular automata in the Cantor, Besicovitch and Weyl spaces. Complex Syst 11(2):107–123 3. Blanchard F, Cervelle J, Formenti E (2005) Some results about the chaotic behaviour of cellular automata. Theor Comput Sci 349(3):318–336 4. Cattaneo G, Formenti E, Margara L, Mazoyer J (1997) A shiftinvariant metric on Sz inducing a nontrivial topology. Lecture Notes in Computer Science, vol 1295. Springer, Berlin 5. Formenti E, Kurka ˚ P (2007) Subshift attractors of cellular automata. Nonlinearity 20:105–117 6. Hedlund GA (1969) Endomorphisms and automorphisms of the shift dynamical system. Math Syst Theory 3:320–375 7. Hurd LP (1990) Recursive cellular automata invariant sets. Complex Syst 4:119–129 8. Iwanik A (1988) Weyl almost periodic points in topological dynamics. Colloquium Mathematicum 56:107–119 9. Kamae J (1973) Subsequences of normal sequences. Isr J Math 16(2):121–149 10. Knudsen C (1994) Chaos without nonperiodicity. Am Math Mon 101:563–565 11. Kurka ˚ P (1997) Languages, equicontinuity and attractors in cellular automata. Ergod Theory Dyn Syst 17:417–433 12. Kurka ˚ P (2003) Cellular automata with vanishing particles. Fundamenta Informaticae 58:1–19
923
924
Dynamics of Cellular Automata in Non-compact Spaces
13. Kurka ˚ P (2005) On the measure attractor of a cellular automaton. Discret Continuous Dyn Syst 2005(suppl):524–535 14. Marcinkiewicz J (1939) Une remarque sur les espaces de a.s. Besicovitch. C R Acad Sc Paris 208:157–159 15. Sablik M (2006) étude de l’action conjointe d’un automate cellulaire et du décalage: une approche topologique et ergodique. Ph D thesis, Université de la Mediterranée
Books and Reviews Besicovitch AS (1954) Almost periodic functions. Dover, New York Kitchens BP (1998) Symbolic dynamics. Springer, Berlin Kurka ˚ P (2003) Topological and symbolic dynamics. Cours spécialisés, vol 11 . Société Mathématique de France, Paris Lind D, Marcus B (1995) An introduction to symbolic dynamics and coding. Cambridge University Press, Cambridge
Embodied and Situated Agents, Adaptive Behavior in
Embodied and Situated Agents, Adaptive Behavior in STEFANO N OLFI Institute of Cognitive Sciences and Technologies, National Research Council (CNR), Rome, Italy Article Outline Glossary Definition of the Subject Introduction Embodiment and Situatedness Behavior and Cognition as Complex Adaptive Systems Adaptive Methods Evolutionary Robotics Methods Discussion and Conclusion Bibliography Glossary Phylogenesis Indicates the variations of the genetic characteristics of a population of artificial agents throughout generations. Ontogenesys Indicates the variations which occur in the phenotypical characteristics of an artificial agent (i. e. in the characteristics of the control system or of the body of the agent) while it interacts with the environment. Embodied agent Indicates an artificial system (simulated or physical) which has a body (characterized by physical properties such us shape, dimension, weight, etc), actuators (e. g. motorized wheels, motorized articulated joints), and sensors (e. g. touch sensors or vision sensors). For a more restricted definition see the concluding section of the paper. Situated agent Indicates an artificial system which is located in a physical environment (simulated or real) with which it interacts on the basis of the law of physics. For a more restricted definition see the concluding section of the paper. Morphological computation Indicates the ability of the body of an agent (with certain specific characteristics) to control its interaction with the environment so to produce a given desired behavior. Definition of the Subject Adaptive behavior concerns the study of how organisms develop their behavioral and cognitive skills through a synthetic methodology which consists in designing artificial agents which are able to adapt to their environment autonomously. These studies are important both
from a modeling point of view (i. e. for making progress in our understanding of intelligence and adaptation in natural beings) and from an engineering point of view (i. e. for making progresses in our ability to develop artefacts displaying effective behavioral and cognitive skills). Introduction Adaptive behavior research concerns the study of how organisms can develop behavioral and cognitive skills by adapting to the environment and to the task they have to fulfill autonomously (i. e. without human intervention). This goal is achieved through a synthetic methodology, i. e. through the synthesis of artificial creatures which: (i) have a body, (ii) are situated in an environment with which they interact, and (iii) have characteristics which vary during an adaptation process. In the rest of the paper we will use the term “agent” to indicate artificial creatures which posses the first two features described above and the term “adaptive agent” to indicate artificial creatures which also posses the third feature. The agents and the environment might be simulated or real. In the former case the characteristics of agents’ body, motor, and sensory system, the characteristics of the environment, and the rules that regulate the interactions between all the elements are simulated on a computer. In the latter case, the agents consist of physical entities (mobile robots) situated in a physical environment with which they interact on the basis of the physical laws. The adaptive process which regulates how the characteristics of the agents (and eventually of the environment change) change might consist of a population-based evolutionary process and/or of a developmental/learning process. In the former case, the characteristics of the agents do not vary during their “lifetime” (i. e. during the time in which the agents interact with the environment) but phylogenetically, while individual agents “reproduce”. In the latter case, the characteristics of the agents vary ontogenetically, while they interact with the environment. The criteria which determine how variations are generated and/or whether or not variations are retained can be task-dependent and/or task-independent, i. e. might be based on an evaluation of whether the variation increase or decrease agents’ ability to display a behavior which is adapted to the task/environment or might be based on task-independent criteria (i. e. general criteria which do not reward directly the exhibition of the requested skill). The paper is organized as follows. In Sect. “Introduction” we briefly introduce the notion of embodiment and situatedness and their implications. In Sect. “Embodiment and Situatedness” we claim that behavior and cog-
925
926
Embodied and Situated Agents, Adaptive Behavior in
nition in embodied and situated adaptive agents should be characterized as a complex adaptive system. In Sect. “Behavior and Cognition as Complex Adaptive Systems” we briefly describe the methods which can be used to synthesize embodied and situated adaptive agents. Finally in Sect. “Adaptive Methods”, we draw our conclusions. Embodiment and Situatedness The notion of embodiment and situatedness has been introduced [8,9,12,34,48] to characterize systems (e. g. natural organism and robots) which have a physical body and which are situated in a physical environment with which they interact. In this and in the following sections we will briefly discuss the general implications of these two fundamental properties. This analysis will be further extended in the concluding section where we will claim on the necessity to distinguish between a weak and a strong notion of embodiment and situatedness. One first important implication of being embodied and situated consists in the fact that these agents and their parts are characterized by their physical properties (e. g. weight, dimension, shape, elasticity etc.), are subjected to the laws of physics (e. g. inertia, friction, gravity, energy consumption, deterioration etc.), and interact with the environment through the exchange of energy and physical material (e. g. forces, sound waves, light waves etc.). Their physical nature also implies that they are quantitative in state and time [49]. The fact that these agents are quantitative in time implies, for example, that the joints which connect the parts of a robotic arm can assume any possible position within a given range. The fact that these agents are quantitative in time implies, for example, that the effects of the application of a force to a joint depend from the time duration of its application. One second important implication is that the information measured by the sensors is not only a function of the environment but also of the relative position of the agent in the environment. This implies that the motor actions performed by an agent, by modifying the agent/environmental relation or the environment, co-determine the agent sensory experiences. One third important implication is that the information measured by the sensors provide information about the external environment which is egocentric (depends from the current position and the orientation of the agent in the environment), local (only provide information related to the local observable portion of the environment), incomplete (due to visual occlusion, for example), and subjected to noise. Similar characteristics apply to the motor actions produced by the agent’s effectors.
It is important to notice that these characteristics do not only represent constraints but also opportunities to be exploited. Indeed, as we will see in the next section, the exploitation of some of these characteristics might allow embodied and situated agents to solve their adaptive problems through solutions which are robust and parsimonious (i. e. minimal) with respect to the complexity of the agent’s body and control system. Behavior and Cognition as Complex Adaptive Systems In embodied and situated agents, behavioral and cognitive skills are dynamical properties which unfold in time and which arise from the interaction between agents’ nervous system, body, and the environment [3,11,19,29,31] and from the interaction between dynamical processes occurring within the agents’ control system, the agents’ body, and within the environment [4,15,45]). Moreover, behavioral and cognitive skills typically display a multilevel and multi-scale organization involving bottom-up and top-down influences between entities at different levels of organization. These properties imply that behavioral and cognitive skills in embodied and situated agents can be properly characterized as complex adaptive systems [29]. These aspects and the complex system nature of behavior and cognition will be illustrated in more details in the next subsections also with the help of examples. The theoretical and practical implication of these aspects for developing artificial agents able to exhibit effective behavioral and cognitive skills will be discussed in the forthcoming sections. Behavior and Cognition as Emergent Dynamical Properties Behavior and cognition are dynamical properties which unfold in time and which emerge from high-frequent nonlinear interactions between the agent, its body, and the external environment [11]. At any time step, the environmental and the agent/ environmental relation co-determine the body and the motor reaction of the agent which, in turn, co-determines how the environment and/or the agent/environmental relation vary. Sequences of these interactions, occurring at a fast time rate, lead to a dynamical process – behavior – which extends over significant larger time span than the interactions (Fig. 1). Since interactions between the agent’s control system, the agent’s body, and the external environment are nonlinear (i. e. small variations in sensory states might lead to significantly different motor actions) and dynamical (i. e.
Embodied and Situated Agents, Adaptive Behavior in
Embodied and Situated Agents, Adaptive Behavior in, Figure 2 A schematization of the passive walking machine developed by McGeer [24]. The machine includes two passive knee joints and a passive hip joint
Embodied and Situated Agents, Adaptive Behavior in, Figure 1 A schematic representation of the relation between agent’s control system, agent’s body, and the environment. The behavioral and cognitive skills displayed by the agent are the emergent result of the bi-directional interactions (represented with full arrows) between the three constituting elements – agent’s control system, agent’s body, and environment. The dotted arrows indicate that the three constituting elements might be dynamical systems on their own. In this case, agents’ behavioral and cognitive skills result of the dynamics originating from the agent/body/environmental interactions but also from the combination and the interaction between dynamical processes occurring within the agent’s body, within the agent’s control system, and within the environment (see Sect. “Embodiment and Situatedness”)
small variations in the action performed at time t might significantly impact later interactions at time tCx ) the relation between the rules that govern the interactions and the behavioral and cognitive skills originating from the interactions tend to be very indirect. Behavioral and cognitive skills thus emerge from the interactions between the three foundational elements and cannot be traced back to any of the three elements taken in isolation. Indeed, the behavior displayed by an embodied and situated agent can hardly be predicted or inferred from an external observer even on the basis of a complete knowledge of the interacting elements and of the rules governing the interactions. A clear example of how behavioral skill might emerge from the interaction between the agents’ body and the environment is constituted by the passive walking machines developed in simulation by McGeer [24] – a two-dimensional bipedal machines able to walk down a four-degree
slope with no motors and no control system (Fig. 2). The walking behavior arises from the fact that the physical forces resulting from gravity and from the collision between the machine and the slope produce a movement of the robot and the fact that robot’s movements produce a variation of the agent-environmental relation which in turn produce a modification of the physical forces to which the machine will be subjected in the next time step. The sequence of by-directional effects between the robot’s body and the environment can lead to a stable dynamical process – the walking behavior. The type of behavior which arises from the robot/ environmental interaction depends from the characteristics of the environment, the physics law which regulate the interaction between the body and the environment, and the characteristics of the body. The first two factors can be considered as fixed but the third factor, the body structure, can be adapted to achieve a given function. Indeed, in the case of this biped robot, the author carefully selected the leg length, the leg mass, and the foot size to obtain the desired walking behavior. In more general term, this example shows how the role of regulating the interaction between the robot and the environment in the appropriate way can be played not only but the control system but also from the body itself providing that the characteristics of the body has been shaped so to favor the exhibition of the desired behavior. This property, i. e. the ability of the body to control its interaction with the environment, has been named with the term “morphological computation” [35]. For related work which demonstrate how effective walking machines can be obtained by integrating passive walking techniques with simple control mecha-
927
928
Embodied and Situated Agents, Adaptive Behavior in
nisms, see [6,13,50]. For related works which show the role of elastic material and elastic actuators for morphological computing see [23,40]. To illustrate of how behavioral and cognitive skills might emerge from agent’s body, agent’s control system, and environmental interactions we describe a simple experiment in which a small wheeled robot situated in an arena surrounded by walls has been evolved to find and to remain close to a cylindrical object. The Khepera robot [26] is provided with eight infrared sensors and two motors controlling the two corresponding wheels (Fig. 3). From the point of view of an external observer, solving this problem requires robots able to: (a) explore the environment until an obstacle is detected, (b) discriminate whether the obstacle detected is a wall or a cylindrical object, and (c) approach or avoid objects depending on the object type. Some of these behaviors (e. g. the wall-avoidance behavior) can be obtained through simple control mechanisms but others require non trivial control mechanisms. Indeed, a detailed analysis of the sensory patterns experienced by the robot indicated that the task of discriminating the two objects is far from trivial since the two classes of sensory patterns experienced by robots close to a wall and close to cylindrical objects largely overlap. The attempt to solve this problem through an evolutionary adaptive method (see Sect. “Behavior and Cognition as Complex Adaptive Systems”) in which the free parameters (i. e. the parameters which regulate the fine-
Embodied and Situated Agents, Adaptive Behavior in, Figure 3 Left: The agent situated in the environment. The agent is a Khepera robot [26]. The environment consists of an arena of 60 × 35 cm containing cylindrical objects placed in a randomly selected location. Right: Angular trajectories of an evolved robot close to a wall (top graph) and to a cylinder (bottom graph). The picture was obtained by placing the robot at a random position in the environment, leaving it free to move for 500 time steps each lasting 100 ms, and recording its relative movements with respect to the two types of objects for distances smaller than 45 mm. The x-axis and the y-axis indicate the relative angle (in degrees) and distance (in mm) between the robot and the corresponding object. For sake of clarity, arrows are used to indicate the relative direction, but not the amplitude of movements
grained interaction between the robot and the environment) are varied randomly and in which variations are retained or discarded on the basis on an evaluation of the overall ability of the robot (i. e. on the basis of the time spent by the robot close to the cylindrical object) demonstrated how adaptive robots can find solutions which are robust and parsimonious in term of control mechanisms [28]. Indeed, in all replications of these experiment, evolved robot solve the problem by moving forward, by avoiding walls, and by oscillating back and fourth and left and right close to cylindrical objects (Fig. 3, right). All these behaviors result from sequences of interactions between the robot and the environment mediated by four types of simple control rules which consist in: turning left when the right infrared sensors are activated, turning right when the left infrared sensors are activated, moving back when the frontal infrared sensors are activated, and moving forward when the frontal infrared sensors are not activated. To understand how these simple control rules can produce the required behaviors and the required arbitration between behaviors we should consider that the same motor responses produce different effects on different agent/environmental situations. For example, the execution of a left-turning action close to a cylindrical object and the subsequent modification of the robot/object relative position produce a new sensory state which triggers a right-turning action. Then, the execution of the latter action and the subsequent modification of the robot/object relative position produce a new sensory state which triggers a left-turning action. The combination and the alternation of these left and right-turning actions over time produce an attractor in the agent/environmental dynamics (Fig. 3, right, bottom graph) which allows the robot to remain close to the cylindrical object. On the other hand the execution of a left-turning behavior close to a wall object and the subsequent modification of the robot/wall position produce a new sensory state which triggers the reiteration of the same motor action. The execution of a sequence of left-turning action then leads to the avoidance of the object and to a modification of the robot/environmental relation which finally lead to a perception of a sensory state which trigger a move-forward behavior (Fig. 4, right, top graph). Before concluding the description of this experiment, it is important to notice that, although the rough classification of the robot motor responses into four different types of actions is useful to describe the strategy with which these robots solve the problem qualitatively, the quantitative aspects which characterize the robot motor reactions (e. g. how sharply a robot turns given a certain pattern of activation of the infrared sensors) are crucial for determin-
Embodied and Situated Agents, Adaptive Behavior in
Embodied and Situated Agents, Adaptive Behavior in, Figure 4 Left: The e-puck robot developed at EPFL, Switzerland http://www.e-puck.org/. Center: The environment which have a size of 52 cm by 60 cm. The light produced by the light bulb located on the left side of the central corridor cannot be perceived from the other two corridors. Right: The motor trajectory produced by the robot during a complete lap of the environment
ing whether the robot will be able to solve the problem or not. Indeed, small differences in the robot’s motor response tend to cumulate in time and might prevent the robot for producing successful behavior (e. g. might prevent the robot to produce a behavioral attractor close to cylindrical objects). This experiment clearly exemplifies some important aspects which characterize all adaptive behavioral system, i. e. systems which are embodied and situated and which have been designed or adapted so to exploit the properties that emerge from the interaction between their control system, their body, and the external environment. In particular, it demonstrates how required behavioral and cognitive skills (i. e. object categorization skills) might emerge from the fine-grained interaction between the robot’s control system, body, and the external environment without the need of dedicated control mechanisms. Moreover, it demonstrates how the relation between the control rules which mediate the interaction between the robot body and the environment and the behavioral skills exhibited by the agents are rather indirect. This means, for example, that an external human observer can hardly predict the behaviors which will be produced by the robot, before observing the robot interacting with the environment, even on the basis of a complete description of the characteristics of the body, of the control rules, and of the environment. Behavior and Cognition as Phenomena Originating from the Interaction Between Coupled Dynamical Processes Up to this point we restricted our analysis to the dynamics originating from the agent’s control system, agents’ body, and environmental interactions. However, the body of an agent, its control system, and the environment
might have their own dynamics (dotted arrows in Fig. 1). For the sake of clarity, we will refer to the dynamical processes occurring within the agent control system, within the agent body, or within the environment as internal dynamics and to the dynamics originating from the agent/body/environmental interaction as external dynamics. In cases in which agents’ body, agents’ control system, or the environment have their own dynamics, behavior should be characterized as a property emerging from the combination of several coupled dynamical processes. The existence of several concurrent dynamical processes represents an important opportunity for the possibility to exploit emergent features. Indeed, behavioral and cognitive skills might emerge not only from the external dynamics, as we showed in the previous section, but also from the internal dynamical processes or from the interaction between different dynamical processes. As an example which illustrates how complex cognitive skills can emerge from the interaction between a simple agent/body/environmental dynamic and a simple agent’s internal dynamic consider the case of a wheeled robot placed in a maze environment (Fig. 4) which has been trained to: (a) produce a wall-following behavior which allows the robot to periodically visit and re-visit all environmental areas, (b) identify a target object constituted by a black disk which is placed in a randomly selected position of the environment for a limited time duration, and (c) recognize the location in which the target object was previously found every time the robot re-visit the corresponding location [15]. The robot has infrared sensors (which provide information about nearby obstacles), light sensors (which provide information about the light gradient generated by the light bulb placed in the central corridor), ground sensors (which detect the color of the ground), two motors
929
930
Embodied and Situated Agents, Adaptive Behavior in
(which control the desired speed of the two corresponding wheels), and one additional output units which should be turned on when the robot re-visit the environmental area in which the black disk was previously found. The robot’s controller consists of a three layers neural network which includes a layer of sensory neurons (which encode the state of the corresponding sensors), a layer of motor neurons which encode the state of the actuators, and a layer of internal neurons which consist of leaky integrators operating at tuneable time scale [3,15]. The free parameters of the robot’s neural controllers (i. e. the connection weights, and the time constant of the internal neurons which regulate the time rate at which this neurons change their state over time) were adapted through an evolutionary technique [31]. By analyzing the evolved robot the authors observed how they are able to generate a spatial representation of the environment and of their location in the environment while they are situated in the environment itself. Indeed, while the robot travel by performing different laps of the environment (see Fig. 4, right), the states of the two internal neurons converge on a periodic limit cycle dynamic in which different states correspond to different locations of the robot in the environment (Fig. 5). As we mentioned above, the ability to generate this form of representation which allow the robot to solve its adaptive problem originate from the coupling between a simple robot’s internal dynamics and a simple robot/body/environmental dynamics. The former dynamics is characterized by the fact that the state of the two internal neurons tends to move slowly toward different fixed point attractors, in the robot’s internal dynamics, which correspond to different type of sensory states exemplified in Fig. 5. The latter dynamics originate from the fact that different types of sensory states last for different time durations and alternate with a given order while the robot move in the environment. The interaction between these two dynamical processes leads to a transient dynamics of agents’ internal state which moves slowly toward the current fixed point attractor without never fully reaching it (thus preserving information about previously experienced sensory states, the time duration of these states, and the order with which they have been experienced). The coupling between the two dynamical processes originates from the fact that the free parameters which regulate the agent/environmental dynamics (e. g. the trajectory and the speed with which the robot moves in the environment) and the agent internal dynamics (e. g. the direction and the speed with which the internal neurons change their state) have been co-adapted and co-shaped during the adaptive process.
Embodied and Situated Agents, Adaptive Behavior in, Figure 5 The state of the two internal neurons (i1 and i2) of the robot recorded for 330 s while the robot performs about 5 laps of the environment. The s, a, b, c, and d labels indicate the internal states corresponding to five different positions of the robot in the environment shown in Fig. 4. The other labels indicate the position of the fixed point attractors in the robot’s internal dynamics corresponding to five types of sensory states experienced by the robot when it detects: a light in its frontal side (LF), a light on its rear side (LR), an obstacle on its right and frontal side (OFR), an obstacle on its right side (OR), no obstacles and no lights (NO)
For related works which show how navigation and localization skills might emerge from the coupling between agent’s internal and external dynamics, see [45]. For other works addressing other behavioral/cognitive capabilities see [4] for what concerns categorization, [16,41] for what concerns selective attention and [44] for what concern language and compositionality.
Behavior and Cognition as Phenomena with a Multi-Level and Multi-Scale Organization Another fundamental feature that characterizes behavior is the fact that it is a multi-layer system with different levels of organizations extending at different time scales [2,19]. More precisely, as exemplified in Fig. 6, the behavior of an agent or of a group of agents involve both lower and higher level behaviors which extend for shorter or longer time spans, respectively. Lower level behaviors arise from few agent/environmental interactions and short term internal dynamical processes. Higher level behaviors, instead, arise from the combination and interaction of lower level behaviors and/or from long term internal dynamical processes.
Embodied and Situated Agents, Adaptive Behavior in
Embodied and Situated Agents, Adaptive Behavior in, Figure 6 A schematic representation of multi-level and multi-scale organization of behavior. The behaviors represented in the inner circles represent elementary behaviors which arise from finegrained interactions between the control system, the body, and the environment, and which extend over limited time spans. The behaviors represented in the external circles represent higher level behaviors which arise from the combination and interaction between lower-level behaviors and which extend over longer time spans. The arrows which go from higher level behavior toward lower levels indicate the fact that the behaviors currently exhibited by the agents later affect the lower level behaviors and/or the fine-grained interaction between the constituting elements (agent’s control system, agent’s body, and the environment)
The multi-level and multi-scale organization of agents’ behavior play important roles: it is one of the factors which allow agents to produce functionally useful behavior without necessarily developing dedicated control mechanisms [8,9,29], it might favor the development of new behavioral and/or cognitive skills thanks to the recruitment of pre-existing capabilities [22], it allow agents to generalize their skills in new task/environmental conditions [29]. An exemplification of how the multi-level and multiscale organization of behavior allow agents to generalize their skill in new environmental conditions is represented by the experiments carried our by Baldassarre et al. [2] in which the authors evolved the control system of a group of robots assembled into a linear structure (Fig. 7) for the ability to move in a coordinated manner and for the ability to display a coordinated light approaching behavior. Each robot [27] consists of a mobile base (chassis) and a main body (turret) that can rotate with respect to the chassis along the vertical axis. The chassis has two drive mechanisms that control the two corresponding tracks and teethed wheels. The turret has one gripper, which allows robots to assemble together and to grasp objects, and a motor controlling the rotation of the turret with respect to the chassis. Robots are provided with a traction sensor, placed at the turret-chassis junction, that detects the intensity and the direction of the force that the turret exerts on the chassis (along the plane orthogonal to the vertical axis) and light sensors. Given that the orientations of individual robots might vary and given that the target light might be out of sight, robots need to coordinate to choose a common direction of movement and to change their direction as soon as one or few robots start to detect a light gradient. Evolved individuals show the ability to negotiate a common direction of movement and by approaching
Embodied and Situated Agents, Adaptive Behavior in, Figure 7 Left: Four robots assembled into a linear structure. Right: A simulation of the robots shown in the left part of the figure
931
932
Embodied and Situated Agents, Adaptive Behavior in
level organization can be observed. The simpler behaviors that can be identified consist of low level individual behaviors which extend over short time spans:
Embodied and Situated Agents, Adaptive Behavior in, Figure 8 The behavior produced by eight robots assembled into a circular structure in a maze environment including walls and cylindrical objects (represented with gray lines and circles). The robots start in the central portion of the maze and reach the light target located in the bottom-left side of the environment (represented with an empty circle) by exhibiting a combination of coordinated-movement behaviors, collective obstacle-avoidance, and collective light-approaching behaviors. The irregular lines, that represent the trajectories of the individual robots, show how the shape of the assembled robots changes during motion by adapting to the local structure of the environment
light targets as soon as a light gradient is detected. By testing evolved robots in different conditions the authors observed that they are able to generalize their skills in new conditions and also to spontaneously produce new behaviors which have not been rewarded during the evolutionary process. More precisely, groups of assembled robots display a capacity to generalize their skills with respect to the number of robots which are assembled together and to the shape formed by the assembled robots. Moreover, when the evolved controllers are embodied in eight robots assembled so to form a circular structure and situated in the maze environment shown in Fig. 8, the robots display an ability to collectively avoid obstacles, to rearrange their shape so to pass through narrow passages, and to explore the environment. The ability to display all these behavioral skills allow the robots to reach the light target even in large maze environments, i. e. even in environmental conditions which are rather different from the conditions that they experienced during the training process (Fig. 8). By analyzing the behavior displayed by the evolved robots tested in the maze environment, a complex multi-
1. A move-forward behavior which consists of the individuals’ ability to move forward when the robot is coordinated with the rest of the team, is oriented toward the direction of the light gradient (if any), and does not collide with obstacles. This behavior results from the combination of: (a) a control rule which produces a move forward action when the perceived traction has a low intensity and when difference between the intensity of the light perceived on the left and the right side of the robot is low, and (b) the sensory effects of the execution of the move forward action selected mediated by the external environment which does not produce a variation of the state of the sensors until the conditions that should be satisfied to produce this behaviors hold. 2. A conformistic behavior which consists of the individuals’ ability to conform its orientation with that of the rest of the team when the two orientations differ significantly. This behavior results from the combination of: (a) a control rule that makes the robot turns toward the direction of the traction when its intensity is significant, and (b) the sensory effects produced by the execution of this action mediated by the external environment that lead to a progressive reduction of the intensity of the traction until the orientation of the robot conform with the orientation of the rest of the group. 3. A phototaxis behavior which consists of the individuals’ ability to orient toward the direction of the light target. This behavior results from the combination of: (a) a control rule that makes the robot turns toward the direction in which the intensity of the light gradient is higher, and (b) the sensory effects produced by the execution of this action mediated by the external environment that lead to a progressive reduction of the difference in the light intensity detected on the two side of the robot until the orientation of the robot conforms with the direction of the light gradient. 4. An obstacle-avoidance behavior which consists of the individuals’ ability to change direction of motion when the execution of a motor action produced a collision with an obstacle. This behavior results from the combination of: (a) the same control rule which lead to behavior #2 which make the robot turns toward the direction of the perceived traction (which in this case is caused by the collision with the obstacle while in the case of behavior #2 is caused by the forces exhorted by the other assembled robots), and (b) the sensory effects produced by the execution of the turning action mediated by the
Embodied and Situated Agents, Adaptive Behavior in
external environment which make the robot turns until collisions do not prevent anymore the execution of a moving forward behavior. The combination and the interaction between these three behaviors produce the following higher levels collective behaviors that extend over a longer time span: 5. A coordinated-motion behavior which consists in the ability of the robots to negotiate a common direction of movement and to keep moving along such direction by compensating further misalignments originating during motion. This behavior emerges from the combination and the interaction of the conformistic behavior (which plays the main role when robots are misaligned) and the move-forward behavior (which plays the main role when robots are aligned). 6. A coordinated-light-approaching behavior which consists in the ability of the robots to co-ordinately move toward a light target. This behavior emerges from the combination of the conformistic, the move-forward, and the phototaxis behaviors (which is triggered when the robots detect a light gradient). The relative importance of the three control rules which lead to the three corresponding behaviors depends both on the strength of the corresponding triggering condition (i. e. the extent of lack of traction forces, the intensity of traction forces, and the intensity of the light gradient, respectively) and on a priority relations among behaviors (i. e. the fact that the conformistic behavior tends to play a stronger role than the phototaxis behavior). 7. A coordinated-obstacle-avoidance behavior which consists in the ability of the robots to co-ordinately turn to avoid nearby obstacles. This behavior arises as the result of the combination of the obstacle avoidance-, the conformistic and the move-forward behaviors. The combination and the interaction between these behaviors lead to the following higher levels collective behaviors that extend over still longer time spans: 8. A collective-exploration-behavior which consists in the ability of the robots to visit different area on the environment when the light target cannot be detected. This behavior emerges from the combination of the coordinated-motion behavior and the coordinate obstacleavoidance behavior which ensures that the assembled robots can move in the environment without getting stuck and without entering into limit cycle trajectories. 9. A shape-re-arrangement behavior which consists in the ability of the assembled robots to dynamically adapt their shape to the current structure of the environment so to pass through narrow passages especially when the
passages to be negotiated are in the direction of the light gradient. This behavior emerges from the combination and the interaction between coordinated motion and coordinated-light-approaching behaviors mediated by the effects produced by relative differences in motion between robots resulting from the execution of different motor actions and/or from differences in the collisions. The fact that the shape of the assembled robots adapt to the current environmental structure so to facilitate the overcoming of narrow passages can be explained by considering that collisions produce a modification of the shape which affect on particular the relative position of the colliding robots. The combination and the interaction of all these behavior leads to a still higher level behavior: 10. A collective-navigation-behavior which consists in the ability of the assembled robots to navigate toward the light target by producing coordinated movements, exploring the environment, passing through narrow passages, and producing a coordinated-lightapproaching behavior (Fig. 8). This analysis illustrates two important mechanisms which explain the remarkable generalization abilities of these robots. The first mechanism consists in the fact that the control rules which regulate the interaction between the agents’ and the environment so to produce certain behavioral skills in certain environmental conditions will produce different but related behavioral skills in other environmental conditions. In particular, the control rules which generate the behaviors #5 and #6 for which evolving robots have been evolved in an environment without obstacles also produce behavior #7 in an environment with obstacles. The second mechanism consists in the fact that the development of certain behaviors at a given level of organization which extend for a given time span will automatically lead to the exhibition of related higher-level behaviors extending at longer time spans which originate form the interactions from the former behaviors (even if these higher level behaviors have not being rewarded during the adaptation process). In particular, the combination and the interaction of behaviors #5, #6, and #7 (which have been rewarded during the evolutionary process or which arise from the same control rules which lead to the generation of rewarded behaviors) automatically lead to the production of behaviors #8, #9, and #10 (which have not been rewarded). Obviously, there no warranty that the new behaviors obtained as a result of these generalization processes will play useful functions. However, the fact that these behaviors are related to the other functional behav-
933
934
Embodied and Situated Agents, Adaptive Behavior in
ioral skills implies that the probabilities that these new behavior will play useful functions is significant. In principle, these generalization mechanisms can also be exploited by agents during their adaptive process to generate behavioral skills which play new functionalities and which emerge from the combination and the interaction between pre-existing behavioral skills playing different functions. On the Top-Down Effect from Higher to Lower Levels of Organization In the previous sections we have discussed how the interactions between the agents’ body, the agents’ control system, and the environment lead to behavioral and cognitive skills and how such skills have a multi-level and multiscale organization in which the interaction between lowerlevel skills lead to the emergence of higher-level skills. However, higher level skills also affect lower level skills up to the fine-grained interaction between the constituting elements (agents’ body, agents’ control system, and environment). More precisely, the behaviors which originate from the interaction between the agent and the environment and from the interaction between lower levels behaviors, later affect the lower levels behaviors and the interaction from which they originate. These bi-directional influences between different levels of organization can lead to circular causality [20] where high level processes act as independent entities which constraint the lower level processes from which they originate. One of the most important effects of this top-down influences consists in the fact that the behavior exhibited by an agent constraint the type of sensory patterns that the agent will experience later on (i. e. constraint the fine-grained agent/environmental interactions which determine the behavior that will be later exhibited by the agent). Since the complexity of the problem faced by an agent depends on the sensory information experienced by the agent itself, this top down influences can be exploited in order to turn hard problems into simple ones. One neat demonstration of this type of phenomena is given by the experiments conducted by Marocco and Nolfi [32] in which a simulated finger robot with six degree of freedom provided with sensors of its joint positions and with rough touch sensors is asked to discriminate between cubic and spherical objects varying in size. The problem is not trivial since, in general terms, the sensory patterns experienced by the robot do not provide clear regularities for discriminating between the two types of objects. However, the type of sensory states which are experienced by the agent also depend on the behavior previously exhib-
ited by the agent itself – agents exhibiting different behavior might face simpler or harder problems. By evolving the robots in simulation for the ability to solve this problem and by analyzing the complexity of the problem faced by robots of successive generations, the authors observed that the evolved robot manage to solve their adaptive problem on the basis of simple control rules which allow the robot to approach the object and to move following the surface of the object from left to right, independently from the object shape. The exhibition of this behavior in interaction with objects characterized by a smooth or irregular surface (in the case of spherical or cubic objects, respectively) ensures that the same control rules lead to two types of behaviors depending on the type of the object. These behaviors consist in following the surface of the object and then moving away from the object in the case of spherical objects, and in following the surface of the object by getting stuck in a corner in the case of cubic objects. The exhibition of these two behaviors allows the agent to experience rather different proprioceptors states as a consequence of having had interacted with spherical or cubic object which nicely encode the regularities which are necessary to differentiate the two types of objects. For other examples which shows how adaptive agents can exploit the fact that behavioral and cognitive processes which arise from the interaction between lower-level behaviors or between the constituting elements later affect these lower level processes see [4,28,39]. Adaptive Methods In this section we briefly review the methods through which artificial embodied and situated agents can develop their skill autonomously while they interact at different levels of organization with the environment and eventually with other agents. These methods are inspired by the adaptive process observed in nature: evolution, maturation, development, and learning. We will focus in particular on self-organized adaptive methodologies in which the role of the experimenter/designer is reduced to the minimum and in which the agents are free to develop their strategy to solve their adaptive problems within a large number of potentially alternative solutions. This choice is motivated by the following considerations: (a) These methods allow agents to identify the behavioral and cognitive skills which should be possessed, combined, and integrated so to solve the given problem. In other words, these methods can come up with effective ways of decomposing the overall required skill into a collection of simpler lower levels skills. Indeed,
Embodied and Situated Agents, Adaptive Behavior in
as we showed in the previous section, evolutionary adaptive techniques can discover ways of decomposing the high-level requested skill into lower-levels behavioral and cognitive skills so find solutions which are effective and parsimonious thanks to the exploitation of properties emerging from the interaction between lower-levels processes and skills and thanks to the recruitment of previously developed skills for performing new functions. In other words, these methods release the designer from the burden of deciding how the overall skill should be divided into a set of simpler skills and how these skills should be integrated. More importantly, these methods can come up with solutions exploiting emergent properties which would be hard to design [17,31]. (b) These methods allow agents to identify how a given behavioral and cognitive skill can be produced, i. e. the appropriate fine-grained characteristics of agents’ body structure and control rules regulating the agent/environmental interaction. As for the previous aspect, the advantage of using adaptive techniques lies not only in the fact that the experimenter is released from the burden of designing the fine-grained characteristics of the agents but also in the fact that adaptation might prove more effective than human design due to the inability of an external observer to foresee the effects of large number of non-linear interactions occurring at different levels of organization. (c) These methods allow agents to adapt to variations of the task, of the environment, and of the social conditions. Current approaches, in this respect, can be grouped into two families which will be illustrated in the following subsections and which include Evolutionary Robotics methods and Developmental Robotics methods. Evolutionary Robotics Methods Evolutionary Robotics [14,31] is a method which allow to create embodied and situated agents able to adapt to their task/environment autonomously through an adaptive process inspired by natural evolution [18] and, eventually, through the combination of evolutionary, developmental, and learning processes. The basic idea goes as follows (Fig. 9). An initial population of different artificial genotypes, each encoding the control system (and possibly the morphology) of an agent, is randomly created. Each genotype is translated into a corresponding phenotype (i. e. a corresponding agent) which is then left free to act (move, look around, manipulate the environment etc.) while its performance
Embodied and Situated Agents, Adaptive Behavior in, Figure 9 A schematic representation of the evolutionary process. The stripes with black and white squares represent individual genotypes. The rectangular boxes indicate the genome of a population of a certain generation. The small robots placed inside the square on the right part of the figure represent a group of robots situated in an environment which interact with the environment and between themselves
(fitness) with respect to a given task is automatically evaluated. In cases in which this methodology is applied to collective behaviors, agents are evaluated in groups which might be heterogeneous or homogeneous (i. e. might consist of agents which differ not with respect to their genetic and phenotypic characteristics). The fittest individuals (those having higher fitness) are allowed to reproduce by generating copies of their genotype with the addition of changes introduced by some genetic operators (e. g., mutations, exchange of genetic material). This process is repeated for a number of generations until an individual or a group of individuals is born which satisfies the performance level set by the user. The process that determines how a genotype (i. e. typically a string of binary values) is turned into a corresponding phenotype (i. e. a robot with a given morphology and control system) might consist of a simple one-to-one mapping or of a complex developmental process. In the former case, many of the characteristics of the phenotypical individual (e. g. the shape of the body, the number and position of the sensors and of the actuators, and in some case the architecture of the neural controller) are pre-determined and fixed and the genotype encodes a vector of
935
936
Embodied and Situated Agents, Adaptive Behavior in
free parameters (e. g. the connection weights of the neural controller [31]). In the latter case, the genotype might encode a set of rules that determine how the body structure and the control system of the individual growth during an artificial developmental processes. Through these type of indirect developmental mappings most of the characteristics of the phenotypical robot can be encoded in the genotype and subjected to the evolutionary adaptive process [31,36]. Finally, in some cases the adaptation process might involve both an evolutionary process that regulates how the characteristics of the robots vary phylogenetically (i. e. throughout generations) and a developmental/learning process which regulates how the characteristics of the robots vary ontogenetically (i. e. during the phase in which the robots act in the environment [30]). Evolutionary methods can be used to allow agents to develop the requested behavioral and cognitive skills from scratch (i. e. starting from agents which do not have any behavioral or cognitive capability) or in an incremental manner (i. e. starting from pre-evolved robots which already have some behavioral capability which consists, for example, in the ability to solve a simplified version of the adaptive problem). The fitness function which determines whether an individual will be reproduced or not might also include, in the addition to a component that score the performance of the agent with respect to a given task, additional taskindependent components. These additional components, in fact, can lead to the development of behavioral skills which are not necessarily functional but which can favor the development of functional skills later on [37]. Evolutionary methods can allow agents to develop low-levels behavioral and cognitive skills which have been previously identified by the designer/experimenter, which might later be combined and integrated in order to realize the high-level requested skill, or directly to develop the high-level requested skill. In the former case the adaptive process leads to the identification of the fine-grained features of the agent (e. g. number and type of sensors, body shape, architecture and connection weights of the neural controller) which by interacting between themselves and with the environment will produce the required skill. In the latter case, the adaptive process leads to the identification of the lower-levels skills (at different levels of organization) which are necessary to produce the required high-level skill, the identification of the way in which these lower levels skills should be combined and integrated, and (as for the formed case) the identification of the fine grained features of the agent which, in interaction with the physical and social environment, will produce the required behavioral or cognitive skills.
Developmental Robotics Methods Developmental Robotics [1,10,21], also known as epigenetic robotics, is a method for developing embodied and situated agents that adapt to their task/environment autonomously through processes inspired by biological developmental and learning processes. Evolutionary and developmental robotics methods share the same fundamental assumptions but also present differences for what concerns the way in which they are realized and the type of situations in which they are typically applied. For what concerns the former aspect, unlike evolutionary robotics methods which operate on ‘long’ phylogenetic time scales, developmental methods typically operate on ‘short’ ontogenetic time scales. For what concerns the latter aspects, unlike evolutionary methods which are usually used to develop behavioral and cognitive skills from scratch, developmental methods are typically adopted to model the development of complex developmental and cognitive skills from simpler pre-existing skills which represents pre-requisites for the development of the required skills. At the present stage, developmental robotics does not consist of a well defined methodology [1,21] but rather of a collection of approaches and methods often addressing complementary aspects which hopefully would be integrated in a single methodology in the future. Below briefly summarize some of the most important methodological aspects of the developmental robotics approach. The Incremental Nature of the Developmental Process Development should be characterized as an incremental process in which pre-existing structures and behavioral skills constitute important prerequisites and constraints for the development of more complex structures and behavioral skills and in which the complexity of the internal and external characteristics increases during developmental. One crucial aspect of developmental approach therefore consists in the identification of the initial characteristics and skills which should enable the bootstrapping of the developmental process: the layering of new skills on top of existing ones [10,25,38]. Another important aspect consists in shaping the developmental process so to ensure that the progressive increase in the complexity of the task matches the current competency of the system and so to drive the developmental process toward the progressive acquisition of the skills which represent the prerequisites for further developments. The progressive increase in complexity might concern not only the complexity of the task or of the required skills but also the complexity of single components of the robot/environmental interac-
Embodied and Situated Agents, Adaptive Behavior in
tion such us, for example, the number of freeze/unfreeze degrees of freedom [5]. The Social Nature of the Developmental Process Development should involve social interaction with human subjects and with other developing robots. Social interactions (e. g. scaffolding, tutelage, mimicry, emulation, and imitation), in fact, play an important role not only for the development of social skills [7] but also as facilitators for the development of individual cognitive and behavioral skills [47]. Moreover, other types of social interactions (i. e. alignment processes or social games) might lead to the development of cognitive and/or behavioral skills which are generated by a collection of individuals and which could not be developed by a single individual robot [43]. Exploitation of the Interaction Between Concurrent Developmental Processes Development should involve the exploitation of properties originating from the interaction and the integration of several co-occurring processes. Indeed, the co-development of different skills at the same time can favor the acquisition of the corresponding skills and of additional abilities arising from the combination and the integration of the developed skills. For example, the development of an ability to anticipate the sensory consequences of our own actions might facilitate the concurrent development of other skills such us categorical perception skills [46]. The development of an ability to pay attention to new situations (curiosity) and to look for new experiences after some time (boredom) might improve the learning of a given functional skill [33,42]. The codevelopment of behavioral and linguistic skills might favor the acquisition of the corresponding skills and the development of semantic combinatoriality skills Sugita and Tani [44]. Discussion and Conclusion In this paper we described how artificial agents which are embodied and situated can develop behavioral and cognitive skills autonomously while they interact with their physical and social environment. After having introduced the notion of embodiment and situatedness, we illustrated how the behavioral and cognitive skills displayed by adaptive agents can be properly characterized as complex system with multi-level and multi-scale properties resulting from a large number of interaction at different levels of organization and involving both bottom-up processes (in which the interaction between elements at lower levels of organization lead to higher levels properties) and top-down processes (in
which properties at a certain level of organization later affect lower level properties or processes). Finally, we briefly introduced the methods which can be used to synthesize adaptive embodied and situated agents. The complex system nature of adaptive agents which are embodied and situated has important implications which constraint the organization of these systems and the dynamics of the adaptive process through which they develop their skills. For what concerns the organization of these systems, it implies that agents’ behavioral and/or cognitive skills (at any stage of the adaptive process) cannot be traced back to anyone of the three foundational elements (i. e. the body of the agents, the control system of the agents, and the environment) in isolation but should rather be characterized as properties which emerge from the interactions between these three elements and the interaction between behavioral and cognitive properties emerging from the former interactions at different levels of organizations. Moreover, it implies that ‘complex’ behavioral or cognitive skills might emerge from the interaction between simple properties or processes. For what concerns agents’ adaptive process, it implies that the development of new ‘complex’ skills does not necessarily require the development of new ‘complex’ morphological features or new ‘complex’ control mechanisms. Indeed, new ‘complex’ skills might arise from the addition of new ‘simple’ features or new ‘simple’ control rules which, in interaction with the pre-existing features and processes, might produce the required new behavioral or cognitive skills. The study of adaptive behavior in artificial agents which has been reviewed in this paper has important implication both from an engineering point of view (i. e. for progressing in our ability to develop effective machines) and from a modeling point of view (i. e. for understanding the characteristics of biological organisms). In particular, from an engineering point of view, progresses in our ability to develop adaptive embodied and situated agents can lead to development of machines playing useful functionalities. From a modeling point of view, progresses in our ability to model and analyze artificial adaptive agents can improve our understanding of the general mechanisms behind animal and human intelligence. For example, the comprehension of the complex system nature of behavioral and cognitive skills illustrated in this paper can allow us to better define the notion of embodiment and situatedness which represent two foundational concepts in the study of natural and artificial intelligence. Indeed, al-
937
938
Embodied and Situated Agents, Adaptive Behavior in
though possessing a body and being in a physical environment certainly represent a pre-requisite for considering an agent embodied and situated, a more useful definition of embodiment (or of truly embodiment) can be given in term of the extent to which a given agent exploits its body characteristics to solve its adaptive problem (i. e. the extent to which its body structure is adapted to the problem to be solved, or in other words, the extent to which its body performs morphological computation). Similarly, a more useful definition of situatedness (or truly situatedness) can be given in terms of the extent to which an agent exploits its interaction with the physical and social environment and the properties originating from this interaction to solve its adaptive problem. For sake of clarity we can refer to the former definition of the terms (i. e. possessing a physical body and being situated in a physical environment) as embodiment and situatedness as weak sense, and to the latter definition as embodiment and situatedness in a strong sense. Bibliography 1. Asada M, MacDorman K, Ishiguro H, Kuniyoshi Y (2001) Cognitive developmental robotics as a new paradigm for the design of humanoid robots. Robot Auton Syst 37:185–193 2. Baldassarre G, Parisi D, Nolfi S (2006) Distributed coordination of simulated robots based on self-organisation. Artif Life 3(12):289–311 3. Beer RD (1995) A dynamical systems perspective on agent-environment interaction. Artif Intell 72:173–215 4. Beer RD (2003) The dynamics of active categorical perception in an evolved model agent. Adapt Behav 11:209–243 5. Berthouze L, Lungarella M (2004) Motor skill acquisition under environmental perturbations: on the necessity of alternate freezing and freeing. Adapt Behav 12(1):47–63 6. Bongard JC, Paul C (2001) Making evolution an offer it can’t refuse: Morphology and the extradimensional bypass. In: Keleman J, Sosik P (eds) Proceedings of the Sixth European Conference on Artificial Life. Lecture Notes in Artificial Intelligence, vol 2159. Springer, Berlin 7. Breazeal C (2003) Towards sociable robots. Robotics Auton Syst 42(3–4):167–175 8. Brooks RA (1991) Intelligence without reason. In: Mylopoulos J, Reiter R (eds) Proceedings of 12th International Joint Conference on Artificial Intelligence. Morgan Kaufmann, San Mateo 9. Brooks RA (1991) Intelligence without reason. In: Proceedings of 12th International Joint Conference on Artificial Intelligence. Sydney, Australia, pp 569–595 10. Brooks RA, Breazeal C, Irie R, Kemp C, Marjanovic M, Scassellati B, Williamson M (1998) Alternate essences of intelligence. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), Madison, Wisconsin, pp 961–976 11. Chiel HJ, Beer RD (1997) The brain has a body: Adaptive behavior emerges from interactions of nervous system, body and environment. Trends Neurosci 20:553–557 12. Clark A (1997) Being there: Putting brain, body and world together again. MIT Press, Cambridge
13. Endo I, Yamasaki F, Maeno T, Kitano H (2002) A method for co-evolving morphology and walking patterns of biped humanoid robot. In: Proceedings of the IEEE Conference on Robotics and Automation, Washington, D.C. 14. Floreano D, Husband P, Nolfi S (2008) Evolutionary Robotics. In: Siciliano B, Oussama Khatib (eds) Handbook of Robotics. Springer, Berlin 15. Gigliotta O, Nolfi S (2008) On the coupling between agent internal and agent/environmental dynamics: Development of spatial representations in evolving autonomous robots. Adapt Behav 16:148–165 16. Goldenberg E, Garcowski J, Beer RD (2004) May we have your attention: Analysis of a selective attention task. In: Schaal S, Ijspeert A, Billard A, Vijayakumar S, Hallam J, Meyer J-A (eds) From Animals to Animats 8: Proceedings of the Eighth International Conference on the Simulation of Adaptive Behavior. MIT Press, Cambridge 17. Harvey I (2000) Robotics: Philosophy of mind using a screwdriver. In: Gomi T (ed) Evolutionary Robotics: From Intelligent Robots to Artificial Life, vol III. AAI Books, Ontario 18. Holland J (1975) Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor 19. Keijzer F (2001) Representation and behavior. MIT Press, London 20. Kelso JAS (1995) Dynamics patterns: The self-organization of brain and behaviour. MIT Press, Cambridge 21. Lungarella M, Metta G, Pfeifer R, Sandini G (2003) Developmental robotics: a survey. Connect Sci 15:151–190 22. Marocco D, Nolfi S (2007) Emergence of communication in embodied agents evolved for the ability to solve a collective navigation problem. Connect Sci 19(1):53–74 23. Massera G, Cangelosi A, Nolfi S (2007) Evolution of prehension ability in an anthropomorphic neurorobotic arm. Front Neurorobot 1(4):1–9 24. McGeer T (1990) Passive walking with knees. In: Proceedings of the IEEE Conference on Robotics and Automation, vol 2, pp 1640–1645 25. Metta G, Sandini G, Natale L, Panerai F (2001) Development and Q30 robotics. In: Proceedings of IEEE-RAS International Conference on Humanoid Robots, pp 33–42 26. Mondada F, Franzi E, Ienne P (1993) Mobile robot miniaturisation: A tool for investigation in control algorithms. In: Proceedings of the Third International Symposium on Experimental Robotics, Kyoto, Japan 27. Mondada F, Pettinaro G, Guigrard A, Kwee I, Floreano D, Denebourg J-L, Nolfi S, Gambardella LM, Dorigo M (2004) Swarm-bot: A new distributed robotic concept. Auton Robots 17(2–3):193–221 28. Nolfi S (2002) Power and limits of reactive agents. Neurocomputing 49:119–145 29. Nolfi S (2005) Behaviour as a complex adaptive system: On the role of self-organization in the development of individual and collective behaviour. Complexus 2(3–4):195–203 30. Nolfi S, Floreano D (1999) Learning and Evolution. Auton Robots 1:89–113 31. Nolfi S, Floreano D (2000) Evolutionary Robotics: The Biology, Intelligence, and Technology of Self-Organizing Machines. MIT Press/Bradford Books, Cambridge 32. Nolfi S, Marocco D (2002) Active perception: A sensorimotor account of object categorization. In: Hallam B, Floreano D, Hallam J, Hayes G, Meyer J-A (eds) From Animals to Animats 7, Pro-
Embodied and Situated Agents, Adaptive Behavior in
33.
34. 35.
36. 37.
38.
39. 40.
41.
ceedings of the VII International Conference on Simulation of Adaptive Behavior. MIT Press, Cambridge, pp 266–271 Oudeyer P-Y, Kaplan F, Hafner V (2007) Intrinsic motivation systems for autonomous mental development. IEEE Trans Evol Comput 11(2):265–286 Pfeifer R, Bongard J (2007) How the body shape the way we think. MIT Press, Cambridge Pfeifer R, Iida F, Gómez G (2006) Morphological computation for adaptive behavior and cognition. In: International Congress Series, vol 1291, pp 22–29 Pollack JB, Lipson H, Funes P, Hornby G (2001) Three generations of coevolutionary robotics. Artif Life 7:215–223 Prokopenko M, Gerasimov V, Tanev I (2006) Evolving spatiotemporal coordination in a modular robotic system. In: Rocha LM, Yaeger LS, Bedau MA, Floreano D, Goldstone RL, Vespignani A (eds) Artificial Life X: Proceedings of the Tenth International Conference on the Simulation and Synthesis of Living Systems. MIT Press, Boston Scassellati B (2001) Foundations for a Theory of Mind for a Humanoid Robot. Ph D thesis, Department of Electrical Engineering and Computer Science, MIT, Boston Scheier C, Pfeifer R, Kunyioshi Y (1998) Embedded neural networks: exploiting constraints. Neural Netw 11:1551–1596 Schmitz A, Gómez G, Iida F, Pfeifer R (2007) On the robustness of simple speed control for a quadruped robot. In: Proceeding of the International Conference on Morphological Computation, Venice, Italy Slocum AC, Downey DC, Beer RD (2000) Further experiments in the evolution of minimally cognitive behavior: From perceiving affordances to selective attention. In: Meyer J, Berthoz A, Floreano D, Roitblat H, Wilson S (eds) From Animals to Animats
42.
43. 44.
45.
46.
47.
48.
49. 50.
6. Proceedings of the Sixth International Conference on Simulation of Adaptive Behavior. MIT Press, Cambridge Schmidhuber J (2006) Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts. Connect Sci 18(2):173–187 Steels L (2003) Evolving grounded communication for robots. Trends Cogn Sci 7(7):308–312 Sugita Y, Tani J (2005) Learning semantic combinatoriality from the interaction between linguistic and behavioral processes. Adapt Behav 13(1):33–52 Tani J, Fukumura N (1997) Self-organizing internal representation in learning of navigation: A physical experiment by the mobile robot Yamabico. Neural Netw 10(1):153–159 Tani J, Nolfi S (1999) Learning to perceive the world as articulated: An approach for hierarchical learning in sensory-motor systems. Neural Netw 12:1131–1141 Tani J, Nishimoto R, Namikawa J, Ito M (2008) Co-developmental learning between human and humanoid robot using a dynamic neural network model. IEEE Trans Syst Man Cybern B. Cybern 38:1 Varela FJ, Thompson E, Rosch E (1991) The Embodied mind: Cognitive science and human experience. MIT Press, Cambridge van Gelder TJ (1998) The dynamical hypothesis in cognitive science. Behav Brain Sci 21:615–628 Vaughan E, Di Paolo EA, Harvey I (2004) The evolution of control and adaptation in a 3D powered passive dynamic walker. In: Pollack J, Bedau M, Husband P, Ikegami T, Watson R (eds) Proceedings of the Ninth International Conference on the Simulation and Synthesis of Living Systems. MIT Press, Cambridge
939
940
Entropy
Entropy CONSTANTINO TSALLIS1,2 1 Centro Brasileiro de Pesquisas Físicas, Rio de Janeiro, Brazil 2 Santa Fe Institute, Santa Fe, USA Article Outline Glossary Definition of the Subject Introduction Some Basic Properties Boltzmann–Gibbs Statistical Mechanics On the Limitations of Boltzmann–Gibbs Entropy and Statistical Mechanics The Nonadditive Entropy Sq A Connection Between Entropy and Diffusion Standard and q-Generalized Central Limit Theorems Future Directions Acknowledgments Bibliography Glossary Absolute temperature Denoted T. Clausius entropy Also called thermodynamic entropy. Denoted S. Boltzmann–Gibbs entropy Basis of Boltzmann–Gibbs statistical mechanics. This entropy, denoted SBG , is additive. Indeed, for two probabilistically independent subsystems A and B, it satisfies SBG (ACB) D SBG (A)C SBG (B). Nonadditive entropy It usually refers to the basis of nonextensive statistical mechanics. This entropy, denoted Sq , is nonadditive for q ¤ 1. Indeed, for two probabilistically independent subsystems A and B, it satisfies S q (A C B) ¤ S q (A) C S q (B) (q ¤ 1). For historical reasons, it is frequently (but inadequately) referred to as nonextensive entropy. q-logarithmic and q-exponential functions Denoted ln q x (ln1 x D ln x), and exq (e1x D e x ), respectively. Extensive system So called for historical reasons. A more appropriate name would be additive system. It is a system which, in one way or another, relies on or is connected to the (additive) Boltzmann–Gibbs entropy. Its basic dynamical and/or structural quantities are expected to be of the exponential form. In the sense of complexity, it may be considered a simple system. Nonextensive system So called for historical reasons. A more appropriate name would be nonadditive sys-
tem. It is a system which, in one way or another, relies on or is connected to a (nonadditive) entropy such as S q (q ¤ 1). Its basic dynamical and/or structural quantities are expected to asymptotically be of the powerlaw form. In the sense of complexity, it may be considered a complex system. Definition of the Subject Thermodynamics and statistical mechanics are among the most important formalisms in contemporary physics. They have overwhelming and intertwined applications in science and technology. They essentially rely on two basic concepts, namely energy and entropy. The mathematical expression that is used for the first one is well known to be nonuniversal; indeed, it depends on whether we are say in classical, quantum, or relativistic regimes. The second concept, and very specifically its connection with the microscopic world, has been considered during well over one century as essentially unique and universal as a physical concept. Although some mathematical generalizations of the entropy have been proposed during the last forty years, they have frequently been considered as mere practical expressions for disciplines such as cybernetics and control theory, with no particular physical interpretation. What we have witnessed during the last two decades is the growth, among physicists, of the belief that it is not necessarily so. In other words, the physical entropy would basically rely on the microscopic dynamical and structural properties of the system under study. For example, for systems microscopically evolving with strongly chaotic dynamics, the connection between the thermodynamical entropy and the thermostatistical entropy would be the one found in standard textbooks. But, for more complex systems (e. g., for weakly chaotic dynamics), it becomes either necessary, or convenient, or both, to extend the traditional connection. The present article presents the ubiquitous concept of entropy, useful even for systems for which no energy can be defined at all, within a standpoint reflecting a nonuniversal conception for the connection between the thermodynamic and the thermostatistical entropies. Consequently, both the standard entropy and its recent generalizations, as well as the corresponding statistical mechanics, are here presented on equal footing. Introduction The concept of entropy (from the Greek !, en trepo, at turn, at transformation) was first introduced in 1865 by the German physicist and mathematician Rudolf Julius Emanuel Clausius, Rudolf Julius Emanuel in order to mathematically complete the formalism of classi-
Entropy
cal thermodynamics [55], one of the most important theoretical achievements of contemporary physics. The term was so coined to make a parallel to energy (from the Greek o& , energos, at work), the other fundamental concept of thermodynamics. Clausius connection was given by dS D
ıQ ; T
(1)
where ıQ denotes an infinitesimal transfer of heat. In other words, 1/T acts as an integrating factor for ıQ. In fact, it was only in 1909 that thermodynamics was finally given, by the Greek mathematician Constantin Caratheodory, a logically consistent axiomatic formulation. In 1872, some years after Clausius proposal, the Austrian physicist Ludwig Eduard Boltzmann introduced a quantity, that he noted H, which was defined in terms of microscopic quantities: • H
f (v) ln[ f (v)] dv ;
(2)
where f (v) dv is the number of molecules in the velocity space interval dv. Using Newtonian mechanics, Boltzmann showed that, under some intuitive assumptions (Stoßzahlansatz or molecular chaos hypothesis) regarding the nature of molecular collisions, H does not increase with time. Five years later, in 1877, he identified this quantity with Clausius entropy through kH S, where k is a constant. In other words, he established that • S D k f (v) ln[ f (v)] dv ; (3) later on generalized into “ S D k
f (q; p) ln[ f (q; p)] dq dp ;
(4)
where (q; p) is called the -space and constitutes the phase space (coordinate q and momentum p) corresponding to one particle. Boltzmann’s genius insight – the first ever mathematical connection of the macroscopic world with the microscopic one – was, during well over three decades, highly controversial since it was based on the hypothesis of the existence of atoms. Only a few selected scientists, like the English chemist and physicist John Dalton, the Scottish physicist and mathematician James Clerk Maxwell, and the American physicist, chemist and mathematician Josiah Willard Gibbs, believed in the reality of atoms and
molecules. A large part of the scientific establishment was, at the time, strongly against such an idea. The intricate evolution of Boltzmann’s lifelong epistemological struggle, which ended tragically with his suicide in 1906, may be considered as a neat illustration of Thomas Kuhn’s paradigm shift, and the corresponding reaction of the scientific community, as described in The Structure of Scientific Revolutions. There are in fact two important formalisms in contemporary physics where the mathematical theory of probabilities enters as a central ingredient. These are statistical mechanics (with the concept of entropy as a functional of probability distributions) and quantum mechanics (with the physical interpretation of wave functions and measurements). In both cases, contrasting viewpoints and passionate debates have taken place along more than one century, and continue still today. This is no surprise after all. If it is undeniable that energy is a very deep and subtle concept, entropy is even more. Indeed, energy concerns the world of (microscopic) possibilities, whereas entropy concerns the world of the probabilities of those possibilities, a step further in epistemological difficulty. In his 1902 celebrated book Elementary Principles of Statistical Mechanics, Gibbs introduced the modern form of the entropy for classical systems, namely Z S D k d f (q; p) ln[C f (q; p)] ; (5) where represents the full phase space of the system, thus containing all coordinates and all momenta of its elementary particles, and C is introduced to take into account the finite size and the physical dimensions of the smallest admissible cell in -space. The constant k is known today to be a universal one, called Boltzmann constant, and given by k D 1:3806505(24) 1023 Joule/Kelvin. The studies of the German physicist Max Planck along Boltzmann and Gibbs lines after the appearance of quantum mechanical concepts, eventually led to the expression S D k ln W ;
(6)
which he coined as Boltzmann entropy. This expression is carved on the stone of Boltzmann’s grave at the Central Cemetery of Vienna. The quantity W is the total number of microstates of the system that are compatible with our macroscopic knowledge of it. It is obtained from Eq. (5) under the hypothesis of an uniform distribution or equal probabilities. The Hungarian–American mathematician and physicist Johann von Neumann extended the concept of BG entropy in two steps – in 1927 and 1932 respectively –, in order to also cover quantum systems. The following expression, frequently referred to as the von Neumann entropy,
941
942
Entropy
resulted: S D k Tr ln ;
(7)
being the density operator (with Tr D 1). Another important step was given in 1948 by the American electrical engineer and mathematician Claude Elwood Shannon. Having in mind the theory of digital communications he explored the properties of the discrete form S D k
W X
p i ln p i ;
(8)
iD1
PW frequently referred to as Shannon entropy (with iD1 p i D 1). This form can be recovered from Eq. (5) for the particular case for which the phase space density f (q; p) D PW iD1 p i ı(q qi ) ı(p pi ). It can also be recovered from Eq. (7) when is diagonal. We may generically refer to Eqs. (5), (6), (7) and (8) as the BG entropy, noted SBG . It is a measure of the disorder of the system or, equivalently, of our degree of ignorance or lack of information about its state. To illustrate a variety of properties, the discrete form (8) is particularly convenient. Some Basic Properties Non-negativity It can be easily verified that, in all cases, SBG 0, the zero value corresponding to certainty, i. e., p i D 1 for one of the W possibilities, and zero for all the others. To be more precise, it is exactly so whenever SBG is expressed either in the form (7) or in the form (8). However, this property of non-negativity may be no longer true if it is expressed in the form (5). This violation is one of the mathematical manifestations that, at the microscopic level, the state of any physical system exhibits its quantum nature. Expansibility Also SBG (p1 ; p2 ; : : : ; p W ; 0) D SBG (p1 ; p2 ; : : : ; p W ), i. e., zero-probability events do not modify our information about the system. Maximal value SBG is maximized at equal probabilities, i. e., for p i D 1/W ; 8i. Its value is that of Eq. (6). This corresponds to the Laplace principle of indifference or principle of insufficient reason. Concavity If we have two arbitrary probability distributions fp i g and fp0i g for the same set of W possibilities, we can define the intermediate probability distribution p00i D p i C (1 ) p0i (0 < < 1). It straightforwardly follows that SBG (fp00i g) SBG (fp i g) C (1 ) SBG (fp0i g). This property is essential for thermodynamics since it eventually leads to thermodynamic stability, i. e., to robustness with regard to energy fluctuations. It also leads to the tendency of the entropy to
attain, as time evolves, its maximal value compatible with our macroscopic knowledge of the system, i. e., with the possibly known values for the macroscopic constraints. Lesche stability or experimental robustness B. Lesche introduced in 1982 [107] the definition of an interesting property, which he called stability. It reflects the experimental robustness that a physical quantity is expected to exhibit. In other words, similar experiments should yield similar numerical results for the physical quantities. Let us consider two probability distributions fp i g and fp0i g, assumed to be close, in the sense P 0 that W iD1 jp i p i j < ı ; ı > 0 being a small number. An entropic functional S(fp i g) is said stable or experimentally robust if, for any given > 0, a ı > 0 exists such that jS(fp i g) S(fp0i g)j/Smax < ; where Smax is the maximal value that the functional can attain (ln W in the case of SBG ). This implies that limı!0 limW!1 (S(fp i g) S(fp0i g))/Smax D 0. As we shall see soon, this property is much stronger than it seems at first sight. Indeed, it provides a (necessary but not sufficient) criterion for classifying entropic functionals into physically admissible or not. It can be shown that SBG is Lesche-stable (or experimentally robust). Entropy production If we start the (deterministic) time evolution of a generic classical system from an arbitrarily chosen point in its phase space, it typically follows a quite erratic trajectory which, in many cases, gradually visits the entire (or almost) phase space. By making partitions of this -space, and counting the frequency of visits to the various cells (and related symbolic quantities), it is possible to define probability sets. Through them, we can calculate a sort of time evolution of SBG (t). If the system is chaotic (sometimes called strongly chaotic), i. e., if its sensitivity to the initial conditions increases exponentially with time, then SBG (t) increases linearly with t in the appropriate asymptotic limits. This rate of increase of the entropy is called Kolmogorov–Sinai entropy rate, and, for a large class of systems, it coincides (Pesin identity or Pesin theorem) with the sum of the positive Lyapunov exponents. These exponents characterize the exponential divergences, along various directions in the -space, of a small discrepancy in the initial condition of a trajectory. It turns out, however, that the Kolmogorov–Sinai entropy rate is, in general, quite inconvenient for computational calculations for arbitrary nonlinear dynamical systems. In practice, another quantity is used instead [102], usually referred to as entropy production per unit time, which we note KBG . Its definition is as
Entropy
follows. We first make a partition of the -space into many W cells (i D 1; 2; : : : ; W). In one of them, arbitrarily chosen, we randomly place M initial conditions (i. e., an ensemble). As time evolves, the occupancy of the W cells determines the set fM i (t)g, with PW iD1 M i (t) D M. This set enables the definition of a probability set with p i (t) M i (t)/M, which in turn determines SBG (t). We then define the entropy production per unit time as follows: KBG lim
lim
lim
t!1 W!1 M!1
SBG (t) : t
(9)
Up to date, no theorem guarantees that this quantity coincides with the Kolmogorov–Sinai entropy rate. However, many numerical approaches of various chaotic systems strongly suggest so. The same turns out to occur with what is frequently referred in the literature as a Pesin-like identity. For instance, if we have a one-dimensional dynamical system, its sensitivity to the initial conditions lim x(0)!0 x(t)/x(0) is typically given by (t) D et ;
(11)
Additivity and extensivity If we consider a system A C B constituted by two probabilistically independent subsystems A and B, i. e., if we consider p ACB D p Ai p Bj , ij we immediately obtain from Eq. (8) that SBG (A C B) D SBG (A) C SBG (B) :
(12)
In other words, the BG entropy is additive [130]. If our system is constituted by N probabilistically independent identical subsystems (or elements), we clearly have SBG (N) / N. It frequently happens, however, that the N elements are not exactly independent but only asymptotically so in the N ! 1 limit. This is the usual case of many-body Hamiltonian systems involving only short-range interactions, where the concept of short-range will be focused in detail later on. For such systems, SBG is only asymptotically additive, i. e., 0 < lim
N!1
SBG (N) < 1: N
0 < lim
N!1
S(N) < 1; N
(14)
where no hypothesis at all is made about the possible independence or weak or strong correlations between the elements of the system whose entropy S we are considering. Equation (13) amounts to say that the additive entropy SBG is extensive for weakly correlated systems such as the already mentioned many-body shortrange-interacting Hamiltonian ones. It is important to clearly realize that additivity and extensivity are independent properties. An additive entropy such as SBG is extensive for simple systems such as the ones just mentioned, but it turns out to be nonextensive for other, more complex, systems that will be focused on later on. For many of these more complex systems, it is the nonadditive entropy Sq (to be analyzed later on) which turns out to be extensive for a non standard value of q (i. e., q ¤ 1).
(10)
where x(t) is the discrepancy in the one-dimensional phase space of two trajectories initially differing by x(0), and is the Lyapunov exponent ( > 0 corresponds to strongly sensitive to the initial conditions, or strongly chaotic, and < 0 corresponds to strongly insensitive to the initial conditions). The so-called Pesinlike identity amounts, if 0, to KBG D :
An entropy S(fp i g) of a specific systems is said extensive if it satisfies
(13)
Boltzmann–Gibbs Statistical Mechanics Physical systems (classical, quantum, relativistic) can be theoretically described in very many ways, through microscopic, mesoscopic, macroscopic equations, reflecting either stochastic or deterministic time evolutions, or even both types simultaneously. Those systems whose time evolution is completely determined by a well defined Hamiltonian with appropriate boundary conditions and admissible initial conditions are the main purpose of an important branch of contemporary physics, named statistical mechanics. This remarkable theory (or formalism, as sometimes called), which for large systems satisfactorily matches classical thermodynamics, was primarily introduced by Boltzmann and Gibbs. The physical system can be in all types of situations. Two paradigmatic such situations correspond to isolation, and thermal contact with a large reservoir called thermostat. Their stationary state (t ! 1) is usually referred to as thermal equilibrium. Both situations have been formally considered by Gibbs within his mathematical formulation of statistical mechanics, and they respectively correspond to the socalled micro-canonical and canonical ensembles (other ensembles do exist, such as the grand-canonical ensemble, appropriate for those situations in which the total number of elements of the system is not fixed; this is however out of the scope of the present article). The stationary state of the micro-canonical ensemble is determined by p i D 1/W (8i, where i runs over all possi-
943
944
Entropy
ble microscopic states), which corresponds to the extremization of SBG with a single (and trivial) constraint, namely W X
pi D 1 :
(15)
iD1
To obtain the stationary state for the canonical ensemble, the thermostat being at temperature T, we must (typically) add one more constraint, namely W X
pi Ei D U ;
(16)
iD1
where fE i g are the energies of all the possible states of the system (i. e., eigenvalues of the Hamiltonian with the appropriate boundary conditions). The extremization of SBG with the two constraints above straightforwardly yields pi D Z
eˇ E i Z W X
(17)
eˇ E j
(18)
jD1
with the partition function Z, and the Lagrange parameter ˇ D 1/kT. This is the celebrated BG distribution for thermal equilibrium (or Boltzmann weight, or Gibbs state, as also called), which has been at the basis of an enormous amount of successes (in fluids, magnets, superconductors, superfluids, Bose–Einstein condensation, conductors, chemical reactions, percolation, among many other important situations). The connection with classical thermodynamics, and its Legendre-transform structure, occurs through relations such as 1 @S D T @U F U TS D
(19) 1 ln Z ˇ
@ ln Z @ˇ @U @2 F @S D D T 2 ; C T @T @T @T
UD
(20) (21) (22)
where F, U and C are the Helmholtz free energy, the internal energy, and the specific heat respectively. The BG statistical mechanics historically appeared as the first connection between the microscopic and the macroscopic descriptions of the world, and it constitutes one of the cornerstones of contemporary physics. The Establishment resisted heavily before accepting the validity and power of
Boltzmann’ s revolutionary ideas. In 1906 Boltzmann dramatically committed suicide, after 34 years that he had first proposed the deep ideas that we are summarizing here. At that early 20th century, few people believed in Boltzmann’s proposal (among those few, we must certainly mention Albert Einstein), and most physicists were simply unaware of the existence of Gibbs and of his profound contributions. It was only half a dozen years later that the emerging new generation of physicists recognized their respective genius (thanks in part to various clarifications produced by Paul Ehrenfest, and also to the experimental successes related with Brownian motion, photoelectric effect, specific heat of solids, and black-body radiation). On the Limitations of Boltzmann–Gibbs Entropy and Statistical Mechanics Historical Background As any other human intellectual construct, the applicability of the BG entropy, and of the statistical mechanics to which it is associated, naturally has restrictions. The understanding of present developments of both the concept of entropy, and its corresponding statistical mechanics, demand some knowledge of the historical background. Boltzmann was aware of the relevance of the range of the microscopic interactions between atoms and molecules. He wrote, in his 1896 Lectures on Gas Theory [41], the following words: When the distance at which two gas molecules interact with each other noticeably is vanishingly small relative to the average distance between a molecule and its nearest neighbor—or, as one can also say, when the space occupied by the molecules (or their spheres of action) is negligible compared to the space filled by the gas—then the fraction of the path of each molecule during which it is affected by its interaction with other molecules is vanishingly small compared to the fraction that is rectilinear, or simply determined by external forces. [ . . . ] The gas is “ideal” in all these cases. Also Gibbs was aware. In his 1902 book [88], he wrote: In treating of the canonical distribution, we shall always suppose the multiple integral in equation (92) [the partition function, as we call it nowadays] to have a finite value, as otherwise the coefficient of probability vanishes, and the law of distribution becomes illusory. This will exclude certain cases, but not such apparently, as will affect the value of our results with respect to their bearing on thermodynamics. It
Entropy
will exclude, for instance, cases in which the system or parts of it can be distributed in unlimited space [ . . . ]. It also excludes many cases in which the energy can decrease without limit, as when the system contains material points which attract one another inversely as the squares of their distances. [ . . . ]. For the purposes of a general discussion, it is sufficient to call attention to the assumption implicitly involved in the formula (92). The extensivity/additivity of SBG has been challenged, along the last century, by many physicists. Let us mention just a few. In his 1936 Thermodynamics [82], Enrico Fermi wrote: The entropy of a system composed of several parts is very often equal to the sum of the entropies of all the parts. This is true if the energy of the system is the sum of the energies of all the parts and if the work performed by the system during a transformation is equal to the sum of the amounts of work performed by all the parts. Notice that these conditions are not quite obvious and that in some cases they may not be fulfilled. Thus, for example, in the case of a system composed of two homogeneous substances, it will be possible to express the energy as the sum of the energies of the two substances only if we can neglect the surface energy of the two substances where they are in contact. The surface energy can generally be neglected only if the two substances are not very finely subdivided; otherwise, it can play a considerable role. Laszlo Tisza wrote, in his Generalized Thermodynamics [178]: The situation is different for the additivity postulate P a2, the validity of which cannot be inferred from general principles. We have to require that the interaction energy between thermodynamic systems be negligible. This assumption is closely related to the homogeneity postulate P d1. From the molecular point of view, additivity and homogeneity can be expected to be reasonable approximations for systems containing many particles, provided that the intermolecular forces have a short range character. Corroborating the above, virtually all textbooks of quantum mechanics contain the mechanical calculations corresponding to a particle in a square well, the harmonic oscillator, the rigid rotator, a spin 1/2 in the presence of a magnetic field, and the Hydrogen atom. In the textbooks of statistical mechanics we can find the thermostatistical calculations of all these systems . . . excepting the Hydrogen atom! Why? Because the long-range electron-proton
interaction produces an energy spectrum which leads to a divergent partition function. This is but a neat illustration of the above Gibbs’ alert. A Remark on the Thermodynamics of Short- and Long-Range Interacting Systems We consider here a simple d-dimensional classical fluid, constituted by many N point particles, governed by the Hamiltonian H DKCV D
N X X p2i V(r i j ) ; C 2m iD1
(23)
i¤ j
where the potential V(r) has, if it is attractive at short distances, no singularity at the origin, or an integrable singularity, and whose asymptotic behavior at infinity is given by V(r) B/r˛ with B > 0 and ˛ 0. One such example is the d D 3 Lennard–Jones fluid, for which V(r) D A/r12 B/r6 (A > 0), i. e., repulsive at short distances and attractive at long distances. In this case ˛ D 6. Another example could be Newtonian gravitation with a phenomenological short-distance cutoff (i. e., V(r) ! 1 for r r0 with r0 > 0. In this case, ˛ D 1. The full -space of such a system has 2dN dimensions. The total potential energy is expected to scale (assuming a roughly homogeneous distribution of the particles) as Upot (N) / B N
Z
1
dr r d1 r˛ ;
(24)
1
where the integral starts appreciably contributing above a typical cutoff, here taken to be unity. This integral is finite [ D B/(˛ d) ] for ˛/d > 1 (short-range interactions), and diverges for 0 ˛/d 1 (long-range interactions). In other words, the energy cannot be generically characterized by Eq. (24), and we must turn onto a different and more powerful estimation. Given the finiteness of the size of the system, an appropriate one is, in all cases, given by Upot (N) / B N
Z
N 1/d 1
B dr r d1 r˛ D N ? ; d
(25)
where 8 1 ˆ ˆ ˆ ˆ ˆ ˛/d 1 N 1˛/d 1 < ln N N? ˆ 1 ˛/d ˆ ˆ ˆ N 1˛/d ˆ : 1 ˛/d
if ˛/d > 1 ; if ˛/d D 1 ; if 0 < ˛/d < 1 : (26)
945
946
Entropy
Notice that N ? D ln˛/d N where the q-log function lnq x (x 1q 1)/(1 q)(x > 0; ln1 x D ln x) will be shown to play an important role later on. Satisfactorily enough, Eqs. (26) recover the characterization with Eq. (24) in the limit N ! 1, but they have the great advantage of providing, for finite N, a finite value. This fact will be now shown to enable to properly scale the macroscopic quantities in the thermodynamic limit (N ! 1), for all values of ˛/d 0. Let us address the thermodynamical consequences of the microscopic interactions being short- or long-ranged. To present a slightly more general illustration, we shall assume from now on that our homogeneous and isotropic classical fluid is made by magnetic particles. Its Gibbs free energy is then given by G(N; T; p; H) D U(N; T; p; H) T S(N; T; p; H) C pV (N; T; p; H) HM(N; T; p; H) ;
(27)
where (T; p; H) correspond respectively to the temperature, pressure and external magnetic field, V is the volume and M the magnetization. If the interactions are shortranged (i. e., if ˛/d > 1), we can divide this equation by N and then take the N ! 1 limit. We obtain
must in general express them in the (T ? ; p? ; H ? ) variables. If ˛/d > 1, this procedure recovers the usual equations of states, and the usual extensive (G; U; S; V ; M) and intensive (T; p; H) thermodynamic variables. But, if 0 ˛/d 1, the situation is more complex, and we realize that three, instead of the traditional two, classes of thermodynamic variables emerge. We may call them extensive (S; V ; M; N), pseudo-extensive (G; U) and pseudo-intensive (T; p; H) variables. All the energy-type thermodynamical variables (G; F; U) give rise to pseudo-extensive ones, whereas those which appear in the usual Legendre thermodynamical pairs give rise to pseudo-intensive ones (T; p; H; ) and extensive ones (S; V ; M; N). See Figs. 1 and 2. The possibly long-range interactions within Hamiltonian (23) refer to the dynamical variables themselves. There is another important class of Hamiltonians, where the possibly long-range interactions refer to the coupling constants between localized dynamical variables. Such is, for instance, the case of the following classical Hamiltonian: N X L2i H DKCV D 2I iD1
y y
g(T; p; H) D u(T; p; H) Ts(T; p; H) C pv(T; p; H) Hm(T; p; H) ;
(28)
where g(T; p; H) lim N!1 G(N; T; p; H)/N, and analogously for the other variables of the equation. If the interactions were instead long-ranged (i. e., if 0 ˛/d 1), all these quantities would be divergent, hence thermodynamically nonsense. Consequently, the generically correct procedure, i. e. 8 ˛/d 0, must conform to the following lines: lim
N!1
G(N; T; p; H) U(N; T; p; H) D lim N!1 N N? N N? T S(N; T; p; H) lim N!1 N ? N (29) p V (N; T; p; H) C lim N!1 N ? N H M(N; T; p; H) lim N!1 N ? N
hence g(T ? ; p? ; H ? ) D u(T ? ; p? ; H ? ) T ? s(T ? ; p? ; H ? ) C p? v(T ? ; p? ; H ? ) H ? m(T ? ; p? ; H ? ) ;
(30)
where the definitions of T ? and all the other variables are self-explanatory (e. g., T ? T/N ? ). In other words, in order to have finite thermodynamic equations of states, we
X J x s xi s xj C J y s i s j C J z s zi s zj i¤ j
r˛i j
(˛ 0) ;
(31)
where fL i g are the angular momenta, I the moment of y inertia, f(s xi ; s i ; s zi )g are the components of classical rotators, (J x ; J y ; J z ) are coupling constants, and rij runs over all distances between sites i and j of a d-dimensional lattice. For example, for a simple hypercubic lattice
Entropy, Figure 1 For long-range interactions (0 ˛/d 1) we have three classes of thermodynamic variables, namely the pseudo-intensive (scaling with N? ), pseudo-extensive (scaling with NN? ) and extensive (scaling with N) ones. For short range interactions (˛/d > 1) the pseudo-intensive variables become intensive (independent from N), and the pseudo-extensive merge with the extensive ones, all being now extensive (scaling with N), thus recovering the traditional two textbook classes of thermodynamical variables
Entropy
matical difficulties emerging in the models in the presence of long-range interactions. Last but not least, we verify a point which is crucial for the developments here below, namely that the entropy S is expected to be extensive no matter the range of the interactions. The Nonadditive Entropy Sq Introduction and Basic Properties
Entropy, Figure 2 The so-called extensive systems (˛/d > 1 for the classical ones) typically involve absolutely convergent series, whereas the socalled nonextensive systems (0 ˛/d < 1 for the classical ones) typically involve divergent series. The marginal systems (˛/d D 1 here) typically involve conditionally convergent series, which therefore depend on the boundary conditions, i. e., typically on the external shape of the system. Capacitors constitute a notorious example of the ˛/d D 1 case. The model usually referred to in the literature as the Hamiltonian–Mean–Field (HMF) one lies on the ˛ D 0 axis (8d > 0). The model usually referred to as the d-dimensional ˛-XY model [19] lies on the vertical axis at abscissa d (8˛ 0)
with unit crystalline 2; 3; : : : if p parameter we have r i j D p 1; p d D 1, r i j D 1; 2; 2; : : : if d D 2, r i j D 1; 2; 3; 2; : : : if d D 3, and so on. For such a case, we have that N?
N X
˛ r1i ;
(32)
iD2
which has in fact the same asymptotic behaviors as indicated in Eq. (26). In other words, here again ˛/d > 1 corresponds to short-range interactions, and 0 ˛/d 1 corresponds to long-range ones. The correctness of the present generalized thermodynamical scalings has already been specifically checked in many physical systems, such as a ferrofluid-like model [97], Lennard–Jones-like fluids [90], magnetic systems [16,19,59,158], anomalous diffusion [66], percolation [85,144]. Let us mention that, for the ˛ D 0 models (i. e., mean field models), it is largely spread in the literature to divide by N the potential term of the Hamiltonian in order to make it extensive by force. Although mathematically admissible (see [19]), this is obviously very unsatisfactory in principle since it implies a microscopic coupling constant which depends on N. What we have described here is the thermodynamically proper way of eliminating the mathe-
The possibility was introduced in 1988 [183] (see also [42,112,157,182]) to generalize the BG statistical mechanics on the basis of an entropy Sq which generalizes SBG . This entropy is defined as follows: P q 1 W iD1 p i Sq k (33) (q 2 R; S1 D SBG ): q1 For equal probabilities, this entropy takes the form S q D k lnq W
(S1 D k ln W) ;
(34)
where the q-logarithmic function has already been defined. Remark With the same or different prefactor, this entropic form has been successively and independently introduced in many occasions during the last decades. J. Havrda and F. Charvat [92] were apparently the first to ever introduce this form, though with a different prefactor (adapted to binary variables) in the context of cybernetics and information theory. I. Vajda [207], further studied this form, quoting Havrda and Charvat. Z. Daroczy [74] rediscovered this form (he quotes neither Havrda–Charvat nor Vajda). J. Lindhard and V. Nielsen [108] rediscovered this form (they quote none of the predecessors) through the property of entropic composability. B.D. Sharma and D.P. Mittal [163] introduced a two-parameter form which reproduces both Sq and Renyi entropy [145] as particular cases. A. Wehrl [209] mentions the form of Sq in p. 247, quotes Daroczy, but ignores Havrda–Charvat, Vajda, Lindhard–Nielsen, and Sharma–Mittal. Myself I rediscovered this form in 1985 with the aim of generalizing Boltzmann–Gibbs statistical mechanics, but quote none of the predecessors in the 1988 paper [183]. In fact, I started knowing the whole story quite a few years later thanks to S.R.A. Salinas and R.N. Silver, who were the first to provide me with the corresponding informations. Such rediscoveries can by no means be considered as particularly surprising. Indeed, this happens in science more frequently than usually realized. This point is lengthily and colorfully developed by S.M. Stigler [167]. In p. 284, a most interesting example is described, namely that of the celebrated
947
948
Entropy
normal distribution. It was first introduced by Abraham De Moivre in 1733, then by Pierre Simon de Laplace in 1774, then by Robert Adrain in 1808, and finally by Carl Friedrich Gauss in 1809, nothing less than 76 years after its first publication! This distribution is universally called Gaussian because of the remarkable insights of Gauss concerning the theory of errors, applicable in all experimental sciences. A less glamorous illustration of the same phenomenon, but nevertheless interesting in the present context, is that of Renyi entropy [145]. According to I. Csiszar [64], p. 73, the Renyi entropy had already been essentially introduced by Paul-Marcel Schutzenberger [161].
(vii) Sq is nonadditive for q ¤ 1. Indeed, for independent subsystems A and B, it can be straightforwardly proved
The entropy defined in Eq. (33) has the following main properties:
which makes explicit that (1 q) ! 0 plays the same role as k ! 1. Property (38), occasionally referred to in the literature as pseudo-additivity, can be called subadditivity (superadditivity) for q > 1 (q < 1). P x (viii) S q D k D q W iD1 p i j xD1 , where the 1909 Jackson differential operator is defined as follows:
(i)
Sq is nonnegative (8q);
(ii)
Sq is expansible (8q > 0);
(iii) Sq attains its maximal (minimal) value k lnq W for q > 0 (for q < 0); (iv)
Sq is concave (convex) for q > 0 (for q < 0);
(v)
Sq is Lesche-stable (8q > 0) [2];
(vi)
Sq yields a finite upper bound of the entropy production per unit time for a special value of q, whenever the sensitivity to the initial conditions exhibits an upper bound which asymptotically increases as a power of time. For example, many D D 1 nonlinear dynamical systems have a vanishing maximal Lyapunov exponent 1 and exhibit a sensitivity to the initial conditions which is (upper) bounded by q t
D eq
;
S q (A C B) S q (A) S q (B) S q (A) S q (B) D C C (1 q) ; k k k k k (38) or, equivalently, S q (A C B) D S q (A) C S q (B) C
D q f (x)
(ix)
exq
1
if 1 C (1 q)x > 0 ;
0
otherwise :
An uniqueness theorem has been proved by Santos [159], which generalizes, for arbitrary q, that of Shannon [162]. Let us assume that an entropic form S(fp i g) satisfies the following properties: (a) S(fp i g) is a continuous function of fp i g; (41) (b) S(p i D 1/W; 8i) monotonically increases with the total number of possibilities W; (42)
(35)
[1 C (1 q) x] 1q
(D1 f (x) D d f (x)/dx) : (40)
with q > 0, q < 1, the q-exponential function exq being the inverse of lnq x. More explicitly (see Fig. 3) (
f (qx) f (x) qx x
(1 q) S q (A) S q (B) ; k (39)
(c) S(A C B) S(A) S(B) S(A) S(B) D C C (1 q) k k k k k A B D p p 8(i; j) ; with k > 0; if p ACB i j ij
(36) Such systems have a finite entropy production per unit time, which satisfies a q-generalized Pesin-like identity, namely, for the construction described in Sect. “Introduction”, K q lim
lim
lim
t!1 W!1 M!1
S q (t) D q : t
(43) (d) q
L terms
(37)
The situation is in fact sensibly much richer than briefly described here. For further details, see [27,28, 29,30,93,116,117,146,147,148,149,150,151,152].
q
S(fp i g) D S(p L ; p M ) C p L S(fp i /p L g) C p M S(fp i /p M g) X X with p L pi ; pM p i (L C M D W) ; and
M terms
pL C p M D 1 : (44)
(x)
Then and only then [159] S(fp i g) D S q (fp i g). Another (equivalent) uniqueness theorem was
Entropy
Entropy, Figure 3 The q-exponential and q-logarithm functions in typical representations: a Linear-linear representation of exq ; b Linear-linear repreaq x , solution of dy/dx D aq yq with y(0) D 1; d Linear-linear representation sentation of ex q ; c Log-log representation of y(x) D eq of Sq D lnq W (value of the entropy for equal probabilities)
proved by Abe [1], which generalizes, for arbitrary q, that of Khinchin [100]. Let us assume that an entropic form S(fp i g) satisfies the following properties: (a) S(fp i g) is a continuous function of fp i g; (45) (b) S(p i D 1/W; 8i) monotonically increases
(d) S(A) S(BjA) S(A) S(BjA) S(A C B) D C C (1 q) k k k k k o n ; where S(A C B) S p ACB ij 91 08 WB 0) (48)
S(p1 ; p2 ; : : : ; p W ; 0) D S(p1 ; p2 ; : : : ; p W ) ; (47)
Then and only then [1] S(fp i g) D S q (fp i g).
949
950
Entropy
Additivity Versus Extensivity of the Entropy It is of great importance to distinguish additivity from extensivity. An entropy S is additive [130] if its value for a system composed by two independent subsystems A and B satisfies S(A C B) D S(A) C S(B) (hence, for N independent equal subsystems or elements, we have S(N) D N S(1)). Therefore, SBG is additive, and S q (q ¤ 1) is nonadditive. A substantially different matter is whether a given entropy S is extensive for a given system. An entropy is extensive if and only if 0 < lim N!1 S(N)/N < 1. What matters for satisfactorily matching thermodynamics is extensivity not additivity. For systems whose elements are nearly independent (i. e., essentially weakly correlated), SBG is extensive and Sq is nonextensive. For systems whose elements are strongly correlated in a special manner, SBG is nonextensive, whereas Sq is extensive for a special value of q ¤ 1 (and nonextensive for all the others). Let us illustrate these facts for some simple examples of equal probabilities. If W(N) A N (A > 0, > 1, and N ! 1), the entropy which is extensive is SBG . Indeed, SBG (N) D k ln W(N) (ln )N / N (it is equally trivial to verify that S q (N) is nonextensive for any q ¤ 1). If W(N) BN (B > 0, > 0, and N ! 1), the entropy which is extensive is S1(1/ ) . Indeed, S1(1/ ) (N) k B1/ N / N (it is equally trivial to verify that SBG (N) / ln N, hence nonextensive). If W(N) C N (C > 0, > 1, ¤ 1, and N ! 1), then S q (N) is nonextensive for any value of q. Therefore, in such a complex case, one must in principle refer to some other kind of entropic functional in order to match the extensivity required by classical thermodynamics. Various nontrivial abstract mathematical models can be found in [113,160,186,198,199] for which S q (q ¤ 1) is extensive. Moreover, a physical realization is also available now [60,61] for a many-body quantum Hamiltonian, namely the ground state of the following one:
H D
N1 X
y y x (1 C )S ix S iC1 C (1 )S i S iC1 2
p 9 C c2 3 ; qD c
S zi ;
(49)
iD1
(50)
where c is the central charge which emerges in quantum field theory [54]. In other words, 0 < limL!1 S(p9Cc 2 3)/c (L)/L < 1. Notice that q increases from zero p to unity when c increases from zero to infinity; p q D 37 6 ' 0:083 for c D 1/2 (Ising model), q D 10 3 ' 0:16 for c D 1 (isotropic XY model), p q D 1/2 for c D 4 (dimension of space-time), and q D ( 685 3)/26 ' 0:89 for c D 26, related to string theory [89]. The possible physical interpretation of the limit c ! 1 is still unknown, although it could correspond to some sort of mean field approach.
Nonextensive Statistical Mechanics To generalize BG statistical mechanics for the canonical ensemble, we optimize Sq with constraint (15) and also W X
iD1 N X
versality class, which corresponds to a so-called central charge c D 1/2), whereas the latter one belongs to a different universality class (the XX one, which corresponds to a central charge c D 1). At temperature T D 0 and N ! 1, this model exhibits a second-order phase transition as a function of . For the Ising model, the critical value is D 1, whereas, for the XX model, the entire line 0 1 is critical. Since the system is at its ground state (assuming a vanishingly small magnetic field component in the x y plane), it is a pure state (i. e., its density matrix N is such that Tr 2N D 1, 8N), hence the entropy S q (N)(8q > 0) is strictly zero. However, the situation is drastically different for any L-sized block of the infinite chain. Indeed, L TrNL N is such that Tr 2L < 1, i. e., it is a mixed state, hence it has a nonzero entropy. The block entropy S q (L) lim N!1 S q (N; L) monotonically increases with L for all values of q. And it does so linearly for
Pi E i D U q ;
(51)
iD1
where y
where is a transverse magnetic field, and (S ix ; S i ; S zi ) are Pauli matrices; for j j D 1 we have the Ising model, for 0 < j j < 1, we have the anisotropic XY model, and, for D 0, we have the isotropic XY model. The two former share the same symmetry and consequently belong to the same critical universality class (the Ising uni-
q
W X
pi
Pi PW
jD1
q
pi
! Pi D 1
(52)
iD1
is the so-called escort distribution [33]. It follows that 1/q P 1/q p i D Pi / W jD1 P j . There are various converging rea-
Entropy
sons for being appropriate to impose the energy constraint with the fPi g instead of with the original fp i g. The full discussion of this delicate point is beyond the present scope. However, some of these intertwined reasons are explored in [184]. By imposing Eq. (51), we follow [193], which in turn reformulates the results presented in [71,183]. The passage from one to the other of the various existing formulations of the above optimization problem are discussed in detail in [83,193]. The entropy optimization yields, for the stationary state, ˇq (E i U q )
pi D
eq
;
Z¯ q
(53)
with ˇ ˇ q PW
jD1
q
pj
;
(54)
The connection to thermodynamics is established in what follows. It can be proved that @S q 1 ; D T @U q with T 1/(kˇ). Also we prove, for the free energy, Fq U q T S q D
Z¯ q
W X
ˇq (E i U q )
eq
;
(55)
lnq Z q D lnq Z¯ q ˇU q :
pi D
;
(56)
with Z q0
W X
ˇq0 E j eq
;
(57)
@ lnq Z q ; @ˇ
ˇq0
ˇq : 1 C (1 q)ˇ q U q
(58)
The form (56) is particularly convenient for many applications where comparison with experimental or computational data is involved. Also, it makes clear that pi asymp1/(q1) totically decays like 1/E i for q > 1, and has a cutoff for q < 1, instead of the exponential decay with Ei for q D 1.
(62)
@S q @U q @2 Fq : D D T @T @T @T 2
(63)
In fact the entire Legendre transformation structure of thermodynamics is q-invariant, which is both remarkable and welcome. A Connection Between Entropy and Diffusion We review here one of the main common aspects of entropy and diffusion. We shall present on equal footing both the BG and the nonextensive cases [13,138,192,216]. Let us extremize the entropy Sq D k
1
R1
1
jD1
and
(61)
as well as relations such as Cq T
ˇq0 E i eq Z q0
(60)
This relation takes into account the trivial fact that, in contrast with what is usually done in BG statistics, the energies fE i g are here referred to U q in (53). It can also be proved
i
ˇ being the Lagrange parameter associated with the constraint (51). Equation (53) makes explicit that the probability distribution is, for fixed ˇ q , invariant with regard to the arbitrary choice of the zero of energies. The stationary state (or (meta)equilibrium) distribution (53) can be rewritten as follows:
1 lnq Z q ; ˇ
where
Uq D
and
(59)
d(x/ ) [ p(x)]q q1
with the constraints Z 1 dx p(x) D 1
(64)
(65)
1
and R1
R1 hx iq 1 2
dx x 2 [p(x)]q
1
dx [p(x)]q
D 2 ;
(66)
> 0 being some fixed value having the same physical dimensions of the variable x. We straightforwardly obtain
951
952
Entropy
the following distribution: p q (x) D
8 1 ˆ 1/2 ˆ ˆ 1 q1 q1 1 ˆ ˆ
ˆ 1/(q1) ˆ 3 q (3 q) ˆ q 1 x2 ˆ ˆ 1 C ˆ 2(q 1) ˆ 3 q 2 ˆ ˆ ˆ if 1 < q < 3 ; ˆ ˆ ˆ ˆ ˆ 1 1 2 2 ˆ ˆ < p ex /2
ˇ [U (x)U (0)]
eq ; p(x; 1)q D Z Z 1 [U (x)U (0)] Z dx eˇ ; q
(70)
1/ˇ kT / jDj ;
(71)
(69)
1
2
if q D 1 ; ˆ ˆ ˆ
ˆ ˆ 5 3q ˆ ˆ 1/(1q) 1/2 ˆ ˆ 1 q x2 2(1 q) 1 q 1 ˆ ˆ
1 ˆ ˆ 2q (3 q) 3 q 2 ˆ ˆ ˆ ˆ 1 q ˆ ˆ ˆ ˆfor jxj < [(3 q)/(1 q)]1/2 ; and zero otherwise , ˆ : if
q < 1:
(67) These distributions are frequently referred to as q-Gaussians. For q > 1, they asymptotically have a power-law tail (q 3 is not admissible because the norm (65) cannot be satisfied); for q < 1, they have a compact support. For q D 1, the celebrated Gaussian is recovered; for q D 2, the Cauchy–Lorentz distribution is recovered; finally, for q ! 1, the uniform distribution within the interval [1; 1] is recovered. For q D 3Cm 1Cm , m being an integer (m D 1; 2; 3; : : :), we recover the Student’s t-distributions with m degrees of freedom [79]. For q D n4 n2 , n being an integer (n D 3; 4; 5; : : :), we recover the socalled r-distributions with n degrees of freedom [79]. In other words, q-Gaussians are analytical extensions of Student’s t- and r-distributions. In some communities they are also referred to as the Barenblatt form. For q < 5/3, they have a finite variance which monotonically increases for q varying from 1 to 5/3; for 5/3 q < 3, the variance diverges. Let us now make a connection of the above optimization problem with diffusion. We focus on the following quite general diffusion equation: @ı p(x; t) @t ı
For example, the stationary state for ˛ D 2, 8ı, and any confining potential (i. e., limjxj!1 U(x) D 1) is given by [106]
@ D @x
@U(x) @˛ [p(x; t)]2q p(x; t) C D @x @jxj˛
(0 < ı 1; 0 < ˛ 2; q < 3; t 0) ;
(68)
with a generic nonsingular potential U(x), and a generalized diffusion coefficient D which is positive (negative) for q < 2 (2 < q < 3). Several particular instances of this equation have been discussed in the literature (see [40,86,106,131,188] and references therein).
which precisely is the distribution obtained within nonextensive statistical mechanics through extremization of Sq . Also, the solution for ˛ D 2, ı D 1, U(x) D k1 x C k2 2 x (8 k1 , and k2 0), and p(x; 0) D ı(x) is given 2 by [188] ˇ (t)[xx M (t)]2
p q (x; t) D
eq
Z q (t) 2
;
Z q (0) ˇ(t) D ˇ(0) Z q (t)
1 t/ 1 2/(3q) D 1 e ; C K2 K2 k2 ; K2 2(2 q)Dˇ(0)[Z q (0)]q1 1 ; k2 (3 q) k1 k1 k 2 t C x M (0) : x M (t) e k2 k2
(72)
(73)
(74) (75) (76)
In the limit k2 ! 0, Eq. (73) becomes ˚ Z q (t) D [Z q (0)]3q C 2(2 q)(3 q)Dˇ(0) 1/(3q) ; [Z q (0)]2 t
(77)
which, in the t ! 1 limit, yields 1 / [Z q (t)]2 / t 2/(3q) : ˇ(t)
(78)
In other words, x2 scales like t , with D
2 ; 3q
(79)
hence, for q > 1 we have > 1 (i. e., superdiffusion; in particular, q D 2 yields D 2, i. e., ballistic diffusion), for q < 1 we have < 1 (i. e., subdiffusion; in particular, q ! 1 yields D 0, i. e., localization), and naturally, for q D 1, we obtain normal diffusion. Four systems are known for which results have been found that are consistent with prediction (79). These are the motion of Hydra viridissima [206], defect turbulence [73], simulation of
Entropy
a silo drainage [22], and molecular dynamics of a manybody long-range-interacting classical system of rotators (˛ XY model) [143]. For the first three, it has been found (q; ) ' (3/2; 4/3). For the latter one, relation (79) has been verified for various situations corresponding to > 1. Finally, for the particular case ı D 1 and U(x) D 0, Eq. (68) becomes @p(x; t) @˛ [p(x; t)]2q DD @t @jxj˛
(0 < ˛ 2; q < 3) : (80)
The diffusion constant D just rescales time t. Only two parameters are therefore left, namely ˛ and q. The linear case (i. e., q D 1) has two types of solutions: Gaussians for ˛ D 2, and Lévy- (or ˛-stable) distributions for 0 < ˛ < 2. The case ˛ D 2 corresponds to the Central Limit Theorem, where the N ! 1 attractor of the sums of N independent random variables with finite variance precisely is a Gaussian. The case 0 < ˛ < 2 corresponds to the sometimes called Levy–Gnedenko Central Limit Theorem, where the N ! 1 attractor of the sums of N independent random variables with infinite variance (and appropriate asymptotics) precisely is a Lévy distribution with index ˛. The nonlinear case (i. e., q ¤ 1) has solutions that are q-Gaussians for ˛ D 2, and one might conjecture that, similarly, interesting solutions exist for 0 < ˛ < 2. Furthermore, in analogy with the q D 1 case, one expects corresponding q-generalized Central Limit Theorems to exist [187]. This is precisely what we present in the next Section. Standard and q-Generalized Central Limit Theorems The q-Product It has been recently introduced (independently and virtually simultaneously) [43,125] a generalization of the product, which is called q-product. It is defined, for x 0 and y 0, as follows:
it is commutative, i. e., x ˝q y D y ˝q x ;
(83)
it is additive under q-logarithm, i. e., lnq (x ˝q y) D lnq x C lnq y
(84)
(whereas we remind that lnq (x y) D lnq x C lnq y C (1 q)(ln q x)(ln q y); it has a (2 q)-duality/inverse property, i. e., 1/(x ˝q y) D (1/x) ˝2q (1/y) ;
(85)
it is associative, i. e., x ˝q (y ˝q z) D (x ˝q y) ˝q z D x ˝q y ˝q z D (x 1q C y 1q C z1q 2)1/(1q) ;
(86)
it admits unity, i. e., x ˝q 1 D x :
(87)
and, for q 1, also a zero, i. e., x ˝q 0 D 0
(q 1) :
(88)
The q-Fourier Transform We shall introduce the q-Fourier transform of a quite generic function f (x) (x 2 R) as follows [140,189,202, 203,204,205]: Z 1 dx eix Fq [ f ]() q ˝ q f (x) 1 ; (89) Z 1 ix[ f (x)] q1 dx eq f (x) D 1
where we have primarily focused on the case q 1. In contrast with the q D 1 case (standard Fourier transform), this integral transformation is nonlinear for q ¤ 1. It has a remarkable property, namely that the q-Fourier transform of a q-Gaussian is another q-Gaussian: h p i 2 x2 1 () D eˇ ; (90) Fq N q ˇ eˇ q q1 with
x ˝q y ( [x 1q C y 1q 1]1/(1q) 0
if
x 1q C y 1q > 1 ;
otherwise : (81)
It has, among others, the following properties: it recovers the standard product as a particular instance, i. e.,
Nq
8
1 ˆ ˆ ˆ ˆ q1 q 1 1/2 ˆ
ˆ ˆ 3q ˆ ˆ ˆ ˆ 2(q 1) ˆ < 1 p
if 1 < q < 3 ;
if
q D 1;
ˆ
ˆ ˆ 3q ˆ ˆ ˆ 1/2 ˆ 2(1 q) 3q 1q ˆ ˆ
if q < 1 ; ˆ ˆ 2 1 ˆ : 1q
x ˝1 y D x y ;
(82)
(91)
953
954
Entropy
and q1 D z(q) ˇ1 D
1 ˇ 2q
1Cq ; 3q
(92)
2(1q)
Nq
(3 q) : 8
(93) p
p 1/ 2q
D Equation (93) can be re-written as ˇ 2q ˇ1 p 2(1q) 1/ 2q [(N q (3 q))/8]
K(q), which, for q D 1, recovers the well known Heisenberg-uncertainty-principlelike relation ˇˇ1 D 1/4. If we iterate n times the relation z(q) in Eq. (92), we obtain the following algebra: q n (q) D
2q C n(1 q) 2 C n(1 q)
(n D 0; ˙1; ˙2; : : : ) ; (94)
which can be conveniently re-written as 2 2 D Cn 1 q n (q) 1q
(n D 0; ˙1; ˙2; : : : ) : (95)
(See Fig. 4). We easily verify that q n (1) D 1 (8n), q˙1 (q) D 1 (8q), as well as 1 q nC1
D 2 q n1 :
(96)
This relation connects the so called additive duality q ! (2 q) and multiplicative duality q ! 1/q, frequently emerging in all types of calculations in the literature. Moreover, we see from Eq. (95) that multiple values of q are expected to emerge in connection with diverse properties of nonextensive systems, i. e., in systems whose basic entropy is the nonadditive one Sq . Such is the case of the so called q-triplet [185], observed for the first time in the magnetic field fluctuations of the solar wind, as it has been revealed by the analysis of the data sent to NASA by the spacecraft Voyager 1 [48]. q-Independent Random Variables Two random variables X [with density f X (x)] and Y [with density f Y (y)] having zero q-mean values (e. g., if f X (x) and f Y (y) are even functions) are said q-independent, with q1 given by Eq. (92), if Fq [X C Y]() D Fq [X]() ˝q 1 Fq [Y]() ;
(97)
i. e., if Z 1
dz eiz q ˝ q f XCY (z) D dx eix ˝ f (x) ˝(1Cq)/(3q) q X q 1 Z 1 i y dy eq ˝q f X (y) ;
1 Z 1
1
Entropy, Figure 4 The q-dependence of qn (q) q2;n (q)
with Z f XCY (z) D
Z
1
1
dy h(x; y) ı(x C y z)
dx Z1 1
1
dx h(x; z x)
D
(99)
Z1 1 D
dy h(z y; y) 1
where h(x; y) is the joint density. Clearly, q-independence means independence for q D 1 (i. e., h(x; y) D f X (x) f Y (y)), and implies a special correlation for q ¤ 1. Although the full understanding of this correlation is still under progress, q-independence appears to be consistent with scale-invariance. q-Generalized Central Limit Theorems It is out of the scope of the present survey to provide the details of the complex proofs of the q-generalized central limit theorems. We shall restrict to the presentation of their structure. Let us start by introducing a notation which is important for what follows. A distribution is said (q; ˛)-stable distribution L q;˛ (x) if its q-Fourier transform Lq;˛ () is of the form ˛
jj Lq;˛ () D a eb q1
(98)
[a > 0; b > 0; 0 < ˛ 2; q1 D (q C 1)/(3 q)] : (100)
Entropy
Consistently, L1;2 are Gaussians, L1;˛ are Lévy distributions, and L q;2 are q-Gaussians. We are seeking for the N ! 1 attractor associated with the sum of N identical and distinguishable random variables each of them associated with one and the same arbitrary symmetric distribution f (x). The random variables are independent for q D 1, and correlated in a special manner for q ¤ 1. To obtain the N ! 1 invariant distribution, i. e. the attractor, the sum must be rescaled, i. e., divided by N ı , where ıD
1 : ˛(2 q)
(101) p
For (˛; q) D (2; 1), we recover the traditional 1/ N rescaling of Brownian motion. At the present stage, the theorems have been established for q 1 and are summarized in Table 1. The case q < 1 is still open at the time at which these lines are being written. Two q < 1 cases have been preliminarily explored numerically in [124] and in [171]. The numerics seemed to indicate that the N ! 1 limits would be q-Gaussians for both models. However, it has been analytically shown [94] that it is not exactly so. The limiting distributions numerically are amazingly close to q-Gaussians, but they are in fact different. Very recently, another simple scale-invariant model
has been introduced [153], whose attractor has been analytically shown to be a q-Gaussian. These q ¤ 1 theorems play for the nonadditive entropy Sq and nonextensive statistical mechanics the same grounding role that the well known q D 1 theorems play for the additive entropy SBG and BG statistical mechanics. In particular, interestingly enough, the ubiquity of Gaussians and of q-Gaussians in natural, artificial and social systems may be understood on equal footing. Future Directions The concept of entropy permeates into virtually all quantitative sciences. The future directions could therefore be very varied. If we restrict, however, to the evidence presently available, the main lines along which evolution occurs are: Networks Many of the so-called scale-free networks, among others, systematically exhibit a degree distribution p(k) (probability of a node having k links) which is of the form p(k) /
1 (k0 C k)
( > 0; k0 > 0) ;
(102)
or, equivalently,
Entropy, Table 1 The attractors corresponding to the four basic where the NRvariables that are being summed are q-independent (i. e., globally R cases, 1 1 correlated) with q1 D (1 C q)/(3 q); Q ( 1 dx x 2 [f(x)]Q )/( 1 dx [f(x)]Q ) with Q 2q 1. The attractor for (q; ˛) D (1; 2) is a Gaussian G(x) L1;2 (standard Central Limit Theorem); for q D 1 and 0 < ˛ < 2, it is a Lévy distribution L˛ L1;˛ (the so called Lévy-Gnedenko limit theorem); for ˛ D 2 and q ¤ 1, it is a q-Gaussian Gq Lq;2 (the q-Central Limit Theorem; [203]); finally, for q ¤ 1 and 0 < ˛ < 2, it is a generic (q; ˛)-stable distribution Lq;˛ ([204,205]). See [140,189] for typical illustrations of the four types of attractors. The distribution L˛ (x) remains, for 1 < ˛ < 2, close to a Gaussian for jxj up to about xc (1; ˛), where it makes a crossover to a power-law. The distribution Gq (x) remains, for q > 1, close to a Gaussian for jxj up to about xc (q; 2), where it makes a crossover to a power-law. The distribution Lq;˛ (x) remains, for q > 1 and ˛ < 2, close to a Gaussian for jxj up to about xc(1) (q; ˛), where it makes a crossover to a power-law (intermediate regime), which lasts further up to about xc(2) (q; ˛), where it makes a second crossover to another power-law (distant regime) q ¤ 1 (i. e., Q ¤ 1) [globally correlated] Gq (x) D G(3q1 1)/(1Cq1 ) (x) [with same Q of f (x)] Gq (x) G(x) if jxj xc (q; 2) Gq (x) Cq;2 /jxj2/(q1) if jxj xc (q; 2) for q > 1, with limq!1 xc (q; 2) D 1 Q ! 1 L˛ (x) Lq;˛ (x) (˛ < 2) [with same jxj ! 1 behavior of f (x)] [with same jxj ! 1 behavior of f (x)] Q < 1 (˛ D 2)
q D 1 [independent] G(x) [with same 1 of f (x)]
2(1q)˛(3q)
L˛ (x) G(x) if jxj xc (1; ˛) L˛ (x) C1;˛ /jxj1C˛ if jxj xc (1; ˛)
(intermediate) 2(q1) Lq;˛ Cq;˛ /jxj (1) if xc (q; ˛) jxj xc(2) (q; ˛) 1C˛
with lim˛!2 qc (1; ˛) D 1
(distant) Lq;˛ Cq;˛ /jxj 1C˛(q1) if jxj xc(2) (q; ˛)
955
956
Entropy
Entropy, Figure 5 Snapshot of a nongrowing dynamic network with N D 256 nodes (see details in [172], by courtesy of the author)
p(k) / ek/ q
(q 1; > 0) ;
(103)
with D 1/(q 1) and k0 D /(q 1) (see Figs. 5 and 6). This is not surprising since, if we associate to each link an “energy” (or cost) and to each node half of the “energy” carried by its links (the other half being associated with the other nodes to which any specific node is linked), the distribution of energies optimizing Sq precisely coincides with the degree distribution. If, for any reason, we consider k as the modulus of a d-dimensional vector k, the optimization of the functional S q [p(k)] may lead to , where k plays the role of a denp(k) / k ek/ q sity of states, (d) being either zero (which reproduces Eq. (103)) or positive or negative. Several examples [12,39,76,91,165,172,173,212,213] already exist in the literature; in particular, the Barabasi–Albert universality class D 3 corresponds to q D 4/3. A deeper understanding of this connection might enable the systematic calculation of several meaningful properties of networks. Nonlinear dynamical systems, self-organized criticality, and cellular automata Various interesting phenomena emerge in both low- and high-dimensional weakly chaotic deterministic dynamical systems, either dissipative or conservative. Among these phenomena we have the sensitivity to the initial conditions and the entropy production, which have been briefly addressed in Eq. (37) and related papers. But there
Entropy, Figure 6 Nongrowing dynamic network: a Cumulative degree distribution for typical values for the number N of nodes; b Same data of a in the convenient representation linear q-log versus linear with Zq (k) lnq [Pq (> k)] ([Pq (> k)]1q 1)/(1 q) (the optimal fitting with a q-exponential is obtained for the value of q which has the highest value of the linear correlation r as indicated in the inset; here this is qc D 1:84, which corresponds to the slope 1.19 in a). See details in [172,173]
is much more, such as relaxation, escape, glassy states, and distributions associated with the stationary state [14,15,31,46,62,67,68,77,103,111,122,123,154, 170,174,176,177,179]. Also, recent numerical indications suggest the validity of a dynamical version of the q-generalized central limit theorem [175]. The possible connections between all these various properties is still in its infancy. Long-range-interacting many-body Hamiltonians A wide class of long-range-interacting N-body classical Hamiltonians exhibits collective states whose Lyapunov spectrum has a maximal value that vanishes in the N ! 1 limit. As such, they constitute natural
Entropy
Entropy, Figure 7 Distribution of velocities for the HMF model at the quasi-stationary state (whose duration appears to diverge when N ! 1). The blue curves indicate a Gaussian, for comparison. See details in [137]
candidates for studying whether the concepts derived from the nonadditive entropy Sq are applicable. A variety of properties have been calculated, through molecular dynamics, for various systems, such as Lennard– Jones-like fluids, XY and Heisenberg ferromagnets, gravitational-like models, and others. One or more long-standing quasi-stationary states (infinitely longstanding in the limit N ! 1) are typically observed before the terminal entrance into thermal equilibrium. Properties such as distribution of velocities and angles, correlation functions, Lyapunov spectrum, metastability, glassy states, aging, time-dependence of the temperature in isolated systems, energy whenever thermal contact with a large thermostat at a given temperature is allowed, diffusion, order parameter, and others, are typically focused on. An ongoing debate exists, also involving Vlasov-like equations, Lynden–Bell statistics, among others. The breakdown of ergodicity that emerges in various situations makes the whole discussion rich and complex. The activity of the research nowadays in this area is illustrated in papers such as [21,26,45,53,56,57,63,104,119,121,126,127,132,133, 134,135,136,142,169,200]. A quite remarkable molecular-dynamical result has been obtained for a paradigmatic long-range Hamiltonian: the distribution of time averaged velocities sensibly differs from that found for the ensemble-averaged velocities, and has been shown to be numerically consistent with a q-Gaussian [137], as shown in Fig. 7. This result provides strong support to a conjecture made long ago: see Fig. 4 at p. 8 of [157]. Stochastic differential equations Quite generic Fokker– Planck equations are currently being studied. Aspects
Entropy, Figure 8 Quantum Monte Carlo simulations in [81]: a Velocity distribution (superimposed with a q-Gaussian); b Index q (superimposed with Lutz prediction [110], by courtesy of the authors)
such as fractional derivatives, nonlinearities, space-dependent diffusion coefficients are being focused on, as well as their connections to entropic forms, and associated generalized Langevin equations [20,23,24,70,128, 168,214]. Quite recently, computational (see Fig. 8) and experimental (see Fig. 9) verifications of Lutz’ 2003 prediction [110] have been exhibited [81], namely about the q-Gaussian form of the velocity distribu-
957
958
Entropy
Entropy, Figure 9 Experiments in [81]: a Velocity distribution (superimposed with a q-Gaussian); b Index q as a function of the frequency; c Velocity distribution (superimposed with a q-Gaussian; the red curve is a Gaussian); d Tail of the velocity distribution (superimposed with the asymptotic power-law of a q-Gaussian). [By courtesy of the authors]
tion of cold atoms in dissipative optical lattices, with q D 1 C 44E R /U0 (E R and U 0 being energy parameters of the optical lattice). These experimental verifications are in variance with some of those exhibited previously [96], namely double-Gaussians. Although it is naturally possible that the experimental conditions have not been exactly equivalent, this interesting question remains open at the present time. A hint might be hidden in the recent results [62] obtained for a quite different problem, namely the size distributions of avalanches; indeed, at a critical state, a q-Gaussian shape was obtained, whereas, at a noncritical state, a double-Gaussian was observed.
Quantum entanglement and quantum chaos The nonlocal nature of quantum physics implies phenomena that are somewhat analogous to those originated by classical long-range interactions. Consequently, a variety of studies are being developed in connection with the entropy Sq [3,36,58,59,60,61,155,156,195]. The same happens with some aspects of quantum chaos [11,180,210,211]. Astrophysics, geophysics, economics, linguistics, cognitive psychology, and other interdisciplinary applications Applications are available and presently searched in many areas of physics (plasmas, turbulence, nuclear collisions, elementary particles, manganites), but also
Entropy
in interdisciplinary sciences such astrophysics [38,47, 48,49,78,84,87,101,109,129,196], geophysics [4,5,6,7,8, 9,10,62,208], economics [25,50,51,52,80,139,141,197, 215], linguistics [118], cognitive psychology [181], and others. Global optimization, image and signal processing Optimizing algorithms and related techniques for signal and image processing are currently being developed using the entropic concepts presented in this article [17,35,72,75,95,105,114,120,164,166,191]. Superstatistics and other generalizations The methods discussed here have been generalized along a variety of lines. These include Beck–Cohen superstatistics [32,34,65,190], crossover statistics [194,196], spectral statistics [201]. Also, a huge variety of entropies have been introduced which generalize in different manners the BG entropy, or even focus on other possibilities. Their number being nowadays over forty, we mention here just a few of them: see [18,44,69,98,99, 115].
Acknowledgments Among the very many colleagues towards which I am deeply grateful for profound and long-lasting comments along many years, it is a must to explicitly thank S. Abe, E.P. Borges, E.G.D. Cohen, E.M.F. Curado, M. Gell-Mann, R.S. Mendes, A. Plastino, A.R. Plastino, A.K. Rajagopal, A. Rapisarda and A. Robledo.
Bibliography 1. Abe S (2000) Axioms and uniqueness theorem for Tsallis entropy. Phys Lett A 271:74–79 2. Abe S (2002) Stability of Tsallis entropy and instabilities of Renyi and normalized Tsallis entropies: A basis for q-exponential distributions. Phys Rev E 66:046134 3. Abe S, Rajagopal AK (2001) Nonadditive conditional entropy and its significance for local realism. Physica A 289:157–164 4. Abe S, Suzuki N (2003) Law for the distance between successive earthquakes. J Geophys Res (Solid Earth) 108(B2):2113 5. Abe S, Suzuki N (2004) Scale-free network of earthquakes. Europhys Lett 65:581–586 6. Abe S, Suzuki N (2005) Scale-free statistics of time interval between successive earthquakes. Physica A 350:588–596 7. Abe S, Suzuki N (2006) Complex network of seismicity. Prog Theor Phys Suppl 162:138–146 8. Abe S, Suzuki N (2006) Complex-network description of seismicity. Nonlinear Process Geophys 13:145–150 9. Abe S, Sarlis NV, Skordas ES, Tanaka H, Varotsos PA (2005) Optimality of natural time representation of complex time series. Phys Rev Lett 94:170601 10. Abe S, Tirnakli U, Varotsos PA (2005) Complexity of seismicity and nonextensive statistics. Europhys News 36:206–208
11. Abul AY-M (2005) Nonextensive random matrix theory approach to mixed regular-chaotic dynamics. Phys Rev E 71:066207 12. Albert R, Barabasi AL (2000) Phys Rev Lett 85:5234–5237 13. Alemany PA, Zanette DH (1994) Fractal random walks from a variational formalism for Tsallis entropies. Phys Rev E 49:R956–R958 14. Ananos GFJ, Tsallis C (2004) Ensemble averages and nonextensivity at the edge of chaos of one-dimensional maps. Phys Rev Lett 93:020601 15. Ananos GFJ, Baldovin F, Tsallis C (2005) Anomalous sensitivity to initial conditions and entropy production in standard maps: Nonextensive approach. Euro Phys J B 46:409–417 16. Andrade RFS, Pinho STR (2005) Tsallis scaling and the longrange Ising chain: A transfer matrix approach. Phys Rev E 71:026126 17. Andricioaei I, Straub JE (1996) Generalized simulated annealing algorithms using Tsallis statistics: Application to conformational optimization of a tetrapeptide. Phys Rev E 53:R3055–R3058 18. Anteneodo C, Plastino AR (1999) Maximum entropy approach to stretched exponential probability distributions. J Phys A 32:1089–1097 19. Anteneodo C, Tsallis C (1998) Breakdown of exponential sensitivity to initial conditions: Role of the range of interactions. Phys Rev Lett 80:5313–5316 20. Anteneodo C, Tsallis C (2003) Multiplicative noise: A mechanism leading to nonextensive statistical mechanics. J Math Phys 44:5194–5203 21. Antoniazzi A, Fanelli D, Barre J, Chavanis P-H, Dauxois T, Ruffo S (2007) Maximum entropy principle explains quasi-stationary states in systems with long-range interactions: The example of the Hamiltonian mean-field model. Phys Rev E 75:011112 22. Arevalo R, Garcimartin A, Maza D (2007) A non-standard statistical approach to the silo discharge. Eur Phys J Special Topics 143:191–197 23. Assis PC Jr, da Silva LR, Lenzi EK, Malacarne LC, Mendes RS (2005) Nonlinear diffusion equation, Tsallis formalism and exact solutions. J Math Phys 46:123303 24. Assis PC Jr, da Silva PC, da Silva LR, Lenzi EK, Lenzi MK (2006) Nonlinear diffusion equation and nonlinear external force: Exact solution. J Math Phys 47:103302 25. Ausloos M, Ivanova K (2003) Dynamical model and nonextensive statistical mechanics of a market index on large time windows. Phys Rev E 68:046122 26. Baldovin F, Orlandini E (2006) Incomplete equilibrium in longrange interacting systems. Phys Rev Lett 97:100601 27. Baldovin F, Robledo A (2002) Sensitivity to initial conditions at bifurcations in one-dimensional nonlinear maps: Rigorous nonextensive solutions. Europhys Lett 60:518–524 28. Baldovin F, Robledo A (2002) Universal renormalizationgroup dynamics at the onset of chaos in logistic maps and nonextensive statistical mechanics. Phys Rev E 66:R045104 29. Baldovin F, Robledo A (2004) Nonextensive Pesin identity. Exact renormalization group analytical results for the dynamics at the edge of chaos of the logistic map. Phys Rev E 69:R045202 30. Baldovin F, Robledo A (2005) Parallels between the dynamics at the noise-perturbed onset of chaos in logistic maps and the dynamics of glass formation. Phys Rev E 72:066213
959
960
Entropy
31. Baldovin F, Moyano LG, Majtey AP, Robledo A, Tsallis C (2004) Ubiquity of metastable-to-stable crossover in weakly chaotic dynamical systems. Physica A 340:205–218 32. Beck C, Cohen EGD (2003) Superstatistics. Physica A 322:267– 275 33. Beck C, Schlogl F (1993) Thermodynamics of Chaotic Systems. Cambridge University Press, Cambridge 34. Beck C, Cohen EGD, Rizzo S (2005) Atmospheric turbulence and superstatistics. Europhys News 36:189–191 35. Ben A Hamza (2006) Nonextensive information-theoretic measure for image edge detection. J Electron Imaging 15: 013011 36. Batle J, Plastino AR, Casas M, Plastino A (2004) Inclusion relations among separability criteria. J Phys A 37:895–907 37. Batle J, Casas M, Plastino AR, Plastino A (2005) Quantum entropies and entanglement. Intern J Quantum Inf 3:99–104 38. Bernui A, Tsallis C, Villela T (2007) Deviation from Gaussianity in the cosmic microwave background temperature fluctuations. Europhys Lett 78:19001 39. Boccaletti S, Latora V, Moreno Y, Chavez M, Hwang D-U (2006) Phys Rep 424:175–308 40. Bologna M, Tsallis C, Grigolini P (2000) Anomalous diffusion associated with nonlinear fractional derivative FokkerPlanck-like equation: Exact time-dependent solutions. Phys Rev E 62:2213–2218 41. Boltzmann L (1896) Vorlesungen über Gastheorie. Part II, ch I, paragraph 1. Leipzig, p 217; (1964) Lectures on Gas Theory (trans: Brush S). Univ. California Press, Berkeley 42. Boon JP, Tsallis C (eds) (2005) Nonextensive Statistical Mechanics: New Trends, New Perspectives. Europhysics News 36(6):185–231 43. Borges EP (2004) A possible deformed algebra and calculus inspired in nonextensive thermostatistics. Physica A 340:95– 101 44. Borges EP, Roditi I (1998) A family of non-extensive entropies. Phys Lett A 246:399–402 45. Borges EP, Tsallis C (2002) Negative specific heat in a Lennard–Jones-like gas with long-range interactions. Physica A 305:148–151 46. Borges EP, Tsallis C, Ananos GFJ, Oliveira PMC (2002) Nonequilibrium probabilistic dynamics at the logistic map edge of chaos. Phys Rev Lett 89:254103 47. Burlaga LF, Vinas AF (2004) Multiscale structure of the magnetic field and speed at 1 AU during the declining phase of solar cycle 23 described by a generalized Tsallis PDF. J Geophys Res Space – Phys 109:A12107 48. Burlaga LF, Vinas AF (2005) Triangle for the entropic index q of non-extensive statistical mechanics observed by Voyager 1 in the distant heliosphere. Physica A 356:375–384 49. Burlaga LF, Ness NF, Acuna MH (2006) Multiscale structure of magnetic fields in the heliosheath. J Geophys Res Space – Phys 111:A09112 50. Borland L (2002) A theory of non-gaussian option pricing. Quant Finance 2:415–431 51. Borland L (2002) Closed form option pricing formulas based on a non-Gaussian stock price model with statistical feedback. Phys Rev Lett 89:098701 52. Borland L, Bouchaud J-P (2004) A non-Gaussian option pricing model with skew. Quant Finance 4:499–514
53. Cabral BJC, Tsallis C (2002) Metastability and weak mixing in classical long-range many-rotator system. Phys Rev E 66:065101(R) 54. Calabrese P, Cardy J (2004) JSTAT – J Stat Mech Theory Exp P06002 55. Callen HB (1985) Thermodynamics and An Introduction to Thermostatistics, 2nd edn. Wiley, New York 56. Chavanis PH (2006) Lynden-Bell and Tsallis distributions for the HMF model. Euro Phys J B 53:487–501 57. Chavanis PH (2006) Quasi-stationary states and incomplete violent relaxation in systems with long-range interactions. Physica A 365:102–107 58. Canosa N, Rossignoli R (2005) General non-additive entropic forms and the inference of quantum density operstors. Physica A 348:121–130 59. Cannas SA, Tamarit FA (1996) Long-range interactions and nonextensivity in ferromagnetic spin models. Phys Rev B 54:R12661–R12664 60. Caruso F, Tsallis C (2007) Extensive nonadditive entropy in quantum spin chains. In: Abe S, Herrmann HJ, Quarati P, Rapisarda A, Tsallis C (eds) Complexity, Metastability and Nonextensivity. American Institute of Physics Conference Proceedings, vol 965. New York, pp 51–59 61. Caruso F, Tsallis C (2008) Nonadditive entropy reconciles the area law in quantum systems with classical thermodynamics. Phys Rev E 78:021101 62. Caruso F, Pluchino A, Latora V, Vinciguerra S, Rapisarda A (2007) Analysis of self-organized criticality in the Olami– Feder–Christensen model and in real earthquakes. Phys Rev E 75:055101(R) 63. Campa A, Giansanti A, Moroni D (2002) Metastable states in a class of long-range Hamiltonian systems. Physica A 305:137–143 64. Csiszar I (1978) Information measures: A critical survey. In: Transactions of the Seventh Prague Conference on Information Theory, Statistical Decision Functions, Random Processes, and the European Meeting of Statisticians, 1974. Reidel, Dordrecht 65. Cohen EGD (2005) Boltzmann and Einstein: Statistics and dynamics – An unsolved problem. Boltzmann Award Lecture at Statphys-Bangalore-2004. Pramana 64:635–643 66. Condat CA, Rangel J, Lamberti PW (2002) Anomalous diffusion in the nonasymptotic regime. Phys Rev E 65:026138 67. Coraddu M, Meloni F, Mezzorani G, Tonelli R (2004) Weak insensitivity to initial conditions at the edge of chaos in the logistic map. Physica A 340:234–239 68. Costa UMS, Lyra ML, Plastino AR, Tsallis C (1997) Power-law sensitivity to initial conditions within a logistic-like family of maps: Fractality and nonextensivity. Phys Rev E 56:245–250 69. Curado EMF (1999) General aspects of the thermodynamical formalism. Braz J Phys 29:36–45 70. Curado EMF, Nobre FD (2003) Derivation of nonlinear FokkerPlanck equations by means of approximations to the master equation. Phys Rev E 67:021107 71. Curado EMF, Tsallis C (1991) Generalized statistical mechanics: connection with thermodynamics. Phys J A 24:L69-L72; [Corrigenda: 24:3187 (1991); 25:1019 (1992)] 72. Cvejic N, Canagarajah CN, Bull DR (2006) Image fusion metric based on mutual information and Tsallis entropy. Electron Lett 42:11
Entropy
73. Daniels KE, Beck C, Bodenschatz E (2004) Defect turbulence and generalized statistical mechanics. Physica D 193:208–217 74. Daroczy Z (1970) Information and Control 16:36 75. de Albuquerque MP, Esquef IA, Mello ARG, de Albuquerque MP (2004) Image thresholding using Tsallis entropy. Pattern Recognition Lett 25:1059–1065 76. de Meneses MDS, da Cunha SD, Soares DJB, da Silva LR (2006) In: Sakagami M, Suzuki N, Abe S (eds) Complexity and Nonextensivity: New Trends in Statistical Mechanics. Prog Theor Phys Suppl 162:131–137 77. de Moura FABF, Tirnakli U, Lyra ML (2000) Convergence to the critical attractor of dissipative maps: Log-periodic oscillations, fractality and nonextensivity. Phys Rev E 62:6361–6365 78. de Oliveira HP, Soares ID, Tonini EV (2004) Role of the nonextensive statistics in a three-degrees of freedom gravitational system. Phys Rev D 70:084012 79. de Souza AMC, Tsallis C (1997) Student’s t- and r-distributions: Unified derivation from an entropic variational principle. Physica A 236:52–57 80. de Souza J, Moyano LG, Queiros SMD (2006) On statistical properties of traded volume in financial markets. Euro Phys J B 50:165–168 81. Douglas P, Bergamini S, Renzoni F (2006) Tunable Tsallis distributions in dissipative optical lattices. Phys Rev Lett 96:110601 82. Fermi E (1936) Thermodynamics. Dover, New York, p 53 83. Ferri GL, Martinez S, Plastino A (2005) Equivalence of the four versions of Tsallis’ statistics. J Stat Mech P04009 84. Ferro F, Lavagno A, Quarati P (2004) Non-extensive resonant reaction rates in astrophysical plasmas. Euro Phys J A 21:529– 534 85. Fulco UL, da Silva LR, Nobre FD, Rego HHA, Lucena LS (2003) Effects of site dilution on the one-dimensional long-range bond-percolation problem. Phys Lett A 312:331–335 86. Frank TD (2005) Nonlinear Fokker–Planck Equations – Fundamentals and Applications. Springer, Berlin 87. Gervino G, Lavagno A, Quarati P (2005) CNO reaction rates and chemical abundance variations in dense stellar plasma. J Phys G 31:S1865–S1868 88. Gibbs JW (1902) Elementary Principles in Statistical Mechanics – Developed with Especial Reference to the Rational Foundation of Thermodynamics. C Scribner, New York; Yale University Press, New Haven, 1948; OX Bow Press, Woodbridge, Connecticut, 1981 89. Ginsparg P, Moore G (1993) Lectures on 2D Gravity and 2D String Theory. Cambridge University Press, Cambridge; hepth/9304011, p 65 90. Grigera JR (1996) Extensive and non-extensive thermodynamics. A molecular dyanamic test. Phys Lett A 217:47–51 91. Hasegawa H (2006) Nonextensive aspects of small-world networks. Physica A 365:383–401 92. Havrda J, Charvat F (1967) Kybernetika 3:30 93. Hernandez H-S, Robledo A (2006) Fluctuating dynamics at the quasiperiodic onset of chaos, Tsallis q-statistics and Mori’s qphase thermodynamics. Phys A 370:286–300 94. Hilhorst HJ, Schehr G (2007) A note on q-Gaussians and nonGaussians in statistical mechanics. J Stat Mech P06003 95. Jang S, Shin S, Pak Y (2003) Replica-exchange method using the generalized effective potential. Phys Rev Lett 91:058305
96. Jersblad J, Ellmann H, Stochkel K, Kastberg A, Sanchez L-P, Kaiser R (2004) Non-Gaussian velocity distributions in optical lattices. Phys Rev A 69:013410 97. Jund P, Kim SG, Tsallis C (1995) Crossover from extensive to nonextensive behavior driven by long-range interactions. Phys Rev B 52:50–53 98. Kaniadakis G (2001) Non linear kinetics underlying generalized statistics. Physica A 296:405–425 99. Kaniadakis G, Lissia M, Scarfone AM (2004) Deformed logarithms and entropies. Physica A 340:41–49 100. Khinchin AI (1953) Uspekhi Matem. Nauk 8:3 (Silverman RA, Friedman MD, trans. Math Found Inf Theory. Dover, New York) 101. Kronberger T, Leubner MP, van Kampen E (2006) Dark matter density profiles: A comparison of nonextensive statistics with N-body simulations. Astron Astrophys 453:21–25 102. Latora V, Baranger M (1999) Kolmogorov-Sinai entropy rate versus physical entropy. Phys Rev Lett 82:520–523 103. Latora V, Baranger M, Rapisarda A, Tsallis C (2000) The rate of entropy increase at the edge of chaos. Phys Lett A 273:97–103 104. Latora V, Rapisarda A, Tsallis C (2001) Non-Gaussian equilibrium in a long-range Hamiltonian system. Phys Rev E 64:056134 105. Lemes MR, Zacharias CR, Dal Pino A Jr (1997) Generalized simulated annealing: Application to silicon clusters. Phys Rev B 56:9279–9281 106. Lenzi EK, Anteneodo C, Borland L (2001) Escape time in anomalous diffusive media. Phys Rev E 63:051109 107. Lesche B (1982) Instabilities of Rényi entropies. J Stat Phys 27:419–422 108. Lindhard J, Nielsen V (1971) Studies in statistical mechanics. Det Kongelige Danske Videnskabernes Selskab Matematiskfysiske Meddelelser (Denmark) 38(9):1–42 109. Lissia M, Quarati P (2005) Nuclear astrophysical plasmas: Ion distributions and fusion rates. Europhys News 36:211–214 110. Lutz E (2003) Anomalous diffusion and Tsallis statistics in an optical lattice. Phys Rev A 67:051402(R) 111. Lyra ML, Tsallis C (1998) Nonextensivity and multifractality in low-dimensional dissipative systems. Phys Rev Lett 80:53–56 112. Mann GM, Tsallis C (eds) (2004) Nonextensive Entropy – Interdisciplinary Applications. Oxford University Press, New York 113. Marsh JA, Fuentes MA, Moyano LG, Tsallis C (2006) Influence of global correlations on central limit theorems and entropic extensivity. Physica A 372:183–202 114. Martin S, Morison G, Nailon W, Durrani T (2004) Fast and accurate image registration using Tsallis entropy and simultaneous perturbation stochastic approximation. Electron Lett 40(10):20040375 115. Masi M (2005) A step beyond Tsallis and Renyi entropies. Phys Lett A 338:217–224 116. Mayoral E, Robledo A (2004) Multifractality and nonextensivity at the edge of chaos of unimodal maps. Physica A 340:219–226 117. Mayoral E, Robledo A (2005) Tsallis’ q index and Mori’s q phase transitions at edge of chaos. Phys Rev E 72:026209 118. Montemurro MA (2001) Beyond the Zipf–Mandelbrot law in quantitative linguistics. Physica A 300:567–578 119. Montemurro MA, Tamarit F, Anteneodo C (2003) Aging in an infinite-range Hamiltonian system of coupled rotators. Phys Rev E 67:031106
961
962
Entropy
120. Moret MA, Pascutti PG, Bisch PM, Mundim MSP, Mundim KC (2006) Classical and quantum conformational analysis using Generalized Genetic Algorithm. Phys A 363:260–268 121. Moyano LG, Anteneodo C (2006) Diffusive anomalies in a long-range Hamiltonian system. Phys Rev E 74:021118 122. Moyano LG, Majtey AP, Tsallis C (2005) Weak chaos in large conservative system – Infinite-range coupled standard maps. In: Beck C, Benedek G, Rapisarda A, Tsallis C (eds) Complexity, Metastability and Nonextensivity. World Scientific, Singapore, pp 123–127 123. Moyano LG, Majtey AP, Tsallis C (2006) Weak chaos and metastability in a symplectic system of many long-range-coupled standard maps. Euro Phys J B 52:493–500 124. Moyano LG, Tsallis C, Gell-Mann M (2006) Numerical indications of a q-generalised central limit theorem. Europhys Lett 73:813–819 125. Nivanen L, Le Mehaute A, Wang QA (2003) Generalized algebra within a nonextensive statistics. Rep Math Phys 52:437– 444 126. Nobre FD, Tsallis C (2003) Classical infinite-range-interaction Heisenberg ferromagnetic model: Metastability and sensitivity to initial conditions. Phys Rev E 68:036115 127. Nobre FD, Tsallis C (2004) Metastable states of the classical inertial infinite-range-interaction Heisenberg ferromagnet: Role of initial conditions. Physica A 344:587–594 128. Nobre FD, Curado EMF, Rowlands G (2004) A procedure for obtaining general nonlinear Fokker-Planck equations. Physica A 334:109–118 129. Oliveira HP, Soares ID (2005) Dynamics of black hole formation: Evidence for nonextensivity. Phys Rev D 71:124034 130. Penrose O (1970) Foundations of Statistical Mechanics: A Deductive Treatment. Pergamon Press, Oxford, p 167 131. Plastino AR, Plastino A (1995) Non-extensive statistical mechanics and generalized Fokker-Planck equation. Physica A 222:347–354 132. Pluchino A, Rapisarda A (2006) Metastability in the Hamiltonian Mean Field model and Kuramoto model. Physica A 365:184–189 133. Pluchino A, Rapisarda A (2006) Glassy dynamics and nonextensive effects in the HMF model: the importance of initial conditions. In: Sakagami M, Suzuki N, Abe S (eds) Complexity and Nonextensivity: New Trends in Statistical Mechanics. Prog Theor Phys Suppl 162:18–28 134. Pluchino A, Latora V, Rapisarda A (2004) Glassy dynamics in the HMF model. Physica A 340:187–195 135. Pluchino A, Latora V, Rapisarda A (2004) Dynamical anomalies and the role of initial conditions in the HMF model. Physica A 338:60–67 136. Pluchino A, Rapisarda A, Latora V (2005) Metastability and anomalous behavior in the HMF model: Connections to nonextensive thermodynamics and glassy dynamics. In: Beck C, Benedek G, Rapisarda A, Tsallis C (eds) Complexity, Metastability and Nonextensivity. World Scientific, Singapore, pp 102–112 137. Pluchino A, Rapisarda A, Tsallis C (2007) Nonergodicity and central limit behavior in long-range Hamiltonians. Europhys Lett 80:26002 138. Prato D, Tsallis C (1999) Nonextensive foundation of Levy distributions. Phys Rev E 60:2398–2401
139. Queiros SMD (2005) On non-Gaussianity and dependence in financial in time series: A nonextensive approach. Quant Finance 5:475–487 140. Queiros SMD, Tsallis C (2007) Nonextensive statistical mechanics and central limit theorems II – Convolution of q-independent random variables. In: Abe S, Herrmann HJ, Quarati P, Rapisarda A, Tsallis C (eds) Complexity, Metastability and Nonextensivity. American Institute of Physics Conference Proceedings, vol 965. New York, pp 21–33 141. Queiros SMD, Moyano LG, de Souza J, Tsallis C (2007) A nonextensive approach to the dynamics of financial observables. Euro Phys J B 55:161–168 142. Rapisarda A, Pluchino A (2005) Nonextensive thermodynamics and glassy behavior. Europhys News 36:202–206; Erratum: 37:25 (2006) 143. Rapisarda A, Pluchino A (2005) Nonextensive thermodynamics and glassy behaviour in Hamiltonian systems. Europhys News 36:202–206; Erratum: 37:25 (2006) 144. Rego HHA, Lucena LS, da Silva LR, Tsallis C (1999) Crossover from extensive to nonextensive behavior driven by longrange d D 1 bond percolation. Phys A 266:42–48 145. Renyi A (1961) In: Proceedings of the Fourth Berkeley Symposium, 1:547 University California Press, Berkeley; Renyi A (1970) Probability theory. North-Holland, Amsterdam 146. Robledo A (2004) Aging at the edge of chaos: Glassy dynamics and nonextensive statistics. Physica A 342:104–111 147. Robledo A (2004) Universal glassy dynamics at noise-perturbed onset of chaos: A route to ergodicity breakdown. Phys Lett A 328:467–472 148. Robledo A (2004) Criticality in nonlinear one-dimensional maps: RG universal map and nonextensive entropy. Physica D 193:153–160 149. Robledo A (2005) Intermittency at critical transitions and aging dynamics at edge of chaos. Pramana-J Phys 64:947–956 150. Robledo A (2005) Critical attractors and q-statistics. Europhys News 36:214–218 151. Robledo A (2006) Crossover from critical to chaotic attractor dynamics in logistic and circle maps. In: Sakagami M, Suzuki N, Abe S (eds) Complexity and Nonextensivity: New Trends in Statistical Mechanics. Prog Theor Phys Suppl 162:10–17 152. Robledo A, Baldovin F, Mayoral E (2005) Two stories outside Boltzmann-Gibbs statistics: Mori’s q-phase transitions and glassy dynamics at the onset of chaos. In: Beck C, Benedek G, Rapisarda A, Tsallis C (eds) Complexity, Metastability and Nonextensivity. World Scientific, Singapore, p 43 153. Rodriguez A, Schwammle V, Tsallis C (2008) Strictly and asymptotically scale-invariant probabilistic models of N correlated binary random variables havin q-Gaussians as N -> infinity Limiting distributions. J Stat Mech P09006 154. Rohlf T, Tsallis C (2007) Long-range memory elementary 1D cellular automata: Dynamics and nonextensivity. Physica A 379:465–470 155. Rossignoli R, Canosa N (2003) Violation of majorization relations in entangled states and its detection by means of generalized entropic forms. Phys Rev A 67:042302 156. Rossignoli R, Canosa N (2004) Generalized disorder measure and the detection of quantum entanglement. Physica A 344:637–643 157. Salinas SRA, Tsallis C (eds) (1999) Nonextensive Statistical Mechanics and Thermodynamics. Braz J Phys 29(1)
Entropy
158. Sampaio LC, de Albuquerque MP, de Menezes FS (1997) Nonextensivity and Tsallis statistic in magnetic systems. Phys Rev B 55:5611–5614 159. Santos RJV (1997) Generalization of Shannon’ s theorem for Tsallis entropy. J Math Phys 38:4104–4107 160. Sato Y, Tsallis C (2006) In: Bountis T, Casati G, Procaccia I (eds) Complexity: An unifying direction in science. Int J Bif Chaos 16:1727–1738 161. Schutzenberger PM (1954) Contributions aux applications statistiques de la theorie de l’ information. Publ Inst Statist Univ Paris 3:3 162. Shannon CE (1948) A Mathematical Theory of Communication. Bell Syst Tech J 27:379–423; 27:623–656; (1949) The Mathematical Theory of Communication. University of Illinois Press, Urbana 163. Sharma BD, Mittal DP (1975) J Math Sci 10:28 164. Serra P, Stanton AF, Kais S, Bleil RE (1997) Comparison study of pivot methods for global optimization. J Chem Phys 106:7170–7177 165. Soares DJB, Tsallis C, Mariz AM, da Silva LR (2005) Preferential Attachment growth model and nonextensive statistical mechanics. Europhys Lett 70:70–76 166. Son WJ, Jang S, Pak Y, Shin S (2007) Folding simulations with novel conformational search method. J Chem Phys 126:104906 167. Stigler SM (1999) Statistics on the table – The history of statistical concepts and methods. Harvard University Press, Cambridge 168. Silva AT, Lenzi EK, Evangelista LR, Lenzi MK, da Silva LR (2007) Fractional nonlinear diffusion equation, solutions and anomalous diffusion. Phys A 375:65–71 169. Tamarit FA, Anteneodo C (2005) Relaxation and aging in longrange interacting systems. Europhys News 36:194–197 170. Tamarit FA, Cannas SA, Tsallis C (1998) Sensitivity to initial conditions and nonextensivity in biological evolution. Euro Phys J B 1:545–548 171. Thistleton W, Marsh JA, Nelson K, Tsallis C (2006) unpublished 172. Thurner S (2005) Europhys News 36:218–220 173. Thurner S, Tsallis C (2005) Nonextensiveaspects of self-organized scale-free gas-like networks. Europhys Lett 72:197–204 174. Tirnakli U, Ananos GFJ, Tsallis C (2001) Generalization of the Kolmogorov–Sinai entropy: Logistic – like and generalized cosine maps at the chaos threshold. Phys Lett A 289:51–58 175. Tirnakli U, Beck C, Tsallis C (2007) Central limit behavior of deterministic dynamical systems. Phys Rev E 75:040106(R) 176. Tirnakli U, Tsallis C, Lyra ML (1999) Circular-like maps: Sensitivity to the initial conditions, multifractality and nonextensivity. Euro Phys J B 11:309–315 177. Tirnakli U, Tsallis C (2006) Chaos thresholds of the z-logistic map: Connection between the relaxation and average sensitivity entropic indices. Phys Rev E 73:037201 178. Tisza L (1961) Generalized Thermodynamics. MIT Press, Cambridge, p 123 179. Tonelli R, Mezzorani G, Meloni F, Lissia M, Coraddu M (2006) Entropy production and Pesin-like identity at the onset of chaos. Prog Theor Phys 115:23–29 180. Toscano F, Vallejos RO, Tsallis C (2004) Random matrix ensembles from nonextensive entropy. Phys Rev E 69:066131 181. Tsallis AC, Tsallis C, Magalhaes ACN, Tamarit FA (2003) Human and computer learning: An experimental study. Complexus 1:181–189
182. Tsallis C Regularly updated bibliography at http://tsallis.cat. cbpf.br/biblio.htm 183. Tsallis C (1988) Possible generalization of Boltzmann–Gibbs statistics. J Stat Phys 52:479–487 184. Tsallis C (2004) What should a statistical mechanics satisfy to reflect nature? Physica D 193:3–34 185. Tsallis C (2004) Dynamical scenario for nonextensive statistical mechanics. Physica A 340:1–10 186. Tsallis C (2005) Is the entropy Sq extensive or nonextensive? In: Beck C, Benedek G, Rapisarda A, Tsallis C (eds) Complexity, Metastability and Nonextensivity. World Scientific, Singapore 187. Tsallis C (2005) Nonextensive statistical mechanics, anomalous diffusion and central limit theorems. Milan J Math 73:145–176 188. Tsallis C, Bukman DJ (1996) Anomalous diffusion in the presence of external forces: exact time-dependent solutions and their thermostatistical basis. Phys Rev E 54:R2197–R2200 189. Tsallis C, Queiros SMD (2007) Nonextensive statistical mechanics and central limit theorems I – Convolution of independent random variables and q-product. In: Abe S, Herrmann HJ, Quarati P, Rapisarda A, Tsallis C (eds) Complexity, Metastability and Nonextensivity. American Institute of Physics Conference Proceedings, vol 965. New York, pp 8–20 190. Tsallis C, Souza AMC (2003) Constructing a statistical mechanics for Beck-Cohen superstatistics. Phys Rev E 67:026106 191. Tsallis C, Stariolo DA (1996) Generalized simulated annealing. Phys A 233:395–406; A preliminary version appeared (in English) as Notas de Fisica/CBPF 026 (June 1994) 192. Tsallis C, Levy SVF, de Souza AMC, Maynard R (1995) Statistical-mechanical foundation of the ubiquity of Levy distributions in nature. Phys Rev Lett 75:3589–3593; Erratum: (1996) Phys Rev Lett 77:5442 193. Tsallis C, Mendes RS, Plastino AR (1998) The role of constraints within generalized nonextensive statistics. Physica A 261:534–554 194. Tsallis C, Bemski G, Mendes RS (1999) Is re-association in folded proteins a case of nonextensivity? Phys Lett A 257:93–98 195. Tsallis C, Lloyd S, Baranger M (2001) Peres criterion for separability through nonextensive entropy. Phys Rev A 63:042104 196. Tsallis C, Anjos JC, Borges EP (2003) Fluxes of cosmic rays: A delicately balanced stationary state. Phys Lett A 310:372– 376 197. Tsallis C, Anteneodo C, Borland L, Osorio R (2003) Nonextensive statistical mechanics and economics. Physica A 324:89– 100 198. Tsallis C, Mann GM, Sato Y (2005) Asymptotically scale-invariant occupancy of phase space makes the entropy Sq extensive. Proc Natl Acad Sci USA 102:15377–15382 199. Tsallis C, Mann GM, Sato Y (2005) Extensivity and entropy production. In: Boon JP, Tsallis C (eds) Nonextensive Statistical Mechanics: New Trends, New perspectives. Europhys News 36:186–189 200. Tsallis C, Rapisarda A, Pluchino A, Borges EP (2007) On the non-Boltzmannian nature of quasi-stationary states in longrange interacting systems. Physica A 381:143–147 201. Tsekouras GA, Tsallis C (2005) Generalized entropy arising from a distribution of q-indices. Phys Rev E 71:046144 202. Umarov S, Tsallis C (2007) Multivariate generalizations of the q–central limit theorem. cond-mat/0703533 203. Umarov S, Tsallis C, Steinberg S (2008) On a q-central limit the-
963
964
Entropy
204.
205.
206.
207. 208.
209.
orem consistent with nonextensive statistical mechanics. Milan J Math 76. doi:10.1007/s00032-008-0087-y Umarov S, Tsallis C, Gell-Mann M, Steinberg S (2008) Symmetric (q, ˛)-stable distributions. Part I: First representation. cond-mat/0606038v2 Umarov S, Tsallis C, Gell-Mann M, Steinberg S (2008) Symmetric (q, ˛)-stable distributions. Part II: Second representation. cond-mat/0606040v2 Upadhyaya A, Rieu J-P, Glazier JA, Sawada Y (2001) Anomalous diffusion and non-Gaussian velocity distribution of Hydra cells in cellular aggregates. Physica A 293:549–558 Vajda I (1968) Kybernetika 4:105 (in Czech) Varotsos PA, Sarlis NV, Tanaka HK, Skordas ES (2005) Some properties of the entropy in the natural time. Phys Rev E 71:032102 Wehrl A (1978) Rev Modern Phys 50:221
210. Weinstein YS, Lloyd S, Tsallis C (2002) Border between between regular and chaotic quantum dynamics. Phys Rev Lett 89:214101 211. Weinstein YS, Tsallis C, Lloyd S (2004) On the emergence of nonextensivity at the edge of quantum chaos. In: Elze H-T (ed) Decoherence and Entropy in Complex Systems. Lecture notes in physics, vol 633. Springer, Berlin, pp 385–397 212. White DR, Kejzar N, Tsallis C, Farmer JD, White S (2005) A generative model for feedback networks. Phys Rev E 73:016119 213. Wilk G, Wlodarczyk Z (2004) Acta Phys Pol B 35:871–879 214. Wu JL, Chen HJ (2007) Fluctuation in nonextensive reactiondiffusion systems. Phys Scripta 75:722–725 215. Yamano T (2004) Distribution of the Japanese posted land price and the generalized entropy. Euro Phys J B 38:665–669 216. Zanette DH, Alemany PA (1995) Thermodynamics of anomalous diffusion. Phys Rev Lett 75:366–369
Ergodic Theory of Cellular Automata
Ergodic Theory of Cellular Automata MARCUS PIVATO Department of Mathematics, Trent University, Peterborough, Canada Article Outline Glossary Definition of the Subject Introduction Invariant Measures for CA Limit Measures and Other Asymptotics Measurable Dynamics Entropy Future Directions and Open Problems Acknowledgments Bibliography Glossary Configuration space and the shift Let M be a finitely generated group or monoid (usually abelian). Typically, M D N :D f0; 1; 2; : : :g or M D Z :D f: : : ; 1; 0; 1; 2; : : :g, or M D N E , ZD , or Z D N E for some D; E 2 N. In some applications, M could be nonabelian (although usually amenable), but to avoid notational complexity we will generally assume M is abelian and additive, with operation ‘+’. Let A be a finite set of symbols (called an alphabet). Let AM denote the set of all functions a : M!A, which we regard as M-indexed configurations of elements in A. We write such a configuration as a D [am ]m2M , where am 2 A for all m 2 M, and refer to AM as configuration space. Treat A as a discrete topological space; then A is compact (because it is finite), so AM is compact in the Tychonoff product topology. In fact, AM is a Cantor space: it is compact, perfect, totally disconnected, and metrizable. For example, if M D Z D , then the D standard metric on AZ is defined d(a; b) D 2(a;b) , where (a; b) :D min fjzj; az ¤ bz g. Any v 2 M, determines a continuous shift map v : AM !AM defined by v (a)m D amCv for all a 2 AM and m 2 M. The set f v gv2M is then a continuous M-action on AM , which we denote simply by “”. If a 2 AM and U M, then we define aU 2 AU by aU :D [au ]u2U . If m 2 M, then strictly speaking, amCU 2 AmCU ; however, it will often be convenient to ‘abuse notation’ and treat amCU as an element of AU in the obvious way.
Cellular automata Let H M be some finite subset, and let : AH !A be a function (called a local rule). The cellular automaton (CA) determined by is the function ˚ : AM !AM defined by ˚(a)m D (amCH ) for all a 2 AM and m 2 M. Curtis, Hedlund and Lyndon showed that cellular automata are exactly the continuous transformations of AM which commute with all shifts (see Theorem 3.4 in [58]). We refer to H as the neighborhood of ˚. For example, if M D Z, then typically H :D [` : : : r] :D f`; 1 `; : : : ; r 1; rg for some left radius ` 0 and right radius r 0. If ` 0, then can either define CA on AN or define a one-sided CA on AZ . If M D Z D , then typically H [R : : : R] D , for some radius R 0. Normally we assume that `, r, and R are chosen to be minimal. Several specific classes of CA will be important to us: Linear CA Let (A; C) be a finite abelian group (e. g. A D Z/p , where p 2 N; usually p is prime). Then ˚ is a linear CA (LCA) if the local rule has the form X 'h (ah ) ; 8aH 2 AH ; (1) (aH ) :D h2H
where 'h : A!A is an endomorphism of (A; C), for each h 2 H. We say that ˚ has scalar coefficients if, for each h 2 H, there is some scalar ch 2 Z, so that 'h (ah ) :D ch ah ; P then (aH ) :D h2H ch ah . For example, if A D (Z/p ; C), then all endomorphisms are scalar multiplications, so all LCA have scalar coefficients. If ch D 1 for all h 2 H, then ˚ has local rule P (aH ) :D h2H ah ; in this case, ˚ is called an additive cellular automaton; see Additive Cellular Automata. Affine CA If (A; C) is a finite abelian group, then an affine CA is one with a local rule (aH ) :D c C P h2H 'h (ah ), where c is some constant and where 'h : A!A are endomorphisms of (A; C). Thus, ˚ is an LCA if c D 0. Permutative CA Suppose ˚ : AZ !AZ has local rule : A[`:::r] !A. Fix b D [b1` ; : : : ; br1 ; br ] 2 A(`:::r] . For any a 2 A, define [a b] :D [a; b1` ; : : : ; br1 ; br ] 2 A[`:::r] . We then define the function b : A!A by b (a) :D ([a b]). We say that ˚ is left-permutative if b : A!A is a permutation (i. e. a bijection) for all b 2 A(`:::r] . Likewise, given b D [b` ; : : : ; br1 ] 2 A[`:::r) and c 2 A, define [b c] :D [b` ; b1` : : : ; br1 ; c] 2 A[`:::r] , and define b : A!A by b (c) :D ([bc]); then ˚ is right-permutative if b : A! A is a permutation for all b 2 A[`:::r) . We say ˚
965
966
Ergodic Theory of Cellular Automata
is bipermutative if it is both left- and right-permutative. More generally, if M is any monoid, H M is any neighborhood, and h 2 H is any fixed coordinate, then we define h-permutativity for a CA on AM in the obvious fashion. For example, suppose (A; C) is an abelian group and ˚ is an affine CA on AZ with local rule P (aH ) D c C rhD` ' h (a h ). Then ˚ is leftpermutative iff '` is an automorphism, and right-permutative iff 'r is an automorphism. If A D Z/p , and p is prime, then every nontrivial endomorphism is an automorphism(because it is multiplication by a nonzero element of Z/p , which is a field), so in this case, every affine CA is permutative in every coordinate of its neighborhood (and in particular, bipermutative). If A ¤ Z/p , however, then not all affine CA are permutative. Permutative CA were introduced by Hedlund [58], §6, and are sometimes called permutive CA. Right permutative CA on AN are also called toggle automata. For more information, see Sect. 7 of Topological Dynamics of Cellular Automata. Subshifts A subshift is a closed, -invariant subset X AM . For any U M, let XU :D fxU ; x 2 Xg AU . We say X is a subshift of finite type (SFT) if there is some finite U M such that X is entirely described by XU , in the sense that X D fx 2 AM ; xU C m 2 XU ; 8m 2 Mg. In particular, if M D Z, then a (two-sided) Markov subshift is an SFT X AZ determined by a set Xf0;1g Af0;1g of admissible transitions; equivalently, X is the set of all bi-infinite directed paths in a digraph whose vertices are the elements of A, with an edge a Ý b iff (a; b) 2 Xf0;1g . If M D N, then a onesided Markov subshift is a subshift of AN defined in the same way. D If D 2, then an SFT in AZ can be thought of as the set of admissible ‘tilings’ of RD by Wang tiles corresponding to the elements of XU . (Wang tiles are unit squares (or (hyper)cubes) with various ‘notches’ cut into their edges (or (hyper)faces) so that they can only be juxtaposed in certain ways.) D A subshift X AZ is strongly irreducible (or topologically mixing) if there is some R 2 N such that, for any disjoint finite subsets V ; U Z D separated by a distance of at least R, and for any u 2 XU and v 2 XV , there is some x 2 X such that xU D u and xV D v. Measures For any finite subset U M, and any b 2 AU , let hbi :D fa 2 AM ; aU :D bg be the cylinder set de-
termined by b. Let B be the sigma-algebra on AM generated by all cylinder sets. A (probability) measure on AM is a countably additive function : B![0; 1] such that [AM ] D 1. A measure on AM is entirely determined by its values on cylinder sets. We will be mainly concerned with the following classes of measures: Bernoulli measure Let ˇ0 be a probability measure on A. The Bernoulli measure induced by ˇ0 is the measure ˇ on AM such that, for any finite subset U M, and any a 2 AU , if U :D jU j, then Q ˇ[hai] D h2H ˇ0 (ah ). Invariant measure Let be a measure on AM , and let ˚ : AM !AM be a cellular automaton. The measure ˚ is defined by ˚(B) D (˚ 1 (B)), for any B 2 B. We say that is ˚-invariant (or that ˚ is -preserving) if ˚ D . Uniform measure Let A :D jAj. The uniform measure on AM is the Bernoulli measure such that, for any finite subset U M, and any b 2 AU , if U :D jU j, then [hbi] D 1/AU . The support of a measure is the smallest closed subset X AM such that [X] D 1; we denote this by supp(). We say has full support if supp() D AM – equivalently, [C] > 0 for every cylinder subset C AM . Notation Let CA(AM ) denote the set of all cellular automata on AM . If X AM , then let CA(X) be the subset of all ˚ 2 CA(AM ) such that ˚(X) X. Let Meas(AM ) be the set of all probability measures on AM , and let Meas(AM ; ˚ ) be the subset of ˚-invariant measures. If X AM , then let Meas(X) be the set of probability measures with supp() X, and define Meas(X; ˚ ) in the obvious way. Font conventions Upper case calligraphic letters (A; B; C ; : : :) denote finite alphabets or groups. Upper-case bold letters (A; B; C; : : :) denote subsets of AM (e. g. subshifts), lowercase bold-faced letters (a; b; c; : : :) denote elements of AM , and Roman letters (a; b; c; : : :) are elements of A or ordinary numbers. Lower-case sans-serif (: : : ; m; n; p) are elements of M, upper-case hollow font (U ; V ; W ; : : :) are subsets of M. Upper-case Greek letters (˚; ; : : :) are functions on AM (e. g. CA, block maps), and lowercase Greek letters (; ; : : :) are other functions (e. g. local rules, measures.) Acronyms in square brackets (e. g. Topological Dynamics of Cellular Automata) indicate cross-references to related entries in the Encyclopedia; these are listed at the end of this article.
Ergodic Theory of Cellular Automata
Definition of the Subject Loosely speaking, a cellular automaton (CA) is the ‘discrete’ analogue of a partial differential evolution equation: it is a spatially distributed, discrete-time, symbolic dynamical system governed by a local interaction rule which is invariant in space and time. In a CA, ‘space’ is discrete (usually the D-dimensional lattice, ZD ) and the local statespace at each point in space is also discrete (a finite ‘alphabet’, usually denoted by A). A measure-preserving dynamical system (MPDS) is a dynamical system equipped with an invariant probability measure. Any MPDS can be represented as a stationary stochastic process (SSP) and vice versa; ‘chaos’ in the MPDS can be quantified via the information-theoretic ‘entropy’ of the corresponding SSP. An MPDS ˚ on a statespace X also defines a unitary linear operator ˚ on the Hilbert space L2 (X); the spectral properties of ˚ encode information about the global periodic structure and longterm informational asymptotics of ˚. Ergodic theory is the study of MPDSs and SSPs, and lies at the interface between dynamics, probability theory, information theory, and unitary operator theory. Please refer to the Glossary for precise definitions of ‘CA’, ‘MPDS’, etc. Introduction The study of CA as symbolic dynamical systems began with Hedlund [58], and the study of CA as MPDSs began with Coven and Paul [24] and Willson [144]. (Further historical details will unfold below, where appropriate.) The ergodic theory of CA is important for several reasons: CA are topological dynamical systems ( Topological Dynamics of Cellular Automata, Chaotic Behavior of Cellular Automata). We can gain insight into the topological dynamics of a CA by identifying its invariant measures, and then studying the corresponding measurable dynamics. CA are often proposed as stylized models of spatially distributed systems in statistical physics – for example, as microscale models of hydrodynamics, or of atomic lattices ( Cellular Automata Modeling of Physical Systems). In this context, the distinct invariant measures of a CA correspond to distinct ‘phases’ of the physical system ( Phase Transitions in Cellular Automata). CA can also act as information-processing systems ( Cellular Automata, Universality of, Cellular Automata as Models of Parallel Computation). Ergodic theory studies the ‘informational’ aspect of dynamical systems, so it is particularly suited to explicitly ‘informational’ dynamical systems like CA.
Article Roadmap In Sect. “Invariant Measures for CA”, we characterize the invariant measures for various classes of CA. Then, in Sect. “Limit Measures and Other Asymptotics”, we investigate which measures are ‘generic’ in the sense that they arise as the attractors for some large class of initial conditions. In Sect. “Measurable Dynamics” we study the mixing and spectral properties of CA as measure-preserving dynamical systems. Finally, in Sect. “Entropy”, we look at entropy. These sections are logically independent, and can be read in any order. Invariant Measures for CA The Uniform Measure vs. Surjective Cellular Automata The uniform measure plays a central role in the ergodic theory of cellular automata, because of the following result. Theorem 1 Let M D Z D N E , let ˚ 2 CA(AM ) and let be the uniform measure on AM . Then (˚ preserves ) () (˚ is surjective). Proof sketch “H)” If ˚ preserves , then ˚ must map supp() onto itself. But supp() D AM ; hence ˚ is surjective. “(H” The case D D 1 follows from a result of W.A. Blankenship and Oscar S. Rothaus, which first appeared in Theorem 5.4 in [58]. The Blankenship–Rothaus Theorem states that, if ˚ 2 CA(AZ ) is surjective and has neighborhood [` : : : r], then for any k 2 N and any a 2 A k , the ˚-preimage of the cylinder set hai is a disjoint union of exactly Ar+` cylinder sets of length k C r C `; it follows that [˚ 1 hai] D ArC` /AkCrC` D Ak D hai. This result was later reproved by Kleveland (see Theorem 5.1 in [74]). The special case A D f0; 1g also appeared in Theorem 2.4 in [131]. The case D 2 follows from the multidimensional version of the Blankenship–Rothaus Theorem, which was proved by Maruoka and Kimura (see Theorem 2 in [93]) (their proof assumes that D D 2 and that ˚ has a ‘quiescent’ state, but neither hypothesis is essential). Alternately, “(H” follows from recent, more general results of Meester, Burton, and Steif; see Example 9 below. Example 2 Let M D Z or N and consider CA on AM . (a) Say that ˚ is bounded-to-one if there is some B 2 N such that every a 2 AM has at most B preimages. Then (˚ is bounded-to-one) () (˚ is surjective). (b) Any posexpansive CA on AM is surjective (see Subsect. “Posexpansive and Permutative CA” below). (c) Any left- or right-permutative CA on AZ (or rightpermutative CA on AN ) is surjective. This includes, for example, most linear CA.
967
968
Ergodic Theory of Cellular Automata
Hence, in any of these cases, ˚ preserves the uniform measure. Proof For (a), see Theorem 5.9 in [58], or Corollary 8.1.20, p. 271 in [81]. For (b), see Proposition 2.2 in [9], in the case AN ; their argument also works for AZ . Part (c) follows from (b) because any permutative CA is posexpansive (Proposition 11 below). There is also a simple direct proof for a right-permutative CA on AN : using right-permutativity, you can systematically construct a preimage of any desired image sequence, one entry at a time. See Theorem 6.6 in [58] for the proof in AZ . The surjectivity of a one-dimensional CA can be determined in finite time using certain combinatorial tests ( Topological Dynamics of Cellular Automata). However, for D 2, it is formally undecidable whether an arbiD trary CA on AZ is surjective ( Tiling Problem and Undecidability in Cellular Automata). This problem is sometimes referred to as the Garden of Eden problem, because D an element of AZ with no ˚ -preimage is called a Garden of Eden (GOE) configuration for ˚ (because it could only ever occur at the ‘beginning of time’). However, it is known that a CA is surjective if it is ‘almost injective’ in a certain sense, which we now specify. Let (M; C) be any monoid, and let ˚ 2 CA(AM ) have neighborhood H M. If B M is any subset, then we define B :D B C H D fb C h ; b 2 B; h 2 Hg ; and
@B :D B \ BC :
If B is finite, then so is B (because H is finite). If ˚ has local rule : AH !A, then induces a function ˚B : AB !AB in the obvious fashion. A B-bubble (or B-diamond) is a pair b; b0 2 AB such that: b ¤ b0 ;
b@B D b0@B ;
and
˚B (b) D ˚B (b0 ) :
Suppose a; a0 2 AM are two configurations such that aB D b ;
a0B D b0 ;
and
aBC D a0BC :
Then it is easy to verify that ˚ (a) D ˚ (a0 ). We say that a and a0 form a mutually erasable pair (because ˚ ‘erases’ the difference between a and a0 ). Figure 1 is a schematic representation of this structure in the case D D 1 (hence the term ‘diamond’). If D D 2, then a and a0 are like two membranes which are glued together everywhere except for a B-shaped ‘bubble’. We say that ˚ is pre-injective if any (and thus, all) of the following three conditions hold: ˚ admits no bubbles. ˚ admits no mutually erasable pairs.
Ergodic Theory of Cellular Automata, Figure 1 A ‘diamond’ in AZ
For any c 2 AM , if a; a0 2 ˚ 1 fcg are distinct, then a and a0 must differ in infinitely many locations. For example, any injective CA is preinjective (because a mutually erasable pair for ˚ gives two distinct ˚ -preimages for some point). More to the point, however, if B is finite, and ˚ admits a B-bubble (b; b0 ), then we can embed N disjoint copies of B into M, and thus, by making various choices between b and b0 on different translates, we obtain a configuration with 2N distinct ˚-preimages (where N is arbitrarily large). But if some configurations in AM have such a large number of preimages, then other configurations in AM must have very few preimages, or even none. This leads to the following result: Theorem 3 (Garden of Eden) Let M be a finitely generated amenable group (e. g. M D Z D ). Let ˚ 2 CA(AM ). (a) ˚ is surjective if and only if ˚ is pre-injective. (b) Let X AM be a strongly irreducible SFT such that ˚(X) X. Then ˚(X) D X if and only if ˚ jX is preinjective. Proof (a) The case M D Z2 was originally proved by Moore [103] and Myhill [104]; see Cellular Automata and Groups. The case M D Z was implicit Hedlund (Lemma 5.11, and Theorems 5.9 and 5.12 in [58]). The case when M is a finite-dimensional group was proved by Machi and Mignosi [92]. Finally, the general case was proved by Ceccherini-Silberstein, Machi, and Scarabotti (see Theorem 3 in [20]), see Cellular Automata and Groups. (b) The case M D Z is Corollary 8.1.20 in [81] (actually this holds for any sofic subshift); see also Fiorenzi [38]. The general case is Corollary 4.8 in [39]. Corollary 4 (Incompressibility) Suppose M is a finitely generated amenable group and ˚ 2 CA(AM ). If ˚ is injective, then ˚ is surjective. Remark 5 (a) A cellular network is a CA-like system defined on an infinite, locally finite digraph, with different local rules at different nodes. By assuming a kind of ‘amenability’ for this digraph, and then imposing some weak global statistical symmetry conditions on the local rules, Gromov (see Theorem 8.F’ in [51]) has generalized
Ergodic Theory of Cellular Automata
the GOE Theorem 3 to a large class of such cellular networks (which he calls ‘endomorphisms of symbolic algebraic varieties’). See also [19]. (b) In the terminology suggested by Gottschalk [46], Incompressibility Corollary 4 says that the group M is surjunctive; Gottschalk claims that ‘surjunctivity’ was first proved for all residually finite groups by Lawton (unpublished); see Cellular Automata and Groups. For a recent direct proof (not using the GOE theorem), see Weiss (see Theorem 1.6 in [143]). Weiss also defines sofic groups (a class containing both residually finite groups and amenable groups) and shows that Corollary 4 holds whenever M is a sofic group (see Theorem 3.2 in [143]); see also Cellular Automata and Groups. (c) If X AM is an SFT such that ˚(X) X, then Corollary 4 holds as long as X is ‘semi-strongly irreducible’; see Fiorenzi (see Corollary 4.10 in [40]). Invariance of Maxentropy Measures If X AZ is any subshift with topological entropy htop (X; ), and 2 Meas(X; ) has measurable entropy h(; ), then in general, h(; ) htop (X; ); we say is a measure of maximal entropy (or maxentropy measure) if h(; ) D htop (X; ). (See Example 75(a) for definitions.) Every subshift admits one or more maxentropy measures. If D D 1 and X AZ is an irreducible subshift of finite type (SFT), then Parry (see Theorem 10 in [110]) showed that X admits a unique maxentropy measure X (now called the Parry measure); see Theorem 8.10, p. 194 in [142] or Sect. 13.3, pp. 443–444 in [81]. Theorem 1 is then a special case of the following result: D
Theorem 6 (Coven, Paul, Meester and Steif) Let D X AZ be an SFT having a unique maxentropy measure X , and let ˚ 2 CA(X). Then ˚ preserves X if and only if ˚(X) D X. Proof The case D D 1 is Corollary 2.3 in [24]. The case D 2 follows from Theorem 2.5(iii) in [95], which states: if X and Y are SFTs, and ˚ : X!Y is a factor mapping, and is a maxentropy measure on X, then ˚ () is a maxentropy measure on Y. For example, if X AZ is an irreducible SFT and X is its Parry measure, and ˚ (X) D X, then Theorem 6 says ˚ (X ) D X , as observed by Coven and Paul (see Theorem 5.1 in [24]). Unfortunately, higher-dimensional SFTs do not, in general, have unique maxentropy measures. Burton and Steif [14] provided a plethora of examples of such nonuniqueness, but they also gave a sufficient condition for uniqueness of the maxentropy measure, which we now explain.
Let X AZ be an SFT and let U Z D . For any x 2 X, let xU :D [xu ]u2U be its ‘projection’ to AU , and let XU :D fxU ; x 2 Xg AU . Let V :D U C Z D . For any D u 2 AU and v 2 AV , let [uv] denote the element of AZ such that [uv]U D u and [uv]V D v. Let D
X(u) :D fv 2 AV ; [uv] 2 Xg be the set of all “X-admissible completions” of u (thus, D X(u) ¤ ; , u 2 XU ). If 2 Meas(AZ ), and u 2 AU , then let (u) denote the conditional measure on AV induced by u. If U is finite, then (u) is just the restriction of to the cylinder set hui. If U is infinite, then the precise definition of (u) involves a ‘disintegration’ of into ‘fibre measures’ (we will suppress the details). Let U be the projection of onto AU . If supp() X, then supp(U ) XU , and for any u 2 AU , supp((u) ) X(u) . We say that is a Burton–Steif measure on X if: (1) supp() D X; and (2) For any U Z D whose complement U C is finite, and for U -almost any u 2 XU , the measure (u) is uniformly distributed on the (finite) set X(u) . For example, if X D AZ , then the only Burton–Steif measure is the uniform Bernoulli measure. If X AZ is an irreducible SFT, then the only Burton–Steif measure is the Parry measure. If r > 0 and B :D [r : : : r] D Z D , and X is an SFT determined by a set of admissible words XB AB , then it is easy to check that any Burton–Steif measure on X must be a Markov random field with interaction range r. D
Theorem 7 (Burton and Steif) Let X AZ be a subshift of finite type. (a) Any maxentropy measure on X is a Burton–Steif measure. (b) If X is strongly irreducible, then any Burton–Steif measure on X is a maxentropy measure for X. D
Proof (a) and (b) are Propositions 1.20 and 1.21 of [15], respectively. For a proof in the case when X is a symmetric nearest-neighbor subshift of finite type, see Propositions 1.19 and 4.1 of [14], respectively. Any subshift admits at least one maxentropy measure, so any SFT admits at least one Burton–Steif measure. Theorems 6 and 7 together imply: Corollary 8 If X AZ is an SFT which admits a unique Burton–Steif measure X , then X is the unique maxentropy measure for X. Thus, if ˚ 2 CA(AM ) and ˚(X) D X, then ˚ (X ) D X . D
969
970
Ergodic Theory of Cellular Automata
Example 9 If X D AZ , then we get Theorem 1, because D the unique Burton–Steif measure on AZ is the uniform Bernoulli measure. D
Remark If X AM is a subshift admitting a unique maxentropy measure , and supp() D X, then Wiess (see Theorem 4.2 in [143]) has observed that X automatically satisfies Incompressibility Corollary 4. In particular, this applies to any SFT having a unique Burton–Steif measure. Periodic Invariant Measures If P 2 N, then a sequence a 2 AZ is P-periodic if P (a) D a. If A :D jAj, then there are exactly AP such sequences, and a measure on AZ is called P-periodic if is supported entirely on these P-periodic sequences. More generally, if M is any monoid and P M is any submonoid, then a configuration a 2 AM is P -periodic if p (a) D a for all p 2 P . (For example, if M D Z and P :D PZ, then the P -periodic configurations are the P-periodic sequences). Let AM/P denote the set of P -periodic configurations. If P :D jM/P j, then jAM/P j D AP . A measure is called P -periodic if supp() AM/P . Proposition 10 Let ˚ 2 CA(AM ). If P M is any submonoid and jM/P j is finite, then there exists a P -periodic, ˚-invariant measure. Proof sketch If ˚ 2 CA(AM ), then ˚(AM/P ) AM/P . Thus, if is P -periodic, then ˚ t () is P -periodic for all t 2 N. Thus, the Cesàro limit of the sequence f˚ t ()g1 tD1 is P -periodic and ˚-invariant. This Cesàro limit exists because AM/P is finite. These periodic measures have finite (hence discrete) support, but by convex-combining them, it is easy to obtain (nonergodic) ˚-invariant measures with countable, dense support. When studying the invariant measures of CA, we usually regard these periodic measures (and their convex combinations) as somewhat trivial, and concentrate instead on invariant measures supported on aperiodic configurations.
AM , if a ¤ a0 , then there is some t 2 N such that ˚ t (a)B ¤ ˚ t (a0 )B . We say ˚ is positively expansive (or posexpansive) if ˚ is B-posexpansive for some finite B (it is easy to see that this is equivalent to the usual definition of positive expansiveness a topological dynamical system). For more information see Sect. 8 Topological Dynamics of Cellular Automata. Thus, if X :D ˚BN (AM ) BN , then X is a compact, shift-invariant subset of BN , and ˚BN : AM !X is an isomorphism from the system (AM ; ˚ ) to the one-sided subshift (X; ), which is sometimes called the canonical factor or column shift of ˚ . The easiest examples of posexpansive CA are one-dimensional, permutative automata.
Proposition 11 (a) Suppose ˚ 2 CA(AN ) has neighborhood [r : : : R], where 0 r < R. Let B :D [0 : : : R) and let B :D AB . Then (˚ is right permutative ) () (˚ is B-posexpansive, and ˚BN (AN ) D BN ) :
(b) Suppose ˚ 2 CA(AZ ) has neighborhood [L : : : R], where L < 0 < R. Let B :D [L : : : R), and let B :D AB . Then (˚ is bipermutative ) () (˚ is B-posexpansive, and ˚ NB (AZ ) D BN ) : Thus, one-sided, right-permutative CA and two-sided, bipermutative CA are both topologically conjugate the onesided full shift (BN ; ), where B is an alphabet with jAj RCL symbols (setting L D 0 in the one-sided case). Proof Suppose a 2 AM (where M D N or Z). Draw a picture of the spacetime diagram for ˚. For any t 2 N, and any b 2 B[0:::t) , observe how (bi)permutativity allows you to reconstruct a unique a[tL:::tR) 2 A[tL:::tR) such that b D (aB ; ˚ (a)B ; ˚ 2 (a)B ; : : : ; ˚ t1 (a)B ). By letting t!1, we see that the function ˚BN is a bijection between AM and BN .
˚BN (a) :D [aB ; ˚(a)B ; ˚ 2 (a)B ; ˚ 3 (a)B ; : : :] 2 BN : (2)
Remark 12 (a) The idea of Proposition 11 is implicit in Theorem 6.7 in [58], but it was apparently first stated explicitly by Shereshevsky and Afra˘ımovich (see Theorem 1 in [130]). It was later rediscovered by Kleveland (see Corollary 7.3 in [74]) and Fagnani and Margara (see Theorem 3.2 in [35]). (b) Proposition 11(b) has been generalized to higher dimensions by Allouche and Skordev (see Proposition 1 D in [3]), who showed that, if ˚ 2 CA(AZ ) is permutative in the ‘corner’ entries of its neighborhood, then ˚ is conjugate to a full shift (K N ; ), where K is an uncountable, compact space.
Clearly, ˚BN ı ˚ D ı ˚BN . We say that ˚ is B-posexpansive if ˚BN is injective. Equivalently, for any a; a0 2
Proposition 11 is quite indicative of the general case. Posexpansiveness occurs only in one-dimensional CA, in
Posexpansive and Permutative CA Let B M be a finite subset, and let B :D AB . If ˚ 2 CA(AM ), then we define a continuous function ˚BN : AM !BN by
Ergodic Theory of Cellular Automata
which it takes a very specific form. To explain this, suppose (M; ) is a group with finite generating set G M. For any r > 0, let B(r) :D fg1 g2 gr ; g1 ; : : : ; gr 2 Gg. The dimension (or growth degree) of (M; ) is defined dim(M; ) :D lim supr!1 log jB(r)j/ log(r); see [50] or [49]. It can be shown that this number is independent of the choice of generating set G, and is always an integer. For example, dim(Z D ; C) D D. If X AM is a subshift, then we define its topological entropy htop (X) with respect to dim(M) in the obvious fashion (see Example 75(a)). Theorem 13 Let ˚ 2 CA(AM ). (a) If M D Z D N E with D C E 2, then ˚ cannot be posexpansive. (b) If M is any group with dim(M) 2, and X AM is any subshift with htop (X) > 0, and ˚ (X) X, then the system (X; ˚ ) cannot be posexpansive. (c) Suppose M D Z or N, and ˚ has neighborhood [L : : : :R] M. Let L :D maxf0; Lg, R :D maxf0; Rg and B :D [L : : : R). If ˚ is posexpansive, then ˚ is B-posexpansive. Proof (a) is Corollary 2 in [127]; see also Theorem 4.4 in [37]. Part (b) follows by applying Theorem 1.1 in [128] to the natural extension of (X; ˚ ). (c) The case M D Z is Proposition 7 in [75]. The case M D N is Proposition 2.3 in [9]. Proposition 11 says bipermutative CA on AZ are conjugate to full shifts. Using his formidable theory of textile systems, Nasu extended this to all posexpansive CA on AZ . Theorem 14 (Nasu’s) Let ˚ 2 CA(AZ ) and let B Z. If ˚ is B-posexpansive, then ˚BN (AZ ) BN is a one-sided SFT which is conjugate to a one-sided full shift C N for some alphabet C with jC j 3. Proof sketch The fact that X :D ˚ NB (AZ ) is an SFT follows from Theorem 10 in [75] or Theorem 10.1 in [76]. Next, Theorem 3.12(1) on p. 49 of [105] asserts that, if ˚ is any surjective endomorphism of an irreducible, aperiodic, SFT Y AZ , and (Y; ˚ ) is itself conjugate to an SFT, then (Y; ˚ ) is actually conjugate to a full shift (C N ; ) for some alphabet C with jC j 3. Let Y :D AZ and invoke K˚urka’s result. For a direct proof not involving textile systems, see Theorem 4.9 in [86]. Remark 15 (a) See Theorem 60(d) for an ‘ergodic’ version of Theorem 14. (b) In contrast to Proposition 11, Nasu’s Theorem 14 does not say that ˚BN (AZ ) itself is a full shift – only that it is conjugate to one.
If (X; ; ) is a measure-preserving dynamical system (MPDS) with sigma-algebra B, then a one-sided generator W t is a finite partition P B such that 1 tD0 P B. If P has C elements, and C is a finite set with jC j D C, then P induces an essentially injective function p : X!C N such that p ı D ı p. Thus, if :D p(), then (X; ; ) is measurably isomorphic to the (onesided) stationary stochastic process (C N ; ; ). If is invertible, then a (two-sided) generator is a finite partiW t tion P B such that 1 tD1 P B. The Krieger Generator Theorem says every finite-entropy, invertible MPDS has a generator; indeed, if h(; ) log2 (C), then (X; ; ) has a generator with C or less elements. If jC j D C, then once again, P induces a measurable isomorphism from (X; ; ) to a two-sided stationary stochastic process (C Z ; ; ), for some stationary measure on AZ . Corollary 16 (Universal Representation) Let M D N or Z, and let ˚ 2 CA(AM ) have neighborhood H M. Suppose that either M D N, ˚ is right-permutative, and H D [r : : : R] for some 0 r < R, and then let C :D R log2 jAj; or M D Z, ˚ is bipermutative, and H D [L : : : R], and then let C :D (L C R) log2 jAj where L :D maxf0; Lg and R :D maxf0; Rg; or M D Z and ˚ is positively expansive, and htop (AM ; ˚ ) D log2 (C) for some C 2 N. (a) Let (X; ; ) be any MPDS with a one-sided generator having at most C elements. Then there exists
2 Meas(AM ; ˚ ) such that the system (AM ; ; ˚) is measurably isomorphic to (X; ; ). (b) Let (X; ; ) be an invertible MPDS, with measurable entropy h(; ) log2 (C). Then there exists
2 Meas(AM ; ˚ ) such that the natural extension of the system (AM ; ; ˚) is measurably isomorphic to (X; ; ). Proof Under each of the three hypotheses, Proposition 11 or Theorem 14 yields a topological conjugacy : (C N ; )!(AM ; ˚ ), where C is a set of cardinality C. (a) As discussed above, there is a measure on C N such that (C N ; ; ) is measurably isomorphic to (X; ; ). Thus, :D [] is a ˚-invariant measure on AM , and (AM ; ; ˚) is isomorphic to (C N ; ; ) via . (b) As discussed above, there is a measure on C Z such that (C Z ; ; ) is measurably isomorphic to (X; ; ). Let N be the projection of to C N ; then (C N ; N ; ) is a one-sided stationary process. Thus, :D [N ] is a ˚ -invariant measure on AM , and (AM ; ; ˚) is isomorphic to (C N ; N ; ) via . Thus, the natural extension of (AM ; ; ˚) is isomorphic to the natural extension
971
972
Ergodic Theory of Cellular Automata
of (C N ; N ; ), which is (C Z ; ; ), which is in turn isomorphic to (X; ; ). Remark 17 The Universal Representation Corollary implies that studying the measurable dynamics of the CA ˚ with respect to some arbitrary ˚ -invariant measure will generally tell us nothing whatsoever about ˚. For these measurable dynamics to be meaningful, we must pick a measure on AM which is somehow ‘natural’ for ˚. First, this measure should be shift-invariant (because one of the defining properties of CA is that they commute with the shift). Second, we should seek a measure which has maximal ˚-entropy or is distinguished in some other way. (In general, the measures given by the Universal Representation Corollary will neither be -invariant, nor have maximal entropy for ˚.) If ˚N 2 CA(AN ), and ˚Z 2 CA(AZ ) is the CA obtained by applying the same local rule to all coordinates in Z, then ˚Z can never be posexpansive: if B D [B : : : B], and a; a0 2 AZ are any two sequences such that a(1:::B) ¤ a0(1:::B) , then ˚ t (a)B D ˚ t (a0 )B for all t 2 N, because the local rule of ˚ only propagates information to the left. Thus, in particular, the posexpansive CA on AZ are completely unrelated to the posexpansive CA on AN . Nevertheless, posexpansive CA on AN behave quite similarly to those on AZ . Theorem 18 Let ˚ 2 CA(AN ) have neighborhood [r : : : R], where 0 r < R, and let B :D [0 : : : R). Suppose ˚ is posexpansive. Then: (a) X :D ˚BN (AN ) BN is a topologically mixing SFT. (b) The topological entropy of ˚ is log2 (k) for some k 2 N. (c) If is the uniform measure on AN , then ˚BN () is the Parry measure on X. Thus, is the maxentropy measure for ˚. Proof See Corollary 3.7 and Theorems 3.8 and 3.9 in [9] or Theorem 4.8(1,2,4) in [86]. Remark (a) See Theorem 58 for an ‘ergodic’ version of Theorem 18. (b) The analog of Nasu’s Theorem 14 (i. e. conjugacy to a full shift) is not true for posexpansive CA on AN . See [13] for a counterexample. (c) If ˚ : AN !AN is invertible, then we define the function ˚BZ : AN !BZ by extending the definition of ˚BN to negative times. We say that ˚ is expansive if ˚BZ is bijective for some finite B N. Expansiveness is a much weaker condition than positive expansiveness. Nevertheless, the analog of Theorem 18(a) is true: if ˚ : AN !AN is invertible and expansive, then BZ is conjugate to a (two-sided) subshift of finite type; see Theorem 1.3 in [106].
Measure Rigidity in Algebraic CA Theorem 1 makes the uniform measure a ‘natural’ invariant measure for a surjective CA ˚ . However, Proposition 10 and Corollary 16 indicate that there are many other (unnatural) ˚-invariant measures as well. Thus, it is natural to seek conditions under which the uniform measure is the unique (or almost unique) measure which is ˚-invariant, shift-invariant, and perhaps ‘nondegenerate’ in some other sense – a phenomenon which is sometimes called measure rigidity. Measure rigidity has been best understood when ˚ is compatible with an underlying algebraic structure on AM . Let ? : AM AM !AM be a binary operation (‘multiplication’) and let 1 : AM !AM be an unary operation (‘inversion’) such that (AM ; ?) is a group, and suppose both operations are continuous and commute with all M-shifts; then (AM ; ?) is called a group shift. For example, if (A; ) is itself a finite group, and AM is treated as a Cartesian product and endowed with componentwise multiplication, then (AM ; ) is a group shift. However, not all group shifts arise in this manner; see [70,71,72,73,124]. If (AM ; ?) is a group shift, then a subgroup shift is a closed, shift-invariant subgroup G AM (i. e. G is both a subshift and a subgroup). If (G; ?) is a subgroup shift, then the Haar measure on G is the unique probability measure G on G which is invariant under translation by all elements of G. That is, if g 2 G, and U G is any measurable subset, and U ? g :D fu ? g; u 2 Ug, then G [U ? g] D G [U]. In particular, if G D AM , then G is just the uniform Bernoulli measure on AM . The Haar measure is a maxentropy measure on G (see Subsect. “Invariance of Maxentropy Measures”). If (AM ; ?) is a group shift, and G AM is a subgroup shift, and ˚ 2 CA(AM ), then ˚ is called an endomorphic (or algebraic) CA on G if ˚(G) G and ˚ : G!G is an endomorphism of (G; ?) as a topological group. Let ECA(G; ?) denote the set of endomorphic CA on G. For example, suppose (A; C) is abelian, and let (G; ?) :D (AM ; C) with the product group structure; then the endomorphic CA on AM are exactly the linear CA. However, if (A; ) is a nonabelian group, then endomorphic CA on (AM ; ) are not the same as multiplicative CA. Even in this context, CA admit many nontrivial invariant measures. For example, it is easy to check the following: Proposition 19 Let AM be a group shift and let ˚ 2 ECA(AM ; ?). Let G AM be any ˚-invariant subgroup shift; then the Haar measure on G is ˚ -invariant.
Ergodic Theory of Cellular Automata
For example, if (A; C) is any nonsimple abelian group, and (AM ; C) has the product group structure, then AM admits many nontrivial subgroup shifts; see [73]. If ˚ is any linear CA on AM with scalar coefficients, then every subgroup shift of AM is ˚-invariant, so Proposition 19 yields many nontrivial ˚-invariant measures. To isolate as a unique measure, we must impose further restrictions. The first nontrivial results in this direction were by Host, Maass, and Martínez [61]. Let h(˚; ) be the entropy of ˚ relative to the measure (see Sect. “Entropy” for definition). Proposition 20 Let A :D Z/p , where p is prime. Let ˚ 2 CA(AZ ) be a linear CA with neighborhood f0; 1g, and let 2 Meas(AZ ; ˚; ). If is -ergodic, and h(˚; ) > 0, then is the Haar measure on AZ . Proof See Theorem 12 in [61].
A similar idea is behind the next result, only with the roles of ˚ and reversed. If is a measure on AN , and b 2 A[1:::1) , then we define the conditional measure (b) on A by (b) (a) :D [x0 D ajx[1:::1) D b], where x is a -random sequence. For example, if is a Bernoulli measure, then (b) (a) D [x0 D a], independent of b; if is a Markov measure, then (b) (a) D [x0 D ajx1 D b1 ]. Proposition 21 Let (A; ) be any finite (possibly nonabelian) group, and let ˚ 2 CA(AN ) have multiplicative local rule : Af0;1g !A defined by (a0 ; a1 ) :D a0 a1 . Let 2 Meas(AZ ; ˚; ). If is ˚-ergodic, then there is some subgroup C A such that, for every b 2 A[1:::1] , supp((b) ) is a right coset of C , and (b) is uniformly distributed on this coset. Proof See Theorem 3.1 in [113].
Example 22 Let ˚ and be as in Proposition 21. Let be the Haar measure on AN (a) has complete connections if supp((b) ) D A for -almost all b 2 A[1:::1) . Thus, if has complete connections in Proposition 21, then D . (b1) Suppose h(; ) > h0 :D maxflog2 jC j ; C a proper subgroup of Ag. Then D . (b2) In particular, suppose A D (Z/p ; C), where p is prime; then h0 D 0. Thus, if ˚ has local rule (a0 ; a1 ) :D a0 C a1 , and is any -invariant, ˚ -ergodic measure with h(; ) > 0, then D . This is closely analogous to Proposition 20, but ‘dual’ to it, because the roles of ˚ and are reversed in the ergodicity and entropy hypotheses. (c) If C A is a subgroup, and is the Haar measure on the subgroup shift C N AN , then satisfies the condi-
tions of Proposition 21. Other, less trivial possibilities also exist (see Examples 3.2(b,c) in [113]). If is a measure on AZ , and X; Y AZ , then we say X essentially equals Y and write X Y if [X Y] D 0. If n 2 N, then let n o In () :D X AZ ; n (X) X be the sigma-algebra of subsets of AZ which are ‘essentially’ n -invariant. Thus, is -ergodic if and only if I1 () is trivial (i. e. contains only sets of measure zero or one). We say is totally -ergodic if I n () is trivial for all n 2 N. Let (AZ ; ) be any group shift. The identity element e of (AZ ; ) is a constant sequence. Thus, if ˚ 2 ECA(AZ ; ) is surjective, then ker(˚) :D fa 2 AZ ; ˚(a) D eg is a finite, shift-invariant subgroup of AZ (i. e. a finite collection of -periodic sequences). Proposition 23 Let (AZ ; ) be a (possibly nonabelian) group shift, and let ˚ 2 ECA(AZ ; ) be bipermutative, with neighborhood f0; 1g. Let 2 Meas(AZ ; ˚; ). Suppose that: (IE) is totally ergodic for ; (H) h(˚; ) > 0; and (K) ker(˚) contains no nontrivial -invariant subgroups. Then is the Haar measure on AZ . Proof See Theorem 5.2 in [113].
(AZ ; C)
Example 24 If A D Z/p and is the product group, then ˚ is a linear CA and condition (c) is automatically satisfied, so Proposition 23 becomes a special case of Proposition 20. If ˚ 2 ECA(AZ ; ), then we have an increasing sequence of finite, shift-invariant subgroups ker(˚) ker(˚ 2 ) S n ker(˚ 3 ) . If K(˚) :D 1 nD1 ker(˚ ), then K(˚ ) is a countable, shift-invariant subgroup of (AZ ; ). Theorem 25 Let (AZ ; C) be an abelian group shift, and let G AZ be a subgroup shift. Let ˚ 2 ECA(G; C) be bipermutative, and let 2 Meas(G; ˚; ). Suppose: (I) I kP () D I1 (), where P is the lowest common multiple of the -periods of all elements in ker(˚), and k 2 N is any common multiple of all prime factors of jAj. (H) h(˚; ) > 0. Furthermore, suppose that either: (E1) is ergodic for the N Z action (˚; ); (K1) Every infinite, -invariant subgroup of K(˚) \ G is dense in G; or: (E2) is -ergodic;
973
974
Ergodic Theory of Cellular Automata
(K2) Every infinite, (˚; )-invariant subgroup of K(˚) \ G is dense in G. Then is the Haar measure on G. Proof See Theorems 3.3 and 3.4 of [122], or Théorèmes V.4 and V.5 on p. 115 of [120]. In the special case when G is irreducible and has topological entropy log2 (p) (where p is prime), Sobottka has given a different and simpler proof, by using his theory of ‘quasigroup shifts’ to establish an isomorphism between ˚ and a linear CA on Z/p , and then invoking Theorem 7. See Theorems 7.1 and 7.2 of [136], or Teoremas IV.3.1 and IV.3.2 on pp. 100–101 of [134]. Example 26 (a) Let A :D Z/p , where p is prime. Let ˚ 2 CA(AZ ) be linear, and suppose that 2 Meas(AZ ; ˚; ) is (˚; )-ergodic, h(˚; ) > 0, and I p(p1) () D I1 (). Setting k D p and P D p 1 in Theorem 25, we conclude that is the Haar measure on AZ . For ˚ with neighborhood f0; 1g, this result first appeared as Theorem 13 in [61]. (b) If (AZ ; ) is abelian, then Proposition 23 is a special case of Theorem 25 [hypothesis (IE) of the former implies hypotheses (I) and (E2) of the latter, while (K) implies (K2)]. Note, however, that Proposition 23 also applies to nonabelian groups. An algebraic Z D -action is an action of ZD by automorphisms on a compact abelian group G. For example, if D G AZ is an abelian subgroup shift, then is an algebraic ZD -action. The invariant measures of algebraic ZD -actions have been studied in Schmidt (see Sect. 29 in [124]), Silberger (see Sect. 7 in [132]), and Einsiedler [28,29]. If ˚ 2 CA(G), then a complete history for ˚ is a sequence (g t ) t2Z 2 GZ such that ˚(g t ) D g tC1 for all t 2 D DC1 be the set of Z. Let ˚ Z (G) GZ (AZ )Z Š AZ Z all complete histories for ˚; then ˚ (G) is a subshift of DC1 AZ . If ˚ 2 ECA[G], then ˚ Z (G) is itself an abelian subgroup shift, and the shift action of Z DC1 on ˚ Z (G) is thus an algebraic Z DC1 -action. Any (˚; )-invariant measure on G extends in the obvious way to a -invariant measure on ˚ Z (G). Thus, any result about the invariant measures (or rigidity) of algebraic Z DC1-actions can be translated immediately into a result about the invariant measures (or rigidity) of endomorphic cellular automata. D AZ
Proposition 27 Let G be an abelian subgroup shift and let ˚ 2 ECA(G). Suppose 2 Meas(G; ˚; ) is (˚; )-totally ergodic, and has entropy dimension d 2 [1 : : : D] (see Subsect. “Entropy Geometry and Expansive Subdynamics”). If the system (G; ; ˚; ) admits no factors whose d-dimensional measurable entropy is zero, then there is a ˚-invariant subgroup shift G0 G and some
element x 2 G such that is the translated Haar measure on the ‘affine’ subset G0 C x. Proof This follows from Corollary 2.3 in [29].
If we remove the requirement of ‘no zero-entropy factors’, and instead require G and ˚ to satisfy certain technical algebraic conditions, then must be the Haar measure on G (see Theorem 1.2 in [29]). These strong hypotheses are probably necessary, because in general, the system (G; ; ˚ ) admits uncountably many distinct nontrivial invariant measures, even if (G; ; ˚ ) is irreducible, meaning that G contains no proper, infinite, ˚-invariant subgroup shifts: Proposition 28 Let G AZ be an abelian subgroup shift, let ˚ 2 ECA(G), and suppose (G; ; ˚ ) is irreducible. For any s 2 [0; 1), there exists a (˚; )-ergodic measure 2 Meas(G; ˚; ) such that h(; ˚ n ı z ) D s htop (G; ˚ n ı z ) for every n 2 N and z 2 Z D . D
Proof This follows from Corollary 1.4 in [28].
Let 2 Meas(AM ; ) and let H M be a finite subset. We say that is H-mixing if, for any H-indexed collection fUh gh2H of measurable subsets of AM , " # \ Y nh lim (Uh ) D [Uh ] : n!1
h2H
h2H
For example, if jHj D H, then any H-multiply -mixing measure (see Subsect. “Mixing and Ergodicity”) is H-mixing. Proposition 29 Let G AZ be an abelian subgroup shift and let ˚ 2 ECA(G) have neighborhood H (with jHj 2). Suppose (G; ; ˚ ) is irreducible, and let D 2 Meas(AZ ; ˚; ). Then is H-mixing if and only if is the Haar measure of G. D
Proof This follows from [124], Corollary 29.5, p. 289 (note that Schmidt uses ‘almost minimal’ to mean ‘irreducible’). A significant generalization of Proposition 29 appears in [115] The Furstenberg Conjecture Let T 1 D R/Z be the circle group, which we identify with the interval [0; 1). Define the functions 2 ; 3 : T 1 !T 1 by 2 (t) D 2t (mod 1) and 3 (t) D 3t (mod 1). Clearly, these maps commute, and preserve the Lebesgue measure on T 1 . Furstenberg [44] speculated that the only nonatomic 2 - and 3 -invariant measure on T 1 was the Lebesgue measure. Rudolph [119] showed that, if is (2 ; 3 )-invariant measure and not Lebesgue, then the
Ergodic Theory of Cellular Automata
systems (T 1 ; ; 2 ) and (T 1 ; ; 3 ) have zero entropy; this was later generalized in [60,69]. It is not known whether any nonatomic measures exist on T 1 which satisfy Rudolph’s conditions; this is considered an outstanding problem in abstract ergodic theory. To see the connection between Furstenberg’s Conjecture and cellular automata, let A D f0; 1; 2; 3; 4; 5g, and define the surjection : AN !T 1 by mapping each a 2 AN to the element of [0; 1) having a as its base-6 expansion. That is: (a0 ; a1 ; a2 ; : : :) :D
1 X an : nC1 6 nD0
The map is injective everywhere except on the countable set of sequences ending in [000 : : :] or [555 : : :] (on this set, is 2-to-1). Furthermore, defines a semiconjugacy from 2 and 3 into two CA on AN . Let H :D f0; 1g, and define local maps 2 ; 3 : AH !A as follows: h i la m 1 2 (a0 ; a1 ) D 2a0 C and 6 3 h i la m 1 3 (a0 ; a1 ) D 3a0 C ; 6 2 where, [a]6 is the least residue of a, mod 6. If p 2 CA(AN ) has local map p (for p D 2; 3), then it is easy to check that p corresponds to multiplication by p in base-6 notation. In other words, ı p D p ı for p D 2; 3. If is the Lebesgue measure on T 1 , then () D , where is the uniform Bernoulli measure on AN . Thus, is 2 - and 3 -invariant, and Furstenberg’s Conjecture asserts that is the only nonatomic measure on AN which is both 2 - and 3 -invariant. The shift map : AN !AN corresponds to multiplication by 6 in base-6 notation. Hence, 2 ı 3 D . From this it follows that a measure is (2 ; 3 )-invariant if and only if is (2 ; )-invariant if and only if is (; 3 )-invariant. Thus, Furstenberg’s Conjecture equivalently asserts that is the only stationary, 3 -invariant nonatomic measure on AN , and Rudolph’s result asserts that is the only such nonatomic measure with nonzero entropy; this is analogous to the ‘measure rigidity’ results of Subsect. “Measure Rigidity in Algebraic CA”. The existence of zero-entropy, (; 3 )-invariant, nonatomic measures remains an open question. Remark 30 (a) There is nothing special about 2 and 3; the same results hold for any pair of prime numbers. (b) Lyons [85] and Rudolph and Johnson [68] have also established that a wide variety of 2 -invariant probability measures on T 1 will weak* converge, under the iteration of 3 , to the Lebesgue measure (and vice versa). In
the terminology of Subsect. “Asymptotic Randomization by Linear Cellular Automata”, these results immediately translate into equivalent statements about the ‘asymptotic randomization’ of initial probability measures on AN under the iteration of 2 or 3 . Domains, Defects, and Particles Suppose ˚ 2 CA(AZ ), and there is a collection of ˚ -invariant subshifts P1 ; P2 ; : : : ; P N AZ (called phases). Any sequence a can be expressed a finite or infinite concatenation a D [: : : a2 d2 a1 d1 a0 d0 a1 d1 a2 ] ; where each domain a k is a finite word (or half-infinite sequence) which is admissible to phase Pn for some n 2 [1 : : : N], and where each defect d k is a (possibly empty) finite word (note that this decomposition may not be unique). Thus, ˚(a) D a0 , where a0 D [: : : a02 d02 a01 d01 a00 d00 a01 d01 a02 ] ; and, for every k 2 Z, a0k belongs to the same phase as a k . We say that ˚ has stable phases if, for any such a and a0 in AZ , it is the case that, for all k 2 Z, jd0k j jd k j. In other words, the defects do not grow over time. However, they may propagate sideways; for example, d0k may be slightly to the right of d k , if the domain a0k is larger than a k , while the domain a0kC1 is slightly smaller than a kC1 . If a k and a kC1 belong to different phases, then the defect d k is sometimes called a domain boundary (or ‘wall’, or ‘edge particle’). If a k and a kC1 belong to the same phase, then the defect d k is sometimes called a dislocation (or ‘kink’). Often Pn D fpg where p D [: : : ppp : : :] is a constant sequence, or each Pn consists of the -orbit of a single periodic sequence. More generally, the phases P1 ; : : : ; P N may be subshifts of finite type. In this case, most sequences in AZ can be fairly easily and unambiguously decomposed into domains separated by defects. However, if the phases are more complex (e. g. sofic shifts), then the exact definition of a ‘defect’ is actually fairly complicated – see [114] for a rigorous discussion. Example 31 Let A D f0; 1g and let H D f1; 0; 1g. Elementary cellular automaton (ECA) #184 is the CA ˚ : AZ !AZ with local rule : AH !A given as follows: (a1 ; a0 ; a1 ) D 1 if a0 D a1 D 1, or if a1 D 1 and a0 D 0. On the other hand, (a1 ; a0 ; a1 ) D 0 if a1 D a0 D 0, or if a1 D 0 and a0 D 1. Heuristically, each ‘1’ represents a ‘car’ moving cautiously to the right on a single-lane road. During each iteration, each car will
975
976
Ergodic Theory of Cellular Automata
advance to the site in front of it, unless that site is already occupied, in which case the car will remain stationary. ECA#184 exhibits one stable phase P, given by the 2-periodic sequence [: : : 0101:0101 : : :] and its translate [: : : 1010:1010 : : :] (here the decimal point indicates the zeroth coordinate), and ˚ acts on P like the shift. The phase P admits two dislocations of width 2. The dislocation d0 D [00] moves uniformly to the right, while the dislocation d1 D [11] moves uniformly to the left. In the traffic interpretation, P represents freely flowing traffic, d0 represents a stretch of empty road, and d1 represents a traffic jam. Example 32 Let A :D Z/N , and let H :D [1 : : : 1]. The one-dimensional, N-color cyclic cellular automaton (CCAN ) ˚ : AZ !AZ has local rule : AH !A defined: 8 ˆ < a0 C 1 if there is some h 2 H (a) :D with ah D a0 C 1 ; ˆ : otherwise : a0 (here, addition is mod N). The CCA has phases P0 ; P1 ; : : : ; P N1 , where P a D f[: : : aaa : : :]g for each a 2 A. A domain boundary between P a and P a1 moves with constant velocity towards the P a1 side. All other domain boundaries are stationary. In a particle cellular automaton (PCA), A D f;g t P , where P is a set of ‘particle types’ and ; represents a vacant site. Each particle p 2 P is assigned some (constant) velocity vector v(p) 2 (H) (where H is the neighborhood of the automaton). Particles propagate with constant velocity through M until two particles try to simultaneously enter the same site in the lattice, at which point the outcome is determined by a collision rule: a stylized ‘chemical reaction equation’. For example, an equation “p1 C p2 Ý p3 ” means that, if particle types p1 and p2 collide, they coalesce to produce a particle of type p3 . On the other hand, “p1 C p2 Ý ;” means that the two particles annihilate on contact. Formally, given a set of velocities and collision rules, the local rule : AH !A is defined 8 ˆ p if there is a unique h 2 H and p 2 P with ˆ ˆ ˆ < ah D p and v(p) D h : (a) :D ˆ q if fp 2 P ; av(p) D pg D fp1 ; p2 ; : : : ; p n g; ˆ ˆ ˆ : and p C C p Ý q : 1
n
Example 33 The one-dimensional ballistic annihilation model (BAM) contains two particle types: P D f˙1g, with the following rules: v(1) D 1 ;
v(1) D 1 ;
and 1 C 1 Ý ; :
(This CA is sometimes also called Just Gliders.) Thus, az D 1 if the cell z contains a particle moving to the right with velocity 1, whereas az D 1 if the cell z contains a particle moving left with velocity -1, and az D ; if cell z is vacant. Particles move with constant velocity until they collide with oncoming particles, at which point both particles are annihilated. If B :D f˙1; ;g and H D [1 : : : 1] Z, then we can represent the BAM using 2 CA(BZ ) with local rule : BH !B defined: 8 ˆ 1 if b1 D 1 and ˆ ˆ ˆ < b1 ; b0 2 f1; ;g ; (b1 ; b0 ; b1 ) :D ˆ 1 if b1 D 1 and b0 ; b1 2 f1; ;g ; ˆ ˆ ˆ : ; otherwise : Particle CA can be seen as ‘toy models’ of particle physics or microscale chemistry. More interestingly, however, one-dimensional PCA often arise as factors of coalescentdomain CA, with the ‘particles’ tracking the motion of the defects. Example 34 (a) Let A :D f0; 1g and let ˚ 2 CA(AZ ) be ECA#184. Let B :D f˙1; 0g, and let 2 CA(BZ ) be the BAM. Let G :D f0; 1g, and let : AZ !BZ be the block map with local rule : AG !B defined 8 ˆ < 1 if [a0 ; a1 ] D [0; 0] D d0 ; (a0 ; a1 ) :D 1a0 a1 D 1 if [a0 ; a1 ] D [1; 1] D d1 ; ˆ : 0 otherwise : Then ı ˚ D ı ; in other words, the BAM is a factor of ECA#184, and tracks the motion of the dislocations. (b) Again, let 2 CA(BZ ) be the BAM. Let A D Z/3 , and let ˚ 2 CA(AZ ) be the 3-color CCA. Let G :D f0; 1g, and let : AZ !BZ be the block map with local rule : AG !B defined (a0 ; a1 ) :D (a0 a1 ) mod 3 : Then ı ˚ D ı ; in other words, the BAM is a factor of CCA3 , and tracks the motion of the domain boundaries. Thus, it is often possible to translate questions about coalescent domain CA into questions about particle CA, which are generally easier to study. For example, the invariant measures of the BAM have been completely characterized. Proposition 35 Let B D f˙1; 0g, and let : BZ !BZ be the BAM. (a) The sets R :D f0; 1gZ and L :D f0; 1gZ are -invariant, and acts as a right-shift on R and as a leftshift on L.
Ergodic Theory of Cellular Automata
(b) Let LC :D f0; 1gN and R :D f0; 1gN , and let ˚ X :D a 2 AZ ; 9 z 2 Z such that a(1:::z] 2 R and a[z:::1) 2 LC : Then X is -invariant. For any x 2 X, acts as a right shift on a(1:::z) , and as a left-shift on x(z:::1) . (The boundary point z executes some kind of random walk.) (c) Any -invariant measure on AZ can be written in a unique way as a convex combination of four measures ı0 , , , and , where: ı0 is the point mass on the ‘vacuum’ configuration [: : : 0 0 0 : : :], is any shift-invariant measure on R, is any shift-invariant measure on L, and is a measure on X. Furthermore, there exist shift-invariant measures and C on R and LC , respectively, such that, for -almost all x 2 X, x(1:::z] is -distributed and x[z:::1) is C -distributed. Proof (a) and (b) are obvious; (c) is Theorem 1 in [8]. Remark 36 (a) Proposition 35(c) can be immediately translated into a complete characterization of the invariant measures of ECA#184, via the factor map in Example 34(a); see [8], Theorem 3. Likewise, using the factor map in Example 34(b) we get a complete characterization of the invariant measures for CCA3 . (b) Proposition 48 and Corollaries 49 and 50 describe the limit measures of the BAM, CCA3 , and ECA#184. Also, Blank [10] has characterized invariant measures for a broad class of multilane, multi-speed traffic models (including ECA#184); see Remark 51(b). (c) K˚urka [78] has defined, for any ˚ 2 CA(AZ ), a construction similar to the set X in Proposition 35(b). For any n 2 N and z 2 Z, let Sz;n be the set of fixed points of ˚ n ı z ; then Sz;n is a subshift of finite type, which K˚urka calls a signal subshift with velocity v D z/n. (For example, if ˚ is the BAM, then R D S1;1 and L D S1;1 .) Now, suppose that z1 /n1 > z2 /n2 > > z J /n J . The join of the signal subshifts Sz 1 ;n 1 ; Sz 2 ;n 2 ; : : : ; Sz J ;n J is the set S of all infinite sequences [a1 a2 : : : a J ], where for all j 2 [1::J], a j is a (possibly empty) finite word or (half-)infinite sequence admissible to the subshift Sz j ;n j . (For example, if S is the join of S1;1 D R and S1;1 D L from Proposition 35(a), then S D L [ X [ R.) It follows that S ˚(S) ˚ 2 (S) . If we deS fine ˚ 1 (S) :D 1 ˚ t (S), then ˚ 1 (S) ˚ 1 (AZ ), tD0 T1 1 Z where ˚ (A ) :D tD0 ˚ t (AZ ) is the omega limit set of ˚ (see Proposition 5 in [78]). The support of any ˚-invariant measure must be contained in ˚ 1 (AZ ), so invariant measures may be closely related to the joins of sig-
nal subshifts. See Topological Dynamics of Cellular Automata for more information. In the case of the BAM, it is not hard to check that ˚ 1 (S) D S D ˚ 1 (AZ ); this suggests an alternate proof of Proposition 35(c). It would be interesting to know whether a conclusion analogous to Proposition 35(c) holds for other ˚ 2 CA(AZ ) such that ˚ 1 (AZ ) is a join of signal subshifts. Limit Measures and Other Asymptotics Asymptotic Randomization by Linear Cellular Automata The results of Subsect. “Measure Rigidity in Algebraic CA” suggest that the uniform Bernoulli measure is the ‘natural’ measure for algebraic CA, because is the unique invariant measure satisfying any one of several collections of reasonable criteria. In this section, we will see that is ‘natural’ in quite another way: it is the unique limit measure for linear CA from a large set of initial conditions. M If fn g1 nD1 is a sequence of measures on A , then this sequence weak* converges to the measure 1 (“wk lim n D 1 ”) if, for all cylinder sets B AM , n!1
lim n [B] D 1 [B]. Equivalently, for all continuous
n!1
functions f : AM !C, we have Z Z f dn D f d1 : lim n!1
AM
AM
The Cesàro average (or Cesàro limit) of fn g1 nD1 is 1 XN n , if this limit exists. wk lim nD1 N!1 N Let 2 Meas(AM ) and let ˚ 2 CA(AM ). For any t 2 N, the measure ˚ t is defined by ˚ t (B) D ( t (B)), for any measurable subset B AM . We say that ˚ asymptotically randomizes if the Cesàro average of the sequence f n g1 nD1 is . Equivalently, there is a subset J N of density 1, such that wk lim ˚ j D : j!1 j2J
The uniform measure is the measure of maximal entropy on AM . Thus, asymptotic randomization is kind of ‘Second Law of Thermodynamics’ for CA. Let (A; C) be a finite abelian group, and let ˚ be a linear cellular automaton (LCA) on AM . Recall that ˚ has scalar coefficients if there is some finite H M, and integer coefficients fch gh2H so that ˚ has a local rule of the form X ch a h ; (3) (aH ) :D h2H
977
978
Ergodic Theory of Cellular Automata
An LCA ˚ is proper if ˚ has scalar coefficients as in Eq. (3), and if, furthermore, for any prime divisor p of jAj, there are at least two h; h0 2 H such that ch 6 0 6 ch0 mod p. For example, if A D Z/n for some n 2 N, then every LCA on AM has scalar coefficients; in this case, ˚ is proper if, for every prime p dividing n, at least two of these coefficients are coprime to p. In particular, if A D Z/p for some prime p, then ˚ is proper as long as jHj 2. Let PLCA(AM ) be the set of proper linear CA on AM . If 2 Meas(AM ), recall that has full support if [B] > 0 for every cylinder set B AM .
Part I Any linear CA with scalar coefficients can be written as a ‘Laurent polynomial of shifts’. That is, if ˚ has local rule (3), then for any a 2 AM , X ˚ (a) :D ch h (a) (where we add configurations h2H
componentwise) :
We indicate this by writing “˚ D F( )”, where F 2 Z[x1˙1 ; x2˙1 ; : : : ; x ˙1 D ] is the D-variable Laurent polynomial defined: X F(x1 ; : : : ; x D ) :D ch x1h1 x2h2 : : : x Dh D : (h 1 ;:::;h D )2H
Theorem 37 Let (A; C) be a finite abelian group, let M :D Z D N E for some D; E 0, and let ˚ 2 PLCA(AM ). Let be any Bernoulli measure or Markov random field on AM having full support. Then ˚ asymptotically randomizes . History Theorem 37 was first proved for simple one-dimensional LCA randomizing Bernoulli measures on AZ , where A was a cyclic group. In the case A D Z/2 , Theorem 37 was independently proved for the nearest-neighbor XOR CA (having local rule (a1 ; a0 ; a1 ) D a1 C a1 mod 2) by Miyamoto [100] and Lind [82]. This result was then generalized to A D Z/p for any prime p by Cai and Luo [16]. Next, Maass and Martínez [87] extended the Miyamoto/Lind result to the binary Ledrappier CA (local rule (a0 ; a1 ) D a0 C a1 mod 2). Soon after, Ferrari et al. [36] considered the case when A was an abelian group of order pk (p prime), and proved Theorem 37 for any Ledrappier CA (local rule (a0 ; a1 ) D c0 a0 C c1 a1 , where c0 ; c1 6 0 mod p) acting on any measure on AZ having full support and ‘rapidly decaying correlations’ (see Part II(a) below). For example, this includes any Markov measure on AZ with full support. Next, Pivato and Yassawi [116] generalized Theorem 37 to any PLCA acting on any fully supported N-step Markov chain on AZ D E or any nontrivial Bernoulli measure on AZ N , where A D Z/p k (p prime). Finally, Pivato and Yassawi [117] proved Theorem 37 in full generality, as stated above. The proofs of Theorem 37 and its variations all involve two parts: Part I. A careful analysis of the local rule of ˚ t (for all t 2 N), showing that the neighborhood of ˚ t grows large as t!1 (and in some cases, contains large ‘gaps’). Part II. A demonstration that the measure exhibits ‘rapidly decaying correlations’ between widely separated elements of M; hence, when these elements are combined using ˚ t , it is as if we are summing independent random variables.
For example, if ˚ is the nearest-neighbor XOR CA, then ˚ D 1 C 1 D F( ), where F(x) D x 1 C x. If ˚ is a Ledrappier CA, then ˚ D c0 Id C c1 1 D F( ), where F(x) D c0 C c1 x. It is easy to verify that, if F and G are two such polynomials, and ˚ D F() while D G( ), then ˚ ı D (F G)( ), where F G is the product of F and G in the polynomial ring Z[x1˙1 ; x2˙1 ; : : : ; x ˙1 D ]. In particular, this means that ˚ t D F t ( ) for all t 2 N. Thus, iterating an LCA is equivalent to computing the powers of a polynomial. If A D Z/p , then we can compute the coefficients of F t modulo p. If p is prime, then this can be done using a result of Lucas [84], which provides a formula for the binomial coefficient ba in terms of the base-p expansions of a and b. For example, if p D 2, then Lucas’ theorem says that Pascal’s triangle, modulo 2, looks like a ‘discrete Sierpinski triangle’, made out of 0’s and 1’s. (This is why fragments of the Sierpinski triangle appear frequently in the spacetime diagrams of linear CA on A D Z/2 , a phenomenon which has inspired much literature on ‘fractals and automatic sequences in cellular automata’; see [2,4,5,6,7,52,53,54,55,56,57,94,138,139,140, 145,146,147,148,149].) Thus, Lucas’ Theorem, along with some combinatorial lemmas about the structure of base-p expansions, provides the machinery for Part I. Part II There are two approaches to analyzing probability measures on AM ; one using renewal theory, and the other using harmonic analysis. II(a) Renewal theory This approach was developed by Maass, Martínez and their collaborators. Loosely speaking, if 2 Meas(AZ ; ) has sufficiently large support and sufficiently rapid decay of correlations (e. g. a Markov chain), and a 2 AZ is a -random sequence, then we can treat a as if there is a sparse, randomly distributed set of ‘renewal times’ when the normal stochastic evolution of a is interrupted by independent, random ‘errors’. By judicious use
Ergodic Theory of Cellular Automata
of Part I described above, one can use this ‘renewal process’ to make it seem as though ˚ t is summing independent random variables. For example, if (A; C) be an abelian group of order pk where p is prime, and 2 Meas(AZ ; ) has complete connections (see Example 22(a)) and summable decay (which means that a certain sequence of coefficients (measuring long-range correlation) decays fast enough that its sum is finite), and ˚ 2 CA(AZ ) is a Ledrappier CA, then Ferrari et al. (see Theorem 1.3 in [36]) showed that ˚ asymptotically randomizes . (For example, this applies to any N-step Markov chain with full support on on AZ .) FurZ D Zi/p h Zi thermore, if Ah /p , and ˚ 2 CA(A ) has lin-
ear local rule xy00 ; xy11 D (y0 ; x0 C y1 ), then Maass and Martínez [88] showed that ˚ randomizes any Markov measure with full support on AZ . Maass and Martínez again handled Part II using renewal theory. However, in this case, Part I involves some delicate analysis of the (noncommutative) algebra of the matrix-valued coefficients; unfortunately, their argument does not generalize to other LCA with noncommuting, matrix-valued coefficients. (However, Proposition 8 of [117] suggests a general strategy for dealing with such LCA).
II(b) Harmonic analysis This approach to Part II was implicit in the early work of Lind [82] and Cai and Luo [16], but was developed in full generality by Pivato and Yassawi [116,117,118]. We regard AM as a direct product of copies of the group (A; C), and endow it with the product group structure; then (AM ; C) a compact abelian topological group. A character on (AM ; C) is a continuous group homomorphism : AM !T , where T :D fc 2 C; jcj D 1g is the unit circle group. If is a measure on AM R , then the Fourier coefficients of are ˆ defined: [] D AM d, for every character . If : AM !T is any character, then there is a unique finite subset K M (called the support of ) and a unique collection of nontrivial characters k : A!T for all k 2 K, such that, Y k (ak ) ; 8 a 2 AM : (4) (a) D
in [116]). If A D Z/p (p prime) then any nontrivial Bernoulli measure on AM is harmonically mixing (see Proposition 6 in [116]). Furthermore, if 2 Meas(AZ ; ) has complete connections and summable decay, then 2 Hm(AZ ) (see Theorem 23 in [61]). If M :D Meas(AM ; C) is the set of all complex-valued measures on AM , then M is Banach algebra (i. e. it is a vector space under the obvious definition of addition and scalar multiplication for measures, and a Banach space under the total variation norm, and finally, since AM is a topological group, M is a ring under convolution). Then Hm(AM ) is an ideal in M, is closed under the total variation norm, and is dense in the weak* topology on M (see Propositions 4 and 7 in [116]). Finally, if is any Markov random field on AM which is locally free (which roughly means that the boundary of any finite region does not totally determine the interior of that region), then 2 Hm(AM ) (see Theorem 1.3 in [118]). In particular, this implies: Proposition 38 If (A; C) is any finite group, and 2 Meas(AM ) is any Markov random field with full support, then is harmonically mixing. Proof This follows from Theorem 1.3 in [118]. It is also a special case of Theorem 15 in [117]. If is a character, and ˚ is a LCA, then ı ˚ t is also a character, for any t 2 N (because it is a composition of two continuous group homomorphisms). We say ˚ is diffusive if there is a subset J N of density 1, such that, for every character of AM , lim
J 3 j!1
rank[ ı ˚ j ] D 1 :
Proposition 39 Let (A; C) be any finite abelian group and let M be any monoid. If is harmonically mixing and ˚ is diffusive, then ˚ asymptotically randomizes . Proof See Theorem 12 in [117].
Proposition 40 Let (A; C) be any abelian group and let M :D Z D N E for some D; E 0. If ˚ 2 PLCA(AM ), then ˚ is diffusive.
k2K
We define rank[] :D jKj. The measure is called harmonically mixing if, for all > 0, there is some R such that ˆ for all characters , (rank[] R) H) (j[]j < ). The set Hm(AM ) of harmonically mixing measures on AM is quite inclusive. For example, if is any (N-step) Markov chain with full support on AZ , then 2 Hm(AZ ) (see Propositions 8 and 10 in [116]), and if 2 Meas(AZ ) is absolutely continuous with respect to this , then 2 Hm(AZ ) also (see Corollary 9
Proof The proof uses Lucas’ theorem, as described in Part I above. See Theorem 15 in [116] for the case A D Z/p when p prime. See Theorem 6 in [117] for the case when A is any cyclic group. That proof easily extends to any finite abelian group A: write A as a product of cyclic groups and decompose ˚ into separate automata over these cyclic factors. Proof of Theorem 37 Combine Propositions 38, 39, and 40.
979
980
Ergodic Theory of Cellular Automata
Remark (a) Proposition 40 can be generalized: we do not need the coefficients of ˚ to be integers, but merely to be a collection of automorphisms of A which commute with one another (so that Lucas’ theorem from Part I is still applicable). See Theorem 9 in [117]. (b) For simplicity, we stated Theorem 37 for measures with full support; however, Proposition 39 actually applies to many Markov random fields without full support, because harmonic mixing only requires ‘local freedom’ (see Theorem 1.3 in [118]). For example, the support of a Markov chain on AZ is Markov subshift. If A D Z/p (p prime), then Proposition 39 yields asymptotic randomization of the Markov chain as long as the transition digraph of the underlying Markov subshift admits at least two distinct paths of length 2 between any pair of vertices in A. More generally, if M D Z D , then the support of D any Markov random field on AZ is an SFT, which we can regard as the set of all tilings of RD by a certain collection of Wang tiles. If A D Z/p (p prime), then Proposition 39 yields asymptotic randomization of the Markov random field as long as the underlying Wang tiling is flexible enough that any hole can always be filled in at least two ways; see Sect. 1 in [118]. Remark 41 (Generalizations and Extensions) (a) Pivato and Yassawi (see Thm 3.1 in [118]) proved a variation of Theorem 60 where diffusion (of ˚ ) is replaced with a slightly stronger condition called dispersion, so that harmonic mixing (of ) can be replaced with a slightly weaker condition called dispursion mixing (DM). It is unknown whether all proper linear CA are dispersive, but a very large class are (including, for example, ˚ D Id C ). Any uniformly mixing measure with positive entropy is DM (see Theorem 5.2 in [118]); this includes, for example, any mixing quasimarkov measure (i. e. the image of a Markov measure under a block map; these are the natural measures supported on sofic shifts). Quasimarkov measures are not, in general, harmonically mixing (see Sect. 2 in [118]), but this result shows they are still asymptotically randomized by most linear CA. D (b) Suppose G AZ is a -transitive subgroup shift (see Subsect. “Measure Rigidity in Algebraic CA” for definition), and let ˚ 2 PLCA(G). If G satisfies an algebraic condition called the follower lifting property (FLP) and is any Markov random field with supp() D G, then Maass, Martínez, Pivato, and Yassawi [89] have shown that ˚ asymptotically randomizes to a maxentropy measure on G. Furthermore, if D D 1, then this maxentropy measure is the Haar measure on G. In particular, if A is an abelian group of prime-power order, then any transitive Markov subgroup G AZ satisfies the FLP, so this re-
sult holds for any multistep Markov measure on G. See also [90] for the special case when ˚ 2 CA(AZ ) has local rule (x0 ; x1 ) D x0 C x1 . In the special case when ˚ has local rule (x0 ; x1 ) D c0 x0 C c1 x1 C a, the result has been extended to measures with complete connections and summable decay; see Teorema III.2.1, p. 71 in [134] or see Theorem 1 in [91]. (c) All the aforementioned results concern asymptotic randomization of initial measures with nonzero entropy. Is nonzero entropy either necessary or sufficient for asymptotic randomization? First let X N AZ be the set of N-periodic points (see Subsect. “Periodic Invariant Measures”) and suppose supp() X N . Then the Cesàro limit of f˚ t ()g t2N will also be a measure supported on X N , so 1 cannot be the uniform measure on AZ . Nor, in general, will 1 be the uniform measure on X N ; this follows from Jen’s (1988) exact characterization of the limit cycles of linear CA acting on X N . What if is a quasiperiodic measure, such as the unique -invariant measure on a Sturmian shift? There exist quasiperiodic measures on (Z/2 )Z which are not asymptotically randomized by the Ledrappier CA (see Sect. 15 in [112]). But it is unknown whether this extends to all quasiperiodic measures or all linear CA. There is also a measure on AZ which has zero -entropy, yet is still asymptotically randomized by ˚ (see Sect. 8 in [118]). Loosely speaking, is a Toeplitz measure with a very low density of ‘bit errors’. Thus, is ‘almost’ deterministic (so it has zero entropy), but by sufficiently increasing the density of ‘bit errors’, we can introduce just enough randomness to allow asymptotic randomization to occur. (d) Suppose (G ; ) is a nonabelian group and ˚ : n n G Z !G Z has multiplicative local rule (g) :D gh11 gh22 nJ gh J , for some fh1 ; : : : ; h J g Z (possibly not distinct) and n1 ; : : : ; n J 2 N. If G is nilpotent, then G can be decomposed into a tower of abelian group extensions; this induces a structural decomposition of ˚ into a tower of skew products of ‘relative’ linear CA. This strategy was first suggested by Moore [102], and was developed by Pivato (see Theorem 21 in [111]), who proved a version of Theorem 37 in this setting. (e) Suppose (Q; ?) is a quasigroup – that is, F is a binary operation such that for any q; r; s 2 Q, (q ? r D q ? s) () (r D s) () (r ? q D s ? q). Any finite associative quasigroup has an identity, and any associative quasigroup with an identity is a group. However there are also many nonassociative finite quasigroups. If we define a ‘multiplicative’ CA ˚ : QZ !QZ with local rule : Qf0;1g !Q given by (q0 ; q1 ) D q0 ? q1 , then it is easy to see that ˚ is bipermutative if and only if (Q; ?) is
Ergodic Theory of Cellular Automata
a quasigroup. Thus, quasigroups seem to provide the natural algebraic framework for studying bipermutative CA; this was first proposed by Moore [101], and later explored by Host, Maass, and Martínez (see Sect. 3 in [61]), Pivato (see Sect. 2 in [113]), and Sobottka [134,135,136]. Note that QZ is a quasigroup under componentwise F-multiplication. A quasigroup shift is a subshift X QZ which is also a subquasigroup; it follows that ˚(X) X. If X and ˚ satisfy certain strong algebraic conditions, and 2 Meas(X; ) has complete connections and summable decay, then the sequence f˚ t g1 tD1 Cesàro-converges to a maxentropy measure 1 on X (thus, if X is irreducible, then 1 is the Parry measure; see Subsect. “Invariance of Maxentropy Measures”). See Theorem 6.3(i) in [136], or Teorema IV.5.3, p. 107 in [134]. Hybrid Modes of Self-Organization Most cellular automata do not asymptotically randomize; instead they seem to weak* converge to limit measures concentrated on small (i. e. low-entropy) subsets of the statespace AM – a phenomenon which can be interpreted as a form of ‘self-organization’. Exact limit measures have been computed for a few CA. For example, D let A D f0; 1; 2g and let ˚ 2 CA(AZ ) be the Greenberg– Hastings model (a simple model of an excitable medium). Durrett and Steif [26] showed that, if D 2 and is any D Bernoulli measure on AZ , then 1 :D wk lim ˚ t t!1
exists; 1 -almost all points are 3-periodic for ˚, and alD though 1 is not a Bernoulli measure, the system (AZ ; 1 ; ) is measurably isomorphic to a Bernoulli system. In other cases, the limit measure cannot be exactly computed, but can still be estimated. For example, let A D f˙1g, 2 (0; 1), and R > 0, and let ˚ 2 CA(AZ ) be the (R; )-threshold voter CA (where each cell computes the fraction of its radius-R neighbors which disagree with its current sign, and negates its sign if this fraction is at least ). Durret and Steif [27] and Fisch and Gravner [43] have described the long-term behavior of ˚ in the limit as R!1. If < 1/2, then every initial condition falls into a two-periodic orbit (and if < 1/4, then every cell simply alternates its sign). Let be the uniform Bernoulli measure on AZ ; if 1/2 < , then for any finite subset B Z, if R is large enough, then ‘most’ initial conditions (relative to ) converge to orbits that are fixed inside B. Indeed, there is a critical value c 0:6469076 such that, if c < , and R is large enough, then ‘most’ initial conditions (for ) are already fixed inside B; see also [137] for an analysis of behavior at the critical value. In still other cases, the limit measure is is known to exist, but is still mysterious; this is true for the Cesàro
limit measures of Coven CA, for example see [87], Theorem 1. However, for most CA, it is difficult to even show that limit measures exist. Except for the linear CA of Subsect. “Asymptotic Randomization by Linear Cellular Automata”, there is no large class of CA whose limit measures have been exactly characterized. Often, it is much easier to study the dynamical asymptotics of CA at a purely topological level. If ˚ 2 CA(AM ), then AM ˚(AM ) ˚ 2 (AM ) . The limit set of ˚ is the nonempty subT t M M shift ˚ 1 (AM ) :D 1 tD1 ˚ (A ). For any a 2 A , the omega-limit set of a is the set !(a; ˚ ) of all cluster points M is of the ˚ -orbit f˚ t (a)g1 tD1 . A closed subset X A a (Conley) attractor if there exists a clopen subset U X T t such that ˚(U) U and X D 1 tD1 ˚ (U). It follows that !(˚; u) X for all u 2 U. For example, ˚ 1 (AM ) is an attractor (let U :D AM ). The topological attractors of CA were analyzed by Hurley [63,65,66], who discovered severe constraints on the possible attractor structures a CA could exhibit; see Sect. 9 of Topological Dynamics of Cellular Automata and Cellular Automata, Classification of. Within pure topological dynamics, attractors and (omega) limit sets are the natural formalizations of the heuristic notion of ‘self-organization’. The corresponding formalization in pure ergodic theory is the weak* limit measure. However, both weak* limit measures and topological attractors fail to adequately describe the sort of self-organization exhibited by many CA. Thus, several ‘hybrid’ notions self-organization have been developed, which combine topological and measurable criteria. These hybrid notions are more flexible and inclusive than purely topological notions. However, they do not require the explicit computation (or even the existence) of weak* limit measures, so in practice they are much easier to verify than purely ergodic notions. Milnor–Hurley -attractors If X AM is a closed subset, then for any a 2 AM , we define d(a; X) :D infx2X d(a; x). If ˚ 2 CA(AM ), then the basin (or realm) of X is the set n o Basin(X) :D a 2 AM ; lim d ˚ t (a); X D 0 t!1 n o M D a 2 A ; !(a; ˚ ) X : Suppose ˚ (X) X. If 2 Meas(AM ), then X is a -attractor if [Basin(X)] > 0; we call X a lean -attractor if in addition, [Basin(X)] > [Basin(Y)] for any proper closed subset Y ¨ X. Finally, a -attractor X is minimal if [Basin(Y)] D 0 for any proper closed subset Y ¨ X. For example, if X is a -attractor, and (X; ˚ ) is minimal as
981
982
Ergodic Theory of Cellular Automata
a dynamical system, then X is a minimal -attractor. This concept was introduced by Milnor [96,97] in the context of smooth dynamical systems; its ramifications for CA were first explored by Hurley [64,65]. D If 2 Meas(AZ ; ), then is weakly -mixing if, D for any measurable sets U; V AZ , there is a subset J Z D of density 1 such that limJ 3j!1 [ j (U) \ V] D [U] [V] (see Subsect. “Mixing and Ergodicity”). For example, any Bernoulli measure is weakly mixing. A subD shift X AZ is -minimal if X contains no proper nonempty subshifts. For example, if X is just the -orbit of some -periodic point, then X is -minimal. Proposition 42 Let ˚ 2 CA(AM ), let 2 Meas(AM ; ), and let X be a -attractor. (a) If is -ergodic, and X AM is a subshift, then [Basin(X)] D 1. (b) If M is countable, and X is -minimal subshift with [Basin(X)] D 1, then X is lean. (c) Suppose M D Z D and is weakly -mixing. (i) If X is a minimal -attractor, then X is a subshift, so [Basin(X)] D 1, and thus X is the only lean -attractor of ˚. (ii) If X is a ˚-periodic orbit which is also a lean -attractor, then X is minimal, [Basin(X)] D 1, and X contains only constant configurations. Proof (a) If X is -invariant, then Basin(X) is also -invariant; hence [Basin(X)] D 1 because is -ergodic. (b) Suppose Y ¨ X was a proper closed subset with [Basin(Y)] D 1. For any m 2 M, it is easy to check that Basin( m [Y]) D m [Basin(Y)]. Thus, if T T e Y) D m2M m [Basin(Y)], Y :D m2M m (Y), then Basin(e Y)] D 1 (because M is countable). Thus, e Y is so [Basin(e nonempty, and is a subshift of X. But X is -minimal, so e Y D X, which means Y D X. Thus, X is a lean -attractor. (c) In the case when is a Bernoulli measure, (c)[i] is Theorem B in [64] or Proposition 2.7 in [65], while (c)[ii] is Theorem A in [65]. Hurley’s proofs easily extend to the case when is weakly -mixing. The only property we require of is this: for any nontrivial meaD surable sets U; V AZ , and any z 2 Z D , there is some x; y 2 Z D with z D x y, such that [ y (U) \ V] > 0 and [ x (U) \ V] > 0. This is clearly true if is weakly mixing (because if J Z D has density 1, then J \ (z C J ) ¤ ; for any z 2 Z D ). Proof sketch for (c)[i] If X is a (minimal) -attractor, then so is y (X), and Basin[y(X)] D y(Basin[X]). Thus, weak mixing yields x; y 2 Z D such that Basin[x(X)] \ Basin[X] and Basin[y(X)] \ Basin[X] are both nontrivial. But the
basins of distinct minimal -attractors must be disjoint; thus x (X) D X D y (X). But x y D z, so this means z (X) D X. This holds for all z 2 Z D , so X is a subshift, so (a) implies [Basin(X)] D 1. Section 4 of [64] contains several examples showing that the minimal topological attractor of ˚ can be different from its minimal -attractor. For example, a CA can have different minimal -attractors for different choices of . On the other hand, there is a CA possessing a minimal topological attractor but with no minimal -attractors for any Bernoulli measure . Hilmy–Hurley Centers Let a 2 AM . For any closed subset X AM , we define a [X] :D lim inf N!1
N 1 X 1X (˚ t (a)) : N nD1
(Thus, if is a ˚-ergodic measure on AM , then Birkhoff’s Ergodic Theorem asserts that a [X] D [X] for -almost all a 2 AM ). The center of a is the set: o \n Cent < (a; ˚ ) :D closed subsets X AM ; a [X] D 1 : Thus, Cent(a; ˚ ) is the smallest closed subset such that a [Cent(a; ˚ )] D 1. If X AM is closed, then the well of X is the set n o Well(X) :D a 2 AM ; Cent(a; ˚ ) X : If 2 Meas(AM ), then X is a -center if [Well(X)] > 0; we call X a lean -center if in addition, [Well(X)] > [Well(Y)] for any proper closed subset Y ¨ X. Finally, a -center X is minimal if [Well(Y)] D 0 for any proper closed subset Y ¨ X. This concept was introduced by Hilmy [59] in the context of smooth dynamical systems; its ramifications for CA were first explored by Hurley [65]. Proposition 43 Let ˚ 2 CA(AM ), let 2 Meas(AM ; ), and let X be a -center. (a) If is -ergodic, and X AM is a subshift, then [Well(X)] D 1. (b) If M is countable, and X is -minimal subshift with [Well(X)] D 1, then X is lean. (c) Suppose M D Z D and is weakly -mixing. If X is a minimal -center, then X is a subshift, X is the only lean -center, and [Well (X)] D 1. Proof (a) and (b) are very similar to the proofs of Proposition 42(a,b).
Ergodic Theory of Cellular Automata
(c) is proved for Bernoulli measures as Theorem B in [65]. The proof is quite similar to Proposition 42(c)[i], and again, we only need to be weakly mixing. Section 4 of [65] contains several examples of minimal -centers which are not -attractors. In particular, the analogue of Proposition 42(c)[ii] is false for -centers. K˚urka–Maass -limit Sets If ˚ 2 CA(AM ) and 2 Meas(AM ; ), then K˚urka and Maass define the -limit set of ˚: \n (˚; ) :D closed subsets X AM ; o lim ˚ t (X) D 1 : t!1
It suffices to take this intersection only over all cylinder sets X. By doing this, we see that (˚; ) is a subshift of AM , and is defined by the following property: for any finite B M and any word b 2 AB , b is admissible to (˚; ) if and only if lim inf t!1 ˚ t [b] > 0. Proposition 44 Let ˚ 2 CA(AM ) and 2 Meas(AM ; ). (a) If wk lim ˚ t D , then (˚; ) D supp( ). t!1
Suppose M D Z. (b) If ˚ is surjective and has an equicontinuous point, and has full support on AZ , then (˚; ) D AZ . (c) If ˚ is left- or right-permutative and is connected (see below), then (˚; ) D AZ . Proof For (a), see Proposition 2 in [79]. For (b,c), see Theorems 2 and 3 in [77]; for earlier special cases of these results, see also Propositions 4 and 5 in [79]. Remark 45 (a) In Proposition 44(c), the measure is connected if there is some constant C > 0 such that, for any finite word b 2 A , and any a 2 A, we have [b a] C [b] and [a b] C [b]. For example, any Bernoulli, Markov, or N-step Markov measure with full support is connected. Also, any measure with ‘complete connections’ (see Example 22(a)) is connected. (b) Proposition 44(a) shows that -limit sets are closely related to the weak* limits of measures. Recall from Subsect. “Asymptotic Randomization by Linear Cellular Automata” that the uniform Bernoulli measure is the weak* limit of a large class of initial measures under the action of linear CA. Presumably the same result should hold for a much larger class of permutative CA, but so far this is unproven, except in some special cases [see Remarks 41(d,e)]. Proposition .44(a,c) implies that the limit measure of a permutative CA (if it exists) must have full support – hence it can’t be ‘too far’ from .
K˚urka’s Measure Attractors Let Minv :D Meas(AM ; ) have the weak* topology, and define ˚ : Minv !Minv by ˚ () D ı ˚ 1 . Then ˚ is continuous, so we can treat (Minv ; ˚ ) itself as a compact topological dynamical system. The “weak* limit measures” of ˚ are simply ; ˚ ). However, even the attracting fixed points of (Minv if the ˚ -orbit of a measure does not weak* converge to a fixed point, we can still consider the omega-limit set ) is the union of of . In particular, the limit set ˚1 (Minv the omega-limit sets of all -invariant initial measures under ˚ . K˚urka defines the measure attractor of ˚: [ )g AM : MeasAttr(˚ ) :D fsupp() ; 2 ˚1 (Minv
(The bar denotes topological closure.) A configuration D a 2 AZ is densely recurrent if any word which occurs in a does so with nonzero frequency. Formally, for any finite B ZD ˚ # z 2 [N : : : N] D ; aBCz D aB lim sup >0: (2N C 1) D N!1 If X AZ is a subshift, then the densely recurrent subshift of X is the closure D of the set of all densely recurrent points in X. If 2 Minv (X), then the Birkhoff Ergodic Theorem implies that supp() D; see Proposition 8.8 in [1], p. 164. From this it fol lows that Minv (X) D Minv (D). On the other hand, S D D fsupp() ; 2 Minv (D)g. In other words, densely recurrent subshifts are the only subshifts which are ‘covered’ by their own set of shift-invariant measures. D
Proposition 46 Let ˚ 2 CA(AM )[AZ ]. Let D be D the densely recurrent subshift of ˚ 1 (AZ ). Then D D MeasAttr(˚ ), and ˚ 1 (Minv ) D Meas(D; ). D
Proof Case D D 1 is Proposition 13 in [78]. The same proof works for D 2. Synthesis The various hybrid modes of self-organization are related as follows: Proposition 47 Let ˚ 2 CA(AM ). (a) Let 2 Meas(AM ; ) and let X AM be any closed set. If X is a topological attractor and has full support, then X is a -attractor. [ii] If X is a -attractor, then X is a -center. [iii] Suppose M D Z D , and that is weakly -mixing. Let Y be the intersection of all topological attractors of ˚. If ˚ has a minimal -attractor X, then X Y.
[i]
983
984
Ergodic Theory of Cellular Automata
T [iv] If is -ergodic, then (˚; ) fX AM ; D X a subshift and -attractorg ˚ 1 (AZ ). [v] Thus, if is -ergodic and has full support, then (˚; )
\˚
X AM ; X a subshift and a topological attractor :
[vi] If X is a subshift, then ((˚; ) X) () (!(˚ ; ) Minv (X)).
(b) Let M D Z D . Let B be the set of all Bernoulli meaD sures on AZ , and for any ˇ 2 B, let Xˇ be the minimal ˇ-attractor for ˚ (if it exists). D There is a comeager subset A AZ such that S T ˇ 2B Xˇ a2A !(a; ˚ ). S˚ (c) MeasAttr(˚) D (˚; ) ; 2 Minv (AM ) . D (d) If M D Z D , then MeasAttr(˚) ˚ 1 (AZ ). Proof (a)[i]: If U is a clopen subset and ˚ 1 (U) D X, then U Basin(X); thus, 0 < [U] [Basin(X)], where the “ 0. (a)[iii] is Proposition 3.3 in [64]. (Again, Hurley states and proves this in the case when is a Bernoulli measure, but his proof only requires weak mixing). (a)[iv]: Let X be a subshift and a -attractor; we claim that (˚; ) X. Proposition 42(a) says [Basin(X)] D 1. Let B M be any finite set. If b 2 AB n XB , then ˚ a 2 AM ; 9 T 2 N such that 8t T ; ˚ t (a)B ¤ b Basin(X) : It follows that the left-hand set has -measure 1, which implies that lim t!1 ˚ t hbi D 0 – hence b is a forbidden word in (˚; ). Thus, all the words forbidden in X are also forbidden in (˚; ). Thus (˚; ) X. (The case M D Z of (a)[iv] appears as Prop. 1 in [79] and as Prop. II.27, p. 67 in [120]; see also Cor. II.30 in [120] for a slightly stronger result.) (a)[v] follows from (a)[iv] and (a)[i]. (a)[vi] is Proposition 1 in [77] or Proposition 10 in [78]; the argument is fairly similar to (a)[iv]. (K˚urka assumes M D Z, but this is not necessary.) (b) is Proposition 5.2 in [64].
(c) Let X AM be a subshift and let Minv Minv (AM ). Then
D
(MeasAttr(˚ ) X) () (supp( ) X ; 8 2 ˚1 (Minv )) (X) ; 8 2 ˚1 (Minv )) () ( 2 Minv () (˚1 (Minv ) Minv (X)) (X) ; 8 2 Minv ) () (!(˚ ; ) Minv () ((˚; ) X ; 8 2 Minv )
()
[ () ( f(˚; ) 2 Minv g X) : where () is by (a)[vi]. It follows that MeasAttr(˚ ) D S˚ (˚; ) ; 2 Minv (AM ) . (d) follows immediately from Proposition 46. Examples and Applications The most natural examples of these hybrid modes of self-organization arise in the particle cellular automata (PCA) introduced in Subsect. “Domains, Defects, and Particles”. The long-term dynamics of a PCA involves a steady reduction in particle density, as particles coalesce or annihilate one another in collisions. Thus, presumably, for almost any initial configuration a 2 AZ , the sequence f˚ t (a)g1 tD1 should converge to the subshift Z of configurations containing no particles (or at least, no particles of certain types), as t!1. Unfortunately, this presumption is generally false if we interpret ‘convergence’ in the strict topological dynamical sense: the occasional particles will continue to wander near the origin at arbitrarily large times in the future orbit of a (albeit with diminishing frequency), so !(a; ˚ ) will not be contained in Z. However, the presumption becomes true if we instead employ one of the more flexible hybrid notions introduced above. For example, most initial probability measures should converge, under iteration of ˚ to a measure concentrated on configurations with few or no particles; hence we expect that (˚; ) Z. As discussed in Subsect. “Domains, Defects, and Particles”, a result about selforganization in a PCA can sometimes be translated into an analogous result about self-organization in associated coalescent-domain CA. Proposition 48 Let A D f0; ˙1g and let 2 CA(AZ ) be the Ballistic Annihilation Model (BAM) from Example 33. Let R :D f0; 1gZ and L :D f0; 1gZ . (a) If 2 Meas(AZ ; ), then D wk lim t!1 t () exists, and has one of three forms: either
2 Meas(R; ), or 2 Meas(L; ), or D ı0 , the point mass on the sequence 0 D [: : : 000 : : :]. (b) Thus, the measure attractor of ˚ is R [ L (note that R \ L D f0g).
Ergodic Theory of Cellular Automata
(c) In particular, if is a Bernoulli measure on AZ with [C1] D [1], then D ı0 . (d) Let be a Bernoulli measure on AZ . If [C1] > [1], then R is a -attractor – i. e. [Basin(R)] > 0. [ii] If [C1] < [1], then L is a -attractor. [iii] If [C1] D [1], then f0g is not a -attractor, because [Basinf0g] D 0. However, (˚; ) D f0g. [i]
Proof (a) is Theorem 6 of [8], and (b) follows from (a). (c) follows from Theorem 2 of [42]. (d)[i,ii] were first observed by Gilman (see Sect. 3, pp. 111–112 in [45], and later by K˚urka and Maass (see Example 4 in [80]). (d)[iii] follows immediately from (c): the statement (˚; ) D f0g is equivalent to asserting that lim t!1 ˚ t [˙1] D 0, which a consequence of (c). Another proof of (d)[iii] is Proposition 11 in [80]; see also Example 3 in [79] or Prop. II.32, p. 70 in [120]. Corollary 49 Let A D Z/3 , let ˚ 2 CA(AZ ) be the CCA3 (see Example 32), and let be the uniform Bernoulli 1 measure on AZ . Then wk lim ˚ t () D (ı0 C ı1 C ı2 ), t!1 3 where ı a is the point mass on the sequence [: : : aaa : : :] for each a 2 A. Proof Combine Proposition 48(c) with the factor map in Example 34(b). See Theorem 1 of [42] for details. Corollary 50 Let A D f0; 1g, let ˚ 2 CA(AZ ) be ECA#184 (see Example 31). (a) MeasAttr(˚) D R [ L, where R AZ is the set of sequences not containing [11], and L AZ is the set of sequences not containing [00]. (b) If is the uniform Bernoulli measure on AZ , then 1 wk lim ˚ t () D (ı0 C ı1 ), where ı0 and ı1 are the t!1 2 point masses on [: : : 010:101 : : :] and [: : : 101:010 : : :]. (c) Let be a Bernoulli measure on AZ . [i] If [0] > [1], then R is a -attractor – i. e. [Basin(R)] > 0. [ii] If [0] < [1], then L is a -attractor. Proof sketch Let be the factor map from Example 34(a). To prove (a), apply to Proposition 48(b); see Example 26, Sect. 9 in [78] for details. To prove (b), apply to Proposition 48(c); see Proposition 12 in [80] for details. To prove (c), apply to Proposition 48(d)[i,ii]; Remark 51 (a) The other parts of Proposition 48 can likewise be translated into equivalent statements about
the measure attractors and -attractors of CCA3 and ECA#184. (b) Recall that ECA#184 is a model of single-lane, traffic, where each car is either stopped or moving rightwards at unit speed. Blank [10] has extended Corollary 50(c) to a much broader class of CA models of multi-lane, multispeed traffic. For any such model, let R AZ be the set of ‘free flowing’ configurations where each car has enough space to move rightwards at its maximum possible speed. Let L AZ be the set of ‘jammed’ configurations where the cars are so tightly packed that the jammed clusters can propagate (leftwards) through the cars at maximum speed. If is any Bernoulli measure, then [Basin(R)] D 1 if the -average density of cars is greater than 1/2, whereas [Basin(L)] D 1 if the density is less than 1/2 Theorems 1.2 and 1.3 in [10]. Thus, L t R is a (non-lean) -attractor, although not a topological attractor Lemma 2.13 in [10]. Example 52 A cyclic addition and ballistic annihilation model (CABAM) contains the same ‘moving’ particles ˙1 as the BAM (Example 33), but also has one or more ‘stationary’ particle types. Let 3 N 2 N, and let P D f1; 2; : : : ; N 1g Z/N , where we identify N 1 with -1, modulo N. It will be convenient to represent the ‘vacant’ state ; as 0; thus, A D Z/N . The particles 1 and –1 have velocities and collisions as in the BAM, namely: v(1) D 1 ;
v(1) D 1 ; and 1 C 1 Ý ; :
We set v(p) D 0, for all p 2 [2 : : : N 2], and employ the following collision rule: If p1 Cp0 Cp1 q;
(mod N); then p1 Cp0 Cp1 Ýq: (5)
(here, any one of p1 , p0 , p1 , or q could be 0, signifying vacancy). For example, if N D 5 and a (rightward moving) type C1 particle strikes a (stationary) type 3 particle, then the C1 particle is annihilated and the 3 particle turns into a (stationary) 4 particle. If another C1 particle hits the 4 particle, then both are annihilated, leaving a vacancy (0). Let B D Z/N , and let 2 CA(AZ ) be the CABAM. Then the set of fixed points of is F D ff 2 BZ ; fz ¤ ˙1; 8z 2 Zg. Note that, if b 2 Basin[F] – that is, if !(b; ) F – then in fact lim t (b) exists and t!1
is a -fixed point. Proposition 53 Let B D Z/N , let 2 CA(AZ ) be the CABAM, and let be the uniform Bernoulli measure on BZ . If N 5, then F is a ‘global’ -attractor – that is, [Basin(F)] D 1. However, if N 4, then [Basin(F)] D 0. Proof See Theorem 1 of [41].
985
986
Ergodic Theory of Cellular Automata
Let A D Z/N and let ˚ 2 CA(AZ ) be the N-color CCA from Example 32. Then the set of fixed points of ˚ is F D ff 2 AZ ; fz fzC1 ¤ ˙1; 8z 2 Zg. Note that, if a 2 Basin[F], then in fact lim ˚ t (a) exists and is a ˚ -fixed t!1 point. Corollary 54 Let A D Z/N , let ˚ 2 CA(AZ ) be the N-color CCA, and let be the uniform Bernoulli measure on AZ If N 5, then F is a ‘global’ -attractor – that is, [Basin(F)] D 1. However, if N 4, then [Basin(F)] D 0. Proof sketch Let B D Z/N and let 2 CA(BZ ) be the N-particle CABAM. Construct a factor map : AZ !BZ with local rule (a0 ; a1 ) :D (a0 a1 ) mod N, similar to Example 34(b). Then ı ˚ D ı , and the -particles track the ˚-domain boundaries. Now apply to Proposition 53. Example 55 Let A D f0; 1g and let H D f1; 0; 1g. Elementary Cellular Automaton #18 is the one-dimensional CA with local rule : AH !A given: [100] D 1 D [001], and (a) D 0 for all other a 2 AH . Empirically, ECA#18 has one stable phase: the odd sofic 1 0 . 0 shift S, defined by the A-labeled digraph In other words, a sequence is admissible to S as long as an pair of consecutive ones are separated by an odd number of zeroes. Thus, a defect is any word of the form 102m 1 (where 02m represents 2m zeroes) for any m 2 N. Thus, defects can be arbitrarily large, they can grow and move arbitrarily quickly, and they can coalesce across arbitrarily large distances. Thus, it is impossible to construct a particle CA which tracks the motion of these defects. Nevertheless, in computer simulations, one can visually follow the moving defects through time, and they appear to perform random walks. Over time, the density of defects decreases as they randomly collide and annihilate. This was empirically observed by Grassberger [47,48] and Boccara et al. [11]. Lind (see Sect. 5 in [82]) conjectured that this gradual elimination of defects caused almost all initial conditions to converge, in some sense, to S under application of ˚. Eloranta and Numelin [34] proved that the defects of ˚ individually perform random walks. However, the motions of neighboring defects are highly correlated. They are not independent random walks, so one cannot use standard results about stochastic interacting particle systems to conclude that the defect density converges to zero. To solve problems like this, K˚urka [77] developed a theory of ‘particle weight functions’ for CA. Let A be the set of all finite words in the alphabet A. A particle weight function is a bounded function
p : A !N, so that, for any a 2 AZ , we interpret # p (a) :D
1 X X
p(a[z:::zCr] ) and
rD0 z2Z
ı p (a) :D
1 X rD0
lim
N!1
N 1 X p(a[z:::zCr] ) 2N zDN
to be, respectively the ‘number of particles’ and ‘density of particles’ in configuration a (clearly # p (a) is finite if and only if ı p (a) D 0). The function p can count the single-letter ‘particles’ of a PCA, or the short-length ‘domain boundaries’ found in ECA#184 and the CCA of Examples 31 and 32. However, p can also track the arbitrarily large defects of ECA#18. For example, define p18 (102m 1) D 1 (for any m 2 N), and define p18 (a) D 0 for all other a 2 A . Let Z p :D fa 2 AZ ; # p (a) D 0g be the set of vacuum configurations. (For example, if p D p18 as above, then Z p is just the odd sofic shift S.) If the iteration of a CA ˚ decreases the number (or density) of particles, then one expects Z p to be a limit set for ˚ in some Z sense. Indeed, if R 2 Minv :D Meas(A ; ), then we define p () :D AZ ı p d. If ˚ is ‘p-decreasing’ in a certain sense, then p acts as a Lyapunov function for the ; ˚ ). Thus, with certain technical dynamical system (Minv assumptions, we can show that, if 2 Minv is connected, then (˚; ) Z p (see Theorem 8 in [77]). Furthermore, under certain conditions, MeasAttr(˚ ) Z p (see Theorem 7 in [77]). Using this machinery, K˚urka proved: Proposition 56 Let ˚ : AZ !AZ be ECA#18, and let S AZ be the odd sofic shift. If 2 Meas(AZ ; ) is connected, then (˚; ) S. Proof See Example 6.3 of [77].
Measurable Dynamics If ˚ 2 CA(AM ) and 2 Meas(AM ; ˚ ), then the triple (AM ; ; ˚) is a measure-preserving dynamical system (MPDS), and thus, amenable to the methods of classical ergodic theory. Mixing and Ergodicity If ˚ 2 CA(AM ), then the topological dynamical system (AM ; ˚ ) is topologically transitive (or topologically ergodic) if, if, for any open subsets U; V AM , there exists t 2 N such that U \ ˚ t (V) ¤ ;. Equivalently, there exists some a 2 AM whose orbit O(a) :D f˚ t (a)g1 tD0 is dense in AM . If 2 Meas(AM ; ˚ ), then the system (AM ; ; ˚) is ergodic if, for any nontrivial measurable U; V AM , there exists some t 2 N such that
Ergodic Theory of Cellular Automata
[U \ ˚ t (V)] > 0. The system (AM ; ; ˚) is totally ergodic if (AM ; ; ˚ n ) is ergodic for every n 2 N. The system (AM ; ; ˚) is (strongly) mixing if, for any nontrivial measurable U; V AM . (6) lim U \ ˚ t (V) D [U] [V] : t!1
The system (AM ; ; ˚) is weakly mixing if the limit (6) holds as n!1 along an increasing subsequence ft n g1 nD1 of density one – i. e. such that lim n!1 t n /n D 1. For any M 2 N, we say (AM ; ; ˚ ) is M-mixing if, for any measurable U0 ; U1 ; : : : ; U M AM . " M # M \ Y t m lim ˚ (Um ) D [Um ] (7) jt n t m j!1 8n¤m2[0:::M]
mD0
mD0
(thus, ‘strong’ mixing is 1-mixing). We say (AM ; ; ˚ ) is multimixing (or mixing of all orders) if (AM ; ; ˚) is M-mixing for all M 2 N. We say (AM ; ; ˚ ) is a Bernoulli endomorphism if its natural extension is measurably isomorphic to a system (BZ ; ˇ; ), where ˇ 2 Meas(BZ ; ) is a Bernoulli measure. We say (AM ; ; ˚) is a Kolmogorov endomorphism if its natural extension is a Kolmogorov (or “K”) automorphism. Theorem 57 Let ˚ 2 CA(AM ), let 2 Meas(AM ; ˚), and let X D supp(). Then X is a compact, ˚-invariant set. Furthermore: (; ˚ ) is Bernoulli H) (; ˚ ) is Kolmogorov H) (; ˚ ) is multimixing H) (; ˚ ) is mixing H) (; ˚ ) is weakly mixing H) (; ˚ ) is totally ergodic H) (; ˚ ) is ergodic H) The system (X; ˚ ) is topologically transitive H) ˚ : X!X is surjective. Theorem 58 Let ˚ 2 CA(AN ) be posexpansive (see Subsect. “Posexpansive and Permutative CA”). Then (AN ; ˚ ) has topological entropy log2 (k) for some k 2 N, ˚ preserves the uniform measure , and (AN ; ; ˚ ) is a uniformly distributed Bernoulli endomorphism on an alphabet of cardinality k. Proof Extend the argument of Theorem 18. See Corollary 3.10 in [9] or Theorem 4.8(5) in [86]. Example 59 Suppose ˚ 2 CA(AN ) is right-permutative, with neighborhood [r : : : R], where 0 r < R. Then htop (˚) D log2 (jAj R ), so Theorem 58 says that (AN ; ; ˚) is a uniformly distributed Bernoulli endomorphism on the alphabet B :D AR . In this case, it is easy to see this directly. If ˚BN : N () is the uniAN !BN is as in Eq. (2), then ˇ :D ˚B N N form Bernoulli measure on B , and ˚B is an isomorphism from (AN ; ; ˚) to (BN ; ˇ; ).
Theorem 60 Let ˚ 2 CA(AZ ) have neighborhood [L : : : R]. Suppose that either (a) 0 L < R and ˚ is right-permutative; or (b) L < R 0 and ˚ is left-permutative; or (c) L < R and ˚ is bipermutative; or (d) ˚ is posexpansive. Then ˚ preserves the uniform measure , and (AZ ; ; ˚) is a Bernoulli endomorphism. Proof For cases (a) and (b), see Theorem 2.2 in [125]. For case (c), see Theorem 2.7 in [125] or Corollary 7.3 in [74]. For (d), extend the argument of Theorem 14; see Theorem 4.9 in [86]. Remark Theorem 60(c) can be extended to some higherdimensional permutative CA using Proposition 1 in [3]; see Remark 12(b). Theorem 61 Let ˚ 2 CA(AZ ) have neighborhood [L : : : R]. Suppose that either (a) ˚ is surjective and 0 < L R; or (b) ˚ is surjective and L R < 0; or (c) ˚ is right-permutative and R ¤ 0; or (d) ˚ is left-permutative and L ¤ 0. Then ˚ preserves , and (AZ ; ; ˚) is a Kolmogorov endomorphism. Proof Cases (a) and (b) are Theorem 2.4 in [125]. Cases (c) and (d) are from [129]. Corollary 62 Any CA satisfying the hypotheses of Theorem 61 is multimixing. Proof This follows from Theorems 57 and 61. See also Theorem 3.2 in [131] for a direct proof that any CA satisfying hypotheses (a) or (b) is 1-mixing. See Theorem 6.6 in [74] for a proof that any CA satisfying hypotheses (c) or (d) is multimixing. Let ˚ 2 CA(AZ ) have neighborhood H. An element x 2 H is extremal if hx; xi > hx; hi for all h 2 H n fxg. We say ˚ is extremally permutative if ˚ is permutative in some extremal coordinate. D
Theorem 63 Let ˚ 2 CA(AZ ) and let be the uniform D measure. If ˚ is extremally permutative, then (AZ ; ; ˚) is mixing. D
Proof
See Theorem A in [144] for the case D D 2 and
A D Z/2 . Willson described ˚ as ‘linear’ in an extremal
coordinate (which is equivalent to permutative when A D Z/2 ), and then concluded that ˚ was ‘ergodic’ –
however, he did this by explicitly showing that ˚ was mixing. His proof technique easily generalizes to any extremally permutative CA on any alphabet, and any D 1.
987
988
Ergodic Theory of Cellular Automata
Theorem 64 Let A D Z/m . Let ˚ 2 CA(AZ ) have linP ear local rule : AH !A given by (aH ) D h2H ch ah , where ch 2 Z for all h 2 H. Let be the uniform meaD sure on AZ . The following are equivalent: D (a) ˚ preserves and (AZ ; ; ˚ ) is ergodic. D (b) (AZ ; ˚ ) is topologically transitive. (c) gcdfch g0¤h2H is coprime to m. (d) For all prime divisors p of m, there is some nonzero h 2 H such that ch is not divisible by p. D
Proof Theorem 3.2 in [18]; see also [17]. For a different proof in the case D D 2, see Theorem 6 in [123]. Spectral Properties If 2 Meas(AM ), then let L2 D L2 (AM ; ) be the set ofR measurable functions f : AM !C such that k f k2 :D ( AM j f j2 d)1/2 is finite. If ˚ 2 CA(AM ) and 2 Meas(AM ; ˚ ), then ˚ defines a unitary linear operator ˚ : L2 !L2 by ˚ ( f ) D f ı ˚ for all f 2 L2 . If f 2 L2 , then f is an eigenfunction of ˚, with eigenvalue c 2 C, if ˚ ( f ) D c f . By definition of ˚ , any eigenvalue must be an element of the unit circle T :D fc 2 C ; jcj D 1g. Let S˚ T be the set of all eigenvalues of ˚, and ˚ for any s 2 S˚ , let Es (˚) :D f 2 L2 ; ˚ f D s f be the corresponding eigenspace. For example, if f is constant -almost everywhere, then F f 2 E1 (˚). Let E(˚) :D s2S˚ Es (˚). Note that S˚ is a group. Indeed, if s1 ; s2 2 S˚ , and f1 2 Es 1 and f2 2 Es 2 , then ( f1 f2 ) 2 Es 1 s 2 and (1/ f1 ) 2 E1/s 1 . Thus, S˚ is called the spectral group of ˚ . If s 2 S˚ , then heuristically, an s-eigenfunction is an ‘observable’ of the dynamical system (AM ; ; ˚) which exhibits quasiperiodically recurrent behavior. Thus, the spectral properties of ˚ characterize the ‘recurrent aspect’ of its dynamics (or the lack thereof). For example: (AM ; ; ˚) is ergodic () E1 (˚) contains only constant functions () dim[Es (˚)] D 1 for all s 2 S˚ . (AM ; ; ˚) is weakly mixing (see Subsect. “Mixing and Ergodicity”) () E(˚) contains only constant functions () (AM ; ; ˚) is ergodic and S˚ D f1g. We say (AM ; ; ˚) has discrete spectrum if L2 is spanned by E(˚ ). In this case, (AM ; ; ˚) is measurably isomorphic to an MPDS defined by translation on a compact abelian group (e. g. an irrational rotation of a torus, an odometer, etc.). If 2 Meas(AM ; ), then there is a natural unitary M-action on L2 , where m ( f ) D f ı m . A character of b M is a monoid homomorphism : M!T . The set M
of all characters is a group under pointwise multiplication, b then f is called the dual group of M. If f 2 L2 and 2 M, M m a -eigenfunction of (A ; ; ) if ( f ) D (m) f for all m 2 M; then is called a eigencharacter. The spectral b of all group of (AM ; ; ) is then the subgroup S M eigencharacters. For any 2 S , let E ( ) be the correF sponding eigenspace, and let E( ) :D 2S˚ E ( ). (AM ; ; ) is ergodic () E1 ( ) contains only constant functions () dim[E ( )] D 1 for all 2 S . (AM ; ; ) is weakly mixing () E( ) contains only constant functions () (AM ; ; ) is ergodic and S D f1g. (AM ; ; ) has discrete spectrum if L2 is spanned by E( ). In this case, the system (AM ; ; ) is measurably isomorphic to an action of M by translations on a compact abelian group. Example 65 Let M D Z; then any character : Z!T has the form (n) D c n for some c 2 T , so a -eigenfunction is just a eigenfunction with eigenvalue c. In this case, the aforementioned spectral properties for the Z-action by shifts are equivalent to the corresponding spectral properties of the CA ˚ D 1 . Bernoulli measures and irreducible Markov chains are weakly mixing. On other hand, several important classes of symbolic dynamical systems have discrete spectrum, including Sturmian shifts, constant-length substitution shifts, and regular Toeplitz shifts; see Dynamics of Cellular Automata in Non-compact Spaces. Proposition 66 Let ˚ 2 CA(AM ), and let 2 Meas(AM ; ˚; ) be -ergodic. (a) E( ) E(˚). (b) If (AM ; ; ) has discrete spectrum, then so does (AM ; ; ˚). (c) Suppose is ˚-ergodic. If (AM ; ; ) is weakly mixing, then so is (AM ; ; ˚). b and f 2 E . Then f ı ˚ 2 E Proof (a) Suppose 2 M also, because for all m 2 M, f ı ˚ ı m D f ı m ı ˚ D (m) f ı ˚ . But if (AM ; ; ) is ergodic, then dim[E ( )] D 1; hence f ı ˚ must be a scalar multiple of f . Thus, f is also an eigenfunction for ˚. (b) follows from (a). (c) By reversing the roles of ˚ and in (a), we see that E(˚) E( ). But if (AM ; ; ) is weakly mixing, then E( ) D fconstant functionsg. Thus, (AM ; ; ˚) is also weakly mixing. Example 67 (a) Let be any Bernoulli measure on AM . If is ˚-invariant and ˚ -ergodic, then (AM ; ; ˚) is weakly mixing (because (AM ; ; ) is weakly mixing).
Ergodic Theory of Cellular Automata
(b) Let P 2 N and suppose is a ˚-invariant measure supported on the set XP of P-periodic sequences (see Proposition 10). Then (AZ ; ; ) has discrete spectrum (with rational eigenvalues). But XP is finite, so the system (X P ; ˚ ) is also periodic; hence (AZ ; ; ˚) also has discrete spectrum (with rational eigenvalues). (c) Downarowicz [25] has constructed an example of a regular Toeplitz shift X AZ and ˚ 2 CA(AZ ) (not the shift) such that ˚(X) X. Any regular Toeplitz shift is uniquely ergodic, and the unique shift-invariant measure has discrete spectrum; thus, (AZ ; ; ˚) also has discrete spectrum. Aside from Examples 67(b,c), the literature contains no examples of discrete-spectrum, invariant measures for CA; this is an interesting area for future research. Entropy Let ˚ 2 CA(AM ). For any finite B M, let B :D AB , let ˚BN : AN !BN be as in Eq. (2), and let X :D ˚BN (AM ) AN ; then define Htop (B; ˚ ) :D htop (X) D lim
T!1
Meas(AM ; ˚ ),
If 2 let :D variant measure on BN . Define
1 log2 (#X[0:::T) ) : T
˚BN ();
H (B; ˚) :D h () D lim
T!1
1 T
then is a -in-
X
[b] log2 ( [b]):
b2B[0:::T)
The topological entropy of (AM ; ˚ ) and the measurable entropy of (AM ; ˚; ) are then defined htop (˚) :D sup Htop (B; ˚) and BM
finite
h (˚) :D sup H (B; ˚) : BM
finite
The famous Variational Principle states that htop (˚) D sup fh (˚); 2 Meas(AM ; ˚)g; see Sect. 10 Topological Dynamics of Cellular Automata. If M has more than one dimension (e. g. M D Z D or N D for D 2) then most CA on AM have infinite entropy. Thus, entropy is mainly of interest in the case M D Z or N. Coven [23] was the first to compute the topological entropy of a CA; he showed that htop (˚) D 1 for a large class of left-permutative, one-sided CA on f0; 1gN (which have since been called Coven CA). Later, Lind [83] showed how to construct CA whose topological entropy was any element of a countable dense subset of RC , consisting of logarithms of certain algebraic numbers.
Theorems 14 and 18(b) above characterize the topological entropy of posexpansive CA. However, Hurd et al. [62] showed that there is no algorithm which can compute the topological entropy of an arbitrary CA; see Tiling Problem and Undecidability in Cellular Automata. Measurable entropy has also been computed for a few special classes of CA. For example, if ˚ 2 CA(AZ ) is bipermutative with neighborhood f0; 1g and 2 Meas(AZ ; ˚ ; ) is -ergodic, then h (˚ ) D log2 (K) for some integer K jAj (see Thm 4.1 in [113]). If is the uniform measure, and ˚ is posexpansive, then Theorems 58 and 60 above characterize h (˚ ). Also, if ˚ satisfies the conditions of Theorem 61, then h (˚ ) > 0, and furthermore, all factors of the MPDS (AZ ; ; ˚) also have positive entropy. However, unlike abstract dynamical systems, CA come with an explicit spatial ‘geometry’. The most fruitful investigations of CA entropy are those which have interpreted entropy in terms of how information propagates through this geometry. Lyapunov Exponents Wolfram [150] suggested that the propagation speed of ‘perturbations’ in a one-dimensional CA ˚ could transform ‘spatial’ entropy [i. e. h( )] into ‘temporal’ entropy [i. e. h(˚)]. He compared this propagation speed to the ‘Lyapunov exponent’ of a smooth dynamical system: it determines the exponential rate of divergence between two initially close ˚-orbits (see pp. 172, 261 and 514 in [151]). Shereshevsky [126] formalized Wolfram’s intuition and proved the conjectured entropy relationship; his results were later improved by Tisseur [141]. Let ˚ 2 CA(AZ ), let a 2 AZ , and let z 2 Z. Define o n Z and WC z (a) :D w 2 A ; w[z:::1) D a[z:::1) ; n o Z W z (a) :D w 2 A ; w(1:::z] D a(1:::z] : Thus, we obtain each w 2 WC z (a) (respectively Wz (a)) by ‘perturbing’ a somewhere to the left (resp. right) of coordinate z. Next, for any t 2 N, define
˚ C t t C eC t (a) :D min z 2 N ; ˚ W0 (a) Wz ˚ [a] ; and ˚ t t e t (a) :D min z 2 N ; ˚ W0 (a) Wz ˚ [a] : e˙ Thus, t measures the farthest distance which any perturbation of a at coordinate 0 could have propagated e˙ z by time t. Next, define ˙ t (a) :D max t [ (a)]. Then z2Z
989
990
Ergodic Theory of Cellular Automata
Shereshevsky [126] defined the (maximum) Lyapunov exponents 1 C (a) ; t!1 t t 1 (a) ; (˚; a) :D lim t!1 t t
C (˚; a) :D lim
and
AZ ; ˙
whenever these limits exist. Let G(˚ ) :D fg 2 (˚; g) both existg. The subset G(˚) is ‘generic’ within AZ in a very strong sense, and the Lyapunov exponents detect ‘chaotic’ topological dynamics. Proposition 68 Let ˚ 2 CA(AZ ). (a) Let 2 Meas(AZ ; ). Suppose that either: [i] is also ˚-invariant; or: [ii] is -ergodic and supp() is a ˚-invariant subset. Then (G) D 1. (b) The set G and the functions ˙ (˚; ) are (˚; )-invariant. Thus, if is either ˚-ergodic or -ergodic, then there exist constants ˙ (˚) 0 such that ˙ (˚; g) D ˙ (˚ ) for all g 2 G. (c) If ˚ is posexpansive, then there is a constant c > 0 such that ˙ (˚; g) c for all g 2 G. (d) Let be the uniform Bernoulli measure. If ˚ is surjec tive, then htop (˚ ) C (˚) C (˚) log jAj. Proof (a) follows from the fact that, for any a 2 AZ , the sequence [˙ t (a)] t2N is subadditive in t. Condition [i] is Theorem 1 in [126], and follows from Kingman’s subadditive ergodic theorem. Condition [ii] is Proposition 3.1 in [141]. (b) is clear by definition of ˙ . (c) is Theorem 5.2 in [37]. (d) is Proposition 5.3 in [141]. For any ˚-ergodic 2 Meas(AZ ; ˚; ), Shereshevsky (see Theorem 2 in [126]) showed that h (˚) C (˚ )C (˚) h (). Tisseur later improved this estimate. For any T 2 N, let ˚ C e I T (a) :D min z 2 N ; 8 t 2 [1 : : : T] ; t C ˚ t WC and z (a) W0 ˚ [a] ˚ e I T (a) :D min z 2 N ; 8 t 2 [1 : : : T] ; t : ˚ t W z (a) W0 ˚ [a] Meas(AZ ; ),
˙ b I T ()
Next, for any 2 define :D R ˙ e I (a) d[a]. AZ T Tisseur then defined the average Lyapunov exponents: ˙ ˙ (˚) :D lim infb I I T ()/T. T!1
Theorem 69 Let ˚ 2 CA(AZ ) and let 2 Meas(AZ ; ). C (˚) C (˚) and (a) If supp() is ˚-invariant, then I I (˚) (˚), and one or both inequalities are sometimes strict.
(b) If is -ergodic and ˚-invariant, then C (˚ ) C I (˚ ) h ( ), and this inequalh (˚ ) I ity is sometimes strict. (c) If supp() contains ˚ -equicontinuous points, then C (˚ ) D I (˚ ) D h (˚ ) D 0. I Proof See [141]: (a) is Proposition 3.2 and Example 6.1; (b) is Theorem 5.1 and Example 6.2; and (c) is Proposition 5.2. Directional Entropy Milnor [98,99] introduced directional entropy to capture the intuition that information in a CA propagates in particular directions with particular ‘velocities’, and that different CA ‘mix’ information in different ways. Classical entropy is unable to detect this informational anisotropy. For example, if A D f0; 1g and ˚ 2 CA(AZ ) has local rule (a0 ; a1 ) D a0 C a1 (mod 2), then htop (˚ ) D 1 D htop ( ), despite the fact that ˚ vigorously ‘mixes’ information together and propagates any ‘perturbation’ outwards in an expanding cone, whereas merely shifts information to the left in a rigid and essentially trivial fashion. D If ˚ 2 CA(AZ ), then a complete history for ˚ is a seD DC1 quence (a t ) t2Z 2 (AZ )Z Š AZ such that ˚(a t ) D DC1 be the a tC1 for all t 2 Z. Let XHist :D XHist (˚ ) AZ subshift of all complete histories for ˚ , and let be the Z DC1 shift action on XHist ; then (XHist ; ) is conjugate to the natural extension of the system (Y; ˚; ), where Y :D T t Z D ) is the omega-limit set of ˚ 1 (AM ) :D 1 tD1 ˚ (A D ˚. If 2 Meas(AZ ; ˚; ), then supp() Y, and ex˜ on XHist in the obvious tends to a -invariant measure way. Let v E D (v0 ; v1 ; : : : ; v D ) 2 R R D Š R DC1 . For any E ) :D bounded open subset B R DC1 and T > 0, let B(T v fb C t v E ; b 2 B and t 2 [0; T]g be the ‘sheared cylinder’ in R DC1 with cross-section B and length TjE vj in E ) \ Z DC1. Let the direction ˚v E , and let B(T v E ) :DB(T v XHist B(T vE ) :D xB(T vE) ; x 2 XHist (˚ ) . We define 1 log2 [#XHist B(T vE ) ] ; and T T!1 X 1 ˜ log2 ([x]) ˜ H (˚ ; B; v [x] E ) :D lim sup : T!1 T Hist
E ) :D lim sup Htop (˚ ; B; v
x2X
B(TE v)
We then define the v E -directional topological entropy and v E -directional -entropy of ˚ by E ) :D htop (˚ ; v
sup BR DC1 open & bounded
htop (˚ ; B; v E) ;
and
(8)
Ergodic Theory of Cellular Automata
h (˚; v E ) :D
sup
h (˚; B; v E) :
(9)
BR DC1 open & bounded
Proposition 70 Let ˚ 2 CA(AZ ) and let 2 D Meas(AZ ; ˚; ). (a) Directional entropy is homogeneous. That is, for any v E 2 R DC1 and r > 0, htop (˚; rv E ) D r htop (˚; v E) E ) D r h (˚; v E ). and h (˚; rv (b) If v E D (t; z) 2 Z Z D , then htop (˚; v E ) D htop (˚ t ı z t E ) D h (˚ ı z). ) and h (˚; v (c) There is an extension of the Z Z D -system (XHist ; e; e ˚ ; ) to an R R D -system (e X; ˚ ) such that, for D E) D any v E D (t; u E ) 2 R R we have htop (˚; v et ı e htop (˚ uE ). D For any 2 Meas(AZ ; ˚; ), there is an extension e; e ˜ 2 Meas(e X; ˚ ) such that for any v E D (t; u E) 2 t D u E e E ) D h˜ (˚ ı e ). R R we have h (˚; v D
Proof (a,b) follow from the definition. (c) is Proposition 2.1 in [109]. Remark 71 Directional entropy can actually be defined for any continuous Z DC1-action on a compact metric DC1 space, and in particular, for any subshift of AZ . The directional entropy of a CA ˚ is then just the directional entropy of the subshift XHist (˚). Proposition 70 holds for any subshift. Directional entropy is usually infinite for multidimensional CA (for the same reason that classical entropy is usually infinite). Thus, most of the analysis has been for one-dimensional CA. For example, Kitchens and Schmidt (see Sect. 1 in [72]) studied the directional topological entropy of one-dimensional linear CA, while Smillie (see Proposition 1.1 in [133]) computed the directional topological entropy for ECA#184. If ˚ is linear, then the function v E 7! htop (˚; v E ) is piecewise linear and convex, but if ˚ is ECA#184, it is neither. If v E has rational entries, then Proposition 70(a,b) shows that h(˚; v E ) is a rational multiple of the classical entropy of some composite CA, which can be computed through classical methods. However, if v E is irrational, then h(˚; v E ) is quite difficult to compute using the formulae (8) and (9), and Proposition 70(c), while theoretically interesting, is not very computationally useful. Can we compute h(˚; v E ) as the limit of h(˚; v E k ) where fE v k g1 kD1 is a sequence of rational vectors tending to v E ? In other words, is directional entropy continuous as a function of v E ? What other properties has h(˚; v E ) as a function of v E? Theorem 72 Let ˚ 2 CA(AZ ) and let 2 Meas(AZ ; ˚). E 7! h (˚; v E ) 2 R is continuous. (a) The function R2 3 v
(b) Suppose there is some (t; z) 2 N Z with t 1, such that ˚ t ı z is posexpansive. Then the function E 7! htop (˚; v E ) 2 R is convex, and thus, LipsR2 3 v chitz-continuous. (c) However, there exist other ˚ 2 CA(AZ ) for which the E 7! htop (˚; v E ) 2 R is not continuous. function R2 3 v (d) Suppose ˚ has neighborhood [` : : : r] Z. If v E D (t; x) 2 R2 , then let z` :D x `t and zr :D x C rt. Let L :D log jAj. E ) maxfjz` j; [i] Suppose z` zr 0. Then h (˚ ; v jzr jg L. Furthermore: If ˚ is right-permutative, and jz` j jzr j, then h (˚ ; v E ) D jzr j L. If ˚ is left-permutative, and jzr j jz` j, then h (˚ ; v E ) D jzr j L. E ) jzr z` j L. [ii] Suppose z` zr 0. Then h (˚ ; v Furthermore, if ˚ is bipermutative in this case, then h (˚ ; v E ) D jzr z` j L. Proof (a) is Corollary 3.3 in [109], while (b) is Théorème III.11 and Corollaire III.12, pp. 79–80 in [120]. (c) is Proposition 1.2 in [133]. (d) summarizes the main results of [21]. See also Example 6.2 in [99] for an earlier analysis of permutative CA in the case r D ` D 1; see also Example 6.4 in [12] and Sect. 1 in Sect. 1 in [72] for the special case when ˚ is linear. Remark 73 (a) In fact, the conclusion of Theorem 72(b) holds as long as ˚ has any posexpansive directions (even irrational ones). A posexpansive direction is analogous to an expansive subspace (see Subsect. “Entropy Geometry and Expansive Subdynamics”), and is part of Sablik’s theory of ‘directional dynamics’ for one-dimensional CA; see Remark 84(b) below. Using this theory, Sablik has also shown that h (˚ ; v E ) D 0 D htop (˚; v E ) whenever v E is an E) ¤ 0 ¤ equicontinuous direction for ˚, whereas h (˚ ; v E ) whenever v E is a right- or left posexpansive dihtop (˚; v rection for ˚. See Sect. §III.4.5–Sect. §III.4.6, pp. 86–88 in [120]. (b) Courbage and Kami´nski have defined a ‘directional’ version of the Lyapunov exponents introduced in Subsect. “Lyapunov Exponents”. If ˚ 2 CA(AZ ), a 2 AZ and v E D (t; z) 2 N Z, then ˙ (˚ t ı z ; a), where ˙ are defined as (˚; a) :D ˙ v E in Subsect. “Lyapunov Exponents”. If v E 2 R2 is irrational, then the definition of ˙ (˚; a) is somewhat more subtle. v E 2 ˙ E 7! vE (˚; a) 2 R is For any ˚ and a, the function R 3 v homogeneous and continuous (see Lemma 2 and Proposition 3 in [22]). If 2 Meas(AZ ; ˚; ) is -ergodic, then
991
992
Ergodic Theory of Cellular Automata
˙ (˚; ) is constant -almost everywhere, and is related v E to h (˚; v E ) through an inequality exactly analogous to Theorem 69(b); see Theorem 1 in [22]. Cone Entropy For any v E 2 R DC1 , any angle > 0, and any N > 0, we define n vj and K(N v E ; ) :D z 2 Z DC1 ; jzj NjE o zv E /jzjjE vj cos( ) : Geometrically, this is the set of all Z DC1 -lattice points in a cone of length NjE vj which subtends an angle of 2 around an axis parallel to v E , and which has its apex at D the origin. If ˚ 2 CA(AZ ), then let XHist (N v E ; ) :D ˚ D Hist Z ˜ is xK(N vE; ) ; x 2 X (˚) . If 2 Meas(A ; ˚), and Hist the extension of to X , then the cone entropy of (˚; ) in direction v E is defined cone (˚; v E ) :D h
lim
lim
&0 N!1
1 N
X
˜ log2 ([x]) ˜ [x] :
x2XHist (N v E ; )
Park [107,108] attributes this concept to Doug Lind. Like directional entropy, cone entropy can be defined for any continuous Z DC1-action, and is generally infinite for multidimensional CA. However, for one-dimensional CA, Park has proved: Theorem 74 If ˚ 2 CA(AZ ), 2 Meas(AZ ; ˚) and cone (˚; v E ) D h (˚; v E ). v E 2 R2 , then h Proof See Theorem 1 in [108].
Entropy Geometry and Expansive Subdynamics Directional entropy is the one-dimensional version of a multidimensional ‘entropy density’ function, which was introduced by Milnor [99] to address the fact that classical and directional entropy are generally infinite for multidimensional CA. Milnor’s ideas were then extended by Boyle and Lind [12], using their theory of expansive subdynamics. DC1 be a subshift, and let 2 Meas(X; ). Let X AZ For any bounded B R DC1 , let B :D B \ Z DC1, let XB :D XB , and then define HX (B) :D log2 jXB j and X H (B) :D [x] log2 ([x]) : x2XB
The topological entropy dimension dim(X) is the smallest d 2 [0 : : : DC1] having some constant c > 0 such that, for
any finite B R DC1 , HX (B) c diam [B]d . The measurable entropy dimension dim() is defined similarly, only with H in place of HX . Note that dim() dim(X), because H (B) HX (B) for all B R DC1 . For any bounded B R DC1 and ‘scale factor’ s > 0, let˚ sB :D fsb; b 2 Bg. For any radius r > 0, let (sB)r :D x 2 R DC1 ; d(x; sB) r . Define the d-dimensional topological entropy density of B by hXd (B) :D sup lim sup HX [(s B)r ]/s d : r>0
(10)
s!1
d (B) Define d-dimensional measurable entropy density h similarly, only using H instead of HX . Note that, for any d < dim(X) [respectively, d < dim()], hXd (B) [resp. d (B)] will be infinite, whereas for for any d > dim(X) h d (B)] will be zero; [resp. d > dim()], hXd (B) [resp. h hence dim(X) [resp. dim()] is the unique value of d for d ] defined in Eq. (10) could which the function hXd [resp. h be nontrivial.
Example 75 (a) If d D D C 1, and B is a unit cube cenDC1 (B)) is just tered at the origin, then hXDC1 (B) (resp. h the classical (DC1)-dimensional topological (resp. measurable) entropy of X (resp. ) as a (DC1)-dimensional subshift (resp. random field). (b) However, the most important case for Milnor [99] D (and us) is when X D XHist (˚ ) for some ˚ 2 CA(AZ ). In this case, dim() dim(X) D < DC1. ˚In particular, if d D 1, then for any v E 2 R DC1 , if B :D rv E ; r 2 [0; 1] , 1 1 (B) D h (˚ ; v E ) and h E ) are dithen hX (B) D htop (˚ ; v rectional entropies of Subsect. “Directional Entropy”. For any d 2 [0 : : : DC1], let d be the d-dimensional Hausdorff measure on R DC1 such that, if P R DC1 is any d-plane (i. e. a d-dimensional linear subspace of R DC1 ), then d restricts to the d-dimensional Lebesgue measure on P. be a subshift, and let Theorem 76 Let X AZ 2 Meas(X; ). Let d D dim(X) (or dim()) and let hd d ). Let B; C R DC1 be compact sets. Then be hXd (or h (a) h d (B) is well-defined and finite. (b) If B C then h d (B) h d (C). (c) h d (B [ C) h d (B) C h d (C). E ) D h d (B) for any v E 2 R DC1 . (d) h d (B C v (e) h d (s B) D s d h d (B) for any s > 0. (f) There is some constant c such that hd (B) cd (B) for all compact B R DC1 . (g) If d 2 N, then for any d-plane P R DC1 , there is some Hd (P) 0 such that h d (B) D Hd (P) d (B) for any compact subset B P with d (@B) D 0. DC1
d
d
(h) There is a constant H X < 1 such that HXd (P) H X for all d-planes P.
Ergodic Theory of Cellular Automata
Ergodic Theory of Cellular Automata, Figure 2 Example 78(b)[ii]: A left permutative CA ˚ is quasi-invertible. In this picture, [` : : : r] D [1 : : : 2], and L is a line of slope 1/3. If x 2 X and we know the entries of x in a neighborhood of L, then we can reconstruct the rest of x as shown. Entries above L are directly computed using the local rule of ˚. Entries below L are interpolated via left-permutativity. In both cases, the reconstruction occurs in consecutive diagonal lines, whose order is indicated by shading from darkest to lightest in the figure
Proof See Theorems 1 and 2 and Corollary 1 in [99], or see Theorems 6.2, 6.3, and 6.13 in [12]. Example 77 Let ˚ 2 CA(AZ ) and let X :D XHist (˚). If P :D f0g R D , then H D (P) is the classical D-dimenD sional entropy of the omega limit set Y :D ˚ 1 (AZ ); heuristically, this measures the asymptotic level of ‘spatial disorder’ in Y. If P R DC1 is some other D-plane, then H D (P) measures some combination of the ‘spatial disorder’ of Y with the dynamical entropy of ˚ . D
Let d 2 [1 : : : DC1], and let P R DC1 be a d-plane. For any r > 0, let P(r) :D fz 2 Z DC1 ; d(z; P) < rg. We say P is expansive for X if there is some r > 0 such that, for any x; y 2 X, (xP(r) D yP(r) ) () (x D y). If P is spanned by d rational vectors, then P \ Z DC1 is a rank-d sublattice L Z DC1, and P is expansive if and only if the induced L-action on X is expansive. However, if P is ‘irrational’, then expansiveness is a more subtle concept; see Sect. 2 in [12] for more information. D If ˚ 2 CA(AZ ) and X D XHist (˚), then ˚ is quasi-invertible if X admits an expansive D-plane P (this is a natural extension of Milnor’s (1988; §7) definition in terms of ‘causal cones’). Heuristically, if we regard Z DC1 as ‘spacetime’ (in the spirit of special relativity), then P can be seen as ‘space’, and any direction transversal to P can be interpreted as the flow of ‘time’.
(b) Let ˚ 2 CA(AZ ), so that X AZ . Let ˚ have neighborhood [` : : : r], with ` 0 r, and let L R2 be a line with slope S through the origin (Fig. 2). 2
1 , then L If ˚ is right-permutative, and 0 < S `C1 is expansive for X. 1 [ii] If ˚ is left-permutative, and rC1 S < 0, then L is expansive for X. 1 [iii] If ˚ is bipermutative, and rC1 S < 0 or 0 < S 1 , then L is expansive for X. `C1
[i]
[iv] If ˚ is posexpansive (see Subsect. “Posexpansive and Permutative CA”) then the ‘time’ axis L D R f0g is expansive for X. Hence, in any of these cases, ˚ is quasi-invertible. (Presumably, something similar is true for multidimensional permutative CA.) Proposition 79 Let ˚ 2 CA(AZ ), let X D XHist (˚ ), let d 2 Meas(X; ), and let Hd and H X be as in Theorem 76(g,h). D (a) If HXD (f0g R D ) D 0, then H X D 0. D
(b) Let d 2 [1 : : : D], and suppose that X admits an expansive d-plane. Then: [i]
dim(X) d; d
d (P) [ii] There is a constant H < 1 such that H d
Example 78 (a) If ˚ is invertible, then it is quasi-invertible, because f0g R D is an expansive D-plane (recall that the zeroth coordinate is time).
H for all d-planes P; [iii] If Hd (P) D 0 for some expansive d-plane P, then d
H D 0.
993
994
Ergodic Theory of Cellular Automata
Proof (a) is Corollary 3 in [99], (b)[i] is Corollary 1.4 in [128], and (b)[ii] is Theorem 6.19(2) in [12]. d (b)[iii]: See Theorem 6.3(4) in [12] for “H X D 0”. d
See Theorem 6.19(1) in [12] for “H D 0”.
If d 2 [1 : : : DC1], then a d-frame in R DC1 is a d-tuple F :D (E v1 ; : : : ; v E d ), where v E 1; : : : ; v E d 2 R DC1 are linearly independent. Let Frame(DC1; d) be the set of all d-frames in R DC1 ; then Frame(DC1; d) is an open subset of R DC1 R DC1 :D R(DC1) d]. Let Expans(X; d) :D fF 2 Frame(DC1; d) ; span(F) is expansive for Xg : Then Expans(X; d) is an open subset of Frame(DC1; d), by Lemma 3.4 in [12]. A connected component of Expans(X; d) is called an expansive component for X. For any F 2 Frame(DC1; d), let [F] be the d-dimensional parallelepiped spanned by F, and let hXd (F) :D hXd ([F]) D HXd (span(F)) d ([F]), where the last equality is by Theorem 76(g). The next result is a partial extension of Theorem 72(b). be a subshift, suppose Proposition 80 Let X AZ d :D dim(X) 2 N, and let C Expans(X; d) be an expansive component. Then the function hXd : C!R is convex in each of its d distinct R DC1 -valued arguments. Thus, hXd is Lipschitz-continuous on C. DC1
Proof See Theorem 6.9(1,4) in [12].
For measurable entropy, we can say much more. Recall that a d-linear form is a function ! : R(DC1)d !R which is linear in each of its d distinct R DC1 -valued arguments and antisymmetric. be a subshift and let Theorem 81 Let X AZ 2 Meas (X; ). Suppose d :D dim() 2 N, and let C Expans(X; d) be an expansive component for X. Then there d agrees is a d-linear form ! : R(DC1)d !R such that h with ! on C. DC1
Proof Theorem 6.16 in [12].
d
If H ¤ 0, then Theorem 81 means that there is E DC1 ) an orthogonal (D C 1 d)-frame W :D (E wdC1 ; : : : ; w (transversal to all frames in C) such that, for any d-frame V :D (E v1 ; : : : ; v E d ) 2 C, d h (V) D det(E v1 ; : : : ; v Ed; w E dC1 ; : : : ; w E DC1 ) :
(11)
E DC1 g is the Thus, the d-plane orthogonal to fE wdC1 ; : : : ; w d – this is the d-plane manid-plane which maximizes H festing the most rapid decay of correlation with distance.
On the other hand, span(W) is the (D C 1 d)-plane along which correlations decay the most slowly. Also, if V 2 C, then Eq. (11) implies that C cannot contain any frame spanning span(V) with reversed orientation (e. g. an odd permutation of V), because entropy is nonnegative. Example 82 Let ˚ 2 CA(AZ ) be quasi-invertible, and let P be an expansive D-plane for X :D XHist (˚ ) (see Example 78). The D-frames spanning P fall into two expansive components (related by orientation-reversal); let C be D union of these two components. Let 2 Meas (AZ ; ˚ ), and extend to a -invariant measure on X. In this case, Theorem 81 is equivalent to Theorem 4 in [99], which says there a vector w E 2 R DC1ˇ such that, for any ˇ d (F) D ˇdet(E E D ) 2 C, h v1 ; : : : ; v E D; w E )ˇ. D-frame (E v1 ; : : : ; v d (P) is maximized when P is the hyperplane orThus, H thogonal to w E . Heuristically, w E points in the direction of minimum correlation decay (or maximum ‘causality’) – the direction which could most properly be called ‘time’ for the MPDS (˚; ). D
Theorem 81 yields the following generalization the Variational Principle: be a subshift and suppose Theorem 83 Let X AZ d :D dim(X) 2 N. (a) If F 2 Expans(X; d), then there exists 2 Meas(X; ) d (F). such that hXd (F) D h (b) Let C Expans(X; d) be an expansive component for X. There exists some 2 Meas(X; ) such that hXd D d on C if and only if hd is a d-linear form on C. h X DC1
Proof Proposition 6.24 and Theorem 6.25 in [12]. ZD
Remark 84 (a) If G A is an abelian subgroup shift and ˚ 2 ECA(G), then XHist (˚ ) is a subgroup shift of DC1 AZ , which can be viewed as an algebraic Z DC1-action (see discussion prior to Proposition 27). In this context, the expansive subspaces of XHist (˚ ) have been completely characterized by Einsiedler et al. (see Theorem 8.4 in [33]). Furthermore, certain dynamical properties (such as positive entropy, completely positive entropy, or Bernoullicity) are common amongst all elements of each expansive component of XHist (˚ ) (see Theorem 9.8 in [33]) (this sort of ‘commonality’ within expansive components was earlier emphasized by Boyle and Lind (see [12])). If XHist (˚ ) has entropy dimension 1 (e. g. ˚ is a one-dimensional linear CA), the structure of XHist (˚ ) has been thoroughly analyzed by Einsiedler and Lind [30]. Finally, if G1 and G2 are subgroup shifts, and ˚ k 2 ECA(G k ) and k 2 Meas(G k ; ˚; ) for k D 1; 2, with dim(1 ) D dim(2 ) D 1, then Einsiedler and Ward [32] have given conditions for the measure-pre-
Ergodic Theory of Cellular Automata
serving systems (G1 ; 1 ; ˚1 ; ) and (G2 ; 2 ; ˚2 ; ) to be disjoint. (b) Boyle and Lind’s ‘expansive subdynamics’ concerns expansiveness along certain directions in the space-time diagram of a CA. Recently, M. Sablik has developed a theory of directional dynamics, which explores other topological dynamical properties (such as equicontinuity and sensitivity to initial conditions) along spatiotemporal directions in a CA; see [120], Chapitre II or [121]. Future Directions and Open Problems 1. We now have a fairly good understanding of the ergodic theory of linear and/or ‘abelian’ CA. The next step is to extend these results to CA with nonlinear and/or nonabelian algebraic structures. In particular: (a) Almost all the measure rigidity results of Subsect. “Measure Rigidity in Algebraic CA” are for endomorphic CA on abelian group shifts, except for Propositions 21 and 23. Can we extend these results to CA on nonabelian group shifts or other permutative CA? (b) Likewise, the asymptotic randomization results of Subsect. “Asymptotic Randomization by Linear Cellular Automata” are almost exclusively for linear CA with scalar coefficients, and for M D Z D N E . Can we extend these results to LCA with noncommuting, matrix-valued coefficients? (The problem is: if the coefficients do not commute, then the ‘polynomial representation’ and Lucas’ theorem become inapplicable.) Also, can we obtain similar results for multiplicative CA on nonabelian groups? (See Remark 41(d).) What about other permutative CA? (See Remark 41(e).) Finally, what if M is a nonabelian group? (For example, Lind and Schmidt (unpublished) [31] have recently investigated algebraic actions of the discrete Heisenberg group.) 2. Cellular automata are often seen as models of spatially distributed computation. Meaningful ‘computation’ could possibly occur when a CA interacts with a highly structured initial configuration (e. g. a substitution sequence), whereas such computation is probably impossible in the roiling cauldron of noise arising from a mixing, positive entropy measure (e. g. a Bernoulli measure or Markov random field). Yet almost all the results in this article concern the interaction of CA with such mixing, positive-entropy measures. We are starting to understand the topological dynamics of CA acting on non-mixing and/or zero-entropy symbolic dynamical systems, (e. g. substitution shifts, automatic shifts, regular Toeplitz shifts, and quasisturmian shifts);
3.
4.
5.
6.
see Dynamics of Cellular Automata in Non-compact Spaces. However, almost nothing is known about the interaction of CA with the natural invariant measures on these systems. In particular: (a) The invariant measures discussed in Sect. “Invariant Measures for CA” all have nonzero entropy (see, however, Example 67(c)). Are there any nontrivial zero-entropy measures for interesting CA? (b) The results of Subsect. “Asymptotic Randomization by Linear Cellular Automata” all concern the asymptotic randomization of initial measures with nonzero entropy, except for Remark 41(c). Are there similar results for zero-entropy measures? (c) Zero-entropy systems often have an appealing combinatorial description via cutting-and-stacking constructions, Bratteli diagrams, or finite state machines. Likewise, CA admit a combinatorial description (via local rules). How do these combinatorial descriptions interact? As we saw in Subsect. “Domains, Defects, and Particles”, and also in Propositions 48–56, emergent defect dynamics can be a powerful tool for analyzing the measurable dynamics of CA. Defects in one-dimensional CA generally act like ‘particles’, and their ‘kinematics’ is fairly well-understood. However, in higher dimensions, defects can be much more topologically complicated (e. g. they can look like curves or surfaces), and their evolution in time is totally mysterious. Can we develop a theory of multidimensional defect dynamics? Almost all the results about mixing and ergodicity in Subsect. “Mixing and Ergodicity” are for one-dimensional (mostly permutative) CA and for the uniform measure on AZ . Can similar results be obtained for other CA and/or measures on AZ ? What about CA in D AZ for D 2? Let be a (˚; )-invariant measure on AM . Proposition 66 suggests an intriguing correspondence between certain spectral properties (namely, weak mixing and discrete spectrum) for the system (AM ; ; ) and those for the system (AM ; ; ˚). Does a similar correspondence hold for other spectral properties, such as continuous spectrum, Lebesgue spectral type, spectral multiplicity, rigidity, or mild mixing? DC1 Let X 2 AZ be a subshift admitting an expansive D-plane P R DC1 . As discussed in Subsect. “Entropy Geometry and Expansive Subdynamics”, if we regard Z DC1 as ‘spacetime’, then we can treat P as a ‘space’, and a transversal direction as ‘time’. Indeed, if P is spanned by rational vectors, then the Curtis–Hedlund– Lyndon theorem implies that X is isomorphic to the D history shift of some invertible ˚ 2 CA(AZ ) acting on
995
996
Ergodic Theory of Cellular Automata
some ˚-invariant subshift Y AZ (where we embed ZD in P). If P is irrational, then this is not the case; however, X still seems very much like the history shift of a spatially distributed symbolic dynamical system, closely analogous to a CA, except with a continually fluctuating ‘spatial distribution’ of state information, and perhaps with occasional nonlocal interactions. For example, Proposition 79(b)[i] implies that dim(X) D, just as for a CA. How much of the theory of invertible CA can be generalized to such systems? D
I will finish with the hardest problem of all. Cellular automata are tractable mainly because of their homogeneity: CA are embedded in a highly regular spatial geometry (i. e. a lattice or other Cayley digraph) with the same local rule everywhere. However, many of the most interesting spatially distributed symbolic dynamical systems are not nearly this homogeneous. For example: CA are often proposed as models of spatially distributed physical systems. Yet in many such systems (e. g. living tissues, quantum ‘foams’), the underlying geometry is not a flat Euclidean space, but a curved manifold. A good discrete model of such a manifold can be obtained through a Voronoi tessellation of sufficient density; a realistic symbolic dynamical model would be a CA-like system defined on the dual graph of this Voronoi tessellation. As mentioned in question #3, defects in multidimensional CA may have the geometry of curves, surfaces, or other embedded submanifolds (possibly with varying nonzero thickness). To model the evolution of such a defect, we could treat it as a CA-like object whose underlying geometry is an (evolving) manifold, and whose local rules (although partly determined by the local rule of the original CA) are spatially heterogenous (because they are also influenced by incoming information from the ambient ‘nondefective’ space). The CA-like system arising in question #6 has a D-dimensional planar geometry, but the distribution of ‘cells’ within this plane (and, presumably, the local rules between them) are constantly fluctuating. More generally, any topological dynamical system on a Cantor space can be represented as a cellular network: a CA-like system defined on an infinite digraph, with different local rules at different nodes. Gromov [51] has generalized the Garden of Eden Theorem 3 to this setting (see Remark 5(a)). However, other than Gromov’s work, basically nothing is known about such systems. Can we generalize any of the theory of cellular automata to cellular networks? Is it possible to develop a nontrivial ergodic theory for such systems?
Acknowledgments I would like to thank François Blanchard, Mike Boyle, Maurice Courbage, Doug Lind, Petr K˚urka, Servet Martínez, Kyewon Koh Park, Mathieu Sablik, Jeffrey Steif, and Marcelo Sobottka, who read draft versions of this article and made many invaluable suggestions, corrections, and comments. (Any errors which remain are mine.) To Reem.
Bibliography 1. Akin E (1993) The general topology of dynamical systems, Graduate Studies in Mathematics, vol 1. American Mathematical Society, Providence 2. Allouche JP (1999) Cellular automata, finite automata, and number theory. In: Cellular automata (Saissac, 1996), Math. Appl., vol 460. Kluwer, Dordrecht, pp 321–330 3. Allouche JP, Skordev G (2003) Remarks on permutive cellular automata. J Comput Syst Sci 67(1):174–182 4. Allouche JP, von Haeseler F, Peitgen HO, Skordev G (1996) Linear cellular automata, finite automata and Pascal’s triangle. Discret Appl Math 66(1):1–22 5. Allouche JP, von Haeseler F, Peitgen HO, Petersen A, Skordev G (1997) Automaticity of double sequences generated by one-dimensional linear cellular automata. Theoret Comput Sci 188(1-2):195–209 6. Barbé A, von Haeseler F, Peitgen HO, Skordev G (1995) Coarse-graining invariant patterns of one-dimensional twostate linear cellular automata. Internat J Bifur Chaos Appl Sci Eng 5(6):1611–1631 7. Barbé A, von Haeseler F, Peitgen HO, Skordev G (2003) Rescaled evolution sets of linear cellular automata on a cylinder. Internat J Bifur Chaos Appl Sci Eng 13(4):815–842 8. Belitsky V, Ferrari PA (2005) Invariant measures and convergence properties for cellular automaton 184 and related processes. J Stat Phys 118(3-4):589–623 9. Blanchard F, Maass A (1997) Dynamical properties of expansive one-sided cellular automata. Israel J Math 99:149–174 10. Blank M (2003) Ergodic properties of a simple deterministic traffic flow model. J Stat Phys 111(3-4):903–930 11. Boccara N, Naser J, Roger M (1991) Particle-like structures and their interactions in spatiotemporal patterns generated by one-dimensional deterministic cellular automata. Phys Rev A 44(2):866–875 12. Boyle M, Lind D (1997) Expansive subdynamics. Trans Amer Math Soc 349(1):55–102 13. Boyle M, Fiebig D, Fiebig UR (1997) A dimension group for local homeomorphisms and endomorphisms of onesided shifts of finite type. J Reine Angew Math 487:27–59 14. Burton R, Steif JE (1994) Non-uniqueness of measures of maximal entropy for subshifts of finite type. Ergodic Theory Dynam Syst 14(2):213–235 15. Burton R, Steif JE (1995) New results on measures of maximal entropy. Israel J Math 89(1-3):275–300 16. Cai H, Luo X (1993) Laws of large numbers for a cellular automaton. Ann Probab 21(3):1413–1426 17. Cattaneo G, Formenti E, Manzini G, Margara L (1997) On ergodic linear cellular automata over Z m . In: STACS 97 (Lübeck).
Ergodic Theory of Cellular Automata
18.
19.
20.
21.
22.
23. 24. 25.
26. 27. 28.
29.
30. 31.
32.
33.
34.
35.
36.
37.
38.
Lecture Notes in Computer Science, vol 1200. Springer, Berlin, pp 427–438 Cattaneo G, Formenti E, Manzini G, Margara L (2000) Ergodicity, transitivity, and regularity for linear cellular automata over Z m . Theoret Comput Sci 233(1-2):147–164 Ceccherini-Silberstein T, Fiorenzi F, Scarabotti F (2004) The Garden of Eden theorem for cellular automata and for symbolic dynamical systems. In: Random walks and geometry, de Gruyter, Berlin, pp 73–108 Ceccherini-Silberstein TG, Machì A, Scarabotti F (1999) Amenable groups and cellular automata. Ann Inst Fourier (Grenoble) 49(2):673–685 ´ Courbage M, Kaminski B (2002) On the directional entropy of Z2 -actions generated by cellular automata. Studia Math 153(3):285–295 ´ Courbage M, Kaminski B (2006) Space-time directional Lyapunov exponents for cellular automata. J Stat Phys 124(6):1499–1509 Coven EM (1980) Topological entropy of block maps. Proc Amer Math Soc 78(4):590–594 Coven EM, Paul ME (1974) Endomorphisms of irreducible subshifts of finite type. Math Syst Theory 8(2):167–175 Downarowicz T (1997) The royal couple conceals their mutual relationship: a noncoalescent Toeplitz flow. Israel J Math 97:239–251 Durrett R, Steif JE (1991) Some rigorous results for the Greenberg–Hastings model. J Theoret Probab 4(4):669–690 Durrett R, Steif JE (1993) Fixation results for threshold voter systems. Ann Probab 21(1):232–247 Einsiedler M (2004) Invariant subsets and invariant measures for irreducible actions on zero-dimensional groups. Bull London Math Soc 36(3):321–331 Einsiedler M (2005) Isomorphism and measure rigidity for algebraic actions on zero-dimensional groups. Monatsh Math 144(1):39–69 Einsiedler M, Lind D (2004) Algebraic Zd -actions on entropy rank one. Trans Amer Math Soc 356(5):1799–1831 (electronic) Einsiedler M, Rindler H (2001) Algebraic actions of the discrete Heisenberg group and other non-abelian groups. Aequationes Math 62(1–2):117–135 Einsiedler M, Ward T (2005) Entropy geometry and disjointness for zero-dimensional algebraic actions. J Reine Angew Math 584:195–214 Einsiedler M, Lind D, Miles R, Ward T (2001) Expansive subdynamics for algebraic Zd -actions. Ergodic Theory Dynam Systems 21(6):1695–1729 Eloranta K, Nummelin E (1992) The kink of cellular automaton Rule 18 performs a random walk. J Stat Phys 69(5-6):1131– 1136 Fagnani F, Margara L (1998) Expansivity, permutivity, and chaos for cellular automata. Theory Comput Syst 31(6):663– 677 Ferrari PA, Maass A, Martínez S, Ney P (2000) Cesàro mean distribution of group automata starting from measures with summable decay. Ergodic Theory Dynam Syst 20(6):1657– 1670 Finelli M, Manzini G, Margara L (1998) Lyapunov exponents versus expansivity and sensitivity in cellular automata. J Complexity 14(2):210–233 Fiorenzi F (2000) The Garden of Eden theorem for sofic shifts. Pure Math Appl 11(3):471–484
39. Fiorenzi F (2003) Cellular automata and strongly irreducible shifts of finite type. Theoret Comput Sci 299(1-3):477–493 40. Fiorenzi F (2004) Semi-strongly irreducible shifts. Adv Appl Math 32(3):421–438 41. Fisch R (1990) The one-dimensional cyclic cellular automaton: a system with deterministic dynamics that emulates an interacting particle system with stochastic dynamics. J Theoret Probab 3(2):311–338 42. Fisch R (1992) Clustering in the one-dimensional three-color cyclic cellular automaton. Ann Probab 20(3):1528–1548 43. Fisch R, Gravner J (1995) One-dimensional deterministic Greenberg–Hastings models. Complex Systems 9(5):329–348 44. Furstenberg H (1967) Disjointness in ergodic theory, minimal sets, and a problem in Diophantine approximation. Math Systems Theory 1:1–49 45. Gilman RH (1987) Classes of linear automata. Ergodic Theory Dynam Syst 7(1):105–118 46. Gottschalk W (1973) Some general dynamical notions. In: Recent advances in topological dynamics (Proc Conf Topological Dynamics, Yale Univ, New Haven, 1972; in honor of Gustav Arnold Hedlund), Lecture Notes in Math, vol 318. Springer, Berlin, pp 120–125 47. Grassberger P (1984) Chaos and diffusion in deterministic cellular automata. Phys D 10(1-2):52–58, cellular automata (Los Alamos, 1983) 48. Grassberger P (1984) New mechanism for deterministic diffusion. Phys Rev A 28(6):3666–3667 49. Grigorchuk RI (1984) Degrees of growth of finitely generated groups and the theory of invariant means. Izv Akad Nauk SSSR Ser Mat 48(5):939–985 50. Gromov M (1981) Groups of polynomial growth and expanding maps. Inst Hautes Études Sci Publ Math 53:53–73 51. Gromov M (1999) Endomorphisms of symbolic algebraic varieties. J Eur Math Soc (JEMS) 1(2):109–197 52. von Haeseler F, Peitgen HO, Skordev G (1992) Pascal’s triangle, dynamical systems and attractors. Ergodic Theory Dynam Systems 12(3):479–486 53. von Haeseler F, Peitgen HO, Skordev G (1993) Cellular automata, matrix substitutions and fractals. Ann Math Artificial Intelligence 8(3-4):345–362, theorem proving and logic programming (1992) 54. von Haeseler F, Peitgen HO, Skordev G (1995) Global analysis of self-similarity features of cellular automata: selected examples. Phys D 86(1-2):64–80, chaos, order and patterns: aspects of nonlinearity– the “gran finale” (Como, 1993) 55. von Haeseler F, Peitgen HO, Skordev G (1995) Multifractal decompositions of rescaled evolution sets of equivariant cellular automata. Random Comput Dynam 3(1-2):93–119 56. von Haeseler F, Peitgen HO, Skordev G (2001) Self-similar structure of rescaled evolution sets of cellular automata. I. Internat J Bifur Chaos Appl Sci Eng 11(4):913–926 57. von Haeseler F, Peitgen HO, Skordev G (2001) Self-similar structure of rescaled evolution sets of cellular automata. II. Internat J Bifur Chaos Appl Sci Eng 11(4):927–941 58. Hedlund GA (1969) Endormorphisms and automorphisms of the shift dynamical system. Math Syst Theory 3:320–375 59. Hilmy H (1936) Sur les centres d’attraction minimaux des systeémes dynamiques. Compositio Mathematica 3:227–238 60. Host B (1995) Nombres normaux, entropie, translations. Israel J Math 91(1-3):419–428 61. Host B, Maass A, Martìnez S (2003) Uniform Bernoulli measure
997
998
Ergodic Theory of Cellular Automata
62.
63. 64. 65. 66. 67. 68. 69.
70.
71.
in dynamics of permutative cellular automata with algebraic local rules. Discret Contin Dyn Syst 9(6):1423–1446 Hurd LP, Kari J, Culik K (1992) The topological entropy of cellular automata is uncomputable. Ergodic Theory Dynam Syst 12(2):255–265 Hurley M (1990) Attractors in cellular automata. Ergodic Theory Dynam Syst 10(1):131–140 Hurley M (1990) Ergodic aspects of cellular automata. Ergodic Theory Dynam Syst 10(4):671–685 Hurley M (1991) Varieties of periodic attractor in cellular automata. Trans Amer Math Soc 326(2):701–726 Hurley M (1992) Attractors in restricted cellular automata. Proc Amer Math Soc 115(2):563–571 Jen E (1988) Linear cellular automata and recurring sequences in finite fields. Comm Math Phys 119(1):13–28 Johnson A, Rudolph DJ (1995) Convergence under q of p invariant measures on the circle. Adv Math 115(1):117–140 Johnson ASA (1992) Measures on the circle invariant under multiplication by a nonlacunary subsemigroup of the integers. Israel J Math 77(1-2):211–240 Kitchens B (2000) Dynamics of mathbfZ d actions on Markov subgroups. In: Topics in symbolic dynamics and applications (Temuco, 1997), London Math Soc Lecture Note Ser, vol 279. Cambridge Univ Press, Cambridge, pp 89–122 Kitchens B, Schmidt K (1989) Automorphisms of compact groups. Ergodic Theory Dynam Syst 9(4):691–735 2
72. Kitchens B, Schmidt K (1992) Markov subgroups of (Z/2Z)Z . In: Symbolic dynamics and its applications (New Haven, 1991), Contemp Math, vol 135. Amer Math Soc, Providence, pp 265–283 73. Kitchens BP (1987) Expansive dynamics on zero-dimensional groups. Ergodic Theory Dynam Syst 7(2):249–261 74. Kleveland R (1997) Mixing properties of one-dimensional cellular automata. Proc Amer Math Soc 125(6):1755–1766 75. Kurka ˚ P (1997) Languages, equicontinuity and attractors in cellular automata. Ergodic Theory Dynam Systems 17(2):417– 433 76. Kurka ˚ P (2001) Topological dynamics of cellular automata. In: Codes, systems, and graphical models (Minneapolis, 1999), IMA vol Math Appl, vol 123. Springer, New York, pp 447–485 77. Kurka ˚ P (2003) Cellular automata with vanishing particles. Fund Inform 58(3-4):203–221 78. Kurka ˚ P (2005) On the measure attractor of a cellular automaton. Discret Contin Dyn Syst (suppl.):524–535 79. Kurka ˚ P, Maass A (2000) Limit sets of cellular automata associated to probability measures. J Stat Phys 100(5-6):1031–1047 80. Kurka ˚ P, Maass A (2002) Stability of subshifts in cellular automata. Fund Inform 52(1-3):143–155, special issue on cellular automata 81. Lind D, Marcus B (1995) An introduction to symbolic dynamics and coding. Cambridge University Press, Cambridge 82. Lind DA (1984) Applications of ergodic theory and sofic systems to cellular automata. Phys D 10(1-2):36–44, cellular automata (Los Alamos, 1983) 83. Lind DA (1987) Entropies of automorphisms of a topological Markov shift. Proc Amer Math Soc 99(3):589–595 84. Lucas E (1878) Sur les congruences des nombres eulériens et les coefficients différentiels des functions trigonométriques suivant un module premier. Bull Soc Math France 6:49–54
85. Lyons R (1988) On measures simultaneously 2- and 3-invariant. Israel J Math 61(2):219–224 86. Maass A (1996) Some dynamical properties of one-dimensional cellular automata. In: Dynamics of complex interacting systems (Santiago, 1994), Nonlinear Phenom. Complex Systems, vol 2. Kluwer, Dordrecht, pp 35–80 87. Maass A, Martínez S (1998) On Cesàro limit distribution of a class of permutative cellular automata. J Stat Phys 90(1–2): 435–452 88. Maass A, Martínez S (1999) Time averages for some classes of expansive one-dimensional cellular automata. In: Cellular automata and complex systems (Santiago, 1996), Nonlinear Phenom. Complex Systems, vol 3. Kluwer, Dordrecht, pp 37– 54 89. Maass A, Martínez S, Pivato M, Yassawi R (2006) Asymptotic randomization of subgroup shifts by linear cellular automata. Ergodic Theory Dynam Syst 26(4):1203–1224 90. Maass A, Martínez S, Pivato M, Yassawi R (2006) Attractiveness of the Haar measure for the action of linear cellular automata in abelian topological Markov chains. In: Dynamics and Stochastics: Festschrift in honour of Michael Keane, vol 48 of, Lecture Notes Monograph Series of the IMS, vol 48. Institute for Mathematical Statistics, Beachwood, pp 100–108 91. Maass A, Martínez S, Sobottka M (2006) Limit measures for affine cellular automata on topological Markov subgroups. Nonlinearity 19(9):2137–2147, http://stacks.iop.org/ 0951-7715/19/2137 92. Machì A, Mignosi F (1993) Garden of Eden configurations for cellular automata on Cayley graphs of groups. SIAM J Discret Math 6(1):44–56 93. Maruoka A, Kimura M (1976) Condition for injectivity of global maps for tessellation automata. Inform Control 32(2):158–162 94. Mauldin RD, Skordev G (2000) Random linear cellular automata: fractals associated with random multiplication of polynomials. Japan J Math (NS) 26(2):381–406 95. Meester R, Steif JE (2001) Higher-dimensional subshifts of finite type, factor maps and measures of maximal entropy. Pacific J Math 200(2):497–510 96. Milnor J (1985) Correction and remarks: “On the concept of attractor”. Comm Math Phys 102(3):517–519 97. Milnor J (1985) On the concept of attractor. Comm Math Phys 99(2):177–195 98. Milnor J (1986) Directional entropies of cellular automatonmaps. In: Disordered systems and biological organization (Les Houches, 1985), NATO Adv Sci Inst Ser F Comput Syst Sci, vol 20. Springer, Berlin, pp 113–115 99. Milnor J (1988) On the entropy geometry of cellular automata. Complex Syst 2(3):357–385 100. Miyamoto M (1979) An equilibrium state for a one-dimensional life game. J Math Kyoto Univ 19(3):525–540 101. Moore C (1997) Quasilinear cellular automata. Phys D 103(1– 4):100–132, lattice dynamics (Paris, 1995) 102. Moore C (1998) Predicting nonlinear cellular automata quickly by decomposing them into linear ones. Phys D 111(1– 4):27–41 103. Moore EF (1963) Machine models of self reproduction. Proc Symp Appl Math 14:17–34 104. Myhill J (1963) The converse of Moore’s Garden-of-Eden theorem. Proc Amer Math Soc 14:685–686
Ergodic Theory of Cellular Automata
105. Nasu M (1995) Textile systems for endomorphisms and automorphisms of the shift. Mem Amer Math Soc 114(546): viii+215 106. Nasu M (2002) The dynamics of expansive invertible onesided cellular automata. Trans Amer Math Soc 354(10):4067–4084 (electronic) 107. Park KK (1995) Continuity of directional entropy for a class of Z2 -actions. J Korean Math Soc 32(3):573–582 108. Park KK (1996) Entropy of a skew product with a Z2 -action. Pacific J Math 172(1):227–241 109. Park KK (1999) On directional entropy functions. Israel J Math 113:243–267 110. Parry W (1964) Intrinsic Markov chains. Trans Amer Math Soc 112:55–66 111. Pivato M (2003) Multiplicative cellular automata on nilpotent groups: structure, entropy, and asymptotics. J Stat Phys 110(1-2):247–267 112. Pivato M (2005) Cellular automata versus quasisturmian shifts. Ergodic Theory Dynam Syst 25(5):1583–1632 113. Pivato M (2005) Invariant measures for bipermutative cellular automata. Discret Contin Dyn Syst 12(4):723–736 114. Pivato M (2007) Spectral domain boundaries cellular automata. Fundamenta Informaticae 77(special issue), available at: http://arxiv.org/abs/math.DS/0507091 115. Pivato M (2008) Module shifts and measure rigidity in linear cellular automata. Ergodic Theory Dynam Syst (to appear) 116. Pivato M, Yassawi R (2002) Limit measures for affine cellular automata. Ergodic Theory Dynam Syst 22(4):1269–1287 117. Pivato M, Yassawi R (2004) Limit measures for affine cellular automata. II. Ergodic Theory Dynam Syst 24(6):1961–1980 118. Pivato M, Yassawi R (2006) Asymptotic randomization of sofic shifts by linear cellular automata. Ergodic Theory Dynam Syst 26(4):1177–1201 119. Rudolph DJ (1990) 2 and 3 invariant measures and entropy. Ergodic Theory Dynam Syst 10(2):395–406 120. Sablik M (2006) Étude de l’action conjointe d’un automate cellulaire et du décalage: Une approche topologique et ergodique. Ph D thesis, Université de la Méditerranée, Faculté des science de Luminy, Marseille 121. Sablik M (2008) Directional dynamics for cellular automata: A sensitivity to initial conditions approach. submitted to Theor Comput Sci 400(1–3):1–18 122. Sablik M (2008) Measure rigidity for algebraic bipermutative cellular automata. Ergodic Theory Dynam Syst 27(6):1965– 1990 123. Sato T (1997) Ergodicity of linear cellular automata over Zm . Inform Process Lett 61(3):169–172 124. Schmidt K (1995) Dynamical systems of algebraic origin, Progress in Mathematics, vol 128. Birkhäuser, Basel 125. Shereshevsky MA (1992) Ergodic properties of certain surjective cellular automata. Monatsh Math 114(3-4):305–316 126. Shereshevsky MA (1992) Lyapunov exponents for one-dimensional cellular automata. J Nonlinear Sci 2(1):1–8 127. Shereshevsky MA (1993) Expansiveness, entropy and polynomial growth for groups acting on subshifts by automorphisms. Indag Math (NS) 4(2):203–210 128. Shereshevsky MA (1996) On continuous actions commuting with actions of positive entropy. Colloq Math 70(2):265–269
129. Shereshevsky MA (1997) K-property of permutative cellular automata. Indag Math (NS) 8(3):411–416 130. Shereshevsky MA, Afra˘ımovich VS (1992/93) Bipermutative cellular automata are topologically conjugate to the onesided Bernoulli shift. Random Comput Dynam 1(1):91–98 131. Shirvani M, Rogers TD (1991) On ergodic one-dimensional cellular automata. Comm Math Phys 136(3):599–605 132. Silberger S (2005) Subshifts of the three dot system. Ergodic Theory Dynam Syst 25(5):1673–1687 133. Smillie J (1988) Properties of the directional entropy function for cellular automata. In: Dynamical systems (College Park, 1986–87), Lecture Notes in Math, vol 1342. Springer, Berlin, pp 689–705 134. Sobottka M (2005) Representación y aleatorización en sistemas dinámicos de tipo algebraico. Ph D thesis, Universidad de Chile, Facultad de ciencias físicas y matemáticas, Santiago 135. Sobottka M (2007) Topological quasi-group shifts. Discret Continuous Dyn Syst 17(1):77–93 136. Sobottka M (to appear 2007) Right-permutative cellular automata on topological Markov chains. Discret Continuous Dyn Syst. Available at http://arxiv.org/abs/math/0603326 137. Steif JE (1994) The threshold voter automaton at a critical point. Ann Probab 22(3):1121–1139 138. Takahashi S (1990) Cellular automata and multifractals: dimension spectra of linear cellular automata. Phys D 45(1– 3):36–48, cellular automata: theory and experiment (Los Alamos, NM, 1989) 139. Takahashi S (1992) Self-similarity of linear cellular automata. J Comput Syst Sci 44(1):114–140 140. Takahashi S (1993) Cellular automata, fractals and multifractals: space-time patterns and dimension spectra of linear cellular automata. In: Chaos in Australia (Sydney, 1990), World Sci Publishing, River Edge, pp 173–195 141. Tisseur P (2000) Cellular automata and Lyapunov exponents. Nonlinearity 13(5):1547–1560 142. Walters P (1982) An introduction to ergodic theory, Graduate Texts in Mathematics, vol 79. Springer, New York 143. Weiss B (2000) Sofic groups and dynamical systems. Sankhy¯a Ser A 62(3):350–359 144. Willson SJ (1975) On the ergodic theory of cellular automata. Math Syst Theory 9(2):132–141 145. Willson SJ (1984) Cellular automata can generate fractals. Discret Appl Math 8(1):91–99 146. Willson SJ (1984) Growth rates and fractional dimensions in cellular automata. Phys D 10(1-2):69–74, cellular automata (Los Alamos, 1983) 147. Willson SJ (1986) A use of cellular automata to obtain families of fractals. In: Chaotic dynamics and fractals (Atlanta, 1985), Notes Rep Math Sci Eng, vol 2. Academic Press, Orlando, pp 123–140 148. Willson SJ (1987) Computing fractal dimensions for additive cellular automata. Phys D 24(1-3):190–206 149. Willson SJ (1987) The equality of fractional dimensions for certain cellular automata. Phys D 24(1-3):179–189 150. Wolfram S (1985) Twenty problems in the theory of cellular automata. Physica Scripta 9:1–35 151. Wolfram S (1986) Theory and Applications of Cellular Automata. World Scientific, Singapore
999
1000
Evolutionary Game Theory
Evolutionary Game Theory W ILLIAM H. SANDHOLM Department of Economics, University of Wisconsin, Madison, USA Article Outline Glossary Definition of the Subject Introduction Normal Form Games Static Notions of Evolutionary Stability Population Games Revision Protocols Deterministic Dynamics Stochastic Dynamics Local Interaction Applications Future Directions Acknowledgments Bibliography
form game by introducing random matching; however, many population games of interest, including congestion games, do not take this form. Replicator dynamic The replicator dynamic is a fundamental deterministic evolutionary dynamic for games. Under this dynamic, the percentage growth rate of the mass of agents using each strategy is proportional to the excess of the strategy’s payoff over the population’s average payoff. The replicator dynamic can be interpreted biologically as a model of natural selection, and economically as a model of imitation. Revision protocol A revision protocol describes both the timing and the results of agents’ decisions about how to behave in a repeated strategic interaction. Revision protocols are used to derive both deterministic and stochastic evolutionary dynamics for games. Stochastically stable state In Game-theoretic models of stochastic evolution in games are often described by irreducible Markov processes. In these models, a population state is stochastically stable if it retains positive weight in the process’s stationary distribution as the level of noise in agents’ choices approaches zero, or as the population size approaches infinity.
Glossary Deterministic evolutionary dynamic A deterministic evolutionary dynamic is a rule for assigning population games to ordinary differential equations describing the evolution of behavior in the game. Deterministic evolutionary dynamics can be derived from revision protocols, which describe choices (in economic settings) or births and deaths (in biological settings) on an agent-by-agent basis. Evolutionarily stable strategy (ESS) In a symmetric normal form game, an evolutionarily stable strategy is a (possibly mixed) strategy with the following property: a population in which all members play this strategy is resistant to invasion by a small group of mutants who play an alternative mixed strategy. Normal form game A normal form game is a strategic interaction in which each of n players chooses a strategy and then receives a payoff that depends on all agents’ choices choices of strategy. In a symmetric two-player normal form game, the two players choose from the same set of strategies, and payoffs only depend on own and opponent’s choices, not on a player’s identity. Population game A population game is a strategic interaction among one or more large populations of agents. Each agent’s payoff depends on his own choice of strategy and the distribution of others’ choices of strategies. One can generate a population game from a normal
Definition of the Subject Evolutionary game theory studies the behavior of large populations of agents who repeatedly engage in strategic interactions. Changes in behavior in these populations are driven either by natural selection via differences in birth and death rates, or by the application of myopic decision rules by individual agents. The birth of evolutionary game theory is marked by the publication of a series of papers by mathematical biologist John Maynard Smith [137,138,140]. Maynard Smith adapted the methods of traditional game theory [151,215], which were created to model the behavior of rational economic agents, to the context of biological natural selection. He proposed his notion of an evolutionarily stable strategy (ESS) as a way of explaining the existence of ritualized animal conflict. Maynard Smith’s equilibrium concept was provided with an explicit dynamic foundation through a differential equation model introduced by Taylor and Jonker [205]. Schuster and Sigmund [189], following Dawkins [58], dubbed this model the replicator dynamic, and recognized the close links between this game-theoretic dynamic and dynamics studied much earlier in population ecology [132,214] and population genetics [73]. By the 1980s, evolutionary game theory was a well-developed and firmly established modeling framework in biology [106].
Evolutionary Game Theory
Towards the end of this period, economists realized the value of the evolutionary approach to game theory in social science contexts, both as a method of providing foundations for the equilibrium concepts of traditional game theory, and as a tool for selecting among equilibria in games that admit more than one. Especially in its early stages, work by economists in evolutionary game theory hewed closely to the interpretation set out by biologists, with the notion of ESS and the replicator dynamic understood as modeling natural selection in populations of agents genetically programmed to behave in specific ways. But it soon became clear that models of essentially the same form could be used to study the behavior of populations of active decision makers [50,76,133,149,167,191]. Indeed, the two approaches sometimes lead to identical models: the replicator dynamic itself can be understood not only as a model of natural selection, but also as one of imitation of successful opponents [35,188,216]. While the majority of work in evolutionary game theory has been undertaken by biologists and economists, closely related models have been applied to questions in a variety of fields, including transportation science [143, 150,173,175,177,197], computer science [72,173,177], and sociology [34,62,126,225,226]. Some paradigms from evolutionary game theory are close relatives of certain models from physics, and so have attracted the attention of workers in this field [141,201,202,203]. All told, evolutionary game theory provides a common ground for workers from a wide range of disciplines. Introduction This article offers a broad survey of the theory of evolution in games. Section “Normal Form Games” introduces normal form games, a simple and commonly studied model of strategic interaction. Section “Static Notions of Evolutionary Stability” presents the notion of an evolutionarily stable strategy, a static definition of stability proposed for this normal form context. Section “Population Games” defines population games, a general model of strategic interaction in large populations. Section “Revision Protocols” offers the notion of a revision protocol, an individual-level description of behavior used to define the population-level processes of central concern. Most of the article concentrates on these population-level processes: Section “Deterministic Dynamics” considers deterministic differential equation models of game dynamics; Section “Stochastic Dynamics” studies stochastic models of evolution based on Markov processes; and Sect. “Local Interaction” presents deterministic and
stochastic models of local interaction. Section “Applications” records a range of applications of evolutionary game theory, and Sect. “Future Directions” suggests directions for future research. Finally, Sect. “Bibliography” offers an extensive list of primary references. Normal Form Games In this section, we introduce a very simple model of strategic interaction: the symmetric two-player normal form game. We then define some of the standard solution concepts used to analyze this model, and provide some examples of games and their equilibria. With this background in place, we turn in subsequent sections to evolutionary analysis of behavior in games. In a symmetric two-player normal form game, each of the two players chooses a (pure) strategy from the finite set S, which we write generically as S D f1; : : : ; ng. The game’s payoffs are described by the matrix A 2 Rnn . Entry A i j is the payoff a player obtains when he chooses strategy i and his opponent chooses strategy j; this payoff does not depend on whether the player in question is called player 1 or player 2. The fundamental solution concept of noncooperative game theory is Nash equilibrium [151]. We say that the pure strategy i 2 S is a symmetric Nash equilibrium of A if A i i A ji
for all j 2 S:
(1)
Thus, if his opponent chooses a symmetric Nash equilibrium strategy i, a player can do no better than to choose i himself. A stronger requirement on strategy i demands that it be superior to all other strategies regardless of the opponent’s choice: A i k > A jk
for all j; k 2 S:
(2)
When condition (2) holds, we say that strategy i is strictly dominant in A. Example 1 The game below, with strategies C (“cooperate”) and D (“defect”), is an instance of a Prisoner’s Dilemma:
C D
C
D
2 3
0 1
:
(To interpret this game, note that A C D D 0 is the payoff to cooperating when one’s opponent defects.) Since 1 > 0, defecting is a symmetric Nash equilibrium of this game. In fact, since 3 > 2 and 1 > 0, defecting is even a strictly
1001
1002
Evolutionary Game Theory
dominant strategy. But since 2 > 1, both players are better off when both cooperate than when both defect. In many instances, it is natural to allow players to choose mixed (or randomized) strategies. When a player chooses mixed strategy from the simplex X D fx 2 P n : RC i2S x i D 1g, his behavior is stochastic: he commits to playing pure strategy i 2 S with probability x i . When either player makes a randomized choice, we evaluate payoffs by taking expectations: a player choosing mixed strategy x against an opponent choosing mixed strategy y garners an expected payoff of XX xi Ai j y j : (3) x 0 Ay D i2S j2S
In biological contexts, payoffs are fitnesses, and represent levels of reproductive success relative to some baseline level; Eq. (3) reflects the idea that in a large population, expected reproductive success is what matters. In economic contexts, payoffs are utilities: a numerical representation of players’ preferences under which Eq. (3) captures players’ choices between uncertain outcomes [215]. The notion of Nash equilibrium extends easily to allow for mixed strategies. Mixed strategy x is a symmetric Nash equilibrium of A if x 0 Ax y 0 Ax
for all y 2 X:
(4)
In words, x is a symmetric Nash equilibrium if its expected payoff against itself is at least as high as the expected payoff obtainable by any other strategy y against x. Note that we can represent the pure strategy i 2 S using the mixed strategy e i 2 X, the ith standard basis vector in Rn . If we do so, then definition (4) restricted to such strategies is equivalent to definition (1). We illustrate these ideas with a few examples. Example 2 Consider the Stag Hunt game:
H S
H
S
h 0
h s
:
Each player in the Stag Hunt game chooses between hunting hare (H) and hunting stag (S). A player who hunts hare always catches one, obtaining a payoff of h > 0. But hunting stag is only successful if both players do so, in which case each obtains a payoff of s > h. Hunting stag is potentially more profitable than hunting hare, but requires a coordinated effort. In the Stag Hunt game, H and S (or, equivalently, e H and e S ) are symmetric pure Nash equilibria. This game
also has a symmetric mixed Nash equilibrium, namely ; x ) D ( sh ; h ). If a player’s opponent chooses x D (x H S s s this mixed strategy, the player’s expected payoff is h whether he chooses H, S, or any mixture between the two; in particular, x is a best response against itself. To distinguish between the two pure equilibria, we might focus on the one that is payoff dominant, in that it achieves the higher joint payoff. Alternatively, we can concentrate on the risk dominant equilibrium [89], which utilizes the strategy preferred by a player who thinks his opponent is equally likely to choose either option (that is, against an opponent playing mixed strategy (x H ; x S ) D ( 12 ; 12 )). In the present case, since s > h, equilibrium S is payoff dominant. Which strategy is risk dominant depends on further information about payoffs. If s > 2h, then S is risk dominant. But if s < 2h, H is risk dominant: evidently, payoff dominance and risk dominance need not agree. Example 3 In the Hawk–Dove game [139], the two players are animals contesting a resource of value v > 0. The players choose between two strategies: display (D) or escalate (E). If both display, the resource is split; if one escalates and the other displays, the escalator claims the entire resource; if both escalate, then each player is equally likely to claim the entire resource or to be injured, suffering a cost of c > v in the latter case. The payoff matrix for the Hawk–Dove game is therefore D D E
1 2v
v
E 1 2 (v
0 c)
:
This game has no symmetric Nash equilibrium in pure strategies. It does, however, admit the symmetric mixed v equilibrium x D (x D ; x E ) D ( cv c ; c ). (In fact, it can be shown that every symmetric normal form game admits at least one symmetric mixed Nash equilibrium [151].) In this example, our focus on symmetric behavior may seem odd: rather than randomizing symmetrically, it seems more natural for players to follow an asymmetric Nash equilibrium in which one player escalates and the other displays. But the symmetric equilibrium is the most relevant one for understanding natural selection in populations whose members are randomly matched in pairwise contests –;see Sect. “Static Notions of Evolutionary Stability”. Example 4 Consider the class of Rock–Paper–Scissors games:
Evolutionary Game Theory
R P S
R 0 w l
P l 0 w
S w l 0
For all y ¤ x; [x 0 Ax D y 0 Ax] implies that [x 0 Ay > y 0 Ay] : :
Here w > 0 is the benefit of winning the match and l > 0 the cost of losing; ties are worth 0 to both players. We call this game good RPS if w > l, so that the benefit of winning the match exceeds the cost of losing, standard RPS if w D l, and bad RPS if w < l. Regardless of the values of w and l, the unique symmetric Nash equilibrium of this game, x D (x R ; x P ; x S ) D ( 13 ; 13 ; 13 ), requires uniform randomization over the three strategies.
(6)
Condition (4) is familiar: it requires that the incumbent strategy x be a best response to itself, and so is none other than our definition of symmetric Nash equilibrium. Condition (6) requires that if a mutant strategy y is an alternative best response against the incumbent strategy x, then the incumbent earns a higher payoff against the mutant than the mutant earns against itself. A less demanding notion of stability can be obtained by allowing the incumbent and the mutant in condition (6) to perform equally well against the mutant: For all y 2 X; [x 0 Ax D y 0 Ax]
Static Notions of Evolutionary Stability
implies that [x 0 Ay y 0 Ay] :
In introducing game-theoretic ideas to the study of animal behavior, Maynard Smith advanced this fundamental principle: that the evolutionary success of (the genes underlying) a given behavioral trait can depend on the prevalences of all traits. It follows that natural selection among the traits can be modeled as random matching of animals to play normal form games [137,138,139,140]. Working in this vein, Maynard Smith offered a stability concept for populations of animals sharing a common behavioral trait – that of playing a particular mixed strategy in the game at hand. Maynard Smith’s concept of evolutionary stability, influenced by the work of Hamilton [87] on the evolution of sex ratios, defines such a population as stable if it is resistant to invasion by a small group of mutants carrying a different trait. Suppose that a large population of animals is randomly matched to play the symmetric normal form game A. We call mixed strategy x 2 X an evolutionarily stable strategy (ESS) if x 0 A((1 ")x C "y) > y 0 A((1 ")x C "y) for all " "¯(y) and y ¤ x :
(5)
To interpret condition (5), imagine that a population of animals programmed to play mixed strategy x is invaded by a group of mutants programmed to play the alternative mixed strategy y. Equation (5) requires that regardless of the choice of y, an incumbent’s expected payoff from a random match in the post-entry population exceeds that of a mutant so long as the size of the invading group is sufficiently small. The definition of ESS above can also be expressed as a combination of two conditions: x 0 Ax y 0 Ax
for all y 2 X;
(4)
(7)
If x satisfies conditions (4) and (7), it is called a neutrally stable strategy (NSS) [139]. Let us apply these stability notions to the games introduced in the previous section. Since every ESS and NSS must be a Nash equilibrium, we need only consider whether the Nash equilibria of these games satisfy the additional stability conditions, (6) and (7). Example 5 In the Prisoner’s Dilemma game (Example 1), the dominant strategy D is an ESS. Example 6 In the Stag Hunt game (Example 2), each pure Nash equilibrium is an ESS. But the mixed equilibrium ; x ) D ( sh ; h ) is not an ESS: if mutants playing ei(x H S s s ther pure strategy enter the population, they earn a higher payoff than the incumbents in the post-entry population. Example 7 In the Hawk–Dove game (Example 3), the v mixed equilibrium (x D ; x E ) D ( cv c ; c ) is an ESS. Maynard Smith used this and other examples to explain the existence of ritualized fighting in animals. While an animal who escalates always obtains the resource when matched with an animal who merely displays, a population of escalators is unstable: it can be invaded by a group of mutants who display, or who merely escalate less often. Example 8 In Rock–Paper–Scissors games (Example 4), whether the mixed equilibrium x D ( 13 ; 13 ; 13 ) is evolutionarily stable depends on the relative payoffs to winning and losing a match. In good RPS (w > l), x is an ESS; in standard RPS (w D l), x is a NSS but not an ESS, while in bad RPS (w < l), x is neither an ESS nor an NSS. The last case shows that neither evolutionary nor neutrally stable strategies need exist in a given game. The definition of an evolutionarily stable strategy has been extended to cover a wide range of strategic settings, and has been generalized in a variety of directions. Prominent
1003
1004
Evolutionary Game Theory
among these developments are set-valued versions of ESS: in rough terms, these concepts consider a set of mixed strategies Y X to be stable if the no population playing a strategy in the set can be invaded successfully by a population of mutants playing a strategy outside the set. [95] provides a thorough survey of the first 15 years of research on ESS and related notions of stability; key references on set-valued evolutionary solution concepts include [15,199,206]. Maynard Smith’s notion of ESS attempts to capture the dynamic process of natural selection using a static definition. The advantage of this approach is that his definition is often easy to check in applications. Still, more convincing models of natural selection should be explicitly dynamic models, building on techniques from the theories of dynamical systems and stochastic processes. Indeed, this thoroughgoing approach can help us understand whether and when the ESS concept captures the notion of robustness to invasion in a satisfactory way. The remainder of this article concerns explicitly dynamic models of behavior. In addition to being dynamic rather than static, these models will differ from the one considered in this section in two other important ways as well. First, rather than looking at populations whose members all play a particular mixed strategy, the dynamic models consider populations in which different members play different pure strategies. Second, instead of maintaining a purely biological point of view, our dynamic models will be equally well-suited to studying behavior in animal and human populations. Population Games Population games provide a simple and general framework for studying strategic interactions in large populations whose members play pure strategies. The simplest population games are generated by random matching in normal form games, but the population game framework allows for interactions of a more intricate nature. We focus here on games played by a single population (i. e., games in which all agents play equivalent roles). We suppose that there is a unit mass of agents, each of whom chooses a pure strategy from the set S D f1; : : : ; ng. The aggregate behavior of these agents is described by a population state x 2 X, with x j representing the proportion of agents choosing pure strategy j. We identify a population game with a continuous vector-valued payoff function F : X ! Rn . The scalar F i (x) represents the payoff to strategy i when the population state is x. Population state x is a Nash equilibrium of F if no agent can improve his payoff by unilaterally switching
strategies. More explicitly, x is a Nash equilibrium if x i > 0
implies that F i (x) F j (x) for all j 2 S: (8)
Example 9 Suppose that the unit mass of agents are randomly matched to play the symmetric normal form game A. At population state x, the (expected) payoff to P strategy i is the linear function Fi (x) D j2S A i j x j ; the payoffs to all strategies can be expressed concisely as F(x) D Ax. It is easy to verify that x is a Nash equilibrium of the population game F if and only if x is a symmetric Nash equilibrium of the symmetric normal form game A. While population games generated by random matching are especially simple, many games that arise in applications are not of this form. In the biology literature, games outside the random matching paradigm are known as playing the field models [139]. Example 10 Consider the following model of highway congestion [17,143,166,173]. A pair of towns, Home and Work, are connected by a network of links. To commute from Home to Work, an agent must choose a path i 2 S connecting the two towns. The payoff the agent obtains is the negation of the delay on the path he takes. The delay on the path is the sum of the delays on its constituent links, while the delay on a link is a function of the number of agents who use that link. Population games embodying this description are known as a congestion games. To define a congestion game, let ˚ be the collection of links in the highway network. Each strategy i 2 S is a route from Home to Work, and so is identified with a set of links ˚ i ˚. Each link is assigned a cost function c : RC ! R, whose argument is link ’s utilization level u : X u (x) D x i ; where () D fi 2 S : 2 ˚ i g i2 ( )
The payoff of choosing route i is the negation of the total delays on the links in this route: X c (u (x)) : F i (x) D
2˚ i
Since driving on a link increases the delays experienced by other drivers on that link (i. e., since highway congestion involves negative externalities), cost functions in models of highway congestion are increasing; they are typically convex as well. Congestion games can also be used to model positive externalities, like the choice between different technological standards; in this case, the cost functions are decreasing in the utilization levels.
Evolutionary Game Theory
Revision Protocols We now introduce foundations for our models of evolutionary dynamics. These foundations are built on the notion of a revision protocol, which describes both the timing and results of agents’ myopic decisions about how to continue playing the game at hand [24,35,96,175,217]. Revision protocols will be used to derive both the deterministic dynamics studied in Sect. “Deterministic Dynamics” and the stochastic dynamics studied in Sect. “Stochastic Dynamics”; similar ideas underlie the local interaction models introduced in Sect. “Local Interaction”. Definition nn Formally, a revision protocol is a map : Rn X ! RC that takes the payoff vectors and population states x as arguments, and returns nonnegative matrices as outputs. For reasons to be made clear below, scalar i j (; x) is called the conditional switch rate from strategy i to strategy j. To move from this notion to an explicit model of evolution, let us consider a population consisting of N < 1 members. (A number of the analyzes to follow will consider the limit of the present model as the population size N approaches infinity – see Sects. “Mean Dynamics”, “Deterministic Approximation”, and “Stochastic Stability via Large Population Limits”.) In this case, the set of feasible social states is the finite set X N D X \ N1 Zn D fx 2 X : Nx 2 Zn g, a grid embedded in the simplex X. A revision protocol , a population game F, and a population size N define a continuous-time evolutionary process – a Markov process fX tN g – on the finite state space X N . A one-size-fits-all description of this process is as follows. Each agent in the society is equipped with a “stochastic alarm clock”. The times between rings of of an agent’s clock are independent, each with a rate R exponential distribution. The ringing of a clock signals the arrival of a revision opportunity for the clock’s owner. If an agent playing strategy i 2 S receives a revision opportunity, he switches to strategy j ¤ i with probability i j /R. If a switch occurs, the population state changes accordingly, from the old state x to a new state y that accounts for the agent’s change in strategy. While this interpretation of the evolutionary process can be applied to any revision protocol, simpler interpretations are sometimes available for protocols with additional structure. The examples to follow illustrate this point.
Examples Imitation Protocols and Natural Selection Protocols In economic contexts, revision protocols of the form
i j (; x) D x j ˆi j (; x)
(9)
are called imitation protocols [35,96,216]. These protocols can be given a very simple interpretation: when an agent receives a revision opportunity, he chooses an opponent at random and observes her strategy. If our agent is playing strategy i and the opponent strategy j, the agent switches from i to j with probability proportional to ˆi j . Notice that the value of the population share xj is not something the agent need know; this term in (9) accounts for the agent’s observing a randomly chosen opponent. Example 11 Suppose that after selecting an opponent, the agent imitates the opponent only if the opponent’s payoff is higher than his own, doing so in this case with probability proportional to the payoff difference: i j (; x) D x j [ j i ]C : This protocol is known as pairwise proportional imitation [188]. Protocols of form (9) also appear in biological contexts, [144], [153,158], where in these cases we refer to them as natural selection protocols. The biological interpretation of (9) supposes that each agent is programmed to play a single pure strategy. An agent who receives a revision opportunity dies, and is replaced through asexual reproduction. The reproducing agent is a strategy j player with probability i j (; x) D x j ˆi j (; x), which is proportional both to the number of strategy j players and to some function of the prevalences and fitnesses of all strategies. Note that this interpretation requires the restriction X i j (; x) 1: j2S
Example 12 Suppose that payoffs are always positive, and let i j (; x) D P
xj j : k2S x k k
(10)
Understood as a natural selection protocol, (10) says that the probability that the reproducing agent is a strategy j player is proportional to x j j , the aggregate fitness of strategy j players. In economic contexts, we can interpret (10) as an imitative protocol based on repeated sampling. When an agent’s clock rings he chooses an opponent at random. If the opponent is playing strategy j, the agent imitates him with probability proportional to j . If the agent does not imitate this opponent, he draws a new opponent at random and repeats the procedure.
1005
1006
Evolutionary Game Theory
Direct Evaluation Protocols In the previous examples, only strategies currently in use have any chance of being chosen by a revising agent (or of being the programmed strategy of the newborn agent). Under other protocols, agents’ choices are not mediated through the population’s current behavior, except indirectly via the effect of behavior on payoffs. These direct evaluation protocols require agents to directly evaluate the payoffs of the strategies they consider, rather than to indirectly evaluate them as under an imitative procedure. Example 13 Suppose that choices are made according to the logit choice rule: i j (; x) D P
exp(1 j ) : 1 k2S exp( k )
(11)
The interpretation of this protocol is simple. Revision opportunities arrive at unit rate. When an opportunity is received by an i player, he switches to strategy j with probability i j (; x), which is proportional to an exponential function of strategy j’s payoffs. The parameter > 0 is called the noise level. If is large, choice probabilities under the logit rule are nearly uniform. But if is near zero, choices are optimal with probability close to one, at least when the difference between the best and second best payoff is not too small. Additional examples of revision protocols can be found in the next section, and one can construct new revision protocols by taking linear combinations of old ones; see [183] for further discussion. Deterministic Dynamics Although antecedents of this approach date back to the early work of Brown and von Neumann [45], the use of differential equations to model evolution in games took root with the introduction of the replicator dynamic by Taylor and Jonker [205], and remains an vibrant area of research; Hofbauer and Sigmund [108] and Sandholm [183] offer recent surveys. In this section, we derive a deterministic model of evolution: the mean dynamic generated by a revision protocol and a population game. We study this deterministic model from various angles, focusing in particular on local stability of rest points, global convergence to equilibrium, and nonconvergent limit behavior. While the bulk of the literature on deterministic evolutionary dynamics is consistent with the approach we take here, we should mention that other specifications exist, including discrete time dynamics [5,59,131,218], and dynamics for games with continuous strategy sets [41,42,77,100,159,160] and for Bayesian population
games [62,70,179]. Also, deterministic dynamics for extensive form games introduce new conceptual issues; see [28,30,51,53,55] and the monograph of Cressman [54]. Mean Dynamics As described earlier in Sect. “Definition”, a revision protocol , a population game F, and a population size N define a Markov process fX tN g on the finite state space X N . We now derive a deterministic process – the mean dynamic – that describes the expected motion of fX tN g. In Sect. “Deterministic Approximation”, we will describe formally the sense in which this deterministic process provides a very good approximation of the behavior of the stochastic process fX tN g, at least over finite time horizons and for large population sizes. But having noted this result, we will focus in this section on the deterministic process itself. To compute the expected increment of fX tN g over the next dt time units, recall first that each of the N agents receives revision opportunities via a rate R exponential distribution, and so expects to receive Rdt opportunities during the next dt time units. If the current state is x, the expected number of revision opportunities received by agents currently playing strategy i is approximately Nx i Rdt: Since an i player who receives a revision opportunity switches to strategy j with probability i j /R, the expected number of such switches during the next dt time units is approximately Nx i i j dt: Therefore, the expected change in the number of agents choosing strategy i during the next dt time units is approximately 1 0 X X x j ji (F(x); x) x i i j (F(x); x)A dt: (12) N@ j2S
j2S
Dividing expression (12) by N and eliminating the time differential dt yields a differential equation for the rate of change in the proportion of agents choosing strategy i: x˙ i D
X j2S
x j ji (F(x); x) x i
X
i j (F(x); x):
(M)
j2S
Equation (M) is the mean dynamic (or mean field) generated by revision protocol in population game F. The first term in (M) captures the inflow of agents to strategy i from other strategies, while the second captures the outflow of agents to other strategies from strategy i. Examples We now describe some examples of mean dynamics, starting with ones generated by the revision protocols from
Evolutionary Game Theory
approaches one whenever the best response is unique. At such points, the logit dynamic approaches the best response dynamic [84]:
Sect. “Examples”. To do so, we let X F(x) D x i F i (x) i2S
denote the average payoff obtained by the members of the population, and define the excess payoff to strategy i, Fˆi (x) D F i (x) F(x) ;
(16)
where B F (x) D argmax y2X y 0 F(x)
to be the difference between strategy i’s payoff and the population’s average payoff. Example 14 In Example 11, we introduced the pairwise proportional imitation protocol i j (; x) D x j [ j i ]C . This protocol generates the mean dynamic x˙ i D x i Fˆi (x) :
x˙ 2 B F (x) x;
(13)
Equation (13) is the replicator dynamic [205], the bestknown dynamic in evolutionary game theory. Under this dynamic, the percentage growth rate x˙ i /x i of each strategy currently in use is equal to that strategy’s current excess payoff; unused strategies always remain so. There are a variety of revision protocols other than pairwise proportional imitation that generate the replicator dynamic as their mean dynamics; see [35,96,108,217].
defines the (mixed) best response correspondence for game F. Note that unlike the other dynamics we consider here, (16) is defined not by an ordinary differential equation, but by a differential inclusion, a formulation proposed in [97]. Example 17 Consider the protocol i h X i j (; x) D j x k k : k2S
C
When an agent’s clock rings, he chooses a strategy at random; if that strategy’s payoff is above average, the agent switches to it with probability proportional to its excess payoff. The resulting mean dynamic, X x˙i BM D [Fˆi (x)]C x i [Fˆk (x)]C ; k2S
Example 15 In Example 12, we assumed that payoffs are always positive, and introduced the protocol i j / x j j ; which we interpreted both as a model of biological natural selection and as a model of imitation with repeated sampling. The resulting mean dynamic, x i F i (x) x i Fˆi (x) ; xi D x˙ i D P x F (x) F(x) k2S k k
(14)
Example 16 In Example 13 we introduced the logit choice rule i j (; x) / exp(1 j ): The corresponding mean dynamic, exp(1 F i (x)) xi ; 1 k2S exp( F k (x))
Example 18 Consider the revision protocol i j (; x) D [ j i ]C :
is the Maynard Smith replicator dynamic [139]. This dynamic only differs from the standard replicator dynamic (13) by a change of speed, with motion under (14) being relatively fast when average payoffs are relatively low. (In multipopulation models, the two dynamics are less similar, and convergence under one does not imply convergence under the other – see [183,216].)
x˙ i D P
is called the Brown–von Neumann–Nash (BNN) dynamic [45]; see also [98,176,194,200,217].
(15)
is called the logit dynamic [82]. If we take the noise level to zero, then the probability with which a revising agent chooses the best response
When an agent’s clock rings, he selects a strategy at random. If the new strategy’s payoff is higher than his current strategy’s payoff, he switches strategies with probability proportional to the difference between the two payoffs. The resulting mean dynamic, x˙ i D
X j2S
x j [F i (x) F j (x)]C x i
X
[F j (x) F i (x)]C ;
j2S
(17) is called the Smith dynamic [197]; see also [178]. We summarize these examples of revision protocols and mean dynamics in Table 1. Figure 1 presents phase diagrams for the five basic dynamics when the population is randomly matched to play standard Rock–Paper–Scissors (Example 4). In the phase diagrams, colors represent speed of motion: within each
1007
1008
Evolutionary Game Theory
Evolutionary Game Theory, Figure 1 Five basic deterministic dynamics in standard Rock–Paper–Scissors. Colors represent speeds: red is fastest, blue is slowest
diagram, motion is fastest in the red regions and slowest in the blue ones. The phase diagram of the replicator dynamic reveals closed orbits around the unique Nash equilibrium
x D ( 13 ; 13 ; 13 ). Since this dynamic is based on imitation (or on reproduction), each face and each vertex of the simplex X is an invariant set: a strategy initially absent from the population will never subsequently appear.
Evolutionary Game Theory Evolutionary Game Theory, Table 1 Five basic deterministic dynamics Revision Protocol ij D xj [j i ]C exp( 1 j ) ij D P 1 ) k k2S exp(
Mean Dynamic x˙ i D xi Fˆ i (x)
Name and source Replicator [205]
exp( 1 Fi (x)) xi Logit [82] 1 F (x)) k k2S exp(
x˙ i D P
x˙ 2 BF (x) x ij D 1f jDargmaxk2S k g P P ij D j k2S xk k C x˙ i D [Fˆ i (x)]C xi [Fˆ j (x)]C j2S X x˙ i D xj [Fi (x) Fj (x)]C ij D [j i ]C
j2S
xi
X
Best response [84] BNN [45]
Smith [197] xj [Fj (x) Fi (x)]C
j2S
The other four dynamics pictured are based on direct ecaluation, allowing agents to select strategies that are currently unused. In these cases, the Nash equilibrium is the sole rest point, and attracts solutions from all initial conditions. (In the case of the logit dynamic, the rest point happens to coincide with the Nash equilibrium only because of the symmetry of the game; see [101,104].) Under the logit and best response dynamics, solution trajectories quickly change direction and then accelerate when the best response to the population state changes; under the BNN and especially the Smith dynamic, solutions approach the Nash equlibrium in a less angular fashion. Evolutionary Justification of Nash Equilibrium One of the goals of evolutionary game theory is to justify the prediction of Nash equilibrium play. For this justification to be convincing, it must be based on a model that makes only mild assumptions about agents’ knowledge about one another’s behavior. This sentiment can be captured by introducing two desiderata for revision protocols: (C) (SD)
Continuity:
is Lipschitz continuous.
Scarcity of data:
i j only depends on i ; j ; and x j :
Continuity (C) asks that revision protocols depend continuously on their inputs, so that small changes in aggregate behavior do not lead to large changes in players’ responses. Scarcity of data (SD) demands that the conditional switch rate from strategy i to strategy j only depend on the payoffs of these two strategies, so that agents need only know those facts that are most germane to the decision at hand [183]. (The dependence of i j on x j is included to allow for dynamics based on imitation.) Protocols that respect these two properties do not make unreal-
istic demands on the amount of information that agents in an evolutionary model possess. Our two remaining desiderata impose restrictions on mean dynamics x˙ D V F (x), linking the evolution of aggregate behavior to incentives in the underlying game. (NS) Nash stationarity: V F (x) D 0 (PC)
if and only if x 2 N E(F) :
Positive correlation: F
V (x) ¤ 0
implies that V F (x)0 F(x) > 0:
Nash stationarity (NS) is a restriction on stationary states: it asks that the rest points of the mean dynamic be precisely the Nash equilibria of the game being played. Positive correlation (PC) is a restriction on disequilibrium adjustment: it requires that away from rest points, strategies’ growth rates be positively correlated with their payoffs. Condition (PC) is among the weakest of the many conditions linking growth rates of evolutionary dynamics and payoffs in the underlying game; for alternatives, see [76,110,149,162,170,173,200]. In Table 2, we report how the the five basic dynamics fare under the four criteria above. For the purposes of justifying the Nash prediction, the most important row in the table is the last one, which reveals that the Smith dynamic satisfies all four desiderata at once: while the revision protocol for the Smith dynamic (see Example 18) requires only limited information on the part of the agents who employ it, this information is enough to ensure that rest points of the dynamic and Nash equilibria coincide. In fact, the dynamics introduced above can be viewed as members of families of dynamics that are based on similar revision protocols and that have similar qualitative properties. For instance, the Smith dynamic is a member of the family of pairwise comparison dynamics [178], under which agents only switch to strategies that outperform
1009
1010
Evolutionary Game Theory Evolutionary Game Theory, Table 2 Families of deterministic evolutionary dynamics and their properties; yes indicates that a weaker or alternate form of the property is satisfied Dynamic Replicator Best response Logit BNN Smith
Family Imitation
(C) yes no Perturbed best response yes Excess payoff yes Pairwise comparison yes
their current choice. For this reason, the exact functional forms of the previous examples are not essential to establishing the properties noted above. In interpreting these results, it is important to remember that Nash stationarity only concerns the rest points of a dynamic; it says nothing about whether a dynamic will converge to Nash equilibrium from an arbitrary initial state. The question of convergence is addressed in Sects. “Global Convergence” and “Nonconvergence”. There we will see that in some classes of games, general guarantees of convergence can be obtained, but that there are some games in which no reasonable dynamic converges to equilibrium. Local Stability Before turning to the global behavior of evolutionary dynamics, we address the question of local stability. As we noted at the onset, an original motivation for introducing game dynamics was to provide an explicitly dynamic foundation for Maynard Smith’s notion of ESS [205]. Some of the earliest papers on evolutionary game dynamics [105,224] established that being an ESS is a sufficient condition for asymptotically stablity under the replicator dynamic, but that it is not a necessary condition. It is curious that this connection obtains despite the fact that ESS is a stability condition for a population whose members all play the same mixed strategy, while (the usual version of) the replicator dynamic looks at populations of agents choosing among different pure strategies. In fact, the implications of ESS for local stability are not limited to the replicator dynamic. Suppose that the symmetric normal form game A admits a symmetric Nash equilibrium that places positive probability on each strategy in S. One can show that this equilibrium is an ESS if and only if the payoff matrix A is negative definite with respect to the tangent space of the simplex: z0 Az < 0
for all z 2 T X D n X zˆ 2 Rn :
i2S
o zˆi D 0 :
(18)
(SD) yes yes yes no yes
(NS) no yes no yes yes
(PC) yes yes no yes yes
Condition (18) and its generalizations imply local stability of equilibrium not only under the replicator dynamic, but also under a wide range of other evolutionary dynamics: see [52,98,99,102,111,179] for further details. The papers cited above use linearization and Lyapunov function arguments to establish local stability. An alternative approach to local stability analysis, via index theory, allows one to establish restrictions on the stability properties of all rest points at once – see [60]. Global Convergence While analyses of local stability reveal whether a population will return to equilibrium after a small disturbance, they do not tell us whether the population will approach equilibrium from an arbitrary disequilibrium state. To establish such global convergence results, we must restrict attention to classes of games defined by certain interesting payoff structures. These structures appear in applications, lending strong support for the Nash prediction in the settings where they arise. Potential Games A potential game [17,106,143,166,173, 181] is a game that admits a potential function: a scalar valued function whose gradient describes the game’s payoffs. n In a full potential game F : RC ! Rn (see [181]), all information about incentives is captured by the potential funcn ! R, in the sense that tion f : RC r f (x) D F(x)
n : for all x 2 RC
(19)
If F is smooth, then it is a full potential game if and only if it satisfies full externality symmetry: @F j @F i (x) D (x) @x j @x i
n for all i; j 2 S and x 2 RC :
(20)
That is, the effect on the payoff to strategy i of adding new strategy j players always equals the effect on the payoff to strategy j of adding new strategy i players. Example 19 Suppose a single population is randomly matched to play the symmetric normal form game
Evolutionary Game Theory
A 2 Rnn , generating the population game F(x) D Ax. We say that A exhibits common interests if the two players in a match always receive the same payoff. This means that A i j D A ji for all i and j, or, equivalently, that the matrix A is symmetric. Since DF(x) D A, this is precisely what we need for F to be a full potential game. The full potential function for F is f (x) D 12 x 0 Ax; which is one-half of the P ¯ average payoff function F(x) D i2S x i F i (x) D x 0 Ax. The common interest assumption defines a fundamental model from population genetics, this assumption reflects the shared fate of two genes that inhabit the same organism [73,106,107]. Example 20 In Example 10, we introduced congestion games, a basic model of network congestion. To see that these games are potential games, observe that an agent taking path j 2 S affects the payoffs of agents choosing path i 2 S through the marginal increases in congestion on the links 2 ˚ i \ ˚ j that the two paths have in common. But since the marginal effect of an agent taking path i on the payoffs of agents choosing path j is identical, full externality symmetry (20) holds: @F i (x) D @x j
X
c 0 (u (x)) D
2˚ i \˚ j
@F j (x): @x i
In congestion games, the potential function takes the form f (x) D
XZ
2˚
u (x)
c (z) dz; 0
and so is typically unrelated to aggregate payoffs, F(x) D
X i2S
x i F i (x) D
X
u (x) c (u (x)):
2˚
However, potential is proportional to aggregate payoffs if the cost functions c are all monomials of the same degree [56,173]. Population state x is a Nash equilibrium of the potential game F if and only if it satisfies the Kuhn–Tucker first order conditions for maximizing the potential function f on the simplex X [17,173]. Furthermore, it is simple to verify that any dynamic x˙ D V F (x) satisfying positive correlation (PC) ascends the potential function: d dt
f (x t ) D r f (x t )0 x˙ t D F(x t )0 V F (x t ) 0:
It then follows from classical results on Lyapunov functions that any dynamic satisfying positive correlation (PC) converges to a connected set of rest points. If the dynamic
also satisfies Nash stationarity (NS), these sets consist entirely of Nash equilibria. Thus, in potential games, very mild conditions on agents’ adjustment rules are sufficient to justify the prediction of Nash equilibrium play. In the case of the replicator dynamic, one can say more. On the interior of the simplex X, the replicator dynamic for the potential game F is a gradient system for the potential function f (i. e., it always ascends f in the direction of maximum increase). However, this is only true after one introduces an appropriate Riemannian metric on X [123,192]. An equivalent statement of this result, due to [2], is that the replicator dynamic is the gradient system for f under the usual Euclidean metric if we stretch the state space X onto the radius 2 sphere. This stretching is accomplished using the Akin transformation p H i (x) D 2 x i , which emphasizes changes in the use of rare strategies relative to changes in the use of common ones [2,4,185]. (There is also a dynamic that generates the gradient system for f on X under the usual metric: the socalled projection dynamic [130,150,185].) Example 21 Consider evolution in 123 Coordination:
1 2 3
1 1 0 0
2 0 2 0
3 0 0 3
:
Figure 2a presents a phase diagram of the replicator dynamic on its natural state space X, drawn atop of a contour plot of the potential function f (x) D 12 ((x1 )2 C 2(x2 )2 C 3(x3 )2 ). Evidently, all solution trajectories ascend this function and converge to one of the seven symmetric Nash equilibria, with trajectories from all but a measure zero set of initial conditions converging to one of the three pure equilibria. Figure 2b presents another phase diagram for the replicator dynamic, this time after the solution trajectories and the potential function have been transported to the surface of the radius 2 sphere using the Akin transformation. In this case, solutions cross the level sets of the potential function orthogonally, moving in the direction that increases potential most quickly. Stable Games [102] if
A population game F is a stable game
(y x)0 (F(y) F(x)) 0
for all x; y 2 X:
(21)
If the inequality in (21) always holds strictly, then F is a strictly stable game.
1011
1012
Evolutionary Game Theory
Example 22 The symmetric normal form game A is symmetric zero-sum if A is skew-symmetric (i. e., if A D A0 ), so that the payoffs of the matched players always sum to zero. (An example is provided by the standard Rock– Paper–Scissors game (Example 4).) Under this assumption, z0 Az D 0 for all z 2 Rn ; thus, the population game generated by random matching in A, F(x) D Ax, is a stable game that is not strictly stable. Example 23 Suppose that A satisfies the interior ESS condition (18). Then (22) holds strictly, so F(x) D Ax is a strictly stable game. Examples satisfying this condition include the Hawk–Dove game (Example 3) and any good Rock–Paper–Scissors game (Example 4). Example 24 A war of attrition [33] is a symmetric normal form game in which strategies represent amounts of time committed to waiting for a scarce resource. If the two players choose times i and j > i, then the j player obtains the resource, worth v, while both players pay a cost of ci : once the first player leaves, the other seizes the resource immediately. If both players choose time i, the resource is split, so payoffs are v2 c i each. It can be shown that for any resource value v 2 R and any increasing cost vector c 2 Rn , random matching in a war of attrition generates a stable game [102].
Evolutionary Game Theory, Figure 2 The replicator dynamic in 123 Coordination. Colors represent the value of the game’s potential function
If F is smooth, then F is a stable game if and only if it satisfies self-defeating externalities: z0 DF(x)z 0
for all z 2 T X and x 2 X;
(22)
where DF(x) is the derivative of F : X ! Rn at x. This condition requires that the improvements in the payoffs of strategies to which revising agents are switching are always exceeded by the improvements in the payoffs of strategies which revising agents are abandoning.
The flavor of the self-defeating externalities condition (22) suggests that obedience of incentives will push the population toward some “central” equilibrium state. In fact, the set of Nash equilibria of a stable game is always convex, and in the case of strictly stable games, equilibrium is unique. Moreover, it can be shown that the replicator dynamic converges to Nash equilibrium from all interior initial conditions in any strictly stable game [4,105,224], and that the direct evaluation dynamics introduced above converge to Nash equilibrium from all initial conditions in all stable games, strictly stable or not [98,102,104,197]. In each case, the proof of convergence is based on the construction of a Lyapunov function that solutions of the relevant dynamic descend. The Lyapunov functions for the five basic dynamics are presented in Table 3. Interestingly, the convergence results for direct evaluation dynamics are not restricted to the dynamics listed in Table 3, but extend to other dynamics in the same families (cf Table 2). But compared to the conditions for convergence in potential games, the conditions for convergence in stable games demand additional structure on the adjustment process [102]. Perturbed Best Response Dynamics in Supermodular Games Supermodular games are defined by the property
Evolutionary Game Theory Evolutionary Game Theory, Table 3 Lyapunov functions for five basic deterministic dynamics in stable games Dynamic Replicator Logit
Lyapunov function for stable games P x Hx (x) D i2S(x ) xi log xii P P ˜ D max y0 F(x) ˆ i2S yi log yi C i2S xi log xi G(x) y2int(X)
Best response G(x) D max Fˆ i (x) i2S
BNN Smith
(x) D
1 2
P
2 ˆ i2S [Fi (x)]C
PP 1
(x) D 2
i2S j2S
xi [Fj (x)Fi (x)]2C
that higher choices by one’s opponents (with respect to the natural ordering on S D f1; : : : ; ng) make one’s own higher strategies look relatively more desirable. Let the matrix ˙ 2 R(n1)n satisfy ˙ i j D 1 if j > i and ˙ i j D 0 otherwise, so that ˙ x 2 Rn1 is the “decumulative distribution function” corresponding to the “density function” x. The population game F is a supermodular game if it exhibits strategic complementarities: If ˙ y ˙ x; then F iC1(y) F i (y) F iC1 (x) F i (x) for all i < n and x 2 X:
(23)
If F is smooth, condition (23) is equivalent to @(FiC1 F i ) (x) 0 @(e jC1 e j )
for all i; j < n and x 2 X: (24)
Example 25 Consider this model of search with positive externalities. A population of agents choose levels of search effort in S D f1; : : : ; ng. The payoff to choosing effort i is F i (x) D m(i) b(a(x)) c(i); P where a(x) D kn kx k is the aggregate search effort, b is some increasing benefit function, m is an increasing multiplier function, and c is an arbitrary cost function. Notice that the benefits from searching are increasing in both own search effort and in the aggregate search effort. It is easy to check that F is a supermodular game. Complementarity condition (23) implies that the agents’ best response correspondence is monotone in the stochastic dominance order, which in turn ensures the existence of minimal and maximal Nash equilibria [207]. One can take advantage of the monotoncity of best responses in studying evolutionary dynamics by appealing to the theory of monotone dynamical systems [196]. To do so, one needs to focus on dynamics that respect the monotonicity of best responses and that also are smooth, so that the
the theory of monotone dynamics can be applied. It turns out that the logit dynamic satisfies these criteria; so does any perturbed best response dynamic defined in terms of stochastic payoff perturbations. In supermodular games, these dynamics define cooperative differential equations; consequently, solutions of these dynamics from almost every initial condition converge to an approximate Nash equilibrium [104]. Imitation Dynamics in Dominance Solvable Games Suppose that in the population game F, strategy i is a strictly dominated by strategy j: Fi (x) < F j (x) for all x 2 X. Consider the evolution of behavior under the replicator dynamic (13). Since for this dynamic we have x˙ i x j x˙ j x i d xi D dt x j (x j )2 x i Fˆi (x)x j x j Fˆ j (x)x i D (x j )2 xi ˆ D Fi (x) Fˆ j (x) ; xj solutions from every interior initial condition converge to the face of the simplex where the dominated strategy is unplayed [3]. It follows that the replicator dynamic converges in games with a strictly dominant strategy, and by iterating this argument, one can show that this dynamic converges to equilibrium in any game that can be solved by iterative deletion of strictly dominated strategies. In fact, this argument is not specific to the replicator dynamic, but can be shown to apply to a range of dynamics based on imitation [110,170]. Even in games which are not dominance solvable, arguments of a similar flavor can be used to restrict the long run behavior of imitative dynamics to better-reply closed sets [162]; see Sect. “Convergence to Equilibria and to Better–Reply Closed Sets” for a related discussion.
1013
1014
Evolutionary Game Theory
While the analysis here has focused on imitative dynamics, it is natural to expect that elimination of dominated strategies will extend to any reasonable evolutionary dynamic. But we will see in Sect. “Survival of Dominated Strategies” that this is not the case: the elimination of dominated strategies that obtains under imitative dynamics is the exception, not the rule. Nonconvergence The previous section revealed that when certain global structural conditions on payoffs are satisfied, one can establish global convergence to equilibrium under various classes of evolutionary dynamics. Of course, if these conditions are not met, convergence cannot be guaranteed. In this section, we offer examples to illustrate some of the possibilities for nonconvergent limit behavior. Conservative Properties of the Replicator Dynamic in Zero-Sum Games In Sect. “Stable Games”, we noted that in strictly stable games, the replicator dynamic converges to Nash equilibrium from all interior initial conditions. To prove this, one shows that interior solutions descend the function X x H x (x) D x i log xii ; i2S(x )
Evolutionary Game Theory, Figure 3 Solutions of the replicator dynamic in a zero-sum game. The solutions pictured lie on the level set Hx (x) D :58
proved that the the time average of each solution trajectory converges to Nash equilibrium [190]. The existence of a constant of motion is not the only conservative property enjoyed by replicator dynamics for symmetric zero-sum games: these dynamics are also volume preserving after an appropriate change of speed or change of measure [5,96].
until converging to its minimizer, the unique Nash equilibrium x . Now, random matching in a symmetric zero-sum game generates a population game that is stable, but not strictly stable (Example 22). In this case, for each interior Nash equilibrium x , the function H x is a constant of motion for the replicator dynamic: its value is fixed along every interior solution trajectory.
Games with Nonconvergent Dynamics The conservative properties described in the previous section have been established only for the replicator dynamic (and its distant relative, the projection dynamic [185]). Inspired by Shapley [193], many researchers have sought to construct games in which large classes of evolutionary dynamics fail to converge to equilibrium.
Example 26 Suppose that agents are randomly matched to play the symmetric zero-sum game A, given by
Example 27 Suppose that players are randomly matched to play the following symmetric normal form game [107,109]:
1 2 3 4
1 0 1 0 1
2 1 0 1 0
3 0 1 0 1
4 1 0 1 0
:
The Nash equilibria of F(x) D Ax are the points on the line segment N E connecting states ( 12 ; 0; 12 ; 0) and (0; 12 ; 0; 12 ), a segment that passes through the barycenter x D ( 14 ; 14 ; 14 ; 14 ). Figure 3 shows solutions to the replicator dynamic that lie on the level set H x (x) D :58. Evidently, each of these solutions forms a closed orbit. Although solution trajectories of the replicator dynamic do not converge in zero-sum games, it can be
1 2 3 4
1 0 " 1 0
2 0 0 " 1
3 1 0 0 "
4 " 1 0 0
:
When " D 0, the payoff matrix A" D A0 is symmetric, so F 0 is a potential game with potential function f (x) D 12 x 0 A0 x D x1 x3 x2 x4 . The function f attains its minimum of 14 at states v D ( 12 ; 0; 12 ; 0) and w D (0; 12 ; 0; 12 ), has a saddle point with value 18 at the Nash equilibrium x D ( 14 ; 14 ; 14 ; 14 ), and attains its maximum of 0 along the closed path of Nash equilibria consisting of edges e1 e2 , e2 e3 , e3 e4 , and e4 e1 .
Evolutionary Game Theory
Let x˙ D V F (x) be an evolutionary dynamic that satisfies Nash stationarity (NS) and positive correlation (PC), and that is based on a revision protocol that is continuous (C). If we apply this dynamic to game F 0 , then the fore0 going discussion implies that all solutions to x˙ D V F (x) 1 whose initial conditions satisfy f () > 8 converge to . The Smith dynamic for F 0 is illustrated in Fig. 4a. Now consider the same dynamic for the game F " , 0 where " > 0. By continuity (C), the attractor of V F con" tinues to an attractor " of V F whose basin of attraction 0 approximates that of under x˙ D V F (x) (Fig. 4b). But since the unique Nash equilibrium of F " is the barycenter x , it follows that solutions from most initial conditions converge to an attractor far from any Nash equilibrium. Other examples of games in which many dynamics fail to converge include monocyclic games [22,83,97,
106], Mismatching Pennies [91,116], and the hypnodisk game [103]. These examples demonstrate that there is no evolutionary dynamic that converges to Nash equilibrium regardless of the game at hand. This suggests that in general, analyses of long run behavior should not restrict attention to equilibria alone. Chaotic Dynamics We have seen that deterministic evolutionary game dynamics can follow closed orbits and approach limit cycles. We now show that they also can behave chaotically. Example 28 Consider evolution under the replicator dynamic when agents are randomly matched to play the symmetric normal form game below [13,195], whose lone interior Nash equilibrium is the barycenter x D ( 14 ; 14 ; 14 ; 14 ): 1 2 3 4
1 0 20 21 10
2 12 0 4 2
3 0 0 0 2
4 22 10 35 0
:
Figure 5 presents a solution to the replicator dynamic for this game from initial condition x0 D (:24; :26; :25; :25). This solution spirals clockwise about x . Near the rightmost point of each circuit, where the value of x3 gets close to zero, solutions sometimes proceed along an “outside” path on which the value of x3 surpasses .6. But they sometimes follow an “inside” path on which x3 remains below .4, and at other times do something in between. Which
Evolutionary Game Theory, Figure 4 Solutions of the Smith dynamic in a the potential game F 0 ; b the 1 perturbed potential game F " , " D 10
Evolutionary Game Theory, Figure 5 Chaotic behavior under the replicator dynamic
\unskip
1015
1016
Evolutionary Game Theory
of these alternatives occurs is difficult to predict from approximate information about the previous behavior of the system. While the game in Example 28 has a complicated payoff structure, in multipopulation contexts one can find chaotic evolutionary dynamics in very simple games [187]. Survival of Dominated Strategies In Sect. “Imitation Dynamics in Dominance Solvable Games”, we saw that dynamics based on imitation eliminate strictly dominated strategies along solutions from interior initial conditions. While this result seems unsurprising, it is actually extremely fragile: [25,103] prove that dynamics that satisfy continuity (C), Nash stationarity (NS), and positive correlation (PC) and that are not based exclusively on imitation must fail to eliminate strictly dominated strategies in some games. Thus, evolutionary support for a basic rationality criterion is more tenuous than the results for imitative dynamics suggest. Example 29 Figure 6a presents the Smith dynamic for “bad RPS with a twin”: R P S T
R 0 1 2 2
P 2 0 1 1
S 1 2 0 0
T 1 2 0 0
:
The Nash equilibria of this game are the states on line segment N E D fx 2 X : x D ( 13 ; 13 ; c; 13 c)g, which is a repellor under the Smith dynamic. Under this dynamic, strategies gain players at rates that depend on their payoffs, but lose players at rates proportional to their current usage levels. It follows that when the dynamics are not at rest, the proportions of players choosing strategies 3 and 4 become equal, so that the dynamic approaches the plane P D fx 2 X : x3 D x4 g on which the twins receive equal weight. Since the usual three-strategy version of bad RPS, exhibits cycling solutions here on the plane P approach a closed orbit away from any Nash equilibrium. Figure 6b presents the Smith dynamic in “bad RPS with a feeble twin”, R P S T
R 0 1 2 2
P 2 0 1 1
S 1 2 0
T 1 2 0
:
1 with " D 10 . Evidently, the attractor from Fig. 6a moves slightly to the left, reflecting the fact that the payoff to
Evolutionary Game Theory, Figure 6 The Smith dynamic in two games
Twin has gone down. But since the new attractor is in the interior of X, the strictly dominated strategy Twin is always played with probabilities bounded far away from zero. Stochastic Dynamics In Sect. “Revision Protocols” we defined the stochastic evolutionary process fX tN g in terms of a simple model of myopic individual choice. We then turned to the study of deterministic dynamics, which we claimed could be used to approximate the stochastic process fX tN g over finite time spans and for large population sizes. In this section, we turn our attention to the stochastic process fX tN g itself. After offering a formal version of the deterministic approximation result, we investigate the long run behav-
Evolutionary Game Theory
ior of fX tN g, focusing on the questions of convergence to equilibrium and selection among multiple stable equilibria. Deterministic Approximation In Sect. “Revision Protocols”, we defined the Markovian evolutionary process fX tN g from a revision protocol , a population game F, and a finite population size N. In Sect. “Mean Dynamics”, we argued that the expected motion of this process is captured by the mean dynamic x˙ i D ViF (x) D
X
x j ji (F(x); x) x i
j2S
X
i j (F(x); x):
j2S
(M) The basic link between the Markov process fX tN g and its mean dynamic (M) is provided by Kurtz’s Theorem [127], variations and extensions of which have been offered in a number of game-theoretic contexts [24,29,43, 44,175,204]. Consider the sequence of Markov processes N ffX tN g t0 g1 NDN 0 , supposing that the initial conditions X 0 converge to x0 2 X. Let fx t g t0 be the solution to the mean dynamic (M) starting from x0 . Kurtz’s Theorem tells us that for each finite time horizon T < 1 and error bound " > 0, we have that ! ˇ N ˇ lim P sup ˇX x t ˇ < " D 1: (25) N!1
t2[0;T]
t
Thus, when the population size N is large, nearly all sample paths of the Markov process fX tN g stay within " of a solution of the mean dynamic (M) through time T. By choosing N large enough, we can ensure that with probability close to one, X tN and xt differ by no more than " for all times t between 0 and T (Fig. 7). The intuition for this result comes from the law of large numbers. At each revision opportunity, the increment in the process fX tN g is stochastic. Still, at most population states the expected number of revision opportunities that arrive during the brief time interval I D [t; t C dt] is large – in particular, of order Ndt. Since each opportunity leads to an increment of the state of size N1 , the size of the overall change in the state during time interval I is of order dt. Thus, during this interval there are a large number of revision opportunities, each following nearly the same transition probabilities, and hence having nearly the same expected increments. The law of large numbers therefore suggests that the change in fX tN g during this interval should be almost completely determined by the expected motion of fX tN g, as described by the mean dynamic (M).
Evolutionary Game Theory, Figure 7 Deterministic approximation of the Markov process fXtN g
Convergence to Equilibria and to Better–Reply Closed Sets Stochastic models of evolution can also be used to address directly the question of convergence to equilibrium [61,78,117,118,125,143,172,219]. Suppose that a society of agents is randomly matched to play an (asymmetric) normal form game that is weakly acyclic in better replies: from each strategy profile, there exists a sequence of profitable unilateral deviations leading to a Nash equilibrium. If agents switch to strategies that do at least as well as their current one against the choices of random samples of opponents, then the society will eventually escape any better-response cycle, ultimately settling upon a Nash equilibrium. Importantly, many classes of normal form games are weakly acyclic in better replies: these include potential games, dominance solvable games, certain supermodular games, and certain aggregative games, in which each agent’s payoffs only depend on opponents’ behavior through a scalar aggregate statistic. Thus, in all of these cases, simple stochastic better-reply procedures are certain to lead to Nash equilibrium play. Outside these classes of games, one can narrow down the possibilities for long run behavior by looking at better-reply closed sets: that is, subsets of the set of strategy profiles that cannot be escaped without a player switching to an inferior strategy (cf. [16,162]). Stochastic better-reply procedures must lead to a cluster of population states corresponding to a better-reply closed set; once the society enters such a cluster, it never departs.
1017
1018
Evolutionary Game Theory
Stochastic Stability and Equilibirum Selection To this point, we used stochastic evolutionary dynamics to provide foundations for deterministic dynamics and to address the question of convergence to equilibrium. But stochastic evolutionary dynamics introduce an entirely new possibility: that of obtaining unique long-run predictions of play, even in games with multiple locally stable equilibria. This form of analysis, which we consider next, was pioneered by Foster and Young [74], Kandori, Mailath, and Rob [119], and Young [219], building on mathematical techniques due to Freidlin and Wentzell [75]. Stochastic Stability To minimize notation, let us describe the evolution of behavior using a discrete-time N Markov chain fX kN;" g1 kD0 on X , where the parameter " > 0 represents the level of “noise” in agents’ decision procedures. The noise ensures that the Markov chain is irreducible and aperiodic: any state in X N can be reached from any other, and there is positive probability that a period passes without a change in the state. Under these conditions, the Markov chain fX kN;" g admits a unique stationary distribution, N;" , a measure on the state space X N that is invariant under the Markov chain: X
ˇ N;" N;" (x) P X kC1 D y ˇ X kN;" D x D N;" (y)
x2X N
for all y 2 X N : The stationary distribution describes the long run behavior of the process fX tN;" g in two distinct ways. First, N;" is the limiting distribution of fX tN;" g: ˇ lim P X kN;" D y ˇ X0N;" D x D N;" (y)
k!1
for all x; y 2 X N : Second, N;" almost surely describes the limiting empirical distribution of fX tN;" g: P
lim
K!1
K1 1 X 1fX N;" 2Ag D N;" (A) D 1 k K kD0
for any A X N : Thus, if most of the mass in the stationary distribution N;" were placed on a single state, then this state would provide a unique prediction of long run behavior. With this motivation, consider a sequence of Markov chains ffX kN;" g1 kD0 g"2(0;¯") parametrized by noise levels "
that approach zero. Population state x 2 X N is said to be stochastically stable if it retains positive weight in the stationary distributions of these Markov chains as " becomes arbitrarily small: lim N;" (x) > 0:
"!0
When the stochastically stable state is unique, it offers a unique prediction of play that is relevant over sufficiently long time spans. Bernoulli Arrivals and Mutations Following the approach of many early contributors to the literature, let us consider a model of stochastic evolution based on Bernoulli arrivals of revision opportunities and best responses with mutations. The former assumption means that during each discrete time period, each agent has probability 2 (0; 1] of receiving an opportunity to update his strategy. This assumption differs than the one we proposed in Sect. “Revision Protocols”; the key new implication is that all agents may receive revision opportunities simultaneously. (Models that assume this directly generate similar results.) The latter assumption posits that when an agent receives a revision opportunity, he plays a best response to the current strategy distribution with probability 1 ", and chooses a strategy at random with probability ". Example 30 Suppose that a population of N agents is randomly matched to play the Stag Hunt game (Example 2): H S
H h 0
S h s
:
Since s > h > 0, hunting hare and hunting stag are both symmetric pure equilibria; the game also admits the sym ; x ) D ( sh ; h ). metric mixed equilibrium x D (x H S s s If more than fraction x H of the agents hunt hare, then hare is the unique best response, while if more than fraction x S of the agents hunt stag, then stag is the unique best response. Thus, under any deterministic dynamic that respects payoffs, the mixed equilibrium x divides the state space into two basins of attraction, one for each of the two pure equilibria. Now consider our stochastic evolutionary process. If the noise level " is small, this process typically behaves like a deterministic process, moving quickly toward one of the two pure states, e H D (1; 0) or e S D (0; 1), and remaining there for some time. But since the process is ergodic, it will eventually leave the pure state it reaches first, and in fact will switch from one pure state to the other infinitely often. To determine the stochastically stable state, we must compute and compare the “improbabilities” of these tran-
Evolutionary Game Theory
sitions. If the current state is e H , a transition to e S requires mutations to cause roughly Nx S agents to switch to the suboptimal strategy S, sending the population into the basin of attraction of e S ; the probability of this event is of order " N x S . Similarly, to transit from e S to e H , mutations D N(1 x ) to switch from S to must cause roughly Nx H S H; this probability of this event is of order " N(1x S ) . Which of these rare events is more likely ones depends on whether x S is greater than or less than 12 . If s > 2h, so that x S < 12 , then " N x S is much smaller than " N(1x S ) when " is small; thus, state e S is stochastically stable (Fig. 8a). If instead s < 2h, so that x S > 12 , then " N(1x S ) < " N x S , so e H is stochastically stable (Fig. 8b). These calculations show that risk dominance – being the optimal response against a uniformly randomizing opponent – drives stochastic stability 2 2 games. In particular, when s < 2h, so that risk dominance and payoff dominance disagree, stochastic stability favors the former over the latter. This example illustrates how under Bernoulli arrivals and mutations, stochastic stability analysis is based on mutation counting: that is, on determining how many simultaneous mutations are required to move from each equilibrium into the basin of attraction of each other equilibrium. In games with more than two strategies, completing the argument becomes more complicated than in the example above: the analysis, typically based on the tree-analysis techniques of [75,219], requires one to account for the relative difficulties of transitions between all pairs of equilibria. [68] develops a streamlined method of computing the stochastically stable state based on radius-coradius calcu-
Evolutionary Game Theory, Figure 8 Equilibrium selection via mutation counting in Stag Hunt games
lations; while this approach is not always sufficiently fine to yield a complete analysis, in the cases where it works it can be considerably simpler to apply than the tree-analysis method. These techniques have been employed successfully to variety of classes of games, including pure coordination games, supermodular games, games satisfying “bandwagon” properties, and games with equilibria that satisfy generalizations of risk dominance [68,120,121,134]. A closely related literature uses stochastic stability as a basis for evaluating traditional solution concepts for extensive form games [90,115,122,128,152,168,169]. A number of authors have shown that variations on the Bernoulli arrivals and mutations model can lead to different equilibrium selection results. For instance, [165,211] show that if choices are determined from the payoffs from a single round of matching (rather than from expected payoffs), the payoff dominant equilibrium rather than the risk dominant equilibrium is selected. If choices depend on strategies’ relative performances rather than their absolute performances, then long run behavior need not resemble a Nash equilibrium at all [26,161,171,198]. Finally, if the probability of mutation depends on the current population state, then any recurrent set of the unperturbed process (e. g., any pure equilibrium of a coordination game) can be selected in the long run if the mutation rates are specified in an appropriate way [27]. This last result suggests that mistake probabilities should be provided with an explicit foundation, a topic we take up in Sect. “Poisson Arrivals and Payoff Noise”. Another important criticism of the stochastic stability literature concerns the length of time needed for its predic-
1019
1020
Evolutionary Game Theory
tions to become relevant [31,67]. If the population size N is large and the mutation rate " is small, then the probability "c N that a transition between equilibria occurs during given period is miniscule; the waiting time between transitions is thus enormous. Indeed, if the mutation rate falls over time, or if the population size grows over time, then ergodicity may fail, abrogating equilibrium selection entirely [163,186]. These analyses suggest that except in applications with very long time horizons, the unique predictions generated by analyses of stochastic stability may be inappropriate, and that modelers would do better to focus on history-dependent predictions of the sort provided by deterministic models. At the same time, there are frameworks in which stochastic stability becomes relevant much more quickly. The most important of these are local interaction models, which we discuss in Sect. “Local Interaction”. Poisson Arrivals and Payoff Noise Combining the assumption of Bernoulli arrivals of revision opportunities with that of best responses with mutations creates a model in which the probabilities of transitions between equilibria are easy to compute: one can focus on events in which large numbers of agents switch to a suboptimal strategy at once, each doing so with the same probability. But the simplicity of this argument also highlights the potency of the assumptions behind it. An appealing alternative approach is to model stochastic evolution using Poisson arrivals of revision opportunities and payoff noise [29,31,38,39,63,135,145,209,210,222]. (One can achieve similar effects by looking at models defined in terms of stochastic differential equations; see [18,48,74,79,113].) By allowing revision opportunities to arrive in continuous time, as we did in Sect. “Revision Protocols”, we ensure that agents do not receive opportunities simultaneously, ruling out the simultaneous mass revisions that drive the Bernoulli arrival model. (One can accomplish the same end using a discrete time model by assuming that one agent updates during each period; the resulting process is a random time change away from the Poisson arrivals model.) Under Poisson arrivals, transitions between equilibria occur gradually, as the population works its way out of basins of attraction one agent at a time. In this context, the mutation assumption becomes particularly potent, ensuring that the probabilities of suboptimal choices do not vary with their payoff consequences. Under the alternative assumption of payoff noise, one supposes that agents play best responses to payoffs that are subject to random perturbations drawn from a fixed multivariate distribution. In this case, suboptimal choices are much more likely near
basin boundaries, where the payoffs of second-best strategies are not much less than those of optimal ones, than they are at stable equilibria, where payoff differences are larger. Evidently, assuming Poisson arrivals and payoff noise means that stochastic stability cannot be assessed by way of mutation counting. To determine the unlikelihood of escaping from an equilibrium’s basin of attraction, one must not only account for the “width” of the basin of attraction (i. e., the number of suboptimal choices needed to escape it), but also for its “depth” (the unlikelihood of each of these choices). In two-strategy games this is not difficult to accomplish: in this case the evolutionary process is a birth-and-death chain, and its stationary distribution can be expressed using an explicit formula. Beyond this case, one can employ the Freidlin and Wentzell [75] machinery, although doing so tends to be computationally demanding. This computational burden is less in models that retain Poisson arrivals, but replace perturbed optimization with decision rules based on imitation and mutation [80]. Because agents imitate successful opponents, the population spends the vast majority of periods on the edges of the simplex, implying that the probabilities of transitions between vertices can be determined using birth-and-death chain methods [158]. As a consequence, one can reduce the problem of finding the stochastically stable state in an n strategy coordination game to that of computing the limiting stationary distribution of an n state Markov chain. Stochastic Stability via Large Population Limits The approach to stochastic stability followed thus far relies on small noise limits: that is, on evaluating the limit of the stationary distributions N;" as the noise level " approaches zero. Binmore and Samuelson [29] argue that in the contexts where evolutionary models are appropriate, the amount of noise in agents decisions is not negligible, so that taking the low noise limit may not be desirable. At the same time, evolutionary models are intended to describe behavior in large populations, suggesting an alternative approach: that of evaluating the limit of the stationary distributions N;" as the population size N grows large. In one respect, this approach complicates the analysis. When N is fixed and " varies, each stationary distribution N;" is a measure on the fixed state space X N D fx 2 X : Nx 2 Zn g. But when " is fixed and N varies, the state space X N varies as well, and one must introduce notions of weak convergence of probability measures in order to define stochastic stability. But in other respects taking large population limits can make analysis simpler. We saw in Sect. “Deterministic Ap-
Evolutionary Game Theory
proximation” that by taking the large population limit, we can approximate the finite-horizon sample paths of the stochastic evolutionary process fX tN;" g by solutions to the mean dynamic (M). Now we are concerned with infinite horizon behavior, but it is still reasonable to hope that the large population limit will again reduce some of our computations to a calculus problems. As one might expect, this approach is easiest to follow in the two-strategy case, where for each fixed population size N, the evolutionary process fX tN;" g is a birthand-death chain. When one takes the large population limit, the formulas for waiting times and for the stationary distribution can be evaluated using integral approximations [24,29,39,222]. Indeed, the approximations so obtained take an appealing simple form [182]. The analysis becomes more complicated beyond the two-strategy case, but certain models have proved amenable to analysis. For instance [80], characterizes large population stochastic stability in models based on imitation and mutation. Imitation ensures that the population spends nearly all periods on the edges of the simplex X, and the large population limit makes evaluating the probabilities of transitions along these edges relatively simple. If one supposes that agents play best responses to noisy payoffs, then one must account directly for the behavior of the process fX tN;" g in the interior of the simplex. One possibility is to combine the deterministic approximation results from Sect. “Deterministic Approximation” with techniques from the theory of stochastic approximation [20,21] to show that the large N limiting stationary distribution is concentrated on attractors of the mean dynamic. By combining this idea with convergence results for deterministic dynamics from Sect. “Global Convergence”, Ref. [104] shows that the limiting stationary distribution must be concentrated around equilibrium states in potential games, stable games, and supermodular games. The results in [104] do not address the question of equilibrium selection. However, for the specific case of logit evolution in potential games, a complete characterization of the large population limit of the process fX tN;" g has been obtained [23]. By combining deterministic approximation results, which describe the usual behavior of the process within basins of attraction, with a large deviations analysis, which characterizes the rare escapes from basins of attraction, one can obtain a precise asymptotic formula for the large N limiting stationary distribution. This formula accounts both for the typical procession of the process along solutions of the mean dynamic, and for the rare sojourns of the process against this deterministic flow.
Local Interaction All of the game dynamics considered so far have been based implicitly on the assumption of global interaction: each agent’s payoffs depend directly on all agents’ actions. In many contexts, one expects to the contrary that interactions will be local in nature: for instance, agents may live in fixed locations and interact only with neighbors. In addition to providing a natural fit for these applications, local interaction models respond to some of the criticisms of the stochastic stability literature. At the same time, once one moves beyond relatively simple cases, local interaction models become exceptionally complicated, and so lend themselves to methods of analysis very different from those considered thus far. Stochastic Stability and Equilibrium Selection Revisited In Sect. “Stochastic Stability and Equilibirum Selection”, we saw the prediction of risk dominant equilibrium play provided by stochastic stability models is subverted by the waiting-time critique: namely, that the length of time required before this equilibrium is reached may be extremely long. Ellison [67,68] shows that if interactions are local, then selection of the risk dominant equilibrium persists, and waiting times are no longer an issue. Example 31 In the simplest local interaction model, a population of N agents are located at N distinct positions around a circle. During each period of play, each agent plays the Stag Hunt game (Examples 2 and 30) with his two nearest neighbors, following the same action against both of his opponents. If we suppose that s 2 (h; 2h), so that hunting hare is the risk dominant strategy, then by definition, an agent whose neighbors play different strategies finds it optimal to choose H himself. Now suppose that there are Bernoulli arrivals of revision opportunities, and that decisions are based on best responses and rare mutations. To move from the all S state to the all H state, it is enough that a single agent mutates S to H. This one mutation begins a chain reaction: the mutating agent’s neighbors respond optimally by switching to H themselves; they are followed in this by their own neighbors; and the contagion continues until all agents choose H. Since a single mutation is always enough to spur the transition from all S to all H, the expected wait before this transition is small, even when the population is large. In contrast, the transition back from all H to all S is extremely unlikely. Even if all but one of the agents simultaneously mutate to S, the contagion process described above will return the population to the all-H state. Thus, while the transition from all-S to all-H occurs quickly, the
1021
1022
Evolutionary Game Theory
reverse transition takes even longer than in the global interaction setting. The local interaction approach to equilibrium selection has been advanced in a variety of directions: by allowing agents to choose their locations [69], or to pay a cost to choose different strategies against different opponents [86], and by basing agents’ decisions on the attainment of aspiration levels [11], or on imitation of successful opponents [9,10]. A portion of this literature initiated by Blume develops connections between local interaction models in evolutionary game theory with models from statistical mechanics [36,37,38,124,141]. These models provide a point of departure for research on complex spatial dynamics in games, which we consider next. Complex Spatial Dynamics The local interaction models described above address the questions of convergence to equilibrium and selection among multiple equilibria. In the cases where convergence and selection results obtain, behavior in these models is relatively simple, as most periods are spent with most agents coordinating on a single strategy. A distinct branch of the literature on evolution and local interaction focuses on cases with complex dynamics, where instead of settling quickly into a homogeneous, static configuration, behavior remains in flux, with multiple strategies coexisting for long periods of time. Example 32 Cooperating is a dominated strategy in the Prisoner’s Dilemma, and is not played in equilibrium in finitely repeated versions of this game. Nevertheless, a pair of Prisoner’s Dilemma tournaments conducted by Axelrod [14] were won by the strategy Tit-for-Tat, which cooperates against cooperative opponents and defects against defectors. Axelrod’s work spawned a vast literature aiming
to understand the persistence of individually irrational but socially beneficial behavior. To address this question, Nowak and May [153,154, 155,156,157] consider a population of agents who are repeatedly matched to play the Prisoner’s Dilemma C D
C 1 g
D " 0
;
where the greedy payoff g exceeds 1 and " > 0 is small. The agents are positioned on a two-dimensional grid. During each period, each agent plays the Prisoner’s Dilemma with the eight agents in his (Moore) neighborhood. In the simplest version of the model, all agents simultaneously update their strategies at the end of each period. If an agent’s total payoff that period is as high as that of any of neighbor, he continues to play the same strategy; otherwise, he switches to the strategy of the neighbor who obtained the highest payoff. Since defecting is a dominant strategy in the Prisoner’s Dilemma, one might expect the local interaction process to converge to a state at which all agents defect, as would be the case in nearly any model of global interaction. But while an agent is always better off defecting himself, he also is better off the more of his neighbors cooperate; and since evolution is based on imitation, cooperators tend to have more cooperators as neighbors than do defectors. In Figs. 9–11, we present snapshots of the local interaction process for choices of the greedy payoff g from each of three distinct parameter regions. If g > 53 (Fig. 9), the process quickly converges to a configuration containing a few rectangular islands of cooperators in a sea of defectors; the exact configuration depending on the initial conditions. If instead g < 85 (Fig. 10), the process moves towards a configuration in which agents other than those in a “web” of defectors cooperate. But for g 2 ( 85 ; 53 ) (Fig. 11), the sys-
Evolutionary Game Theory, Figure 9 Local interaction in a Prisoner’s Dilemma; greedy payoff g D 1:7. In Figs. 9–11, agents are arrayed on a 100 × 100 grid with periodic boundaries (i. e., a torus). Initial conditions are random with 75% cooperators and 25% defectors. Agents update simultaneously, imitating the neighbor who earned the highest payoff. Blue cells represent cooperators who also cooperated last period, green cells represent new cooperators; red cells represent defectors who also defected last period, yellow cells represent new defectors. (Figs. 9– 11 created using VirtualLabs [92])
Evolutionary Game Theory
Evolutionary Game Theory, Figure 10 Local interaction in a Prisoner’s Dilemma; greedy payoff g D 1:55
Evolutionary Game Theory, Figure 11 Local interaction in a Prisoner’s Dilemma; greedy payoff g D 1:65
tem evolves in a complicated fashion, with clusters of cooperators and of defectors forming, expanding, disappearing, and reforming. But while the configuration of behavior never stabilizes, the proportion of cooperators appears to settle down to about .30. The specification of the dynamics considered above, based on simultaneous updating and certain imitation of the most successful neighbor, presents a relatively favorable environment for cooperative behavior. Nevertheless, under Poisson arrivals of revision opportunities, or probabilistic decision rules, or both, cooperation can persist for very long periods of time for values of g significantly larger than 1 [154,155]. The literature on complex spatial dynamics in evolutionary game models is large and rapidly growing, with the evolution of behavior in the spatial Prisoners’ Dilemma
being the single most-studied environment. While analyses are typically based on simulations, analytical results have been obtained in some relatively simple settings [71,94]. Recent work on complex spatial dynamics has considered games with three or more strategies, including Rock– Paper–Scissors games, as well as public good contribution games and Prisoner’s Dilemmas with voluntary participation. Introducing more than two strategies can lead to qualitatively novel dynamic phenomena, including largescale spatial cycles and traveling waves [93,202,203]. In addition to simulations, the analysis of complex spatial dynamics is often based on approximation techniques from non-equilibrium statistical physics, and much of the research on these dynamics has appeared in the physics literature. [201] offers a comprehensive survey of work on this topic.
1023
1024
Evolutionary Game Theory
Applications Evolutionary game theory was created with biological applications squarely in mind. In the prehistory of the field, Fisher [73] and Hamilton [87] used game-theoretic ideas to understand the evolution of sex ratios. Maynard Smith [137,138,139,140] introduced his definition of ESS as a way of understanding ritualized animal conflicts. Since these early contributions, evolutionary game theory has been used to study a diverse array of biological questions, including mate choice, parental investment, parent-offspring conflict, social foraging, and predator-prey systems. For overviews of research on these and other topics in biology, see [65,88]. The early development of evolutionary game theory in economics was motivated primarily by theoretical concerns: the justification of traditional game-theoretic solution concepts, and the development of methods for equilibrium selection in games with multiple stable equilibria. More recently, evolutionary game theory has been applied to concrete economic environments, in some instances as a means of contending with equilibrium selection problems, and in others to obtain an explicitly dynamic model of the phenomena of interest. Of course, these applications are most successful when the behavioral assumptions that underlie the evolutionary approach are appropriate, and when the time horizon needed for the results to become relevant corresponds to the one germane to the application at hand. Topics in economics theoretical studied using the methods of evolutionary game theory range from behavior in markets [1,6,7,8,12,19,64,112,129,212], to bargaining and hold-up problems [32,46,57,66,164,208,220,221, 222], to externality and implementation problems [47,49, 136,174,177,180], to questions of public good provision and collective action [146,147,148]. The techniques described here are being applied with increasing frequency to problems of broader social science interest, including residential segregation [40,62,142,222,223,225,226] and cultural evolution [34,126], and to the study of behavior in transportation and computer networks [72,143,150,173, 175,177,197]. A proliferating branch of research extends the approaches described in this article to address the evolution of structure and behavior in social networks; a number of recent books [85,114,213] offer detailed treatments of work in this domain. Future Directions Evolutionary game theory is a maturing field; many basic theoretical issues are well understood, but many difficult questions remain. It is tempting to say that stochastic and
local interaction models offer the more open terrain for further explorations. But while it is true that we know less about these models than about deterministic evolutionary dynamics, even our knowledge of the latter is limited: while dynamics on one and two dimensional state spaces, and for games satisfying a few interesting structural assumptions, are well-understood, the dynamics of behavior in the vast majority of many-strategy games are not. The prospects for further applications of the tools of evolutionary game theory are brighter still. In economics, and in other social sciences, the analysis of mathematical models has too often been synonymous with the computation and evaluation of equilibrium behavior. The questions of whether and how equilibrium will come to be are often ignored, and the possibility of long-term disequilibrium behavior left unmentioned. For settings in which its assumptions are tenable, evolutionary game theory offers a host of techniques for modeling the dynamics of economic behavior. The exploitation of the possibilities for a deeper understanding of human social interactions has hardly begun. Acknowledgments The figures in Sects. “Deterministic Dynamics” and “Local Interaction” were created using Dynamo [184] and VirtualLabs [92], respectively. I am grateful to Caltech for its hospitality as I completed this article, and I gratefully acknowledge financial support under NSF Grant SES0617753. Bibliography 1. Agastya M (2004) Stochastic stability in a double auction. Games Econ Behav 48:203–222 2. Akin E (1979) The geometry of population genetics. Springer, Berlin 3. Akin E (1980) Domination or equilibrium. Math Biosci 50:239– 250 4. Akin E (1990) The differential geometry of population genetics and evolutionary games. In: Lessard S (ed) Mathematical and statistical developments of evolutionary theory. Kluwer, Dordrecht, pp 1–93 5. Akin E, Losert V (1984) Evolutionary dynamics of zero-sum games. J Math Biol 20:231–258 6. Alós-Ferrer C (2005) The evolutionary stability of perfectly competitive behavior. Econ Theory 26:497–516 7. Alós-Ferrer C, Ania AB, Schenk-Hoppé KR (2000) An evolutionary model of Bertrand oligopoly. Games Econ Behav 33:1–19 8. Alós-Ferrer C, Kirchsteiger G, Walzl M (2006) On the evolution of market institutions: The platform design paradox. Unpublished manuscript, University of Konstanz 9. Alós-Ferrer C, Weidenholzer S (2006) Contagion and efficiency. J Econ Theory forthcoming, University of Konstanz and University of Vienna
Evolutionary Game Theory
10. Alós-Ferrer C, Weidenholzer S (2006) Imitation, local interactions, and efficiency. Econ Lett 93:163–168 11. Anderlini L, Ianni A (1996) Path dependence and learning from neighbors. Games Econ Behav 13:141–177 12. Ania AB, Tröger T, Wambach A (2002) An evolutionary analysis of insurance markets with adverse selection. Games Econ Behav 40:153–184 13. Arneodo A, Coullet P, Tresser C (1980) Occurrence of strange attractors in three-dimensional Volterra equations. Phys Lett 79A:259–263 14. Axelrod R (1984) The evolution of cooperation. Basic Books, New York 15. Balkenborg D, Schlag KH (2001) Evolutionarily stable sets. Int J Game Theory 29:571–595 16. Basu K, Weibull JW (1991) Strategy sets closed under rational behavior. Econ Lett 36:141–146 17. Beckmann M, McGuire CB, Winsten CB (1956) Studies in the economics of transportation. Yale University Press, New Haven 18. Beggs AW (2002) Stochastic evolution with slow learning. Econ Theory 19:379–405 19. Ben-Shoham A, Serrano R, Volij O (2004) The evolution of exchange. J Econ Theory 114:310–328 20. Benaïm M (1998) Recursive algorithms, urn processes, and the chaining number of chain recurrent sets. Ergod Theory Dyn Syst 18:53–87 21. Benaïm M, Hirsch MW (1999) On stochastic approximation algorithms with constant step size whose average is cooperative. Ann Appl Probab 30:850–869 22. Benaïm M, Hofbauer J, Hopkins E (2006) Learning in games with unstable equilibria. Unpublished manuscript, Université de Neuchâtel, University of Vienna and University of Edinburgh 23. Benaïm M, Sandholm WH (2007) Logit evolution in potential games: Reversibility, rates of convergence, large deviations, and equilibrium selection. Unpublished manuscript, Université de Neuchâtel and University of Wisconsin 24. Benaïm M, Weibull JW (2003) Deterministic approximation of stochastic evolution in games. Econometrica 71:873–903 25. Berger U, Hofbauer J (2006) Irrational behavior in the Brownvon Neumann-Nash dynamics. Games Econ Behav 56:1–6 26. Bergin J, Bernhardt D (2004) Comparative learning dynamics. Int Econ Rev 45:431–465 27. Bergin J, Lipman BL (1996) Evolution with state-dependent mutations. Econometrica 64:943–956 28. Binmore K, Gale J, Samuelson L (1995) Learning to be imperfect: The ultimatum game. Games Econ Behav 8:56–90 29. Binmore K, Samuelson L (1997) Muddling through: Noisy equilibrium selection. J Econ Theory 74:235–265 30. Binmore K, Samuelson L (1999) Evolutionary drift and equilibrium selection. Rev Econ Stud 66:363–393 31. Binmore K, Samuelson L, Vaughan R (1995) Musical chairs: Modeling noisy evolution. Games Econ Behav 11:1–35 32. Binmore K, Samuelson L, Peyton Young H (2003) Equilibrium selection in bargaining models. Games Econ Behav 45:296– 328 33. Bishop DT, Cannings C (1978) A generalised war of attrition. J Theor Biol 70:85–124 34. Bisin A, Verdier T (2001) The economics of cultural transmission and the dynamics of preferences. J Econ Theory 97:298– 319
35. Björnerstedt J, Weibull JW (1996) Nash equilibrium and evolution by imitation. In: Arrow KJ et al. (eds) The Rational Foundations of Economic Behavior. St. Martin’s Press, New York, pp 155–181 36. Blume LE (1993) The statistical mechanics of strategic interaction. Games Econ Behav 5:387–424 37. Blume LE (1995) The statistical mechanics of best reponse strategy revision. Games Econ Behav 11:111–145 38. Blume LE (1997) Population games. In: Arthur WB, Durlauf SN, Lane DA (eds) The economy as an evolving complex system II. Addison–Wesley, Reading pp 425–460 39. Blume LE (2003) How noise matters. Games Econ Behav 44:251–271 40. Bøg M (2006) Is segregation robust? Unpublished manuscript, Stockholm School of Economics 41. Bomze IM (1990) Dynamical aspects of evolutionary stability. Monatshefte Mathematik 110:189–206 42. Bomze IM (1991) Cross entropy minimization in uninvadable states of complex populations. J Math Biol 30:73–87 43. Börgers T, Sarin R (1997) Learning through reinforcement and the replicator dynamics. J Econ Theory 77:1–14 44. Boylan RT (1995) Continuous approximation of dynamical systems with randomly matched individuals. J Econ Theory 66:615–625 45. Brown GW, von Neumann J (1950) Solutions of games by differential equations. In: Kuhn HW, Tucker AW (eds) Contributions to the theory of games I, volume 24 of Annals of Mathematics Studies. Princeton University Press, Princeton, pp 73– 79 46. Burke MA, Peyton Young H (2001) Competition and custom in economic contracts: A case study of Illinois agriculture. Am Econ Rev 91:559–573 47. Cabrales A (1999) Adaptive dynamics and the implementation problem with complete information. J Econ Theory 86:159–184 48. Cabrales A (2000) Stochastic replicator dynamics. Int Econ Rev 41:451–481 49. Cabrales A, Ponti G (2000) Implementation, elimination of weakly dominated strategies and evolutionary dynamics. Rev Econ Dyn 3:247–282 50. Crawford VP (1991) An “evolutionary” interpretation of Van Huyck, Battalio, and Beil’s experimental results on coordination. Games Econ Behav 3:25–59 51. Cressman R (1996) Evolutionary stability in the finitely repeated prisoner’s dilemma game. J Econ Theory 68:234–248 52. Cressman R (1997) Local stability of smooth selection dynamics for normal form games. Math Soc Sci 34:1–19 53. Cressman R (2000) Subgame monotonicity in extensive form evolutionary games. Games Econ Behav 32:183–205 54. Cressman R (2003) Evolutionary dynamics and extensive form games. MIT Press, Cambridge 55. Cressman R, Schlag KH (1998) On the dynamic (in)stability of backwards induction. J Econ Theory 83:260–285 56. Dafermos S, Sparrow FT (1969) The traffic assignment problem for a general network. J Res Nat Bureau Stand B 73:91– 118 57. Dawid H, Bentley MacLeod W (2008) Hold-up and the evolution of investment and bargaining norms. Games Econ Behav forthcoming62:26–52 58. Dawkins R (1976) The selfish gene. Oxford University Press, Oxford
1025
1026
Evolutionary Game Theory
59. Dekel E, Scotchmer S (1992) On the evolution of optimizing behavior. J Econ Theory 57:392–407 60. Demichelis S, Ritzberger K (2003) From evolutionary to strategic stability. J Econ Theory 113:51–75 61. Dindoš M, Mezzetti C (2006) Better-reply dynamics and global convergence to Nash equilibrium in aggregative games. Games Econ Behav 54:261–292 62. Dokumacı E, Sandholm WH (2007) Schelling redux: An evolutionary model of residential segregation. Unpublished manuscript, University of Wisconsin 63. Dokumacı E, Sandholm WH (2007) Stochastic evolution with perturbed payoffs and rapid play. Unpublished manuscript, University of Wisconsin 64. Droste E, Hommes, Tuinstra J (2002) Endogenous fluctuations under evolutionary pressure in Cournot competition. Games Econ Behav 40:232–269 65. Dugatkin LA, Reeve HK (eds)(1998) Game theory and animal behavior. Oxford University Press, Oxford 66. Ellingsen T, Robles J (2002) Does evolution solve the hold-up problem? Games Econ Behav 39:28–53 67. Ellison G (1993) Learning, local interaction, and coordination. Econometrica 61:1047–1071 68. Ellison G (2000) Basins of attraction, long run equilibria, and the speed of step-bystep evolution. Rev Econ Stud 67:17–45 69. Ely JC (2002) Local conventions. Adv Econ Theory 2:1(30) 70. Ely JC, Sandholm WH (2005) Evolution in Bayesian games I: Theory. Games Econ Behav 53:83–109 71. Eshel I, Samuelson L, Shaked A (1998) Altruists, egoists, and hooligans in a local interaction model. Am Econ Rev 88:157– 179 72. Fischer S, Vöcking B (2006) On the evolution of selfish routing. Unpublished manuscript, RWTH Aachen 73. Fisher RA (1930) The genetical theory of natural selection. Clarendon Press, Oxford 74. Foster DP, Peyton Young H (1990) Stochastic evolutionary game dynamics. Theor Popul Biol 38:219–232 also in Corrigendum 51:77–78 (1997) 75. Freidlin MI, Wentzell AD (1998) Random perturbations of dynamical systems, 2nd edn. Springer, New York 76. Friedman D (1991) Evolutionary games in economics. Econometrica 59:637–666 77. Friedman D, Yellin J (1997) Evolving landscapes for population games. Unpublished manuscript, UC Santa Cruz 78. Friedman JW, Mezzetti C (2001) Learning in games by random sampling. J Econ Theory 98:55–84 79. Fudenberg D, Harris C (1992) Evolutionary dynamics with aggregate shocks. J Econ Theory 57:420–441 80. Fudenberg D, Imhof LA (2006) Imitation processes with small mutations. J Econ Theory 131:251–262 81. Fudenberg D, Imhof LA (2008) Monotone imitation dynamics in large populations. J Econ Theory 140:229–245 82. Fudenberg D, Levine DK (1998) Theory of learning in games. MIT Press, Cambridge 83. Gaunersdorfer A, Hofbauer J (1995) Fictitious play, shapley polygons, and the replicator equation. Games Econ Behav 11:279–303 84. Gilboa I, Matsui A (1991) Social stability and equilibrium. Econometrica 59:859–867 85. Goyal S (2007) Connections: An introduction to the economics of networks. Princeton University Press, Princeton
86. Goyal S, Janssen MCW (1997) Non-exclusive conventions and social coordination. J Econ Theory 77:34–57 87. Hamilton WD (1967) Extraordinary sex ratios. Science 156:477–488 88. Hammerstein P, Selten R (1994) Game theory and evolutionary biology. In: Aumann RJ, Hart S (eds) Handbook of Game Theory. vol 2, chap 28, Elsevier, Amsterdam, pp 929–993 89. Harsanyi JC, Selten R (1988) A General Theory of equilibrium selection in games. MIT Press, Cambridge 90. Hart S (2002) Evolutionary dynamics and backward induction. Games Econ Behav 41:227–264 91. Hart S, Mas-Colell A (2003) Uncoupled dynamics do not lead to Nash equilibrium. Am Econ Rev 93:1830–1836 92. Hauert C (2007) Virtual Labs in evolutionary game theory. Software http://www.univie.ac.at/virtuallabs. Accessed 31 Dec 2007 93. Hauert C, De Monte S, Hofbauer J, Sigmund K (2002) Volunteering as Red Queen mechanism for cooperation in public goods games. Science 296:1129–1132 94. Herz AVM (1994) Collective phenomena in spatially extended evolutionary games. J Theor Biol 169:65–87 95. Hines WGS (1987) Evolutionary stable strategies: A review of basic theory. Theor Popul Biol 31:195–272 96. Hofbauer J (1995) Imitation dynamics for games. Unpublished manuscript, University of Vienna 97. Hofbauer J (1995) Stability for the best response dynamics. Unpublished manuscript, University of Vienna 98. Hofbauer J (2000) From Nash and Brown to Maynard Smith: Equilibria, dynamics and ESS. Selection 1:81–88 99. Hofbauer J, Hopkins E (2005) Learning in perturbed asymmetric games. Games Econ Behav 52:133–152 100. Hofbauer J, Oechssler J, Riedel F (2005) Brown-von NeumannNash dynamics: The continuous strategy case. Unpublished manuscript, University of Vienna 101. Hofbauer J, Sandholm WH (2002) On the global convergence of stochastic fictitious play. Econometrica 70:2265–2294 102. Hofbauer J, Sandholm WH (2006) Stable games. Unpublished manuscript, University of Vienna and University of Wisconsin 103. Hofbauer J, Sandholm WH (2006) Survival of dominated strategies under evolutionary dynamics. Unpublished manuscript, University of Vienna and University of Wisconsin 104. Hofbauer J, Sandholm WH (2007) Evolution in games with randomly disturbed payoffs. J Econ Theory 132:47–69 105. Hofbauer J, Schuster P, Sigmund K (1979) A note on evolutionarily stable strategies and game dynamics. J Theor Biol 81:609–612 106. Hofbauer J, Sigmund K (1988) Theory of evolution and dynamical systems. Cambridge University Press, Cambridge 107. Hofbauer J, Sigmund K (1998) Evolutionary games and population dynamics. Cambridge University Press, Cambridge 108. Hofbauer J, Sigmund K (2003) Evolutionary game dynamics. Bull Am Math Soc (New Series) 40:479–519 109. Hofbauer J, Swinkels JM (1996) A universal Shapley example. Unpublished manuscript, University of Vienna and Northwestern University 110. Hofbauer J, Weibull JW (1996) Evolutionary selection against dominated strategies. J Econ Theory 71:558–573 111. Hopkins E (1999) A note on best response dynamics. Games Econ Behav 29:138–150
Evolutionary Game Theory
112. Hopkins E, Seymour RM (2002) The stability of price dispersion under seller and consumer learning. Int Econ Rev 43:1157–1190 113. Imhof LA (2005) The long-run behavior of the stochastic replicator dynamics. Ann Appl Probab 15:1019–1045 114. Jackson MO Social and economic networks. Princeton University Press, Princeton, forthcoming 115. Jacobsen HJ, Jensen M, Sloth B (2001) Evolutionary learning in signalling games. Games Econ Behav 34:34–63 116. Jordan JS (1993) Three problems in learning mixed-strategy Nash equilibria. Games Econ Behav 5:368–386 117. Josephson J (2008) Stochastic better reply dynamics in finite games. Econ Theory, 35:381–389 118. Josephson J, Matros A (2004) Stochastic imitation in finite games. Games Econ Behav 49:244–259 119. Kandori M, Mailath GJ, Rob R (1993) Learning, mutation, and long run equilibria in games. Econometrica 61:29–56 120. Kandori M, Rob R (1995) Evolution of equilibria in the long run: A general theory and applications. J Econ Theory 65:383– 414 121. Kandori M, Rob R (1998) Bandwagon effects and long run technology choice. Games Econ Behav 22:84–120 122. Kim Y-G, Sobel J (1995) An evolutionary approach to pre-play communication. Econometrica 63:1181–1193 123. Kimura M (1958) On the change of population fitness by natural selection. Heredity 12:145–167 124. Kosfeld M (2002) Stochastic strategy adjustment in coordination games. Econ Theory 20:321–339 125. Kukushkin NS (2004) Best response dynamics in finite games with additive aggregation. Games Econ Behav 48:94–110 126. Kuran T, Sandholm WH (2008) Cultural integration and its discontents. Rev Economic Stud 75:201–228 127. Kurtz TG (1970) Solutions of ordinary differential equations as limits of pure jump Markov processes. J Appl Probab 7:49–58 128. Kuzmics C (2004) Stochastic evolutionary stability in extensive form games of perfect information. Games Econ Behav 48:321–336 129. Lahkar R (2007) The dynamic instability of dispersed price equilibria. Unpublished manuscript, University College London 130. Lahkar R, Sandholm WH The projection dynamic and the geometry of population games. Games Econ Behav, forthcoming 131. Losert V, Akin E (1983) Dynamics of games and genes: Discrete versus continuous time. J Math Biol 17:241–251 132. Lotka AJ (1920) Undamped oscillation derived from the law of mass action. J Am Chem Soc 42:1595–1598 133. Mailath GJ (1992) Introduction: Symposium on evolutionary game theory. J Econ Theory 57:259–277 134. Maruta T (1997) On the relationship between risk-dominance and stochastic stability. Games Econ Behav 19:221–234 135. Maruta T (2002) Binary games with state dependent stochastic choice. J Econ Theory 103:351–376 136. Mathevet L (2007) Supermodular Bayesian implementation: Learning and incentive design. Unpublished manuscript, Caltech 137. Maynard Smith J (1972) Game theory and the evolution of fighting. In: Maynard Smith J On Evolution. Edinburgh University Press, Edinburgh, pp 8–28 138. Maynard Smith J (1974) The theory of games and the evolution of animal conflicts. J Theor Biol 47:209–221
139. Maynard Smith J (1982) Evolution and the theory of games. Cambridge University Press, Cambridge 140. Maynard Smith J, Price GR (1973) The logic of animal conflict. Nature 246:15–18 141. Miekisz ˛ J (2004) Statistical mechanics of spatial evolutionary games. J Phys A 37:9891–9906 142. Möbius MM (2000) The formation of ghettos as a local interaction phenomenon. Unpublished manuscript, MIT 143. Monderer D, Shapley LS (1996) Potential games. Games Econ Behav 14:124–143 144. Moran PAP (1962) The statistical processes of evolutionary theory. Clarendon Press, Oxford 145. Myatt DP, Wallace CC (2003) A multinomial probit model of stochastic evolution. J Econ Theory 113:286–301 146. Myatt DP, Wallace CC (2007) An evolutionary justification for thresholds in collective-action problems. Unpublished manuscript, Oxford University 147. Myatt DP, Wallace CC (2008) An evolutionary analysis of the volunteer’s dilemma. Games Econ Behav 62:67–76 148. Myatt DP, Wallace CC (2008) When does one bad apple spoil the barrel? An evolutionary analysis of collective action. Rev Econ Stud 75:499–527 149. Nachbar JH (1990) “Evolutionary” selection dynamics in games: Convergence and limit properties. Int J Game Theory 19:59–89 150. Nagurney A, Zhang D (1997) Projected dynamical systems in the formulation, stability analysis and computation of fixed demand traffic network equilibria. Transp Sci 31:147–158 151. Nash JF (1951) Non-cooperative games. Ann Math 54:287– 295 152. Nöldeke G, Samuelson L (1993) An evolutionary analysis of backward and forward induction. Games Econ Behav 5:425– 454 153. Nowak MA (2006) Evolutionary dynamics: Exploring the equations of life. Belknap/Harvard, Cambridge 154. Nowak MA, Bonhoeffer S, May RM (1994) More spatial games. Int J Bifurc Chaos 4:33–56 155. Nowak MA, Bonhoeffer S, May RM (1994) Spatial games and the maintenance of cooperation. Proc Nat Acad Sci 91:4877– 4881 156. Nowak MA, May RM (1992) Evolutionary games and spatial chaos. Nature 359:826–829 157. Nowak MA, May RM (1993) The spatial dilemmas of evolution. Int J Bifurc Chaos 3:35–78 158. Nowak MA, Sasaki A, Taylor C, Fudenberg D (2004) Emergence of cooperation and evolutionary stability in finite populations. Nature 428:646–650 159. Oechssler J, Riedel F (2001) Evolutionary dynamics on infinite strategy spaces. Econ Theory 17:141–162 160. Oechssler J, Riedel F (2002) On the dynamic foundation of evolutionary stability in continuous models. J Econ Theory 107:141–162 161. Rhode P, Stegeman M (1996) A comment on “learning, mutation, and long run equilibria in games”. Econometrica 64:443– 449 162. Ritzberger K, Weibull JW (1995) Evolutionary selection in normal form games. Econometrica 63:1371–1399 163. Robles J (1998) Evolution with changing mutation rates. J Econ Theory 79:207–223 164. Robles J (2008) Evolution, bargaining and time preferences. Econ Theory 35:19–36
1027
1028
Evolutionary Game Theory
165. Robson A, Vega-Redondo F (1996) Efficient equilibrium selection in evolutionary games with random matching. J Econ Theory 70:65–92 166. Rosenthal RW (1973) A class of games possessing pure strategy Nash equilibria. Int J Game Theory 2:65–67 167. Samuelson L (1988) Evolutionary foundations of solution concepts for finite, two-player, normal-form games. In: Vardi MY (ed) Proc. of the Second Conference on Theoretical Aspects of Reasoning About Knowledge (Pacific Grove, CA, 1988), Morgan Kaufmann Publishers, Los Altos, pp 211–225 168. Samuelson L (1994) Stochastic stability in games with alternative best replies. J Econ Theory 64:35–65 169. Samuelson L (1997) Evolutionary games and equilibrium selection. MIT Press, Cambridge 170. Samuelson L, Zhang J (1992) Evolutionary stability in asymmetric games. J Econ Theory 57:363–391 171. Sandholm WH (1998) Simple and clever decision rules in a model of evolution. Econ Lett 61:165–170 172. Sandholm WH (2001) Almost global convergence to p-dominant equilibrium. Int J Game Theory 30:107–116 173. Sandholm WH (2001) Potential games with continuous player sets. J Econ Theory 97:81–108 174. Sandholm WH (2002) Evolutionary implementation and congestion pricing. Rev Econ Stud 69:81–108 175. Sandholm WH (2003) Evolution and equilibrium under inexact information. Games Econ Behav 44:343–378 176. Sandholm WH (2005) Excess payoff dynamics and other wellbehaved evolutionary dynamics. J Econ Theory 124:149–170 177. Sandholm WH (2005) Negative externalities and evolutionary implementation. Rev Econ Stud 72:885–915 178. Sandholm WH (2006) Pairwise comparison dynamics. Unpublished manuscript, University of Wisconsin 179. Sandholm WH (2007) Evolution in Bayesian games II: Stability of purified equilibria. J Econ Theory 136:641–667 180. Sandholm WH (2007) Pigouvian pricing and stochastic evolutionary implementation. J Econ Theory 132:367–382 181. Sandholm WH (2007) Large population potential games. Unpublished manuscript, University of Wisconsin 182. Sandholm WH (2007) Simple formulas for stationary distributions and stochastically stable states. Games Econ Behav 59:154–162 183. Sandholm WH Population games and evolutionary dynamics. MIT Press, Cambridge, forthcoming 184. Sandholm WH, Dokumacı E (2007) Dynamo: Phase diagrams for evolutionary dynamics. Software http://www.ssc.wisc. edu/~whs/dynamo 185. Sandholm WH, Dokumacı E, Lahkar R The projection dynamic and the replicator dynamic. Games Econ Behav, forthcoming 186. Sandholm WH, Pauzner A (1998) Evolution, population growth, and history dependence. Games Econ Behav 22:84– 120 187. Sato Y, Akiyama E, Doyne Farmer J (2002) Chaos in learning a simple two-person game. Proc Nat Acad Sci 99:4748–4751 188. Schlag KH (1998) Why imitate, and if so, how? A boundedly rational approach to multi-armed bandits. J Econ Theory 78:130–156 189. Schuster P, Sigmund K (1983) Replicator dynamics. J Theor Biol 100:533–538 190. Schuster P, Sigmund K, Hofbauer J, Wolff R (1981) Selfregu-
191. 192. 193.
194. 195. 196.
197.
198.
199. 200. 201. 202. 203.
204. 205. 206. 207. 208. 209. 210. 211. 212. 213. 214. 215.
lation of behaviour in animal societies I: Symmetric contests. Biol Cybern 40:1–8 Selten R (1991) Evolution, learning, and economic behavior. Games Econ Behav 3:3–24 Shahshahani S (1979) A new mathematical framework for the study of linkage and selection. Mem Am Math Soc 211 Shapley LS (1964) Some topics in two person games. In: Dresher M, Shapley LS, Tucker AW (eds) Advances in game theory. vol 52 of Annals of Mathematics Studies. Princeton University Press, Princeton, pp 1–28 Skyrms B (1990) The Dynamics of Rational Deliberation. Harvard University Press, Cambridge Skyrms B (1992) Chaos in game dynamics. J Log Lang Inf 1:111–130 Smith HL (1995) Monotone Dynamical Systems: An introduction to the theory of competitive and cooperative systems. American Mathematical Society, Providence, RI Smith MJ (1984) The stability of a dynamic model of traffic assignment –an application of a method of Lyapunov. Transp Sci 18:245–252 Stegeman M, Rhode P (2004) Stochastic Darwinian equilibria in small and large populations. Games Econ Behav 49:171– 214 Swinkels JM (1992) Evolutionary stability with equilibrium entrants. J Econ Theory 57:306–332 Swinkels JM (1993) Adjustment dynamics and rational play in games. Games Econ Behav 5:455–484 Szabó G, Fáth G (2007) Evolutionary games on graphs. Phys Rep 446:97–216 Szabó G, Hauert C (2002) Phase transitions and volunteering in spatial public goods games. Phys Rev Lett 89:11801(4) Tainaka K-I (2001) Physics and ecology of rock-paper-scissors game. In: Marsland TA, Frank I (eds) Computers and games, Second International Conference (Hamamatsu 2000), vol 2063 in Lecture Notes in Computer Science. Springer, Berlin, pp 384–395 Tanabe Y (2006) The propagation of chaos for interacting individuals in a large population. Math Soc Sci 51:125–152 Taylor PD, Jonker L (1978) Evolutionarily stable strategies and game dynamics. Math Biosci 40:145–156 Thomas B (1985) On evolutionarily stable sets. J Math Biol 22:105–115 Topkis D (1979) Equilibrium points in nonzero-sum n-person submodular games. SIAM J Control Optim 17:773–787 Tröger T (2002) Why sunk costs matter for bargaining outcomes: An evolutionary approach. J Econ Theory 102:28–53 Ui T (1998) Robustness of stochastic stability. Unpublished manuscript, Bank of Japan van Damme E, Weibull JW (2002) Evolution in games with endogenous mistake probabilities. J Econ Theory 106:296–315 Vega-Redondo F (1996) Evolution, games, and economic behaviour. Oxford University Press, Oxford Vega-Redondo F (1997) The evolution of Walrasian behavior. Econometrica 65:375–384 Vega-Redondo F (2007) Complex social networks. Cambridge University Press, Cambridge Volterra V (1931) Lecons sur la Theorie Mathematique de la Lutte pour la Vie. Gauthier–Villars, Paris von Neumann J, Morgenstern O (1944) Theory of games and economic behavior. Prentice–Hall, Princeton
Evolutionary Game Theory
216. Weibull JW (1995) Evolutionary game theory. MIT Press, Cambridge 217. Weibull JW (1996) The mass action interpretation. Excerpt from “The work of John Nash in game theory: Nobel Seminar, December 8, 1994”. J Econ Theory 69:165–171 218. Weissing FJ (1991) Evolutionary stability and dynamic stability in a class of evolutionary normal form games. In: Selten R (ed) Game Equilibrium Models I. Springer, Berlin, pp 29–97 219. Peyton Young H (1993) The evolution of conventions. Econometrica 61:57–84 220. Peyton Young H (1993) An evolutionary model of bargaining. J Econ Theory 59:145–168 221. Peyton Young H (1998) Conventional contracts. Review Econ Stud 65:773–792
222. Peyton Young H (1998) Individual strategy and social structure. Princeton University Press, Princeton 223. Peyton Young H (2001) The dynamics of conformity. In: Durlauf SN, Peyton Young H (eds) Social dynamics. Brookings Institution Press/MIT Press, Washington/Cambridge, pp 133– 153 224. Zeeman EC (1980) Population dynamics from game theory. In: Nitecki Z, Robinson C (eds) Global theory of dynamical systems (Evanston, 1979). number 819 in Lecture Notes in Mathematics. Springer, Berlin, pp 472–497 225. Zhang J (2004) A dynamic model of residential segregation. J Math Sociol 28:147–170 226. Zhang J (2004) Residential segregation in an all-integrationist world. J Econ Behav Organ 24:533–550
1029
1030
Evolution in Materio
Evolution in Materio SIMON HARDING1 , JULIAN F. MILLER2 1 Department of Computer Science, Memorial University, St. John’s, Canada 2 Department of Electronics, University of York, Heslington, UK Article Outline Glossary Definition of the Subject Introduction Evolutionary Algorithms Evolution in Materio: Historical Background Evolution in Materio: Defining Suitable Materials Evolution in Materio Is Verified with Liquid Crystal Evolution in Materio Using Liquid Crystal: Implementational Details The Computational Power of Materials Future Directions Bibliography Glossary Evolutionary algorithm A computer algorithm loosely inspired by Darwinian evolution. Generate-and-test The process of generating a potential solution to a computational problem and testing it to see how good a solution it is. The idea behind it is that no human ingenuity is employed to make good solutions more likely. Genotype A string of information that encodes a potential solution instance of a problem and allows its suitability to be assessed. Evolution in materio The method of applying computer controlled evolution to manipulate or configure a physical system. Liquid crystal Substances that have properties between those of a liquid and a crystal. Definition of the Subject Evolution in materio refers to the use of computers running search algorithms, called evolutionary algorithms, to find the values of variables that should be applied to material systems so that they carry out useful computation. Examples of such variables might be the location and magnitude of voltages that need to be applied to a particular physical system. Evolution in materio is a methodology for programming materials that utilizes physical effects that the human programmer need not be aware of.
It is a general methodology for obtaining analogue computation that is specific to the desired problem domain. Although a form of this methodology was hinted at in the work of Gordon Pask in the 1950s it was not convincingly demonstrated until 1996 by Adrian Thompson, who showed that physical properties of a digital chip could be exploited by computer controlled evolution. This article describes the first demonstration that such a method can be used to obtain specific analogue computation in a non-silicon based physical material (liquid crystal). The work is important for a number of reasons. Firstly, it proposes a general method for building analogue computational devices. Secondly it explains how previously unknown physical effects may be utilized to carry out computations. Thirdly, it presents a method that can be used to discover useful physical effects that can form the basis of future computational devices. Introduction Physical Computation Classical computation is founded on a mathematical model of computation based on an abstract (but physically inspired) machine called a Turing Machine [1]. A Turing machine is a machine that can write or erase symbols on a possibly infinite one dimensional tape. Its actions are determined by a table of instructions that determine what the machine will write on the tape (by moving one square left or right) given its state (stored in a state register) and the symbol on the tape. Turing showed that the calculations that could be performed on such a machine accord with the notion of computation in mathematics. The Turing machine is an abstraction (partly because it uses a possibly infinite tape) and to this day it is still not understood what limitations or extensions to the computational power of Turing’s model might be possible using real physical processes. Von Neumann and others at the Institute for Advanced Study at Princeton devised a design for a computer based on the ideas of Turing that has formed the foundation of modern computers. Modern computers are digital in operation. Although they are made of physical devices (i. e. transistors), computations are made on the basis of whether a voltage is above or below some threshold. Prior to the invention of digital computers there have been a variety of analogue computing machines. Some of these were purely mechanical (e. g. an abacus, a slide-rule, Charles Babbage’s difference engine, Vannevar Bush’s Differential Analyzer) but later computing machines were built using operational amplifiers [2]. There are many aspects of computation that were deliberately ignored by Turing in his model of computation.
Evolution in Materio
For instance, speed, programmability, parallelism, openess, adaptivity are not considered. The speed at which an operation can be performed is clearly an important issue since it would be of little use to have a machine that can calculate any computable function but takes an arbitrarily large amount of time to do so. Programmability is another issue that is of great importance. Writing programs directly in the form of instruction tables that could be used with a device based on a Turing is extremely tedious. This is why many high-level computer languages have been devised. The general issue of how to subdivide a computer program into a number of parallel executing processes so that the intended computation is carried out as quickly as possible is still unsolved. Openness refers to systems that can interact with an external environment during their operation. Openness is exhibited strongly in biological systems where new resources can be added or removed either by an external agency or by the actions taken by the system itself. Adaptivity refers to the ability of systems to change their characteristics in response to an environment. In addition to these aspects, the extent to which the underlying physics affects both the abstract notion of computation and its tractability has been brought to prominence through the discovery of quantum computation, where Deutsch pointed out that Turing machines implicitly use assumptions based on physics [3]. He also showed that through ‘quantum parallelism’ certain computations could be performed much more quickly than on classical computers. Other forms of physical computation that have recently been explored are: reaction-diffusion systems [4], DNA computing [5,6] and synthetic biology [7]. In the UK a number of Grand Challenges in computing research have been proposed [8], in particular ‘Journeys in Non-Classical Computation’ [9,10] seeks to explore, unify and generalize many diverse non-classical computational paradigms to produce a mature science of computation. Toffoli argued that ‘Nothing Makes Sense in Computing Except in the Light of Evolution’ [11]. He argues firstly that a necessary but not sufficient condition for a computation to have taken place, is when a novel function is produced from a fixed and finite repertoire of components (i. e. logic gates, protein molecules). He suggests that a sufficient condition requires intention. That is to say, we cannot argue that computation has taken place unless a system has arisen for a higher purpose (this is why he insists on intention as being a prerequisite for computation). Otherwise, almost everything is carrying out some form of computation (which is not a helpful point of view). Thus a Turing machine does not carry out computations unless
it has been programmed to do so, and since natural evolution constructs organisms that have an increased chance of survival (the higher ‘purpose’) we can regard them as carrying out computations. It is in this sense that Toffoli points to the fundamental role of evolution in the definition of a computation as it has provided animals with the ability to have intention. This brings us to one of the fundamental questions in computation. How can we program a physical system to perform a particular computation? The dominant method used to answer this question has been to construct logic gates and from these build a von Neumann machine (i. e. a digital computer). The mechanism that has been used to devise a computer program to carry out a particular computation is the familiar top-down design process, where ultimately the computation is represented using Boolean operations. According to Conrad this process leads us to pay “The Price of Programmability” [12], whereby in conventional programming and design we proceed by excluding many of the processes that may lead to us solving the problem at hand. Natural evolution does not do this. It is noteworthy that natural evolution has constructed systems of extraordinary sophistication, complexity and computational power. We argue that it is not possible to construct computational systems of such power using a conventional methodology and that complex software systems that directly utilize physical effects will require some form of search process akin to natural evolution together with a way of manipulating the properties of materials. We suggest that some form of evolution ought to be an appropriate methodology for arriving at physical systems that compute. In this chapter we discuss work that has adopted this methodology. We call it evolution in materio. Evolutionary Algorithms Firstly we propose that to overcome the limitations of a top-down design process, we should use a more unconstrained design technique that is more akin to a process of generate-and-test. However, a guided search method is also required that spends more time in areas of the search space that confer favorable traits for computation. One such approach is the use of evolutionary algorithms. These algorithms are inspired by the Darwinian concepts of survival of the fittest and the genetic inheritance of information. Using a computer, a population of randomly generated solutions is systematically tested, selected and modified until a solution has been found [13,14,15]. As in nature, a genetic algorithm optimizes a population of individuals by selecting the ones that are best suited to solving a problem and allowing their genetic make-up
1031
1032
Evolution in Materio
to propagate into future generations. It is typically guided only by the evolutionary process and often contains very limited domain specific knowledge. Although these algorithms are bio-inspired, it is important that any analogies drawn with nature are considered only as analogies. Their lack of specialization for a problem makes genetic algorithms ideal search techniques where little is known about a problem. As long as a suitable representation is chosen along with a fitness function that allows for ease of movement around a search space, a GA can search vast problem spaces rapidly. Another feature of their behavior is that provided that the genetic representation chosen is sufficiently expressive the algorithm can explore potential solutions that are unconventional. A human designer normally has a set of predefined rules and strategies that they adopt to solve a problem. These preconceptions may prevent trying a new method, and may prevent the designer using a better solution. A genetic algorithm does not necessarily require such domain knowledge. Evolutionary algorithms have been shown to be competitive or surpass human designed solutions in a number of different areas. The largest conference on evolutionary computation called GECCO has an annual session on evolutionary approaches that have produced human competitive scientific and technological results. Moreover the increase in computational power of computers makes such results increasingly more likely. Many different versions of genetic algorithms exist. Variations in representations and genetic operators change the performance characteristics of the algorithm, and depending on the problem, people employ a variety of modifications of the basic algorithm. However, all the algorithms follow a similar basic set of steps. Firstly the numbers or physical variables that are required to define a potential solution have to be identified and encoded into a data representation that can be manipulated inside a computer program. This is referred to as the encoding step. The representation chosen is of crucial importance as it is possible to inadvertedly choose overly constrained representations which limits the portions of the space of potential solutions that will be considered by the evolutionary algorithm. Generally the encoded information is referred to as a genotype and genotypes are sometimes divided into a number of separate strings called chromosomes. Each entry in the chromosome string is an allele, and one or more of these make up a gene. The second step is to create inside the computer a number of independently generated genotypes whose alleles have been chosen with uniform probability from the allowed set of values. This collection of genotypes is called a population.
In its most basic form, an individual genotype is a single chromosome made of 1 s and 0 s. However, it is also common to use integer and floating-point numbers if they more appropriate for the task at hand. Combinations of different representations can also be used within the same chromosome, and that is the approach used in the work described in this article. Whatever representation is used, it should be able to adequately describe the individual and provide a mechanism where its characteristics can be transferred to future generations without loss of information. Each of these individuals is then decoded into its phenotype, the outward, physical manifestation of the individual and tested to see how well the candidate solution solves the problem at hand. This is usually returned as a number that is referred to as the fitness of the genotype. Typically it is this phase in a genetic algorithm that is the most time consuming. The next stage is to select what genetic information will proceed to the next generation. In nature the fitness function and selection are essentially the same – individuals that are better suited to the environment survive to reproduce and pass on their genes. In the genetic algorithm a procedure is applied to determine what information gets to proceed. Genetic algorithms are often generational – where all the old population is removed before moving to the next generation, in nature this process is much less algorithmic. However, to increase the continuity of information between generations, some versions of the algorithm use elitism, where the fittest individuals are always selected for promotion to the next generation. This ensures that good solutions are not lost from the population, but it may have the side effect of causing the genetic information in the population to converge too quickly so that the search stagnates on a sub-optimal solution. To generate the next population, a procedure analogous to sexual reproduction occurs. For example, two individuals will be selected and they will then have their genetic information combined together to produce the genotype for the offspring. This process is called recombination or crossover. The genotype is split into sections at randomly selected points called crossover points. A “simple” GA has only one of these points, however it is possible to perform this operation at multiple points. Sections of the two chromosomes are then put together to form a new individual. This individual shares some of the characteristics of both parents. There are many different ways to choose which members of the population to breed with each other, the aim in general is to try and ensure that fit individuals get to reproduce with other fit
Evolution in Materio
individuals. Individuals can be selected with a probability proportional to their relative fitness or selected through some form of tournament, which may choose two or more chromosomes at random from the population and select the fittest. In natural recombination, errors occur when the DNA is split and combined together. Also, errors in the DNA of a cell can occur at any time under the influence of a mutagen, such as radiation, a virus or toxic chemical. The genetic algorithm also has mutations. A number of alleles are selected at random and modified in some way. For a binary GA, the bit may be flipped, in a real-numbered GA a random value may be added to or subtracted from the previous allele. Although GAs often have both mutation and crossover, it is possible to just use mutation. A mutation only approach has in some cases been demonstrated to work, and often crossover is seen as a macro mutation operator – effectively changing large sections of a chromosome. After the previous operations have been carried out, the new individuals in the population are then retested and their new fitness scores calculated. Eventually this process leads to an increase in the average fitness of the population, and so the population moves closer toward a solution. This cycle of test, select and reproduce is continued until a solution is found (or some other termination condition is reached), at which point the algorithm stops. The performance of a genetic algorithm is normally measured in terms of the number of evaluations required to find a solution of a given quality. Evolution in Materio: Historical Background It is arguable that ‘evolution in materio’ began in 1958 in the work of Gordon Pask who worked on experiments to grow neural structures using electrochemical assemblages [16,17,18,19]. Gordon Pask’s goal was to create a device sensitive to either sound or magnetic fields that could perform some form of signal processing – a kind of ear. He realized he needed a system that was rich in structural possibilities, and chose to use a metal solution. Using electric currents, wires can be made to self-assemble in an acidic aqueous metal-salt solution (e. g. ferrous sulphate). Changing the electric currents can alter the structure of these wires and their positions – the behavior of the system can be modified through external influence. Pask used an array of electrodes suspended in a dish containing the metal-salt solution, and by applying current (either transiently or a slowly changing source) was able to build iron wires that responded differently to two different frequencies of sound – 50 Hz and 100 Hz.
Evolution in Materio, Figure 1 Pask’s experimental set up for growing dendritic wires in ferrous sulphate solution [17]
Pask had developed a system whereby he could manually train the wire formation in such a way that no complete specification had to be given – a complete paradigm shift from previous engineering techniques which would have dictated the position and behavior of every component in the system. His training technique relied on making changes to a set of resistors, and updating the values with given probabilities – in effect a test-randomly modify-test cycle. We would today recognize this algorithm as some form of evolutionary, hill climbing strategy – with the test stage as the fitness evaluation. In 1996 Adrian Thompson started what we might call the modern era of evolution in materio. He was investigating whether it was possible to build working electronic circuits using unconstrained evolution (effectively, generate-and-test) using a re-configurable electronic silicon chip called an Field Programmable Gate Array (FPGA). Carrying out evolution by defining configurations of actual hardware components is known as intrinsic evolution. This is quite possible using FPGAs which are devices that have a two-dimensional array of logic functions that a configuration bit string defines and connects together. Thompson had set himself the task of evolving a digital circuit that could discriminate between an applied 1 kHz or 10 kHz applied signal [20,21]. He found that computer controlled evolution of the configuring bit strings could relatively easily solve this problem. However, when he analyzed the successful circuits he found to his surprise that they worked by utilizing subtle electrical properties of the silicon. Despite painstaking analysis and simulation work he was unable to explain how, or what property was being utilized. This lack of knowledge of how the system works, of course, prevents humans from designing systems that are intended to exploit these subtle and complex physical characteristics. However, it does not prevent exploitation through artificial evolution. Since then a number of re-
1033
1034
Evolution in Materio
searchers have demonstrated the viability of intrinsic evolution in silicon devices [21,22,23,24,25,26,27]. The term evolution in materio was first coined by Miller and Downing [28]. They argued that the lesson that should be drawn from the work of [21] is that evolution may be used to exploit the properties of a wider range of materials than silicon. In summary, evolution in materio can be described as: Exploitation, using an unconstrained evolutionary algorithm, of the non-linear properties of a malleable or programmable material to perform a desired function by altering its physical or electrical configuration. Evolution in materio is a subset of a research field known as evolvable hardware. It aims to exploit properties of physical systems with much fewer preconditions and constraints than is usual, and it deliberately tries to avoid paying Conrad’s ‘Price of Programmability’. However, to get access to physically rich systems, we may have to discard devices designed with human programming in mind. Such devices are often based on abstract idealizations of processes occurring in the physical world. For example, FPGAs are considered as digital, but they are fundamentally analogue devices that have been constrained to behave in certain, human understandable ways. This means that intrinsically complex physical processes are carefully manipulated to represent extremely simple effects (e. g. a rapid switch from one voltage level to another). Unconstrained evolution, as demonstrated by Thompson, allows for the analogue properties of such devices to be effectively utilized. We would expect physically rich systems to exhibit non-linear properties – they will be complex systems. This is because physical systems generally have huge numbers of parts interacting in complex ways. Arguably, humans have difficulty working with complex systems, and the use of evolution enables us to potentially overcome these limitations when dealing with such systems. When systems are abstracted, the relationship to the physical world becomes more distant. This is highly convenient for human designers who do not wish to understand, or work with, hidden or subtle properties of materials. Exploitation through evolution reduces the need for abstraction, as it appears evolution is capable of discovering and utilizing any physical effects it can find. The aim of this new methodology in computation is to evolve special purpose computational processors. By directly exploiting physical systems and processes, one should be able to build extremely fast and efficient computational devices. It is our view that computer controlled evolution is a uni-
versal methodology for doing this. Of course, von Neumann machines (i. e. digital computers) are individually universal and this is precisely what confers their great utility in modern technology, however this universality comes at a price. They ignore the rich computational possibilities of materials and try to create operations that are close to a mathematical abstraction. Evolution in materio is a universal methodology for producing specific, highly tuned computational devices. It is important not to underestimate the real practical difficulties associated with using an unconstrained design process. Firstly the evolved behavior of the material may be extremely sensitive to the specific properties of the material sample, so each piece would require individual training. Thompson originally experienced this difficulty, however in later work he showed that it was possible to evolve the configuration of FPGAs so that they produced reliable behavior in a variety of environmental conditions [29]. Secondly, the evolutionary algorithm may utilize physical aspects of any part of the training set-up. Both of these difficulties have already been experienced [21,23]. A third problem can be thought of as “the wiring problem”. The means to supply huge amounts of configuration data to a tiny sample. This problem is a very fundamental one. It suggests that if we wish to exploit the full physical richness of materials we might have to allow the material to grow its own wires and be self-wiring. This has profound implications for intrinsic evolution as artificial hardware evolution requires complete reconfigurability, this implies that one would have to be able to “wipe-clean” the evolved wiring and start again with a new artificial genotype. This might be possible by using nanoparticles that assemble into nanowires. These considerations bring us to an important issue in evolution in materio. Namely, the problem of choosing a suitable materials that can be exploited by computer controlled evolution. Evolution in Materio: Defining Suitable Materials The obvious characteristic required by a candidate material is the ability to reconfigure it in some way. Liquid crystal, clay, salt solutions etc can be readily configured either electrically or mechanically; their physical state can be adjusted, and readjusted, by applying a signal or force. In contrast (excluding its electrical properties) the physical properties of an FPGA would remain unchanged during configuration. It is also desirable to bulk configure the system. It would be infeasible to configure every molecule in the material, so the material should support the ability to be reconfigured over large areas using a small amount of configuration.
Evolution in Materio
The material needs to perform some form of transformation (or computation) on incident signals that we apply. To do this, the material will have to interfere with the incident signal and perform a modification to it. We will need to be able to observe this modification, in order to extract the result of the computation. To perform a nontrivial computation, the material should be capable of performing complex operations upon the signal. Such capabilities would be maximized if the system exhibited nonlinear behavior when interacting with input signals. In summary, we can say that for a material to be useful to evolution in materio it should have the following properties: Modify incident signals in observable ways. The components of a system (i. e. the molecules within a material) interact with each other locally such that non-linear effects occur at either the local or global levels. It is possible to configure the state of the material locally. It is possible to observe the state of the material – either as a whole or in one or more locations. For practical reasons we can state that the material should be reconfigurable, and that changes in state should be temporary or reversible. Miller and Downing [28] identified a number of physical systems that have some, if not all, of these desirable properties. They identified liquid crystal as the most promising in this regard as it digitally writable, reconfigurable and works at a molecular level. Most interestingly, it is an example of mesoscopic organization. Some people have argued that it is within such systems that emergent, organized behavior can occur [30]. Liquid crystals also exhibit the phenomenon of self-assembly. They form a class of substances that are being designed and developed in a field of chemistry called Supramolecular Chemistry [31]. This is a new and exciting branch of chemistry that can be characterized as ‘the designed chemistry of the intermolecular bond’. Supramolecular chemicals are in a permanent process of being assembled and disassembled. It is interesting to consider that conceptually liquid crystals appear to sit on the ‘edge of chaos’ [32] in that they are fluids (chaotic) that can be ordered, under certain circumstances. Liquid Crystal Liquid crystal (LC) is commonly defined as a substance that can exist in a mesomorphic state [33,34]. Mesomorphic states have a degree of molecular order that lies between that of a solid crystal (long-range positional and ori-
entational) and a liquid, gas or amorphous solid (no longrange order). In LC there is long-range orientational order but no long-range positional order. LC tends to be transparent in the visible and near infrared and quite absorptive in UV. There are three distinct types of LC: lyotropic, polymeric and thermotropic. Lyotropic LC is obtained when an appropriate amount of material is dissolved in a solvent. Most commonly this is formed by water and amphiphilic molecules: molecules with a hydrophobic part (water insoluble) and a hydrophillic part (strongly interacting with water). Polymeric LC is basically a polymer version of the aromatic LC discussed. They are characterized by high viscosity and include vinyls and Kevlar. Thermotropic LC (TLC) is the most common form and is widely used. TLC exhibit various liquid crystalline phases as a function of temperature. They can be depicted as rod-like molecules and interact with each other in distinctive ordered structures. TLC exists in three main forms: nematic, cholesteric and smectic. In nematic LC the molecules are positionally arranged randomly but they all share a common alignment axis. Cholesteric LC (or chiral nematic) is like nematic however they have a chiral orientation. In smectic LC there is typically a layered positionally disordered structure. The three types A, B and C are defined as follows. In type A the molecules are oriented in alignment with the natural physical axes (i. e normal to the glass container), however in type C the common molecular axes of orientation is at an angle to the container. LC molecules typically are dipolar. Thus the organization of the molecular dipoles give another order of symmetry to the LC. Normally the dipoles would be randomly oriented. However in some forms the natural molecular dipoles are aligned with one another. This gives rise to ferroelectric and ferrielectric forms. There is a vast range of different types of liquid crystal. LC of different types can be mixed. LC can be doped (as in Dye-Doped LC) to alter their light absorption characteristics. Dye-Doped LC film has been made that is optically addressable and can undergo very large changes in refractive index [35]. There are Polymer-Dispersed Liquid Crystals, which can have tailored, electrically controlled light refractive properties. Another interesting form of LC being actively investigated is Discotic LC. These have the form of disordered stacks (1-dimensional fluids) of discshaped molecules on a two-dimensional lattice. Although discotic LC is an electrical insulator, it can be made to conduct by doping with oxidants [36]. The oxidants are incorporated into the fluid hydrocarbon chain matrix (between disks). LC is widely known as useful in electronic displays, however, there are in fact, many non-display applications too. There are many applications of LC (espe-
1035
1036
Evolution in Materio
cially ferroelectric LC) to electrically controlled light modulation: phase modulation, optical correlation, optical interconnects and switches, wavelength filters, optical neural networks. In the latter case a ferroelectric LC is used to encode the weights in a neural network [37]. Conducting and Electroactive Polymers Conducting polymer composites have been made that rapidly change their microwave reflection coefficient when an electric field is applied. When the field is removed, the composite reverts to its original state. Experiments have shown that the composite can change from one state to the other in the order of 100 ms [38]. Also, some polymers exhibit electrochromism. These substances change their reflectance when a voltage is applied. This can be reversed by a change in voltage polarity [39]. Electroactive polymers [40] are polymers that change their volume with the application of an electric field. They are particularly interesting as voltage controlled artificial muscle. Organic semiconductors also look promising especially when some damage is introduced. Further details of electronic properties of polymers and organic crystals can be found in [41]. Voltage Controlled Colloids Colloids are suspensions of particles of sub-micron sizes in a liquid. The phase behavior of colloids is not fully understood. Simple colloids can self assemble into crystals, while multi-component suspensions can exhibit a rich variety of crystalline structures. There are also electrorheological fluids. These are suspensions of extremely fine nonconducting particles in an electrically insulating fluid. The viscosity of these fluids can be changed in a reversible way by large factors in response to an applied electric field in times of the order of milliseconds [42]. Also colloids can also be made in which the particles are charged making them easily manipulatable by suitable applied electric fields. Even if the particles are not charged they may be moved through the action of applied fields using a phenomenon known as dielectrophoresis which is the motion of polarized but electrically uncharged particles in nonuniform electric fields [43]. In work that echoes the methods of Pask nearly four decades ago, dielectrophoresis has been used to grow tiny gold wires through a process of self-assembly [44]. Langmuir–Blodgett Films Langmuir–Blodgett films are molecular monolayers of organic material that can be transferred to a solid substrate [45]. They usually consist of hydrophillic heads
Evolution in Materio, Figure 2 Kirchhoff–Lukasiewicz Machine
and hydrophobic tails attached to the substrate. Multiple monolayers can be built and films can be built with very accurate and regular thicknesses. By arranging an electrode layer above the film it seems feasible that the local electronic properties of the layers could be altered. These systems look like feasible systems whose properties might be exploitable through computer controlled evolution of the voltages. Kirchoff–Lukasiewicz Machines Work by Mills [46,47] also demonstrates the use of materials in computation. He has designed an ‘Extended Analog Computer’ (EAC) that is a physical implementation of a Kirchhoff–Lukasiewicz Machine (KLM) [46]. The machines are composed of logical function units connected to a conductive media, typically a conductive polymer sheet. The logical units implement Lukasiewicz Logic – a type of multi-valued logic [47]. Figure 2 shows how the Lukasiewicz Logic Arrays (LLA) are connected to the conductive polymer. The LLA bridge areas of the sheet together. The logic units measure the current at one point, perform a transformation and then apply a current source to the other end of the bridge. Computation is performed by applying current sinks and sources to the conductive polymer and reading the output from the LLAs. Different computations can be performed that are determined by the location of applied signals in the conducting sheet and the configuration of the LLAs. Hence, computation is performed by an interaction of the physics described by Kirchoff’s laws and the Lukasiewicz Logic units. Together they form a physical device that can solve certain kinds of partial differential equations. Using this form of analogue computation, a large
Evolution in Materio
number of these equations can be solved in nanoseconds – much faster than on a conventional computer. The speed of computation is dependent on materials used and how they are interfaced to digital computers, but it is expected that silicon implementations will be capable of finding tens of millions of solutions to the equations per second. Examples of computation so far implemented in this system include robot control, control of a cyclotron beam [48], models of biological systems (including neural networks) [49] and radiosity based image rendering. One of the most interesting feature of these devices is the programming method. It is very difficult to understand the actual processes used by the system to perform computation, and until recently most of the reconfiguration has been done manually. This is difficult as the system is not amenable to traditional software development approaches. However, evolutionary algorithms can be used to automatically define the parameters of the LLAs and the placement of current sinks and sources. By defining a suitable fitness function, the configuration of the EAC can be evolved – which removes the need for human interaction and for knowledge of the underlying system. Although it is clear that such KLMs are clearly using the physical properties of a material to perform computation, the physical state of the material is not reconfigured (i. e., programmed), only the currents in the sheet are changed. Evolution in Materio Is Verified with Liquid Crystal Harding [50] has verified Miller’s intuition about the suitability of liquid crystal as an evolvable material by demonstrating that it is relatively easy to configure liquid crystal to perform various forms of computation. In 2004, Harding constructed an analogue processor that utilizes the physical properties of liquid crystal for computation. He evolved the configuration of the liquid crystal to discriminate between two square waves of many different frequencies. This demonstrated, for the first time, that the principle of using computer-controlled evolution was a viable and powerful technique for using non-silicon materials for computation. The analogue processor consists of a passive liquid crystal display mounted on a reconfigurable circuit, known as an evolvable motherboard. The motherboard allows signals and configuration voltages to be routed to physical locations in the liquid crystal. Harding has shown that many different devices can be evolved in liquid crystal including: Tone discriminator. A device was evolved in liquid crystal that could differentiate many different frequen-
cies of square wave. The results were competitive, if not superior to those evolved in the FPGA. Logic gates. A variety of two input logic gates were evolved, showing that liquid crystal could behave in a digital fashion. This indicates that liquid crystal is capable of universal computation. Robot controller. An obstacle avoidance system for a simple exploratory robot was evolved. The results were highly competitive, with solutions taking fewer evaluations to find compared to other work on evolved robot controllers. One of the surprising findings in this work has been that it turns out to be relatively easy to evolve the configuration of liquid crystal to solve tasks; i. e., only 40 generations of a modest population of configurations are required to evolve a very good frequency discriminator, compared to the thousands of generations required to evolve a similar circuit on an FPGA. This work has shown that evolving such devices in liquid crystal is easier than when using conventional components, such as FPGAs. The work is a clear demonstration that evolutionary design can produce solutions that are beyond the scope of human design. Evolution in Materio Using Liquid Crystal: Implementational Details An evolvable motherboard (EM) [23] is a circuit that can be used to investigate intrinsic evolution. The EM is a reconfigurable circuit that rewires a circuit under computer control. Previous EMs have been used to evolve circuits containing electronic components [23,51] – however they can also be used to evolve in materio by replacing the standard components with a candidate material. An EM is connected to an Evolvatron. This is essentially a PC that is used to control the evolutionary processes. The Evolvatron also has digital and analog I/O, and can be used to provide test signals and record the response of the material under evolution. The Liquid Crystal Evolvable Motherboard (LCEM) is a circuit that uses four cross-switch matrix devices to dynamically configure the circuits connecting to the liquid crystal. The switches are used to wire the 64 connections on the LCD to one of 8 external connections. The external connections are: input voltages, grounding, signals and connections to measurement devices. Each of the external connectors can be wired to any of the connections to the LCD. The external connections of the LCEM are connected to the Evolvatron’s analogue inputs and outputs. One connection was assigned for the incident signal, one for measurement and the other for fixed voltages. The value of the
1037
1038
Evolution in Materio
Evolution in Materio, Figure 3 Equipment configuration
Evolution in Materio, Figure 4 The LCEM
fixed voltages is determined by the evolutionary algorithm, but is constant throughout each evaluation. In these experiments the liquid crystal glass sandwich was removed from the display controller it was originally mounted on, and placed on the LCEM. The display has a large number of connections (in excess of 200), however because of PCB manufacturing constraints we are limited in the size of connection we can make, and hence the number of connections. The LCD is therefore roughly positioned over the pads on the PCB, with many of the PCB pads touching more than one of the connectors on the LCD. This means that we are applying configuration voltages to several areas of LC at the same time. Unfortunately neither the internal structure nor the electrical characteristics of the LCD are known. This raises
Evolution in Materio, Figure 5 Schematic of LCEM
the possibility that a configuration may be applied that would damage the device. The wires inside the LCD are made of an extremely thin material that could easily be burnt out if too much current flows through them. To guard against this, each connection to the LCD is made through a 4.7 Kohm resistor in order to provide protection against short circuits and to help limit the current in the LCD. The current supplied to the LCD is limited to 100 mA. The software controlling the evolution is also responsible for avoiding configurations that may endanger the device (such as short circuits). It is important to note that other than the control circuitry for the switch arrays there are no other active components on the motherboard – only analog switches, smoothing capacitors, resistors and the LCD are present.
Evolution in Materio
Stability and Repeatability Issues When the liquid crystal display is observed while solving a problem it is seen that some regions of the liquid display go dark indicating that the local molecular direction has been changed. This means that the configuration of the liquid crystal is changing while signals are being applied. To draw an analogy with circuit design, the incident signals would be changing component values or changing the circuit topology, which would have an affect on the behavior of the system. This is likely to be detrimental to the measured performance of the circuit. When a solution is evolved, the fitness function automatically measures it stability over the period of the evaluation. Changes made by the incident signals can be considered part of the genotype-phenotype mapping. Solutions that cannot cope with their initial configurations being altered will achieve a low score. However, the fitness function cannot measure the behavior beyond the end of the evaluation time. Therein lies the difficulty, in evolution in materio long term stability cannot be guaranteed. Another issue concerns repeatability. When a configuration is applied to the liquid crystal the molecules are unlikely go back to exactly where they were when this configuration was tried previously. Assuming, that there is a strong correlation between genotype and phenotype, then it is likely that evolution will cope with this extra noise. However, if evolved devices are to be useful one needs to be sure that previously evolved devices will function in the same way as they did when originally evolved. In [27] it is noted that the behavior of circuits evolved intrinsically can be influenced by previous configurations – therefore their behavior (and hence fitness) is dependent not only on the currently evaluated individual’s configuration but on those that came before. It is worth noting that this is precisely what happens in natural evolution. For example, in a circuit, capacitors may still hold charge from a previously tested circuit. This charge would then affect the circuits operation, however if the circuit was tested again with no stored charge a different behavior would be expected and a different fitness score would be obtained. Not only does this affect the ability to evolve circuits, but would mean that some circuits are not valid. Without the influence of the previously evaluated circuits the current solution may not function as expected. It is expected that such problems will have analogies in evolution in materio. The configurations are likely to be highly sensitive to initial conditions (i. e. conditions introduced by previous configurations). Dealing with Environmental Issues A major problem when working with intrinsic evolution is
separating out the computation allegedly being carried out by the target device, and that actually done by the material being used. For example, whilst trying to evolve an oscillator Bird and Layzell discovered that evolution was using part of the circuit for a radio antenna, and picking up emissions from the environment [22]. Layzell also found that evolved circuits were sensitive to whether or not a soldering iron was plugged in (not even switched on) in another part of the room [23]! An evolved device is not useful if it is highly sensitive to its environment in unpredictable ways, and it will not always be clear what environmental effects the system is using. It would be unfortunate to evolve a device for use in a space craft, only to find out it fails to work once out of range of a local radio tower! To minimize these risks, we will need to check the operation of evolved systems under different conditions. We will need to test the behavior of a device using a different set up in a different location. It will be important to know if a particular configuration only works with one particular sample of a given material. The Computational Power of Materials In [52], Lloyd argued that the theoretical computing power of a kilogram of material is far more than is possible with a kilogram of traditional computer. He notes that computers are subject to the laws of physics, and that these laws place limits on the maximum speed they can operate and the amount of information it can process. Lloyd shows that if we were fully able to exploit a material, we would get an enormous increase in computing power. For example, with 1 kg of matter we should be able to perform roughly 5 1050 operations per second, and store 1031 bits. Amazingly, contemporary quantum computers do operate near these theoretical limits [52]. A small amount of material also contains a large number of components (regardless of whether we consider the molecular or atomic scale). This leads to some interesting thoughts. If we can exploit materials at this level, we would be able to do a vast amount of computation in a small volume. A small size also hints at low power consumption, as less energy has to be spent to perform an operation. Many components also provide a mechanism for reliability through redundancy. A particularly interesting observation, especially when considered in terms of non Von-Neumann computation, is the massive parallelism we may be able to achieve. The reason that systems such as quantum, DNA and chemical computation can operate so quickly is that many operations are performed at the same time. A programmable material might be capable of per-
1039
1040
Evolution in Materio
forming vast numbers of tasks simultaneously, and therefore provide a computational advantage. In commercial terms, small is often synonymous with low cost. It may be possible to construct devices using cheaply available materials. Reliability may not be an issue, as the systems could be evolved to be massively fault tolerant using their intrinsic redundancy. Evolution is capable of producing novel designs. Koza has already rediscovered circuits that infringe on recent patents, and his genetic programming method has ‘invented’ brand new circuit designs [53]. Evolving in materio could produce many novel designs, and indeed given the infancy of programmable materials all designs may be unique and hence patentable. Future Directions The work described here concerning liquid crystal computational devices is at an early stage. We have merely demonstrated that it is possible to evolve configurations of voltages that allow a material to perform desired computations. Any application that ensues from this work is unlikely to be a replacement for a simple electronic circuit. We can design and build those very successfully. What we have difficulty with is building complex, fault tolerant systems for performing complex computation. It appears that nature managed to do this. It used a simple process of a repetitive test and modify, and it did this in a universe of unimaginable physical complexity. If nature can exploit the physical properties of a material and its surroundings through evolution, then so should we. There are many important issues that remain to be addressed. Although we have made some suggestions about materials worthy of investigation, it is at present unclear which materials are most suitable. An experimental platform needs to be constructed that allows many materials to be tried and investigated. The use of microelectrode arrays in a small volume container would allow this. This would also have the virtue of allowing internal signals in the materials to be inspected and potentially understood. We need materials that are rapidly configurable. They must not be fragile and sensitive to minute changes in physical setup. They must be capable of maintaining themselves in a stable configuration. The materials should be complex and allow us to carry out difficult computations more easily than conventional means. One would like materials that can be packaged into small volumes. The materials should be relatively easily interfaced with. So far, material systems have been configured by applying a constant configuration pattern, however this may not be appropriate for all systems. It may be necessary to put the physical
system under some form of responsive control, in order to program and then keep the behavior stable. We may or may not know if a particular material can be used to perform some form of computation. However, we can treat our material as a “black box”, and using evolution as a search technique, automatically discover what, if any, computations our black box can perform. The first step is to build an interface that will allow us to communicate with a material. Then we will use evolution to find a configuration we can apply using this platform, and then attempt to find a mapping from a given problem to an input suitable for that material, and a mapping from the materials response to an output. If this is done correctly, we might be automatically able to tell if a material can perform computation, and then classify the computation. When we evolve in materio, using mappings evolved in software, how can we tell when the material is giving us any real benefit? The lesson of evolution in materio has been that the evolved systems can be very difficult to analyze, and the principal obstacle to the analysis is the problem of separating out the computational role that each component plays in the evolved system. These issues are by no means just a problem for evolution in materio. They may be an inherent part of complex evolved systems. Certainly the understanding of biological systems are providing immense challenges to scientists. The single most important aspect that suggests that evolution in materio has a future is that natural evolution has produced immensely sophisticated material computational systems. It would seem foolish to ignore this and merely try to construct computational devices that operate according to one paradigm of computation (i. e. Turing). Oddly enough, it is precisely the sophistication of the latter that allows us to attempt the former. Bibliography Primary Literature 1. Turing AM (1936) On computable numbers, with an application to the entscheidungsproblem. Proc Lond Math Soc 42(2):230–265 2. Bissell C (2004) A great disappearing act: the electronic analogue computer. In: IEEE Conference on the History of Electronics, 28–30 June 3. Deutsch D (1985) Quantum theory, the church–turing principle and the universal quantum computer. Proc Royal Soc Lond A 400:97–117 4. Adamatzky A, Costello BDL, Asai T (2005) Reaction-Diffusion Computers. Elsevier, Amsterdam 5. Adleman LM (1994) Molecular computation of solutions to combinatorial problems. Science 266(11):1021–1024 6. Amos M (2005) Theoretical and Experimental DNA Computation. Springer, Berlin
Evolution in Materio
7. Weiss R, Basu S, Hooshangi S, Kalmbach A, Karig D, Mehreja R, Netravali I (2003) Genetic circuit building blocks for cellular computation, communications, and signal processing. Nat Comput 2(1):47–84 8. UK Computing Research Committee (2005) Grand challenges in computer research. http://www.ukcrc.org.uk/grand_ challenges/ 9. Stepney S, Braunstein SL, Clark JA, Tyrrell A, Adamatzky A, Smith RE, Addis T, Johnson C, Timmis J, Welch P, Milner R, Partridge D (2005) Journeys in non-classical computation I: A grand challenge for computing research. Int J Parallel Emerg Distrib Syst 20(1):5–19 10. Stepney S, Braunstein S, Clark J, Tyrrell A, Adamatzky A, Smith R, Addis T, Johnson C, Timmis J, Welch P, Milner R, Partridge D (2006) Journeys in non-classical computation II: Initial journeys and waypoints. Int J Parallel, Emerg Distrib Syst 21(2):97–125 11. Toffoli T (2005) Nothing makes sense in computing except in the light of evolution. Int J Unconv Comput 1(1):3–29 12. Conrad M (1988) The price of programmability. The Universal Turing Machine. pp 285–307 13. Goldberg D (1989) Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, Massachusetts 14. Holland J (1992) Adaptation in Natural and Artificial Systems. 2nd edn. MIT Press, Cambridge 15. Mitchell M (1996) An introduction to genetic algorithms. MIT Press, Cambridge, MA, USA 16. Pask G (1958) Physical analogues to the growth of a concept. In: Mechanization of Thought Processes. Symposium 10, National Physical Laboratory, pp 765–794 17. Pask G (1959) The natural history of networks. In: Proceedings of International Tracts. In: Computer Science and Technology and their Application. vol 2, pp 232–263 18. Cariani P (1993) To evolve an ear: epistemological implications of gordon pask’s electrochemical devices. Syst Res 3:19–33 19. Pickering A (2002) Cybernetics and the mangle: Ashby, beer and pask. Soc Stud Sci 32:413–437 20. Thompson A, Harvey I, Husbands P (1996) Unconstrained evolution and hard consequences. In: Sanchez E, Tomassini M (eds) Towards Evolvable Hardware: The evolutionary engineering approach. LNCS, vol 1062. Springer, Berlin, pp 136–165 21. Thompson A (1996) An evolved circuit, intrinsic in silicon, entwined with physics. ICES 390–405 22. Bird J, Layzell P (2002) The evolved radio and its implications for modelling the evolution of novel sensors. In: Proceedings of Congress on Evolutionary Computation, pp 1836–1841 23. Layzell P (1998) A new research tool for intrinsic hardware evolution. Proceedings of The Second International Conference on Evolvable Systems: From Biology to Hardware. LNCS, vol 1478. Springer, Berlin, pp 47–56 24. Linden DS, Altshuler EE (2001) A system for evolving antennas in-situ. In: 3rd NASA / DoD Workshop on Evolvable Hardware. IEEE Computer Society, pp 249–255 25. Linden DS, Altshuler EE (1999) Evolving wire antennas using genetic algorithms: A review. In: 1st NASA / DoD Workshop on Evolvable Hardware. IEEE Computer Society, pp 225–232 26. Stoica A, Zebulum RS, Guo X, Keymeulen D, Ferguson MI, Duong V (2003) Silicon validation of evolution-designed circuits. In: Proceedings. NASA/DoD Conference on Evolvable Hardware, pp 21–25
27. Stoica A, Zebulum RS, Keymeulen D (2000) Mixtrinsic evolution. In: Proceedings of the Third International Conference on Evolvable Systems: From Biology to Hardware (ICES2000). Lecture Notes in Computer Science, vol 1801. Springer, Berlin, pp 208–217 28. Miller JF, Downing K (2002) Evolution in materio: Looking beyond the silicon box. In: Proceedings of NASA/DoD Evolvable Hardware Workshop, pp 167–176 29. Thompson A (1998) On the automatic design of robust electronics through artificial evolution. In: Sipper M, Mange D, Pérez-Uribe A (eds) Evolvable Systems: From Biology to Hardware, vol 1478. Springer, New York, pp 13–24 30. Laughlin RB, Pines D, Schmalian J, Stojkovic BP, Wolynes P (2000) The middle way. Proc Nat Acd Sci 97(1):32–37 31. Lindoy LF, Atkinson IM (2000) Self-assembly in Supramolecular Systems. Royal Society of Chemistry 32. Langton C (1991) Computation at the edge of chaos: Phase transitions and emergent computation. In: Emergent Computation, pp 12–37. MIT Press 33. Demus D, Goodby JW, Gray GW, Spiess HW, Vill V (eds) (1998) Handbook of Liquid Crystals, vol 4. Wiley-VCH, ISBN 3-52729502-X, pp 2180 34. Khoo IC (1995) Liquid Crystals: physical properties and nonlinear optical phenomena. Wiley 35. Khoo IC, Slussarenko S, Guenther BD, Shih MY, Chen P, Wood WV (1998) Optically induced space-charge fields, dc voltage, and extraordinarily large nonlinearity in dye-doped nematic liquid crystals. Opt Lett 23(4):253–255 36. Chandrasekhar S (1998) Columnar, discotic nematic and lamellar liquid crystals: Their structure and physical properties. In: Handbook of Liquid Crystals, vol 2B. Wiley-VCH pp 749–780 37. Crossland WA, Wilkinson TD (1998) Nondisplay applications of liquid crystals. In: Handbook of Liquid Crystals, vol 1. WileyVCH, pp 763–822 38. Wright PV, Chambers B, Barnes A, Lees K, Despotakis A (2000) Progress in smart microwave materials and structures. Smart Mater Struct 9:272–279 39. Mortimer RJ (1997) Electrochromic materials. Chem Soc Rev 26:147–156 40. Bar-Cohen Y (2001) Electroactive Polymer (EAP) Actuators as Artificial Muscles – Reality, Potential and Challenges. SPIE Press 41. Pope M, Swenberg CE (1999) Electronic Processes of Organic Crystals and Polymers. Oxford University Press, Oxford 42. Hao T (2005) Electrorheological Fluids: The Non-aqueous Suspensions. Elsevier Science 43. Khusid B, Activos A (1996) Effects of interparticle electric interactions on dielectrophoresis in colloidal suspensions. Phys Rev E 54(5):5428–5435 44. Khusid B, Activos A (2001) Hermanson KD, Lumsdon SO, Williams JP, Kaler EW, Velev OD. Science 294:1082–1086 45. Petty MC (1996) Langmuir–Blodgett Films: An Introduction. Cambridge University Press, Cambridge 46. Mills JW (1995) Polymer processors. Technical Report TR580, Department of Computer Science, University of Indiana 47. Mills JW, Beavers MG, Daffinger CA (1989) Lukasiewicz logic arrays. Technical Report TR296, Department of Computer Science, University of Indiana 48. Mills JW (1995) Programmable vlsi extended analog computer for cyclotron beam control. Technical Report TR441, Department of Computer Science, University of Indiana
1041
1042
Evolution in Materio
49. Mills JW (1995) The continuous retina: Image processing with a single sensor artificial neural field network. Technical Report TR443, Department of Computer Science, University of Indiana 50. Harding S, Miller JF (2004) Evolution in materio: A tone discriminator in liquid crystal. In: In Proceedings of the Congress on Evolutionary Computation 2004 (CEC’2004), vol 2, pp 1800– 1807 51. Crooks J (2002) Evolvable analogue hardware. Meng project report, The University Of York 52. Lloyd S (2000) Ultimate physical limits to computation. Nature 406:1047–1054 53. Koza JR (1999) Human-competitive machine intelligence by means of genetic algorithms. In: Booker L, Forrest S, Mitchell M, Riolo R (eds) Festschrift in honor of John H Holland. Center for the Study of Complex Systems, Ann Arbor, pp 15–22
Books and Reviews Analog Computer Museum and History Center: Analog Computer Reading List. http://dcoward.best.vwh.net/analog/readlist. htm Bringsjord S (2001) In Computation, Parallel is Nothing, Physical Everything. Minds and Machines 11(1)
Feynman RP (2000) Feyman Lectures on Computation. Perseus Books Group Fifer S (1961) Analogue computation: theory, techniques, and applications. McGraw-Hill, New York Greenwood GW, Tyrrell AM (2006) Introduction to Evolvable Hardware: A Practical Guide for Designing Self-Adaptive Systems. Wiley-IEEE Press Hey AJG (ed) (2002) Feynman and Computation. Westview Press Penrose R (1989) The Emperor’s New Mind, Concerning Computers, Minds, and the Laws of Physics. Oxford University, Oxford Piccinini G: The Physical Church–Turing Thesis: Modest or Bold? http://www.umsl.edu/~piccininig/CTModestorBold5.htm Raichman N, Ben-Jacob N, Segev R (2003) Evolvable Hardware: Genetic Search in a Physical Realm. Phys A 326:265–285 Sekanina L (2004) Evolvable Components: From Theory to Hardware Implementations, 1st edn. Springer, Heidelberg Siegelmann HT (1999) Neural Networks and Analog Computation, Beyond the Turing Limits. Birkhauser, Boston Sienko T, Adamatzky A, Rambidi N, Conrad M (2003) Molecular Computing. MIT Press Thompson A (1999) Hardware Evolution: Automatic Design of Electronic Circuits in Reconfigurable Hardware by Artificial Evolution, 1st edn. Springer, Heidelberg
Evolving Cellular Automata
Evolving Cellular Automata MARTIN CENEK, MELANIE MITCHELL Computer Science Department, Portland State University, Portland, USA Article Outline Glossary Definition of the Subject Introduction Cellular Automata Computation in CAs Evolving Cellular Automata with Genetic Algorithms Previous Work on Evolving CAs Coevolution Other Applications Future Directions Acknowledgments Bibliography Glossary Cellular automaton (CA) Discrete-space and discretetime spatially extended lattice of cells connected in a regular pattern. Each cell stores its state and a statetransition function. At each time step, each cell applies the transition function to update its state based on its local neighborhood of cell states. The update of the system is performed in synchronous steps – i. e., all cells update simultaneously. Cellular programming A variation of genetic algorithms designed to simultaneously evolve state transition rules and local neighborhood connection topologies for non-homogeneous cellular automata. Coevolution An extension to the genetic algorithm in which candidate solutions and their “environment” (typically test cases) are evolved simultaneously. Density classification A computational task for binary CAs: the desired behavior for the CA is to iterate to an all-1s configuration if the initial configuration has a majority of cells in state 1, and to an all-0s configuration otherwise. Genetic algorithm (GA) A stochastic search method inspired by the Darwinian model of evolution. A population of candidate solutions is evolved by reproduction with variation, followed by selection, for a number of generations. Genetic programming A variation of genetic algorithms that evolves genetic trees.
Genetic tree Tree-like representation of a transition function, used by genetic programming algorithm. Lookup table (LUT) Fixed-length table representation of a transition function. Neighborhood Pattern of connectivity specifying to which other cells each cell is connected. Non-homogeneous cellular automaton A CA in which each cell can have its own distinct transition function and local neighborhood connection pattern. Ordering A computational task for one-dimensional binary CAs with fixed boundaries: The desired behavior is for the CA to iterate to a final configuration in which all initial 0 states migrate to the left-hand side of the lattice and all initial 1 states migrate to the right-hand side of the lattice. Particle Periodic, temporally coherent boundary between two regular domains in a set of successive CA configurations. Particles can be interpreted as carrying information about the neighboring domains. Collisions between particles can be interpreted as the processing of information, with the resulting information carried by new particles formed by the collision. Regular domain Region defined by a set of successive CA configurations that can be described by a simple regular language. Synchronization A computational task for binary CAs: the desired behavior for the CA is to iterate to a temporal oscillation between two configurations: all cells have state 1 and all cells have state 0s. Transition function Maps a local neighborhood of cell states to an update state for the center cell of that neighborhood. Definition of the Subject Evolving cellular automata refers to the application of evolutionary computation methods to evolve cellular automata transition rules. This has been used as one approach to automatically “programming” cellular automata to perform desired computations, and as an approach to model the evolution of collective behavior in complex systems. Introduction In recent years, the theory and application of cellular automata (CAs) has experienced a renaissance, due to advances in the related fields of reconfigurable hardware, sensor networks, and molecular-scale computing systems. In particular, architectures similar to CAs can be used to construct physical devices such as field configurable
1043
1044
Evolving Cellular Automata
gate arrays for electronics, networks of robots for environmental sensing and nano-devices embedded in interconnect fabric used for fault tolerant nanoscale computing. Such devices consist of networks of simple components that communicate locally without centralized control. Two major areas of research on such networks are (1) programming – how to construct and configure the locally connected components such that they will collectively perform a desired task; and (2) computation theory – what types of tasks are such networks able to perform efficiently, and how does the configuration of components affect the computational capability of these networks? This article describes research into one particular automatic programming method: the use of genetic algorithms (GAs) to evolve cellular automata to perform desired tasks. We survey some of the leading approaches to evolving CAs with GAs, and discuss some of the open problems in this area. Cellular Automata A cellular automaton (CA) is a spatially extended lattice of locally connected simple processors (cells). CAs can be used both to model physical systems and to perform parallel distributed computations. In a CA, each cell maintains a discrete state and a transition function that maps the cell’s current state to its next state. This function is often represented as a lookup table (LUT). The LUT stores all possible configurations of a cell’s local neighborhood, which consists of its own current state and the state of its neighboring cells. Change of state is performed in discrete time steps: the entire lattice is updated synchronously. There are many possible definitions of a neighborhood, but here we will define a neighborhood as the cell to be updated and the cells adjacent to it at a distance of radius r. The number of entries in the LUT will be s N , where s is the number of possible states and N is the total number of cells in the neighborhood: (2r C 1)d for a square shaped neighborhood in a d-dimensional lattice, also known as a Moore neighborhood. CAs typically are given periodic boundary conditions, which treat the lattice as a torus. To transform a cell’s state, the values of the cell’s state and those of its neighbors are encoded as a lookup index to the LUT that stores a value representing the cell’s new state (Fig. 1: left) [8,16,59]. For the scope of this article, we will focus on homogeneous binary CAs, which means that all cells in the CAs have the same LUT and each cell has one of two possible states, s 2 f0; 1g. Figure 1 shows the mechanism of updates in a homogeneous one-dimensional two-state CA with a neighborhood radius r D 1.
Evolving Cellular Automata, Figure 1 Left top: A one-dimensional neighborhood of three cells (radius 1): Center cell, West neighbor, and East neighbor. Left middle: A sample look-up table in which all possible neighborhood configurations are listed, along with the update state for the center cell in each neighborhood. Left bottom: Mechanism of update in a one dimensional binary CA of length 13: t0 is the initial configuration at time 0 and t1 is the initial configuration at next time step. Right: The sequence of synchronous updates starting at the initial state t0 and ending at state t9
CAs were invented in the 1940s by Stanislaw Ulam and John von Neumann. Ulam used CAs as a mathematical abstraction to study the growth of crystals, and von Neumann used them as an abstraction of a physical system with the concepts of a cell, state and transition function in order to study the logic of self-reproducing systems [8,11,55]. Von Neumann’s seminal work on CAs had great significance. Science after the industrial revolution was primarily concerned with energy, force and motion, but the concept of CAs shifted the focus to information processing, organization, programming, and most importantly, control [8]. The universal computational ability of CAs was realized early on, but harnessing this power continues to intrigue scientists [8,11,32,55]. Computation in CAs In the early 1970s John Conway published a description of his deceptively simple Game of Life CA [18]. Conway proved that the Game of Life, like von Neumann’s self-reproducing automaton, has the power of a universal Turing machine: any program that can be run on a Turing machine can be simulated by the Game of Life with the appropriate initial configuration of states. This initial configuration (IC) encodes both the input and the program to be run on that input. It is interesting that so simple a CA as the Game of Life (as well as even simpler CAs – see chapter 11 in [60]) has the power of a universal computer. However, the actual application of CAs as universal
Evolving Cellular Automata
Evolving Cellular Automata, Figure 2 Two space-time diagrams illustrating the behavior of the “naïve” local majority voting rule, with lattice size N D 149, neighborhood radius r D 3, and number of time steps M D 149. Left: initial configuration has a majority of 0s. Right: initial configuration has a majority of 1s. Individual cells are colored black for state 1 and white for state 0. (Reprinted from [37] with permission of the author.)
computers is, in general, impractical due to the difficulty of encoding a given program and input as an IC, as well as very long simulation times. An alternative use of CAs as computers is to design a CA to perform a particular computational task. In this case, the initial configuration is the input to the program, the transition function corresponds to the program performing the specific task, and some set of final configurations is interpreted as the output of the computation. The intermediate configurations comprise the actual computation being done. Examples of tasks for which CAs have been designed include location management in mobile computing networks [50], classification of initial configuration densities [38], pseudo-random number generation [51], multiagent synchronization [47], image processing [26], simulation of growth patterns of material microstructures [5], chemical reactions [35], and pedestrian dynamics [45]. The problem of designing a CA to perform a task requires defining a cell’s local neighborhood and boundary conditions, and constructing a transition function for cells that produces the desired input-output mapping. Given a CA’s states, neighborhood radius, boundary conditions, and initial configuration, it is the LUT values that must be set by the “programmer” so that the computation will be performed correctly over all inputs. In order to study the application of genetic algorithms to designing CAs, substantial experimentation has been done using the density classification (or majority classification) task. Here, “density” refers to the fraction of 1s in the initial configuration. In this task, a binary-state CA must iterate to an all-1s configuration if the initial configuration has a majority of cells in state 1, and iterate to an all-0s configuration otherwise. The maximum time al-
lowed for completing this computation is a function of the lattice size. One “naïve” solution for designing the LUT for this task would be local majority voting: set the output bit to 1 for all neighborhood configurations with a majority of 1s, and 0 otherwise. Figure 2 gives two space-time diagrams illustrating the behavior of this LUT in a one-dimensional binary CA with N D 149, and r D 3, where N denotes the number of cells in the lattice, and r is the neighborhood radius. Each diagram shows an initial configuration of 149 cells (horizontal) iterating over 149 time steps (vertical, down the page). The left-hand diagram has an initial configuration with a majority of 0 (white) cells, and the righthand diagram has an initial configuration with a majority of 1 (black) cells. In neither case does the CA produce the “correct” global behavior: an all-0s configuration for the left diagram and an all-1s configuration for the right diagram. This illustrates the general fact that human intuition often fails when trying to capture emergent collective behavior by manipulating individual bits in the lookup table that reflect the settings of the local neighborhood. Evolving Cellular Automata with Genetic Algorithms Genetic Algorithms (GAs) are a group of stochastic search algorithms, inspired by the Darwinian model of evolution, that have proved successful for solving various difficult problems [3,4,36]. A GA works as follows: (1) A population of individuals (“chromosomes”) representing candidate solutions to a given problem is initially generated at random. (2) The fitness of each individual is calculated as a function of its quality as a solution. (3) The fittest individuals are then se-
1045
1046
Evolving Cellular Automata
lected to be the parents of a new generation of candidate solutions. Offspring are created from parents via copying, random mutation, and crossover. Once a new generation of individuals is created, the process returns to step two. This entire process is iterated for some number of generations, and the result is (hopefully) one or more highly fit individuals that are good solutions to the given problem. GAs have been used by a number of groups to evolve LUTs for binary CAs [2,10,14,15,34,43,47,51]. The individuals in the GA population are LUTs, typically encoded as binary strings. Figure 3 shows a mechanism of encoding LUTs from a particular neighborhood configuration. For example: the decimal value for the neighborhood 11010 is 26. The updated value for the neighborhood’s center cell 11010 is retrieved from the 26th position in the LUT, updating cell’s value to 1. The fitness of a LUT is a measure of how well the corresponding CA performs a given task after a fixed number of time steps, starting from a number of test initial configurations. For example, given the density classification task, the fitness of a LUT is calculated by running the corresponding CA on some number k of random initial configurations, and returning the fraction of those k on which the CA produces the correct final configuration (all 1s for initial configurations with majority 1s, all 0s otherwise). The set of random test ICs is typically regenerated at each generation. For LUTs represented as bit strings, crossover is applied to two parents by randomly selecting a crossover point, so that each child inherits one segment of bits from each parent. Next, each child is subject to a mutation, where the genome’s individual bits are subject to a bit complement with a very low probability. An example of the reproduction process is illustrated in Fig. 4 for a lookup
Evolving Cellular Automata, Figure 4 Reproduction applied to Parent1 and Parent2 producing Child1 and Child2 . The one-point crossover is performed at a randomly selected crossover point (bit 3) and a mutation is performed on bits 2 and 5 in Child1 and Child2 respectively
table representation of r D 1. Here, one of two children is chosen for survival at random and placed in an offspring population. This process is repeated until the offspring population is filled. Before a new evolutionary cycle begins, the newly created population of offspring replaces the previous population of parents. Previous Work on Evolving CAs Von Neumann’s self-reproducing automaton was the first construction that showed that CAs can perform universal computation [55], meaning that the CAs are capable, in principle, of performing any desired computation. However, in general it was unknown how to effectively “program” CAs to perform computations or what information processing dynamics CAs could best use to accomplish a task. In the 1980s and 1990s, a number of researchers attempted to determine how the generic dynamical behavior of a CA might be related to its ability to perform computations [18,19,21,33,58]. In particular, Langton defined a parameter on CA LUTs, , that he claimed correlated with
Evolving Cellular Automata, Figure 3 Lookup table encoding for 1D CA with neighborhood r D 2. All permutations of neighborhood values are encoded as an offset to the LUT. The LUT bit represents a new value for the center cell of the neighborhood. The binary string (LUT) encodes an individual’s chromosome used by evolution
Evolving Cellular Automata
Evolving Cellular Automata, Figure 5 Analysis of a GA evolved CA for density classification task. Left: The original spacetime diagram containing particle strategies in a CA evolved by GA. The regions of regular domains are all white, all black, or have a checkerboard pattern. Right: Spacetime diagram after regular domains are filtered out. (Reprinted from [37] with permission of the author.)
computational ability. In Langton’s work, is a function of the state-update values in the LUT; for binary CAs, is defined as the fraction of 1s in the state-update values. Computation at the Edge of Chaos Packard [40] was the first to use a genetic algorithm to evolve CA LUTs in order to test the hypothesis that LUTs with a critical value of will have maximal computational capability. Langton had shown that generic CA behavior seemed to undergo a sequence of phase transitions – from simple to “complex” to chaotic – as was varied. Both Langton and Packard believed that the “complex” region was necessary for non-trivial computation in CAs, thus the phrase “computation at the edge of chaos” was coined [33,40]. Packard’s experiments indicated that CAs evolved by GAs to perform the density classification task indeed tended to exhibit critical values. However, this conclusion was not replicated in later work [38]. Correlations between (or other statistics of LUTs) and computational capability in CAs have been hinted at in further work, but have not been definitively established. A major problem is the difficulty of quantifying “computational capability” in CAs beyond the general (and not very practical) capability of universal computation. Computation via CA “Particles” While Mitchell, Hraber, and Crutchfield were not able to replicate Packard’s results on , they were able to show that genetic algorithms can indeed evolve CAs to perform computations [38]. Using earlier work by Hanson and Crutchfield on characterizing computation in CAs [20,21],
Das, Mitchell and Crutchfield gave an information-processing interpretation of the dynamics exhibited by the evolved CAs in terms of regular domains and particles [21]. This work was extended by Das, Crutchfield, Mitchell, and Hanson [14] and Hordijk, Crutchfield and Mitchell [24]. In particular these groups showed that when regular domains – patterns described by simple regular languages – are filtered out of CA space-time behavior, the boundaries between these domains become forefront and can be interpreted as information-carrying “particles”. These particles can characterize non-trivial computation carried out by CAs [15,21]. The information-carrying role of particles becomes clear when applied to CAs evolved by the GA for the density classification task. Figure 5, left, shows typical behavior of the best CAs evolved by the GA. The CA contains three regular domains: all white (0 ), all black (1 ), and checkerboard ((01) ). Figure 5, right, shows the particles remaining after the regular domains are filtered out. Each particle has an origin and velocity, and carries information about the neighboring regions [37]. Hordijk et al. [24] showed that a small set of particles and their interactions can explain the computational behavior (i. e., the fitness) of the evolved cellular automata. Crutchfield et al. [13] describe how the analysis of evolved CAs in terms of particles can also explain how the GA evolved CAs with high fitness. Land and Belew [31] proved that no two-state homogeneous CA can perform the density classification task perfectly. However, the maximum possible performance for CAs on this task is not known. The density classification task remains a popular benchmark for studying the evolution of CAs with GAs,
1047
1048
Evolving Cellular Automata
since the task requires collective behavior: the decision about the global density of the IC is based on information only from each local neighborhood. Das et al. [14] also used GAs to evolve CAs to perform a global synchronization task, which requires that, starting from any initial configuration, all cells of the CA will synchronize their states (to all 1s or 0s) and in the next time step all cells must change state to the opposite value. Again, this behavior requires global coordination based on local communication. Das et al. showed that an analysis in terms of particles and their interactions was also possible for this task. Genetic Programming Andre et al. [2] applied genetic programming (GP), a variation of GAs, to the density classification task. GP methodology also uses a population of evolving candidate solutions, and the principles of reproduction and survival are the same for both GP and GAs. The main difference between these two methods is the encoding of individuals in the population. Unlike the binary strings used in GAs, individuals in a GP population have tree structures, made up of function and terminal nodes. The function nodes (internal nodes) are operators from a pre-defined function set, and the terminal nodes (leaves) represent operands from a terminal set. The fitness value is obtained by evaluating the tree on a set of test initial configurations. The crossover operator is applied to two parents by swapping randomly selected sub-trees, and the mutation operation is performed on a single node by creating a new node or by changing its value (Fig. 6) [29,30]. The GP algorithm evolved CAs whose performance is slightly higher than the performance of the best CAs evolved by a traditional GA. Unlike traditional GAs that use crossover and mutation to evolve fixed length genome solutions, GP trees evolve to different sizes or shapes, and the subtrees can be substituted out and added to the function set as automatically defined functions. According to Andre et al., this allows GP to better explore the “regularities, symmetries, homogeneities, and modularities of the problem domain” [2]. The best-evolved CAs by GP revealed more complex particles and particle interactions that than the CAs found by the EvCA group [13,24]. It is unclear whether the improved results were due to the GP representation or to the increased population sizes and computation time used by Andre et al.
automata [22,47,48,54]. Each cell of a non-homogeneous CA contains two independently evolving chromosomes. One represents the LUT for the cell (different cells can have different LUTs), and the second represents the neighborhood connections for the cell. Both the LUTs and the cell’s connectivity can be evolved at the same time. Since a task is performed by a collection of cells with different LUTs, there is no single best performing individual; the fitness is a measure of the collective behavior of the cells’ LUTs and their neighborhood assignments [46,48]. One of many tasks studied by Sipper was the global ordering task [47]. Here, the CA has fixed rather than periodic boundaries, so the “left” and “right” parts of the CA lattice are defined. The ordering in any given IC pattern will place all 0s on the left, followed by all 1s on the right. The initial density of the IC has to be preserved in the final configuration. Sipper designed a cellular programming algorithmj to co-evolve multiple LUTs and their neighborhood topologies. Cellular programming carries out the same steps as the conventional GA (initialization, evaluation, reproduction, replacement), but each cell reproduces only with its local neighbors. The LUTs and connectivity chromosomes from the locally connected sites are the only potential parents for the reproduction and replacement of cell’s LUTs and the connectivity tables respectively. The cell’s limited connectivity results in genetically diverse population. If a current population has a cell with a high-fitness LUT, its LUT will not be directly inherited by a given cell unless they are connected. The connectivity chromosome causes spatial isolation that allows evolution to explore multiple CA rules as a part of a collective solution [47,48]. Sipper exhaustively tested all homogeneous CAs with r D 1 on the ordering task, and found that the best performing rule (rule 232) correctly ordered 71% of 1000 randomly generated ICs. The cellular programming algorithm evolved a non-homogeneous CA that outperformed the best homogeneous CA. The evolutionary search identified multiple rules that the non-homogeneous CA used as the components in the final solution. The rules composing the collective CA solution were classified as state preserving or repairing the incorrect ordering of the neighborhood bits. The untested hypothesis is that the cellular programming algorithm can discover multiple important rules (partial traits) that compose more complex collective behavior.
Parallel Cellular Machines
Coevolution
The field of evolving CAs has grown in several directions. One important area is evolving non-homogeneous cellular
Coevolution is an extension of the GA, introduced by Hillis [23], inspired by host-parasite coevolution in nature.
Evolving Cellular Automata
Evolving Cellular Automata, Figure 6 An example of the encoding of individuals in a GP population, similar to one used in [2]. The function set here consists of the logical operators {and, or, not, nand, nor, and xor}. The terminal set represents the states of cells in a CA neighborhood, here {Center, East, West, EastOfEast WestOfWest, EastOfEastOfEast, WestOfWestOfWest}. The figure shows the reproduction of Parent1 and Parent2 by crossover with subsequent mutation to produce Child1 and Child2
The main idea is that randomly generated test cases will not continually challenge evolving candidate solutions. Coevolution solves this problem by evolving two populations – candidate solutions and test cases – also referred to as hosts and parasites. The hosts obtain high fitness by performing well on many of the parasites, whereas the parasites obtain high fitness by being difficult for the hosts. Simultaneously coevolving both populations engages hosts and parasites in a mutual competition to achieve increasingly better results [7,17,56]. Successful applications of coevolutionary learning include discovery of minimal sorting networks, training artificial neural networks for robotics, function induction from data, and evolving game strategies [9,23,41,44,56,57]. Coevolution also improved upon GA results on evolving CA rules for density classification [28].
In the context of evolving CAs, the LUT candidate solutions are hosts, and the ICs are parasites. The fitness of a host is a fraction of correctly evaluated ICs from the parasite population. The fitness of a parasite is a function of the number of hosts that failed to correctly classify it. Pagie et al. and Mitchell et al. among others, have found that embedding the host and parasite populations in a spatial grid, where hosts and parasites compete and evolve locally, significantly improves the performance of coevolution on evolving CAs [39,41,42,57]. Other Applications The examples described in previous sections illustrate the power and versatility of genetic algorithms used to evolve desired collective behavior in CAs. The following are some
1049
1050
Evolving Cellular Automata
additional examples of applications of CAs evolved by GAs. CAs are most commonly used for modeling physical systems. CAs evolved by GAs modeled multi-phase fluid flow in porous material [61]. A 3D CA represented a pore model, and the GA evolved the permeability characteristics of the model to match the fluid flow pattern collected from the sample data. Another example is the modeling of physical properties of material microstructures [5]. An alternative definition of CAs (effector automata) represented a 2D cross-section of a material. The rule table specified the next location of the neighborhood’s center cell. The results show that the GA evolved rules that reconstructed microstructures in the sample superalloy. Network theory and topology studies for distributed sensor networks rely on connectivity and communication among its components. Evolved CAs for location management in mobile computing networks is an application in this field [50]. The cells in the mobile networks are mapped to CA cells where each cell is either a reporting or nonreporting cell. Subrata and Zomaya’s study used three network datasets that assigned unique communication costs to each cell. A GA evolved the rules that designate each cell as reporting or not while minimizing the communication costs in the network. The results show that the GA found optimal or near optimal rules to determine which cells in a network are reporting. Sipper also hinted at applying his cellular programming algorithm to non-homogeneous CAs with non-standard topology to evolve network topology assignments [47]. Chopra and Bender applied GAs to evolve CAs to predict protein secondary structure [10]. The 1D CA with r D 5 represents interactions among local fragments of a protein chain. A GA evolved the weights for each of the neighboring fragments that determine the shape of the secondary protein structure. The algorithm achieved superior results in comparison with some other protein-secondary-structure prediction algorithms. Built-In Self-Test (BIST) is a test method widely used in the design and production of hardware components. A combination of a selfish gene algorithm (a GA variant) and CAs were used to program the BIST architecture [12]. The individual CA cells correspond to the circuitry’s input terminals, and the transition function serves as a test pattern generator. The GA identified CA rules that produce input test sequences that detect circuitry faults. The results achieved are comparable with previously proposed GA-based methods but with lower overhead. Computer vision is a fast growing research area where CAs have been used for low-level image processing. The cellular programming algorithm has evolved non-homo-
geneous CAs to perform image thinning, finding and enhancing an object’s rectangle boundaries, image shrinking, and edge detection [47]. Future Directions Initial work on evolving two-dimensional CAs with GAs was done by Sipper [47] and Jimènez-Morales, Crutchfield, and Mitchell [27]. An extension of domain-particle analysis for 2D CAs is needed in order to analyze the information processing of CAs and to identify the epochs of innovations in evolutionary learning. Spatially extended coevolution was successfully used to evolve high performance CAs for density classification. Parallel cellular machines also used spatial embedding of their components and found better performing CAs than the homogeneous CAs evolved by a traditional GA. The hypothesis is that spatially extended search techniques are successful more often than non-spatial techniques because spatial embedding enforces greater genetic diversity and, in the case of coevolution, more effective competition between hosts and parasites. This hypothesis deserves more detailed investigation. Additional important research topics include the study of the error resiliency and the effect of noise on both the information processing in CAs and evolution of CAs. How successful is evolutionary learning in noisy environment? What is the impact of failing CA components on information processing and evolutionary adaptation? Similarly, to make CAs more realistic as models of physical systems, evolving CAs with asynchronous cell updates is an important topic for future research. A number of groups have shown that CAs and similar decentralized spatially extended systems using asynchronous updates can have very different behavior from those using synchronous updates (e. g., [1,6,25,49,53]). An additional topic for future research is the effect of connectivity network structure on the behavior and computational capability of CAs. Some work along these lines has been done by Teuscher [52]. Acknowledgments This work has been funded by the Center on Functional Engineered Nano Architectonics (FENA), through the Focus Center Research Program of the Semiconductor Industry Association. Bibliography 1. Alba E, Giacobini M, Tomassini M, Romero S (2002) Comparing synchronous and asynchronous cellular genetic algorithms. In: Guervos MJJ et al (eds) Parallel problem solving from nature.
Evolving Cellular Automata
2.
3. 4. 5.
6.
7.
8. 9.
10.
11. 12.
13.
14.
15.
16.
17.
PPSN VII, Seventh International Conference. Springer, Berlin, pp 601–610 Andre D, Bennett FH III, Koza JR (1996) Evolution of intricate long-distance communication signals in cellular automata using genetic programming. In: Artificial life V: Proceedings of the fifth international workshop on the synthesis and simulation of living systems. MIT Press, Cambridge Ashlock D (2006) Evolutionary computation for modeling and optimization. Springer, New York Back T (1996) Evolutionary algorithms in theory and practice. Oxford University Press, New York Basanta D, Bentley PJ, Miodownik MA, Holm EA (2004) Evolving cellular automata to grow microstructures. In: Genetic programming: 6th European Conference. EuroGP 2003, Essex, UK, April 14–16, 2003. Proceedings. Springer, Berlin, pp 77–130 Bersini H, Detours V (2002) Asynchrony induces stability in cellular automata based models. In: Proceedings of the IVth conference on artificial life. MIT Press, Cambridge, pp 382–387 Bucci A, Pollack JB (2002) Order-theoretic analysis of coevolution problems: Coevolutionary statics. In: GECCO 2002 Workshop on Understanding Coevolution: Theory and Analysis of Coevolutionary Algorithms, vol 1. Morgan Kaufmann, San Francisco, pp 229–235 Burks A (1970) Essays on cellular automata. University of Illinois Press, Urban Cartlidge J, Bullock S (2004) Combating coevolutionary disengagement by reducing parasite virulence. Evol Comput 12(2):193–222 Chopra P, Bender A (2006) Evolved cellular automata for protein secondary structure prediction imitate the determinants for folding observed in nature. Silico Biol 7(0007):87–93 Codd EF (1968) Cellular automata. ACM Monograph series, New York Corno F, Reorda MS, Squillero G (2000) Exploiting the selfish gene algorithm for evolving cellular automata. IEEE-INNSENNS International Joint Conference on Neural Networks (IJCNN’00) 06:6577 Crutchfield JP, Mitchell M, Das R (2003) The evolutionary design of collective computation in cellular automata. In: Crutchfield JP, Schuster PK (eds) Evolutionary Dynamics – Exploring the Interplay of Selection, Neutrality, Accident, and Function. Oxford University Press, New York, pp 361–411 Das R, Crutchfield JP, Mitchell M, Hanson JE (1995) Evolving globally synchronized cellular automata. In: Eshelman L (ed) Proceedings of the sixth international conference on genetic algorithms. Morgan Kaufmann, San Francisco, pp 336–343 Das R, Mitchell M, Crutchfield JP (1994) A genetic algorithm discovers particle-based computation in cellular automata. In: Davidor Y, Schwefel HP, Männer R (eds) Parallel Problem Solving from Nature-III. Springer, Berlin, pp 344–353 Farmer JD, Toffoli T, Wolfram S (1984) Cellular automata: Proceedings of an interdisciplinary workshop. Elsevier Science, Los Alamos Funes P, Sklar E, Juille H, Pollack J (1998) Animal-animat coevolution: Using the animal population as fitness function. In: Pfeiffer R, Blumberg B, Wilson JA, Meyer S (eds) From animals to animats 5: Proceedings of the fifth international conference on simulation of adaptive behavior. MIT Press, Cambridge, pp 525–533
18. Gardner M (1970) Mathematical games: The fantastic combinations of John Conway’s new solitaire game “Life”. Sci Am 223:120–123 19. Grassberger P (1983) Chaos and diffusion in deterministic cellular automata. Physica D 10(1–2):52–58 20. Hanson JE (1993) Computational mechanics of cellular automata. Ph D Thesis, Univeristy of California at Berkeley 21. Hanson JE, Crutchfield JP (1992) The attractor-basin portrait of a cellular automaton. J Stat Phys 66:1415–1462 22. Hartman H, Vichniac GY (1986) Inhomogeneous cellular automata (inca). In: Bienenstock E, Fogelman F, Weisbuch G (eds) Disordered Systems and Biological Organization, vol F20. Springer, Berlin, pp 53–57 23. Hillis WD (1990) Co-evolving parasites improve simulated evolution as an optimization procedure. Physica D 42:228–234 24. Hordijk W, Crutchfield JP, Mitchell M (1996) Embedded-particle computation in evolved cellular automata. In: Toffoli T, Biafore M, Leão J (eds) Physics and Computation 1996. New England Complex Systems Institute, Cambridge, pp 153–158 25. Huberman BA, Glance NS (1993) Evolutionary games and computer simulations. Proc Natl Acad Sci 90:7716–7718 26. Ikebe M, Amemiya Y (2001) VMoS cellular-automaton circuit for picture processing. In: Miki T (ed) Brainware: Bio-inspired architectures and its hardware implementation, vol 6 of FLSI Soft Computing, chapter 6. World Scientific, Singapore, pp 135–162 27. Jiménez-Morales F, Crutchfield JP, Mitchell M (2001) Evolving two-dimensional cellular automata to perform density classification: A report on work in progress. Parallel Comput 27(5):571–585 28. Juillé H, Pollack JB (1998) Coevolutionary learning: A case study. In: Proceedings of the fifteenth international conference on machine learning (ICML-98). Morgan Kaufmann, San Francisco, pp 24–26 29. Koza JR (1992) Genetic programming: On the programming of computers by means of natural selection. MIT Press, Cambridge 30. Koza JR (1994) Genetic programming II: Automatic discovery of reusable programs. MIT Press, Cambridge 31. Land M, Belew RK (1995) No perfect two-state cellular automata for density classification exists. Phys Rev Lett 74(25): 5148–5150 32. Langton C (1986) Studying artificial life with cellular automata. Physica D 10D:120 33. Langton C (1990) Computation at the edge of chaos: Phase transitions and emergent computation. Physica D 42:12–37 34. Lohn JD, Reggia JA (1997) Automatic discovery of self-replicating structures in cellular automata. IEEE Trans Evol Comput 1(3):165–178 35. Madore BF, Freedman WL (1983) Computer simulations of the Belousov-Zhabotinsky reaction. Science 222:615–616 36. Mitchell M (1996) An introduction to genetic algorithms. MIT Press, Cambridge 37. Mitchell M (1998) Computation in cellular automata: A selected review. In: Gramss T, Bornholdt S, Gross M, Mitchell M, Pellizzari T (eds) Nonstandard Computation. VCH, Weinheim, pp 95–140 38. Mitchell M, Hraber PT, Crutchfield JP (1993) Revisiting the edge of chaos: Evolving cellular automata to perform computations. Complex Syst 7:89–130 39. Mitchell M, Thomure MD, Williams NL (2006) The role of space in the success of coevolutionary learning. In: Rocha LM, Yaeger
1051
1052
Evolving Cellular Automata
40.
41. 42. 43.
44. 45.
46.
47. 48. 49.
50.
51.
LS, Bedau MA, Floreano D, Goldstone RL, Vespignani A (eds) Artificial life X: Proceedings of the tenth international conference on the simulation and synthesis of living systems. MIT Press, Cambridge, pp 118–124 Packard NH (1988) Adaptation toward the edge of chaos. In: Kelso JAS, Mandell AJ, Shlesinger M (eds) Dynamic patterns in complex systems. World Scientific, Singapore, pp 293–301 Pagie L, Hogeweg P (1997) Evolutionary consequences of coevolving targets. Evol Comput 5(4):401–418 Pagie L, Mitchell M (2002) A comparison of evolutionary and coevolutionary search. Int J Comput Intell Appl 2(1):53–69 Reynaga R, Amthauer E (2003) Two-dimensional cellular automata of radius one for density classification task D 12 . Pattern Recogn Lett 24(15):2849–2856 Rosin C, Belew R (1997) New methods for competitive coevolution. Evol Comput 5(1):1–29 Schadschneider A (2001) Cellular automaton approach to pedestrian dynamics – theory. In: Pedestrian and evacuation dynamics. Springer, Berlin, pp 75–86 Sipper M (1994) Non-uniform cellular automata: Evolution in rule space and formation of complex structures. In: Brooks RA, Maes P (eds) Artificial life IV. MIT Press, Cambridge, pp 394–399 Sipper M (1997) Evolution of parallel cellular machines: The cellular programming approach. Springer, Heidelberg Sipper M, Ruppin E (1997) Co-evolving architectures for cellular machines. Physica D 99:428–441 Sipper M, Tomassini M, Capcarrere M (1997) Evolving asynchronous and scalable non-uniform cellular automata. In: Proceedings of the international conference on artificial neural networks and genetic algorithms (ICANNGA97). Springer, Vienna, pp 382–387 Subrata R, Zomaya AY (2003) Evolving cellular automata for location management in mobile computing networks. IEEE Trans Parallel Distrib Syst 14(1):13–26 Tan SK, Guan SU (2007) Evolving cellular automata to gener-
52.
53.
54.
55. 56.
57.
58. 59. 60. 61.
ate nonlinear sequences with desirable properties. Appl Soft Comput 7(3):1131–1134 Teuscher C (2006) On irregular interconnect fabrics for self-assembled nanoscale electronics. In: Tyrrell AM, Haddow PC, Torresen J (eds) 2nd ieee international workshop on defect and fault tolerant nanoscale architectures, NANOARCH’06. Lecture Notes in Computer Science, vol 2602. ACM Press, New York, pp 60–67 Teuscher C, Capcarrere MS (2003) On fireflies, cellular systems, and evolware. In: Tyrrell AM, Haddow PC, Torresen J (eds) Evolvable systems: From biology to hardware. Proceedings of the 5th international conference, ICES2003. Lecture Notes in Computer Science, vol 2602. Springer, Berlin, pp 1–12 Vichniac GY, Tamayo P, Hartman H (1986) Annealed and quenched inhomogeneous cellular automata. J Stat Phys 45:875–883 von Neumann J (1966) Theory of Self-Reproducing Automata. University of Illinois Press, Champaign Wiegand PR, Sarma J (2004) Spatial embedding and loss of gradient in cooperative coevolutionary algorithms. Parallel Probl Solving Nat 1:912–921 Williams N, Mitchell M (2005) Investigating the success of spatial coevolution. In: Proceedings of the 2005 conference on genetic and evolutionary computation. Washington DC, pp 523– 530 Wolfram S (1984) Universality and complexity in cellular automata. Physica D 10D:1 Wolfram S (1986) Theory and application of cellular automata. World Scientific Publishing, Singapore Wolfram S (2002) A new kind of science. Wolfram Media, Champaign Yu T, Lee S (2002) Evolving cellular automata to model fluid flow in porous media. In: 2002 Nasa/DoD conference on evolvable hardware (EH ’02). IEEE Computer Society, Los Alamitos, pp 210
Evolving Fuzzy Systems
Evolving Fuzzy Systems PLAMEN ANGELOV Intelligent Systems Research Laboratory, Digital Signal Processing Research Group, Communication Systems Department, Lancaster University, Lancaster, UK Article Outline Glossary Definition of the Subject Introduction Evolving Clustering Evolving TS Fuzzy Systems Evolving Fuzzy Classifiers Evolving Fuzzy Controllers Application Case Studies Future Directions Acknowledgments Bibliography Glossary Evolving system In the context of this article the term ‘evolving’ is used in the sense of the self-development of a system (in terms of both its structure and parameters) based on the stream of data coming to the system on-line and in real-time from the environment and the system itself. The system is assumed to be mathematically described by a set of fuzzy rules of the form: Rule i : IF (Input1 is close to prototype1i ) AND . . . AND (Inputn is close to prototype ni ) THEN (Output i D Inputs
T
(1)
ConseqPara ms)
In this sense, this definition strictly follows the meaning of the English word “evolving” as described in [34], p. 294, namely “unfolding; developing; being developed, naturally and gradually”. Contrast this to the definition of “evolutionary” in the same source, which is “development of more complicated forms of life (plants, animals) from earlier and simpler forms”. The terms evolutionary or genetic are also associated with such phenomena (respectively operators that mimic these) as chromosome crossover, mutation, selection and reproduction, parents and off-springs [32]. Evolving (fuzzy and neuro-fuzzy) systems do not deal with such phenomena. They rather consider a gradual development of the underlying (fuzzy or neuro-fuzzy) system structure.
Fuzzy system structure Structure of a fuzzy (or neurofuzzy) system is constituted of a set of fuzzy rules (1). Each fuzzy rule is composed of antecedent (IF) and consequents (THEN) parts. They are linguistically expressed. The antecedent part consists of a number of fuzzy sets that are linked with fuzzy logic aggregators such as conjunction, disjunction, more rarely, negation [43]. In the above example, a conjunction (logical AND) is used. It can be mathematically described by so-called t-norms or t-conorms between membership functions. The most popular membership functions are Gaussian, triangular, trapezoidal [73]. The consequent part of the fuzzy rules in the so-called Takagi–Sugeno (TS) form is represented by mathematical functions (usually linear). The structure of the TS fuzzy system can also be represented as a neural network with a specific (five layer) composition (Fig. 1). Therefore, these systems are also called neuro-fuzzy (NF). The number of fuzzy rules and inputs (which in case of classification problems are also called features or attributes) is also a part of the structure. The first layer consists of neurons corresponding to the membership functions of a wparticular fuzzy set. This layer takes the inputs, x and gives as output the degree, μ to which these fuzzy descriptors are satisfied. The second layer represents the antecedent parts of the fuzzy rules. It takes as inputs the membership function values and gives as output the firing level of the ith rule, i . The third layer of the network takes as inputs the firing levels of the respective rule, i and gives as output the normalized firing level, i as “center of gravity” [43] of i . As an alternative one can use the “winner takes all” operator. This operator is used usually in classification, while the “center of gravity” is preferred for time-series prediction and general system modeling and control. The fourth layer aggregates the antecedent and the consequent part that represents the local sub-systems (singletons or hyper planes). Finally, the last 5th layer forms the total output of the NF system. It performs a weighed summation of local subsystems. Fuzzy system parameters Parameters of the NF system of TS type include the center, c and spread, of the Gaussians or parameters of the triangular (or trapezoidal) membership functions. An example of a Gaussian type membership function can be given as: De
12
2 d r
(2)
where d denotes distance between a data sample (point in the data space) and a prototype/cluster center (fo-
1053
1054
Evolving Fuzzy Systems
Evolving Fuzzy Systems, Figure 1 Structure of the (neuro-fuzzy) system of TS type
cal point of a fuzzy set); r is the radius of the cluster (spread of the membership function). Note that the distance can be represented by Euclidean (the most typical example), Mahalonobis [33], cosine etc. forms. These parameters are associated with the antecedent part of the system. Consequent part parameters are coefficients of the (usually) linear functions, singleton coefficients or coefficients of more complex functions (e. g. exponential) if such ones are used.
cursive calculations. 1 P(z) D : 1 C 2
(5)
Age of a cluster or fuzzy rule The age of the (evolving) cluster is defined as the accumulated time of appearance of the samples that form the cluster which support that fuzzy rule. 0 i 1 Sk . X i A D k@ k l A S ki (6) l D1
y i D a i0 C a i1 x1 C C a i n x n
(3)
where a denotes parameters of the consequent part; x denote the inputs (features); i is the index of the ith fuzzy rule; n is the number (dimensionality) of the inputs (features). Potential Potential is a mathematical measure of the data density. It is calculated at a data point, z and represents numerically the accumulated proximity (density) of the data surrounding this data point. It resembles the probability distribution used in so-called Parzen windows [33] and is described in [26,72] by a Gaussian-like function: 1
P(z) D e 2r
2
(4)
where z D [x; y] denotes the joint (input/output) vector; k1 1 P 2 2k D k1 d (z k ; z i ) is the variance of the data in iD1
terms of the cluster center. In [3,9] the Cauchy function is used which has the same properties as the Gaussian but is suitable for re-
where k denotes the current time instant; S ki denotes the support of the cluster that is the number of data samples (points) that are in the zone of influence of the cluster (formed by its radius). It is derived by simple counting of data samples (points) at the moment of their arrival (when they are first read) and assigned to the nearest cluster [10]. The values of A vary from 0 to k and the derivative of A in respect to time is always less or equal to 1 [17]. An “old” cluster (fuzzy rule) has not been updated recently. A “young” cluster (fuzzy rule) is one that has predominantly new samples or recent ones. The (first and second) derivatives of the age are very informative and useful for detection of data “shift” and “drift” [17]. Definition of the Subject Evolving Fuzzy Systems (EFS) are a class of Fuzzy Rulebased (FRB) and Neuro-Fuzzy (NF) systems that have both their parameters and underlying structure self-adapting, self-developing, self-learning from the data in on-line mode and, possibly, in real-time. The concept was con-
Evolving Fuzzy Systems
ceived at the beginning of this century [2,5]. Parallel investigations have led to similar developments in neural networks (NN) [41,42]. EFS have the significant advantage compared to the evolving NN of being linguistically tractable and transparent. EFS have been instrumental in the emergence of new branches of evolving clustering algorithms [3], evolving classifiers [16,51], evolving time-series predictors [9,47], evolving fuzzy controllers [4], evolving fault detectors [30] etc. Over the last years EFS has demonstrated a wide range of applications spanning robotics [76] and defense [24] to biomedical [70] and industrial process [29] data processing in real-time, new generations of self-calibrating, self-adapting sensors [52,53], speech [37] and image processing [56] etc. EFS have the potential to revolutionize such areas as autonomous systems [66], intelligent sensors [45], early cancer detection and diagnosis; they are instrumental in raising the so-called machine intelligence quotient [74] by developing systems that self-adapt in real-time to the dynamically changing environment and to internal changes that occur in the system itself (e. g. wearing, contamination, performance degradation, faults etc.). Although the terms intelligent and artificial intelligence have been used often during the last several decades the technical systems that claim to have such features are in reality far from true intelligence. One of the main reasons is that true intelligence is evolving, it is not fixed. EFS are the first mathematical constructs that combine the approximate reasoning typical for humans represented by the fuzzy inference with the dynamically evolving structure and respective formal mathematically sound learning mechanisms to implement it. Introduction Fuzzy Sets and Fuzzy Logic were introduced by Lotfi Zadeh in 1965 in his seminal paper [71]. During the last decade of the previous century there was an increase of the various applications of fuzzy logic-based systems mainly due to the introduction of fuzzy logic controllers (FLC) by Ebrahim Mamdani in 1975 [54], the introduction of the fuzzily blended linear systems construct called Takagi–Sugeno (TS) fuzzy systems in 1985 [65], and the theoretical proof that FRB systems are universal approximators (that is any arbitrary non-linear function in the [0; 1] range can be asymptotically approximated by a FRB system [68]). Historically, the FRB systems where first being designed based entirely or predominantly on human expert knowledge [54,71]. This offers advantages and was a novel technique at that time for incorporating uncertain, subjective information, preferences, experience, intuition, which are difficult or impossible to be described
otherwise. However, it poses enormous difficulties for the process of designing and routine use of these systems, especially in real industrial environments and in on-line and real-time modes. TS fuzzy systems made possible the development of efficient algorithms for their design not only in off-line, but also in on-line mode [14]. This is facilitated by their dual nature – they combine a fuzzy linguistic premise (antecedent) part with a functional (usually linear) consequent part [65]. With the invention of the concept of EFS [2,5] the problem of the design was completely automated and data-driven. This means, EFS systems selfdevelop their model, respectively system structure as well as adapt their parameters “from scratch” on the fly using experimental data and efficient recursive learning mechanisms. Human expert knowledge is not compulsory, not limiting, not essential (especially if it is difficult to obtain in real-time). This does not necessarily mean that such knowledge is prohibited or not possible to be used. On the contrary, the concept of EFS makes possible the use of such knowledge in initialization stages, even during the learning process itself, but this is not essential, it is optional. Examples of EFS are intelligent sensors for oil refineries [52,53], autonomous self-localization algorithms used by mobile robots [75,76], smart agents for machine health monitoring and prognosis in the car industry [30], smart systems for automatic classification of images in CD production process [51] etc. This is a new promising area of research and new applications in different branches of industry are emerging. Evolving Clustering Data Clustering and, Fuzzy Clustering in particular, are methods for grouping the data based on their similarity, density in the data space and proximity. Partitioning of the data into clusters can be done off-line (using a batch set of data, performing iterative computations over this set, minimizing certain criteria/cost function) or on-line, incrementally. Examples of incremental clustering approaches are self-organizing maps (SOM) conceived by Teuvo Kohonen in the early 1980s [44], adaptive resonance theory (ART) by Stephen Grossberg conceived in the same period [25] etc. Clustering is a type of unsupervised learning technique where the correct examples are not provided. Usually, the number of clusters is pre-specified, e. g. in SOM the number of nodes of the map is pre-defined; the number of neighbors, k in the k-nearest neighbor clustering method [33] is also supposed to be provided, the number C in the fuzzy c-means (FCM) fuzzy clustering algorithm by Jim Bezdek should also be provided [22]. Usually these approaches rely on a threshold and are very sensitive
1055
1056
Evolving Fuzzy Systems
to the specific values of this threshold. Most of the existing approaches are also mean-based (i. e. they use the mean of all data or mean of groups of data). The problem is that the mean is a virtual (non-existing and possibly infeasible) point in the data space. In contrast, the evolving clustering method eClustering conceived in the last decade [3] does not need the number of clusters, the threshold or any other parameter to be pre-specified. It is parameter-free and starts “from scratch” to cluster the data based on their density distribution alone. It is based on the recursive calculation of the potential (5). eClustering is prototype-based (some of the data points are used as prototypes of cluster centers). The procedure of the evolving clustering approach starts from scratch assuming that the first available data point is a center of a cluster. This assumption is temporary and if a priori knowledge exists the procedure can start with an initial set of cluster centers that will be further refined. The coordinates of the first cluster center are formed from the coordinates of the first data point (z1 z1 ). Its potential is set to the ideal value, P1 (z1 ) ! 1. Starting from the next data point which is read in real-time, the following steps are performed for each new data point: calculate its potential, Pk (z k ); update the potential of the existing cluster centers (because their potential has been affected by adding a new data point); compare the potential of the new data point with the potential of the previously existing centers. On the basis of this comparison and the membership of the existing clusters one of the following actions is taken: (add a new cluster center based on the new data point) OR (remove the cluster that describes well the new point which brings an increment to the potential) AND (replace it with a cluster formed around the new point) OR (ignore (do not change the cluster structure)). The process is illustrated in Fig. 2 for the data of NOx emissions from a car exhaust [13]. One can see that the clustering evolves (number of clusters increases from two to three, their position and their radius has changed. Note that in this experiment only two normalized to the range [0; 1] inputs (features), namely the engine output torque in N/m, x1 and pressure in the second cylinder in Pa, x2 are used. For more details on eClustering, please, consult the papers from the bibliography, and especially [2,3,9,16]. Evolving TS Fuzzy Systems TS fuzzy systems [as illustrated in Fig. 1 and described in a very general form in Eq. (1)] were first introduced in
Evolving Fuzzy Systems, Figure 2 The Evolving Clustering method applied to data concerning NOx emissions; a top plot-after 43 samples are read (after 43 s because the sampling rate is 1 sample/second or 1 Hz); b bottom plot – after 124 samples are read (after 124 s)
1985 [65] in the form: < i : IF (x1 is A i1 ) AND . . . AND (x n is A i n ) THEN (y i D a i0 C a i1 x1 C C a i n x n )
(7)
where Ai denotes the ith fuzzy rule (i D [1; R]); R is the number of fuzzy rules; x is the input vector; x D [x1 ; x2 ; : : : ; x n ]T ; Aij denotes the antecedent fuzzy sets, j 2 f1; ng; y i is the output of the ith linear subsystem; ail are its parameters, l 2 f0; ng. The structure of the TS system (number of fuzzy rules), antecedent part of the rules, number of inputs etc., are supposed to be known and are fixed. The data may be provided to the TS system in off-line or in on-line manner.
Evolving Fuzzy Systems
Different data-driven techniques were developed to identify the best in terms of certain (local or global) error minimization criteria such as using a (recursive) least squares technique [65], using genetic algorithms [7,62] etc. The overall output is found to be a weighted sum of local outputs produced by each fuzzy rule: yD
R X
i y i :
(8)
iD1
Where the weights, represent the normalized firing level of the respective fuzzy rule and can be determined by: n
T ij (x j )
i D
jD1 R P
:
n
(9)
T lj (x j )
l D1 jD1
In a vector form the above equations can be represented as: yD
T
(10)
where D [1 x Te ; 2 x Te ; : : : ; R x Te ]T is the vector weighted extended inputs; x Te D [1; x T ]; T D 1T ; 2T ; : : : ; RT is the vector of parameters; 2
i ˛01 i ˛11
6 i D 6 4 ::: i ˛n1
i ˛02 i ˛12
::: i ˛n2
::: ::: ::: :::
i ˛0m i ˛1m
of
3
7 7 ::: 5 i ˛nm
are the parameters of the m local linear sub-systems. The assumption that the TS fuzzy system structure has to be known a priori was for the first time questioned in [2,5] and ultimately in [9] with the proposal of evolving TS (eTS) systems. In [15] a further extension of the eTS system was proposed, namely that they can have many outputs. In this way, the multi-input-multi-output eTS systems were introduced (MIMO-eTS). eTS is a very flexible and powerful tool for time-series prediction, prognosis, modeling non-stationary phenomena, intelligent sensors etc. The algorithm for its learning from streaming data in real-time has two basic phases, which can both be performed very quickly (in one time step between the arrival of two data samples – the current one and the next one). The learning mechanism proposed in [9] is computationally very efficient because it is fully recursive. The two phases include: (a) Data space partitioning and based on this forming and update of the fuzzy rule-base structure;
(b) Learning parameters of the consequent part of the fuzzy rules. Note that the partitioning of the data space serves in eTS identification a different purpose compared to the purpose of data space partitioning in eClustering. In eTS there are outputs and the aim is to find such (perhaps overlapping) clustering of the input-output joint data space that fragments the input-output relationship into locally valid simpler (possibly linear) dependences. In eClustering the aim is to cluster the input data space into distinctive regions. Other than that, the first phase of the eTS model identification is the same as the procedure in the eClustering method described above. The second phase of the learning is parameter identification. It can be performed using a fuzzily weighted version [9] of the well-known recursive least squares (RLS) method [50]. One can perform either local (11) or global (12) identification by minimizing different cost functions [9]: JL D
R X
Y X T i
T
i Y X T i
(11)
iD1
T Y T : JG D Y T
(12)
In one of the cases (when a local cost function is used) the result will be a better approximation locally of the overall non-linear function by the local linear sub-models. The pay-off is, however, a poorer overall approximation. This is, however, compensated by a simpler and computationally more efficient procedure (if we use a locally valid cost function the covariance matrices of much smaller size can be used and they require much less memory space and time to perform computations) [9]. Evolving Fuzzy Classifiers Classification is a problem that has been well studied and a large number of conventional approaches exist to address this problem [33]. Most of them, however, are designed to operate in batch mode and do not change their structure on-line (do not capture new patterns that may be present in the streaming data once the classifier is built). Off-line pre-trained classifiers may be good for certain scenarios, but they need to be redesigned or retrained if the circumstances change. There are also so-called incremental (or on-line) classifiers which work on a “sample-bysample” basis and only require the features of that sample plus a small amount of aggregated information (a rulebase, a small number of variables needed for recursive calculations). They do not require all the history of the data
1057
1058
Evolving Fuzzy Systems
stream (all previously seen data samples). Sometimes they are also called one-pass (each sample is processed only once at a time and is then discarded from the memory). FRB systems have been successfully applied to a range of classification tasks including, but not limited to, decision making, fault detection, pattern recognition, image processing [46]. FRB systems have become one of the alternative frameworks for classifier design together with the more established Bayesian classifiers, decision trees [33], neural network-based classifiers [57], and support-vector machines (SVM) [67]. The task of the classifier is to map the set of features of the sample data onto the set of class labels. A particular advantage of the FRB classifiers is that they are linguistic in form while being also proven universal approximators [68]. In the framework of the concept of evolving fuzzy systems a family of evolving fuzzy classifiers, eClass was proposed in [16,17,51]. The first type of evolving fuzzy classifiers, eClass0 has the typical structure of a fuzzy classifier [46] that differs from structure (1) by the consequent part only: R i : IF (Feature1 is close to prototype1i ) AN D : : : AN D (Feature n is close to prototype ni ) THEN (ClassLabel i )
(13)
The output of eClass0, in the same way as typical fuzzy classifiers [46] provides the label of the class (0, 1 etc.) directly. In this sense, it is not a TS fuzzy system, but is closer to the Mamdani-type fuzzy systems [54]. The main difference of the eClass0 from the typical classifiers [46] is its ability to evolve, to expand the set of fuzzy rules that enables it to capture new data patterns, to adapt to possibly changing characteristics of the streaming data [16,17]. The inference in eClass0 is produced using the so-called “winner takes all” rule [33,46]: R
Label D Label i ; i D arg max iD1
n T ij x j
! :
(14)
jD1
It is much easier and faster to build and train eClass0 in real-time, but the results of the classification can be further significantly improved if the classifier structure is assumed to be of TS-type. eClass is designed for on-line applications with an evolving (self-developing) FRB structure. The antecedents of the FRB are formed from the data stream around highly descriptive focal points (prototypes) per class in the input-output space. The features vector, x is augmented with the class label, L to form a joint input T output vector z D x T ; L . The eClustering algorithm is
applied per class (not over all the data). In this way, information granules (primitive forms of knowledge) [36] are formed in real-time around descriptive data samples, represented linguistically by fuzzy sets. This on-line algorithm works similarly to adaptive control [19] and estimation [39] – in the period between two samples two phases are performed: (1) class prediction (classification); (2) classifier update or evolution. During the first phase the class label is not known and is being predicted; during the second phase, however, it is known and is used as supervisory information to update the classifier (including its structure evolution as well as its parameters update). An alternative structure of the fuzzy classifier, eClass1 is based on the TS-type fuzzy system which has a consequent part of functional type as described in (7). The architecture of eClass1 differs significantly from the architecture of eClass0 and the typical FRB [46]. It performs a regression over the features. Having in mind that the classification surface in a data stream changes dynamically the goal of the evolving fuzzy classifier eClass1 is to evolve a rule-base which takes these changes into account by adapting parameters of the FRB (spreads, consequent parameters) as well as the focal points and the size of the rule-base. The output of each rule is a real (not integer as in the typical fuzzy classifiers) value, which if normalized represents the possibility of a data sample to be of certain class [16,17]: yi D
yi R P
:
(15)
yi
iD1
The overall output of the classifier is then taken as a weighted average (not as winner takes all as in typical fuzzy classifiers) of the normalized outputs of each fuzzy rule: n
yD
R X iD1
T ij
jD1 R P
n
T
l D1 jD1
yi :
(16)
lj
This output is then used to discriminate between the classes. If the problem has two classes (A and B) then the target values are, obviously, 0 for one of the two classes (e. g. Class A) and 1 for the other one (Class B) or vice versa. To discriminate in this case one can simply use a threshold of 0.5. All the outputs that are above 0.5 are being classified as Class B while all the outputs below 0.5 are classified as Class A or vice versa: IF (y > 0:5) THEN (Class A) ELSE (Class B): (17)
Evolving Fuzzy Systems
Evolving Fuzzy Systems, Figure 3 Indirect learning-based control scheme
When the problem has more than two classes, one can apply MIMO eTS where each of the K outputs corresponds to the possibility that a data sample belongs to a certain class (as discussed above). It is interesting to note that it is possible to use MIMO eTS for a two-class problem. In order to do this, one needs to have vectors that represent the target outputs, for example y D [1 0] for Class A and y D [0 1] for Class B or vice versa. In eClass1-MIMO the label is determined by the highest value of the discriminator, y l : K
Label D Label i ; i D arg max y l
(18)
l D1
where K denotes the number of classes. Evolving Fuzzy Controllers Fuzzy logic controllers have been applied in a range of industrial processes [48,59] around the word including in home appliances [21]. The structure of the controller, however, is often decided in an ad hoc manner [54,59] and parameters are tuned off-line using various numerical techniques such as genetic algorithms for example [27,32,63]. In reality, however, even if a validation test has been made beforehand, there is no guarantee that a controller designed in this way will perform satisfactory if the object of control or its environment change [4]. The reasons could be aging, wearing, change in the mode of operation or development of a fault, seasonal changes in the environment etc. An effective mechanism for tackling such problems known from the classical control theory is adaptation [19]. It is well developed for the linear mod-
els and controllers [38], but not for a general (very often highly non-linear, complex and uncertain) case [2]. Adaptive control theory assumes a linear model with a fixed structure and applies to parameters only [19,38]. The concept of evolving fuzzy systems has been applied to the control problem in [2,4] in terms of self-developing the structure of a controller from experimental data in a data-driven manner based on the indirect adaptive learning scheme proposed initially by Psaltis [60] and developed further using NN by Anderson [1]. The indirect learning (IL) control scheme is based on the approximation of the inverse dynamics of the plant. The IL-based control scheme is a model-free concept. It feeds back the integrated (or delayed one-step back) output signal instead of feeding back the error between the plant output and the reference signal as represented in Fig. 3. Figure 3 represents only the basic concept of the approach. It has two phases and the switching between them can be represented by an imaginary switch knob. When the imaginary knob, K is in position “1” the controller is used and we are in phase “Control”. When the imaginary knob is in position “2” the controller learns, self-develops and we are in phase “Learning”. During the supervisory learning phase, the true output signal (y kC1 ) at the time-instant (k C 1) is fed back and the knob is in position “2”. The controller also receives a signal that is a delayed true output, yk . The controller has as an output the value of the control signal, uk . During the control phase (when the knob is in position “1”) the input is determined entirely based on the reference signal (ref) as an alternative to the predicted next step output, y kC1 . In this way, the controller already trained in the previous learning phases produces such a control signal (uk ), which brings the out-
1059
1060
Evolving Fuzzy Systems
Evolving Fuzzy Systems, Figure 4 Fuzzy sets for different fractions of the crude that contribute to the different quality of the end product (naphtha in this case) can be extracted automatically in real-time from the data stream using eSensor
put of the plant at the next time step (y kC1 ) close to the reference signal (ref). The IL scheme was taken further in [2,4] by implementing the controller as an evolving FRB system of TStype. The original works realized the controller as a NN that was trained off-line based on a batch set of training data for the control action and output triplets of the form [y k ; y kC1 ; u k ] for k D 1; 2; : : : ; N; where N denotes the number of training data samples. However, learning techniques for NN are iterative and, therefore training of the NN-controller as described in [1,60] is performed off-line. Additionally, NN suffers from the important disadvantage in comparison to the FRB systems that they are not transparent. In [2,4] the basic scheme of IL control is taken further by adding a disturbance and by using eTS to realize the controller. This scheme was implemented on temperature-control problems. Application Case Studies Self-Calibrating Intelligent Sensors for Process Industries So-called intelligent or inferential sensors have been adopted by the process industries (chemical, petro-chemical, manufacturing) for several decades [31]. The main reason is that they provide accurate real time estimates of difficult to measure otherwise parameters or can replace (substitute) expensive measurements such as gas emissions, biomass, melt index, etc. They use as inputs the available (“hard”) sensors for easy to measure physical variables, such as temperatures, pressures, and flows
which are also cheaper. The main disadvantage of the currently existing inferential or “soft” sensors is that significant efforts must be made based on batch sets of data to develop and maintain the mathematical models that support them (neural networks, statistical models etc.). Any process changes outside the conditions used for off-line model development can lead to significant performance deterioration which, in turn, requires maintenance and recalibration. Evolving Fuzzy Systems offer an effective opportunity to develop “soft” sensors that are more flexible, self-calibrating, and thus, more “intelligent” [45]. Several applications of EFS-based soft sensors, in particular for oil refineries [52,53] and propylene production [12] were reported. An important advantage of the evolving sensors is that they extract human-interpretable knowledge in the form of linguistic fuzzy rules. For example, Fig. 4 illustrates membership functions of the fuzzy sets (in values in the range [0;1] on the vertical axis) that describe the density of the crude, d in gram per liter (g/l) on the horizontal axis. The evolving fuzzy sensor (eSensor) implemented in the oil refinery at Santa Cruz, Tenerife, Spain predicts in realtime the temperature of the heavy naphtha (hn) T hn , °C in degrees Celsius when it evaporates 95% liquid volume, according to the ASTM D86-04b standard based on realtime measurements of: The pressure of the tower, p, measured in kg/cm2 g The amount of the product taking off, P, represented in % The density of the crude, d in g/l (illustrated in Fig. 4)
Evolving Fuzzy Systems
Evolving Fuzzy Systems, Figure 5 Fuzzy rule describing a Landmark (underpass at Lancaster University campus) that was discovered automatically by eClustering using video streaming data
Temperature of the column overhead, T co in °C Temperature of the naphtha extraction, T ne in °C An expert in the area of oil refining processes can easily visually distinguish between the heavy crude and light crude represented by respective membership functions derived automatically from the data in real-time. Adaptive Real-Time Classifiers (Image, Land-Mark, Robotic) Presently the data that are to be processed in industry, in defense and other real-life applications are not only huge in volume, but very often they are in the form of data streams [28]. This requires not only precise classifiers, but also dynamically evolvable classifiers. For example, in mobile robotics, an autonomous vehicle produces a video stream while operating in a completely unknown environment that needs to be processed [75,76]. The Evolving Clustering method was used to automatically generate a fuzzy rule-base that describes the landmarks discovered without any prior learning, “on the fly” by a mobile robot Pioneer 3DX [58] exploring a completely unknown environment [76]. The mobile robot is using its on-board pantilt zoom camera to produce a video stream. The frames were grasped and processed by the eClustering algorithm based on a 3-dimensional color-vector (R, G, B). Fuzzy rules of the following form were extracted from the data automatically: Note that the landmarks that were identified automatically represented real objects on the route of the mobile robot such as the underpass of Lancaster University from the example. They were identified to be distinctive from the surrounding background by eClustering. The fuzzy rule base (Fig. 5) has evolved “from scratch” based on the video information and the data distribution only.
Predictive Models (Air-Conditioning, Financial Time-Series, Benchmark Data Sets) There are different techniques that can be used for predictive models, such as ARMAX models [50], neural networks [34] non-evolving (fixed structure) fuzzy rule-based models [65,73]. Evolving fuzzy systems, however, offer additionally the capability to have a predictor that evolves following the dynamic changes of the data by gradually adapting not only its parameters, but also its structure. In [6] the problem of predicting the characteristic temperature difference across a coil in a heat exchanger of an air-conditioning unit installed in a real building in Iowa, USA is considered. The evolving fuzzy rule-based model develops its structure and parameters based on the data of the flow rate entering the coil, moisture content of the air entering the coil, temperature of the chilled water, and control signal to the valve as illustrated in Fig. 6. The model proved to work satisfactorily in all season conditions due to its ability to adapt to the changes in the environment (different seasons) as demonstrated in Fig. 6c. When pre-trained (based on 400 samples) and fixed as both structure and parameters the performance deteriorated unacceptably in changing seasonal conditions as seen in Fig. 6a. A partial re-training improved significantly the results (Fig. 6b), but still this was less valid when the season changed (at around sample 912) and the model structure evolution has stopped (at sample 1000). When, the model structure evolution continued uninterrupted the result was a satisfactory performance in all seasons as seen in Fig. 6c. Fault Detection and Prognostics Evolving clustering and eTS fuzzy systems were applied in Ford Motor Company to machine health monitoring and prognosis in [30]. The ability of the evolving cluster-
1061
1062
Evolving Fuzzy Systems
Speech Signal Reconstruction The ETS fuzzy system is used in [37] for error concealment in the next-generation Voice over Internet Protocol (VoIP) communication receivers. It is used in combination with parametric speech coders of analysis-by-synthesis-type. eTS MIMO [15] is used to predict the missing values of the linear spectral pairs (LSP) that will allow one to reconstruct the lost in transmission packets. The eTS fuzzy model used ten inputs (current LSP parameter values) and ten outputs (predicted one step/20 milliseconds ahead LSP values). This research was a joint work between Lancaster University and Nokia-UK and aims the development of the next generation intelligent decoders at the receiver that will be able to conceal lost packets with a size of 80 to 160 ms without significant deterioration to the quality of service (QoS) in VoIP transmission. Future Directions
Evolving Fuzzy Systems, Figure 6 A predictive model of the characteristic temperature difference across a coil in a heat exchanger of an air-conditioning unit installed in a real building in Iowa, USA. a An off-line pre-trained model used in two different seasons (summer and spring); b Evolving FRB model trained and used during the summer (up to sample 1000) and then having its structure fixed during the spring season; c Evolving FRB model left to evolve (self-develop) during the whole period of usage (both spring and summer seasons)
ing method to form new clusters that represent different operating modes of a machine was exploited and different types of faults (incipient or drastic) were automatically identified based on the difference in the cluster formation. A prediction of the direction of movement of the cluster centers was used for prediction of possible faults and of the end-of-life of the machine.
The area of evolving fuzzy systems is in its infancy and has already demonstrated a remarkable success in addressing some of the most vibrating issues of the development, application and implementation of truly intelligent systems in a wide variety of branches of industry and real-life problems [11]. It opens the door for future developments that are related to the areas of autonomous systems, early cancer diagnosis and prognosis of the progression, even to the identification of structural changes in biological cells that correspond to the evolution of the disease. In the area of intelligent self-maintaining sensors the process industry can benefit from more flexible and smarter solutions. The problems that are yet to be addressed and can mark the future development of this vibrant area are; (1) collaboration aspects between two or more evolving fuzzy systembased intelligent systems (autonomous robots, intelligent sensors etc.); (2) further flexibility of the systems in terms of real-time self-analysis, optimal features and input selection, rule aggregation mechanism adaptation etc.; (3) even more flexible system structure architectures such as hierarchical, decentralized; (4) more robust learning algorithms that take care of missing data, different sampling intervals etc. From a broader prospective, the future developments of this discipline will influence and are closely related to similar developments in the area of communication networks (self-adaptive networks [64]), self-validating soft sensors [61], autonomous aerial, ground-based, and underwater vehicles [20,40,49] etc. The area is closely related to the developments in the area of neural networks [34,48], so-called autonomous mental development [23] and cognitive psychology [55], mining data streams [28]. One
Evolving Fuzzy Systems
can also expect more hardware implementations (the first hardware implementation of eClustering was reported in 2005 [8]). From the point of view of mathematical fundamentals and learning it is also closely related to adaptive filters theory [69] and the recent developments in particle filters [17] will certainly influence the future, more efficient techniques that will be developed in this emerging and highly potential branch of research. Acknowledgments The author would like to thank Mr. Xiaowei Zhou for his assistance in producing the illustrative material, Dr. Jose Macias Hernandez for kindly providing real data from the oil refinery CEPSA, Santa Cruz, Tenerife, Spain, Dr. Richard Buswell, Loughborough University and ASHRAE (RP-1020) for the real air conditioning data, and Dr. Edwin Lughofer from Johannes Kepler University of Linz, Austria for providing real data from car engines. Bibliography Primary Literature 1. Andersen HC, Teng FC, Tsoi AC (1994) Single Net Indirect Learning Architecture. IEEE Trans Neural Netw 5:1003–1005 2. Angelov P (2002) Evolving Rule-based Models: A Tool for Design of Flexible Adaptive Systems. Springer, Heidelberg 3. Angelov P (2004) An Approach for Fuzzy Rule-base Adaptation using On-line Clustering. Int J Approx Reason 35(3): 275–289 4. Angelov PP (2004) A Fuzzy Controller with Evolving Structure. Inf Sci 161:21–35 5. Angelov P, Buswell R (2001) Evolving Rule-based Models: A Tool for Intelligent Adaptation. In: Proc of the Joint 9th IFSA World Congress and 20th NAFIPS Intern Conf, Vancouver, 25– 28 July 2001. IEEE Press, USA, pp 1062–1066 6. Angelov P, Buswell R (2002) Identification of Evolving Rulebased Models. IEEE Trans Fuzzy Syst 10(5):667–677 7. Angelov P, Buswell R (2003) Automatic Generation of Fuzzy Rule-based Models from Data by Genetic Algorithms. Inf Sci 150(1/2):17–31 8. Angelov P, Everett M (2005) EvoMap: On-Chip Implementation of Intelligent Information Modelling using EVOlving MAPping. Lancaster University, Lancaster, pp 1–15 9. Angelov P, Filev D (2004) An approach to on-line identification of evolving Takagi–Sugeno models. IEEE Trans Syst Man Cybern part B Cybern 34(1):484–498 10. Angelov P, Filev D (2005) Simpl_eTS: A Simplified Method for Learning Evolving Takagi–Sugeno Fuzzy Models. In: Proc of The 2005 IEEE Intern. Conf. on Fuzzy Systems FUZZ-IEEE – 2005, Reno 2005, pp 1068–1073 11. Angelov P, Filev D, Kasabov N, Cordon O (eds) (2006) Evolving Fuzzy Systems. Proc of the 2nd Int Symposium on Evolving Fuzzy Systems, Ambleside, 7–9 Sept 2006. pp 1–350 12. Angelov P, Kordon A, Zhou X (2008) Adaptive Inferential Sensors based on Evolving Fuzzy Models: An Industrial Case Study. IEEE Trans Fuzzy Syst (under review)
13. Angelov P, Lughofer E, Klement PE (2005) Two Approaches for Data – Driven Design of Evolving Fuzzy Systems: eTS and FLEXFIS. In: Proc of The 2005 North American Fuzzy Information Processing Society, NAFIPS Annual Conference, Ann Arbor, June 2005, pp 31–35 14. Angelov P, Victor J, Dourado A, Filev D (2004) On-line evolution of Takagi–Sugeno Fuzzy Models. In: Proc of the 2nd IFAC Workshop on Advanced Fuzzy and Neural Control, Oulu, 16– 17 Sept 2004, pp 67–72 15. Angelov P, Xydeas C, Filev D (2004) On-line Identification of MIMO Evolving Takagi–Sugeno Fuzzy Models. In: Proc of the Intern. Joint Conf. on Neural Networks and Intern. Conf. on Fuzzy Systems, IJCNN-FUZZ-IEEE, Budapest, 25–29 July 2004, pp 55–60 16. Angelov P, Zhou X, Klawonn F (2007) Evolving Fuzzy Rulebased Classifiers. In: Proc of the First 2007 IEEE International Conference on Computational Intelligence Applications for Signal and Image Processing – a part of the IEEE Symposium Series on Computational Intelligence, SSCI-2007, Honolulu, 1–5 April 2007, pp 220–225 17. Angelov P, Zhou X, Lughofer E, Filev D (2007) Architectures of Evolving Fuzzy Rule-based Classifiers. In: Proc of the 2007 IEEE International Conference on Systems, Man, and Cybernetics, Montreal, 7–10 Oct 2007, pp 2050–2055 18. Arulampalam MS, Maskell S, Gordon N (2002) A Tutorial on Particle Filters for On-line Nonlinear Non-Gaussian Bayesian Tracking. IEEE Trans Signal Process 50(2):174–188 19. Astroem KJ, Wittenmark B (1994) Adaptive Control. Prentice Hall, Upper Saddle River 20. Azimi-Sadjadi MR, Yao D, Jamshidi AA, Dobeck GJ (2002) Underwater Target Classification in Changing Environments Using an Adaptive Feature Mapping. IEEE Trans Neural Netw 13(5):1099–1111 21. Badami VV, Chbat NW (1998) Home appliances get smart. IEEE Spectrum 35(8):36–43 22. Bezdek J (1974) Cluster Validity with Fuzzy Sets. J Cybern 3(3):58–71 23. Bonarini A, Lazaric A, Restelli M, Vitali P (2006) Self-Development Framework for Reinforcement Learning Agents. In: Proc of the 5th Intern. Conf. on Development and Learning, ICDL06 New Delhi 24. Carline D, Angelov PP, Clifford R (2005) Agile Collaborative Autonomous Agents for Robust Underwater Classification Scenarios. In: the Proceedings of the Underwater Defense Technology Conference, Amsterdam, June 2005 25. Carpenter GA, Grossberg S (2003) Adaptive Resonance Theory. In: Arbib MA (ed) The Handbook of Brain Theory and Neural Networks, 2nd edn. MIT Press, Cambridge, pp 87–90 26. Chiu SL (1994) Fuzzy model identification based on cluster estimation. J Intell Fuzzy Syst 2:267–278 27. Cordon O, Gomide F, Herrera F, Hoffmann F, Magdalena L (2004) Ten years of genetic fuzzy systems: Current framework and new trends. Fuzzy Sets Syst 141(1):5–31 28. Domingos P, Hulten G (2001) Catching up with the data: Research issues in mining data streams. In: Proc of the Workshop on Research Issues in Data Mining and Knowledge Discovery, Santa Barbara 29. Filev D, Larson T, Ma L (2000) Intelligent Control for Automotive Manufacturing – Rule-based Guided Adaptation. In: Proc of the IEEE Conference on Industrial Electronics, IECON-2000, Nagoya Oct 2000, pp 283–288
1063
1064
Evolving Fuzzy Systems
30. Filev D, Tseng F (2006) Novelty detection-based Machine Health Prognostics. In: Proc of the 2006 Int Symposium on Evolving Fuzzy Systems. IEEE Press, USA, pp 193–199 31. Fortuna L, Graziani S, Rizzo A, Xibilia MG (2007) Soft sensors for Monitoring and Control of In Industrial Processes. Springer, London 32. Goldberg DE (1989) Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading 33. Hastie T, Tibshirani R, Friedman J (2001) The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer, Heidelberg 34. Hornby AS (1974) Oxford Advance Learner’s Dictionary. Oxford University Press, Oxford 35. Huang G-B, Saratchandran P, Sundarajan N (2005) A generalized growing and pruning RBF (GGAP – RBF) neural network for function approximation. IEEE Trans Neural Netw 16(1):57– 67 36. Ishibuchi H, Nakashima T, Nii M (2004) Classification and Modeling with Linguistic Granules: Advanced Information Processing. Springer, Berlin 37. Jones E, Angelov P, Xydeas C (2006) Recovery of LSP Coefficients in VoIP Systems using Evolving Takagi–Sugeno Fuzzy MIMO Models. In: Proc of the 2006 Intern. Symposium on Evolving Fuzzy Systems, Ambelside, 7–9 Sept 2006, pp 208– 214 38. Kailath T, Sayed AH, Hassibi B (2000) Linear Estimation. Prentice Hall, Upper Saddle River 39. Kalman RE (1960) A New Approach to linear filtering and prediction problem. Transactions of the American Society of Mechanical Engineering, ASME, Ser. D. J Basic Eng 8:34–45 40. Kanakakis V, Valavanis KP, Tsourveloudis NC (2004) FuzzyLogic Based Navigation of Underwater Vehicles. J Intell Robotic Syst 40:45–88 41. Kasabov N (2001) Evolving fuzzy neural networks for on-line supervised/unsupervised, knowledge-based learning. IEEE Trans. on Systems, Man and Cybernetics – part B. Cybernetics 31:902–918 42. Kasabov N, Song Q (2002) DENFIS: Dynamic Evolving Neural-Fuzzy Inference System and Its Application for Time-Series Prediction. IEEE Trans Fuzzy Syst 10(2):144–154 43. Klir G, Folger T (1988) Fuzzy Sets, Uncertainty and Information. Prentice Hall, Englewood Cliffs 44. Kohonen T (1995) Self-Organizing Maps. Series in Inf Sci, vol 30. Springer, Heidelberg 45. Kordon A (2006) Inferential Sensors as Potential Application Area of Intelligent Evolving Systems. 2006 International Symposium on Evolving Fuzzy Systems, Ambleside, 7–9 September 2006, key note presentation 46. Kuncheva L (2000) Fuzzy Classifiers. Physica, Heidelberg 47. Leng G, McGuinty TM, Prasad G (2005) An approach for online extraction of fuzzy rules using a self-organizing fuzzy neural network. Fuzzy Sets Syst 150(2):211–243 48. Lin F-J, Lin C-H, Shen P-H (2001) Self-constructing fuzzy neural network speed controller for permanent-magnet synchronous motor drives. IEEE Trans Fuzzy Syst 9(5):751–759 49. Liu PX, Meng MQ-X (2004) On-line Data-Driven Fuzzy Clustering with Applications to Real-time Robotic Tracking. IEEE Trans Fuzzy Syst 12(4):516–523 50. Ljung L (1987) System Identification: Theory for the User. Prentice-Hall, New Jersey
51. Lughofer E, Angelov P, Zhou X (2007) Evolving Single- and Multi-Model Fuzzy Classifiers with FLEXFIS-Class. In: Proc of the 2007 IEEE International Conference on Fuzzy Systems, London, 23–26 July 2007, pp 363–368 52. Macias J, Angelov P, Zhou X (2006) Predicting quality of the crude oil distillation using evolving Takagi–Sugeno fuzzy models. In: Proc of the 2006 International Symposium on Evolving Fuzzy Systems, Ambleside, 7–9 Sept 2006, pp 201–207 53. Macias-Hernandez JJ, Angelov P, Zhou X (2007) Soft Sensor for Predicting Crude Oil Distillation Side Streams using Takagi Sugeno Evolving Fuzzy Models. In: Proc of the 2007 IEEE Int Conf on Syst, Man, and Cybernetics, Montreal, 7–10 Oct. 2007, pp 3305–3310 54. Mamdani EH, Assilian S (1975) An Experiment in Linguistic Synthesis with a Fuzzy Logic Controller. Int J Man-Mach Stud 7:1–13 55. Massaro DW (1991) Integration versus Interactive Activation: The Joint Influence of Stimulus and Context in Perception. Cogn Psychol 23:558–614 56. Memon MA, Angelov P, Ahmed H (2006) An Approach to Real-Time Color-based Object Tracking. In: Proc 2006 International Symposium on Evolving Fuzzy Systems, Ambleside, 7–9 Sept 2006, pp 81–87 57. Nauck D, Kruse R (1997) A Neuro-fuzzy method to learn fuzzy classification rules from data. Fuzzy Sets Syst 89:277–288 58. Pioneer-3DX (2004) User Guide. ActiveMedia Robotics, Amherst 59. Procyk TJ, Mamdani EH (1979) A linguistic self-organizing process controller. Automatica 15:15–30 60. Psaltis D, Sideris A, Yamamura AA (1988) A Multilayered Neural Network Controller. IEEE Trans Control Syst Manag 8:17–21 61. Qin SJ, Yue H, Dunia R (1997) Self-validating inferential sensors with application to air emission monitoring. Ind Eng Chem Res 36:1675–1685 62. Setnes M, Roubos H (2000) Ga-fuzzy modeling and classification: complexity and performance. IEEE Trans Fuzzy Syst 8(5):509–522 63. Shimojima K, Fukuda T, Hashegawa Y (1995) Self-Tuning Modeling with Adaptive Membership Function, Rules, and Hierarchical Structure based on Genetic Algorithm. Fuzzy Sets Syst 71:295–309 64. Sifalakis M, Hutchison D (2004) From Active Networks to Cognitive Networks. In: Proc of the ICRC Dagstuhl Seminar 04411 on Service Management and Self-Organization in IP-based Networks. Waden, October 2004 65. Takagi T, Sugeno M (1985) Fuzzy identification of systems and its application to modeling and control. IEEE Trans Syst Man Cybern B – Cybern 15:116–132 66. Valavanis K (2006) Unmanned Vehicle Navigation and Control: A Fuzzy Logic Perspective. In: Proc of the 2006 International Symposium on Evolving Fuzzy Systems. Ambleside, 7– 9 Sept. 2006, pp 200–207 67. Vapnik VN (1998) The Statistical Learning Theory. Springer, Berlin 68. Wang L-X (1992) Fuzzy Systems are Universal Approximators. In: Proc of the First IEEE International Conference on Fuzzy Systems, FUZZ-IEEE – 1992, San Diego, pp 1163–1170 69. Widrow B, Stearns S (1985) Adaptive Signal Processing. Prentice Hall, Englewood Cliffs 70. Xydeas C, Angelov P, Chiao S, Reoullas M (2006) Advances in EEG Signals Classification via Dependant HMM models and
Evolving Fuzzy Systems
71.
72.
73. 74.
75.
76.
Evolving Fuzzy Classifiers. Int J Comput Biol Medicine, special issue on Intell Technol Bio-Inform Medicine 36(10):1064– 1083 Yager R (2006) Learning Methods for Intelligent Evolving Systems. In: Proc 2006 International Symposium on Evolving Fuzzy Systems. Ambelside, 7–9 Sept. 2006, pp 3–7 Yager RR, Filev DP (1993) Learning of Fuzzy Rules by Mountain Clustering. In: Proc of the SPIE Conf. on Application of Fuzzy Logic Technology, Boston, pp 246–254 Yager RR, Filev DP (1994) Essentials of Fuzzy Modeling and Control. Wiley, New York Zadeh LA (1993) Soft Computing. Introductory Lecture for the 1st European Congress on Fuzzy and Intelligent Technologies EUFIT’93, Aachen, pp vi–vii Zhou X-W, Angelov P (2006) Real-Time joint Landmark Recognition and Classifier Generation by an Evolving Fuzzy System. In: Proc of the 2006 IEEE World Congress on Computational Intelligence, WCCI-2006, Vancouver, 16–21 July 2006, pp 6314–6321 Zhou X, Angelov P (2007) An approach to autonomous selflocalization of a mobile robot in completely unknown environment using evolving fuzzy rule-based classifier. In: Proc of the First 2007 IEEE Int Symposium on Computational Intelligence Applications for Defense and Security – a part of the IEEE Symposium Series on Computational Intelligence, SSCI2007, Honolulu, 1–5 April 2007, pp 131–138
Books and Reviews Angelov P, Xydeas C (2006) Fuzzy Systems Design: Direct and Indirect Approaches. Int J Soft Comput, special issue on New Trends in Fuzzy Modeling part I: Novel Approaches 10(9): 836–849 Angelov P, Zhou X (2006) Evolving fuzzy systems from data streams in real-time. Proc 2006 International Symposium on Evolving Fuzzy Systems, Ambleside, 7–9 Sept. 2006, pp 29–35 Bentley PJ (2000) Evolving Fuzzy Detectives: An Investigation into the Evolution of Fuzzy Rules. In: Suzuki, Roy, Ovasks, Furuhashi, Dote (eds) Soft Computing in Industrial Applications. Springer, London Chan Z, Kasabov N (2004) Evolutionary computation for on-line and off-line parameter tuning of evolving fuzzy neural networks. Int J Comput Intell Appl 4(3):309–319 Fayyad UM, Piatetsky-Shapiro G, Smyth P (1996) From Data Mining to Knowledge Discovery: An Overview, Advances in Knowledge Discovery and Data Mining. MIT Press, Boston Fritzke B (1994) Growing cell structures – a self-organizing network for unsupervised and supervised learning. Neural Netw 7(9):1441–1460 Futschik M, Reeve A, Kasabov N (2003) Evolving connectionist systems for knowledge discovery from gene expression data of cancer tissue. Artif Intell Med 28:165–189
Hopner F, Klawonn F (2000) Obtaining interpretable fuzzy models from fuzzy clustering and fuzzy regression. In: Proc of the 4th Intern Conf on Knowledge-based Intelligent Engineering Systems (KES), Brighton, pp 162–165 Huang G-B, Saratchandran P, Sundarajan N (2005) A generalized growing and pruning RBF (GGAP-RBF) neural network for function approximation. IEEE Trans Neural Netw 16(1): 57–67 Huang L, Song Q, Kasabov N (2005) Evolving Connectionist Systems Based Role Allocation of Robots for Soccer Playing. In: Proc of the Joint 2005 International Symposium on Intelligent Control and 13th Mediterranean Conference on Control and Automation, ISIC – MED – 2005, Limassol, 27–29 June 2005 Jang JSR (1993) ANFIS: Adaptive Network-based Fuzzy Inference Systems. IEEE Trans. on Syst Man Cybernt B – Cybern 23(3):665–685 Juang C-F, Lin X-T (1999) A recurrent self-organizing neural fuzzy inference network. IEEE Trans Neural Netw 10:828–845 Kasabov N (2006) Adaptation and Interaction in Dynamical Systems: Modelling and Rule Discovery Through Evolving Connectionist Systems. Appl Soft Comput 6(3):307–322 Kasabov N (2006) Evolving connectionist systems: Brain-, gene-, and, quantum inspired computational intelligence. Springer, London Kasabov N, Chan Z, Song Q, Greer D (2005) Evolving neuro-fuzzy systems with evolutionary parameter self-optimisation. In: Do Adaptive Smart Systems exist? Series Study in Fuzziness, vol 173. Physica, Heidelberg Kim K, Baek J, Kim E, Park M (2005) TSK Fuzzy model based on-line identification. In: Proc of the 11th International Fuzzy Systems Association, IFSA World Congress, Beijing, pp 1435–1439 Klinkenberg R, Joachims T (2000) Detection concept drift with support vector machines. Proc of the 7th International Conference on Machine Learning (ICML). Morgan Kaufman, Stanford University, pp 487–494 Marin-Blazquez JG, Shen Q (2002) From approximative to descriptive fuzzy classifiers. IEEE Trans Fuzzy Syst 10(4):484–497 Marshall MR, Song Q, Ma TM, MacDonell S, Kasabov N (2005) Evolving Connectionist System versus Algebraic Formulae for Prediction of Renal Function from Serum Creatinine. Kidney Int 6:1944–1954 Ozawa S, Pang S, Kasabov N (2004) A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier. Lecture Notes in Artificial Intelligence LNAI, vol 3157. Springer, Berlin, pp 231–240 Pang S, Ozawa S, Kasabov N (2005) Incremental Linear Discriminant Analysis for Classification of Data Streams. IEEE Trans Syst Man Cybern B – Cybern 35(5):905–914 Plat J (1991) A resource allocation network for function interpolation. Neural Comput 3(2):213–225 Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach Learn 23(1):69–101
1065
1066
Extreme Value Statistics
Extreme Value Statistics MARIO N ICODEMI Department of Physics, University of Warwick, Coventry, UK Article Outline Glossary Definition of the Subject Introduction The Extreme Value Distributions The Generalized Extreme Value Distribution Domains of Attraction and Examples Future Directions Bibliography Glossary Random variable When a coin is tossed two random outcomes are permitted: head or tail. These outcomes can be mapped to numbers in a process which defines a ‘random variable’: for instance, ‘head’ and ‘tail’ could be mapped respectively to +1 and 1. More generally, any function mapping the outcomes of a random process to real numbers is defined a random variable [15]. More technically, a random variable is any function from a probability space to some measurable space, i. e., the space of admitted values of the variable, e. g., real numbers with the Borel -algebra. The amount of rainfall in a day or the daily price variation of a stock are two more examples. It’s worth to stress that, formally, the outcome of a given random experiment is not a random variable: the random variable is the function describing all the possible outcomes as numbers. Finally, two random variables are said independent when the outcome of either of them has no influence on the other. Probability distribution The probability of either outcomes, ‘head’ and ‘tail’, in tossing a coin is 50%. Similarly, a discrete random variable, X, with values fx1 ; x2 ; : : : g has an associate discrete probability distribution of occurrence fp1 ; p2 ; : : : g. More generally, for a random variable on real numbers, X, the corresponding probability distribution [15] is the function returning the probability to find a value of X within a given interval [x1 ; x2 ] (where x1 and x2 are real numbers): Pr[x1 X x2 ]. In particular, the random variable, X, is fully characterized by its cumulative distribution function, F(x), which is: F(x) D Pr[X < x] for any x in R. The probability distribution density, f (x), can be often defined as the derivative of F(x) : f (x) D dF(x)/dx.
The probability distribution of two independent random variables, X and Y, is the product of the distributions, F X and F Y , of X and Y : F(x; y) Pr[X < x; Y < y] D Pr[X < x] Pr[Y < y] D F X (x) FY (y). Expected value The expected value [15] of a random variable is its average outcome over many independent experiments. Consider, for instance, a discrete random variable, X, with values in the set fx1 ; x2 ; : : : g and the corresponding probability for each of these values fp1 ; p2 ; : : : g. In probability theory, the expected, or average, value of X (denoted E(X)) is just the sum: P E(X) D x i p i . For instance, if you have an asset which can give two returns fx1 ; x2 g with probability fp1 ; p2 g, its expected return is x1 p1 C x2 p2 . In case we have a random variable defined on real numbers and F(x) is its probability distribution funcR tion, the expected value of X is: E(X) D XdF. As for some F(x) the above integral may not exist, the ‘expected value’ of a random variable is not always defined. Variance and moments The variance [15] of a probability distribution is a measure of the average deviations from the mean of the related random variable. In probability theory, the variance is usually defined as the mean squared deviation, E((X E(X))2 ), i. e., the expected value of (X E(X))2 . The square root of the variance is named the standard deviation and is a more sensible measure of fluctuations of X around E(X). Alike E(X), for some distributions the variance may not exist. In general, the expected value of the kth power of X, E(X k ), is called the kth moment of the distribution. The Central limit theorem The Central Limit Theorem [15] is a very important result in probability theory stating that the sum of N independent identically-distributed random variables, with finite average and variance, has a Gaussian probability distribution in the limit N ! 1, irrespective of the underlying distributions of the random variables. The domain of attraction of the Gaussian as a limit distribution is, thus, very large and can explain why the Gaussian is so frequently encountered. The theorem is, in practice, very useful since many real random processes have a finite average and variance and are approximately independent. Definition of the Subject Extreme value theory is concerned with the statistical properties of the extreme events related to a random variable (see Fig. 1), and the understanding and applications of
Extreme Value Statistics
ers a few more technical topics such as extreme r order statistics and the generalization of extreme value distribution theory. We refer to textbooks on probability theory, such as [15] (or, for simplicity, to the Glossary), for the definition of the basic notions of probability used here. The extensive research on extreme value statistics is reviewed in excellent books published over the past years, e. g., [6,11,12,13,14,17,18,21,24]; we provide here only an overview of the basic concepts and tools. To make this paper as self contained as possible, the Glossary gives a beginner introduction to all the elementary notions of probability theory encountered in the following sections. Introduction
Extreme Value Statistics, Figure 1 We show three samples, 1 to 3, each with N realizations of a random variable Xi (i 2 f1; : : : ; Ng). Extreme value theory is concerned with the statistical properties of occurrence of the extreme values in those samples, such as the maxima (circled points)
their probability distributions. The methods and the practical use of such a theory have been developed in the last 60 years, though, many complex real-life problems have only recently been tackled. Many disciplines use the tools of extreme value theory including meteorology, hydrology, ocean wave modeling, and finance to name just a few. For example, in economics, extreme value theory is currently used by actuaries to evaluate and price insurance against the probability of rare but financially catastrophic events. An other application is for the estimation of Value at Risk. In hydrology, the theory is applied by environmental risk agencies to calculate, for example, the height of sea-walls to prevent flooding. Similarly, extreme value theory is also used to set strength boundaries in engineering materials, as well as for material fatigue and reliability in buildings (e. g., bridges, oil rigs), and estimating pollution levels. This paper aims to give a simple, self contained, introduction to the motivations and basic ideas behind the development of extreme value theory, and briefly cov-
Extreme events, exceeding the typical expected value of a random variable, can have substantial relevance to problems raising in disciplines as diverse as sciences, engineering and economics. Extreme value theory is a sub-field of applied statistics, early developed [13,14,17,18] by mathematician such as Fisher, Tippett, Gnedenko, and, in particular, Emil Julius Gumbel, dealing precisely with the problems related to extreme events (see Fig. 1). One of its key point is the so called ‘three types theorem’, relating the properties of the distribution of probability of the underlying stochastic variable to its extreme value distributions, i. e., the limiting distributions for the extreme (minimum or maximum) value of a large collection of random observations. Interestingly, for a comparatively large class of random variables, the theory points out that only a few species of limit extreme value distributions are found. In some respect, the ‘three types theorem’ can be considered the analogous of the well known central limit theorem applying to ordinary sums, or averages, of random variables. From a practical point of view it is as important, since it opens a way to estimate the asymptotic distribution of extreme values without any a priori knowledge, or guess assumption, on the parent distribution. In this way, we have a solid ground to estimate the parameters of the limit distributions along with their confidence intervals, an issue crucial, for instance, to proper risk assessment. In Finance, for example, market regulators and financial institutions face the important and complex task to estimate and manage risk. Assessing the probability of rare and extreme events is, of course, a crucial issue and reliable measures of risk are needed to minimize undesirable effects on portfolios from large fluctuations in market conditions, e. g., exchange rates or prices of assets. Similar issues about risk and reliability are faced in insurance and banking, which are deeply concerned with unusually large
1067
1068
Extreme Value Statistics
fluctuations. Extreme value theory provides the solid theoretical foundation needed for the statistical modeling of such events and proper computation of risk and related confidence intervals. The study of natural and environmental hazards is also strongly interested to extreme events; for instance, reported hydrology and meteorology applications of extreme event theory concern flood frequency analysis, estimation of precipitation probabilities, or extreme tide levels. Predictions of events such as strong heat waves, rainfall, occurrence of huge sea waves are deeply grounded on such a theory as well. Analogous problems are found in telecommunications and transport systems, such as traffic data analysis, Internet traffic, queuing modeling; problems from material science, pollutants hazards on health add examples from a very long list of related phenomena. It is impossible to summarize here the huge, often technical, literature on all these topics and we refer to the general books cited in the bibliography. Actually, to give an idea of the variety of applications of the theory, we only mention a few more example from still an other class of disciplines, Physical Sciences. In Physics, for instance, the equilibrium low-temperature properties of disordered systems are characterized by the statistics of extremely low-energy states. Several problems in this class, including the Random Energy Model and models for decaying Burgers turbulence, have been connected to extreme value distributions [3]. In GaAs films, extreme values in Gaussian 1/ f correlations of voltage fluctuations were shown to follow one of the limit distributions of extreme value theory, the Gumbel asymptote [1]. Hierarchically correlated random variables representing the energies of directed polymers [9] and maximal heights of growing self-affine surfaces [20] exhibits extreme value statistics as well. The Fisher–Tippett–Gumbel asymptote is involved in distribution of extreme height fluctuations for Edwards–Wilkinson relaxation of fluctuating interfaces on small-world-coupled interacting systems [16]. A connection was also established between the energy level density of a gas of non-interacting bosons and the distribution laws of extreme value statistics [7]. In case of systems with correlated variables, the application of the extreme value theory is far from trivial. A theorem states that the statistics of maxima of stationary Gaussian sequences, with suitable correlations, asymptotically converges to a Gumbel distribution [2]. Evidences supporting similar results were derived from numerical simulations and analysis of long-term correlated exponentially distributed signals [10]. In general, however, the scenario is non trivial. For instance, in Physics a variant of Gumbel distribution was observed in turbulence [4] and
derived in the two-dimensional XY model [5], which are systems where correlations play an important role. Similarly, correlated extreme value statistics were discovered in the Sneppen depinning model [8]. In models for fluctuating correlated interfaces, such as Edwards–Wilkinson and Kardar–Parisi–Zhang equations, an exact solution for the distribution of maximal heights was recently derived and it turns out to be an Airy function [19]. After the above brief picture of the field, we illustrate next the general properties of extreme value distributions of independent random variables. The Extreme Value Distributions Extreme value distributions are the limit distributions of extremes (either maxima or minima) of a set of random variables (see Fig. 1). For definiteness, we will deal here with maxima, as minima can be seen as ‘maxima’ of a set of variables with opposite signs. Consider a set fX1 ; X2 ; : : : ; X N g of N independent identically distributed random variables, X i , with a cumulative distribution function, F(x) PrfX i xg. The maximum in the set, YN D MaxfX1 ; X2 ; : : : ; X N g, has a distribution function, H N (x), which is simply related to F, since by definition of Y N , we have: H N (x) PrfYN xg D PrfX 1 x; X 2 x; : : : X N xg D PrfX1 xg PrfX 2 xg : : : PrfX N xg D F N (x) :
(1)
In the limit of large samples, N ! 1, it is possible to show that, under some general hypotheses on F described below, we can find a suitable sequence of scaling constants aN and bN , such that the scaled variable y N D (YN b N )/a N has a non degenerate probability distribution function H(y). Specifically, as N ! 1, the distribution Prfy N yg has a non trivial well defined limit H(y): Prfy N yg D Prf(YN b N )/a N yg D PrfYN a N y C b N g D F N (a N y C b N ) ! H(y) for N ! 1 : (2) For a given underlying distribution, F, the individuation of the precise sequence of scaling constants aN and bN required in Eq. (2) is a non trivial technical problem in the mathematics of extreme values [18], which we briefly discuss in the next sections. Such an issue is overshadowed by the simplicity of the result of the ‘three types theorem’,
Extreme Value Statistics
it is especially important as it selects the specific type of asymptote: I)
The case D 0 corresponds to the Gumbel asymptote, since it is easy to show that h y i lim H(y; ; ; ) D exp exp ; (7) !0
II) Similarly, the case > 0 corresponds to the Fréchet asymptote of Eq. (4), where the exponent is ˛ D 1/; III) And, finally, the case < 0 corresponds to the Weibull asymptote of Eq. (5), where the exponent is ˛ D 1/. Extreme Value Statistics, Figure 2 As an example, we plot in this figure the Gumbel density distribution, h(x) D dH(x)/dx, from Eq. (3), and the Fréchet and Weibull density distributions from Eqs. (4) and (5), for ˛ D 2
which states that there are only three types (apart from a scaling transformation of the variable) of limiting distribution H(y) (see Fig. 2): I)
The parameters and of Eq. (6) are called the ‘location’ and the ‘scale’ parameters, since they are related to the moments of the generalized extreme value distribution of Eq. (6). Figure 3 plots the effects of changes in and on the form of H(y) from Eq. (6) in the Gumbel case, D 0. It is possible to show [18] that the kth moment is finite only if < 1/k. The mean, which exists only if < 1, can be expressed in the following general form
Gumbel type: H(y) D exp[ exp(y)]
(3)
with 1 < y < 1; II) Fréchet type: H(y) D exp[y ˛ ]
(4)
where ˛ is a fixed exponent and 0 < y < 1 (with H(y) D 0 for y < 0); III) Weibull type: H(y) D exp[(y)˛ ]
(5)
where ˛ is a fixed exponent and 1 < y < 0 (with H(y) D 1 for y > 0).
E(y) D C
[ (1 ) 1] ;
(8)
where (x) is the Gamma function. In the Gumbel limit, ! 0, the above result is simplified to: E(y) D C , where D 0:577 : : : is the Euler constant. Analogously, the variance, existing for < 1/2, can be written as: E [y E(y)]2 D
2 [ (1 2) 2 (1 )] ; (9)
which in the ! 0 limit becomes E [y E(y)]2 D 2 2 /6. The r Largest Order Statistics
The Generalized Extreme Value Distribution In the extreme value statistics literature, the three types of limiting distributions, Gumbel, Fréchet and Weibull, are often represented as a single family including all of them, the so called generalized extreme value distribution: y 1/ (6) H(y; ; ; ) D exp 1 C with support in the interval where 1 C (y )/ > 0, as otherwise H is either zero or one. Out of the three parameters ; ; of Eq. (6), is called the ‘shape’ parameter and
The results on distributions of the maximum (or minimum) discussed above can be extended to the set of the rth largest value of an ensemble. Consider a set fX1 ; X2 ; : : : ; X N g of N identically distributed random variables which, for simplicity of notation, are arranged in order of magnitude: X1 < X2 < < X N . As before, F(x) PrfX i xg is their common cumulative distribution function. The statistics of X N and X 1 are the distribution of, respectively, the maximum and the minimum seen before. Similarly, X r (with r 2 f1; Ng) is called the r (largest) order statistic.
1069
1070
Extreme Value Statistics
In order to describe a broader panorama of the available results in extreme value theory, we give here some details on the more general case of the limit probability distribution density of the vector of the first r largest values (y1 ; y2 ; : : : ; y r ) D ((X N b N )/a N ; (X N1 b N )/a N ; : : : ; (X NrC1 b N )/a N ). Such a limit distribution density can be shown to be [18] 1 y r 1 exp 1C r #
X r 1 yi 1C ln 1 C :
h(y1 ; : : : ; y r ) D
(11)
iD1
Most of the other results of the previous sections can be generalized to the r order statistics, as shown for instance in [18]. Domains of Attraction and Examples
Extreme Value Statistics, Figure 3 In this figure we show the effects of the and parameters on the appearance of the generalized extreme value density distribution, h(x) D dH(x)/dx, from Eq. (6), in the case D 0, i. e., the Gumbel type. In the upper panel, we plot h(x) for D 1 and D 0; 1; 2. In the lower panel, we plot h(x) for D 0 and D 1/4; 1/2; 1
The r order statistic has a distribution function, H r (x), simply related to F H r (x) PrfX r xg D PrfX1 x; X 2 x; : : : ; X r xg PrfX rC1 > x; : : : ; X N > xg D
N X iDr
N! F i (x)[1 F(x)] Ni : (N i)!i!
(10)
The theory of the generalized extreme value distribution can be extended to the r order statistic. Actually, in the limit N ! 1, if a suitable sequence of scaling constants, aN and bN , can be found such that the scaled maximum variable y N D (X N b N )/a N has a limit distribution function H(y) given in Eq. (6), then the r order statistic has a limit distribution which can be easily expressed in terms of H(y) [18].
The problem of finding the domains of attraction of the classes of limiting distribution is a complex, partially still open, topic in extreme value theory [18]. Even in the case of independent identically distributed random variables, understanding which asymptote a given distribution, F, converges to and which is the sequence of scaling constants, aN and bN , can be a non trivial task. As the extreme events of a random variable are characterized by the tail of F, a simple approximate approach to guess the domain of attraction F falls into is to consider its behavior for large x. We summarize below a few well known examples having a broad validity which can guide practical applications of the theory to the case where the random variables are independent and identically distributed. E I)
Many common distributions, F(x), have exponential tails in x, very important examples being the Gaussian and the exponential distributions. In this case, their extreme value statistic is the Gumbel asymptote. A more formal condition for F to belong to the domain of attraction of the Gumbel limiting distribution was established by von Mises. Take a function F and denote xmax the largest value in its support, i. e., where F(xmax ) D 1 (the point xmax can be also infinite). Consider the derivative f (x) D dF(x)/dx and the rate of F(x) in approaching 1 as x ! xmax : when x ! xmax , if d 1 F(x) !0 (12) dx f (x) then Prfy N yg tends to the Gumbel asymptote given in Eq. (3).
Extreme Value Statistics
The above criterion can be rephrased in a more colloquial way: the Gumbel type is the limiting distribution when 1 F(x) decays faster than a polynomial for x ! xmax . Beyond the Gaussian and exponential distributions, the lognormal, the Gamma, the Weibull, the Benktander-type I and II, and many more common distributions, with xmax either finite or infinite, belong to this class. E II) Distributions such as the Pareto, Cauchy, Student, Burr have the Fréchet asymptote. More generally, when xmax is infinite and F has a power law tail for x!1 1 F(x) ' x ˛
(13)
with an exponent ˛ > 0, then the domain of attraction of the extreme value statistics is the Fréchet type given in Eq. (4), with precisely the same exponent ˛ of Eq. (13). E III) Finally, when xmax is finite and F has a power law behavior for x ! xmax 1 F(x) ' (xmax x)˛
(14)
with an exponent ˛ > 0, then the domain of attraction of the extreme value statistics is the Weibull type of Eq. (5), with the same exponent ˛ of Eq. (14). The Uniform and Beta distributions have, for instance, the Weibull asymptote.
Extremes of Correlated Random Variables When the underlying random variables are not independent, as in many cases of practical relevance ranging from Meteorology to Finance, the problem to individuate the form, or even the existence, of the limiting distribution is, in general, open. The existing broad technical literature on the topic [18] shows that the three types, Gumbel, Fréchet and Weibull, summarized in the generalized extreme value distribution of Eq. (6), often arise as well. For instance, a recent theorem has shown that in the case of stationary Gaussian sequences with suitable correlations the distribution of maxima asymptotically follows the Gumbel type [2]. Analysis of numerical simulations of long-term correlated exponentially distributed signals has given evidences supporting similar conclusions [10]. Sometimes, when considering ‘time’ series of N correlated variables, the approximate rule of thumb that N must be much bigger than the ‘correlation length’ of the sequence is used as a guide to decide whether Eq. (6) is likely to be the right asymptote.
Some of the example mentioned in the Introduction can help, however, in delineating the strong limits of the above approximate criteria and the lack of a general picture. For instance, in the XY model for magnetic systems used in Statistical Physics, in the Kosterliz–Thouless low temperature phase the magnetization has a distribution which is a generalized Gumbel [5], but not the one in Eq. (3), a result expected to hold in a broad class of systems. In models for fluctuating interfaces, developing correlations, described by Edwards–Wilkinson and Kardar–Parisi–Zhang like equations, it has been derived that the exact distribution of maximal heights is an Airy function [19]. These examples show the variety of situations which can arise in practical cases and indicate that the theorems derived for independent identically distributed variables must be applied with caution. Future Directions In the sections above, we reviewed at an introductory level the mathematics of extreme value theory, with a special focus on the ‘three types theorem’ on the limiting distributions. We also discussed their domains of attraction and many examples on random extreme events. We have not covered, instead, other important, though, more technical and still evolving topics such as the theoretical approach to the problem of ‘exceedances over thresholds’, and the the methodology for estimating from real sample data the parameters of extreme distributions, such as maximum likelihood and Bayesian methods. These are covered, for instance, in the general references listed in the bibliography. Actually, there is a number of excellent textbooks on these topics, ranging from the original book by E.J. Gumbel [17], to more recent volumes illustrating in details extreme value theory in the formal framework of the theory of probability [11,13,14,18]. Volumes more focused on applications to Finance and Insurances are, e. g., [6,11,12,21], as applications to climate, hydrology and meteorology research are found in [6,12,21,24]. Finally, there is a number of more technical review papers on the topic, including [10,22,23]. Bibliography 1. Antal T, Droz M, Gyrgyi G, Racz Z (2001) Phys Rev Lett 87:240601 2. Berman SM (1964) Ann Math Stat 35:502 3. Bouchaud J-P, Mézard M (1997) J Phys A 30:7997 4. Bramwell ST, Holdsworth PCW, Pinton J-F (1998) Nature (London) 396:552 5. Bramwell ST, Christensen K, Fortin J-Y, Holdsworth PCW, Jensen HJ, Lise S, Lopez JM, Nicodemi M, Pinton J-F, Sellitto M (2000) Phys Rev Lett 84:3744
1071
1072
Extreme Value Statistics
6. Bunde A, Kropp J, Schellnhuber H-J (eds) (2002) The science of disasters-climate disruptions, heart attacks, and market crashes. Springer, Berlin 7. Comtet A, Leboeuf P, Majumdar SN (2007) Phys Rev Lett 98:070404 8. Dahlstedt K, Jensen HJ (2001) J Phys A 34:11193; [Inspec] [ISI] 9. Dean DS, Majumdar SN (2001) Phys Rev E 64:046121 10. Eichner JF, Kantelhardt JW, Bunde A, Havlin S (2006) Phys Rev E 73:016130 11. Embrechts P, Klüppelberg C, Mikosch T, Karatzas I, Yor M (eds) (1997) Modelling extremal events. Springer, Berlin 12. Finkenstadt B, Rootzen H (2004) Extreme values in finance, telecommunications, and the environment. Chapman and Hall/CRC Press, London 13. Galambos J (1978) The asymptotic theory of extreme order statistics. Wiley, New York 14. Galambos J, Lechner J, Simin E (eds) (1994) Extreme value theory and applications. Kluwer, Dordrecht 15. Gnedenko BV (1998) Theory of probability. CRC, Boca Raton, FL 16. Guclu H, Korniss G (2004) Phys Rev E 69:065104(R)
17. Gumbel EJ (1958) Statistics of extremes. Columbia University Press, New York 18. Leadbetter MR, Lindgren G, Rootzen H (1983) Extremes and related properties of random sequences and processes. Springer, New York 19. Majumdar SN, Comtet A (2004) Phys Rev Lett 92:225501; J Stat Phys 119, 777 (2005) 20. Raychaudhuri S, Cranston M, Przybyla C, Shapir Y (2001) Phys Rev Lett 87:136101 21. Reiss RD, Thomas M (2001) Statistical analysis of extreme values: with applications to insurance, finance, hydrology, and other fields. Birkhäuser, Basel 22. Smith RL (2003) Statistics of extremes, with applications in environment, insurance and finance, chap 1. In: Statistical analysis of extreme values: with applications to insurance, finance, hydrology, and other fields. Birkhäuser, Basel 23. Smith RL, Tawn JA, Yuen HK (1990) Statistics of multivariate extremes. Int Stat Rev 58:47 24. v. Storch H, Zwiers FW (2001) Statistical analysis in climate research. Cambridge University Press, Cambridge
Fair Division
Fair Division* STEVEN J. BRAMS Department of Politics, New York University, New York, USA Article Outline Glossary Definition of the Subject Introduction Single Heterogeneous Good Several Divisible Goods Indivisible Goods Conclusions Future Directions Bibliography Glossary Efficiency An allocation is efficient if there is no other allocation that is better for one player and at least as good for all the other players. Envy-freeness An allocation is envy-free if each player thinks it receives at least a tied-for-largest portion and so does not envy the portion of any other player. Equitability An allocation is equitable if each player values the portion that it receives the same as every other player values its portion. Definition of the Subject Cutting a cake, dividing up the property in an estate, determining the borders in an international dispute – such allocation problems are ubiquitous. Fair division treats all these problems and many more through a rigorous analysis of procedures for allocating goods, or deciding who wins on what issues, in a dispute. Introduction The literature on fair division has burgeoned in recent years, with five academic books [1,13,23,28,32] and one popular book [15] providing overviews. In this review, I will give a brief survey of three different literatures: (i) the division of a single heterogeneous good (e. g., a cake with different flavors or toppings); (ii) the division, in whole or part, of several divisible goods; and (iii) the allocation of several indivisible goods. In each case, I assume the differ Adapted from Barry R. Weingast and Donald Wittman (eds) Oxford Handbook of Political Economy (Oxford University Press, 2006) by permission of Oxford University Press.
ent people, called players, may have different preferences for the items being divided. For (i) and (ii), I will describe and illustrate procedures for dividing divisible goods fairly, based on different criteria of fairness. For (iii), I will discuss problems that arise in allocating indivisible goods, illustrating trade-offs that must be made when different criteria of fairness cannot all be satisfied simultaneously. Single Heterogeneous Good The metaphor I use for a single heterogeneous good is a cake, with different flavors or toppings, that cannot be cut into pieces that have exactly the same composition. Unlike a sponge or layer cake, different players may like different pieces – even if they have the same physical size – because they are not homogeneous. Some of the cake-cutting procedures that have been proposed are discrete, whereby players make cuts with a knife – usually in a sequence of steps – but the knife is not allowed to move continuously over the cake. Movingknife procedures, on the other hand, permit such continuous movement and allow players to call “stop” at any point at which they want to make a cut or mark. There are now about a dozen procedures for dividing a cake among three players, and two procedures for dividing a cake among four players, such that each player is assured of getting a most valued or tied-for-most-valued piece, and there is an upper bound on the number of cuts that must be made [16]. When a cake is so divided, no player will envy another player, resulting in an envy-free division. In the literature on cake-cutting, two assumptions are commonly made: 1. The goal of each player is to maximize the minimumsize piece (maximin piece) that he or she can guarantee for himself or herself, regardless of what the other players do. To be sure, a player might do better by not following such a maximin strategy; this will depend on the strategy choices of the other players. However, all players are assumed to be risk-averse: They never choose strategies that might yield them more valued pieces if they entail the possibility of giving them less than their maximin pieces. 2. The preferences of the players over the cake are continuous. Consider a procedure in which a knife moves across a cake from left to right and, at any moment, the piece of the cake to the left of the knife is A and the piece to the right is B. The continuity assumption enables one to use the intermediate-value theorem to say the following: If, for some position of the knife, a player views piece A as being more valued than piece B, and
1073
1074
Fair Division
for some other position he or she views piece B as being more valued than piece A, then there must be some intermediate position such that the player values the two pieces exactly the same. Only two 3-person procedures [2,30], and no 4-person procedure, make an envy-free division with the minimal number of cuts (n 1 cuts if there are n players). A cake so cut ensures that each player gets a single connected piece, which is especially desirable in certain applications (e. g., land division). For two players, the well-known procedure of “I cut the cake, you choose a piece,” or “cut-and-choose,” leads to an envy-free division if the players choose maximin strategies. The cutter divides the cake 50-50 in terms of his or her preferences. (Physically, the two pieces may be of different size, but the cutter values them the same.) The chooser takes the piece he or she values more and leaves the other piece for the cutter (or chooses randomly if the two pieces are tied in his or her view). Clearly, these strategies ensure that each player gets at least half the cake, as he or she values it, proving that the division is envy-free. But this procedure does not satisfy certain other desirable properties [7,22]. For example, if the cake is, say, half vanilla, which the cutter values at 75 percent, and half chocolate, which the chooser values at 75 percent, a “pure” vanilla-chocolate division would be better for the cutter than the divide-and-choose division, which gives him or her exactly 50% percent of the value of the cake. The moving-knife equivalent of “I cut, you choose” is for a knife to move continuously across the cake, say from left to right. Assume that the cake is cut when one player calls “stop.” If each of the players calls “stop” when he or she perceives the knife to be at a 50-50 point, then the first player to call “stop” will produce an envy-free division if he or she gets the left piece and the other player gets the right piece. (If both players call “stop” at the same time, the pieces can be randomly assigned to the two players.) To be sure, if the player who would truthfully call “stop” first knows the other player’s preference and delays calling “stop” until just before the knife would reach the other player’s 50-50 point, the first player can obtain a greater-than-50-percent share on the left. However, the possession of such information by the cutter is not generally assumed in justifying cut-and-choose, though it does not undermine an envy-free division. Surprisingly, to go from two players making one cut to three players making two cuts cannot be done by a discrete procedure if the division is to be envy-free.1 The 3-per1 [28], pp. 28–29; additional information on the minimum numbers of cuts required to give envy-freeness is given in [19], and [29].
son discrete procedure that uses the fewest cuts is one discovered independently by John L. Selfridge and John H. Conway about 1960; it is described in, among other places, Brams and Taylor (1996) and Robertson and Webb (1998) and requires up to five cuts. Although there is no discrete 4-person envy-free procedure that uses a bounded number of cuts, Brams, Taylor, and Zwicker (1997) and Barbanel and Brams (2004) give moving-knife procedures that require up to 11 and 5 cuts, respectively. The Brams-Taylor-Zwicker (1997) procedure is arguably simpler because it requires fewer simultaneously moving knives. Peterson and Su (2002) give a 4-person envy-free moving-knife procedure for chore division, whereby each player thinks he or she receives the least undesirable chores, that requires up to 16 cuts. To illustrate ideas, I describe next the Barbanel– Brams [2] 3-person, 2-cut envy-free procedure, which is based on the idea of squeezing a piece by moving two knives simultaneously. The Barbanel–Brams [2] 4-person, 5-cut envy-free procedure also uses this idea, but it is considerably more complicated and will not be described here. The latter procedure, however, is not as complex as Brams and Taylor’s [12] general n-person discrete procedure. Their procedure illustrates the price one must pay for an envy-free procedure that works for all n, because it places no upper bound on the number of cuts that are required to produce an envy-free division; this is also true of other n-person envy-free procedures [25,27]. While the number of cuts needed depends on the players’ preferences over the cake, it is worth noting that Su’s [31] approximate envy-free procedure uses the minimal number of cuts at a cost of only small departures from envy-freeness.2 I next describe the Barbanel–Brams 3-person, 2-cut envy-free procedure, called the squeezing procedure [2]. I refer to players by number – player 1, player 2, and so on – calling even-numbered players “he” and odd-numbered players “she.” Although cuts are made by two knives in the end, initially one player makes “marks,” or virtual cuts, on the line segment defining the cake; these marks may subsequently be changed by another player before the real cuts are made. Squeezing procedure. A referee moves a knife from left to right across a cake. The players are instructed to call “stop” when the knife reaches the 1/3 point for each. Let the first player to call “stop” be player 1. (If two or three 2 See [10,20,26] for other approaches, based on bidding, to the housemates problem discussed in [31]. On approximate solutions to envy-freeness, see [33]. For recent results on pie-cutting, in which radial cuts are made from the center of a pie to divide it into wedgeshaped pieces, see [3,8].
Fair Division
players call “stop” at the same time, randomly choose one.) Have player 1 place a mark at the point where she calls “stop” (the right boundary of piece A in the diagram below), and a second mark to the right that bisects the remainder of the cake (the right boundary of piece B below). Thereby player 1 indicates the two points that, for her, trisect the cake into pieces A, B, and C, which will be assigned after possible modifications. A B C /–––––j–––––j–––––/ 1 1 Because neither player 2 nor player 3 called “stop” before player 1 did, each of players 2 and 3 thinks that piece A is at most 1/3. They are then asked whether they prefer piece B or piece C. There are three cases to consider: 1. If players 2 and 3 each prefer a different piece – one player prefers piece B and the other piece C – we are done: Players 1, 2, and 3 can each be assigned a piece that they consider to be at least tied for largest. 2. Assume players 2 and 3 both prefer piece B. A referee places a knife at the right boundary of B and moves it to the left. At the same time, player 1 places a knife at the left boundary of B and moves it to the right in such a way that the value of the cake traversed on the left (by B’s knife) and on the right (by the referee’s knife) are equal for player 1. Thereby pieces A and C increase equally in player 1’s eyes. At some point, piece B will be diminished sufficiently to a new piece, labeled B0 – in either player 2’s or player 3’s eyes – to tie with either piece A0 or C0 , the enlarged A and C pieces. Assume player 2 is the first, or tied for the first, to call “stop” when this happens; then give player 3 piece B0 , which she still thinks is the most valued or the tied-for-most-valued piece. Give player 2 the piece he thinks ties for the most value with piece B0 (say, piece A0 ), and give player 1 the remaining piece (piece C0 ), which she thinks ties for the most value with the other enlarged piece (A0 ). Clearly, each player will think that he or she received at least a tied-for-most-valued piece. 3. Assume players 2 and 3 both prefer piece C. A referee places a knife at the right boundary of B and moves it to the right. Meanwhile, player 1 places a knife at the left boundary of B and moves it to the right in such a way as to maintain the equality, in her view, of pieces A and B as they increase. At some point, piece C will be diminished sufficiently to C0 – in either player 2’s or player 3’s eyes – to tie with either piece A0 or B0 , the enlarged A and B pieces. Assume player 2 is the first, or the tied for the first, to call “stop” when this happens; then
give player 3 piece C0 , which she still thinks is the most valued or the tied-for-most-valued piece. Give player 2 the piece he thinks ties for the most value with piece C0 (say, piece A0 ), and give player 1 the remaining piece (piece B0 ), which she thinks ties for the most value with the other enlarged piece (A0 ). Clearly, each player will think that he or she received at least a tied-for-mostvalued piece. Note that who moves a knife or knives varies, depending on what stage is reached in the procedure. In the beginning, I assume a referee moves a single knife, and the first player to call “stop” (player 1) then trisects the cake. But, at the next stage of the procedure, in cases (2) and (3), it is a referee and player 1 that move two knives simultaneously, “squeezing” what players 2 and 3 consider to be the most-valued piece until it eventually ties, for one of them, with one of the two other pieces. Several Divisible Goods Most disputes – divorce, labor-management, merger-acquisition, and international – involve only two parties, but they frequently involve several homogeneous goods that must be divided, or several issues that must be resolved.3 As an example of the latter, consider an executive negotiating an employment contract with a company. The issues before them are (1) bonus on signing, (2) salary, (3) stock options, (4) title and responsibilities, (5) performance incentives, and (6) severance pay [14]. The procedure I describe next, called adjusted winner (AW), is a 2-player procedure that has been applied to disputes ranging from interpersonal to international ([15]).4 It works as follows. Two parties in a dispute, after perhaps long and arduous bargaining, reach agreement on (i) what issues need to be settled and (ii) what winning and losing means for each side on each issue. For example, if the executive wins on the bonus, it will presumably be some amount that the company considers too high but, nonetheless, is willing to pay. On the other hand, if the executive loses on the bonus, the reverse will hold. 3 Dividing several homogeneous goods is very different from cake-
cutting. Cake-cutting is most applicable to a problem like land division, in which hills, dales, ponds, and trees form an incongruous mix, making it impossible to give all or one thing (e. g., trees) to one player. By contrast, in property division it is possible to give all of one good to one player. Under certain conditions, 2-player cake division, and the procedure to be discussed next (adjusted winner), are equivalent [22]. 4 A website for AW can be found at http://www.nyu.edu/projects/ adjustedwinner. Procedures applicable to more than two players are discussed in [13,15,23,32].
1075
1076
Fair Division
Thus, instead of trying to negotiate a specific compromise on the bonus, the company and the executive negotiate upper and lower bounds, the lower one favoring the company and the upper one favoring the executive. The same holds true on other issues being decided, including non-monetary ones like title and responsibilities. Under AW, each side will always win on some issues. Moreover, the procedure guarantees that both the company and the executive will get at least 50% of what they desire, and often considerably more. To implement AW, each side secretly distributes 100 points across the issues in the dispute according to the importance it attaches to winning on each. For example, suppose that the company and the executive distribute their points as follows, illustrating that the company cares more about the bonus than the executive (it would be a bad precedent for it to go too high), whereas the reverse is true for severance pay (the executive wants to have a cushion in the event of being fired):
1. 2. 3. 4. 5. 6.
Issues Company Executive Bonus 10 5 Salary 35 40 Stock Options 15 20 Title and Responsibilities 15 10 Performance Incentives 15 5 Severance Pay 10 20 Total 100 100
The italicized figures show the side that wins initially on each issue by placing more points on it. Notice that whereas the company wins a total of 10 C 15 C 15 D 40 of its points, the executive wins a whopping 40C20C20 D 80 of its points. This outcome is obviously unfair to the company. Hence, a so-called equitability adjustment is necessary to equalize the points of the two sides. This adjustment transfers points from the initial winner (the executive) to the loser (the company). The key to the success of AW – in terms of a mathematical guarantee that no win-win potential is lost – is to make the transfer in a certain order (for a proof, see [13], pp. 85–94). That is, of the issues initially won by the executive, look for the one on which the two sides are in closest agreement, as measured by the quotient of the winner’s points to the loser’s points. Because the winner-toloser quotient on the issue of salary is 40/35 D 1:14, and this is smaller than on any other issue on which the executive wins (the next-smallest quotient is 20/15 = 1.33 on stock options), some of this issue must be transferred to the company.
But how much? The point totals of the company and the executive will be equal when the company’s winning points on issues 1, 4, and 5, plus x percent of its points on salary (left side of equation below), equal the executive’s winning points on issues 2, 3, and 6, minus x percent of its points on salary (right side of equation): 40 C 35x D 80 40x 75x D 40 : Solving for x gives x D 8/15 0:533. This means that the executive will win about 53% on salary, and the company will lose about 53% (i. e., win about 47%), which is almost a 50-50 compromise between the low and high figures they negotiated earlier, only slightly favoring the executive. This compromise ensures that both the company and the executive will end up with exactly the same total number of points after the equitability adjustment: 40 C 35(:533) D 80 40(:533) 58:7 : On all other issues, either the company or the executive gets its way completely (and its winning points), as it should since it valued these issues more than the other side. Thus, AW is essentially a winner-take-all procedure, except on the one issue on which the two sides are closest and which, therefore, is the one subject to the equitability adjustment. On this issue a split will be necessary, which will be easier if the issue is a quantitative one, like salary, than a more qualitative one like title and responsibilities.5 Still, it should be possible to reach a compromise on an issue like title and responsibilities that reflects the percentages the relative winner and relative loser receive (53% and 47% on salary in the example). This is certainly easier than trying to reach a compromise on each and every issue, which is also less efficient than resolving them all at once according to AW.6 In the example, each side ends up with, in toto, almost 59% of what it desires, which will surely foster greater satisfaction than would a 50-50 split down the middle on each issue. In fact, assuming the two sides are truthful, there is no better split for both, which makes the AW settlement efficient. In addition, it is equitable, because each side gets exactly the same amount above 50%, with this figure increasing the greater the differences in the two sides’ valuations of the issues. In effect, AW makes optimal trade-offs by 5 AW may require the transfer of more than one issue, but at most one issue must be divided in the end. 6 A procedure called proportional allocation (PA) awards issues to the players in proportion to the points they allocate to them. While inefficient, PA is less vulnerable to strategic manipulation than AW, with which it can be combined ([13], pp. 75–80).
Fair Division
awarding issues to the side that most values them, except as modified by the equitability adjustment that ensures that both sides do equally well (in their own subjective terms, which may not be monetary). On the other hand, if the two sides have unequal claims or entitlements – as specified, for example, in a contract – AW can be modified to give each side shares of the total proportional to its specified claims. Can AW be manipulated to benefit one side? It turns out that exploitation of the procedure by one side is practically impossible unless that side knows exactly how the other side will allocate its points. In the absence of such information, attempts at manipulation can backfire miserably, with the manipulator ending up with less than the minimum 50 points its honesty guarantees it [13,15]. While AW offers a compelling resolution to a multiissue dispute, it requires careful thought to delineate what the issues being divided are, and tough bargaining to determine what winning and losing means on each. More specifically, because the procedure is an additive point scheme, the issues need to be made as independent as possible, so that winning or losing on one does not substantially affect how much one wins or loses on others. To the degree that this is not the case, it becomes less meaningful to use the point totals to indicate how well each side does. The half dozen issues identified in the executive-compensation example overlap to an extent and hence may not be viewed as independent (after all, might not the bonus be considered part of salary?). On the other hand, they might be reasonably thought of as different parts of a compensation package, over which the disputants have different preferences that they express with points. In such a situation, losing on the issues you care less about than the other side will be tolerable if it is balanced by winning on the issues you care more about. Indivisible Goods The challenge of dividing up indivisible goods, such as a car, a boat, or a house in a divorce, is daunting, though sometimes such goods can be shared (usually at different times). The main criteria I invoke are efficiency (there is no other division better for everybody, or better for some players and not worse for the others) and envy-freeness (each player likes its allocation at least as much as those that the other players receive, so it does not envy anybody else). But because efficiency, by itself, is not a criterion of fairness (an efficient allocation could be one in which one player gets everything and the others nothing), I also consider other criteria of fairness besides envy-freeness, including Rawlsian and utilitarian measures of welfare (to be defined).
I present two paradoxes, from a longer list of eight in [4],7 that highlight difficulties in creating “fair shares” for everybody. But they by no means render the task impossible. Rather, they show how dependent fair division is on the fairness criteria one deems important and the tradeoffs one considers acceptable. Put another way, achieving fairness requires some consensus on the ground rules (i. e., criteria), and some delicacy in applying them (to facilitate trade-offs when the criteria conflict). I make five assumptions. First, players rank indivisible items but do not attach cardinal utilities to them. Second, players cannot compensate each other with side payments (e. g., money) – the division is only of the indivisible items. Third, players cannot randomize among different allocations, which is a way that has been proposed for “smoothing out” inequalities caused by the indivisibility of items. Fourth, all players have positive values for every item. Fifth, a player prefers one set S of items to a different set T if (i) S has as many items as T and (ii) for every item t in T and not in S, there is a distinct item s in S and not T that the player prefers to t. For example, if a player ranks four items in order of decreasing preference, 1 2 3 4, I assume that it prefers the set {1,2} to {2,3}, because {1} is preferred to {3}; and the set {1,3} to {2,4}, because {1} is preferred to {2} and {3} is preferred to {4}, whereas the comparison between sets {1,4} and {2,3} could go either way. Paradox 1. A unique envy-free division may be inefficient. Suppose there is a set of three players, {A, B, C}, who must divide a set of six indivisible items, {1, 2, 3, 4, 5, 6}. Assume the players rank the items from best to worst as follows: A: 1 23 4 5 6 B: 43 2 1 56 C: 5 12 6 3 4 The unique envy-free allocation to (A, B, C) is ({1,3}, {2,4}, {5,6}), or for simplicity (13, 24, 56), whereby A and B get their best and 3rd-best items, and C gets its best and 4thbest items. Clearly, A prefers its allocation to that of B (which are A’s 2nd-best and 4th-best items) and that of C (which are A’s two worst items). Likewise, B and C prefer their allocations to those of the other two players. Consequently, the division (13, 24, 56) is envy-free: All players prefer their allocations to those of the other two players, so no player is envious of any other. 7 For a more systematic treatment of conflicts in fairness criteria and trade-offs that are possible, see [5,6,9,11,18,21].
1077
1078
Fair Division
Compare this division with (12, 34, 56), whereby A and B receive their two best items, and C receives, as before, its best and 4th-best items. This division Pareto-dominates (13, 24, 56), because two of the three players (A and B) prefer the former allocation, whereas both allocations give player C the same two items (56). It is easy to see that (12, 34, 56) is Pareto-optimal or efficient: No player can do better with some other division without some other player or players doing worse, or at least not better. This is apparent from the fact that the only way A or B, which get their two best items, can do better is to receive an additional item from one of the two other players, but this will necessarily hurt the player who then receives fewer than its present two items. Whereas C can do better without receiving a third item if it receives item 1 or item 2 in place of item 6, this substitution would necessarily hurt A, which will do worse if it receives item 6 for item 1 or 2. The problem with efficient allocation (12, 34, 56) is that it is not assuredly envy-free. In particular, C will envy A’s allocation of 12 (2nd-best and 3rd-best items for C) if it prefers these two items to its present allocation of 56 (best and 4th-best items for C). In the absence of information about C’s preferences for subsets of items, therefore, we cannot say that efficient allocation (12, 34, 56) is envyfree.8 But the real bite of this paradox stems from the fact that not only is inefficient division (13, 24, 56) envy-free, but it is uniquely so – there is no other division, including an efficient one, that guarantees envy-freeness. To show this in the example, note first that an envy-free division must give each player its best item; if not, then a player might prefer a division, like envy-free division (13, 24, 56) or efficient division (12, 34, 56), that does give each player its best item, rendering the division that does not do so envy-possible or envy-ensuring. Second, even if each player receives its best item, this allocation cannot be the only item it receives, because then the player might envy any player that receives two or more items, whatever these items are. 8 Recall
that an envy-free division of indivisible items is one in which, no matter how the players value subsets of items consistent with their rankings, no player prefers any other player’s allocation to its own. If a division is not envy-free, it is envy-possible if a player’s allocation may make it envious of another player, depending on how it values subsets of items, as illustrated for player C by division (12, 34, 56). It is envy-ensuring if it causes envy, independent of how the players value subsets of items. In effect, a division that is envy-possible has the potential to cause envy. By comparison, an envy-ensuring division always causes envy, and an envy-free division never causes envy.
By this reasoning, then, the only possible envy-free divisions in the example are those in which each player receives two items, including its top choice. It is easy to check that no efficient division is envy-free. Similarly, one can check that no inefficient division, except (13, 24, 56), is envy-free, making this division uniquely envy-free. Paradox 2. Neither the Rawlsian maximin criterion nor the utilitarian Borda-score criterion may choose a unique efficient and envy-free division. Unlike the example illustrating paradox 1, efficiency and envy-freeness are compatible in the following example: A: 1 2 3 4 5 6 B: 5 6 2 1 4 3 C: 3 6 5 4 1 2 There are three efficient divisions in which (A, B, C) each get two items: (i) (12, 56, 34); (12, 45, 36); (iii) (14, 25, 36). Only (iii) is envy-free: Whereas C might prefer B’s 56 allocation in (i), and B might prefer A’s 12 allocation in (ii), no player prefers another player’s allocation in (iii). Now consider the following Rawlsian maximin criterion to distinguish among the efficient divisions: Choose a division that maximizes the minimum rank of items that players receive, making a worst-off player as well off as possible.9 Because (ii) gives a 5th-best item to B, whereas (i) and (iii) give players, at worst, a 4th-best item, the latter two divisions satisfy the Rawlsian maximin criterion. Between these two, (i), which is envy-possible, is arguably better than (iii), which is envy-free: (i) gives the two players that do not get a 4th-best item their two best items, whereas (iii) does not give B its two best items.10 Now consider what a modified Borda count would also give the players under each of the three efficient divisions. Awarding 6 points for obtaining a best item, 5 points for obtaining a 2nd-best item, . . . , 1 point for obtaining a worst item in the example, (ii) and (iii) give the players a total of 30 points, whereas (i) gives the players a total of 31 points.11 This criterion, which I call the util9 This is somewhat different from Rawls’s (1971) proposal to maximize the utility of the player with minimum utility, so it might be considered a modified Rawlsian criterion. I introduce a rough measure of utility next with a modified Borda count. 10 This might be considered a second-order application of the maximin criterion: If, for two divisions, players rank the worst item any player receives the same, consider the player that receives a nextworst item in each, and choose the division in which this item is ranked higher. This is an example of a lexicographic decision rule, whereby alternatives are ordered on the basis of a most important criterion; if that is not determinative, a next-most important criterion is invoked, and so on, to narrow down the set of feasible alternatives. 11 The standard scoring rules for the Borda count in this 6-item example would give 5 points to a best item, 4 points to a 2nd-best
Fair Division
itarian Borda-score criterion, gives the nod to division (i); the Borda scores provide a measure of the overall utility or welfare of the players. Thus, neither the Rawlsian maximin criterion nor the utilitarian Borda-score criterion guarantees the selection of the unique efficient and envy-free division of (iii). Conclusions The squeezing procedure I illustrated for dividing up a cake among three players ensures efficiency and envyfreeness, but it does not satisfy equitability. Whereas adjusted winner satisfies efficiency, envy-freeness, and equitability for two players dividing up several divisible goods, all these properties cannot be guaranteed if there are more than two players. Finally, the two paradoxes relating to the fair division of indivisible good, which are independent of the procedure used, illustrate new difficulties – that no division may satisfy either maximin or utilitarian notions of welfare and, at the same time, be efficient and envy-free. Future Directions Patently, fair division is a hard problem, whatever the things being divided are. While some conflicts are ineradicable, as the paradoxes demonstrate, the trade-offs that best resolve these conflicts are by no means evident. Understanding these may help to ameliorate, if not solve, practical problems of fair division, ranging from the splitting of the marital property in a divorce to determining who gets what in an international dispute. Bibliography 1. Barbanel JB (2005) The Geometry of Efficient Fair Division. Cambridge University Press, New York 2. Barbanel JB, Brams SJ (2004) Cake Division with Minimal Cuts: Envy-Free Procedures for 3 Persons, 4 Persons, and Beyond. Math Soc Sci 48(3):251–269 3. Barbanel JB, Brams SJ (2007) Cutting a Pie Is Not a Piece of Cake. Am Math Month (forthcoming) 4. Brams SJ, Edelman PH, Fishburn PC (2001) Paradoxes of Fair Division. J Philos 98(6):300–314 5. Brams SJ, Edelman PH, Fishburn PC (2004) Fair Division of Indivisible Items. Theory Decis 55(2):147–180 6. Brams SJ, Fishburn PC (2000) Fair Division of Indivisible Items Between Two People with Identical Preferences: EnvyFreeness, Pareto-Optimality, and Equity. Soc Choice Welf 17(2):247–267 7. Brams SJ, Jones MA, Klamler C (2006) Better Ways to Cut a Cake. Not AMS 35(11):1314–1321 item, . . . , 0 points to a worst item. I depart slightly from this standard scoring rule to ensure that each player obtains some positive value for all items, including its worst choice, as assumed earlier.
8. Brams SJ, Jones MA, Klamler C (2007) Proportional Pie Cutting. Int J Game Theory 36(3–4):353–367 9. Brams SJ, Kaplan TR (2004) Dividing the Indivisible: Procedures for Allocating Cabinet Ministries in a Parliamentary System. J Theor Politics 16(2):143–173 10. Brams SJ, Kilgour MD (2001) Competitive Fair Division. J Political Econ 109(2):418–443 11. Brams SJ, King DR (2004) Efficient Fair Division: Help the Worst Off or Avoid Envy? Ration Soc 17(4):387–421 12. Brams SJ, Taylor AD (1995) An Envy-Free Cake Division Protocol. Am Math Month 102(1):9–18 13. Brams SJ, Taylor AD (1996) Fair Division: From Cake-Cutting to Dispute Resolution. Cambridge University Press, New York 14. Brams SJ, Taylor AD (1999a) Calculating Consensus. Corp Couns 9(16):47–50 15. Brams SJ, Taylor AD (1999b) The Win-Win Solution: Guaranteeing Fair Shares to Everybody. W.W. Norton, New York 16. Brams SJ, Taylor AD, Zwicker SW (1995) Old and New MovingKnife Schemes. Math Intell 17(4):30–35 17. Brams SJ, Taylor AD, Zwicker WS (1997) A Moving-Knife Solution to the Four-Person Envy-Free Cake Division Problem. Proc Am Math Soc 125(2):547–554 18. Edelman PH, Fishburn PC (2001) Fair Division of Indivisible Items Among People with Similar Preferences. Math Soc Sci 41(3):327–347 19. Even S, Paz A (1984) A Note on Cake Cutting. Discret Appl Math 7(3):285–296 20. Haake CJ, Raith MG, Su FE (2002) Bidding for Envy-Freeness: A Procedural Approach to n-Player Fair Division Problems. Soc Choice Welf 19(4):723–749 21. Herreiner D, Puppe C (2002) A Simple Procedure for Finding Equitable Allocations of Indivisible Goods. Soc Choice Welf 19(2):415–430 22. Jones MA (2002) Equitable, Envy-Free, and Efficient Cake Cutting for Two People and Its Application to Divisible Goods. Math Mag 75(4):275–283 23. Moulin HJ (2003) Fair Division and Collective Welfare. MIT Press, Cambridge 24. Peterson E, Su FE (2000) Four-Person Envy-Free Chore Division. Math Mag 75(2):117–122 25. Pikhurko O (2000)On Envy-Free Cake Division. Am Math Month 107(8):736–738 26. Potthoff RF (2002) Use of Linear Programming to Find an EnvyFree Solution Closest to the Brams-Kilgour Gap Solution for the Housemates Problem. Group Decis Negot 11(5):405–414 27. Robertson JM, Webb WA (1997) Near Exact and Envy-Free Cake Division. Ars Comb 45:97–108 28. Robertson J, Webb W (1998) Cake-Cutting Algorithms: Be Fair If You Can. AK Peters, Natick 29. Shishido H, Zeng DZ (1999) Mark-Choose-Cut Algorithms for Fair and Strongly Fair Division. Group Decis Negot 8(2): 125–137 30. Stromquist W (1980) How to Cut a Cake Fairly. Am Math Month 87(8):640–644 31. Su FE (1999) Rental Harmony: Sperner’s Lemma in Fair Division. Am Math Month 106:922–934 32. Young HP (1994) Equity in Theory and Practice. Princeton University Press, Princeton 33. Zeng DZ (2000) Approximate Envy-Free Procedures. Game Practice: Contributions from Applied Game Theory. Kluwer Academic Publishers, Dordrecht, pp. 259–271
1079
1080
Field Theoretic Methods
Field Theoretic Methods UWE CLAUS TÄUBER Department of Physics, Center for Stochastic Processes in Science and Engineering, Virginia Polytechnic Institute and State University, Blacksburg, USA Article Outline Glossary Definition of the Subject Introduction Correlation Functions and Field Theory Discrete Stochastic Interacting Particle Systems Stochastic Differential Equations Future Directions Acknowledgments Bibliography Glossary Absorbing state State from which, once reached, an interacting many-particle system cannot depart, not even through the aid of stochastic fluctuations. Correlation function Quantitative measure of the correlation of random variables; usually set to vanish for statistically independent variables. Critical dimension Borderline dimension dc above which mean-field theory yields reliable results, while for d dc fluctuations crucially affect the system’s large scale behavior. External noise Stochastic forcing of a macroscopic system induced by random external perturbations, such as thermal noise from a coupling to a heat bath. Field theory A representation of physical processes through continuous variables, typically governed by an exponential probability distribution. Generating function Laplace transform of the probability distribution; all moments and correlation functions follow through appropriate partial derivatives. Internal noise Random fluctuations in a stochastic macroscopic system originating from its internal kinetics. Langevin equation Stochastic differential equation describing time evolution that is subject to fast random forcing. Master equation Evolution equation for a configurational probability obtained by balancing gain and loss terms through transitions into and away from each state.
Mean-field approximation Approximative analytical approach to an interacting system with many degrees of freedom wherein spatial and temporal fluctuations as well as correlations between the constituents are neglected. Order parameter A macroscopic density corresponding to an extensive variable that captures the symmetry and thereby characterizes the ordered state of a thermodynamic phase in thermal equilibrium. Nonequilibrium generalizations typically address appropriate stationary values in the long-time limit. Perturbation expansion Systematic approximation scheme for an interacting and/or nonlinear system that involves a formal expansion about an exactly solvable simplication by means of a power series with respect to a small coupling. Definition of the Subject Traditionally, complex macroscopic systems are often described in terms of ordinary differential equations for the temporal evolution of the relevant (usually collective) variables. Some natural examples are particle or population densities, chemical reactant concentrations, and magnetization or polarization densities; others involve more abstract concepts such as an apt measure of activity, etc. Complex behavior often entails (diffusive) spreading, front propagation, and spontaneous or induced pattern formation. In order to capture these intriguing phenomena, a more detailed level of description is required, namely the inclusion of spatial degrees of freedom, whereupon the above quantities all become local density fields. Stochasticity, i. e., randomly occuring propagation, interactions, or reactions, frequently represents another important feature of complex systems. Such stochastic processes generate internal noise that may crucially affect even long-time and large-scale properties. In addition, other system variables, provided they fluctuate on time scales that are fast compared to the characteristic evolution times for the relevant quantities of interest, can be (approximately) accounted for within a Langevin description in the form of external additive or multiplicative noise. A quantitative mathematical analysis of complex spatio-temporal structures and more generally cooperative behavior in stochastic interacting systems with many degrees of freedom typically relies on the study of appropriate correlation functions. Field-theoretic, i. e., spatially continuous, representations both for random processes defined through a master equation and Langevin-type stochastic differential equations have been developed since the 1970s. They provide a general framework for the com-
Field Theoretic Methods
putation of correlation functions, utilizing powerful tools that were originally developed in quantum many-body as well as quantum and statistical field theory. These methods allow us to construct systematic approximation schemes, e. g., perturbative expansions with respect to some parameter (presumed small) that measures the strength of fluctuations. They also form the basis of more sophisticated renormalization group methods which represent an especially potent device to investigate scale-invariant phenomena.
Introduction Stochastic Complex Systems Complex systems consist of many interacting components. As a consequence of either these interactions and/or the kinetics governing the system’s temporal evolution, correlations between the constituents emerge that may induce cooperative phenomena such as (quasi-)periodic oscillations, the formation of spatio-temporal patterns, and phase transitions between different macroscopic states. These are characterized in terms of some appropriate collective variables, often termed order parameters, which describe the large-scale and long-time system properties. The time evolution of complex systems typically entails random components: either, the kinetics itself follows stochastic rules (certain processes occur with given probabilities per unit time); or, we project our ignorance of various fast microscopic degrees of freedom (or our lack of interest in their detailed dynamics) into their treatment as stochastic noise. An exact mathematical analysis of nonlinear stochastic systems with many interacting degrees of freedom is usually not feasible. One therefore has to resort to either computer simulations of corresponding stochastic cellular automata, or approximative treatments. A first step, which is widely used and often provides useful qualitative insights, consists of ignoring spatial and temporal fluctuations, and just studying equations of motion for ensemble-averaged order parameters. In order to arrive at closed equations, additional simplifications tend to be necessary, namely the factorization of correlations into powers of the mean order parameter densities. Such approximations are called mean-field theories; familiar examples are rate equations for chemical reaction kinetics or Landau–Ginzburg theory for phase transitions in thermal equilibrium. Yet in some situations mean-field approximations are insufficient to obtain a satisfactory quantitative description (see, e. g., the recent work collected in [1,2]). Let us consider an illuminating example.
Example: Lotka–Volterra Model In the 1920s, Lotka and Volterra independently formulated a mathematical model to describe emerging periodic oscillations respectively in coupled autocatalytic chemical reactions, and in the Adriatic fish population (see, e. g., [3]). We shall formulate the model in the language of population dynamics, and treat it as a stochastic system with two species A (the ‘predators’) and B (the ‘prey’), subject to the following reactions: predator death A ! ;, with rate ; prey proliferation B ! B C B, with rate ; predation interaction A C B ! A C A, with rate . Obviously, for D 0 the two populations decouple; while the predators face extinction, the prey population will explode. The average predator and prey population densities a(t) and b(t) are governed by the linear differential ˙ D b(t), whose soluequations a˙(t) D a(t) and b(t) tions are exponentials. Interesting competition arises as a consequence of the nonlinear process governed by the rate . In an exact representation of the system’s temporal evolution, we would now need to know the probability of finding an A B pair at time t. Moreover, in a spatial Lotka–Volterra model, defined on a d-dimensional lattice, say, on which the individual particles can move via nearest-neighbor hopping, the predation reaction should occur only if both predators and prey occupy the same or adjacent sites. The evolution equations for the mean densities a(t) and b(t) would then have to be respectively amended by the terms ˙ha(x; t)b(x; t)i. Here a(x, t) and b(x, t) represent local concentrations, the brackets denote the ensemble average, and ha(x; t)b(x; t)i represents A B cross correlations. In the rate equation approximation, it is assumed that the local densities are uncorrelated, whereupon ha(x; t)b(x; t)i factorizes to ha(x; t)ihb(x; t)i D a(t)b(t). This yields the famous deterministic Lotka–Volterra equations ˙ D a(t)b(t) a(t) ; a(t) ˙b(t) D b(t) a(t)b(t) :
(1)
Within this mean-field approximation, the quantity K(t) D [a(t) C b(t)] ln a(t) ln b(t) (essentially the system’s Lyapunov function) is a constant of motion, ˙ D 0. This results in regular nonlinear population osK(t) cillations, whose frequency and amplitude are fully determined by the initial conditions, a rather unrealistic feature. Moreover Eqs. (1) are known to be unstable with respect to various model modifications (as discussed in [3]). In contrast with the rate equation predictions, the original stochastic spatial Lotka–Volterra system displays much richer behavior (a recent overview is presented
1081
1082
Field Theoretic Methods
Field Theoretic Methods, Figure 1 Snapshots of the time evolution (left to right) of activity fronts emerging in a stochastic Lotka–Volterra model simulated on a 512 512 lattice, with periodic boundary conditions and site occupation numbers restricted to 0 or 1. For the chosen reaction rates, the system is in the species coexistence phase (with rates D 4:0, D 0:1, and D 2:2), and the corresponding mean-field fixed point a focus. The red, blue, and black dots respectively represent predators A, prey B, and empty sites ;. Reproduced with permission from [4]
Field Theoretic Methods, Figure 2 Static correlation functions a CAA (x) (note the logarithmic scale), and b CAB (x), measured in simulations on a 1024 1024 lattice without any restrictions on the site occupations. The reaction rates were D 0:1, D 0:1, and was varied from 0.5 (blue triangles, upside down), 0.75 (green triangles), to 1.0 (red squares). Reproduced with permission from [5]
in [4]): The predator–prey coexistence phase is governed, for sufficiently large values of the predation rate, by an incessant sequence of ‘pursuit and evasion’ wave fronts that form quite complex dynamical patterns, as depicted in Fig. 1, which shows snapshots taken in a two-dimensional lattice Monte Carlo simulation where each site could at most be occupied by a single particle. In finite systems, these correlated structures induce erratic population oscillations whose features are independent of the initial configuration. Moreover, if locally the prey ‘carrying capacity’ is limited (corresponding to restricting the maximum site occupation number per lattice site), there appears an extinction threshold for the predator population that separates the active coexistence regime through a continuous phase transition from a state wherein at long times t ! 1 only prey survive. With respect to the predator population, this represents an absorbing state: Once all A particles
have vanished, they cannot be produced by the stochastic kinetics. A quantitative characterization of the emerging spatial structures utilizes equal-time correlation functions such as C AA (x x 0 ; t) D ha(x; t)a(x 0 ; t)i a(t)2 and C AB (x x 0 ; t) D ha(x; t)b(x 0 ; t)i a(t)b(t), computed at some large time t in the (quasi-)stationary state. These are shown in Fig. 2 as measured in computer simulations for a stochastic Lotka–Volterra model (but here no restrictions on the site occupation numbers of the A or B particles were implemented). The A A (and B B) correlations obviously decay essentially exponentially with distance x, C AA (x) / C BB (x) / ejxj/ , with roughly equal correlation lengths for the predators and prey. The cross-correlation function C AB (x) displays a maximum at six lattice spacings; these positive correlations indicate the spatial extent of the emerging activity fronts (prey followed
Field Theoretic Methods
P [S i ] D 1, where the integration extends over the allowed range of values for the Si ; i. e., P [S i ] D ZD
1 Z
exp (A[S i ]) ;
Z Y N
(2) dS i exp (A[S i ]) :
iD1
Field Theoretic Methods, Figure 3 Space-time plot (space horizontal, with periodic boundary conditions; time vertical, proceeding downward) showing the temporal evolution of a one-dimensional stochastic Lotka–Volterra model on 512 lattice sites, but without any restrictions on the site occupation numbers (red: predators, blue: prey, magenta: sites occupied by both species; rates: D 0:1, D 0:1, D 0:1). Reproduced with permission from [5]
by the predators). At closer distance, the A and B particles become anti-correlated (C AB (x) < 0 for jxj < 3): prey would not survive close encounters with the predators. In a similar manner, one can address temporal correlations. These appear prominently in the space-time plot of Fig. 3 obtained for a Monte Carlo run on a one-dimensional lattice (no site occupation restrictions), indicating localized population explosion and extinction events. Correlation Functions and Field Theory The above example demonstrates that stochastic fluctuations and correlations induced by the dynamical interactions may lead to important features that are not adequately described by mean-field approaches. We thus require tools that allow us to systematically account for fluctuations in the mathematical description of stochastic complex systems and evaluate characteristic correlations. Such a toolbox is provided through field theory representations that are conducive to the identification of underlying symmetries and have proven useful starting points for the construction of various approximation schemes. These methods were originally devised and elaborated in the theory of (quantum and classical) many-particle systems and quantum fields ([6,7,8,9,10,11,12,13] represent a sample of recent textbooks).
In canonical equilibrium statistical mechanics, A[S i ] D H [S i ]/kB T is essentially the Hamiltonian, and the normalization is the partition function Z. In Euclidean quantum field theory, the action A[S i ] is given by the Langrangian. All observables O should be functions of the basic degrees of freedom Si ; their ensemble average thus becomes hO[S i ]i D
Z Y N
dS i O[S i ]P [S i ]
iD1
D
1
Z Y N
Z
dS i O[S i ] exp (A[S i ]) :
(3)
iD1
If we are interested in n-point correlations, i. e., expectation values of the products of the variables Si , it is useful to define a generating function * + N X W [ j i ] D exp ji Si ; (4) iD1
with W [ j i D 0] D 1. Notice that W [ j i ] formally is just the Laplace transform of the probability distribution P [S i ]. The correlation functions can now be obtained via partial derivatives of W [ j i ] with respect to the sources ji : hS i 1 : : : S i n i D
ˇ @ @ ˇ ::: W [ j i ]ˇ : j i D0 @ ji1 @ jin
(5)
Connected correlation functions or cumulants can be found by similar partial derivatives of the logarithm of the generating function: hS i 1 : : : S i n ic D
ˇ @ @ ˇ ::: ln W [ j i ]ˇ ; j i D0 @ ji1 @ jin
(6)
e. g., hS i ic D hS i i, and hS i S j ic D hS i S j i hS i ihS j i D h(S i hS i i)(S j hS j i)i. Perturbation Expansion
Generating Functions The basic structure of these field theories rests in a (normalized) exponential probability distributionR P [S i ] for QN the N relevant variables Si , i D 1; : : : ; N: iD1 dS i
For a Gaussian action, i. e., a quadratic form A0 [S i ] D 1P i j S i A i j S j (for simplicity we assume real variables 2 Si ), one may readily compute the corresponding generating function W0 [ j i ]. After diagonalizing the symmetric
1083
1084
Field Theoretic Methods
N N matrix A i j , completing the squares, and evaluating the ensuing Gaussian integrals, one obtains
point functions (‘propagators’) that are connected to vertices that stem from the (polynomial) interaction terms; for details, see, e. g., [6,7,8,9,10,11,12,13].
(2) N/2
Z0 D p
; det A 1 0 N 1 X 1 W0 [ j i ] D exp @ ji Ai j j j A ; 2
Continuum Limit and Functional Integrals ˝
Si S j
˛ 0
D A1 ij :
i; jD1
(7) Thus, the two-point correlation functions in the Gaussian ensemble are given by the elements of the inverse harmonic coupling matrix. An important special property of the Gaussian ensemble is that all n-point functions with odd n vanish, whereas those with even n factorize into sums of all possible permutations of products of two-point functions A1 i j that can be constructed by pairing up the variables Si (Wick’s theorem). For example, the four-point function reads hS i S j S k S l i0 D 1 1 1 1 1 A1 i j A k l C A i k A jl C A i l A jk . Let us now consider a general action, isolate the Gaussian contribution, and label the remainder as the nonlinear, anharmonic, or interacting part, A[S i ] D A0 [S i ] C AR [S i ]. We then observe that D
E
Z D Z0 exp AR [S i ]
; E O[S i ] exp AR [S i ] D E 0 ; hO[S i ]i D exp AR [S i ]
D
0
(8)
*
0
where the index 0 indicates that the expectation values are computed in the Gaussian ensemble. The nonlinear terms in Eq. (8) may now be treated perturbatively by expanding the exponentials in the numerator and denominator with respect to the interacting part AR [S i ]: hO[S i ]i D
O[S i ]
P1
P1
` i] 0 : ` R A [S i ]
1 `D0 `!
1 `D0 `!
Discrete spatial degrees of freedom are already contained in the above formal description: for example, on a d-dimensional lattice with N d sites the index i for the fields Si merely needs to entail the site labels, and the total number of degrees of freedom is just N D N d times the number of independent relevant quantities. Upon discretizing time, these prescriptions can be extended in effectively an additional dimension to systems with temporal evolution. We may at last take the continuum limit by letting N ! 1, while the lattice constant and elementary time step tend to zero in such a manner that macroscopic dynamical features are preserved. Formally, this replaces sums over lattice sites and time steps with spatial and temporal integrations; the action A[S i ] becomes a functional of the fields S i (x; t); partial derivatives turn Rinto functional RderivaQN tives; and functional integrations iD1 dS i ! D[S i ] are to be inserted in the previous expressions. For example, Eqs. (3), (4) and (6) become Z 1 D[S i ]O[S i ] exp (A[S i ]) ; (10) hO[S i ]i D Z * + Z Z X d W [ j i ] D exp d x dt j i (x; t)S i (x; t) ; (11)
AR [S
(9)
0
If the interaction terms are polynomial in the variables Si , Wick’s theorem reduces the calculation of n-point functions to a summation of products of Gaussian two-point functions. Since the number of contributing terms grows factorially with the order ` of the perturbation expansion, graphical representations in terms of Feynman diagrams become very useful for the classification and evaluation of the different contributions to the perturbation series. Basically, they consist of lines representing the Gaussian two-
n Y jD1
i
+ S i j (x j ; t j )
D c
n Y jD1
ˇ ı ˇ : (12) ln W [ j i ]ˇ j i D0 ı j i j (x j ; t j )
Thus we have arrived at a continuum field theory. Nevertheless, we may follow the procedures outlined above; specifically, the perturbation expansion expressions (8) and (9) still hold, yet with arguments S i (x; t) that are now fields depending on continuous space-time parameters. More than thirty years ago, Janssen and De Dominicis independently derived a mapping of the stochastic kinetics defined through nonlinear Langevin equations onto a field theory action ([14,15]; reviewed in [16]). Almost simultaneously, Doi constructed a Fock space representation and therefrom a stochastic field theory for classical interacting particle systems from the master equation describing the corresponding stochastic processes [17,18]. His approach was further developed by several authors into a powerful method for the study of internal noise and correlation effects in reaction-diffusion systems ([19,20,21,22,23]; for recent reviews, see [24,25]). We shall see below that the field-theoretic representations of both classical master and Langevin equations require two independent fields
Field Theoretic Methods
for each stochastic variable. Otherwise, the computation of correlation functions and the construction of perturbative expansions fundamentally works precisely as sketched above. But the underlying causal temporal structure induces important specific features such as the absence of ‘vacuum diagrams’ (closed response loops): the denominator in Eq. (2) is simply Z D 1. (For unified and more detailed descriptions of both versions of dynamic stochastic field theories, see [26,27].)
can construct the master equation associated with these on-site reactions as follows. The annihilation process locally changes the occupation numbers by one; the transition rate from a state with ni particles at site i to n i 1 particles is Wn i !n i 1 D n i (n i 1), whence
Discrete Stochastic Interacting Particle Systems
represents the master equation for this reaction at site i. As an initial condition, we can for example choose a Poisson distribution P(n i ) D n¯ 0n i en¯ 0 /n i ! with mean initial particle density n¯ 0 . In order to capture the complete stochastic dynamics, we just need to add similar contributions describing other processes, and finally sum over all lattice sites i. Since the reactions all change the site occupation numbers by integer values, a Fock space representation (borrowed from quantum mechanics) turns out particularly useful. To this end, we introduce the harmonic oscillator or bosonic ladder operator algebra [a i ; a j ] D 0 D [a i ; a j ], [a i ; a j ] D ı i j , from which we construct the particle number eigenstates jn i i, namely a i jn i i D n i jn i 1i, a i jn i i D jn i C 1i, a i a i jn i i D n i jn i i. (Notice that a different normalization than in ordinary quantum mechanics has been employed here.) A general state with ni particles on sites i is obtained from the ‘vacuum’ configuration j0i, defined via a i j0i D 0, through the product ni Q jfn i gi D i ai j0i. To implement the stochastic kinetics, we introduce a formal state vector as a linear combination of all possible states weighted by the time-dependent configurational probability:
We first outline the mapping of stochastic interacting particle dynamics as defined through a master equation onto a field theory action [17,18,19,20,21,22,23]. Let us denote the configurational probability for a stochastically evolving system to be in state ˛ at time t with P(˛; t). Given the transition rates W˛!ˇ (t) from states ˛ to ˇ, a master equation essentially balances the transitions into and out of each state: X @P(˛; t) Wˇ !˛ (t)P(ˇ; t) W˛!ˇ (t)P(˛; t) : D @t ˇ 6D˛
(13) The dynamics of many complex systems can be cast into the language of ‘chemical’ reactions, wherein certain particle species (upon encounter, say) transform into different species with fixed (time-independent) reaction rates. The ‘particles’ considered here could be atoms or molecules in chemistry, but also individuals in population dynamics (as in our example in Sect. “Example: Lotka–Volterra Model”), or appropriate effective degrees of freedom governing the system’s kinetics, such as domain walls in magnets, etc. To be specific, we envision our particles to propagate via unbiased random walks (diffusion) on a d-dimensional hypercubic lattice, with the reactions occuring according to prescribed rules when particles meet on a lattice site. This stochastic interacting particle system is then at any time fully characterized by the number of particles n A ; n B ; : : : of each species A; B; : : : located on any lattice site. The following describes the construction of an associated field theory action. As important examples, we briefly discuss annihilation reactions and absorbing state phase transitions. Master Equation and Fock Space Representation The formal procedures are best explained by means of a simple example; thus consider the irreversible binary annihilation process A C A ! A, happening with rate . In terms of the occupation numbers ni of the lattice sites i, we
@P(n i ; t) D (n i C1)n i P(n i C1; t)n i (n i 1)P(n i ; t) @t (14)
j˚ (t)i D
X
P(fn i g; t)jfn i gi :
(15)
fn i g
Simple manipulations then transform the linear time evolution according to the master equation into an ‘imaginary-time’ Schrödinger equation @j˚(t)i D Hj˚(t)i ; @t
j˚(t)i D eHt j˚(0)i
(16)
governed by a stochastic quasi-Hamiltonian (rather, the Liouville time evolution operator). For on-site reaction P processes, Hreac D i H i (a i ; a i ) is a sum of local contributions; e. g., for the binary annihilation reaction, H i (a i ; a i ) D (1 a i )a i a2i . It is a straightforward exercise to construct the corresponding expressions within
1085
1086
Field Theoretic Methods
this formalism for the generalization kA ! `A,
` k a ik ; H i (a i ; a i ) D a i a i
(17)
and for nearest-neighbor hopping with rate D between adjacent sites hi ji, X (18) Hdiff D D ai a j ai a j :
The two contributions for each process may be interpreted as follows: The first term in Eq. (17) corresponds to the actual process, and describes how many particles are annihilated and (re-)created in each reaction. The second term encodes the ‘order’ of each reaction, i. e., the number operator a i a i appears to the kth power, but in the normalk
ordered form a i a ik , for a kth-order process. These procedures are readily adjusted for reactions involving multiple particle species. We merely need to specify the occupation numbers on each site and correspondingly introduce additional ladder operators b i ; c i ; : : : for each new species, with [a i ; b i ] D 0 D [a i ; c i ] etc. For example, consider the reversible reaction kA C `B • mC with forward rate and backward rate ; the associated reaction Hamiltonian reads
X m k ` ci ai bi Hreac D a ik b`i c im : (19) i
Similarly, for the Lotka–Volterra model of Sect. “Example: Lotka–Volterra Model”, one finds Xh Hreac D 1 ai ai C bi 1 bi bi i
i C ai bi ai ai bi :
(20)
Note that all the above quasi-Hamiltonians are non-Hermitean operators, which naturally reflects the creation and destruction of particles. Our goal is to compute averages and correlation functions with respect to the configurational probability P(fn i g; t). Returning to a single-species system (again, the generalization to many particle species is obvious), this is accomplished with the aid of the projection state Q hP j D h0j i e a i , for which hP j0i D 1 and hP ja i D hP j, since [e a i ; a j ] D e a i ı i j . For the desired statistical averages of observables (which must all be expressible as functions of the occupation numbers fn i g), one obtains X hO(t)i D O(fn i g)P(fn i g; t) D hP jO(fa i a i g)j˚(t)i : fn i g
For example, as a consequence of probability conservation, 1 D hP j˚(t)i D hP jeHt P j˚(0)i. Thus necessarily hP jH D 0; upon commuting e i a i with H, the creation operators are shifted a i ! 1 C a i , whence this condi tion is fulfilled provided H i (a i ! 1; a i ) D 0, which is indeed satisfied by our above explicit expressions (17) and (18). Through this prescription, we may replace a i a i ! a i in all averages; e. g., the particle density becomes a(t) D ha i (t)i. In the bosonic operator representation above, we have assumed that no restrictions apply to the particle occupation numbers ni on each site. If n i 2s C 1, one may instead employ a representation in terms of spin s operators. For example, particle exclusion systems with n i D 0 or 1 can thus be mapped onto non-Hermitean spin 1/2 ‘quantum’ systems (for recent overviews, see [28,29]). Specifically in one dimension, such representations in terms of integrable spin chains have been very fruitful. An alternative approach uses the bosonic theory, but incorporates the site occupation restrictions through exponentials in the
number operators ea i a i [30]. Continuum Limit and Field Theory As a next step, we follow an established route in quantum many-particle theory [8] and proceed towards a field theory representation through constructing the path integral equivalent to the ‘Schrödinger’ dynamics (16) based on coherent states, which are right eigenstates of the ani, with complex eigennihilation operator, a i j i i D i j i
values i . Explicitly, j i i D exp 12 j i j2 C i a i j0i, and these coherent formula states satisfy the overlap h j j i i D exp 12 j i j2 12 j j j2 C j i , and the RQ 2 (over-)completeness relation i d i jf i gihf i gj D . Upon splitting the temporal evolution (16) into infinitesimal increments, standard procedures (elaborated in detail in [25]) eventually yield an expression for the configurational average Z Y hO(t)i / d i d i O(f i g)eA[ i ; i ;t] ; (22) i
which is of the form (3), with the action A[ i ; i ; t f ] D
Z C 0
(21)
X
i (t f )
i
tf
! @ i dt i C H i ( i ; i ) n¯ 0 i (0) ; @t (23)
Field Theoretic Methods
where the first term originates from the projection state, and the last one stems from the initial Poisson distribution. Through this procedure, in the original quasi-Hamil tonian the creation and annihilation operators a i and ai are simply replaced with the complex numbers i and i . Finally, we proceed to the continuum limit, i (t) ! (x; t), i (t) ! ˆ (x; t). The ‘bulk’ part of the action then becomes Z ˆ A[ ; ] D dd x
Z @ Dr 2 dt ˆ C Hreac( ˆ ; ) ; (24) @t where the discrete hopping contribution (18) has naturally turned into a continuum diffusion term. We have thus arrived at a microscopic field theory for stochastic reaction– diffusion processes, without invoking any assumptions on the form or correlations of the internal reaction noise. Note that we require two independent fields ˆ and to capture the stochastic dynamics. Actions of the type (24) may serve as a basis for further systematic coarse-graining, constructing a perturbation expansion as outlined in Sect. “Perturbation Expansion”, and perhaps a subsequent renormalization group analysis [25,26,27]. We remark that it is often useful to perform a shift in the field ˆ about the mean-field solution, ˆ (x; t) D 1 C e(x; t). For occasionally, the resulting field theory action allows the derivation of an equivalent Langevin dynamics, see Sect. “Stochastic Differential Equations” below. Annihilation Processes Let us consider our simple single-species example kA ! `A. The reaction part of the corresponding field theory action reads k Hreac ( ˆ ; ) D ˆ ` ˆ k ; (25) see Eq. (17). It is instructive ı to study the classical field equations, namely ı A ı D 0, which is always ˆ solved ı by D 1, reflecting probability conservation, and ı A ı ˆ D 0, which, upon inserting ˆ D 1 yields @ (x; t) D Dr 2 (x; t) (k `) (x; t) k ; @t
(26)
i. e., the mean-field equation for the local particle density (x; t), supplemented with a diffusion term. For k D 1, the particle density grows (k < `) or decays (k > `) exponentially. The solution of the rate equation for k > 1, 1/(k1) a(t) D h (x; t)i D a(0)1k C (k l)(k 1)t
implies a divergence within a finite time for k < `, and an algebraic decay (t)1/(k1) for k > `. The full field theory action, which was derived from the master equation defining the very stochastic process, provides a means of systematically including fluctuations in the mathematical treatment. Through a dimensional analysis, we can determine the (upper) critical dimension below which fluctuations become sufficiently strong to alter these power laws. Introducing an inverse length scale , [x] 1 , and applying diffusive temporal scaling, [Dt] 2 , and [ ˆ (x; t)] 0 , [ (x; t)] d in d spatial dimensions, the reaction rate in terms of the diffusivity scales according to [/D] 2(k1)d . In large dimensions, the kinetics is reaction-limited, and at least qualitatively correctly described by the mean-field rate equation. In low dimensions, the dynamics becomes diffusion-limited, and the annihilation reactions generate depletion zones and spatial particle anti-correlations that slow down the density decay. The nonlinear coupling /D becomes dimensionless at the boundary critical dimension dc (k) D 2/(k 1) that separates these two distinct regimes. Thus in physical dimensions, intrinsic stochastic fluctuations are relevant only for pair and triplet annihilation reactions. By means of a renormalization group analysis (for details, see [25]) one finds for k D 2 and d < dc (2) D 2: a(t) (Dt)d/2 [21,22], as confirmed by exact solutions in one dimension. Precisely at the critical dimension, the mean-field decay laws acquire logarithmic corrections, namely a(t) (Dt)1 ln(Dt) for k D 2 1/2 for k D 3 at at dc (2) D 2, and a(t) (Dt)1 ln(Dt) dc (3) D 1. Annihilation reaction between different species (e. g., A C B ! ;) may introduce additional correlation effects, such as particle segregation and the confinement of active dynamics to narrow reaction zones [23]; a recent overview can be found in [25]. Active to Absorbing State Phase Transitions Competition between particle production and decay processes leads to even richer scenarios, and can induce genuine nonequilibrium transitions that separate ‘active’ phases (wherein the particle densities remain nonzero in the long-time limit) from ‘inactive’ stationary states (where the concentrations ultimately vanish). A special but abundant case are absorbing states, where, owing to the absence of any agents, stochastic fluctuations cease entirely, and no particles can be regenerated [31,32]. These occur in a variety of systems in nature ([33,34] contain extensive discussions of various model systems); examples are chemical reactions involving an inert state ;, wherefrom no reactants A are released anymore, or stochastic
1087
1088
Field Theoretic Methods
population dynamics models, combining diffusive migration of a species A with asexual reproduction A ! 2A (with rate ), spontaneous death A ! ; (at rate ), and lethal competition 2A ! A (with rate ). In the inactive state, where no population members A are left, clearly all processes terminate. Similar effective dynamics may be used to model certain nonequilibrium physical systems, such as the domain wall kinetics in Ising chains with competing Glauber and Kawasaki dynamics. Here, spin flips ""##!"""# and ""#"!"""" may be viewed as domain wall (A) hopping and pair annihilation 2A ! ;, whereas spin exchange ""##!"#"# represents a branching process A ! 3A. Notice that the paraand ferromagnetic phases respectively map onto the active and inactive ‘particle’ states. The ferromagnetic state becomes absorbing if the spin flip rates are taken at zero temperature. The reaction quasi-Hamiltonian corresponding to the stochastic dynamics of the aforementioned population dynamics model reads Hreac ( ˆ ; ) D 1 ˆ ˆ ˆ 2 : (27) The associated rate equation is the Fisher–Kolmogorov equation (see Murray 2002 [3]) a˙(t) D ( )a(t) a(t)2 ;
(28)
which yields both inactive and active phases: For < we have a(t ! 1) ! 0, whereas for > the density eventually saturates at a s D ( )/.ı The explicit time-dependent solution a(t) D a(0)a s [a(0) C [a s a(0)]e()t ] shows that both stationary states are approached exponentially in time. They are separated by a continuous nonequilibrium phase transition at D , where the temporal decay becomes algebraic, a(t) D a(0)/[1Ca(0)t]) ! 1/(t) as t ! 1, independent of the initial density a(0). As in second-order equilibrium phase transitions, however, critical fluctuations are expected to invalidate the mean-field power laws in low dimensions d < dc .
If we now shift the field ˆ about its stationary value 1 p and rescalepaccording to ˆ (x; t) D 1 C / e S(x; t) and (x; t) D / S(x; t), the (bulk) action becomes
A e S; S D
Z
"
@ S C D r r2 S d x dt e @t # 2 2 e e e u S S SS C S S : (29) d
Z
Thus, the three-point verticesphave been scaled to identical coupling strengths u D , which in fact represents the effective coupling of the perturbation expansion. Its scaling dimension is [u] D 2d/2 , whence we infer the upper critical dimension dc D 4. The four-point vertex / , with [] D 2d , is then found to be irrelevant in the renormalization group sense, and can be dropped for the computation of universal, asymptotic scaling properties. The action (29) with D 0 is known as Reggeon field theory [35]; it satisfies a characteristic symmetry, namely invariance under so-called rapidity inversion S(x; t) $ e S(x; t). Remarkably, it has moreover been established that the field theory action (29) describes the scaling properties of critical directed percolation clusters [36,37,38]. The fluctuation-corrected universal power laws governing the vicinity of the phase transition can be extracted by renormalization group methods (reviewed for directed percolation in [39]). Table 1 compares the analytic results obtained in an " expansion about the critical dimension ( D 4 d) with the critical exponent values measured in Monte Carlo computer simulations [33,34]. According to a conjecture originally formulated by Janssen and Grassberger, any continuous nonequilibrium phase transition from an active to an absorbing state in a system governed by Markovian stochastic dynamics that is decoupled from any other slow variable, and in the absence of special additional symmetries or quenched randomness, should in fact fall in the directed percolation universality class [38,40]. This statement has indeed been confirmed in a large variety of model sytems (many exam-
Field Theoretic Methods, Table 1 Comparison of the values for the critical exponents of the directed percolation universality class measured in Monte Carlo simulations with the analytic renormalization group results within the D 4 d expansion: denotes the correlation length, tc the characteristic relaxation time, as the saturation density in the active state, and ac (t) the critical density decay law Scaling exponent j j tc z j jz as j jˇ ac (t) t ˛
dD1 1:100 z 1:576 ˇ 0:2765 ˛ 0:160
dD2 0:735 z 1:73 ˇ 0:584 ˛ 0:46
dD4 D 1/2 C /16 C O( 2 ) z D 2 /12 C O( 2 ) ˇ D 1 /6 C O( 2 ) ˛ D 1 /4 C O( 2 )
Field Theoretic Methods
ples are listed in [33,34]). It even pertains to multi-species generalizations [41], and applies for instance to the predator extinction threshold in the stochastic Lotka–Volterra model with restricted site occupation numbers mentioned in Sect. “Example: Lotka–Volterra Model” [4]. Stochastic Differential Equations This section explains how dynamics governed by Langevin-type stochastic differential equations can be represented through a field-theoretic formalism [14,15,16]. Such a description is especially useful to capture the effects of external noise on the temporal evolution of the relevant quantities under consideration, which encompasses the case of thermal noise induced by the coupling to a heat bath in thermal equilibrium at temperature T. The underlying assumption in this approach is that there exists a natural separation of time scales between the slow variables Si , and all other degrees of freedom % i which in comparison fluctuate rapidly, and are therefore summarily gathered in zero-mean noise terms, assumed to be uncorrelated in space and time, h% i (x; t)i D 0 ; ˝ ˛ % i (x; t)% j (x 0 ; t 0 ) D 2L i j [S i ]ı(x x 0 )ı(t t 0 ) :
the end all the path integrals are defined through appropriate discretizations in space and time): 1D
Z Y
D[S i ]
i
Y @S i (x; t) ı F i [S i ](x; t) % i (x; t) @t Z Z(x;t) Y e D D[i S i ]D[S i ] exp dd x Z
i
dt
X
e Si
i
h
1 4
Z
(30)
tf
dt 0
@S i (t) D F i [S i ] C % i ; @t
(31)
where we may decompose the ‘systematic forces’ into reversible terms of microscopic origin and relaxational contributions that are induced by the noise and drive the system towards its stationary state (see below), i. e.: F i [S i ] D F irev[S i ] C Firel [S i ]. Both ingredients may contain nonlinear terms as well as mode couplings between different variables. Again, we first introduce the abstract formalism, and then proceed to discuss relaxation to thermal equilibrium as well as some examples for nonequilibrium Langevin dynamics.
:
(32)
Z
dd x
X
% i (x; t)
h
i
#
L1 i j % j (x; t)
:
(33)
ij
Inserting the identity (32) and the probability distribution (33) into the desired stochastic noise average of any observable O[S i ], we arrive at hO[S i ]i /
Z Y
D[ie S i ]D[S i ]
i " Z
# Z X @S i d e Si exp d x dt F i [S i ] O[S i ] @t Zi Z Y Z D[% i ] exp dd x dt i
X 1 i
4
%i
X
e L1 i j % j S i %i
:
(34)
j
Subsequently evaluating the Gaussian integrals over the noise % i yields at last
Field Theory Representation of Langevin Equations The shortest and most general route towards a field theory representation of the Langevin dynamics (31) with noise correlations (30) starts with one of the most elaborate ways to expand unity, namely through a product of functional delta functions (for the sake of compact notations, we immediately employ a functional integration language, but in
In the second line we have used the Fourier representation of the (functional) delta distribution by means of the purely imaginary auxiliary variables e S i (also called Martin–Siggia–Rose response fields [42]). Next we require the explicit form of the noise probability distribution that generates the correlations (30); for simplicity, we may employ the Gaussian W [% i ] / exp
Here, the noise correlator 2L i j [S i ] may be a function of the slow system variables Si , and also contain operators such as spatial derivatives. A general set of coupled Langevin-type stochastic differential equations then takes the form
@S i F i [S i ] % i @t
hO[S i ]i D
Z Y
D[S i ]O[S i ]P [S i ] ;
i
P [S i ] /
Z Y
e
(35)
D[ie S i ]eA[S i ;S i ] ;
i
with the statistical weight governed by the Janssen–De Do-
1089
1090
Field Theoretic Methods
minicis ‘response’ functional [14,15] A[e S i ; Si ] D
X
Z
dd x
Z
tf
dt 0
2 4e Si
i
3
X @S i S j5 : Si L i je F i [S] e @t
(36)
j
It should be noted that in the above manipulations, we have omitted the functional determinant from the variable change f% i g ! fS i g. This step can be justified through applying a forward (Itô) discretization (for technical details, see [16,27,43]). Normalization implies RQ A[e S i ;S] i D 1. The first term in the ace i D[i S i ]D[S i ]e tion (36) encodes the temporal evolution according to the systematic terms in the Langevin Equations (31), whereas the second term specifies the noise correlations (30). Since the auxiliary fields appear only quadratically, they could be eliminated via completing the squares and Gaussian integrations. This results in the equivalent Onsager–Machlup functional which however contains squares of the nonlinear terms and the inverse of the noise correlator operators; the form (36) is therefore usually more convenient for practical purposes. The Janssen–De Dominicis functional (36) takes the form of a (d C 1)-dimensional statistical field theory with again two independent sets of fields Si and e S i . It may serve as a starting point for systematic approximation schemes including perturbative expansions, and subsequent renormalization group treatments. Causality is properly incorporated in this formalism which has important technical implications [16,27,43].
Consider the dynamics of a system that following some external perturbation relaxes towards thermal equilibrium governed by the canonical Boltzmann distribution at fixed temperature T, 1 Z(T)
exp (H [S i ]/kB T) :
(37)
The relaxational term in the Langevin Equation (31) can then be specified as F irel[S i ] D i
ı H [S i ] ; ıS i
Z
dd x
X i
ı rev F i [S i ]eH [S i ]/k B T D 0 : ıS i (x)
(38)
(39)
This condition severely constrains the reversible force terms. For example, for a system whose microscopic time evolution is˚determined through the Poisson brack ets Q i j (x; x 0 ) D S i (x); S j (x 0 ) D Q ji (x 0 ; x) (to be replaced by commutators in quantum mechanics), one finds for the reversible mode-coupling terms [44] Z F irev [S i ](x) D dd x 0 X ıQ i j (x; x 0 ) ı H [S i ] T k : Q i j (x; x 0 ) B ıS j (x 0 ) ıS j (x 0 )
(40)
j
Second, the noise correlator in Eq. (30) must be related to the Onsager relaxation coefficients through the Einstein relation L i j D kB T i ı i j :
Thermal Equilibrium and Relaxational Critical Dynamics
Peq [S i ] D
@S i /@t C r J i D 0, with a conserved current that is typically given by a gradient of the field Si : J i D D i rS i C: : :; as a consequence, the fluctuations of the fields Si will relax diffusively with diffusivity Di , and i D D i r 2 becomes a spatial Laplacian. In order for P (t) ! Peq as t ! 1, the stochastic Langevin dynamics needs to satisfy two conditions, which can be inferred from the associated Fokker–Planck equation [27,44]. First, the reversible probability current is required to be divergence-free in the space spanned by the fields Si :
(41)
To provide a specific example, we focus on the case of purely relaxational dynamics (i. e., reversible force terms are absent entirely), with the (mesoscopic) Hamiltonian given by the Ginzburg–Landau–Wilson free energy that describes second-order phase transitions in thermal equilibrium for an n-component order parameter Si , i D 1; : : : ; N [6,7,8,9,10,11,12,13]:
H [S i ] D
Z
" N X 1 r d x [S i (x)]2 C [rS i (x)]2 2 2 iD1 # N X u 2 2 C [S i (x)] [S j (x)] ; 4! d
(42)
jD1
with Onsager coefficients i ; for nonconserved fields, i is a positive relaxation rate. On the other hand, if the variable Si is a conserved quantity (such as the energy density), there is an associated continuity equation
where the control parameter r / T Tc changes sign at the critical temperature T c , and the positive constant u governs the strength of the nonlinearity. If we assume that
Field Theoretic Methods
the order parameter itself is not conserved under the dynamics, the associated response functional reads
S i ; Si D A e
X
e Si
Z
i
Z
dd x
dt
@ ı H [S i ] Si kB T ie C i @t ıS i
:
(43)
This case is frequently referred to as model A critical dynamics [45]. For a diffusively relaxing conserved field, termed model B in the classification of [45], one has instead
A e S i ; Si D
X
e Si
i
Z
dd x
Z dt
@ ı H [S i ] Si C kB T D i r 2e Di r 2 @t ıS i
: (44)
Consider now the external fields hi that are thermodynamically conjugate to the variables R mesoscopic P Si , i. e., H (h i ) D H (h i D 0) dd x i h i (x)S i (x). For the simple relaxational models (43) and (44), we may thus immediately relate the dynamic susceptibility to twopoint correlation functions that involve the auxiliary fields e S i [43], namely ˇ ıhS i (x; t)i ˇˇ ıh j (x 0 ; t 0 ) ˇ h i D0 ˛ ˝ D kB T i S i (x; t)e S j (x 0 ; t 0 )
i j (x x 0 ; t t 0 ) D
(45)
for nonconserved fields, while for model B dynamics ˝ ˛ i j (x x 0 ; t t 0 ) D kB T D i S i (x; t)r 2e S j (x 0 ; t 0 ) : (46) Finally, in thermal equilibrium the dynamic response and correlation functions are related through the fluctuationdissipation theorem [43] ˛ @ ˝ i j (x x ; t t ) D (t t ) 0 S i (x; t)S j (x 0 ; t 0 ) : (47) @t 0
0
0
Driven Diffusive Systems and Interface Growth We close this section by listing a few intriguing examples for Langevin sytems that describe genuine out-of-equilibrium dynamics. First, consider a driven diffusive lattice gas (an overview is provided in [46]), namely a particle system with conserved total density with biased diffusion in a specified (‘k’) direction. The coarse-grained Langevin equation for the scalar density fluctuations thus becomes
spatially anisotropic [47,48], @S(x; t) @t Dg 2 D D r? C crk2 S(x; t)C rk S(x; t)2 C%(x; t); 2 (48) and similarly for the conserved noise with h%i D 0, ˛ ˝ 2 C c˜rk2 ı(x x 0 )ı(t t 0 ) : %(x; t)%(x 0 ; t 0 ) D 2D r? (49) Notice that the drive term / g breaks both the system’s spatial reflection symmetry as well as the Ising symmetry S ! S. In one dimension, Eq. (48) coincides with the noisy Burgers equation [49], and since in this case (only) the condition (39) is satisfied, effectively represents a system with equilibrium dynamics. The corresponding Janssen–De Dominicis response functional reads " Z Z @S d 2 e e d x dt S A S; S D C crk2 S D r? @t # Dg 2 2 e 2 C D r? C c˜rk S rk S : (50) 2 It describes a ‘massless’ theory, hence we expect the system to generically display scale-invariant features, without the need to tune to a special point in parameter space. The large-scale scaling properties can be analyzed by means of the dynamic renormalization group [47,48]. Another famous example for generic scale invariance emerging in a nonequilibrium system is curvature-driven interface growth, as captured by the Kardar–Parisi–Zhang equation [50] @S(x; t) Dg D Dr 2 S(x; t) C [rS(x; t)]2 C %(x; t) ; (51) @t 2 with again h%i D 0 and the noise correlations ˝ ˛ %(x; t)%(x 0 ; t 0 ) D 2Dı(x x 0 )ı(t t 0 ) :
(52)
(For more details and intriguing variants, see e. g. [51, 52,53].) The associated field theory action Z A e S; S D dd x " #
Z Dg @S 2 2 2 S Dr S [rS] De dt e S (53) @t 2 encodes surprisingly rich behavior including a kinetic roughening transition separating two distinct scaling regimes in dimensions d > 2 [51,52,53].
1091
1092
Field Theoretic Methods
Future Directions The rich phenomenology in many complex systems is only inadequately captured within widely used mean-field approximations, wherein both statistical fluctuations and correlations induced by the subunits’ interactions or the system’s kinetics are neglected. Modern computational techniques, empowered by recent vast improvements in data storage and tact frequencies, as well as the development of clever algorithms, are clearly invaluable in the theoretical study of model systems displaying the hallmark features of complexity. Yet in order to gain a deeper understanding and to maintain control over the typically rather large parameter space, numerical investigations need to be supplemented by analytical approaches. The field-theoretic methods described in this article represent a powerful set of tools to systematically include fluctuations and correlations in the mathematical description of complex stochastic dynamical systems composed of many interacting degrees of freedom. They have already been very fruitful in studying the intriguing physics of highly correlated and strongly fluctuating many-particle systems. Aside from many important quantitative results, they have provided the basis for our fundamental understanding of the emergence of universal macroscopic features. At the time of writing, the transfer of field-theoretic methods to problems in chemistry, biology, and other fields such as sociology has certainly been initiated, but is still limited to rather few and isolated case studies. This is understandable, since becoming acquainted with the intricate technicalities of the field theory formalism requires considerable effort. Also, whereas it is straightforward to write down the actions corresponding the stochastic processes defined via microscopic classical discrete master or mesoscopic Langevin equations, it is usually not that easy to properly extract the desired information about largescale structures and long-time asymptotics. Yet if successful, one tends to gain insights that are not accessible by any other means. I therefore anticipate that the now welldeveloped methods of quantum and statistical field theory, with their extensions to stochastic dynamics, will find ample successful applications in many different areas of complexity science. Naturally, further approximation schemes and other methods tailored to the questions at hand will have to be developed, and novel concepts be devised. I look forward to learning about and hopefully also participating in these exciting future developments. Acknowledgments The author would like to acknowledge financial support through the US National Science Foundation grant NSF
DMR-0308548. This article is dedicated to the victims of the terrible events at Virginia Tech on April 16, 2007.
Bibliography 1. Lindenberg K, Oshanin G, Tachiya M (eds) (2007) J Phys: Condens Matter 19(6): Special issue containing articles on Chemical kinetics beyond the textbook: fluctuations, many-particle effects and anomalous dynamics; see: http://www.iop.org/EJ/ toc/0953-8984/19/6 2. Alber M, Frey E, Goldstein R (eds) (2007) J Stat Phys 128(1/2): Special issue on Statistical physics in biology; see: http:// springerlink.com/content/j4q1ln243968/ 3. Murray JD (2002) Mathematical biology, vols. I, II, 3rd edn. Springer, New York 4. Mobilia M, Georgiev IT, Täuber UC (2007) Phase transitions and spatio-temporal fluctuations in stochastic lattice Lotka– Volterra models. J Stat Phys 128:447–483. several movies with Monte Carlo simulation animations can be accessed at http:// www.phys.vt.edu/~tauber/PredatorPrey/movies/ 5. Washenberger MJ, Mobilia M, Täuber UC (2007) Influence of local carrying capacity restrictions on stochastic predator-prey models. J Phys: Condens Matter 19:065139, 1–14 6. Ramond P (1981) Field theory – a modern primer. Benjamin/Cummings, Reading 7. Amit DJ (1984) Field theory, the renormalization group, and critical phenomena. World Scientific, Singapore 8. Negele JW, Orland H (1988) Quantum many-particle systems. Addison-Wesley, Redwood City 9. Parisi G (1988) Statistical field theory. Addison-Wesley, Redwood City 10. Itzykson C, Drouffe JM (1989) Statistical field theory. Cambridge University Press, Cambridge 11. Le Bellac M (1991) Quantum and statistical field theory. Oxford University Press, Oxford 12. Zinn-Justin J (1993) Quantum field theory and critical phenomena. Clarendon Press, Oxford 13. Cardy J (1996) Scaling and renormalization in statistical physics. Cambridge University Press, Cambridge 14. Janssen HK (1976) On a Lagrangean for classical field dynamics and renormalization group calculations of dynamical critical properties. Z Phys B 23:377–380 15. De Dominicis C (1976) Techniques de renormalisation de la théorie des champs et dynamique des phénomènes critiques. J Physique (France) Colloq 37:C247–C253 16. Janssen HK (1979) Field-theoretic methods applied to critical dynamics. In: Enz CP (ed) Dynamical critical phenomena and related topics. Lecture Notes in Physics, vol 104. Springer, Heidelberg, pp 26–47 17. Doi M (1976) Second quantization representation for classical many-particle systems. J Phys A: Math Gen 9:1465–1477 18. Doi M (1976) Stochastic theory of diffusion-controlled reactions. J Phys A: Math Gen 9:1479–1495 19. Grassberger P, Scheunert M (1980) Fock-space methods for identical classical objects. Fortschr Phys 28:547–578 20. Peliti L (1985) Path integral approach to birth-death processes on a lattice. J Phys (Paris) 46:1469–1482 21. Peliti L (1986) Renormalisation of fluctuation effects in the A C A ! A reaction. J Phys A: Math Gen 19:L365–L367
Field Theoretic Methods
22. Lee BP (1994) Renormalization group calculation for the reaction kA ! ;. J Phys A: Math Gen 27:2633–2652 23. Lee BP, Cardy J (1995) Renormalization group study of the A C B ! ; diffusion-limited reaction. J Stat Phys 80:971– 1007 24. Mattis DC, Glasser ML (1998) The uses of quantum field theory in diffusion-limited reactions. Rev Mod Phys 70:979–1002 25. Täuber UC, Howard MJ, Vollmayr-Lee BP (2005) Applications of field-theoretic renormalization group methods to reactiondiffusion problems. J Phys A: Math Gen 38:R79–R131 26. Täuber UC (2007) Field theory approaches to nonequilibrium dynamics. In: Henkel M, Pleimling M, Sanctuary R (eds) Ageing and the glass transition. Lecture Notes in Physics, vol 716. Springer, Berlin, pp 295–348 27. Täuber UC, Critical dynamics: a field theory approach to equilibrium and nonequilibrium scaling behavior. To be published at Cambridge University Press, Cambridge. for completed chapters, see: http://www.phys.vt.edu/~tauber/utaeuber.html 28. Schütz GM (2000) Exactly solvable models for many-body systems far from equilibrium. In: Domb C, Lebowitz JL (eds) Phase transitions and critical phenomena, vol 19. Academic Press, London 29. Stinchcombe R (2001) Stochastic nonequilibrium systems. Adv Phys 50:431–496 30. Van Wijland F (2001) Field theory for reaction-diffusion processes with hard-core particles. Phys Rev E 63:022101, 1–4 31. Chopard B, Droz M (1998) Cellular automaton modeling of physical systems. Cambridge University Press, Cambridge 32. Marro L, Dickman R (1999) Nonequilibrium phase transitions in lattice models. Cambridge University Press, Cambridge 33. Hinrichsen H (2000) Nonequilibrium critical phenomena and phase transitions into absorbing states. Adv Phys 49:815–958 34. Ódor G (2004) Phase transition universality classes of classical, nonequilibrium systems. Rev Mod Phys 76:663–724 35. Moshe M (1978) Recent developments in Reggeon field theory. Phys Rep 37:255–345 36. Obukhov SP (1980) The problem of directed percolation. Physica A 101:145–155
37. Cardy JL, Sugar RL (1980) Directed percolation and Reggeon field theory. J Phys A: Math Gen 13:L423–L427 38. Janssen HK (1981) On the nonequilibrium phase transition in reaction-diffusion systems with an absorbing stationary state. Z Phys B 42:151–154 39. Janssen HK, Täuber UC (2005) The field theory approach to percolation processes. Ann Phys (NY) 315:147–192 40. Grassberger P (1982) On phase transitions in Schlögl’s second model. Z Phys B 47:365–374 41. Janssen HK (2001) Directed percolation with colors and flavors. J Stat Phys 103:801–839 42. Martin PC, Siggia ED, Rose HA (1973) Statistical dynamics of classical systems. Phys Rev A 8:423–437 43. Bausch R, Janssen HK, Wagner H (1976) Renormalized field theory of critical dynamics. Z Phys B 24:113–127 44. Chaikin PM, Lubensky TC (1995) Principles of condensed matter physics. Cambridge University Press, Cambridge 45. Hohenberg PC, Halperin BI (1977) Theory of dynamic critical phenomena. Rev Mod Phys 49:435–479 46. Schmittmann B, Zia RKP (1995) Statistical mechanics of driven diffusive systems. In: Domb C, Lebowitz JL (eds) Phase transitions and critical phenomena, vol 17. Academic Press, London 47. Janssen HK, Schmittmann B (1986) Field theory of long time behaviour in driven diffusive systems. Z Phys B 63:517–520 48. Leung KT, Cardy JL (1986) Field theory of critical behavior in a driven diffusive system. J Stat Phys 44:567–588 49. Forster D, Nelson DR, Stephen MJ (1977) Large-distance and long-time properties of a randomly stirred fluid. Phys Rev A 16:732–749 50. Kardar M, Parisi G, Zhang YC (1986) Dynamic scaling of growing interfaces. Phys Rev Lett 56:889–892 51. Barabási AL, Stanley HE (1995) Fractal concepts in surface growth. Cambridge University Press, Cambridge 52. Halpin-Healy T, Zhang YC (1995) Kinetic roughening phenomena, stochastic growth, directed polymers and all that. Phys Rep 254:215–414 53. Krug J (1997) Origins of scale invariance in growth processes. Adv Phys 46:139–282
1093
1094
Firing Squad Synchronization Problem in Cellular Automata
Firing Squad Synchronization Problem in Cellular Automata HIROSHI UMEO Electro-Communication, University of Osaka, Osaka, Japan Article Outline Glossary Definition of the Subject Introduction Firing Squad Synchronization Problem Variants of the Firing Squad Synchronization Problem Firing Squad Synchronization Problem on Two-dimensional Arrays Summary and Future Directions Bibliography Glossary Cellular automaton A cellular automaton is a discrete computational model studied in mathematics, computer science, economics, biology, physics and chemistry etc. It consists of a regular array of cells, each cell is a finite state automaton. The array can be in any finite number of dimensions. Time (step) is also discrete, and the state of a cell at time t ( 1) is a function of the states of a finite number of cells (called its neighborhood) at time t 1. Each cell has a same rule set for updating its next state, based on the states in the neighborhood. At every step the rules are applied to the whole array synchronously, yielding a new configuration. Time-space diagram A time-space diagram is frequently used to represent signal propagations in one-dimensional cellular space. Usually, the time is drawn on the vertical axis and the space on the horizontal axis. The trajectories of individual signals in propagation are expressed in this diagram by sloping lines. The slope of the line represents the propagation speed of the signal. Time-space-diagrams that show the position of individual signals in time and in space are very useful for understanding cellular algorithms, signal propagations and crossings in the cellular space. Definition of the Subject The firing squad synchronization problem (FSSP for short) is formalized in terms of the model of cellular automata. Figure 1 shows a finite one-dimensional cellular array consisting of n cells, denoted by Ci , where 1 i n.
Firing Squad Synchronization Problem in Cellular Automata, Figure 1 One-dimensional cellular automaton
All cells (except the end cells) are identical finite state automata. The array operates in lock-step mode such that the next state of each cell (except the end cells) is determined by both its own present state and the present states of its right and left neighbors. All cells (soldiers), except the left end cell, are initially in the quiescent state at time t D 0 and have the property whereby the next state of a quiescent cell having quiescent neighbors is the quiescent state. At time t D 0 the left end cell (general) is in the fire-when-ready state, which is an initiation signal to the array. The firing squad synchronization problem is stated as follows: Given an array of n identical cellular automata, including a general on the left end which is activated at time t D 0, one wants to give the description (state set and next-state transition function) of the automata so that, at some future time, all of the cells will simultaneously and, for the first time, enter a special firing state. The set of states and the next-state transition function must be independent of n. Without loss of generality, it is assumed that n 2. The tricky part of the problem is that the same kind of soldiers having a fixed number of states must be synchronized, regardless of the length n of the array. The problem itself is interesting as a mathematical puzzle and a good example of recursive divide-and-conquer strategy operating in parallel. It has been referred to as achieving macro-synchronization given in micro-synchronization systems and realizing a global synchronization using only local information exchange [10]. Introduction Cellular automata are considered to be a nice model of complex systems in which an infinite one-dimensional array of finite state machines (cells) updates itself in synchronous manner according to a uniform local rule. A comprehensive study is made for a synchronization problem that gives a finite-state protocol for synchronizing a large scale of cellular automata. Synchronization of general network is a computing primitive of parallel and distributed computations. The synchronization in cellu-
Firing Squad Synchronization Problem in Cellular Automata
lar automata has been known as firing squad synchronization problem since its development, in which it was originally proposed by J. Myhill in Moore [29] to synchronize all parts of self-reproducing cellular automata. The problem has been studied extensively for more than 40 years [1–80]. The present article firstly examines the state transition rule sets for the famous firing squad synchronization algorithms that give a finite-state protocol for synchronizing large-scale cellular automata, focusing on the fundamental synchronization algorithms operating in optimum steps on one-dimensional cellular arrays. The algorithms discussed herein are the Goto’s first algorithm [12], the eight-state Balzer’s algorithm [1], the seven-state Gerken’s algorithm [9], the six-state Mazoyer’s algorithm [25], the 16-state Waksman’s algorithm [74] and a number of revised versions thereof. In addition, the article constructs a survey of current optimum-time synchronization algorithms and compares their transition rule sets with respect to the number of internal states of each finite state automaton, the number of transition rules realizing the synchronization, and the number of state-changes on the array. It also presents herein a survey and a comparison of the quantitative and qualitative aspects of the optimumtime synchronization algorithms developed thus far for one-dimensional cellular arrays. Then, it provides several variants of the firing squad synchronization problems including fault-tolerant synchronization protocols, one-bit communication protocols, non-optimum-time algorithms, and partial solutions etc. Finally, a survey on two-dimensional firing squad synchronization algorithms is presented. Several new results and viewpoints are also given. Firing Squad Synchronization Problem A Brief History of the Developments of Firing Squad Synchronization Algorithms The problem known as the firing squad synchronization problem was devised in 1957 by J. Myhill, and first appeared in print in a paper by E.F. Moore [29]. This problem has been widely circulated, and has attracted much attention. The firing squad synchronization problem first arose in connection with the need to simultaneously turn on all parts of a self-reproducing machine. The problem was first solved by J. McCarthy and M. Minsky [28] who presented a non-optimum-time synchronization scheme that operates in 3nCO(1) steps for synchronizing n cells. In 1962, the first optimum-time, i. e. (2n 2)-step, synchronization algorithm was presented by Goto [12], with each cell having several thousands of states. Waksman [74]
presented a 16-state optimum-time synchronization algorithm. Afterward, Balzer [1] and Gerken [9] developed an eight-state algorithm and a seven-state algorithm, respectively, thus decreasing the number of states required for the synchronization. In 1987, Mazoyer [25] developed a six-state synchronization algorithm which, at present, is the algorithm having the fewest states. Firing Squad Synchronization Algorithm Section “Firing Squad Synchronization Algorithm” briefly sketches the design scheme for the firing squad synchronization algorithm according to Waksman [74] in which the first transition rule set was presented. It is quoted from Waksman [74]. The code book of the state transitions of machines is so arranged to cause the array to progressively divide itself into 2k equal parts, where k is an integer and an increasing function of time. The end machines in each partition assume a special state so that when the last partition occurs, all the machines have for both neighbors machines at this state. This is made the only condition for any machine to assume terminal state. Figure 2 is a time-space diagram for the Waksman’s optimum-step firing squad synchronization algorithm. The general at time t D 0 emits an infinite number of signals which propagate at 1/(2 kC1 1) speed, where k is positive integer. These signals meet with a reflected signal at half point, quarter points, . . . , etc., denoted by ˇ in Fig. 2. It is noted that these cells indicated by ˇ are synchronized. By increasing the number of synchronized cells exponentially, eventually all of the cells are synchronized. Complexity Measures and Properties in Firing Squad Synchronization Algorithms Time Complexity Any solution to the firing squad synchronization problem can easily be shown to require (2n 2)-steps for synchronizing n cells, since signals on the array can propagate no faster than one cell per step, and the time from the general’s instruction until the synchronization must be at least 2n 2. See Balzer [1], Goto [12] and Waksman [74] for a proof. The next two theorems show the optimum-time complexity for synchronizing n cells on one-dimensional arrays. Theorem 1 ([1,12,74]) Synchronization of n cells in less than (2n 2)-steps is impossible. Theorem 2 ([1,12,74]) Synchronization of n cells at exactly (2n 2)-steps is possible.
1095
1096
Firing Squad Synchronization Problem in Cellular Automata
some 4- and 5-state partial solutions that can synchronize infinite cells, but not all. Theorem 3 ([1,40]) There is no four-state CA that can synchronize n cells. Berthiaume, Bittner, Perkovi´c, Settle and Simon [2] considered the state lower bound on ring-connected cellular automata. It is shown that there exists no three-state solution and no four-state symmetric solution for rings. Theorem 4 ([2]) There is no four-state symmetric optimum-time solution for ring-connected cellular automata. Number of Transition Rules Any k-state transition table for the synchronization has at most (k 1)k 2 entries in (k 1) matrices of size k k. The number of transition rules reflects the complexity of synchronization algorithms. Transition Rule Sets for Optimum-Time Firing Squad Synchronization Algorithms
Firing Squad Synchronization Problem in Cellular Automata, Figure 2 Time-space diagram for Waksman’s optimum-time firing squad synchronization algorithm
Number of States The following three distinct states: the quiescent state, the general state, and the firing state, are required in order to define any cellular automaton that can solve the firing squad synchronization problem. The boundary state for C0 and CnC1 is not generally counted as an internal state. Balzer [1] implemented a search strategy in order to prove that there exists no four-state solution. He showed that no four-state optimum-time solution exists. Sanders [40] studied a similar problem on a parallel computer and showed that the Balzer’s backtrack heuristic was not correct, rendering the proof incomplete and gave a proof based on a computer simulation for the non-existence of four-state solution. Balzer [1] also showed that there exist no five-state optimum-time solution satisfying special conditions. It is noted that the Balzer’s special conditions do not hold for the Mazoyer’s six-state solution with the fewest states known at present. The question that remains is: “What is the minimum number of states for an optimum-time solution of the problem?” At present, that number is five or six. Section “Partial Solutions” gives
Section “Transition Rule Sets for Optimum-Time Firing Squad Synchronization Algorithms” implements most of the transition rule sets for the synchronization algorithms above mentioned on a computer and check whether these rule sets yield successful firing configurations at exactly t D 2n 2 steps for any n such that 2 n 10; 000. Waksman’s 16-State Algorithm Waksman [74] proposed a 16-state firing squad synchronization algorithm, which, together with an unpublished algorithm by Goto [12], is referred to as the first-in-the-world optimum-time synchronization algorithm. Waksman presented the first set of transition rules described in terms of a state transition table that is defined on the following state set D consisting of 16 states such that D = fQ, T, P0 , P1 , B0 , B1 , R0 , R1 , A000 , A001 , A010 , A011 , A100 , A101 , A110 , A111 g, where Q is a quiescent state, T is a firing state, P0 and P1 are prefiring states, B0 and B1 are states for signals propagating at various speeds, R0 and R1 are trigger states which cause the B0 and B1 states move in the left or right direction and A i jk ; i; j; k 2 f0; 1g are control states which generate the state R0 or R1 either with a unit delay or without any delay. The state P0 also acts as an initial general. USN Transition Rule Set Cellular automata researchers have reported that some errors are included in the Waksman’s transition table. A computer simulation made in Umeo, Sogabe and Nomura [64] reveals this to be true. They corrected some errors included in the original Waksman’s transition rule set. The correction procedures can
Firing Squad Synchronization Problem in Cellular Automata
Firing Squad Synchronization Problem in Cellular Automata, Figure 3 USN transition table consisting of 202 rules that realize Waksman’s synchronization algorithm. The symbol represents the boundary state
be found in Umeo, Sogabe and Nomura [64]. This subsection gives a complete list of the transition rules which yield successful synchronizations for any n. Figure 3 is the complete list, which consists of 202 transition rules. The list is referred to as the USN transition rule set. In the correction, a ninety-three percent reduction in the number of transition rules is realized compared to the Waksman’s original list. The computer simulation based on the table of Fig. 3 gives the following observation. Figure 4 shows
snapshots of the Waksman’s 16-state optimum-time synchronization algorithm on 21 cells. Observation 3.1 ([54,64]) The set of rules given in Fig. 3 is the smallest transition rule set for Waksman’s optimumtime firing squad synchronization algorithm. Balzer’s Eight-State Algorithm Balzer [1] constructed an eight-state, 182-rule synchronization algorithm and
1097
1098
Firing Squad Synchronization Problem in Cellular Automata
Gerken’s Seven-State Algorithm Gerken [9] constructed a seven-state, 118-rule synchronization algorithm. In the computer examination, no errors were found, however, 13 rules were found to be redundant. Figure 6 gives a list of the transition rules for Gerken’s algorithm and snapshots for synchronization operations on 28 cells. The 13 redundant rules are marked by shaded squares in the table. The symbols “>”, “/”, “. . . ” and “#” represent the general, quiescent, firing and boundary states, respectively. The symbol “. . . ” is replaced by “F” in the configuration (right) at time t D 54. Mazoyer’s Six-State Algorithm Mazoyer [25] proposed a six-state, 120-rule synchronization algorithm, the structure of which differs greatly from the previous three algorithms discussed above. The computer examination revealed no errors and only one redundant rule. Figure 7 presents a list of transition rules for Mazoyer’s algorithm and snapshots of configurations on 28 cells. In the transition table, the letters “G”, “L”, “F” and “X” represent the general, quiescent, firing and boundary states, respectively.
Firing Squad Synchronization Problem in Cellular Automata, Figure 4 Snapshots of the Waksman’s 16-state optimum-time synchronization algorithm on 21 cells
the structure of which is completely identical to that of Waksman [74]. A computer examination made by Umeo, Hisaoka and Sogabe [54] revealed no errors, however, 17 rules were found to be redundant. Figure 5 gives a list of transition rules for Balzer’s algorithm and snapshots for synchronization operations on 28 cells. Those redundant rules are indicated by shaded squares. In the transition table, the symbols “M”, “L”, “F” and “X” represent the general, quiescent, firing and boundary states, respectively. Noguchi [34] also constructed an eight-state, 119-rule optimum-time synchronization algorithm.
Goto’s Algorithm The first synchronization algorithm presented by Goto [12] was not published as a journal paper. According to Prof. Goto, the original note Goto [12] is now unavailable, and the only existing material that treats the algorithm is Goto [13]. The Goto’s study presents one figure (Fig. 3.8 in Goto [13]) demonstrating how the algorithm works on 13 cells with a very short description in Japanese. Umeo [50] reconstructed the Goto’s algorithm based on this figure. Mazoyer [27] also reconstructed this algorithm again. The algorithm that Umeo [50] reconstructed is a non-recursive algorithm consisting of a marking phase and a 3n-step synchronization phase. In the first phase, by printing a special marker in the cellular space, the entire cellular space is divided into many smaller subspaces, the lengths of which increase exponentially with a common ratio of two, that is 2j , for any integer j such that 1 j blog2 nc 1. The marking is made from both the left and right ends. In the second phase, each subspace is synchronized using a well-known conventional 3n-step simple synchronization algorithm. A time-space diagram of the reconstructed algorithm is shown in Fig. 8. Gerken’s 155-State Algorithm Gerken [9] constructed two kinds of optimum-time synchronization algorithms. One seven-state algorithm has been discussed in the previous subsection, and the other is a 155-state algorithm having (n log n) state-change complexity. The transition table given in Gerken [9] is described in terms of two-layer construction with 32 states and 347 rules. An expansion of
Firing Squad Synchronization Problem in Cellular Automata
Firing Squad Synchronization Problem in Cellular Automata, Figure 5 Transition table for the Balzer’s eight-state protocol (left) and its snapshots for synchronization operations on 28 cells (right)
the transition table into a single-layer format yields a 155state table consisting of 2371 rules. Figure 9 shows a configuration on 28 cells. State Change Complexity Vollmar [73] introduced a state-change complexity in order to measure the efficiency of cellular algorithms and
showed that ˝(n log n) state-changes are required for the synchronization of n cells in (2n 2) steps. Theorem 5 ([73]) ˝(n log n) state-change is necessary for synchronizing n cells in (2n 2) steps. Theorem 6 ([9,54]) Each optimum-time synchronization algorithm developed by Balzer [1], Gerken [9], Ma-
1099
1100
Firing Squad Synchronization Problem in Cellular Automata
Firing Squad Synchronization Problem in Cellular Automata, Figure 6 Transition table for the Gerken’s seven-state protocol (left) and snapshots for synchronization operations on 28 cells (right)
zoyer [25] and Waksman [74] has an O(n2 ) state-change complexity, respectively.
Goto’s time-optimum synchronization algorithm. Umeo, Hisaoka and Sogabe [54] has shown that:
Theorem 7 ([9]) Gerken’s 155-state synchronization algorithm has a (n log n) state-change complexity.
Theorem 8 ([54]) Goto’s time-optimum synchronization algorithm as reconstructed by Umeo [50] has (n log n) state-change complexity.
It has been shown that any 3n-step thread-like synchronization algorithm has a (n log n) state-change complexity and such a 3n-step thread-like synchronization algorithm can be used for subspace synchronization in the
Figure 10 shows a comparison of the state-change complexities in several optimum-time synchronization algorithms.
Firing Squad Synchronization Problem in Cellular Automata
Firing Squad Synchronization Problem in Cellular Automata, Figure 7 Transition table for the Mazoyer’s six-state protocol (left) and its snapshots of configurations on 28 cells (right)
A Comparison of Quantitative Aspects of Optimum-Time Synchronization Algorithms Section “A Comparison of Quantitative Aspects of Optimum-Time Synchronization Algorithms” presents a table based on a quantitative comparison of optimum-time synchronization algorithms and their transition tables discussed above with respect to the number of internal states of each finite state automaton, the number of transition rules realizing the synchronization, and the number of state-changes on the array.
One-Sided vs. Two-Sided Recursive Algorithms Firing squad synchronization algorithms have been designed on the basis of parallel divide-and-conquer strategy that calls itself recursively in parallel. Those recursive calls are implemented by generating many Generals that work for synchronizing divided small areas in the cellular space. Initially a General G0 located at the left end works for synchronizing the whole cellular space consisting of n cells. In Fig. 11 (left), G1 synchronizes the subspace between G1 and the right end of the array. The ith Gen-
1101
1102
Firing Squad Synchronization Problem in Cellular Automata Firing Squad Synchronization Problem in Cellular Automata, Table 1 Quantitative comparison of transition rule sets for optimum-time firing squad synchronization algorithms. The symbol shows the correction and reduction of transition rules made in Umeo, Hisaoka and Sogabe [54]. The
symbol indicates the number of states and rules obtained after the expansion of the original two-layer construction Algorithm # of states # of transition rules State change complexity Goto [12] many thousands – (n log n) Waksman [74] 16 202 (3216) O(n2 ) Balzer [1] 8 165 (182) O(n2 ) Noguchi [34] 8 119 O(n2 ) Gerken [9] 7 105 (118) O(n2 ) Mazoyer [25] 6 119 (120) O(n2 ) Gerken [9] 155 (32) 2371 (347) (n log n) Firing Squad Synchronization Problem in Cellular Automata, Table 2 A qualitative comparison of optimum-time firing squad synchronization algorithms Algorithm One-/two-sided Recursive/non-recursive # of signals Goto [12] – non-recursive finite Waksman [74] two-sided recursive infinite Balzer [1] two-sided recursive infinite Noguchi [34] two-sided recursive infinite Gerken [9] two-sided recursive infinite Mazoyer [25] one-sided recursive infinite Gerken [9] two-sided recursive finite
eral Gi , i D 2; 3; : : : ; works for synchronizing the cellular space between G i1 and Gi , respectively. Thus, all of the Generals generated by G0 are located at the left end of the divided cellular spaces to be synchronized. On the other hand, in Fig. 11 (right), the General G0 generates General Gi , i D 1; 2; 3; : : : ; . Each Gi , i D 1; 2; 3; : : : ; synchronizes the divided space between Gi and G iC1 , respectively. In addition, Gi , i D 2; 3; : : : ; does the same operations as G0 . Thus, in Fig. 11 (right) one can find Generals located at either end of the subspace for which they are responsible. If all of the recursive calls for the synchronization are issued by Generals located at one (both two) end(s) of partitioned cellular spaces for which the General works, the synchronization algorithm is said to have one-sided (two-sided) recursive property, respectively. A synchronization algorithm with the one-sided (two-sided) recursive property is referred to as one-sided (two-sided) recursive synchronization algorithm. Figure 11 illustrates a time-space diagram for one-sided (Fig. 11 (left)) and two-sided (Fig. 11 (right)) recursive synchronization algorithms both operating in optimum 2n 2 steps. It is noted that optimumtime synchronization algorithms developed by Balzer [1], Gerken [9], Noguchi [34] and Waksman [74] are twosided ones and an algorithm proposed by Mazoyer [25] is an only synchronization algorithm with the one-sided recursive property.
Observation 3.2 ([54]) Optimum-time synchronization algorithms developed by Balzer [1], Gerken [9], Noguchi [34] and Waksman [74] are two-sided ones. The algorithm proposed by Mazoyer [25] is a one-sided one. A more general design scheme for one-sided recursive optimum-time synchronization algorithms can be found in Mazoyer [24]. Recursive vs. Non-recursive algorithms As is shown in the previous section, the optimumtime synchronization algorithms developed by Balzer [1], Gerken [9], Mazoyer [25], Noguchi [34] and Waksman [74] are recursive ones. On the other hand, it is noted that overall structure of the reconstructed Goto’s algorithm is a non-recursive one, however divided subspaces are synchronized by using a recursive 3n C O(1)step synchronization algorithm. Number of Signals Waksman [74] devised an efficient way to cause a general cell to generate infinite signals at propagating speeds of 1/1; 1/3; 1/7, .., 1/(2 k 1), where k is any natural number. These signals play an important role in dividing the array into two, four, eight, . . . , equal parts synchronously. The same set of signals is used in Balzer [1]. Gerken [9]
Firing Squad Synchronization Problem in Cellular Automata
Firing Squad Synchronization Problem in Cellular Automata, Figure 8 Time-space diagram for Goto’s algorithm as reconstructed by Umeo [50]
had a similar idea in the construction of his seven-state algorithm. Thus infinite set of signals with different propagation speed is used in the first three algorithms. On the other hand, finite sets of signals with propagating speed f1/5; 1/2; 1/1g and f1/3; 1/2; 3/5; 1/1g are made use of in Gerken’s 155-state algorithm and the reconstructed Goto’s algorithm, respectively. A Comparison of Qualitative Aspects of Optimum-Time Synchronization Algorithms Section “A Comparison of Qualitative Aspects of Optimum-Time Synchronization Algorithms” presents a table
Firing Squad Synchronization Problem in Cellular Automata, Figure 9 Snapshots of the Gerken’s 155-state algorithm on 28 cells
based on a qualitative comparison of optimum-time synchronization algorithms with respect to one/two-sided recursive properties and the number of signals being used for simultaneous space divisions. Variants of the Firing Squad Synchronization Problem Generalized Firing Squad Synchronization Problem Section “Generalized Firing Squad Synchronization Problem” considers a generalized firing squad synchroniza-
1103
1104
Firing Squad Synchronization Problem in Cellular Automata
Firing Squad Synchronization Problem in Cellular Automata, Figure 10 A comparison of state-change complexities in optimum-time synchronization algorithms
Firing Squad Synchronization Problem in Cellular Automata, Figure 11 One-sided recursive synchronization scheme (left) and two-sided recursive synchronization scheme (right)
Firing Squad Synchronization Problem in Cellular Automata
tion problem which allows the general to be located anywhere on the array. It has been shown to be impossible to synchronize any array of length n less than n 2 C max(k; n k C 1) steps, where the general is located on Ck . Moore and Langdon [30], Szwerinski [45] and Varshavsky, Marakhovsky and Peschansky [69] developed a generalized optimum-time synchronization algorithm with 17, 10 and 10 internal states, respectively, that can synchronize any array of length n at exactly n 2 C max(k; n k C 1) steps. Recently, Settle and Simon [43] and Umeo, Hisaoka, Michisaka, Nishioka and Maeda [56] have proposed a 9-state generalized synchronization algorithm operating in optimum-step. Figure 12 shows snapshots for synchronization configurations based on the rule set of Varshavsky, Marakhovsky and Peschansky [69].
Theorem 9 ([30,43,45,56,69]) There exists a cellular automaton that can synchronize any one-dimensional array of length n in optimum n 2 C max(k; n k C 1) steps, where the general is located on the kth cell from left end. Non-Optimum-Time 3n-Step Synchronization Algorithms Non-optimum-time 3n-step algorithm is a simple and straightforward one that exploits a parallel divide-andconquer strategy based on an efficient use of 1/1- and 1/3speed of signals. Minsky and MacCarthy [28] gave an idea for designing the 3n-step synchronization algorithm, and Fischer [8] implemented the 3n-step algorithm, yielding a 15-state implementation, respectively. Yunès [76] developed two seven-state synchronization algorithms, thus decreasing the number of internal states of each cellular automaton. This section presents a new symmetric six-state 3n-step firing squad synchronization algorithm developed in Umeo, Maeda and Hongyo [61]. The number six is the smallest one known at present in the class of 3n-step synchronization algorithms. Figure 13 shows the 6-state transition table and snapshots for synchronization on 14 cells. In the transition table, the symbols “P”, “Q”, “F” and “*” represent the general, quiescent, firing and boundary states, respectively. Yunès [79] also developed a symmetric 6-state 3n-step solution. Theorem 10 ([61,79]) There exists a symmetric 6-state cellular automaton that can synchronize any n cells in 3n + O(log n) steps. A non-trivial, new symmetric six-state 3n-step generalized firing squad synchronization algorithm is also presented in Umeo, Maeda and Hongyo [61]. Figure 14 gives a list of transition rules for the 6-state generalized synchronization algorithm and snapshots of configurations on 15 cells. The symbol “M” is the general state. Theorem 11 ([61]) There exists a symmetric 6-state cellular automaton that can solve the generalized firing squad synchronization problem in max(k; n k C 1) C 2n + O(log n) steps.
Firing Squad Synchronization Problem in Cellular Automata, Figure 12 Snapshots of the three Russian’s 10-state generalized optimumtime synchronization algorithm on 22 cells
In addition, a state-change complexity is studied in 3n-step firing squad synchronization algorithms. It has been shown that the six-state algorithms presented above have O(n2 ) state-change complexity, on the other hand, the thread-like 3n-step algorithms developed so far have O(n log n) state-change complexity. Here, the following table presents a quantitative comparison of the 3n-step synchronization algorithms developed so far.
1105
1106
Firing Squad Synchronization Problem in Cellular Automata
Firing Squad Synchronization Problem in Cellular Automata, Figure 13 Transition table for symmetric six-state protocol (left) and snapshots for synchronization algorithm on 14 cells
Delayed Firing Squad Synchronization Algorithm This section introduces a freezing-thawing technique that yields a delayed synchronization algorithm for onedimensional arrays. The technique is very useful in the design of time-efficient synchronization algorithms for oneand two-dimensional arrays in Umeo [52], Yunès [77] and Umeo and Uchino [65]. A similar technique was used by Romani [37] in the tree synchronization. The technique is stated as in the following theorem. Theorem 12 ([52]) Let t0 , t1 , t2 and t be any integer such that t0 0, t0 t1 t0 C n 1, t1 t2 and t D t2 t1 . It is assumed that a usual optimum-time synchronization operation is started at time t D t0 by generating a special signal at the left end of one-dimensional array and the right end cell of the array receives another special signals from outside at time t D t1 and t2 , respectively. Then, there exists a one-dimensional cellular au-
Firing Squad Synchronization Problem in Cellular Automata, Figure 14 Transition table for generalized symmetric six-state protocol (left) and snapshots for synchronization algorithm on 15 cells with a General on C5
tomaton that can synchronize the array of length n at time t D t0 C 2n 2 C t. The array operates as follows: 1. Start an optimum-time firing squad synchronization algorithm at time t D t0 at the left end of the array. A 1/1speed signal is propagated towards the right direction to wake-up cells in quiescent state. The signal is referred to as wake-up signal. A freezing signal is given from outside at time t D t1 at the right end of the array. The signal is propagated in the left direction at its maximum speed, that is, 1 cell per 1 step, and freezes the configuration progressively. Any cell that receives the freezing signal from its right neighbor has to stop its state-change and transmits the freezing signal to its left neighbor. The frozen cell keeps its state as long as no thawing signal will arrive.
Firing Squad Synchronization Problem in Cellular Automata Firing Squad Synchronization Problem in Cellular Automata, Table 3 A comparison of 3n-step firing squad synchronization algorithms Algorithm
# States # Rules Time complexity
Minsky and MacCarthy 13 Fischer 15 Yunès 7 Yunès 7 Settle and Simon 6 Settle and Simon 7 Umeo et al. 6 Umeo et al. 6 Umeo and Yanagihara 5
– – 105 107 134 127 78 115 67
Statechange complexity 3n C n log n C c O(n log n) 3n 4 O(n log n) 3n ˙ 2n log n C c O(n log n) 3n ˙ 2n log n C c O(n log n) 3n C 1 O(n2 ) 2n 2 C k O(n2 ) 3n C O(log n) O(n2 ) max(k; n k C 1) C 2n C O(log n) O(n2 ) 3n 3 O(n2 )
Yunès
105
3n C dlog ne 3
6
2. A special signal supplied with outside at time t D t2 is used as a thawing signal that thaws the frozen configuration. The thawing signal forces the frozen cell to resume its state-change procedures immediately. See Fig. 15 (left). The signal is also transmitted toward the left end at speed 1/1. The readers can see how those three signals work. The entire configuration can be freezed during t steps and the synchronization on the array is delayed for t steps. It is easily seen that the freezing signal can be replaced by the reflected signal of the wake-up signal, that is generated at the right end cell at time t D t0 C n 1. See Fig. 15. The scheme is referred to as freezing-thawing technique. Fault-Tolerant Firing Squad Synchronization Problem Consider a one-dimensional array of cells, some of which are defective. At time t D 0, the left end cell C1 is in the fire-when-ready state, which is the initialization signal for the array. The fault-tolerant firing squad synchronization problem for cellular automata with defective cells is to determine a description of cells that ensures all intact cells enter the fire state at exactly the same time and for the first time. The fault-tolerant firing squad synchronization problem has been studied in Kutrib and Vollmar [22,23], Umeo [52] and Yunès [77].
O(n log n)
Generals’s Type position
Notes
Ref.
left left left left right arbitrary left arbitrary left/right
thread thread thread thread plane plane plane plane plane
0 n < 1 – 0 n < 1 0 n < 1 – – – – n D 2k ; kD 1; 2; : : :
[28] [8] [76] [76] [43] [43] [61] [61] [67]
left
thread
[79]
to as a defective (intact) segment, respectively. Any defective and intact cell can detect whether its neighbor cells are defective or not. Cellular arrays are assumed to have an intact segment at its left and right ends. New defections do not occur during the operational lifetime on any cell. Signal Propagation in a Defective Segment: It is assumed that any cell in defective segment can only transmit the signal to its right or left neighbor depending on the direction in which it comes to the defective segment. The speed of the signal in any defective segment is fixed to 1/1, that is, one cell per one step. In defective segments, both the information carried by the signal and the direction in which the signal is propagated are preserved without any modifications. Thus, one can see that any defective segment has two one-way pipelines that can transfer one state at 1/1 speed in either direction. Note that from a standard viewpoint of state transition of usual CA each cell in a defective segment can change its internal states in a specific manner. The array consists of p defective segments and (p C 1) intact segments, where they are denoted by I i and Dj , respectively and p is any positive integer. Let ni and mj be number of cells on the ith intact and jth defective segments, where i and j be any integer such that 1 i p C 1 and 1 j p. Let n be the number of cells of the array such that n D (n1 Cm1 )C(n2 Cm2 )C; : : : ; C(n p Cm p )Cn pC1 .
Cellular Automata with Defective Cells Intact and Defective Cells: Each cell has its own selfdiagnosis circuit that diagnoses itself before its operation. A consecutive defective (intact) cells are referred
Fault-Tolerant Firing Squad Synchronization Algorithms Umeo [52] studied the synchronization algorithms for such arrays that there are locally more intact cells than defective ones, i. e., n i m i for any i such that
1107
1108
Firing Squad Synchronization Problem in Cellular Automata
Firing Squad Synchronization Problem in Cellular Automata, Figure 15 Time-space diagram for delayed firing squad synchronization scheme based on the freezing-thawing technique (left) and a delayed (for t D 5 steps) configuration in Balzer’s optimum-time firing squad synchronization algorithm on n D 11 cells (right)
1 i p. First, consider the case p D 1 where the array has one defective segment and n1 m1 . Figure 16 illustrates a simple synchronization scheme. The fault-tolerant synchronization algorithm for one defective segment is stated as follows: Theorem 13 ([52]) Let M be any cellular array of length n with one defective and two intact segments such that n1
m1 , where n1 and m1 denote the number of cells on the first intact and defective segments, respectively. Then, M is synchronizable in 2n 2 optimum-time. The synchronization scheme above can be generalized to arrays with multiple defective segments more than two. Figure 17 shows the synchronization scheme for a cellular array with three defective segments. Details of the algorithm can be found in Umeo [52]. Theorem 14 ([52]) Let p be any positive integer and M
be any cellular array of length n with p defective segments, where n i m i and n i C m i p i, for any i such that 1 i p. Then, M is synchronizable in 2n 2 C p steps. Partial Solutions The original firing squad synchronization problem is defined to synchronize all cells of one-dimensional array. In this section, consider a partial FSSP solution that can synchronize an infinite number of cells, but not all. The first partial solution was given by Umeo and Yanagihara [67]. They proposed a five-state solution that can synchronize any one-dimensional cellular array of length n D 2 k in 3n 3 steps for any positive integer k. Figure 18 shows the five-state transition table consisting of 67 rules and its snapshots for n D 8 and 16. In the transition table, the symbols “R”, “Q”, “F” and “*” represent the general, quiescent, firing and boundary states, respectively.
Firing Squad Synchronization Problem in Cellular Automata
Firing Squad Synchronization Problem in Cellular Automata, Figure 16 Time-space diagram for optimum-time firing squad synchronization algorithm with one defective segment
Firing Squad Synchronization Problem in Cellular Automata, Figure 17 Time-space diagram for optimum-time firing squad synchronization algorithm with three defective segments
Theorem 15 ([67]) There exists a 5-state cellular automaton that can synchronize any array of length n D 2 k in 3n 3 steps, where k is any positive integer.
“G”, “Q”, “F” and “*” represent the general, quiescent, firing and boundary states, respectively. Figure 20 shows the four-state transition table given by Umeo, Kamikawa and Yunès [58] which is based on Wolfram’s Rule 150. It has 32 transition rules. Snapshots for n D 16 are given in the figure. In the transition table, the symbols “G”, “Q”, “F” and “*” represent the general, quiescent, firing and boundary states, respectively. Umeo, Kamikawa and Yunès [58] proposed a different, but looking-similar, 4-state protocol based on Wolfram’s Rule 150. Figure 21 shows the four-state transition table consisting of 37 rules and its snapshots for n D 9 and 17. The four-state protocol has some desirable properties. Note that the algorithm operates in optimum-step and its transition rule is symmetric. The state of a general can be either “G” or “A”. Its initial position can be at the left or right end.
Surprisingly, Yunès [80] and Umeo, Kamikawa and Yunès [58] proposed four-state synchronization protocols which are based on an algebraic property of Wolfram’s two-state cellular automata. Theorem 16 ([58,80]) There exists a 4-state cellular automaton that can synchronize any array of length n D 2 k in non-optimum 2n 1 steps, where k is any positive integer. Figure 19 shows the four-state transition table given by Yunès [80] which is based on Wolfram’s Rule 60. It consists of 32 rules. Snapshots for n D 16 cells are also illustrated in Fig. 19. In the transition table, the symbols
1109
1110
Firing Squad Synchronization Problem in Cellular Automata
Firing Squad Synchronization Problem in Cellular Automata, Figure 18 Transition table for the five-state protocol (left) and its snapshots of configurations on 8 and 16 cells (right)
Theorem 17 ([58]) There exists a symmetric 4-state cellular automaton that can synchronize any array of length n D 2 k C 1 in 2n 2 optimum-steps, where k is any positive integer. Yunès [80] has given a state lower bound for the partial solution: Theorem 18 ([80]) There is no 3-state partial solution. Thus, the 4-state partial solutions given above are optimum in state-number complexity in partial solutions. Synchronization Algorithm for One-Bit Communication Cellular Automata In the study of cellular automata, the amount of bitinformation exchanged at one step between neighbor-
ing cells has been assumed to be O(1)-bit. An O(1)-bit communication CA is a conventional cellular automaton in which the number of communication bits exchanged at one step between neighboring cells is assumed to be O(1)-bit, however, such an inter-cell bit-information exchange has been hidden behind the definition of conventional automata-theoretic finite state description. On the other hand, the 1-bit inter-cell communication model is a new cellular automaton in which inter-cell communication is restricted to 1-bit data, referred to as the 1-bit CA model (CA1bit ). The number of internal states of the CA1bit is assumed to be finite in the usual sense. The next state of each cell is determined by the present state of that cell and two binary 1-bit inputs from its left and right neighbor cells. Thus, the CA1bit can be thought of as one of the most powerless and the simplest models in
Firing Squad Synchronization Problem in Cellular Automata
Firing Squad Synchronization Problem in Cellular Automata, Figure 19 Transition table for the four-state protocol based on Wolfram’s Rule 60 (left) and its snapshots of configurations on 16 cells (right)
a variety of CA’s. A precise definition of the CA1bit can be found in Umeo [51] and Umeo and Kamikawa [57]. Mazoyer [26] and Nishimura and Umeo [32] each designed an optimum-time synchronization algorithm on the CA1bit based on Balzer’s algorithm and Waksman’s algorithm, respectively. Figure 22 shows a configuration of the 1-bit synchronization algorithm on 15 cells that is based on the design of Nishimura and Umeo [32]. Each cell has 78 internal states and 208 transition rules. The small black triangles I and J indicate a 1-signal transfer in the right or left direction, respectively, between neighboring cells. A symbol in a cell shows an internal state of the cell. Theorem 19 ([26,32]) There exists a CA1bit that can synchronize n cells in optimum 2n 2 steps. Synchronization Algorithms for Multi-Bit Communication Cellular Automata Section “Synchronization Algorithms for Multi-Bit Communication Cellular Automata” studies a trade-off between internal states and communication bits in fir-
Firing Squad Synchronization Problem in Cellular Automata, Figure 20 Transition table for the four-state protocol based on Wolfram’s Rule 150 (left) and its snapshots of configurations on 16 cells (right)
ing squad synchronization protocols for k-bit communication-restricted cellular automata (CA kbit ) and propose several time-optimum state-efficient bit-transferbased synchronization protocols. It is shown that there exists a 1-state CA5bit that can synchronize any n cells in 2n 2 optimum-step. The result is interesting, since one knows that there exists no 4-state synchronization algorithm on conventional O(1)-bit communication cellular automata. A bit-transfer complexity is also introduced to measure the efficiency of synchronization protocols. It is shown that ˝(n log n) bit-transfer is a lower-bound for synchronizing n cells in (2n 2) steps. In addition, each optimum-time/non-optimum-time synchronization protocols presented has an O(n2 ) bit-transfer complexity, respectively. Most of the results presented here are from Umeo, Yanagihara and Kanazawa [68]. A computational relation between the conventional CA and CA kbit is stated as follows: Lemma 20 ([57]) Let N be any s-state conventional cellular automaton with time complexity T(n). Then, there exists a CA1bit which can simulate N in kT(n) steps, where k is a positive constant integer such that k D dlog2 se.
1111
1112
Firing Squad Synchronization Problem in Cellular Automata
Firing Squad Synchronization Problem in Cellular Automata, Figure 21 Transition table for the four-state protocol (left) and its snapshots of configurations on 9 and 17 cells (right) Firing Squad Synchronization Problem in Cellular Automata, Table 4 A comparison of optimum-time/non-optimum-time firing squad synchronization protocols for multi-bit communication cellular automata Synchronization protocol Communication bits transferred P1 1 P2 2 P3 3 P4 4 P4 4 P5 5 Mazoyer [26] 1 Mazoyer [26] 12 Nishimura and Umeo [32] 1
# of states 54 6 4 3 2 1 58 3 78
# of Transition rules 207 60 76 87 88 114 – – 208
Time complexity 2n 1 2n 2 2n 2 2n 2 2n 2 2n 2 2n 2 2n 2 2n 2
One-/Two-sided recursiveness One-sided One-sided One-sided One-sided One-sided One-sided Two-sided – Two-sided
Lemma 21 ([68]) Let N be any s-state conventional cellular automaton. Then, there exists an s-state CA kbit which can simulate N in real time, where k is a positive integer such that k D dlog2 se.
no internal state is necessary to synchronize the whole array, as is shown in Theorem 27. The protocol design is based on the 6-state Mazoyer’s algorithm given in Mazoyer [25].
The following theorems show a trade-off between internal states and communication bits, and present state-efficient synchronization protocols P i ; 1 i 5. In some sense,
Theorem 22 ([68]) There exists a 54-state CA1bit with protocol P1 that can synchronize any n cells in 2n 1 optimum-step.
Firing Squad Synchronization Problem in Cellular Automata
Theorem 25 ([68]) There exists a 3-state CA4bit with protocol P4 that can synchronize any n cells in 2n 2 optimum-step. Theorem 26 ([68]) There exists a 2-state CA4bit with protocol P4 that can synchronize any n cells in 2n 2 optimum-step. Theorem 27 ([68]) There exists a 1-state CA5bit with protocol P5 that can synchronize any n cells in 2n 2 optimum-step. Figures 24 and 25 illustrate snapshots of the 3-state (2n 2)-step synchronization protocol P4 operating on CA4bit (left) and for the 1-state protocol P5 operating on CA5bit (right). Let BT(n) be total number of bits transferred which is needed for synchronizing n cells on CA kbit . By using the similar technique developed by Vollmar [73], a lower-bound on bit-transfer complexity can be established for synchronizing n cells on CA kbit in a way such that BT(n) D ˝(n log n). In addition, it is shown that each synchronization protocol P i ; 1 i 5 and P4 presented above has an O(n2 ) bit-transfer complexity, respectively. Theorem 28 ([68]) ˝(n log n) bit-transfer is a lower bound for synchronizing n cells on CA kbit in (2n 2) steps. Theorem 29 ([68]) Each optimum-time/non-optimumtime synchronization protocols P i ; 1 i 5 and P4 has an O(n2 ) bit-transfer complexity, respectively. Firing Squad Synchronization Problem on Two-Dimensional Arrays Firing Squad Synchronization Problem in Cellular Automata, Figure 22 A configuration of optimum-time synchronization algorithm with 1-bit inter-cell communication on 15 cells
Theorem 23 ([68]) There exists a 6-state CA2bit with protocol P2 that can synchronize any n cells in 2n 2 optimum-step. Theorem 24 ([68]) There exists a 4-state CA3bit with protocol P3 that can synchronize any n cells in 2n 2 optimum-step. Figure 23 illustrates snapshots of the 6-state (2n 2)-step synchronization protocol P2 operating on CA2bit (left) and for the 4-state protocol P3 operating on CA3bit with 24 cells (right).
Figure 26 shows a finite two-dimensional (2-D) cellular array consisting of m n cells. The array operates in lockstep mode in such a way that the next state of each cell (except border cells) is determined by both its own present state and the present states of its north, south, east and west neighbors. All cells (soldiers), except the north-west corner cell (general), are initially in the quiescent state at time t D 0 with the property that the next state of a quiescent cell with quiescent neighbors is the quiescent state again. At time t D 0, the north-west corner cell C11 is in the fire-when-ready state, which is the initiation signal for the synchronization. The firing squad synchronization problem is to determine a description (state set and nextstate function) of cells that ensures all cells enter the fire state at exactly the same time and for the first time. The set of states and transition function must be independent of m and n.
1113
1114
Firing Squad Synchronization Problem in Cellular Automata
Firing Squad Synchronization Problem in Cellular Automata, Figure 23 Snapshots for the 6-state protocol P∈ operating on CA2bit with 24 cells (left) and for the 4-state protocol P3 operating on CA3bit with 24 cells (right)
Several synchronization algorithms on 2-D arrays have been proposed by Beyer [3], Grasselli [14], Shinahr [44], Szwerinski [45], Umeo, Maeda and Fujiwara [59], and Umeo, Hisaoka and Akiguchi [53]. Most of the synchronization algorithms for 2-D arrays are based on mapping schemes which map synchronized configurations for 1-D arrays onto 2-D arrays. Section “Firing Squad Synchronization Problem on Two-Dimensional Arrays” presents such several mapping schemes that yield time-efficient 2D synchronization algorithms. Orthogonal Mapping: A Simple Linear-Time Algorithm In this section, a very simple synchronization algorithm is
provided for 2-D arrays. The overall of the algorithm is as follows: 1. First, synchronize the first column cells using a usual optimum-step 1-D algorithm with a general at one end, thus requiring 2m 2 steps. 2. Then, start the row synchronization operation on each row simultaneously. Additional 2n 2 steps are required for the row synchronization. Totally, its time complexity is 2(m C n) 4 steps. The implementation is referred to as orthogonal mapping. It is shown that s C 2 states are enough for the implementation of the algorithm above, where s is the number of internal states of the 1-D base algorithm. Figure 27 shows
Firing Squad Synchronization Problem in Cellular Automata
Firing Squad Synchronization Problem in Cellular Automata, Figure 24 Snapshots for 3-state (2n 2)-step synchronization protocol P4 operating on CA4bit with 24 cells
snapshots of the 8-state synchronization algorithm running on a rectangular array of size 4 6. Theorem 30 There exists an (s C 2)-state protocol for synchronizing any m n rectangular arrays in 2(m C n) 4 steps, where s is number of states of any optimum-time 1-D synchronization protocol. L-shaped Mapping: Shinar’s Optimum-Time Algorithm The first optimum-time 2-D synchronization algorithm was developed by Shinar [44] and Beyer [3]. The rectangular array of size m n is regarded as min(m; n) rotated
Firing Squad Synchronization Problem in Cellular Automata, Figure 25 Snapshots for 1-state (2n 2)-step synchronization protocol P5 operating on CA5bit with 18 cells
(90ı in clockwise direction) L-shaped 1-D arrays, where they are synchronized independently using the generalized firing squad synchronization algorithm. The configuration of the generalized synchronization on 1-D array can be mapped on 2-D array. See Fig. 28. Thus, an m n array synchronization problem is reduced to independent min(m; n) 1-D generalized synchronization problems such that: P (m; m C n 1), P (m 1; m C n 3), . . . , P (1; n m C 1) in the case m n and P (m; m C n 1), P (m 1; m C n 3), . . . , P (m n C 1; m n C 1) in the case m > n, where P (k; `) means the 1-D generalized synchronization problem for ` cells with a general on the kth cell from left end. Beyer [3] and Shinahr [44]
1115
1116
Firing Squad Synchronization Problem in Cellular Automata
g k , 1 k m C n 1, defined as follows. g k D fC i; j j(i 1) C ( j 1) D k 1g ; g1 D fC1;1 g ;
i. e.;
g2 D fC1;2 ; C2;1 g ;
g3 D fC1;3 ; C2;2 ; C3;1 g ; : : : ; g mCn1 D fC m;n g :
Firing Squad Synchronization Problem in Cellular Automata, Figure 26 A two-dimensional cellular automaton
have shown that an optimum-time complexity for synchronizing any m n arrays is m C n C max(m; n) 3 steps. Shinahr [44] has also given a 28-state implementation. Theorem 31 ([3,44]) There exists an optimum-time 28state protocol for synchronizing any m n rectangular arrays in m C n C max(m; n) 3 steps. Diagonal Mapping I: Six-State Linear-Time Algorithm The proposal is a simple and state-efficient mapping scheme that enables us to embed any 1-D firing squad synchronization algorithm with a general at one end onto two-dimensional arrays without introducing additional states. Consider a 2-D array of size m n, where m; n 2. Firstly, divide mn cells on the array into m C n 1 groups
Figure 29 shows the division of the two-dimensional array of size m n into m C n 1 groups. Let M be any 1-D CA that synchronizes ` cells in T(`) steps. Assume that M has m C n 1 cells, denoted by Ci , where 1 i m C n 1. Then, consider a one-to-one correspondence between the ith group g i and the ith cell Ci on M such that g i $ Ci , where 1 i m C n 1. One can construct a 2-D CA N such that all cells in g i simulate the ith cell Ci in realtime and N can synchronize any m n arrays at time t D T(m C n 1) if and only if M synchronizes 1-D arrays of length m C n 1 at time t D T(m C n 1). It is noted that the set of internal states of N constructed is the same as M. Thus an m n 2-D array synchronization problem is reduced to one 1-D synchronization problem with the general at the left end. The algorithm obtained is slightly slower than the optimum ones, but the number of internal states is considerably smaller. Figure 30 shows snapshots of the proposed 6-state linear-time firing squad synchronization algorithm on rectangular arrays. For the details of the construction of the transition rule set, see Umeo, Maeda, Hisaoka and Teraoka [60]. Theorem 32 ([60]) Let A be any s-state firing synchronization algorithm operating in T(`) steps on 1-D ` cells. Then, there exists a 2-D s-state cellular automaton that can synchronize any m n rectangular array in T(m C n 1) steps.
Firing Squad Synchronization Problem in Cellular Automata, Figure 27 Snapshots of the synchronization process on 4 6 array
Firing Squad Synchronization Problem in Cellular Automata
Firing Squad Synchronization Problem in Cellular Automata, Figure 28 An optimum-time synchronization scheme for rectangular arrays
Theorem 33 ([60]) There exists a 6-state 2-D CA that can synchronize any m n rectangular array in 2(m C n) 4 steps. Theorem 34 ([60]) There exists a 6-state 2-D CA that can synchronize any m n rectangular array containing isolated rectangular holes in 2(m C n) 4 steps. Theorem 35 ([60]) There exists a 6-state firing squad synchronization algorithm that can synchronize any 3-D m n ` solid arrays in 2(m C n C `) 6 steps. Diagonal Mapping II: Twelve-State Time-Optimum Algorithm The second diagonal mapping scheme in this section enables us to embed a special class of 1-D generalized synchronization algorithm onto two-dimensional arrays without introducing additional states. An m n 2-D array synchronization problem is reduced to one 1-D generalized synchronization problem: P (m; m C n 1). Divide mn cells into m C n 1 groups g k defined as follows,
Firing Squad Synchronization Problem in Cellular Automata, Figure 29 A correspondence between 1-D and 2-D arrays
where k is any integer such that (m 1) k n 1: g k D fC i; j j j i D kg ;
(m 1) k n 1 :
Figure 31 shows the correspondence between 1-D and 2-D arrays. Property A: Let S it denote the state of Ci at step t. It is said that a generalized firing algorithm has a property A, where any state S it appearing in the area A can be comt1 t1 and S iC1 puted from its left and right neighbor states S i1 t1 but it never depends on its own previous state S i . Figure 32 shows the area A in the time-space diagram for the generalized optimum-step firing squad synchronization algorithm. Any one-dimensional generalized firing squad synchronization algorithm with the property A can be easily embedded onto two-dimensional arrays without introducing additional states. Theorem 36 ([53]) Let M be any s-state generalized synchronization algorithm with the property A operating in
1117
1118
Firing Squad Synchronization Problem in Cellular Automata
Firing Squad Synchronization Problem in Cellular Automata, Figure 30 Snapshots of the proposed 6-state linear-time firing squad synchronization algorithm on rectangular arrays
T(k; `) steps on 1-D ` cells with a general on the kth cell from the left end. Then, based on M, one can construct a 2-D s-state cellular automaton that can synchronize any m n rectangular array in T(m; m C n 1) steps. It has been shown in Umeo, Hisaoka and Akiguchi [53] that there exists a 12-state implementation of the generalized optimum-time synchronization algorithms having the property A. Then, one can get a 12-state optimumtime synchronization algorithm for rectangular arrays. Figure 33 shows snapshots of the proposed 12-state optimum-time firing squad synchronization algorithm operating on a 7 9 array. Theorem 37 ([53]) There exists a 12-state firing squad synchronization algorithm that can synchronize any m n rectangular array in optimum m C n C max(m; n) 3 steps. Rotated L-Shaped Mapping: Time-Optimum Algorithm In this section we present an optimum-time synchronization algorithm based on a rotated L-shaped mapping. The synchronization scheme is quite different from previous designs. The scheme uses the freezing-thawing technique. Without loss of generality, it is assumed that m n. A rectangular array of size m n is regarded as m ro-
tated (90ı in counterclockwise direction) L-shaped 1-D arrays. Each L-shaped array is denoted by L i ; 1 i m. See Fig 34. Each Li consists of three segments of length i, n m, and i, respectively. Each segment can be synchronized by the freezing-thawing technique. Synchronization operations for L i ; 1 i m are as follows: Figure 35 shows a time-space diagram for synchronizing Li . The wake-up signals for the three segments of Li are generated at time t D m C 2(m i) 1; 3m i 2, and n C 2(m i) 1, respectively. Synchronization operations on each segments are delayed for t i j ; 1 j 3 such that: 8 ˆ t, is described in a statistical manner by the conditional probability distribution p(x; tjx0 ; t 0 ). The conditional probability is subject to a Fokker–Planck equation, also known as second Kolmogorov equation, if @ p(x; t j x0 ; t 0 ) D @t d X @ (1) D (x; t)p(x; t j x0 ; t 0 ) @x i i iD1
C Article Outline Glossary Definition of the Subject Introduction Stochastic Processes Stochastic Time Series Analysis Applications: Processes in Time Applications: Processes in Scale Future Directions Further Reading Acknowledgment Bibliography Glossary Complexity in time Complex structures may be characterized by non regular time behavior of a describing variable q 2 Rd . Thus the challenge is to understand or to model the time dependence of q(t), which may be d q(t) D : : : or by achieved by a differential equation dt the discrete dynamics q(t C ) D f (q(t); : : : ) fixing the evolution in the future. Of special interest are not only nonlinear equations leading to chaotic dynamics but also those which include general noise terms, too. Complexity in space Complex structures may be characterized by their spatial disorder. The disorder on a selected scale l may be measured at the location x by some scale dependent quantities, q(l; x), like wavelets, increments and so on. The challenge is to understand or to model the features of the disorder variable q(l; x) on different scales l. If the moments of q show power behavior hq(l; x)n i / l (n) the complex structures are called fractals. Well known examples of spatial complex structures are turbulence or financial market data. In the first case the complexity of velocity fluctuations over different distances l are investigated, in the second case the complexity of price changes over different time steps (time scale) are of interest.
d 1 X @2 D(2) (x; t)p(x; t j x0 ; t 0 ) ; 2 @x i @x j i j i; jD1
holds. Here D(1) and D(2) are the drift vector and the diffusion matrix, respectively. Kramers–Moyal coefficients Knowing for a stochastic process the conditional probability distribution p(x(t); tjx0 ; t 0 ), for all t and t0 the Kamers–Moyal coefficients can be estimated as nth order moments of the conditional probability distribution. In this way also the drift and diffusion coefficient of the Fokker–Planck equation can be obtained form the empirically measured conditional probability distributions. Langevin equation The time evolution of a variable x(t) is described by Langevin equation if for x(t) it holds: p d x(t) D D(1) (x; t) C D(2) (x; t) (t i ) : dt Using Itô’s interpretation the deterministic part of the differential equation is equal to the drift term, the noise amplitude is equal to the square root of the diffusion term of a corresponding Fokker–Planck equation. Note, for vanishing noise a purely deterministic dynamics is included in this description. Stochastic process in scale For the description of complex system with spatial or scale disorder usually a measure of disorder on different scales q(l; x) is used. A stochastic process in scale is now a description of the l evolution of q(l; x) by means of stochastic equations. As a special case the single event q(l; x) follows a Langevin equation, whereas the probability p(q(l)) follows a Fokker–Planck equation. Definition of the Subject Measurements of time signals of complex systems of the inanimate and the animate world like turbulent fluid motions, traffic flow or human brain activity yield fluctuating time series. In recent years, methods have been devised which allow for a detailed analysis of such data. In
1131
1132
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
particular methods for parameter free estimations of the underlying stochastic equations have been proposed. The present article gives an overview on the achievements obtained so far for analyzing stochastic data and describes results obtained for a variety of complex systems ranging from electrical nonlinear circuits, fluid turbulence, to traffic flow and financial market data. The systems will be divided into two classes, namely systems with complexity in time and systems with complexity in scale. Introduction The central theme of the present article is exhibited in Fig. 1. Given a fluctuating, sequentially measured set of experimental data one can pose the question whether it is possible to determine underlying trends and to assess the characteristics of the fluctuations generating the experimental traces. This question becomes especially important for nonlinear systems, which can only partly be analyzed by the evaluation of the powerspectra obtained from a Fourier representation of the data. In recent years it has become evident that for a wide class of stochastic processes the posed question can be answered in an affirmative way. A common method has been developed which can deal with stochastic processes (Langevin processes, Lévy processes) in time as well as in scale. In the first case, one faces the analysis of temporal disorder, whereas in the second case one considers scale disorder, which is an inherent feature of turbulent fluid motion and, quite interestingly, can also be detected in financial time series. This scale disorder is often linked to fractal scaling behavior and can be treated by a stochastic ansatz in a more general way. In the present article we shall give an overview on the developed methods for analyzing stochastic data in time and scale. Furthermore, we list complex systems ranging from electrical nonlinear circuits, fluid turbulence, finance to biological data like heart beat data or data of human tremor, for which a successful application of the data analysis method has been performed. Furthermore, we shall focus on results obtained from some exemplary applications of the method to electronics, traffic flow, turbulence, and finance. Complexity in Time Complex systems are composed of a large number of subsystems behaving in a collective manner. In systems far from equilibrium collectivity arises due to selforganization [1,2,3]. It results in the formation of temporal, spatial, spatio-temporal and functional structures.
Fluctuations, Importance of: Complexity in the View of Stochastic Processes, Figure 1 Stochastic time series, data generated numerically and measured in a Rayleigh–Bénard experiment: Is it possible to disentangle turbulent trends from chances?
The investigation of systems like lasers, hydrodynamic instabilities and chemical reactions has shown that selforganization can be described in terms of order parameters u i (t) which obey a set of stochastic differential equations of the form X g i j (u1 ; ::; u n )dWj ; (1) du i D N i [u1 ; ::; u n ]dt C j
where W j are independent Wiener processes. Although the state vector q(t) of the complex system under consideration is high dimensional, its long time behavior is entirely governed by the dynamics of typically few order parameters: q(t) D Q(u1 ; : : : u n ) :
(2)
This fact allows one to perform a macroscopic treatment of complex systems on the basis of the order parameter concept [1,2,3]. For hydrodynamic instabilities in the laminar flow regime like Rayleigh–Bénard convection or the Taylor– Couette experiment thermal fluctuations are usually small and can be neglected. However, in nonlinear optics and, especially, in biological systems the impact of noise has been shown to be of great importance. In principle, the order parameter equations (1) can be derived from basic equations characterizing the system under consideration close to instabilities [1,2]. However if the basic equations are not available, as is the case e. g. for systems considered in biology or medicine, the order parameter concept yields a top-down approach to complexity [3].
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
In this top-down approach the analysis of experimental time series becomes a central issue. Methods of nonlinear time series analysis (c.f. the monograph of Kantz and Schreiber [4]) have been widely applied to analyze complex systems. However, the developed methods aim at the understanding of deterministic systems and can only be successful if the stochastic forces are weak. Apparently, these methods have to be extended to include stochastic processes. Complexity in Scale In the case of selfsimilar structures complexity is commonly investigated by a local measure q(l; x) characterizing the structure on the scale l at x. Selfsimilarity means that in a certain range of l the processes q(l; x) ;
q(l; x)
(3)
This formula is based on the assumption that in a turbulent flow regions with different scaling indices exist, where p(; l) gives a measure of the scaling indices at scale l. The major shortcoming of the fractal and multifractal approach to complexity in scale is the fact that it only addresses the statistics of the measure q(l; x) at a single scale l. In general one has to expect dependencies of the measures q(l; x) and q(l 0 ; x) from different scales. Thus the question, which we will address in the following, can be posed, are there methods, which lead to a more comprehensive characterization of the scale disorder by general joint statistics f (q N ; l N ; q N1 ; l N1 ; : : : ; q1 ; l1 ; q0 ; l0 )
(9)
of the local measure q at multiple scales li . Stochastic Data Analysis
should have the same statistics. More precisely, the probability distribution of the quantity q takes the form
q 1 (4) f (q; l) D F l l
Processes in time and scale can be analyzed in a similar way, if we generalize the time process given in (1) to a stochastic process equation evolving in scale
with a universal function F(Q). Furthermore, the moments exhibt scaling behavior
Z q 1 (5) hq n (l)i D dq q n F D Q n l n l l
where dW belongs to a random process. The aim of data analysis is to disentangle deterministic dynamics and the impact of fluctuations. Loosely speaking, this amounts to detect trends and chances in data series of complex systems. A complete analysis of experimental data, which is generated by the interplay of deterministic dynamics and dynamical noise, has to address the following issues
Such type of behavior has been termed fractal scaling behavior. There are many experimental examples of systems, like turbulent fields or surface roughness, just to mention two, that such a simple picture of a complex structure is only a rough, first approximation. In fact, especially for turbulence where q(l; x) is taken as a velocity increment, it has been argued that multifractal behavior is more appropriate, where the nth order moments scale according to hq n (l)i D Q n l (n) ;
(6)
where the scaling indices %(n) are nonlinear functions of the order n: 2
3
%(n) D n0 C n 1 C n 2 C :
(7)
Such a behavior can formally be obtained by the assumption that the probability distribution f (q; l) has the following form
Z 1 q f (q; l) D d p(; l) F : (8) l l
q(l C dl) D q(l) C N(q; l)dl C g(q; l)dW(l)
(10)
Identification of the order parameters Extracting the deterministic dynamics Evaluating the properties of fluctuations The outline of the present article is as follows. First, we shall summarize the description of stochastic processes focusing mainly on Markovian processes. Second, we discuss the approach developed to analyze stochastic processes. The last parts are devoted to applications of the data analysis method to processes in time and processes in scale. Stochastic Processes In the following we consider the class of systems which are described by a multivariate state vector X(t) contained in a d-dimensional state space fxg. The evolution of the state vector X(t) is assumed to be governed by a deterministic part and by noise: d X(t) D N X(t); t C F X(t); t : dt
(11)
1133
1134
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
N denotes a nonlinear function depending on the stochastic variable X(t) and additionally, may explicitly depend on time t (Note, time t can also be considered as a general variable and replaced for example by a scale variable l like in (86)). Because the function N can be nonlinear, also systems exhibiting chaotic time evolution in the deterministic case are included in the class of stochastic processes (11). The second part, F(X(t); t), fluctuates on a fast time scale. We assume that the d components F i can be represented in the form d X F i X(t); t D g i j X(t); t j (t) :
(12)
jD1
The quantities j (t) are considered to be random functions, whose statistical characteristics are well-defined. It is evident that these properties significantly determine the dynamical behavior of the state vector X(t). Formally, our approach also includes purely deterministic processes taking F D 0.
b) Generalized Langevin Equations: Lévy Noise A more general class is formed by the discrete time evolution laws [5,6], X(t iC1 ) D X(t i ) C N(X(t i ); t i ) C g(X(t i ); t i ) 1/˛ ˛;ˇ (t i )
where the increment ˛;ˇ (t i ) is a fluctuating quantity distributed according to the Lévy stable law characterized by the Lévy parameters ˛, ˇ, [7]. As is well-known, only the Fourier-transform of this distribution can be given: Z 1 h˛;ˇ D dk Z(k; ˛; ˇ) ei kx 2 ˛ (1iˇ
Z(k; ˛; ˇ) D eijkj ˚ D tan ˚ D
Discrete Time Evolution It is convenient to consider the temporal evolution (11) of the state vector X(t) on a time scale, which is large compared to the time scale of the fluctuations j (t). As we shall briefly indicate below, a stochastic process related to the evolution Equation (11) can be modeled by stochastic evolution laws relating the state vectors X(t) at times t i ; t iC1 D t i C ; t iC2 D t i C 2 ; : : : for small but finite values of . In the present article we shall deal with the class of proper Langevin processes and generalized Langevin processes, which are defined by the following discrete time evolutions. a) Proper Langevin Equations: White Noise The discrete time evolution of a proper Langevin process is given by X(t iC1 ) D X(t i ) C N(X(t i ); t i ) C g(X(t i ); t i )
p (t i )
(13)
where the stochastic increment (t i ) is a fluctuating quantity characterized by a Gaussian distribution with zero mean, h (t i )i D 0 Pd 2 1 1 h( ) D p e 2 D p e ˛D1 d d ( 2) ( 2)
2˛ 2
(14)
Furthermore, the increments are statistically independent for different times h (t i ) (t j )i D ı i j
(15)
(16)
sign(k)˚)
(17)
˛ 2
˛ ¤ 1;
2 ln jkj ˛ D 1 :
The Gaussian distribution is contained in the class of Lévy stable distributions (˛ D 2, ˇ D 0). Formally, ˛ can be taken from the interval 0 < ˛ 2. However, for applications it seems reasonable to choose 1 < ˛ 2 in order that the first moment of the noise source exists. The consideration of this type of statistics for the noise variables is based on the central limit theorem, as discussed in a subsection below. The discrete Langevin (13) and generalized Langevin Equations (16) have to be considered in the limit ! 0. They are the basis of all further treatments. A central point is that if one assumes the noise sources to be independent of the variable X(t i ) the discrete time evolution equations define a Markov process, whose generator, i. e. the conditional probability distribution or short time propagator can be established on the basis of (13), (16). In the following we shall discuss, how the discrete time processes can be related to the stochastic evolution equation (11). Discrete Time Approximation of Stochastic Evolution Equations In order to motivate the discrete time approximations (13), (16) we integrate the evolution law (11) over a finite but small time increment : Z tC X(t C ) D X(t) C dt 0 N(X(t 0 ); t 0 ) t
Z
tC
C t
dt 0 g(X(t 0 ); t 0 ) (t 0 )
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
X(t) C N(X(t); t) Z tC dt 0 g(X(t 0 ); t 0 ) (t 0 ) : C
stochastic integrals (18)
Z
tC
dW(t; ) D
dt 0 (t 0 ) :
(19)
t
These are the quantities, for which a statistical characterization can be given. We shall pursue this problem in the next subsection. However, looking at (18) we encounter the difficulty that the integrals over the noise forces may involve functions of the state vector within the time interval (t; t C ). The interpretation of such integrals for wildly fluctuating, stochastic quantities (t) is difficult. The simplest way is to formulate an interpretation of these terms leading to different interpretations of the stochastic evolution equation (11). We formulate the widely used definitions due to Itô and Stratonovich. In the Itô sense, the integral is interpreted as Z
tC
dt 0 g(X(t 0 ); t 0 ) (t 0 ) D g(X(t); t)
t
Z
tC
dt 0 (t 0 )
t
D g(X(t); t) dW(t; ) : (20) The Stratonovich definition is Z
tC t
dt 0 g(X(t 0 ); t 0 ) (t 0 ) X(t C ) C X(t) ;t C Dg 2 X(t C ) C X(t) Dg ;t C 2
Z tC dt 0 (t 0 ) 2 t
dW(t; ) : (21) 2
Since from experiments one obtains probability distributions of stochastic processes which are related to stochastic Langevin equations, we are free to choose a certain interpretation of the process. In the following we shall always adopt the Itô interpretation. In this case, the drift vector D1 (x; t) D N(x; t) coincides with the nonlinear vector field N(x; t). Limit Theorems, Wiener Process, Lévy Process In the following we shall discuss possibilities to characterize the
tC
dt 0 (t 0 )
(22)
t
t
The time interval is chosen to be larger than the time scale of the fluctuations of j (t). It involves the rapidly fluctuating quantities j (t) and is denoted as a stochastic integral [8,9,10,11]. If we assume the matrix g to be independent on time t and state vector X(t), we arrive at the integrals
Z
W(t C ) W(t) D
(t) is a rapidly fluctuating quantity of zero mean. In order to characterize the properties of this force one can resort to the limit theorems of statistical mechanics [7]. The central limit theorem states that if the quantities j , j D 1; : : : ; n are statistically independent variables of zero mean and variance 2 then the sum n 1 X p j D (23) n jD1 tends to a Gaussian random variable with variance 2 for large values of n. The limiting probability distribution h() is then a Gaussian distribution with variance 2 : 2 1 h() D p e 2 2 : (24) 2 2 As is well known, there is a generalization of the central limit theorem, which applies to random variables whose second moment does not exist. It states that the distribution of the sum over identically distributed random variables j n 1 X
n1/˛
j D ˛;ˇ
(25)
jD1
tends to a random variable ˛;ˇ , which is distributed according to the Lévy-stable distribution h˛;ˇ (). The Lévy stable distributions can only be given by their Fourier – transforms, cf. Eq. (17). In order to evaluate the integral (22) using the limit theorems, it is convenient to represent the stochastic force (t) as a sum over N ı-kicks occuring at discrete times tj X (t) D j (t)1/˛ ı(t t j ) : (26) j
Thereby, t is the time difference between the occurrence of two kicks. Then, we obtain X 1 X j (t)1/˛ D (N t)1/˛ 1/˛ j dW(t; ) D N j;t 2 j;t 2 j
j
1/˛
D (t) : (27) An application of the central limit theorem shows that if the quantities j are identically distributed independet variables the integral Z tC 1 D (t) (28) 1/˛ t can be considered to be a random variable (t) which in the limit N ! 1 tends to a stable random variable.
1135
1136
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
Thus, for the case ˛ D 2, i. e. for the case where the second moments of the random kicks exist, the stochastic variable dW(t; ) can be represented by the increments p dW(t; ) D (t) (29)
can be constructed from the knowledge of the conditional probability distributions
where (t) is a Gaussian distributed random variable. For the more general case, dW(t; ) is a stochastic variable
according to
dW(t; ) D 1/˛ ˛;ˇ (t)
(30)
where the distribution of ˛;ˇ is the Lévy distribution (17). Statistical Description of Stochastic Processes In the previous subsection we have discussed processes described by stochastic equations. In the present subsection we shall summarize the corresponding statistical description. Such a description is achieved by introducing suitable statistical averages. We shall denote these averages by the brackets h: : :i. For stationary processes the averages can be viewed as a time average. For nonstationary processes averages are defined as ensemble averages, i. e. averages over an ensemble of experimental (or numerical) realizations of the stochastic process (11). For stationary processes in time, one usually deals with time averages. For processes in scale, the average is an ensemble average. Probability Distributions The set of stochastic evolution equations (11), or it’s finite time representations (13), (16) define a Markov process. We consider the joint probability density f (xn ; t n ; : : : ; x1 ; t1 ; x0 ; t0 )
(31)
which is related to the joint probability, to find the system at times ti in the volume Vi in phase space. If we take times ti which are separated by the small time increment D t iC1 t i , then the probability density can be related to the discrete time representation of the stochastic process (13), (16) according to f (xn ; t n ; : : : ; x1 ; t1 ; x0 ; t0 ) D hı(xn X(t n )) : : : ı(x0 X(t0 ))i ;
(32)
where the brackets indicate the statistical average, which may be a time average (for stationary processes) or an ensemble average. Markov Processes An important subclass of stochastic processes are Markov processes. For these processes the joint probability distribution f (xn ; t n ; : : : ; x1 ; t1 ; x0 ; t0 )
p(x iC1 ; t iC1 jx i ; t i ) D
f (x iC1; t iC1 ; x i ; t i ) f (x i ; t i )
(33)
f (xn ; t n ; : : : ; x1 ; t1 ; x0 ; t0 ) D p(xn ; t n jxn1 ; t n1 ) : : : p(x1 ; t1 jx0 ; t0 ) f (x0 ; t0 ) ; (34) Here the Markov property of a process for multiple conditioned probabilities p(x i ; t i j x i1 ; t i1 ; : : : ; x0 ; t0 ) D p(x i ; t i j x i1 ; t i1 ) (35) is used. As a consequence, the knowledge of the transition probabilities together with the initial probability distribution f (x0 ; t0 ) suffices to define the N-times probability distribution. It is straightforward to prove the Chapman–Kolmogorov equation Z p(x j ; t j j x i ; t i ) D
dx k p(x j ; t j j x k ; t k ) p(x k ; t k j x i ; t i ): (36)
This relation is valid for all times t i < t k < t j . In the following we shall show that the transition probabilities p(x jC1 ; tC jx j ; t) can be determined for small time differences . This defines the so-called short time propagators. Short Time Propagator of Langevin Processes It is straightforward to determine the short time propagator from the finite time approximation (13) of the Langevin equation. We shall denote these propagators by p(x jC1 ; t C jx j ; t), in contrast to the finite time propagators (33), for which the time interval t iC1 t i is large compared to . We first consider the case of Gaussian noise. The variables (t i ) are Gaussian random vectors with probability distribution h i 1 h[ ] D p exp : (37) 2 (2)d The finite time intepretation of the Langevin equation can be rewritten in the form
(t i ) D
1 [g(X(t i ); t i )]1 [X(t iC1 ) X(t i ) 1/2 N(X(t i ))] :
(38)
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
This relation, in turn, defines the transition probability distribution p(x iC1 ; t iC1 j x i ; t i ) dx iC1 D h[ D (t i )]J(x i ; t i ) dx iC1 ;
Furthermore, we observe that under the assumption that the random vector (t i ) is independent on the variables X(t j ) for all j i we can construct the N-time probability distribution
(39) f (xn ; t n ; : : : ; x1 ; t1 ; x0 ; t0 )
where J is the determinant of the Jacobian @ ˛ (t i ) D ; @x iC1;ˇ
D p(xn ; t n j xn1 ; t n1 ) : : : p(x1 ; t1 j x0 ; t0 ) f (x0 ; t0 ) (40)
(46)
and [g]1 denotes the inverse of the matrix g (which is assumed to exist). For the following it will be convenient to define the socalled diffusion matrix D(2) (x i ; t i )
However, this is the definition of a Markov chain. Thereby, the transition probabilities are the short time propagators, i. e. the representation (46) is valid in the short time limit ! 0. The probability distribution (46) is the discrete approximation of the path integral representation of the stochastic process under consideration [8]. Let us summarize: The statistical description of the Langevin equation based on the n-times joint probability distribution leads to the representation in terms of the conditional probability distribution. This representation is the definition of a Markov process. Due to the assumptions on the statistics of the fluctuating forces different processes arise. If the fluctuating forces are assumed to be Gaussian the short time propagator is Gaussian and, as a consequence, solely defined by the drift vector and the diffusion matrix. If the fluctuating forces are assumed to be Lévy distributed, more complicated short time propagators arise. Let us add the following remarks: a) The assumption of Gaussianity of the statistics is not necessary. One can consider fluctuating forces with nonGaussian probability distributions. In this case the probability distributions have to be characterized by higher order moments, or, more explicitly, by its cumulants. At this point we remind the reader that for non-Gaussian distributions, infinitely many cumulants exist. b) The Markovian property, i. e. the fact that the propagator p(x i ; t i jx i1 t i1 ) does not depend on states x k at times t k < t i1 is usually violated for physical systems due to the fact that the noise sources become correlated for small time differences Tmar . This point already has been emphasized in the famous work of A. Einstein on Brownian motion, which is one of the first works on stochastic processes [12]. This time scale is denoted as Markov–Einstein scale. It seems to be a highly interesting quantity especially for nonequilibrium systems like turbulence [13,14] and earthquake signals [15].
J˛ˇ
D(2) (x i ; t i ) D g T (x i ; t i ) g(x i ; t i )
(41)
We are now able to explicitly state the short time propagator of the process (13): 1 p(x i C ; t iC1 jx i ; t i ) D p eS(x iC1 ;x i ;t i ;) d (2 ) Det[D(2) ] (42) We have defined the quantity S(x iC1 ; x i ; t i ; ) according to hx i iC1 x i S(x iC1 ; x i ; t i ; ) D D(1) (x i ; t) i hx iC1 x i D(1) (x i ; t i ) : (43) [D(2) (x i ; t i )]1 As we see, the short time propagator, which yields the transition probability density from state x i to state x iC1 in the finite but small time interval is a normal distribution. Short Time Propagator of Lévy Processes It is now straightforward to determine the short time propagator for Lévy processes. We have to replace the Gaussian distribution by the (multivariate) Lévy distribution h˛;ˇ ( ). As a consequence, we obtain the conditional probability, i. e. the short time propagator, for Lévy processes: 1 p(x iC1 ; t i C j x i ; t i ) D h˛;ˇ Det[g(x i ; t i )]
1 1 [g(x(t ); t )] [x(t ) x(t ) N(x(t ))] i i iC1 i i 1/˛ (44) Joint Probability Distribution and Markovian Properties Due to statistical independence of the random vectors (t i ), (t j ) for i ¤ j we obtain the joint probability distribution as a product of the distributions h( ): h( N ; : : : ; 0 ) D h( N ) h( N1 ) : : : h( 0 )
(45)
Finite Time Propagators Up to now, we have considered the short time propagators p(x i ; t i jx i1 ; t i1 ) for inifinitesimal time differences
1137
1138
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
t i t i1 D . However, one is interested in the conditional probability distributions for finite time intervals, p(x; tjx0 ; t 0 ), t t 0 . Fokker–Planck Equation The conditional probability distribution p(x; tjx0 ; t 0 ), t t 0 can be obtained from the solution of the Fokker–Planck equation (also known as second Kolmogorov equation [16]): d
X @ (1) @ D (x; t) p(x; t j x0 ; t 0 ) p(x; t j x0 ; t 0 ) D @t @x i i iD1
C
d 1 X @2 D(2) (x; t) p(x; t j x0 ; t 0 ) ; 2 @x i @x j i j
(47)
i; jD1
D(1) and D(2) are drift vector and diffusion matrix. Under consideration of Itô’s definitions of stochastic integrals the coefficients D(1) , D(2) of the Fokker–Planck equation (47) and the functionals N, g of the Langevin equation (11), (12) are related by D(1) i (x; t) D N i (x; t) ; D(2) i j (x; t) D
d X
g i l (x; t) g jl (x; t) :
(48) (49)
l D1
They are defined according to 1 lim hX i (t C ) x i ijX(t)Dx D(1) i (x; t) D !0 Z 1 D lim dx0 p(x0 ; t C j x; t) x 0i x i !0 (50) D(2) i j (x; t) 1 h(X i (t C ) x i )(X j (t C ) x j )ijX(t)Dx Z 1 D lim dx0 p(x0 ; t C jx; t) x 0i x i x 0j x j : !0 (51)
D lim
!0
These expressions demonstrate that drift vector and diffusion matrix can be determined as the first and second moments of the conditional probability distributions p(x0 ; t C jx; t) in the small time limit. Fractional Fokker–Planck Equations The finite time propagators or conditional probability distributions of stochastic processes containing Lévy-noise lead to fractional diffusion equations. For a discussion of this topic we refer the reader to [5,6].
Master Equation The most general equation specifying a Markov process for a continuous state vector X(t) takes the form Z @ dx0 w(x; x0 ; t) p(x0 ; t j x0 ; t0 ) p(x; t j x0 ; t0 ) D @t Z dx0 w(x0 ; x; t) p(x; t j x0 ; t0 ) (52) here w denote transition probabilities. Measurement Noise We can now go a step ahead and include measurement noise. Due to measurement noise, the observed state vector, which we shall now denote by Y(t i ), is related to the stochastic variable X(t i ) by an additional noise term (t i ): Y(t i ) D X(t i ) C (t i )
(53)
We assume that the stochastic variables (t i ) have zero mean, are statistically independent and obey the probability density h(). Then the probability distribution for the variable Y t i is given by Z Z g(yn ; t n ; : : : ; y0 ; t0 ) D : : : d n : : : d0 (54) f (yn n ; t n ; : : : ; y0 0 ; t0 ) h(n ) : : : h(0 ) : The short time propagator recently has been determined for the Ornstein–Uhlenbeck process, a process with linear drift term and constant diffusion. Analysis of data sets spoilt by measurement noise is currently under investigation [17,18]. Stochastic Time Series Analysis The ultimate goal of nonlinear time series analysis applied to deterministic systems is to extract the underlying nonlinear dynamical system directly from measured time series in the form of a system of differential equations, cf. [4]. The role played by dynamic fluctuations has not been fully appreciated. Mostly, noise has been considered as a random variable additively superimposed on a trajectory generated by a deterministic dynamical system. Noise has been usually considered as extrinsic or measurement noise. The problem of dynamical noise, i. e. fluctuations which interfere with the deterministic dynamical evolution, has not been addressed in full details. The natural extension of the nonlinear time series analysis to (continuous) Markov processes is the estimation of short time propagators from time series. During recent years, it has become evident that such an approach is
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
feasible. In fact, noise may help in the estimation of the deterministic ingredients of the dynamics. Due to dynamical noise the system explores a larger part of phase space and thus measurements of signals yield considerably more information about the dynamics as compared to the purely deterministic case, where the trajectories fastly converge to attractors providing only limited information. The analysis of data set’s of stochastic systems exhibiting Markov properties has to be performed along the following lines:
Disentangling Measurement and Dynamical Noise Evaluating Markovian Properties Determination of Short Time Propagators Reconstruction of Data
Since the methods for disentangling measurement and dynamical noise are currently under intense investigation, see Subsect. “Measurement Noise”, our focus is on the three remaining issues.
This procedure is feasible if large data sets are available. Due to the different conditioning both probabilities are typically based on different number of events. As an appropriate method to statistically show the similarity of (56) the Wilcoxon Test has been proposed, for details see [20,21]. In principle, higher order conditional probability distributions should be considered in a similar way. However, the validity of relation (56) is a strong hint for Markovianity. Evaluation of Chapman–Kolmogorov Equation An indirect way is to use the Chapman–Kolmogorov equation (36), whose validity is a necessary condition for Markovianity. The method is based on a comparison between the conditional pdf p(x k ; t k j x i ; t i )
(57)
taken from experiment and the one calculated by the Chapman–Kolmogorov equation
Evaluating Markovian Properties In principle it is a difficult task to decide on Markovian properties by an inspection of experimental data. The main point is that Markovian properties usually are violated for small time increments , as it already has been pointed out above and in [12]. There are at least two reasons for this fact. First, the dynamical noise sources become correlated at small time differences. If we consider Gaussian noise sources one usually observes an exponential decay of correlations
p˜(x k ; t k jx i ; t i ) Z D dx j p(x k ; t k jx j ; t j ) p(x j ; t j jx i ; t i )
(58)
where tj is an intermediate time t i < t j < t k . A refined method can be based on an iteration of the Chapman– Kolmogorov equation, i. e. considering several intermediate times. If the Chapman–Kolmogorov equation is not fulfilled, deviations are enhanced by each iteration.
0
h i (t) j (t 0 )i D ı i j
ejtt j/Tmar Tmar
(55)
Markovian properties can only expected to hold for time increments > Tmar . Second, measurement noise can spoil Markovian properties [19]. Thus, the estimation of the Markovian time scale Tmar is a necessary step for stochastic data analysis. Several methods have been proposed to test Markov properties. Direct Evaluation of Markovian Properties A direct way is to use the definition of a Markov process (35) and to consider the higher order conditional probability distributions p(x3 ; t3 j x2 t2 ; x1 ; t1 ) D
f (x3 ; t3 ; x2 t2 ; x1 ; t1 ) f (x2 t2 ; x1 ; t1 )
D p(x3 ; t3 j x2 t2 )
(56)
Direct Estimation of Stochastic Forces Probably the most direct way is the determination of the stochastic forces from data. If the drift vector field D(1) (x; t) has been established, as discussed below, the fluctuating forces can be estimated according to g(x; t) (t) D
dx D(1) (x; t) p :
(59)
The correlations of this force can then be examined directly, see also [22] and Subsect. “Noisy Circuits”. Differentiating Between Stochastic Process and Noise Data Looking at the joint statistics of increments extracted from given data, it could be shown that the nesting of increments and the resulting statistics can be used to differentiate between noise – like data sets and those resulting from stochastic time processes [24].
1139
1140
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
Estimating the Short Time Propagator A crucial step in the stochastic analysis is the assessment of the short time propagator of the continuous Markov process. This gives access to the deterministic part of the dynamics as well as to the fluctuations. Gaussian Noise We shall first consider the case of Gaussian white noise. As we have already indicated, drift vector and diffusion matrix are defined as the first and second moments of the conditional probability distribution or short time propagator, Eq. (50). We shall now describe an operational approach, which allows one to estimate drift vector and diffusion matrix from data and has been successfully applied to a variety of stochastic processes. We shall discuss the case, where averages are taken with respect to an ensemble of experimental realizations of the stochastic process under consideration in order to include nonstationary processes. Replacing the ensemble averages by time averages for statistically stationary processes is straightforward. For illustration we show in Fig. 2 the estimated functions of the drift terms obtained from the analysis of the data of Fig. 1. The procedure is as follows: The data is represented in a d-dimensional phase space The phase space is partitioned into a set of finite, but small d-dimensional volume elements For each bin (denoted by ˛), located at the point x˛ of the partition we consider the quantity (1)
x(t j C ) D x(t j ) C D (x(t j ); t j ) p C g(x(t j ) (t j ) :
(60)
Thereby, the considered points x(t i ) are taken from the bin located at x˛ . Since we consider time dependent processes this has to be done for each time step tj separately Estimation of the drift vector: The drift vector assigned to the bin located at x˛ is determined as the small - limit 1 D (x; t) D lim M(x; t; ) !0 (1)
(61)
of the conditional moment M(1) (x˛ ; t j ; ) D
1 X [x(t j C ) x(t j )] (62) N˛ x(t j )2˛
The sum is over all N˛ points contained in the bin ˛.
Fluctuations, Importance of: Complexity in the View of Stochastic Processes, Figure 2 Estimated drift terms from the numerical and experimental data of Fig. 1. In part a also the exact function of the numerical model is shown as solid curve, in part b the line D(1) D 0 is shown to visualize the multiple fixed points; after [23]
Proof: The drift vector assigned to the bin ˛ located at x˛ is approximated by the conditional expectation value M(1) (x˛ ; t; ) D
1 X (1) D (x j ; t j ) N˛ x(t j )2˛
p 1 X g(x(t j ) (t j ) (63) C N˛ x(t j )2˛
Thereby, the sum is over all points x(t j ) located in the bin assigned to x˛ . Assuming that D(1) (x; t) and g(x; t) do not vary significantly over the bin, the second con-
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
tribution drops out since 1 X (t j ) ! 0 : N˛ x 2˛
(64)
j
Estimation of the Diffusion matrix The diffusion matrix can be estimated by the small limit D(2) (x; t) D lim
!0
1 (2) M (x; t; )
(65)
of the conditional second moment M(2) (x˛ ; t; ) 1 X f[x(t j C ) x(t j )] D(1) (x j ; t j )g2 D N˛ j
(66) Proof: We consider the quantity M(2) (x˛ ; t; ) 1 X X g(x(t k ); t k ) (t k )g(x(t j ) (t j ) D N˛ x(t k )2˛ k
(67) If the bin size is small comparable to the scale, where the matrix g(x; t) varies significantly, we can replace g(x(t k ); t k ) by g(x˛ ; t k ) such that M(2) (x˛ ; t; )
2
3 X X 1 (t k ) (t j )5 g T (x˛ ; t k ) D g(x˛ ; t k ) 4 N˛ x 2˛ x 2˛ j
k
(68) Thereby, we have used the assumption of the statistical independence of the fluctuations X
x(t j )2˛ x(t k )2˛
(t k ) (t j ) D ı k j E :
Lévy Processes A procedure to analyze Lévy processes along the same lines has been proposed in [28]. An important point here is the determination of the Lévy parameter ˛. Selfconsistency After determining the drift vector and the characteristics of the noise sources from data, it is straightforward to synthetically generate data sets by iterating the corresponding stochastic evolution equations. Subsequently, their statistical properties can be compared with the properties of the real world data. This yields a selfconsistent check of the obtained results. Estimation of Drift and Diffusion from Sparsely Sampled Time Series
D g(x˛ ; t k )g T (x˛ ; t k )
1 X N˛
Technical Aspects The above procedure of estimating drift vector and diffusion matrix explicitly shows the properties, which limit the accuracy of the determined quantities. First of all, the bin size influences the results. The bin size should allow for a resonable number of events such that the sums converge, however, it should be reasonable small in order to allow for a accurate resolution of drift vector and diffusion matrix. Second, the data should allow for the estimation of the conditional moments in the limit lim!0 [25,26]. Here, a finite Markov–Einstein coherence length may cause problems. Furthermore, measurement noise can spoil the possibility of performing this limit. From the investigation of the Fokker–Planck equation much is known on the dependence of the conditional moments. This may be used for further improved estimations, as has been discussed in [27]. Furthermore, as we shall discuss below, extended estimation procedures have been devised, which overcome the problems related with the small -limit.
(69)
Higher order cumulants In a similar way one may estimate higher cumulants M n , which in the small time limit converge to the socalled Kramers–Moyal coefficients. The estimation of these quantities allows to answer the question whether the noise sources actually are Gaussian distributed random variables.
As we have discussed, the results from an analysis of data sets can be reconsidered selfconsistently. This fact can be used to extend the procedure to data sets with insufficient amount of data or sparsely sampled time series, for which the estimation of conditional moments M (i) (x; t; ) and the subsequent limiting procedure ! 0 can not be performed accurately. In this case, one may proceed as follows. In a first step one obtains a zeroth order approximation of drift vector D(1) (x) and diffusion matrix D(2) (x). Based on this estimate one performes, in a second step, a suitable ansatz for the drift vector and the diffusion matrix containing a set of free parameters D(1) (x; ); D(2) (x; )
(70)
1141
1142
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
defining a class of Langevin-equations. Each Langevin equation defines a joint probability distribution f (xn ; t n ; : : : x1 ; t1 ; )
(71)
This joint probability distribution can be compared with the experimental one, f (xn ; t n ; : : : x1 ; t1 ; exp). The best representative of the class of Langevin equations for the reconstruction of experimental data is then obtained by minimizing a suitably defined distance between the two distributions: Distf f (xn ; t n ; : : : x1 ; t1 ; ) f (xn ; t n ; : : : x1 ; t1 ; exp)g D Min
(72)
A reasonable choice is the so-called Kullback–Leibler distance between two distributions, defined as Z K D d f (xn ; t n ; : : : ; x1 ; t1 ; exp) ln
f (xn ; t n ; : : : ; x1 ; t1 ; exp) f (xn ; t n ; : : : ; x1 ; t1 ; )
(73)
Recently, it has been shown how the iteration procedure can be obtained from maximum likelihood arguments. For more details, we refer the reader to [29,30]. A technical question concerns the determination of the minimum. In [31] the limited-memory Broyden–Fletcher–Goldfarb– Shanno algorithm for constraint problems has been used for the solution of the optimization problem. Applications: Processes in Time The method outlined in the previous section has been used for revealing nonlinear deterministic behavior in a variety of problems ranging from physics, meteorology, biology to medicine. In most of these cases, alternative procedures with strong emphasis on deterministic features have only been partly successful due to their inappropriate treatment of dynamical fluctuations. The following list (with some exemplary citations) gives an overview on the investigated phenomena, which range from technical applications over many particle systems to biological and geophysical systems. Chatter in cutting processes [32,33] Identification of bifurcations towards drifting solitary structures in a gas-discharge system [34,35,36,37,38] Electric circuits [17,39,40] Wind energy convertors [41,42,43,44] Traffic flow [45] Inverse grading of granular flows [46] Heart rythms [47,48,49,50] Tremor data [39]
Epileptic brain dynamics [51] Meteorological data like El NINO [52,53,54] Earth quake prediction [15] The main advantage of the stochastic data analysis method is its independence on modeling assumptions. It is purely data driven and is based on the mathematical features of Markov processes. As mentioned above these properties can be verified and validated selfconsistently. Before we proceed to consider some exemplary applications we would like to add the following comment. The described analysis method cleans data from dynamical and measurement noise and provides the drift vector field, i. e. one obtains the underlying deterministic dynamical system. In turn, this system can be analyzed by the methods of nonlinear time series analysis: One can determine proper embedding, Ljapunov-exponents, dimensions, fixed points, stable and unstable limit cycles etc. [55]. We want to point out that the determination of these quantities from uncleaned data usually is flawed by the presence of dynamical noise. Synthetic Data: Potential Systems, Limit Cycles, Chaos The above method of disentangling drift and dynamical noise has been tested for synthetically generated data. The investigated systems include the pitchfork bifurcation, the Van der Pol oscillator as an example of a nonlinear oscillator, as well as the noisy Lorenz equations as an example of a system exhibiting chaos [56,57]. Furthermore, it has been shown how one can analyze processes which additionally contain a time periodic forcing [58]. This is of high interest for analyzing systems exhibiting the phenomena of stochastic resonance. Quite recently, stochastic systems with time delay have been considered [59,60]. The results of these investigations may be summarized as follows: Provided there is enough data and the data is well sampled, it is possible to extract the underlying deterministic dynamics and the strength of the fluctuations accurately. Figure 3 summarizes what can be achieved for the example of the noisy Lorenz model. For a detailed discussion we refer the reader to [57]. Noisy Circuits Next, we present the application of the method to data sets from experimental systems. As first example, a chaotic electric circuit has been chosen. Its dynamics is formed by a damped oscillator with nonlinear energy support and additional dynamic noise terms. In this case, well defined electric quantities are measured for which the dynamic equations are known. The measured time series are an-
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
Fluctuations, Importance of: Complexity in the View of Stochastic Processes, Figure 3 Time series of the stochastic Lorenz equation from top to bottom: a) Original time series b) Deterministic time series c) Time series obtained from an integration of the reconstructed vector field d) Reconstructed time series including noise. For details cf. [57]
alyzed according to the numerical algorithm described above. Afterwards, the numerically determined results and the expected results according to the system’s equations are compared. The dynamic equations of the electric circuit are given by the following equations, where the deterministic part is known as Shinriki oscillator [61]: X˙ 1 D
1
1 R1 C1
X1 R N IC C1 f (X 1 X2 ) 1 C (t) C1 R N IC C1
D g1 (X 1 ; X2 ) C h1 (t) f (X 1 X 2 ) 1 X3 D g2 (X 1 ; X2 ; X3 ) X˙ 2 D C2 R3 C2 R3 X˙ 3 D (X 2 X3 ) D g3 (X 2 ; X3 ) L
(74)
(75) (76) (77)
X1 ; X2 and X 3 denote voltage terms, Ri are values of resistors, L and C stand for inductivity and capacity values. The function f (X1 X2 ) denotes the characteristic of the nonlinear element. The quantities X i , characterizing the stochastic variable of the Shinriki oscillator with dynamical noise, were measured by means of a 12 bit A/D converter. Our analysis is based on the measurement of 100.000 data points [39]. The attractor of the noise free dynamics is shown in Fig. 4. The measured 3-dimensional time series were analyzed as outlined above. The determined deterministic dynamics – expressed by the deterministic part of the evo-
Fluctuations, Importance of: Complexity in the View of Stochastic Processes, Figure 4 Trajectory for the Shinriki oscillator in the phase space without noise; after [40]
lution equations – corresponds to a vector field in the three dimensional state space. For presentation of the results, cuts of lower dimension have been generated. Part a) of Fig. 5 illustrates the vector field (g1 (x1 ; x2 ), g2 (x1 ; x2 ; x3 D 0)) of the reconstructed deterministic parts affiliated with (75), (76). Furthermore, the one-dimensional curve g1 (x1 ; x2 D 0) is drawn in part b). Additionally to the numerically determined results found by data analysis the expected vector field and curve (75), (76) are shown for comparison. A good agreement can be recognized.
1143
1144
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
Fluctuations, Importance of: Complexity in the View of Stochastic Processes, Figure 5 Cuts of the function D(1) (x) reconstructed from experimental data of the electric circuit in comparison with the expected functions according to the known differential (Eq. (75), (76)). In part a the cut g1 (X1 ; X2 ), g2 (X1 ; X2 ; X3 D 0) is shown as a two-dimensional vector field. Thick arrows represent values determined by data analysis, thin arrows represent the theoretically expected values. In areas of the state space where the trajectory did not show up during the measurement no estimated values for the functions are obtained. Figure b shows the one dimensional cut g1 (X1 ; X2 D 0). Crosses represent values estimated numerically by data analysis. Additionally, the affiliated theoretically curve is printed as well; after [39]
Based on the reconstructed deterministic functions it is possible to reconstruct also the noisy part from the data set, see Subsect. “Direct Estimation of Stochastic Forces”. This has been performed for the three dimensionally embedded data, as well for the case of two dimensional embedding. From these reconstructed noise data, the autocorrelation was estimated. As shown in Fig. 6 correlated noise is obtained for wrong embedding indicating the violation of Markovian properties. In fact such an approach can be used to verify the correct embedding of nonlinear noisy dynamical systems. We emphasize that, provided sufficient data is available, this check of correct embedding can also be performed locally in phase space to find out where for the corresponding deterministic system crossing of trajectories take place. This procedure can be utilized to find the correct local embedding of data.
Fluctuations, Importance of: Complexity in the View of Stochastic Processes, Figure 6 Autocorrelation function of reconstructed dynamical noise, a correctly three dimensional embedded showing ı-correlated noise, b the projected dynamics in the twodimensional phase space X1 (t) and X3 (t), showing finite time correlations; after [17]
The electronic circuit of the Shinriki oscillator has also been investigated with two further perturbations. In [17] the reconstruction of the deterministic dynamics by the presence of additional measurement noise has been addressed. In [40] the Langevin noise has been exchanged by a high frequency periodic source, as shown in Fig. 7. Even for this case reasonable (correct) estimations of the deterministic part can be achieved. Many Particle Physics – Traffic Flow Far from equilibrium interacting many particle systems exhibit collective macroscopic phenomena like pattern formation, which can be described by the concept of order
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
Fluctuations, Importance of: Complexity in the View of Stochastic Processes, Figure 7 a Trajectory for the Shinriki oscillator in the phase space with a sinusoidal force. b The corresponding trajectory in the phase space, compare Fig. 5b; after [40]
parameters. In the following we shall exemplify for the case of traffic flow how complex behavior can be analyzed by means of the proposed method leading to stochastic equations for the macroscopic description of the system. Traffic flow certainly is a collective phenomena, for which a huge amount of data is available. Measured quantities are velocity v and current q D v of cars passing a fixed line on the highway. Theoretical models of traffic flow are based on the so-called fundamental diagram, which is a type of material law for traffic flow relating current and velocity of the traffic flow q D Q(v)
(78)
The special form of this relation has been much under debate in recent years.
Fluctuations, Importance of: Complexity in the View of Stochastic Processes, Figure 8 Deterministic dynamics of the twodimensional traffic states (q; v) given by the drift vector for a right lane of freeway with vans and b all three lanes. Bold dots indicate stable fixed points and open dots saddle points, respectively; after [45]
It is tempting to describe the dynamics by the following set of stochastic difference equation v NC1 D G(v N ; q N ) C N q NC1 D F(v N ; q N ) C N
(79)
Here, v NC1 , q N C 1 are velocity and current of the N C 1 car traversing the line. N and N are noise terms with zero mean, which may depend on the variables u and q. The drift vector field D1 D [D11 (v; q); D21 (v; q)] has been determined in [45]. We point out that the meaning of the drift vector field does not depend on the assumption of ideal noise sources, as has been discussed above Subsect. “Estimating the Short Time Propagator” and Subsect. “Noisy Circuits”. The obtained drift vector field is depicted in Fig. 8.
1145
1146
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
The phase space yields interesting behavior. For the traffic data involving all three lanes there are three fixed points, two sinks separated by a saddle point. Furthermore, the arrows representing the drift vector field indicate the existence of an invariant manifold, q D Q(v)
(80)
which has to be interpreted as the above mentioned fundamental diagram of traffic flow. It is interesting to see differences caused by separation of the traffic dynamics into that of cars and that of vans. This has been roughly achieved by considering different lanes of the highway. In Figs. 8a and 9a the dynamics of the right lane caused by vans is shown. It can clearly be seen that up to a speed of about 80 km/h a meta stable plateau is present corresponding to a quasi interaction free dynamics of vans. Up to now the information contained in the noise terms has not led to clear new insights. For the discussion of further examples we refer the reader to the literature. In particular, we want to mention the analysis of segregational dynamics of single particles with different sizes in avalanches [46], which can be treated along similar lines . Applications: Processes in Scale In the following we shall consider complex behavior in scale as discussed in the introduction by the method of stochastic processes. In order to statistically describe scale complexity in a comprehensive way, one has to study joint probability distributions f (q N ; l N ; q N1 ; l N1 ; : : : ; q1 ; l1 ; q0 ; l0 )
(81)
of the local measure q at multiple scales li . To grasp relations across several scales (like given by “coherent” structures) all qi tuples are taken at common locations x. In the case of statistical independency of q at different scales li , this joint probability density factorizes: f (q N ; l N ; q N1 ; l N1 ; : : : ; q1 ; l1 ; q0 ; l0 ) D f (q N ; l N ) : : : f (q0 ; l0 ) :
(82)
In this case multifractal scaling (6) is a sufficient characterization of systems with scale complexity, provided the scaling property is given. If there is no factorization, one can use the joint probability distribution Eq. (81) to define a stochastic process in scale. Thus, one can identify the scale l with time t and try to obtain a representation of the spatial disorder in the form of a stochastic process similar to a stochastic process in time (see Subsect. “Statistical
Fluctuations, Importance of: Complexity in the View of Stochastic Processes, Figure 9 Corresponding potentials for the deterministic dynamics given by the drift coefficients of a one dimensional projection of the results in Fig. 8 [45]
Description of Stochastic Processes” and Subsect. “Finite Time Propagators”). For such problems our method has been used as an alternative description of multifractal behavior. The present method has the advantage to relate the random variables across different scales by introducing a conditional probability distribution or, in fact, a two scale probability distribution. Scaling properties are no prerequisite. Provided that the Markovian property holds, which can be tested experimentally, a complete statistical characterization of the process across scale is obtained. The method has been used to characterize the complexity of data sets for the following systems (with some exemplary citations): Turbulent flows [20,62,63,64] Passive scalar in turbulent flows [21]
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
Financial data [65,66,67,68] Surface roughness [69,70,71,72,73,74] Earthquakes [15] Cosmic background radiation [75]
The stochastic analysis of scale dependent complex system aims to achieve a n-scale characterization, from which the question arose, whether it will be possible to derive from these stochastic processes methods for generating synthetic time series. Based on the knowledge of the n-scale statistics a way to estimate the n-point statistics has been proposed, which enables to generate synthetic time series with the same n-scale statistics [76]. It is even possible to extend given data sets, an interesting subject for forecasting. Other approaches has been proposed in [70,77]. Turbulence and Financial Market Data In the following we present results from investigations of fully developed turbulence and from data of the financial market. The complexity of turbulent fields still has not been understood in detail. Although the basic equations, namely the Navier Stokes equations, are known for more than 150 years, a general solution of these equations for high Reynolds numbers, i. e. for turbulence, is not known. Even with the use of powerful computers no rigorous solutions can be obtained. Thus for a long time there has been the challenge to understand at least the complexity of an idealized turbulent situation, which is taken to be isotropic and homogeneous. The main problem is to formulate statistical laws based on the treatment of the deterministic evolution laws of fluid dynamics. A first approach is due to Kolmogorov [78], who formulated a phenomenological theory characterizing properties of turbulent fluid motions by statistical quantities. The central observable of Kolmogorov’s theory is the so-called longitudinal velocity increment (which we label here with q) of a turbulent velocity field u(x; t) defined according to: qx (l; t) D
l [u(x C l; t) u(x; t)] : l
(83)
A statistical description is given in terms of the probability distribution
fields have been considered from the viewpoint of selfsimilarity addressing scaling behavior of the probability distribution f (q; l) and their nth order moments, the so-called structure functions hq n i. Multifractal scaling properties, mentioned already in Eq. (7), of the velocity increments for turbulence are identical to the well known intermittency problem1 , which manifests itself in the occurrence of heavy tailed statistics, i. e. an unexpected high probability of extreme events. Kolmogorov and Oboukhov proposed the so-called intermittency correction [79] hq(l; x)n i D l n with n D
n n(n 3) and n > 2 3 18
with 0:25 < < 0:5 (for further details see [80]). The form of n has been heavily debated during the last decades. For isotropic turbulence the central issue is to reveal the mechanism, which leads to this anomalous statistics (see [80,81]). A completely different point of view of the properties of turbulent fields is gained by interpreting the probability distribution as the distribution of a random process in scale l [62,63]. It is tempting to hypothesize the existence of a stochastic process, see Subsect. “Stochastic Data Analysis” q(l C dl) D q(l) C N(q; l)dl C g(q; l)dW(l) ;
(84)
For stationary, homogeneous and isotropic turbulence, this probability distribution is independent of the reference point x, time t, and, due to isotropy, only depends on the scale l D jlj. As a consequence, the central statistical quantity is the probability distribution f (q; l). Turbulent
(86)
where dW(l) is an increment of a random process. This type of stochastic equation indicates how the velocity increment of a snapshot of the turbulent field changes as a function of scale l. In this respect, the process q(l) can be considered to be a stochastic cascade process in “time l”. This concept of complexity in scale can be carried over to other systems like the roughness of surfaces or financial data. In the latter case the scale variable l is replaced by the time distance or time scale . Anomalous Statistics A direct consequence of multifractal scaling related with nonlinear behavior of the scaling exponents n is the fact 1 Here it should be noted that the term “intermittency”
f (q; l; t; x) D hı(q qx (l; t))i :
(85)
is used frequently in physics for different phenomena, and may cause confusions. This turbulent intermittency is not equal to the intermittency of chaos. There are also different intermittency phenomena introduced for turbulence. There is this intermittency due to the nonlinear scaling, there is the intermittency of switches between turbulent and laminar flows for non local isotropic fully developed turbulent flows, there is the intermittency due to the statistics of small scale turbulence which we discuss here as heavy tailed statistics.
1147
1148
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
tures we will discuss here are also found in other financial data like for instance quotes of stocks, see [83]. A central issue is the understanding of the statistics of price changes over a certain time interval which determines losses and gains. The changes of a time series of quotations Y(t) are commonly measured by returns r( ; t) :D Y(t C )/Y(t), logarithmic returns or by increments q( ; t) :D Y(t C ) Y(t) [84]. The moments of these quantities often exhibit power law behavior similar to the just discussed Kolmogorov scaling for turbulence, cf. [85,86,87]. For the probability distributions one additionally observes an increasing tendency to heavy tailed probability distributions for small (see Fig. 11). This represents the high frequency dynamics of the financial market. The identification of the underlying process leading to these heavy tailed probability density functions of price changes is a prominent puzzle (see [86,87,88,89,90,91]), like it is for turbulence. Stochastic Cascade Process for Scale
Fluctuations, Importance of: Complexity in the View of Stochastic Processes, Figure 10 Comparison of the numerical solution of the Fokker–Planck equation (solid lines) for the pdfs f(q(x); l) with the pdfs obtained directly from the experimental data (bold symbols). The scales l are (from top to bottom): l D L0 ; 0:6L0 ; 0:35L0 ; 0:2L0 and 0:1L0 . The distribution at the largest scale L0 was parametrized (dashed line) and used as initial condition for the Fokker–Planck equation (L0 is the correlation length of the turbulent velocity signal). The pdfs have been shifted in vertical direction for clarity of presentation and all pdfs have been normalized to their standard deviations; after [20]
The occurrence of the heavy tailed probability distributions on small scales will be discussed as a consequence of stochastic process evolving in scale, using the above mentioned methods. Guided by the finding that the statistics changes with scale, as shown in Figs. 10 and 11, we consider the evolution of the quantity q(l; x), or q( ; t) with the scale variable l or , respectively. For a single fixed scale l we get the scale dependent disorder by the statistics of q(l; x). The complete stochastic information of the disorder on all length scales is given by the joint probability density function f (q1 ; : : : ; q n ) ;
that the shape of the probability distribution f (q; l) has to change its shape as a function of scale. A selfsimilar behavior of the form (4) would lead to fractal scaling behavior, as outlined in Eq. (5). Using experimental data from a turbulent flow this change of the shape of the pdf becomes obvious. In Fig. 10 we present f (q; l) for a data set measured in the central line of a free jet with Reynolds number of 2:7 104 , see [20]. Note for large scales (l L0 ) the distributions become nearly Gaussian. As the scale is decreased the probability densities become more and more heavy tailed. Quite astonishingly, the anomalous statistical features of data from the financial market are similar to the just discussed intermittency of turbulence [82]. The following analysis is based on a data set Y(t), which consists of 1 472 241 quotes for US dollar-German Mark exchange rates from the years 1992 and 1993. Many of the fea-
(87)
where we set q i D q(l i ; x). Without loss of generality we take l i < l iC1 . This joint probability may be seen in analogy to joint probabilities of statistical mechanics (thermodynamics), describing in the most general way the occupation probabilities of the microstates of n particles, where q is a six-dimensional phase state vector (space and momentum). Next, the question is, whether it is possible to simplify the joint probability by conditional probabilities: f (q1 ; : : : ; q n ) D p(q1 jq2 ; : : : ; q n ) p(q2 jq3 ; : : : ; q n ) : : : p(q n1 jq n ) f (q n ) ; (88) where the multiple conditioned probabilities are given by p(q i j q iC1 ; : : : ; q n ) D p(q i j q iC1 ) :
(89)
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
Fluctuations, Importance of: Complexity in the View of Stochastic Processes, Figure 11 Probability densities (pdf) f(q(t); ) of the price changes q(; t) D Y(t C ) Y(t) for the time delays D 5120; 10240; 20480; 40960 s (from bottom to top). Symbols: results obtained from the analysis of middle prices of bit-ask quotes for the US dollar-German Mark exchange rates from October 1st, 1992 until September 30th, 1993. Full lines: results from a numerical iteration of the Fokker–Planck equation (95); the probability distribution for D 40960s (dashed lines) was taken as the initial condition. The pdfs have been shifted in vertical direction for clarity of presentation and all pdfs have been normalized to their standard deviations; after [66]
Eq. (89) is nothing else than the condition for a Markov process evolving from state q iC1 to the state qi , i. e. from scale l iC1 to li as it has been introduced above, see Eq. (35). A “single particle approximation” would correspond to the representation: f (q1 ; : : : ; q n ) D f (q1 ) f (q2 ) : : : f (q n ) :
(90)
According to Eqs. (89) and (90), Eq. (91) holds if p(q i jq iC1 ) D f (q i ). Only for this case, the knowledge of f (q i ) is sufficient to characterize the complexity of the whole system. Otherwise an analysis of the single scale probability distribution f (q i ) is an incomplete description of the complexity of the whole system. This is the deficiency of the approach characterizing complex structures by means of fractality or multiaffinity (cf. for multiaffin-
Fluctuations, Importance of: Complexity in the View of Stochastic Processes, Figure 12 Comparison of the numerical solution of the Fokker–Planck equation for the conditional pdf p(q; ljq0 ; l0 ) denoted in this figure as p(v; rjv0 ; r0 ) with the experimental data. a: Contour plots of p(v; rjv0 ; r0 ) for r0 D L and r D 0:6L, where L denotes the integral length. Dashed lines: numerical solution of (94), solid lines: experimental data. b and c: Cuts through p(v; rjv0 ; r0 ) for v0 D C1 and v0 D 1 respectively. Open symbols: experimental data, solid lines: numerical solution of the Fokker–Planck equation; after [20]
ity [92], for turbulence [80,81], for financial market [85]). The scaling analysis of moments as indicated for turbulence in Eq. (85) provides a complete knowledge of any joint n-scale probability density only if Eq. (90) is valid. These remarks underline the necessity to investigate these conditional probabilities, which can be done in a straightforward manner from given experimental or numerical data. For the case of turbulence as well as for financial data we see that p(q i jq iC1 ) does not coincide with f (q i ), as it is shown for turbulence data in Fig. 12. If p(q i jq iC1 ) D f (q i ) no dependency on q iC1 could be detected.
1149
1150
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
The next point of interest is whether the Markov properties are fulfilled. Therefore doubly conditioned probabilities were extracted from data and compared to the single conditioned ones. For financial data as well as for turbulence we found evidence that the Markov property is fulfilled if the step size is larger than a Markov–Einstein length [20,66]. For turbulence it could be shown that the Markov–Einstein length coincides with the Taylor length marking the small scale end of the inertial range [14]. (The extensive discussion of the analysis of financial and turbulent data can be found in [20,65,66,93]). Based on the fact that the multiconditioned probabilities are equal to the single conditioned probabilities, and taking this as a verification of Markovian properties we can proceed according to Subsect. “Estimating the Short Time Propagator” and estimate from given data sets the stochastic equations underlying the cascade process. The evolution of the conditional probability density p(q; ljq0 ; l0 ) starting from a selected scale l0 follows @ p(q; ljq0 ; l0 ) @l 1 X 1 @ k D D(k) (q; l) p(q; ljq0 ; l0 ) : (91) k! @q
l
kD1
(The minus sign on the left side is introduced, because we consider processes running to smaller scales l, furthermore we multiply the stochastic equation by l, which leads to a new parametrization of the cascade by the variable ln(1/l), a simplification for a process with scaling law behavior of its moments.) This equation is known as the Kramers–Moyal expansion [8]. As outlined in Subsect. “Finite Time Propagators” and Subsect. “Estimating the Short Time Propagator”, the Kramers–Moyal coefficients D(k) (q; l) are now defined as the limit l ! 0 of the conditional moments M (k) (q; l; l): D(k) (q; l) D lim M (k) (q; l; l) ;
l !0
(92)
M (k) (q; l; l) :D l l
C1 Z
q˜ q
k ˜ l l j q; l d q˜ : p q;
(93)
1
Thus, for the estimation of the D(k) coefficients it is only necessary to estimate the conditional probabili˜ l ljq; l). For a general stochastic process, ties p(q; all Kramers–Moyal coefficients are different from zero. According to Pawula’s theorem, however, the Kramers– Moyal expansion stops after the second term, provided that the fourth order coefficient D(4) (q; l) vanishes. In
that case the Kramers–Moyal expansion is reduced to a Fokker–Planck equation: @ p(q; ljq0 ; l0 ) @l
1 @2 (2) @ D (q; l) p(q; ljq0 ; l0 ) : D D(1) (q; l) C @q 2 @q2 (94)
l
D(1) is denoted as drift term, D(2) as diffusion term now for the cascade process. The probability density function f (q; l) has to obey the same equation: @ f (q; l) @l
@ 1 @2 (2) D D(1) (q; l) C D (q; l) f (q; l) : (95) @q 2 @q2
l
The novel point of our analysis here is that knowing the evolution equation (91), the n-increment statistics f (q1 ; : : : ; q n ) can be retrieved as well. Definitely, information like scaling behavior of the moments of q(l; x) can also be extracted from the knowledge of the process equations. Multiplying (91) by qn and successively integrating over q, an equation for the moments is obtained: @ n hq i @l
n X n! @ k D hD(k) (q; l)q nk i : @q k!(n k)!
l
(96)
kD1
Scaling, i. e. multi affinity as described in Eq. (7), is obtained if D(k) (q; l) / q k , see [71,94]. We summarize: By the described procedure we were able to reconstruct stochastic processes in scale directly from given data. Knowing these processes one can perform numerical solutions in order to obtain a selfconsistent check of the procedure(see [20,66]). In Figs. 10, 11 and 12 the numerical solutions are shown by solid (dashed) curves. The heavy tailed structure of the single scale probabilities as well as the conditional probabilities are well described by this approach based on a Fokker– Planck equation. Further improvements can be achieved by optimization procedures mentioned in Subsect. “Estimation of Drift and Diffusion from Sparsely Sampled Time Series”. New Insights The fact that the complexity of financial market data as well as turbulent data can be expressed by a Markovian
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
process in scale, has the consequence that the conditioned probabilities involving only two different scales are sufficient for the general n scale statistics. This indicates that three or four point correlation are sufficient for the formulation of n-point statistics. The finding of a finite length scale above which the Markov properties are fulfilled [14] has lead to a new interpretation of the Taylor length for turbulence, which so far had no specific physical meaning. For financial, as well as, for the turbulent data it has been found that the diffusion term is quadratic in the state space of the scale resolved variable. With respect to the corresponding Langevin equation, the multiplicative nature of the noise term becomes evident, which causes heavy tailed probability densities and multifractal scaling. The scale dependency of drift and diffusion terms corresponds to a non-stationary process in scale variables and l, respectively. From this point we conclude that a Levy statistics for one fixed scale, i. e. for the statistics of q(l; x) for fixed l can not be an adequate class for the statistical description of financial and turbulent data. Comparing the maximum of the distributions at small scales in Figs. 10 and 11 one finds a less sharp tip for the turbulence data. This finding is in accordance with a comparably larger additive contribution in the diffusion term D(2) D a C bq2 for turbulence data. Knowing that D(2) has an additive term and quadratic q dependence it is clear that for small q values, i. e. for the tips of the distribution, the additive term dominates. Taking this result in combination with the Langevin equation, we see that for small q values Gaussian noise is present, which leads to a Gaussian tip of the probability distribution, as found in Fig. 10. A further consequence of the additive term in D(2) is that the structure functions as given by Eq. (96) are not independent and a general scaling solution does not exist. This fact has been confirmed by optimizing the coefficients [31]. This nonscaling behavior of turbulence seems to be present also for higher Reynolds numbers. It has been found that the additive terms becomes smaller but still remains relevant [96]. The cascade processes encoded in the functions D(1) and D(2) seem to depend on the Reynolds number, which might indicate that turbulence is less universal as commonly thought. As has been outlined above it is straight forward to extend the analysis to higher dimensions. For the case of longitudinal and transversal increments a symmetry for these two different directions of the complex velocity field has been found in the way that the cascade process runs with different speeds. The factor is 3/2 [64,96] and again is not in accordance with the proposed multifractal scaling property of turbulence.
The investigation of the advection of passive scalar in a turbulent flow along the present method has revealed the interesting result that the Markovian properties are fulfilled but that higher order Kramers–Moyal coefficients are nonnegligible [21]. This indicates that for passive scalars non-Gaussian noise is present, which can be attributed to the existence of shock like structures in the distribution of the passive scalar. Future Directions The description of complex systems on the basis of stochastic processes, which include nonlinear dynamics, seems to by a promising approach for the future. The challenge will be to extend the understanding to more complicated processes, like Levy processes, processes with no white noise or higher dimensional processes, just to mention some. As it has been shown in this contribution, for these cases it should be possible to derive from precise mathematical results general methods of data series analysis, too. Besides the further improve of the method, we are convinced that there is still a wide range of further applications. Advanced sensor techniques enables scientists to collect huge data sets measured with high precision. Based on the stochastic approach we have presented here it is not any more the question to put much efforts into noise reduction, but in contrary the involved noise can help to derive a better characterization and thus a better understanding of the system considered. Thus there seem to be many applications in the inanimate and the animate world, ranging form technical applications over socio-econo systems to biomedical applications. An interesting feature will be the extraction of of higher correlation aspects, like the question of the cause and effect chain, which may be unfolded by asymmetric determinism and noise terms reconstructed from data. Further Reading For further reading we suggest the publications [1,2,3,4, 8,9,10,11]. Acknowledgment The scientific results reported in this review have been worked out in close collaboration with many colleagues and students. We mention St. Barth, F. Böttcher, F. Ghasemi, I. Grabec, J. Gradisek, M. Haase, A. Kittel, D. Kleinhans, St. Lück, A. Nawroth, Chr. Renner, M. Siefert, and S. Siegert.
1151
1152
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
Bibliography 24. 1. Haken H (1983) Synergetics, An Introduction. Springer, Berlin 2. Haken H (1987) Advanced Synergetics. Springer, Berlin 3. Haken H (2000) Information and Self-Organization: A Macroscopic Approach to Complex Systems. Springer, Berlin 4. Kantz H, Schreiber T (1997) Nonlinear Time Series Analysis. Cambridge University Press, Cambridge 5. Yanovsky VV, Chechkin AV, Schertzer D, Tur AV (2000) Physica A 282:13 6. Schertzer D, Larchevéque M, Duan J, Yanovsky VV, Lovejoy S (2001) J Math Phys 42:200 7. Gnedenko BV, Kolmogorov AN (1954) Limit distributions of sums of independent random variables. Addison-Wesley, Cambridge 8. Risken H (1989) The Fokker-Planck Equation. Springer, Berlin 9. Gardiner CW (1983) Handbook of Stochastic Methods. Springer, Berlin 10. van Kampen NG (1981) Stochastic processes in physics and chemistry. North-Holland Publishing Company, Amsterdam 11. Hänggi P, Thomas H (1982) Stochastic processes: time evolution, symmetries and linear response. Phys Rep 88:207 12. Einstein A (1905) Über die von der molekularkinetischen Theorie der Wärme geforderte Bewegung von in ruhenden Flüssigkeiten suspendierten Teilchen. Ann Phys 17:549 13. Friedrich R, Zeller J, Peinke J (1998) A Note in Three Point Statistics of Velocity Increments in Turbulence. Europhys Lett 41:153 14. Lück S, Renner Ch, Peinke J, Friedrich R (2006) The Markov Einstein coherence length a new meaning for the Taylor length in turbulence. Phys Lett A 359:335 15. Tabar MRR, Sahimi M, Ghasemi F, Kaviani K, Allamehzadeh M, Peinke J, Mokhtari M, Vesaghi M, Niry MD, Bahraminasab A, Tabatabai S, Fayazbakhsh S, Akbari M (2007) Short-Term Prediction of Mediumand Large-Size Earthquakes Based on Markov and Extended Self-Similarity Analysis of Seismic Data. In: Bhattacharyya P, Chakrabarti BK (eds) Modelling Critical and Catastrophic Phenomena in Geoscience. Lecture Notes in Physics, vol 705. Springer, Berlin, pp 281–301 16. Kolmogorov AN (1931) Über die analytischen Methoden in der Wahrscheinlichkeitsrechnung. Math Ann 140:415 17. Siefert M, Kittel A, Friedrich R, Peinke J (2003) On a quantitative method to analyze dynamical and measurement noise. Europhys Lett 61:466 18. Böttcher F, Peinke J, Kleinhans D, Friedrich R, Lind PG, Haase M (2006) On the proper reconstruction of complex dynamical systems spoilt by strong measurement noise. Phys Rev Lett 97:090603 19. Kleinhans D, Friedrich R, Wächter M, Peinke J (2007) Markov properties under the influence of measurement noise. Phys Rev E 76:041109 20. Renner C, Peinke J, Friedrich R (2001) Experimental indications for Markov properties of small scale turbulence. J Fluid Mech 433:383 21. Tutkun M, Mydlarski L (2004) Markovian properties of passive scalar increments in grid-generated turbulence. New J Phys 6:49 22. Marcq P, Naert A (2001) A Langevin equation for turbulent velocity increments. Phys Fluids 13:2590 23. Langner M, Peinke J, Rauh A (2008) A Langevin analysis with
25.
26.
27.
28. 29.
30.
31.
32.
33.
34. 35.
36. 37.
38.
39.
40. 41.
42.
43.
application to a Rayleigh-Bénard convection experiment. Exp Fluids (submitted) Wächter M, Kouzmitchev A, Peinke J (2004) Increment definitions for sale-dependent analysis of stochastic data. Phys Rev E 70:055103(R) Ragwitz M, Kantz H (2001) Indispensable finite time corrections for Fokker-Planck equations from time series. Phys Rev Lett 87:254501 Ragwitz M, Kantz H (2002) Comment on: Indispensable finite time correlations for Fokker-Planck equations from time series data-Reply. Phys Rev Lett 89:149402 Friedrich R, Renner C, Siefert M, Peinke J (2002) Comment on: Indispensable finite time correlations for Fokker-Planck equations from time series data. Phys Rev Lett 89:149401 Siegert S, Friedrich R (2001) Modeling nonlinear Lévy processes by data analysis. Phys Rev E 64:041107 Kleinhans D, Friedrich R, Nawroth AP, Peinke J (2005) An iterative procedure for the estimation of drift and diffusion coefficients of Langevin processes. Phys Lett A 346:42 Kleinhans D, Friedrich R (2007) Note on Maximum Likelihood estimation of drift and diffusion functions. Phys Lett A 368:194 Nawroth AP, Peinke J, Kleinhans D, Friedrich R (2007) Improved estimation of Fokker-Planck equations through optimisation. Phys Rev E 76:056102 Gradisek J, Grabec I, Siegert S, Friedrich R (2002). Stochastic dynamics of metal cutting: Bifurcation phenomena in turning. Mech Syst Signal Process 16(5):831 Gradisek J, Siegert S, Friedrich R, Grabec I (2002) Qualitative and quantitative analysis of stochastic processes based on measured data-I. Theory and applications to synthetic data. J Sound Vib 252(3):545 Purwins HG, Amiranashvili S (2007) Selbstorganisierte Strukturen im Strom. Phys J 6(2):21 Bödeker HU, Röttger M, Liehr AW, Frank TD, Friedrich R, Purwins HG (2003) Noise-covered drift bifurcation of dissipative solitons in planar gas-discharge systems. Phys Rev E 67: 056220 Purwins HG, Bödeker HU, Liehr AW (2005) In: Akhmediev N, Ankiewicz A (eds) Dissipative Solitons. Springer, Berlin Bödeker HU, Liehr AW, Frank TD, Friedrich R, Purwins HG (2004) Measuring the interaction law of dissipative solitions. New J Phys 6:62 Liehr AW, Bödeker HU, Röttger M, Frank TD, Friedrich R, Purwins HG (2003) Drift bifurcation detection for dissipative solitons. New J Phys 5:89 Friedrich R, Siegert S, Peinke J, Lück S, Siefert M, Lindemann M, Raethjen J, Deuschl G, Pfister G (2000) Extracting model equations from experimental data. Phys Lett A 271:217 Siefert M Peinke J (2004) Reconstruction of the Deterministic Dynamics of Stochastic systems. Int J Bifurc Chaos 14:2005 Anahua E, Lange M, Böttcher F, Barth S, Peinke J (2004) Stochastic Analysis of the Power Output for a Wind Turbine. DEWEK 2004, Wilhelmshaven, 20–21 October 2004 Anahua E, Barth S, Peinke J (2006) Characterization of the wind turbine power performance curve by stochastic modeling. EWEC 2006, BL3.307, Athens, February 27–March 2 Anahua E, Barth S, Peinke J (2007) Characterisation of the power curve for wind turbines by stochastic modeling. In: Peinke J, Schaumann P, Barth S (eds) Wind Energy – Proceedings of the Euromech Colloquium. Springer, Berlin, p 173–177
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
44. Anahua E, Barth S, Peinke J (2008) Markovian Power Curves for Wind Turbines. Wind Energy 11:219 45. Kriso S, Friedrich R, Peinke J, Wagner P (2002) Reconstruction of dynamical equations for traffic flow. Phys Lett A 299:287 46. Kern M, Buser O, Peinke J, Siefert M, Vulliet L (2005) Stochastic analysis of single particle segregational dynamics. Phys Lett A 336:428 47. Kuusela T (2004) Stochastic heart-rate model can reveal pathologic cardiac dynamics. Phys Rev E 69:031916 48. Ghasemi F, Peinke J, Reza Rahimi Tabar M, Muhammed S (2006) Statistical properties of the interbeat interval cascade in human subjects. Int J Mod Phys C 17:571 49. Ghasemi F, Sahimi M, Peinke J, Reza Rahimi Tabar M (2006) Analysis of Non-stationary Data for Heart-rate Fluctuations in Terms of Drift and Diffusion Coefficients. J Biological Phys 32:117 50. Tabar MRR, Ghasemi F, Peinke J, Friedrich R, Kaviani K, Taghavi F, Sadghi S, Bijani G, Sahimi M (2006) New computational approaches to analysis of interbeat intervals in human subjects. Comput Sci Eng 8:54 51. Prusseit J, Lehnertz K (2007) Stochastic Qualifiers of Epileptic Brain Dynamics. Phys Rev Lett 98:138103 52. Sura P, Gille ST (2003) Interpreting wind-driven Southern Ocean variability in a stochastic framework. J Marine Res 61:313 53. Sura P (2003) Stochastic Analysis of Southern and Pacific Ocean Sea Surface Winds. J Atmospheric Sci 60:654 54. Egger J, Jonsson T (2002) Dynamic models for islandic meteorological data sets. Tellus A 51(1):1 55. Letz T, Peinke J, Kittel A (2008) How to characterize chaotic time series distorted by interacting dynamical noise. Preprint 56. Siegert S, Friedrich R, Peinke J (1998) Analysis of data sets of stochastic systems. Phys Lett A 234:275–280 57. Gradisek J, Siegert S, Friedrich R, Grabec I (2000) Analysis of time series from stochastic processes. Phys Rev E 62:3146 58. Gradisek J, Friedrich R, Govekar E, Grabec I (2002) Analysis of data from periodically forced stochastic processes. Phys Lett A 294:234 59. Frank TD, Beek PJ, Friedrich R (2004) Identifying noise sources of time-delayed feedback systems. Phys Lett A 328:219 60. Patanarapeelert K, Frank TD, Friedrich R, Beek PJ, Tang IM (2006) A data analysis method for identifying deterministic components of stable and unstable time-delayed systems with colored noise. Phys Lett A 360:190 61. Shinriki M, Yamamoto M, Mori S (1981) Multimode Oscillations in a Modified Van-der-Pol Oscillator Containing a Positive Nonlinear Conductance. Proc IEEE 69:394 62. Friedrich R, Peinke J (1997). Statistical properties of a turbulent cascade. Physica D 102:147 63. Friedrich R, Peinke J (1997) Description of a turbulent cascade by a Fokker-Planck equation. Phys Rev Lett 78:863 64. Siefert M, Peinke J (2006) Joint multi-scale statistics of longitudinal and transversal increments in small-scale wake turbulence. J Turbul 7:1 65. Friedrich R, Peinke J, Renner C (2000) How to quantify deterministic and random influences on the statistics of the foreign exchange market. Phys Rev Lett 84:5224 66. Renner C, Peinke J, Friedrich R (2001) Markov properties of high frequency exchange rate data. Physica A 298:499–520 67. Ghasemi F, Sahimi M, Peinke J, Friedrich R, Reza Jafari G, Reza Rahimi Tabar M (2007) Analysis of Nonstationary Stochastic
68.
69. 70.
71.
72.
73.
74.
75.
76. 77.
78. 79.
80. 81. 82.
83. 84. 85. 86. 87.
Processes with Application to the Fluctuations in the Oil Price. Phys Rev E (Rapid Commun) 75:060102 Farahpour F, Eskandari Z, Bahraminasab A, Jafari GR, Ghasemi F, Reza Rahimi Tabar M, Muhammad Sahimi (2007) An Effective Langevin Equation for the Stock Market Indices in Approach of Markov Length Scale. Physica A 385:601 Wächter M, Riess F, Kantz H, Peinke J (2003) Stochastic analysis of raod surface roughness. Europhys Lett 64:579 Jafari GR, Fazeli SM, Ghasemi F, Vaez Allaei SM, Reza Rahimi Tabar M, Iraji zad A, Kavei G (2003) Stochastic Analysis and Regeneration of Rough Surfaces. Phys Rev Lett 91:226101 Friedrich R, Galla T, Naert A, Peinke J, Schimmel T (1998) Disordered Structures Analyzed by the Theory of Markov Processes. In: Parisi J, Müller S, Zimmermann W (eds) A Perspective Look at Nonlinear Media. Lecture Notes in Physics, vol 503. Springer, Berlin Waechter M, Riess F, Schimmel T, Wendt U, Peinke J (2004) Stochastic analysis of different rough surfaces. Eur Phys J B 41:259 Sangpour P, Akhavan O, Moshfegh AZ, Jafari GR, Reza Rahimi Tabar M (2005) Controlling Surface Statistical Properties Using Bias Voltage: Atomic force microscopy and stochastic analysis. Phys Rev B 71:155423 Jafari GR, Reza Rahimi Tabar M, Iraji zad A, Kavei G (2007) Etched Glass Surfaces, Atomic Force Microscopy and Stochastic Analysis. J Phys A 375:239 Ghasemi F, Bahraminasab A, Sadegh Movahed M, Rahvar S, Sreenivasan KR, Reza Rahimi Tabar M (2006) Characteristic Angular Scales of Cosmic Microwave Background Radiation. J Stat Mech P11008 Nawroth AP, Peinke J (2006) Multiscale reconstruction of time series. Phys Lett A 360:234 Ghasemi F, Peinke J, Sahimi M, Reza Rahimi Tabar M (2005) Regeneration of Stochastic Processes: An Inverse Method. Eur Phys J B 47:411 Kolmogorov AN (1941) Dissipation of energy in locally isotropic turbulence. Dokl Akad Nauk SSSR 32:19 Kolmogorov AN (1962) A refinement of previous hypotheses concerning the local structure of turbulence in a viscous incompressible fluid at high Reynolds number. J Fluid Mech 13:82 Frisch U (1995) Turbulence. Cambridge University Press, Cambridge Sreenivasan KR, Antonia RA (1997) The phenomenology of small-scale turbulence. Annu Rev Fluid Mech 29:435–472 Ghashghaie S, Breymann W, Peinke J, Talkner P, Dodge Y (1996) Turbulent Cascades in Foreign Exchange Markets. Nature 381:767–770 Nawroth AP, Peinke J (2006) Small scale behavior of financial data. Eur Phys J B 50:147 Karth M, Peinke J (2002) Stochastic modelling of fat-tailed probabilities of foreign exchange rates. Complexity 8:34 Bouchaud JP, Potters M, Meyer M (2000) Apparent multifractality in financial time series. Eur Phys J B 13:595–599 Bouchaud JP (2001) Power laws in economics and finance: some ideas from physics Quant Finance 1:105–112 Mandelbrot BB (2001) Scaling in financial prices: I. Tails and dependence. II. Multifractals and the star equation. Quant Finance 1:113–130
1153
1154
Fluctuations, Importance of: Complexity in the View of Stochastic Processes
88. Embrechts P, Klüppelberg C, Mikosch T (2003) Modelling extremal events. Springer, Berlin 89. Mantegna RN, Stanley HE (1995) Nature 376:46–49 90. McCauley J (2000) The Futility of Utility: how market dynamics marginalize Adam Smith. Physica A 285:506–538 91. Muzy JF, Sornette D, Delour J, Areneodo A (2001) Multifractal returns and hierarchical portfolio theory. Quant Finance 1: 131–148 92. Viscek T (1992) Fractal Growth Phenomena. World Scientific, Singapore 93. Renner C, Peinke J, Friedrich R (2000) Markov properties of
high frequency exchange rate data. Int J Theor Appl Finance 3:415 94. Davoudi J, Reza Rahimi Tabar M (1999) Theoretical Model for Kramers-Moyal’s description of Turbulence Cascade. Phys Rev Lett 82:1680 95. Renner C, Peinke J, Friedrich R, Chanal O, Chabaud B (2002) Universality of small scale turbulence. Phys Rev Lett 89: 124502 96. Siefert M, Peinke J (2004) Different cascade speeds for longitudinal and transverse velocity increments of small-scale turbulence. Phys Rev E 70:015302R
Food Webs
Food Webs JENNIFER A. DUNNE1,2 1 Santa Fe Institute, Santa Fe, USA 2 Pacific Ecoinformatics and Computational Ecology Lab, Berkeley, USA Article Outline Glossary Definition of the Subject Introduction: Food Web Concepts and Data Early Food Web Structure Research Food Web Properties Food Webs Compared to Other Networks Models of Food Web Structure Structural Robustness of Food Webs Food Web Dynamics Ecological Networks Future Directions Bibliography Glossary Connectance(C) The proportion of possible links in a food web that actually occur. There are many algorithms for calculating connectance. The simplest and most widely used algorithm, sometimes referred to as “directed connectance,” is links per species2 (L/S 2 ), where S2 represents all possible directed feeding interactions among S species, and L is the total number of actual feeding links. Connectance ranges from 0.03 to 0.3 in food webs, with a mean of 0.10 to 0.15. Consumer-resource interactions A generic way of referring to a wide variety of feeding interactions, such as predator-prey, herbivore-plant or parasite-host interactions. Similarly, “consumer” refers generically to anything that consumes or preys on something else, and “resource” refers to anything that is consumed or preyed upon. Many taxa are both consumers and resources within a particular food web. Food web The network of feeding interactions among diverse co-occurring species in a particular habitat. Trophic species (S) Defined within the context of a particular food web, a trophic species is comprised of a set of taxa that share the same set of consumers and resources. A particular trophic species is represented by a single node in the network, and that node is topologically distinct from all other nodes. “Trophic species” is a convention introduced to minimize bias due to uneven resolution in food web data and to focus analysis
and modeling on functionally distinct network components. S is used to denote the number of trophic species in a food web. The terms “trophic species,” “species,” and “taxa” will be used somewhat interchangeably throughout this article to refer to nodes in a food web. “Original species” will be used specifically to denote the taxa found in the original dataset, prior to trophic species aggregation. Definition of the Subject Food webs refer to the networks of feeding (“trophic”) interactions among species that co-occur within particular habitats. Research on food webs is one of the few subdisciplines within ecology that seeks to quantify and analyze direct and indirect interactions among diverse species, rather than focusing on particular types of taxa. Food webs ideally represent whole communities including plants, bacteria, fungi, invertebrates and vertebrates. Feeding links represent transfers of biomass and encompass a variety of trophic strategies including detritivory, herbivory, predation, cannibalism and parasitism. At the base of every food web are one or more types of autotrophs, organisms such as plants or chemoautotrophic bacteria, which produce complex organic compounds from an external energy source (e. g., light) and simple inorganic carbon molecules (e. g., CO2 ). Food webs also have a detrital component—non-living particulate organic material that comes from the body tissues of organisms. Feeding-mediated transfers of organic material, which ultimately trace back to autotrophs or detritus via food chains of varying lengths, provide the energy, organic carbon and nutrients necessary to fuel metabolism in all other organisms, referred to as heterotrophs. While food webs have been a topic of interest in ecology for many decades, some aspects of contemporary food web research fall within the scope of the broader crossdisciplinary research agenda focused on complex, “realworld” networks, both biotic and abiotic [2,83,101]. Using the language of graph theory and the framework of network analysis, species are represented by vertices (nodes) and feeding links are represented by edges (links) between vertices. As with any other network, the structure and dynamics of food webs can be quantified, analyzed and modeled. Links in food webs are generally considered directed, since biomass flows from a resource species to a consumer species (A ! B). However, trophic links are sometimes treated as undirected, since any given trophic interaction alters the population and biomass dynamics of both the consumer and resource species ( A $ B). The types of questions explored in food web research range from “Do
1155
1156
Food Webs
food webs from different habitats display universal topological characteristics, and how does their structure compare to that of other types of networks?” to “What factors promote different aspects of stability of complex food webs and their components given internal dynamics and external perturbations?” Two fundamental measures used to characterize food webs are S, the number of species or nodes in a web, and C, connectance—the proportion of possible feeding links that are actually realized in a web (C D L/S 2 , where L is the number of observed directed feeding links, and S2 is the number of possible directed feeding interactions among S taxa). This article focuses on research that falls at the intersection of food webs and complex networks, with an emphasis on network structure augmented by a brief discussion of dynamics. This is a subset of a wide variety of ecological research that has been conducted on feeding interactions and food webs. Refer to the “Books and Reviews” in the bibliography for more information about a broader range of research related to food webs. Introduction: Food Web Concepts and Data The concept of food chains (e. g., grass is eaten by grasshoppers which are eaten by mice which are eaten by owls; A !B ! C ! D) goes back at least several hundred years, as evidenced by two terrestrial and aquatic food chains briefly described by Carl Linnaeus in 1749 [42]. The earliest description of a food web may be the mostly detrital-based feeding interactions observed by Charles Darwin in 1832 on the island of St. Paul, which had only two bird species (Darwin 1939, as reported by Egerton [42]): By the side of many of these [tern] nests a small flying-fish was placed; which, I suppose, had been brought by the male bird for its partner . . . quickly a large and active crab (Craspus), which inhabits the crevices of the rock, stole the fish from the side of the nest, as soon as we had disturbed the birds. Not a single plant, not even a lichen, grows on this island; yet it is inhabited by several insects and spiders. The following list completes, I believe, the terrestrial fauna: a species of Feronia and an acarus, which must have come here as parasites on the birds; a small brown moth, belonging to a genus that feeds on feathers; a staphylinus (Quedius) and a woodlouse from beneath the dung; and lastly, numerous spiders, which I suppose prey on these small attendants on, and scavengers of the waterfowl. The earliest known diagrams of generalized food chains and food webs appeared in the late 1800s, and diagrams of
specific food webs, began appearing in the early 1900s, for example the network of insect predators and parasites on cotton-feeding weevils (“the boll weevil complex,” [87]). By the late 1920s, diagrams and descriptions of terrestrial and marine food webs were becoming more common (e. g., Fig. 1 from [103], see also [48,104]). Charles Elton introduced the terms “food chain” and “food cycle” in his classic early textbook, Animal Ecology [43]. By the time Eugene Odum published a later classic textbook, Fundamentals of Ecology [84], the term “food web” was starting to replace “food cycle.” From the 1920s to the 1980s, dozens of system-specific food web diagrams and descriptions were published, as well as some webs that were more stylized (e. g., [60]) and that quantified link flows or species biomasses. In 1977, Joel Cohen published the first comparative studies of empirical food web network structure using up to 30 food webs collected from the literature [23,24]. To standardize the data, he transformed the diagrams and descriptions of webs in the literature into binary matrices with m rows and n columns [24]. Each column is headed by the number of one of the consumer taxa in a particular web, and each row is headed by the number of one of the resource taxa for that web. If wij represents the entry in the ith row and the jth column, it equals 1 if consumer j eats resource i or 0 if j does not eat i. This matrix-based representation of data is still often used, particularly in a full S by S format (where S is the number of taxa in the web), but for larger datasets a compressed two- or three-column notation for observed links is more efficient (Fig. 2). By the mid-1980s, those 30 initial webs had expanded into a 113-web catalog [30] which included webs mostly culled from the literature, dating back to the 1923 Bear Island food web ([103], Fig. 1). However, it was apparent that there were many problems with the data. Most of the 113 food webs had very low diversity compared to the biodiversity known to be present in ecosystems, with a range of only 5 to 48 species in the original datasets and 3 to 48 trophic species. This low diversity was largely due to very uneven resolution and inclusion of taxa in most of these webs. The webs were put together in many different ways and for various purposes that did not include comparative, quantitative research. Many types of organisms were aggregated, underrepresented, or missing altogether, and in a few cases animal taxa had no food chains connecting them to basal species. In addition, cannibalistic links were purged when the webs were compiled into the 113-web catalog. To many ecologists, these food webs looked like little more than idiosyncratic cartoons of much richer and more complex species interactions found in natural systems, and they appeared to be an extremely un-
Food Webs
Food Webs, Figure 1 A diagram of a terrestrial Arctic food web, with a focus on nitrogen cycling, for Bear Island, published in 1923 [103]
Food Webs, Figure 2 Examples of formats for standardized notation of binary food web data. A hypothetical web with 6 taxa and 12 links is used. Numbers 1–6 correspond to the different taxa. a Partial matrix format: the 1s or 0s inside the matrix denote the presence or absence of a feeding link between a consumer (whose numbers 3–6 head columns) and a resource (whose numbers 1–6 head rows); b Full matrix format: similar to a, but all 6 taxa are listed at the heads of columns and rows; c Two-column format: a consumer’s number appears in the first column, and one of its resource’s numbers appears in the second column; d Three-column format: similar to c, but where there is a third number, the second and third numbers refer to a range of resource taxa. In this hypothetical web, taxa numbers 1 and 2 are basal taxa (i. e., taxa that do not feed on other taxa—autotrophs or detritus), and taxa numbers 3, 5, and 6 have cannibalistic links to themselves
sound foundation on which to build understanding and theory [86,92]. Another catalog of “small” webs emerged in the late 1980s, a set of 60 insect-dominated webs with 2 to 87 original species (mean = 22) and 2 to 54 trophic species (mean = 12) [102]. Unlike the 113-web catalog, these webs are
highly taxonomically resolved, mostly to the species level. However, they are still small due to their focus, in most cases, on insect interactions in ephemeral microhabitats such as phytotelmata (i. e., plant-held aquatic systems such as water in tree holes or pitcher plants) and singular detrital sources (e. g., dung paddies, rotting logs, animal car-
1157
1158
Food Webs
Food Webs, Figure 3 Food web of Little Rock Lake, Wisconsin [63]. 997 feeding links among 92 trophic species are shown. Image produced with FoodWeb3D, written by R.J. Williams, available at the Pacific Ecoinformatics and Computational Ecology Lab (www.foodwebs.org)
casses). Thus, while the 113-web catalog presented food webs for communities at fairly broad temporal and spatial scales, but with low and uneven resolution, the 60-web catalog presented highly resolved but very small spatial and temporal slices of broader communities. These two very different catalogs were compiled into ECOWeB, the “Ecologists Co-Operative Web Bank,” a machine readable database of food webs that was made available by Joel Cohen in 1989 [26]. The two catalogs, both separately and together as ECOWeB, were used in many studies of regularities in food web network structure, as discussed in the next Sect. “Early Food Web Structure Research”. A new level of detail, resolution and comprehensiveness in whole-community food web characterization was presented in two seminal papers in 1991. Gary Polis [92] published an enormous array of data for taxa found in the Coachella Valley desert (California). Over two decades, he collected taxonomic and trophic information on at least 174 vascular plant species, 138 vertebrate species, 55 spider species, thousands of insect species including parasitoids, and unknown numbers of microorganisms, acari, and nematodes. He did not compile a complete food web including all of that information, but instead reported a number of detailed subwebs (e. g., a soil web, a scorpion-focused web, a carnivore web, etc.), each of which was more diverse than most of the ECOWeB webs. On the basis of the subwebs and a simplified, aggregated 30-taxa web of the whole community, he concluded that “. . . most catalogued webs are oversimplified caricatures of actual communities
. . . [they are] grossly incomplete representations of communities in terms of both diversity and trophic connections.” At about the same time, Neo Martinez [63] published a detailed food web for Little Rock Lake (Wisconsin) that he compiled explicitly to test food web theory and patterns (see Sect. “Early Food Web Structure Research”). By piecing together diversity and trophic information from multiple investigators actively studying various types of taxa in the lake, he was able to put together a relatively complete and highly resolved food web of 182 taxa, most identified to the genus, species, or ontogentic life-stage level, including fishes, copepods, cladocera, rotifers, diptera and other insects, mollusks, worms, porifera, algae, and cyanobacteria. In later publications, Martinez modified the original dataset slightly into one with 181 taxa. The 181 taxa web aggregates into a 92 trophic-species web, with nearly 1000 links among the taxa (Fig. 3). This dataset, and the accompanying analysis, set a new standard for food web empiricism and analysis. It still stands as the best wholecommunity food web compiled, in terms of even, detailed, comprehensive resolution. Since 2000, the use of the ECOWeB database for comparative analysis and modeling has mostly given way to a focus on a smaller set of more recently published food webs [10,37,39,99,110]. These webs, available through www.foodwebs.org or from individual researchers, are compiled for particular, broad-scale habitats such as St. Mark’s Estuary [22], Little Rock Lake [63], the island of St. Martin [46], and the Northeast U.S. Marine Shelf [61].
Food Webs
Most of the food webs used in contemporary comparative research are still problematic—while they generally are more diverse and/or evenly resolved than the earlier webs, most could still be resolved more highly and evenly. Among several issues, organisms such as parasites are usually left out (but see [51,59,67,74]), microorganisms are either missing or highly aggregated, and there is still a tendency to resolve vertebrates more highly than lower level organisms. An important part of future food web research is the compilation of more inclusive, evenly resolved, and well-defined datasets. Meanwhile, the careful selection and justification of datasets to analyze is an important part of current research that all too often is ignored. How exactly are food web data collected? In general, the approach is to compile as complete a species list as possible for a site, and then to determine the diets of each species present at that site. However, researchers have taken a number of different approaches to compiling food webs. In some cases, researchers base their food webs on observations they make themselves in the field. For example, ecologists in New Zealand have characterized the structure of stream food webs by taking samples from particular patches in the streams, identifying the species present in those samples, taking several individuals of each species present, and identifying their diets through gutcontent analysis [106]. In other cases, researchers compile food web data by consulting with experts and conducting literature searches. For example, Martinez [63] compiled the Little Rock Lake (WI) food web by drawing on the expertise of more than a dozen biologists who were specialists on various types of taxa and who had been working at Little Rock Lake for many years. Combinations of these two approaches can also come into play—for example, a researcher might compile a relatively complete species list through field-based observations and sampling, and then assign trophic habits to those taxa through a combination of observation, consulting with experts, and searching the literature and online databases. It is important to note that most of the webs used for comparative research can be considered “cumulative” webs. Contemporary food web data range from time- and space-averaged or “cumulative” (e. g., [63]) to more finely resolved in time (e. g., seasonal webs—[6]) and/or space (e. g., patch-scale webs—[106]; microhabitat webs—[94]). The generally implicit assumption underlying cumulative food web data is that the set of species in question co-exist within a habitat and individuals of those species have the opportunity over some span of time and space to interact directly. To the degree possible, such webs document who eats whom among all species within a macrohabitat, such as a lake or meadow, over multiple seasons or years,
including interactions that are low frequency or represent a small proportion of consumption. Such cumulative webs are used widely for comparative research to look at whether there are regularities in food web structure across habitat (see Sect. “Food Webs Compared to Other Networks” and Sect. “Models of Food Web Structure”). More narrowly defined webs at finer scales of time or space, or that utilize strict evidence standards (e. g., recording links only through gut content sampling), have been useful for characterizing how such constraints influence perceived structure within habitats [105,106], but are not used as much to look for cross-system regularities in trophic network structure. Early Food Web Structure Research The earliest comparative studies of food web structure were published by Joel Cohen in 1977. Using data from the first 30-web catalog, one study focused on the ratio of predators to prey in food webs [23], and the other investigated whether food webs could be represented by single dimension interval graphs [24], a topic which continues to be of interest today (see Sect. “Food Webs Compared to Other Networks”). In both cases, he found regularities—(1) a ratio of prey to predators of 3/4 regardless of the size of the web, and (2) most of the webs are interval, such that all species in a food web can be placed in a fixed order on a line such that each predator’s set of prey forms a single contiguous segment of that line. The prey-predator ratio paper proved to be the first salvo in a quickly growing set of papers that suggested that a variety of food web properties were “scale-invariant.” In its strong sense, scale invariance means that certain properties have constant values as the size (S) of food webs change. In its weak sense, scale-invariance refers to properties not changing systematically with changing S. Other scale-invariant patterns identified include constant proportions of top species (Top, species with no predators), intermediate species (Int, species with both predators and prey), and basal species (Bas, species with no prey), collectively called “species scaling laws” [12], and constant proportions of T-I, I-B, T-B, and I-I links between T, I, and B species, collectively called “link scaling laws” [27]. Other general properties of food webs were thought to include: food chains are short [31,43,50,89]; cycling/looping is rare (e. g., A ! B ! C ! A; [28]); compartments, or subwebs with many internal links that have few links to other subwebs, are rare [91]; omnivory, or feeding at more than one trophic level, is uncommon [90]; and webs tend to be interval, with instances of intervality decreasing as S increases [24,29,116]. Most of these patterns were reported
1159
1160
Food Webs
for the 113-web catalog [31], and some of the regularities were also documented in a subset the 60 insect-dominated webs [102]. Another, related prominent line of early comparative food web research was inspired by Bob May’s work from the early 1970s showing that simple, abstract communities of interacting species will tend to transition sharply from local stability to instability as the complexity of the system increases—in particular as the number of species (S), the connectance (C) or the average interaction strength (i) increase beyond critical values [69,70]. He formalized this as a criterion that ecological communities near equilibrium will tend to be stable if i(SC)1/2 < 1. This mathematical analysis flew in the face of the intuition of many ecologists (e. g., [44,50,62,84]) who felt that increased complexity (in terms of greater numbers of species and links between them) in ecosystems gives rise to stability. May’s criterion and the general question of how diversity is maintained in communities provided a framework within which to analyze some readily accessible empirical data, namely the numbers of links and species in food webs. Assuming that average interaction strength (i) is constant, May’s criterion suggests that communities can be stable given increasing diversity (S) as long as connectance (C) decreases. This can be empirically demonstrated using food web data in three similar ways, by showing that 1) C hyperbolically declines as S increases, so that the product SC remains constant, 2) the ratio of links to species (L/S), also referred to as link or linkage density, remains constant as S increases, or 3) L plotted as a function of S on a log-log graph, producing a power-law relation of the form L D ˛S ˇ , displays an exponent of ˇ D 1 (the slope of the regression) indicating a linear relationship between L and S. These relationships were demonstrated across food webs in a variety of studies (see detailed review in [36]), culminating with support from the 113web catalog and the 60 insect-dominated web catalog. Cohen and colleagues identified the “link-species scaling law” of L/S 2 using the 113 web catalog (i. e., there are two links per species on average in any given food web, regardless of its size) [28,30], and SC was reported as “roughly independent of species number” in a subset of the 60 insectdominated webs [102]. However, these early conclusions about patterns of food web structure began to crumble with the advent of improved data and new analysis methods that focused on the issues of species aggregation, sampling effort, and sampling consistency [36]. Even before there was access to improved data, Tom Schoener [93] set the stage for critiques of the conventional paradigm in his Ecological Society of America MacArthur Award lecture, in which he
explored the ramifications of a simple conceptual model based on notions of “generality” (what Schoener referred to as “generalization”) and “vulnerability.” He adopted the basic notion underlying the “link-species scaling law”: that how many different taxa something can eat is constrained, which results in the number of resource taxa per consumer taxon (generality) holding relatively steady with increasing S. However, he further hypothesized that the ability of resource taxa to defend again consumers is also constrained, such that the number of consumer taxa per resource taxon (vulnerability) should increase with increasing S. A major consequence of this conceptual model is that total links per species (L/S, which includes links to resources and consumers) and most other food web properties should display scale-dependence, not scale-invariance. A statistical reanalysis of a subset of the 113-web catalog supported this contention as well as the basic assumptions of his conceptual model about generality and vulnerability. Shortly thereafter, more comprehensive, detailed datasets, like the ones for Coachella Valley [92] and Little Rock Lake [63], began to appear in the literature. These and other new datasets provided direct empirical counterpoints to many of the prevailing notions about food webs: their connectance and links per species were much higher than expected from the “link-species scaling law,” food chains could be quite long, omnivory and cannibalism and looping could be quite frequent, etc. In addition, analyzes such as the one by Martinez [63], in which he systematically aggregated the Little Rock Lake food web taxa and links to generate small webs that looked like the earlier data, demonstrated that “most published food web patterns appear to be artifacts of poorly resolved data.” Comparative studies incorporating newly available data further undermined the whole notion of “scale invariance” of most properties, particularly L/S(e. g., [65,66]). For many researchers, the array of issues brought to light by the improved data and more sophisticated analyzes was enough for them to turn their back on structural food web research. A few hardy researchers sought to build new theory on top of the improved data. “Constant connectance” was suggested as an alternative hypothesis to constant L/S (the “link-species scaling law”), based on a comparative analysis of the relationship of L to S across a subset of available food webs including Little Rock Lake [64]. The mathematical difference between constant C and constant L/S can be simply stated using a log-log graph of links as a function of species (Fig. 4). If a power law exists of the form L D ˛S ˇ , in the case of the link-species scaling law ˇ D 1, which means that L D ˛S; L/S D ˛, indicating constant L/S. In the case of constant connectance, ˇ D 2 and thus L D ˛S 2 ,
Food Webs
S animal species in a food web, based on three parameters for an individual of species j: (1) net energy gain from consumption of an individual of species i, (2) the encounter rate of individuals of species i, and (3) the handling time spent attacking an individual of species i. This allows estimation of C for the animal portion of food webs, once data aggregation and cumulative sampling, well-known features of empirical datasets, are taken into account. The model does a good job of predicting values of C observed in empirical food webs and associated patterns of C across food webs. Food Webs, Figure 4 The relationship of links to species for 19 trophic-species food webs from a variety of habitats (black circles). The solid line shows the log-log regression for the empirical data, the dashed line shows the prediction for constant connectance, and the dotted line shows the prediction for the link-species scaling law (reproduced from [36], Fig. 1)
L/S 2 D ˛, indicating constant C (L/S 2 ). Constant connectance means that L/S increases as a fixed proportion of S. One ecological interpretation of constant connectance is that consumers are likely to exploit an approximately constant fraction of available prey species, so as diversity increases, links per species increases [108]. Given the L D ˛S ˇ framework, ˇ D 2 was reported for a set of 15 webs derived from an English pond [108], and ˇ D 1:9 for a set of 50 Adirondack lakes [65], suggesting connectance may be constant across webs within a habitat or type of habitat. Across habitats, the picture is less clear. While ˇ D 2 was reported for a small subset of the 5 most “credible” food webs then available from different habitats [64], several analyzes of both the old ECOWeB data and the more reliable newer data suggest that the exponent lies somewhere between 1 and 2, suggesting that C declines non-linearly with S (Fig. 4, [27,30,36,64,79,93]). For example, Schoener’s reanalysis of the 113-web catalog suggested that ˇ D 1:5, indicating that L2/3 is proportional to S. A recent analysis of 19 recent trophic-species food webs with S of 25 to 172 also reported ˇ D 1:5, with much scatter in the data (Fig. 4). A recent analysis has provided a possible mechanistic basis for the observed constrained variation in C ( 0.03 to 0.3 in cumulative community webs) as well as the scaling of C with S implied by ˇ intermediate between 1 and 2 [10]. A simple diet breadth model based on optimal foraging theory predicts both of these patterns across food webs as an emergent consequence of individual foraging behavior of consumers. In particular, a contingency model of optimal foraging is used to predict mean diet breadth for
Food Web Properties Food webs have been characterized by a variety of properties or metrics, several of which have been mentioned previously (Sect. “Early Food Web Structure Research”). Many of these properties are quantifiable just using the basic network structure (“topology”) of feeding interactions. These types of topological properties have been used to evaluate simple models of food web structure (Sect. “Food Web Properties”). Any number of properties can be calculated on a given network—ecologists tend to focus on properties that are meaningful within the context of ecological research, although other properties such as path length (Path) and clustering coefficient (Cl) have been borrowed from network research [109]. Examples of several types of food web network structure properties, with common abbreviations and definitions, follow. Fundamental Properties: These properties characterize very simple, overall attributes of food web network structure. S: number of nodes in a food web L: number of links in a food web L/S: links per species C, or L/S2 : connectance, or the proportion of possible links that are realized Types of Taxa: These properties characterize what proportion or percentage of taxa within a food web fall into particular topologically defined roles. Bas: percentage of basal taxa (taxa without resources) Int: percentage of intermediate taxa (taxa with both consumers and resources) Top: percentage of top taxa (taxa with no consumers) Herb: percentage of herbivores plus detritivores (taxa that feed on autotrophs or detritus) Can: percentage of cannibals (taxa that feed on their own taxa)
1161
1162
Food Webs
Omn: percentage of omnivores (taxa that feed that feed on taxa at different trophic levels Loop: percentage of taxa that are in loops, food chains in which a taxon occur twice (e. g., A ! B ! C ! A) Network Structure: These properties characterize other attributes of network structure, based on how links are distributed among taxa. TL: trophic level averaged across taxa. Trophic level represents how many steps energy must take to get from an energy source to a taxon. Basal taxa have TL = 1, and obligate herbivores are TL = 2. TL can be calculated using many different algorithms that take into account multiple food chains that can connect higher level organisms to basal taxa (Williams and Martinez 2004). ChLen: mean food chain length, averaged over all species ChSD: standard deviation of ChLen ChNum: log number of food chains LinkSD: normalized standard deviation of links (# links per taxon) GenSD: normalized standard deviation of generality (# resources per taxon) VulSD: normalized standard deviation of vulnerability (# consumers per taxon) MaxSim: mean across taxa of the maximum trophic similarity of each taxon to other taxa Ddiet: the level of diet discontinuity—the proportion of triplets of taxa with an irreducible gap in feeding links over the number of possible triplets [19]—a local estimate of intervality Cl: clustering coefficient (probability that two taxa linked to the same taxon are linked) Path: characteristic path length, the mean shortest set of links (where links are treated as undirected) between species pairs The previous properties (most of which are described in [110] and [39]) each provide a single metric that characterizes some aspect of food web structure. There are other properties, such as Degree Distribution, which are not single-number properties. “Degree” refers to the number of links that connect to a particular node, and the degree distribution of a network describes (in the format of a function or a graph) the total number of nodes in a network that have a given degree for each level of degree (Subsect. “Degree Distribution”). In food web analysis, LinkSD, GenSD, and VulSD characterize the variability of different aspects of degree distribution. Many food web structure properties are correlated with each other, and vary in predictable ways with S and/or C. This provides opportunities for topological modeling that are discussed below (Sect. “Models of Food Web Structure”).
In addition to these types of metrics based on networks with unweighted links and nodes, it is possible to calculate a variety of metrics for food webs with nodes and/or links that are weighted by measures such as biomass, numerical abundance, frequency, interaction strength, or body size [11,33,67,81]. However, few food web datasets are “enriched” with such quantitative data and it remains to be seen whether such approaches are primarily a tool for richer description of particular ecosystems or whether they can give rise to novel generalities, models and predictions. One potential generality was suggested by a study of interaction strengths in seven soil food webs, where interaction strength reflects the size of the effects of species on each other’s dynamics near equilibrium. Interaction strengths appear to be organized such that long loops contain many weak links, a pattern which enhances stability of complex food webs [81]. Food Webs Compared to Other Networks Small-World Properties How does the structure of food webs compare to that of other kinds of networks? One common way that various networks have been compared is in terms of whether they are “small-world” networks. Small-world networks are characterized by two of the properties described previously, characteristic path length (Path) and clustering coefficient (Cl) [109]. Most real-world networks appear to have high clustering, like what is seen on some types of regular spatial lattices (such as a planar triangular lattice, where many of a node’s neighbors are neighbors of one another), but have short path lengths, like what is seen on “random graphs” (i. e., networks in which links are distributed randomly). Food webs do display short path lengths that are similar to what is seen in random webs (Table 1, [16,37,78,113]). On average, taxa are about two links from other taxa in a food web (“two degrees of separation”), and path length decreases with increasing connectance [113]. However, clustering tends to be quite low in many food webs, closer to the clustering expected on a random network (Table 1). This relatively low clustering in food webs appears consistent with their small size compared to most other kinds of networks studied, since the ratio of clustering in empirical versus comparable random networks increases linearly with the size of the network (Fig. 5). Degree Distribution In addition to small-world properties, many real-world networks appear to display power-law degree distribu-
Food Webs Food Webs, Table 1 Topological properties of empirical and random food webs, listed in order of increasing connectance. Path refers to characteristic path length, and Cl refers to the clustering coefficient. Pathr and Cl r refer to the mean D and Cl for 100 random webs with the same S and C. Modified from [37] Table 1 Food Web
S
C(L/S2 ) L/S
Path Pathr Cl
Clr
Cl/Clr
Grassland
61 0.026
1.59 3.74 3.63
0.11 0.03 3.7
Scotch Broom
85 0.031
2.62 3.11 2.82
0.12 0.04 3.0
Ythan Estuary 1
124 0.038
4.67 2.34 2.39
0.15 0.04 3.8
Ythan Estuary 2
83 0.057
4.76 2.20 2.19
0.16 0.06 2.7
El Verde Rainforest 155 0.063
9.74 2.20 1.95
0.12 0.07 1.4
Canton Creek
102 0.067
6.83 2.27 2.01
0.02 0.07 0.3
Stony Stream
109 0.070
7.61 2.31 1.96
0.03 0.07 0.4
Chesapeake Bay
31 0.071
2.19 2.65 2.40
0.09 0.09 1.0
St. Marks Seagrass
48 0.096
4.60 2.04 1.94
0.14 0.11 1.3
St. Martin Island
42 0.116
4.88 1.88 1.85
0.14 0.13 1.1
Little Rock Lake
92 0.118
10.84 1.89 1.77
0.25 0.12 2.1
Lake Tahoe
172 0.131
22.59 1.81 1.74
0.14 0.13 1.1
Mirror Lake
172 0.146
25.13 1.76 1.72
0.14 0.15 0.9
Bridge Brook Lake
25 0.171
4.28 1.85 1.68
0.16 0.19 0.8
Coachella Valley
29 0.312
9.03 1.42 1.43
0.43 0.32 1.3
Skipwith Pond
25 0.315
7.88 1.33 1.41
0.33 0.33 1.0
Food Webs, Figure 5 Trends in clustering coefficient across networks. The ratio of clustering in empirical networks (Clempirical ) to clustering in random networks with the same number of nodes and links (Clrandom ) is shown as a function of the size of the network (number of nodes). Reproduced from [37], Fig. 1
tions [2]. Whereas regular graphs have the same number of links per node, and random graphs display a Poisson degree distribution, many empirical networks, both biotic and abiotic, display a highly skewed power-law (“scalefree”) degree distribution, where most nodes have few links and a few nodes have many links. However, some empirical networks display less-skewed distributions such as exponential distributions [4]. Most empirical food webs display exponential or uniform degree distributions, not power-law distributions [16,37], and it has been suggested that normalized degree distributions in food webs follow universal functional forms [16] although there is a quite a bit of scatter when a wide range of data are considered (Fig. 6, [37]). Variable degree distributions, like what is seen in individual food webs, could result from simple mechanisms. For example, exponential and uniform food web degree distributions are generated by a model that combines (1) random immigration to local webs from a randomly linked regional set of taxa, and (2) random extinctions in the local webs [5]. The general lack of powerlaw degree distributions in food webs may result partly from the small size and large connectance of such networks, which limits the potential for highly skewed distributions. Many of the networks displaying power-law degree distributions are much larger and much more sparsely connected than food webs.
1163
1164
Food Webs
Food Webs, Figure 6 Log-log overlay plot of the cumulative distributions of links per species in 16 food webs. The link data are normalized by the average number of links/species in each web. If the distributions followed a power law, the data would tend to follow a straight line. Instead, they follow a roughly exponential shape. Reproduced from [37], Fig. 3
Other Properties Assortative mixing, or the tendency of nodes with similar degree to be linked to each other, appears to be a pervasive phenomenon in a variety of social networks [82]. However, other kinds of networks, including technological and biological networks, tend to show disassortative mixing, where nodes with high degree tend to link to nodes with low degree. Biological networks, and particularly two food webs examined, show strong disassortativity [82]. Some of this may relate to a finite-size effect in systems like food webs that have limits on how many links are recorded between pairs of nodes. However, in food webs it may also result from the stabilizing effects of having feeding specialists linked to feeding generalists, as has been suggested for plant-animal pollination and frugivory (fruit-eating) networks ([7], Sect. “Ecological Networks”). Another aspect of structure that has been directly compared across several types of networks including food webs are “motifs,” defined as “recurring, significant patterns of interconnections” [77]. A variety of networks (transcriptional gene regulation, neuron connectivity, food webs, two types of electronic circuits, the World Wide Web) were scanned for all possible subgraphs that could be constructed out of 3 or 4 nodes (13 and 199 possible subgraphs, respectively). Subgraphs that appeared significantly more often in empirical webs than in their randomized counterparts (i. e., networks with the same number of nodes and links, and the same degree for each node, but with links otherwise randomly distributed) were identi-
Food Webs, Figure 7 The two 3 or 4-node network motifs found to occur significantly more often than expected in most of seven food webs examined. There is one significant 3-node motif (out of 13 possible motifs), a food chain of the form A eats B eats C. There is one significant 4-node motif (out of 199 possible motifs), a trophic diamond (“bi-parallel”) of the form A eats B and C, which both eat D
fied. For the seven food webs examined, there were two “consensus motifs” shared by most of the webs—a threenode food chain, and a four-species diamond where a predator has two prey, which in turn prey on the same species (Fig. 7). The four-node motif was shared by two other types of networks (neuron connectivity, one type of electronic circuit), and nothing shared the three-node chain. The WWW and food web networks appear most dissimilar to other types of networks (and to each other) in terms of significant motifs. Complex networks can be decomposed into minimum spanning trees (MST). A MST is a simplified version of a network created by removing links to minimize the distance between nodes and some destination. For example, a food web can be turned into MST by adding an “environment” node that all basal taxa link to, tracing the shortest food chain from each species to the environment node, and removing links that do not appear in the shortest chains. Given this algorithm, a MST removes links that occur in loops and retains a basic backbone that has a treelike structure. In a MST, the quantity Ai is defined as the number of nodes in a subtree rooted at node i, and can be regarded as the transportation rate through that node. Ci is defined as the integral of Ai (i. e., the sum of Ai for all nodes rooted at node i) and can be regarded as the transportation cost at node i. These properties can be used to plot Ci versus Ai for each node in a networks, or to plot wholesystem Co versus Ao across multiple networks, to identify whether scaling relationships of the form C(A) / An are present, indicating self-similarity in the structure of the MST (see [18] for review). In a food web MST, the most efficient configuration is a star, where every species links directly to the environment node, resulting in an exponent of 1, and the least efficient configuration is a single chain, where resources have to pass through each species in a line,
Food Webs
resulting in an exponent of 2. It has been suggested that food webs display a universal exponent of 1.13 [18,45], reflecting an invariant functional food web property relating to very efficient resource transportation within an ecosystem. However, analyzes based on a larger set of webs (17 webs versus 7) suggest that exponents for Ci as a function of Ai range from 1.09 to 1.26 and are thus not universal, that the exponents are quite sensitive to small changes in food web structure, and that the observed range of exponent values would be similarly constrained in any network with only 3 levels, as is seen in food web MSTs [15]. Models of Food Web Structure An important area of research on food webs has been whether their observed structure, which often appears quite complex, can emerge from simple rules or models. As with other kinds of “real-world” networks, models that assign links among nodes randomly, according to fixed probabilities, fail to reproduce the network structure of empirically observed food webs [24,28,110]. Instead, several models that combine stochastic elements with simple link assignment rules have been proposed to generate and predict the network structure of empirical food webs. The models share a basic formulation [110]. There are two empirically quantifiable parameters: (1) S, the number of trophic species in a food web, and (2) C, the connectance of a food web, defined as links per species squared, L/S 2 . Thus, S specifies the number of nodes in a network, and C specifies the number of links in a network with S nodes. Each species is assigned a “niche value” ni drawn randomly and uniformly from the interval [0,1]. The models differ in the rules used to distribute links among species. The link distribution rules follow in the order the models were introduced in the literature: Cascade Model (Cohen and Newman [28]): Each species has the fixed probability P D 2CS/(S 1) of consuming species with niche values less than its own. This creates a food web with hierarchical feeding, since it does not allow feeding on taxa with the same niche value (cannibalism) or taxa with higher niche values (looping/cycling). This formulation [110] is a modified version of the original cascade model that allows L/S, equivalent to the CS term in the probability statement above, to vary as a tunable parameter, rather than be fixed as a constant [28]. Niche Model (Williams and Martinez [110], Fig. 8): Each species consumes all species within a segment of the [0,1] interval whose size ri is calculated using the feeding
Food Webs, Figure 8 Graphical representation of the niche model: Species i feeds on 4 taxa including itself and one with a higher niche value
range width algorithm described below. The ri ’s center ci is set at a random value drawn uniformly from the interval [r i /2; n i ] or [r i /2; 1 r i /2] if n i > 1 r i /2, which places ci equal to or lower than the niche value ni and keeps the ri segment within [0,1]. The ci rule relaxes the strict feeding hierarchy of the cascade model and allows for the possibility of cannibalism and looping. Also, the ri rule ensures that species feed on a contiguous range of species, necessarily creating interval graphs (i. e., species can be lined up along a single interval such that all of their resource species are located in contiguous segments along the interval). Feeding range width algorithm: The value of r i D xn i , where 0 < x < 1 is randomly drawn from the probability density function p(x) D ˇ(1 x)b1 (the beta distribution), where ˇ D (1/2C) 1 to obtain a C close to the desired C. Nested-Hierarchy Model (Cattin et al. [19]): Like the niche model, the number of prey items for each species is drawn randomly from a beta distribution that constrains C close to a target value. Once the number of prey items for each species is set, those links are assigned in a multistep process. First, a link is randomly assigned from species i to a species j with a lower ni . If j is fed on by other species, the next feeding links for i are selected randomly from the pool of prey species fed on by a set of consumer species defined as follows: they share at least one prey species and at least one of them feeds on j. If more links need to be distributed, they are then randomly assigned to species without predators and with niche values < n i , and finally to those with niche value n i . These rules were chosen to relax the contiguity rule of the niche model and to allow for trophic habit overlap among taxa in a manner which the authors suggest evokes phylogenetic constraints. Generalized Cascade Model (Stouffer et al. [99]): Species i feeds on species j if n j n i with a probability drawn from the interval [0,1] using the beta or an exponential distribution. This model combines the beta distribution introduced in the niche model with the hierarchical, non-contiguous feeding of the cascade model.
1165
1166
Food Webs
These models have been evaluated with respect to their relative fit to empirical data in a variety of ways. In a series of six papers published from 1985 to 1990 with the common title “A stochastic theory of community food webs,” the cascade model was proposed as a means of explaining “the phenomenology of observed food web structure, using a minimum of hypotheses” [31]. This was not the first simple model proposed for generating food web structure [25,88,89,116], but it was the most well-developed model. Cohen and colleagues also examined several model variations, most of which performed poorly. While the cascade model appeared to generate structures that qualitatively fit general patterns in the data from the 113-web catalog, subsequent statistical analyzes suggested that the fit between the model and that early data was poor [93,96,97]. Once improved data began to emerge, it became clear that some of the basic assumptions built in to the cascade model, such as no looping and minimal overlap and no clustering of feeding habits across taxa, are violated by common features of multi-species interactions. The niche model was introduced in 2000, along with a new approach to analysis: numerical simulations to compare statistically the ability of the niche model, the cascade model, and one type of random network model to fit empirical food web data [110]. Because of stochastic variation in how species and links are distributed in any particular model web, analysis begins with the generation of hundreds to thousands of model webs with the same S and similar C as an empirical food web of interest. Model webs that fall within 3% of the target C are retained. Model-generated webs occasionally contain species with no links to other species, or species that are trophically identical. Either those webs are thrown out, or those species are eliminated and replaced, until every model web has no disconnected or identical species. Also, each model web must contain at least one basal species. These requirements ensure that model webs can be sensibly comparable to empirical trophic-species webs. Once a set of model webs are generated with the same S and C as an empirical web, model means and standard deviations are calculated for each food web property of interest, which can then be compared to empirical values. Raw error, the difference between the value of an empirical property and a model mean for that property, is normalized by dividing it by the standard deviation of the property’s simulated distribution. This approach allows assessment not only of whether a model over- or under-estimates empirical properties as indicated by the raw error, but also to what degree a model’s mean deviates from the empirical value. Normalized errors within ˙2
are considered to indicate a good fit between the model prediction and the empirical value. This approach has also been used to analyze network motifs [77] (Subsect. “Other Properties”). The initial niche model analyzes examined seven more recent, diverse food webs (S D 24 to 92) and up to 12 network structure properties for each web [110]. The random model (links are distributed randomly among nodes) performs poorly, with an average normalized error (ANE) of 27.1 (SD D 202). The cascade model performs better, with ANE of -3.0 (SD D 14:1). The niche model performs an order of magnitude better than the cascade model, with ANE of 0.22 (SD D 1:8). Only the niche model falls within ˙2 ANE considered to show a good fit to the data. Not surprisingly, there is variability in how all three models fit different food webs and properties. For example, the niche model generally overestimates food-chain length. Specific mismatches are generally attributable either to limitations of the models or biases in the data [110]. A separate test of the niche and cascade models with three marine food webs, a type of habitat not included in the original analysis, obtained similar results [39]. These analyzes demonstrate that the structure of food webs is far from random and that simple link distribution rules can yield apparently complex network structure, similar to that observed in empirical data. In addition, the analyzes suggest that food webs from a variety of habitats share a fundamentally similar network structure, and that the structure is scale-dependent in predictable ways with S and C. The nested-hierarchy model [19] and generalized cascade model [99], variants of the niche model, do not appear to improve on the niche model, and in fact may be worse at representing several aspects of empirical network structure. Although the nested-hierarchy model breaks the intervality of the niche model and uses a complicatedsounding set of link distribution rules to try to mimic phylogenetic constraints on trophic structure, it “generates webs characterized by the same universal distributions of numbers of prey, predators, and links” as the niche model [99]. Both the niche and nested-hierarchy models have a beta distribution at their core. The beta distribution is reasonably approximated by an exponential distribution for C < 0:12 [99], and thus reproduces the exponential degree distributions observed in many empirical webs, particularly those with average or less-than-average C [37]. The generalized cascade model was proposed as a simplified model that would return the same distributions of taxa and links as the niche and nested-hierarchy models. It is defined using only two criteria: (1) taxa form a totally ordered set—this is fulfilled by the arrangement of taxa along a single “niche” interval or dimension, and (2) each species
Food Webs
has an exponentially decaying probability of preying on a given fraction of species with lower niche values [99]. Although the generalized cascade model does capture a central tendency of successful food web models, only some food web properties are derivable from link distributions (e. g., Top, Bas, Can, VulSD, GenSD, Clus). There are a variety of food web structure properties of interest that are not derivable from degree distributions (e. g., Loop, Omn, Herb, TL, food-chain statistics, intervality statistics). The accurate representation of these types of properties may depend on additional factors, for example the contiguous feeding ranges specified by the niche model but absent from the cascade, nested-hierarchy, and generalized cascade models. While it is known that empirical food webs are not interval, until recently it was not clear how non-interval they are. Intervality is a brittle property that is broken by a single gap in a single feeding range (i. e., a single missing link in a food web), and trying to arrange species in a food web into their most interval ordering is a computationally challenging problem. A more robust measure of intervality in food webs has been developed, in conjunction with the use of simulated annealing to estimate the most interval ordering of empirical food webs [100]. This analysis suggests that complex food webs “do exhibit a strong bias toward contiguity of prey, that is, toward intervality” when compared to several alternative “null” models, including the generalized cascade model. Thus, the intervality assumption of the niche model, initially critiqued as a flaw of the model [19], helps to produce a better fit to empirical data than the non-interval alternate models. Structural Robustness of Food Webs A series of papers have examined the response of a variety of networks including the Internet and WWW web pages [1] and metabolic and protein networks [52,53] to the simulated loss of nodes. In each case, the networks, all of which display highly skewed power-law degree distributions, appear very sensitive to the targeted loss of highlyconnected nodes but relatively robust to random loss of nodes. When highly-connected nodes are removed from scale-free networks, the average path length tends to increase rapidly, and the networks quickly fragment into isolated clusters. In essence, paths of information flow in highly skewed networks are easily disrupted by loss of nodes that are directly connected to an unusually large number of other nodes. In contrast, random networks with much less skewed Poisson degree distributions display similar responses to targeted loss of highly-connected nodes versus random node loss [101].
Within ecology, species deletions on small (S < 14) hypothetical food web networks as well as a subset of the 113-web catalog have been used to examine the reliability of network flow, or the probability that sources (producers) are connected to sinks (consumers) in food webs [54]. The structure of the empirical webs appears to conform to reliable flow patterns identified using the hypothetical webs, but that result is based on low diversity, poorly resolved data. The use of more highly resolved data with node knock-out algorithms to simulate the loss of species allows assessment of potential secondary extinctions in complex empirical food webs. Secondary extinctions result when the removal of taxa results in one or more consumers losing all of their resource taxa. Even though most food webs do not have power-law degree distributions, they show similar patterns of robustness to other networks: removal of highly-connected species results in much higher rates of secondary extinctions than random loss of species ([38,39,95], Fig. 9). In addition, loss of high-degree species results in more rapid fragmentation of the webs [95]. Protecting basal taxa from primary removal increases the robustness of the web (i. e., fewer secondary extinctions occur) ([38], Fig. 9). While removing species with few links generally results in few secondary extinctions, in a quarter of the food webs examined, removing low-degree species results in secondary extinctions comparable to or greater than what is seen with removal of high-degree species [38]. This tends to occur in webs with relatively high C. Beyond differential impacts of various sequences of species loss in food webs, food web ‘structural robustness’ can be defined as the fraction of primary species loss that induces some level of species loss (primary + secondary extinctions) for a particular trophic-species web. Analysis of R50 (i. e., what proportion of species have to be removed to achieve 50% total species loss) across multiple food webs shows that robustness increases approximately logarithmically with increasing connectance ([38,39], Fig. 9, 10). In essence, from a topological perspective food webs with more densely interconnected taxa are better protected from species loss, since it takes greater species loss for consumers to lose all of their resources. It is also potentially important from a conservation perspective to identify particular species likely to result in the greatest number of secondary extinctions through their loss. The loss of a particular highly-connected species may or may not result in secondary extinctions. One way to identify critical taxa is to reduce the topological structure of empirical food webs into linear pathways that define the essential chains of energy delivery in each web. A particular species can be said to “dominate” other species if it passes energy to them along a chain in the
1167
1168
Food Webs
Food Webs, Figure 9 Secondary extinctions resulting from primary species loss in 4 food webs ordered by increasing connectance (C). The y-axis shows the cumulative secondary extinctions as a fraction of initial S, and the x-axis shows the primary removals of species as a fraction of initial S. 95% error bars for the random removals fall within the size of the symbols and are not shown. For the most connected (circles), least connected (triangles), and random removal (plus symbols) sequences, the data series end at the black diagonal dashed line, where primary removals plus secondary extinctions equal S and the web disappears. For the most connected species removals with basal species preserved (black dots), the data points end when only basal species remain. The shorter red diagonal lines show the points at which 50% of species are lost through combined primary removals and secondary extinctions (“robustness” or R50 )
Food Webs, Figure 10 The proportion of primary species removals required to induce a total loss (primary removals plus secondary extinctions) of 50% of the species in each of 16 food webs (“robustness,” see the shorter red line of Fig. 9 for visual representation) as a function of the connectance of each web. Logarithmic fits to the three data sets are shown, with a solid line for the most connected deletion order, a long dashed line for the most connected with basal species preserved deletion order, and a short dashed line for random deletion order. The maximum possible y value is 0.50. The equations for the fits are: y D 0:162 ln(x) C 0:651 for most connected species removals, y D 0:148 ln(x) C 0:691 for most connected species removals with basal species preserved, and y D 0:067 ln(x) C 0:571 for random species removals. Reproduced from [38], Fig. 2
dominator tree. The higher the number of species that a particular species dominates, the greater the secondary extinctions that may result from its removal [3]. This approach has the advantage of going beyond assessment of direct interactions to include indirect interactions. As in food webs, the order of pollinator loss has an effect on potential plant extinction patterns in plant-pollinator networks [75] (see Sect. “Ecological Networks”). Loss of plant diversity associated with targeted removal of highly-connected pollinators is not as extreme as comparable secondary extinctions in food webs, which may be due to pollinator redundancy and the nested topology of those networks. While the order in which species go locally extinct clearly affects the potential for secondary extinctions in ecosystems, the focus on high-degree, random, or even dominator species does not provide insight on ecologically plausible species loss scenarios, whether the focus is on human perturbations or natural dynamics. The issue of what realistic natural extinction sequences might look like has been explored using a set of pelagic-focused food webs for 50 Adirondack lakes [49] with up to 75 species [98]. The geographic nestedness of species composition across the lakes is used to derive an ecologically plausible extinction sequence scenario, with the most restricted taxa the most likely to go extinct. This sequence is corroborated by the pH tolerances of the species. Species removal simulations show that the food webs are highly robust in terms of secondary extinctions to the “realistic” extinction order and highly sensitive to the reverse order. This suggests that nested geographical distribution patterns coupled with lo-
Food Webs
cal food web interaction patterns appear to buffer effects of likely species losses. This highlights important aspects of community organization that may help to minimize biodiversity loss in the face of a naturally changing environment. However, anthropogenic disturbances may disrupt the inherent buffering of how taxa are organized geographically and trophically, reducing the robustness of ecosystems. Food Web Dynamics Analysis of the topology of food webs has proven very useful for exploring basic patterns and generalities of “who eats whom” in ecosystems. This approach seeks to identify “the most universal, high-level, persistent elements of organization” [35] in trophic networks, and to leverage understanding of such organization for thinking about ecosystem robustness. However, food webs are inherently dynamical systems, since feeding interactions involve biomass flows among species whose “stocks” can be characterized by numbers of individuals and/or aggregate population biomass. All of these stocks and flows change through time in response to direct and indirect trophic and other types of interactions. Determining the interplay among network structure, network dynamics, and various aspects of stability such as persistence, robustness, and resilience in complex “real-world” networks is one of the great current challenges in network research [101]. It is particularly important in the study of ecosystems, since they face a variety of anthropogenic perturbations such as climate change, habitat loss, and invasions, and since humans depend on them for a variety of “ecosystem services” such as supply of clean water and pollination of crops [34]. Because it is nearly impossible to compile detailed, long-term empirical data for dynamics of more than two interacting species, most research on species interaction dynamics relies on analytical or simulation modeling. Most modeling studies of trophic dynamics have focused narrowly on predator-prey or parasite-host interactions. However, as the previous sections should make clear, in natural ecosystems such interaction dyads are embedded in diverse, complex networks, where many additional taxa and their direct and indirect interactions can play important roles for the stability of focal species as well as the stability or persistence of the broader community. Moving beyond the two-species population dynamics modeling paradigm, there is a strong tradition of research that looks at interactions among 3–8 species, exploring dynamics and simple variations in structure in slightly more complex systems (see reviews in [40,55]). However, these interaction modules still present a drastic simplification of
the diversity and structure of natural ecosystems. Other dynamical approaches have focused on higher diversity model systems [69], but ignore network structure in order to conduct analytically tractable analyzes. Researchers are increasingly integrating dynamics with complex food web structure in modeling studies that move beyond small modules. The Lotka–Volterra cascade model [20,21,32] was an early incarnation of this type of integration. As its name suggests, the Lotka–Volterra cascade model runs classic L-V dynamics, including a nonsaturating linear functional response, on sets of species interactions structured according to the cascade model [28]. The cascade model was also used to generate the structural framework for a dynamical food web model with a linear functional response [58] used to study the effects of prey-switching on ecosystem stability. Improving on aspects of biological realism of both dynamics and structure, a bioenergetic dynamical model with nonlinear functional responses [119] was used in conjunction with empirically-defined food web structure among 29 species to simulate the biomass dynamics of a marine fisheries food web [117,118]. This type of nonlinear bioenergetic dynamical modeling approach has been integrated with niche model network structure and used to study more complex networks [13,14,68,112]. A variety of types of dynamics are observed in these non-linear models, including equilibrium, limit cycle, and chaotic dynamics, which may or may not be persistent over short or long time scales (Fig. 11). Other approaches model ecological and evolutionary dynamics to assemble species into networks, rather than imposing a particular structure on them. These models, which typically employ an enormous amount of parameters, are evaluated as to whether they generate plausible persistence, diversity, and network structure (see review by [72]). All of these approaches are generally used to examine stability, characterized in a diversity of ways, in ecosystems with complex structure and dynamics [71,85]. While it is basically impossible to empirically validate models of integrated structure and dynamics for complex ecological networks, in some situations it is possible to draw interesting connections between models and data at more aggregated levels. This provides opportunities to move beyond the merely heuristic role that such models generally play. For example, nonlinear bioenergetic models of population dynamics parametrized by biological rates allometrically scaled to populations’ average body masses have been run on various types of model food web structures [14]. This approach has allowed the comparison of trends in two different measures of food web stability, and how they relate to consumer-resource bodysize ratios and to initial network structure. One measure
1169
1170
Food Webs
range from 102 (consumers are 100 times smaller than their resources) to 105 (consumers are 100,000 times larger than their resources) (Fig. 12). Species persistence increases dramatically with increasing body-size ratios, until inflection points are reached at which persistence shifts to high levels ( 0.80) of persistence (Fig. 12a). However, population stability decreases with increasing bodysize ratios until inflection points are reached that show the lowest stability, and then increases again beyond those points (Fig. 12b). In both cases, the inflection points correspond to empirically observed consumer-resource bodysize ratios, both for webs parametrized to represent invertebrate dominated webs, and for webs parametrized to represent ectotherm vertebrate dominated webs. Thus, across thousands of observations, invertebrate consumerresource body size ratios are 101 , and ectotherm vertebrate consumer-resource body size ratios are 102 , which correspond to the model’s inflection points for species persistence and population stability (Fig. 12). It is interesting to note that high species persistence is coupled to low population stability—i. e., an aspect of increased stability of the whole system (species persistence) is linked to an aspect of decreased stability of components of that system (population stability). It is also interesting to note that in this formulation, using initial cascade versus niche model structure had little impact on species persistence or population stability [14], although other formulations show increased persistence when dynamics are initiated with niche model versus other structures [68]. How structure influences dynamics, and vice-versa, is an open question. Ecological Networks
Food Webs, Figure 11 5 different types of population dynamics shown as time series of population density (from [40], Fig. 10.1). The types of dynamics shown include a stable equilibrium, damped oscillations, limit cycles, amplified oscillations, and chaotic oscillations
of stability is the fraction of original species that display persistent dynamics, i. e., what fraction of species do not go extinct in the model when it is run over many time steps (“species persistence”). Another measure of stability is how variable the densities of all of the persistent species are (“population stability”)—greater variability across all the species indicates decreased stability in terms of population dynamics. Brose and colleagues [14] ran the model using different hypothetical consumer-resource body-size ratios that
This article has focused on food webs, which generally concern classic predator-herbivore-primary producer feeding interactions. However, the basic concept of food webs can be extended to a broader framework of “ecological networks” that is more inclusive of different components of ecosystem biomass flow, and that takes into consideration different kinds of species interactions that are not classic “predator-prey” interactions. Three examples are mentioned here. First, parasites have typically been given short shrift in traditional food webs, although exceptions exist (e. g., [51,67,74]). This is changing as it becomes clear that parasites are ubiquitous, often have significant impacts on predator-prey dynamics, and may be the dominant trophic habitat in most food webs, potentially altering our understanding of structure and dynamics [59]. The dynamical models described previously have been parametrized with more conventional, non-parasite consumers in mind. An interesting open question is how
Food Webs
Food Webs, Figure 12 a shows the fraction of species that display persistent dynamics as a function of consumer-resource body-size ratios for model food webs parametrized for invertebrates (gray line) and ectotherm vertebrates (black line). The inflection points for shifts to highpersistence dynamics are indicated by red arrows for both curves, and those inflection points correspond to empirically observed consumer-resource body size ratios for invertebrate dominated webs (101 —consumers are on average 10 times larger than their resources) and ectotherm vertebrate dominated webs (102 —consumers are on average 100 times larger than their resources). b shows results for population stability, the mean of how variable species population biomasses are in persistent webs. In this case, the inflection points for shifts to low population stability are indicated by red arrows, and those inflection points also correspond to the empirically observed body-size ratios for consumers and resources. Figure adapted from [14]
altering dynamical model parameters such as metabolic rate, functional response, and consumer-resource body size ratios to reflect parasite characteristics will affect our understanding of food web stability. Second, the role of detritus, or dead organic matter, in food webs has yet to be adequately resolved in either structural or dynamical approaches. Detritus has typically been included as one or several separate nodes in many binary-link and flow-weighted food webs. In some cases, it is treated as an additional “primary producer,” while in other cases both primary producers and detritivores connect to it. Researchers must think much more carefully about how to include detritus in all kinds of ecological studies [80], given that it plays a fundamental role in most ecosystems and has particular characteristics that differ from other food web nodes: it is non-living organic matter, all species contribute to detrital pools, it is a major resource for many species, and the forms it takes are extremely heterogeneous (e. g., suspended organic matter in water columns; fecal material; rotting trees; dead animal bodies; small bits of plants and molted cuticle, skin, and hair mixed in soil; etc.). Third, there are many interactions that species participate in that go beyond strictly trophic interactions. Plantanimal mutualistic networks, particularly pollination and seed dispersal or “frugivory” networks, have received the most attention thus far. They are characterized as “bipartite” (two-level) graphs, with links from animals to plants, but no links among plants or among animals [7,9,56,57, 73,107]. While both pollination and seed dispersal do in-
volve a trophic interaction, with animals gaining nutrition from plants during the interactions, unlike in classic predator-prey relationships a positive benefit is conferred upon both partners in the interaction. The evolutionary and ecological dynamics of such mutualistic relationships place unique constraints on the network structure of these interactions and the dynamical stability of such networks. For example, plant-animal mutualistic networks are highly nested and thus asymmetric, such that generalist plants and generalist animals tend to interact among themselves, but specialist species (whether plants or animals) also tend to interact with the most generalist species [7,107]. When simple dynamics are run on these types of “coevolutionary” bipartite networks, it appears that the asymmetric structure enhances long-term species coexistence and thus biodiversity maintenance [9]. Future Directions Food web research of all kinds has expanded greatly over the last several years, and there are many opportunities for exciting new work at the intersection of ecology and network structure and dynamics. In terms of empiricism, there is still a paucity of detailed, evenly resolved community food webs in every habitat type. Current theory, models, and applications need to be tested against more diverse, more complete, and more highly quantified data. In addition, there are many types of datasets that could be compiled which would support novel research. For example, certain kinds of fossil assemblages may allow the com-
1171
1172
Food Webs
pilation of detailed paleo food webs, which in turn could allow examination of questions about how and why food web structure does or does not change over deep time or in response to major extinction events. Another example is data illustrating the assembly of food [41] webs in particular habitats over ecological time. In particular, areas undergoing rapid successional dynamics would be excellent candidates, such as an area covered by volcanic lava flows, a field exposed by a retreating glacier, a hillside denuded by an earth slide, or a forest burned in a large fire. This type of data would allow empirically-based research on the topological dynamics of food webs. Another empirical frontier is the integration of multiple kinds of ecological interaction data into networks with multiple kinds of links—for example, networks that combine mutualistic interactions such as pollination and antagonistic interactions such as predator-prey relationships. In addition, more spatially explicit food web data can be compiled across microhabitats or linked macrohabitats [8]. Most current food web data is effectively aspatial even though trophic interactions occur within a spatial context. More could also be done to collect food web data based on specific instances of trophic interactions. This was done for the insects that live inside the stems of grasses in British fields. The web includes multiple grass species, grass herbivores, their parasitoids, hyper-parasitoids, and hyper-hyper parasitoids [67]. Dissection of over 160,000 grass stems allowed detailed quantification of the frequency with which the species (S D 77 insect plus 10 grass species) and different trophic interactions (L D 126) were observed. Better empiricism will support improved and novel analysis, modeling, and theory development and testing. For example, while food webs appear fundamentally different in some ways from other kinds of “real-world” networks (e. g., they don’t display power-law degree distributions), they also appear to share a common core network structure that is scale-dependent with species richness and connectance in predicable ways, as suggested by the success of the niche and related models. Some of the disparity with other kinds of networks, and the shared structure across food webs, may be explicable through finite-size effects or other methodological or empirical constraints or artifacts. However, aspects of these patterns may reflect attributes of ecosystems that relate to particular ecological, evolutionary, or thermodynamic mechanisms underlying how species are organized in complex bioenergetic networks of feeding interactions. Untangling artifacts from attributes [63] and determining potential mechanisms underlying robust phenomenological patterns (e. g., [10]) is an important area of ongoing and future research. As a part of this, there is much work to be done to continue
to integrate structure and dynamics of complex ecological networks. This is critical for gaining a more comprehensive understanding of the conditions that underlie and promote different aspects of stability, at different levels of organization, in response to external perturbations and to endogenous short- and long-term dynamics. As the empiricism, analysis and modeling of food web structure and dynamics improves, food web network research can play a more central and critical role in conservation and management [76]. It is increasingly apparent that an ecological network perspective, which encompasses direct and indirect effects among interacting taxa, is critical for understanding, predicting, and managing the impacts of species loss and invasion, habitat conversion, and climate change. Far too often, critical issues of ecosystem management have been decided on extremely limited knowledge of one or a very few taxa. For example, this has been an ongoing problem in fisheries science. The narrow focus of most research driving fisheries management decisions has resulted in overly optimistic assessments of sustainable fishing levels. Coupled with climate stressors, over-fishing appears to be driving steep, rapid declines in diversity of common predator target species, and probably many other kinds of associated taxa [114]. Until we acknowledge that species of interest to humans are embedded within complex networks of interactions that can produce unexpected effects through the interplay of direct and indirect effects, we will continue to experience negative outcomes from our management decisions [118]. An important part of minimizing and mitigating human impacts on ecosystems also involves research that explicitly integrates human and natural dynamics. Network research provides a natural framework for analyzing and modeling the complex ways in which humans interact with and impact the world’s ecosystems, whether through local foraging or large-scale commercial harvesting driven by global economic markets. These and other related research directions will depend on efficient management of increasingly dispersed and diversely formatted ecological and environmental data. Ecoinformatic tools—the technologies and practices for gathering, synthesizing, analyzing, visualizing, storing, retrieving and otherwise managing ecological knowledge and information—are playing an increasingly important role in the study of complex ecosystems, including food web research [47]. Indeed, ecology provides an excellent testbed for developing, implementing, and testing new information technologies in a biocomplexity research context (e. g., Semantic Prototypes in Research Ecoinformatics/SPiRE: spire.umbc.edu/us/; Science Environment for Ecological Knowledge/SEEK: seek.ecoinformatics.org;
Food Webs
Webs on the Web/WoW: www.foodwebs.org). Synergistic ties between ecology, physics, computer science and other disciplines will dramatically increase the efficacy of research that takes advantage of such interdisciplinary approaches, as is currently happening in food web and related research.
20. 21.
22.
Bibliography Primary Literature 1. Albert R, Jeong H, Barabási AL (2000) Error and attack tolerance of complex networks. Nature 406:378–382 2. Albert R, Barabási AL (2002) Statistical mechanics of complex networks. Rev Mod Phys 74:47–97 3. Allesina S, Bodini A (2004) Who dominates whom in the ecosystem? Energy flow bottlenecks and cascading extinctions. J Theor Biol 230:351–358 4. Amaral LAN, Scala A, Berthelemy M, Stanley HE (2000) Classes of small-world networks. Proc Nat Acad Sci USA 97:11149– 11152 5. Arii K, Parrott L (2004) Emergence of non-random structure in local food webs generated from randomly structured regional webs. J Theor Biol 227:327–333 6. Baird D, Ulanowicz RE (1989) The seasonal dynamics of the Chesapeake Bay ecosystem. Ecol Monogr 59:329–364 7. Bascompte J, Jordano P, Melián CJ, Olesen JM (2003) The nested assembly of plant-animal mutualistic networks. Proc Natl Acad Sci USA 100:9383–9387 8. Bascompte J, Melián CJ, Sala E (2005) Interaction strength combinations and the overfishing of a marine food web. Proc Natl Acad Sci 102:5443–5447 9. Bascompte J, Jordano P, Olesen JM (2006) Asymmetric coevolutionary networks facilitate biodiversity maintenance. Science 312:431–433 10. Beckerman AP, Petchey OL, Warren PH (2006) Foraging biology predicts food web complexity. Proc Natl Acad Sci USA 103:13745–13749 11. Bersier L-F, Banašek-Richter C, Cattin M-F (2002) Quantitative descriptors of food web matrices. Ecology 83:2394–2407 12. Briand F, Cohen JE (1984) Community food webs have scaleinvariant structure. Nature 398:330–334 13. Brose U, Williams RJ, Martinez ND (2003) Comment on “Foraging adaptation and the relationship between food-web complexity and stability”. Science 301:918b 14. Brose U, Williams RJ, Martinez ND (2006) Allometric scaling enhances stability in complex food webs. Ecol Lett 9:1228– 1236 15. Camacho J, Arenas A (2005) Universal scaling in food-web structure? Nature 435:E3-E4 16. Camacho J, Guimerà R, Amaral LAN (2002) Robust patterns in food web structure. Phys Rev Lett 88:228102 17. Camacho J, Guimerà R, Amaral LAN (2002) Analytical solution of a model for complex food webs. Phys Rev Lett E 65:030901 18. Cartoza CC, Garlaschelli D, Caldarelli G (2006) Graph theory and food webs. In: Pascual M, Dunne JA (eds) Ecological Networks: Linking Structure to Dynamics in Food Webs. Oxford University Press, New York, pp 93–117 19. Cattin M-F, Bersier L-F, Banašek-Richter C, Baltensperger M,
23. 24. 25. 26.
27. 28.
29.
30.
31. 32.
33.
34. 35. 36.
37.
38.
39.
40.
41.
Gabriel J-P (2004) Phylogenetic constraints and adaptation explain food-web structure. Nature 427:835–839 Chen X, Cohen JE (2001) Global stability, local stability and permanence in model food webs. J Th Biol 212:223–235 Chen X, Cohen JE (2001) Transient dynamics and food web complexity in the Lotka–Volterra cascade model. Proc Roy Soc Lond B 268:869–877 Christian RR, Luczkovich JJ (1999) Organizing and understanding a winter’s seagrass foodweb network through effective trophic levels. Ecol Model 117:99–124 Cohen JE (1977) Ratio of prey to predators in community food webs. Nature 270:165–167 Cohen JE (1977) Food webs and the dimensionality of trophic niche space. Proc Natl Acad Sci USA 74:4533–4563 Cohen JE (1978) Food Webs and Niche Space. Princeton University Press, NJ Cohen JE (1989) Ecologists Co-operative Web Bank (ECOWeB™). Version 1.0. Machine Readable Data Base of Food Webs. Rockefeller University, NY Cohen JE, Briand F (1984) Trophic links of community food webs. Proc Natl Acad Sci USA 81:4105–4109 Cohen JE, Newman CM (1985) A stochastic theory of community food webs: I. Models and aggregated data. Proc R Soc Lond B 224:421–448 Cohen JE, Palka ZJ (1990) A stochastic theory of community food webs: V. Intervality and triangulation in the trophic niche overlap graph. Am Nat 135:435–463 Cohen JE, Briand F, Newman CM (1986) A stochastic theory of community food webs: III. Predicted and observed length of food chains. Proc R Soc Lond B 228:317–353 Cohen JE, Briand F, Newman CM (1990) Community Food Webs: Data and Theory. Springer, Berlin Cohen JE, Luczak T, Newman CM, Zhou Z-M (1990) Stochastic structure and non-linear dynamics of food webs: qualitative stability in a Lotka–Volterra cascade model. Proc R Soc Lond B 240:607–627 Cohen JE, Jonsson T, Carpenter SR (2003) Ecological community description using the food web, species abundance, and body size. Proc Natl Acad Sci USA 100:1781–1786 Daily GC (ed) (1997) Nature’s services: Societal dependence on natural ecosystems. Island Press, Washington DC Doyle J, Csete M (2007) Rules of engagement. Nature 446:860 Dunne JA (2006) The network structure of food webs. In: Pascual M, Dunne JA (eds) Ecological Networks: Linking Structure to Dynamics in Food Webs. Oxford University Press, New York, pp 27–86 Dunne JA, Williams RJ, Martinez ND (2002) Food web structure and network theory: the role of connectance and size. Proc Natl Acad Sci USA 99:12917–12922 Dunne JA, Williams RJ, Martinez ND (2002) Network structure and biodiversity loss in food webs: robustness increases with connectance. Ecol Lett 5:558–567 Dunne JA, Williams RJ, Martinez ND (2004) Network structure and robustness of marine food webs. Mar Ecol Prog Ser 273:291–302 Dunne JA, Brose U, Williams RJ, Martinez ND (2005) Modeling food-web dynamics: complexity-stability implications. In: Belgrano A, Scharler U, Dunne JA, Ulanowicz RE (eds) Aquatic Food Webs: An Ecosystem Approach. Oxford University Press, New York, pp 117–129 Dunne JA, Williams RJ, Martinez ND, Wood RA, Erwing
1173
1174
Food Webs
42. 43. 44. 45. 46. 47.
48.
49. 50. 51. 52.
53. 54. 55. 56.
57.
58.
59. 60. 61. 62. 63.
64. 65.
DE (2008) Compilation and network analyses of Cambrian food webs. PLoS Biology 5:e102. doi:10.1371/journal.pbio. 0060102 Egerton FN (2007) Understanding food chains and food webs, 1700–1970. Bull Ecol Soc Am 88(1):50–69 Elton CS (1927) Animal Ecology. Sidgwick and Jackson, London Elton CS (1958) Ecology of Invasions by Animals and Plants. Chapman & Hall, London Garlaschelli D, Caldarelli G, Pietronero L (2003) Universal scaling relations in food webs. Nature 423:165–168 Goldwasser L, Roughgarden JA (1993) Construction of a large Caribbean food web. Ecology 74:1216–1233 Green JL, Hastings A, Arzberger P, Ayala F, Cottingham KL, Cuddington K, Davis F, Dunne JA, Fortin M-J, Gerber L, Neubert M (2005) Complexity in ecology and conservation: mathematical, statistical, and computational challenges. Bioscience 55:501–510 Hardy AC (1924) The herring in relation to its animate environment. Part 1. The food and feeding habits of the herring with special reference to the East Coast of England. Fish Investig Ser II 7:1–53 Havens K (1992) Scale and structure in natural food webs. Science 257:1107–1109 Hutchinson GE (1959) Homage to Santa Rosalia, or why are there so many kinds of animals? Am Nat 93:145–159 Huxham M, Beany S, Raffaelli D (1996) Do parasites reduce the chances of triangulation in a real food web? Oikos 76:284–300 Jeong H, Tombor B, Albert R, Oltvia ZN, Barabási A-L (2000) The large-scale organization of metabolic networks. Nature 407:651–654 Jeong H, Mason SP, Barabási A-L, Oltavia ZN (2001) Lethality and centrality in protein networks. Nature 411:41 Jordán F, Molnár I (1999) Reliable flows and preferred patterns in food webs. Ecol Ecol Res 1:591–609 Jordán F, Scheuring I (2004) Network ecology: topological constraints on ecosystem dynamics. Phys Life Rev 1:139–229 Jordano P (1987) Patterns of mutualistic interactions in pollination and seed dispersal: connectance, dependence asymmetries, and coevolution. Am Nat 129:657–677 Jordano P, Bascompte J, Olesen JM (2003) Invariant properties in co-evolutionary networks of plant-animal interactions. Ecol Lett 6:69–81 Kondoh M (2003) Foraging adaptation and the relationship between food-web complexity and stability. Science 299:1388–1391 Lafferty KD, Dobson AP, Kurls AM (2006) Parasites dominate food web links. Proc Nat Acad Sci USA 103:11211–11216 Lindeman RL (1942) The trophic-dynamic aspect of ecology. Ecology 23:399–418 Link J (2002) Does food web theory work for marine ecosystems? Mar Ecol Prog Ser 230:1–9 MacArthur RH (1955) Fluctuation of animal populations and a measure of community stability. Ecology 36:533–536 Martinez ND (1991) Artifacts or attributes? Effects of resolution on the Little Rock Lake food web. Ecol Monogr 61:367– 392 Martinez ND (1992) Constant connectance in community food webs. Am Nat 139:1208–1218 Martinez ND (1993) Effect of scale on food web structure. Science 260:242–243
66. Martinez ND (1994) Scale-dependent constraints on foodweb structure. Am Nat 144:935–953 67. Martinez ND, Hawkins BA, Dawah HA, Feifarek BP (1999) Effects of sampling effort on characterization of food-web structure. Ecology 80:1044–1055 68. Martinez ND, Williams RJ, Dunne JA (2006) Diversity, complexity, and persistence in large model ecosystems. In: Pascual M, Dunne JA (eds) Ecological Networks: Linking Structure to Dynamics in Food Webs. Oxford University Press, New York, pp 163–185 69. May RM (1972) Will a large complex system be stable? Nature 238:413–414 70. May RM (1973) Stability and Complexity in Model Ecosystems. Princeton University Press, Princeton. Reprinted in 2001 as a “Princeton Landmarks in Biology” edition 71. McCann KS (2000) The diversity-stability debate. Nature 405:228–233 72. McKane AJ, Drossel B (2006) Models of food-web evolution. In: Pascual M, Dunne JA (eds) Ecological Networks: Linking Structure to Dynamics in Food Webs. Oxford University Press, New York, pp 223–243 73. Memmott J (1999) The structure of a plant-pollinator network. Ecol Lett 2:276–280 74. Memmott J, Martinez ND, Cohen JE (2000) Predators, parasitoids and pathogens: species richness, trophic generality and body sizes in a natural food web. J Anim Ecol 69:1–15 75. Memmott J, Waser NM, Price MV (2004) Tolerance of pollination networks to species extinctions. Proc Royal Soc Lond Series B 271:2605–2611 76. Memmott J, Alonso D, Berlow EL, Dobson A, Dunne JA, Sole R, Weitz J (2006) Biodiversity loss and ecological network structure. In: Pascual M, Dunne JA (eds) Ecological Networks: Linking Structure to Dynamics in Food Webs. Oxford University Press, New York, pp 325–347 77. Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs: simple building blocks of complex networks. Science 298:763–764 78. Montoya JM, Solé RV (2002) Small world patterns in food webs. J Theor Biol 214:405–412 79. Montoya JM, Solé RV (2003) Topological properties of food webs: from real data to community assembly models. Oikos 102:614–622 80. Moore JC, Berlow EL, Coleman DC, de Ruiter PC, Dong Q, Hastings A, Collin Johnson N, McCann KS, Melville K, Morin PJ, Nadelhoffer K, Rosemond AD, Post DM, Sabo JL, Scow KM, Vanni MJ, Wall DH (2004) Detritus, trophic dynamics and biodiversity. Ecol Lett 7:584–600 81. Neutel AM, Heesterbeek JAP, de Ruiter PC (2002) Stability in real food webs: weak links in long loops. Science 296:1120– 1123 82. Newman MEJ (2002) Assortative mixing in networks. Phys Rev Lett 89:208701 83. Newman M, Barabasi A-L, Watts DJ (eds) (2006) The Structure and Dynamics of Networks. Princeton University Press, Princeton 84. Odum E (1953) Fundamentals of Ecology. Saunders, Philadelphia 85. Pascual M, Dunne JA (eds) (2006) Ecological Networks: Linking Structure to Dynamics in Food Webs. Oxford University Press, New York
Food Webs
86. Paine RT (1988) Food webs: road maps of interactions or grist for theoretical development? Ecology 69:1648–1654 87. Pierce WD, Cushamn RA, Hood CE (1912) The insect enemies of the cotton boll weevil. US Dept Agric Bull 100:9–99 88. Pimm SL (1982) Food Webs. Chapman and Hall, London. Reprinted in 2002 as a 2nd edition by University of Chicago Press 89. Pimm SL (1984) The complexity and stability of ecosystems. Nature 307:321–326 90. Pimm SL, Lawton JH (1978) On feeding on more than one trophic level. Nature 275:542–544 91. Pimm SL, Lawton JH (1980) Are food webs divided into compartments? J Anim Ecol 49:879–898 92. Polis GA (1991) Complex desert food webs: an empirical critique of food web theory. Am Nat 138:123–155 93. Schoener TW (1989) Food webs from the small to the large. Ecology 70:1559–1589 94. Schoenly K, Beaver R, Heumier T (1991) On the trophic relations of insects: a food web approach. Am Nat 137:597–638 95. Solé RV, Montoya JM (2001) Complexity and fragility in ecological networks. Proc R Soc Lond B 268:2039–2045 96. Solow AR (1996) On the goodness of fit of the cascade model. Ecology 77:1294–1297 97. Solow AR, Beet AR (1998) On lumping species in food webs. Ecology 79:2013–2018 98. Srinivasan UT, Dunne JA, Harte H, Martinez ND (2007) Response of complex food webs to realistic extinction sequences. Ecology 88:671–682 99. Stouffer DB, Camacho J, Guimera R, Ng CA, Amaral LAN (2005) Quantitative patterns in the structure of model and empirical food webs. Ecology 86:1301–1311 100. Stouffer DB, Camacho J, Amaral LAN (2006) A robust measure of food web intervality. Proc Nat Acad Sci 103:19015–19020 101. Strogatz SH (2001) Exploring complex networks. Nature 410:268–275 102. Sugihara G, Schoenly K, Trombla A (1989) Scale invariance in food web properties. Science 245:48–52 103. Summerhayes VS, Elton CS (1923) Contributions to the ecology of Spitzbergen and Bear Island. J Ecol 11:214–286 104. Summerhayes VS, Elton CS (1928) Further contributions to the ecology of Spitzbergen and Bear Island. J Ecol 16:193–268 105. Thompson RM, Townsend CR (1999) The effect of seasonal variation on the community structure and food-web attributes of two streams: implications for food-web science. Oikos 87:75–88 106. Thompson RM, Townsend CR (2005) Food web topology varies with spatial scale in a patchy environment. Ecology 86:1916–1925 107. Vásquez DP, Aizen MA (2004) Asymmetric specialization: a pervasive feature of plant-pollinator interactions. Ecology 85:1251–1257 108. Warren PH (1989) Spatial and temporal variation in the structure of a freshwater food web. Oikos 55:299–311 109. Watts DJ, Strogatz SH (1998) Collective dynamics of ‘smallworld’ networks. Nature 393:440–442 110. Williams RJ, Martinez ND (2000) Simple rules yield complex food webs. Nature 404:180–183 111. Williams RJ, Martinez ND (2004) Trophic levels in complex food webs: theory and data. Am Nat 163:458–468 112. Williams RJ, Martinez ND (2004) Diversity, complexity, and
113.
114.
115. 116. 117.
118. 119.
persistence in large model ecosystems. Santa Fe Institute Working Paper 04-07-022 Williams RJ, Berlow EL, Dunne JA, Barabási AL, Martinez ND (2002) Two degrees of separation in complex food webs. Proc Natl Acad Sci USA 99:12913–12916 Worm B, Sandow M, Oschlies A, Lotze HK, Myers RA (2005) Global patterns of predator diversity in the open oceans. Science 309:1365–1369 Yodzis P (1980) The connectance of real ecosystems. Nature 284:544–545 Yodzis P (1984) The structure of assembled communities II. J Theor Biol 107:115–126 Yodzis P (1998) Local trophodynamics and the interaction of marine mammals and fisheries in the Benguela ecosystem. J Anim Ecol 67:635–658 Yodzis P (2000) Diffuse effects in food webs. Ecology 81:261– 266 Yodzis P, Innes S (1992) Body-size and consumer-resource dynamics. Am Nat 139:1151–1173.
Books and Reviews Belgrano A, Scharler U, Dunne JA, Ulanowicz RE (eds) (2005) Aquatic Food Webs: An Ecosystem Approach. Oxford University Press, Oxford Berlow EL, Neutel A-M, Cohen JE, De Ruiter P, Ebenman B, Emmerson M, Fox JW, Jansen VAA, Jones JI, Kokkoris GD, Logofet DO, McKane AJ, Montoya J, Petchey OL (2004) Interaction strengths in food webs: issues and opportunities. J Animal Ecol 73:585–598 Borer ET, Anderson K, Blanchette CA, Broitman B, Cooper SD, Halpern BS (2002) Topological approaches to food web analyses: a few modifications may improve our insights. Oikos 99:397–401 Christensen V, Pauly D (1993) Trophic Models of Aquatic Ecosystems. ICLARM, Manila Cohen JE, Beaver RA, Cousins SH, De Angelis DL, et al (1993) Improving food webs. Ecology 74:252–258 Cohen JE, Briand F, Newman CM (1990) Community Food Webs: Data and Theory. Springer, Berlin DeAngelis DL, Post WM, Sugihara G (eds) (1983) Current Trends in Food Web Theory. ORNL-5983, Oak Ridge Natl Laboratory Drossel B, McKane AJ (2003) Modelling food webs. In: Bornholt S, Schuster HG (eds) Handbook of Graphs and Networks: From the Genome to the Internet. Wiley-VCH, Berlin Hall SJ, Raffaelli DG (1993) Food webs: theory and reality. Advances in Ecological Research 24:187–239 Lawton JH (1989) Food webs. In Cherett JM, (ed) Ecological Concepts. Blackwell Scientific, Oxford Lawton JH, Warren PH (1988) Static and dynamic explanations for patterns in food webs. Trends in Ecology and Evolution 3:242–245 Martinez ND (1995) Unifying ecological subdisciplines with ecosystem food webs. In Jones CG, Lawton JH, (eds) Linking Species and Ecosystems. Chapman and Hall, New York Martinez ND, Dunne JA (1998) Time, space, and beyond: scale issues in food-web research. In Peterson D, Parker VT, (eds) Ecological Scale: Theory and Applications. Columbia University Press, New York May RM (1983) The structure of food webs. Nature 301:566–568
1175
1176
Food Webs
May RM (2006) Network structure and the biology of populations. Trends Ecol Evol 21:394–399 Montoya JM, Pimm SL, Sole RV (2006) Ecological networks and their fragility. Nature 442:259–264 Moore J, de Ruiter P, Wolters V (eds) (2005) Dynamic Food Webs: Multispecies Assemblages, Ecosystem Development and Environmental Change. Academic Press, Elsevier, Amsterdam Pimm SL, Lawton JH, Cohen JE (1991) Food web patterns and their consequences. Nature 350:669–674 Polis GA, Winemiller KO, (eds) (1996) Food Webs: Integration of Patterns & Dynamics. Chapman and Hall
Polis GA, Power ME, Huxel GR, (eds) (2003) Food Webs at the Landscape Level. University of Chicago Press Post DM (2002) The long and short of food-chain length. Trends Ecol Evol 17:269–277 Strong DR (ed) (1988) Food web theory: a ladder for picking strawberries. Special Feature. Ecology 69:1647–1676 Warren PH (1994) Making connections in food webs. Trends Ecol Evol 9:136–141 Woodward G, Ebenman B, Emmerson M, Montoya JM, Olesen JM, Valido A, Warren PH (2005) Body size in ecological networks. Trends Ecol Evol 20:402–409
Fuzzy Logic
Fuzzy Logic LOTFI A. Z ADEH Department of EECS, University of California, Berkeley, USA
Article Outline Glossary Definition of the Subject Introduction Conceptual Structure of Fuzzy Logic The Basics of Fuzzy Set Theory The Concept of Granulation The Concepts of Precisiation and Cointensive Precisiation The Concept of a Generalized Constraint Principal Contributions of Fuzzy Logic A Glimpse of What Lies Beyond Fuzzy Logic Bibliography
Glossary Cointension A qualitative measure of proximity of meanings/input-output relations. Extension principle A principle which relates to propagation of generalized constraints. f-validity fuzzy validity. Fuzzy if-then rule A rule of the form: if X is A then Y is B. In general, A and B are fuzzy sets. Fuzzy logic (FL) A precise logic of imprecision, uncertainty and approximate reasoning. Fuzzy logic gambit Exploitation of tolerance for imprecision through deliberate m-imprecisiation followed by mm-precisiation. Fuzzy set A class with a fuzzy boundary. Generalized constraint A constraint of the form X isr R, where X is the constrained variable, R is the constraining relation and r is an indexical variable which defines the modality of the constraint, that is, its semantics. In general, generalized constraints have elasticity. Generalized constraint language A language generated by combination and propagation of generalized constraints. Graduation Association of a scale of degrees with a fuzzy set. Granuland Result of granulation. Granular variable A variable which takes granules as variables. Granulation Partitioning of an object/set into granules.
Granule A clump of attribute values drawn together by indistinguishability, equivalence, similarity, proximity or functionality. Linguistic variable A granular variable with linguistic labels of granular values. m-precision Precision of meaning. mh-precisiand m-precisiand which is described in a natural language (human-oriented). mm-precisiand m-precisiand which is described in a mathematical language (machine-oriented). p-validity provable validity. Precisiand Result of precisiation. Precisiend Object of precisiation. v-precision Precision of value.
Definition of the Subject Viewed in a historical perspective, fuzzy logic is closely related to its precursor – fuzzy set theory [70]. Conceptually, fuzzy logic has a much broader scope and a much higher level of generality than traditional logical systems, among them the classical bivalent logic, multivalued logics, model logics, probabilistic logics, etc. The principal objective of fuzzy logic is formalization – and eventual mechanization – of two remarkable human capabilities. First, the capability to converse, communicate, reason and make rational decisions in an environment of imprecision, uncertainty, incompleteness of information, partiality of truth and partiality of possibility. And second, the capability to perform a wide variety of physical and mental tasks – such as driving a car in city traffic and summarizing a book – without any measurement and any computations. A concept which has a position of centrality in fuzzy logic is that of a fuzzy set. Informally, a fuzzy set is a class with a fuzzy boundary, implying a gradual transition from membership to nonmembership. A fuzzy set is precisiated through graduation, that is, through association with a scale of grades of membership. Thus, membership in a fuzzy set is a matter of degree. Importantly, in fuzzy logic everything is or is allowed to be graduated, that is, be a matter of degree. Furthermore, in fuzzy logic everything is or is allowed to be granulated, with a granule being a clump of attribute-values drawn together by indistinguishability, equivalence, similarity, proximity or functionality. Graduation and granulation form the core of fuzzy logic. Graduated granulation is the basis for the concept of a linguistic variable – a variable whose values are words rather than numbers [73]. The concept of a linguistic variable is employed in almost all applications of fuzzy logic.
1177
1178
Fuzzy Logic
During much of its early history, fuzzy logic was an object of controversy stemming in part from the pejorative connotation of the term “fuzzy”. In reality, fuzzy logic is not fuzzy. Basically, fuzzy logic is a precise logic of imprecision and uncertainty. An important milestone in the evolution of fuzzy logic was the development of the concept of a linguistic variable and the machinery of fuzzy if-then rules [73,90]. Another important milestone was the conception of possibility theory [79]. Possibility theory and probability theory are complimentary. A further important milestone was the development of the formalism of computing with words (CW) [93]. Computing with words opens the door to a wide-ranging enlargement of the role of natural languages in scientific theories. In the following, fuzzy logic is viewed in a nontraditional perspective. In this perspective, the cornerstones of fuzzy logic are graduation, granulation, precisisation and the concept of a generalized constraint. The concept of a generalized constraint serves to precisiate the concept of granular information. Granular information is the basis for granular computing (GrC) [2,29,37,81,91,92]. In granular computing the objects of computation are granular variables, with a granular value of a granular variable representing an imprecise and/or uncertain information about the value of the variable. In effect, granular computing is the computational facet of fuzzy logic. GrC and CW are closely related. In coming years, GrC and CW are likely to play increasingly important roles in the evolution of fuzzy logic and its applications. Introduction Science deals not with reality but with models of reality. In large measure, scientific progress is driven by a quest for better models of reality. In the real world, imprecision, uncertainty and complexity have a pervasive presence. In this setting, construction of better models of reality requires a better understanding of how to deal effectively with imprecision, uncertainty and complexity. To a significant degree, development of fuzzy logic has been, and continues to be, motivated by this need. In essence, logic is concerned with formalization of reasoning. Correspondently, fuzzy logic is concerned with formalization of fuzzy reasoning, with the understanding that precise reasoning is a special case of fuzzy reasoning. Humans have many remarkable capabilities. Among them there are two that stand out in importance. First, the capability to converse, communicate, reason and make rational decisions in an environment of imprecision, uncer-
tainty, incompleteness of information, partiality of truth and partiality of possibility. And second, the capability to perform a wide variety of physical and mental tasks – such as driving a car in heavy city traffic and summarizing a book – without any measurements and any computations. In large measure, fuzzy logic is aimed at formalization, and eventual mechanization, of these capabilities. In this perspective, fuzzy logic plays the role of a bridge from natural to machine intelligence. There are many misconceptions about fuzzy logic. A common misconception is that fuzzy logic is fuzzy. In reality, fuzzy logic is not fuzzy. Fuzzy logic deals precisely with imprecision and uncertainty. In fuzzy logic, the objects of deduction are, or are allowed to be fuzzy, but the rules governing deduction are precise. In summary, fuzzy logic is a precise system of reasoning, deduction and computation in which the objects of discourse and analysis are associated with information which is, or is allowed to be, imprecise, uncertain, incomplete, unreliable, partially true or partially possible. For illustration, here are a few simple examples of reasoning in which the objects of reasoning are fuzzy. First, consider the familiar example of deduction in Aristotelian, bivalent logic. all men are mortal Socrates is a man Socrates is mortal In this example, there is no imprecision and no uncertainty. In an environment of imprecision and uncertainty, an analogous example – an example drawn from fuzzy logic – is most Swedes are tall Magnus is a Swede it is likely that Magnus is tall with the understanding that Magnus is an individual picked at random from a population of Swedes. To deduce the answer from the premises, it is necessary to precisiate the meaning of “most” and “tall,” with “likely” interpreted as a fuzzy probability which, as a fuzzy number, is equal to “most”. This simple example points to a basic characteristic of fuzzy logic, namely, in fuzzy logic precisiation of meaning is a prerequisite to deduction. In the example under consideration, deduction is contingent on precisiation of “most”, “tall” and “likely”. The issue of precisiation has a position of centrality in fuzzy logic. In fuzzy logic, deduction is viewed as an instance of question-answering. Let I be an information set consisting
Fuzzy Logic
of a system of propositions p1 ; : : : ; p n , I D S(p1 ; : : : ; p n ). Usually, I is a conjunction of p1 ; : : : ; p n . Let q be a question. A question-answering schema may be represented as I q
where ans(q/I) denotes the answer to q given I. The following examples are instances of the deduction schema
q:
most Swedes are tall Magnus is a Swede what is the probability that Magnus is tall?
most Swedes are tall what fraction of Swedes are not tall?
most Swedes are tall what fraction of Swedes are short?
I:
q:
most Swedes are tall what is the average height of Swedes?
(2)
A very simple example of interpolative deduction is the following. Assume that a1 , a2 , a3 are labeled small, medium and large, respectively. A fuzzy graph of f may be expressed as a calculus of fuzzy if-then rules. f
(3)
(4)
ans(q/I) a box contains balls of various sizes most are small there are many more small balls than large balls what is the probability that a ball drawn at random is neither large nor small? ans(q/I) : (5)
In these examples, rules of deduction in fuzzy logic must be employed to compute ans(q/I). For (1) and (2) deduction is simple. For (3)–(5) deduction requires the use of what is referred to as the extension principle [70,75]. This principle is discussed in Sect. “The Concept of a Generalized Constraint”. A less simple example of deduction involves interpolation of an imprecisely specified function. Interpolation of imprecisely specified functions, or interpolative deduction for short, plays a pivotal role in many applications of fuzzy logic, especially in the realm of control. For simplicity, assume that f is a function from reals to reals, Y D f (X). Assume that what is known about f is a collection of input-output pairs of the form
f D (( a1 ; b1 ); : : : ; ( a n ; b n )) ;
i D 1; : : : ; n :
ans(q/( f ; a))
:
ans(q/I) I: q:
Y is b i ;
I: f q : f ( a)
(1)
ans(q/I) is (1 most) I: q:
then
Let a be a value of X. Viewing f as an information set, I, interpolation of f may be expressed as a questionanswering schema
ans(q/I) is likely, likely=most I: q:
X is a i
if
ans(q/I)
I:
where a is an abbreviation of “approximately a”. Such a collection is referred to as a fuzzy graph of f [74]. A fuzzy graph of f may be interpreted as a summary of f . In many applications, a fuzzy graph is described as a collection of fuzzy if-then rules of the form
if X is small then Y is small if X is medium than Y is large if X is large then Y is small:
Given a value of X, X D a, the question is: What is The so-called Mamdani rule [39,74,78] provides an answer to this question. More generally, in an environment of imprecision and uncertainty, fuzzy if-then rules may be of the form
f ( a)?
if
X is a i ;
then usually (Y is b i ) ;
i D 1; : : : ; n :
Rules of this form endow fuzzy logic with the capability to model complex causal dependencies, especially in the realms of economics, social systems, forecasting and medicine. What is not widely recognized is that fuzzy logic is more than an addition to the methods of dealing with imprecision, uncertainty and complexity. In effect, fuzzy logic represents a paradigm shift. More specifically, it is traditional to associate scientific progress with progression from perceptions to numbers. What fuzzy logic adds to this capability are four basic capabilities. (a) Nontraditional. Progression from perceptions to precisiated words (b) Nontraditional. Progression from unprecisiated words to precisiated words (c) Countertraditional. Progression from numbers to precisiated words (d) Nontraditional. Computing with words (CW)/NLcomputation.
1179
1180
Fuzzy Logic
These capabilities open the door to a wide-ranging enlargement of the role of natural languages in scientific theories. Our brief discussion of deduction in the context of fuzzy logic is intended to clarify the nature of problems which fuzzy logic is designed to address. The principal concepts and techniques which form the core of fuzzy logic are discussed in the following. Our discussion draws on the concepts and ideas introduced in [102].
Conceptual Structure of Fuzzy Logic There are many logical systems, among them the classical, Aristotelian, bivalent logic, multivalued logics, model logics, probabilistic logic, logic, dynamic logic, etc. What differentiates fuzzy logic from such logical systems is that fuzzy logic is much more than a logical system. The point of departure in fuzzy logic is the concept of a fuzzy set. Informally, a fuzzy set is a class with a fuzzy boundary, implying that, in general, transition from membership to nonmembership in a fuzzy set is gradual rather than abrupt. A set is a class with a crisp boundary (Fig. 1). A set, A, in a space U, U D fug, is precisiated through association with a characteristic function which maps U into f0; 1g. More generally, a fuzzy set, A, is precisiated through graduation, that is, through association with A of a membership function, A – a mapping from U to a grade of membership space, G, with A (u) representing the grade of membership of u in A. In other words, membership in a fuzzy set is a matter of degree. A familiar example of graduation is the association of Richter scale with the class of earthquakes. A fuzzy set is basic if G is the unit interval. More generally, G may be a partially ordered set. L-fuzzy sets [23] fall into this category. A basic fuzzy set is of Type 1. A fuzzy set, A, is of Type 2 if A (u) is a fuzzy set of Type 1. Recursively, a fuzzy set, A, is of Type n if A (u) is a fuzzy set of Type n 1, n D 2; 3; : : : [75]. Fuzzy sets of Type 2 have become an object of growing attention in the literature of fuzzy logic [40]. Unless stated to the contrary, a fuzzy set is assumed to be of Type 1 (basic). Note. A clarification is in order. Consider a concatenation of two words, A and B, with A modifying B, e. g. A is an adjective and B is a noun. Usually, A plays the role of an s-modifier, that is, a modifier which specializes B in the sense that AB is a subset of B, as in convex set. In some instances, however, A plays the role of a g-modifier, that is, a modifier which generalizes B. In this sense, fuzzy in fuzzy set, fuzzy logic and, more generally, in fuzzy B, is a g-modifier. Examples: fuzzy topology, fuzzy measure, fuzzy arithmetic, fuzzy stability, etc. Many misconceptions
Fuzzy Logic, Figure 1 The concepts of a set and a fuzzy set are derived from the concept of a class through precisiation. A fuzzy set has a fuzzy boundary. A fuzzy set is precisiated through graduation
about fuzzy logic are rooted in incorrect interpretation of fuzzy as an s-modifier. What is important to note is that (a) A (u) is namebased (extensional) if u is the name of an object in U, e. g., middle-aged(Vera); (b) A (u) is attribute-based (intensional) if the grade of membership is a function of an attribute of u, e. g., age; and (c) A (u) is perception-based if u is a perception of an object, e. g., Vera’s grade of membership in the class of middle-aged women, based on her appearance, is 0.8. It should be observed that a class of objects, A, has two basic attributes: (a) the boundary of A; and (b) the cardinality (count) or, more generally, the measure of A (Fig. 1). In this perspective, fuzzy set theory is, in the main, boundary-oriented, while probability theory is, in the main, measure-oriented. Fuzzy logic is, in the main, both boundary- and measure-oriented. The concept of a membership function is the centerpiece of fuzzy set theory. With the concept of a fuzzy set as the point of departure, different directions may be pursued, leading to various facets of fuzzy logic. More specifically, the following are the principal facets of fuzzy logic: the logical facet, FLl; the fuzzy-set-theoretic facet, FLs, the epistemic facet, Fle; and the relational facet, FLr (Fig. 2). The logical facet of FL, FLl, is fuzzy logic in its narrow sense. FLl may be viewed as a generalization of multivalued logic. The agenda of FLl is similar in spirit to the agenda of classical logic [17,22,25,44,45]. The fuzzy-set-theoretic facet, FLs, is focused on fuzzy sets. The theory of fuzzy sets is central to fuzzy logic. Historically, the theory of fuzzy sets [70] preceded fuzzy
Fuzzy Logic
Fuzzy Logic, Figure 3 The cornerstones of a nontraditional view of fuzzy logic Fuzzy Logic, Figure 2 Principal facets of fuzzy logic (FL). The nucleus of fuzzy logic is the concept of a fuzzy set
logic [77]. The theory of fuzzy sets may be viewed as an entry to generalizations of various branches of mathematics, among them fuzzy topology, fuzzy measure theory, fuzzy graph theory, fuzzy algebra and fuzzy differential equations. Note that fuzzy X is a fuzzy-set-theory-based or, more generally, fuzzy-logic-based generalization of X. The epistemic facet of FL, FLe, is concerned with knowledge representation, semantics of natural languages and information analysis. In FLe, a natural language is viewed as a system for describing perceptions. An important branch of FLe is possibility theory [15,79,83]. Another important branch of FLe is the computational theory of perceptions [93,94,95]. The relational facet, FLr, is focused on fuzzy relations and, more generally, on fuzzy dependencies. The concept of a linguistic variable – and the associated calculi of fuzzy if-then rules – play pivotal roles in almost all applications of fuzzy logic [1,3,6,8,12,13,18,26,28,32,36,40,48,52,65,66, 67,69].
Fuzzy Logic, Figure 4 Precisiation of middle-age through graduation
The cornerstones of fuzzy logic are the concepts of graduation, granulation, precisiation and generalized constraints (Fig. 3). These concepts are discussed in the following. The Basics of Fuzzy Set Theory The grade of membership, A (u), of u in A may be interpreted in various ways. It is convenient to employ the proposition p: Vera is middle-aged, as an example, with middle-age represented as a fuzzy set shown in Fig. 4. Among the various interpretations of p are the following. Assume that q is the proposition: Vera is 43 years old; and r is the proposition: the grade of membership of Vera in the fuzzy set of middle-aged women is 0.8. (a) The truth value of p given r is 0.8. (b) The possibility of q given p and r is 0.8. (c) The degree to which the concept of middle-age has to be stretched to apply to Vera is (1–0.8).
1181
1182
Fuzzy Logic
(d) 80% of voters in a voting group vote for p given q and r. (e) The probability of p given q and r is 0.8. Of these interpretations, (a)–(c) are most compatible with intuition. If A and B are fuzzy sets in U, then their intersection (conjunction), A \ B, and union (disjunction), A [ B, are defined as A\B (u) D A (u) ^ B (u) ;
u2U
A[B (u) D A (u) _ B (u) ;
u2U
where ^ D min and _ D max. More generally, conjunction and disjunction are defined through the concepts of t-norm and t-conorm, respectively [48]. When U is a finite set, U D fu1 ; : : : ; u n g, it is convenient to represent A as a union of fuzzy singletons, A (u i )/u i ;
i D 1; : : : ; n :
Specifically;
A D A (u1 )/u1 C C A (u n )/u n
in which C denotes disjunction. More compactly, A D ˙ i A (u i )/u i ;
i D 1; : : : ; n :
When U is a continuum, A may be expressed as Z A (u)/u : U
A basic concept in fuzzy set theory is that of a level set [70], commonly referred to as an ˛-cut [48]. Specifically, if A is a fuzzy set in U, then an ˛-cut, A˛ , is defined as (Fig. 5) A˛ D fujA (u) > ˛g ;
0 supSCL C(p ; p) :
Fuzzy Logic
This obvious inequality has an important implication. Specifically, as a meaning representation language, fuzzy logic dominates bivalent logic. As a very simple example consider the proposition p: Speed limit is 65 mph. Realistically, what is the meaning of p? The inequality implies that employment of fuzzy logic for precisiation of p would lead to a precisiand whose cointension is at least as high – and generally significantly higher – than the cointension which is achievable through the use of bivalent logic. More concretely, assume that A tells B that the speed limit is 65 mph, with the understanding that 65 mph should be interpreted as “approximately 65 mph”. B asks A to precisiate what is meant by “approximately 65 mph”, and stipulates that no imprecise numbers and no probabilities should be used in precisiation. With this restriction, A is not capable of formulating a realistic meaning of “approximately 65 mph”. Next, B allows A to use imprecise numbers but no probabilities. B is still unable to formulate a realistic definition. Next, B allows A to employ imprecise numbers but no imprecise probabilities. Still, A is unable to formulate a realistic definition. Finally, B allows A to use imprecise numbers and imprecise probabilities. This allows A to formulate a realistic definition of “approximately 65 mph”. This simple example is intended to demonstrate the need for the vocabulary of fuzzy logic to precisiate the meaning of terms and concepts which involve imprecise probabilities. In addition to serving as a basis for precisiation of meaning, GCL serves another important function – the function of a deductive question-answering system [100]. In this role, what matters are the rules of deduction. In GCL, the rules of deduction coincide with the rules which govern constraint propagation and counterpropagation. Basically, these are the rules which govern generation of a generalized constraint from other generalized constraints [100,101]. The principal rule of deduction in fuzzy logic is the socalled extension principle [70,75]. The extension principle can assume a variety of forms, depending on the generalized constraints to which it applies. A basic form which involves possibilistic constraints is the following. An analogous principle applies to probabilistic constraints. Let X be a variable which takes values in U, and let f be a function from U to V. The point of departure is a possibilistic constraint on f (X) expressed as
which may be expressed as g(X) is B ; where B is a fuzzy relation. The question is: What is B? The extension principle reduces the problem of computation of B to the solution of a variational problem. Specifically, f (X) is A g(X) is B where B (w) D supu A ( f (u)) subject to w D g(u) : The structure of the solution is depicted in Fig. 21. Basically, the possibilistic constraint on f (X) counterpropagates to a possibilistic constraint on X. Then, the possibilistic constraint on X propagates to a possibilistic constraint on g(X). There is a version of the extension principle – referred to as the fuzzy-graph extension principle – which plays an important role in control and systems analysis [74,78,90]. More specifically, let f be a function from reals to reals, Y D f (X). Let f and X be the granulands of f and X, respectively, with f having the form of a fuzzy graph (Sect. “The Concept of Granulation”).
f D A1 B j(1) C C A m B j(m) ;
f (X) is A ; where A is a fuzzy relation in V which is defined by its membership function A (v), v 2 V . Let g be a function from U to W. The possibilistic constraint on f (X) induces a possibilistic constraint on g(X)
Fuzzy Logic, Figure 21 Structure of the extension principle
1193
1194
Fuzzy Logic
Fuzzy Logic, Figure 22 Fuzzy-graph extension principle. B D f(A)
where the A i , i D 1; : : : ; m and the B j , j D 1; : : : ; n, are granules of X and Y, respectively; × denotes Cartesian product; and C denotes disjunction (Fig. 22). In this instance, the extension principle may be expressed as follows. X is A f is (A1 B j(1) C C A m B j(m) ) Y is (m1 ^ B j(1) C C m m ^ B j(m) ) where the m i are matching coefficients, defined as [78] m i D sup(A \ A i ) ;
i D 1; : : : ; m
and ^ denotes conjunction (min). In the special case where X is a number, a, the possibilistic constraint on Y may be expressed as Y is (A 1 (a) ^ B j(1) C C A m (a) ^ B j(m) ) : In this form, the extension principle plays a key role in the Mamdani–Assilian fuzzy logic controller [39]. Deduction Assume that we are given an information set, I, which consists of a system of propositions (p1 ; : : : ; p n ). I will be referred to as the initial information set. The canonical problem of deductive question-answering is that of computing an answer to q, ans(qjI), given I [100,101]. The first step is to ask the question: What information is needed to answer q? Suppose that the needed information consists of the values of the variables X1 ; : : : ; X n .
Thus, ans(qjI) D g(X 1 ; : : : ; X n ) ; where g is a known function. Using GCL as a meaning precisiation language, one can express the initial information set as a generalized constraint on X 1 ; : : : ; X n . In the special case of possibilistic constraints, the constraint on the X i may be expressed as f (X 1 ; : : : ; X n ) is A ; where A is a fuzzy relation. At this point, what we have is (a) a possibilistic constraint induced by the initial information set, and (b) an answer to q expressed as ans(qjI) D g(X 1 ; : : : ; X n ) ; with the understanding that the possibilistic constraint on f propagates to a possibilistic constraint on g. To compute the induced constraint on g what is needed is the extension principle of fuzzy logic [70,75]. As a simple illustration of deduction, it is convenient to use an example which was considered earlier. Initial information set; Question;
p : most Swedes are tall q : what is the average height of Swedes?
What information is needed to compute the answer to q? Let P be a population of n Swedes, Swede1 ; : : : ; Sweden . Let h i be the height of Swede i ; i D 1; : : : ; n. Knowing the h i , one can express the answer to q as av(h) : ans(qjp) D
1 n
(h1 C C h n ) :
Fuzzy Logic
Turning to the constraint induced by p, we note that the mm-precisiand of p may be expressed as the possibilistic constraint mm-precisiation 1 X p ! n Count(tall.Swedes) is most ; where ˙ Count(tall.Swedes) is the number of tall.Swedes in P, with the understanding that tall.Swedes is a fuzzy subset of P. Using this definition of ˙ Count [86], one can write the expression for the constraint on the hi as 1 n tall (h1 ) C C tall (h n ) is most : At this point, application of the extension principle leads to a solution which may be expressed as av(h) (v) D sup h n1 most tall (h1 ) C C tall (h n ) ; h D (h1 ; : : : ; h n ) subject to vD
1 n
(h1 C C h n ) :
In summary, the Generalized Constraint Language is, by construction, maximally expressive. Importantly, what this implies is that, in realistic settings, fuzzy logic, viewed as a modeling language, has a significantly higher level of power and generality than modeling languages based on standard constraints or, equivalently, on bivalent logic and bivalent-logic-based probability theory. Principal Contributions of Fuzzy Logic As was stated earlier, fuzzy logic is much more than an addition to existing methods for dealing with imprecision, uncertainty and complexity. In effect, fuzzy logic represents a paradigm shift. The structure of the shift is shown in Fig. 23. Contributions of fuzzy logic range from contributions to basic sciences to applications involving various types of systems and products. The principal contributions are summarized in the following. Fuzzy Logic as the Basis for Generalization of Scientific Theories One of the principal contributions of fuzzy logic to basic sciences relates to what is referred to as FL-generalization. By construction, fuzzy logic has a much more general conceptual structure than bivalent logic. A key element in the transition from bivalent logic to fuzzy logic is the generalization of the concept of a set to a fuzzy set. This generalization is the point of departure for FL-generalization.
Fuzzy Logic, Figure 23 Fuzzy logic as a paradigm shift
More specifically, FL-generalization of any theory, T, involves an addition to T of concepts drawn from fuzzy logic. In the limit, as more and more concepts which are drawn from fuzzy logic are added to T, the foundation of T is shifted from bivalent logic to fuzzy logic. By construction, FL-generalization results in an upgraded theory, T C , which is at least as rich and, in general, significantly richer than T. As an illustration, consider probability theory, PT – a theory which is bivalent-logic-based. Among the basic concepts drawn from fuzzy logic which may be added to PT are the following [96]. set C fuzzy set event C fuzzy event relation C fuzzy relation probability C fuzzy probability random set C fuzzy random set independence C fuzzy independence stationarity C fuzzy stationarity random variable C fuzzy random variable etc. As a theory, PTC is much richer than PT. In particular, it provides a basis for construction of models which are much closer to reality than those that can be constructed through the use of PT. This applies, in particular, to computation with imprecise probabilities. A number of scientific theories have already been FLgeneralized to some degree, and many more are likely to be FL-generalized in coming years. Particularly worthy of note are the following FL-generalizations. control ! fuzzy control [12,18,20,69,72] linear programming ! fuzzy linear programming [19,103]
1195
1196
Fuzzy Logic
probability theory ! fuzzy probability theory [96,98,101] measure theory ! fuzzy measure theory [14,62] topology ! fuzzy topology [38,68] graph theory ! fuzzy graph theory [34,41] cluster analysis ! fuzzy cluster analysis [7,27] Prolog ! fuzzy Prolog [21,42] etc. FL-generalization is a basis for an important rationale for the use of fuzzy logic. It is conceivable that eventually the foundations of many scientific theories may be shifted from bivalent logic to fuzzy logic. Linguistic Variables and Fuzzy If-Then Rules The most visible, the best understood and the most widely used contribution of fuzzy logic is the concept of a linguistic variable and the associated machinery of fuzzy if-then rules [90]. The machinery of linguistic variables and fuzzy if-then rules is unique to fuzzy logic. This machinery has played and is continuing to play a pivotal role in the conception and design of control systems and consumer products. However, its applicability is much broader. A key idea which underlies the machinery of linguistic variables and fuzzy if-then rules is centered on the use of information compression. In fuzzy logic, information compression is achieved through the use of graduated (fuzzy) granulation. The Concepts of Precisiation and Cointension The concepts of precisiation and cointension play pivotal roles in fuzzy logic [101]. In fuzzy logic, differentiation is made between two concepts of precision: precision of value, v-precision; and precision of meaning, m-precision. Furthermore, differentiation is made between precisiation of meaning which is (a) human-oriented, or mh-precisiation for short; and (b) machine-oriented, or mm-precisiation for short. It is understood that mm-precisiation is mathematically well defined. The object of precisiation, p, and the result of precisiation, p , are referred to as precisiend and precisiand, respectively. Informally, cointension is defined as a measure of closeness of the meanings of p and p . Precisiation is cointensive if the meaning of p
is close to the meaning of p. One of the important features of fuzzy logic is its high power of cointensive precisiation. What this implies is that better models of reality can be achieved through the use of fuzzy logic. Cointensive precisiation has an important implication for science. In large measure, science is bivalent-logicbased. In consequence, in science it is traditional to define concepts in a bivalent framework, with no degrees of truth allowed. The problem is that, in reality, many concepts in science are fuzzy, that is, are a matter of degree. For this reason, bivalent-logic-based definitions of scientific concepts are, in many cases, not cointensive. To formulate cointensive definitions of fuzzy concepts it is necessary to employ fuzzy logic. As was noted earlier, one of the principal contributions of fuzzy logic is its high power of cointensive precisiation. The significance of this capability of fuzzy logic is underscored by the fact that it has always been, and continues to be, a basic objective of science to precisiate and clarify what is imprecise and unclear. Computing with Words (CW), NL-Computation and Precisiated Natural Language (PNL) Much of human knowledge is expressed in natural language. Traditional theories of natural language are based on bivalent logic. The problem is that natural languages are intrinsically imprecise. Imprecision of natural languages is rooted in imprecision of perceptions. A natural language is basically a system for describing perceptions. Perceptions are intrinsically imprecise, reflecting the bounded ability of human sensory organs, and ultimately the brain, to resolve detail and store information. Imprecision of perceptions is passed on to natural languages. Bivalent logic is intolerant of imprecision, partiality of truth and partiality of possibility. For this reason, bivalent logic is intrinsically unsuited to serve as a foundation for theories of natural language. As the logic of imprecision and approximate reasoning, fuzzy logic is a much better choice [71,80,84,85,86,88]. Computing with words (CW), NL-computation and precisiated natural language (PNL) are closely related formalisms [93,94,95,97]. In conventional modes of computation, the objects of computation are mathematical constructs. By contrast, in computing with words the objects of computation are propositions and predicates drawn from a natural language. A key idea which underlies computing with words involves representing the meaning of propositions and predicates as generalized constraints. Computing with words opens the door to a wide-ranging enlargement of the role of natural languages in scientific theories [30,31,36,58,59,61].
Fuzzy Logic
Computational Theory of Perceptions Humans have a remarkable capability to perform a wide variety of physical and mental tasks without any measurements and any computations. In performing such tasks humans employ perceptions. To endow machines with this capability what is needed is a formalism in which perceptions can play the role of objects of computation. The fuzzy-logic-based computational theory of perceptions (CTP) serves this purpose [94,95]. A key idea in this theory is that of computing not with perceptions per se, but with their descriptions in a natural language. Representing perceptions as propositions drawn from a natural language opens the door to application of computing with words to computation with perceptions. Computational theory of perceptions is of direct relevance to achievement of human level machine intelligence. Possibility Theory Possibility theory is a branch of fuzzy logic [15,79]. Possibility theory and probability theory are distinct theories. Possibility theory may be viewed as a formalization of perception of possibility, whereas probability theory is rooted in perception of likelihood. In large measure, possibility theory and probability theory are complementary rather than competitive. Possibility theory is of direct relevance to, knowledge representation, semantics of natural languages, decision analysis and computation with imprecise probabilities. Computation with Imprecise Probabilities Most realworld probabilities are perceptions of likelihood. As such, real-world probabilities are intrinsically imprecise [75]. Until recently, the issue of imprecise probabilities has been accorded little attention in the literature of probability theory. More recently, the problem of computation with imprecise probabilities has become an object of rapidly growing interest [63,99]. Typically, imprecise probabilities occur in an environment of imprecisely defined variables, functions, relations, events, etc. Existing approaches to computation with imprecise probabilities do not address this reality. To address this reality what is needed is fuzzy logic and, more particularly, computing with words and the computational theory of perceptions. A step in this direction was taken in the paper “Toward a perception-based theory of probabilistic reasoning with imprecise probabilities”; [96] followed by the 2005 paper “Toward a generalized theory of uncertainty (GTU) – an outline”, [98] and the 2006 paper “Generalized theory of uncertainty (GTU) –principal concepts and ideas” [101]. Fuzzy Logic as a Modeling Language Science deals not with reality but with models of reality. More often than
not, reality is fuzzy. For this reason, construction of realistic models of reality calls for the use of fuzzy logic rather than bivalent logic. Fuzzy logic is a logic of imprecision, uncertainty and approximate reasoning [82]. It is natural to employ fuzzy logic as a modeling language when the objects of modeling are not well defined [102]. But what is somewhat paradoxical is that in many of its practical applications fuzzy logic is used as a modeling language for systems which are precisely defined. The explanation is that, in general, precision carries a cost. In those cases in which there is a tolerance for imprecision, reduction in cost may be achieved through imprecisiation, e. g., data compression, information compression and summarization. The result of imprecisiation is an object of modeling which is not precisely defined. A fuzzy modeling language comes into play at this point. This is the key idea which underlies the fuzzy logic gambit. The fuzzy logic gambit is widely used in the design of consumer products – a realm in which cost is an important consideration. A Glimpse of What Lies Beyond Fuzzy Logic Fuzzy logic has a much higher level of generality than traditional logical systems – a generality which has the effect of greatly enhancing the problem-solving capability of fuzzy logic compared with that of bivalent logic. What lies beyond fuzzy logic? What is of relevance to this question is the so-called incompatibility principle [73]. Informally, this principle asserts that as the complexity of a system increases a point is reached beyond which high precision and high relevance become incompatible. The concepts of mm-precisiation and cointension suggest an improved version of the principle: As the complexity of a system increases a point is reached beyond which high cointension and mm-precision become incompatible. What it means in plain words is that in the realm of complex systems – such as economic systems – it may be impossible to construct models which are both realistic and precise. As an illustration consider the following problem. Assume that A asks a cab driver to take him to address B. There are two versions of this problem. (a) A asks the driver to take him to B the shortest way; and (b) A asks the driver to take him to B the fastest way. Based on his experience, the driver decides on the route to take to B. In the case of (a), a GPS system can suggest a route that is provably the shortest way, that is, it can come up with a provably valid (p-valid) solution. In the case of (b) the uncertainties involved preclude the possibility of constructing a model of the system which is cointensive and mm-pre-
1197
1198
Fuzzy Logic
cise, implying that a p-valid solution does not exist. The driver’s solution, based on his experience, has what may be called fuzzy validity (f-validity). Thus, in the case of (b) no p-valid solution exists. What exists is an f-valid solution. In fuzzy logic, mm-precisiation is a prerequisite to computation. A question which arises is: What can be done when cointensive mm-precisiation is infeasible? To deal with such problems what is needed is referred to as extended fuzzy logic (FLC). In this logic, mm-precisiation is optional rather than mandatory. Very briefly, what is admissible in FLC is f-validity. Admissibility of f-validity opens the door to construction of concepts prefixed with f, e. g. f-theorem, f-proof, f-principle, f-triangle, f-continuity, f-stability, etc. An example is f-geometry. In f-geometry, figures are drawn by hand with a spray can. An example of f-theorem in f-geometry is the f-version of the theorem: The medians of a triangle are concurrent. The f-version of this theorem reads: The f-medians of an f-triangle are f-concurrent. An f-theorem can be proved in two ways. (a) empirically, that is, by drawing triangles with a spray can and verifying that the medians intersect at an f-point. Alternatively, the theorem may be f-proved by constructing an f-analogue of its traditional proof. At this stage, the extended fuzzy logic is merely an idea, but it is an idea which has the potential for being a point of departure for construction of theories with important applications to the solution of real-world problems. Bibliography Primary Literature 1. Aliev RA, Fazlollahi B, Aliev RR, Guirimov BG (2006) Fuzzy time series prediction method based on fuzzy recurrent neural network. In: Neuronal Informatinformation Processing Book. Lecture notes in computer science, vol 4233. Springer, Berlin, pp 860–869 2. Bargiela A, Pedrycz W (2002) Granular computing: An Introduction. Kluwer Academic Publishers, Boston 3. Bardossy A, Duckstein L (1995) Fuzzy rule-based modelling with application to geophysical, biological and engineering systems. CRC Press, New York 4. Bellman RE, Zadeh LA (1970) Decision-making in a fuzzy environment. Manag Sci B 17:141–164 5. Belohlavek R, Vychodil V (2006) Attribute implications in a fuzzy setting. In: Ganter B, Kwuida L (eds) ICFCA (2006) Lecture notes in artificial intelligence, vol 3874. Springer, Heidelberg, pp 45–60 6. Bezdek J, Pal S (eds) (1992) Fuzzy models for pattern recognition – methods that search for structures in data. IEEE Press, New York 7. Bezdek J, Keller JM, Krishnapuram R, Pal NR (1999) Fuzzy models and algorithms for pattern recognition and image processing. In: Zimmermann H (ed) Kluwer, Dordrecht
8. Bouchon-Meunier B, Yager RR, Zadeh LA (eds) (2000) Uncertainty in intelligent and information systems. In: Advances in fuzzy systems – applications and theory, vol 20. World Scientific, Singapore 9. Colubi A, Santos Domínguez-Menchero J, López-Díaz M, Ralescu DA (2001) On the formalization of fuzzy random variables. Inf Sci 133(1–2):3–6 10. Cresswell MJ (1973) Logic and Languages. Methuen, London 11. Dempster AP (1967) Upper and lower probabilities induced by a multivalued mapping. Ann Math Stat 38:325–329 12. Driankov D, Hellendoorn H, Reinfrank M (1993) An Introduction to Fuzzy Control. Springer, Berlin 13. Dubois D, Prade H (1980) Fuzzy Sets and Systems – Theory and Applications. Academic Press, New York 14. Dubois D, Prade H (1982) A class of fuzzy measures based on triangular norms. Int J General Syst 8:43–61 15. Dubois D, Prade H (1988) Possibility Theory. Plenum Press, New York 16. Dubois D, Prade H (1994) Non-standard theories of uncertainty in knowledge representation and reasoning. Knowl Engineer Rev Camb J Online 9(4):pp 399–416 17. Esteva F, Godo L (2007) Towards the generalization of Mundici’s gamma functor to IMTL algebras: the linearly ordered case, Algebraic and proof-theoretic aspects of nonclassical logics, pp 127–137 18. Filev D, Yager RR (1994) Essentials of Fuzzy Modeling and Control. Wiley-Interscience, New York 19. Gasimov RN, Yenilmez K (2002) Solving fuzzy linear programming problems with linear membership functions. Turk J Math 26:375–396 20. Gerla G (2001) Fuzzy control as a fuzzy deduction system. Fuzzy Sets Syst 121(3):409–425 21. Gerla G (2005) Fuzzy logic programming and fuzzy control. Studia Logica 79(2):231–254 22. Godo LL, Esteva F, García P, Agustí J (1991) A formal semantical approach to fuzzy logic. In: International Symposium on Multiple Valued Logic, ISMVL’91, pp 72–79 23. Goguen JA (1967) L-fuzzy sets. J Math Anal Appl 18:145–157 24. Goodman IR, Nguyen HT (1985) Uncertainty models for knowledge-based systems. North Holland, Amsterdam 25. Hajek P (1998) Metamathematics of fuzzy logic. Kluwer, Dordrecht 26. Hirota K, Sugeno M (eds) (1995) Industrial applications of fuzzy technology in the world. In: Advances in fuzzy systems – applications and theory, vol 2. World Scientific, Singapore 27. Höppner F, Klawonn F, Kruse R, Runkler T (1999) Fuzzy cluster analysis. Wiley, Chichester 28. Jamshidi M, Titli A, Zadeh LA, Boverie S (eds) (1997) Applications of fuzzy logic – towards high machine intelligence quotient systems. In: Environmental and intelligent manufacturing systems series, vol 9. Prentice Hall, Upper Saddle River 29. Jankowski A, Skowron A (2007) Toward rough-granular computing. In: Proceedings of the 11th International Conference on Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing, (RSFDGrC’07), Toronto, Canada, pp 1–12 30. Kacprzyk J, Zadeh LA (eds) (1999) Computing with words in information/intelligent systems part 1. Foundations. Physica, Heidelberg, New York 31. Kacprzyk J, Zadeh LA (eds) (1999) Computing with words in information/intelligent systems part 2. Applications. Physica, Heidelberg, New York
Fuzzy Logic
32. Kandel A, Langholz G (eds) (1994) Fuzzy control systems. CRC Press, Boca Raton 33. Klir GJ (2006) Uncertainty and information: Foundations of generalized information theory. Wiley-Interscience, Hoboken 34. Kóczy LT (1992) Fuzzy graphs in the evaluation and optimization of networks. Fuzzy Sets Syst 46(3):307–319 35. Lambert K, Van Fraassen BC (1970) Meaning relations, possible objects and possible worlds. Philosophical problems in logic, pp 1–19 36. Lawry J, Shanahan JG, Ralescu AL (eds) (2003) Modelling with words – learning, fusion, and reasoning within a formal linguistic representation framework. Springer, Heidelberg 37. Lin TY (1997) Granular computing: From rough sets and neighborhood systems to information granulation and computing in words. In: European Congress on Intelligent Techniques and Soft Computing, September 8–12, pp 1602–1606 38. Liu Y, Luo M (1997) Fuzzy topology. In: Advances in fuzzy systems – applications and theory, vol 9. World Scientific, Singapore 39. Mamdani EH, Assilian S (1975) An experiment in linguistic synthesis with a fuzzy logic controller. Int J Man-Machine Stud 7:1–13 40. Mendel J (2001) Uncertain rule-based fuzzy logic systems – Introduction and new directions. Prentice Hall, Upper Saddle River 41. Mordeson JN, Nair PS (2000) Fuzzy graphs and fuzzy hypergraphs. In: Studies in Fuzziness and Soft Computing. Springer, Heidelberg 42. Mukaidono M, Shen Z, Ding L (1989) Fundamentals of fuzzy prolog. Int J Approx Reas 3(2):179–193 43. Nguyen HT (1993) On modeling of linguistic information using random sets. In: Fuzzy sets for intelligent systems. Morgan Kaufmann Publishers, San Mateo, pp 242–246 45. Novak V (2006) Which logic is the real fuzzy logic? Fuzzy Sets Syst 157:635–641 44. Novak V, Perfilieva I, Mockor J (1999) Mathematical principles of fuzzy logic. Kluwer, Boston/Dordrecht 46. Ogura Y, Li S, Kreinovich V (2002) Limit theorems and applications of set-valued and fuzzy set-valued random variables. Springer, Dordrecht 47. Orlov AI (1980) Problems of optimization and fuzzy variables. Znaniye, Moscow 48. Pedrycz W, Gomide F (2007) Fuzzy systems engineering: Toward human-centric computing. Wiley, Hoboken 49. Perfilieva I (2007) Fuzzy transforms: a challenge to conventional transforms. In: Hawkes PW (ed) Advances in images and electron physics, 147. Elsevier Academic Press, San Diego, pp 137–196 50. Puri ML, Ralescu DA (1993) Fuzzy random variables. In: Fuzzy sets for intelligent systems. Morgan Kaufmann Publishers, San Mateo, pp 265–271 51. Ralescu DA (1995) Cardinality, quantifiers and the aggregation of fuzzy criteria. Fuzzy Sets Syst 69:355–365 52. Ross TJ (2004) Fuzzy logic with engineering applications, 2nd edn. Wiley, Chichester 53. Rossi F, Codognet P (2003) Special Issue on Soft Constraints: Constraints 8(1) 54. Rutkowska D (2002) Neuro-fuzzy architectures and hybrid learning. In: Studies in fuzziness and soft computing. Springer 55. Rutkowski L (2008) Computational intelligence. Springer, Polish Scientific Publishers PWN, Warzaw
56. Schum D (1994) Evidential foundations of probabilistic reasoning. Wiley, New York 57. Shafer G (1976) A mathematical theory of evidence. Princeton University Press, Princeton 58. Trillas E (2006) On the use of words and fuzzy sets. Inf Sci 176(11):1463–1487 59. Türksen IB (2007) Meta-linguistic axioms as a foundation for computing with words. Inf Sci 177(2):332–359 60. Wang PZ, Sanchez E (1982) Treating a fuzzy subset as a projectable random set. In: Gupta MM, Sanchez E (eds) Fuzzy information and decision processes. North Holland, Amsterdam, pp 213–220 61. Wang P (2001) Computing with words. Albus J, Meystel A, Zadeh LA (eds) Wiley, New York 62. Wang Z, Klir GJ (1992) Fuzzy measure theory. Springer, New York 63. Walley P (1991) Statistical reasoning with imprecise probabilities. Chapman & Hall, London 64. Wygralak M (2003) Cardinalities of fuzzy sets. In: Studies in fuzziness and soft computing. Springer, Berlin 65. Yager RR, Zadeh LA (eds) (1992) An introduction to fuzzy logic applications in intelligent systems. Kluwer Academic Publishers, Norwell 66. Yen J, Langari R, Zadeh LA (ed) (1995) Industrial applications of fuzzy logic and intelligent systems. IEEE, New York 67. Yen J, Langari R (1998) Fuzzy logic: Intelligence, control and information, 1st edn. Prentice Hall, New York 68. Ying M (1991) A new approach for fuzzy topology (I). Fuzzy Sets Syst 39(3):303–321 69. Ying H (2000) Fuzzy control and modeling – analytical foundations and applications. IEEE Press, New York 70. Zadeh LA (1965) Fuzzy sets. Inf Control 8:338–353 71. Zadeh LA (1972) A fuzzy-set-theoretic interpretation of linguistic hedges. J Cybern 2:4–34 72. Zadeh LA (1972) A rationale for fuzzy control. J Dyn Syst Meas Control G 94:3–4 73. Zadeh LA (1973) Outline of a new approach to the analysis of complex systems and decision processes. IEEE Trans Syst Man Cybern SMC 3:28–44 74. Zadeh LA (1974) On the analysis of large scale systems. In: Gottinger H (ed) Systems approaches and environment problems. Vandenhoeck and Ruprecht, Göttingen, pp 23–37 75. Zadeh LA (1975) The concept of a linguistic variable and its application to approximate reasoning Part I. Inf Sci 8:199–249; Part II. Inf Sci 8:301–357; Part III. Inf Sci 9:43–80 76. Zadeh LA (1975) Calculus of fuzzy restrictions. In: Zadeh LA, Fu KS, Tanaka K, Shimura M (eds) Fuzzy sets and their applications to cognitive and decision processes. Academic Press, New York, pp 1–39 77. Zadeh LA (1975) Fuzzy logic and approximate reasoning. Synthese 30:407–428 78. Zadeh LA (1976) A fuzzy-algorithmic approach to the definition of complex or imprecise concepts. Int J Man-Machine Stud 8:249–291 79. Zadeh LA (1978) Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst 1:3–28 80. Zadeh LA (1978) PRUF – a meaning representation language for natural languages. Int J Man-Machine Stud 10:395–460 81. Zadeh LA (1979) Fuzzy sets and information granularity. In: Gupta M, Ragade R, Yager R (eds) Advances in fuzzy set theory
1199
1200
Fuzzy Logic
82.
83.
84.
85.
86. 87.
88.
89.
90. 91.
92.
93.
94.
95. 96.
97. 98.
and applications. North-Holland Publishing Co., Amsterdam, pp 3–18 Zadeh LA (1979) A theory of approximate reasoning. In: Hayes J, Michie D, Mikulich LI (eds) Machine intelligence 9. Halstead Press, New York, pp 149–194 Zadeh LA (1981) Possibility theory and soft data analysis. In: Cobb L, Thrall RM (eds) Mathematical frontiers of the social and policy sciences. Westview Press, Boulder, pp 69–129 Zadeh LA (1982) Test-score semantics for natural languages and meaning representation via PRUF. In: Rieger B (ed) Empirical semantics. Brockmeyer, Bochum, pp 281–349 Zadeh LA (1983) Test-score semantics as a basis for a computational approach to the representation of meaning. Proceedings of the Tenth Annual Conference of the Association for Literary and Linguistic Computing, Oxford University Press Zadeh LA (1983) A computational approach to fuzzy quantifiers in natural languages. Comput Math 9:149–184 Zadeh LA (1984) Precisiation of meaning via translation into PRUF. In: Vaina L, Hintikka J (eds) Cognitive constraints on communication. Reidel, Dordrecht, pp 373–402 Zadeh LA (1986) Test-score semantics as a basis for a computational approach to the representation of meaning. Lit Linguist Comput 1:24–35 Zadeh LA (1986) Outline of a computational approach to meaning and knowledge representation based on the concept of a generalized assignment statement. In: Thoma M, Wyner A (eds) Proceedings of the International Seminar on Artificial Intelligence and Man-Machine Systems. Springer, Heidelberg, pp 198–211 Zadeh LA (1996) Fuzzy logic and the calculi of fuzzy rules and fuzzy graphs. Multiple-Valued Logic 1:1–38 Zadeh LA (1997) Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets Syst 90:111–127 Zadeh LA (1998) Some reflections on soft computing, granular computing and their roles in the conception, design and utilization of information/intelligent systems. Soft Comput 2:23–25 Zadeh LA (1999) From computing with numbers to computing with words – from manipulation of measurements to manipulation of perceptions. IEEE Trans Circuits Syst 45:105–119 Zadeh LA (2000) Outline of a computational theory of perceptions based on computing with words. In: Sinha NK, Gupta MM, Zadeh LA (eds) Soft Computing & Intelligent Systems: Theory and Applications. Academic Press, London, pp 3–22 Zadeh LA (2001) A new direction in AI – toward a computational theory of perceptions. AI Magazine 22(1):73–84 Zadeh LA (2002) Toward a perception-based theory of probabilistic reasoning with imprecise probabilities. J Stat Plan Inference 105:233–264 Zadeh LA (2004) Precisiated natural language (PNL). AI Magazine 25(3)74–91 Zadeh LA (2005) Toward a generalized theory of uncertainty (GTU) – an outline. Inf Sci 172:1–40
99. Zadeh LA (2005) From imprecise to granular probabilities. Fuzzy Sets Syst 154:370–374 100. Zadeh LA (2006) From search engines to question answering systems – The problems of world knowledge, relevance, deduction and precisiation. In: Sanchez E (ed) Fuzzy logic and the semantic web, Chapt 9. Elsevier, pp 163–210 101. Zadeh LA (2006) Generalized theory of uncertainty (GTU)– principal concepts and ideas. Comput Stat Data Anal 51:15–46 102. Zadeh LA (2008) Is there a need for fuzzy logic? Inf Sci 178:(13)2751–2779 103. Zimmermann HJ (1978) Fuzzy programming and linear programming with several objective functions. Fuzzy Sets Syst 1:45–55
Books and Reviews Aliev RA, Fazlollahi B, Aliev RR (2004) Soft computing and its applications in business and economics. In: Studies in fuzziness and soft computing. Springer, Berlin Dubois D, Prade H (eds) (1996) Fuzzy information engineering: A guided tour of applications. Wiley, New York Gupta MM, Sanchez E (1982) Fuzzy information and decision processes. North-Holland, Amsterdam Hanss M (2005) Applied fuzzy arithmetic: An introduction with engineering applications. Springer, Berlin Hirota K, Czogala E (1986) Probabilistic sets: Fuzzy and stochastic approach to decision, control and recognition processes, ISR. Verlag TUV Rheinland, Köln Jamshidi M, Titli A, Zadeh LA, Boverie S (1997) Applications of fuzzy logic: Towards high machine intelligence quotient systems. In: Environmental and intelligent manufacturing systems series. Prentice Hall, Upper Saddle River Kacprzyk J, Fedrizzi M (1992) Fuzzy regression analysis. In: Studies in fuzziness. Physica 29 Kosko B (1997) Fuzzy engineering. Prentice Hall, Upper Saddle River Mastorakis NE (1999) Computational intelligence and applications. World Scientific Engineering Society Pal SK, Polkowski L, Skowron (2004) A rough-neural computing: Techniques for computing with words. Springer, Berlin Ralescu AL (1994) Applied research in fuzzy technology, international series in intelligent technologies. Kluwer Academic Publishers, Boston Reghis M, Roventa E (1998) Classical and fuzzy concepts in mathematical logic and applications. CRC-Press, Boca Raton Schneider M, Kandel A, Langholz G, Chew G (1996) Fuzzy expert system tools. Wiley, New York Türksen IB (2005) Ontological and epistemological perspective of fuzzy set theory. Elsevier Science and Technology Books Zadeh LA, Kacprzyk J (1992) Fuzzy logic for the management of uncertainty. Wiley Zhong N, Skowron A, Ohsuga S (1999) New directions in rough sets, data mining, and granular-soft computing. In: Lecture Notes in Artificial Intelligence. Springer, New York
Fuzzy Logic, Type-2 and Uncertainty
Fuzzy Logic, Type-2 and Uncertainty ROBERT I. JOHN1 , JERRY M. MENDEL2 Centre for Computational Intelligence, School of Computing, De Montfort University, Leicester, United Kingdom 2 Signal and Image Processing Institute, Ming Hsieh Department of Electrical Engineering, University of Southern California, Los Angeles, USA
1
Article Outline Glossary Definition of the Subject Introduction Type-2 Fuzzy Systems Generalized Type-2 Fuzzy Systems Interval Type-2 Fuzzy Sets and Systems Future Directions Bibliography Glossary Type-1 fuzzy sets Are the underlying component in fuzzy logic where uncertainty is represented by a number between one and zero. Type-2 fuzzy sets Are where the uncertainty is represented by a type-1 fuzzy set. Interval type-2 fuzzy sets Are where the uncertainty is represented by a type-1 fuzzy set where the membership grades are unity. Definition of the Subject Type-2 fuzzy logic was first defined in 1975 by Zadeh and is an increasingly popular area for research and applications. The reason for this is because it appears to tackle the fundamental problem with type-1 fuzzy logic in that it is unable to handle the many uncertainties in real systems. Type-2 fuzzy systems are conceptually and mathematically more difficult to understand and implement but the proven applications show that the effort is worth it and type-2 fuzzy systems are at the forefront of fuzzy logic research and applications. These systems rely on the notion of a type-2 fuzzy set where the membership grades are type-1 fuzzy sets.
been able. The use of fuzzy sets in real computer systems is extensive, particularly in consumer products and control applications. Fuzzy logic (a logic based on fuzzy sets) is now a mainstream technique in everyday use across the world. The number of applications is many, and growing, in a variety of areas, for example, heat exchange, warm water pressure, aircraft flight control, robot control, car speed control, power systems, nuclear reactor control, fuzzy memory devices and the fuzzy computer, control of a cement kiln, focusing of a camcorder, climate control for buildings, shower control and mobile robots. The use of fuzzy logic is not limited to control. Successful applications, for example, have been reported in train scheduling, system modeling, computing, stock tracking on the Nikkei stock exchange and information retrieval. Type-1 fuzzy sets represent uncertainty using a number in [0; 1] whereas type-2 fuzzy sets represent uncertainty by a function. This is discussed in more detail later in the article. Essentially, the more imprecise or vague the data is, then type-2 fuzzy sets offer a significant improvement on type-1 fuzzy sets. Figure 1 shows the view taken here of the relationships between levels of imprecision, data and technique. As the level of imprecision increases then type-2 fuzzy logic provides a powerful paradigm for potentially tackling the problem. Problems that contain crisp, precise data do not, in reality, exist. However some problems can be tackled effectively using mathematical techniques where the assumption is that the data is precise. Other problems (for example, in control) use imprecise terminology that can often be effectively modeled using type-1 fuzzy sets. Perceptions, it is argued here, are at a higher level of imprecision and type-2 fuzzy sets can effectively model this imprecision.
Introduction Fuzzy sets [1] have, over the past forty years, laid the basis for a successful method of modeling uncertainty, vagueness and imprecision in a way that no other technique has
Fuzzy Logic, Type-2 and Uncertainty, Figure 1 Relationships between imprecision, data and fuzzy technique
1201
1202
Fuzzy Logic, Type-2 and Uncertainty
The reason for this lies in some of the problems associated with type-1 fuzzy logic systems. Although successful in the control domain they have not delivered as well in systems that attempt to replicate human decision making. It is our view that this is because a type-1 fuzzy logic system (FLS) has some uncertainties which cannot be modeled properly by type-1 fuzzy logic. The sources of the uncertainties in type-1 FLSs are: The meanings of the words that are used in the antecedents and consequents of rules can be uncertain (words mean different things to different people). Consequents may have a histogram of values associated with them, especially when knowledge is extracted from a group of experts who do not all agree. Measurements that activate a type-1 FLS may be noisy and therefore uncertain. The data that are used to tune the parameters of a type-1 FLS may also be noisy. The uncertainties described all essentially have to do with the uncertainty contained in a type-1 fuzzy set. A type-1 fuzzy set can be defined in the following way: Let X be a universal set defined in a specific problem, with a generic element denoted by x. A fuzzy set A in X is a set of ordered pairs:
tween one and zero. The example shown is linear but, of course, it could be any function. Fuzzy sets offer a practical way of modeling what one might refer to as ‘fuzziness’. The real world can be characterized by the fact that much of it is imprecise in one form or other. For a clear exposition (important to the notion of, and argument for, type-2 sets) two ideas of ‘fuzziness’ can be considered important – imprecision and vagueness (linguistic uncertainty). Imprecision As has already been discussed, in many physical systems measurements are never precise (a physical property can always be measured more accurately). There is imprecision inherent in measurement. Fuzzy numbers are one way of capturing this imprecision by having a fuzzy set represent a real number where the numbers in an interval near to the number are in the fuzzy set to some degree. So, for example, the fuzzy number ‘About 35’ might look like the fuzzy set in Fig. 3 where the numbers closer to 35 have membership nearer unity than those that are further away from 35.
A D f(x; A (x) j x 2 X)g ; where A : X ! [0; 1] is called the membership function A of and A (x) represents the degree of membership of the element x in A. The key points to draw from this definition of a fuzzy set are: The members of a fuzzy set are members to some degree, known as a membership grade or degree of membership. A fuzzy set is fully determined by the membership function. The membership grade is the degree of belonging to the fuzzy set. The larger the number (in [0; 1]) the more the degree of belonging. The translation from x to A (x) is known as fuzzification. A fuzzy set is either continuous or discrete. Graphical representation of membership functions is very useful. For example, the fuzzy set ‘Tall’ might be represented as shown in Fig. 2 where someone who is of height five feet has a membership grade of zero while someone who is of height seven feet is tall to degree one, with heights in between having membership grade be-
Fuzzy Logic, Type-2 and Uncertainty, Figure 2 The fuzzy set ‘Tall’
Fuzzy Logic, Type-2 and Uncertainty, Figure 3 The fuzzy number ‘About 35’
Fuzzy Logic, Type-2 and Uncertainty
Vagueness or Linguistic Uncertainty Another use of fuzzy sets is where words have been used to capture imprecise notions, loose concepts or perceptions. We use words in our everyday language that we, and the intended audience, know what we want to convey but the words cannot be precisely defined. For example, when a bank is considering a loan application somebody may be assessed as a good risk in terms of being able to repay the loan. Within the particular bank this notion of a good risk is well understood. It is not a black and white decision as to whether someone is a good risk or not – they are a good risk to some degree. Type-2 Fuzzy Systems A type-1 fuzzy system uses type-1 fuzzy sets in either the antecedent and/or the consequent of type-1 fuzzy if-then rules and a type-2 fuzzy system deploys type-2 fuzzy sets in either the antecedent and/or the consequent of type-2 fuzzy rules. Fuzzy systems usually have the following features: The fuzzy sets as defined by their membership functions. These fuzzy sets are the basis of a fuzzy system. They capture the underlying properties or knowledge in the system. The if-then rules that combine the fuzzy sets – in a rule set or knowledge base. The fuzzy composition of the rules. Any fuzzy system that has a set of if-then rules has to combine the rules. Optionally, the defuzzification of the solution fuzzy set. In many (most) fuzzy systems there is a requirement that the final output be a ‘crisp’ number. However, for certain fuzzy paradigms the output of the system is a fuzzy set, or its associated word. This solution set is ‘defuzzified’ to arrive at a number. Type-1 fuzzy sets are, in fact, crisp and not at all fuzzy, and are two dimensional. A domain value x is simply represented by a number in [0; 1] – the membership grade. The methods for combining type-1 fuzzy sets in rules are also precise in nature. Type-2 fuzzy sets, in contrast, are three dimensional. The membership grade for a given value in a type-2 fuzzy set is a type-1 fuzzy set. A formal definition of a type-2 fuzzy set is given in the following: ˜ is characterized by a type-2 memA type-2 fuzzy set, A, bership function A˜ (x; u), where x 2 X and u 2 J x subset JM[0; 1], and J x is called the primary membership, i. e. A˜ D f((x; u); A˜ (x; u)) j 8x 2 X; 8J x subset JM[0; 1]g : (1)
Fuzzy Logic, Type-2 and Uncertainty, Figure 4 A typical FOU of a type-2 set
A useful way of looking at a type-2 fuzzy set is by considering its Footprint of Uncertainty (FOU). This is a two dimensional view of a type-2 fuzzy set. See Fig. 4 for a simple example. The shaded area represents the Union of all the J x . An effective way to compare type-1 fuzzy sets and type-2 fuzzy sets is by use of a simple example. Suppose, for a particular application, we wish to describe the imprecise concept of ‘tallness’. One approach would be to use a type-1 fuzzy set tall1 . Now suppose we are only considering three members of this set – Michael Jordan, Danny Devito and Robert John. For the type-1 fuzzy approach one might say that Michael Jordan is tall1 to degree 0.95, Danny Devito to degree 0.4 and Robert John to degree 0.6. This can be written as tall1 D 0:95/Michael Jordan C 0:4/Danny Devito C 0:6/Robert John : A type-2 fuzzy set (tall2 ) that models the concept of ‘tallness’ could be tall2 D High1 /Michael Jordan C Low1 /Danny Devito C Medium1 /Robert John ; where High1 , Low1 and Medium1 are type-1 fuzzy sets. Figure 5 shows what the sets High1 , Low1 and Medium1 might look like if represented graphically. As can be seen the x axis takes values between 0 and 1, as does the y axis (). Type-1 sets have an x axis representing the domain – in this case the height of an individual. Type-2 sets employ type-1 sets as the membership grades. Therefore, these fuzzy sets of type-2 allow for the idea that a fuzzy approach does not necessarily have membership grades [0; 1] in but the degree of membership for the member is itself a type-1 fuzzy set. As can be seen, by the simple example, there is an
1203
1204
Fuzzy Logic, Type-2 and Uncertainty
Union of Type-2 Fuzzy Sets ˜ B) ˜ correspondThe union ([) of two type-2 fuzzy sets (A; ing to A˜ OR B˜ is given by: ˜ join JM B˜ A˜ [ B˜ , A[ ˜ B˜ (x) D A X ( f (u i )JM g(w j )) D : (u i JMw j ) ij
Fuzzy Logic, Type-2 and Uncertainty, Figure 5 The Fuzzy Sets High1 , Low 1 and Medium1
inherent extra fuzziness offered by type-2 fuzzy sets over and above a type-1 approach. So, a type-2 fuzzy set could be called a fuzzy-fuzzy set. Real situations do not allow for precise numbers in [0; 1]. In a control application, for instance, can we say that a particular temperature, t, belongs to the type-1 fuzzy set hot1 with a membership grade x precisely? No. Firstly it is highly likely that the membership could just as well be x 0:1 for example. Different experts would attach different membership grades and, indeed, the same expert might well give different values on different days! On top of this uncertainty there is always some uncertainty in the measurement of t. So, we have a situation where an uncertain measurement is matched precisely to another uncertain value!! Type-2 fuzzy sets on the other hand, for certain appropriate applications, allow for this uncertainty to be modeled by not using precise membership grades but imprecise type-1 fuzzy sets. So that type-2 sets can be used in a fuzzy system (in a similar manner to type-1 fuzzy sets) a method is required for computing the intersection (AND) and union (OR) of two type-2 sets. Suppose we have two type-2 fuzzy sets, A˜ and B˜ in X and A˜ (x) and B˜ (x) are two secondary membership functions of A˜ and B˜ respectively, represented as: f (u1 ) f (u2 ) f (u n ) C CC u1 u2 un X f (u i ) D ; ui
A˜ (x) D
i
g(w) f (w2 ) f (w m ) C CC w1 w2 wm X g(w j ) D ; wj
B˜ (x) D
j
where the functions f and g are membership functions of fuzzy grades and fu i ; i D 1; 2; : : : ; ng, fw j ; j D 1; 2; : : : ; mg, are the members of the fuzzy grades.
Intersection of Type-2 Fuzzy Sets ˜ B) ˜ correThe intersection (\) of two type-2 fuzzy sets (A; sponding to A˜ AND B˜ is given by: ˜ meet JM B˜ A˜ \ B˜ , A\ ˜ B˜ (x) D A X ( f (u i )JM g(w j )) D ; (u i JMw j ) ij
where join JM denotes join and meet JM denotes meet. Thus the join and meet allow for us to combine type-2 fuzzy sets for the situation where we wish to ‘AND’ or ‘OR’ two type-2 fuzzy sets. Join and meet are the building blocks for type-2 fuzzy relations and type-2 fuzzy inferencing with type-2 if-then rules. Type-2 fuzzy if-then rules (type-2 rules) are similar to type-1 fuzzy if-then rules. An example type-2 if-then rule is given by IF x is A˜ and y is B˜ then z is C˜ :
(2)
Obviously the rule could have a more complex antecedent connected by AND. Also the consequent of the rule could be type-1 or, indeed, crisp. Type-2 output processing can be done in a number of ways through type-reduction (e. g. Centroid, Centre of Sums, Height, Modified Height and Center-of-Sets) that produces a type-1 fuzzy set, followed by defuzzification of that set. Generalized Type-2 Fuzzy Systems So far we have been discussing type-2 fuzzy systems where the secondary membership function can take any form – these are known as generalized type-2 fuzzy sets. Historically these have been difficult to work with because the complexity of the calculations is too high for real applications. Recent developments [2,3,4] mean that it is now possible to develop type-2 fuzzy systems where, for example, the secondary membership functions are triangular in shape. This is relatively new but offers an exciting opportunity to capture the uncertainty in real applications. We expect the interest in this area to grow considerably but for
Fuzzy Logic, Type-2 and Uncertainty
the purposes of this article we will concentrate on the detail of interval type-2 fuzzy systems where the secondary membership function is always unity. Interval Type-2 Fuzzy Sets and Systems As of this date, interval type-2 fuzzy sets (IT2 FSs) and interval type-2 fuzzy logic systems (IT2 FLSs) are most widely used because it is easy to compute using them. IT2 FSs are also known as interval-valued FSs for which there is a very extensive literature (e. g., [1], see the many references in this article, [5,6,16,28]). This section focuses on IT2 FSs and IT2 FLSs. Interval Type-2 Fuzzy Sets An IT2 FS A˜ is characterized as (much of the background material in this sub-section is taken from [23]; see also [17]): Z Z 1/(x; u) A˜ D x2X u2J x [0;1] Z (3) Z ı 1/u x ; D u2J x [0;1]
x2X
where x, the primary variable, has domain X; u 2 U, the secondary variable, has domain J x at each x 2 X; J x is called the primary membership of x and is defined below in (9); and, the secondary grades of A˜ all equal 1. ˜ X ! Note that for continuous X and U, (3) means: A: f[a; b] : 0 6 a 6 b 6 1g. The bracketed term in (3) is called the secondary MF, ˜ and is denoted A˜ (x), i. e. or vertical slice, of A, Z A˜ (x) D
1/u ;
(4)
u2J x [0;1]
so that A˜ can be expressed in terms of its vertical slices as: A˜ D
Z x2X
A˜ (x)/x :
(5)
Fuzzy Logic, Type-2 and Uncertainty, Figure 6 FOU (shaded), LMF (dashed), UMF (solid) and an embedded FS (wavy line) for IT2 FS A˜
˜ is associated with the lower bound of FOU(A) ˜ and LMF(A) is denoted A˜ (x); 8x 2 X, i. e. ˜ ˜ 8x 2 X ; ¯ A˜ (x) D FOU(A) UMF(A)
(7)
˜ (x) D FOU(A) ˜ LMF(A) A˜
(8)
8x 2 X :
Note that J x is an interval set, i. e. h i ¯ A˜ (x) : J x D A˜ (x);
(9)
This set is discrete when U is discrete and is continuous ˜ in (6) can when U is continuous. Using (9), the FOU(A) also be expressed as i [ h ˜ D ¯ A˜ (x) : FOU(A) A˜ (x); (10) 8x2X
A very compact way to describe an IT2 FS is [22]: ˜ ; A˜ D 1/FOU(A)
(11)
where this notation means that the secondary grade equals ˜ 1 for all elements of FOU(A). For continuous universes of discourse X and U, an embedded IT2 FS A˜e is Z (12) A˜ e D [1/u] /x u 2 J x : x2X
Uncertainty about A˜ is conveyed by the union of all the primary memberships, which is called the footprint of uncertainty (FOU) of A˜ (see Fig. 6), i. e. ˜ D [ J x D f(x; u) : u 2 J x [0; 1]g : FOU(A) 8x2X
(6)
The upper membership function (UMF) and lower membership function (LMF) of A˜ are two type-1 MFs that ˜ is associated with the upbound the FOU (Fig. 6). UMF(A) ˜ ¯ A˜ (x); 8x 2 X, and per bound of FOU(A) and is denoted
Note that (12) means: A˜ e : X ! fu : 0 6 u 6 1g. The set A˜e is embedded in A˜ such that at each x it only has one secondary variable (i. e., one primary membership whose ¯ A˜ (x) secondary grade equals 1). Examples of A˜ e are 1/ and 1/ A˜ (x), 8x 2 X. In this notation it is understood that the secondary grade equals 1 at all elements in A˜ (x) ¯ A˜ (x). or For discrete universes of discourse X and U, in which x has been discretized into N values and at each of these values u has been discretized into M i values, an embedded IT2
1205
1206
Fuzzy Logic, Type-2 and Uncertainty
FS A˜ e has N elements, where A˜ e contains exactly one element from J x 1 ; J x 2 ; : : : ; J x N , namely u1 ; u2 ; : : : ; u N , each with a secondary grade equal to 1, i. e., A˜ e D
N X [1/u i ]/x i ;
(13)
iD1
˜ and, there are where u i 2 J x i . Set A˜ e is embedded in A, Q M embedded T2 FSs. a total of n A D N i iD1 Associated with each A˜ e is an embedded T1 FS Ae , where Z Ae D u/x u 2 J x : (14) x2X
The set Ae , which acts as the domain for A˜ e (i. e., A˜ e D 1/Ae ) is the union of all the primary memberships ¯ A˜ (x) and ˜ (x), of the set A˜ e in (12). Examples of Ae are A 8x 2 X. When the universes of discourse X and U are continuous then there is an uncountable number of embedded ˜ Because such sets are only used for IT2 and T1 FSs in A. theoretical purposes and are not used for computational purposes, this poses no problem. For discrete universes of discourse X and U, an embedded T1 FS Ae has N elements, one each from J x 1 ; J x 2 ; : : : ; J x N , namely u1 ; u2 ; : : : ; u N , i. e., Ae D
N X
u i /x i :
(15)
iD1
Set Ae is the union of all the primary memberships of set Q Ae and, there are a total of N iD1 M i embedded T1 FSs. Theorem 1 (Representation Theorem (RT) [19] Specialized to an IT2 FS [22]) For an IT2 FS, for which X and U are discrete, A˜ is the union of all of its embedded IT2 FSs. Equivalently, the domain of A˜ is equal to the union of all of its embedded T1 FSs, so that A˜ can be expressed as A˜ D
nA X jD1
j ˜ D1 A˜ e D 1/FOU(A)
nA ıX jD1
j
Ae D 1
N ıX
j
u i /x i
iD1
o ı [ n ¯ A˜ (x) A˜ (x); : : : ; D1
(16)
use leads to the structure of the solution to a problem, after which efficient computational methods must be found to implement that structural solution. Table 1 summarizes set theoretic operations and uncertainty measures all of which were computed using the RT. Additional results for similarity measures are in [25]. Interval Type-2 Fuzzy Logic Systems An interval type-2 fuzzy logic system (IT2 FLS), which is a FLS that uses at least one IT2 FS, contains five components – fuzzifier, rules, inference engine, type-reducer and defuzzifier – that are inter-connected, as shown in Fig. 7 (the background material in this sub-section is taken from [23]). The IT2 FLS can be viewed as a mapping from inputs to outputs (the path in Fig. 7, from “Crisp Inputs” to “Crisp Outputs”), and this mapping can be expressed quantitatively as y D f (x), and is also known as interval type-2 fuzzy logic controller (IT2 FLC) [7], interval type-2 fuzzy expert system, or interval type-2 fuzzy model. The inputs to the IT2 FLS prior to fuzzification may be certain (e. g., perfect measurements) or uncertain (e. g., noisy measurements). T1 or IT2 FSs can be used to model the latter measurements. The IT2 FLS works as follows: the crisp inputs are first fuzzified into either type-0 (known as singleton fuzzification), type-1 or IT2 FSs, which then activate the inference engine and the rule base to produce output IT2 FSs. These IT2 FSs are then processed by a type-reducer (which combines the output sets and then performs a centroid calculation), leading to an interval T1 FS called the type-reduced set. A defuzzifier then defuzzifies the type-reduced set to produce crisp outputs. Rules are the heart of a FLS, and may be provided by experts or can be extracted from numerical data. In either case, rules can be expressed as a collection of IF-THEN statements. A multi-input multi-output (MIMO) rule base can be considered as a group of multi-input single-output (MISO) rule bases; hence, it is only necessary to concentrate on a MISO rule base. Consider an IT2 FLS having p inputs x1 2 X 1 ; : : : ; x p 2 X p and one output y 2 Y. We assume there are M rules where the ith rule has the form
8x2X
and if X is a continuous universe, then the infinite ¯ A˜ (x)g is replaced by the interval set set fA˜ (x); : : : ; ¯ A˜ (x)]. [ A˜ (x); This RT is arguably the most important result in IT2 FS theory because it can be used as the starting point for solving all problems involving IT2 FSs. It expresses an IT2 FS in terms of T1 FSs, so that all results for problems that use IT2 FSs can be solved using T1 FS mathematics [22]. Its
R i : IF x1 is F˜1i and : : : and x p is F˜pi ; THEN y is G˜ i
i D 1; : : : ; M :
(17)
This rule represents a T2 relation between the input space, X 1 X p , and the output space, Y, of the IT2 FLS. Associated with the p antecedent IT2 FSs, F˜ki , are the IT2 MFs F˜ i (x k )(k D 1; : : : ; p), and associated with the conk sequent IT2 FS G˜ i is its IT2 MF G˜ i (y).
Fuzzy Logic, Type-2 and Uncertainty
Fuzzy Logic, Type-2 and Uncertainty, Table 1 Results for IT2 FSs Set Theoretic Operations [22] h i S Union ¯ A˜ (x) _ ¯ B˜ (x) A˜ [ B˜ D 1/ 8x2X A˜ (x) _ B˜ (x); h i S Intersection A˜ \ B˜ D 1/ 8x2X A˜ (x) ^ B˜ (x); ¯ A˜ (x) ^ ¯ B˜ (x) h i S Complement A¯˜ D 1/ 8x2X 1 A˜ (x); 1 ¯ A˜ (x) Uncertainty Measures [9,21,24] Centroid
Cardinality Fuzziness
˜ cr (A)] ˜ D CA˜ D [cl (A);
PL
¯ A˜ (xi )C iD1 xi
PL
¯ A˜ (xi )C iD1
PN
iDLC1 xi A˜ (xi )
PN
iDLC1 A˜ (xi )
PR
;
iD1 xi A˜ (xi )C
PR
iD1 A˜ (xi )C
PN
¯ A˜ (xi ) iDRC1 xi
PN
¯ A˜ (xi ) iDRC1
L and R computed using the KM Algorithms in Table 2. P ˜ pr (A)] ˜ D [p( (x)); p( PA˜ D [pl (A); ¯ A˜ (x))]; p(B) D jXj NiD1 B (xi )/N A˜ P N ˜ f2 (A)] ˜ D [f1 (Ae1 ); f2 (Ae2 )]; f (A) D h FA˜ D [f1 (A); iD1 g(A (xi )) 8 0g
for ˛ D 0
(21)
where clsfAg denotes the closure of set fAg. Tanaka, Okuda and Asai [120] make the following remarkable observation: sup D (x) D sup[˛ ^ max G (x)] : x
˛
C˛
˜ depicted For illustration, consider two fuzzy sets, C˜ and G, in Fig. 2. In case (i), ˛ D ˛1 , and C˛ is the interval between the end-points of the ˛-cut, [C (˛1 ); C C (˛1 )]. The maximum G in this interval is shown in the example. In this case, ˛1 < maxC ˛1 G (x)], so [˛1 ^ maxC ˛1 G (x)] D ˛1 . In case (ii), ˛2 > maxC ˛2 G (x), so [˛2 ^ maxC ˛2 G (x)] D
1223
1224
Fuzzy Optimization
maxC ˛2 G (x). In case (iii), ˛ D maxC ˛ G (x). It should be apparent from Fig. 2 that ˛ D supx D . In case (iii), ˛ D ˛ is also sup˛ [˛ ^ maxC ˛ G (x)]. For any ˛ < ˛ , we have case (i), where [˛1 ^ maxC ˛1 G (x)] D ˛1 < ˛ ; and for any ˛ > ˛ , we have case (ii), where [˛2 ^ maxC ˛2 G (x)] D maxC ˛2 G (x) < ˛ . The formal proof, which follows the reasoning illustrated here pictorially, is omitted for brevity’s sake. However, the interested reader is referred to Tanaka, Okuda, and Asai [120]. This result allows the following reformulation of the fuzzy mathematical problem:
x2C ˛
˛
i
and
j
0
max min i @ x0
X
i
max D (x) ;
Zimmermann Just two years after Tanaka, Okuda, and Asai [120] suggested the use of ˛-cuts to solve the fuzzy mathematical problem, Zimmermann published a linear programming equivalent to the ˛-cut formulation. Beginning with the crisp programming problem, x 0;
(23)
the decision maker introduces flexibility in the constraints, and sets a target value for the objective function, Z, ˜ Z; cx
˜ b; Ax
x 0:
(24)
A linear membership function i is defined for each flexible right hand side, including the goal Z as 0 1 X ai j x jA i @ j
D
8 1 ˆ ˆ <
P
P 1 j a i j x j b i
ˆ ˆ : 0
di
j
P
Pj j
(26)
j
subject to: x 0 :
ai j x j bi ; a i j x j 2 (b i ; b i C d i ); ai j x j bi C di :
where di is the decision maker’s maximum allowable violation of constraint (or goal) i.
(27)
In the membership functions i from (25), ZimmerP mann substitutes b0i for b i /d i and B0i for j a i j /d i . He also drops the 1 (which does not change the solution to the problem) to obtain the simplification i D b0i B0i x. Equation (27) then becomes min(b0i B0i x) ; i
x 0:
(28)
This is equivalent to the following linear program: max ˛ ˛ b0i B0i x
subject to: Ax b ;
ai j x jA
(22)
The researchers suggest an iterative algorithm which solves the problem. The algorithm cycles through a series of steps, each of which brings it closer to a solution to the relation ˛ D maxC ˛ f (X). When ˛ is determined to within a tolerable degree of uncertainty, it is used to find x such that f (x) D maxC ˛ f (X). This cumbersome solution method is impractical for large-scale problems. It should be noted that Negoita [93] suggested an alternative, but similarly complex, algorithm for the flexible programming problem based on Tanaka’s findings.
min Z D cx ;
1
is the decision with the highest degree of membership. The problem of finding the solution is therefore
Determine (˛ ; x ) ˛ ^ f (x ) D sup[˛ ^ max f (x)] :
According to the convention set forth by Bellman and Zadeh, the fuzzy decision D is characterized by 0 1 X (25) ai j x j A ; D D min i @
x 0;
8i
(29)
0˛ 1:
Equation (29) can easily be solved via the simplex or interior point methods. Verdegay The solutions examined so far for the fuzzy programming problem have been crisp solutions. Ralescu [99] first suggested that a fuzzy problem should have a fuzzy solution and Verdegay [130] proposes a method for obtaining a fuzzy solution. Verdegay considers a problem with fuzzy constraints, max z D f (x) ;
subject to: x C˜ ;
(30)
where the set of constraints have a membership function C˜ , with alpha-cuts C˜ ˛ . Verdegay defines x˛ as the set of solutions that satisfy constraints C˜˛ . Then a fuzzy solution to the fuzzy linear programming problem is max z D f (x) ;
x2C˜˛
8 ˛ 2 [0; 1] :
(31)
Verdegay proposes solving (31) parametrically for ˛ 2 ˜ with ˛-cut x˛ , which [0; 1] to obtain a fuzzy solution X, yields fuzzy objective value z˜, with ˛ -cut z˛ .
Fuzzy Optimization
Fuzzy Optimization, Figure 3 Surprise Function
It should be noted that the (crisp) solution obtained via Zimmermann’s method corresponds to Verdegay’s solution in the following way: If the objective function is transformed into a goal, with membership function G , then Zimmermann’s optimal solution, x , is equal to Verdegay’s optimal value x(˛) for the value of ˛ which satisfies G (z˛ D cx˛ ) D ˛ :
quently translated into surprise functions by s i () D ( i ()1 1)2 ; and the contribution of all constraints are added to give the total surprise X X s i () D s i ((Ax) i ) : i
i
In other words, when a particular ˛-cut of the fuzzy solution, (x˛ ) yields an objective value (z˛ D c T x˛ ) whose membership level for goal G, (G (z˛ )) is equal to the same ˛, then that solution x˛ corresponds to Zimmermann’ optimal solution, x .
Thus, the fuzzy optimization problem is X s i ((Ax) i ) min z D
Surprise Recently, Neumaier suggested a way to model fuzzy right-hand side values with crisp inequality con˜ based on the concept of surprise [83,94]. straints (Ax b) Neumaier defines a surprise function, s(x j E), which corresponds to the amount of surprise a variable x produces, given a statement E. The range of s is [0; 1), with s D 0 corresponding to an entirely true statement E, and s D 1 corresponding to an entirely false statement E. The most plausible values are those values of x for which s(x j E) is minimal, so Neumaier proposes that the best compromise solution for an optimization problem with fuzzy goals and constraints can be found by minimizing the sum of the surprise functions of the goals and constraints. Each fuzzy constraint,
Figure 3 illustrates a surprise function associated with a corresponding fuzzy goal. For triangular and trapezoidal numbers, the surprise function is simple, smooth, and convex, leading to a tractable non-linear programming problem. The surprise approach is a fuzzy optimization method since the optimization is over sets of crisp values derived from fuzzy sets. In contrast to flexible programming, the constraints are not restricted such that all satisfy a minimal level. The salient feature is that surprise uses a dynamic penalty for falling outside distribution/membership values of one. Because the individual penalties are convex functions which become infinite as the values approach the endpoints of the support, this method lends itself to convex programming solution techniques. It is noted that the surprise approach may be used to handle soft constraints of flexible programming since these soft constraints can be viewed as fuzzy numbers (trapezoidal fuzzy numbers when linear interpolation is used). However, if soft constraints are handled using surprise functions, the sum of the failure to meet the constraints is minimized rather than forcing each constraint
(Ax)i b˜ i ; is translated into a fuzzy equality constraint, (Ax) i D ˜i ; where the membership function i () of ˜i is the possibility that b˜ i . These membership functions are subse-
i
(32)
subject to: x 0 :
1225
1226
Fuzzy Optimization
to meet a minimal feasibility level. One could add a hard constraint to the surprise approach to attain this minimal level of feasibility. The modeler might choose to translate soft constraints to surprise functions (with perhaps a fixed minimal feasibility level) because surprise is usually more computationally efficient than Zimmermann’s method of handling soft constraints (see [72]). Therefore, the surprise approach is quite flexible both in terms of semantics as well as in computational robustness.
(34)
x1 ; x2 0 To derive the possibility function, Poss[Z D z], Buckley first specifies the possibility that x satisfies the ith constraint. Let ˘ ( aˆ i ; bˆ i ) D min( a i1 ( a˜ i1 ); : : : ; a in ( a˜ i n ); b i (b˜ i )) ;
Possibilistic Programming
(35)
Recall that fuzzy imprecision arises when elements of a set (for instance, a feasible set) are members of the set to varying degrees, which are defined by the membership function of the set. Possibilistic imprecision arises when elements of a set (say, again, a feasible set) are known to exist as either full members or non-members, but whether they are members or non-members is known with a varying degree of certainty, which is defined by the possibility distribution of the set. This possibilistic uncertainty arises from a lack of information. In this section, we examine possibilistic programming formulations. Buckley J.J. Buckley has suggested an algorithm for dealing with possibilistic cost coefficients, constraint coefficients, and right hand sides. Consider the possibilistic linear program min Z D cˆx ˆ x 0: ˆ b; subject to: Ax
˜ 1 C e˜x2 min z D dx subject to f˜x1 C g˜x2 h˜ ˜ 20 ˜{ x1 C jx
(33)
where Aˆ = [ a˜ i j ] is an m n matrix of trapezoidal possibilistic intervals aˆ i j D (a i j˛ ; a i jˇ ; a i j ; a i jı ), bˆ D (b˜ 1 ; : : : ; b˜ m )T is an m 1 vector of trapezoidal fuzzy numbers bˆ i D (b i˛ ; b iˇ ; b i ; b iı ), and cˆ D (˜c1 ; : : : ; c˜n ) is a 1 n vector of trapezoidal possibilistic intervals c˜i D (c i˛ ; c iˇ ; c i ; c iı ). The possibilistic intervals are the possibility distributions associated with the variables, and place a restriction on the possible values the variables may assume [145]. For example, Poss[ a˜ i j D a] = a ( a˜ i j ) is the possibility that a˜ i j is equal to a. Stated another way, Poss[ a˜ i j D a] D x means that, given the current state of knowledge about the value of ai j , we believe that x is the possibility that variable a i j could take on value a. Because this is a possibilistic linear programming problem, the objective function will be governed by a possibilistic distribution, Poss[Z D z]. Let us consider this simple example as we follow Buckley’s derivation of Poss [Z D z]:
which is the simultaneous distribution of a˜ i j ; 1 j n, and b˜ i . In our example (34) this corresponds to ˆ D min(F ( f ); G (g); H (h)) : ˘ ( fˆ; gˆ; h) Then the possibility that x 0 is feasible with respect to the ith constraint is: Poss[x 2 F i ] D sup (˘ (a i ; b i ) j a i x b i ) : a i ;b i
In our example (34) this corresponds to Poss[x 2 F1 ] D sup (˘ ( f ; g; h) j f x1 C gx2 h) f ;g;h
Poss[x 2 F2 ] D sup (˘ (i; j; k) j ix1 C jx2 k) :
(36)
i; j;k
Now the possibility that x 0 is feasible with respect to all constraints is [x 2 F ] D min ([x 2 F i ]) 1im
In our example (34), [x 2 F ] D min([x 2 F1 ]; [x 2 F2 ]) Buckley next constructs Poss[Z D z j x], which is the conditional possibility that the objective function Z obtains a particular value z, given values for x. The joint distribution of the possibilistic cost coefficients c˜j is ˘ (c) D min(c 1 (˜c1 ); : : : ; c n (˜c n )) :
(37)
In our example (34), ˜ E ( e˜)) : ˘ (c) D min( D (d); Therefore Poss[Z D z j x] D sup(˘ (c) j cx D z) : c
(38)
Fuzzy Optimization
left and right end-points of the ˛-cut of a fuzzy number x˜ as x (˛) and x C (˛), respectively. Using these, Buckley defines a new objective function: min Z(˛) D c (˛)x subject to: AC (˛)x b (˛) :
Fuzzy Optimization, Figure 4 Buckley Example
In our example (34), to obtain the possibility that the objective function will take on a particular value z, given the solution x, (i. e. Poss[Z D z j x]), we consider all the values of d and e for which dx1 C ex2 D z. Of these, the pair with the highest simultaneous possibility values will yield an objective function value of z with the same possibility level. For example, consider the case illustrated in Fig. 4. When d˜ D (2; 2) and e˜ D (3; 2) are symmetric triangular fuzzy numbers, (x1 ; x2 ) D (1; 1) and z D 7. We consider values of d and e such that d 1 C e 1 D 7. We could select d D 2 and e D 5 with D (2) D 1 and E (5) D 0. The joint possibility of d D 2 and e D 5 is min( D (2); E (5)) D 0. The maximum joint possibility of d and e such that d 1 C e 1 D 7 occurs when d D 3 and e D 4. The joint possibility that d D 3 and e D 4, which is min( D (3); E (4)) D :5, we set equal to the possibility that z D 7 given that (x1 ; x2 ) D (1; 1). So Poss[z D 7 j (1; 1)] D :5]. A combination of (37) and (38) yields the possibility distribution of the objective function: Poss[Z D z] D sup[min(Poss[Z D z j x]; Poss[x 2 F ])] : x0
(39) In our example (34), to obtain the possibility that the objective function will take on a particular value z (that is Poss[Z D z], we consider Poss[Z D z j x], as described above, for all positive x’s, and select the x which maximizes Poss[Z D z j x]. This definition of the possibility distribution motivates Buckley’s solution method. Recall that because we are dealing with a possibilistic problem, the solution will be governed by a possibilistic distribution. Buckley’s method depends upon a static ˛, chosen a priori. The decision maker defines an acceptable level of uncertainty in the objective outcome, 0 < ˛ 1. For a given ˛, we define the
(40)
Since both x and ˛ are variables, this is a non-linear programming problem. When ˛ is fixed in advance, however, it becomes linear. We can use either the simplex method or an interior point method to solve for a given a priori chosen value of ˛. If a maximal value for ˛ is desired, the linear programming method must be applied iteratively. It should be noted that this linear program is constrained upon the best-case scenario. That is, for a given ˛level, each variable is multiplied by the largest possible coefficient aC i j (˛), and is required to be greater than the the smallest possible right hand side b i (˛). We should interpret z(˛) accordingly. If the solution to the linear program is implemented, the possibility that the objective function will attain the level z(˛) is given by ˛. Stated another way, the best-case scenario is that the objective function attains a value of z(˛), and the possibility of the best case scenario occurring is ˛. Tanaka, Asai, Ichihashi In the mid 1980s, Tanaka and Asai [118] and Tanaka, Ichahashi, and Asai [121] proposed a technique for dealing with ambiguous coefficients and right hand sides based upon a possibilistic definition of “greater than zero”. The reader will note that this approach bears many similarities to the flexible programming proposed by Tanaka, Okuda, and Asai a decade earlier, which was discussed in Subsect. “Tanaka, Okuda, and Asai”. Indeed, the 1984 publications refer to fuzzy variables. This approach has subsequently been classified [44,60] as possibilistic programming because the imprecision it represents stems from a lack of information about the values of the coefficients. Consider a programming problem with non-interacˆ and cˆ, whose possible values are deˆ b, tive possibilistic A, ˜ and fuzzy vector c˜, ˜ fuzzy vector b, fined by fuzzy matrix A, respectively: min z˜ D c˜x ˜ b˜ subject to: Ax
(41)
x 0: Tanaka, Asai, and Ichihashi transform the problem in several steps. First, the objective function is viewed as a goal. As in flexible programming, the goal becomes
1227
1228
Fuzzy Optimization
a constraint, with the aspiration level for the objective function on the right-hand-side of the inequality. Next, a new decision variable x0 is added. Finally, the b’s are moved to the left hand side, so that the possibilistic linear programming problem is a˜0i x 0i 0,
i D 1; 2; : : : ; m
0
x 0
(42)
where x 0 D (1; x T )T D (1; x1 ; x2 ; : : : ; x n )T , and a˜ i D (b˜ i ; a˜ i1 ; : : : a˜ i n ). Note that all the parameters, A; b, and c are now ˜ Because grouped together in the new constraint matrix A. the objective function(s) have become goals, the cost coefficients, c, are part of the constraint coefficient matrix A. Furthermore, because the right-hand side values have been moved to the left-hand side, the b’s are also part of the constraint coefficient matrix. The right-hand-sides corresponding the former objective functions are the aspiration levels of the goals. Each constraint becomes
Fuzzy Optimization, Figure 5 h-Level
(˛ i h(˛ i i ))T x 0 : Since we wish to find the largest h that satisfies these conditions, the linear program becomes max z D h
Y˜i D a˜i x 0 ;
subject to: (˛ i h(˛ i i ))T x 0 ;
where
8i
(45)
h 2 [0; 1] :
a˜i D (b˜ i ; a˜ i1 ; : : : ; a˜ i n ) : ˜ 0, is defined by “Y˜i is almost positive”, denoted by Y˜i
˜ 0 Y˜i
, Yi (0) 1 h;
(43)
x T ˛ i 0: The measure of the non-negativity of Y˜i is h: The greater the value of h, the stronger the meaning of “almost positive” (see Fig. 5). Actually, h is 1 ˛, where ˛ is the level of membership used by Bellman and Zadeh. Tanaka and Asai [118] developed this theory for triangular fuzzy numbers, and Tanaka and Asai extended it to trapezoidal fuzzy numbers in [122]. Inuiguchi, Ichihashi, and Tanaka, [44] generalized the approach for L-R fuzzy numbers. For the sake of simplicity, our discussion here will deal with trapezoidal fuzzy numbers, denoted x˜ D (; ˛; ˇ; ı), where (; ı) is the support of x˜ , and (˛; ˇ) is the core of x˜ . Using (43), we can rewrite each constraint from (42) as Yi (0) D 1
˛ Ti x 1h (˛ i i )T x
˛ Ti x 0 ; where x > 0. Then (44) reduces to
Since both x and h are variables, this is a non-linear programming problem. When h is fixed, it becomes linear. We can use the simplex method or an interior point algorithm to solve for a given value of h. If the decision maker wishes to maximize h, the linear programming method must be applied iteratively. Fuzzy Max Dubois and Prade [19] suggested that the concept of “fuzzy max” could be applied to constraints with fuzzy parameters. The “fuzzy max” concept was used to solve possibilistic linear programs with triangular possibilistic coefficients by Tanaka, Ichihashi, and Asai [121]. Ramik and Rimanek [101] applied the same technique to L-R fuzzy numbers. For consistency, we will discuss the fuzzy max technique with respect to trapezoidal numbers. The fuzzy max, illustrated in Fig. 6 where C D max[A; B], is the extended maximum operator between real numbers, and defined by the extension principle (8) as C (c) D
max
fa;b : cDmax(a;b)g
min[A˜ (a); B˜ (b)] :
Using fuzzy max, we can define an inequality relation (44)
as
A˜ B˜ , ˜ ˜ max(A; B) D A˜ :
(46)
Fuzzy Optimization
Fuzzy Optimization, Figure 6 Illustration of Fuzzy Max
Fuzzy Optimization, Figure 7 Illustration of A˜ B˜
Applying (46) to a fuzzy inequality constraint: ˜ f (x; a˜) g(x; b) ,
(47)
˜ D f (x; a˜) : ˜ g(x; b)) max( f (x; a); Observe that the inequality relation defined by (46) yields only a partial ordering. That is, it is sometimes the case that neither A˜ B˜ nor B˜ A˜ holds. To improve this, Tanaka, Ichihashi, and Asai, introduce a level h, corresponding to the decision maker’s degree of optimism. They define an h-level set of A˜ B˜ as A˜ h B˜ , ˜ ˛ max[B] ˜˛ max[A] ˜ ˜ min[A]˛ min[B]˛
(48)
˛ 2 [1 h; 1] : This definition of A˜ B˜ requires that two of Dubois’ inequalities from Subsect. “Fuzzy Relations”, (ii) and (iii), hold at the same time, and is illustrated in Fig. 7. Tanaka, Ichihashi, and Asai, [118] suggest a similar treatment for a fuzzy objective function. A problem with the single objective function, Maximize z(x; c˜), becomes a multi-objective problem with objective functions: ( inf(z(x; c˜)˛ ) ˛ 2 [0; 1] maximize (49) sup(z(x; c˜)˛ ) ˛ 2 [0; 1]: Clearly, since ˛ can assume an infinite number of values, (49) has an infinite number of parameters. Since (49) is not tractable, Inuiguchi, Ichihashi, and Tanaka, [44] suggest using the following approximation using a finite
set of ˛ i 2 [0; 1]: ( min(z(x; c˜))˛ i maximize max(z(x; c˜))˛ i
i D 1; 2; : : : ; p i D 1; 2; : : : ; p
:
(50)
Jamison and Lodwick Jamison and Lodwick [50,74] develop a method for dealing with possibilistic right hand sides that is a possibilistic generalization of the recourse models in stochastic optimization. Violations of constraints are allowable, at a cost determined a priori by the decision maker. Jamison and Lodwick choose the utility (that is, valuation) of a given interval of possible values to be its expected average (a concept defined by Yager [139].) The expected average (or EA) of a possibilistic distribution of a˜ is defined to be Z 1 1 ˜ D ( a˜ (˛) C a˜C (˛)) d˛: (51) EA( a) 2 0 It should be noted that the expected average of a crisp value is the a˜ (˛) D a˜ C (˛) D a, R1 R value itself, since 1 1 EA(a) D 2 0 (a C a)dx D a 0 dx D a. Jamison and Lodwick start from the following possibilistic linear program: max z D c T x Ax bˆ ;
(52)
x 0: By subtracting a penalty term from the objective function, they transform (52) into the following possibilistic nonlinear program: ˆ max z D c T x C pT max(0; Ax b) x 0:
(53)
1229
1230
Fuzzy Optimization
The “max” in the penalty is taken component-wise and each p i < 0 is the cost per unit violation of the right hand side of constraint i. The utility, which is the expected average, of the objective function is chosen to be minimized. The possibilistic programming problem becomes ˆ max z D c T x C pEA(max(0; Ax b)) x 2 [0; U]:
(54)
A closed-form objective function (for the purpose of differentiating when solving) is achieved in [72] by replacing ˆ max (0; Ax b) with q
ˆ C 2 C Ax bˆ (Ax b) 2
:
Jamison and Lodwick’s method can be extended, [50], to account for possibilistic values for A, b, c, and even the penalty coefficients p with the following formulation: 1 EA f˜(x) D 2
Z
1n
cˆ (˛)T x C cˆC (˛)T x
0
pˆC (˛) max(0; Aˆ (˛)x bˆ C (˛) o pˆ (˛) max(0; AˆC (˛)x bˆ (˛) d˛ : (55)
This approach differs significantly from the others we’ve examined in several regards. First, many of the approaches we’ve seen have incorporated the objective function(s) as goals into the constraints in the Bellman and Zadeh tradition. Jamison and Lodwick, on the other hand, incorporate the constraints into the objective function. Bellman and Zadeh create a symmetry between constraints and objective, while Jamison and Lodwick temper the objective with the constraints. A second distinction of the expected average approach is the nature of the solution. The other formulations we have examined to this point have produced either (1) a crisp solution for a particular value of ˛, (namely, the maximal value of ˛), or, (2) a fuzzy/possibilistic solution which encompasses all possible ˛ values. The Jamison and Lodwick approach provides a crisp solution via the expected average utility which encompasses all alpha values. This may be a desirable quality to the decision maker who wants to account for all possibility levels and still reach a crisp solution. Luhandjula Luhanjula’s [84] formulation of the possibilistic mathematical program depends upon his concept
of “more possible” values. He first defines a possibility distribution ˘ X with respect to constraint F as ˘ X D F (u) ; where F (u) is the degree to which the constraint F is satisfied when u is the value assigned to the solution X. Then the set of more possible values for X, denoted by Vp (X), is given by Vp (X) D ˘ X1 (max ˘ X (u)) : u
In other words, Vp (X) contains elements of U which are most compatible with the restrictions defined by ˘ X . It follows from intuition, and from Luhanjula’s formal proof [84], that when ˘ X is convex, Vp (X) is a real-valued interval, and when ˘ X is strongly convex, Vp (X) is a single real number. Luhandjula considers the mathematical program max z˜ D cˆx subject to: Aˆ i bˆi ;
(56)
x 0: By replacing the possibilistic numbers cˆ, Aˆi , and bˆi with their more possible values, Vp (ˆc ), Vp (Aˆi ), and Vp (bˆi ), respectively, Luhandjula arrives at a deterministic equivalent to Eq. (56): max z D kx subject to: k i 2 Vp (cˆi ) X ti xi si (57)
i
t i 2 Vp ( aˆ i j ) s i 2 Vp (bˆ i ) x 0 : This formulation varies significantly from the other approaches considered thus far. The possibility of each possibilistic component is maximized individually. Other formulations have required that each possibilistic component c˜j , A˜i j , and b˜i achieve the same possibility level defined by ˛. This formulation also has a distinct disadvantage over the others we’ve considered, since to date there is no proposed computational method for determining the “more possible” values, V p , so there is no way to solve the deterministic MP. Programming with Fuzzy and Possibilistic Components Sometimes the values of an optimization problem’s components are ambiguous and the decision-makers are vague
Fuzzy Optimization
(or flexible) regarding feasibility requirements. This sections explores a couple of approaches for dealing with such fuzzy/possibilistic problems. One type of mixed programming problem that arises (see [20,91]) is a mathematical program with possibilistic constraint coefficients aˆ i j whose possible values are defined by fuzzy numbers of the form a˜ i j : max cx subject to: aˆ 0i x 0 b˜ i 0
(58)
t
x D (1; x )t 0: ˜ N ˜ as Zadeh [142] defines the set-inclusion relation M M˜ (r) N˜ (r)8r 2 R. Recall that Dubois [18] interprets the set-inclusive constraint a˜0i x 0 b˜ i as a fuzzy extension of the crisp equality constraint. Mixed programming, however, interprets the set-inclusive constraint to mean that the region in which a˜0i x 0 can possibly occur is restricted to b˜i , a region which is tolerable to the decision maker. Therefore, the left side of (58) is possibilistic, and the right side is fuzzy. Negoita [91] defines the fuzzy right hand side as follows: b˜ i D fr j r b i g :
(59)
As a result, we can interpret a˜ 0i x 0 b˜ i as an extension of an inequality constraint. The set-inclusive constraint (58) is reduced to C aC i (˛)x b i (˛) a i (˛)x b i (˛)
(60)
for all ˛ 2 (0; 1]: If we abide by Negoita’s definition of b˜ (59), bC i D 1 for all values of ˛, so we can drop the first constraint in (60). Nonetheless, we still have an infinitely (continuum) constrained program, with two constraints for each value of ˛ 2 (0; 1]. Inuiguchi observes [44] that if the left-hand side of the membership functions for a i0 ; a i1 ; : : : a i n ; b i are identical for all i, and the right-hand side of the membership functions for a i0 ; a i1 ; : : : a i n ; b i are identical for all i, the constraints are reduced to the finite set, a i;˛ b i;˛ a i; b i; a i;ˇ b i;ˇ
(61)
a i;ı b i;ı : As per our usual notation, (; ı) is the support of the fuzzy number, and (˛; ˇ) is its core. In application, constraint formulation (61) has limited utility because of the
narrowly defined sets of memberships functions it admits. For example, if a i0 ; a i1 ; : : : a i n ; b i are defined by trapezoidal fuzzy numbers, they must all have the same spread, and therefore the same slope, on the right-hand side; and they must all have the same spread, and therefore the same slope, on the left-hand side if (61) is to be implemented. Recall that in this kind of mixed programming, the ai j ’s are possibilistic, reflecting a lack of information about their values, and the bi are fuzzy, reflecting the decision maker’s degree of satisfaction with their possible values. It is possible that n possibilistic components and 1 fuzzy component will share identically-shaped distribution, but it is not something likely to happen with great frequency. Delgado, Verdegay and Vila Delgado, Verdegay, and Villa [11] propose the following formulation for dealing with ambiguity in the constraint coefficients and right-hand sides, as well as vagueness in the inequality relationship: max cx ˆ ˜ bˆ subject to: Ax
(62)
x 0: In addition to (62), membership functions a i j are defined for the possible values of each possibilistic element of ˆ membership functions b are defined for the possible A, i ˆ and membership values of each possibilistic element of b, function i gives the degree to which the fuzzy constraint i is satisfied. Stated another way, i is the membership function of the fuzzy inequality. The uncertainty in the a˜ i j and the b˜ i is due to ambiguity concerning the actual value of ˜ is due to the the parameter, while the uncertainty in the decision maker’s flexibility regarding the necessity of satisfying the constraints in full. Delgado, Verdegay, and Vila do not define the fuzzy inequality, but leave that specification to decision maker. Any ranking index that preserves ranking of fuzzy numbers when multiplied by a positive scalar is allowed. For instance, one could select any of Dubois’ four inequalities from Subsect. “Fuzzy Relations”. Once the ranking index is selected, the problem is solved parametrically, as in Verdegay’s earlier work [130] (see Subsect. “Verdegay”). To illustrate this approach, let us choose Dubois’ pessimistic inequality (i), which interprets A˜ b˜ to mean 8x 2 A; 8y 2 B; x y. This is equivalent to a C b . Then (62) becomes max cx subject to: a C i j (˛)x b i (˛)
x 0:
(63)
1231
1232
Fuzzy Optimization
Fuzzy/Robust Programming The approach covered so far in this section, has the same ˛ cut define the level of ambiguity in the coefficients and the level at which the decision-maker’s requirements are satisfied. These ˛, however, mean very different things. The fuzzy ˛ represents the level at which the decision-maker’s requirements are satisfied. The possibilistic ˛, on the other hand, represents the likelihood that the parameters will take on values which will attain that level. The solution is interpreted to mean that for any value ˛ 2 (0; 1] there is a possibility ˛ of obtaining a solution that satisfies the decision maker to degree ˛. Using the same ˛ value for both the possibilistic and fuzzy components of the problem is convenient, but does not necessarily provide a meaningful model of reality. A recent model [126] based on Markowitz’s meanvariance approach to portfolio optimization (see [89,117]) differentiates between the fuzzy ˛ and the possibilistic ˛. Markowitz introduced an efficient combination, which has the minimum risk for a return greater than or equal to a given level; or one which has the maximum return for a risk less than or equal to a given level. The decision maker can move among these efficient combinations, or along the efficient frontier, according to her/his degree of risk aversion. Similarly, in mixed possibilistic and fuzzy programming, one might wish to allow a trade-off between the potential reward of the outcome and the reliability of the outcome, with the weights of the two competing objectives determined by the decision maker’s risk aversion. The desire is to obtain an objective function like the following: max reward C (risk aversion) (reliability) : (64) The reward variable is the ˛-level associated with the fuzzy constraints and goal(s). It tells the decision maker how satisfactory the solution is. The reliability variable is the ˛-level associated with the possibilistic parameters. It tells the decision maker how likely it is that the solution will actually be satisfactory. To avoid confusion, let us refer to the fuzzy constraint membership parameter as ˛ and the possibilistic parameter membership level as ˇ. In addition, let 2 [0; 1] be an indicator of the decision maker’s valuation of reward and risk-avoidance, with 0 indicating that the decision maker cares exclusively about the reward, and 1 indicating the only risk avoidance is important. Using this notation, the desired objective is max(1 )˛ C ˇ : Suppose we begin with the mixed problem: max cˆT x
(65)
ˆ ˜b subject to Ax
(66)
x 0: Incorporating fuzziness from soft constraints in the tradition of Zimmermann and incorporating a pessimistic view of possibility results in the following formulation: max
(1 )˛ C ˇ
subject to
˛
(67)
X uj X uj wj g C xj C x jˇ d0 d0 d0 j
˛
bi di
j
X vi j j
X zi j vi j xj x jˇ di di j
x 0 ˛; ˇ 2 [0; 1] : (68) The last terms in each of the constraints contain ˇx, so the system is non-linear. It can be fairly easily reformulated as an optimization program with linear objective function and quadratic constraints, but the feasible set is non-convex, so finding to solution to this mathematical programming problem is very difficult. Possibilistic, Interval, Cloud, and Probabilistic Optimization Utilizing IVPM This section is taken from [79,80] and begins by defining what is meant by an IVPM. This generalization of a probability measure includes probability measures, possibility/necessity measures, intervals, and clouds (see [95]) which will allow a mixture of uncertainty within one constraint (in)equality. The previous mixed methods was restricted to a single type of uncertainty for any particular (in)equality and are unable to handle cases in which a mixture of fuzzy and possibilistic parameter occurs in the same constraint (in)equality. The IVPM set function may be thought of as a method for giving a partial representation for an unknown probability measure. Throughout, arithmetic operations involving set functions are in terms of interval arithmetic [90] and the set˚of all intervals contained in [0, 1] is denoted, Int[0;1] a; b j 0 a b 1 . Moreover, S is used to denote the universal set and a set of subsets of the universal set is denoted as A S. In particular A is a set of subset on which a structure has been imposed on it as will be seen and a generic set of the structure A is denoted by A. Definition 18 (Weichselberger [135]) Given measurable space (S; A), an interval valued function i m : A A ! Int[0;1] is called an R-probability if:
Fuzzy Optimization
C (a) i m (A) D i m (A) ; i m (A) [0; 1] with i m (A) iC m (A), (b) 9 a probability measure Pr on A such that 8A 2 A, Pr (A) 2 i m (A). By an R-probability field we mean the triple (S; A; i m ). Definition 19 (Weichselberger [135]) Given an R-probability field R D (S; A; i m ) the set M (R) D fPr j Pr is a probability measure on A
such that 8A 2 A; Pr (A) 2 i m (A)g is called the structure of R. Definition 20 (Weichselberger [135]) An R-probability field R D (S; A; i m ) is called an F-probability field if 8A 2 A: (a) i C m (A) D sup fPr (A) j Pr 2 M (R)g, (b) i m (A) D inf fPr (A) j Pr 2 M (R)g. It is interesting to note that given a measurable space (S; A) and a set of probability measures P, then defining iC m (A) D sup fPr (A) j Pr 2 Pg and i m (A) D inffPr (A) j Pr 2 Pg gives an F-probability and that P is a subset of the structure. The following examples show how intervals, possibility distributions, clouds and (of course) probability measures can define R-probability fields on B, the Borel sets on the real line. Example 21 (An interval defines an F-probability field) Let I D a; b be a non-empty interval on the real line. On the Borel sets define
1 if I \ A ¤ ; iC D (A) m 0 otherwise and i m (A) D
1 0
if I A otherwise
then C i m (A) D i m (A) ; i m (A) defines an F-probability field R D (R; B; i m ). To see this, simply let P be the set of all probability measures on B such that Pr (I) D 1. This example also illustrates that any set A, not just an interval I, can be used to define an F-probability field.
Example 22 (A probability measure is an F-probability field) Let Pr be a probability measure over (S; A). Define i m (A) D [Pr (A) ; Pr (A)] : This definition of a probability as an IVPM is equivalent to having total knowledge about a probability distribution over S. The concept of a cloud was introduced by Neumaier in [95] as follows: Definition 23 A cloud over set S is a mapping c such that: 1) 8s 2 S, c(s) D n(s); p¯(s) with 0 n(s) p¯(s) 1 ¯ ¯ 2) (0; 1) [s2S c(s) [0; 1] In addition, random variable X taking values in S is said to belong to cloud c (written X 2 c) iff 3) 8˛ 2 [0; 1] ; Pr (n (X) ˛) 1 ˛ Pr p¯ (X) > ˛ ¯ Property 3) above defines when a random variable belongs to a cloud. That any cloud contains a random variable X is proved in section 5 of [96]. This is a significant result as will be seen, since among other things, it means that clouds can be used to define IVPMs. Clouds are closely related to possibility theory. It is shown in [51] that possibility distributions can be constructed which satisfy the following consistency definition. Definition 24 ([51]) Let p : S ! [0; 1] be a possibility distribution function with associated possibility measure Pos and necessity measure Nec. Then p is said to be consistent with random variable X if 8 measurable sets A, Nec (A) Pr (X 2 A) Pos (A). Possibility distributions constructed in a consistent manner are able to bound (unknown) probabilities of interest. The reason this is significant is twofold. Firstly, possibility and necessity distributions are easier to construct since the axioms they satisfy are more general. Secondly, the algebra on possibility and necessity pairs are much simpler since they are min/max algebras akin to the min/max algebra of interval arithmetic (see, for example, [76]). In particular, they avoid convolutions which are requisite for probabilistic arithmetic. The concept of a cloud can be stated in terms of certain pairs of consistent possibility distributions as shown by the following proposition (which means that clouds may be considered as pairs of consistent possibilities – possibility and necessity pairs, see [51]). Proposition 25 Let p¯ ; p be a pair of regular possibility ¯ set S such that 8s 2 S p¯(s) C distribution functions over p(s) 1. Then the mapping c(s) D n(s); p¯(s) where ¯ ¯
1233
1234
Fuzzy Optimization
n(s) D 1 p(s) (i. e. the dual necessity distribution func¯ ¯ In addition, if X is a random variable taktion) is a cloud. ing values in S and the possibility measures associated with p¯ ; p are consistent with X then X belongs to cloud c. Con¯ versely, every cloud defines such a pair of possibility distribution functions and their associated possibility measures are consistent with every random variable belonging to c. Proof (see [52,78,79])
Example 26 (A cloud defines an R-probability field) Let c be a cloud over the real line. Let Pos1 ; Nec1 ; Pos2 ; Nec2 be the possibility measures and their dual necessity measures relating to p¯(s) and p(s) (where p¯ and p are as in Proposi¯ ¯ tion 18). Define ˚ i m (A) D max Nec1 (A) ; Nec2 (A) ; ˚ min Pos1 (A) ; Pos2 (A) : Neumaier [96] proved that every cloud contains a random variable X. Since consistency requires that Pr (X 2 A) 2 i m (A), the result that every cloud contains a random variable X shows consistency. Thus every cloud defines an R-probability field because the inf and sup of the probabilities are bounded by the lower and upper bounds of i m (A). Example 27 (A possibility distribution defines an R-probability field) Let p : S ! [0; 1] be a possibility distribution function and let Pos be the associated possibility measure and Nec the dual necessity measure. Define i m (A) D [Nec (A) ; Pos (A)]. Defining a second possibility distribution, p(x) D 18x means that the pair p; p define ¯ ¯ Since a cloud for which i m (A) defines the R-probability. a cloud defines an R-probability field, this means that this possibility in turn generates a R-probability. Note that the above example means that every pair of possibility p(x) and necessity n(x) such that n(x) p(x) has an associated F-probability field. This is because such a pair defines a cloud and in every cloud there exists a probability distribution. The F-probability field can then be constructed from a inf /sup over all such enclosed probabilities that are less than or equal to the bounding necessity/possibility distributions. The application of these concepts to mixed fuzzy and possibilistic optimization is as follows. Suppose the optimization problem is to maximize f (E x ; aE) subject to E D 0 (where aE and bE are parameters). Assume aE and g(E x ; b) bE are vectors of independent uncertain parameters, each with an associated IVPM. Assume the constraint may be violated at a cost pE > 0 so that the problem becomes one to maximize ˇ ˇ h xE; aE; bE D f xE; aE pEˇ g xE; bE ˇ :
Given the independence assumption, form an IVPM for the product space i aEbE for the joint distribution (see the example below and [52,78,79]) and calculate the interval-valued expected value (see [48,140]) with respect to this IVPM. The interval-valued expected value is denoted (there is a lower expected value and an upper expected value) Z R
h xE; aE; bE di aEbE :
(69)
To optimize (69) requires an ordering of intervals, a valuation function denoted by v : IntR ! Rn . One such ordering is the midpoint of the interval on the principle that in the absence of additional data, the midpoint is the best estimate for the true value so that for this function (mid point and width), v : Int R ! R2 . Thus, for I D a; b , this particular valuation function is v(I) D ((a C b)/2; b a). Next a utility function of a vector in Rn is required which is denote by u : Rn ! R. A utility function operating on the midpoint and width valuation function is u : R2 ! R and particular utility is a weighted sum of the midpoint and width u(c; d) D ˛c C ˇd. Using the valuation and utility functions, the optimization problem is: Z
max u v : h(x; a; b)di ab x
(70)
R
Thus there are four steps to obtaining a real-valued optimization problem from an IVPM problem. The first step is to obtain interval probability h(x; a; b). The second step is R E to obtain the IVPM expected value R h(E x ; aE; b)di . The aEbE third step is to obtain the vector value of v : Int R ! Rn . The fourth step is to obtain the utility function value of a vector u : Rn ! R. Example 28 Consider the problem max f (x; a) D 8x1 C 7x2 subject to: g1 (x; b) D 3x1 C [1; 3]x2 4 D 0 ˜ 1 C 5x2 1 D 0 g2 (x; b) D 2x xE 2 [0; 2] where 2˜ D 1/2/3, that is, 2˜ is a triangular possibilistic number with support [1,3] and modal value at 2. For pE D (1; 1)T , ˜ 1 C [3; 5]x2 6 h(x; a; b) D 5x1 2x
Fuzzy Optimization
so that Z R
h(x; a; b)di ab Z 1 Z 1 D 5x1 C (˛ 3) d˛; (1 ˛) d˛ x1 0
0
C [3; 5]x2 6 5 3 D 5x1 C ; x1 C [3; 5]x2 5 : 2 2 Since the constant -5 will not affect the optimization, it will be removed (then added at the end), so that Z
5 7 h (x; a; b) di ab D v ; x1 C [3; 5]x2 v 2 2 R D (3; 1)x1 C (4; 2)x2 : Let u(Ey ) D yields
Pn
iD1
y i which for the context of this problem
Z max z D u v h (x; a; b) di ab x
R
D max (4x1 C 6x2 ) 5 xE2[0;2]
D 20 5 D 15 x1 D 2; x2 D 2 : Example 29 (see [124]) Consider ¯ 2 C [3; 5]x3 ˆ 1 0x max z D 2x ˆ 1 C [1; 5]x2 2x3 [0; 2] D 0 4x ¯ 2 C 9x3 9 D 0 6x1 2x ˆ 3 C 5¯ D 0 2x1 [1; 4]x2 8x 0 x1 3; 1 x2 ; x3 2 : ¯ 2, ¯ and 5¯ are probability distributions. Note the Here the 0; mixture of three uncertainty types in the third constraint equation. Using the same approach as in the previous example, the optimal values are: z D 3:9179 x1 D 0 ;
x2 D 0:4355 ;
x3 D 1:0121 :
In [124] it is shown that these mixed problems arising from linear programs remain linear programs. Thus, the complexity of mixed problems is equivalent to that of linear programming. Future Directions The applications of fuzzy optimization seems to be headed toward “industrial strength” problems. Increasingly, each
year there are a greater number of applications that appear. Given that greater attention is being given to the semantics of fuzzy optimization and as fuzzy optimization becomes increasingly used in applications, associated algorithms that are more sophisticated, robust, and efficient will need to be developed to handle these more complex problems. It would be interesting to develop modeling languages like GAMS [6], MODLER [34], or AMPL [30], that support fuzzy data structures. From the theoretical side, the flexibility that fuzzy optimization has with working with uncertainty data that is fuzzy, flexible, and/or possibilistic (or a mixture of these via IVPM), means that fuzzy optimization is able to provide an ample approach to optimization under uncertainty. Further research into the development of more robust methods that use fuzzy Banach spaces would certainty provide a deeper theoretical foundation to fuzzy optimization. Fuzzy optimization that utilize fuzzy Banach spaces have the advantage that the problem remains fuzzy throughout and only when one needs to make a decision or implement the solution does one map the solution to a real number (defuzzify). The methods that map fuzzy optimization problems to their real number equivalent defuzzify first and then optimize. Fuzzy optimization problems that optimize in fuzzy Banach spaces keep the solution fuzzy and defuzzify as a last step. Continued development of clear input and output semantics of fuzzy optimization will greatly aid fuzzy optimization’s applicability and relevance. When fuzzy optimization is used in, for example, an assembly-line scheduling problem and one’s solution is a fuzzy three, how does one convey this solution to the assembly-line manager? Lastly, continued research into handling dependencies in an efficient way would amplify the usefulness and applicability of fuzzy optimization.
Bibliography 1. Asai K, Tanaka H (1973) On the fuzzy – mathematical programming, Identification and System Estimation Proceedings of the 3rd IFAC Symposium, The Hague, 12–15 June 1973, pp 1050–1051 2. Audin J-P, Frankkowska H (1990) Set-Valued Analysis. Birkhäuser, Boston 3. Baudrit C, Dubois D, Fargier H (2005) Propagation of uncertainty involving imprecision and randomness. ISIPTA 2005, Pittsburgh, pp 31–40 4. Bector CR, Chandra S (2005) Fuzzy Mathematical Programming and Fuzzy Matrix Games. Springer, Berlin 5. Bellman RE, Zadeh LA (1970) Decision-Making in a Fuzzy Environment. Manag Sci B 17:141–164 6. Brooke A, Kendrick D, Meeraus A (2002) GAMS: A User’s Guide. Scientific Press, San Francisco (the latest updates can be obtained from http://www.gams.com/)
1235
1236
Fuzzy Optimization
7. Buckley JJ (1988) Possibility and necessity in optimization. Fuzzy Sets Syst 25(1):1–13 8. Buckley JJ (1988) Possibilistic linear programming with triangular fuzzy numbers. Fuzzy Sets Syst 26(1):135–138 9. Buckley JJ (1989) Solving possibilistic linear programming problems. Fuzzy Sets Syst 31(3):329–341 10. Buckley JJ (1989) A generalized extension principle. Fuzzy Sets Syst 33:241–242 11. Delgado M, Verdegay JL, Vila MA (1989) A general model for fuzzy linear programming. Fuzzy Sets Syst 29:21–29 12. Delgado M, Kacprzyk J, Verdegay J-L, Vila MA (eds) (1994) Fuzzy Optimization: Recent Advances. Physica, Heidelberg 13. Demster AP (1967) Upper and lower probabilities induced by multivalued mapping. Ann Math Stat 38:325–339 14. Dempster MAH (1969) Distributions in interval and linear programming. In: Hansen ER (ed) Topics in Interval Analysis. Oxford Press, Oxford, pp 107–127 15. Diamond P (1991) Congruence classes of fuzzy sets form a Banach space. J Math Anal Appl 162:144–151 16. Diamond P, Kloeden P (1994) Metric Spaces of Fuzzy Sets. World Scientific, Singapore 17. Diamond P, Kloeden P (1994) Robust Kuhn–Tucker conditions and optimization under imprecision. In: Delgado M, Kacprzyk J, Verdegay J-L, Vila MA (eds) Fuzzy Optimization: Recent Advances. Physica, Heidelberg, pp 61–66 18. Dubois D (1987) Linear programming with fuzzy data. In: Bezdek JC (ed) Analysis of Fuzzy Information, vol III: Applications in Engineering and Science. CRC Press, Boca Raton, pp 241–263 19. Dubois D, Prade H (1980) Fuzzy Sets and Systems: Theory and Applications. Academic Press, New York 20. Dubois D, Prade H (1980) Systems of linear fuzzy constraints. Fuzzy Sets Syst 3:37–48 21. Dubois D, Prade H (1981) Additions of interactive fuzzy numbers. IEEE Trans Autom Control 26(4):926–936 22. Dubois D, Prade H (1983) Ranking fuzzy numbers in the setting of possibility theory. Inf Sci 30:183–224 23. Dubois D, Prade H (1986) New results about properties and semantics of fuzzy set-theoretical operators. In: Wang PP, Chang SK (eds) Fuzzy Sets. Plenum Press, New York, pp 59–75 24. Dubois D, Prade H (1987) Fuzzy numbers: An overview. Tech. Rep. no 219, (LSI, Univ. Paul Sabatier, Toulouse, France). Mathematics and Logic. In: James Bezdek C (ed) Analysis of Fuzzy Information, vol 1, chap 1. CRC Press, Boca Raton, pp 3–39 25. Dubois D, Prade H (1988) Possibility Theory an Approach to Computerized Processing of Uncertainty. Plenum Press, New York 26. Dubois D, Prade H (2005) Fuzzy elements in a fuzzy set. Proceedings of the 11th International Fuzzy System Association World Congress, IFSA 2005, Beijing, July 2005, pp 55–60 27. Dubois D, Moral S, Prade H (1997) Semantics for possibility theory based on likelihoods. J Math Anal Appl 205:359–380 ˘ I, Tu¸s A (2007) Interactive fuzzy linear programming 28. Ertugrul and an application sample at a textile firm. Fuzzy Optim Decis Making 6:29–49 29. Fortin J, Dubois D, Fargier H (2008) Gradual numbers and their application to fuzzy interval analysis. IEEE Trans Fuzzy Syst 16:2, pp 388–402 30. Fourer R, Gay DM, Kerninghan BW (1993) AMPL: A Modeling Language for Mathematical Programming. Scientific Press,
31. 32.
33.
34. 35. 36.
37.
38. 39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
San Francisco, CA (the latest updates can be obtained from http://www.ampl.com/) Fullér R, Keresztfalvi T (1990) On generalization of Nguyen’s theorem. Fuzzy Sets Syst 41:371–374 Ghassan K (1982) New utilization of fuzzy optimization method. In: Gupta MM, Sanchez E (eds) Fuzzy Information and Decision Processes. North-Holland, Netherlands, pp 239–246 Gladish B, Parra M, Terol A, Uría M (2005) Management of surgical waiting lists through a possibilistic linear multiobjective programming problem. Appl Math Comput 167:477–495 Greenberg H (1992) MODLER: Modeling by Object-Driven Linear Elemental Relations. Ann Oper Res 38:239–280 Hanss M (2005) Applied Fuzzy Arithmetic. Springer, Berlin Hsu H, Wang W (2001) Possibilistic programming in production planning of assemble-to-order environments. Fuzzy Sets Syst 119:59–70 Inuiguchi M (1992) Stochastic Programming Problems Versus Fuzzy Mathematical Programming Problems. Jpn J Fuzzy Theory Syst 4(1):97–109 Inuiguchi M (1997) Fuzzy linear programming: what, why and how? Tatra Mt Math Publ 13:123–167 Inuiguchi M (2007) Necessity measure optimization in linear programming problems with fuzzy polytopes. Fuzzy Sets Syst 158:1882–1891 Inuiguchi M (2007) On possibility/fuzzy optimization. In: Melin P, Castillo O, Aguilar LT, Kacprzyk J, Pedrycz W (eds) Foundations of Fuzzy Logic and Soft Computing: 12th International Fuzzy System Association World Congress, IFSA 2007, Cancun, June 2007, Proceedings. Springer, Berlin, pp 351–360 Inuiguchi M, Ramik J (2000) Possibilistic linear programming: A brief review of fuzzy mathematical programming and a comparison with stochastic programming in portfolio selection problem. Fuzzy Sets Syst 111:97–110 Inuiguchi M, Sakawa M (1997) An achievement rate approach to linear programming problems with an interval objective function. J Oper Res Soc 48:25–33 Inuiguchi M, Tanino T (2004) Fuzzy linear programming with interactive uncertain parameters. Reliab Comput 10(5):512– 527 Inuiguchi M, Ichihashi H, Tanaka H (1990) Fuzzy programming: A survey of recent developments. In: Slowinski R, Teghem J (eds) Stochastic versus Fuzzy Approaches to Multiobjective Mathematical Programming Under Uncertainty. Kluwer, Netherlands, pp 45–68 Inuiguchi M, Ichihashi H, Kume Y (1992) Relationships Between Modality Constrained Programming Problems and Various Fuzzy Mathematical Programming Problems. Fuzzy Sets Syst 49:243–259 Inuiguchi M, Ichihashi H, Tanaka H (1992) Fuzzy Programming: A Survey of Recent Developments. In: Slowinski R, Teghem J (eds) Stochastic versus Fuzzy Approaches to Multiobjective Mathematical Programming under Uncertainty.Springer, Berlin, pp 45–68 Inuiguchi M, Sakawa M, Kume Y (1994) The usefulness of possibilistic programming in production planning problems. Int J Prod Econ 33:49–52 Jamison KD (1998) Modeling Uncertainty Using Probabilistic Based Possibility Theory with Applications to Optimization. Ph D Thesis, University of Colorado Denver, Department
Fuzzy Optimization
49.
50. 51.
52. 53.
54. 55.
56.
57.
58.
59. 60. 61.
62. 63. 64. 65. 66.
67. 68. 69. 70.
of Mathematical Sciences. http://www-math.cudenver.edu/ graduate/thesis/jamison.pdf Jamison KD (2000) Possibilities as cumulative subjective probabilities and a norm on the space of congruence classes of fuzzy numbers motivated by an expected utility functional. Fuzzy Sets Syst 111:331–339 Jamison KD, Lodwick WA (2001) Fuzzy linear programming using penalty method. Fuzzy Sets Syst 119:97–110 Jamison KD, Lodwick WA (2002) The construction of consistent possibility and necessity measures. Fuzzy Sets Syst 132(1):1–10 Jamison KD, Lodwick WA (2004) Interval-valued probability measures. UCD/CCM Report No. 213, March 2004 Joubert JW, Luhandjula MK, Ncube O, le Roux G, de Wet F (2007) An optimization model for the management of South African game ranch. Agric Syst 92:223–239 Kacprzyk J, Orlovski SA (eds) (1987) Optimization Models Using Fuzzy Sets and Possibility Theory. D Reidel, Dordrecht Kacprzyk J, Orlovski SA (1987) Fuzzy optimization and mathematical programming: A brief introduction and survey. In: Kacprzyk J, Orlovski SA (eds) Optimization Models Using Fuzzy Sets and Possibility Theory. D Reidel, Dordrecht, pp 50–72 Kasperski A, Zeilinski P (2007) Using gradual numbers for solving fuzzy-valued combinatorial optimization problems. In: Melin P, Castillo O, Aguilar LT, Kacprzyk J, Pedrycz W (eds) Foundations of Fuzzy Logic and Soft Computing: 12th International Fuzzy System Association World Congress, IFSA 2007, Cancun, June 2007, Proceedings. Springer, Berlin, pp 656–665 Kaufmann A, Gupta MM (1985) Introduction to Fuzzy Arithmetic – Theory and Applications. Van Nostrand Reinhold, New York Kaymak U, Sousa JM (2003) Weighting of Constraints in Fuzzy Optimization. Constraints 8:61–78 (also in the 2001 Proceedings of IEEE Fuzzy Systems Conference) Klir GJ, Yuan B (1995) Fuzzy Sets and Fuzzy Logic. Prentice Hall, Upper Saddle River Lai Y, Hwang C (1992) Fuzzy Mathematical Programming. Springer, Berlin Lin TY (2005) A function theoretic view of fuzzy sets: New extension principle. In: Filev D, Ying H (eds) Proceedings of NAFIPS05 Liu B (1999) Uncertainty Programming. Wiley, New York Liu B (2000) Dependent-chance programming in fuzzy environments. Fuzzy Sets Syst 109:97–106 Liu B (2001) Fuzzy random chance-constrained programming. IEEE Trans Fuzzy Syst 9:713–720 Liu B (2002) Theory and Practice of Uncertainty Programming. Physica, Heidelberg Liu B, Iwamura K (2001) Fuzzy programming with fuzzy decisions and fuzzy simulation-based genetic algorithm. Fuzzy Sets Syst 122:253–262 Lodwick WA (1990) Analysis of structure in fuzzy linear programs. Fuzzy Sets Syst 38:15–26 Lodwick WA (1990) A generalized convex stochastic dominance algorithm. IMA J Math Appl Bus Ind 2:225–246 Lodwick WA (1999) Constrained Interval Arithmetic. CCM Report 138, Feb. 1999. CCM, Denver Lodwick WA (ed) (2004) Special Issue on Linkages Between Interval Analysis and Fuzzy Set Theory. Reliable Comput 10
71. Lodwick WA (2007) Interval and fuzzy analysis: A unified approach. In: Advances in Imaging and Electronic Physics, vol 148. Elsevier, San Diego, pp 75–192 72. Lodwick WA, Bachman KA (2005) Solving large-scale fuzzy and possibilistic optimization problems: Theory, algorithms and applications. Fuzzy Optim Decis Making 4(4):257–278. (also UCD/CCM Report No. 216, June 2004) 73. Lodwick WA, Inuiguchi M (eds) (2007) Special Issue on Optimization Under Fuzzy and Possibilistic Uncertainty. Fuzzy Sets Syst 158:17; 1 Sept 2007 74. Lodwick WA, Jamison KD (1997) A computational method for fuzzy optimization. In: Bilal A, Madan G (eds) Uncertainty Analysis in Engineering and Sciences: Fuzzy Logic, Statistics, and Neural Network Approach, chap 19. Kluwer, Norwell 75. Lodwick WA, Jamison KD (eds) (2003) Special Issue: Interfaces Fuzzy Set Theory Interval Anal 135:1; April 1, 2003 76. Lodwick WA, Jamison KD (2003) Estimating and Validating the Cumulative Distribution of a Function of Random Variables: Toward the Development of Distribution Arithmetic. Reliab Comput 9:127–141 77. Lodwick WA, Jamison KD (2005) Theory and semantics for fuzzy and possibilistic optimization, Proceedings of the 11th International Fuzzy System Association World Congress. IFSA 2005, Beijing, July 2005 78. Lodwick WA, Jamison KD (2006) Interval-valued probability in the analysis of problems that contain a mixture of fuzzy, possibilistic and interval uncertainty. In: Demirli K, Akgunduz A (eds) 2006 Conference of the North American Fuzzy Information Processing Society, 3–6 June 2006, Montréal, Canada, paper 327137 79. Lodwick WA, Jamison KD (2008) Interval-Valued Probability in the Analysis of Problems Containing a Mixture of Fuzzy, Possibilistic, Probabilistic and Interval Uncertainty. Fuzzy Sets and Systems 2008 80. Lodwick WA, Jamison KD (2007) The use of interval-valued probability measures in optimization under uncertainty for problems containing a mixture of fuzzy, possibilistic, and interval uncertainty. In: Melin P, Castillo O, Aguilar LT, Kacprzyk J, Pedrycz W (eds) Foundations of Fuzzy Logic and Soft Computing: 12 th International Fuzzy System Association World Congress, IFSA 2007, Cancun, Mexico, June 2007, Proceedings. Springer, Berlin, pp 361–370 81. Lodwick WA, Jamison KD (2007) Theoretical and semantic distinctions of fuzzy, possibilistic, and mixed fuzzy/possibilistic optimization. Fuzzy Sets Syst 158(17):1861–1872 82. Lodwick WA, McCourt S, Newman F, Humpheries S (1999) Optimization Methods for Radiation Therapy Plans. In: Borgers C, Natterer F (eds) IMA Series in Applied Mathematics – Computational Radiology and Imaging: Therapy and Diagnosis. Springer, New York, pp 229–250 83. Lodwick WA, Neumaier A, Newman F (2001) Optimization under uncertainty: methods and applications in radiation therapy. Proc 10th IEEE Int Conf Fuzzy Syst 2001 3:1219–1222 84. Luhandjula MK (1986) On possibilistic linear programming. Fuzzy Sets Syst 18:15–30 85. Luhandjula MK (1989) Fuzzy optimization: an appraisal. Fuzzy Sets Syst 30:257–282 86. Luhandjula MK (2004) Optimisation under hybrid uncertainty. Fuzzy Sets Syst 146:187–203
1237
1238
Fuzzy Optimization
87. Luhandjula MK (2006) Fuzzy stochastic linear programming: Survey and future research directions. Eur J Oper Res 174:1353–1367 88. Luhandjula MK, Ichihashi H, Inuiguchi M (1992) Fuzzy and semi-infinite mathematical programming. Inf Sci 61:233–250 89. Markowitz H (1952) Portfolio selection. J Finance 7:77–91 90. Moore RE (1979) Methods and Applications of Interval Analysis. SIAM, Philadelphia 91. Negoita CV (1981) The current interest in fuzzy optimization. Fuzzy Sets Syst 6:261–269 92. Negoita CV, Ralescu DA (1975) Applications of Fuzzy Sets to Systems Analysis. Birkhäuser, Boston 93. Negoita CV, Sularia M (1976) On fuzzy mathematical programming and tolerances in planning. Econ Comput Cybern Stud Res 3(31):3–14 94. Neumaier A (2003) Fuzzy modeling in terms of surprise. Fuzzy Sets Syst 135(1):21–38 95. Neumaier A (2004) Clouds, fuzzy sets and probability intervals. Reliab Comput 10:249–272. Springer 96. Neumaier A (2005) Structure of clouds. (submitted – downloadable http://www.mat.univie.ac.at/~neum/papers.html) 97. Ogryczak W, Ruszczynski A (1999) From stochastic dominance to mean-risk models: Semideviations as risk measures. Eur J Oper Res 116:33–50 98. Nguyen HT (1978) A note on the extension principle for fuzzy sets. J Math Anal Appl 64:369–380 99. Ralescu D (1977) Inexact solutions for large-scale control problems. In: Proceedings of the 1st International Congress on Mathematics at the Service of Man, Barcelona 100. Ramik J (1986) Extension principle in fuzzy optimization. Fuzzy Sets Syst 19:29–35 101. Ramik J, Rimanek J (1985) Inequality relation between fuzzy numbers and its use in fuzzy optimization. Fuzzy Sets Syst 16:123–138 102. Ramik J, Vlach M (2002) Fuzzy mathematical programming: A unified approach based on fuzzy relations. Fuzzy Optim Decis Making 1:335–346 103. Ramik J, Vlach M (2002) Generalized Concavity in Fuzzy Optimization and Decision Analysis. Kluwer, Boston 104. Riverol C, Pilipovik MV (2007) Optimization of the pyrolysis of ethane using fuzzy programming. Chem Eng J 133:133–137 105. Riverol C, Pilipovik MV, Carosi C (2007) Assessing the water requirements in refineries using possibilistic programming. Chem Eng Process 45:533–537 106. Rommelfanger HJ (1994) Some problems of fuzzy optimization with T-norm based extended addition. In: Delgado M, Kacprzyk J, Verdegay J-L, Vila MA (eds) Fuzzy Optimization: Recent Advances. Physica, Heidelberg, pp 158–168 107. Rommelfanger HJ (1996) Fuzzy linear programming and applications. Eur J Oper Res 92:512–527 108. Rommelfanger HJ (2004) The advantages of fuzzy optimization models in practical use. Fuzzy Optim Decis Making 3:293–309 109. Rommelfanger HJ, Slowinski R (1998) Fuzzy linear programming with single or multiple objective functions. In: Slowinski R (ed) Fuzzy Sets in Decision Analysis, Operations Research and Statistics. The Handbooks of Fuzzy Sets. Kluwer, Netherlands, pp 179–213 110. Roubens M (1990) Inequality constraints between fuzzy numbers and their use in mathematical programming. In: Slowinski R, Teghem J (eds) Stochastic versus Fuzzy Approaches
111. 112. 113.
114. 115.
116. 117. 118. 119.
120. 121.
122.
123.
124.
125.
126.
127.
128.
129.
to Multiobjective Mathematical Programming Under Uncertainty. Kluwer, Netherlands, pp 321–330 Russell B (1924) Vagueness. Aust J Philos 1:84–92 Sahinidis N (2004) Optimization under uncertainty: State-ofthe-art and opportunities. Comput Chem Eng 28:971–983 Saito S, Ishii H (1998) Existence criteria for fuzzy optimization problems. In: Takahashi W, Tanaka T (eds) Proceedings of the International Conference on Nonlinear Analysis and Convex Analysis, Niigata, Japan, 28–31 July 1998. World Scientific Press, Singapore, pp 321–325 Shafer G (1976) Mathematical A Theory of Evidence. Princeton University Press, Princeton Shafer G (1987) Belief functions and possibility measures. In: James Bezdek C (ed) Analysis of Fuzzy Information. Mathematics and Logic, vol 1. CRC Press, Boca Raton, pp 51–84 Sousa JM, Kaymak U (2002) Fuzzy Decision Making in Modeling and Control. World Scientific Press, Singapore Steinbach M (2001) Markowitz revisited: Mean-variance models in financial portfolio analysis. SIAM Rev 43(1):31–85 Tanaka H, Asai K (1984) Fuzzy linear programming problems with fuzzy numbers. Fuzzy Sets Syst 13:1–10 Tanaka H, Okuda T, Asai K (1973) Fuzzy mathematical programming. Trans Soc Instrum Control Eng 9(5):607–613; (in Japanese) Tanaka H, Okuda T, Asai K (1974) On fuzzy mathematical programming. J Cybern 3(4):37–46 Tanaka H, Ichihasi H, Asai K (1984) A formulation of fuzzy linear programming problems based on comparison of fuzzy numbers. Control Cybern 13(3):185–194 Tanaka H, Ichihasi H, Asai K (1985) Fuzzy decisions in linear programming with trapezoidal fuzzy parameters. In: Kacprzyk J, Yager R (eds) Management Decision Support Systems Using Fuzzy Sets and Possibility Theory. Springer, Heidelberg, pp 146–159 Tang J, Wang D, Fung R (2001) Formulation of general possibilistic linear programming problems for complex industrial systems. Fuzzy Sets Syst 119:41–48 Thipwiwatpotjani P (2007) An algorithm for solving optimization problems with interval-valued probability measures. CCM Report, No. 259 (December 2007). CCM, Denver Untiedt E (2006) Fuzzy and possibilistic programming techniques in the radiation therapy problem: An implementationbased analysis. Masters Thesis, University of Colorado Denver, Department of Mathematical Sciences (July 5, 2006) Untiedt E (2007) A robust model for mixed fuzzy and possibilistic programming. Project report for Math 7593: Advanced Linear Programming. Spring, Denver Untiedt E (2007) Using gradual numbers to analyze nonmonotonic functions of fuzzy intervals. CCM Report No 258 (December 2007). CCM, Denver Untiedt E, Lodwick WA (2007) On selecting an algorithm for fuzzy optimization. In: Melin P, Castillo O, Aguilar LT, Kacprzyk J, Pedrycz W (eds) Foundations of Fuzzy Logic and Soft Computing: 12th International Fuzzy System Association World Congress, IFSA 2007, Cancun, Mexico, June 2007, Proceedings. Springer, Berlin, pp 371–380 Vasant PM, Barsoum NN, Bhattacharya A (2007) Possibilistic optimization in planning decision of construction industry. Int J Prod Econ (to appear)
Fuzzy Optimization
130. Verdegay JL (1982) Fuzzy mathematical programming. In: Gupta MM, Sanchez E (eds) Fuzzy Information and Decision Processes. North-Holland, Amsterdam, pp 231–237 131. Vila MA, Delgado M, Verdegay JL (1989) A general model for fuzzy linear programming. Fuzzy Sets Syst 30:21–29 132. Wang R, Liang T (2005) Applying possibilistic linear programming to aggregate production planning. Int J Prod Econ 98:328–341 133. Wang S, Zhu S (2002) On fuzzy portfolio selection problems. Fuzzy Optim Decis Making 1:361–377 134. Wang Z, Klir GJ (1992) Fuzzy Measure Theory. Plenum Press, New York 135. Weichselberger K (2000) The theory of interval-probability as a unifying concept for uncertainty. Int J Approx Reason 24:149–170 136. Werners B (1988) Aggregation models in mathematical programming. In: Mitra G (ed) Mathematical Models for Decision Support, NATO ASI Series vol F48. Springer, Berlin 137. Werners B (1995) Fuzzy linear programming – Algorithms and applications. Addendum to the Proceedings of ISUMANAPIPS’95, College Park, Maryland, 17–20 Sept 1995, pp A7A12 138. Whitmore GA, Findlay MC (1978) Stochastic Dominance: An Approach to DecisionDecision-Making Under Risk. Lexington Books, Lexington 139. Yager RR (1980) On choosing between fuzzy subsets. Kybernetes 9:151–154
140. Yager RR (1981) A procedure for ordering fuzzy subsets of the unit interval. Inf Sci 24:143–161 141. Yager RR (1986) A characterization of the extension principle. Fuzzy Sets Syst 18:205–217 142. Zadeh LA (1965) Fuzzy Sets. Inf Control 8:338–353 143. Zadeh LA (1968) Probability measures of fuzzy events. J Math Anal Appl 23:421–427 144. Zadeh LA (1975) The concept of a linguistic variable and its application to approximate reasoning. Inf Sci, Part I: 8:199– 249; Part II: 8:301–357; Part III: 9:43–80 145. Zadeh LA (1978) Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst 1:3–28 146. Zeleny M (1994) Fuzziness, knowledge and optimization: New optimality concepts. In: Delgado M, Kacprzyk J, Verdegay J-L, M Vila A (eds) Fuzzy Optimization: Recent Advances. Physica, Heidelberg, pp 3–20 147. Zimmermann HJ (1974) Optimization in fuzzy environment. Paper presented at the International XXI TIMS and 46th ORSA Conference, Puerto Rico Meeting, San Juan, Porto Rico, Oct 1974 148. Zimmermann HJ (1976) Description and optimization of fuzzy systems. Int J General Syst 2(4):209–216 149. Zimmermann HJ (1978) Fuzzy programming and linear programming with several objective functions. Fuzzy Sets Syst 1:45–55 150. Zimmermann HJ (1983) Fuzzy mathematical programming. Comput Oper Res 10:291–298
1239
1240
Fuzzy Probability Theory
Fuzzy Probability Theory MICHAEL BEER National University of Singapore, Kent Ridge, Singapore Article Outline Glossary Definition of the Subject Introduction Mathematical Environment Fuzzy Random Quantities Fuzzy Probability Representation of Fuzzy Random Quantities Future Directions Bibliography
Fuzzy Probability Theory, Figure 1 Normalized fuzzy set with ˛-level sets and support
Ai 2 S(X) ) ACi 2 S(X) ;
(6)
and if for every sequence of sets Ai
Glossary Fuzzy set and fuzzy vector Let X represent a universal set and x be the elements of X, then
Ai 2 S(X); i D 1; 2; : : : )
1 [
Ai 2 S(X) :
(7)
i
˜ D f(x; A (x)) j x 2 Xg; A (x) 0 8x 2 X (1) A ˜ on X. A (x) is the memberis referred to as fuzzy set A ship function (characteristic function) of the fuzzy set ˜ and represents the degree with which the elements x A ˜ If belong to A. sup[A (x)] D 1 ;
(2)
x2X
˜ are called the membership function and the fuzzy set A normalized; see Fig. 1. In case of a limitation to the Euclidean space X D Rn and normalized fuzzy sets, the ˜ is also referred to as fuzzy vector denoted fuzzy set A by x˜ with its membership function (x), or, in the onedimensional case, as fuzzy variable x˜ with (x). ˛-Level set and support The crisp sets A˛k D fx 2 X j A (x) ˛ k g
(3)
˜ for real numbers extracted from the fuzzy set A ˛ k 2 (0; 1] are called ˛-level sets. These comply with the inclusion property A˛k A˛ i
8˛ i ; ˛ k 2 (0; 1] with ˛ i ˛ k :
(4)
˜ The largest ˛-level set A˛k !C0 is called support S(A); see Fig. 1. -Algebra A family M(X) of sets Ai on the universal set X is referred to as -algebra S(X) on X, if X 2 S(X) ;
(5)
In this definition, ACi is the complementary set of Ai with respect to X, a family M(X) of sets Ai refers to subsets and systems of subsets of the power set P(X) on X, and the power set P(X) is the set of all subsets Ai of X. Definition of the Subject Fuzzy probability theory is an extension of probability theory to dealing with mixed probabilistic/non-probabilistic uncertainty. It provides a theoretical basis to model uncertainty which is only partly characterized by randomness and defies a pure probabilistic modeling with certainty due to a lack of trustworthiness or precision of the data or a lack of pertinent information. The fuzzy probabilistic model is settled between the probabilistic model and nonprobabilistic uncertainty models. The significance of fuzzy probability theory lies in the treatment of the elements of a population not as crisp quantities but as set-valued quantities or granules in an uncertain fashion, which largely complies with reality in most everyday situations. Probabilistic and non-probabilistic uncertainty can so be transferred adequately and separately to the results of a subsequent analysis. This enables best case and worst case estimates in terms of probability taking account of variations within the inherent non-probabilistic uncertainty. The development of fuzzy probability theory was initiated by H. Kwakernaak with the introduction of fuzzy random variables in [47] in 1978. Subsequent developments have been reported in different directions and from dif-
Fuzzy Probability Theory
ferent perspectives including differences in terminology. The usefulness of the theory has been underlined with various applications beyond mathematics and information science, in particular, in engineering. The application fields are not limited and may be extended increasingly, for example, to medicine, biology, psychology, economy, financial sciences, social sciences, and even to law. Introduction The probably most popular example of fuzzy probability is the evaluation of a survey on the subjective assessment of temperature. A group of test persons are asked – under equal conditions – to give a statement on the current temperature as realistically as possible. If the scatter of the statements is considered as random, the mean value of the statements provides a reasonable statistical point estimate for the actual temperature. The statements are, however, naturally given in an uncertain form. The test persons enunciate their perception in a form such as about 25°C, possibly 27°C, between 24°C and 26°C, or they even only come up with linguistic assessments such as warm, very warm, or pleasant. This uncertainty is non-probabilistic but has to be taken into account in the statistical evaluation. It is transferred to the estimated mean value, which is no longer obtained as a crisp number but as a value range or a set of values corresponding to the possibilities within the range of uncertainty of the statements. If the uncertain statements are initially quantified as fuzzy values, the mean value is obtained as a fuzzy value, too; and the probability of certain events is also computed as a fuzzy quantity – referred to as fuzzy probability. This example is a typical materialization of the following general real-world situation. The numerical representation of a physical quantity with the aid of crisp numbers x 2 R or sets thereof is frequently interfered by uncertainty regarding the trustworthiness of measured, or otherwise specified, values. The perceptions of physical quantities may appear, for example, as imprecise, diffuse, vague, dubious, or ambiguous. Underlying reasons for this phenomenon include the limited precision of any measurement (digital or analog), indirect measurements via auxiliary quantities in conjunction with a – more or less trustworthy – model to eventually determine the value wanted, measurements under weakly specified or arbitrarily changing boundary conditions, and the specification of values by experts in a linguistic manner. The type of the associated uncertainty is non-frequentative and improper for a subjective probabilistic modeling; hence, it is non-probabilistic. This uncertainty is unavoidable and may always be made evident by a respective choice of scale.
If a set of uncertain perceptions of a physical quantity is present in the form of a random sample, then the overall uncertainty possesses a mixed probabilistic/non-probabilistic character. Whilst the scatter of the realizations of the physical quantity possesses a probabilistic character (frequentative or subjective), each particular realization from the population may, additionally, exhibit nonprobabilistic uncertainty. Consequently, a realistic modeling in those cases must involve both probabilistic and non-probabilistic uncertainty. This modeling without distorting or ignoring information is the mission of fuzzy probability theory. A pure probabilistic modeling would introduce unwarranted information in the form of a distribution function that cannot be justified and would thus diminish the trustworthiness of the probabilistic results. Mathematical Environment Fuzzy probability is part of the framework of generalized information theory [38] and is covered by the umbrella of granular computing [50,62]. It represents a special case of imprecise probabilities [15,78] with ties to concepts of random sets [52]. This is underlined by Walley’s summary of the semantics of imprecise probabilities with the term indeterminacy, which arises from ignorance about facts, events, or dependencies. Within the class of mathematical models covered by the term imprecise probabilities, see [15, 38,78], fuzzy probability theory has a relationship to concepts known as upper and lower probabilities [28], sets of probability measures [24], distribution envelops [7], interval probabilities [81], and p-box approach [23]. Also, similarities exist with respect to evidence theory (or Dempster–Shafer theory) [20,70] as a theory of infinitely monotone Choquet capacities [39,61]. The relationship to the latter is associated with the interpretation of the measures plausibility and belief, with the special cases of possibility and necessity, as upper and lower probabilities, respectively, [8]. Fuzzy probability shares the common feature of all imprecise probability models: the uncertainty of an event is characterized with a set of possible measure values in terms of probability, or with bounds on probability. Its distinctive feature is that set-valued information, and hence the probability of associated events, is described with the aid of uncertain sets according to fuzzy set theory [83,86]. This represents a marriage between fuzzy methods and probabilistics with fuzziness and randomness as special cases, which justifies the denotation as fuzzy randomness. Fuzzy probability theory enables a consideration of a fuzzy set of possible probabilistic models over the range of im-
1241
1242
Fuzzy Probability Theory
precision of the knowledge about the underlying randomness. The associated fuzzy probabilities provide weighted bounds on probability – the weights of which are obtained as the membership values of the fuzzy sets. Based on ˛-discretization [86] and the representation of fuzzy sets as sets of ˛-level sets, the relationship of fuzzy probability theory to various concepts of imprecise probabilities becomes obvious. For each ˛-level a common interval probability, crisp bounds on probability, or a classical set of probability measures, respectively, are obtained. The ˛-level sets of fuzzy events can be treated as random sets. Further, a relationship of these random sets to evidence theory can be constructed if a respective basic probability assignment is selected; see [21]. Consistency with evidence theory is obtained if the focal sets are specified as fuzzy elementary events and if the basic probability assignment follows a discrete uniform distribution over the fuzzy elementary events. The model fuzzy randomness with its two components – fuzzy methods and probabilistics – can utilize a distinction between aleatory and epistemic uncertainty with respect to the sources of uncertainty [29]. This is particularly important in view of practical applications. Irreducible uncertainty as a property of the system associated with fluctuations/variability may be summarized as aleatory uncertainty and described probabilistically, and reducible uncertainty as a property of the analysts, or its perception, associated with a lack of knowledge/precision may be understood as epistemic uncertainty and described with fuzzy sets. The model fuzzy randomness then combines, without mixing, both components in the form of a fuzzy set of possible probabilistic models over some particular range of imprecision. This distinction is retained throughout any subsequent analysis and reflected in the results. The development of fuzzy probability was initiated with the introduction of fuzzy random variables by Kwakernaak [47,48] in 1978/79. Subsequent milestones were set by Puri and Ralescu [63], Kruse and Meyer [46], Wang and Zhang [80], and Krätschmer [43]. The developments show differences in terminology, concepts, and in the associated consideration of measurability; and the investigations are ongoing [11,12,14,17,35,36,37,45,73]. Krätschmer [43] showed that the different concepts can be unified to a certain extent. Generally, it can be noted that ˛-discretization is utilized as a helpful instrument. An overview with specific comments on the different developments is provided in [43] and [57]. Investigations were pursued on independent and dependent fuzzy random variables for which parameters were defined with particular focus on variance and covariance [22,32,40,41,60]. Fuzzy random processes
were examined to reveal properties of limit theorems and martingales associated with fuzzy randomness [44,65]; see also [71,79] and for a survey [49]. Particular interest was devoted to the strong law of large numbers [10,33,34]. Further, the differentiation and the integration of fuzzy random quantities was investigated in [51,64]. Driven by the motivation for the establishment of fuzzy probability theory considerable effort was made in the modeling and statistical evaluation of imprecise data. Fundamental achievements were reported by Kruse and Meyer [46], Bandemer and Näther [2,5], and by Viertl [75]. Classical statistical methods were extended in order to take account of statistical fluctuations/variability and imprecision simultaneously, and the specific features associated with the imprecision of the data were investigated. Ongoing research is reported, for example, in [58, 74] in view of evaluating measurements, in [51,66] for decision making, and in [16,42,59] for regression analysis. Methods for evaluating imprecise data with the aid of generalized histograms are discussed in [9,77]. Also, the application of resampling methods is pursued; bootstrap concepts are utilized for statistical estimations [31] and hypothesis testing [26] based on imprecise data. Another method for hypothesis testing is proposed in [27], which employs fuzzy parameters in order to describe a fuzzy transition between rejection and acceptance. Bayesian methods have also been extended by the inclusion of fuzzy variables to take account of imprecise data; see [75] for basic considerations. A contribution to Bayesian statistics with imprecise prior distributions is presented in [76]. This leads to imprecise posterior distributions, imprecise predictive distributions, and may be used to deduce imprecise confidence intervals. A combination of the Bayesian theorem with kriging based on imprecise data is described in [3]. A Bayesian test of fuzzy hypotheses is discussed in [72], while in [67] the application of a fuzzy Bayesian method for decision making is presented. In view of practical applications, probability distribution functions are defined for fuzzy random quantities [54, 75,77,85] – despite some drawback [6]. These distribution functions can easily be formulated and used for further calculations, but they do not uniquely describe a fuzzy random quantity. This theoretical lack is, however, generally without an effect in practical applications so that stochastic simulations may be performed according to the distribution functions. Alternative simulation methods were proposed based on parametric [13] and non-parametric [6, 55] descriptions of imprecision. The approach according to [55] enables a direct generation of fuzzy realizations based on a new concept for an incremental representation of fuzzy random quantities. This method is designed to
Fuzzy Probability Theory
simulate and predict fuzzy time series; it circumvents the problems of artificial uncertainty growth or bias of nonprobabilistic uncertainty, which is frequently concerned with numerical simulations. This variety of theoretical developments provides reasonable margins for the formulation of fuzzy probability theory but does not allow the definition of a unique concept. Choices have to be made within the elaborated margins depending on the envisaged application and environment. For the subsequent sections these choices are made in view of a broad spectrum of possible applications, for example, in civil/mechanical engineering [54]. These choices concern the following three main issues. First, measurability has to be ensured according to a sound concept. According to [43], the concepts of measurable bounding functions [47,48], of measurable mappings of ˛-level sets [63], and of measurable fuzzy valued mappings [17,36] are available; or the unifying concept proposed in [43] itself, which utilizes a special topology on the space of fuzzy realizations, may be selected. From a practical point of view the concept of measurable bounding functions is most reasonable due to its analogy to traditional probability theory. On this basis, a fuzzy random quantity can be regarded as a fuzzy set of traditional, crisp, random quantities, each one carrying a certain membership degree. Each of these crisp random quantities is then measurable in the traditional fashion, and their membership degrees can be transferred to the respective measure values. The set of the obtained measure values including their membership degrees then represents a fuzzy probability. Second, a concept for the integration of a fuzzy-valued function has to be selected from the different available approaches [86]. This is associated with the computation of the probability of a fuzzy event. An evaluation in a mean sense weighted by the membership function of the fuzzy event is suggested in [84], which leads to a scalar value for the probability. This evaluation is associated with the interpretation that an event may occur partially. An approach for calculating the probability of a fuzzy event as a fuzzy set is proposed in [82]; the resulting fuzzy probability then represents a set of measure values with associated membership degrees. This complies with the interpretation that the occurrence of an event is binary but it is not clearly indicated if the event has occurred or not. The imprecision is associated with the observation rather than with the event. The latter approach corresponds with the practical situation in many cases, provides useful information in form of the imprecision reflected in the measure values, and follows the selected concept of measurability. It is thus taken as a basis for further consideration.
Third, the meaning of the distance between fuzzy sets as realizations of a fuzzy random quantity must be defined, which is of particular importance for the definition of the variance and of further parameters of fuzzy random quantities. An approach that follows a set-theoretical point of view and leads to a crisp distance measure is presented in [40]. It is proposed to apply the Hausdorff metric to the ˛-level sets of a fuzzy quantity and to average the results over the membership scale. Consequently, parameters of a fuzzy random quantity which are associated with a distance between fuzzy realizations reflect the variability within the imprecision merely in an integrated form. For example, the variance of a fuzzy random variable is obtained as a crisp value. In contrast to this, the application of standard algorithms for operations on fuzzy sets, such as the extension principle, [4,39,86] leads to fuzzy distances between fuzzy sets. Parameters, including variances, of fuzzy random quantities are then obtained as fuzzy sets of possible parameter values. This corresponds to the interpretation of fuzzy random quantities as fuzzy sets of traditional, crisp, random quantities. The latter approach is thus pursued further. These three selections comply basically with the definitions in [46] and [47]; see also [57]. Among all the possible choices, this set-up ensures the most plausible settlement of fuzzy probability theory within the framework of imprecise probabilities [15,38,78] with ties to evidence theory and random set approaches [21,30]. Fuzzy probability is obtained in the form of weighted plausible bounds on probability. Moreover, in view of practical applications, the treatment of fuzzy random quantities as fuzzy sets of traditional random quantities enables the utilization of established probabilistic methods as kernel solutions in the environment of a fuzzy analysis. For example, sophisticated methods of Monte Carlo simulation [25,68,69] may be combined with a generally applicable fuzzy analysis based on an global optimization approach using ˛-discretization [53]. If some restricting conditions are complied with, numerically efficient methods from interval mathematics [1] may be employed for the ˛-level mappings instead of a global optimization approach; see [56]. The selected concept, eventually, enables best-case and worst-case studies within the range of possible probabilistic models. Fuzzy Random Quantities With the above selections, the definitions from traditional probability theory can be adopted and extended to dealing with imprecise outcomes from a random experiment. Let ˝ be the space of random elementary events ! 2 ˝
1243
1244
Fuzzy Probability Theory
Fuzzy Probability Theory, Figure 2 Fuzzy random variable
and the universe on which the realizations are observed be the n-dimensional Euclidean space X D Rn . Then, a membership scale is introduced perpendicular to the hyperplane ˝ X. This enables the specification of fuzzy sets on X for given elementary events ! from ˝ without interaction between ˝ and . That is, randomness induced by ˝ and fuzziness described by – only in x-direction – are not mixed with one another. Let F(X) be the set of all fuzzy quantities on X D Rn ; that is, F(X) denotes the ˜ according ˜ on X D Rn , with A collection of all fuzzy sets A to Eq. (1). Then, the imprecise result of the mapping ˜ : ˝ ! F(X) X
(8)
˜ In conis referred to as fuzzy random quantity X. trast to real-valued random quantities, a fuzzy realization x˜ (!) 2 F(X), or x˜ (!) X, is now assigned to each elementary event ! 2 ˝; see Fig. 2. These fuzzy realizations may be understood as a numerical representation of granules. Generally, a fuzzy random quantity can be discrete or continuous with respect to both fuzziness and randomness. The further consideration refers to the continuous case, from which the discrete case may be derived. Without a limitation in generality, the fuzzy realizations may be restricted to normalized fuzzy quantities, thus representing fuzzy vectors. Further restrictions can be defined in view of a convenient numerical treatment if the application field allows for. This concerns, for example, a restriction to connected and compact ˛-level sets A˛ D x ˛ of the fuzzy realizations, a restriction to convex ˜ is convex if fuzzy sets as fuzzy realizations (a fuzzy set A
all its ˛-level sets A˛ are convex sets), or a restriction to fuzzy realizations with only one element x i carrying the membership (x i ) D 1 as in [47]. For the treatment of the fuzzy realizations, a respective algorithm for operations on fuzzy sets has to be selected. Following the literature and the above selection, the standard basis is employed with the min-operator as a special case of a t-norm and the associated max-operator as a special case of a t-co-norm [18,19,86]. This leads to the minmax operator and the extension principle [4,39,86]. With the interpretation of a fuzzy random quantity as a fuzzy set of real-valued random quantities, according to the above selection, the following relationship to traditional probability theory is obtained. Let x ji be a realization of a real-valued random quantity X j and x˜ i be a fuzzy ˜ with x ji and x˜ i realization of a fuzzy random quantity X be assigned to the same elementary event ! i . If x ji 2 x˜ i , then x ji is called contained in x˜ i . If, for all elementary events ! i 2 ˝; i D 1; 2; : : :, the x ji are contained in the x˜ i , the set of the x ji ; i D 1; 2; : : :, then constitutes an orig˜ see Fig. 2. The inal X j of the fuzzy random quantity X; ˜ original X j is referred to as completely contained in X, ˜ X j 2 X. Each real-valued random quantity X that is com˜ is an original X j of X ˜ and carries the pletely contained in X membership degree (X j ) D max[˛j x ji 2 x i˛ 8i:] :
(9)
That is, in the ˝-direction, each original X j must be con˜ Consequently, the fuzzy sistent with the fuzziness of X. ˜ random quantity X can be represented as the fuzzy set of
Fuzzy Probability Theory
˜ all originals X j contained in X, ˚ ˜ D (X j ; (X j ))j x ji 2 x˜ i 8i: : X
˜ with their membership values (X j ). Specifically, tity X (10)
˜ contains at least one Each fuzzy random quantity X ˜ Each real-valued random quantity X as an original X j of X. ˜ fuzzy random quantity X that possesses precisely one original is thus a real-valued random quantity X. That is, realvalued random quantities are a special case of fuzzy random quantities. This enables a simultaneous treatment of real-valued random quantities and fuzzy random quantities within the same theoretical environment and with the same numerical algorithms. Or, vice versa, it enables the utilization of theoretical results and established numerical algorithms from traditional probability theory within the framework of fuzzy probability theory. If ˛-discretization is applied to the fuzzy random ˜ random ˛-level sets X˛ are obtained, quantity X, X˛ D fX D X j j (X j ) ˛g :
(11)
Their realizations are ˛-level sets x i˛ of the respective ˜ fuzzy realizations x˜ i of the fuzzy random quantity X. A fuzzy random quantity can thus, alternatively, be represented by the set of its ˛-level sets, ˚ ˜ D (X˛ ; (X˛ )) j (X˛ ) D ˛ 8˛ 2 (0; 1] : (12) X In the one-dimensional case and with the restriction to connected and compact ˛-level sets of the realizations, the ˜ berandom ˛-level sets X˛ of the fuzzy random variable X come closed random intervals [X˛l ; X˛r ]. Fuzzy Probability Fuzzy probability is derived as a fuzzy set of probability measures for events the occurrence of which depends on the behavior of a fuzzy random quantity. These events are referred to as fuzzy random events with the following characteristics. ˜ be a fuzzy random quantity according to Eq. (8) Let X with the realizations x˜ and S(X) be a -algebra of sets Ai defined on X. Then, the event ˜ hits Ai E˜ i : X
˜ i ) D f(P(X j 2 Ai ); (P(X j 2 Ai ))) P(A ˜ (P(X j 2 Ai )) D (X j ) 8 jg : j X j 2 X;
Each of the involved probabilities P(X j 2 Ai ) is a traditional, real-valued probability associated with the traditional probability space [X; S; P] and complying with all established theorems and properties of traditional probability. For a formal closure of fuzzy probability theory, the membership scale is incorporated in the probability space to constitute the fuzzy probability space denoted by ˜ [X; S; P; ] or [X; S; P]. The evaluation of the fuzzy random event Eq. (13) hinges on the question whether a fuzzy realization x˜ k hits the set Ai or not. Due to the fuzziness of the x˜ , these events appear as fuzzy events E˜ i k : x˜ k hits Ai with the following three options for occurrence; see Fig. 3: The fuzzy realization x˜ k lies completely inside the set Ai , the event E˜ i k has occurred. The fuzzy realization x˜ k lies only partially inside Ai , the event E˜ i k may have occurred or not occurred. The fuzzy realization x˜ k lies completely outside the set Ai , the event E˜ i k has not occurred. ˜ i ) takes account of all three opThe fuzzy probability P(A tions within the range of fuzziness. The fuzzy random ˜ is discretized into a set of random ˛-level sets quantity X X˛ according to Eq. (11), and the events E˜ i and E˜ i k , respectively, are evaluated ˛-level by ˛-level. In this evaluation, the event E i k˛ : x k˛ hits Ai admits the following two “extreme” interpretations of occurrence: E i k˛l : “x k˛ is contained in Ai : x k˛ Ai ”, and E i k˛r : “x k˛ and Ai possess at least one element in common: x k˛ \ Ai ¤ ;”. Consequently, the events E i k˛l are associated with the smallest probability P˛l (Ai ) D P(X˛ Ai ) ;
(13)
is referred to as fuzzy random event, which occurs if ˜ hits a fuzzy realization x˜ of the fuzzy random quantity X the set Ai . The associated probability of occurrence of E˜ i is ˜ i ). It is obtained as the referred to as fuzzy probability P(A fuzzy set of the probabilities of occurrence of the events E i j : X j 2 Ai
(14)
associated with all originals X j of the fuzzy random quan-
(15)
Fuzzy Probability Theory, Figure 3 Fuzzy event x˜ k hits Ai in the one-dimensional case
(16)
1245
1246
Fuzzy Probability Theory
Fuzzy Probability Theory, Figure 4 Events for determining P˛l (Ai ) and P˛r (Ai ) in the one-dimensional case
and the events E i k˛r correspond to the largest probability P˛r (Ai ) D P(X˛ \ Ai ¤ ;) :
(17)
The probabilities P˛l (Ai ) and P˛r (Ai ) are bounds on the ˜ i ) on the respective ˛-level associated with probability P(A ˜ the random ˛-level set X˛ of the fuzzy random quantity X; ˜ the see Fig. 4. As all elements of X˛ are originals X j of X, probability that an X j 2 X˛ hits Ai is bounded according to P˛l (Ai ) P(X j 2 Ai ) P˛r (Ai )
8X j 2 X˛ :
(18)
This enables a computation of the probability bounds directly from the real-valued probabilities P(X j 2 Ai ) associated with the originals X j , P˛l (Ai ) D min P(X j 2 Ai ) ;
(19)
P˛r (Ai ) D max P(X j 2 Ai ) :
(20)
X j 2X˛
X j 2X˛
˜ represents a fuzzy set of If the fuzzy random quantity X continuous real-valued random quantities and if the mem˜ are at bership functions of all fuzzy realizations x˜ i of X least segmentally continuous, then the probability bounds P˛l (Ai ) and P˛r (Ai ) determine closed connected intervals ˜ i) [P˛l (Ai ); P˛r (Ai )]. In this case, the fuzzy probability P(A is obtained as a continuous and convex fuzzy set, which may be specified uniquely with the aid of ˛-discretization, ˚ ˜ i ) D (P˛ (Ai ); (P˛ (Ai )))jP˛ (Ai ) P(A D [P˛l (Ai ); P˛r (Ai )] ; (P˛ (Ai )) D ˛ 8˛ 2 (0; 1] : (21)
˜ i ) result The properties of the fuzzy probability P(A from the properties of the traditional probability measure in conjunction with fuzzy set theory. For example, a com˜ ) as folplementary relationship may be derived for P(A i lows. The equivalence (X˛ Ai ) , X˛ \ ACi D ; (22) with ACi being the complementary set of Ai with respect to the universe X, leads to P(X˛ Ai ) D P X˛ \ ACi D ; (23) for each ˛-level. If the event X˛ \ ACi D ; is expressed in terms of its complementary event X˛ \ ACi ¤ ;, Eq. (23) can be rewritten as P(X˛ Ai ) D 1 P X˛ \ ACi ¤ ; : (24) This leads to the relationships P˛l (Ai ) D 1 P˛r ACi ; P˛r (Ai ) D 1 P˛l ACi ;
(25) (26)
and ˜ i ) D 1 P˜ AC : P(A i
(27)
In the special case that the set Ai contains only one ˜ i ) changes element Ai D x i , the fuzzy probability P(A ˜ to P(x i ). The event X˛ \ A i ¤ ; is then replaced by x i 2 X˛ , and X˛ Ai becomes X˛ D x i . The probability P˛l (x i ) D P(X˛ D x i ) may take values greater than zero only if a realization of X˛ exists that possesses exactly one
Fuzzy Probability Theory
element X˛ D t with t D x i and if this element t represents a realization of a discrete original of X˛ . Otherwise, ˜ ) is exclusively specP˛l (x i ) D 0, and the fuzziness of P(x i ified by P˛r (x i ). In the one-dimensional case with A i D fx j x 2 X; x1 x x2 g
(28)
˜ i ) can be represented in a simplithe fuzzy probability P(A fied manner. If the random ˛-level sets X˛ are restricted to be closed random intervals [X˛l ; X˛r ], the associated ˜ can be completely described by fuzzy random variable X means of the bounding real-valued random quantities X˛l and X˛r ; see Fig. 4. X˛l and X˛r represent specific origi˜ This enables the specification of the probabilnals X j of X. ity bounds for each ˛-level according to P˛l (A i ) D max 0; P(X˛r D tr j x2 ; tr 2 X; tr x2 ) P(X˛l D t l j x1 ; t l 2 X; t l < x1 ) ; (29) and P˛r (A i ) D P(X˛l D t l j x2 ; t l 2 X; t l x2 ) P(X˛r D tr j x1 ; tr 2 X; tr < x1 ) :
(30)
From the above framework, the special case of real-valued random quantities X, may be reobtained as a fuzzy ˜ that contains precisely one original random quantity X X j D X1 , X1 D (X jD1 ; (X jD1 ) D 1) ;
(31)
see Eq. (10). Then, all X˛ contain only the sole original X1 , and both X˛ \ Ai ¤ ; and X˛ Ai reduce to X1 2 Ai . That is, the event X1 hits Ai does no longer provide options for interpretation reflected as fuzziness, P˛l (Ai ) D P˛r (Ai ) D P(Ai ) D P(X1 2 Ai ) :
(32)
In the above special one-dimension case, Eqs. (29) and (30), the setting X D X˛l D X˛r leads to P˛l (A i ) D P˛r (A i ) D P(X1 D t j x1 ; x2 ; t 2 X; x1 t l x2 ) ; (33) Further properties and computation rules for fuzzy probabilities may be derived from traditional probability theory in conjunction with fuzzy set theory. Representation of Fuzzy Random Quantities ˜ i ) may be computed for each The fuzzy probability P(A arbitrary set Ai 2 S(X). If – as a special family of sets S(X) – the Borel -algebra S0 (Rn ) of the Rn is selected, the concept of the probability distribution function may be applied to fuzzy random quantities. That is, the system S0 (Rn ) of the open sets ˚ Ai0 D t D (t1 ; : : : ; t k ; : : : ; t n )jx D x i ; x; t 2 Rn ; (34) t k < x k ; k D 1; : : : ; n on X D Rn is considered; S0 (Rn ) is a Boolean set algebra. The concept of fuzzy probability according to Sect. “Fuzzy Probability” applied to the sets from Eq. (34) leads to fuzzy probability distribution functions; see Fig. 5. The ˜ fuzzy probability distribution function F(x) of the fuzzy n ˜ random quantity X on X D R is obtained as the set of the ˜ i0 ) with Ai0 according to Eq. (34) fuzzy probabilities P(A for all x i 2 X, ˜ ˜ i0 ) 8x 2 Xg : F(x) D fP(A i
(35)
It is a fuzzy function. Bounds for the functional values ˜ F(x) are specified for each ˛-level depending on x D x i in Eq. (34) and in compliance with Eqs. (19) and (20), F˛l x D (x1 ; : : : ; x n ) D 1 max P X j D t D (t1 ; : : : ; t n )jx; t 2 X D Rn ; X j 2X˛
9t k x k ; 1 k n ;
Fuzzy Probability Theory, Figure 5 ˜ of a continuous fuzzy random variable X˜ Fuzzy probability density function f˜ (x) and fuzzy probability distribution function F(x)
(36)
1247
1248
Fuzzy Probability Theory
F˛r (x D (x1 ; : : : ; x n )) D max P X j D t D (t1 ; : : : ; t n )jx; t 2 X D Rn ; X j 2X˛
t k < x k ; k D 1; : : : ; n :
(37)
For the determination of F˛l (x) the relationship in Eq. (25) is used. If F˛l (x) and F˛r (x) form closed connected intervals [F˛l (x); F˛r (x)] – see Sect. “Fuzzy Proba˜ are bility” for the conditions – the functional values F(x) determined based on Eq. (21), ˚
˜ F(x) D (F˛ (x); (F˛ (x)))jF˛ (x) D [F˛l (x); F˛r (x)]; (F˛ (x)) D ˛ 8˛ 2 (0; 1] ; (38) In this case, the functional values of the fuzzy probabil˜ are continuous and convex ity distribution function F(x) fuzzy sets. In correspondence with Eq. (15), the fuzzy probability ˜ ˜ represents the fuzzy set of distribution function F(x) of X the probability distribution functions F j (x) of all originals ˜ with the membership values (F j (x)), X j of X ˚ ˜ ˜ (F j (x)) D (F j (x); (F j (x))) j X j 2 X; F(x)
D (X j ) 8 j :
(39)
Each original X j determines precisely one trajectory F j (x) ˜ ˜ of weighted functions F j (x) 2 F(x). within the bunch F(x) In the one-dimensional case and with the restriction to closed random intervals [X˛l ; X˛r ] for each ˛-level, the ˜ is determined fuzzy probability distribution function F(x) by F˛l (x) D P(X˛r D tr j x; tr 2 X; tr < x) ;
(40)
F˛r (x) D P(X˛l D t l j x; t l 2 X; t l < x) :
(41)
In correspondence with traditional probability theory, fuzzy probability density functions f˜(t) or f˜(x) are defined ˜ in association with the F(x); see Fig. 5. The f˜(t) or f˜(x) are fuzzy functions which – in the continuous case with respect to randomness – are integrable for each original ˜ and satisfy the relationship X j of X Z F j (x) D
Z
t 1 Dx 1 t 1 D1
Z
t k Dx k
:::
t n Dx n
::: t k D1
t n D1
f j (t) dt ; (42)
with t D (t1 ; : : : ; t n ) 2 X. For each original X j the integration of the associated trajectory f j (x) 2 f˜(x) leads to ˜ the respective trajectory F j (x) 2 F(x). ˜ paFor the description of a fuzzy random quantity X, ˜ rameters in the form of fuzzy quantities p˜ t (X) may be
used. These fuzzy parameters may represent any type of parameters known from real-valued random quantities, such as moments, weighting factors for different distribution types in a compound distribution, or functional parameters of the distribution functions. The fuzzy parame˜ of the fuzzy random quantity X ˜ is the fuzzy set ter p˜ t (X) of the parameter values p t (X j ) of all originals X j with the membership values (p t (X j )), ˚ ˜ D (p t (X j ); (p t (X j ))) j X j 2 X; ˜ p˜ t (X)
(p t (X j )) D (X j ) 8 j :
(43)
For each ˛-level, bounds are given for the fuzzy parameter ˜ by p˜ t (X) ˜ D min [p t (X j )] ; p t;˛l (X)
(44)
˜ D max [p t (X j )] : p t;˛r (X)
(45)
X j 2X˛
X j 2X˛
˜ represents a fuzzy set of If the fuzzy random quantity X continuous real-valued random quantities, if all fuzzy re˜ are connected sets, and if the parameter alizations x˜ i of X pt is defined on a continuous scale, then the fuzzy param˜ is determined by its ˛-level sets eter p˜ t (X) ˜ D [p t;˛l (X); ˜ p t;˛r (X)] ˜ ; p t;˛ (X)
(46)
˚ ˜ D (p t;˛ (X); ˜ (p t;˛ (X))) ˜ j p˜ t (X) ˜ D ˛ 8 ˛ 2 (0; 1] ; (p t;˛ (X))
(47)
and represents a continuous and convex fuzzy set. ˜ is described by more than If a fuzzy random quantity X ˜ one fuzzy parameter p˜ t (X), interactive dependencies are generally present between the different fuzzy parameters. If this interaction is neglected, a fuzzy random quantity ˜ hull is obtained, which covers the actual fuzzy random X ˜ completely. That is, for all realizations of X ˜ hull quantity X ˜ and X the following holds, x˜ i hull x˜ i 8i :
(48)
Fuzzy parameters and fuzzy probability distribution functions do not enable a unique reproduction of fuzzy realizations based on the above description. But they are sufficient to compute fuzzy probabilities correctly for any events defined according to Eq. (13). The presented concept of fuzzy probability can be extended to fuzzy random functions and processes. Future Directions Fuzzy probability theory provides a powerful key to solving a broad variety of practical problems that defy an appropriate treatment with traditional probabilistics due
Fuzzy Probability Theory
to imprecision of the information for model specification. Fuzzy probabilities reflect aleatory uncertainty and epistemic uncertainty of the underlying problem simultaneously and separately and provide extended information and decision aids. These features can be utilized in all application fields of traditional probability theory and beyond. Respective developments can be observed, primarily, in information science and, increasingly, in engineering. Potential for further extensive fruitful applications exists, for example, in psychology, economy, financial sciences, medicine, biology, social sciences, and even in law. In all cases, fuzzy probability theory is not considered as a replacement for traditional probabilistics but as a beneficial supplement for an appropriate model specification according to the available information in each particular case. The focus of further developments is seen on both theory and applications. Future theoretical developments may pursue a measure theoretical clarification of the embedding of fuzzy probability theory in the framework of imprecise probabilities under the umbrella of generalized information theory. This is associated with the ambition to unify the variety of available fuzzy probabilistic concepts and to eventually formulate a consistent generalized fuzzy probability theory. Another important issue for future research is the mathematical description and treatment of dependencies within the fuzziness of fuzzy random quantities such as non-probabilistic interaction between fuzzy realizations, between fuzzy parameters, and between fuzzy probabilities of certain events. In parallel to theoretical modeling, further effort is worthwhile towards a consistent concept for the statistical evaluation of imprecise data including the analysis of probabilistic and nonprobabilistic dependencies of the data. In view of applications, the further development of fuzzy probabilistic simulation methods is of central importance. This concerns both theory and numerical algorithms for the direct generation of fuzzy random quantities – in a parametric and in a non-parametric fashion. Representations and computational procedures for fuzzy random quantities must be developed with focus on a high numerical efficiency to enable a solution of realworld problems. For a spread into practice it is further essential to elaborate options and potentials for an interpretation and evaluation of fuzzy probabilistic results such as fuzzy mean values or fuzzy failure probabilities. The most promising potentials for a utilization are seen in worst-case investigations in terms of probability, in a sensitivity analysis with respect to non-probabilistic uncertainty, and in decision-making based on mixed probabilistic/non-probabilistic information.
In summary, fuzzy probability theory and its further developments significantly contribute to an improved uncertainty modeling according to reality. Bibliography Primary Literature 1. Alefeld G, Herzberger J (1983) Introduction to interval computations. Academic Press, New York 2. Bandemer H (1992) Modelling uncertain data. AkademieVerlag, Berlin 3. Bandemer H, Gebhardt A (2000) Bayesian fuzzy kriging. Fuzzy Sets Syst 112:405–418 4. Bandemer H, Gottwald S (1995) Fuzzy sets, fuzzy logic fuzzy methods with applications. Wiley, Chichester 5. Bandemer H, Näther W (1992) Fuzzy data analysis. Kluwer, Dordrecht 6. Beer M (2007) Model-free sampling. Struct Saf 29:49–65 7. Berleant D, Zhang J (2004) Representation and problem solving with distribution envelope determination (denv). Reliab Eng Syst Saf 85(1–3):153–168 8. Bernardini A, Tonon F (2004) Aggregation of evidence from random and fuzzy sets. Special Issue of ZAMM. Z Angew Math Mech 84(10–11):700–709 9. Bodjanova S (2000) A generalized histogram. Fuzzy Sets Syst 116:155–166 10. Colubi A, Domínguez-Menchero JS, López-Díaz M, Gil MA (1999) A generalized strong law of large numbers. Probab Theor Relat Fields 14:401–417 11. Colubi A, Domínguez-Menchero JS, López-Díaz M, Ralescu DA (2001) On the formalization of fuzzy random variables. Inf Sci 133:3–6 12. Colubi A, Domínguez-Menchero JS, López-Díaz M, Ralescu DA (2002) A de[0, 1]-representation of random upper semicontinuous functions. Proc Am Math Soc 130:3237–3242 13. Colubi A, Fernández-García C, Gil MA (2002) Simulation of random fuzzy variables: an empirical approach to statistical/probabilistic studies with fuzzy experimental data. IEEE Trans Fuzzy Syst 10:384–390 14. Couso I, Sanchez L (2008) Higher order models for fuzzy random variables. Fuzzy Sets Syst 159:237–258 15. de Cooman G (2002) The society for imprecise probability: theories and applications. http://www.sipta.org 16. Diamond P (1990) Least squares fitting of compact set-valued data. J Math Anal Appl 147:351–362 17. Diamond P, Kloeden PE (1994) Metric spaces of fuzzy sets: theory and applications. World Scientific, Singapore 18. Dubois D, Prade H (1980) Fuzzy sets and systems theory and applications. Academic Press, New York 19. Dubois D, Prade H (1985) A review of fuzzy set aggregation connectives. Inf Sci 36:85–121 20. Dubois D, Prade H (1986) Possibility theory. Plenum Press, New York 21. Fellin W, Lessmann H, Oberguggenberger M, Vieider R (eds) (2005) Analyzing uncertainty in civil engineering. Springer, Berlin 22. Feng Y, Hu L, Shu H (2001) The variance and covariance of fuzzy random variables and their applications. Fuzzy Sets Syst 120(3):487–497
1249
1250
Fuzzy Probability Theory
23. Ferson S, Hajagos JG (2004) Arithmetic with uncertain numbers: rigorous and (often) best possible answers. Reliab Eng Syst Saf 85(1–3):135–152 24. Fetz T, Oberguggenberger M (2004) Propagation of uncertainty through multivariate functions in the framework of sets of probability measures. Reliab Eng Syst Saf 85(1–3):73–87 25. Ghanem RG, Spanos PD (1991) Stochastic finite elements: a spectral approach. Springer, New York; Revised edition 2003, Dover Publications, Mineola 26. González-Rodríguez G, Montenegro M, Colubi A, Ángeles Gil M (2006) Bootstrap techniques and fuzzy random variables: Synergy in hypothesis testing with fuzzy data. Fuzzy Sets Syst 157(19):2608–2613 27. Grzegorzewski P (2000) Testing statistical hypotheses with vague data. Fuzzy Sets Syst 112:501–510 28. Hall JW, Lawry J (2004) Generation, combination and extension of random set approximations to coherent lower and upper probabilities. Reliab Eng Syst Saf 85(1–3):89–101 29. Helton JC, Johnson JD, Oberkampf WL (2004) An exploration of alternative approaches to the representation of uncertainty in model predictions. Reliab Eng Syst Saf 85(1–3):39–71 30. Helton JC, Oberkampf WL (eds) (2004) Special issue on alternative representations of epistemic uncertainty. Reliab Eng Syst Saf 85(1–3):1–369 31. Hung W-L (2001) Bootstrap method for some estimators based on fuzzy data. Fuzzy Sets Syst 119:337–341 32. Hwang C-M, Yao J-S (1996) Independent fuzzy random variables and their application. Fuzzy Sets Syst 82:335–350 33. Jang L-C, Kwon J-S (1998) A uniform strong law of large numbers for partial sum processes of fuzzy random variables indexed by sets. Fuzzy Sets Syst 99:97–103 34. Joo SY, Kim YK (2001) Kolmogorovs strong law of large numbers for fuzzy random variables. Fuzzy Sets Syst 120:499–503 35. Kim YK (2002) Measurability for fuzzy valued functions. Fuzzy Sets Syst 129:105–109 36. Klement EP, Puri ML, Ralescu DA (1986) Limit theorems for fuzzy random variables. Proc Royal Soc A Math Phys Eng Sci 407:171–182 37. Klement EP (1991) Fuzzy random variables. Ann Univ Sci Budapest Sect Comp 12:143–149 38. Klir GJ (2006) Uncertainty and information: foundations of generalized information theory. Wiley-Interscience, Hoboken 39. Klir GJ, Folger TA (1988) Fuzzy sets, uncertainty, and information. Prentice Hall, Englewood Cliffs 40. Körner R (1997) Linear models with random fuzzy variables. Phd thesis, Bergakademie Freiberg, Fakultät für Mathematik und Informatik 41. Körner R (1997) On the variance of fuzzy random variables. Fuzzy Sets Syst 92:83–93 42. Körner R, Näther W (1998) Linear regression with random fuzzy variables: extended classical estimates, best linear estimates, least squares estimates. Inf Sci 109:95–118 43. Krätschmer V (2001) A unified approach to fuzzy random variables. Fuzzy Sets Syst 123:1–9 44. Krätschmer V (2002) Limit theorems for fuzzy-random variables. Fuzzy Sets Syst 126:253–263 45. Krätschmer V (2004) Probability theory in fuzzy sample space. Metrika 60:167–189 46. Kruse R, Meyer KD (1987) Statistics with vague data. Reidel, Dordrecht
47. Kwakernaak H (1978) Fuzzy random variables I. definitions and theorems. Inf Sci 15:1–19 48. Kwakernaak H (1979) Fuzzy random variables II. algorithms and examples for the discrete case. Inf Sci 17:253–278 49. Li S, Ogura Y, Kreinovich V (2002) Limit theorems and applications of set valued and fuzzy valued random variables. Kluwer, Dordrecht 50. Lin TY, Yao YY, Zadeh LA (eds) (2002) Data mining, rough sets and granular computing. Physica, Germany 51. López-Díaz M, Gil MA (1998) Reversing the order of integration in iterated expectations of fuzzy random variables, and statistical applications. J Stat Plan Inference 74:11–29 52. Matheron G (1975) Random sets and integral geometry. Wiley, New York 53. Möller B, Graf W, Beer M (2000) Fuzzy structural analysis using alpha-level optimization. Comput Mech 26:547–565 54. Möller B, Beer M (2004) Fuzzy randomness – uncertainty in civil engineering and computational mechanics. Springer, Berlin 55. Möller B, Reuter U (2007) Uncertainty forecasting in engineering. Springer, Berlin 56. Muhanna RL, Mullen RL, Zhang H (2007) Interval finite element as a basis for generalized models of uncertainty in engineering mechanics. J Reliab Comput 13(2):173–194 57. Gil MA, López-Díaz M, Ralescu DA (2006) Overview on the development of fuzzy random variables. Fuzzy Sets Syst 157(19):2546–2557 58. Näther W, Körner R (2002) Statistical modelling, analysis and management of fuzzy data, chapter on the variance of random fuzzy variables. Physica, Heidelberg, pp 25–42 59. Näther W (2006) Regression with fuzzy random data. Comput Stat Data Analysis 51:235–252 60. Näther W, Wünsche A (2007) On the conditional variance of fuzzy random variables. Metrika 65:109–122 61. Oberkampf WL, Helton JC, Sentz K (2001) Mathematical representation of uncertainty. In: AIAA non-deterministic approaches forum, number AIAA 2001–1645. AIAA, Seattle 62. Pedrycz W, Skowron A, Kreinovich V (eds) (2008) Handbook of granular computing. Wiley, New York 63. Puri ML, Ralescu D (1986) Fuzzy random variables. J Math Anal Appl 114:409–422 64. Puri ML, Ralescu DA (1983) Differentials of fuzzy functions. J Math Anal Appl 91:552–558 65. Puri ML, Ralescu DA (1991) Convergence theorem for fuzzy martingales. J Math Anal Appl 160:107–122 66. Rodríguez-Muñiz L, López-Díaz M, Gil MA (2005) Solving influence diagrams with fuzzy chance and value nodes. Eur J Oper Res 167:444–460 67. Samarasooriya VNS, Varshney PK (2000) A fuzzy modeling approach to decision fusion under uncertainty. Fuzzy Sets Syst 114:59–69 68. Schenk CA, Schuëller GI (2005) Uncertainty assessment of large finite element systems. Springer, Berlin 69. Schuëller GI, Spanos PD (eds) (2001) Proc int conf on monte carlo simulation MCS 2000. Swets and Zeitlinger, Monaco 70. Shafer G (1976) A mathematical theory of evidence. Princeton University Press, Princeton 71. Song Q, Leland RP, Chissom BS (1997) Fuzzy stochastic fuzzy time series and its models. Fuzzy Sets Syst 88:333–341 72. Taheri SM, Behboodian J (2001) A bayesian approach to fuzzy hypotheses testing. Fuzzy Sets Syst 123:39–48
Fuzzy Probability Theory
73. Terán P (2006) On borel measurability and large deviations for fuzzy random variables. Fuzzy Sets Syst 157(19):2558–2568 74. Terán P (2007) Probabilistic foundations for measurement modelling with fuzzy random variables. Fuzzy Sets Syst 158(9):973–986 75. Viertl R (1996) Statistical methods for non-precise data. CRC Press, Boca Raton 76. Viertl R, Hareter D (2004) Generalized Bayes’ theorem for nonprecise a-priori distribution. Metrika 59:263–273 77. Viertl R, Trutschnig W (2006) Fuzzy histograms and fuzzy probability distributions. In: Proceedings of the 11th Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems, Editions EDK, Paris, CD-ROM 78. Walley P (1991) Statistical reasoning with imprecise probabilities. Chapman & Hall, London 79. Wang G, Qiao Z (1994) Convergence of sequences of fuzzy random variables and its application. Fuzzy Sets Syst 63:187–199 80. Wang G, Zhang Y (1992) The theory of fuzzy stochastic processes. Fuzzy Sets Syst 51:161–178 81. Weichselberger K (2000) The theory of interval-probability as a unifying concept for uncertainty. Int J Approx Reason 24(2– 3):149–170 82. Yager RR (1984) A representation of the probability of fuzzy subsets. Fuzzy Sets Syst 13:273–283 83. Zadeh LA (1965) Fuzzy sets. Inf Control 8:338–353 84. Zadeh LA (1968) Probability measures of fuzzy events. J Math Anal Appl 23:421–427 85. Zhong Q, Yue Z, Guangyuan W (1994) On fuzzy random linear programming. Fuzzy Sets Syst 65(1):31–49 86. Zimmermann HJ (1992) Fuzzy set theory and its applications. Kluwer, Boston
Books and Reviews Ayyub BM (1998) Uncertainty modeling and analysis in civil engineering. CRC Press, Boston Blockley DI (1980) The nature of structural design and safety. Ellis Horwood, Chichester Buckley JJ (2003) Fuzzy Probabilities. Physica/Springer, Heidelberg Cai K-Y (1996) Introduction to fuzzy reliability. Kluwer, Boston Chou KC, Yuan J (1993) Fuzzy-bayesian approach to reliability of existing structures. ASCE J Struct Eng 119(11):3276–3290 Elishakoff I (1999) Whys and hows in uncertainty modelling probability, fuzziness and anti-optimization. Springer, New York Feng Y (2000) Decomposition theorems for fuzzy supermartingales and submartingales. Fuzzy Sets Syst 116:225–235 Gil MA, López-Díaz M (1996) Fundamentals and bayesian analyses of decision problems with fuzzy-valued utilities. Int J Approx Reason 15:203–224 Gil MA, Montenegro M, González-Rodríguez G, Colubi A, Casals MR (2006) Bootstrap approach to the multi-sample test of means with imprecise data. Comput Stat Data Analysis 51:148–162 Grzegorzewski P (2001) Fuzzy sets b defuzzification and randomization. Fuzzy Sets Syst 118:437–446 Grzegorzewski P (2004) Distances between intuitionistic fuzzy sets and/or interval-valued fuzzy sets based on the Hausdorff metric. Fuzzy Sets Syst 148(2):319–328 Hareter D (2004) Time series analysis with non-precise data. In: Wojtkiewicz S, Red-Horse J, Ghanem R (eds) 9th ASCE specialty conference on probabilistic mechanics and structural reliability. Sandia National Laboratories, Albuquerque
Helton JC, Cooke RM, McKay MD, Saltelli A (eds) (2006) Special issue: The fourth international conference on sensitivity analysis of model output – SAMO 2004. Reliab Eng Syst Saf 91:(10– 11):1105–1474 Hirota K (1992) An introduction to fuzzy logic applications in intelligent systems. In: Kluwer International Series in Engineering and Computer Science, Chapter probabilistic sets: probabilistic extensions of fuzzy sets, vol 165. Kluwer, Boston, pp 335– 354 Klement EP, Puri ML, Ralescu DA (1984) Cybernetics and Systems Research 2, Chapter law of large numbers and central limit theorems for fuzzy random variables. Elsevier, North-Holland, pp 525–529 Kutterer H (2004) Statistical hypothesis tests in case of imprecise data, V Hotine-Marussi Symposium on Mathematical Geodesy. Springer, Berlin, pp 49–56 Li S, Ogura Y (2003) A convergence theorem of fuzzy-valued martingales in the extended hausdorff metric h(inf). Fuzzy Sets Syst 135:391–399 Li S, Ogura Y, Nguyen HT (2001) Gaussian processes and martingales for fuzzy valued random variables with continuous parameter. Inf Sci 133:7–21 Li S, Ogura Y, Proske FN, Puri ML (2003) Central limit theorems for generalized set-valued random variables. J Math Anal Appl 285:250–263 Liu B (2002) Theory and practice of uncertainty programming. Physica, Heidelberg Liu Y, Qiao Z, Wang G (1997) Fuzzy random reliability of structures based on fuzzy random variables. Fuzzy Sets Syst 86:345–355 Möller B, Graf W, Beer M (2003) Safety assessment of structures in view of fuzzy randomness. Comput Struct 81:1567–1582 Möller B, Liebscher M, Schweizerhof K, Mattern S, Blankenhorn G (2008) Structural collapse simulation under consideration of uncertainty – improvement of numerical efficiency. Comput Struct 86(19–20):1875–1884 Montenegro M, Casals MR, Lubiano MA, Gil MA (2001) Two-sample hypothesis tests of means of a fuzzy random variable. Inf Sci 133:89–100 Montenegro M, Colubi A, Casals MR, Gil MA (2004) Asymptotic and bootstrap techniques for testing the expected value of a fuzzy random variable. Metrika 59:31–49 Montenegro M, González-Rodríguez G, Gil MA, Colubi A, Casals MR (2004) Soft methodology and random information systems, chapter introduction to ANOVA with fuzzy random variables. Springer, Berlin, pp 487–494 Muhanna RL, Mullen RL (eds) (2004) Proceedings of the NSF workshop on reliable engineering computing. Center for Reliable Engineering Computing. Georgia Tech Savannah, Georgia Muhanna RL, Mullen RL (eds) (2006) NSF workshop on reliable engineering computing. center for reliable engineering computing. Georgia Tech Savannah, Georgia Negoita VN, Ralescu DA (1987) Simulation, knowledge-based computing and fuzzy-statistics. Van Nostrand, Reinhold, New York Oberguggenberger M, Schuëller GI, Marti K (eds) (2004) Special issue on application of fuzzy sets and fuzzy logic to engineering problems. ZAMM: Z Angew Math Mech 84(10–11):661–776 Okuda T, Tanaka H, Asai K (1978) A formulation of fuzzy decision problems with fuzzy information using probability measures of fuzzy events. Inf Control 38:135–147
1251
1252
Fuzzy Probability Theory
Proske FN, Puri ML (2002) Strong law of large numbers for banach space valued fuzzy random variables. J Theor Probab 15: 543–551 Reddy RK, Haldar A (1992) Analysis and management of uncertainty: theory and applications, chapter a random-fuzzy reliability analysis of engineering systems. North-Holland, Amsterdam, pp 319–329 Reddy RK, Haldar A (1992) A random-fuzzy analysis of existing structures. Fuzzy Sets Syst 48:201–210 Ross TJ (2004) Fuzzy logic with engineering applications, 2nd edn. Wiley, Philadelphia Ross TJ, Booker JM, Parkinson WJ (eds) (2002) Fuzzy logic and probability applications – bridging the gap. SIAM & ASA, Philadelphia Stojakovic M (1994) Fuzzy random variables, expectation, and martingales. J Math Anal Appl 184:594–606
Terán P (2004) Cones and decomposition of sub- and supermartingales. Fuzzy Sets Syst 147:465–474 Tonon F, Bernardini A (1998) A random set approach to the optimization of uncertain structures. Comput Struct 68(6):583–600 Weichselberger K (2000) The theory of interval-probability as a unifying concept for uncertainty. Int J Approx Reason 24(2–3): 149–170 Yukari Y, Masao M (1999) Interval and paired probabilities for treating uncertain events. IEICE Trans Inf Syst E82–D(5):955–961 Zadeh L (1985) Is probability theory sufficient for dealing with uncertainty in ai: A negative view. In: Proceedings of the 1st Annual Conference on Uncertainty in Artificial Intelligence (UAI85). Elsevier Science, New York, pp 103–116 Zhang Y, Wang G, Su F (1996) The general theory for response analysis of fuzzy stochastic dynamical systems. Fuzzy Sets Syst 83:369–405
Fuzzy Sets Theory, Foundations of
Fuzzy Sets Theory, Foundations of JANUSZ KACPRZYK Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Article Outline Glossary Definition of the Subject Introduction Fuzzy Sets – Basic Definitions and Properties Fuzzy Relations Linguistic Variable, Fuzzy Conditional Statement, and Compositional Rule of Inference The Extension Principle Fuzzy Numbers Fuzzy Events and Their Probabilities Defuzzification of Fuzzy Sets Fuzzy Logic – Basic Issues Bellman and Zadeh’s General Approach to Decision Making Under Fuzziness Concluding Remarks Bibliography
Fuzzy event and its probability Make it possible to formally define events which are imprecisely specified, like “high temperature” and calculate their probabilities, for instance the probability of a “high temperature tomorrow”. Fuzzy logic Provides formal means for the representation of, and inference based on imprecisely specified premises and rules of inference; can be understood in different ways, basically as fuzzy logic in a narrow sense, being some type of multivalued logic, and fuzzy logic in a broad sense, being a way to formalize inference based on imprecisely specified premises and rules of inference. Definition of the Subject We provide a brief exposition of basic elements of Zadeh’s [95] fuzzy sets theory. We discuss basic properties, operations on fuzzy sets, fuzzy relations and their compositions, linguistic variables, the extension principle, fuzzy arithmetic, fuzzy events and their probabilities, fuzzy logic, fuzzy dynamic systems, etc. We also outline Bellman and Zadeh’s [8] general approach to decision making in a fuzzy environment which is a point of departure for virtually all fuzzy decision making, optimization, control, etc. models. Introduction
Glossary Fuzzy set A mathematical tool that can formally characterize an imprecise concept. Whereas a conventional set to which elements can either belong or not, elements in a fuzzy set can belong to some extent, from zero, which stands for a full nonbelongingness) to one, which stands for a full belongingness, through all intermediate values. Fuzzy relation A mathematical tool that can formally characterize that which is imprecisely specified, notably by using natural language, relations between variables, for instance, similar, much greater than, almost equal, etc. Extension principle Makes it possible to extend relations, algorithms, etc. defined for variables that take on nonfuzzy (e. g. real) values to those that take on fuzzy values. Linguistic variable, fuzzy conditional statement, compositional rule of inference Make it possible to use variables, which take on linguistic (instead of numeric) values to represent relations between such variables, by using fuzzy conditional statements and use them in inference by using the compositional rule of inference.
This paper is meant to briefly expose a novice reader to basic elements of theory of fuzzy sets and fuzzy systems viewed for our purposes as an effective and efficient means and calculus to deal with imprecision in the definition of data, information and knowledge, and to provide tools and techniques for dealing with imprecision therein. Our exposition will be as formal as necessary, of more intuitive and constructive a character, so that fuzzy tools and techniques can be useful for the multidisciplinary audience of this encyclopedia. For the readers requiring or interested in a deeper exposition of fuzzy sets and related concepts, we will recommend many relevant references, mainly books. However, as the number of books and volumes on this topic and its applications in a variety of fields is huge, we will recommend some of them only, mostly those better known ones. For the newest literature entries the readers should consult the most recent catalogs of major scientific publishers who have books and edited volumes on fuzzy sets/logic and their applications. Our discussion will proceed, on the other hand, in the pure fuzzy setting, and we will not discuss possibility theory (which is related to fuzzy sets theory). The reader interested in possibility theory is referred to, e. g., Dubois and Prade [29,30] or their article in this encyclopedia.
1253
1254
Fuzzy Sets Theory, Foundations of
We will consecutively discuss the idea of a fuzzy set, basic properties of fuzzy sets, operations on fuzzy sets, some extensions of the basic concept of a fuzzy set, fuzzy relations and their compositions, linguistic variables, fuzzy conditional statements, and the compositional rule of inference, the extension principle, fuzzy arithmetic, fuzzy events and their probabilities, fuzzy logic, fuzzy dynamic systems, etc. We also outline Bellman and Zadeh’s [8] general approach to decision making in a fuzzy environment which is a point of departure for virtually all fuzzy decision making, optimization, control, etc. models. Fuzzy Sets – Basic Definitions and Properties Fuzzy sets theory, introduced by Zadeh in 1965 [95], is a simple yet very powerful, effective and efficient means to represent and handle imprecise information (of vagueness type) exemplified by tall buildings, large numbers, etc. We will present fuzzy sets theory as some calculus of imprecision, not as a new set theory in the mathematical sense. The Idea of a Fuzzy Set From our point of view, the main purpose of a (conventional) set in mathematics is to formally characterize some concept (or property). For instance, the concept of “integer numbers which are greater than or equal three and less than or equal ten” may be uniquely represented just by showing all integer numbers that satisfy this condition; that is, given by the following set: fx 2 I : 3 x 10g D f3; 4; 5; 6; 7; 8; 9; 10g where I is the set of integers. Notice that we need to specify first a universe of discourse (universe, universal set, referential, reference set, etc.) that contains all those elements which are relevant for the particular concept as, e. g., the set of integers I in our example. A conventional set, say A, may be equated with its characteristic function defined as 'A : X ! f0; 1g
(1)
which associates with each element x of a universe of discourse X D fxg a number '(x) 2 f0; 1g such that: 'A (x) D 0 means that x 2 X does not belong to the set A, and 'A (x) D 1 means that x belongs to the set A. Therefore, for the set verbally defined as integer numbers which are greater than or equal three and less than or equal ten, its equivalent set A D f3; 4; 5; 6; 7; 8; 9; 10g, listing all the respective integer numbers, may be represented by its characteristic function ( 'A (x) D
1 for x 2 f3; 4; 5; 6; 7; 8; 9; 10g 0 otherwise :
Notice that in a conventional set there is a clear-cut differentiation between elements belonging to the set and not, i. e. the transition from the belongingness to nonbelongingness is clear-cut and abrupt. However, it is easy to notice that a serious difficulty arises when we try to formalize by means of a set vague concepts which are commonly encountered in everyday discourse and widely used by humans as, e. g., the statement “integer numbers which are more or less equal to six.” Evidently, the (conventional) set cannot be used to adequately characterize such an imprecise concept because an abrupt and clear-cut differentiation between the elements belonging and not belonging to the set is artificial here. This has led Zadeh [95] to the idea of a fuzzy set which is a class of objects with unsharp boundaries, i. e. in which the transition from the belongingness to nonbelongingness is not abrupt; thus, elements of a fuzzy set may belong to it to partial degrees, from the full belongingness to the full nonbelongingness through all intermediate values. Notice that this is presumably the most natural and simple way to formally define the imprecision of meaning. We should therefore start again with a universe of discourse (universe, universal set, referential, reference set, etc.) containing all elements relevant for the (imprecise) concept we wish to formally represent. Then, the characteristic function ' : X ! f0; 1g is replaced by a membership function defined as A : X ! [0; 1]
(2)
such that A (x) 2 [0; 1] is the degree to which an element x 2 X belongs to the fuzzy set A: From A (x) D 0 for the full nonbelongingness to A (x) D 1 for the full belongingness, through all intermediate (0 < A (x) < 1) values. Now, if we consider as an example the concept of integer numbers which are more or less six. Then x D 6 certainly belongs to this set so that A (6) D 1, the numbers five and seven belong to this set almost surely so that A (5) and A (7) are very close to one, and the more a number differs from six, the less its A (:). Finally, the numbers below one and above ten do not belong to this set, so that their A (:) D 0. This may be sketched as in Fig. 1 though we should bear in mind that although in our example the membership function is evidently defined for the integer numbers (x’s) only, it is depicted in a continuous form to be more illustrative. In practice the membership function is usually assumed to be piecewise linear as shown in Fig. 2 (for the same fuzzy set as in Fig. 1, i. e. the fuzzy set integer numbers which are more or less six). To specify the membership function we then need four numbers only: a, b, c, and d as, e. g., a D 2, b D 5, c D 7, and d D 10 in Fig. 2.
Fuzzy Sets Theory, Foundations of
Fuzzy Sets Theory, Foundations of, Figure 1 Membership function of a fuzzy set, integer numbers which are more or less six
Fuzzy Sets Theory, Foundations of, Figure 2 Membership function of a fuzzy set, integer numbers which are more or less six
Notice that the particular form of a membership function is subjective as opposed to an objective form of a characteristic function. However, this may be viewed as quite natural as the underlying concepts are subjective indeed as, e. g., the set of integer numbers more or less six depend on an individual opinion. Unfortunately, this inherent subjectivity of the membership function may lead to some problems in many formal models in which users would rather have a limit to the scope of subjectivity. We will comment on this issue later on. We will now define formally a fuzzy set in a form that is very often used. A fuzzy set A in a universe of discourse X D fxg, written A in X, is defined as a set of pairs A D f(A (x); x)g
(3)
where A : X ! [0; 1] is the membership function of A and A (x) 2 [0; 1] is the grade of membership (or a membership grade) of an element x 2 X in a fuzzy set A. Needless to say that our definition of a fuzzy set (3) is clearly equivalent to the definition of the membership function (2) because a function may be represented by a set of pairs argument–value of the function for this argument.
For our purposes, however, the definition (3) is more settheoretic-like which will often be more convenient. So, in this paper we will practically equate fuzzy sets with their membership functions saying, e. g., a fuzzy set, A (x), and also very often we will equate fuzzy sets with their labels saying, e. g., a fuzzy set, large numbers, with the understanding that the label large numbers is equivalent to the fuzzy set mentioned, written A D large numbers. However, we will use the notation A (x) for the membership function of a fuzzy set A in X, and not an abbreviated notation A(x) as in some more technical papers, to be consistent with our more-set-theoretic-like convention. For practical reasons, it is very often assumed (also in this paper) that all the universes of discourse are finite as, e. g., X D fx1 ; : : : ; x n g. In such a case the pair f(A (x); x)g will be denoted by A (x)/x which is called a fuzzy singleton. Then, the fuzzy set A in X will be written as
A (x) A D f(A (x); x)g D x (4) n X A (x i ) A (x n ) A (x1 ) C ::: C D ; D x1 xn xi iD1
P
where C and are meant in the set-theoretic sense. By convention, the pairs A (x)/x with A (x) D 0 are omitted here. A conventional (nonfuzzy) set may obviously be written in the fuzzy sets notation introduced above, for instance the (non-fuzzy) set, integer numbers greater than or equal three and less than or equal ten, may be written as AD
1 3
C
1 4
C
1 5
C
1 6
C
1 7
C
1 8
C
1 9
C
1 10
:
The family of all fuzzy sets defined in X is denoted by A; it includes evidently also the empty fuzzy set to be defined by (9), i. e. A D ; such that A (x) D 0, for each x 2 X, and the whole universe of discourse X written as X D 1/x1 C : : : C 1/x n . The concept of a fuzzy set as defined above has been the point of departure for the theory of fuzzy sets (or fuzzy sets theory) which will be briefly sketched below. We will again follow a more intuitive and less formal presentation, which is better suited for this encyclopedia. Some Extensions of the Concept of Zadeh’s Fuzzy Set The concept of Zadeh’s [95] fuzzy set introduced in the previous section is the by far the simplest and most natural way to fuzzify the concept of a (conventional) set, and clearly provides what we mean to represent and handle imprecision. However, its underlying elements are the
1255
1256
Fuzzy Sets Theory, Foundations of
most straightforward possible. This concerns above all the membership function. Therefore, it is quite natural that some extensions have been presented to this basic concept. We will just briefly mention some of them. First, it is quite easy to notice that though the definition of a fuzzy set by the membership function of the type A : X ! [0; 1] is the simplest and most straightforward one, allowing for a gradual transition from the belongingness and nonbelongingness, it can readily be extended. The same role is namely played by a generalized definition by a membership function of the type A : X ! L ;
(5)
where L is some (partially) ordered set as, e. g., a lattice. This obvious, but powerful extension was introduced by Goguen [37] as an L-fuzzy set, where l stands for a lattice. Notice that by using a lattice as the set of values of the membership function we can accommodate situations i which we can encounter elements of the universe of discourse which are not comparable. Another quite and obvious extension, already mentioned but not developed by Zadeh [95,101] is the concept of a type 2 fuzzy set. The rationale behind this concept is obvious. One can easily imagine that the values of grades of membership of the particular elements of a universe of discourse are fuzzy sets themselves. And further, these fuzzy sets may have grades of membership which are type 2 fuzzy sets, which leads to type 3 fuzzy sets, and one can continue arriving at type n fuzzy sets. The next, natural extension is that instead of assuming that the degrees of membership are real numbers from the unit interval, one can go a step further and replace these real numbers from [0; 1] by intervals with endpoints belonging to the unit interval. This leads to interval valued fuzzy sets which are attributed to Dubois and Gorzałlczany (cf. Klir and Yuan [53]). Notice that by using intervals as values of degrees of membership we significantly increase our ability to represent imprecision. A more radical extension to the concept of Zade’s fuzzy set is the so-called intuitionistic fuzzy set introduced by Atanassov [1,2]. An intuitionistic fuzzy set A0 in a universe of discourse X is defined as 0
A D fhx; A0 (x); A (x)ijx 2 Xg where: The degree of membership is A0 : X ! [0; 1] ;
(6)
the degree of non-membership is
A0 : X ! [0; 1] ; and the condition holds 0 A0 (x) C A0 (x) 1 ;
for each x 2 X :
Obviously, each (conventional) fuzzy set A in X corresponds to the following intuitionistic fuzzy set A0 in X: A D fhx; A0 (x); 1 A0 (x)ijx 2 Xg For each intuitionistic fuzzy set A0 (x) D 1 A0 (x) A0 (x) ;
A0
(7)
in X, we call
for each x 2 X (8)
the intuitionistic fuzzy index (or a hesitation margin) of x in A0 . The intuitionistic fuzzy index expresses a lack of knowledge of whether an element x 2 X belongs to an intuitionistic fuzzy set A0 or not. Notice that the concept of an intuitionistic fuzzy set is a substantial departure from the concept of a (conventional) fuzzy set as it assumes that the degrees of membership and non-membership do not sum up to one, as it is the case in virtually all traditional set theories and their extensions. For more information, we refer the reader to Atanassov’s [3] book. We will not use these extensions in this short introductory article, and the interested readers are referred to the source literature cited. Basic Definition and Properties Related to Fuzzy Sets We will now provide a brief account of basic definitions and properties related to fuzzy sets. We illustrate them with simple examples. A fuzzy set A is said to be empty, written A D ;, if and only if A (x) D 0 ;
for each x 2 X
(9)
and since we omit the pairs 0/x, an empty fuzzy set is really void in the notation (4) as there are no singletons in the right-hand side. Two fuzzy sets A and B defined in the same universe of discourse X are said to be equal, written A D B, if and only if A (x) D B (x) ;
for each x 2 X
Example 1 Suppose that X D f1; 2; 3g and A D 0:1/1 C 0:5/2 C 1/3 B D 0:2/1 C 0:5/2 C 1/3 C D 0:1/1 C 0:5/2 C 1/3 then A D C but A ¤ C and B ¤ C.
(10)
Fuzzy Sets Theory, Foundations of
It is easy to see that this classic definition of the equality of two fuzzy sets by (10) is rigid and clear-cut, contradicting in a sense our intuitive feeling that the equality of fuzzy sets should be softer, and not abrupt, i. e. should rather be to some degree, from zero to one. We will show below one of possible definitions of such an equality to a degree. Two fuzzy sets A and B defined in X are said to be equal to a degree e(A; B) 2 [0; 1], written A D e B, and the degree of equality e(A; B) may be defined in many ways exemplified by those given below (cf. Bandler and Kohout [6]). First, to simplify, we denote: Case 1: A D B in the sense of (10); Case 2: A ¤ B in the sense of (10) and T D fx 2 X : A (x) ¤ B (x)g; Case 3: A ¤ B in the sense of (10) and there exists an x 2 X such that
or
A (x) D 0
and B (x) ¤ 0
A (x) ¤ 0
and B (x) D 0 :
Case 4: A ¤ B in the sense of (10) and there exists an x 2 X such that
or
A (x) D 0
and B (x) D 1
A (x) D 1
and B (x) D 0 :
Now, the following degrees of equality of two fuzzy sets, A and B, may be defined:
e1 (A; B) D
8 ˆ 1 fuzzy goals G1 ; : : : ; G n defined in Y, m > 1 fuzzy constraints C1 ; : : : ; C m defined in X, and a function f : X ! Y, y D f (x), we analogously have D (x) D G 10 (x)^: : :^G 0n (x)^C 1 (x)^: : :^C n (x); for each x 2 X :
D (x ) D max D (x) : x2X
The basic conceptual fuzzy decision making model can be used in many specific areas, notably in fuzzy optimization which will be covered elsewhere in this volume. The models of decision making under fuzziness developed above can also be extended to the case of multiple criteria, multiple decision makers, and multiple stage cases. We will present the last extension, to fuzzy multistage decision making (control) case which makes it possible to account for dynamics.
(93)
Example 19 Let X D f1; 2; 3; 4g, Y D f2; 3; : : : ; 10g, and y D 2x C 1. If now
D (x) D G 0 (x) ^ C (x) ;
The maximizing decision is defined as (92), i. e.
(95)
Multistage Decision Making (Control) Under Fuzziness In this case it is convenient to use control-related notation and terminology. In particular, decisions will be referred to as controls, the discrete time moments at which decisions are to be made – as control stages, and the inputoutput (or cause-effect) relationship – as a system under control. The essence of multistage control in a fuzzy environment may be portrayed as in Fig. 9. First, suppose that the control space is U D fug D fc1 ; : : : ; c m g and the state space is X D fxg D fs1 ; : : : ; s n g. Initially we are in some initial state x0 2 X. We apply a control u0 2 U subjected to a fuzzy constraint C 0 (u0 ). We attain a state x1 2 X via a known cause-effect relationship (i. e. S); a fuzzy goal G 1 (x1 ) is imposed on x1 . Next, we apply a control u1 subjected to a fuzzy constraint C 1 (u1 ), and attain a fuzzy state x2 on which a fuzzy goal G 2 (x2 ) is imposed, etc. Suppose for simplicity that the system under control is deterministic and its temporal evolution is governed by a state transition equation f : X U ! X ;
Fuzzy Sets Theory, Foundations of, Figure 9 Essence of the multistage control in a fuzzy environment (under fuzziness)
(96)
Fuzzy Sets Theory, Foundations of
such that x tC1 D f (x t ; u t ) ;
t D 0; 1; : : :
(97)
where x t ; x tC1 2 X D fs1 ; : : : ; s n g are the states at control stages t and t C 1, respectively, and u t 2 U D fc1 ; : : : ; c m g is the control at t. At each t, u t 2 U is subjected to a fuzzy constraint C t (u t ), and on the state attained x tC1 2 X a fuzzy goal is imposed; t D 0; 1; : : :. The initial state is x0 2 X and is assumed to be known, and given in advance. The termination time (planning, or control, horizon), i. e. is the maximum number of control stages, is denoted by N 2 f1; 2; : : :g, and may be finite or infinite. The performance (goodness) of the multistage control process under fuzziness is evaluated by the fuzzy decision
For a detailed analysis of resulting problems and their solutions (by employing dynamic programming, branchand-bound, genetic algorithms, etc.) we refer the reader to Kacprzyk’s [42,44] books. Concluding Remarks We provided a brief survey of basic elements of Zadeh’s [95] fuzzy sets theory, mainly of basic properties of fuzzy sets, operations on fuzzy sets, fuzzy relations and their compositions, linguistic variables, the extension principle, fuzzy arithmetic, fuzzy events and their probabilities, fuzzy logic, and Bellman and Zadeh’s [8] general approach to decision making in a fuzzy environment. Various aspects of fuzzy sets theory will be expanded in other papers in this part of the volume.
D (u0 ; : : : ; u N1 j x0 ) D C 0 (u0 ) ^ G 1 (x1 ) ^ : : : ^ C N1 (u N1 ) ^ G N (x N ) :
(98)
In most cases, however, a slightly simplified form of the fuzzy decision (98) is used, namely it is assumed all the subsequent fuzzy controls, u0 ; u1 ; : : : ; u N1 , are subjected to the fuzzy constraints, C 0 (u0 ); C 1 (u1 ); : : : ; C N1 (u N1 , while the fuzzy goal is just imposed on the final state xN , G N (x N ). In such a case the fuzzy decision becomes D (u0 ; : : : ; u N1 j x0 ) D C 0 (u0 ) ^ : : : ^ C N1 (u N1 ) ^ G N (x N ) :
(99)
The multistage control problem in a fuzzy environment is now formulated as to find an optimal sequence of controls u0 ; : : : ; uN1 , ut 2 U, t D 0; 1; : : : ; N 1, such that: D (u0 ; : : : ; uN1 j x0 ) D
max
u 0 ;:::;u N1 2U
D (u0 ; : : : ; u N1 j x0 ) :
(100)
Usually it is more convenient to express the solution, i. e. the controls to be applied, as a control policy a t : X ! U, such that u t D a t (x t ), t D 0; 1; : : :, i. e. the control to be applied at t is expressed as a function of the state at t. The above basic formulation of multistage control in a fuzzy environment may readily be extended with respect to: The type of the termination time (fixed and specified, implicitly specified, fuzzy, and infinite), and the type of the system under control (deterministic, stochastic, and fuzzy).
Bibliography 1. Atanassov KT (1983) Intuitionistic fuzzy sets. VII ITKR session. Central Sci.-Techn. Library of Bulg. Acad. of Sci., Sofia pp 1697/84. (in Bulgarian) 2. Atanassov KT (1986) Intuitionistic fuzzy sets. Fuzzy Sets Syst 20:87–96 3. Atanassov KT (1999) Intuitionistic fuzzy sets: Theory and applications. Springer, Heidelberg 4. Bandemer H, Gottwald S (1995) Fuzzy sets, fuzzy logic, fuzzy methods, with applications. Wiley, Chichester 5. Bandemer H, Näther W (1992) Fuzzy data analysis. Kluwer, Dordrecht 6. Bandler W, Kohout LJ (1980) Fuzzy power sets for fuzzy implication operators. Fuzzy Sets Syst 4:13–30 7. Bellman RE, Giertz M (1973) On the analytic formalism of the theory of fuzzy sets. Inform Sci 5:149–157 8. Bellman RE, Zadeh LA (1970) Decision making in a fuzzy environment. Manag Sci 17:141–164 9. Belohlávek R, Vychodil V (2005) Fuzzy equational logic. Springer, Heidelberg 10. Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum, New York 11. Black M (1937) Vagueness: An exercise in logical analysis. Philos Sci 4:427–455 12. Black M (1963) Reasoning with loose concepts. Dialogue 2: 1–12 13. Black M (1970) Margins of precision. Cornell University Press, Ithaca 14. Buckley JJ (2004) Fuzzy statistics. Springer, Heidelberg 15. Buckley JJ (2005) Fuzzy probabilities. New approach and applications, 2nd edn. Springer, Heidelberg 16. Buckley JJ (2005) Simulating fuzzy systems. Springer, Heidelberg 17. Buckley JJ (2006) Fuzzy probability and statistics. Springer, Heidelberg 18. Buckley JJ, Eslami E (2002) An introduction to fuzzy logic and fuzzy sets. Springer, Heidelberg 19. Calvo T, Mayor G, Mesiar R (2002) Aggregation operators. New trends and applications. Springer, Heidelerg
1271
1272
Fuzzy Sets Theory, Foundations of
20. Carlsson C, Fullér R (2002) Fuzzy reasoning in decision making and optimization. Springer, Heidelberg 21. Castillo O, Melin P (2008) Type-2 fuzzy logic: Theory and applications. Springer, Heidelberg 22. Cox E (1994) The fuzzy system handbook. A practitioner’s guide to building, using, and maintaining fuzzy systems. Academic, New York 23. Cross V, Sudkamp T (2002) Similarity and compatibility in fuzzy set theory. Assessment and applications. Springer, Heidelberg 24. Delgado M, Kacprzyk J, Verdegay JL, Vila MA (eds) (1994) Fuzzy optimization: Recent advances. Physica, Heidelberg 25. Dompere KK (2004) Cost-benefit analysis and the theory of fuzzy decisions. Fuzzy value theory. Springer, Heidelberg 26. Dompere KK (2004) Cost-benefit analysis and the theory of fuzzy decisions. Identification and measurement theory. Springer, Heidelberg 27. Driankov D, Hellendorn H, Reinfrank M (1993) An introduction to fuzzy control. Springer, Berlin 28. Dubois D, Prade H (1980) Fuzzy sets and systems: Theory and applications. Academic, New York 29. Dubois D, Prade H (1985) Théorie des possibilités. Applications a la représentation des connaissances en informatique. Masson, Paris 30. Dubois D, Prade H (1988) Possibility theory: An approach to computerized processing of uncertainty. Plenum, New York 31. Dubois D, Prade H (1996) Fuzzy sets and systems (Reedition on CR-ROM of [28]. Academic, New York 32. Fullér R (2000) Introduction to neuro-fuzzy systems. Springer, Heidelberg 33. Gaines BR (1977) Foundations of fuzzy reasoning. Int J ManMach Stud 8:623–668 34. Gil Aluja J (2004) Fuzzy sets in the management of uncertainty. Springer, Heidelberg 35. Gil-Lafuente AM (2005) Fuzzy logic in financial analysis. Springer, Heidelberg 36. Glöckner I (2006) Fuzzy quantifiers. A computational theory. Springer, Heidelberg 37. Goguen JA (1967) L-fuzzy sets. J Math Anal Appl 18:145–174 38. Goguen JA (1969) The logic of inexact concepts. Synthese 19:325–373 39. Goodman IR, Nguyen HT (1985) Uncertainty models for knowledge-based systems. North-Holland, Amsterdam 40. Hájek P (1998) Metamathematics of fuzzy logic. Kluwer, Dordrecht 41. Hanss M (2005) Applied fuzzy arithmetic. An introduction with engineering applications. Springer, Heidelberg 42. Kacprzyk J (1983) Multistage decision making under fuzziness, Verlag TÜV Rheinland, Cologne 43. Kacprzyk J (1992) Fuzzy sets and fuzzy logic. In: Shapiro SC (ed) Encyclopedia of artificial intelligence, vol 1. Wiley, New York, pp 537–542 44. Kacprzyk J (1996) Multistage fuzzy control. Wiley, Chichester 45. Kacprzyk J, Fedrizzi M (eds) (1988) Combining fuzzy imprecision with probabilistic uncertainty in decision making. Springer, Berlin 46. Kacprzyk J, Orlovski SA (eds) (1987) Optimization models using fuzzy sets and possibility theory. Reidel, Dordrecht 47. Kandel A (1986) Fuzzy mathematical techniques with applications. Addison Wesley, Reading
48. Kaufmann A, Gupta MM (1985) Introduction to fuzzy mathematics – theory and applications. Van Nostrand Reinhold, New York 49. Klement EP, Mesiar R, Pap E (2000) Triangular norms. Springer, Heidelberg 50. Klir GJ (1987) Where do we stand on measures of uncertainty, ambiguity, fuzziness, and the like? Fuzzy Sets Syst 24:141–160 51. Klir GJ, Folger TA (1988) Fuzzy sets, uncertainty and information. Prentice-Hall, Englewood Cliffs 52. Klir GJ, Wierman M (1999) Uncertainty-based information. Elements of generalized information theory, 2nd end. Springer, Heidelberg 53. Klir GJ, Yuan B (1995) Fuzzy sets and fuzzy logic: Theory and application. Prentice-Hall, Englewood Cliffs 54. Kosko B (1992) Neural networks and fuzzy systems. PrenticeHall, Englewood Cliffs 55. Kruse R, Meyer KD (1987) Statistics with vague data. Reidel, Dordrecht 56. Kruse R, Gebhard J, Klawonn F (1994) Foundations of fuzzy systems. Wiley, Chichester 57. Kuncheva LI (2000) Fuzzy classifier design. Springer, Heidelberg 58. Li Z (2006) Fuzzy chaotic systems modeling, control, and applications. Springer, Heidelberg 59. Liu B (2007) Uncertainty theory. Springer, Heidelberg 60. Ma Z (2006) Fuzzy database modeling of imprecise and uncertain engineering information. Springer, Heidelberg 61. Mamdani EH (1974) Application of fuzzy algorithms for the control of a simple dynamic plant. Proc IEE 121:1585–1588 62. Mareš M (1994) Computation over fuzzy quantities. CRC, Boca Raton 63. Mendel J (2000) Uncertain rule-based fuzzy logic systems: Introduction and new directions. Prentice Hall, New York 64. Mendel JM, John RIB (2002) Type-2 fuzzy sets made simple. IEEE Trans Fuzzy Syst 10(2):117–127 65. Mordeson JN, Nair PS (2001) Fuzzy mathematics. An introduction for engineers and scientists, 2nd edn. Springer, Heidelberg 66. Mukaidono M (2001) Fuzzy logic for beginners. World Scientific, Singapore 67. Negoi¸ta CV, Ralescu DA (1975) Application of fuzzy sets to system analysis. Birkhäuser/Halstead, Basel/New York 68. Nguyen HT, Waler EA (205) A first course in fuzzy logic, 3rd end. CRC, Boca Raton 69. Nguyen HT, Wu B (2006) Fundamentals of statistics with fuzzy data. Springer, Heidelberg 70. Novák V (1989) Fuzzy sets and their applications. Hilger, Bristol, Boston 71. Novák V, Perfilieva I, Moˇckoˇr J (1999) Mathematical principles of fuzzy logic. Kluwer, Boston 72. Peeva K, Kyosev Y (2005) Fuzzy relational calculus. World Scientific, Singapore 73. Pedrycz W (1993) Fuzzy control and fuzzy systems, 2nd edn. Research Studies/Wiley, Taunton/New York 74. Pedrycz W (1995) Fuzzy sets engineering. CRC, Boca Raton 75. Pedrycz W (ed) (1996) Fuzzy modelling: Paradigms and practice. Kluwer, Boston 76. Pedrycz W, Gomide F (1998) An introduction to fuzzy sets: Analysis and design. MIT Press, Cambridge 77. Petry FE (1996) Fuzzy databases. Principles and applications. Kluwer, Boston
Fuzzy Sets Theory, Foundations of
78. Piegat A (2001) Fuzzy modeling and control. Springer, Heidelberg 79. Ruspini EH (1991) On the semantics of fuzzy logic. Int J Approx Reasining 5:45–88 80. Rutkowska D (2002) Neuro-fuzzy architectures and hybrid learning. Springer, Heidelberg 81. Rutkowski L (2004) Flexible neuro-fuzzy systems. Structures, learning and performance evaluation. Kluwer, Dordrecht 82. Seising R (2007) The fuzzification of systems. The genesis of fuzzy set theory and its initial applications – developments up to the 1970s. Springer, Heidelberg 83. Smithson M (1989) Ignorance and uncertainty. Springer, Berlin 84. Sousa JMC, Kaymak U (2002) Fuzzy decision making in modelling and control. World Scientific, Singapore 85. Thole U, Zimmermann H-J, Zysno P (1979) On the suitability of minimum and product operator for the intersection of fuzzy sets. Fuzzy Sets Syst 2:167–180 86. Turk¸sen IB (1991) Measurement of membership functions and their acquisition. Fuzzy Sets Syst 40:5–38 87. Türksen IB (2006) An ontlogical and epistemological perspective of fuzzy set theory. Elsevier, New York 88. Wang Z, Klir GJ (1992) Fuzzy measure theory. Kluwer, Boston 89. Wygralak M (1996) Vaguely defined objects. Representations, fuzzy sets and nonclassical cardinality theory. Kluwer, Dordrecht 90. Wygralak M (2003) Cardinalities of fuzzy sets. Springer, Heidelberg 91. Yager RR (1983) Quantifiers in the formulation of multiple objective decision functions. Inf Sci 31:107–139 92. Yager RR, Filev DP (1994) Essentials of fuzzy modeling and control. Wiley, New York 93. Yager RR, Kacprzyk J (eds) (1996) The ordered weighted averaging operators: Theory, methodology and applications. Kluwer, Boston 94. Yazici A, George R (1999) Fuzzy database modeling. Springer, Heidelberg 95. Zadeh LA (1965) Fuzzy sets. Inf Control 8:338–353
96. Zadeh LA (1968) Probability measures of fuzzy events. J Math Anal Appl 23:421–427 97. Zadeh LA (1973) Outline of a new approach to the analysis of complex systems and decision processes. IEEE Trans Syst, Man Cybern SMC-2:28–44 98. Zadeh LA (1975) Fuzzy logic and approximate reasoning. Synthese 30:407–428 99. Zadeh LA (1975) The concept of a linguistic variable and its application to approximate reasoning. Inf Sci (Part I) 8:199– 249, (Part II) 8:301–357, (Part III) 9:43–80 100. Zadeh LA (1978) Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst 1:3–28 101. Zadeh LA (1983) A computational approach to fuzzy quantifiers in natural languages. Comput Math Appl 9:149– 184 102. Zadeh LA (1985) Syllogistic reasoning in fuzzy logic and its application to usuality and reasoning with dispositions. IEEE Trans Syst Man Cybern SMC-15:754–763 103. Zadeh LA (1986) Fuzzy probabilities. Inf Process Manag 20:363–372 104. Zadeh LA, Kacprzyk J (eds) (1992) Fuzzy logic for the management of uncertainty. Wiley, New York 105. Zadeh LA, Kacprzyk J (eds) (1999) Computing with words in information/intelligent systems. 1 Foundations. Springer, Heidelberg 106. Zadeh LA, Kacprzyk J (eds) (1999) Computing with words in information/intelligent systems. 2 Applications. Springer, Heidelberg 107. Zang H, Liu D (2006) Fuzzy modeling and fuzzy control. Birkhäuser, New York 108. Zimmermann H-J (1976) Description and optimization of fuzzy systems. Int J Gen Syst 2:209–215 109. Zimmermann H-J (1987) Fuzzy sets, decision making, and expert systems. Kluwer, Dordrecht 110. Zimmermann H-J (1996) Fuzzy set theory and its applications, 3rd edn. Kluwer, Boston 111. Zimmermann H-J, Zysno P (1980) Latent connectives in human decision making. Fuzzy Sets Syst 4:37–51
1273
1274
Fuzzy System Models Evolution from Fuzzy Rulebases to Fuzzy Functions
Fuzzy System Models Evolution from Fuzzy Rulebases to Fuzzy Functions I. BURHAN TÜRK¸S EN Head Department of Industrial Engineering, TOBB-ETÜ (Economics and Technology University of the Union of Turkish Chambers and Commodity Exchanges), Ankara, Republic of Turkey
Article Outline Glossary Definition of the Subject Introduction Type 1 Fuzzy System Models of the Past Future of Fuzzy System Models Case Study Applications Experimental Design Conclusions and Future Directions Bibliography Glossary Z-FRB Zadeh’s linguistic fuzzy rule base. TS-FR Takagi–Sugeno fuzzy rule base. c the number of rules in the rule base. nv the number of input variables in the system. X D (x1 ; x2 ; : : : ; xnv ) input vector. x j the input (explanatory variable), for j D 1; : : : ; nv. A ji the linguistic label associated with jth input variable of the antecedent in the ith rule. B i the consequent linguistic label of the ith rule. R i ith rule with membership function i (x j ) : x j ! [0; 1]. A i multidimensional type 1 fuzzy set to represent the ith antecedent part of the rules defined by the membership function i (x) : x ! [0; 1]. a i D (a i;1 ; : : : ; a i;nv ) the regression coefficient vector associated with the ith rule. b i the scalar associated with the ith rule in the regression equation. SFF-LSE “Special Fuzzy Functions” that are generated by the Least Squares Estimation. SFF-SVM “Special Fuzzy Functions” estimated by Support Vector Machines. The estimate of yi would be obtained as Yi D ˇ i0Cˇ i1 i C ˇ i2 X with SFF-LSE y the dependent variable, assumed to be a linear function.
ˇ j0 j D 0; 1; : : : ; nv, indicate how a change in one of the independent variables affects the dependent variable. X D (x j;k j j D 1; : : : ; nv; k D 1; : : : ; nd) the set of observations in a training data set. m the level of fuzziness, m D 1:1; : : : ; 2:5. c the number of clusters, c D 1; : : : ; 10. J the objective function to be minimized. k:kA a norm that specifies a distance-based similarity between the data vector xk and a fuzzy. A D I the Euclidean norm. A D COV 1 the Mahalonobis norm. COV the covariance matrix. m , c the optimal pair. v XjY;i D (x1;i ; x2;i ; : : : ; x nv;i ; y i ) the cluster centers for m m D m and each cluster i D 1; : : : ; c . v X;i D (x1;i ; x2;i ; : : : ; x nv;i ) the cluster centers of the m
“input space” for m D m and c D 1; : : : ; c . i k (x k ) the normalized membership values of x data sample in the ith cluster, i D 1; : : : ; c . i D ( i k j i D 1; : : : ; c ; k D 1; : : : ; nd) the membership values of x in the ith cluster. potential augmented input matrices in SFFX 0i ; X 00i ; X 000 i LSE. E xEk i C b linear Support Vector Ref (xEk ) D yˆ k D hw; gression (SVR) equation. l" D jy k f (xEk )j" D maxf0; jy f (x)j "g "insensitive loss function. E b the weight vector and bias term. w; c > 0 the tradeoff between the empirical error and the complexity term.
k 0 and k 0 the slack variables. ˛ k and ˛ k Lagrange multipliers. KhE x k 0 ; xEk i the kernel mapping of the input vectors. nd P ˛ i k ˛ i k KhE x i k 0 ; xEi k iCb i yˆ i k 0 D fˆ xEi k 0 ; ˛ i ; ˛ i D kD1
output value of kth data sample in ith cluster with SFFSVM. ˜ f(x; (u; f x (u)))jx 2 X; u 2 J x [0; 1]g type 2 fuzzy AD ˜ set, A. f x (u) : J x ! [0; 1]; 8u 2 J x [0; 1]; 8x 2 X secondary membership function. f x (u) D 1; 8x 2 X; 8u 2 J x ; J x [0; 1] interval value type 2 membership function. [27] the domain of the primary membership is discrete and the secondary membership values are fixed to 1. Thus, the proposed method is utilizing discrete interval valued type 2 fuzzy sets in order to represent the linguistic values assigned to each fuzzy variable in each rule. These fuzzy sets can be mathematically defined as follows:
Fuzzy System Models Evolution from Fuzzy Rulebases to Fuzzy Functions
hP i R A˜ D x2X u2J x 1/u /x Discrete Interval Valued Type 2 Fuzzy Sets (DIVT2FS) x 2 X 0, then k k D 0 must also be true. In the same sense, there can never be two slack variables k ; k > 0 which are both non-zero and equal. Special Fuzzy Functions with SVM (FF-SVM) Method As one can build ordinary least squares for the estimation of the Special Fuzzy Functions when the relationship between input variables and the output variable can be linearly defined in the original input space, one may also build support vector regression models to estimate the parameters of the non-linear Special Fuzzy Functions using support vector regression methods. The augmented input matrix is determined from FCM algorithm such that there is one SVR in SFF-SVM for each cluster same as SFF-LSE model. One may choose any membership transformation depending on the input dataset. Then one can apply support vector regression, SVR, algorithm instead of LSE to
1281
1282
Fuzzy System Models Evolution from Fuzzy Rulebases to Fuzzy Functions
each augmented matrix, which are comprised of the original selected input variables and the membership values and/or their transformations. Support vector machines’ optimization algorithm is applied to each augmented matrix of each cluster (rule) i; i D 1; : : : ; c , to optimize their Lagrange multipliers, ˛ i k and ˛ i k , and find the candidate support vectors, k D 1; : : : ; nd. Hence, using SFFSVM, one finds Lagrange multipliers of each kth train data sample one for each cluster, i. Then the output value of kth data sample in ith cluster is estimated using the Equation (27) as follows: yˆi k 0
D fˆ xEi k 0 ; ˛ i ; ˛ i D
nd X
˛ i k ˛ i k KhE x i k 0 ; xEi k i C b i :
(27)
kD1
Where the yˆi k 0 is the estimated output of the k 0 th vector in ith cluster which is estimated using the support vector regression function with the Lagrange multipliers of the ith cluster. The augmented kernel matrix denotes the kernel mapping of the augmented input matrix (as described in SFF-LSE approach) where the membership values and their transformations are used as additional input variables. After the optimization algorithm finds the optimum Lagrange multipliers, one can estimate the output value of each data point in each cluster using Eq. (27). The inference structure of SFF-SVM is adapted from the Special Fuzzy Functions with least squares where one can estimate a single output of a data point (see Eq. (15) by taking the membership value weighted averages of its output values calculated for each cluster using Eq. (27).
Where u is defined as the primary membership value, J x [0; 1] is the domain of u and f z (u) is the secondary membership function. An alternative definition of type 2 ˜ used by Mendel [28] and inspired by Mizufuzzy set A, moto and Tanaka [30], can be given as follows: Given that x is a continuous universe of discourse, the same type 2 fuzzy set can be defined as: A˜ D
Z
Z
f x (u)/u /x :
x2X
(30)
u2J x
And for discrete case, the same type 2 fuzzy set can be defined as: hX i X A˜ D f x (u)/u /x : (31) x2X
u2J x
The secondary membership function, f x (u),can be defined as follows: Definition 2 Secondary membership function, f x (u), is a function that maps membership values of universe of discourse x onto unit interval [0; 1]. Thus, f x (u) can be characterized as follows: f x (u) : J x ! [0; 1]; 8u 2 J x [0; 1] ;
8x 2 X : (32)
With the secondary membership function defined as above, the membership function of type 2 fuzzy set ˜ ˜ A (x), can then be written for continuous and discrete A; cases, respectively, as follows: Z ˜ A (x) D
f x (u)/u ;
8x 2 X ;
(33)
u2J x
Future of Fuzzy System Models
˜ A (x) D
In the future, fuzzy system models are expected to be structured by type 2 fuzzy sets. For this purpose, we next present basic definitions. Basic Definitions Definition 1 A type 2 fuzzy set A˜ on universe of discourse, x, is a fuzzy set which is characterized by a fuzzy ˜ A (x) is a mapping ˜ A (x), where membership function, as shown below: [0;1]
˜ A (x) : X ! [0; 1]
:
u2J x
f x (u)/u ;
8x 2 X :
(34)
An Interval Valued Type 2 Fuzzy Set (IVT2FS), which is a special case of type 2 fuzzy set, can be defined as follows: Definition 3 Let A˜ be a linguistic label with type-2 membership function on the universe of discourse of base vari˜ A (x) : X ! f x (u)/u; u 2 J x ; J x [0; 1]. The folable x, lowing condition needs to be satisfied in order to consider ˜ A (x) as an interval value type 2 membership functions: f x (u) D 1; 8x 2 X; 8u 2 J x ; J x [0; 1] :
(35)
(28)
˜ can be characterized as follows: Then type 2 fuzzy set, A, A˜ D f(x; (u; f x (u)))jx 2 X; u 2 J x [0; 1]g :
X
(29)
Thus, the interval valued type 2 membership function is a mapping as shown below: ˜ A (x) : X ! 1/u; u 2 J x ; J x [0; 1] :
(36)
Fuzzy System Models Evolution from Fuzzy Rulebases to Fuzzy Functions
General Structure of Type 2 Fuzzy System Models In a series of papers, Mendel, Karnik and Liang [22,23, 24,25] extended traditional type 1 inference methods such that these methods can process type 2 fuzzy sets. These studies were explained thoroughly by Mendel in [28]. The classical Zadeh and Takagi–Sugeno type 1 models are modified as type 2 fuzzy rule bases (T2Z-FR and T2TSFR,), respectively, as follows: c h NV ALSO IF AND x j 2 X j isr A˜ ji iD1
jD1
i THEN y 2 Y isr B˜ i ;
(37)
c h NV ALSO IF AND x j 2 X j isr A˜ ji iD1
jD1
i THEN y i D a i x T C b i :
(38)
Mendel, Karnik and Liang [22,23,24,25] assumed that the antecedent variables are separable (i. e., non-interactive). After formulating the inference for a full type 2 fuzzy system model, Karnik et al. [22,23,24,25] simplified their proposed methods for the interval values case. In order to identify the structure of the IVT2FS, it was assumed that the membership functions are Gaussian. A clustering method was utilized to identify the mean parameters for the Gaussian functions. However, the clustering method has not been specified. It was assumed that the standard error parameters of the Gaussian membership functions are exactly known. The number of rules was assigned as eight due to the nature of their application. However, the problem of finding the suitable number of rules was not discussed in the paper. Liang and Mendel [26] proposed another method to identify the structure of IVT2FS. It has been suggested to initialize the inference parameters and to use steepestdescent (or other optimization) method in order to tune these parameters of an IVT2FS. Two different approaches for the initialization phase have been suggested in [26]. Partially dependent approach utilizes a type 1 fuzzy system model to provide a baseline for the type 2 fuzzy system model design. Totally independent approach starts with assigning random values to initialize the inference parameters. Liang and Mendel [26] indicated that the main challenge in their proposed tuning method is to determine the active branches. Mendel [28] indicated that several structure identification methods, such as one-pass, least-squares, back-propagation (steepest descent), singular value-QR decomposition, and iterative design methods, can be utilized in order to find the most suitable inference parameters of the type 2
fuzzy system models. Mendel [28] provided an excellent summary of the pros and cons of each method. Several other researchers such as Starczewski and Rutkowski [37], John and Czarnecki [20,21] and Chen and Kawase [10] worked on T2-FSM. Starczewski and Rutkowski [37] proposed a connectionist structure to implement interval valued type 2 fuzzy structure and inference. It was indicated that methods such as, back propagation, recursive least squares or Kalman algorithm-based methods, can be used in order to determine the inference parameters of the structure. John and Czarnecki [20,21] extended the ANFIS structure such that it can process the type 2 fuzzy sets. All of the above methods assume non-interactivity between antecedent variables. Thus, the general steps of the inference can be listed as: fuzzification, aggregation of the antecedents, implication, and aggregation of the consequents, type reduction and defuzzification. With the general structure of type 2 fuzzy system models and inference techniques, we next propose discrete interval valued type 2 rule base structures. Discrete Interval Valued Type 2 Fuzzy Sets (DIVT2FS) In Discrete Interval Valued Type 2 Fuzzy Sets (DIVT2FS) [48] the domain of the primary membership is discrete and the secondary membership values are fixed to 1. Thus, the proposed method is utilizing discrete interval valued type 2 fuzzy sets in order to represent the linguistic values assigned to each fuzzy variable in each rule. These fuzzy sets can be mathematically defined as follows: Z hX i A˜ D 1/u /x ; (39) x2X
u2J x
where x 2 X 13 . One can check that more complicated deviations are also worse. The second part of the definition needs to be checked as well, so we need to ensure that a player cannot do as well in terms of payoff by moving to a less complex strategy, namely a one-state machine. A one-state machine that always plays C will get the worst possible payoff, since the other machine will keep playing D against it. A one-state machine that plays D will get a payoff of 4 in periods 2,4 . . .
Game Theory and Strategic Complexity
ı ı or a total payoff of 4ı (1 ı 2 ) as against 3ı (1 ı). The second is strictly greater for ı > 13 . This machine gives a payoff close to 3 per stage for ı close to 1. As ı ! 1, the payoff of each player goes to 3, the cooperative outcome. The paper by Abreu and Rubinstein obtains a basic result on the characterization of payoffs obtained as NEC in the infinitely repeated Prisoners’ Dilemma. We recall that the “Folk Theorem” for repeated games tells us that all outcome paths that give a payoff per stage strictly greater for each player than the minmax payoff for that player in the stage game can be sustained by Nash equilibrium strategies. Using endogenous complexity, one can obtain a refinement; now only payoffs on a so-called “cross” are sustainable as NEC. This result is obtained from two observations. First, in any NEC of a two-player game, the number of states in the players’ machines must be equal. This follows from the following intuitive reasoning (we refer readers to the original paper for the proofs). Suppose we fix the machine used by one of the players (say Player 1), so that to the other player it becomes part of the “environment”. For Player 2 to calculate a best response or an optimal strategy to Player 1’s given machine, it is clearly not necessary to partition past histories more finely than the other player has done in obtaining her strategy; therefore the number of states in Player 2’s machine need not (and therefore will not, if there are complexity costs) exceed the number in Player 1’s machine in equilibrium. The same holds true in the other direction, so the number of states must be equal. (This does not hold for more than two players.) Another way of interpreting this result is that it restates the result from Markov decision processes on the existence of an optimal “stationary” policy (that is depending only on the states of the environment, which are here the same as the states of the other player’s machine). See also Piccione [40]. Thus there is a one-to-one correspondence between the states of the two machines. (Since the number of states is finite and the game is infinitely repeated, the machine must visit at least one of the states infinitely often for each player.) One can strengthen this further to establish a oneto-one correspondence between actions. Suppose Player 1’s machine has a1t D a1s , where these denote the actions taken at two distinct periods and states by Player 1, with a2t ¤ a2s for Player 2. Since the states in t and s are distinct for Player 1 and the actions taken are the same, the transitions must be different following the two distinct states. But then Player 1 does not need two distinct states, he can drop one and condition the transition after, say, s on the different action used by Player 2. (Recall the transition is a function of the state and the opponent’s action.) But then
Player 1 would be able to obtain the same payoff with a less complex machine; so the original one could not have been a NEC machine. Therefore the actions played must be some combination of (C; C) and (D; D) (the correspondence is between the two Cs and the two Ds) or some combination of (C; D) and (D; C). (By combination, we mean combination over time. For example, fC; C) is played, say, 10 times for every 3 plays of (D; D)). In the payoff space, sustainable payoffs are either on the line joining (3,3) and (0,0) or on the line joining the payoffs on the other diagonal; hence the evocative name chosen to describe the result-the cross of the two diagonals. While this is certainly a selection of equilibrium outcomes, it does not go as far as we would wish. We would hope that some equilibrium selection argument might deliver us the co-operative outcome (3,3) uniquely (even in the limit as ı ! 1), instead of the actual result obtained. There is work that does this, but it uses evolutionary arguments for equilibrium selection (see Binmore and Samuelson [9]). An alternative learning argument for equilibrium selection is used by Maenner [32]. In his model, a player tries to infer what machine is being used by his opponent and chooses the simplest automaton that is consistent with the observed pattern of play as his model of his opponent. A player then chooses a best response to this inference. It turns out complexity is not sufficient to pin down an inference and one must use optimistic or pessimistic rules to select among the simplest inferences. One of these gives only (D; D) repeated, whilst the other reproduces the Abreu–Rubinstein NEC results. Piccione and Rubinstein [41] show that the NEC profile of 2-player repeated extensive form games is unique if the stage game is one of perfect information. This unique equilibrium involves all players playing their one-shot myopic non-cooperative actions at every stage. This is a strong selection result and involves stage game strategies not being observable (only the path of play is) as well as the result on the equilibrium numbers of states being equal in the two players’ machines. In repeated games with more than two players or with more than two actions at each stage the multiplicity problem may be more acute than just not being able to select uniquely a “cooperative outcome”. In some such games complexity by itself may not have any bite and the Folk Theorem may survive even when the players care for the complexity of their strategies. (See Bloise [12] who shows robust examples of two-player repeated games with three actions at each stage such that every individually rational payoff can be sustained as a NEC if players are sufficiently patient.)
1297
1298
Game Theory and Strategic Complexity
Exogenous Complexity We now consider the different approach taken by Neyman [35,36], Ben Porath [6,7], Zemel [50] and others. We shall confine ourselves to the papers by Neyman and Zemel on the Prisoners’ Dilemma, without discussing the more general results these authors and others have obtained. Neyman’s approach treats complexity as exogenous complexity. Let Player i be restricted to use strategies/automata with the number of states not to exceed mi . He also considers finitely repeated games, unlike the infinitely repeated games we have discussed up to this point. With the stage game being the Prisoners’ Dilemma and the number of repetitions being T (for convenience, this includes the first time the game is played). We can write the game being considered as G T (m1 ; m2 ). Note that without the complexity restrictions, the finitely repeated Prisoners’ Dilemma has a unique Nash equilibrium outcome path (and a unique subgame perfect equilibrium)–(D; D) in all stages. Thus sustaining cooperation in this setting is obtaining non-equilibrium behavior, though one that is frequently observed in real life. This approach therefore is an example of bounded rationality being used to explain observed behavior that is not predicted in equilibrium. If the complexity restrictions are severe, it turns out that (C; C) in each period is an equilibrium. For this, we need 2 m1 ; m2 T 1. To see this consider the grim trigger strategy mentioned earlier-representable as a two-state automaton- and let T D 3. Here (q1 ) D C; (q2 ) D D; (q1 ; C) D q1 ; (q1 ; D) D q2 ; (q2 ; C or D) D q2 . If each player uses this strategy, (C; C) will be observed. Such a pair of strategies is clearly not a Nash equilibrium-given Player 1’s strategy, Player 2 can do better by playing D in stage 3. But if Player 2 defects in the second stage, by choosing a two-state machine where (q1 ; C) D D, he will gain 1 in the second stage and lose 3 in the third stage as compared to the machine listed above, so he is worse off. But defecting in stage 3 requires an automaton with three states-two states in which C is played and one in which D is played. The transitions in state q1 will be similar but, if q2 is the second cooperative state, the transition from q2 to the defect state will take place no matter whether the other player plays C or D. However, automata with three states violate the constraint that the number of states be no more than 2, so the profitable deviation is out of reach. Whilst this is easy to see, it is not clear what happens when the complexity is high. Neyman shows the following result: For any integer k, there exists a T 0 , such that for 1 T T0 and T k m1 ; m2 T k , there is a mixed strategy
equilibrium of G T (m1 ; m2 ) in which the expected average payoff to each player is at least 3 1k . The basic idea is that rather than playing (C; C) at each stage, players are required to play a complex sequence of C and D and keeping track of this sequence uses up a sufficient number of states in the automaton so that profitable deviations again hit the constraint on the number of states. But since D cannot be avoided on the equilibrium path, only something close to (C; C) each period can be obtained rather than (C; C) all the time. Zemel’s paper adds a clever little twist to this argument by introducing communication. In his game, there are two actions each player chooses at each stage, either C or D as before and a message to be communicated. The message does not directly affect payoffs as the choice of C or D does. The communication requirements are now made sufficiently stringent, and deviation from them is considered a deviation, so that once again the states “left over” to count up to N are inadequate in number and (C; C) can once again be played in each stage/period. This is an interesting explanation of the rigid “scripts” that many have observed to be followed, for example, in negotiations. Neyman [36] surveys his own work and that of Ben Porath [6,7]. He also generalizes his earlier work on the finitely-repeated Prisoners’ Dilemma to show how small the complexity bounds would have to be in order to obtain outcomes outside the set of (unconstrained) equilibrium payoffs in the finitely-repeated, normal-form game (just as (C; C) is not part of an unconstrained equilibrium outcome path in the Prisoners’ Dilemma). Essentially, if the complexity permitted grows exponentially or faster with the number of repetitions, the equilibrium payoff sets of the constrained and the unconstrained games will coincide. For sub-exponential growth, a version of the Folktheorem is proved for two-person games. The first result says: For every game G in strategic form and with mi being the bound on the complexity of i 0 s strategy and T the number of times the game G is played, there exists a constant c such that if m i exp(cT), then E(G T ) D E(G T (m1 ; m2 ) where E(:) is the set of equilibrium payoffs in the game concerned. The second result, which generalizes the Prisoners’ Dilemma result already stated, considers a sequence of triples (m1 (n); m2 (n); T(n)) for a two-player strategic form game, with m2 m1 and shows that the lim inf of the set of equilibrium payoffs of the automata game as n ! 1 includes essentially the strictly individually rational payoffs of the stage game if m1 (n) ! 1 and ı (log m1 (n)) T(n) ! 0 as n ! 1. Thus a version of the
Game Theory and Strategic Complexity
Folk theorem holds provided the complexity of the players’ machines does not grow too fast with the number of repetitions. Complexity and Bargaining Complexity and the Unanimity Game The well-known alternating offers bargaining model of Rubinstein has two players alternating in making proposals and responding to proposals. Each period or unit of time consists of one proposal and one response. If the response is “reject”, the player who rejects makes the next proposal but in the following period. Since there is discounting with discount factor ı per period, a rejection has a cost. The unanimity game we consider is a multiperson generalization of this bargaining game, with n players arranged in a fixed order, say 1; 2; 3 : : : n. Player 1 makes a proposal on how to divide a pie of size unity among the n people; players 2; 3; : : : n respond sequentially, either accepting or rejecting. If everyone accepts, the game ends. If someone rejects, Player 2 now gets to make a proposal but in the next period. The responses to Player 2’s proposal are made sequentially by Players 3; 4; 5 : : : n; 1. If Player i gets a share xi in an eventual agreement at time t, his payoff is ı t1 x i . Avner Shaked had shown in 1986 that the unanimity game had the disturbing feature that all individually rational (that is non-negative payoffs for each player) outcomes could be supported as subgame perfect equilibria. Thus the sharp result of Rubinstein [43], who found a unique subgame perfect equilibrium in the two-play stood in complete contrast with the multiplicity of subgame perfect equilibria in the multiplayer game. Shaked’s proof had involved complex changes in expectations of the players if a deviation from the candidate equilibrium were to be observed. For example, in the three-player game with common discount factor ı, the three extreme points (1; 0; 0), (0; 1; 0), (0; 0; 1) sustain one another in the following way. Suppose Player 1 is to propose (0; 1; 0), which is not a very sensible offer for him or her to propose, since it gives everything to the second 1 deviates and proposes, say, ı player. If Player ı ((1 ı) 2; ı; (1 ı) 2), then it might be reasoned that Player 2 would have no incentive to reject because in any case he or she can’t get more than 1 in the following period and Player 3 would surely prefer a positive payoff to 0. However, there is a counter-argument. In the subgame following Player 1’s deviation, Player 3’s expectations have been raised so that he (and everyone else, including Player 1) now expect the outcome to be (0; 0; 1), instead of the earlier expected outcome. For sufficiently
high discount factor, Player 3 would reject Player 1’s insufficiently generous offer. Thus Player 1 would have no incentive to deviate. Player 1 is thus in a bind; if he offers Player 2 less than ı and offers Player 3 more in the deviation, the expectation that the outcome next period will be (0; 1; 0) remains unchanged, so now Player 2 rejects his offer. So no deviation is profitable, because each deviation generates an expectation of future outcomes, an expectation that is confirmed in equilibrium. (This is what equilibrium means.) Summarizing, (0; 1; 0) is sustained as follows: Player 1 offers (0; 1; 0), Player 2 accepts any offer of at least 1 and Player 3 any offer of at least 0. If one of them rejects Player 1’s offer, the next player in order offers (0; 1; 0) and the others accept. If any proposer, say Player 1, deviates from the offer (0; 1; 0) to (x1 ; x2 ; x3 ) the player with the lower of fx2 ; x3 g rejects. Suppose it is Player i who rejects. In the following period, the offer made gives 1 to Player i and 0 to the others, and this is accepted. Various attempts were made to get around the continuum of equilibria problem in bargaining games with more than two players; most of them involved changing the game. (See [15,16] for a discussion of this literature.) An alternative to changing the game might be to introduce a cost for this additional complexity, in the belief that players who value simplicity will end up choosing simple, that is history independent, strategies. This seems to be a promising approach because it is clear from Shaked’s construction that the large number of equilibria results from the players choosing strategies that are history-dependent. In fact, if the strategies are restricted to those that are history-independent (also referred to as stationary or Markov) then it can be shown (see Herrero [27]) that the subgame perfect equilibrium is unique and induces equal division of the pie as ı ! 1. The two papers ([15,16]) in fact seek to address the issue of complex strategies with players having a preference for simplicity, just as in Abreu and Rubinstein. However, now we have a game of more than two players, and a single extensive form game rather than a repeated game as in Abreu–Rubinstein. It was natural that the framework had to be broadened somewhat to take this into account. For each of n players playing the unanimity game, we define a machine or an implementation of the strategy as follows. A stage of the game is defined to be n periods, such that if a stage were to be completed, each player would play each role at most once. A role could be as proposer or n 1th responder or n 2th responder . . . up to first responder (the last role would occur in the period before the player concerned had to make another proposal). An outcome of a stage is defined as
1299
1300
Game Theory and Strategic Complexity
a sequence of offers and responses, for example e D (x; A; A; R; y; R; z; A; R; b; A; A; A) in a four-player game where the (x; y; z; b) are proposals made in the four periods and (A; R) refer to accept and reject respectively. From the point of view of the first player to propose (for convenience, let’s call him Player 1), he makes an offer x, which is accepted by Players 2 and 3 but rejected by Player 4. Now it is Player 2’s turn to offer, but this offer, y, is rejected by the first responder Player 3. Player 1 gets to play as second responder in the next period, where he rejects Player 3’s proposal. In the last period of this stage, a proposal b is made by Player 4 and everyone accepts (including Player 1 as first responder). Any partial history within a stage is denoted by s. For example, when Player 2 makes an offer, he does so after a partial history s D (x; A; A; R). Let the set of possible outcomes of a stage be denoted by E and the set of possible partial histories by S. Let Qi denote the set of states used in the ith player’s machine M i . The output mapping is given by i : S Q i ! , where is the set of possible actions (that is the set of possible proposals, plus accept or reject). The transition between states now takes place at the end of each stage, so the transition mapping is given as i : E Q i ! Q i . As before, in the Abreu– Rubinstein setup, there is an initial state qinitial;i specified for each player. There is also a termination state F, which is supposed to indicate agreement. Once in the termination state, players will play the null action and make transitions to this state. Note that our formulation of a strategy naturally uses a Mealy machine. The output mapping i (:; :) has two arguments, the state of the machine and the input s, which lists the outcomes of previous moves within the stage. The transitions take place at the end of the stage. The benefit of using this formulation is that the continuation game is the same at the beginning of each stage. In Chatterjee and Sabourian [16], we investigate the effects of modifying this formulation, including studying the effects of having a sub-machine to play each role. The different formulations can all implement the same strategies, but the complexities in terms of various measures could differ. We refer the reader to that paper for details, but emphasize that in the general unanimity game, the results from other formulations are similar to the one developed here, though they could differ for special cases, like three-player games. We now consider a machine game, where players first choose machines and then the machines play the unanimity game in analogy with Abreu–Rubinstein. Using the same lexicographic utility, with complexity coming after bargaining payoffs, what do we find for Nash equilibria of the machine game?
As it turns out, the addition of complexity costs in this setting has some bite but not much. In particular, any division of the pie can be sustained in some Nash equilibrium of the machine game. Perpetual disagreement can, in fact, be sustained by a stationary machine, that is one that makes the same offers and responses each time, irrespective of past history. Nor can we prove, for general n-player games that the equilibrium machines will be onestate. (A three-player counter-example exists in [16]; it does not appear to be possible to generate in games that lasted less than thirty periods.) For two-player games, the result that machines must be one-state in equilibrium can be shown neatly ([16]); another illustration that in this particular area, there is a substantial increase of analytical difficulty in going from two to three players. One reason why complexity does not appear important here is that the definition of complexity used is too restrictive. Counting the number of states is fine, so long as we don’t consider how complex a response might be for partial histories within a stage. The next attempt at a solution is based on this observation. We devise the following definition of complexity: Given the machine and the states, if a machine made the same response to different partial stage histories in different states and another machine made different responses, then the second one was more complex (given that the machines were identical in all other respects). We refer to this notion as response complexity. (In [15] the concept of response complexity is in fact stated in terms of the underlying strategy rather than in terms of machines.) It captures the intuition that counting states is not enough; two machines could have the same number of states, for example because each generated the same number of distinct offers, but the complexity of responses in one machine could be much lower than that in the other. Note that this notion would only arise in extensive-form games. In normal form games, counting states could be an adequate measure of complexity. Nor is this notion of complexity derivable from notions of transition complexity, due to Banks and Sundaram, for example, which also apply in normal-form games. The main result of Chatterjee and Sabourian [15] is that this new aspect of complexity enables us to limit the amount of delay that can occur in equilibrium and hence to infer that only one-state machines are equilibrium machines. The formal proofs using two different approaches are available in Chatterjee and Sabourian [15,16]. We mention the basic intuition behind these results. Suppose, in the three player game, there is an agreement in period 4 (this is in the second stage). Why doesn’t this agreement
Game Theory and Strategic Complexity
take place in period 1 instead? It must be because if the same offer and responses are seen in period 1 some player will reject the offer. But of course, he or she does not have to do so because the required offer never happens. But a strategy that accepts the offer in period 4 and rejects it off the equilibrium path in period 1 must be more complex, by our definition, than one that always accepts it whenever it might happen, on or off the expected path. Repeated application of this argument by backwards induction gives the result. (The details are more complicated but are in the papers cited above.) Note that this uses the definition that two machines might have the same number of states and yet one could be simpler than the other. It is interesting, as mentioned earlier, that for two players one can obtain an analogous result without invoking the response simplicity criterion, but from three players on this criterion is essential. The above result (equilibrium machines have one state each and there are no delays beyond the first stage) is still not enough to refine the set of equilibria to a single allocation. In order to do this, we consider machines that can make errors/trembles in output. As the error goes to zero, we are left with perfect equilibria of our game. With onestate machines, the only subgame perfect equilibria are the ones that give equal division of the pie as ı ! 1. Thus a combination of two techniques, one essentially recognizing that players can make mistakes and the other that players prefer simpler strategies if the payoffs are the same as those given by a more complex strategy, resolves the problem of multiplicity of equilibria in the multiperson bargaining game. As we mentioned before, the introduction of errors ensures that the equilibrium strategies are credible at every history. We could also take the more direct (and easier) way of obtaining the uniqueness result with complexity costs by considering NEC strategies that are subgame perfect in the underlying game (PEC) (as done in [15]). Then since an history-independent subgame perfect equilibrium of the game is unique and any NEC automaton profile has one state and hence is history-independent, it follows immediately that any PEC is unique and induces equal division as ı ! 1. Complexity and Repeated Negotiations In addition to standard repeated games or standard bargaining games, multiplicity of equilibria often appear in dynamic repeated interactions, where a repeated game is superimposed on an alternating offers bargaining game. For instance, consider two firms, in an ongoing vertical relationship, negotiating the terms of a merger. Such sit-
uations have been analyzed in several “negotiation models” by Busch and Wen [13], Fernandez and Glazer [18] and Haller and Holden [25]. These models can be interpreted as combining the features of both repeated and alternating-offers bargaining games. In each period, one of the two players first makes an offer on how to divide the total available periodic (flow) surplus; if the offer is accepted, the game ends with the players obtaining the corresponding payoffs in the current and every period thereafter. If the offer is rejected, they play some normal form game to determine their flow payoffs for that period and then the game moves on to the next period in which the same play continues with the players’ bargaining roles reversed. One can think of the normal form game played in the event of a rejection as a “threat game” in which a player takes actions that could punish the other player by reducing his total payoffs. If the bargaining had not existed, the game would be a standard repeated normal form game. Introducing bargaining and the prospect of permanent exit, the negotiation model still admits a large number of equilibria, like standard repeated games. Some of these equilibria involve delay in agreement (even perpetual disagreement) and inefficiency, while some are efficient. Lee and Sabourian [31] apply complexity considerations to this model. As in Abreu and Rubinstein [1] and others, the players choose among automata and the equilibrium notion is that of NEC and PEC. One important difference however is that in this paper the authors do not assume the automata to be finite. Also, the paper introduces a new machine specification that formally distinguishes between the two roles – proposer and responder – played by each player in a given period. Complexity considerations select only efficient equilibria in the negotiation model players are sufficiently patient. First, it is shown that if an agreement occurs in some finite period as a NEC outcome then it must occur within the first two periods of the game. This is because if a NEC induces an agreement beyond the first two periods then one of the players must be able to drop the last period’s state of his machine without affecting the outcome of the game. Second, given sufficiently patient players, every PEC in the negotiation model that induces perpetual disagreement is at least long-run almost efficient; that is, the game must reach a finite date at which the continuation game then on is almost efficient. Thus, these results take the study of complexity in repeated games a step further from the previous literature in which complexity or bargaining alone has produced only limited selection results. While, as we discussed above, many inefficient equilibria survive complexity refinement,
1301
1302
Game Theory and Strategic Complexity
Lee and Sabourian [31] demonstrate that complexity and bargaining in tandem ensure efficiency in repeated interactions. Complexity considerations also allow Lee and Sabourian to highlight the role of transaction costs in the negotiation game. Transaction costs take the form of paying a cost to enter the bargaining stage of the negotiation game. In contrast to the efficiency result in the negotiation game with complexity costs, Lee and Sabourian also show that introducing transaction costs into the negotiation game dramatically alters the selection result from efficiency to inefficiency. In particular, they show that, for any discount factor and any transaction cost, every PEC in the costly negotiation game induces perpetual disagreement if the stage game normal form (after any disagreement) has a unique Nash equilibrium. Complexity, Market Games and the Competitive Equilibrium There has been a long tradition in economics of trying to provide a theory of how a competitive market with many buyers and sellers operates. The concept of competitive (Walrasian) equilibrium (see Debreu [17]) is a simple description of such markets. In such an equilibrium each trader chooses rationally the amount he wants to trade taking the prices as given, and the prices are set (or adjust) to ensure that total demanded is equal to the total supplied. The important feature of the set-up is that agents assume that they cannot influence (set) the prices and this is often justified by appealing to the idea that each individual agent is small relative to the market. There are conceptual as well as technical problems associated with such a justification. First, if no agent can influence the prices then who sets them? Second, even in a large but finite market a change in the behavior of a single individual agent may affect the decisions of some others, which in turn might influence the behavior of some other agents and so on and so forth; thus the market as a whole may end up being affected by the decision of a single individual. Game theoretic analysis of markets have tried to address these issues (e. g. see [21,47]). This has turned out to be a difficult task because the strategic analysis of markets, in contrast to the simple and elegant model of competitive equilibrium, tends to be complex and intractable. In particular, dynamic market games have many equilibria, in which a variety of different kinds of behavior are sustained by threats and counter-threats. More than 60 years ago Hayek [26] noted the competitive markets are simple mechanisms in which economic agents only need to know their own endowments, prefer-
ences and technologies and the vector of prices at which trade takes place. In such environments, economic agents maximizing utility subject to constraints make efficient choices in equilibrium. Below we report some recent work, which suggests that the converse might also be true: If rational agents have, at least at the margin, an aversion to complex behavior, then their maximizing behavior will result in simple behavioral rules and thereby in a perfectly competitive equilibrium (Gale and Sabourian [22]). Homogeneous Markets In a seminal paper, Rubinstein and Wolinsky [46], henceforth RW, considered a market for a single indivisible good in which a finite number of homogeneous buyers and homogeneous sellers are matched in pairs and bargain over the terms of trade. In their set-up, each seller has one unit of an indivisible good and each buyer wants to buy at most one unit of the good. Each seller’s valuation of the good is 0 and each buyer’s valuation is 1. Time is divided into discrete periods and at each date, buyers and sellers are matched randomly in pairs and one member of the pair is randomly chosen to be the proposer and the other the responder. In any such match the proposer offers a price p 2 [0; 1] and the responder accepts or rejects the offer. If the offer is accepted the two agents trade at the agreed price p and the game ends with the seller receiving a payoff p and the buyer in the trade obtaining a payoff 1 p. If the offer is rejected the pair return to the market and the process continues. RW further assume that there is no discounting to capture the idea that there is no friction (cost to waiting) in the market. Assuming that the number of buyers and sellers is not the same, RW showed that this dynamic matching and bargaining game has, in addition to a perfectly competitive outcome, a large set of other subgame perfect equilibrium outcomes, a result reminiscent of the Folk Theorem for repeated games. To see the intuition for this, consider the case in which there is one seller s and many buyers. Since there are more buyers than sellers the price of 1, at which the seller receives all the surplus, is the unique competitive equilibrium; furthermore, since there are no frictions p D 1 seems to be the most plausible price. RW’s precise result, however, establishes that for any price p 2 [0; 1] and any buyer b there is a subgame perfect equilibrium that results in s and b trading at p . The idea behind the result is to construct an equilibrium strategy profile such that buyer b is identified as the intended recipient of the good at a price p . This means that the strategies are such
Game Theory and Strategic Complexity
that (i) when s meets b , whichever is chosen as the proposer offers price p and the responder accepts, (ii) when s is the proposer in a match with some buyer b ¤ b , s offers the good at a price of p D 1 and b rejects and (iii) when a buyer b ¤ b is the proposer he offers to buy the good at a price of p D 0 and s rejects. These strategies produce the required outcome. Furthermore, the equilibrium strategies make use of of the following punishment strategies to deter deviations. If the seller s deviates by proposing to a buyer b a price p ¤ p , b rejects this offer and the play continues with b becoming the intended recipient of the item at a price of zero. Thus, after rejection by b strategies are the same as those given earlier with the price zero in place of p and buyer b in place of buyer b . Similarly, if a buyer b deviates by offering a price p ¤ p then the seller rejects, another buyer b0 ¤ b is chosen to be the intended recipient and the price at which the unit is traded changes to 1. Further deviations from these punishment strategies can be treated in an exactly similar way. The strong impression left by RW is that indeterminacy of equilibrium is a robust feature of dynamic market games and, in particular, there is no reason to expect the outcome to be perfectly competitive. However, the strategies required to support the family of equilibria in RW are quite complex. In particular, when a proposer deviates, the strategies are tailor-made so that the responder is rewarded for rejecting the deviating proposal. This requires coordinating on a large amount of information so that at every information set the players know (and agree) what constitutes a deviation. In fact, RW show that if the amount of information available to the agents is strictly limited so that the agents do not recall the history of past play then the only equilibrium outcome is the competitive one. This suggests that the competitive outcome may result if agents use simple strategies. Furthermore, the equilibrium strategies used described in RW to support non-competitive outcomes are particularly unattractive because they require all players, including those buyers who do not end up trading, to follow complex non-stationary strategies in order to support a non-competitive outcome. But buyers who do not trade and receive zero payoff on the equilibrium path could always obtain at least zero by following a less complex strategy than the ones specified in RW’s construction. Thus, RW’s construction of non-competitive equilibria is not robust if players prefer, at least at the margin, a simpler strategy to a more complex one. Following the above observation, Sabourian [47], henceforth S, addresses the role of complexity (simplicity) in sustaining a multiplicity of non-competitive equilibria in RW’s model. The concept of complexity in S is similar
to that in Chatterjee and Sabourian [15]. It is defined by a partial ordering on the set of individual strategies (or automata) that very informally satisfies the following: If two strategies are otherwise identical except that in some role the second strategy uses more information than that available in the current period of bargaining and the first uses only the information available in the current period, then the second strategy is said to be more complex than the first. S also introduces complexity costs lexicographically into the RW game and shows that any PEC is history-independent and induces the competitive outcome in the sense that all trades take place at the unique competitive price of 1. Informally, S’s conclusions in the case of a single seller s and many buyers follows from the following three steps. First, since trading at the competitive price of 1 is the worst outcome for a buyer and the best outcome for the seller, by appealing to complexity type reasoning it can be shown that in any NEC a trader’s response to a price offer of 1 is always history-independent and thus he either always rejects 1 or always accepts 1. For example, if in the case of a buyer this were not the case, then since accepting 1 is a worst possible outcome, he could economize on complexity and obtain at least the same payoff by adopting another strategy that is otherwise the same as the equilibrium strategy except that it always rejects 1. Second, in any non-competitive NEC in which s receives a payoff of less than 1, there cannot be an agreement at a price of 1 between s and a buyer at any history. For example, if at some history, a buyer is offered p D 1 and he accepts then by the first step the buyer should accept p D 1 whenever it is offered; but this is a contradiction because it means that the seller can guarantee himself an equilibrium payoff of one by waiting until he has a chance to make a proposal to this buyer. Third, in any non-competitive PEC the continuation payoffs of all buyers are positive at every history. This follows immediately from the previous step because if there is no trade at p D 1 at any history it follows that each buyer can always obtain a positive payoff by offering the seller more than he can obtain in any subgame. Finally, because of competition between the buyers (there is one seller and many buyers), in any subgame perfect equilibrium there must be a buyer with a zero continuation payoff after some history. To illustrate the basic intuition for this claim, let m be the worst continuation payoff for s at any history and suppose that there exists a subgame at which s is the proposer in a match with a buyer b and the continuation payoff of s at this subgame is m. Then if at this subgame s proposes m C ( > 0), b must reject (otherwise s can get more than m). Since the total surplus
1303
1304
Game Theory and Strategic Complexity
is 1, b must obtain at least 1 m in the continuation game in order to reject s 0 s offer and s gets at least m, this 0 implies that the continuation payoff of all b ¤ b after b s rejection is less than ". The result follows by making " arbitrarily small (and by appealing to the finiteness of f ). But the last two claims contradict each other unless the equilibrium is competitive. This establishes the result for the case in which there is one seller and many buyers. The case of a market with more than one seller is established by induction on the number of sellers. The matching technology in the above model is random. RW also consider another market game with the matching is endogenous: At each date each seller (the short side of the market) chooses his trading partner. Here, they show that non-competitive outcomes and multiplicity of equilibria survive even when the players discount the future. By strengthening the notion of complexity S also shows that in the endogenous matching model of RW the competitive outcome is the only equilibrium if complexity considerations are present. These results suggest perfectly competitive behavior may result if agents have, at least at the margin, preferences for simple strategies. Unfortunately, both RW and S have too simple a market set-up; for example, it is assumed that the buyers are all identical, similarly for the sellers and each agent trades at most one unit of the good. Do the conclusions extend to richer models of trade? Heterogeneous Markets There are good reasons to think that it may be too difficult (or even impossible) to establish a similar set of conclusions as in S in a richer framework. For example, consider a heterogeneous market for a single indivisible good, where buyers (and sellers) have a range of valuations of the good and each buyer wants at most one unit of the good and each seller has one unit of the good for sale. In this case the analysis of S will not suffice. First, in the homogeneous market of RW, except for the special case where the number of buyers is equal to the number of sellers, the competitive equilibrium price is either 0 or 1 and all of the surplus goes to one side of the market. S’s selection result crucially uses this property of the competitive equilibrium. By contrast, in a heterogeneous market, in general there will be agents receiving positive payoffs on both sides of the market in a competitive equilibrium. Therefore, one cannot justify the competitive outcome simply by focusing on extreme outcomes in which there is no surplus for one party from trade. Second, in a homogeneous market individually rational trade is by definition efficient. This may not be the case in a heterogeneous market (an inefficient trade
between inframarginal and an extramarginal agent can be individually rational). Third, in a homogeneous market, the set of competitive prices remains constant, independently of the set of agents remaining in the market. In the heterogeneous market, this need not be so and in some cases, the new competitive interval may not even intersect the old one. The change in the competitive interval of prices as the result of trade exacerbates the problems associated with using an induction hypothesis because here future prices may be conditioned on past trades even if prices are restricted to be competitive ones. Despite these difficulties associated with a market with a heterogeneous set of buyers and sellers, Gale and Sabourian [22], henceforth GS, show that the conclusions of S can be extended to the case of a heterogeneous market in which each agent trades at most one unit of the good. GS, however, focus on deterministic sequential matching models in which one pair of agents are matched at each date and they leave the market if they reach an agreement. In particular, they start by considering exogenous matching processes in which the identities of the proposer and responder at each date are an exogenous and deterministic function of the set of agents remaining in the market and the date. The main result of the paper is that a PEC is always competitive in in such a heterogeneous market, thus supporting the view that competitive equilibrium may arise in a finite market where complex behavior is costly. The notion of complexity in GS is similar to that in S [15]. However, in the GS set-up with heterogeneous buyers and sellers the set of remaining agents changes depending who has traded and left the market and who is remaining, and this affects the market conditions. (In the homogeneous case, only the number of remaining agents matters.) Therefore, the definition of complexity in GS is with reference to a given set of remaining agents. GS also discuss an alternative notion of complexity that is independent of the set of remaining agents; such a definition may be too strong and may result in an equilibrium set being empty. To show their result, GS first establish two very useful restrictions on the strategies that form a NEC (similar to the no delay result in Chatterjee and Sabourian [15]). First, they show that if along the equilibrium path a pair of agents k and ` trade at a price p with k as the proposer and ` as the responder then k and ` always trade at p, irrespective of the previous history, whenever the two agents are matched in the same way with the same remaining set of agents. To show this consider first the case of the responder `. Then it must be that at every history with the same remaining set of agents ` always accepts p by k. Oth-
Game Theory and Strategic Complexity
erwise, ` could economize on complexity by choosing another strategy that is otherwise identical to his equilibrium strategy except that it always accepts p from k without sacrificing any payoff: Such a change of behavior is clearly more simple than sometimes accepting and sometimes rejecting the offer and moreover, it results in either agent k proposing p and ` accepting, so the payoff to agent ` is the same as from the equilibrium strategy, or agent k not offering p, in which case the change in the strategy is not observed and the play of the game is unaffected by the deviation. Furthermore, it must also be that at every history with the same remaining set of agents agent k proposes p in any match with `. Otherwise, k could economize on complexity by choosing another strategy that is otherwise identical to his equilibrium strategy except that it always proposes p to ` without sacrificing any payoff on the equilibrium path: Such a change of behavior is clearly more simple and moreover k’s payoff is not affected because either agent k and ` are matched and k proposes p and ` by the previous argument accepts, so the payoff to agent k is the same as from the equilibrium strategy, or agent k and ` are not matched with k as the proposer, in which case the change in the strategy is not observed and the play of the game is unaffected by the deviation. GS’s shows a second restriction, again with the same remaining set of agents, namely that in any NEC for any pair of agents k and `, player `’s response to k’s (on or off-the-equilibrium path) offer is always the same. Otherwise, it follows that ` sometimes accepts an offer p by k and sometimes rejects (with the same remaining set of agents). Then by the first restriction it must be that if such an offer is made by k to ` on the equilibrium path it is rejected. But then ` can could economize on complexity by always rejecting p by k without sacrificing any payoff on the equilibrium path: Such a change of behavior is clearly more simple and furthermore `’s payoff is not affected because such a behavior is the same as what the equilibrium strategy prescribes on the equilibrium path. By appealing to the above two properties of NEC and to the competitive nature of the market GS establish, using a complicated induction argument, that every PEC induces a competitive outcome in which each trade occurs at the same competitive price. The matching model we have described so far is deterministic and exogenous. The selection result of GS however extends to richer deterministic matching models. In particular, GS also consider a semi-endogenous sequential matching model in which the choice of partners is endogenous but the identity of the proposer at any date is exogenous. Their results extends to this variation, with an endogenous choice of responders. A more radical depar-
ture change would be to consider the case where at any date any agent can choose his partner and make a proposal. Such a totally endogenous model of trade generates new conceptual problems. In a recent working paper Gale and Sabourian [24] consider a continuous time version of such a matching model and show that complexity considerations allows one to select a competitive outcome in the case of totally endogenous matching. Since the selection result holds for all the different matching models we can conclude that complexity considerations inducing a competitive outcome seem to be a robust result in deterministic matching and bargaining market games with heterogeneous agents. Random matching is commonly used in economic models because of its tractability. The basic framework of GS, however, does not extend to such a framework if either the buyers or the sellers are not identical. This is for two different reasons. First, in general in any random framework there is more than one outcome path that can occur in equilibrium with a positive probability; as a result introducing complexity lexicographically may not be enough to induce agents to behave in a simple way (they will have to be complex enough to play optimally along all paths that occur with a positive probability). Second, in Gale and Sabourian [23] it is shown that subgame perfect equilibria in Markov strategies are not necessarily perfectly competitive for the random matching model with heterogeneous agents. Since the definition of complexity in GS is such that Markov strategies are the least complex ones, it follows that with random matching the complexity definition used in GS is not sufficient to select a competitive outcome. Complexity and Off-The-Equilibrium Path Play The concept of the PEC (or NEC) used in S, GS and elsewhere was defined to be such that for each player the strategy/automaton has minimal complexity amongst all strategies/automata that are best responses to the equilibrium strategies/automata of others. Although, these concepts are very mild in the treatment of complexity, it should be noted that there are other ways of introducing complexity into the equilibrium concept. One extension of the above set-up is to treat complexity as a (small) positive fixed cost of choosing a more complex strategy and define a Nash (subgame perfect) equilibrium with a fixed positive complexity costs accordingly. All the selection results based on lexicographic complexity in the papers we discuss in this survey also hold for positive small complexity costs. This is not surprising because with positive costs complexity has at least as much bite as in the lexicographic case; there is at least as much refinement of the equilibrium
1305
1306
Game Theory and Strategic Complexity
concept with the former as with the latter. In particular, in the case of a NEC (or a PEC), in considering complexity, players ignore any consideration of payoffs off the equilibrium path and the trade-off is between the equilibrium payoffs of two strategies and the complexity of the two. As a result these concepts put more weight on complexity costs than on being “prepared” for off-the-equilibriumpath moves. Therefore, although complexity costs are insignificant, they take priority over optimal behavior after deviations. (See [16] for a discussion.) A different approach would be to assume that complexity is a less significant criterion than the off-the-equilibrium payoffs. In the extreme case, one would require agents to choose minimally complex strategies among the set of strategies that are best responses on and off the equilibrium path (see Kalai and Neme [28]). An alternative way of illustrating the differences between the different approaches is by introducing two kinds of vanishingly small perturbations into the underlying game. One perturbation is to impose a small but positive cost of choosing a more complex strategy. Another perturbation is to introduce a small but positive probability of making an error (off-the-equilibrium-path move). Since a PEC requires each agents to choose a minimally complex strategy within the set of best responses, it follows that the limit points of Nash equilibria of the above perturbed game correspond to the concept of PEC if we first let the probability of making an off-the-equilibrium-path move go to zero and then let the cost of choosing a more complex strategy go to zero (this is what Chatterjee and Sabourian [15] do). On the other hand, in terms of the above limiting arguments, if we let the cost of choosing a more complex strategy go to zero and then let the probability of making an off-the-equilibrium-path move go to zero then any limit corresponds to the equilibrium definition in Kalai and Neme [28] where agents choose minimally complex strategies among the set of strategies that are best responses on and off the equilibrium path. Most of the results reported in this paper on refinement and endogenous complexity (for example Abreu– Rubinstein [1]), Chatterjee and Sabourian [15], Gale and Sabourian [22] and Lee and Sabourian [31] hold only for the concept of NEC and its variations and thus depend crucially on assuming that complexity costs are more important than off-the-equilibrium payoffs. This is because these results always appeal to an argument that involves economizing on complexity if the complexity is not used off the equilibrium path. Therefore, they may be a good predictor of what may happen only if complexity costs are more significant than the perturbations that induce offthe-equilibrium-path behavior. The one exception is the
selection result in S [47]. Here, although the result we have reported is stated for NEC and its variations, it turns out that the selection of competitive equilibrium does not in fact depend on the relative importance of complexity costs and off-the-equilibrium path payoffs. It remains true even for the case where the strategies are required to be least complex amongst those that are best responses at every information set. This is because in S’s analysis complexity is only used to show that every agent’s response to the price offer of 1 is always the same irrespective of the past history of play. This conclusion holds irrespective of the relative importance of complexity costs and off-the-equilibrium payoff because trading at the price of 1 is the best outcome that any seller can achieve at any information set (including those off-the-equilibrium) and a worst outcome for any buyer. Therefore, irrespective of the order, the strategy of sometimes accepting a price of 1 and sometimes rejecting cannot be an equilibrium for a buyer (similar arguments applies for a seller) because the buyer can economize on complexity by always rejecting the offer without sacrificing any payoff off or on-the-equilibrium path (accepting p D 1 is a worse possible outcome). Discussion and Future Directions The use of finite automata as a model of players in a game has been criticized as being inadequate, especially because as the number of states becomes lower it becomes more and more difficult for the small automaton to do routine calculations, let alone the best response calculations necessary for game-theoretic equilibria. Some of the papers we have explored address other aspects of complexity that arise from the concrete nature of the games under consideration. Alternative models of complexity are also suggested, such as computational complexity and communication complexity. While our work and the earlier work on which it builds focuses on equilibrium, an alternative approach might seek to see whether simplicity evolves in some reasonable learning model. Maenner [32] has undertaken such an investigation with the infinitely repeated Prisoners’ Dilemma (studied in the equilibrium context by Abreu and Rubinstein). Maenner provides an argument for “learning to be simple”. On the other hand, there are arguments for increasing complexity in competitive games ([42]). It is an open question, therefore, whether simplicity could arise endogenously through learning, though it seems to be a feature of most human preferences and aesthetics (see [11]). The broader research program of explicitly considering complexity in economic settings might be a very fruit-
Game Theory and Strategic Complexity
ful one. Auction mechanisms are designed with an eye towards how complex they are – simplicity is a desideratum. The complexity of contracting has given rise to a whole literature on incomplete contracts, where some models postulate a fixed cost per contingency described in the contract. All this is apart from the popular literature on complexity, which seeks to understand complex, adaptive systems from biology. The use of formal complexity measures such as those considered in this survey and the research we describe might throw some light on whether incompleteness of contracts, or simplicity of mechanisms, is an assumption or a result (of explicitly considering choice of level of complexity). Acknowledgments We wish to thank an anonymous referee and Jihong Lee for valuable comments that improved the exposition of this chapter. We would also like to thank St. John’s College, Cambridge and the Pennsylvania State University for funding Dr Chatterjee’s stay in Cambridge at the time this chapter was written. Bibliography 1. Abreu D, Rubinstein A (1988) The structure of Nash equilibria in repeated games with finite automata. Econometrica 56:1259– 1282 2. Anderlini L (1990) Some notes on Church’s thesis and the theory of games. Theory Decis 29:19–52 3. Anderlini L, Sabourian H (1995) Cooperation and effective computability. Econometrica 63:1337–1369 4. Aumann RJ (1981) Survey of repeated games. In: Essays in game theory and mathematical economics in honor of Oskar Morgenstern. Bibliographisches Institut, Mannheim/Vienna/ Zurich, pp 11–42 5. Banks J, Sundaram R (1990) Repeated games, finite automata and complexity. Games Econ Behav 2:97–117 6. Ben Porath E (1986) Repeated games with bounded complexity. Mimeo, Stanford University 7. Ben Porath E (1993) Repeated games with finite automata. J Econ Theory 59:17–32 8. Binmore KG (1987) Modelling rational players I. Econ Philos 3:179–214 9. Binmore KG, Samuelson L (1992) Evolutionary stability in repeated games played by finite automata. J Econ Theory 57:278–305 10. Binmore KG, Piccione M, Samuelson L (1998) Evolutionary stability in alternating-offers bargaining games. J Econ Theory 80:257–291 11. Birkhoff GD (1933) Aesthetic measure. Harvard University Press, Cambridge 12. Bloise G (1998) Strategic complexity and equilibrium in repeated games. Unpublished doctoral dissertation, University of Cambridge 13. Busch L-A, Wen Q (1995) Perfect equilibria in a negotiation model. Econometrica 63:545–565
14. Chatterjee K (2002) Complexity of strategies and multiplicity of Nash equilibria. Group Decis Negot 11:223–230 15. Chatterjee K, Sabourian H (2000) Multiperson bargaining and strategic complexity. Econometrica 68:1491–1509 16. Chatterjee K, Sabourian H (2000) N-person bargaining and strategic complexity. Mimeo, University of Cambridge and the Pennsylvania State University 17. Debreu G (1959) Theory of value. Yale University Press, New Haven/London 18. Fernandez R, Glazer J (1991) Striking for a bargain between two completely informed agents. Am Econ Rev 81:240–252 19. Fudenberg D, Maskin E (1990) Evolution and repeated games. Mimeo, Harvard/Princeton 20. Fudenberg D, Tirole J (1991) Game theory. MIT Press, Cambridge 21. Gale D (2000) Strategic foundations of general equilibrium: Dynamic matching and bargaining games. Cambridge University Press, Cambridge 22. Gale D, Sabourian H (2005) Complexity and competition. Econometrica 73:739–770 23. Gale D, Sabourian H (2006) Markov equilibria in dynamic matching and bargaining games. Games Econ Behav 54:336– 352 24. Gale D, Sabourian H (2008) Complexity and competition II; endogenous matching. Mimeo, New York University/University of Cambridge 25. Haller H, Holden S (1990) A letter to the editor on wage bargaining. J Econ Theory 52:232–236 26. Hayek F (1945) The use of knowledge in society. Am Econ Rev 35:519–530 27. Herrero M (1985) A Strategic theory of market institutions. Unpublished doctoral dissertation, London School of Economics 28. Kalai E, Neme A (1992) The strength of a little perfection. Int J Game Theory 20:335–355 29. Kalai E, Stanford W (1988) Finite rationality and interpersonal complexity in repeated games. Econometrica 56:397–410 30. Klemperer P (ed) (2000) The economic theory of auctions. Elgar, Northampton 31. Lee J, Sabourian H (2007) Coase theorem, complexity and transaction costs. J Econ Theory 135:214–235 32. Maenner E (2008) Adaptation and complexity in repeated games. Games Econ Behav 63:166–187 33. Miller GA (1956) The magical number seven plus or minus two: Some limits on our capacity to process information. Psychol Rev 63:81–97 34. Neme A, Quintas L (1995) Subgame perfect equilibrium of repeated games with implementation cost. J Econ Theory 66:599–608 35. Neyman A (1985) Bounded complexity justifies cooperation in the finitely-repeated Prisoners’ Dilemma. Econ Lett 19:227– 229 36. Neyman A (1997) Cooperation, repetition and automata in cooperation: Game-theoretic approaches. In: Hart S, Mas-Colell A (eds) NATO ASI Series F, vol 155. Springer, Berlin, pp 233–255 37. Osborne M, Rubinstein A (1990) Bargaining and markets. Academic, New York 38. Osborne M, Rubinstein A (1994) A course in game theory. MIT Press, Cambridge 39. Papadimitriou CH (1992) On games with a bounded number of states. Games Econ Behav 4:122–131
1307
1308
Game Theory and Strategic Complexity
40. Piccione M (1992) Finite automata equilibria with discounting. J Econ Theory 56:180–193 41. Piccione M, Rubinstein A (1993) Finite automata play a repeated extensive game. J Econ Theory 61:160–168 42. Robson A (2003) The evolution of rationality and the Red Queen. J Econ Theory 111:1–22 43. Rubinstein A (1982) Perfect equilibrium in a bargaining model. Econometrica 50:97–109 44. Rubinstein A (1986) Finite automata play the repeated Prisoners’ Dilemma. J Econ Theory 39:83–96 45. Rubinstein A (1998) Modeling bounded rationality. MIT Press, Cambridge 46. Rubinstein A, Wolinsky A (1990) Decentralized trading, strate-
47. 48.
49.
50.
gic behaviour and the Walrasian outcome. Rev Econ Stud 57:63–78 Sabourian H (2003) Bargaining and markets: Complexity and the competitive outcome. J Econ Theory 116:189–228 Selten R (1965) Spieltheoretische Behandlung eines Oligopolmodells mit Nachfrageträgheit. Z gesamte Staatswiss 12:201– 324 Shaked A (1986) A three-person unanimity game. In: The Los Angeles national meetings of the Institute of Management Sciences and the Operations Research Society of America, Mimeo, University of Bonn Zemel E (1989) Small talk and cooperation: A note on bounded rationality. J Econ Theory 49:1–9
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing
Genetic and Evolutionary Algorithms and Programming: General Introduction and Application to Game Playing MICHAEL ORLOV, MOSHE SIPPER, AMI HAUPTMAN Department of Computer Science, Ben-Gurion University, Beer-Sheva, Israel Article Outline Glossary Definition of the Subject Introduction Evolutionary Algorithms A Touch of Theory Extensions of the Basic Methodology Lethal Applications Evolutionary Games Future Directions Bibliography Glossary Evolutionary algorithms/evolutionary computation A family of algorithms inspired by the workings of evolution by natural selection whose basic structure is to: 1. produce an initial population of individuals, these latter being candidate solutions to the problem at hand 2. evaluate the fitness of each individual in accordance with the problem whose solution is sought 3. while termination condition not met do (a) select fitter individuals for reproduction (b) recombine (crossover) individuals (c) mutate individuals (d) evaluate fitness of modified individuals end while Genome/chromosome An individual’s makeup in the population of an evolutionary algorithm is known as a genome, or chromosome. It can take on many forms, including bit strings, real-valued vectors, characterbased encodings, and computer programs. The representation issue – namely, defining an individual’s genome (well) – is critical to the success of an evolutionary algorithm. Fitness A measure of the quality of a candidate solution in the population. Also known as fitness function. Defining this function well is critical to the success of an evolutionary algorithm.
Selection The operator by which an evolutionary algorithm selects (usually probabilistically) higher-fitness individuals to contribute genetic material to the next generation. Crossover One of the two main genetic operators applied by an evolutionary algorithm, wherein two (or more) candidate solutions (parents) are combined in some pre-defined manner to form offspring. Mutation One of the two main genetic operators applied by an evolutionary algorithm, wherein one candidate solution is randomly altered. Definition of the Subject Evolutionary algorithms are a family of search algorithms inspired by the process of (Darwinian) evolution in nature. Common to all the different family members is the notion of solving problems by evolving an initially random population of candidate solutions, through the application of operators inspired by natural genetics and natural selection, such that in time fitter (i. e., better) solutions emerge. The field, whose origins can be traced back to the 1950s and 1960s, has come into its own over the past two decades, proving successful in solving multitudinous problems from highly diverse domains including (to mention but a few): optimization, automatic programming, electronic-circuit design, telecommunications, networks, finance, economics, image analysis, signal processing, music, and art. Introduction The first approach to artificial intelligence, the field which encompasses evolutionary computation, is arguably due to Turing [31]. Turing asked the famous question: “Can machines think?” Evolutionary computation, as a subfield of AI, may be the most straightforward answer to such a question. In principle, it might be possible to evolve an algorithm possessing the functionality of the human brain (this has already happened at least once: in nature). In a sense, nature is greatly inventive. One often wonders how so many magnificent solutions to the problem of existence came to be. From the intricate mechanisms of cellular biology, to the sandy camouflage of flatfish; from the social behavior of ants to the diving speed of the peregrine falcon – nature created versatile solutions, at varying levels, to the problem of survival. Many ingenious solutions were invented (and still are), without any obvious intelligence directly creating them. This is perhaps the main motivation behind evolutionary algorithms: creating the settings for a dynamic environment, in which solutions can be created and improved in the course of time, ad-
1309
1310
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing
vancing in new directions, with minimal direct intervention. The gain to problem solving is obvious. Evolutionary Algorithms In the 1950s and the 1960s several researchers independently studied evolutionary systems with the idea that evolution could be used as an optimization tool for engineering problems. Central to all the different methodologies is the notion of solving problems by evolving an initially random population of candidate solutions, through the application of operators inspired by natural genetics and natural selection, such that in time fitter (i. e., better) solutions emerge [9,16,19,28]. This thriving field goes by the name of evolutionary algorithms or evolutionary computation, and today it encompasses two main branches – genetic algorithms [9] and genetic programming [19] – in addition to less prominent (though important) offshoots, such as evolutionary programming [10] and evolution strategies [26]. A genetic algorithm (GA) is an iterative procedure that consists of a population of individuals, each one represented by a finite string of symbols, known as the genome, encoding a possible solution in a given problem space. This space, referred to as the search space, comprises all possible solutions to the problem at hand. Generally speaking, the genetic algorithm is applied to spaces which are too large to be exhaustively searched. The symbol alphabet used is often binary, but may also be characterbased, real-valued, or any other representation most suitable to the problem at hand. The standard genetic algorithm proceeds as follows: an initial population of individuals is generated at random or heuristically. Every evolutionary step, known as a generation, the individuals in the current population are decoded and evaluated according to some predefined quality criterion, referred to as the fitness, or fitness function. To form a new population (the next generation), individuals are selected according to their fitness. Many selection procedures are available, one of the simplest being fitnessproportionate selection, where individuals are selected with a probability proportional to their relative fitness. This ensures that the expected number of times an individual is chosen is approximately proportional to its relative performance in the population. Thus, high-fitness (good) individuals stand a better chance of reproducing, while lowfitness ones are more likely to disappear. Selection alone cannot introduce any new individuals into the population, i. e., it cannot find new points in the search space; these are generated by genetically-inspired operators, of which the most well known are crossover and
mutation. Crossover is performed with probability pcross (the crossover probability or crossover rate) between two selected individuals, called parents, by exchanging parts of their genomes (i. e., encodings) to form one or two new individuals, called offspring. In its simplest form, substrings are exchanged after a randomly-selected crossover point. This operator tends to enable the evolutionary process to move toward promising regions of the search space. The mutation operator is introduced to prevent premature convergence to local optima by randomly sampling new points in the search space. It is carried out by flipping bits at random, with some (small) probability pmut . Genetic algorithms are stochastic iterative processes that are not guaranteed to converge. The termination condition may be specified as some fixed, maximal number of generations or as the attainment of an acceptable fitness level. Figure 1 presents the standard genetic algorithm in pseudo-code format. Let us consider the following simple example, demonstrating the GA’s workings. The population consists of four individuals, which are binary-encoded strings (genomes) of length 10. The fitness value equals the number of ones in the bit string, with pcross D 0:7 and pmut D 0:05. More typical values of the population size and the genome length are in the range 50–1000. Note that fitness computation in this case is extremely simple, since no complex decoding or evaluation is necessary. The initial (randomly generated) population might look as shown in Table 1. Using fitness-proportionate selection we must choose four individuals (two sets of parents), with probabilities proportional to their relative fitness values. In our example, suppose that the two parent pairs are fp2 ; p4 g and fp1 ; p2 g (note that individual p3 did not get selected as our procedure is probabilistic). Once a pair of parents is se-
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing, Figure 1 Pseudo-code of the standard genetic algorithm
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing, Table 1 The initial population Label p1 p2 p3 p4
Genome 0000011011 1110111101 0010000010 0011010000
Fitness 4 8 2 3
lected, crossover is effected between them with probability pcross , resulting in two offspring. If no crossover is effected (with probability 1 pcross ), then the offspring are exact copies of each parent. Suppose, in our example, that crossover takes place between parents p2 and p4 at the (randomly chosen) third bit position: 111j0111101 001j1010000 This results in offspring p01 D 1111010000 and D 0010111101. Suppose no crossover is performed between parents p1 and p2 , forming offspring that are exact copies of p1 and p2 . Our interim population (after crossover) is thus as depicted in Table 2: Next, each of these four individuals is subject to mutation with probability pmut per bit. For example, suppose offspring p02 is mutated at the sixth position and offspring p04 is mutated at the ninth bit position. Table 3 describes the resulting population. The resulting population is that of the next generation (i. e., p00i equals pi of the next generation). As can be seen, p02
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing, Table 2 The interim population Label p01 p02 p03 p04
Genome 1111010000 0010111101 0000011011 1110111101
Fitness 5 6 4 8
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing, Table 3 The resulting population Label p00 1 p00 2 p00 3 p00 4
Genome 1111010000 0010101101 0000011011 1110111111
Fitness 5 5 4 9
the transition from one generation to the next is through application of selection, crossover, and mutation. Moreover, note that the best individual’s fitness has gone up from eight to nine, and that the average fitness (computed over all individuals in the population) has gone up from 4.25 to 5.75. Iterating this procedure, the GA will eventually find a perfect string, i. e., with maximal fitness value of ten. Another prominent branch of the evolutionary computation tree is that of genetic programming, introduced by Cramer [7], and transformed into a field in its own right in large part due to the efforts of Koza [19]. Basically, genetic programming (GP) is a GA (genetic algorithm) with individuals in the population being programs instead of bit strings. In GP we evolve a population of individual LISP expressions1 , each comprising functions and terminals. The functions are usually arithmetic and logic operators that receive a number of arguments as input and compute a result as output; the terminals are zero-argument functions that serve both as constants and as sensors, the latter being a special type of function that queries the domain environment. The main mechanism behind GP is precisely that of a GA, namely, the repeated cycling through four operations applied to the entire population: evaluate-selectcrossover-mutate. However, the evaluation of a single individual in GP is usually more complex than with a GA since it involves running a program. Moreover, crossover and mutation need to be made to work on trees (rather than simple bit strings), as shown in Fig. 2. A Touch of Theory Evolutionary computation is mostly an experimental field. However, over the years there have been some notable theoretical treatments of the field, gaining valuable insights into the properties of evolving populations. Holland [17] introduced the notion of schemata, which are abstract properties of binary-encoded individuals, and analyzed the growth of different schemas when fitness-proportionate selection, point mutation and onepoint crossover are employed. Holland’s approach has since been enhanced and more rigorous analysis performed; however, there were not many practical consequences on the existing evolutionary techniques, since most of the successful methods are usually much more complex in many aspects. Moreover, the schematic anal1 Languages other than LISP have been used, although LISP is still by far the most popular within the genetic programming domain.
1311
1312
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing, Figure 2 Genetic operators in genetic programming. LISP programs are depicted as trees. Crossover (top): Two sub-trees (marked in bold) are selected from the parents and swapped. Mutation (bottom): A sub-tree (marked in bold) is selected from the parent individual and removed. A new sub-tree is grown instead
ysis suffers from an important approximation of infinite population size, while in reality schemata can vanish. Note that the No Free Lunch theorem states that “. . . for any [optimization] algorithm, any elevated performance over one class of problems is exactly paid for in performance over another class” [32]. Extensions of the Basic Methodology We have reviewed the basic evolutionary computation methods. More advanced techniques are used to tackle complex problems, where an approach of a single population with homogeneous individuals does not suffice. One such advanced approach is coevolution [24]. Coevolution refers to the simultaneous evolution of two or more species with coupled fitness. Such coupled evolution favors the discovery of complex solutions whenever complex solutions are required. Simplistically speaking, one can say that coevolving species can either compete (e. g., to obtain exclusivity on a limited resource) or cooperate (e. g., to gain access to some hard-to-attain resource). In a competitive coevolutionary algorithm the fitness of an individual is based on direct competition with individuals of other species, which in turn evolve separately in their own populations. Increased fitness of one of the species
implies a diminution in the fitness of the other species. This evolutionary pressure tends to produce new strategies in the populations involved so as to maintain their chances of survival. This arms race ideally increases the capabilities of each species until they reach an optimum. Cooperative (also called symbiotic) coevolutionary algorithms involve a number of independently evolving species which together form complex structures, well suited to solve a problem. The fitness of an individual depends on its ability to collaborate with individuals from other species. In this way, the evolutionary pressure stemming from the difficulty of the problem favors the development of cooperative strategies and individuals. Single-population evolutionary algorithms often perform poorly – manifesting stagnation, convergence to local optima, and computational costliness – when confronted with problems presenting one or more of the following features: 1) the sought-after solution is complex, 2) the problem or its solution is clearly decomposable, 3) the genome encodes different types of values, 4) strong interdependencies among the components of the solution, and 5) components-ordering drastically affects fitness [24]. Cooperative coevolution addresses effectively these issues, consequently widening the range of applications of evolutionary computation.
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing
Consider, for instance, the evolution of neural networks [33]. A neural network consists of simple units called neurons, each having several inputs and a single output. The inputs are assigned weights, and a weighted sum of the inputs exceeding a certain threshold causes the neuron to fire an output signal. Neurons are usually connected using a layered topology. When we approach the task of evolving a neural network possessing some desired property naively, we will probably think of some linearized representation of a neural network, encoding both the neuron locations in the network, and their weights. However, evolving such a network with a simple evolutionary algorithm might prove quite a frustrating task, since much information is encoded in each individual, and it is not homogeneous, which presents us with the difficult target of evolving the individuals as single entities. On the other hand, this task can be dealt with more sagely via evolving two independently encoded populations of neurons and network topologies. Stanley and Miikkulainen [30] evaluate the fitness of an individual in one of the populations using the individuals of the other. In addition to the simplification of individuals in each population, the fitness is now dynamic, and an improvement in the evolution of topologies triggers a corresponding improvement in the population of neurons, and vice versa. Lethal Applications In this section we review a number of applications that – though possibly not killer (death being in the eye of the beholder. . . ) – are most certainly lethal. These come from a sub-domain of evolutionary algorithms, which has been gaining momentum over the past few years: human-competitive machine intelligence. Koza et al. [20] recently affirmed that the field of evolutionary algorithms “now routinely delivers high-return human-competitive machine intelligence”, meaning, according to [20]: Human-competitive: Getting machines to produce human-like results, e. g., a patentable invention, a result publishable in the scientific literature, or a game strategy that can hold its own against humans. High-return: Defined by Koza et al. as a high artificialto-intelligence ratio (A/I), namely, the ratio of that which is delivered by the automated operation of the artificial method to the amount of intelligence that is supplied by the human applying the method to a particular system. Routine: The successful handling of new problems once the method has been jump-started.
Machine intelligence: To quote Arthur Samuel, getting “machines to exhibit behavior, which if done by humans, would be assumed to involve the use of intelligence.” Indeed, as of 2004 the major annual event in the field of evolutionary algorithms – GECCO (Genetic and Evolutionary Computation Conference; see www.sigevo. org) – boasts a prestigious competition that awards prizes to human-competitive results. As noted at www. human-competitive.org: “Techniques of genetic and evolutionary computation are being increasingly applied to difficult real-world problems – often yielding results that are not merely interesting, but competitive with the work of creative and inventive humans.” We now describe some winners from the HUMIES competition at www.human-competitive.org. Lohn et al. [22] won a Gold Medal in the 2004 competition for an evolved X-band antenna design and flight prototype to be deployed on NASA’s Space Technology 5 (ST5) spacecraft: The ST5 antenna was evolved to meet a challenging set of mission requirements, most notably the combination of wide beamwidth for a circularly-polarized wave and wide bandwidth. Two evolutionary algorithms were used: one used a genetic algorithm style representation that did not allow branching in the antenna arms; the second used a genetic programming style tree-structured representation that allowed branching in the antenna arms. The highest performance antennas from both algorithms were fabricated and tested, and both yielded very similar performance. Both antennas were comparable in performance to a hand-designed antenna produced by the antenna contractor for the mission, and so we consider them examples of human-competitive performance by evolutionary algorithms [22]. Preble et al. [25] won a gold medal in the 2005 competition for designing photonic crystal structures with large band gaps. Their result is “an improvement of 12.5% over the best human design using the same index contrast platform.” Recently, Kilinç et al. [18] was awarded the Gold Medal in the 2006 competition for designing oscillators using evolutionary algorithms, where the oscillators possess characteristics surpassing the existing human-designed analogs. Evolutionary Games Evolutionary games is the application of evolutionary algorithms to the evolution of game-playing strategies
1313
1314
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing
for various games, including chess, backgammon, and Robocode.
mental and integrative research needed to achieve human-level AI [21].
Motivation and Background
Evolving Game-Playing Strategies
Ever since the dawn of artificial intelligence in the 1950s, games have been part and parcel of this lively field. In 1957, a year after the Dartmouth Conference that marked the official birth of AI, Alex Bernstein designed a program for the IBM 704 that played two amateur games of chess. In 1958, Allen Newell, J. C. Shaw, and Herbert Simon introduced a more sophisticated chess program (beaten in thirty-five moves by a ten-year-old beginner in its last official game played in 1960). Arthur L. Samuel of IBM spent much of the fifties working on game-playing AI programs, and by 1961 he had a checkers program that could play rather decently. In 1961 and 1963 Donald Michie described a simple trial-and-error learning system for learning how to play Tic-Tac-Toe (or Noughts and Crosses) called MENACE (for Matchbox Educable Noughts and Crosses Engine). These are but examples of highly popular games that have been treated by AI researchers since the field’s inception. Why study games? This question was answered by Susan L. Epstein, who wrote:
Recently, evolutionary algorithms have proven a powerful tool that can automatically design successful game-playing strategies for complex games [2,3,13,14,15,27,29].
There are two principal reasons to continue to do research on games. . . . First, human fascination with game playing is long-standing and pervasive. Anthropologists have catalogued popular games in almost every culture. . . . Games intrigue us because they address important cognitive functions. . . . The second reason to continue game-playing research is that some difficult games remain to be won, games that people play very well but computers do not. These games clarify what our current approach lacks. They set challenges for us to meet, and they promise ample rewards [8]. Studying games may thus advance our knowledge in both cognition and artificial intelligence, and, last but not least, games possess a competitive angle which coincides with our human nature, thus motivating both researcher and student alike. Even more strongly, Laird and van Lent [21] proclaimed that, . . . interactive computer games are the killer application for human-level AI. They are the application that will soon need human-level AI, and they can provide the environments for research on the right kinds of problems that lead to the type of the incre-
1. Chess (endgames) Evolve a player able to play endgames [13,14,15,29]. While endgames typically contain but a few pieces, the problem of evaluation is still hard, as the pieces are usually free to move all over the board, resulting in complex game trees – both deep and with high branching factors. Indeed, in the chess lore much has been said and written about endgames. 2. Backgammon Evolve a full-fledged player for the nondoubling-cube version of the game [2,3,29]. 3. Robocode A simulation-based game in which robotic tanks fight to destruction in a closed arena (robocode. alphaworks.ibm.com). The programmers implement their robots in the Java programming language, and can test their creations either by using a graphical environment in which battles are held, or by submitting them to a central web site where online tournaments regularly take place. Our goal here has been to evolve Robocode players able to rank high in the international league [27,29]. A strategy for a given player in a game is a way of specifying which choice the player is to make at every point in the game from the set of allowable choices at that point, given all the information that is available to the player at that point [19]. The problem of discovering a strategy for playing a game can be viewed as one of seeking a computer program. Depending on the game, the program might take as input the entire history of past moves or just the current state of the game. The desired program then produces the next move as output. For some games one might evolve a complete strategy that addresses every situation tackled. This proved to work well with Robocode, which is a dynamic game, with relatively few parameters, and little need for past history. Another approach is to couple a current-state evaluator (e. g., board evaluator) with a next-move generator. One can go on to create a minimax tree, which consists of all possible moves, counter moves, counter countermoves, and so on; for real-life games, such a tree’s size quickly becomes prohibitive. The approach we used with backgammon and chess is to derive a very shallow, singlelevel tree, and evolve smart evaluation functions. Our artificial player is thus had by combining an evolved board
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing
evaluator with a simple program that generates all nextmove boards (such programs can easily be written for backgammon and chess). In what follows we describe the definition of six items necessary in order to employ genetic programming: program architecture, set of terminals, set of functions, fitness measure, control parameters, and manner of designating result and terminating run. Example: Chess As our purpose is to create a schema-based program that analyzes single nodes thoroughly, in a way reminiscent of human thinking, we did not perform deep lookahead. We evolved individuals represented as LISP programs. Each such program receives a chess endgame position as input, and, according to its sensors (terminals) and functions, returns an evaluation of the board, in the form of a real number. Our chess endgame players consist of an evolved LISP program, together with a piece of software that generates all possible (legal) next-moves and feeds them to the program. The next-move with the highest score is selected (ties are broken stochastically). The player also identifies when the game is over (either by a draw or a win). Program Architecture As most chess players would agree, playing a winning position (e. g., with material advantage) is very different than playing a losing position, or an even one. For this reason, each individual contains not one but three separate trees: an advantage tree, an even tree, and a disadvantage tree. These trees are used according to the current status of the board. The disadvantage tree is smaller, since achieving a stalemate and avoiding exchanges requires less complicated reasoning. Most terminals and functions were used for all trees. The structure of three trees per individual was preserved mainly for simplicity reasons. It is actually possible to coevolve three separate populations of trees, without binding them to form a single individual before the end of the experiment. This would require a different experimental setting, and is one of our future-work ideas. Terminals and Functions While evaluating a position, an expert chess player considers various aspects of the board. Some are simple, while others require a deep understanding of the game. Chase and Simon found that experts recalled meaningful chess formations better than novices [6]. This led them to hypothesize that chess skill depends on a large knowledge base, indexed through thousands of familiar chess patterns.
We assumed that complex aspects of the game board are comprised of simpler units, which require less game knowledge, and are to be combined in some way. Our chess programs use terminals, which represent those relatively simple aspects, and functions, which incorporate no game knowledge, but supply methods of combining those aspects. As we used strongly typed GP [23], all functions and terminals were assigned one or more of two data types: Float and Boolean. We also included a third data type, named Query, which could be used as any of the former two. We also used ephemeral random constants (ERCs). The Terminal Set We developed most of our terminals by consulting several high-ranking chess players.2 The terminal set examined various aspects of the chessboard, and may be divided into three groups: Float values, created using the ERC mechanism. ERCs were chosen at random to be one of the following six values: ˙1 f 21 ; 13 ; 14 g MAX (MAX was empirically set to 1000), and the inverses of these numbers. This guaranteed that when a value was returned after some group of features has been identified, it was distinct enough to engender the outcome. Simple terminals, which analyzed relatively simple aspects of the board, such as the number of possible moves for each king, and the number of attacked pieces for each player. These terminals were derived by breaking relatively complex aspects of the board into simpler notions. More complex terminals belonged to the next group (see below). For example, a player should capture his opponent’s piece if it is not sufficiently protected, meaning that the number of attacking pieces the player controls is greater than the number of pieces protecting the opponent’s piece, and the material value of the defending pieces is equal to or greater than the player’s. Adjudicating these considerations is not simple, and therefore a terminal that performs this entire computational feat by itself belongs to the next group of complex terminals. The simple terminals comprising this second group were derived by refining the logical resolution of the previous paragraphs’ reasoning: Is an opponent’s piece attacked? How many of the player’s pieces are attacking that piece? How many pieces are protecting a given opponent’s piece? What is the material value of pieces attacking and defending a given opponent’s piece? All these questions were embodied as terminals within the second group. The ability to easily embody such reasoning within the GP setup, as functions and terminals, is a major asset of GP. 2 The highest-ranking player we consulted was Boris Gutkin, ELO 2400, International Master, and fully qualified chess teacher.
1315
1316
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing
Other terminals were also derived in a similar manner. See Table 4 for a complete list of simple terminals. Note that some of the terminals are inverted – we would like terminals to always return positive (or true) values, since these values represent a favorable position. This is why we used, for example, a terminal evaluating the player’s king’s distance from the edges of the board (generally a favorable feature for endgames), while using a terminal evaluating the proximity of the opponent’s king to the edges (again, a positive feature). Complex terminals: these are terminals that check the same aspects of the board a human player would. Some prominent examples include: the terminal OppPieceCanBeCaptured considering the capture of a piece; checking if the current position is a draw, a mate, or a stalemate (especially important for non-even boards); checking if there is a mate in one or two moves (this is the most complex terminal); the material value of the position; comparing the material value of the position to the original board – this is important since it is easier to consider change than to evaluate the board in an absolute manner. See Table 5 for a full list of complex terminals. Since some of these terminals are hard to compute, and most appear more than once in the individual’s trees, we used a memoization scheme to save time [1]: After the first calculation of each terminal, the result is stored, so that
further calls to the same terminal (on the same board) do not repeat the calculation. Memoization greatly reduced the evolutionary run-time. The Function Set The function set used included the If function, and simple Boolean functions. Although our tree returns a real number, we omitted arithmetic functions, for several reasons. First, a large part of contemporary research in the field of machine learning and game theory (in particular for perfect-information games) revolves around inducing logical rules for learning games (for example, see [4,5,11]). Second, according to the players we consulted, while evaluating positions involves considering various aspects of the board, some more important than others, performing logical operations on these aspects seems natural, while mathematical operations does not. Third, we observed that numeric functions sometimes returned extremely large values, which interfered with subtle calculations. Therefore the scheme we used was a (carefully ordered) series of Boolean queries, each returning a fixed value (either an ERC or a numeric terminal, see below). See Table 6 for the complete list of functions. Fitness Evaluation As we used a competitive evaluation scheme, the fitness of an individual was determined by its
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing, Table 4 Simple terminals for evolving chess endgame players. Opp: opponent, My: player Terminal B=NotMyKingInCheck() B=IsOppKingInCheck() F=MyKingDistEdges() F=OppKingProximityToEdges() F=NumMyPiecesNotAttacked() F=NumOppPiecesAttacked() F=ValueMyPiecesAttacking() F=ValueOppPiecesAttacking() B=IsMyQueenNotAttacked() B=IsOppQueenAttacked() B=IsMyFork() B=IsOppNotFork() F=NumMovesMyKing() F=NumNotMovesOppKing() F=MyKingProxRook() F=OppKingDistRook() B=MyPiecesSameLine() B=OppPiecesNotSameLine() B=IsOppKingProtectingPiece() B=IsMyKingProtectingPiece()
Description Is the player’s king not being checked? Is the opponent’s king being checked? The player’s king’s distance form the edges of the board The opponent’s king’s proximity to the edges of the board The number of the player’s pieces that are not attacked The number of the opponent’s attacked pieces The material value of the player’s pieces which are attacking The material value of the opponent’s pieces which are attacking Is the player’s queen not attacked? Is the opponent’s queen attacked? Is the player creating a fork? Is the opponent not creating a fork? The number of legal moves for the player’s king The number of illegal moves for the opponent’s king Proximity of my king and rook(s) Distance between opponent’s king and rook(s) Are two or more of the player’s pieces protecting each other? Are two or more of the opponent’s pieces protecting each other? Is the opponent’s king protecting one of his pieces? Is the player’s king protecting one of his pieces?
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing, Table 5 Complex terminals for evolving chess endgame players. Opp: opponent, My: player. Some of these terminals perform lookahead, while others compare with the original board Terminal F=EvaluateMaterial() B=IsMaterialIncrease() B=IsMate() B=IsMateInOne() B=OppPieceCanBeCaptured() B=MyPieceCannotBeCaptured() B=IsOppKingStuck() B=IsMyKingNotStuck() B=IsOppKingBehindPiece() B=IsMyKingNotBehindPiece() B=IsOppPiecePinned() B=IsMyPieceNotPinned()
Description The material value of the board Did the player capture a piece? Is this a mate position? Can the opponent mate the player after this move? Is it possible to capture one of the opponent’s pieces without retaliation? Is it not possible to capture one of the player’s pieces without retaliation? Do all legal moves for the opponent’s king advance it closer to the edges? Is there a legal move for the player’s king that advances it away from the edges? Is the opponent’s king two or more squares behind one of his pieces? Is the player’s king not two or more squares behind one of my pieces? Is one or more of the opponent’s pieces pinned? Are all the player’s pieces not pinned?
success against its peers. We used the random-two-ways method, in which each individual plays against a fixed number of randomly selected peers. Each of these encounters entailed a fixed number of games, each starting from a randomly generated position in which no piece was attacked. The score for each game was derived from the outcome of the game. Players that managed to mate their opponents received more points than those that achieved only a material advantage. Draws were rewarded by a score of low value and losses entailed no points at all. The final fitness for each player was the sum of all points earned in the entire tournament for that generation. Control Parameters and Run Termination We used the standard reproduction, crossover, and mutation operators. The major parameters were: population size – 80, generation count – between 150 and 250, reproduction probability – 0.35, crossover probability – 0.5, and mutation probability – 0.15 (including ERC).
Results We pitted our top evolved chess-endgame players against two very strong external opponents: 1) A program we wrote (‘Master’) based upon consultation with several high-ranking chess players (the highest being Boris Gutkin, ELO 2400, International Master); 2) CRAFTY – a world-class chess program, which finished second in the 2004 World Computer Speed Chess Championship (www. cs.biu.ac.il/games/). Speed chess (blitz) involves a timelimit per move, which we imposed both on CRAFTY and on our players. Not only did we thus seek to evolve good players, but ones that play well and fast. Results are shown in Table 7. As can be seen, GP-EndChess manages to hold its own, and even win, against these top players. For more details on GP-EndChess see [13,29]. Deeper analysis of the strategies developed [12] revealed several important shortcomings, most of which stemmed from the fact that they used deep knowledge and little search (typically, they developed only one level of the search tree). Simply increasing the search depth would not solve the problem, since the evolved programs examine
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing, Table 6 Function set of GP chess player individual. B: Boolean, F: Float Function F=If3(B1 , F 1 , F 2 ) B=Or2(B1 , B2 ) B=Or3(B1 , B2 , B3 ) B=And2(B1 , B2 ) B=And3(B1 , B2 , B3 ) B=Smaller(B1 , B2 ) B=Not(B1 )
Description If B1 is non-zero, return F 1 , else return F 2 Return 1 if at least one of B1 , B2 is non-zero, 0 otherwise Return 1 if at least one of B1 , B2 , B3 is non-zero, 0 otherwise Return 1 only if B1 and B2 are non-zero, 0 otherwise Return 1 only if B1 , B2 , and B3 are non-zero, 0 otherwise Return 1 if B1 is smaller than B2 , 0 otherwise Return 0 if B1 is non-zero, 1 otherwise
1317
1318
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing, Table 7 Percent of wins, advantages, and draws for best GP-EndChess player in tournament against two top competitors %Wins %Advs %Draws Master 6.00 2.00 68.00 CRAFTY 2.00 4.00 72.00
each board very thoroughly, and scanning many boards would increase time requirements prohibitively. And so we turned to evolution to find an optimal way to overcome this problem: How to add more search at the expense of less knowledgeable (and thus less time-consuming) node evaluators, while attaining better performance. In [15] we evolved the search algorithm itself , focusing on the Mate-In-N problem: find a key move such that even with the best possible counter-plays, the opponent cannot avoid being mated in (or before) move N. We showed that our evolved search algorithms successfully solve several instances of the Mate-In-N problem, for the hardest ones developing 47% less game-tree nodes than CRAFTY. Improvement is thus not over the basic alpha-beta algorithm, but over a world-class program using all standard enhancements [15]. Finally, in [14], we examined a strong evolved chessendgame player, focusing on the player’s emergent capabilities and tactics in the context of a chess match. Using a number of methods we analyzed the evolved player’s building blocks and their effect on play level. We concluded that evolution has found combinations of building blocks that are far from trivial and cannot be explained through simple combination – thereby indicating the possible emergence of complex strategies. Example: Robocode Program Architecture A Robocode player is written as an event-driven Java program. A main loop controls the tank activities, which can be interrupted on various occasions, called events. The program is limited to four lines of code, as we were aiming for the HaikuBot category, one of the divisions of the international league with a four-line code limit. The main loop contains one line of code that directs the robot to start turning the gun (and the mounted radar) to the right. This insures that within the first gun cycle, an enemy tank will be spotted by the radar, triggering a ScannedRobotEvent. Within the code for this event, three additional lines of code were added, each controlling a single actuator, and using a single numerical input that was supplied by a genetic programming-evolved sub-pro-
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing, Figure 3 Robocode player’s code layout (HaikuBot division)
gram. The first line instructs the tank to move to a distance specified by the first evolved argument. The second line instructs the tank to turn to an azimuth specified by the second evolved argument. The third line instructs the gun (and radar) to turn to an azimuth specified by the third evolved argument (Fig. 3). Terminal and Function Sets We divided the terminals into three groups according to their functionality [27], as shown in Table 8: 1. Game-status indicators: A set of terminals that provide real-time information on the game status, such as last enemy azimuth, current tank position, and energy levels. 2. Numerical constants: Two terminals, one providing the constant 0, the other being an ERC (ephemeral random constant). This latter terminal is initialized to a random real numerical value in the range [1; 1], and does not change during evolution. 3. Fire command: This special function is used to curtail one line of code by not implementing the fire actuator in a dedicated line.
Fitness Measure We explored two different modes of learning: using a fixed external opponent as teacher, and coevolution – letting the individuals play against each other; the former proved better. However, not one external opponent was used to measure performance but three, these adversaries downloaded from the HaikuBot league (robocode.yajags.com). The fitness value of an individual equals its average fractional score (over three battles). Control Parameters and Run Termination The major evolutionary parameters [19] were: population size – 256, generation count – between 100 and 200, selec-
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing, Table 8 Robocode representation. a Terminal set. b Function set (F: Float) Terminal Energy() Heading() X()
Description Returns the remaining energy of the player Returns the current heading of the player Returns the current horizontal position of the player Y() Returns the current vertical position of the player MaxX() Returns the horizontal battlefield dimension MaxY() Returns the vertical battlefield dimension EnemyBearing() Returns the current enemy bearing, relative to the current player’s heading EnemyDistance() Returns the current distance to the enemy EnemyVelocity() Returns the current enemy’s velocity EnemyHeading() Returns the current enemy heading, relative to the current player’s heading EnemyEnergy() Returns the remaining energy of the enemy Constant() An ERC (Ephemeral Random Constant) in the range [1; 1] Random() Returns a random real number in the range [1; 1] Zero() Returns the constant 0 Function Add(F, F) Sub(F, F) Mul(F, F) Div(F, F)
Description Add two real numbers Subtract two real numbers Multiply two real numbers Divide first argument by second, if denominator non-zero, otherwise return zero Abs(F) Absolute value Neg(F) Negative value Sin(F) Sine function Cos(F) Cosine function ArcSin(F) Arcsine function ArcCos(F) Arccosine function IfGreater(F, F, F, F) If first argument greater than second, return value of third argument, else return value of fourth argument IfPositive(F, F, F) If first argument is positive, return value of second argument, else return value of third argument Fire(F) If argument is positive, execute fire command with argument as firepower and return 1; otherwise, do nothing and return 0
tion method – tournament, reproduction probability – 0, crossover probability – 0.95, and mutation probability – 0.05. An evolutionary run terminates when fitness is observed to level off. Since the game is highly nondeterministic a lucky individual might attain a higher fitness value than better overall individuals. In order to obtain a more
accurate measure for the evolved players we let each of them do battle for 100 rounds against 12 different adversaries (one at a time). The results were used to extract the top player – to be submitted to the international league. Results We submitted our top player to the HaikuBot division of the international league. At its very first tournament it came in third, later climbing to first place of 28 (robocode.yajags.com/20050625/haiku-1v1.html). All other 27 programs, defeated by our evolved strategy, were written by humans. For more details on GP-Robocode see [27,29]. Backgammon: Major Results We pitted our top evolved backgammon players against Pubeval, a free, public-domain board evaluation function written by Tesauro. The program – which plays well – has become the de facto yardstick used by the growing community of backgammon-playing program developers. Our top evolved player was able to attain a win percentage of 62.4% in a tournament against Pubeval, about 10% higher (!) than the previous top method. Moreover, several evolved strategies were able to surpass the 60% mark, and most of them outdid all previous works. For more details on GP-Gammon see [2,3,29]. Future Directions Evolutionary computation is a fast growing field. As shown above, difficult, real-world problems are being tackled on a daily basis, both in academia and in industry. In the future we expect major developments in the underlying theory. Partly spurred by this we also expect major new application areas to succumb to evolutionary algorithms, and many more human-competitive results. Expecting such pivotal breakthroughs may seem perhaps a bit of overreaching, but one must always keep in mind Evolutionary Computation’s success in Nature. Bibliography 1. Abelson H, Sussman GJ, Sussman J (1996) Structure and Interpretation of Computer Programs, 2nd edn. MIT Press, Cambridge 2. Azaria Y, Sipper M (2005) GP-Gammon: Genetically programming backgammon players. Genet Program Evolvable Mach 6(3):283–300. doi:10.1007/s10710-005-2990-0 3. Azaria Y, Sipper M (2005) GP-Gammon: Using genetic programming to evolve backgammon players. In: Keijzer M, Tettamanzi A, Collet P, van Hemert J, Tomassini M (eds) Proceedings of 8th European Conference on Genetic Programming (EuroGP2005). Lecture Notes in Computer Science, vol 3447. Springer, Heidelberg, pp 132–142. doi:10.1007/b107383
1319
1320
Genetic and Evolutionary Algorithms and Programming: General Introduction and Appl. to Game Playing
4. Bain M (1994) Learning logical exceptions in chess. Ph D thesis, University of Strathclyde, Glasgow, Scotland. citeseer.ist.psu. edu/bain94learning.html 5. Bonanno G (1989) The logic of rational play in games of perfect information. Papers 347, California Davis – Institute of Governmental Affairs. http://ideas.repec.org/p/fth/caldav/347.html 6. Charness N (1991) Expertise in chess: The balance between knowledge and search. In: Ericsson KA, Smith J (eds) Toward a general theory of Expertise: Prospects and limits. Cambridge University Press, Cambridge 7. Cramer NL (1985) A representation for the adaptive generation of simple sequential programs. In: Grefenstette JJ (ed) Proceedings of the 1st International Conference on Genetic Algorithms. Lawrence Erlbaum Associates, Mahwah, pp 183– 187 8. Epstein SL (1999) Game playing: The next moves. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence. AAAI Press, Menlo Park, pp 987–993 9. Fogel DB (2006) Evolutionary Computation: Toward a New Philosophy of Machine Intelligence, 3rd edn. Wiley-IEEE Press, Hoboken 10. Fogel LJ, Owens AJ, Walsh MJ (1966) Artificial Intelligence Through Simulated Evolution. Wiley, New York 11. Fürnkranz J (1996) Machine learning in computer chess: The next generation. Int Comput Chess Assoc J 19(3):147–161. citeseer.ist.psu.edu/furnkranz96machine.html 12. Hauptman A, Sipper M (2005) Analyzing the intelligence of a genetically programmed chess player. In: Late Breaking Papers at the 2005 Genetic and Evolutionary Computation Conference, distributed on CD-ROM at GECCO-2005, Washington DC 13. Hauptman A, Sipper M (2005) GP-EndChess: Using genetic programming to evolve chess endgame players. In: Keijzer M, Tettamanzi A, Collet P, van Hemert J, Tomassini M (eds) Proceedings of 8th European Conference on Genetic Programming (EuroGP2005). Lecture Notes in Computer Science, vol 3447. Springer, Heidelberg, pp 120–131. doi:10.1007/b107383 14. Hauptman A, Sipper M (2007) Emergence of complex strategies in the evolution of chess endgame players. Adv Complex Syst 10(1):35–59. doi:10.1142/s0219525907001082 15. Hauptman A, Sipper M (2007) Evolution of an efficient search algorithm for the mate-in-n problem in chess. In: Ebner M, O’Neill M, Ekárt A, Vanneschi L, Esparcia-Alcázar AI (eds) Proceedings of 10th European Conference on Genetic Programming (EuroGP2007). Lecture Notes in Computer Science vol. 4455. Springer, Heidelberg, pp 78–89. doi:10.1007/ 978-3-540-71605-1_8 16. Holland JH (1975) Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. University of Michigan Press, Ann Arbor (2nd edn. MIT Press, Cambridge, 1992) 17. Holland JH (1992) Adaptation in Natural and Artificial Systems, 2nd edn. MIT Press, Cambridge
18. Kilinç S, Jain V, Aggarwal V, Cam U (2006) Catalogue of variable frequency and single-resistance-controlled oscillators employing a single differential difference complementary current conveyor. Frequenz: J RF-Eng Telecommun 60(7–8):142–146 19. Koza JR (1992) Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge 20. Koza JR, Keane MA, Streeter MJ, Mydlowec W, Yu J, Lanza G (2003) Genetic Programming IV: Routine Human-Competitive Machine Intelligence. Kluwer, Norwell 21. Laird JE, van Lent M (2000) Human-level AI’s killer application: Interactive computer games. In: AAAI-00: Proceedings of the 17th National Conference on Artificial Intelligence. MIT Press, Cambridge, pp 1171–1178 22. Lohn JD, Hornby GS, Linden DS (2005) An evolved antenna for deployment on NASA’s Space Technology 5 mission. In: O’Reilly UM, Yu T, Riolo R, Worzel B (eds) Genetic Programming Theory and Practice II, Genetic Programming, vol 8, chap 18. Springer, pp 301–315. doi:10.1007/0-387-23254-0_18 23. Montana DJ (1995) Strongly typed genetic programming. Evol Comput 3(2):199–230. doi:10.1162/evco.1995.3.2.199 24. Peña-Reyes CA, Sipper M (2001) Fuzzy CoCo: A cooperativecoevolutionary approach to fuzzy modeling. IEEE Trans Fuzzy Syst 9(5):727–737. doi:10.1009/91.963759 25. Preble S, Lipson M, Lipson H (2005) Two-dimensional photonic crystals designed by evolutionary algorithms. Appl Phys Lett 86(6):061111. doi:10.1063/1.1862783 26. Schwefel HP (1995) Evolution and Optimum Seeking. Wiley, New York 27. Shichel Y, Ziserman E, Sipper M (2005) GP-Robocode: Using genetic programming to evolve robocode players. In: Keijzer M, Tettamanzi A, Collet P, van Hemert J, Tomassini M (eds) Genetic Programming: 8th European Conference, EuroGP 2005, Lausanne, Switzerland, March 30–April 1, 2005. Lecture Notes in Computer Science, vol 3447. Springer, Berlin, pp 143–154. doi:10.1007/b107383 28. Sipper M (2002) Machine Nature: The Coming Age of Bio-Inspired Computing. McGraw-Hill, New York 29. Sipper M, Azaria Y, Hauptman A, Shichel Y (2007) Designing an evolutionary strategizing machine for game playing and beyond. IEEE Trans Syst, Man, Cybern, Part C: Appl Rev 37(4):583– 593 30. Stanley KO, Miikkulainen R (2002) Evolving neural networks through augmenting topologies. Evol Comput 10(2):99–127. doi:10.1162/106365602320169811 31. Turing AM (1950) Computing machinery and intelligence. Mind 59(236):433–460. http://links.jstor.org/sici? sici=0026-4423(195010)2:59:2362.0.CO;2-5 32. Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1(1):67–82. doi:10.1109/ 4235.585893 33. Yao X (1999) Evolving artificial neural networks. Proc IEEE 87(9):1423–1447. doi:10.1009/5.784219
Genetic-Fuzzy Data Mining Techniques
Genetic-Fuzzy Data Mining Techniques TZUNG-PEI HONG1 , CHUN-HAO CHEN2 , VINCENT S. TSENG2 1 Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan 2 Department of Computer Science and Information Engineering, National Cheng–Kung University, Tainan, Taiwan Article Outline Glossary Definition of the Subject Introduction Data Mining Fuzzy Sets Fuzzy Data Mining Genetic Algorithms Genetic-Fuzzy Data Mining Techniques Future Directions Bibliography Glossary Data mining Data mining is the process of extracting desirable knowledge or interesting patterns from existing databases for specific purposes. The common techniques include mining association rules, mining sequential patterns, clustering, and classification, among others. Fuzzy set theory The fuzzy set theory was first proposed by Zadeh in 1965. It is primarily concerned with quantifying and reasoning using natural language in which words can have ambiguous meanings. It is widely used in a variety of fields because of its simplicity and similarity to human reasoning. Fuzzy data mining The concept of fuzzy sets can be used in data mining to handle quantitative or linguistic data. Basically, fuzzy data mining first uses membership functions to transform each quantitative value into a fuzzy set in linguistic terms and then uses a fuzzy mining process to find fuzzy association rules. Genetic algorithms Genetic Algorithms (GAs) were first proposed by Holland in 1975. They have become increasingly important for researchers in solving difficult problems since they could provide feasible solutions in a limited amount of time. Each possible solution is encoded as a chromosome (individual) in a population.
According to the principle of survival of the fittest, GAs generate the next population by several genetic operations such as crossover, mutation, and reproductions. Genetic-fuzzy data mining Genetic algorithms have been widely used for solving optimization problems. If the fuzzy mining problem can be converted into an optimization problem, then the GA techniques can easily be adopted to solve it. They are thus called genetic-fuzzy data-mining techniques. They are usually used to automatically mine both appropriate membership functions and fuzzy association rules from a set of transaction data. Definition of the Subject Data mining is the process of extracting desirable knowledge or interesting patterns from existing databases for specific purposes. Most conventional data-mining algorithms identify the relationships among transactions using binary values. However, transactions with quantitative values are commonly seen in real-world applications. Fuzzy data-mining algorithms are thus proposed for extracting interesting linguistic knowledge from transactions stored as quantitative values. They usually integrate fuzzyset concepts and mining algorithms to find interesting fuzzy knowledge from a given transaction data set. Most of them mine fuzzy knowledge under the assumption that a set of membership functions [8,23,24,35,36,50] is known in advance for the problem to be solved. The given membership functions may, however, have a critical influence on the final mining results. Different membership functions may infer different knowledge. Automatically deriving an appropriate set of membership functions for a fuzzy mining problem is thus very important. There are at least two reasons for it. The first one is that a set of appropriate membership functions may not be defined by experts because lots of money and time are needed and experts are not always available. The second one is that data and concepts are always changing along with time. Some mechanisms are thus needed to automatically adapt the membership functions to the changes if needed. The fuzzy mining problem can thus be extended to finding both appropriate membership functions and fuzzy association rules from a set of transaction data. Recently, genetic algorithms have been widely used for solving optimization problems. If the fuzzy mining problem can be converted into an optimization problem, then the GA techniques can easily be adopted to solve it. They are thus called genetic-fuzzy data-mining techniques. They are usually used to automatically mine both
1321
1322
Genetic-Fuzzy Data Mining Techniques
Genetic-Fuzzy Data Mining Techniques, Figure 1 A KDD Process
appropriate membership functions and fuzzy association rules from a set of transaction data. Some existing approaches are introduced here. These techniques can dynamically adapt membership functions by genetic algorithms according to some criteria, use them to fuzzify the quantitative transactions, and find fuzzy association rules by fuzzy mining approaches. Introduction Most enterprises have databases that contain a wealth of potentially accessible information. The unlimited growth of data, however, inevitably leads to a situation in which accessing desired information from a database becomes difficult. Knowledge discovery in databases (KDD) has thus become a process of considerable interest in recent years, as the amounts of data in many databases have grown tremendously large. KDD means the application of nontrivial procedures for identifying effective, coherent, potentially useful, and previously unknown patterns in large databases [16]. The KDD process [16] is shown in Fig. 1. In Fig. 1, data are first collected from a single or multiple sources. These data are then preprocessed, including methods such as sampling, feature selection or reduction, data transformation, among others. After that, datamining techniques are then used to find useful patterns, which are then interpreted and evaluated to form human knowledge. Especially, data mining plays a critical role to the KDD process. It involves applying specific algorithms for extracting patterns or rules from data sets in a particular representation. Because of its importance, many researchers in the database and machine learning fields are primarily interested in this topic because it offers opportunities to discover useful information and important relevant patterns in large databases, thus helping decision-makers easily analyze the data and make good decisions regarding the domains concerned. For example, there may exist some implicitly useful knowledge in a large database containing
millions of records of customers’ purchase orders over the recent years. This knowledge can be found by appropriate data-mining approaches. Questions such as “what are the most important trends in customers’ purchase behavior?” can thus be easily answered. Most of the mining approaches were proposed for binary transaction data. However, in real applications, quantitative data exists and should also be considered. Fuzzy set theory is being used more and more frequently in intelligent systems because of its simplicity and similarity to human reasoning [51]. The theory has been applied in fields such as manufacturing, engineering, diagnosis, economics, among others [45,52]. Several fuzzy learning algorithms for inducing rules from given sets of data have been designed and used to good effect within specific domains [7,22]. As to fuzzy data mining, many algorithms are also proposed [8,23,24,35,36,50]. Most of these fuzzy data-mining algorithms assume the membership functions are already known. In fuzzy mining problems, the given membership functions may, however, have a critical influence on the final mining results. Developing effective and efficient approaches to derive both the appropriate membership functions and fuzzy association rules automatically are thus worth being studied. Genetic algorithms are widely used for finding membership functions in different fuzzy applications. In this article, we discuss the genetic-fuzzy data-mining approaches which can mine both appropriate membership functions and fuzzy association rules [9,10,11,12,25,26,27,31,32,33]. The genetic-fuzzy mining problems can be divided into four kinds according to the types of fuzzy mining problems and the ways of processing items. The types of fuzzy mining problems include Single-minimum-Support Fuzzy Mining (SSFM) and Multiple-minimum-Support Fuzzy Mining (MSFM). The ways of processing items include processing all the items together (integrated approach) and processing them individually (divide-and-conquer approach). Each of them will be described in details in the following sections. But first of all, the basic concepts of data mining will be described below.
Genetic-Fuzzy Data Mining Techniques
Data Mining Data mining techniques have been used in different fields to discover interesting information from databases in recent years. Depending on the type of databases processed, mining approaches may be classified as working on transaction databases, temporal databases, relational databases, multimedia databases, and data streams, among others. On the other hand, depending on the classes of knowledge derived, mining approaches may be classified as finding association rules, classification rules, clustering rules, and sequential patterns, among others. Finding association rules in transaction databases is most commonly seen in data mining. It is initially applied to market basket analysis for getting relationships of purchased items. An association rule can be expressed as the form A ! B, where A and B are sets of items, such that the presence of A in a transaction will imply the presence of B. Two measures, support and confidence, are evaluated to determine whether a rule should be kept. The support of a rule is the fraction of the transactions that contain all the items in A and B. The confidence of a rule is the conditional probability of the occurrences of items in A and B over the occurrences of items in A. The support and the confidence of an interesting rule must be larger than or equal to a user-specified minimum support and a minimum confidence, respectively. To achieve this purpose, Agrawal and his coworkers proposed several mining algorithms based on the concept of large item sets to find association rules in transaction data [1,2,3,4]. They divided the mining process into two phases. In the first phase, candidate item sets were generated and counted by scanning the transaction data. If the number of an itemset appearing in the transactions was larger than a predefined threshold value (called minimum support), the itemset was considered a large itemset. Itemsets containing only one item were processed first. Large itemsets containing only single items were then combined to form candidate itemsets containing two items. This process was repeated until all large itemsets had been found. In the second phase, association rules were induced from the large itemsets found in the first phase. All possible association combinations for each large itemset were formed, and those with calculated confidence values larger than a predefined threshold (called minimum confidence) were output as association rules. In addition to the above approach, there are many other ones proposed for finding association rules. Most mining approaches focus on binary valued transaction data. Transaction data in real-world applications, however, usually consist of quantitative values. Many so-
phisticated data-mining approaches have thus been proposed to deal with various types of data [6,46,53]. This also presents a challenge to workers in this research field. In addition to proposing methods for mining association rules from transactions of binary values, Srikant et al. also proposed a method [46] for mining association rules from those with quantitative attributes. Their method first determines the number of partitions for each quantitative attribute, and then maps all possible values of each attribute into a set of consecutive integers. It then finds large itemsets whose support values are greater than the user-specified minimum-support levels. For example, the following is a quantitative association rule “If Age is [20; : : : ; 29], then Number of Car is [0; 1]” with a support value (60%) and a confidence value (66.6%). This means that if the age of a person is between 20 to 29-years old, then he/she has zero or one car with 66.6%. Of course, different partition approaches for discretizing the quantitative values may influence the final quantitative association rules. Some researches have thus been proposed for discussing and solving this problem [6,53]. Recently, fuzzy sets have also been used in data mining to handle quantitative data due to its ability to deal with the interval boundary problem. The theory of fuzzy sets will be introduced below. Fuzzy Sets Fuzzy set theory was first proposed by Zadeh in 1965 [51]. It is primarily concerned with quantifying and reasoning using natural language in which words can have ambiguous meanings. It is widely used in a variety of fields because of its simplicity and similarity to human reasoning [13,45,52]. For example, the theory has been applied in fields such as manufacturing, engineering, diagnosis, economics, among others [19,29,37]. Fuzzy set theory can be thought of as an extension of traditional crisp sets, in which each element must either be in or not in a set. Formally, the process by which individuals from a universal set X are determined to be either members or nonmembers of a crisp set can be defined by a characteristic or discrimination function [51]. For a given crisp set A, this function assigns a value A (x) to every x 2 X such that ( 1 if and only if x 2 A A (x) D 0 if and only if x … A : The function thus maps elements of the universal set to the set containing 0 and 1. This kind of function can be generalized such that the values assigned to the elements of the universal set fall within specified ranges, referred
1323
1324
Genetic-Fuzzy Data Mining Techniques
to as the membership grades of these elements in the set. Larger values denote higher degrees of set membership. Such a function is called a membership function, A (x), by which a fuzzy set A is usually defined. This function is represented by A : X ! [0; 1] ; where [0; 1] denotes the interval of real numbers from 0 to 1, inclusive. The function can also be generalized to any real interval instead of [0; 1]. A special notation is often used in the literature to represent fuzzy sets. Assume that x1 to xn are the elements in fuzzy set A, and 1 to n are, respectively, their grades of membership in A. A is then represented as follows: A D 1 /x1 C 2 /x2 C C n /x n : An ˛-cut of a fuzzy set A is a crisp set A˛ that contains all the elements in the universal set X with their membership grades in A greater than or equal to a specified value of ˛. This definition can be written as A˛ D fx 2 X j A (x) ˛g : The scalar cardinality of a fuzzy set A defined on a finite universal set X is the summation of the membership grades of all the elements of X in A. Thus, X jAj D A (x) : x2X
Three basic and commonly used operations on fuzzy sets are complementation, union and intersection, as proposed by Zadeh. They are described as follows. 1. The complementation of a fuzzy set A is denoted by :A, and the membership function of :A is given by :A (x) D 1 A (x) 8x 2 X :
Fuzzy Data Mining As mentioned above, the fuzzy set theory is a natural way to process quantitative data. Several fuzzy learning algorithms for inducing rules from given sets of data have thus been designed and used to good effect with specific domains [7,22,41]. Fuzzy data mining approaches have also been developed to find knowledge with linguistic terms from quantitative transaction data. The knowledge obtained is expected to be easy to understand. A fuzzy association rule is shown in Fig. 2. In Fig. 2, instead of quantitative intervals used in quantitative association rules, linguistic terms are used to represent the knowledge. As we can observe from the rule “If middle amount of bread is bought, then high amount of milk is bought”, bread and milk are items, and middle and high are linguistic terms. The rule means that if the quantity of the purchased item bread is middle, then there is a high possibility that the associated purchased item is milk with high quantity. Many approaches have been proposed for mining fuzzy association rules [8,23,24,35,36,50]. Most of the approaches set a single minimum support threshold for all the items or itemsets and identify the association relationships among transactions. In real applications, different items may have different criteria to judge their importance. Multiple minimum support thresholds are thus proposed for this purpose. We can thus divide the fuzzy data mining approaches into two types, namely Singleminimum-Support Fuzzy Mining (SSFM) [8,23,24,35,50] and Multiple-minimum-Support Fuzzy Mining (MSFM) problems [36]. In the SSFM problem, Chan and Au proposed an FAPACS algorithm to mine fuzzy association rules [8]. They first transformed quantitative attribute values into linguistic terms and then used the adjusted difference analysis to find interesting associations among attributes.
2. The intersection of two fuzzy sets A and B is denoted by A \ B, and the membership function of A \ B is given by A\B (x) D minfA (x); B (x)g
8x 2 X :
3. The union of two fuzzy sets A and B is denoted by A [ B, and the membership function of A [ B is given by A[B (x) D maxfA (x); B (x)g
8x 2 X :
Note that there are other calculation formula for the complementation, union and intersection, but the above are the most popular.
Genetic-Fuzzy Data Mining Techniques, Figure 2 A fuzzy association rule
Genetic-Fuzzy Data Mining Techniques
Genetic-Fuzzy Data Mining Techniques, Figure 3 The concept of fuzzy data mining for the SSFM problem
Kuok et al. proposed a mining approach for fuzzy association rules. Instead of minimum supports and minimum confidences used in most mining approaches, significance factors and certainty factors were used to derive large itemsets and fuzzy association rules [35]. At nearly the same time, Hong et al. proposed a fuzzy mining algorithm to mine fuzzy rules from quantitative transaction data [23]. Basically, these fuzzy mining algorithms first used membership functions to transform each quantitative value into a fuzzy set in linguistic terms and then used a fuzzy mining process to find fuzzy association rules. Yue et al. then extended the above concept to find fuzzy association rules with weighted items from transaction data [50]. They adopted Kohonen self-organized mapping to derive fuzzy sets for numerical attributes. In general, the basic concept of fuzzy data mining for the SSFM problem is shown in Fig. 3. In Fig. 3, the process for fuzzy data mining first transforms quantity transactions into a fuzzy representation according to the predefined membership functions. The transformed data are then calculated to generate large itemsets. Finally, the generated large itemsets are used to derive fuzzy association rules. As to the MSFM problem, Lee et al. proposed a mining algorithm which used multiple minimum supports to mine fuzzy association rules [36]. They assumed that items
had different minimum supports and the maximum constraint was used. That is, the minimum support for an itemset was set as the maximum of the minimum supports of the items contained in the itemset. Under the constraint, the characteristic of level-by-level processing was kept, such that the original a priori algorithm could easily be extended to finding large itemsets. In addition to the maximum constraint, other constraints such as the minimum constraint can also be used with different rationale. In addition to the above fuzzy mining approaches, fuzzy data mining with taxonomy or fuzzy taxonomy has been developed. Fuzzy web mining is another application of it. Besides, fuzzy data mining is strongly related to fuzzy control, fuzzy clustering, and fuzzy learning. In most fuzzy mining approaches, the membership functions are usually predefined in advance. Membership functions are, however, very crucial to the final mined results. Below, we will describe how the genetic algorithm can be combined with fuzzy data mining to make the entire process more complete. The concept of the genetic algorithm will first be briefly introduced in the next section. Genetic Algorithms Genetic Algorithms (GAs) [17,20] have become increasingly important for researchers in solving difficult prob-
1325
1326
Genetic-Fuzzy Data Mining Techniques
Genetic-Fuzzy Data Mining Techniques, Figure 4 The entire GA process
lems since they could provide feasible solutions in a limited amount of time [21]. They were first proposed by Holland in 1975 [20] and have been successfully applied to the fields of optimization [17,39,40,43], machine learning [17,39], neural networks [40], fuzzy logic controllers [43], and so on. GAs are developed mainly based on the ideas and techniques from genetic and evolutionary theory [20]. According to the principle of survival of the fittest, they generate the next population by several operations, with each individual in the population representing a possible solution. There are three principal operations in a genetic algorithm.
first step is to define a representation that describes the problem states. The most common way used is the bit string representation. An initial population of individuals, called chromosomes, is then defined and the three genetic operations (crossover, mutation, and selection) are performed to generate the next generation. Each chromosome in the population is evaluated by a fitness function to determine its goodness. This procedure is repeated until a userspecified termination criterion is satisfied. The entire GA process is shown in Fig. 4.
1. The crossover operation: it generates offspring from two chosen individuals in the population by exchanging some bits in the two individuals. The offspring thus inherit some characteristics from each parent. 2. The mutation operation: it generates offspring by randomly changing one or several bits in an individual. The offspring may thus possess different characteristics from their parents. Mutation prevents local searches of the search space and increases the probability of finding global optima. 3. The selection operation: it chooses some offspring for survival according to predefined rules. This keeps the population size within a fixed constant and puts good offspring into the next generation with a high probability.
In the previous section for fuzzy data mining, several approaches were introduced, in which the membership functions were assumed to be known in advance. The given membership functions may, however, have a critical influence on the final mining results. Although many approaches for learning membership functions were proposed [14,42,44,47,48], most of them were usually used for classification or control problems. There were several strategies proposed for learning membership functions in classification or control problems by genetic algorithms. Below are some of them:
On applying genetic algorithms to solving a problem, the
Genetic-Fuzzy Data Mining Techniques
1. Learning membership functions first, then rules; 2. Learning rules first, then membership functions; 3. Simultaneously learning rules and membership functions; 4. Iteratively learning rules and membership functions.
Genetic-Fuzzy Data Mining Techniques
For fuzzy mining problems, many researches have also been done by combining the genetic algorithm and the fuzzy concepts to discover both suitable membership functions and useful fuzzy association rules from quantitative values. However, most of them adopt the first strategy. That is, the membership functions are first learned and then the fuzzy association rules are derived based on the obtained membership functions. It is done in this way because the number of association rules is often large in mining problems and can not easily be coded in a chromosome. In this article, we introduce several genetic-fuzzy data mining algorithms that can mine both appropriate membership functions and fuzzy association rules [9,10,11, 12,25,26,27,31,32,33]. The genetic-fuzzy mining problems can be divided into four kinds according to the types of fuzzy mining problems and the ways of processing items. The types of fuzzy mining problems include Singleminimum-Support Fuzzy-Mining (SSFM) and MultipleMinimum-Support Fuzzy-Mining (MSFM) as mentioned above. The ways of processing items include processing all the items together (integrated approach) and processing them individually (divide-and-conquer approach). The integrated genetic-fuzzy approaches encode all membership functions of all items (or attributes) into a chromosome (also called an individual). The genetic algorithms are then used to derive a set of appropriate membership functions according to the designed fitness function. Finally, the best set of membership functions are then used to mine fuzzy association rules. On the other hand, the divide-and-conquer genetic-fuzzy approaches encode membership functions of each item into a chromosome. In other words, chromosomes in a population were maintained just for only one item. The membership functions can thus be found for one item after another or at the same time via parallel processing. In general, the chromosomes in the divide-and-conquer genetic-fuzzy approaches are much shorter than those in the integrated approaches since the former only focus on individual items. But there are more application limitations on the former than on the latter. This will be explained later. The four kinds of problems are thus the Integrated Genetic-Fuzzy problem for items with a Single Minimum Support (IGFSMS) [9,11,26,31,32,33], the Integrated Genetic-Fuzzy problem for items with Multiple Minimum Supports (IGFMMS) [12], the Divide-and-Conquer Genetic-Fuzzy problem for items with a Single Minimum Support (DGFSMS) [10,25,27] and the Divide-and-Conquer Genetic-Fuzzy problem for items with Multiple Minimum Supports (DGFMMS). The classification is shown in Table 1.
Genetic-Fuzzy Data Mining Techniques, Table 1 The four different genetic-fuzzy data mining problems
Single minimum support Multiple minimum supports
Integrated approach IGFSMS Problem
Divide-and-conquer approach DGFSMS Problem
IGFMMS Problem
DGFMMS Problem
Each of the four kinds of genetic-fuzzy data mining problems will be introduced in the following sections. The Integrated Genetic-Fuzzy Problem for Items with a Single Minimum Support (IGFSMS) Many approaches have been published for solving the IGFSMS problem [9,11,26,31,32,33]. For example, Hong et al. proposed a genetic-fuzzy data-mining algorithm for extracting both association rules and membership functions from quantitative transactions [26]. They proposed a GA-based framework for searching membership functions suitable for given mining problems and then use the final best set of membership functions to mine fuzzy association rules. The proposed framework is shown in Fig. 5. The proposed framework consists of two phases, namely mining membership functions and mining fuzzy association rules. In the first phase, the proposed framework maintains a population of sets of membership functions, and uses the genetic algorithm to automatically derive the resulting one. It first transforms each set of membership functions into a fixed-length string. The chromosome is then evaluated by the number of large 1-itemsets and the suitability of membership functions. The fitness value of a chromosome Cq is then defined as f (C q ) D
jL1 j ; suitability(C q )
where jL1 j is the number of large 1-itemsets obtained by using the set of membership functions in Cq . Using the number of large 1-itemsets can achieve a trade-off between execution time and rule interestingness. Usually, a larger number of 1-itemsets will result in a larger number of all itemsets with a higher probability, which will thus usually imply more interesting association rules. The evaluation by 1-itemsets is, however, faster than that by all itemsets or interesting association rules. Of course, the number of all itemsets or interesting association rules can also be used in the fitness function. A discussion for different choices of fitness functions can be found in [9].
1327
1328
Genetic-Fuzzy Data Mining Techniques
Genetic-Fuzzy Data Mining Techniques, Figure 5 A genetic-fuzzy framework for the IGFSMS problem
The suitability measure is used to reduce the occurrence of bad types of membership functions. The two bad types of membership functions are shown in Fig. 6, where the first one is too redundant, and the second one is too separate. Two factors, called the overlap factor and the coverage factor, are used to avoid the bad shapes. The overlap factor is designed for avoiding the first bad case (too redundant), and the coverage factor is for the second one (too separate). Each factor has its formula for evaluating a value from a chromosome. After fitness evaluation, the approach then chooses appropriate chromosomes for mating, gradually creating good offspring membership function sets. The offspring membership function sets then undergo recursive evolution until a good set of membership functions has been obtained. In the second phase, the final best membership functions are gathered to mine fuzzy association rules. The fuzzy mining algorithm proposed in [24] is adopted to achieve this purpose.
The calculation for large 1-itemsets, however, will still take a lot of time, especially when the database can not totally be fed into the main memory. An enhanced approach, called the cluster-based fuzzy-genetic mining algorithm was thus proposed [11] to speed up the evaluation process and keep nearly the same quality of solutions as that in [26]. That approach also maintains a population of sets of membership functions and uses the genetic algorithm to derive the best one. Before fitness evaluation, the clustering technique is first used to cluster chromosomes. It uses the k-means clustering approach to gather similar chromosomes into groups. The two factors, overlap factor and coverage factor, are used as two attributes for clustering. For example, coverage and overlap factors for ten chromosomes are shown in Table 2, where the column “Suitability” represents the pair (coverage factor, overlap factor). The k-means clustering approach is then executed to divide the ten chromosomes into k clusters. In this exam-
Genetic-Fuzzy Data Mining Techniques
Genetic-Fuzzy Data Mining Techniques, Figure 6 The two bad types of membership functions Genetic-Fuzzy Data Mining Techniques, Table 2 The coverage and the overlap factors of ten chromosomes Chromosome C1 C2 C3 C4 C5
Suitability (4; 0) (4:24; 0:5) (4:37; 0) (4:66; 0) (4:37; 0:33)
Chromosome C6 C7 C8 C9 C 10
Suitability (4:5; 0) (4:45; 0) (4:37; 0:53) (4:09; 8:33) (4:87; 0)
Genetic-Fuzzy Data Mining Techniques, Table 3 The three clusters found in the example Clusteri Cluster1 Cluster2 Cluster3
Chromosomes C1 ; C2 ; C5 ; C8 C3 ; C4 ; C6 ; C7 ; C10 C9
Representative chromosome C5 C4 C9
ple, assume the parameter k is set at 3. The three clusters found are shown in Table 3. The representative chromosomes in the three clusters are C5 (4:37; 0:33), C4 (4:66; 0) and C9 (4:09; 8:33). All the chromosomes in a cluster use the number of large 1-itemsets derived from the representative chromosome in the cluster and their own suitability of membership functions to calculate their fitness values. Since the number for scanning a database decreases, the evaluation cost can thus be reduced. In this example, the representative chromosomes are chromosomes C4 , C5 , C9 and it only needs to calculate the number of large 1-itemsets three times. The evaluation results are utilized to choose appropriate chromosomes for mating in the next generation. The offspring membership function sets then undergo recursive evolution until a good set of membership functions has been obtained. Finally, the derived membership functions are used to mine fuzzy association rules. Kaya and Alhaji also proposed several genetic-fuzzy data mining approaches to derive membership functions and fuzzy association rules [31,32,33]. In [31], the pro-
posed approach tries to derive membership functions, which can get a maximum profit within an interval of user-specified minimum support values. It then uses the derived membership functions to mine fuzzy association rules. The concept of their approaches is shown in Fig. 7. As shown in Fig. 7a, the approach first derives membership functions from the given quantitative transaction database by genetic algorithms. The final membership functions are then used to mine fuzzy association rules. Figure 7b shows the concept of maximizing the large itemsets of the given minimum support interval. It is used as the fitness function. Kaya and Alhaji also extended the approach to mine fuzzy weighted association rules [32]. Furthermore, fitness functions are not easily defined for GA applications, such that multiobjective genetic algorithms have also been developed [28,30]. In other words, more than one criterion is used in the evaluation. A set of solutions, namely nondominated points (also called ParetoOptimal Surface), is derived and given to users, instead of only the one best solution obtained by genetic algorithms. Kaya and Alhaji thus proposed an approach based on multiobjective genetic algorithms to learn membership functions, which were then used to generate interesting fuzzy association rules [33]. Three objective functions, namely strongness, interestingness and comprehensibility, were used in their approach to find the Pareto-Optimal Surface. In addition to the above approaches for the IGFSMS problem, some others are still in progress.
The Integrated Genetic-Fuzzy Problem for Items with Multiple Minimum Supports (IGFMMS) In the above subsection, it can be seen that lots of researches focus on integrated genetic-fuzzy approaches for items with a single minimum support. However, different items may have different criteria to judge their importance. For example, assume among a set of items there are some which are expensive. They are thus seldom bought because of their high cost. Besides, the support values of
1329
1330
Genetic-Fuzzy Data Mining Techniques
Genetic-Fuzzy Data Mining Techniques, Figure 7 The concept of Kaya and Alhaji’s approaches
these items are low. A manager may, however, still be interested in these products due to their high profits. In such cases, the above approaches may not be suitable for this problem. Chen et al. thus proposed another genetic-fuzzy data mining approach [12], which was an extension of the approach proposed in [26], to solve it. The approach combines the clustering, fuzzy and genetic concepts to derive minimum support values and membership functions for items. The final minimum support values and membership functions are then used to mine fuzzy association rules. The genetic-fuzzy mining framework for the IGFMMS problem is shown in Fig. 8. As shown in Fig. 8, the framework can be divided into two phases. The first phase searches for suitable minimum support values and membership functions of items and the second phase uses the final best set of minimum support values and membership functions to mine fuzzy association rules. The proposed framework maintains a population of sets of minimum support values and membership functions, and uses the genetic algorithm to automatically derive the resulting one. A genetic algorithm requires a population of feasible solutions to be initialized and updated during the evolution process. As mentioned above, each individual within the population is a set of minimum support values and
isosceles-triangular membership functions. Each membership function corresponds to a linguistic term of a certain item. In this approach, the initial set of chromosomes is generated based on the initialization information derived by the k-means clustering approach on the transactions. The frequencies and quantitative values of items in the transactions are the two main factors to gather similar items into groups. The initialization information includes an appropriate number of linguistic terms, the range of possible minimum support values and membership functions of each item. All the items in the same cluster are considered to have similar characteristics and are assigned similar initialization values when a population is initialized. The approach then generates and encodes each set of minimum support values and membership functions into a fixed-length string according to the initialization information. In this approach, the minimum support values of items may be different. It is hard to assign the values. As an alternative, the values can be determined according to the required number of rules. It is, however, very time-consuming to obtain the rules for each chromosome. As mentioned above, a larger number of 1-itemsets will usually result in a larger number of all itemsets with a higher probability, which will thus usually imply more interesting as-
Genetic-Fuzzy Data Mining Techniques
Genetic-Fuzzy Data Mining Techniques, Figure 8 A genetic-fuzzy framework for the IGFMMS problem
sociation rules. The evaluation by 1-itemsets is faster than that by all itemsets or interesting association rules. Using the number of large 1-itemsets can thus achieve a trade-off between execution time and rule interestingness [26]. A criterion should thus be specified to reflect the user preference on the derived knowledge. In the approach, the required number of large 1-itemsets RNL is used for this purpose. It is the number of linguistic large 1-itemsets that a user wants to get from an item. It can be defined as the number of linguistic terms of an item mul-
tiplied by the predefined percentage which reflects users’ preference on the number of large 1-itemsets. It is used to reflect the closeness degree between the number of derived large 1-itemsets and the required number of large 1-itemsets. For example, assume there are three linguistic terms for an item and the predefined percentage p is set at 80%. The RNL value is then set as b3 0:8c, which is 2. The fitness function is then composed of the suitability of membership functions and the closeness to the RNL value. The minimum support values and membership functions can
1331
1332
Genetic-Fuzzy Data Mining Techniques
Genetic-Fuzzy Data Mining Techniques, Figure 9 A genetic-fuzzy framework for the DGFSMS problem
thus be derived by GA and are then used to mine fuzzy association rules by a fuzzy mining approach for multiple minimum supports such as the one in [36]. The Divide-and-Conquer Genetic-Fuzzy Problem for Items with a Single Minimum Support (DGFSMS) The advantages of the integrated genetic-fuzzy approaches lie in that they are simple, easy to use, and with few constraints in the fitness functions. In addition to the number of large 1-itemsets, the other criteria can also be used. However, if the number of items is large, the integrated genetic-fuzzy approaches may need lots of time to find a near-optimal solution because the length of a chromosome is very long. Recently, the divide-and-conquer strategy has been used in the evolutionary computation com-
munity to very good effect. Many algorithms based on it have also been proposed in different applications [5,15, 34,49]. When the number of large 1-itemsets is used in fitness evaluation, the divide-and-conquer strategy becomes a good choice to deal with it since each item can be individually processed in this situation. Hong et al. thus used a GA-based framework with the divide-and-conquer strategy to search for membership functions suitable for the mining problem [27]. The framework is shown in Fig. 9. The proposed framework in Fig. 9 is divided into two phases: mining membership functions and mining fuzzy association rules. Assume the number of items is m. In the phase of mining membership functions, it maintains m populations of membership functions, with each population for an item. Each chromosome in a population represents a possible set of membership functions for that item.
Genetic-Fuzzy Data Mining Techniques
Genetic-Fuzzy Data Mining Techniques, Figure 10 A genetic-fuzzy framework for the DGFMMS problem
The chromosomes in the same population are of the same length. The fitness of each set of membership functions is evaluated by the fuzzy-supports of the linguistic terms in large 1-itemsets and by the suitability of the derived membership functions. The offspring sets of membership functions undergo recursive evolution until a good set of membership functions has been obtained. Next, in the phase of mining fuzzy association rules, the sets of membership function for all the items are gathered together and used to mine the fuzzy association rules from the given quantitative database. An enhanced approach [10], which combines the clustering and the divide-and-conquer techniques, was also proposed to speed up the evaluation process. The clustering idea is similar to that in IGFSMS [11] except that the center value of each membership function is also used as an attribute to cluster chromosomes. The clustering process is thus executed according to the cov-
erage factors, the overlap factors and the center values of chromosomes. For example, assume each item has three membership functions. In total five attributes including one coverage factor, one overlap factor and three center values, are used to form appropriate clusters. Note that the number of linguistic terms for each item is predefined in the mentioned approaches. It may also be automatically and dynamically adjusted [25]. The Divide-and-Conquer Genetic-Fuzzy Problem for Items with Multiple Minimum Supports (DGFMMS) The problem may be thought of as the combination of the IGFMMS and the DGFSMS problems. The framework for the DGFMMS problem can thus be easily designed from the previous frameworks for IGFMMS and DGFSMS. It is shown in Fig. 10.
1333
1334
Genetic-Fuzzy Data Mining Techniques
The proposed framework in Fig. 10 is divided into two phases: mining minimum supports and membership functions, and mining fuzzy association rules. In the first phase, the clustering approach is first used for deriving initialization information which is then used for obtaining better initial populations as used for IGFMMS. It then maintains m populations of minimum supports and membership functions, with each population for an item. Next, in the phase of mining fuzzy association rules, the minimum support values and membership functions for all the items are gathered together and are used to mine fuzzy interesting association rules from the given quantitative database. Future Directions In this article, we have introduced some genetic-fuzzy data mining techniques and their classification. The concept of fuzzy sets is used to handle quantitative transactions and the process of genetic calculation is executed to find appropriate membership functions. The geneticfuzzy mining problems are divided into four kinds according to the types of fuzzy mining problems and the ways of processing items. The types of fuzzy mining problems include Single-minimum-Support Fuzzy-Mining (SSFM) and Multiple-minimum-Support Fuzzy-Mining (MSFM). The methods of processing items include processing all the items together (integrated approach) and processing them individually (divide-and-conquer approach). Each of the four kinds of problems has been described with some approaches given. Data mining is very important especially because the data amounts in the information era are extremely large. The topic will continuously grow but with a variety of forms. Some possible research directions in the future about genetic-fuzzy data mining are listed as follows. Applying multiobjective genetic algorithms to the genetic-fuzzy mining problems: In the article, mined knowledge (number of large itemsets or number of rules) and suitability of membership functions are two important factors used in genetic-fuzzy data mining. Analyzing the relationship between the two factors is thus an interesting and important task. Besides, multiobjective genetic algorithms can also be used to consider the factors at the same time. Analyzing effects of different shapes of membership functions and different genetic operators: Different shapes of membership functions may have different results on genetic-fuzzy data mining. They may be evaluated in the future. Different genetic operations may also be tried to obtain better results than the ones used in the above approaches.
Enhancing performance of the fuzzy-rule mining phase: The final goal of genetic-fuzzy mining techniques introduced in this article is to mine appropriate fuzzy association rules. However, the phase of mining fuzzy association rules is very time-consuming. How to improve the process of mining interesting fuzzy rules is thus worth studying. Some possible approaches include modifying existing approaches, combining the existing ones with other techniques, or defining new evaluation criteria. Developing visual tools for these genetic-fuzzy mining approaches: Another interesting aspect of future work is to develop visual tools for demonstrating the genetic-fuzzy mining results. It can help a decision maker easily understand or get useful information quickly. The visual tools may include, for example, how to show the derived membership functions and to illustrate the interesting fuzzy association rules.
Bibliography 1. Agrawal R, Srikant R (1994) Fast algorithm for mining association rules. In: Proceedings of the international conference on very large data bases, pp 487–499 2. Agrawal R, Imielinksi T, Swami A (1993) Database mining: a performance perspective. Trans IEEE Knowl Data Eng 5(6):914–925 3. Agrawal R, Imielinksi T, Swami A (1993) Mining association rules between sets of items in large database. In: Proceedings of the conference ACMSIGMOD, Washington DC, USA 4. Agrawal R, Srikant R, Vu Q (1997) Mining association rules with item constraints. In: Proceedings of the third international conference on knowledge discovery in databases and data mining, Newport Beach, California, August 1997 5. Au WH, Chan KCC, Yao X (2003) A novel evolutionary data mining algorithm with applications to churn prediction. Trans IEEE Evol Comput 7(6):532–545 6. Aumann Y, Lindell Y (1999) A statistical theory for quantitative association rules. In: Proceedings of the ACMSIGKDD international conference on knowledge discovery and data mining, pp 261–270 7. Casillas J, Cordón O, del Jesus MJ, Herrera F (2005) Genetic tuning of fuzzy rule deep structures preserving interpretability and its interaction with fuzzy rule set reduction. Trans IEEE Fuzzy Syst 13(1):13–29 8. Chan CC, Au WH (1997) Mining fuzzy association rules. In: Proceedings of the conference on information and knowledge management, Las Vegas, pp 209–215 9. Chen CH, Hong TP, Vincent Tseng S (2007) A comparison of different fitness functions for extracting membership functions used in fuzzy data mining. In: Proceedings of the symposium IEEE on foundations of computational intelligence, pp 550– 555 10. Chen CH, Hong TP, Tseng VS (2007) A modified approach to speed up genetic-fuzzy data mining with divide-and-conquer strategy. In: Proceedings of the congress IEEE on evolutionary computation (CEC), pp 1–6
Genetic-Fuzzy Data Mining Techniques
11. Chen CH, Tseng VS, Hong TP (2008) Cluster-based evaluation in fuzzy-genetic data mining. Trans IEEE Fuzzy Syst 16(1):249– 262 12. Chen CH, Hong TP, Tseng VS, Lee CS (2008) A genetic-fuzzy mining approach for items with multiple minimum supports. Accepted and to appear in Soft Computing (SCI) 13. Chen J, Mikulcic A, Kraft DH (2000) An integrated approach to information retrieval with fuzzy clustering and fuzzy inferencing. In: Pons O, Vila MA, Kacprzyk J (eds) Knowledge management in fuzzy databases. Physica, Heidelberg 14. Cordón O, Herrera F, Villar P (2001) Generating the knowledge base of a fuzzy rule-based system by the genetic learning of the data base. Trans IEEE Fuzzy Syst 9(4):667–674 15. Darwen PJ, Yao X (1997) Speciation as automatic categorical modularization. Trans IEEE Evol Comput 1(2):101–108 16. Frawley WJ, Piatetsky-Shapiro G, Matheus CJ (1991) Knowledge discovery in databases: an overview. In: Proceedings of the workshop AAAI on knowledge discovery in databases, pp 1–27 17. Goldberg DE (1989) Genetic algorithms in search, optimization and machine learning. Addison Wesley, Boston 18. Grefenstette JJ (1986) Optimization of control parameters for genetic algorithms. Trans IEEE Syst Man Cybern 16(1):122–128 19. Heng PA, Wong TT, Rong Y, Chui YP, Xie YM, Leung KS, Leung PC (2006) Intelligent inferencing and haptic simulation for Chinese acupuncture learning and training. Trans IEEE Inf Technol Biomed 10(1):28–41 20. Holland JH (1975) Adaptation in natural and artificial systems. University of Michigan Press, Michigan 21. Homaifar A, Guan S, Liepins GE (1993) A new approach on the traveling salesman problem by genetic algorithms. In: Proceedings of the fifth international conference on genetic algorithms 22. Hong TP, Lee YC (2001) Mining coverage-based fuzzy rules by evolutional computation. In: Proceedings of the international IEEE conference on data mining, pp 218–224 23. Hong TP, Kuo CS, Chi SC (1999) Mining association rules from quantitative data. Intell Data Anal 3(5):363–376 24. Hong TP, Kuo CS, Chi SC (2001) Trade-off between time complexity and number of rules for fuzzy mining from quantitative data. Int J Uncertain Fuzziness Knowledge-Based Syst 9(5):587–604 25. Hong TP, Chen CH, Wu YL, Tseng VS (2004) Finding active membership functions in fuzzy data mining. In: Proceedings of the workshop on foundations of data mining in the fourth international IEEE conference on data mining 26. Hong TP, Chen CH, Wu YL, Lee YC (2006) AGA-based fuzzy mining approach to achieve a trade-off between number of rules and suitability of membership functions. Soft Comput 10(11):1091–1101 27. Hong TP, Chen CH, Wu YL, Lee YC (2008) Genetic-fuzzy data mining with divide-and-conquer strategy. Trans IEEE Evol Comput 12(2):252–265 28. Ishibuchi H, Yamamoto T (2004) Fuzzy rule selection by multiobjective genetic local search algorithms and rule evaluation measures in data mining. Fuzzy Sets Syst 141:59–88 29. Ishibuchi H, Yamamoto T (2005) Rule weight specification in fuzzy rule-based classification systems. Trans IEEE Fuzzy Syst 13(4):428–435
30. Jin Y (2006) Multi-objective machine learning. Springer, Berlin 31. Kaya M, Alhajj R (2003) A clustering algorithm with genetically optimized membership functions for fuzzy association rules mining. In: Proceedings of the international IEEE conference on fuzzy systems, pp 881–886 32. Kaya M, Alhaji R (2004) Genetic algorithms based optimization of membership functions for fuzzy weighted association rules mining. In: Proceedings of the international symposium on computers and communications, vol 1, pp 110–115 33. Kaya M, Alhajj R (2004) Integrating multi-objective genetic algorithms into clustering for fuzzy association rules mining. In: Proceedings of the fourth international IEEE conference on data mining, pp 431–434 34. Khare VR, Yao X, Sendhoff B, Jin Y, Wersing H (2005) Coevolutionary modular neural networks for automatic problem decomposition. In: Proceedings of the (2005) congress IEEE on evolutionary computation, vol 3, pp 2691–2698 35. Kuok C, Fu A, Wong M (1998) Mining fuzzy association rules in databases, Record SIGMOD 27(1):41–46 36. Lee YC, Hong TP, Lin WY (2004) Mining fuzzy association rules with multiple minimum supports using maximum constraints. In: Lecture notes in computer science, vol 3214. Springer, Heidelberg, pp 1283–1290 37. Liang H, Wu Z, Wu Q (2002) A fuzzy based supply chain management decision support system. In: Proceedings of the world congress on intelligent control and automation, vol 4, pp 2617–2621 38. Mamdani EH (1974) Applications of fuzzy algorithms for control of simple dynamic plants. Proc IEEE 121(12):1585– 1588 39. Michalewicz Z (1994) Genetic algorithms + data structures = evolution programs. Springer, New York 40. Mitchell M (1996) An introduction to genetic algorithms. MIT Press, Cambridge MA 41. Rasmani KA, Shen Q (2004) Modifying weighted fuzzy subsethood-based rule models with fuzzy quantifiers. In: Proceedings of the international IEEE conference on fuzzy systems, vol 3, pp 1679–1684 42. Roubos H, Setnes M (2001) Compact and transparent fuzzy models and classifiers through iterative complexity reduction. Trans IEEE Fuzzy Syst 9(4):516–524 43. Sanchez E et al (1997) Genetic algorithms and fuzzy logic systems: soft computing perspectives (advances in fuzzy systems – applications and theory, vol 7). World-Scientific, River Edge 44. Setnes M, Roubos H (2000) GA-fuzzy modeling and classification: complexity and performance. Trans IEEE Fuzzy Syst 8(5):509–522 45. Siler W, James J (2004) Fuzzy expert systems and fuzzy reasoning. Wiley, New York 46. Srikant R, Agrawal R (1996) Mining quantitative association rules in large relational tables. In: Proceedings of the (1996) international ACMSIGMOD conference on management of data, Montreal, Canada, June 1996, pp 1–12 47. Wang CH, Hong TP, Tseng SS (1998) Integrating fuzzy knowledge by genetic algorithms. Trans IEEE Evol Comput 2(4):138– 149 48. Wang CH, Hong TP, Tseng SS (2000) Integrating membership functions and fuzzy rule sets from multiple knowledge sources. Fuzzy Sets Syst 112:141–154
1335
1336
Genetic-Fuzzy Data Mining Techniques
49. Yao X (2003) Adaptive divide-and-conquer using populations and ensembles. In: Proceedings of the (2003) international conference on machine learning and application, pp 13–20 50. Yue S, Tsang E, Yeung D, Shi D (2000) Mining fuzzy association rules with weighted items. In: Proceedings of the international IEEE conference on systems, man and cybernetics, pp 1906– 1911
51. Zadeh LA (1965) Fuzzy set. Inf Control 8(3):338–353 52. Zhang H, Liu D (2006) Fuzzy modeling and fuzzy control. Springer, New York 53. Zhang Z, Lu Y, Zhang B (1997) An effective partitioning-combining algorithm for discovering quantitative association rules. In: Proceedings of the Pacific-Asia conference on knowledge discovery and data mining, pp 261–270
Gliders in Cellular Automata
Gliders in Cellular Automata CARTER BAYS Department of Computer Science and Engineering, University of South Carolina, Columbia, USA Article Outline Glossary Definition of the Subject Introduction Other GL Rules in the Square Grid Why Treat All Neighbors the Same? Gliders in One Dimension Two Dimensional Gliders in Non-Square Grids Three and Four Dimensional Gliders Future Directions Bibliography Glossary Game of life A particular cellular automaton (CA) discovered by John Conway in 1968. Neighbor A neighbor of cell x is typically a cell that is in close proximity to (frequently touching) cell x. Oscillator A periodic shape within a specific CA rule. Glider A translating oscillator that moves across the grid of a CA. Generation The discrete time unit which depicts the evolution of a CA. Rule Determines how each individual cell within a CA evolves. Definition of the Subject A cellular automaton is a structure comprising a grid with individual cells that can have two or more states; these cells evolve in discrete time units and according to a rule, which usually involves neighbors of each cell. Introduction Although cellular automata has origins dating from the 1950s, interest in that topic was given a boost during the 1980s by the research of Stephan Wolfram, which culminated in 2002 with his publication of the massive tome, “A New Kind of Science” [11]. And widespread popular interest was created when John Conway’s “game of life” cellular automaton was initially revealed to the public in a 1970 Scientific American article [8]. The single feature of his game that probably caused this intensive interest was
undoubtedly the discovery of “gliders” (translating oscillators). Not surprisingly, gliders are present in many other cellular automata rules; the purpose of this article is to examine some of these rules and their associated gliders. Cellular automata (CA) can be constructed in one, two, three or more dimensions and can best be explained by giving a two dimensional example. Start with an infinite grid of squares. Each individual square has eight touching neighbors; typically these neighbors are treated the same (a Moore neighborhood), whether they touch a candidate square on a side or at a corner. (An exception is one dimensional CA, where position usually plays a role). We now fill in some of the squares; we shall say that these squares are alive. Discrete time units called generations evolve; at each generation we apply a rule to the current configuration in order to arrive at the configuration for the next generation; in our example we shall use the rule below. (a) If a live cell is touching two or three live cells (called neighbors), then it remains alive next generation, otherwise it dies. (b) If a non-living cell is touching exactly three live cells, it comes to life next generation. Figure 1 depicts the evolution of a simple configuration of filled-in (live) cells for the above rule. There are many notations for describing CA rules; these can differ depending upon the type of CA. For CA of more than one dimension, and in our present discussion, we shall utilize the following notation, which is standard for describing CA in two dimensions with Moore neighborhoods. Later we shall deal with one dimension. We write a rule as E1 ; E2 ; : : : /F1 ; F2 : : : where the Ei (“environment”) specify the number of live neighbors required to keep a living cell alive, and the Fi (“fertility”) give the number required to bring a non-living cell to life. The Ei and Fi will be listed in ascending order; hence if i > j then Ei > E j etc. Thus the rule for the CA given above is 2; 3/3. This rule, discovered by John Horton Conway, was examined in several articles in Scientific American and elsewhere, beginning with the seminal article in 1970 [8]. It is popularly known as Conway’s game of life. Of course it is not really a game in the usual sense, as the outcome is determined as soon as we pick a starting configuration. Note that the shape in Fig. 1 repeats, with a period of two. A repeating form such as this is called an oscillator. Stationary forms can be considered oscillators with
1337
1338
Gliders in Cellular Automata
growth. (We say that rules with expansive growth are unstable). We can easily find gliders for many unstable rules; for example Fig. 4 illustrates some simple constructs for rule 2/2. Note that it is practically impossible NOT to create gliders with this rule! Hence we shall only look at gliders for rules that stabilize (i. e. exhibit bounded growth) and eventually yield only zero or more oscillators. We call such rules GL (game of life) rules. Stability can be a rather murky concept, since there may be some carefully constructed forms within a GL rule that grow without bounds. Typically, such forms would never appear in random configurations. Hence, we shall informally define a GL rule as follows: All neighbors must be touching the candidate cell and all are treated the same (a Moore neighborhood). Gliders in Cellular Automata, Figure 1 Top: Each cell in a grid has eight neighbors. The cells containing n are neighbors of the cell containing the X. Any cell in the grid can be either dead or alive. Bottom: Here we have outlined a specific area of what is presumably a much larger grid. At the left we have installed an initial shape. Shaded cells are alive; all others are dead. The number within each cell gives the quantity of live neighbors for that cell. (Cells containing no numbers have zero live neighbors). Depicted are three generations, starting with the configuration at generation one. Generations two then three show the result when we apply the following cellular automata rule: Live cells with exactly two or three live neighbors remain alive (otherwise they die); dead cells with exactly three live neighbors come to life (otherwise they remain dead). Let us now evaluate the transition from generation one to generation two. In our diagram, cell a is dead. Since it does not have exactly three live neighbors, it remains dead. Cell b is alive, but it needs exactly two or three live neighbors to remain alive; since it only has one, it dies. Cell c is dead; since it has exactly three live neighbors, it comes to life. And cell d has two live neighbors; hence it will remain alive. And so on. Notice that the form repeats every two generations. Such forms are called oscillators
a period of one. In Figs. 2 and 3 we show several oscillators that move across the grid as they change from generation to generation. Such forms are called translating oscillators, or more commonly, gliders. Conway’s rule popularized the term; in fact a flurry of activity began during which a great many shapes were discovered and exploited. These shapes were named whimsically – “blinker” (Fig. 1), “boat”, “beehive” and an unbelievable myriad of others. Most translating oscillators were given names other than the simple moniker glider – there were “lightweight spaceships”, “puffer trains”, etc. For this article, we shall call all translating oscillators gliders. Of course rule 2; 3/3 is not the only CA rule (even though it is the most interesting). Configurations under some rules always die out, and other rules lead to explosive
there must exist at least one translating oscillator (a glider). Random configurations must eventually stabilize. This definition is a bit simplistic; for a more formal definition of a GL rule refer to [5]. Conway’s rule 2; 3/3 is the original GL rule and is unquestionably the most famous CA rule known. A challenge put forth by Conway was to create a configuration that would generate an ever increasing quantity of live cells. This challenge was met by William Gosper in 1970 – back when computing time was expensive and computers were slow by today’s standards. He devised a form that spit out a continuous stream of gliders – a “glider gun”, so to speak. Interestingly, his gun configuration was displayed not as nice little squares, but as a rather primitive typewritten output (Fig. 5); this emphasizes the limited resources available in 1970 for seeking out such complex structures. Soon a cottage industry developed – all kinds of intricate initial configurations were discovered and exploited; such research continues to this day. Other GL Rules in the Square Grid The rule 2; 4; 5/3 is also a GL rule and sports the glider shown in Fig. 6. It has not been seriously investigated and will probably not reveal the vast array of interesting forms that exist under 2; 3/3. Interestingly, 2; 3/3; 8 appears to be a GL rule which not unsurprisingly supports many of the constructs of 2; 3/3. This ability to add terms of high neighbor counts onto known GL rules, obtaining other GL rules, seems to be easy to implement – particularly in higher dimensions or in grids with large neighbor counts such as the triangular grid, which has a neighbor count of 12.
Gliders in Cellular Automata
Gliders in Cellular Automata, Figure 2 Here we see a few of the small gliders that exist for 2,3/2. The form at the top – the original glider – was discovered by John Conway in 1968. The remaining forms were found shortly thereafter. Soon after Conway discovered rule 2,3/2 he started to give his various shapes rather whimsical names. That practice continues to this day. Hence, the name glider was given only to the simple shape at the top; the other gliders illustrated were called (from top to bottom) lightweight spaceship, middleweight spaceship and heavyweight spaceship. The numbers give the generation; each of the gliders shown has a period of four. The exact movement of each is depicted by its shifting position in the various small enclosing grids
Gliders in Cellular Automata, Figure 3 The rule 2,3/2 is rich with oscillators – both stationary and translating (i. e. gliders). Here are but two of many hundreds of gliders that exist under this rule. The top form has a period of five and the bottom conglomeration, a period of four
Gliders in Cellular Automata, Figure 4 Gliders exist under a large number of rules, but almost all such rules are unstable. For example the rule 2/2 exhibits rapid unbounded growth, and almost any starting configuration will yield gliders; e. g. just two live cells will produce two gliders going off in opposite directions. But almost any small form will quickly grow without bounds. The form at the bottom left expands to the shape at the right after only 10 generations. The generation is given with each form
1339
1340
Gliders in Cellular Automata
Gliders in Cellular Automata, Figure 6 There are a large number of interesting rules that can be written for the square grid and Rule 2; 3/2 is undoubtedly the most fascinating – but it is not the only GL rule. Here we depict a glider that has been found for the rule 2; 4; 5/3. And since that rule stabilizes, it is a valid GL rule. Unfortunately it is not as interesting as 2; 3/2 because its glider is not as likely to appear in random (and other) configurations – hence limiting the ability of 2; 4; 5/3 to produce interesting moving configurations. Note that the period is seven, indicated in parentheses Gliders in Cellular Automata, Figure 5 A fascinating challenge was proposed by Conway in 1970 – he offered $50 to the first person who could devise a form for 2,3/2 that would generate an infinite number of living cells. One such form could be a glider gun – a construct that would create an endless stream of gliders. The challenge was soon met by William Gosper, then a student at MIT. His glider gun is illustrated here. At the top, testifying to the primitive computational power of the time, is an early illustration of Gosper’s gun. At the bottom we see the gun in action, sending out a new glider every thirty generations (here it has sent out two gliders). Since 1970 there have been numerous such guns that generate all kinds of forms – some gliders and some stationary oscillators. Naturally in the latter case the generator must translate across the grid, leaving its intended stationary debris behind
Why Treat All Neighbors the Same? By allowing only Moore neighborhoods in two (and higher) dimensions we greatly restrict the number of rules that can be written. And certainly we could consider specialized neighborhoods – e. g. treat as neighbors only those cells that touch on sides, or touch only the left two corners and nowhere else, or touch anywhere, but state in our rule that two or more live neighbors of a subject cell must not touch each other, etc. But here we are only exploring gliders. Consider the following rule for finding the next generation. 1) A living cell dies. 2) A dead cell comes to life if and only if its left side touches a live cell.
If we start, say, with a single cell we will obtain a glider of one cell that moves to the right one cell each generation! Such rules are easy to construct, as are more complex glider-producing positional rules. So we shall not investigate them further. Yet as we shall see, the neighbor position is an important consideration in one dimensional CA. Gliders in One Dimension One dimensional cellular automata differ from CA in higher dimensions in that the restrictive grid (essentially a single line of cells) limits the number of rules that can be applied. Hence, many 1D CA involve neighborhoods that extend beyond the immediate two touching neighbors of a cell whose next generation status we wish to evaluate. Or more than the two states (alive, dead) may be utilized. For our discussion about gliders, we shall only look at the simplest rules – those involving just the two adjacent neighbors and two states. Unlike 2D (and higher) dimensions, we usually consider the relative position of the neighbors when giving a rule. Since three cells (center, left, right) are involved in determining the state for the next generation of the central cell, we have 23 D 8 possible initial states, with each state leading to a particular outcome. And since each initial state causes a particular outcome (i. e. the cell in the middle lives or dies next generation) we thus we have 28 possible rules. The behavior of these 256 rules has been extensively studied by Wolfram [11] who also introduced a very convenient shorthand that completely describes each rule (Fig. 7).
Gliders in Cellular Automata
Gliders in Cellular Automata, Figure 7 The one dimensional rules six and 110 are depicted by the diagram shown. There are eight possible states involving a center cell and its two immediate neighbors. The next generation state for the center cell depends upon the current configuration; each possible current state is given. The rule is specified by the binary number depicted by the next generation state of the center cell, This notation is standard for the simplest 1D CA and was introduced by Wolfram (see [11]), who also converts the binary representation to its decimal equivalent. There are 256 possible rules, but most are not as interesting as rule 110. Rule six is one of many that generate nothing but gliders (see Fig. 8)
Gliders in Cellular Automata, Figure 8 Rule six (along with many others) creates nothing but gliders. At the upper left, we have several generations starting with a single live cell (top). (For 1D CA each successive generation moves vertically down one level on the page.) At the lower left is an enlargement of the first few generations. By following the diagram for rule six in Fig. 7, the reader can see exactly how this configuration evolves. At the top right, we start with a random configuration; at the lower right we have enlarged the small area directly under the large dot. Very quickly, all initial random configurations lead solely to gliders heading west
As we add to the complexity of defining 1D CA we greatly increase the number of possible rules. For example, just by having three states instead of two, we note that now, instead of 23 possible initial states, there are 33 (Fig. 12). This leads to 27 possible initial states, and we now can create 327 unique rules – more than six trillion! Wolfram observed that even with more complex 1D rules, the fundamental behavior for all rules is typified by the simplest rules [11]. Gliders in 1D CA are very common (Figs. 8 and 9) but true GL rules are not, because most gliders for stable rules exist against a uniform patterned background (Figs. 9 through 11) instead of a grid of non-living cells. Two Dimensional Gliders in Non-Square Grids Although most 2D CA research involves a square grid, the triangular tessellation has been investigated somewhat. Here we have 12 touching neighbors; as with the square grid, they are all treated equally (Fig. 13). The increased number of neighbors allows for the possibility of more GL rules (and hence several gliders). Figure 14 shows many of these gliders and their various GL rules. The GL rule 2; 7; 8/3 supports two rather unusual gliders (Figs. 15 and 16) and to date is the only known GL rule other than Conway’s original 2; 3/3 game of life that exhibits glider guns. Figure 17 shows starting configurations for two of these guns and Fig. 18 exhibits evolution of the two guns
Gliders in Cellular Automata, Figure 9 Evolution of rule 110 for the first 500 generations, given a random starting configuration. With 1D CA, we can depict a great many generations on a 2D display screen
1341
1342
Gliders in Cellular Automata
Gliders in Cellular Automata, Figure 12 There are 27 possible configurations when we have three states instead of two. Each configuration would yield some specific outcome as in Fig. 7; thus there would be three possible outcomes for each state, and hence 327 distinct rules
Gliders in Cellular Automata, Figure 10 Rule 110 at generations 2000–2500. The structures that move vertically are stationary oscillators; slanted structures can be considered gliders. Unlike higher dimensions, where gliders move in an unobstructed grid with no other live cells in the immediate vicinity, many 1D gliders reside in an environment of oscillating cells (the background pattern). The black square outlines an area depicted in the next figure
Gliders in Cellular Automata, Figure 13 Each cell in the triangular grid has 12 touching neighbors. The subject central cells can have two orientations, E and O
Gliders in Cellular Automata, Figure 11 An area from the previous figure enlarged. One can carefully trace the evolution from one generation to the next. The background pattern repeats every seven generations
after 800 generations. Due to the extremely unusual behavior of the period 80 2; 7; 8/3 glider (Fig. 16), it is highly likely that other guns exist. The hexagonal grid supports the GL rule 3/2, along with GL rules 3; 5/2, 3; 5; 6/2 and 3; 6/2, which all behave in a manner very similar to 3/2. The glider for these three rules is shown in Fig. 19. It is possible that no other distinct hexagonal GL rules exist, because with only six touching neighbors, the set of interesting rules is quite limited. Moreover the fertility portion of the rule must start with two and rules of the form /2; 3 are unstable. Thus,
Gliders in Cellular Automata
Gliders in Cellular Automata, Figure 14 Most of the known GL rules and their gliders are illustrated. The period for each is given in parentheses
Gliders in Cellular Automata, Figure 16 Here we depict the large 2; 7; 8/3 glider. Perhaps flamboyant would be a better description, for this glider spews out much debris as it moves along. It has a period of 80 and its exact motion can be traced by observing its position relative to the black dot. Note that the debris tossed behind does not interfere with the 81st generation, where the entire process repeats 12 cells to the right. By carefully positioning two of these gliders, one can (without too much effort) construct a situation where the debris from both gliders interacts in a manner that produces another glider. This was the method used to discover the two guns illustrated in Figs. 17 and 18
any other hexagonal GL rules must be of the form /2; 4; /2; 4; 5; etc. (i. e. only seven other fertility combinations). A valid GL rule has also been found for at least one pentagonal grid (Fig. 19). Since there are several topologically unique pentagonal tessellations (see [10]), probably other pentagonal gliders will be found, especially when all the variants of the pentagonal grid are investigated. Three and Four Dimensional Gliders
Gliders in Cellular Automata, Figure 15 The small 2; 7; 8/3 glider is shown. This glider also exists for the GL rule 2; 7/3. The small horizontal dash is for positional reference
In 1987, the first GL rules in three dimensions were discovered [1,7]. The initially found gliders and their rules are depicted in Fig. 20. It turns out that the 2D rule 2; 3/3 is in many ways contained in the 3D GL rule 5; 6; 7/6. (Note the similarity between the glider at the bottom of Fig. 20 and at the top of Fig. 2). During the ensuing years, several other 3D gliders were found (Figs. 21 and 22). Most of these gliders were unveiled by employing random but symmetric small initial configurations. The large number of live cells in these 3D gliders implies that they are uncommon random occurrences in their respective GL rules; hence it is highly improbable that the plethora of interesting forms (e. g. glider guns) such as those for 2D rule 2; 3/3 exist in three dimensions.
1343
1344
Gliders in Cellular Automata
Gliders in Cellular Automata, Figure 19 GL rules are supported in pentagonal and hexagonal grids. The pentagonal grid (left) is called the Cairo Tiling, supposedly named after some paving tiles in that city. There are many different topologically distinct pentagonal grids; the Cairo Tiling is but one. At the right are gliders for the hexagonal rules 3/2 and 3/2; 4; 5. The 3/2 glider also works for 3; 5/2; 3; 5; 6/2 and 3; 6/2. All four of these rules are GL rules. The rule 3/2; 4; 5 is unfortunately disqualified (barely) as a GL rule because very large random blobs will grow without bounds. The periods of each glider are given in parentheses
Gliders in Cellular Automata, Figure 17 The GL rule 2; 7; 8/3 is of special interest in that it is the only known GL rule besides Conway’s rule that supports glider guns – configurations that spew out an endless stream of gliders. In fact, there are probably several such configurations under that rule. Here we illustrate two guns; the top one generates period 18 (small) gliders and the bottom one creates period 80 (large) gliders. Unlike Gosper’s 2; 3/3 gun, these guns translate across the grid in the direction indicated. In keeping with the fanciful jargon for names, translating glider guns are also called “rakes” Gliders in Cellular Automata, Figure 20 The first three dimensional GL rules were found in 1987; these are the original gliders that were discovered. The rule 5; 6; 7/6 is analogous to the 2D rule 2; 3/3 (see [1]). Note the similarity between this glider and the one at the top of Fig. 2
Gliders in Cellular Automata, Figure 18 After 800 generations, the two guns from Fig. 17 will have produced the output shown. Motion is in the direction given by the arrows. The gun at the left yields period 18 gliders, one every 80 generations, and the gun at the right produces a period 80 glider every 160 generations
The 3D grid of dense packed spheres has also been investigated somewhat; here each sphere touches exactly 12 neighbors. What is pleasing about this configuration is that each neighbor is identical in the manner that it touches the subject cell, unlike the square and cubic grids, where some neighbors touch on their sides and others at their corners. The gliders for spherical rule 3/3 are shown in Fig. 23. This rule is a borderline GL rule, as random finite configurations appear to stabilize, but infinite ones apparently do not. Future Directions Gliders are an important by-product of many cellular automata rules. They have made possible the construction
Gliders in Cellular Automata
Gliders in Cellular Automata, Figure 21 Several more 3D GL rules were discovered between 1990–1994. They are illustrated here. The 8/5 gliders were originally investigated under the rule 6; 7; 8/5
of extremely complicated forms – most notably within the universe of Conway’s rule, 2; 3/3. (Figs. 24 and 25 illustrate a remarkable example of this complexity). Needless to say many questions remain unanswered. Can a glider gun be constructed for some three dimensional rule? This would most likely be rule 5; 6; 7/6, which is the three dimensional analog of 2; 3/3 [7], but so far no example has been found. The area of cellular automata research is more-or-less in its infancy – especially when we look beyond the square grid. Even higher dimensions have been given a glance; Fig. 26 shows just one of several gliders that are known to exist in four dimensions. Since each cell has 80 touching neighbors, it will come as no surprise that there are a large number of 4D GL rules. But there remains much work do be done in lower dimensions as well. Consider simple one dimensional cellular automata with four possible states. It will be a long time before all 1038 possible rules have been investigated!
Gliders in Cellular Automata, Figure 22 By 2004, computational speed had greatly increased, so another effort was made to find 3D gliders under GL rules; these latest discoveries are illustrated here
1345
1346
Gliders in Cellular Automata
Gliders in Cellular Automata, Figure 23 Some work has been done with the 3D grid of dense packed spheres. Two gliders have been discovered for the rule 3/3, which almost qualifies as a GL rule
Gliders in Cellular Automata, Figure 24 The discovery of the glider in 2; 3/3, along with the development of several glider guns, has made possible the construction of many extremely complex forms. Here we see a Turing machine, developed in 2001 by Paul Rendell. Figure 25 enlarges a small portion of this structure
Bibliography 1. Bays C (1987) Candidates for the Game of Life in Three Dimensions. Complex Syst 1:373–400 2. Bays C (1987) Patterns for Simple Cellular Automata in a Universe of Dense Packed Spheres. Complex Syst 1:853–875 3. Bays C (1994) Cellular Automata in the Triangular Tessellation. Complex Syst 8:127–150 4. Bays C (1994) Further Notes on the Game of Three Dimensional Life. Complex Syst 8:67–73 5. Bays C (2005) A Note on the Game of Life in Hexagonal and Pentagonal Tessellations. Complex Syst 15:245–252
Gliders in Cellular Automata, Figure 25 We have enlarged a tiny portion at the upper left of the Turing machine shown in Fig. 24. One can see the complex interplay of gliders, glider guns, and various other stabilizing forms
Gliders in Cellular Automata, Figure 26 Some work (not much) has been done in four dimensions. Here is an example of a glider for the GL rule 11; 12/12; 13. Many more 4D gliders exist
6. Bays C (2007) The Discovery of Glider Guns in a Game of Life for the Triangular Tessellation. J Cell Autom 2(4):345–350 7. Dewdney AK (1987) The game Life acquires some successors in three dimensions. Sci Am 286:16–22 8. Gardner M (1970) The fantastic combinations of John Conway’s new solitaire game ‘Life’. Sci Am 223:120–123 9. Preston K Jr, Duff MJB (1984) Modern Cellular Automata. Plenum Press, New York 10. Sugimoto T, Ogawa T (2000) Tiling problem of convex pentagon[s]. Forma 15:75–79 11. Wolfram S (2002) A New Kind of Science. Wolfram Media, Champaign Il
Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach
Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach SALVATORE GRECO1 , BENEDETTO MATARAZZO1 , 2,3 ´ ROMAN SŁOWI NSKI 1 Faculty of Economics, University of Catania, Catania, Italy 2 Pozna´ n University of Technology, Institute of Computing Science, Poznan, Poland 3 Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Article Outline Glossary Definition of the Subject Introduction: Granular Computing and Ordered Data Philosophical Basis of DRSA Granular Computing Dominance-Based Rough Set Approach Fuzzy Set Extensions of the Dominance-Based Rough Set Approach Variable-Consistency Dominance-Based Rough Set Approach (VC-DRSA) Dominance-Based Rough Approximation of a Fuzzy Set Monotonic Rough Approximation of a Fuzzy Set Versus Classical Rough Set Dominance-Based Rough Set Approach to Case-Based Reasoning An Algebraic Structure for Dominance-Based Rough Set Approach Conclusions Future Directions Bibliography Glossary Case-based reasoning Case-based reasoning is a paradigm in machine learning whose idea is that a new problem can be solved by noticing its similarity to a set of problems previously solved. Case-based reasoning regards the inference of some proper conclusions related to a new situation by the analysis of similar cases from a memory of previous cases. Very often similarity between two objects is expressed on a graded scale and this justifies application of fuzzy sets in this context. Fuzzy case-based reasoning is a popular approach in this domain. Decision rule Decision rule is a logical statement of the type “if. . . , then. . . ”, where the premise (condition
part) specifies values assumed by one or more condition attributes and the conclusion (decision part) specifies an overall judgment. Dominance-based rough set approach (DRSA) DRSA permits approximation of a set in universe U based on available ordinal information about objects of U. Also the decision rules induced within DRSA are based on ordinal properties of the elementary conditions in the premise and in the conclusion, such as “if property f i1 is present in degree at least ˛ i1 and . . . property f ip is present in degree at least ˛ ip , then property f iq is present in degree at least ˛ iq ”. Fuzzy sets Differently from ordinary sets in which an object belongs or does not belong to a given set, in a fuzzy set an object belongs to a set in some degree. Formally, in universe U a fuzzy set X is characterized by its membership function X : U ! [0; 1], such that for any y 2 U, y certainly does not belong to set X if X (y) D 0, y certainly belongs to X if X (y) D 1, and y belongs to X with a given degree of certainty represented by the value of X (y) in all other cases. Granular computing Granular computing is a general computation theory for using granules such as subsets, classes, objects, clusters, and elements of a universe to build an efficient computational model for complex applications with huge amounts of data, information, and knowledge. Granulation of an object a leads to a collection of granules, with a granule being a clump of points (objects) drawn together by indiscernibility, similarity, proximity, or functionality. In human reasoning and concept formulation, the granules and the values of their attributes are fuzzy rather than crisp. In this perspective, fuzzy information granulation may be viewed as a mode of generalization, which can be applied to any concept, method, or theory. Ordinal properties and monotonicity Ordinal properties in description of objects are related to graduality of the presence or absence of a property. In this context, it is meaningful to say that a property is more present in one object than in another object. It is important that the ordinal descriptions are handled properly, which means, without introducing any operation, such as sum, averages, or fuzzy operators, like t-norm or t-conorm of Łukasiewicz, taking into account cardinal properties of data not present in the considered descriptions, which would, therefore, give not meaningful results. Monotonicity is strongly related to ordinal properties. It regards relationships between degrees of presence or absence of properties in the objects, like “the more present is property f i , the more present is property f j ”, or “the more present is property f i , the
1347
1348
Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach
more absent is property f j ”. The graded presence or absence of a property can be meaningfully represented using fuzzy sets. More precisely, the degree of presence of property f i in object y 2 U is the value given to y by the membership function of the set of objects having property f i . Rough set A rough set in universe U is an approximation of a set based on available information about objects of U. The rough approximation is composed of two ordinary sets called lower and upper approximation. Lower approximation is a maximal subset of objects which, according to the available information, certainly belong to the approximated set, and upper approximation is a minimal subset of objects which, according to the available information, possibly belong to the approximated set. The difference between upper and lower approximation is called boundary. Definition of the Subject This article describes the dominance-based rough set approach (DRSA) to granular computing and data mining. DRSA was first introduced as a generalization of the rough set approach for dealing with multicriteria decision analysis, where preference order is important. The ordering is also important, however, in many other problems of data analysis. Even when the ordering seems absent, the presence or the absence of a property can be represented in ordinal terms, because if two properties are related, the presence, rather than the absence, of one property should make more (or less) probable the presence of the other property. This is even more apparent when the presence or the absence of a property is graded or fuzzy, because in this case, the more credible the presence of a property, the more (or less) probable the presence of the other property. Since the presence of properties, possibly fuzzy, is the basis of any granulation, DRSA can be seen as a general basis for granular computing. After presenting the main ideas of DRSA for granular computing and its philosophical basis, the article introduces the basic concepts of DRSA, followed by its extensions in a fuzzy context and in probabilistic terms. This prepares the ground for treating the rough approximation of a fuzzy set, which is the core of the subject. It is also explained why the classical rough set approach is a specific case of DRSA. The article continues with presentation of DRSA for case-based reasoning, where the main ideas of DRSA for granular computing are fruitfully applied. Finally, some basic formal properties of the whole approach are presented in terms of an algebra modeling the logic of DRSA.
Introduction: Granular Computing and Ordered Data Granular computing originated in the research of Lin [37, 38,39,40,41,42] and Zadeh [57,58,59,60] and gained considerable interest in the last decade. The basic components of granular computing are granules, such as subsets, classes, objects, clusters, and elements of a universe. Granulation of an object a leads to a collection of granules, with a granule being a clump of points (objects) drawn together by indiscernibility, similarity, proximity, or functionality. In human reasoning and concept formulation, the granules and the values of their attributes are fuzzy rather than crisp. In this perspective, fuzzy information granulation may be viewed as a mode of generalization, which can be applied to any concept, method, or theory. Moreover, the theory of fuzzy granulation provides a basis for computing with words, due to the observation that in a natural language, words play the role of labels of fuzzy granules. Since fuzzy granulation plays a central role in fuzzy logic and in its applications, and rough set theory can be considered as a crisp granulation of set theory, it is interesting to study the relationship between fuzzy sets and rough sets from this point of view. Moreover, noticing that fuzzy granulation that leads to fuzzy logic underlies all applications of granulation, hybridization of fuzzy sets and rough sets can lead to a more general theory of granulation with a potential of application to any domain of human investigation. This explains the interest in putting together rough sets and fuzzy sets. Recently, it has been shown that a proper way of handling graduality in rough set theory is to use the DRSA [29]. This implies that DRSA is also a proper way of handling granulation within rough set theory. Let us explain this point in detail. The rough set approach has been proposed to approximate some relationships existing between concepts. For example, in medical diagnosis the concept of “disease Y” can be represented in terms of such concepts as “low blood pressure” and “high temperature”, or “muscle pain” and “headache”. The classical rough approximation is based on a very coarse representation, that is, for each aspect characterizing a concept (“low blood pressure”, “high temperature”, “muscle pain”, etc.), only its presence or its absence is considered relevant. In this case, the rough approximation involves a very primitive idea of monotonicity related to a scale with only two values: “presence” and “absence”. Monotonicity gains importance when a finer representation of the concepts is considered. A representation is finer when, for each aspect characterizing a concept, not only its presence or its absence is taken into account, but
Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach
also the degree of its presence or absence is considered relevant. Graduality is typical for fuzzy set philosophy [56] and, therefore, a joint consideration of rough sets and fuzzy sets is worthwhile. In fact, rough sets and fuzzy sets capture the two basic complementary aspects of monotonicity: rough sets deal with relationships between different concepts and fuzzy sets deal with expression of different dimensions in which the concepts are considered. For this reason, many approaches have been proposed to combine fuzzy sets with rough sets (see e. g. [3,5,45,49,51]). Our combination of rough sets and fuzzy sets presents some important advantages with respect to the other approaches, which are discussed below. The main preoccupation in almost all the studies combining rough sets with fuzzy sets was related to a fuzzy extension of Pawlak’s definition of lower and upper approximations using fuzzy connectives [10,34]. In fact, there is no rule for the choice of the “right” connective, so this choice is always arbitrary to some extent. Another drawback of fuzzy extensions of rough sets involving fuzzy connectives is that they are based on cardinal properties of membership degrees. In consequence, the result of these extensions is sensitive to order preserving transformation of membership degrees. For example, consider the t-conorm of Łukasiewicz as a fuzzy connective; it may be used in the definition of both fuzzy lower approximation (to build fuzzy implication) and fuzzy upper approximation (as a fuzzy counterpart of a union). The t-conorm of Łukasiewicz is defined as T (˛; ˇ) D min(˛ C ˇ; 1) ;
˛; ˇ 2 [0; 1] :
T (˛; ˇ) can be interpreted as follows. Given two fuzzy propositions p and q, putting v(p) D ˛ and v(q) D ˇ, T (˛; ˇ) can be interpreted as v(p _ q), the truth value of the proposition p _ q. Let us consider the following values of arguments: ˛ D 0:5 ;
ˇ D 0:3 ;
D 0:2 ;
ı D 0:1 ;
and their order preserving transformation: ˛ 0 D 0:4 ;
ˇ 0 D 0:3 ;
0 D 0:2 ;
ı 0 D 0:05 :
The values of the t-conorm are in the two cases as follows: T (˛; ı) D 0:6 ;
0
0
T (˛ ; ı ) D 0:45 ;
T (ˇ; ) D 0:5 ; T (ˇ 0 ; 0 ) D 0:5 :
One can see that the order of the results has changed after the order preserving transformation of the arguments. This means that the Łukasiewicz t-conorm takes into account not only the ordinal properties of the truth values, but also their cardinal properties. A natural question
arises: is it reasonable to expect from truth values a cardinal content instead of ordinal only? Or, in other words, is it realistic to claim that a human is able to say in a meaningful way not only that (a) “proposition p is more credible than proposition q” but even something like (b) “proposition p is two times more credible than proposition q”? It is much safer to consider information of type (a), because information of type (b) is rather meaningless for a human. Since fuzzy generalization of rough set theory using DRSA takes into account only ordinal properties of fuzzy membership degrees, it is the proper way of fuzzy generalization of rough set theory. Moreover, the classical rough set approach [46,47] can be seen as a specific case of our general model. This is important for several reasons. In particular, this interpretation of DRSA gives an insight into fundamental properties of the classical rough set approach and permits its further generalization. Rough set theory [46,47] relies on the idea that some knowledge (data, information) is available about objects of a universe of discourse U. Thus, a subset of U is defined using the available knowledge about the objects and not on the base of information about membership or non-membership of the objects to the subset. For example, knowledge about patients suffering from a certain disease may contain information about body temperature, blood pressure, etc. All patients described by the same information are indiscernible in view of the available knowledge, and form groups of similar objects. These groups are called elementary sets, and can be considered as basic granules of the available knowledge about patients. Elementary sets can be combined into compound concepts. For example, elementary sets of patients can be used to represent a set of patients suffering from a certain disease. Any union of elementary sets is called a crisp set, while other sets are referred to as rough sets. Each rough set has boundary line objects, i. e. objects which, in view of the available knowledge, cannot be classified with certainty as members of the set or of its complement. Therefore, in the rough set approach, any set is associated with a pair of crisp sets, called the lower and the upper approximation. Intuitively, in view of the available information, the lower approximation consists of all objects which certainly belong to the set and the upper approximation contains all objects which possibly belong to the set. The difference between the upper and the lower approximation constitutes the boundary region of the rough set. Analogously, for a partition of
1349
1350
Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach
universe U into classes, one may consider rough approximation of the partition. It appeared to be particularly useful for analysis of classification problems, being the most common decision problems. The rough set approach operates on an information table composed of a set U of objects described by a set Q of attributes. If in the set Q disjoint sets (C and D) of condition and decision attributes are distinguished, then the information table is called a decision table. It is often assumed, without loss of generality, that set D is a singleton fdg, and thus decision attribute d makes a partition of set U into decision classes corresponding to its values. Data collected in such a decision table correspond to a multiple attribute classification problem. The classical Indiscernibility-Based Rough Set Approach (IRSA) is naturally adapted to analysis of this type of decision problems, because the set of objects can be identified with examples of classification and it is possible to extract all the essential knowledge contained in the decision table using indiscernibility or similarity relations. However, as pointed out by the authors (see e. g. [17,20,26,54]), IRSA cannot extract all the essential knowledge contained in the decision table if a background knowledge about monotonic relationships between evaluation of objects on condition attributes and their assignment to decision classes has to be taken into account. Such a background knowledge is typical for data describing various phenomena, as well as for data describing multiple criteria decision problems (see e. g. [9]), e. g., “the larger the mass and the smaller the distance, the larger the gravity”, “the more a tomato is red, the more it is ripe” or “the better the school marks of a pupil, the better his overall classification”. The monotonic relationships, typical for multiple criteria decision problems, follow from preferential ordering of value sets of attributes (scales of criteria), as well as preferential ordering of decision classes. In order to take into account the ordinal properties of the considered attributes and the monotonic relationships between condition and decision attributes, a number of methodological changes to the original rough set theory were necessary. The main change was the replacement of the indiscernibility relation with a dominance relation, which permits approximation of ordered sets. The dominance relation is a very natural and rational concept within multiple criteria decision analysis. The dominance-based rough set approach (DRSA) has been proposed and characterized by the authors (see e. g. [17,20,24,25,26,54]). Let us mention that ordered value sets of attributes and a kind of order dependency among attributes has also been considered by specialists of relational databases (see, e. g., [13]). There is, however, a striking difference between consideration of orders in database queries and consider-
ation of orders in knowledge discovery. More precisely, assuming ordered domains of attributes, knowledge discovery tends to discover monotonic relationships between ordered attributes, e. g., if a student is at least medium in networks, and at least good in databases, then his overall evaluation is at least medium. On the other hand, assuming an order of attribute value sets and order dependency in databases, one can exploit this given information for a more efficient answer to a query, e. g., when the dates of bank checks and their numbers are ordered, there is an order dependency between these two attributes, because in day x a check cannot hold a number smaller than checks from day x 1, which permits us to prune the search tree and make the search more efficient. Looking at DRSA from a granular computing perspective, one can observe that DRSA permits us to deal with ordered data by considering a specific type of information granules defined by means of dominance-based constraints having a syntax of the type: “x is at least R” or “x is at most R”, where R is a qualifier from a properly ordered scale. In evaluation space, such granules are dominance cones. In this sense, the contribution of DRSA consists of: Extending the paradigm of granular computing to problems involving ordered data, Specifying a proper syntax and modality of information granules (the dominance-based constraints which should be adjoined to other modalities of information constraints, such as possibilistic, veristic, and probabilistic [60]), Defining a methodology dealing properly with this type of information granules, and resulting in a theory of computing with words and reasoning about data in the case of ordered data. Let us observe that other modalities of information constraints, such as veristic, possibilistic, and probabilistic, have also to deal with ordered values (with qualifiers relative to grades of truth, possibility, and probability). Therefore, granular computing with ordered data and DRSA as a proper way of reasoning about ordered data, are very important in the future development of the whole domain of granular computing. Indeed the DRSA approach proposed in [15,16] avoids arbitrary choice of fuzzy connectives and not meaningful operations on membership degrees. It exploits only ordinal character of the membership degrees and proposes a methodology of fuzzy rough approximation that infers the most cautious conclusion from available imprecise information. In particular, any approximation of knowledge about concept Y using knowledge about con-
Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach
cept X is based on positive or negative relationships between premises and conclusions, i. e.: (i) “The more x is X, the more it is Y” (positive relationship), (ii) “The more x is X, the less it is Y” (negative relationship). The following simple relationships illustrate i) and ii): “The larger the market share of a company, the greater its profit” (positive relationship), and “The greater the debt of a company, the smaller its profit” (negative relationship). These relationships have the form of gradual decision rules [4]. Examples of these decision rules are: “if a car is speedy with credibility at least 0.8 and it has high fuel consumption with credibility at most 0.7, then it is a good car with a credibility at least 0.9”, and “if a car is speedy with credibility at most 0.5 and it has high fuel consumption with credibility at least 0.8, then it is a good car with a credibility at most 0.6”. It is worth noting that the syntax of gradual decision rules is based on monotonic relationships between degrees of credibility, that can also be found in dominance-based decision rules induced from preference-ordered data. This explains why one can build a fuzzy rough approximation using DRSA. Finally, the fuzzy rough approximation taking into account monotonic relationships can be applied to casebased reasoning [28]. In this perspective, it is interesting to consider monotonicity of the type “the more similar is y to x, the more credible is that y belongs to the same class as x”. Application of DRSA in this context leads to decision rules similar to the gradual decision rules: “the more object z is similar to a referent object x w.r.t. condition attribute s, the more z is similar to a referent object x w.r.t. decision attribute t”, or, equivalently, but more technically, s(z; x) ˛
)
t(z; x) ˛ ;
where functions s and t measure the credibility of similarity with respect to condition attribute and decision attribute, respectively. When there are multiple condition and decision attributes, functions s and t aggregate similarity with respect to these attributes. The decision rules induced on the basis of DRSA do not need the aggregation of the similarity with respect to different attributes into one comprehensive similarity. This is important because it permits us to avoid using aggregation operators (weighted average, min, etc.) which are always arbitrary to some extent. Moreover, the DRSA
decision rules permit us to consider different thresholds for degrees of credibility in the premise and in the conclusion. The article is organized as follows. Next Section presents the philosophical basis of DRSA granular computing. Section “Dominance-Based Rough Set Approach” recalls the main steps of DRSA for multiple criteria classification, or, in general, for ordinal classification. Section “Fuzzy Set Extensions of the Dominance-Based Rough Set Approach” presents, a review of fuzzy set extensions of DRSA based on fuzzy connectives. In Sect. “Variable-Consistency Dominance-Based Rough Set Approach (VC– DRSA)”, a “probabilistic” version of DRSA, the variableconsistency dominance-based rough set approach is presented. Dominance-based rough approximation of a fuzzy set is presented in Sect “Dominance-Based Rough Approximation of a Fuzzy Set”. The explanation that IRSA is a particular case of DRSA is presented in Sect. “Monotonic Rough Approximation of a Fuzzy Set versus Classical Rough Set”. Section “Dominance-Based Rough Set Approach to Case-Based Reasoning” is devoted to DRSA for case-based reasoning. In Sect. “An Algebraic Structure for Dominance-Based Rough Set Approach” an algebra modeling logic of DRSA is presented. Section “Conclusions” contains conclusions. In Sect. “Future Directions” some issues for further developments are presented. Philosophical Basis of DRSA Granular Computing It is interesting to analyze the relationships between DRSA and granular computing from the point of view of the philosophical basis of rough set theory proposed by Pawlak. Since according to Pawlak [48], rough set theory refers to some ideas of Gottlob Frege (vague concepts), Gottfried Leibniz (indiscernibility), George Boole (reasoning methods), Jan Łukasiewicz (multi-valued logic), and Thomas Bayes (inductive reasoning), it is meaningful to give an account for DRSA generalization of rough sets, justifying it in reference to some of these main ideas recalled by Pawlak. The identity of indiscernibles is a principle of analytic ontology first explicitly formulated by Gottfried Leibniz in his Discourse on Metaphysics, Sect. 9 [43]. Two objects x and y are defined indiscernible, if x and y have the same properties. The principle of identity of indiscernibles states that if x and y are indiscernible, then x D y :
(II1)
This can be expressed also as if x ¤ y, then x and y are discernible, i. e. there is at least one property that x has and y does not, or vice versa. The converse of the principle of
1351
1352
Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach
identity of indiscernibles is called indiscernibility of identicals and states that if x D y, then x and y are indiscernible, i. e. they have the same properties. This is equivalent to saying that if there is at least one property that x has and y does not, or vice versa, then x ¤ y. The conjunction of both principles is often referred to as “Leibniz’s law”. Rough set theory is based on a weaker interpretation of Leibniz’s law, having as objective the ability to classify objects falling under the same concept. This reinterpretation of Leibniz’s law is based on a reformulation of the principle of identity of indiscernibles as follows: if x and y are indiscernible, then x and y belong to the same class.
(II2)
Let us observe that the word “class” in the previous sentence can be considered as synonymous with “granule”. Thus, from the point of view of granular computing, (II2) can be rewritten as if x and y are indiscernible, then x and y belong to the same granule of classification.
(II2’)
Notice also that the principle of indiscernibility of identicals cannot be reformulated in analogous terms. In fact, such an analogous reformulation would amount to stating that if x and y belong to the same class, then x and y are indiscernible. This principle is too strict, however, because there can be two discernible objects x and y belonging to the same class. Thus, within rough set theory, the principle of indiscernibility of identicals should continue to hold in its original formulation (i. e. if x D y, then x and y are indiscernible). It is worthwhile to observe that the relaxation in the consequence of the implication from (II1) to (II2), implies an implicit relaxation also in the antecedent. In fact, one could say that two objects are identicals if they have the same properties, if one would be able to take into account all conceivable properties. For human limitations this is not the case, therefore, one can imagine that (II2) can be properly reformulated as if x and y are indiscernible taking into account a given set of properties; then x and y belong to the same class.
(II2’’)
This weakening in the antecedent of the implication means also that the objects indiscernible with respect to a given set of properties can be seen as a granule, such that, finally, the (II2) could be rewritten in terms of granulation as if x and y belong to the same granule with respect to a given set of properties; then x and y belong to the same classification granule. (II2’’’)
For this reason, rough set theory needs a still weaker form of the principle of identity of indiscernibles. Such a principle can be formulated using the idea of vagueness due to Gottlob Frege. According to Frege “the concept must have a sharp boundary – to the concept without a sharp boundary there would correspond an area that had not a sharp boundary-line all around”. Therefore, following this intuition, the principle of identity of indiscernibles can be further reformulated as if x and y are indiscernible, then x and y should belong to the same class.
(II3)
In terms of granular computing, (II3) can be rewritten as if x and y belong to the same granule with respect to a given set of properties; then x and y should belong to the same classification granule. (II3’) This reformulation of the principle of identity of indiscernibles implies that there is an inconsistency in the statement that x and y are indiscernible, and x and y belong to different classes. Thus, Leibniz’s principle of identity of indiscernibles and Frege’s intuition about vagueness found the basic idea of the rough set concept proposed by Pawlak. The above reconstruction of the basic idea of Pawlak’s rough set should be completed, however, by referring to another basic idea. This is the idea of Georg Boole that concerns a property which is satisfied or not satisfied. It is quite natural to weaken this principle admitting that a property can be satisfied to some degree. This idea of graduality can be attributed to Jan Łukasiewicz and his proposal of many-valued logic where, in addition to wellknown truth values “true” and “false”, other truth values representing partial degrees of truth were present. Łukasiewicz’s idea of graduality has been reconsidered, generalized and fully exploited by Zadeh [56] within fuzzy set theory, where graduality concerns membership to a set. In this sense, any proposal of putting rough sets and fuzzy sets together can be seen as a reconstruction of the rough set concept, where Boole’s idea of binary logic is abandoned in favor of Łukasiewicz’s idea of many-valued logic, such that Leibniz’s principle of identity of indiscernibles and Frege’s intuition about vagueness are combined with the idea that a property is satisfied to some degree. Putting aside, for the moment, Frege’s intuition about vagueness, but taking into account the concept of graduality, the principle of identity of indiscernibles can reformulated as follows:
Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach
if the grade of each property for x is greater than or equal to the grade for y; then x belongs to the considered class in a grade at least as high as y.
(II4)
Taking into account the paradigm of granular computing, (II4) can be rewritten as if x belongs to the granules defined by considered properties more than y, because the grade of each property for x is greater than or equal to the grade for y, then x belongs to the considered classification granule in a grade at least as high as y. (II4’) Considering the concept of graduality together with Frege’s intuition about vagueness, one can reformulate the principle of identity of indiscernibles as follows: if the grade of each property for x is greater than or equal to the grade for y; then x should belong to the considered class in a grade at least as high as y.
(II5)
In terms of granular computing, (II5) can be rewritten as if x belongs to the granules defined by considered properties more than y, because the grade of each property for x is greater than or equal to the grade for y; then x should belong to the considered classification granule in a grade at least as high as y. (II5’) The formulation (II5’) of the principle of identity of indiscernibles is perfectly concordant with the rough set concept defined within the dominance-based rough set approach [20]. DRSA has been proposed by the authors to deal with ordinal properties of data related to preferences in decision problems [26,54]. The fundamental feature of DRSA is that it handles monotonicity of comprehensive evaluation of objects with respect to preferences relative to evaluation of these objects on particular attributes. For example, the more preferred is a car with respect to such attributes as maximum speed, acceleration, fuel consumption, and price, the better is its comprehensive evaluation. The type of monotonicity considered within DRSA is also meaningful for problems where relationships between different aspects of a phenomenon described by data are to be taken into account, even if preferences are not considered. Indeed, monotonicity concerns, in general, mutual trends existing between different variables, like distance and gravity in physics, or inflation rate and interest rate
in economics. Whenever a relationship between different aspects of a phenomenon is discovered, this relationship can be represented by a monotonicity with respect to some specific measures of the considered aspects. Formulation (II5) of the principle of identity of indiscernibles refers to this type of monotonic relationships. So, in general, the monotonicity permits us to translate into a formal language a primitive intuition of relationships between different concepts of our knowledge corresponding to the principle of identity of indiscernibles formulated as (II5’). Dominance-Based Rough Set Approach This section presents the main concepts of the dominancebased rough set approach (for a more complete presentation see, for example, [17,20,26,54]). Information about objects is represented in the form of an information table. The rows of the table are labeled by objects, whereas columns are labeled by attributes and entries of the table are attribute-values. Formally, an information system (table) is the 4-tuple S D hU; Q; V ; i, where U is a finite set of objects, Q is a finite set of atS tributes, V D q2Q Vq and V q is the set of values of the attribute q, and : U Q ! Vq is a total function such that (x; q) 2 Vq for every q 2 Q, x 2 U, called an information function [47]. The set Q is, in general, divided into set C of condition attributes and set D of decision attributes. Condition attributes with value sets ordered according to decreasing or increasing preference are called criteria. For criterion q 2 Q, q is a weak preference relation on U such that x q y means “x is at least as good as y with respect to criterion q”. It is supposed that q is a complete preorder, i. e. a strongly complete and transitive binary relation, defined on U on the basis of evaluations (; q). Without loss of generality, the preference is supposed to increase with the value of (; q) for every criterion q 2 C, such that for all x; y 2 U, x q y if and only if (x; q) (y; q). Furthermore, it is supposed that the set of decision attributes D is a singleton d. Values of decision attribute d makes a partition of U into a finite number of decision classes, Cl D fCl t ; t D 1; : : : ; ng, such that each x 2 U belongs to one and only one class Cl t 2 Cl. It is supposed that the classes are preference-ordered, i. e. for all r; s 2 f1; : : : ; ng, such that r > s, the objects from Cl r are preferred to the objects from Cl s . More formally, if is a comprehensive weak preference relation on U, i. e. if for all x; y 2 U, x y means “x is at least as good as y”, it is supposed: [x 2 Cl r ; y 2 Cls ; r>s] ) [x y and not y x]. The above assumptions are typical for consideration of or-
1353
1354
Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach
dinal classification problems (also called multiple criteria sorting problems). The sets to be approximated are called upward union and downward union of classes, respectively: Cl t D
[
Cls ;
Cl t D
st
[
Cls ;
t D 1; : : : ; n :
st
The statement x 2 Cl t means “x belongs to at least class Cl t ”, while x 2 Cl t means “x belongs to at most class Cl t ”. Let us remark that Cl1 D Cl n D U, Cl n D Cl n and Cl1 D Cl1 . Furthermore, for t D 2; : : : ; n, D U Cl t Cl t1
Cl t
DU
Cl t1
and :
The key idea of the rough set approach is representation (approximation) of knowledge generated by decision attributes, by “granules of knowledge” generated by condition attributes. In DRSA, where condition attributes are criteria and decision classes are preference ordered, the represented knowledge is a collection of upward and downward unions of classes and the “granules of knowledge” are sets of objects defined using a dominance relation. x dominates y with respect to P C (shortly, x P-dominates y), denoted by xD P y, if for every criterion q 2 P, (x; q) (y; q). The relation of P-dominance is reflexive and transitive, that is it is a partial preorder. Given a set of criteria P C and x 2 U, the “granules of knowledge” used for approximation in DRSA are: A set of objects dominating x, called P-dominating set, DC P (x) D fy 2 U : yD P xg, A set of objects dominated by x, called P-dominated set, D P (x) D fy 2 U : xD P yg. Note that the “granules of knowledge” defined above have the form of upward (positive) and downward (negative) dominance cones in the evaluation space. Let us recall that the dominance principle (or Pareto principle) requires that an object x dominating object y on all considered criteria (i. e. x having evaluations at least as good as y on all considered criteria) should also dominate y on the decision (i. e. x should be assigned to at least as good a decision class as y). This principle is the only objective principle that is widely agreed upon in the multiple criteria comparisons of objects. Given P C, the inclusion of an object x 2 U to the upward union of classes Cl t , t D 2; : : : ; n, is inconsistent with the dominance principle if one of the following conditions holds:
x belongs to class Cl t or better but it is P-dominated by an object y belonging to a class worse than Cl t , i. e. x 2 Cl t but DC P (x) \ Cl t1 ¤ ;, x belongs to a worse class than Cl t but it P-dominates an object y belonging to class Cl t or better, i. e. x … Cl t but D P (x) \ Cl t ¤ ;. If, given a set of criteria P C, the inclusion of x 2 U to Cl t , where t D 2; : : : ; n, is inconsistent with the dominance principle, then x belongs to Cl t with some ambiguity. Thus, x belongs to Cl t without any ambiguity with respect to P C, if x 2 Cl t and there is no inconsistency with the dominance principle. This means that all objects P-dominating x belong to Cl t , i. e. DC P (x) Cl t . Furthermore, x possibly belongs to Cl t with respect to P C if one of the following conditions holds: According to decision attribute d, x belongs to Cl t , According to decision attribute d, x does not belong to Cl t , but it is inconsistent in the sense of the dominance principle with an object y belonging to Cl t . In terms of ambiguity, x possibly belongs to Cl t with respect to P C, if x belongs to Cl t with or without any ambiguity. Because of the reflexivity of the dominance relation DP , the above conditions can be summarized as follows: x possibly belongs to class Cl t or better, with respect to P C, if among the objects P-dominated by x there is an object y belonging to class Cl t or better, i. e. D P (x) \ Cl t ¤ ;. The P-lower approximation of Cl t , denoted by PCl t , and P-upper approximation of Cl t , denoted by the P Cl t , are defined as follows (t D 1; : : : ; n): ˚ P Cl t D x 2 U : DC ; P (x) Cl t ˚ P Cl t D x 2 U : D P (x) \ Cl t ¤ ; : Analogously, one can define the P-lower approximation and the P-upper approximation of Cl t as follows (t D 1; : : : ; n): ˚ ; P Cl t D x 2 U : D P (x) Cl t ˚ P Cl t D x 2 U : DC P (x) \ Cl t ¤ ; : The P-lower and P-upper approximations so defined satisfy the following inclusion properties for each t 2 f1; : : : ; ng and for all P C: P Cl t Cl t P Cl t ; P Cl t Cl t P Cl t :
Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach
The P-lower and P-upper approximations of Cl t and have an important complementarity property, according to which, Cl t
P Cl t D U P Cl t D U P Cl t D U P Cl t D U
P Cl t1 and P Cl t1 ; t D 2; : : : ; n ; P Cl tC1 and P Cl tC1 ; t D 1; : : : ; n 1 :
The P-boundary of Cl t and Cl t , denoted by Bn P Cl t and Bn P Cl t respectively, are defined as follows (t D 1; : : : ; n): Bn P Cl t D P Cl t P Cl t ; Bn P Cl t D P Cl t P Cl t :
Because of the complementarity property, Bn P (Cl t ) D Bn P (Cl t1 ), for t D 2; : : : ; n. The dominance-based rough approximations of upward and downward unions of classes can serve to induce “if. . . , then. . . ” decision rules. It is meaningful to consider the following five types of decision rules: 1) Certain D -decision rules: if x q1 q1 r q1 and x q2 q2 r q2 and : : : x q p q p r q p , then certainly x belongs to Cl t , where, for each w q ; z q 2 X q , “w q q z q ” means “wq is at least as good as zq ”. 2) Possible D -decision rules: if x q1 q1 r q1 and x q2 q2 r q2 and : : : x q p q p r q p , then x possibly belongs to Cl t . 3) Certain D -decision rules: if x q1 q1 r q1 and x q2 q2 r q2 and : : : x q p q p r q p , then certainly x belongs to Cl t , where, for each w q ; z q 2 X q , “w q q z q ” means “wq is at most as good as zq ”. 4) Possible D -decision rules: if x q1 q1 r q1 and x q2 q2 r q2 and : : : x q p q p r q p , then x possibly belongs to Cl t . 5) Approximate D -decision rules: if x q1 q1 r q1 and : : : x q k q k r q k and x q(kC1) q(kC1) r q(kC1) and : : : x q p q p r q p , then x 2 Cls \ Cl t , where s < t. The rules of type 1) and 3) represent certain knowledge extracted from the decision table, while the rules of type 2) and 4) represent possible knowledge. Rules of type 5) represent doubtful knowledge. Fuzzy Set Extensions of the Dominance-Based Rough Set Approach The concept of dominance can be refined by introducing gradedness through the use of fuzzy sets. Here are basic
definitions of fuzzy connectives [10,34]. For each proposition p, one can consider its truth value v(p) ranging from v(p) D 0 (p is definitely false) to v(p) D 1 (p is definitely true); and for all intermediate values, the greater v(p), the more credible is the truth of p. A negation is a non-increasing function N : [0; 1] ! [0; 1] such that N(0) D 1 and N(1) D 0. Given proposition p, N(v(p)) states the credibility of the negation of p. A t-norm T and a t-conorm T are two functions T : [0; 1] [0; 1] ! [0; 1] and T : [0; 1] [0; 1] ! [0; 1], such that given two propositions, p and q, T(v(p); v(q)) represents the credibility of the conjunction of p and q, and T (v(p); v(q)) represents the credibility of the disjunction of p and q. t-norm T and t-conorm T must satisfy the following properties: T(˛; ˇ) D T(ˇ; ˛) and
T (˛; ˇ) D T (ˇ; ˛) ;
for all ˛; ˇ 2 [0; 1] ; T(˛; ˇ) T(; ı)
and
T (˛; ˇ) T (; ı) ;
for all ˛; ˇ; ; ı 2 [0; 1] such that ˛
and ˇ ı ;
T(˛; T(ˇ; )) D T(T(˛; ˇ); ) and T (˛; T (ˇ; )) D T (T (˛; ˇ); ) ; for all ˛; ˇ; 2 [0; 1] ; T(1; ˛) D ˛
and
T (0; ˛) D ˛ ;
for all ˛ 2 [0; 1] : A negation is strict iff it is strictly decreasing and continuous. A negation N is involutive iff, for all ˛ 2 [0; 1], N(N(˛)) D ˛. A strong negation is an involutive strict negation. If N is a strong negation, then (T; T ; N) is a de Morgan triplet iff N(T (˛; ˇ)) D T(N(˛); N(ˇ)). A fuzzy implication is a function I : [0; 1] [0; 1] ! [0; 1] such that, given two propositions p and q, I(v(p); v(q)) represents the credibility of the implication of q by p. A fuzzy implication must satisfy the following properties (see [10]): I(˛; ˇ) I(; ˇ) for all ˛; ˇ; 2 [0; 1] ; such that ˛ ; I(˛; ˇ) I(˛; ) for all ˛; ˇ; 2 [0; 1] ; such that ˇ ; I(0; ˛) D 1; I(˛; 1) D 1
for all ˛ 2 [0; 1] ;
I(1; 0) D 0 : An implication I ! N;T is a T -implication if there is a t-conorm T and a strong negation N such that I! N;T (˛; ˇ) D T (N(˛); ˇ). A fuzzy similarity relation on
1355
1356
Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach
the universe U is a fuzzy binary relation (i. e. function R : U U ! [0; 1]) reflexive (R(x; x) D 1 for all x 2 U), symmetric (R(x; y) D R(y; x) for all x; y 2 U) and transitive (given t-norm T, T(R(x; y); R(y; z)) R(x; z) for all x; y; z 2 U). Let q be a fuzzy weak preference relation on U with respect to criterion q 2 C, i. e. q : U U ! [0; 1], such that, for all x; y 2 U, q (x; y) represents the credibility of the proposition “x is at least as good as y with respect to criterion q”. Suppose that q is a fuzzy partial T-preorder, i. e. that it is reflexive (q (x; x) D 1 for each x 2 U) and T-transitive (T(q (x; y); q (y; z)) q (x; z), for each x; y; z 2 U) (see [10]). Using the fuzzy weak preference relations q , q 2 C, a fuzzy dominance relation on U (denotation D P (x; y)) can be defined, for all P C, as follows: D P (x; y) D Tq2P (q (x; y)) : Given (x; y) 2 U U, D P (x; y) represents the credibility of the proposition “x is at least as good as y with respect to each criterion q from P”. Since the fuzzy weak preference relations q are supposed to be partial T-preorders, then also the fuzzy dominance relation DP is a partial T-preorder. Furthermore, let Cl D fCl t ; t D 1; : : : ; ng be a set of fuzzy classes in U such that, for each x 2 U, Cl t (x) represents the membership function of x to Cl t . It is supposed, as before, that the classes of Cl are increasingly ordered, i. e. that for all r; s 2 f1; : : : ; ng such that r > s, the objects from Clr have a better comprehensive evaluation than the objects from Cls . On the basis of the membership functions of the fuzzy class Cl t , fuzzy membership functions of two other sets can be defined as follows: 1) The upward union fuzzy set Cl t , whose membership function Cl t (x) represents the credibility of the proposition “x is at least as good as the objects in Cl t ”: 9 8 if 9s 2 f1; : : : ; ng : Cl s (x) > 0 = < 1 ; and s > t Cl t (x) D ; : Cl t (x) otherwise 2) the downward union fuzzy set Cl t , whose membership function Cl t (x) represents the credibility of the proposition “x is at most as good as the objects in Cl t ”: 9 8 if 9s 2 f1; : : : ; ng : Cl s (x) > 0 = < 1 : and s < t Cl t (x) D ; : Cl t (x) otherwise The P-lower and the P-upper approximations of Cl t with respect to P C are fuzzy sets in U, whose membership functions, denoted by P[Cl t (x)] and P[Cl t (x)]
respectively, are defined as: P Cl t (x) D Ty2U T N(D P (y; x)); Cl t (y) ; T D P (x; y); Cl t (y) : P Cl t (x) D Ty2U P[Cl t (x)] represents the credibility of the proposition “for all y 2 U, y does not dominate x with respect to criteria from P or y belongs to Cl t ”, while P[Cl t (x)] represents the credibility of the proposition “there is at least one y 2 U dominated by x with respect to criteria from P which belongs to Cl t ”. The P-lower and P-upper approximations of Cl t with respect to P C, denoted by P[Cl t (x)] and P[Cl t (x)] respectively, can be defined, analogously, as: P Cl t (x) D Ty2U T N(D P (x; y)); Cl t (y) ; P Cl t (x) D Ty2U T D P (y; x); Cl t (y) : P Cl t (x) represents the credibility of the proposition “for all y 2 U, x does not dominate y with respect to criteria from P or y belongs to Cl t ”, while P[Cl t (x)] represents the credibility of the proposition “there is at least one y 2 U dominating x with respect to criteria from P which belongs to Cl t ”. Let us remark that, using the definition of the T -implication, it is possible to rewrite the definitions of P[Cl t (x)], P[Cl t (x)], P[Cl t (x)] and P[Cl t (x)], in the following way: ; P Cl t (x) D Ty2U I ! T ;N D P (y; x); Cl t (y) P Cl t (x) D D (x; y); N Cl t (y) N I! Ty2U ; P T ;N ; P Cl t (x) D Ty2U I ! T ;N D P (x; y); Cl t (y) P Cl t (x) D N I! : Ty2U T ;N D P (y; x); N Cl t (y) The following results can be proved: 1) for each x 2 U and for each t 2 f1; : : : ; ng, P Cl t (x) Cl t (x) P Cl t (x) ; P Cl t (x) Cl t (x) P Cl t (x) ; 2) if (T; T ; N) constitute a de Morgan triplet and (x) for each x 2 U and if N[Cl t (x)] D Cl t1 t D 2; : : : ; n, then P Cl t (x) D N P Cl t1 (x) ; P Cl t (x) D N P Cl t1 (x) ; t D 2; : : : ; n ; (x) ; P Cl t (x) D N P Cl tC1 P Cl t (x) D N P Cl tC1 (x) ; t D 1; : : : ; n 1 ;
Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach
3) for all P R C, for all x 2 U and for each t 2 f1; : : : ; ng, P Cl t (x) R Cl t (x) ; P Cl t (x) R Cl t (x) ; P Cl t (x) R Cl t (x) ; P Cl t (x) R Cl t (x) : Results 1) to 3) can be read as fuzzy counterparts of the following results well-known within the classical rough set approach: 1) (inclusion property) says that Cl t and Cl t include their P-lower approximations and are included in their P-upper approximations; 2) (complementarity property) says that the P-lower (P-upper) approximation of Cl t is the complement of the P-upper (P-lower) approximation of its comple , (analogous property holds for Cl t mentary set Cl t1 and Cl tC1 ); 3) (monotonicity with respect to sets of attributes) says that enlarging the set of criteria, the membership to the lower approximation does not decrease and the membership to the upper approximation does not increase. Greco, Inuiguchi, and Słowi´nski [14] proposed, moreover, the following fuzzy rough approximations based on dominance, which go in line with the fuzzy rough approximation by Dubois and Prade [3,5], concerning classical rough sets: P Cl t (x) D inf y2U I D P (y; x); Cl t (y) ; P Cl t (x) D sup y2U T D P (x; y); Cl t (y) ; P Cl t (x) D inf y2U I D P (x; y); Cl t (y) ; P Cl t (x) D sup y2U T D P (y; x); Cl t (y) : Using fuzzy rough approximations based on DRSA, one can induce decision rules having the same syntax as the decision rules obtained from crisp DRSA. In this case, however, each decision rule has a fuzzy credibility. Variable-Consistency Dominance-Based Rough Set Approach (VC-DRSA) The definitions of rough approximations introduced in Sect. “Dominance-Based Rough Set Approach” are based on a strict application of the dominance principle. However, when defining non-ambiguous objects, it is reasonable to accept a limited proportion of negative examples, particularly for large data tables. Such an extended version of DRSA is called the variable-consistency DRSA model (VC-DRSA) [31]. For any P C, x 2 U belongs to Cl t without any ambiguity at consistency level l 2 (0; 1], if x 2 Cl t and at
least l 100% of all objects y 2 U dominating x with respect to P also belong to Cl t , i. e., for t D 2; : : : ; n, ˇ C ˇ ˇ D (x) \ Cl ˇ t P ˇ C ˇ
l: ˇ D (x)ˇ P The level l is called consistency level because it controls the degree of consistency with respect to objects qualified as belonging to Cl t without any ambiguity. In other words, if l < 1, then at most (1 l) 100% of all objects y 2 U dominating x with respect to P do not belong to Cl t and thus contradict the inclusion of x in Cl t . Analogously, for any P C, x 2 U belongs to Cl t without any ambiguity at consistency level l 2 (0; 1], if x 2 Cl t and at least l 100% of all the objects y 2 U dominated by x with respect to P also belong to Cl t , i. e., for t D 1; : : : ; n 1, ˇ ˇ ˇD (x) \ Cl ˇ t Pˇ ˇ
l: ˇ D (x)ˇ P
The concept of non-ambiguous objects at some consistency level l leads naturally to the corresponding definition of P-lower approximations of the unions of classes Cl t and Cl t , respectively: ) ( ˇ ˇ C ˇ D (x) \ Cl ˇ t P l ˇ C ˇ
l ; P Cl t D x 2 Cl t : ˇ D (x)ˇ P
t D 2; : : : ; n ; ( ) ˇ ˇ ˇ D (x) \ Cl ˇ t Pˇ l ˇ P Cl t D x 2 Cl t :
l ; ˇ D (x)ˇ P
t D 1; : : : ; n 1 : Given P C and consistency level l, the corresponding P-upper approximations of Cl t and Cl t , denoted l
l
by P (Cl t ) and P (Cl t ), respectively, can be defined as a complement of P l (Cl t1 ) and P l (Cl tC1 ) with respect to U: l ; t D 2; : : : ; n ; P Cl t D U P l Cl t1 l l P Cl t D U P Cl tC1 ; t D 1; : : : ; n 1 : l
P (Cl t ) can be interpreted as a set of all the objects belonging to Cl t , possibly ambiguous at consistency level l. l
Analogously, P (Cl t ) can be interpreted as a set of all the objects belonging to Cl t , possibly ambiguous at consistency level l. The P-boundaries (P-doubtful regions) of Cl t and Cl t at consistency level l are defined as: l Bn Pl Cl t D P Cl t P l Cl t ; t D 2; : : : ; n l Bn Pl Cl t D P Cl t P l Cl t ; t D 1; : : : ; n 1 :
1357
1358
Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach
The variable consistency model of the dominancebased rough set approach provides some degree of flexibility in assigning objects to lower and upper approximations of the unions of decision classes. The following properties can be easily proved: for 0 < l 0 < l 1, 0 P l Cl t P l Cl t t D 2; : : : ; n ; 0 P l Cl t P l Cl t
and
and
l l0 P Cl t P Cl t ; l
P Cl t
P
l0
Cl t
;
t D 1; : : : ; n 1 : The following two basic types of variable-consistency decision rules can be considered: 1. D -decision rules with the following syntax: “if (x; q1) r q1 and (x; q2) r q2 and . . . (x; qp)
r q p , then x 2 Cl t ” with confidence ˛ (i. e. in fraction ˛ of considered cases), where P D fq1; : : : ; qpg C, (r q1 ; : : : ; r q p ) 2 Vq1 Vq2 : : : Vq p and t D 2; : : : ; n; 2. D -decision rules with the following syntax: “if (x; q1) r q1 and (x; q2) r q2 and . . . (x; qp) r q p , then x 2 Cl t ” with confidence ˛, where P D fq1; : : : ; qpg C, (r q1 ; : : : ; r q p ) 2 Vq1 Vq2 : : :Vq p and t D 1; : : : ; n 1. The variable consistency model is inspired by the variable precision model proposed by Ziarko [61,62] within the classical indiscernibility-based rough set approach. Dominance-Based Rough Approximation of a Fuzzy Set This section shows how the dominance-based rough set approach can be used for rough approximation of fuzzy sets. A fuzzy information base is the 3-tuple B D hU; F; 'i, where U is a finite set of objects (universe), F D f f1 ; f2 ; : : : ; f m g is a finite set of properties, and ' : U F ! [0; 1] is a function such that '(x; f h ) 2 [0; 1] expresses the credibility that object x has property f h . Each object x from U is described by a vector DesF (x) D ['(x; f1 ); : : : ; '(x; f m )] ; called description of x in terms of the degrees to which it has properties from F; it represents the available information about x. Obviously, x 2 U can be described in terms of any non-empty subset E F and in this case DesE (x) D ['(x; f h ); f h 2 E] : For any E F, the dominance relation DE can be defined as follows: for all x, y 2 U, x dominates y with respect
to E (denotation xD E y) if, for any f h 2 E, '(x; f h ) '(y; f h ) : Given E F and x 2 U, let DC E (x) D fy 2 U : yD E xg;
D E (x) D fy 2 U : xD E yg:
Let us consider a fuzzy set X in U, with its membership function X : U ! [0; 1]. For each cutting level ˛ 2 [0; 1] and for 2 f ; >g, the E-lower and the E-upper approximation of X ˛ D fy 2 U : X (y) ˛g with respect to E F (denotation E(X ˛ ) and E(X ˛ ), respectively), can be defined as: ˚ ˛ ; E(X ˛ ) D x 2 U : DC E (x) X ˚ ˛ ˛ E(X ) D x 2 U : D E (x) \ X ¤ ; : Rough approximations E(X ˛ ) and E(X ˛ ) can be expressed in terms of unions of granules DC E (x) as follows: [˚ C ˛ DC ; E(X ˛ ) D E (x) : D E (x) X x2U
E(X ˛ ) D
[˚
˛ ¤; : DC E (x) : D E (x) \ X
x2U
Analogously, for each cutting level ˛ 2 [0; 1] and for ˘ 2 f; ˛ ), E(X >˛ ), E(X ˛ ), E(X ˛ E(X >˛ ) ; E(X ˛ ) and E(x) (X >˛ ). Observe that the lower approximation of X ˛ with respect to x contains all the objects y 2 U such that any object w, being similar to x at least as much as y is similar to x w.r.t. all the considered features E F, also belongs to X ˛ . Thus, the data from the fuzzy pairwise information base B confirm that if w is similar to x not less than y 2 E(x) (X ˛ ) is similar to x w.r.t. all the considered features E F, then w belongs to X ˛ . In other words, x is a reference object and
Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach
y 2 E(x) (X ˛ ) is a limit object which belongs “certainly” to set X with credibility at least ˛; the limit is understood such that all objects w that are similar to x w.r.t. considered features at least as much as y is similar to x, also belong to X with credibility at least ˛. Analogously, the upper approximation of X ˛ with respect to x contains all objects y 2 U such that there is at least one object w, being similar to x at most as much as y is similar to x w.r.t. all the considered features E F, which belongs to X ˛ . Thus, the data from the fuzzy pairwise information base B confirm that if w is similar to x not less than y 2 E(x) (X ˛ ) is similar to x w.r.t. all the considered features E F, then it is possible that w belongs to X ˛ . In other words, x is a reference object and y 2 E(x) (X ˛ ) is a limit object which belongs “possibly” to set X with credibility at least ˛; the limit is understood such that all objects z 2 U similar to x not less than y w.r.t. considered features, possibly belong to X ˛ . For each x 2 U and ˛ 2 [0; 1] and ˘ 2 f; jN2 j for specificity, then any payoff vector in the core assigns $1 to each player of the scarce type; that is, players with a RH glove each receive 0 while players with a LH glove each receive $1.
Not all games have nonempty cores, as the following example illustrates. Example 2 (A simple majority game with an empty core) Let N D f1; 2; 3g and define the function v as follows:
v(S) D
0 if jSj D 1 ; 1 otherwise :
It is easy to see that the core of the game is empty. For if a payoff vector u were in the core, then it must hold that for any i 2 N; u i 0 and for any i; j 2 N, u i C u j 1. Moreover, feasibility dictates that u 1 C u2 C u3 1. This is impossible; thus, the core is empty. Before leaving this example, let us ask whether it would be possible to subsidize the players by increasing the payoff to the total player set N and, by doing so, ensure that the core of the game with a subsidy is nonempty. We leave it to the reader to verify that if v(N) were increased to $3/2 (or more), the new game would have a nonempty core. Let (N; ) be a game and let i; j 2 N. Then players i and j are substitutes if, for all groups S N with i; j … S it holds that v(S [ fig) D v(S [ f jg) : Let (N; ) be a game and let u 2 R N be a payoff vector for the game. If for all players i and j who are substitutes it holds that u i D u j then u has the equal treatment property. Note that if there is a partition of N into T subsets, say N1 ; : : : ; N T , where all players in each subset N t are substitutes for each other, then we can represent u by a vector u 2 RT where, for each t, it holds that u t D u i for all i 2 N t . Essential Superadditivity We wish to treat games where the worth of a group of players is independent of the total player set in which it is embedded and an option open to the members of a group is to partition themselves into smaller groups; that is, we treat games that are essentially superadditive. This is built into our the definition of feasibility above, (1). An alternative approach, which would still allow us to treat situations where it is optimal for players to form groups smaller than the total player set, would be to assume that v is the “superadditive cover” of some other worth function v 0 . Given a not-necessarily-superadditive function v 0 , for each group S define v(S) by: v(S) D max
X
v 0 (S k )
(4)
Market Games and Clubs
where the maximum is taken over all partitions fS k g of S; the function v is the superadditive cover of v 0 . Then the notion of feasibility requiring that a payoff vector u is feasible only if u(N) v(N) ;
(5)
gives an equivalent set of feasible payoff vectors to those of the game (N; v 0 ) with the definition of feasibility given by (1). The following Proposition may be well known and is easily proven. This result was already well understood in Gillies [27] and applications have appeared in a number of papers in the theoretical literature of game theory; see, for example (for " D 0) Aumann and Dreze [6] and Kaneko and Wooders [33]. It is also well known in club theory and the theory of economies with many players and local public goods. Proposition 1 Given " 0, let (N; v 0 ) be a game. A payoff vector u 2 R N is in the weak, respectively uniform, " -core of (N; v 0 ) if and only if it is in the weak, respectively uniform, "-core of the superadditive cover game, say (N; ), where v is defined by (4). A Market In this section we introduce the definition, from Shapley and Shubik [60], of a market. Unlike Shapley and Shubik, however, we do not assume concavity of utility functions. A market is taken to be an economy where all participants have continuous utility functions over a finite set of commodities that are all linear in one commodity, thought of as an “idealized” money. Money can be consumed in any amount, possibly negative. For later convenience we will consider an economy where there is a finite set of types of participants in the economy and all participants of the same type have the same endowments and preferences. Consider an economy with T C 1 types of commodities. Denote the set of participants by N D f(t; q) : t D 1; : : : ; T; and q D 1; : : : ; n t g : Assume that all participants of the same type, (t; q), q D 1; : : : ; n t have the same utility functions given by b u t (y; ) D u t (y) C T and 2 R. Let a tq 2 R T be the endowwhere y 2 RC C ment of the (t; q)th player of the first T commodities. The P total endowment is given by (t;q)2N a tq . For simplicity and without loss of generality, we can assume that no participant is endowed with any nonzero amount of the
(T C 1) th good, the “money” or medium of exchange. One might think of utilities as being measured in money. It is because of the transferability of money that utilities are called “transferable”. Remark 1 Instead of assuming that money can be consumed in negative amounts one might assume that endowments of money are sufficiently large so that no equilibrium allocates any participant a negative amount of money. For further discussion of transferable utility see, for example, Bergstrom and Varian [9] or Kaneko and Wooders [34] . Given a group S N, a S-allocation of commodities is a set 8 < T (y tq ; tq ) 2 RC R: : 9 = X X X y tq a tq and tq 0 ; ; (t;q)2S
(t;q)2S
(t;q)2S
that is, a S-allocation is a redistribution of the commodities owned by the members of S among themselves and monetary transfers adding up to no more than zero. When S D N, a S-allocation is called simply an allocation. With the price of the (T C 1) th commodity set equal to 1, a competitive outcome is a price vector p in RT , listing prices for the first T commodities, and an allocation f(y tq ; tq ) 2 RT R : (t; q) 2 Ng for which y) p (b y a tq ) (a) u t (y tq ) p (y tq a tq ) u t (b T , (t; q) 2 N ; for all b y 2 RC
(b)
P
(t;q)2N
y tq D
P
(t;q)
a tq D y ;
(c) tq D p (y tq a tq ) for all (t; q) 2 N and (d)
P
(t;q)2N
tq D 0 : (6)
Given a competitive outcome with allocation f(y tq ; tq ) 2 T R : (t; q) 2 Ng and price vector p, the competitive RC payoff to the (t; q) th participant is u(y tq ) p (y tq a tq ). A competitive payoff vector is given by (u(y tq ) p (y tq a tq ) : (t; q) 2 N) : In the following we will assume that for each t, all participants of type t have the same endowment; that is, for 0 each t, it holds that a tq D a tq for all q; q0 D 1; : : : ; n t . In this case, every competitive payoff has the equal treatment property; 0
0
0
u t (y tq ) p (y tq a tq ) D u t (y tq ) p (y tq a tq )
1805
1806
Market Games and Clubs
for all q; q0 and for each t. It follows that a competitive payoff vector can be represented by a vector in RT with one component for each player type. It is easy to generate a game from the data of an economy. For each group of participants S N, define v(S) D max
X
u t (y tq ; tq )
tq2S
where the maximum is taken over the set of S-allocations. Let (N; ) denote a game derived from a market. Under the assumption of concavity of the utility functions of the participants in an economy, Shapley and Shubik [60] show that a competitive outcome for the market exists and that the competitive payoff vectors are in the core of the game. (Since [22], such results have been obtained in substantially more general models of economies.) Market-Game Equivalence To facilitate exposition of the theory of games with many players and the equivalence of markets and games, we consider games derived from a common underlying structure and with a fixed number of types of players, where all players of the same type are substitutes for each other.
Let T be a positive integer, to be interpreted as a number of T , where Z T is player types. A profile s D (s1 ; : : : ; s T ) 2 ZC C the T-fold Cartesian product of the non-negative integers ZC , describes a group of players by the numbers of players of each type in the group. Given profile s, define the norm or size of s by def
X
st ;
t
simply the total number of players in a group of players T is a profile s described by s. A subprofile of a profile n 2 ZC satisfying s n. A partition of a profile s is a collection of subprofiles fs k g of n, not all necessarily distinct, satisfying X
k
where the maximum is taken over the set of all partitions fs k g of s . The function is said to be superadditive if the worth functions and are equal. T We define a pregame as a pair (T; ) where : ZC ! RC . As we will now discuss, a pregame can be used to generate multiple games. To generate a game from a pregame, it is only required to specify a total player set N and the numbers of players of each of T types in the set. Then the pregame can be used to assign a worth to every group of players contained in the total player set, thus creating a game. A game determined by the pregame (T; ), which we will typically call a game or a game with side payments, is a pair [n; (T; )] where n is a profile. A subgame of a game [n; (T; )] is a pair [s; (T; )] where s is a subprofile of n. With any game[n; (T; )] we can associate a game (N; ) in the form introduced earlier as follows: Let N D f(t; q) : t D 1; : : : ; T and q D 1; : : : ; n t g be a player set for the game. For each subset S N define T , by its components the profile of S, denoted by prof(S)2 ZC
Pregames
ksk D
Given , define a worth function , called the superadditive cover of , by X def (s) D max (s k ) ;
sk D s :
k
A partition of a profile is analogous to a partition of a set except that all members of a partition of a set are distinct. T to R Let be a function from the set of profiles ZC C with (0) D 0. The value (s) is interpreted as the total payoff a group of players with profile s can achieve from collective activities of the group membership and is called the worth of the profile s.
ˇ def ˇ prof(S) t D ˇfS \ f(t 0 ; q) : t 0 D t and q D 1; : : : ; n t gˇ and define def
v(S) D (prof(S)) : Then the pair (N; ) satisfies the usual definition of a game with side payments. For any S N, define def
v (S) D (prof(S)) : The game (N; ) is the superadditive cover of (N; ). A payoff vector for a game (N; ) is a vector u 2 R N . For each nonempty subset S of N define def
u(S) D
X
u tq :
(t;q)2S
A payoff vector u is feasible for S if u(S) v (S) D (prof(S)) : If S D N we simply say that the payoff vector u is feasible if u(N) v (N) D (prof(N)) :
Market Games and Clubs
Note that our definition of feasibility is consistent with essential superadditivity; a group can realize at least as large a total payoff as it can achieve in any partition of the group and one way to achieve this payoff is by partitioning into smaller groups. A payoff vector u satisfies the equal-treatment prop0 erty if u tq D u tq for all q; q0 2 f1; : : : ; n t g and for each t D 1; : : : ; T. Let [n; (T; )] be a game and let ˇ be a collection of subprofiles of n. The collection is a balanced collection of subprofiles of n if there are positive real numbers s for P s 2 ˇ such that s s D n. The numbers s are called s2ˇ
balancing weights. Given real number " 0, the game [n; (T; )] is "-balanced if for every balanced collection ˇ of subprofiles of n it holds that (n)
X
s ( (s) "ksk)
(7)
s2ˇ
where the balancing weights for ˇ are given by s for s 2 ˇ. This definition extends that of Bondareva [13] and Shapley [56] to games with player types. Roughly, a game is (") balanced if allowing “part time” groups does not improve the total payoff (by more than " per player). A game [n; (T; )] is totally balanced if every subgame [s; (T; )] is balanced. The balanced cover game generated by a game [n; (T; )] is a game [n; (T; b )] where 1. b (s) D (s) for all s ¤ n and 2. b (n) (n) and b (n) is as small as possible consistent with the nonemptiness of the core of [n; (T; b )]. From the Bondareva–Shapley Theorem it follows that b (n) D (n) if and only if the game [n; (T; )] is balanced ("-balanced, with " D 0). For later convenience, the notion of the balanced cover of a pregame is introduced. Let (T; ) be a pregame. For each profile s, define def
b (s) D max ˇ
X
g (g) ;
(8)
Premarkets In this section, we introduce the concept of a premarket and re-state results from Shapley and Shubik [60] in the context of pregames and premarkets. Let L C 1 be a number of types of commodities and let fb u t (y; ) : t D 1; : : : ; Tg denote a finite number of functions, called utility functions, of the form b u t (y; ) D u t (y) C ; L and 2 R. (Such functions, in the literwhere y 2 RC ature of economics, are commonly called quasi-linear). L : t D 1; : : : ; Tg be interpreted as a set of Let fa t 2 RC endowments. We assume that u t (a t ) 0 for each t. For def t D 1; : : : ; T we define c t D (u t (); a t ) as a participant t type and let C D fc : t D 1; : : : ; Tg be the set of participant types. Observe that from the data given by C we can construct a market by specifying a set of participants N and a function from N to C assigning endowments and utility functions – types – to each participant in N. A premarket is a pair (T; C). Let (T; C) be a premarket and let s D (s1 ; : : : ; s T ) 2 T . We interpret s as representing a group of economic ZC participants with st participants having utility functions and endowments given by ct for t D 1; : : : ; T; for each t, that is, there are st participants in the group with type ct . Observe that the data of a premarket gives us sufficient data to generate a pregame. In particular, given a profile s D (s1 ; : : : ; s T ) listing numbers of participants of each of T types, define def
W(s) D max
X
s t u t (y t )
t L : t D where the maximum is taken over the set fy t 2 RC P P t t t 1; : : : ; T and t s t y D t a y g. Then the pair (T; W) is a pregame generated by the premarket. The following Theorem is an extension to premarkets or a restatement of a result due to Shapley and Shubik [60].
Theorem 1 Let (T; C) be a premarket derived from economic data in which all utility functions are concave. Then the pregame generated by the premarket is totally balanced.
g2ˇ
Direct Markets and Market-Game Equivalence where the maximum is taken over all balanced collections ˇ of subprofiles of s with weights g for g 2 ˇ. The pair (T; b ) is called the balanced cover pregame of (T; ). Since a partition of a profile is a balanced collection it is immediately clear that b (s) (s) for every profile s.
Shapley and Shubik [60] introduced the notion of a direct market derived from a totally balanced game. In the direct market, each player is endowed with one unit of a commodity (himself) and all players in the economy have the same utility function. In interpretation, we might think of
1807
1808
Market Games and Clubs
this as a labor market or as a market for productive factors, (as in [50], for example) where each player owns one unit of a commodity. For games with player types as in this essay, we take the player types of the game as the commodity types of a market and assign all players in the market the same utility function, derived from the worth function of the game. Let (T; ) be a pregame and let [n; (T; )] be a derived game. Let N D f(t; q) : t D 1; : : : ; T and q D 1; : : : ; n t for each tg denote the set of players in the game where all participants f(t 0 ; q) : q D 1; : : : ; n t 0 g are of type t 0 for each t 0 D 1; : : : ; T. To construct the direct market generated by a derived game [n; (T; )], we take the T and suppose that each parcommodity space as RC ticipant in the market of type t is endowed with one unit of the tth commodity, and thus has endowment T where “1” is in the tth 1 t D (0; : : : ; 0; 1; 0; : : : ; 0) 2 RC position. The total endowment of the economy is then P given by n t 1 t D n. T define For any vector y 2 RC def
u(y) D max
X
s (s) ;
(9)
sn
the maximum running over all fs 0 : s 2 satisfying X s s D y :
T; ZC
s ng
(10)
sn
As noted by Shapley and Shubik [60], but for our types case, it can be verified that the function u is concave and one-homogeneous. This does not depend on the balancedness of the game [n; (T; )]. Indeed, one may think of u as T ”. Note the “balanced cover of [n; (T; )] extended to RC also that u is superadditive, independent of whether the pregame (T; ) is superadditive. We leave it to the interested reader to verify that if were not necessarily superadditive and is the superadditive cover of then it P P holds that max sn s (s) D max sn s (s). Taking the utility function u as the utility function of each player (t; q) 2 N where N is now interpreted as the set of participants in a market, we have generated a market, called the direct market, denoted by [n; u; (T; )], from the game [n; (T; )]. Again, the following extends a result of Shapley and Shubik [60] to pregames. Theorem 2 Let [n; u; (T; )] denote the direct market generated by a game [n; (T; )] and let [n; (T; u)] denote the game derived from the direct market. Then, if [n; (T; )] is a totally balanced game, it holds that [n; (T; u)] and [n; (T; )] are identical.
Remark 2 If the game [n; (T; )] and every subgame [s; (T; )] has a nonempty core – that is, if the game is ‘totally balanced’– then the game [n; (T; u)] generated by the direct market is the initially given game [n; (T; )]. If however the game [n; (T; )] is not totally balanced then u(s) (s) for all profiles s n. But, whether or not [n; (T; )] is totally balanced, the game [n; (T; u)] is totally balanced and coincides with the totally balanced cover of [n; (T; )]. Remark 3 Another approach to the equivalence of markets and games is taken by Garratt and Qin [26], who define a class of direct lottery markets. While a player can participate in only one coalition, both ownership of coalitions and participation in coalitions is determined randomly. Each player is endowed with one unit of probability, his own participation. Players can trade their endowments at market prices. The core of the game is equivalent to the equilibrium of the direct market lottery. Equivalence of Markets and Games with Many Players The requirement of Shapley and Shubik [60] that utility functions be concave is restrictive. It rules out, for example situations such as economies with indivisible commodities. It also rules out club economies; for a given club structure of the set of players – in the simplest case, a partition of the total player set into groups where collective activities only occur within these groups – it may be that utility functions are concave over the set of alternatives available within each club, but utility functions need not be concave over all possible club structures. This rules out many examples; we provide a simple one below. To obtain the result that with many players, games derived from pregames are market games, we need some further assumption on pregames. If there are many substitutes for each player, then the simple condition that per capita payoffs are bounded – that is, given a pregame (T; ), that there exists some constant K such that (s) < K for all profiles s – suffices. If, however, there may ksk be ‘scarce types’, that is, players of some type(s) become negligible in the population, then a stronger assumption of ‘small group effectiveness’ is required. We discuss these two conditions in the next section. Small Group Effectiveness and Per Capita Boundedness This section discusses conditions limiting gains to group size and their relationships. This definition was introduced in Wooders [83], for NTU, as well as TU, games.
Market Games and Clubs
PCB A pregame (T; ) satisfies per capita boundedness (PCB) if PCB :
sup T s2ZC
(s) is finite ksk
Example 4 ([94]) Consider a pregame (T; ) where T D f1; 2g and is the superadditive cover of the function 0 defined by:
(11) def
0 (s) D
jsj 0
if s1 D 2 ; otherwise :
or equivalently, (s) is finite : sup ksk s2Z T C
It is known that under the apparently mild conditions of PCB and essential superadditivity, in general games with many players of each of a finite number of player types and a fixed distribution of player types have nonempty approximate cores; Wooders [81,83]. (Forms of these assumptions were subsequently also used in Shubik and Wooders [69,70]; Kaneko and Wooders [35]; and Wooders [89,91] among others.) Moreover, under the same conditions, approximate cores have the property that most players of the same type are treated approximately equally ([81,94]; see also Shubik and Wooders [69]). These results, however, either require some assumption ruling out ‘scarce types’ of players, for example, situations where there are only a few players of some particular type and these players can have great effects on total feasible payoffs. Following are two examples. The first illustrates that PCB does not control limiting properties of the per capita payoff function when some player types are scarce. Example 3 ([94]) Let T D 2 and let (T; ) be the pregame given by ( (s1 ; s2 ) D
s1 C s2
when s1 > 0
0
otherwise :
The function obviously satisfies PCB. But there is a problem in defining lim (s1 ; s2 )/s1 C s2 as s1 C s2 tends to infinity, since the limit depends on how it is approached. Consider the sequence (s1 ; s2 ) where (s1 ; s2 ) D (0; ); then lim (s1 ; s2 )/s1 C s2 D 0. Now suppose in contrast that (s1 ; s2 ) D (1; ); then lim (s1 ; s2 )/s1 C s2 D 1. This illustrates why, to obtain the result that games with many players are market games either it must be required that there are no scarce types or some some assumption limiting the effects of scarce types must be made. We return to this example in the next section. The next example illustrates that, with only PCB, uniform approximate cores of games with many players derived from pregames may be empty.
Thus, if a profiles D (s1 ; s2 ) has s1 D 2 then the worth of the profile according to 0 is equal to the total number of players it represents, s1 C s2 , while all other profiles s have worth of zero. In the superadditive cover game the worth of a profile s is 0 if s1 < 2 and otherwise is equal to s2 plus the largest even number less than or equal to s1 . Now consider a sequence of profiles (s ) where s1 D 3 and s2 D for all . Given " > 0, for all sufficiently large player sets the uniform "-core is empty. Take, for example, " D 1/4. If the uniform "-core were nonempty, it would have to contain an equal-treatment payoff vector.1 For the purpose of demonstrating a contradiction, suppose that u D (u1 ; u2 ) represents an equal treatment payoff vector in the uniform "-core of [s ; (T; )]. The following inequalities must hold: 3u1 C u2 C 2 ; 2u1 C u2 C 2; and u1
3 4
:
which is impossible. A payoff vector which assigns each player zero is, however, in the weak "-core for any 1 . But it is not very appealing, in situations such " > C3 as this, to ignore a relatively small group of players (in this case, the players of type 1) who can have a large effect on per capita payoffs. This leads us to the next concept. To treat the scarce types problem, Wooders [88,89,90] introduced the condition of small group effectiveness (SGE). SGE is appealing technically since it resolves the scarce types problem. It is also economically intuitive and appealing; the condition defines a class of economies that, when there are many players, generate competitive markets. Informally, SGE dictates that almost all gains to collective activities can be realized by relatively small groups of players. Thus, SGE is exactly the sort of assumption required to ensure that multiple, relatively small coalitions, firms, jurisdictions, or clubs, for example, are optimal or nearoptimal in large economies. 1 It is well known and easily demonstrated that the uniform "-core of a TU game is nonempty if and only if it contains an equal treatment payoff vector. This follows from the fact that the uniform "-core is a convex set.
1809
1810
Market Games and Clubs
A pregame (T; ) satisfies small group effectiveness, SGE, if:
SGE :
For each real number " > 0; there is an integer 0 (") such that for each profile s; for some partition fs k g of s with k ks k 0 (") for each subprofile s k , it holds that P (s) k (s k ) "ksk ; (12)
given " > 0 there is a group size 0 (") such that the loss from restricting collective activities within groups to groups containing fewer that 0 (") members is at most " per capita [88].2 SGE also has the desirable feature that if there are no ‘scarce types’ – types of players that appear in vanishingly small proportions– then SGE and PCB are equivalent. Theorem 3 ([91] With ‘thickness,’ SGE = PCB) (1) Let (T; ) be a pregame satisfying SGE. Then the pregame satisfies PCB. (2) Let (T; ) be a pregame satisfying PCB. Then given any positive real number , construct a new pregame (T; ) where the domain of is restricted to profiles s st > or s t D 0. where, for each t D 1; : : : ; T, either ksk Then (T; ) satisfies SGE on its domain. It can also be shown that small groups are effective for the attainment of nearly all feasible outcomes, as in the above definition, if and only if small groups are effective for improvement – any payoff vector that can be significantly improved upon can be improved upon by a small group (see Proposition 3.8 in [89]). Remark 4 Under a stronger condition of strict small group effectiveness, which dictates that (") in the definition of small group effectiveness can be chosen independently of ", stronger results can be obtained than those presented in this section and the next. We refer to Winter and Wooders [80] for a treatment of this case. Remark 5 (On the importance of taking into account scarce types) Recall the quotation from von Neumann and Morgenstern and the discussion following the quotation. The assumption of per capita boundedness has significant consequences but is quite innocuous – ruling out the possibility of average utilities becoming infinite as economies grow large does not seem restrictive. But with only per capita boundedness, even the formation of small coalitions can have significant impacts on aggregate outcomes. 2 Exactly the same definition applies to situations with a compact metric space of player types, c.f. Wooders [84,88].
With small group effectiveness, however, there is no problem of either large or small coalitions acting together – large coalitions cannot do significantly better then relatively small coalitions. Roughly, the property of large games we next introduce is that relatively small groups of players make only “asymptotic negligible” contributions to per-capita payoffs of large groups. A pregame (˝; ) satisfies asymptotic negligibility if, for any sequence of profiles f f g where k f k ! 1 as ! 1; 0
( f ) D ( f ) for all and 0 and ( f ) lim!1 k f k
(13)
exists ;
then for any sequence of profiles f` g with k` k D0; !1 k f k
(14)
lim
it holds that lim!1 lim!1
k f C` k k f C` k k f C` k k f C` k
exists, and D lim!1
( f ) k f k
(15) :
Theorem 4 ([89,95]) A pregame (T; ) satisfies SGE if and only if it satisfies PCB and asymptotic negligibility Intuitively, asymptotic negligibility ensures that vanishingly small percentages of players have vanishingly small effects on aggregate per-capita worths. It may seem paradoxical that SGE, which highlights the importance of relatively small groups, is equivalent to asymptotic negligibility. To gain some intuition, however, think of a marriage model where only two-person marriages are allowed. Obviously two-person groups are (strictly) effective, but also, in large player sets, no two persons can have a substantial affect on aggregate per-capita payoffs. Remark 6 Without some assumptions ensuring essential superadditivity, at least as incorporated into our definition of feasibility, nonemptiness of approximate cores of large games cannot be expected; superadditivity assumptions (or the close relative, essential superadditivity) are heavily relied upon in all papers on large games cited. In the context of economies, superadditivity is a sort of monotonicity of preferences or production functions assumption, that is, superadditivity of implies that for all T , it holds that (s C s 0 ) (s) C (s 0 ). Our ass; s0 2 ZC sumption of small group effectiveness, SGE, admits nonmonotonicities. For example, suppose that ‘two is company, three or more is a crowd,’ by supposing there is only
Market Games and Clubs
one commodity and by setting (2) D 2, (n) D 0 for n ¤ 2. The reader can verify, however, that this example satisfies small group effectiveness since (n) D n if n is even and (n) D n 1 otherwise. Within the context of pregames, requiring the superadditive cover payoff to be approximately realizable by partitions of the total player set into relatively small groups is the weakest form of superadditivity required for the equivalence of games with many players and concave markets.
is concave, it is continuous on the interior of its domain; this follows from PCB. Small group effectiveness ensures that the function U is continuous on its entire domain [91](Lemma 2).
Derivation of Markets from Pregames Satisfying SGE
In interpretation, T denotes a number of types of playT. ers/commodities and U denotes a utility function on RC T Observe that when U is restricted to profiles (in ZC ), the pair (T; U) is a pregame with the property that every game [n; (T; U)] has a nonempty core; thus, we will call (T; U) the premarket generated by the pregame (T; ). That every game derived from (T; U) has a nonempty core is a consequence of the Shapley and Shubik [60] result that market games derived from markets with concave utility functions are totally balanced. It is interesting to note that, as discussed in Wooders (Section 6 in [91]), if we restrict the number of commodities to equal the number of player types, then the utility function U is uniquely determined. (If one allowed more commodities then one would effectively have ‘redundant assets’.) In contrast, for games and markets of fixed, finite size, as demonstrated in Shapley and Shubik [62], even if we restrict the number of commodities to equal the number of player types, given any nonempty, compact, convex subset of payoff vectors in the core, it is possible to construct utility functions so that this subset coincides with the set of competitive payoffs. Thus, in the Shapley and Shubik approach, equivalence of the core and the set of price-taking competitive outcomes for the direct market is only an artifact of the method used there of constructing utility functions from the data of a game and is quite distinct from the equivalence of the core and the set of competitive payoff vectors as it is usually understood (that is, in the sense of Debreu and Scarf [22] and Aumann [4]. See also Kalai and Zemel [31,32] which characterize the core in multi-commodity flow games.
With SGE and PCB in hand, we can now derive a premarket from a pregame and relate these concepts. To construct a limiting direct premarket from a pregame, we first define an appropriate utility function. Let (T; ) be a pregame satisfying SGE. For each vector x T define in RC def
( f ) !1 k f k
U(x) D kxk lim
(16)
where the sequence f f g satisfies lim!1
f
k f k
D
x kxk
and
(17)
k f k ! 1 : Theorem 5 ([84,91]) Assume the pregame (T; ) satT the isfies small group effectiveness. Then for any x 2 RC limit (16) exists. Moreover, U() is well-defined, concave and 1-homogeneous and the convergence is uniform in the sense that, given " > 0 there is an integer such that for all profiles s with ksk it holds that ˇ ˇ
ˇ s (s) ˇˇ ˇU ": ˇ ksk ksk ˇ From Wooders [91] (Theorem 4), if arbitrarily small percentages of players of any type that appears in games generated by the pregame are ruled out, then the above result holds under per capita boundedness [91] (Theorem 6). As noted in the introduction to this paper, for the TU case, the concavity of the limiting utility function, for the model of Wooders [83] was first noted by Aumann [5]. The concavity is shown to hold with a compact metric space of player types in Wooders [84] and is simplified to the finite types case in Wooders [91]. Theorem 5 follows from the facts that the function U is superadditive and 1-homogeneous on its domain. Since U
Theorem 6 ([91]) Let (T; ) be a pregame satisfying small group effectiveness and let (T; U) denote the derived direct market pregame. Then (T; U) is a totally balanced market game. Moreover, U is one-homogeneous, that is, U(x) D U(x) for any non-negative real number .
Cores and Approximate Cores The concept of the core clearly was important in the work of Shapley and Shubik [59,60,62] and is also important for the equivalence of games with many players and market games. Thus, we discuss the related results of nonemptiness of approximate cores and convergence of approximate cores to the core of the ‘limit’ – the game where all players have utility functions derived from a pregame and
1811
1812
Market Games and Clubs
large numbers of players. First, some terminology is required. A vector p is a subgradient at x of the concave function U if U(y) U(x) p (y x) for all y. One might think of a subgradient as a bounding hyperplane. To avoid any confusion it might be helpful to note that, as MasColell [46] remarks: “ Strictly speaking, one should use the term subgradient for convex functions and supergradient for concave. But this is cumbersome”, (p. 29–30 in [46]). For ease of notation, equal-treatment payoff vectors for a game [n; (T; )] will typically be represented as vectors in RT . An equal-treatment payoff vector, or simply a payoff vector when the meaning is clear, is a point x in RT . The tth component of x, x t , is interpreted as the payoff to each player of type t. The feasibility of an equal-treatment payoff vector x 2 RT for the game [n; (T; )] can be expressed as: (n) x n : Let [n; (T; )] be a game determined by a pregame (T; ), let " be a non-negative real number, and let x 2 RT be a (equal-treatment) payoff vector. Then x is in the equal-treatment "-core of [n; (T; )] or simply “in the "-core” when the meaning is clear, if x is feasible for [n; (T; )] and (s) x s C "ksk for all subprofiles s of n : Thus, the equal-treatment "-core is the set def
T : (n) x n and C(n; ") D fx 2 RC
(18)
(s) x s C "ksk for all subprofiles s of ng : It is well known that the "-core of a game with transferable utility is nonempty if and only if the equal-treatment "-core is nonempty. T, Continuing with the notation above, for any s 2 RC let ˘ (s) denote the set of subgradients to the function U at the point s; def
˘ (s) D f 2 RT : s D U(s) and s 0 U(s 0 ) T for all s 0 2 RC g:
(19)
The elements in ˘ (s) can be interpreted as equal-treatment core payoffs to a limiting game with the mass of players of type t given by st . The core payoff to a player is simply the value of the one unit of a commodity (himself and all his attributes, including endowments of resources) that he owns in the direct market generated by a game. Thus ˘ () is called the limiting core correspondence for the
pregame (T; ): Of course ˘ () is also the limiting core correspondence for the pregame (T; U). b (n) R T denote equal-treatment core of the Let ˘ market game [n; (T; u)]: b (n) def D f 2 RT : n D u(n) ˘ T , s ng : and s u(s) for all s 2 ZC
(20)
Given any player profile n and derived games [n; (T; )] and [n; (T; U)] it is interesting to observe the distinction between the equal-treatment core of the game b (n); defined by (20), and the [n; (T; U)], denoted by ˘ set ˘ (n) (that is, ˘ (x) with x D n). The definitions of b (n) are the same except that the qualification ˘ (n) and ˘ b (n) does not appear in the “s n” in the definition of ˘ definition of ˘ (n). Since ˘ (n) is the limiting core correspondence, it takes into account arbitrarily large coalib (n) it tions. For this reason, for any x 2 ˘ (n) and b x2˘ holds that x n b x n. A simple example may be informative. Example 5 Let (T; ) be a pregame where T D 1 and (n) D n n1 for each n 2 ZC , and let [n; (T; )] be a deb (n) D f(1 12 )g. rived game. Then ˘ (n) D f1g while ˘ n The following Theorem extends a result due to Shapley and Shubik [62] stated for games derived from pregames. Theorem 7 ([62]) Let [n; (T; )] be a game derived from a pregame and let [n; u; (T; )] be the direct market genb (n) erated by [n; (T; )]. Then the equal-treatment core ˘ of the game [n; (T; u)] is nonempty and coincides with the set of competitive price vectors for the direct market [n; u; (T; )]. Remark 7 Let (T; ) be a pregame satisfying PCB. In the development of the theory of large games as models of competitive economies, the following function on the space of profiles plays an important role: (r f ) ; r!1 r lim
see, for example, Wooders [81] and Shubik and Wooders [69]. For the purposes of comparison, we introduce another definition of a limiting utility function. For each vecT with rational components let r(x) be the smalltor x in RC est integer such that r(x)x is a vector of integers. Therefore, for each rational vector x; we can define ( r(x)x) def ˆ U(x) D lim : !1
r(x) Since is superadditive and satisfies per capita boundˆ edness, the above limit exists and U() is well-defined.
Market Games and Clubs
ˆ Also, U(x) has a continuous extension to any closed subT : The function U(x); ˆ set strictly in the interior of RC howT . For ever, may be discontinuous at the boundaries of RC example, suppose that T D 2 and (
(k; n) D
kCn
when k > 0
0
otherwise :
The function obviously satisfies PCB but does not satisfy SGE. To see the continuity problem, consider the sequences fx g and fy g of vectors in R2C where D (0; ). Then lim D x D ( 1 ; 1 !1 x ) and y ˆ lim!1 y D (0; 1) but lim!1 U(x ) D 1 while ˆ ) D 0. SGE is precisely the condition relim!1 U(y quired to avoid this sort of discontinuity, ensuring that the T. function U is continuous on the boundaries of RC Before turning to the next section, let us provide some b (n). Suppose a game additional interpretation for ˘ [n; (T; )] is one generated by an economy, as in Shapley and Shubik [59] or Owen [50], for example. Players of different types may have different endowments of private b (n) is an equal-treatment payoff goods. An element in ˘ vector in the core of the balanced cover game generated by [n; (T; )] and can be interpreted as listing prices for player types where t is the price of a player of type t; this price is a price for the player himself, including his endowment of private goods.
numbers. Then there is a real number " with 0 < " and an integer 0 (ı; ; " ) with the following property: for each positive " 2 (0; " ] and each game [ f ; (T; )] with k f k > 0 (ı; ; " ) and f t /k f k for each t D 1; : : :, T, if C( f ; ") is nonempty then both b ( f )] < ı ; dist[C( f ; "); ˘ ( f )] < ı and dist[C( f ; "); ˘ where ‘dist’ is the Hausdorff distance with respect to the sum norm on RT . Note that this result applies to games derived from diverse economies, including economies with indivisibilities, nonmonotonicities, local public goods, clubs, and so on. Theorem 8 motivates the question of whether approximate cores of games derived from pregames satisfying small group effectiveness treat players most of the same type nearly equally. The following result, from Wooders [81,89,93] answers this question. Theorem 9 Let (T; ) be a pregame satisfying SGE. Then given any real numbers > 0 and > 0 there is a positive real number " and an integer such that for each " 2 [0, T with knk > , if x 2 R N " ] and for every profile n 2 ZC 1 is in the uniform "-core of the game [n; ] with player set N D f(t; q) : t D 1; : : : ; T and, for each t; q D 1; : : : ; n t g then, for each t 2 f1; : : : ; Tg with
Nonemptiness and Convergence of Approximate Cores of Large Games The next Proposition is an immediate consequence of the convergence of games to markets shown in Wooders [89,91] and can also be obtained as a consequence of Theorem 5 above. Proposition 2 (Nonemptiness of approximate cores) Let (T; ) be a pregame satisfying SGE. Let " be a positive real number. Then there is an integer 1 (") such that any game [n; (T; )] with knk 1 (") has a nonempty uniform "-core. (Note that no assumption of superadditivity is required but only because our definition of feasibility is equivalent to feasibility for superadditive covers.) The following result was stated in Wooders [89]. For more recent results see Wooders [94]. Theorem 8 ([89] Uniform closeness of (equal-treatment) approximate cores to the core of the limit game) Let (T; ) be a pregame satisfying SGE and let ˘ () be as defined above. Let ı > 0 and > 0 be positive real
nt knk1
2
it holds that
jf(t; q) : jx tq z t j > gj < n t g ; where, for each t D 1; : : : ; T, zt D
nt 1 X x tq ; n t qD1
the average payoff received by players of type t. Shapley Values of Games with Many Players Let (N; ) be a game. The Shapley value of a superadditive game is the payoff vector whose ith component is given by SH(v; i) D
jNj1 1 X 1
jNj JD0 jNj 1 J
X v(S [ fig) v(S) : SNnfig jSjDJ
To state the next Theorem, we require one additional definition. Let (T; ) be a pregame. The pregame satisfies
1813
1814
Market Games and Clubs
boundedness of marginal contributions (BMC) if there is a constant M such that j (s C 1 t ) (s)j M for all vectors 1 t D (0; : : : ; 0; 1 t th place ; 0; : : : 0) for each t D 1; : : : ; T. Informally, this condition bounds marginal contributions while SGE bounds average contributions. That BMC implies SGE is shown in Wooders [89]. The following result restricts the main Theorem of Wooders and Zame [96] to the case of a finite number of types of players. Theorem 10 ([96]) Let (T; ) be a superadditive pregame satisfying boundedness of marginal contributions. For each " > 0 there is a number ı(") > 0 and an integer (") with the following property: If [n; (T; )] is a game derived from the pregame, for which n t > (") for each t, then the Shapley value of the game is in the (weak) "-core. Similar results hold within the context of private goods exchange economies (cf., Shapley [55], Shapley and Shubik [60], Champsaur [17], Mas-Colell [43], Cheng [18] and others). Some of these results are for economies without money but all treat private goods exchange economies with divisible goods and concave, monotone utility functions. Moreover, they all treat either replicated sequences of economies or convergent sequences of economies. That games satisfying SGE are asymptotically equivalent to balanced market games clarifies the contribution of the above result. In the context of the prior results developed in this paper, the major shortcoming of the Theorem is that it requires BMC. This author conjectures that the above result, or a close analogue, could be obtained with the milder condition of SGE, but this has not been demonstrated. Economies with Clubs By a club economy we mean an economy where participants in the economy form groups – called clubs – for the purposes of collective consumption and/or production collectively with the group members. The groups may possibly overlap. A club structure of the participants in the economy is a covering of the set of players by clubs. Providing utility functions are quasi-linear, such an economy generates a game of the sort discussed in this essay. The worth of a group of players is the maximum total worth that the group can achieve by forming clubs. The most general model of clubs in the literature at this point is Allouch and Wooders [1]. Yet, if one were to assume that
utility functions were all quasi-linear and the set of possible types of participants were finite. the results of this paper would apply. In the simplest case, the utility of an individual depends on the club profile (the numbers of participants of each type) in his club. The total worth of a group of players is the maximum that it can achieve by splitting into clubs. The results presented in this section immediately apply. When there are many participants, club economies can be represented as markets and the competitive payoff vectors for the market are approximated by equal-treatment payoff vectors in approximate cores. Approximate cores converge to equal treatment and competitive equilibrium payoffs. A more general model making these points is treated in Shubik and Wooders [65]. For recent reviews of the literature, see Conley and Smith [19] and Kovalenkov and Wooders [38].3 Coalition production economies may also be viewed as club economies. We refer the reader to Böhm [12], Sondermann [73], Shubik and Wooders [70], and for a more recent treatment and further references, Sun, Trockel and Yang [74]). Let us conclude this section with some historical notes. Club economies came to the attention of the economics profession with the publication of Buchanan [14]. The author pointed out that people care about the numbers of other people with whom they share facilities such as swimming pool clubs. Thus, there may be congestion, leading people to form multiple clubs. Interestingly, much of the recent literature on club economies with many participants and their competitive properties has roots in an older paper, Tiebout [77]. Tiebout conjectured that if public goods are ‘local’ – that is, subject to exclusion and possibly congestion – then large economies are ‘market-like’. A first paper treating club economies with many participants was Pauly [51], who showed that, when all players have the same preferred club size, then the core of economy is nonempty if and only if all participants in the economy can be partitioned into groups of the preferred size. Wooders [82] modeled a club economy as one with local public goods and demonstrated that, when individuals within a club (jurisdiction) are required to pay the same share of the costs of public good provision, then outcomes in the core permit heterogeneous clubs if and only if all types of participants in the same club have the same demands for local public goods and for congestion. Since 3 Other approaches to economies with clubs/local public goods include Casella and Feinstein [15], Demange [23], Haimanko, O., M. Le Breton and S. Weber [28], and Konishi, Le Breton and Weber [37]. Recent research has treated clubs as networks.
Market Games and Clubs
these early results, the literature on clubs has grown substantially. With a Continuum of Players Since Aumann [4] much work has been done on economies with a continuum of players. It is natural to question whether the asymptotic equivalence of markets and games reported in this article holds in a continuum setting. Some such results have been obtained. First, let N D [01] be the 0,1 interval with Lesbegue measure and suppose there is a partition of N into a finite set of subsets N 1 , . . . , N T where, in interpretation, a point in N t represents a player of type t. Let be given. Observe that determines a payoff for any finite group of players, depending on the numbers of players of each type. If we can aggregate partitions of the total player set into finite coalitions then we have defined a game with a continuum of players and finite coalitions. For a partition of the continuum into finite groups to ‘make sense’ economically, it must preserve the relative scarcities given by the measure. This was done in Kaneko and Wooders [35]. To illustrate their idea of measurement consistent partitions of the continuum into finite groups, think of a census form that requires each three-person household to label the players in the household, #1, #2, or #3. When checking the consistency of its figures, the census taker would expect the numbers of people labeled #1 in three-person households to equal the numbers labeled #2 and #3. For consistency, the census taker may also check that the number of first persons in three-person households in a particular region is equal to the number of second persons and third persons in three person households in that region. It is simple arithmetic. This consistency should also hold for k-person households for any k. Measurement consistency is the same idea with the work “number” replaced by “proportion” or “measure”. One can immediately apply results reported above to the special case of TU games of Kaneko–Wooders [35] and conclude that games satisfying small group effectiveness and with a continuum of players have nonempty cores and that the payoff function for the game is one-homogeneous. (We note that there have been a number of papers investigating cores of games with a continuum of players that have came to the conclusion that non-emptiness of exact cores does not hold, even with balancedness assumptions, cf., Weber [78,79]). The results of Wooders [91], show that the continuum economy must be representable by one where all players have the same concave, continuous onehomogeneous utility functions. Market games with a continuum of players and a finite set of types are also investi-
gated in Azriel and Lehrer [3], who confirm these conclusions.) Other Related Concepts and Results In an unpublished 1972 paper due to Edward Zajac [97], which has motivated a large amount of literature on ‘subsidy-free pricing’, cost sharing, and related concepts, the author writes: “A fundamental idea of equity in pricing is that ‘no consumer group should pay higher prices than it would pay by itself. . . ’. If a particular group is paying a higher price than it would pay if it were severed from the total consumer population, the group feels that it is subsidizing the total population and demands a price reduction”. The “dual” of the cost allocation problem is the problem of surplus sharing and subsidy-free pricing.4 Tauman [75] provides a excellent survey. Some recent works treating cost allocation and subsidy free-pricing include Moulin [47,48]. See also the recent notion of “Walras’ core” in Qin, Shapley and Shimomura [52]. Another related area of research has been into whether games with many players satisfy some notion of the Law of Demand of consumer theory (or the Law of Supply of producer theory). Since games with many players resemble market games, which have the property that an increase in the endowment of a commodity leads to a decrease in its price, such a result should be expected. Indeed, for games with many players, a Law of Scarcity holds – if the numbers of players of a particular type is increased, then core payoffs to players of that type do not increase and may decrease. (This result was observed by Scotchmer and Wooders [54]). See Kovalenkov and Wooders [38,41] for the most recent version of such results and a discussion of the literature. Laws of scarcity in economies with clubs are examined in Cartwright, Conley and Wooders [16]. Some Remarks on Markets and More General Classes of Economies Forms of the equivalence of outcomes of economies where individuals have concave utility functions but not necessarily linear in money. These include Billera [10], Billera and Bixby [11] and Mas-Colell [42]. A natural question is whether the results reported in this paper can extend to nontransferable utility games and economies where individuals have utility functions that are not necessarily liner 4 See, for example Moulin [47,48] for excellent discussions of these two problems.
1815
1816
Market Games and Clubs
in money. So far the results obtained are not entirely satisfactory. Nonemptiness of approximate cores of games with many players, however, holds in substantial generality; see Kovalenkov and Wooders [40] and Wooders [95]. Conclusions and Future Directions The results of Shapley and Shubik [60], showing equivalence of structures, rather than equivalence of outcomes of solution concepts in a fixed structure (as in [4], for example) are remarkable. So far, this line of research has been relatively little explored. The results for games with many players have also not been fully explored, except for in the context of games, such as those derived from economies with clubs, and with utility functions that are linear in money. Per capita boundedness seems to be about the mildest condition that one can impose on an economic structure and still have scarcity of per capita resources in economies with many participants. In economies with quasi-linear utilities (and here, I mean economies in a general sense, as in the glossary) satisfying per capita boundedness and where there are many substitutes for each type of participant, then as the number of participants grows, these economies resemble or (as if they) are market economies where individuals have continuous, and monotonic increasing utility functions. Large groups cannot influence outcomes away from outcomes in the core (and outcomes of free competition) since large groups are not significantly more effective than many small groups (from the equivalence, when each player has many close substitutes, between per capita boundedness and small group effectiveness). But if there are not many substitutes for each participant, then, as we have seen, per capita boundedness allows small groups of participants to have large effects and free competition need not prevail (cores may be empty and price-taking equilibrium may not exist). The condition required to ensure free competition in economies with many participants, without assumptions of “thickness”, is precisely small group effectiveness. But the most complete results relating markets and games, outlined in this paper, deal with economies in which all participants have utility functions that are linear in money and in games with side payments, where the worth of a group can be divided in any way among the members of the group without any loss of total utility or worth. Nonemptiness of approximate cores of large games without side payments has been demonstrated; see Wooders [83,95] and Kovalenkov and Wooders [40]. Moreover, it has been shown that when side payments are ‘limited’
then approximate cores of games without side payments treat similar players similarly [39]. Results for specific economic structures, relating cores to price taking equilibrium treat can treat situations that are, in some respects, more general. A substantial body of literature shows that certain classes of club economies have nonempty cores and also investigates price-taking equilibrium in these situations. Fundamental results are provided by Gale and Shapley [25], Shapley and Shubik [61], and Crawford and Kelso [21] and many more recent papers. We refer the reader to Roth and Sotomayor [53] and to Two-Sided Matching Models, by Ömer and Sotomayor in this encyclopedia. A special feature of the models of these papers is that there are two sorts of players or two sides to the market; examples are (1) men and women, (2) workers and firms, (3) interns and hospitals and so on. Going beyond two-sided markets to clubs in general, however, one observes that the positive results on nonemptiness of cores and existence of price-taking equilibria only holds under restrictive conditions. A number of recent contributions however, provide specific economic models for which, when there are many participants in the economy, as in exchange economies it holds that pricetaking equilibrium exists, cores are non-empty, and the set of outcomes of price-taking equilibrium are equivalent to the core. (see, for example, [1,2,24,85,92]). Bibliography 1. Allouch N, Wooders M (2008) Price taking equilibrium in economies with multiple memberships in clubs and unbounded club sizes. J Econ Theor 140:246–278 2. Allouch N, Conley JP, Wooders M (2008) Anonymous price taking equilibrium in Tiebout economies with a continuum of agents: Existence and characterization. J Math Econ. doi:10. 1016/j.jmateco.2008.06.003 3. Azrieli Y, Lehrer E (2007) Market games in large economies with a finite number of types. Econ Theor 31:327–342 4. Aumann RJ (1964) Markets with a continuum of traders. Econometrica 32:39–50 5. Aumann RJ (1987) Game theory. In: Eatwell J, Milgate M, Newman P (eds) The New Palgrave: A Dictionary of Economics. Palgrave Macmillan, Basingstoke 6. Aumann RJ, Dreze J (1974) Cooperative games with coalition structures. Int J Game Theory 3:217–37 7. Aumann RJ, Shapley S (1974) Values of Non-Atomic Games. Princeton University Press, Princeton 8. Bennett E, Wooders M (1979) Income distribution and firm formation. J Comp Econ 3:304–317. http://www.myrnawooders. com/ 9. Bergstrom T, Varian HR (1985) When do market games have transferable utility? J Econ Theor 35(2):222–233 10. Billera LJ (1974) On games without side payments arising from a general class of markets. J Math Econ 1(2):129–139
Market Games and Clubs
11. Billera LJ, Bixby RE (1974) Market representations of n-person games. Bull Am Math Soc 80(3):522–526 12. Böhm V (1974) The core of an economy with production. Rev Econ Stud 41:429–436 13. Bondareva O (1963) Some applications of linear programming to the theory of cooperative games. Problemy kibernetiki 10 (in Russian, see English translation in Selected Russian papers in game theory 1959–1965, Princteon University Press, Princeton 14. Buchanan J (1965) An economic theory of clubs. Economica 33:1–14 15. Casella A, Feinstein JS (2002) Public goods in trade on the formation of markets and jurisdictions. Intern Econ Rev 43:437– 462 16. Cartwright E, Conley J, Wooders M (2006) The Law of Demand in Tiebout Economies. In: Fischel WA (ed) The Tiebout Model at 50: Essays in Public Economics in honor of Wallace Oates. Lincoln Institute of Land Policy, Cambridge 17. Champsaur P (1975) Competition vs. cooperation. J Econ Theory 11:394–417 18. Cheng HC (1981) On dual regularity and value convergence theorems. J Math Econ 8:37–57 19. Conley J, Smith S (2005) Coalitions and clubs; Tiebout equilibrium in large economies. In: Demange G, Wooders M (eds) Group Formation in Economies; Networks, Clubs and Coalitions. Cambridge University Press, Cambridge 20. Conley JP, Wooders M (1995) Hedonic independence and taste-homogeneity of optimal jurisdictions in a Tiebout economy with crowding types. Ann D’Econ Stat 75/76:198–219 21. Crawford VP, Kelso AS (1982) Job matching, coalition formation, and gross substitutes. Econornetrica 50:1483–1504 22. Debreu G, Scarf H (1963) A limit theorem on the core of an economy. Int Econ Rev 4:235–246 23. Demange G (1994) Intermediate preferences and stable coalition structures. J Math Econ 1994:45–48 24. Ellickson B, Grodal B, Scotchmer S, Zame W (1999) Clubs and the market. Econometrica 67:1185–1218 25. Gale D, Shapley LS (1962) College admissions and the stability of marriage. Am Math Mon 69:9–15 26. Garratt R, Qin C-Z (1997) On a market for coalitions with indivisible agents and lotteries, J Econ Theor 77(1):81–101 27. Gillies DB (1953) Some theorems on n-person games. Ph.D Dissertation, Department of Mathematics. Princeton University, Princeton 28. Haimanko O, Le Breton M, Weber S (2004) Voluntary formation of communities for the provision of public projects. J Econ Theor 115:1–34 29. Hildenbrand W (1974) Core and Equilibria of a Large Economy. Princeton University Press, Princeton 30. Hurwicz L, Uzawa H (1977) Convexity of asymptotic average production possibility sets. In: Arrow KJ, Hurwicz L (eds) Studies in Resource Allocation Processes. Cambridge University Press, Cambridge 31. Kalai E, Zemel E (1982) Totally balanced games and games of flow. Math Oper Res 7:476–478 32. Kalai E, Zemel E (1982) Generalized network problems yielding totally balanced games. Oper Res 30:998–1008 33. Kaneko M, Wooders M (1982) Cores of partitioning games. Math Soc Sci 3:313–327 34. Kaneko M, Wooders M (2004) Utility theories in cooperative
35.
36. 37. 38. 39.
40. 41.
42. 43. 44. 45. 46.
47.
48.
49. 50. 51. 52. 53.
54.
55.
56. 57.
games. In: Handbook of Utility Theory vol 2, Chapter 19. Kluwer Academic Press, Dordrecht, pp 1065–1098 Kaneko M, Wooders M (1986) The core of a game with a continuum of players and finite coalitions; the model and some results. Math Soc Sci 12:105–137. http://www.myrnawooders. com/ Kannai Y (1972) Continuity properties of the core of a market. Econometrica 38:791–815 Konishi H, Le Breton M, Weber S (1998) Equilibrium in a finite local public goods economy. J Econ Theory 79:224–244 Kovalenkov A, Wooders M (2005) A law of scarcity for games. Econ Theor 26:383–396 Kovalenkov A, Wooders M (2001) Epsilon cores of games with limited side payments: nonemptiness and equal treatment. Games Econ Behav 36(2):193–218 Kovalenkov A, Wooders M (2003) Approximate cores of games and economies with clubs. J Econ Theory 110:87–120 Kovalenkov A, Wooders M (2006) Comparative statics and laws of scarcity for games. In Aliprantis CD, Matzkin RL, McFadden DL, Moore JC, Yannelis NC (eds) Rationality and Equilibrium: A Symposium in Honour of Marcel K. Richter. Studies in Economic Theory Series 26. Springer, Berlin, pp 141–169 Mas-Colell A (1975) A further result on the representation of games by markets. J Econ Theor 10(1):117–122 Mas-Colell A (1977) Indivisible commodities and general equilibrium theory. J Econ Theory 16(2):443–456 Mas-Colell A (1979) Competitive and value allocations of large exchange economies. J Econ Theor 14:307–310 Mas-Colell A (1980) Efficiency and decentralization in the pure theory of public goods. Q J Econ 94:625–641 Mas-Colell A (1985) The Theory of General Economic Equilibrium. Economic Society Publication No. 9. Cambridge University Press, Cambridge Moulin M (1988) Axioms of Cooperative Decision Making. Econometric Society Monograph No. 15. Cambridge Press, Cambridge Moulin H (1992) Axiomatic cost and surplus sharing. In: Arrow K, Sen AK, Suzumura K (eds) Handbook of Social Choice and Welfare, 1st edn, vol 1, chap 6. Elsevier, Amsterdam, pp 289– 357 von Neumann J, Morgenstern O (1953) Theory of Games and Economic Behavior. Princeton University Press, Princeton Owen G (1975) On the core of linear production games. Math Program 9:358–370 Pauly M (1970) Cores and clubs. Public Choice 9:53–65 Qin C-Z, Shapley LS, Shimomura K-I (2006) The Walras core of an economy and its limit theorem. J Math Econ 42(2):180–197 Roth A, Sotomayer M (1990) Two-sided Matching; A Study in Game-theoretic Modeling and Analysis. Cambridge University Press, Cambridge Scotchmer S, Wooders M (1988) Monotonicity in games that exhaust gains to scale. IMSSS Technical Report No. 525, Stanford University Shapley LS (1964) Values of large games -VII: A general exchange economy with money. Rand Memorandum RM-4248PR Shapley LS (1967) On balanced sets and cores. Nav Res Logist Q 9:45–48 Shapley LS (1952) Notes on the N-Person game III: Some variants of the von-Neumann-Morgenstern definition of solution. Rand Corporation research memorandum, RM-817:1952
1817
1818
Market Games and Clubs
58. Shapley LS, Shubik M (1960) On the core of an economic system with externalities. Am Econ Rev 59:678–684 59. Shapley LS, Shubik M (1966) Quasi-cores in a monetary economy with nonconvex preferences. Econometrica 34:805–827 60. Shapley LS, Shubik M (1969) On market games. J Econ Theor 1:9–25 61. Shapley LS, Shubik M (1972) The Assignment Game 1; The core. Int J Game Theor 1:11–30 62. Shapley LS, Shubik M (1975) Competitive outcomes in the cores of market games. Int J Game Theor 4:229–237 63. Shapley LS, Shubik M (1977) Trade using one commodity as a means of payment. J Political Econ 85:937–68 64. Shubik M (1959) Edgeworth market games. In: Luce FR, Tucker AW (eds) Contributions to the Theory of Games IV, Annals of Mathematical Studies 40. Princeton University Press, Princeton, pp 267–278 65. Shubik M, Wooders M (1982) Clubs, markets, and near-market games. In: Wooders M (ed) Topics in Game Theory and Mathematical Economics: Essays in Honor of Robert J Aumann. Field Institute Communication Volume, American Mathematical Society, originally Near Markets and Market Games, Cowles Foundation, Discussion Paper No. 657 66. Shubik M, Wooders M (1983) Approximate cores of replica games and economies: Part II Set-up costs and firm formation in coalition production economies. Math Soc Sci 6:285–306 67. Shubik M (1959) Edgeworth market games. In: Luce FR, Tucker AW (eds) Contributions to the Theory of Games IV, Annals of Mathematical Studies 40, Princeton University Press, Princeton, pp 267–278 68. Shubik M, Wooders M (1982) Near markets and market games. Cowles Foundation Discussion Paper No. 657, on line at http:// www.myrnawooders.com/ 69. Shubik M, Wooders M (1982) Clubs, markets, and near-market games. In: Wooders M (ed) Topics in Game Theory and Mathematical Economics: Essays in Honor of Robert J Aumann. Field Institute Communication Volume, American Mathematical Society, originally Near Markets and Market Games, Cowles Foundation, Discussion Paper No. 657 70. Shubik M, Wooders M (1983) Approximate cores of replica games and economies: Part I Replica games, externalities, and approximate cores. Math Soc Sci 6:27–48 71. Shubik M, Wooders M (1983) Approximate cores of replica games and economies: Part II Set-up costs and firm formation in coalition production economies. Math Soc Sci 6:285–306 72. Shubik M, Wooders M (1986) Near-markets and market-games. Econ Stud Q 37:289–299 73. Sondermann D (1974) Economics of scale and equilibria in coalition production economies. J Econ Theor 8:259–291 74. Sun N, Trockel W, Yang Z (2008) Competitive outcomes and endogenous coalition formation in an n-person game. J Math Econ 44:853–860 75. Tauman Y (1987) The Aumann–Shapley prices: A survey. In: Roth A (ed) The Shapley Value: Essays in Honor of Lloyd S Shapley. Cambridge University, Cambridge 76. Tauman Y, Urbano A, Watanabe J (1997) A model of multiproduct price competition. J Econ Theor 77:377–401 77. Tiebout C (1956) A pure theory of local expenditures. J Political Econ 64:416–424 78. Weber S (1979) On "-cores of balanced games. Int J Game Theor 8:241–250
79. Weber S (1981) Some results on the weak core of a nonsidepayment game with infinitely many players. J Math Econ 8:101–111 80. Winter E, Wooders M (1990) On large games with bounded essential coalition sizes. University of Bonn Sondeforschungsbereich 303 Discussion Paper B-149. on-line at http://www. myrnawooders.com/ Intern J Econ Theor (2008) 4:191–206 81. Wooders M (1977) Properties of quasi-cores and quasi-equilibria in coalition economies. SUNY-Stony Brook Department of Economics Working Paper No. 184, revised (1979) as A characterization of approximate equilibria and cores in a class of coalition economies. State University of New York Stony Brook Economics Department. http://www.myrnawooders.com/ 82. Wooders M (1978) Equilibria, the core, and jurisdiction structures in economies with a local public good. J Econ Theor 18:328–348 83. Wooders M (1983) The epsilon core of a large replica game. J Math Econ 11:277–300, on-line at http://www. myrnawooders.com/ 84. Wooders M (1988) Large games are market games 1. Large finite games. C.O.R.E. Discussion Paper No. 8842 http://www. myrnawooders.com/ 85. Wooders M (1989) A Tiebout Theorem. Math Soc Sci 18:33–55 86. Wooders M (1991) On large games and competitive markets 1: Theory. University of Bonn Sonderforschungsbereich 303 Discussion Paper No. (B-195, Revised August 1992). http://www. myrnawooders.com/ 87. Wooders M (1991) The efficaciousness of small groups and the approximate core property in games without side payments. University of Bonn Sonderforschungsbereich 303 Discussion Paper No. B-179. http://www.myrnawooders.com/ 88. Wooders M (1992) Inessentiality of large groups and the approximate core property; An equivalence theorem. Econ Theor 2:129–147 89. Wooders M (1992) Large games and economies with effective small groups. University of Bonn Sonderforschingsbereich 303 Discussion Paper No. B-215.(revised) in Game-Theoretic Methods in General Equilibrum Analysis. (eds) Mertens J-F, Sorin S, Kluwer, Dordrecht. http://www.myrnawooders.com/ 90. Wooders M (1993) The attribute core, core convergence, and small group effectiveness; The effects of property rights assignments on the attribute core. University of Toronto Working Paper No. 9304 91. Wooders M (1994) Equivalence of games and markets. Econometrica 62:1141–1160. http://www.myrnawooders.com/n 92. Wooders M (1997) Equivalence of Lindahl equilibria with participation prices and the core. Econ Theor 9:113–127 93. Wooders M (2007) Core convergence in market games and club economics. Rev Econ Design (to appear) 94. Wooders M (2008) Small group effectiveness, per capita boundedness and nonemptiness of approximate cores. J Math Econ 44:888–906 95. Wooders M (2008) Games with many players and abstract economies permitting differentiated commodities, clubs, and public goods (submitted) 96. Wooders M, Zame WR (1987) Large games; Fair and stable outcomes. J Econ Theor 42:59–93 97. Zajac E (1972) Some preliminary thoughts on subsidization. presented at the Conference on Telecommunications Research, Washington
Mathematical Basis of Cellular Automata, Introduction to
Mathematical Basis of Cellular Automata, Introduction to ANDREW ADAMATZKY University of the West of England, Bristol, UK A cellular automaton is a discrete universe with discrete time, discrete space and discrete states. Cells of the universe are arranged into regular structures called lattices or arrays. Each cell takes a finite number of states and updates its states in a discrete time, depending on the states of its neighbors, and all cells update their states in parallel. Cellular automata are mathematical models of massively parallel computing; computational models of spatially extended non-linear physical, biological, chemical and social systems; and primary tools for studying large-scale complex systems. Cellular automata are ubiquitous; they are objects of theoretical study and also tools of applied modeling in science and engineering. Purely for ease of representation, one can roughly split articles in this section into three groups: cellular automata theory, cellular automata models of computation and cellular automata models of natural phenomena. Many topics, however, belong to several groups. We recommend that the reader begin with articles on history and modern analysis of classifying cellular automata based on internal characteristics of their cell-state transition functions, development of cellular automata configurations in space and time, and decidability of the cellular automata (see Identification of Cellular Automata). Studies in dynamical behavior are essential in progressing cellular automata theory. They include topological dynamics, for example, in relation to symbolic dynamics, surjectivity, and permutations (see Topological Dynamics of Cellular Automata); chaos, entropy and decidability of cellular automata behavior (see Chaotic Behavior of Cellular Automata), and insights into cellular automata as dynamical systems with invariant measures (see Ergodic Theory of Cellular Automata). Self-reproducing patterns and gliders are amongst the most remarkable features of cellular automata. Certain cellular automata can reproduce configurations of cell-states, for example, the von Neumann universal constructor, and thus can be used in designs of self-replicating hardware (see Self-Replication and Cellular Automata). Gliders are translating oscillators, or traveling patterns, of nonquiescent states, for example, gliders in Conway’s Game of Life. Gliders are particularly fascinating in two- and threedimensional spaces (see Gliders in Cellular Automata).
Historically, an orthogonal lattice was the main substrate for cellular automata implementation. In the last decade the limit was lifted and nowadays you can find cellular automata on non-orthogonal lattices and tilings (see Cellular Automata in Triangular, Pentagonal and Hexagonal Tessellations), non-Euclidean geometries, such as hyperbolic spaces (see Cellular Automata in Hyperbolic Spaces) and various topological spaces (see Dynamics of Cellular Automata in Non-compact Spaces). Typically, a cell neighborhood is fixed during cellular automaton development, and a cell updates its state depending on current states of its neighbors. But even in this very basic setup, the space-time dynamics of cellular automata are incredibly complex, as can be observed from analysis of the simplest one-dimensional automata where a transition rule applied to the sum of two states is equal to the sum of its actions on the two states separately (see Additive Cellular Automata). The automata dynamics becomes much richer if we allow the topology of the cell neighborhood to be updated dynamically during automaton development (see Structurally Dynamic Cellular Automata) or also allow a cell’s state to become dependent on the cells’ previous states (see Cellular Automata with Memory). Talking about non-standard celltransition rules, we must mention cellular automata with injective global functions, where every configuration has exactly one preceding configuration (see Reversible Cellular Automata), and also cellular automata with quantum-bit cell-states and cell-transition functions incited by principles of quantum mechanics (see Quantum Cellular Automata). The reader’s initial excursion into the theory of cellular automata themselves can conclude in reading about decision problems of cellular automata expressed in terms of filling the plane using tiles with colored edges (see Tiling Problem and Undecidability in Cellular Automata) and about algebraic properties of cellular automata transformations, such as group representation of the Garden of Eden theorem and matrix representation of cellular automata (see Cellular Automata and Groups). Firing squad synchronization is the oldest problem of cellular automaton computation: all cells of a one-dimensional cellular automaton are quiescent apart from one cell in the firing state; we wish to design minimal cell-state transition rules enabling all other cells to assume the firing state at the same time (see Firing Squad Synchronization Problem in Cellular Automata). Universality of cellular automata is another classical issue. Two kinds of universality are of most importance: computation universality, that is, an ability to compute any computable func-
1819
1820
Mathematical Basis of Cellular Automata, Introduction to
tion or implement a functionally complete logical system, and intrinsic, or simulation universality, such as an ability to simulate any cellular automaton (see Cellular Automata, Universality of). Readers can familiarize themselves with basics of space and time complexity cellular-automata parallel computing in (see Cellular Automata as Models of Parallel Computation). The knowledge will then be extended by measures of complexity, parallels between cellular automata and dynamics systems, Kolmogorov complexity of cellular automata (see Algorithmic Complexity and Cellular Automata) and studies of cellular automata as acceptors of formal languages (see Cellular Automata and Language Theory). As demonstrated in (see Evolving Cellular Automata) cellular automata can be evolved to perform difficult computational tasks.
Cellular automata models of natural systems such as cell differentiation, road traffic, reaction-diffusion, and excitable media (see Cellular Automata Modeling of Physical Systems) are ideal candidates for studying all important phenomena of pattern growth (see Growth Phenomena in Cellular Automata); for studying transformation of a system’s state from one phase to another (see Phase Transitions in Cellular Automata), and for studying the ability of a system to be attracted to the states where boundary between the system’s phases is indistinguishable (see Self-organized Criticality and Cellular Automata). Cellular automata models can be designed, in principle, by reconstructing cell-state transition rules of cellular automata from snapshots of space-time dynamics of the system we wish to simulate (see Identification of Cellular Automata).
Mechanical Computing: The Computational Complexity of Physical Devices
Mechanical Computing: The Computational Complexity of Physical Devices JOHN H. REIF Department of Computer Science, Duke University, Durham, USA Article Outline Glossary Definition of the Subject Introduction The Computational Complexity of Motion Planning and Simulation of Mechanical Devices Concrete Mechanical Computing Devices Future Directions Acknowledgments Bibliography Glossary Mechanism A machine or part of a machine that performs a particular task computation: the use of a computer for calculation. Computable Capable of being worked out by calculation, especially using a computer. Simulation The term simulation will be used to denote both the modeling of a physical system by a computer as well as the modeling of the operation of a computer by a mechanical system; the difference will be clear from the context.
with computational and control functions not feasible by purely electronic processing. Mechanical computers also have been demonstrated at the molecular scale, and may also provide unique capabilities at that scale. The physical designs for these modern micro and molecular-scale mechanical computers may be based on the prior designs of the large-scale mechanical computers constructed in the past. (c) Impact of Physical Assumptions on Complexity of Motion Planning, Design, and Simulation: The study of computation done by mechanical devices is also of central importance in providing lower bounds on the computational resources such as time and/or space required to simulate a mechanical system observing given physical laws. In particular, the problem of simulating the mechanical system can be shown to be computationally hard if a hard computational problem can be simulated by the mechanical system. A similar approach can be used to provide lower bounds on the computational resources required to solve various motion planning tasks that arise in the field of robotics. Typically, a robotic motion planning task is specified by a geometric description of the robot (or collection of robots) to be moved, its initial and final positions, the obstacles it is to avoid, as well as a model for the type of feasible motion and physical laws for the movement. The problem of planning, such as the robotic motion-planning task, can be shown to be computationally hard if a hard computational problem can be simulated by the robotic motion-planning task.
Definition of the Subject
Introduction
Mechanical devices for computation appear to be largely displaced by the widespread use of microprocessor-based computers that are pervading almost all aspects of our lives. Nevertheless, mechanical devices for computation are of interest for at least three reasons:
Abstract Computing Machine Models
(a) Historical: The use of mechanical devices for computation is of central importance in the historical study of technologies, with a history dating back thousands of years and with surprising applications even in relatively recent times. (b) Technical & Practical: The use of mechanical devices for computation persists and has not yet been completely displaced by widespread use of microprocessor-based computers. Mechanical computers have found applications in various emerging technologies at the micro-scale that combine mechanical functions
To gauge the computational power of a family of mechanical computers, we will use a widely known abstract computational model known as the Turing machine, defined in this section. The Turing Machine The Turing machine model formulated by Alan Turing [1] was the first complete mathematical model of an abstract computing machine that possessed universal computing power. The machine model has (i) a finite state transition control for logical control of the machine processing, (ii) a tape with a sequence of storage cells containing symbolic values, and (iii) a tape scanner for reading and writing values to and from the tape cells, which could be made to move (left and right) along the tape cells.
1821
1822
Mechanical Computing: The Computational Complexity of Physical Devices
A machine model is abstract if the description of the machine transition mechanism or memory mechanism does not provide specification of the mechanical apparatus used to implement them in practice. Since Turing’s description did not include any specification of the mechanical mechanism for executing the finite state transitions, it can’t be viewed as a concrete mechanical computing machine, but instead is an abstract machine. Still it is valuable computational model, due to it simplicity and very widespread use in computational theory. A universal Turing machine simulates any other Turing machine; it takes its input a pair consisting of a string providing a symbolic description of a Turing machine M and the input string x, and simulates M on input x. Because of its simplicity and elegance, the Turing machine has come to be the standard computing model used for most theoretical works in computer science. Informally, the Church–Turing hypothesis states that a Turing machine model can simulate a computation by any “reasonable” computational model (we will discuss some other reasonable computational models below). Computational Problems A computational problem is: given an input string specified by a string over a finite alphabet, determine the Boolean answer: 1 is the answer is YES, and otherwise 0. For simplicity, we generally will restrict the input alphabet to be the binary alphabet {0,1}. The input size of a computational problem is the number of input symbols; which is the number of bits of the binary specification of the input. (Note: It is more common to make these definitions in terms of language acceptance. A language is a set of strings over a given finite alphabet of symbols. A computational problem can be identified with the language consisting of all strings over the input alphabet where the answer is 1. For simplicity, we defined each complexity class as the corresponding class of problems.)
Turing machines and inputs, and showed a contradiction occurs when a universal Turing machine attempts to solve the Halting problem for each Turing machine and each possible input. Computational Complexity Classes Computational complexity (see [2]) is the amount of computational resources required to solve a given computational problem. A complexity class is a family of problems, generally defined in terms of limitations on the resources of the computational model. The complexity classes of interest here will be associated with restrictions on the time (number of steps until the machine halts) and/or space (the number of tape cells used in the computation) of Turing machines. There are a number of notable complexity classes: P is the complexity class associated with efficient computations, and is formally defined to be the set of problems solved by Turing machine computations running in time polynomial to the input size (typically, this is the number of bits of the binary specification of the input). NP is the complexity class associated with combinatorial optimization problems which, if solved, can be easily determined to have correct solutions. It is formally defined to be the set of problems solved by Turing machine computations using nondeterministic choice running in polynomial time. PSPACE is the complexity class defined as the set of problems solved by Turing machines running in space polynomial to the input size. EXPTIME is the complexity class defined as the set of problems solved by Turing machine computations running in time exponential to the input size. NP and PSPACE are widely considered to have instances that are not solvable in P, and it has been proved that EXPTIME has problems that are not in P. Polynomial Time Reductions
Recursively Computable Problems and Undecidable Problems There is a large class of problems, known as recursively computable problems, that Turing machines compute in finite computations, that is, always halting in finite time with the answer. There are certain problems that are not recursively computable; these are called undecidable problems. The Halting Problem is: given a Turing Machine description and an input, output 1 if the Turing machine ever halts, and else output 0. Turing proved the halting problem is undecidable. His proof used a method known as a diagonalization method; it considered an enumeration of all
A polynomial time reduction from a problem Q 0 to a problem Q is a polynomial time Turing machine computation that transforms any instance of the problem Q 0 into an instance of the problem Q which has an answer YES if and only if the problem Q 0 has an answer YES. Informally, this implies that problem Q can be used to efficiently solve the problem Q 0 . A problem Q is hard for a family F of problems if for every problem Q 0 in F, there is a polynomial time reduction from Q 0 to Q. Informally, this implies that problem Q can be used to efficiently solve any problem in F. A problem Q is complete for a family F of problems if Q is in C and also hard for F.
Mechanical Computing: The Computational Complexity of Physical Devices
Hardness Proofs for Mechanical Problems We will later consider various mechanical problems and characterize their computation power: Undecidable mechanical problem; this was typically proven by a computable reduction from the halting problem for a universal Turing machine problem to an instance of the mechanical problem; this is equivalent to showing the mechanical problem can be viewed as a computational machine that can simulate a universal Turing machine computation. Mechanical problems that are hard for NP, PSPACE, or EXPTIME; typically this was proved by a polynomial time reduction from the problems in the appropriate complexity class to an instance of the mechanical problem; again, this is equivalent to showing the mechanical problem can be viewed as a computational machine that can simulate a Turing machine computation in the appropriate complexity class. The simulation proofs in either case often provide insight into the intrinsic computational power of the mechanical problem or mechanical machine. Other Abstract Computing Machine Models There are a number of abstract computing models discussed in this Chapter, that are equivalent, or nearly equivalent, to conventional deterministic Turing machines. Reversible Turing machines A computing device is (logically) reversible if each transition of its computation can be executed both in the forward direction as well as in the reverse direction, without loss of information. Landauer [3] showed that irreversible computations must generate heat in the computing process, and that reversible computations have the property that if executed slowly enough, can (in the limit) consume no energy in an adiabatic computation. A reversible Turing machine model allows the scan head to observe 3 consecutive tape symbols and to execute transitions both in the forward as well as in the reverse direction. Bennett [4] showed that any computing machine (e. g., an abstract machine such as a Turing machine) can be transformed to do only reversible computations, which implied that reversible computing devices are capable of universal computation. Bennett’s reversibility construction required extra space to store information to insure reversibility, but this extra space can be reduced by increasing the time. Vitanyi [5] gave trade-offs between time and space in the resulting reversible machine. Lewis and Papadimitriou [95] showed that re-
versible Turing machines are equivalent in computational power to conventional Turing machines when the computations are bounded by polynomial time, and Crescenzi and Papadimitriou [6] proved a similar result when the computations are bounded by polynomial space. This implies that the definitions of the complexity classes P and PSPACE do not depend on the Turing machines being reversible or not. Reversible Turing machines are used in many of the computational complexity proofs to be mentioned involving simulations by mechanical computing machines. Cellular automata These are sets of finite state machines that are typically connected together by a grid network. There are known efficient simulations of the Turing machine by cellular automata (e. g., see Wolfram [7] for some known universal simulations). A number of the particle-based mechanical machines to be described are known to simulate cellular automata. Randomized Turing machines The machine can make random choices in its computation. While the use of randomized choice can be very useful in many efficient algorithms, there is evidence that randomization only provides limited additional computational power above conventional deterministic Turing machines (In particular, there are a variety of pseudo-random number generation methods proposed for producing long pseudo-random sequences from short truly random seeds, which are widely conjectured to be indistinguishable from truly random sequences by polynomial time Turning machines.) A number of the mechanical machines to be described using Brownian-motion have natural sources of random numbers. There are also a number of abstract computing machine models that appear to be more powerful than conventional deterministic Turing machines. Real-valued Turing machines According to Blum et al. [8], each storage cell or register in these machines can store any real value (that may be transcendental). Operations are extended to allow infinite precision arithmetic operations on real numbers. To our knowledge, none of the analog computers that we will describe in this chapter have this power. Quantum computers A quantum superposition is a linear superposition of basis states; it is defined by a vector of complex amplitudes whose absolute magnitudes sum to 1. In a quantum computer, the quantum superposition of basis states is transformed in each step by a unitary transformation (this is a linear mapping that is reversible and always preserves the value of
1823
1824
Mechanical Computing: The Computational Complexity of Physical Devices
the sum of the absolute magnitudes of its inputs). The outputs of a quantum computation are read by observations that that project the quantum superposition to classical values; a given state is chosen with probability defined by the magnitude of the amplitude of that state in the quantum superposition. Feynman [9] and Benioff [10] were the first to suggest the use of quantum mechanical principles for doing computation, and Deutsch [11] was the first to formulate an abstract model for quantum computing and show it was universal. Since then, there is a large body of work in quantum computing (see Gruska [12] and Nielsen [13]) and quantum information theory (see Jaeger [14] and Reif [15]). Some of the particle-based methods for mechanical computing described below make use of quantum phenomena, but generally are not considered to have the full power of quantum computers.
The Computational Complexity of Motion Planning and Simulation of Mechanical Devices Complexity of Motion Planning for Mechanical Devices with Articulated Joints The first known computational complexity result involving mechanical motion or robotic motion planning was in 1979 by Reif [16]. He considered a class of mechanical systems consisting of a finite set of connected polygons with articulated joints, which were required to be moved between two configurations in three dimensional space avoiding a finite set of fixed polygonal obstacles. To specify the movement problem (as well as the other movement problems described below unless otherwise stated), the object to be moved, as well as its initial and final positions, and the obstacles are all defined by linear inequalities with rational coefficients with a finite number of bits. He showed that this class of motion planning problems is hard for PSPACE. Since it is widely conjectured that PSPACE contains problems which are not solvable in polynomial time, this result provided the first evidence that these robotic motion planning problems were not solvable in time polynomial in n if the number of degrees of freedom grew with n. His proof involved simulating a reversible Turing machine with n tape cells by a mechanical device with n articulated polygonal arms that had to be maneuvered through a set of fixed polygonal obstacles similar to the channels in Swiss-cheese. These obstacles where devised to force the mechanical device to simulate transitions of the reversible Turing machine to be simulated, where the positions of the arms encoded the tape cell contents, and tape read/write operations were simulated by chan-
nels of the obstacles which forced the arms to be reconfigured appropriately. This class of movement problems can be solved by reduction to the problem of finding a path in a O(n) dimensional space avoiding a fixed set of polynomial obstacle surfaces, which can be solved by a PSPACE algorithm from Canny [17]. Hence this class of movement problems are PSPACE complete. (In the case where the object to be moved consists of only one rigid polygon, the problem is known as the piano mover’s problem and has a polynomial time solution by Schwartz and Sharir [18].) Other PSPACE Completeness Results for Mechanical Devices There were many subsequent PSPACE completeness results for mechanical devices (two of which we mention below), which generally involved multiple degrees of freedom: The Warehouseman’s Problem In 1984 Schwartz and Sharir [19] showed that moving a set of n disconnected polygons in two dimensions from an initial position to a final position among a finite set of fixed polygonal obstacles is PSPACE hard. There are two classes of mechanical dynamic systems, the Ballistic machines and the Browning Machines described below, that can be shown to provide simulations of polynomial space Turing machine computations. Ballistic Collision-Based Computing Machines and PSPACE A ballistic computer (see Bennett [20,21]) is a conservative dynamical system that follows a mechanical trajectory isomorphic to the desired computation. It has the following properties: Trajectories of distinct ballistic computers can’t be merged, All operations of a computational must be reversible, Computations, when executed at constant velocity, require no consumption of energy, Computations must be executed without error, and need to be isolated from external noise and heat sources. Collision-based computing [22] is computation by a set of particles, where each particle holds a finite state value, and state transformations are executed at the time of collisions between particles. Since collisions between distinct pairs of particles can be simultaneous, the model allows for parallel computation. In some cases the particles can be configured to execute cellular automata computations [23]. Most proposed methods for Collision-based computing are ballistic
Mechanical Computing: The Computational Complexity of Physical Devices
computers as defined above. Examples of concrete physical systems for collision-based computing are: The billiard ball computers Fredkin and Toffoli [24] considered a mechanical computing model, the billiard ball computer, consisting of spherical billiard balls with polygonal obstacles, where the billiard balls were assumed to have perfect elastic collisions with no friction. They showed in 1982 that a Billiard Ball Computer, with an unbounded number of billiard balls, could simulate a reversible computing machine model that used reversible Boolean logical gates known as Toffoli gates. When restricted to a finite set of n spherical billiard balls, their construction provides a simulation of a polynomial space-reversible Turing machine. Particle-like waves in excitable medium Certain classes of excitable medium have discrete models that can exhibit particle-like waves which propagate through the media [25]. Using this phenomena, Adamatzky [26] gave a simulation of a universal Turing Machine. If restricted to n particle-waves, his simulation provides a simulation of a polynomial space Turing Machine. Soliton computers A soliton is a wave packet that maintains a self-reinforcing shape as it travels at constant speed through a nonlinear dispersive media. A soliton computer [27,28] makes use of optical solitons to hold state, and state transformations are made by colliding solitons. Brownian Machines and PSPACE In a mechanical system exhibiting fully Brownian motion, the parts move freely and independently, up to the constraints that either link the parts together or forces the parts exert on each other. In a fully Brownian motion, the movement is entirely due to heat and there is no other source of energy driving the movement of the system. An example of a mechanical system with fully Brownian motion is a set of particles exhibiting Browning motion, as with electrostatic interaction. The rate of movement a of mechanical system with fully Brownian motion is determined entirely by the drift rate in the random walk of their configurations. In other mechanical systems, known as driven Brownian motion, the system’s movement is only partly due to heat; in addition, there is a source of energy driving the movement of the system. Examples of driven Brownian motion systems are: Feynman’s Ratchet and Pawl [29], which is a mechanical ratchet system that has a driving force but that can operate reversibly.
Polymerase enzyme, which uses ATP as fuel to drive their average movement forward, but also can operate reversibly. There is no energy consumed by fully Brownian motion devices, whereas driven Brownian motion devices require power that grows as a quadratic function of the drive rate in which operations are executed (see Bennett [21]). Bennett [20] provides two examples of Brownian computing machines: An enzymatic machine This is a hypothetical biochemical device that simulates a Turing machine, using polymers to store symbolic values in a manner to similar to Turing machine tapes, and uses hypothetical enzymatic reactions to execute state transitions and read/write operations into the polymer memory. Shapiro [30] also describes a mechanical Turing machine whose transitions are executed by hypothetical enzymatic reactions. A clockwork computer This is a mechanism with linked articulated joints, with a Swiss-cheese like set of obstacles, which force the device to simulate a Turing machine. In the case where the mechanism of Bennett’s clockwork computer is restricted to have a linear number of parts, it can be used to provide a simulation of PSPACE similar that of [16]. Hardness Results for Mechanical Devices with a Constant Number of Degrees of Freedom There were also additional computation complexity hardness results for mechanical devices, which only involved a constant number of degrees of freedom. These results exploited special properties of the mechanical systems to do the simulation. Motion planning with moving obstacles Reif and Sharir [31] considered the problem of planning the motion of a rigid object (the robot) between two locations, while avoiding a set of obstacles, some of which are rotating. They showed this problem is PSPACE hard. This result was perhaps surprising, since the number of degrees of freedom of movement of the object to be moved was constant. However, the simulation used the rotational movement of obstacles to force the robot to be moved only to a position that encoded all the tape cells of M. The simulation of a Turing machine M was made by forcing the object between such locations (that encoded the entire n tape cell contents of M) at particular times, and further forced that object to move between these locations over time in a way that simulated state transitions of M.
1825
1826
Mechanical Computing: The Computational Complexity of Physical Devices
NP Hardness Results for Path Problems in Two and Three Dimensions Shortest path problems in fixed dimensions involve only a constant number of degrees of freedom. Nevertheless, there are a number of NP hardness results for such problems. These results also led to proofs that certain physical simulations (in particular, simulation of multi-body molecular and celestial simulations) are NP hard, and therefore not likely efficiently computable with high precision. Finding shortest paths in three dimensions Consider the problem of finding a shortest path of a point in three dimensions (where distance is measured in the Euclidean metric) avoiding fixed polyhedral obstacles whose coordinates are described by rational numbers with a finite number of bits. This shortest path problem can be solved in PSPACE [17], but the precise complexity of the problem is an open problem. Canny and Reif [32] were the first to provide a hardness complexity result for this problem; they showed the problem is NP hard. Their proof used novel techniques called free path encoding that used 2n homotopy equivalence classes of shortest paths. Using these techniques, they constructed exponentially many shortest path classes (with distinct homotopy) in single-source multiple-destination problems involving O(n) polygonal obstacles. They used each of these paths to encode a possible configuration of the nondeterministic Turing machine with n binary storage cells. They also provided a technique for simulating each step of the Turing machine by the use of polygonal obstacles whose edges forced a permutation of these paths that encoded the modified configuration of the Turing machine. These encodings allowed them to prove that the single-source single-destination problem in three dimensions is NPhard. Similar free path encoding techniques were used for a number of other complexity hardness results for the mechanical simulations described below. Kinodynamic planning Kinodynamic planning is the task of motion planning while subject to simultaneous kinematic and dynamic constraints. The algorithms for various classes of kinodynamic planning problems were first developed in [33]. Canny and Reif [32] also used free path encoding techniques to show that two dimensional kinodynamic motion planning with a bounded velocity is NP-hard. Shortest curvature-constrained path planning in two dimensions We now consider curvature-constrained shortest path problems which involve finding a shortest path by a point among polygonal obstacles, where the
there is an upper bound on the path curvature. A class of curvature-constrained shortest path problems in two dimensions were shown to be NP hard by Reif and Wang [34] by devising a set of obstacles that forced the shortest curvature-constrained path to simulate a given nondeterministic Turing machine. PSPACE Hard Physical Simulation Problems Ray tracing with a rational placement and geometry Ray tracing is defined as determining if a light ray reaches some given final position, given an optical system and the position and direction of the initial light ray. This problem of determining the path of light ray through an optical system was first formulated by Newton in his book on Optics. Ray tracing has been used for designing and analyzing optical systems. It is also used extensively in computer graphics to render scenes with complex curved objects under global illumination. Reif, Tygar, and Yoshida [35] showed the problem of ray tracing in various three dimensional optical systems, where the optical devices either consist of reflective objects defined by quadratic equations or refractive objects defined by linear equations, but in either case the coefficients are restricted to be rational. They showed that these ray tracing problems are PSPACE hard. Their proof used free path encoding techniques for simulating a nondeterministic linear space Turing machine, where the position of the ray as it enters a reflective or refractive optical object (such as a mirror or prism face) encodes the entire memory of the Turing machine to be simulated, and further steps of the Turing machine are simulated by optically inducing appropriate modifications in the position of the ray as it enters other reflective or refractive optical objects. This result implies that the apparently simple task of highly precise ray tracing through complex optical systems is not likely to be efficiently executed by a computer in polynomial time. It is another example of the use of a physical system to do powerful computations. Molecular and gravitational mechanical systems A quite surprising example of using physical systems to do computation is the work of Tate and Reif [36] on the complexity of n-body simulations, where they showed that certain n-body simulation problems are PSPACE hard, and therefore not likely to be efficiently computable with high precision. In particular, they considered multi-body systems in three dimensions with n particles and inverse polynomial force laws between each pair of particles (e. g., molecular systems with Columbic force laws or celestial simulations with grav-
Mechanical Computing: The Computational Complexity of Physical Devices
itational force laws). It is quite surprising that such systems can be configured to do computation. Their hardness proof made use of free path encoding techniques similar to the proof of PSPACE-hardness of ray tracing. A single particle, which we will call the memory-encoding particle, is distinguished. The position of a memory-encoding particle as it crosses a plane encodes the entire memory of the Turing machine to be simulated, and further steps of the Turing machine are simulated by inducing modifications in the trajectory of the memory-encoding particle. The modifications in the trajectory of the memory-encoding particle are made by the use of other particles that have trajectories which induce force fields that essentially act like forcemirrors, causing reflection-like changes in the trajectory of the memory-encoding particle. Hence, highly precise n-body molecular simulation is not likely to be efficiently executed by a polynomial time computer. A Provably Intractable Mechanical Simulation Problem: Compliant Motion Planning with Uncertainty in Control Next, we consider compliant motion planning with uncertainty in control. Specifically, we consider a point in 3 dimensions which is commanded to move in a straight line, but whose actual motion may differ from the commanded motion, possibly involving sliding against obstacles. Given that the point initially lies in some start region, the problem is to find a sequence of commanded velocities that is guaranteed to move the point to the goal. This problem was shown by Canny and Reif [32] to be non-deterministic EXPTIME hard, making it the first provably intractable problem in robotics. Their proof used free path encoding techniques that exploited the uncertainty of position to encode exponential number of memory bits in a Turing machine simulation.
sal Turing machine was encoded in the rotational position of a rod. In each step, the mechanism used a construct similar to Babbage’s machine to execute a state transition. The key idea in their construction is to utilize frictional clamping to allow for setting arbitrary high gear transmission. This allowed the mechanism to make state transitions for arbitrary number of steps. Simulation of a universal Turing machine implied that the movement problem is undecidable when there are frictional linkages. (A problem is undecidable if there is no Turing machine that solves the problem for all inputs in finite time.) It also implied that a mechanical computer could be constructed with only a constant number of parts that has the power of an unconstrained Turing machine. Ray tracing with non-rational postitioning Consider again the problem of ray tracing in a three dimensional optical systems, where the optical devices again may either consist of reflective objects defined by quadratic equations, or refractive objects defined by linear equations. Reif et al. [35] also proved that in the case where the coefficients of the defining equations are not restricted to be rational and include at least one irrational coefficient, then the resulting ray tracing problem could simulate a universal Turing machine, and so is undecidable. This ray tracing problem for reflective objects is equivalent to the problem of tracing the trajectory of a single particle bouncing between quadratic surfaces, which is also undecidable by this same result of [35]. An independent result of Moore [38] also showed that the problem of tracing the trajectory of a single particle bouncing between quadratic surfaces is undecidable. Dynamics and nonlinear mappings Moore [39], Ditto [40] and Munakata et al. [41] have also given universal Turing machine simulations of various dynamical systems with nonlinear mappings.
Undecidable Mechanical Simulation Problems Motion planning with friction Consider a class of mechanical systems whose parts consist of a finite number of rigid objects defined by linear or quadratic surface patches connected by frictional contact linkages between the surfaces. (Note: this class of mechanisms is similar to the analytical engine developed by Babbage described in the next sections, except that there are smooth frictional surfaces rather than toothed gears). Reif and Sun [37] proved that an arbitrary Turing machine could be simulated by a (universal) frictional mechanical system in this class consisting of a finite number of parts. The entire memory of a univer-
Concrete Mechanical Computing Devices Mechanical computers have a very extensive history; some surveys are given in Knott [42], Hartree [43], Engineering Research Associates [44], Chase [45], Martin [46], and Davis [47]. Norman [48] recently provided a unique overview of mechanical calculators and other historical computers, summarizing the contributions of notable manuscripts and publications on this topic. Mechanical Devices for Storage and Sums of Numbers Mechanical methods such as notches on stones and bones, or knots and piles of pebbles, have been used since the
1827
1828
Mechanical Computing: The Computational Complexity of Physical Devices
Neolithic period for storing and summing integer values. One example of such a device, the abacus, which may have been developed in Babylonia approximately 5000 years ago, makes use of beads sliding on cylindrical rods to facilitate addition and subtraction calculations. Analog Mechanical Computing Devices Computing devices are considered here to be analog (as opposed to digital) if they don’t provide a method for restoring calculated values to discrete values, whereas digital devices provide restoration of calculated values to discrete values. (Note that both analog and digital computers use some kind of physical quantity to represent values that are stored and computed, so the use of physical encoding of computational values is not necessarily the distinguishing characteristic of analog computing.) Descriptions of early analog computers are given by Horsburgh [49], Turck [50], Svoboda [51], Hartree [43], Engineering Research Associates [44], and Soroka [52]. There are a wide variety of mechanical devices used for analog computing: Mechanical devices for astronomical and celestial calculation While we do not have sufficient space in this article to fully discuss this rich history, we note that various mechanisms for predicting lunar and solar eclipses using optical illumination of configurations of stones and monoliths (for example, Stonehenge) appear to date to the Neolithic period. Mechanical mechanisms for more precisely predicting lunar and solar eclipses may have been developed in the classical period of ancient history. The most impressive and sophisticated known example of an ancient gear-based mechanical device is the Antikythera Mechanism, which is thought to have been constructed by Greeks in approximately 2200 years ago. Recent research [53] provides evidence it may have been used to predict celestial events such as lunar and solar eclipses by the analog calculation of arithmetic-progression cycles. Like many other intellectual heritages, some elements of the design of such sophisticated gear-based mechanical devices may have been preserved by the Arabs after that period, and then transmitted to the Europeans in the middle ages. Planimeters There is a considerable history of mechanical devices that integrate curves. A planimeter is a mechanical device that integrates the area of the region enclosed by a two dimensional closed curve, where the curve is presented as a function of the angle from some fixed interior point within the region. One of the first known planimeters was developed by J.A. Her-
mann in 1814 and improved (as the polar planimeter) by Hermann in 1856. This led to a wide variety of mechanical integrators known as wheel-and-disk integrators, whose input is the angular rotation of a rotating disk and whose output, provided by a tracking wheel, is the integral of a given function of that angle of rotation. More general mechanical integrators known as ball-and-disk integrators, whose input provided 2 degrees of freedom (the phase and amplitude of a complex function), were developed by James Thomson in 1886. There are also devices, such as the Integraph of Abdank Abakanoviez (1878) and C.V. Boys (1882), which integrate a one-variable real function of x presented as a curve y D f (x) on the Cartesian plane. Mechanical integrators were later widely used in WWI and WWII military analog computers for solution of ballistics equations, artillery calculations and target tracking. Various other integrators are described in Morin [54]. Harmonic Analyzers A Harmonic Analyzer is a mechanical device which calculates the coefficients of the Fourier Transform of a complex function of time, such as a sound wave. Early harmonic analyzers were developed by Thomson [55] and Henrici [56] using multiple pulleys and spheres, known as ball-and-disk integrators. Harmonic Synthesizers A Harmonic Synthesizer is a mechanical device that interpolates a function, given the Fourier coefficients. Thomson (then known as Lord Kelvin) [57] developed the first known Harmonic Analyzer in 1886 which used an array of James Thomson’s (his brother) ball-and-disk integrators. Kelvin’s Harmonic Synthesizer made use of these Fourier coefficients to reverse this process and interpolate function values, by using a wire wrapped over the wheels of the array to form a weighted sum of their angular rotations. Kelvin demonstrated the use of these analog devices predicted the tide heights of a port: first his Harmonic Analyzer calculated the amplitude and phase of the Fourier harmonics of solar and lunar tidal movements, and then his Harmonic Synthesizer formed their weighted sum to predict the tide heights over time. Many other Harmonic Analyzers were later developed, including one by Michelson and Stratton (1898) which performed Fourier analysis using an array of springs. Miller [58] gives a survey of these early Harmonic Analyzers. Fisher [59] made improvements to the tide predictor, and later Doodson and Légé increased the scale of this design to a 42-wheel version that was used up to the early 1960s. Analog equation solvers There are various mechanical devices for calculating the solution of sets of equa-
Mechanical Computing: The Computational Complexity of Physical Devices
tions. Kelvin also developed one of the first known mechanical mechanisms for equation solving, involving the motion of pulleys and tilting plate that solved sets of simultaneous linear equations specified by the physical parameters of the ropes and plates. In the 1930s, John Wilbur increased the scale of Kelvin’s design to solve nine simultaneous linear algebraic equations. Leonardo Torres Quevedo constructed various rotational mechanical devices, for determining the real and complex roots of a polynomial. Svoboda [51] describes the state of art in the 1940s of mechanical analog computing devices using linkages. Differential Analyzers A Differential Analyzer is a mechanical analog device using linkages for solving ordinary differential equations. Vannevar Bush [60] developed the first Differential Analyzer at MIT in 1931, which used a torque amplifier to link multiple mechanical integrators. Although it was considered a general-purpose mechanical analog computer, this device required a physical reconfiguration of the mechanical connections to specify a given mechanical problem to be solved. In subsequent Differential Analyzers, the reconfiguration of the mechanical connections was made automatic by resetting electronic relay connections. In addition to the military applications already mentioned above, analog mechanical computers incorporating differential analyzers have been widely used for flight simulations and for industrial control systems. Mechanical simulations of physical processes: Crystallization and packing There are a variety of macroscopic devices used for simulations of physical processes, which can be viewed as analog devices. For example, a number of approaches have been used for mechanical simulations of crystallization and packing: – Simulation using solid macroscopic ellipsoids bodies Simulations of kinetic crystallization processes have been made by collections of macroscopic solid ellipsoidal objects – typically of diameter of a few millimeters – which model the molecules comprising the crystal. In these physical simulations, thermal energy is modeled by introducing vibrations; a low level of vibration is used to model freezing and increasing the level of vibrations models melting. In simple cases, the molecule of interest is a sphere, and ball bearings or similar objects are used for the molecular simulation. For example, to simulate the dense random packing of hard spheres within a crystalline solid, Bernal [61] and Finney [62] used up to 4000 ball bearings on a vibrating table. In addition, to model more general ellipsoidal molecules,
orzo pasta grains as well as M&M candies (Jerry Gollub at Princeton University) have been used. Also, Cheerios have been used to simulate the liquid state packing of benzene molecules. To model more complex systems mixtures of balls of different sizes and/or composition have been used; for example a model ionic crystal formation has been made by use a mixture of balls composed of different materials that acquired opposing electrostatic charges. – Simulations using bubble rafts [63,64] These are the structures that assemble among equal sized bubbles floating on water. They typically they form two dimensional hexagonal arrays, and can be used for modeling the formation of close packed crystals. Defects and dislocations can also be modeled [65]; for example, the deliberate introduction of defects in the bubble rats have been used to simulate crystal dislocations, vacancies, and grain boundaries. Also, impurities in crystals (both interstitial and substitutional) have been simulated by introducing bubbles of other sizes. Reaction-diffusion chemical computers Adamatzky [66,67] described a class of analog computers where there is a chemical medium which has multiple chemical species, where the concentrations of these chemical species vary spatially and which diffuse and react in parallel. The memory values (as well as inputs and outputs) of the computer are encoded by the concentrations of these chemical species at a number of distinct locations (also known as micro-volumes). The computational operations are executed by chemical reactions whose reagents are these chemical species. Example computations [66,67] include: (i) Voronoi diagram; this is to determine the boundaries of the regions closest to a set of points on the plane, (ii) Skeleton of planar shape, and (iii) a wide variety of two dimensional patterns periodic and aperiodic in time and space. Digital Mechanical Devices for Arithmetic Operations Recall that we have distinguished digital mechanical devices from the analog mechanical devices described above by their use of mechanical mechanisms for insuring the values stored and computed are discrete. Such discretization mechanisms include geometry and structure (e. g., the notches of Napier’s bones described below), or cogs and spokes of wheeled calculators. Surveys of the history of some these digital mechanical calculators are given by Knott [42], Turck [50], Hartree [43], Engineering Research Associates [44], Chase [45], Martin [46], Davis [47], and Norman [48].
1829
1830
Mechanical Computing: The Computational Complexity of Physical Devices
Leonardo da Vinci’s mechanical device and mechanical counting devices This intriguing device, which involved a sequence of interacting wheels positioned on a rod, which appear to provide a mechanism for digital carry operations, was illustrated in 1493 in Leonardo da Vinci’s Codex Madrid [68]. A working model of its possible mechanics was constructed in 1968 by Joseph Mirabella. It’s function and purpose is not decisively known, but it may have been intended for counting rotations (e. g., for measuring the distance traversed by a cart). There are a variety of apparently similar mechanical devices used to measuring distances traversed by vehicles. Napier’s Bones In 1614 John Napier [69] developed a mechanical device (known as Napier’s Bones) which allowed multiplication and division (as well as square and cube roots) to be done by addition and multiplication operations. It consisted of rectilinear rods, which provided a mechanical transformation to and from logarithmic values. In 1623, Wilhelm Shickard developed a six digit mechanical calculator that combined the use of Napier’s Bones using columns of sliding rods, with the use of wheels used to sum up the partial products for multiplication. Slide rules Edmund Gunter devised in 1620 a method for calculation that used a single log scale with dividers along a linear scale; this anticipated key elements of the first slide rule described by William Oughtred [70] in 1632. A very large variety of slide machines were later constructed. Pascaline: Pascal’s wheeled calculator Blaise Pascal [71] developed a calculator in 1642 known as the Pascaline that could calculate all four arithmetic operations (addition, subtraction, multiplication, and division) on up to eight digits. A wide variety of mechanical devices were then developed that used revolving drums or wheels (cogwheels or pinwheels) to do various arithmetical calculations. Stepped drum calculators Gottfried Wilhelm von Leibniz developed an improved calculator known as the Stepped Reckoner in 1671, which used a cylinder known as a stepped drum with nine teeth of different lengths that increase in equal amounts around the drum. The stepped drum mechanism allowed the use of a moving slide for specifying a number to be input to the machine, and made use of the revolving drums to do the arithmetic calculations. Charles Xavier Thomas de Colbrar developed a widely used arithmetic mechanical calculator based on the stepped drum known as the Arithometer in 1820. Other stepped drum cal-
culating devices included Otto Shweiger’s Millionaire calculator (1893) and Curt Herzstark’s Curta (early 1940s). Pinwheel calculators Frank S. Baldwin and W.T. Odhner independently invented another class of calculators in the 1870s, known as pinwheel calculators; they used a pinwheel for specifying a number input to the machine and used revolving wheels to do the arithmetic calculations. Pinwheel calculators were widely used up to the 1950s, for example in William S. Burroughs’ calculator/printer and the German Brunsviga. Digital Mechanical Devices for Mathematical Tables and Functions Babbage’s Difference Engine Charles Babbage [72, 73] invented a mechanical device in 1820 known as the Difference Engine for calculating the tables of an analytical function (such as the logarithm), which summed the change in values of the function when a small difference is made in the argument. For each table entry, the difference calculation required a small number of simple arithmetic computations. The device made use of columns of cogwheels to store digits of numerical values. Babbage planned to store 1000 variables, each with 50 digits, where each digit was stored by a unique cogwheel. It used cogwheels in registers for the required arithmetical calculations, and also made use of a rod-based control mechanism specialized for control of these arithmetic calculations. The design and operation of the mechanisms of the device were described by a symbolic scheme developed by Babbage [74]. He also conceived of a printing mechanism for the device. In 1801, Joseph-Marie Jacquard invented an automatic loom that made use of punched cards for the specification of fabric patterns woven by his loom, and Charles Babbage proposed the use of similar punched cards for providing inputs to his machines. He demonstrated over a number of years certain key portions of the mechanics of the device but never completed a complete function device. Other Difference Engines In 1832 Ludgate [75] independently designed, but did not construct, a mechanical computing machine similar but smaller in scale to Babbage’s Analytical Engine. In 1853 Pehr and Edvard Scheutz [76] constructed in Sweden a cog wheel mechanical calculating device (similar to the Difference Engine originally conceived by Babbage) known as the Tabulating Machine, for computing and printing out tables of mathematical functions. This (and a later construction of Babbage’s Difference Engine by Doron
Mechanical Computing: The Computational Complexity of Physical Devices
Swade [77] of the London Science Museum) demonstrated the feasibility of Babbage’s Difference Engine. Babbage’s Analytical Engine Babbage further conceived (but did not attempt to construct) a mechanical computer known as the Analytical Engine to solve more general mathematical problems. Lovelace’s extended description of Babbage’s Analytical Engine [78] (translation of “Sketch of the Analytical Engine” by L.F. Menabrea) describes, in addition to arithmetic operations, also mechanisms for looping and memory addressing. However, the existing descriptions of Babbage’s Analytical Engine appear to lack the ability to execute a full repertory of logical and/or finite state transition operations required for general computation. Babbage’s background was very strong in analytic mathematics, but he (and the architects of similar cog-wheel based mechanical computing devices at that date) seemed to have lacked knowledge of sequential logic and it’s Boolean logical basis, which was required for controlling the sequence of complex computations. This (and his propensity for changing designs prior to the completion of the machine construction) might have been the real reason for the lack of complete development of a universal mechanical digital computing device in the early 1800’s. Subsequent electromechanical digital computing devices with cog-wheels Other electromechanical digital computing devices (see [44]) developed in the late 1940s and 1950s which contain cog-wheels included Howard Aiken’s Mark 1 [79], constructed at Harvard University, and Konrad Zuse’s Z series computer, constructed in Germany. Mechanical Devices for Timing, Sequencing and Logical Control We will use the term mechanical automata here to denote mechanical devices that exhibit autonomous control of their movements. These can require sophisticated mechanical mechanisms for timing, sequencing and logical control. Mechanisms used for timing control Mechanical clocks, and other mechanical devices for measuring time have a very long history, and include a very wide variety of designs, including the flow of liquids (e. g., water clocks), or sands (e. g., sand clocks), and more conventional pendulum-and-gear based clock mechanisms. A wide variety of mechanical automata and other control devices make use of mechanical timing mechanisms to control the order and duration of events automatically executed (for example, mechanical slot
machines dating up to the 1970s made use of such mechanical clock mechanisms to control the sequence of operations used for payout of winnings). As a consequence, there is an interwoven history in the development of mechanical devices for measuring time and the development of devices for the control of mechanical automata. Logical control of computations A critical step in the history of computing machines was the development in the middle 1800’s of Boolean logic by George Boole [80,81]. Boole’s innovation was to assign values to logical propositions: 1 for true propositions and 0 for false propositions. He introduced the use of Boolean variables which are assigned these values, as well the use of Boolean connectives (“and,” and “or”) for expressing symbolic Booelan logic formulas. Boole’s symbolic logic is the basis for the logical control used in modern computers. Shannon [82] was the first to make use of Boole’s symbolic logic to analyze relay circuits (these relays were used to control an analog computer, namely MIT’s Differential Equalizer). The Jevons’ logic piano: A mechanical logical inference machine In 1870 William Stanley Jevons (who also significantly contributed to the development of symbolic logic) constructed a mechanical device [83,84] for the inference of logical proposition that used a piano keyboard for inputs. This mechanical inference machine is less widely known than it should be, since it may have had impact in the subsequent development of logical control mechanisms for machines. Mechanical logical devices used to play games Mechanical computing devices have also been constructed for executing the logical operations for playing games. For example, in 1975, a group of MIT undergraduates including Danny Hillis and Brian Silverman constructed a computing machine made of Tinkertoys that plays a perfect game of tic-tac-toe. Mechanical Devices Used in Cryptography Mechanical cipher devices using cogwheels Mechanical computing devices that used cogwheels were also developed for a wide variety of other purposes beyond merely arithmetic. A wide variety of mechanical computing devices were developed for the encryption and decryption of secret messages. Some of these (most notably the family of German electromechanical cipher devices known as Enigma Machines [85] developed in the early 1920s for commercial use and refined in the late 1920s and 1930s for military use) made use of sets of cogwheels to permute the symbols of text message streams. Similar (but somewhat more advanced) elec-
1831
1832
Mechanical Computing: The Computational Complexity of Physical Devices
tromechanical cipher devices were used by the USSR up to the 1970s. Electromechanical computing devices used in breaking cyphers In 1934 Marian Rejewski and a team including Alan Turing constructed an electrical/mechanical computing device known as the Bomb, which had an architecture similar to the abstract Turing machine described below, and which was used to decrypt ciphers made by the German Enigma cipher device mentioned above. Mechanical and Electro-Optical Devices for Integer Factorization Lehmer’s number sieve computer In 1926 Derrick Lehmer [86] constructed a mechanical device called the “number sieve computer” for various number theoretic problems including factorization of small integers and solutions of Diophantine equations. The device made use of multiple bicycle chains that rotated at distinct periods to discover solutions (such as integer factors) to these number theoretic problems. Shamir’s TWINKLE Adi Shamir [87,88,89] proposed a design for a optical/electric device known as TWINKLE for factoring integers, with the goal of breaking the RSA public key cryptosystem. This was unique among mechanical computing devices in that it used time durations between optical pulses to encode possible solution values. In particular, LEDs were made to flash at certain intervals of time (where each LED is assigned a distinct period and delay) at a very high clock rate so as to execute a sieve-based integer factoring algorithm. Mechanical Computation at the Micro Scale: MEMS Computing Devices Mechanical computers can have advantages over electronic computation at certain scales; they are already having widespread use at the microscale. MEMS (MicroElectro-Mechanical Systems) are manufactured by lithographic etching methods similar in nature to the processes in which microelectronics are manufactured, and have a similar microscale. A wide variety of MEMS devices [90] have been constructed for sensors and actuators, including accelerometers used in automobile safety devices and disk readers, and many of these MEMS devices execute mechanical computation do their task. Perhaps the MEMS device most similar in architecture to the mechanical calculators described above is the Recodable Locking Device [91] constructed in 1998 at Sandia Labs, which made use of microscopic gears that acted as a me-
chanical lock, and which was intended for mechanically locking strategic weapons. Future Directions Mechanical Self-Assembly Processes Most of the mechanical devices discussed in this chapter have been assumed to be constructed top-down; that is they are designed and then assembled by other mechanisms generally of large scale. However a future direction to consider are bottom-up processes for assembly and control of devices. Self-assembly is a basic bottom-up process found in many natural processes and in particular in all living systems. Domino Tiling Problems The theoretical basis for self-assembly has its roots in Domino Tiling Problems (also known as Wang tilings) as defined by Wang [92] (also see the comprehensive text of Grunbaum et al. [93]). The input is a finite set of unit size square tiles, each of whose sides are labeled with symbols over a finite alphabet. Additional restrictions may include the initial placement of a subset of these tiles, and the dimensions of the region where tiles must be placed. Assuming an arbitrarily large supply of each tile, the problem is to place the tiles, without rotation (a criterion that cannot apply to physical tiles), to completely fill the given region so that each pair of abutting tiles have identical symbols on their contacting sides. Turing-universal and NP complete self-assemblies Domino tiling problems over an infinite domain with only a constant number of tiles were first proved by [94] to be undecidable. Lewis and Papadimitriou [95] showed the problem of tiling a given finite region is NP complete. Theoretical models of tiling self-assembly processes Domino tiling problems do not presume or require a specific process for tiling. Winfree [96] proposed kinetic models for self-assembly processes. The sides of the tiles are assumed to have some methodology for selective affinity, which we call pads. Pads function as programmable binding domains, which hold together the tiles. Each pair of pads have specified binding strengths (a real number on the range [0,1] where 0 denotes no binding and 1 denotes perfect binding). The self-assembly process is initiated by a singleton tile (the seed tile) and proceeds by tiles binding together at their pads to form aggregates known as tiling assemblies. The preferential matching of tile pads facilitates the further assembly into tiling assemblies. Pad binding mechanisms These provide a mechanism for the preferential matching of tile sides can be
Mechanical Computing: The Computational Complexity of Physical Devices
provided by various methods: – magnetic attraction, e. g., pads with magnetic orientations (these can be constructed by curing ferrite materials; e. g., PDMS polymer/ferrite composites) in the presence of strong magnet fields, and also pads with patterned strips of magnetic orientations, – capillary force, using hydrophobic/hydrophilic (capillary) effects at surface boundaries that generate lateral forces, – shape matching (also known as shape complementarity or conformational affinity), using the shape of the tile sides to hold them together. – (Also see the sections below discussion of the used of molecular affinity for pad binding.) Materials for tiles There are a variety of distinct materials for tiles, at a variety of scales: Whitesides (see [97] and http://www-chem.harvard.edu/ GeorgeWhitesides.html) has developed and tested multiple technologies for meso-scale self-assembly, using capillary forces, shape complementarity, and magnetic forces. Rothemund [98] gave some of the most complex known meso-scale tiling assemblies using polymer tiles on fluid boundaries with pads that use hydrophobic/hydrophilic forces. A materials science group at the U. of Wisconsin (http://mrsec.wisc.edu/ edetc/selfassembly) has also tested meso-scale self-assembly using magnetic tiles. Meso-Scale Tile Assemblies Meso-Scale Tiling Assemblies have tiles of size a few millimeters up to a few centimeters. They have been experimentally demonstrated by a number of methods, such as the placement of tiles on a liquid surface interface (e. g., at the interface between two liquids of distinct density or on the surface of an air/liquid interface), and using mechanical agitation with shakers to provide a heat source for the assembly kinetics (that is, a temperature setting is made by fixing the rate and intensity of shaker agitation). Applications of Meso-scale Assemblies The are a number of applications, including: – Simulation of the thermodynamics and kinetics of molecular-scale self-assemblies. – For placement of a variety of microelectronics and MEMS parts. Mechanical Computation at the Molecular Scale: DNA Computing Devices Due to the difficulty of constructing electrical circuits at the molecular scale, alternative methods for computation, and in particular mechanical methods, may pro-
vide unique opportunities for computing at the molecular scale. In particular the bottom-up self-assembly processes described above have unique applications at the molecular scale. Self-assembled DNA nanostructures Molecularscale structures known as DNA nanostructures (see surveys by Seeman [99] and Reif [100]) can be made to self-assemble from individual synthetic strands of DNA. When added to a test tube with the appropriate buffer solution, and the test tube is cooled, the strands self-assemble into DNA nanostructures. This self-assembly of DNA nanostrucures can be viewed as a mechanical process, and in fact can be used to do computation. The first known example of a computation by using DNA was by Adleman [101,102] in 1994; he used the self-assembly of DNA strands to solve a small instance of a combinatorial optimization problem known as the Hamiltonian path problem. DNA tiling assemblies The Wang tiling [92] paradigm for self-assembling structures was the basis for scalable and programmable approach proposed by Winfree et al. [103] for doing molecular computation using DNA. First a number of distinct DNA nanostructures known as DNA tiles are self-assembled. End portions of the tiles, known as pads, are designed to allow the tiles to bind together a programmable manner similar to Wang tiling, but in this case uses the molecular affinity for pad binding due to the hydrogen-bonding of complementary DNA bases. This programmable control of the binding together of DNA tiles provides a capability for doing computation at the molecular scale. When the temperature of the test tube containing these tiles is further lowered, the DNA tiles bind together to form complexly patterned tiling lattices that correspond to computations. Assembling patterned DNA tiling assemblies Programmed patterning at the molecular scale can be produced by the use of strands of DNA that encode the patterns; this was first done by Yan et al. [104] in the form of bar-cord striped patterns, and more recently Rothemund [105] self-assembled complex 2D molecular patterns and shapes. Another method for the molecular patterning of DNA tiles is via computation done during the assembly. Computational DNA tiling assemblies The first experimental demonstration of computation via the selfassembly of DNA tiles was in 2000 by Mao et al. [106], and Yan et al. [107], which provided a 1 dimensional computation of a binary-carry computation (known as prefix-sum) associated with binary adders. Rothemund
1833
1834
Mechanical Computing: The Computational Complexity of Physical Devices
et al. [108] in 2004 demonstrated a 2 dimensional computational assembly of tiles displaying a pattern known as the Sierpinski triangle, which is the modulo 2 version of Pascal’s triangle. Other autonomous DNA devices DNA nanostructures can also be made to make sequences of movement, and a demonstration of an autonomous moving DNA robotic device that moved without outside mediation across a DNA nanostructure was given by Yin et al. [109]. The design of an autonomous DNA device that moves under programmed control is described in [110]. Surveys of DNA autonomous devices are given in [111] and [112].
Acknowledgments We sincerely thank Charles Bennett for his numerous suggestions and very important improvements to this survey. This work has been supported by NSF grants CCF0432038 and CCF-0523555. Bibliography 1. Turing A (1937) On Computable Numbers, with an Application to the Entscheidungsproblem. Proc Lond Math Soc, Second Ser, London 42:230–265. Erratum in 43:544–546 2. Lewis HR, Christos PH (1997) Elements of the Theory of Computation, 2nd edn. Prentice Hall, Upper Saddle River 3. Landauer R (1961) Irreversibility and heat generation in the computing process. IBMJ Res Dev 5:183 4. Bennett CH (1973) Logical reversibility of computation. IBM J Res Dev 17(6):525–532 5. Li M, Vitanyi P (1996) Reversibility and Adiabatic Computation: Trading Time and Space for Energy. Proc Roy Soc Lond, Series A 452:769–789. Preprint quant-ph/9703022 6. Crescenzi P, Christos PH (1995) Reversible simulation of space-bounded computations. Theor Comput Sci 143(1):159–165 7. Wolfram S (1984) Universality and complexity in cellular automata. Phys D 10:1–35 8. Blum L, Cucker F, Shub M, Smale S (1996) Complexity and Real Computation: A Manifesto. Int J Bifurc Chaos 6(1):3–26 9. Feynman RP (1982) Simulating physics with computers. Int J Theor Phys 21(6–7):467–488 10. Benioff P (1982) Quantum mechanical models of Turing machines that dissipate no energy. Phys Rev Lett 48:1581 11. Deutsch D (1985) Quantum theory, the Church–Turing principle and the universal quantum computer. Proc Roy Soc A 400:97–117 12. Gruska J (1999) Quantum Computing. McGraw–Hill, Maidenhead 13. Nielsen M, Chuang I (2000) Quantum Computation and Quantum Information. Cambridge University Press, Cambridge 14. Jaeger G (2006) Quantum Information: An Overview. Springer, Berlin
15. Reif JH (2007) Quantum Information Processing: Algorithms, Technologies and Challenges In: Eshaghian-Wilner MM (ed) Nano-scale and Bio-inspired Computing. Wiley, Malden 16. Reif JH (1979) Complexity of the Mover’s Problem and Generalizations. 20th Annual IEEE Symposium on Foundations of Computer Science, San Juan, Puerto Rico, October pp 421– 427; (1987) In: Schwartz J (ed) Planning, Geometry and Complexity of Robot Motion. Ablex Pub Norwood, NJ, pp 267–281 17. Canny J (1988) Some algebraic and geometric computations in PSPACE In: Cole R (ed) Proceedings of the 20th Annual ACM Symposium on the Theory of Computing. ACM Press, Chicago, IL, pp 460–467 18. Schwartz JT, Sharir M (1983) On the piano movers problem: I the case of a two-dimensional rigid polygonal body moving amidst polygonal barriers. Comm Pure Appl Math 36:345–398 19. Hopcroft JE, Schwartz JT, Sharir M (1984) On the Complexity of Motion Planning for Multiple Independent Objects: PSPACE Hardness of the Warehouseman’s Problem. Int J Robot Res 3(4):76–88 20. Bennett CH (1982) The thermodynamics of computation – a review. Int J Theor Phys 21(12):905–940. http://www.research.ibm.com/people/b/bennetc/ bennettc1982666c3d53.pdf 21. Bennett CH (2003) Notes on Landauer’s principle, reversible computation, and Maxwell’s Demon. Stud History Philos Mod Phys 34:501–510. eprint physics/0210005: http://xxx. lanl.gov/abs/physics/0210005 22. Adamatzky A (ed) (2001) Collision-based computing. Springer, London 23. Squier R, Steiglitz K (1994) Programmable parallel arithmetic in cellular automata using a particle model. Complex Syst 8:311–323 24. Fredkin E, Toffoli T (1982) Conservative logic. Int J Theor Phys 21:219–253 25. Adamatzky AI (1996) On the particle-like waves in the discrete model of excitable medium. Neural Netw World 1:3–10 26. Adamatzky AI (1998) Universal dynamical computation in multidimensional excitable lattices. Int J Theor Phys 37:3069– 3108 27. Jakubowski MH, Steiglitz K, Squier R (1998) State transformations of colliding optical solitons and possible application to computation in bulk media. Phys Rev E58:6752–6758 28. Jakubowski MH, Steiglitz K, Squier R (2001) Computing with solitons: a review and prospectus, Collision-based computing. Springer, London, pp 277–297 29. Feynman RP (1963) In: Feynman RP, Leighton RB, Sands M (eds) Ratchet and Pawl, The Feynman Lectures on Physics, vol 1, Chapter 46. Addison–Wesley, Reading 30. Shapiro E (1999) A Mechanical Turing Machine: Blueprint for a Biomolecular Computer. Fifth International Meeting on DNA-Based Computers at the Massachusetts Institute of Technology, Proc DNA Based Computers V. Cambridge, MA, pp 14–16 31. Reif JH, Sharir M (1994) Motion Planning in the Presence of Moving Obstacles. 26th Annual IEEE Symposium on Foundations of Computer Science, Portland, OR, October 1985 pp 144–154; J ACM (JACM) 41(4):764–790 32. Canny J, Reif JH (1987) New Lower Bound Techniques for Robot Motion Planning Problems. 28th Annual IEEE Symposium on Foundations of Computer Science, Los Angeles, CA, October pp 49–60
Mechanical Computing: The Computational Complexity of Physical Devices
33. Canny J, Donald B, Reif JH, Xavier P (1993) On the Complexity of Kinodynamic Planning. 29th Annual IEEE Symposium on Foundations of Computer Science, White Plains, NY, October (1988) pp 306–316; Kinodynamic Motion Planning. J ACM 40(5):1048–1066 34. Reif JH, Wang H (1998) The Complexity of the Two Dimensional Curvature-Constrained Shortest-Path Problem. Third International Workshop on Algorithmic Foundations of Robotics (WAFR98). Peters AK Ltd, Houston, pp 49–57 35. Reif JH, Tygar D, Yoshida A (1994) The Computability and Complexity of Optical Beam Tracing. 31st Annual IEEE Symposium on Foundations of Computer Science, St Louis, MO, October (1990) pp 106–114; The Computability and Complexity of Ray Tracing. Discrete Comput Geometry 11:265–287 36. Tate SR, Reif JH (1993) The Complexity of N-body Simulation. Proceedings of the 20th Annual Colloquium on Automata, Languages and Programming (ICALP’93), Lund, pp 162–176 37. Reif JH, Sun Z (2003) The Computational Power of Frictional Mechanical Systems. Third International Workshop on Algorithmic Foundations of Robotics, (WAFR98). Peters AK Ltd, Houston, Texas, Mar 5–7 1998, pp 223–236; On Frictional Mechanical Systems and Their Computational Power. SIAM J Comput (SICOMP) 32(6):1449–1474 38. Moore C (1990) Undecidability and Unpredictability in Dynamical Systems. Phys Rev Lett 64:2354–2357 39. Moore C (1991) Generalized Shifts: Undecidability and Unpredictability in Dynamical Systems. Nonlinearity 4:199–230 40. Munakata T, Sinha S, Ditto WL (2002) Chaos Computing: Implementation of Fundamental Logical Gates by Chaotic Elements. IEEE Trans Circuits Syst-I Fundam Theory Appl 49(11):1629–1633 41. Sinha S, Ditto (1999) Computing with distributed chaos. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Top 60(1):363–77 42. Knott CG (ed) (1915) Napier tercentenary memorial volume. The Royal Society of Edinburgh, Longmans, Green, London 43. Hartree DR (1950) Calculating Instruments and Machines. Cambridge University Press, Cambridge 44. Engineering Research Associates Staff (1950) High-Speed Computing Devices. McGraw–Hill, New York 45. Chase GC (1980) History of Mechanical Computing Machinery. Ann Hist Comput 2(3):198–226 46. Martin E (1992) The Calculating Machines. The MIT Press, Cambridge, Massachusetts 47. Davis M (2000) The Universal Computer: The Road from Leibniz to Turing. Norton, New York 48. Norman JM (ed) (2002) The Origins of Cyberspace: From Gutenberg to the Internet: a sourcebook on the history of information technology. Norman Publishing, Novato 49. Horsburgh EM (ed) (1914) Modern Instruments and Methodsof Calculation: a Handbook of the Napier Tercentenary Exhibition, London, G Bell and Sons, Edinburgh, The Royal Society of Edinburgh, p 223. Reprinted 1982 50. Turck JAV (1921) Origin of Modern Calculating Machines. The Western Society of Engineers, Chicago 51. Svoboda A (1948) Computing Mechanisms and Linkages. McGraw–Hill, Columbus 52. Soroka WA (1954) Analog Methods in Computation and Simulation. McGraw–Hill 53. Freeth T, Bitsakis Y, Moussas X, Seiradakis JH, Tselikas A, Mangou H, Zafeiropoulou M, Hadland R, Bate D, Ramsey A, Allen
54.
55. 56. 57. 58.
59. 60. 61. 62.
63. 64. 65. 66. 67.
68. 69.
70. 71.
72. 73.
74.
75. 76.
M, Crawley A, Hockley P, Malzbender T, Gelb D, Ambrisco W, Edmunds MG (2006) Decoding the ancient Greek astronomical calculator known as the Antikythera Mechanism. Nature 444:587–591 de Morin H (1913) Les appariéls d’intégration: intégrateurs simples et compasés, planimètres, intégromètres, intégraphes et courbes intégrales, analyse harmonique et analyseurs. Gauthier-Villars, Paris, pp 162–171 Thomson W (later known as Lord Kelvin) (1878) Harmonic Analyzer. Proc Roy Soc Lond 27:371–373 Henrici (1894) Philos Mag 38:110 Lord Kelvin (1878) Harmonic analyzer and synthesizer. Proc Roy Soc 27:371 Miller D (1916) The Henrici harmonic analyzer and devices for extending and facilitating its use. J Franklin Inst 181:51–81; 182:285–322 Fisher EG (1911) Tide-predicting machine. Eng News 66:69– 73 Bush V (1931) The differential analyzer: A new machine for solving differential equations. J Franklin Inst 212:447 Bernal JD (1964) The Structure of Liquids. Proc Roy Soc Lond Ser A 280, 299 Finney JL (1970) Random Packings and the Structure of Simple Liquids. I The Geometry of Random Close Packing. Proc Royal Soc London, Ser A, Math Phys Sci 319(1539):479–493 Bragg L, Nye JF (1947) A dynamical model of a crystal structure. Proc R Soc A 190:474–481 Bragg L, Lomer WM (1948) A dynamical model of a crystal structure II. Proc R Soc A 196:171–181 Corcoran SG, Colton RJ, Lilleodden ET, Gerberich WW (1997) Phys Rev B 190:474 Adamatzky A, De Lacy BC, Asai T (2005) Reaction-Diffusion Computers. Elsevier, Amsterdam Adamatzky AI (1994) Constructing a discrete generalized Voronoi diagram in reaction-diffusion media. Neural Netw World 6:635–643 da Vinci L (1493) Codex Madrid I Napier J (1614) Mirifici logarithmorum cannonis descriptio (the description of the wonderful canon of logarithms). Hart, Edinburgh Oughtred W (1632) Circles of Proportion and the Horizontal Instrument. William Forster, London Pascal E (1645) Lettre dédicatoire à Monseigneur le Chancelier sur le sujet de la machine nouvellement inventée par le sieur BP pour faire toutes sortes d’opérations d’arithmétique par un mouvement réglé sans plume ni jetons, suivie d’un avis nécessaire à ceux qui auront curiosité de voir ladite machine et de s’en servir Babbage C (1822) On Machinery for Calculating and Printing Mathematical Tables. Edinburgh Philos J VII:274–281 Babbage C (1822) Observations on the application of machinery to the computation of mathematical tables. Memoirs of the Astronomical Society 1:311–314 Babbage C (1826) On a Method of expressing by Signs the Action of Machinery. Philosophical Trans Royal Soc London 116(III):250–265 Ludgate P (1909) On a proposed analytical engine. Sci Proc Roy Dublin Soc 12:77–91 Lindgren M (1990) Glory and Failure: Difference Engines of Johann Muller, Charles Babbage and Georg and Edvard Scheutz. MIT Press, Cambridge
1835
1836
Mechanical Computing: The Computational Complexity of Physical Devices
77. Swade D (1991) Charles Babbage and His Calculating Engines. Michigan State University Press, East Lensing 78. Lovelace A (1843) Sketch of the analytical engine invented by Charles Babbage. Translation of: Sketch of the Analytical Engine by Menabrea LF with Ada’s notes and extensive commentary. Esq Sci Mem 3:666–731 79. Cohen BI, Welch GW (1999) Makin’ Numbers: Howard Aiken and the Computer. MIT Press, Cambridge 80. Boole G (1847) Mathematical Analysis of Logic. Pamphlet 81. Boole G (1854) An Investigation of the Laws of Thought, on Which are Founded the Mathematical Theories of Logic and Probabilities. Macmillan, Cambridge 82. Shannon C (1938) A Symbolic Analysis of Relay and Switching Circuits. Trans Am Inst Electr Eng 57:713–719 83. Jevons SW (1870) On the Mechanical Performance of Logical Inference. Philos Trans Roy Soc, Part II 160:497–518 84. Jevons SW (1873) The Principles of Science. A Treatise on Logic and Scientific Method. Macmillan, London 85. Hamer D, Sullivan G, Weierud F (1998) Enigma Variations: an Extended Family of Machines. Cryptologia 22(3):211–229 86. Lehmer DH (1928) The mechanical combination of linear forms. Am Math Mon 35:114–121 87. Shamir A (1999) Method and apparatus for factoring large numbers with optoelectronic devices. patent 475920, awarded 08/05/2003 88. Shamir A (1999) Factoring large numbers with the TWINKLE device. Cryptographic Hardware and Embedded Systems (CHES) 1999. LNCS, vol 1717. Springer, New York, pp 2–12 89. Lenstra AK, Shamir A (2000) Analysis and optimization of the TWINKLE factoring Device. Proc Eurocrypt 2000. LNCS, vol 1807. Springer, pp 35–52 90. Madou MJ (2002) Fundamentals of Microfabrication: The Science of Miniaturization, 2nd edn. CRC Press, Boca Raton 91. Plummer D, Dalton LJ, Peter F (1999) The recodable locking device. Commun ACM 42(7):83–87 92. Wang H (1963) Dominoes and the AEA Case of the Decision Problem. In: J Fox (ed) Mathematical Theory of Automata. Polytechnic Press, Brooklyn, pp 23–55 93. Branko GS, Shepard GC (1987) Tilings and Patterns. H Freeman, Gordonsville. Chapter 11 94. Berger R (1966) The Undecidability of the Domino Problem. Mem Am Math Soc 66:1–72 95. Lewis HR, Papadimitriou CH (1981) Elements of the Theory of Computation. Prentice-Hall, Upper Saddle River, pp 296–300, 345–348 96. Winfree E (1998) Simulations of Computing by Self-Assembly. In: Proceedings of the Fourth Annual Meeting on DNA Based Computers, held at the University of Pennsylvania. pp 16–19
97. Xia Y, Whitesides GM (1998) Soft Lithography. Annu Rev Mater Sci 28:153–184 98. Rothemund PWK (2000) Using lateral capillary forces to compute by self-assembly. Proc Nat Acad Sci (USA) 97:984–989 99. Seeman NC (2004) Nanotechnology and the Double Helix. Sci Am 290(6):64–75 100. Reif JH, LaBean TH (2007) Nanostructures and Autonomous Devices Assembled from DNA In: Eshaghian-Wilner MM (ed) Nano-scale and Bio-inspired Computing. Wiley, Malden 101. Adleman LM (1994) Molecular computation of solutions to combinatorial problems. Science 266(11):1021–1024 102. Adleman L (1998) Computing with DNA. Sci Am 279(2):34–41 103. Winfree E, Liu F, Wenzler LA, Seeman NC (1998) Design and Self-Assembly of Two-Dimensional DNA Crystals. Nature 394:539–544 104. Yan H, LaBean TH, Feng L, Reif JH (2003) Directed Nucleation Assembly of Barcode Patterned DNA Lattices. PNAS 100(14):8103–8108 105. Rothemund PWK (2006) Folding DNA to create nanoscale shapes and patterns. Nature 440:297–302 106. Mao C, LaBean TH, Reif JH, Seeman (2000) Logical Computation Using Algorithmic Self-Assembly of DNA TripleCrossover Molecules. Nature 407:493–495 107. Yan H, Feng L, LaBean TH, Reif J (2003) DNA Nanotubes, Parallel Molecular Computations of Pairwise Exclusive-Or (XOR) Using DNA String Tile Self-Assembly. J Am Chem Soc (JACS) 125(47):14246–14247 108. Rothemund PWK, Papadakis N, Winfree E (2004) Algorithmic Self-Assembly of DNA Sierpinski Triangles. PLoS Biology 2(12), e424. doi:10.1371/journal.pbio.0020424 109. Yin P, Yan H, Daniel XG, Turberfield AJ, Reif JH (2004) A Unidirectional DNA Walker Moving Autonomously Along a Linear Track. Angew Chem [Int Ed] 43(37):4906–4911 110. Reif JH, Sahu S (2008) Autonomous Programmable DNA Nanorobotic Devices Using DNAzymes, John H. Reif and Sudher Sahu. In: Garzon M, Yan H (eds) Autonomous Programmable DNA Nanorobotic Devices Using DNAzymes, 13th International Meeting on DNA Computing (DNA 13). Lecture Notes for Computer Science (LNCS). Springer, Berlin. To appear in special Journal Issue on Self-Assembly, Theoretical Computer Science (TCS) 111. Reif JH, LaBean TH (2007) Autonomous Programmable Biomolecular Devices Using Self-Assembled DNA Nanostructures. Comm ACM (CACM) 50(9):46–53. Extended version: http://www.cs.duke.edu/~reif/paper/AutonomousDNA/ AutonomousDNA.pdf 112. Bath J, Turberfield AJ (2007) DNA nanomachines. Nat Nanotechnol 2:275–284
Mechanism Design
Mechanism Design RON LAVI The Technion – Israel Institute of Technology, Haifa, Israel Article Outline Glossary Definition of the Subject Introduction Formal Model and Early Results Quasi-Linear Utilities and the VCG Mechanism The Importance of the Domain’s Dimensionality Budget Balancedness and Bayesian Mechanism Design Interdependent Valuations Future Directions Bibliography Glossary A social choice function A function that determines a social choice according to players’ preferences over the different possible alternatives. A mechanism A game in incomplete information, in which player strategies are based on their private preferences. A mechanism implements a social choice function f if the equilibrium strategies yield an outcome that coincides with f . Dominant strategies An equilibrium concept where the strategy of each player maximizes her utility, no matter what strategies the other players choose. Bayesian–Nash equilibrium An equilibrium concept that requires the strategy of each player to maximize the expected utility of the player, where the expectation is taken over the types of the other players. VCG mechanisms A family of mechanisms that implement in dominant strategies the social choice function that maximizes the social welfare. Definition of the Subject Mechanism design is a sub-field of economics and game theory that studies the construction of social mechanisms in the presence of rational but selfish individuals (players/agents). The nature of the players dictates a basic contrast between the social planner, that aims to reach a socially desirable outcome, and the players, that care only about their own private utility. The underlying question is how to incentivize the players to cooperate, in order to reach the desirable social outcomes. A mechanism is
a game, in which each agent is required to choose one action among a set of possible actions. The social designer then chooses an outcome, based on the chosen actions. This outcome is typically a coupling of a physical outcome, and a payment given to each individual. Mechanism design studies how to design the mechanism such that the equilibrium behavior of the players will lead to the socially desired goal. The theory of mechanism design has greatly influenced several sub-fields of micro-economics, for example auction theory and contract theory, and the 2007 Nobel prize in Economics was awarded to Leonid Hurwicz, Eric Maskin, and Roger Myerson “for having laid the foundations of mechanism design theory”. Introduction It will be useful to start with an example of a mechanism design setting, the well-known “public project” problem (Clarke [8]): a government is trying to decide on a certain public project (the common example is “building a bridge”). The project costs C dollars, and each player, i, will benefit from it to an amount of vi dollars, where this number is known only to the player herself. The governP ment desires to build the bridge if and only if i v i > C. But how should this condition be checked? Clearly, every player has an interest in over-stating its own vi , if this report is not accompanied by any payment at all, and most probably agents will understate their values, if asked to pay some proportional amount to the declared value. Clarke describes an elegant mechanism that solves this problem. His mechanism has the fantastic property that, from the point of view of every player, no matter what the other players declare, it is always in the best interest of the player to declare his true value. Thus, truthful reporting is a dominant-strategy equilibrium of the mechanism, and under this equilibrium, the government’s goal is fully achieved. A more formal treatment of this result is given in Sect. “Quasi-Linear Utilities and the VCG Mechanism” below. Clarke’s paper, published in the early 1970s, was part of a large body of work that started to investigate mechanism design questions. Most of the early works used two different assumptions about the structure of players’ utilities. Under the assumption that utilities are general, and that the influence of monetary transfers on the utility are not well predicted, the literature have produced mainly impossibilities, which are described in Sect. “Formal Model and Early Results”. The assumption that utilities are quasi-linear in money was successfully used to introduce positive and impressive results, as discussed in detail in Sects. “Quasi-Linear Utilities and the VCG Mech-
1837
1838
Mechanism Design
anism” and “The Importance of the Domain’s Dimensionality”. These mechanisms apply the solution concept of dominant-strategy equilibrium, which is a strong solution concept that may prevent several desirable properties from being achieved. To overcome its difficulties, the weaker concept of a Bayesian–Nash equilibrium is usually employed. This concept, and one main possibility result that it provides, are described in Sect. “Budget Balancedness and Bayesian Mechanism Design”. The last important model that this entry covers aims to capture settings where the players’ values are not fully observed by each player separately. Rather, each player receives a signal that gives a partial indication to her valuation. Mechanism design for such settings is discussed in Sect. “Interdependent Valuations”. One of the most impressive applications of the general mechanism design literature is auction theory. An auction is a specific form of a mechanism, where the outcome is simply the specific allocation of the goods to the players, plus the prices they are required to pay. Vickrey [28] initiated the study of auctions in a mechanism design setting, and in fact perhaps the study of mechanisms itself. After the fundamental study of general mechanism design in the 1970s, in the 1980s the focus of the research community returned to this important application, and many models were studied. We note that there are several other entries in this book that are strongly related to the subject of “mechanism design”. In particular, the entry on Game Theory, Introduction to gives a broader background on the mathematical methods and tools that are used by mechanism designers, and the entry on Implementation Theory handles similar subjects to this entry from a different point of view. Formal Model and Early Results A social designer wishes to choose one possible outcome/alternative out of a set A of possible alternatives. There are n players, each has her own preference order i over A. This preference order is termed the player’s “type”. The set (domain) of all valid player preferences is denoted by V i . The designer has a social choice function f : V1 Vn ! A, that specifies the desired alternative, given any profile of individual preferences over the alternatives. The problem is that these preferences are private information of each player – the social designer does not know them, and thus cannot simply invoke f in order to determine the social alternative. Players are assumed to be strategic, and therefore we are in a game-theoretic situation.
To implement the social choice function, the designer constructs a “game in incomplete information”, as follows. Each player is required to choose an action out of a set of possible actions A i , and a target function g : A1 An ! A specifies the chosen alternative, as a function of the players’ actions. A player’s choice of action may, of-course, depend on her actual preference order. Furthermore, we assume an incomplete information setting, and therefore it cannot depend on any of the other players’ preferences. Thus, to play the game, player i chooses a strategy s i : Vi ! A i . A strategy s i () dominates another strategy s 0i () if, for every tuple of actions ai of the other players, and for every preference i 2 Vi , g(s i ( i ); ai ) i g(a i ; ai )), for any a i 2 A i . In other words, no matter what the other players are doing, the player cannot improve her situation by using an action other than s i ( i ). A mechanism implements the social choice function f in dominant strategies if there exist dominant strategies s1 (); : : : ; s n () such that f (1 ; : : : ; n ) D g(s1 (1 ); : : : ; s n (n )), for any profile of preferences 1 ; : : : ; n . In other words, a mechanism implements the social choice function f if, given that players indeed play their equilibrium strategies (in this case the dominant strategies equilibrium), the outcome of the mechanism coincides with f ’s choice. The theory of mechanism design asks: given a specific problem domain (an alternative set and a domain of preferences), and a social choice function, how can we construct a mechanism that implements it (if at all)? As we shall see below, the literature uses a variety of “solution concepts”, in addition to the concept of dominant strategies equilibrium, and an impressive set of understandings have emerged. The concept of implementing a function with a dominant-strategy mechanism seems at first too strong, as it requires each player to know exactly what action to take, regardless of the actions the others take. Indeed, as we will next describe in detail, if we do not make any further assumptions then this notion yields mainly impossibilities. Nevertheless, it is not completely empty, and it may be useful to start with a positive example, to illustrate the new notions defined above. Consider a voting scenario, where the society needs to choose one out of two candidates. Thus, the alternative set contains two alternatives (“candidate 1” and “candidate 2”), and each player either prefers 1 over 2, 2 over 1, or is indifferent between the two. It turns out that the majority voting rule is the dominant strategy implementable, by the following mechanism: each player reports her top candidate, and the candidate that is preferred by the ma-
Mechanism Design
jority of the players is chosen. This mechanism is a “directrevelation” mechanism, in the sense that the action space of each player is to report a preference, and g is exactly f . In a direct-revelation mechanism, the hope is that truthful reporting (i. e. s i ( i ) D i ) is a dominant strategy. It is not hard to verify that in this two candidates setting, this is indeed the case, and hence the mechanism implements in dominant-strategies the majority voting rule. An elegant generalization for the case of a “singlepeaked” domain is as follows. Assume that the alternatives are numbered as A D fa1 ; : : : ; a n g, and the valid preferences of a player are single-peaked, in the sense that the preference order is completely determined by the choice of a peak alternative, ap . Given the peak, the preference between any two alternatives a i ; a j is determined according to their distance from ap , i. e. a i i a j if and only if j j pj ji pj. Now consider the social choice function f (p1 ; : : : ; p n ) D median(p1 ; : : : ; p n ), i. e. the chosen alternative is the median alternative of all peak alternatives. Theorem 1 Suppose that the domain of preferences is single-peaked. Then the median social choice function is implementable in dominant strategies. Proof Consider the direct revelation mechanism in which each player reports a peak alternative, and the mechanism outputs the median of all peaks. Let us argue that reporting the true peak alternative is a dominant strategy. Suppose the other players reported pi , and that the true peak of player i is pi . Let pm be the median index. If p i D pm then clearly player i cannot gain by declaring a different peak. Thus, assume that p i < pm , and let us examine a false declaration p0 i of player i. If p0i pm then pm remains the median, and the player did not gain. If p0i > pm then the new median is p0m pm , and since p i < pm , this is less preferred by i. Thus, player i cannot gain by declaring a false peak alternative if the true peak alternative is smaller or equal to the median alternative. A similar argument holds for the case of p i > pm . In a voting situation with two candidates, the median rule becomes the same as the majority rule, and the domain is indeed single-peaked. When we have three or more candidates, it is not hard to verify that the majority rule is different than the median rule. In addition, one can also check that the direct-revelation mechanism that uses the majority rule does not have truthfulness as a dominant strategy. Of-course, many times one cannot order the candidates on a line, and any preference ordering over the candidates is plausible. What voting rules are implementable in such a setting? This question was asked by Gibbard [12] and Satterthwaite [27], who provided a beautiful and fundamental impossibility. A domain of player preferences is
unrestricted if it contains all possible preference orderings. In our voting example, for instance, the domain is unrestricted if every ordering of the candidates is valid (in contrast to the case of a single-peaked domain). A social choice function is dictatorial if it always chooses the top alternative of a certain fixed player (the dictator). Theorem 2 ([12,27]) Every social choice function over an unrestricted domain of preferences, with at least three alternatives, must be dictatorial. The proof of this theorem, and in fact of most other impossibility theorems in mechanism design, uses as a first step the powerful direct-revelation principle. Though the examples we have seen above use a direct revelation mechanism, one can try to construct “complicated” mechanisms with “crazy” action spaces and outcome functions, and by this obtain dominant strategies. How should one reason about such vast space of possible constructions? The revelation principle says that one cannot gain extra power by such complex constructions, since if there exists an implementation to a specific function then there exists a directrevelation mechanism that implements it. Theorem 3 (The Direct Revelation Principle) Any implementable social choice function can also be implemented (using the same solution concept) by a direct-revelation mechanism. Proof Given a mechanism M that implements f , with dominant strategies si (), we construct a direct revelation mechanism M 0 as follows: for any tuple of preferences D (1 ; : : : ; n ), g 0 () D g(s ()). Since s i () is a dominant strategy in M, we have that for any fixed i 2 Vi and any i 2 Vi , the action a i D s i ( i ) is dominant when i’s type is i . Hence declaring any other ˜ i ), can˜ i that will “produce” an action a˜ i D s i ( type not increase i’s utility. Therefore, the strategy i in M 0 is dominant. The proof uses the dominant-strategies solution concept, but any other equilibrium definition will also work, using the same argumentation. Though technically very simple, the revelation principle is fundamental. It states that, when checking if a certain function is implementable, it is enough to check the direct-revelation mechanism that is associated with it. If it turns out to be truthful, we still may want to implement it with an indirect mechanism that will seem more natural and “real”, but if the direct-revelation mechanism is not truthful, then there is no hope of implementing the function. The proof of the theorem of Gibbard and Satterthwaite relies on the revelation principle to focus on direct-revelation mechanisms, but this is just the beginning. The
1839
1840
Mechanism Design
next step is to show that any non-dictatorial function is non-implementable. The proof achieves this by an interesting reduction to Arrow’s theorem, from the field of social choice theory. This theory is concerned with the possibilities and impossibilities of social preference aggregations that will exhibit desirable properties. A social welfare function F : V ! R aggregates the individuals’ preferences into a single preference order over all alternatives, where R is the set of all possible preference orders over A. Arrow [2] describes few desirable properties from a social welfare function, and shows that no social choice function can satisfy all: Definition 1 (Arrow’s desirable properties) 1. A social welfare function satisfies “weak Pareto” if whenever all individuals strictly prefer alternative a to alternative b then, in the social preference, a is strictly preferred to b. 2. A social welfare function is “a dictatorship” if there exists an individual for which the social preference is always identical to his own preference. 3. A social welfare function F satisfies the “Independence of Irrelevant Alternatives” property (IIA) if, for any preference orders R; R˜ 2 R and any a; b 2 A, a >F(R) b and b >F(R) ˜ a ) 9i : a >R i b and b >R˜ i a (where a >R i b iff a is preferred over b in Ri ). In other words, if the social preference between a and b was flipped when the individual preferences were changed ˜ then it must be the case that some individfrom R to R, ual flipped his own preference between a and b. Arrow’s impossibility theorem holds for the unrestricted domain of preferences, i. e. when all preference orders are possible: Theorem 4 ([2]) Assume jAj 3. Any social welfare function over an unrestricted domain of preferences that satisfies both weak Pareto and Independence of Irrelevant Alternatives must be a dictatorship. Gibbard and Satterthwaite’s proof reveals an interesting and important connection between the concept of implementation in dominant strategies, and Arrow’s condition of IIA. The proof shows how to construct, from a given implementable social choice function f , a social welfare function, F, that satisfies IIA and weak Pareto. In addition, F always places the alternative chosen by f as the most preferred alternative. By Arrow’s theorem, the resulting social welfare function must be dictatorial. In turn, this im-
plies that f is dictatorial. The construction of F from f is the straight-forward one: the top alternative is f ’s choice to the original preferences, say a. Then a is lowered to be the least preferred alternative in all the preferences, and f ’s new choice is placed second, etc. The interesting exercise is to show that the implementability of f implies that F satisfies Arrow’s conditions. In fact, as the proof shows that any implementable social choice function f entails a social welfare function F that “extends” f and satisfies Arrow’s conditions, it actually provides a strong argument for the reasonability of Arrow’s requirement – they are simply implied by the implementability requirement. In view of these strong impossibility results, it is natural to ask whether the entire concept of a mechanism can yield positive constructions. The answer is a big yes, under the “right” set of assumptions, as discussed in the next sections.
Quasi-Linear Utilities and the VCG Mechanism The model formalization of the previous section ignores the existence of money, or, more accurately, the fact that it has a more or less predictable effect on a player’s utility. The quasi-linear utilities model takes this into account, and players are assumed to have monetary value for each alternative. Formally, the type of a player is a valuation function v i : A ! < that describes the monetary value that the player will obtain from each chosen alternative (as before vi is taken from a domain of valid valuations V i and V D V1 Vn ). Note that the value of a player does not depend on the other players’ values (this is termed the private value assumption). The mechanism designer can now additionally pay each player (or charge money from her), and the total utility of player i if the chosen outcome is a and in addition she pays a price Pi is v i (a) Pi . A direct mechanism for quasi-linear utilities includes an outcome function f : V ! A (as before), as well as price functions p i : V ! < for each player i (the definition of an indirect mechanism is the natural parallel of the definition of the previous section; the revelation principle holds for quasilinear utilities as well, and we focus here on direct mechanisms). The implicit assumption is that a player aims to maximize her resulting utility, v i ( f (v i ; vi )) p i (v i ; vi ), and this leads us to the definition of a truthful mechanism, that parallels that of the previous section: Definition 2 (Truthfulness, or Incentive Compatibility, or Strategy-Proofness) A direct revelation mechanism is “truthful” (or incentive-compatible, or strategy-proof) if the dominant strategy of each player is to reveal her true
Mechanism Design
type, i. e. if for every vi 2 Vi and every v i ; v 0i 2 Vi , v i ( f (v i ; vi )) p i (v i ; vi ) v i ( f (v 0i ; vi )) p i (v 0i ; vi ) Using this framework, we can return to the example from Sect. “Introduction” (“building a bridge”), and construct a truthful mechanism to solve it. Recall that, in this problem, a government is trying to decide on a certain public project, which costs C dollars. Each player, i, will benefit from it to an amount of vi dollars, where this number is known only to the player herself. The government desires P to build the bridge if and only if i v i C. Clarke [8] designed the following mechanism. Each player reports P a value, v˜i , and the bridge is built if and only if i v˜i C. If the bridge is not built, the price of each player is 0. If the bridge is built then each player, i, pays the minimal value she could have declared to maintain the positive decision. P More precisely, if i 0 ¤i v˜i 0 C then she still pays zero, P and otherwise she pays C i 0 ¤i v˜i 0 . Theorem 5 Bidding the true value is a dominant strategy in the Clarke mechanism. Proof Consider the truthful bidding for player i, vi , vs. another possible bid v˜i (fixing the bids of the other players to arbitrarily be v˜i ). If with vi the project was reP jected then v i < C i 0 ¤i v˜ i 0 . In order to change the decision to an accept, the player would need to deP clare v˜i C i 0 ¤i v˜i 0 . In this case i’s payment will be P C i 0 ¤i v˜i 0 which is smaller than vi , as observed above. Thus, i’s resulting utility will be negative, hence bidding v˜i did not improve her utility. On the other hand, assume that with vi the project is accepted. Therefore, the player’s utility from declaring vi is non-negative. Note that the price that the player pays in case of an accept does not depend on her bid. Thus, the only way to change i’s utility (if at all) is to declare some v˜i that will cause the project to be rejected. But in this case i’s utility will be zero, hence she did not gain any benefit. Subsequently, Groves [13] made the remarkable observation that Clarke’s mechanism is in fact a special case of a much more general mechanism, that solves the welfare maximization problem on any domain with private values and quasi-linear utilities. For a given set of player types v1 ; : : : ; v n , the welfare obtained by an alternative a 2 A P is i v i (a). A social choice function is termed a welfare maximizer if f (v) is an alternative with maximal welfare, P i. e. f (v) 2 argmaxa2A f niD1 v i (a)g. Definition 3 (VCG Mechanisms) Given a set of alternatives A, and a domain of players’ types V D V1 : : : Vn , a VCG mechanism is a direct revelation mechanism such that, for any v 2 V,
˚ Pn 1. f (v) 2 argmaxa2A iD1 v i (a) . P 2. p i (v) D j¤i v j ( f (v)) C h i (vi ), where h i : Vi ! < is an arbitrary function. Ignore for a moment the term h i (vi ) in the payment functions. Then the VCG mechanism has a very natural interpretation: it chooses an alternative with maximal welfare according to the reported types, and then, by making additional payments, it equates the utility of each player to that maximal welfare level. Theorem 6 ([13]) Any VCG mechanism truthfully implements the welfare maximizing social choice function. Proof We argue that s i (v i ) D v i is a dominant strategy for i. Fix any vi 2 Vi as the declarations (actions) of the other players, any v 0i ¤ v i , and assume by contradiction that v i ( f (v i ; vi )) p i (v i ; vi ) < v i ( f (v 0i ; vi )) p i (v 0i ; vi ). Replacing p i () with the specific VCG payment function, and eliminating the term h i (vi ) from P both sides, we get: v i ( f (v i ; vi )) C j¤i v j ( f (v i ; vi )) < P 0 0 v i ( f (v i ; vi )) C j¤i v j ( f (v i ; vi )). Therefore, it must be that f (v i ; vi ) ¤ f (v 0i ; vi ). Denote f (v i ; vi ) D a and f (v 0i ; vi ) D b. The above equation is now v i (a) C P P v j (a) < v i (b) C j¤i j¤i v j (b), or, equivalently, Pn Pn iD1 v i (a) < iD1 v i (b), a contradiction to the fact that f (v i ; vi ) D a, since f () is a welfare maximizer. Thus, we see that the welfare maximizing social choice function can always be implemented, no matter what the problem domain is, under the assumption of quasi-linear utilities. The VCG mechanism is named after Vickrey, whose seminal paper [28] on auction theory was the first to describe a special case of the above mechanism (this is the second price auction; see the entry on auction theory for more details), after Clarke, who provided the second example, and after Groves himself, that finally pinned down the general idea. Clarke’s work can be viewed, in retrospect, as a suggestion for one specific form of the function h i (vi ), namely P h i (vi ) D j¤i v i ( f (vi )) (this is a slight abuse of notation, as f is defined for n players, but the intention is the straight-forward one – f chooses an alternative with maximal welfare). This form for the h i ()’s gives the following property: if a player does not influence the social choice, her payment is zero, and, in general, a player pays the “monetary damage” to the other players (i. e. the welfare that the others lost) as a result of i’s participation. Additionally, with Clarke’s payments, a truthful player is guaranteed a non-negative utility, no matter what the others declare. This last property is termed “individual rationality”.
1841
1842
Mechanism Design
The Importance of the Domain’s Dimensionality The impressive property of the VCG mechanism is its generality with respect to the domain of preferences – it can be used for any domain. On the other hand, VCG is restrictive in the sense that it can be used only to implement one specific goal, namely welfare maximization. Given the possibility that VCG presents, it is natural to ask if the assumption of quasi-linear utilities and private values allows the designer to implement many other different goals. It turns out that the answer depends on the “dimensionality” of the domain, as is discussed in this section. Single-Dimensional Domains Consider first a domain of preferences for which the type v i () can be completely described by a single number vi , in the following way. For each player i, a subset of the alternatives are “losing” alternatives, and her value for all these alternatives is always 0. The other alternatives are “winning” alternatives, and the value for each “winning” alternative is the same, regardless of the specific alternative. Such a domain is “single dimensional” in the sense that one single number completely describes the entire valuation vector. As before, this single number (the value for winning), is private to the player, and here this is the only private information of the player. The public project domain discussed above is an example of a single-dimensional domain: the losing alternative is the rejection of the project, and the winning alternative is the acceptance of the project. A major drawback of the VCG mechanism, in general, and with respect to the public project domain in particular, is the fact that the sum of payments is not balanced (a broader discussion on this is given in Sect. “Budget Balancedness and Bayesian Mechanism Design” below). In particular, the payments for the public project domain may not cover the entire cost of the project. Is there a different mechanism that always covers the entire cost? The positive answer that we shall soon see crucially depends on the fact that the domain is single-dimensional, and this turns out to be true for many other problem domains as well. The following mechanism for the public project problem assumes that the designer can decide not only if the project will be built, but also which players will be allowed to use it. Thus, we now have many possible alternatives, that correspond to the different subsets of players that will be allowed to utilize the project. This is still a single-dimensional domain, as each player only cares about whether she is losing or winning, and so the alternatives, from the point of view of a specific player, can be divided to the two winning/losing subsets. The following cost-shar-
ing mechanism was proposed by Moulin [20] in a general cost-sharing framework. The mechanism is a directrevelation mechanism, where each player, i, first submits her winning value, vi . The mechanism then continues in rounds, where in the first round all players are present, and in each round one or more players are declared losers and retire. Suppose that in a certain round x players remain. If all remaining players have v i C/x then they are declared winners, and each one pays C/x. Otherwise, all players with v i < C/x are declared losers, and “walk out”, and the process repeats. If no players remain then the project is rejected. Clearly, the cost sharing mechanism always recovers the cost of the project, if it is indeed accepted. But is it truthful? One can analyze it directly, to show that indeed the dominant strategy of each player is to declare her true winning value. Perhaps a better way is to understand a characterization of truthfulness for the general abstract setting of a single-dimensional domain. For simplicity, we will assume that we require mechanisms to be “normalized”, i. e. that a losing player will pay exactly zero to the mechanism. Now, a mechanism is said to be “value-monotone” if a winner that increases her value will always remain a winner. More formally, for all v i 2 Vi and vi 2 Vi , if i is a winner in the declaration (v i ; vi ) then i is a winner in the declaration (v 0i ; vi ), for all v 0i v i . Note that a value-monotone mechanism casts a “threshold value” function v i (vi ) such that, for every vi , player i wins when declaring v i > v i (vi ), and looses when declaring v i < v i (vi ). Quite interestingly, this structure completely characterizes incentive compatibility in single-dimensional domains: Theorem 7 A normalized direct-revelation mechanism for a single-dimensional domain is truthful if and only if it is value monotone and the price of a winning player is v i (vi ). Proof The first observation is that the price of a winner cannot depend on her declaration, vi (only on the fact that she wins, and on the declaration of the other players). Otherwise, if it can depend on her declaration, then there are two possible bids vi and v 0i such that i wins with both bids and pays pi and p0i , where p0i < p i . But then if the true value of i is vi then bidding v 0i instead of vi will increase i’s utility, contradicting truthfulness. We now show that a truthful mechanism must be value-monotone. Assume by contradiction that a declaration of (v i ; vi ) will cause i to win, but a declaration of (v 0i ; vi ) will cause i to lose, for some v 0i > v i . Suppose that i pays pi for winning (when the others declare vi ). Since we assume a normalized mechanism, truthful-
Mechanism Design
ness implies that p i v i . But then when the true type of a player is v 0i , her utility from declaring the truth will be zero (she loses), and she can increase her utility by declaring vi , which will cause her to win and to pay pi , a contradiction. Thus, a truthful mechanism must be value-monotone, and there exists a threshold value v i (vi ). To see that this defines pi , let us first check the case of p i < v i (vi ). In this case, if the type of i is vi with p i < v i < v i (vi ), she will lose (by the definition of v i (vi )), and by bidding some false large enough v 0i she can win and get a positive utility of v i p i . On the other hand, if p i > v i (vi ) then with type vi such that p i > v i > v i (vi ) a player will have negative utility of v i p i ) from declaring the truth, and she can strictly increase it by losing, again a contradiction. Therefore, it must be that p i D v i (vi ). To conclude, it only remains to show that a valuemonotone mechanism with a price for a winner p i D v i (vi ) is indeed truthful. Suppose first that with the truthful declaration i wins. Then v i > v i (vi ) D p i and i has a positive utility. If she changes the declaration and remains a winner, her price does not change, and if she becomes a loser her utility decreases to zero. Thus, a winner cannot increase her utility. Similarly, a loser can change her utility only by becoming a winner, i. e. by declaring v 0i > v i (vi ) > v i , but since she will then pay v i (vi ) her utility will now decrease to be negative. Thus, a loser cannot increase her utility either, and the mechanism is therefore truthful. This structure of truthful mechanisms is very powerful, and reduces the mechanism design problem to the algorithmic problem of designing monotone social choice functions. Another strong implication of this structure is the fact that the payments of a truthful mechanism are completely derived from the social choice rule. Consequently, if two mechanisms always choose the same set of winners and losers, then the revenues that they raise must also be equal. Myerson [21] was perhaps the first to observe that, in the context of auctions, and named this the “revenue equivalence” theorem. As a result of this characterization, one can easily verify that the above-mentioned cost-sharing mechanism is indeed truthful. It is not hard to check that the two conditions of the theorem hold, and therefore its truthfulness is concluded. This is just one example of the usefulness of the characterization. Multi-Dimensional Domains In the more general case, when the domain is multi-dimensional, the simple characterization from above does
not fit, but it turns out that there exists a nice generalization. We describe two properties, cyclic monotonicity (Rochet [26]) and weak monotonicity (Bikhchandani et al. [7]), which achieve that. The exposition here also relies on [14]. It will be convenient to use the abstract social choice setting described above: there is a finite set A of alternatives, and each player has a type (valuation function) v : A ! < that assigns a real number to every possible alternative. v i (a) should be interpreted as i’s value for alternative a. The valuation function v i () belongs to the domain V i of all possible valuation functions. Our goal is to implement in dominant strategies the social choice function f : V1 Vn ! A. As before, it is not hard to verify that the required price function of a player i may depend on her declaration only through the choice of the alternative, i. e. that it takes the form p i : Vi A ! ˛c . In addition to the different scaling of volatility with N, other quantities also show different behaviors in the two phases. By examining the distributions of winning probabilities for a particular action after different history strings, R. Savit et al. [3] found that these distributions are completely different in the two phases. P(1j) is defined as the conditional probability of action “1” turns out to be the
Minority Games
Minority Games, Figure 3 The simulation results of the predictability H as a function of the control parameter ˛ D 2M /N for games with S D 2 strategies for each agent averaged over 100 samples. Endogenous information and linear payoff are adopted in these simulations
Minority Games, Figure 2 The histogram of the probabilities P(1j) of winning action to be “1” given information (plotted as the decimal representations of the binary strings of information), for games of N D 101 agents and S D 2 in a symmetric phase with M D 5, i. e., ˛ 0:316 < ˛c and b asymmetric phase with M D 6, i. e., ˛ 0:634 > ˛c . Endogenous information and linear payoff are adopted. The histogram in a would have been even more uniform if step payoff was adopted, as shown in the original paper [3]
minority group after the history or information , and the histogram for P(1j) is flat at 0.5 for all when ˛ < ˛c , as shown in Fig. 2a. For ˛ > ˛c as shown in Fig. 2b, this histogram for P(1j) is not flat and uniform. This result is highly important as it implies that below ˛c , there is no extractable information from the history string of length M, since the two actions have equal probability of winning (both are 0.5) for any history string. However, beyond the phase transition when ˛ > ˛c , there is an unequal winning probability of the two actions, by just looking at the past M winning actions of the game. Hence, we can call the phase for ˛ < ˛c the unpredictable or the symmetric phase, as agents cannot predict the winning actions from
the past M-bit history (the winning probabilities are symmetric). On the contrary, the phase of ˛ > ˛c is called the predictable or the asymmetric phase, as there is bias of winning actions given the past M-bit history string (the winning probabilities are asymmetric). Owing to the results in these histograms, a useful quantity can be defined to measure the “non-uniformity” of the winning probabilities or the information content given by the past M-bit history string. We denote H to be the predictability of the game which is given by the following formula, HD
P 1 X hAji2 P D1
(8)
with P D 2 M again. H/N is plotted as a function of ˛ in Fig. 3. In the symmetric phase, hAji D 0 for all as the actions of “1” and “1” are equally likely to appear after . Hence, H D 0 in the symmetric phase. In the asymmetric phase, hAji ¤ 0 for all as the actions of “1” and “1” are not equally likely after . Hence, H > 0 in the asymmetric phase. H begins to increase at ˛ D ˛c as shown in Fig. 3. Analytic approaches are developed that are based on the minimization of predictability H [7,8,9,10]. In addition to the predictability H, the fraction of frozen agents also increases drastically before ˛c and decreases afterward. Frozen agents are agents that always use the same strategy for making decisions. In contrast, fickle agents are those who always switch strategies. In exogenous games, the phase transition, the scaling by ˛ and the properties of the two phases are preserved. The symmetric and asymmetric winning probabilities are
1869
1870
Minority Games
also preserved with representing the random information given to the agents in the conditional probabilities P(1j) (not the actual past history) [30,32]. The numerical values of volatility in the asymmetric phase are slightly deviated from that in the endogenous games. Because the winning probabilities are asymmetric, the probability of history appearance is non-uniform in endogenous games, while in exogenous games, we assume a uniform appearance probability of all the random information being given to the agents. Formation of Crowds, Anticrowds and Anti-Persistence in the Symmetric Phase In the symmetric phase with small ˛, the amount of available information P is small when compared to the number of agents N. Agents are able to exploit the information well and they react like a crowd which results in a large volatility. From the point of view of the strategy space, the number of independent strategies [4,5,6,31] (as discussed in RSS) is smaller than N in the symmetric phase. Many agents use identical strategies and react in the same or similar ways, forming crowds and anticrowds giving large volatility. This is sometimes known as the herd effect in the minority game. On the other hand, in the asymmetric phase with large ˛, the available information P is too large and complex when compared to N. Agents are not able to exploit all the information and they react in a way similar to making random decisions, resulting in a volatility approaching the coin-toss limit. In this case, the number of independent strategies is greater than N and agents are unlikely to use the same strategies, they act independently and crowds are not formed. The phase transition occurs when ˛c is roughly O(1), where N is roughly the same size as the available information P. In addition to formation of crowds, anti-persistence of the winning actions exists in the symmetric phase. For small ˛ in the symmetric phase, consecutive occurrence of the same signal leads to opposite winning actions [3,11,33], which is known as anti-persistence of the Minority Game. For example, in the case of M D 2, if the history “01” leads to a winning choice of “0”, then the next appearance of the history “01” will lead to a winning choice of “1”. This results in periodic dynamics of the game for ˛ < ˛c with a period of 2 M 2, where every history appears exactly twice with different winning actions for the first and second occurrence. This kind of anti-persistence disappears in the asymmetric phase. Instead, persistence is more likely [11], in which the consecutive occurrence of the same signal tends to have a higher probability in giving out the same winning actions.
Dependence on Temperature and Initial Conditions In 1999, Cavagna et al. [34] introduced the probabilistic fashion, the temperature, to the decision-making process of agents in a model known as the Thermal Minority Game (TMG). This stochasticity of temperature can also be implemented in the basic game. Instead of choosing the best strategy for sure, agents employ their strategy s with the probabilities i;s given by e U i;s (t) Prob fs i (t) D sg D i;s D P U 0 (t) i;s s0 e
(9)
where s i (t) denotes the strategy being employed by agent i at time t. is denoted by ˇ in the original formulation, which corresponds to the inverse temperature (as in physical systems) of individual agents. It can also be interpreted as the learning rate of the system [10]. Roughly speaking, this is because the dynamics of scores take a time of approximately 1/ to learn a difference in the cumulated payoffs of the strategies. For small , the system takes a convergence time of order N/ to reach the steady state [35,36], which also reveals the physical meaning of as the learning rate. In the asymmetric phase, the final state of the system and hence the volatility are independent of [10]. In the symmetric phase, the final state of the system is dependent on and the volatility of the system increases with increasing , provided that the system has reached the steady state [36]. This property of the game is in contrast to the ordinary physical systems, where fluctuations increase with increasing temperature. In the Minority Game, fluctuations increase with increasing , i. e., decreasing temperature, as is implemented as an individual inverse temperature in choosing strategies. As a result, in addition to inverse temperature or learning rate, can also be interpreted as collective or global effective temperature of the whole system, since global fluctuations increase with . In contrast to 2 , predictability H is independent of in both symmetric and asymmetric phases [10]. In addition to dependence on in the symmetry phase, the final state of the system is dependent on the initial conditions. For games with the same set of strategies among agents (identical quenched disorders), the final state of the system is dependent on the bias of initial virtual scores (heterogeneous initial conditions of annealed variables) of the strategies [10,33]. For the case of S D 2, the volatility of the system is smaller if larger differences are assigned to the initial virtual scores (i. e., initial bias) of the two strategies, given that the same is implemented [9]. A system with a final state depending on the initial state of the annealed variable corresponds to the spin glass phase,
Minority Games
or the replica symmetry breaking (RSB) phenomenon in physical systems. Thus, the symmetric phase also corresponds to the behaviors of broken replica symmetry. On the other hand, in the asymmetric phase, the final state of the system and hence the values of volatility is independent of the initial conditions. Thus, the asymmetric phase also corresponds to the replica symmetry (RS) phase in physical systems. The Cases of S > 2 Finally, the behaviors of the game are also dependent on S, the number of strategies that each agent holds. As shown in Fig. 4 where the volatility is plotted against ˛ D 2 M /N, the volatility of the system is dependent on S. Data collapse of volatility with different values of N and M is still shown by plotting volatility against ˛, for each value of S. While the generic shape of the curves is preserved when S increases, the points of minimum volatility shift to the right which suggests that the phase transition point is a function of S. It is also suggested in [31,37] that for the cases of S > 2, instead of ˛ D 2 M /N, the important control parameter should be 2 MC1 /SN. Since 2 MC1 is the number of important strategies in the reduced strategy space and SN is the total number of strategies held by all agents, when 2 MC1 < SN, some agents are using identical strategies and crowds and anticrowds are formed. On the other hand, when 2 MC1 > SN, most agents are using independent strategies and crowds are not formed. Numerical solutions from the replica approach for different values
Minority Games, Figure 4 The simulation results of the volatility as a function of the control parameter ˛ D 2M /N for games with S D 2; 3; 4 strategies for each agent averaged over 100 samples. Endogenous information and linear payoff are adopted in these simulations. Volatility generally increases with the number of strategies S per agent. Data collapse of volatility is shown for different values of S
of S show the relation of ˛c (S) ˛c (S D 2) C (S 2)/2 to a high degree of accuracy [9]. Variants of the Minority Game After the first publication of Arthur’s bar problem and the Minority Game, many variants of the game were established and studied by the physics community and also some economists. Some of these variant models were developed to further simplify the Minority Game or to include more features from the financial markets. Most of the modifications include the use of different kinds of strategies and payoff functions, the presence of different kinds of agents and the increased flexibility in participation of agents, evolution of agents, replacement of poorly performing agents by new agents and the individual concerns for capital. In this section, some of these variants are briefly introduced together with their physical significance to the development of the Minority Game. We leave their implications in regard to the financial markets to Sect. “Minority Games and Financial Markets”. The Evolutionary Minority Game or the Genetic Model In 1999, N.F. Johnson et al. [20] introduced the Evolutionary Minority Game, which is usually simply described as EMG in the literature or the Genetic Model in later literature. From the name of this model, we can see that the evolution of agents is an important feature added to the game. In addition, the strategies employed by agents are also major modifications. Unlike the basic Minority Game, all agents in EMG hold only one strategy S D 1 and the strategy table is identical for everyone. For example in the case of M D 3, all agents hold one strategy as in Table 1. Instead of having a column of fixed predictions, this column records the most recent past winning action or choice for the corresponding history. Thus, this strategy table is time dependent. To make decisions, all agents are assigned a different probability pi at the beginning, with 0 p i 1, which is defined as the probability that agent i acts according to the strategy table, i. e., will follow the recent winning action or the last outcome for that M-bit history. With a probability 1 p i , agent i chooses the choice opposite to the past winning action for that history. This probability pi (rather than the strategy table) acts as a role of strategy in making decisions for agents and is called “strategy” in EMG or the “gene” value in Genetic Model. Hence, the payoff or scores are rewarded or penalized subject to pi . To enhance the evolutionary property of the game, agents are allowed to modify the pi if the scores fall below a threshold denoted by d, where d < 0, which is sometimes known as the death score. The new pi is being drawn
1871
1872
Minority Games
with an equal probability in the range (p i r/2; p i C r/2) of width r, with either a periodic or reflective boundary condition at p i D 0 or p i D 1. This corresponds to an evolution of strategy (the probability pi ), or the mutation of the gene value pi with mutation range r, as a result the EMG is also known as the Genetic Model. It is found that in the ordinary EMG with the winning rule being the minority group (A(t) < N/2), the memory M of the strategy table is not relevant in affecting the major features (including the steady state distribution P(p i )) of the system [38], but may be relevant with other winning rules (winning level other than N/2) [39]. Thus, the volatility is independent of M in the ordinary EMG, in contrast to the basic Minority Game. In this ordinary model, one point is added to or deducted from the strategy pi for winning or losing predictions. The possibility of having a non-unity price-tofine ratio R was introduced by Hod et al. [38,40]. For R > 1, agents are in a wealthy regime since the gain is larger than the loss in one game. For R < 1, agents are in a tough regime. The system behaviors are dependent on R, with two thresholds Rc(1) and Rc(2) , both Rc(1) and Rc(2) are less than and close to 1, with Rc(1) > Rc(2) . In the regime where R > Rc(1) , after a sufficiently long time of evolution, the agents self-segregated into two opposing groups at p i D 0; 1 and a “U” shaped” P(p) distribution is found as shown in the case of R D 1 in Fig. 5. This implies agents tend to behave in an extreme way in a rich economy. In the regime where R < Rc(2) , the agents cluster at p i D 0:5 and an “inverse-U” shaped” distribution is found as shown in
the case of R D 0:97 in Fig. 5, implying agents tend to be more cautious in a poor economy. The Thermal Minority Game In addition to the stochasticity in choosing strategies as introduced by Eq. (9) discussed in Sect. “The Physical Properties of the Minority Game”, there are several other modifications from the basic Minority Game in the TMG [34]. In the original formulation, the strategy is a vector in the real space R P denoted by aEi;s , with P-dimensional p aEi;s D P. Thus, the strategy space is the surface of the P-dimensional hypersphere and the components of the strategies are continuous. This is different from the discrete strategies in the basic Minority Game. Every agent in the game draws S vector strategies before the game starts. The information processed by the strategies is a random vector E(t), with a unit-length in R P . The response or the bid of the strategy is no longer integer and is given by the inner product of the strategies and the information, i. e., aEi;s E(t). Hence, the attendance A(t) is given by A(t) D
N X
aEi;s i (t) E(t)
(10)
iD1
with s i (t) denoting the chosen strategy of agent i at time t. The cumulated payoff of strategy can be updated by U i;s (t C 1) D U i;s (t) A(t) aEi;s E(t) : (11) TMG can be considered as a continuous formulation of the Minority Game, in which the game is no longer discrete and binary. Since the response of the strategy in TMG is defined as the inner product, i. e., a sum over all the P entries of the vectors, all components of the strategy have to predict at each round. This is a difference from the basic Minority Game where at each round, only one of the P predictions on the strategy is effective. Despite these differences and the continuous formulation, TMG reproduces the same collective behaviors as the basic MG [34,35]. The Simple Minority Game Without Information In this simplified Minority Game, no information is given to agents and thus they have no strategy tables. Agents choose between the choice +1 and 1 according to the following probabilities, Prob fa i (t) D ˙1g
Minority Games, Figure 5 The simulation results of the distribution P(pi ) in a game of N D 10001 agents with d D 4 and M D 3. The distribution is obtained after 120 000 time steps and averaged over 50 simulations. Self-segregation and clustering of agents at different values of R are shown
e˙U i (t) C eU i (t)
eU i (t)
(12)
and U i (t) is updated by A(t) (13) N where U i (t) can be considered as the virtual score for U i (t C 1) D U i (t)
Minority Games
agent i to make a decision of “+1”, and U i (t) correspondingly the virtual score for agent i to make a decision of “1”. If U i (t) > 0, the past experience of the agents shows that it is more successful to take action a i (t) D C1, and vice versa. The learning rate or the temperature is implemented in this model. This model gives a very simple analytic explanation of the system’s dependence on . For < c , the volatility is found to be proportional to N, i. e., 2 / N. For > c , 2 / N 2 . c is found to be dependent on the initial conditions U i (0). In addition, 2 decreases with increasing U i (0). Similar dependence of the volatility on and initial conditions, and the dependence of c on initial conditions are also found in the basic Minority Game.
producing the stylized facts of financial markets in the Minority Game models [24,25,26,27,28], while preserving the two-phase structure of the predictable and unpredictable phase. The stylized facts being reproduced in the models include the fat-tail volatility or price return distributions and volatility clustering, when the systems are close to the critical state. Numerical tests and analytical attempts are carried out in the critical regime of the models. These models serve as a tool for physicists in understanding how macroscopic features are produced from the microscopic dynamics of individuals, which also supports the conjecture of self-organized criticality of the financial market in which the financial market is always close to or attracted to the critical state.
The Grand-Canonical Minority Game
Analytic Approaches to the Minority Game
The grand-canonical Minority Game (GCMG) refers to a subclass of Minority Games where the number of agents who actively participate in the market is variable. Instead of summing over all N agents, the collective action or the attendance A(t) is effectively the sum of the actions of active agents at time t. Agents can be active or inactive at any time, depending on their potential profitability from the market. The term “grand-canonical” originates from the grand-canonical ensemble in statistical mechanics where the number of particles in the observing system is variable. In most of the formulations [23,26,28], when the highest virtual score of the strategies that an agent holds is below some threshold or t (where is usually a positive constant being referred to as the interest rate and t is the number of rounds or time steps since the beginning of the game), the agent refrains from participating in that round of the game. It is equivalent to the addition of an inactive strategy for every agent from which agents become inactive, and the virtual score of this strategy is t. Physically, it corresponds to circumstances of gaining an interest of at each time step by keeping the capital in the form of cash, so agents would only participate in the market if the gain from investments in the market is greater than the interest rate. In some other formulations, instead of the virtual score of the strategies, the real score of the agents is used to compare with the interest rate [24]. Winning probabilities of strategies within a certain time horizon are also considered [24,41]. Individual capital concerns can also be implemented to achieve the grand-canonical nature of the game [25,28], in which agents vary their investment size at each time step by considering risk, gain potential or their limited capital. These grand-canonical modifications from the Minority Game are considered to be important and crucial in
There are several analytic approaches to solving the Minority Game. Most of the approaches are based on the models of the basic Minority Game with little modifications or simplification. It is found that in the asymmetric phase with ˛ > ˛c , both equilibrium approaches and dynamics approaches are likely to describe the same behavior of the system, and the equilibrium approach based on the minimization of H gives an analytic solution in this phase [7,8,9,10,12]. For the symmetric phase with ˛ < ˛c , fluctuations in the dynamics have to be considered and the solution is dependent on initial conditions. The final state of the system is sensitive to initial conditions and perturbations in the dynamics [10,12,14]. In this case, a solution is available in the limit of ! 0 or asymptotic behaviors can be obtained in the limit of ˛ ! 0. One of the early approaches to solving the Minority Game was the crowd-anticrowd theory which provides a qualitative explanation of the volatility dependence on brain size M [4,5,6]. Consider the reduced strategy space with strategies R D 1 : : : 2 M to be the uncorrelated strategies, R¯ to be the anti-correlated strategy of R (i. e., R and R¯ always have opposite decisions), which constitute the 2 MC1 strategies in the RSS. We denote nR to be the number of agents using the strategy R, n R¯ to be the number of agents using R¯ and h : : : i to be time averaging. For R and R0 to be uncorrelated strategies, the time average P h R¤R 0 (n R n R¯ )(n R 0 n R¯0 )i D 0 and thus the volatility 2 can be expressed as M
2 D
2 X
(n R n R¯ )2
(14)
RD1
which physically corresponds to the contribution to the ¯ global volatility from each crowd-anticrowd pair (R, R),
1873
1874
Minority Games
as R and R¯ are always making opposite decisions. If we consider a uniform distribution of all strategy combinations among agents at the beginning of the game, nR and n R¯ can be determined from the ranking of virtual scores of strategies, since agents are always using the best strategies they hold. In this case, the strategy with the highest virtual points would be the most popular strategy, while its anti-correlated partner would have the lowest virtual points and becomes the least popular strategy. This happens for small M where the number of strategies is small and a large number of agents are using the best strategy. On the contrary only a small number of agents are using its anti-correlated strategy, leading to a large jn R n R¯ j and a large volatility. For the cases of large M, even for the best strategy, nR is relatively small and the n R¯ may have a similar magnitude as nR , leading to a small jn R n R¯ j and a small volatility. This qualitatively explains the behaviors of volatility with ˛ based on the size of crowd-anticrowd pairs. The presence of temperature is also considered in the extended crowd-anticrowd approach [6]. In addition to this crowd-anticrowd theory, a full analytic approach can be developed. To solve the Minority Game analytically, we employ some convenient notation changes which make the tools in statistical physics more applicable [7,8,9]. For the case of S D 2, we denote the first strategy of an agent to be “+1” while the second one is “ 1”, whereas the best strategy of agent i at time t is now expressed as s i (t) D ˙1. The real bid a i (t) of agent i at time t can then be expressed as (t)
(t)
a i (t) D a i;s i (t) D ! i
(t)
C s i (t) i
(15)
where ! i D (a i;C C a i; )/2 and i D (a i;C a i; )/2. ! i and i are quenched disorders and are fixed at the be ginning of the game. ! i ; i D 0; ˙1 and ! i i D 0 for all . s i (t) is the dynamic variable and becomes explicit in the action of agents, corresponding to the Ising spins in physical systems. Thus, the attendance can be expressed as a function of spin s i (t) given by A(t) D ˝ (t) C
N X
(t)
i
s i (t)
(16)
iD1
P where ˝ D i ! i . Other than the spin s i (t), the virtual scores of the strategies are also dynamic and we denote the difference of the virtual scores of the two strategies of agent i to be Yi (t) given by Yi (t) D
(U i;C (t) U i; (t)) : 2
(17)
This Yi (t) determines the relative probabilities of using the two strategies with “inverse temperature” and is updated by Yi (t C 1) D Yi (t)
(t) A(t) N i
(18)
which is given by the update of the individual virtual scores U i;C (t) and U i; (t) in Eq. (6) with a factor of 1/N in the last term. Thus, the probabilities Eq. (9) for using the strategies s i (t) D ˙1 at time t becomes Prob fs i (t) D ˙1g D i;˙ D
1 ˙ tanh Yi (t) : 2
(19)
From this equation, we can calculate the time average of s i (t) at equilibrium with probabilities Eq. (19), denoted by mi , to be m i D hs i i D htanh(Yi )i :
(20)
The system will be stationary with hYi i v i t, corresponding to a stationary state solution of the set of mi . From Eq. (18), vi can be expressed as v i D ˝ i
N X
i j m j :
jD1
(21)
where denotes the average over For v i ¤ 0, hYi i diverges to ˙1 and gives m i D ˙1, corresponding to the frozen agents who always use the same strategy. For v i D 0, hYi i remains finite even after a long time and jm i j < 1, corresponding to the fickle agents who always switch their active strategy even in the stationary P state of the game. We can identify ˝ i C j¤i i j m j as an external field while i2 is the self-interaction of agent i. For an agent to be frozen, the magnitude of the external field has to be greater than the self-interaction. In order to have fickle agents in the stationary state, the self-interaction term is crucial. We note that the above equation of vi in Eq. (21) and the corresponding conditions of frozen and fickle agents are equivalent to the minimization of predictability H, with H written in the form " #2 P N X 1 X i mi : ˝ C HD P D1
(22)
iD1
Since mi ’s are bounded in the range [1; C1], H either attains its minimum at dH/dm i D 0, giving P ˝ i C j i j m j D 0 (fickle agents) or at the boundary of the range [1; C1] of mi , giving m i D ˙1 (frozen
Minority Games
agents). Thus, we can identify H as the Hamiltonian where the stationary state of the system is the ground state that minimizes the Hamiltonian. From Eq. (16), the volatility of the system can be expressed as 2 D H C
N X iD1
2 (1 m2i ) C
X
i j
N
i¤ j
˝ ˛ (tanh Yi m i )(tanh Yj m j ) :
(23)
The last term involves the fluctuations around the average behavior of the agents and is related to the dynamics of the system. Identifying H as the Hamiltonian reduces the problem to a conventional physical problem of finding the ground state of the system by minimizing the Hamiltonian. Solving the problem involves averaging the quantity ln Z over quenched disorders corresponding to the strategies a i;˙ (now represented by ! i and i ) given to the agents at the beginning of the game. We note that the system is a fully connected system in which agents interact with all other agents, and can be handled by the replica approach as in spin glass models, under the assumption of replica symmetry. Given the fraction of frozen agents to be , which can be expressed as a function of ˛, ˛c D 0:3374 : : : [8] is found to be the solution of the equation ˛ D 1 (˛) :
(24)
In addition, this approach allows us to see that the system with different initial conditions converges to the same unique solution, corresponding to a single minima of H in the phase ˛ > ˛c (replica symmetry). This method allows us to obtain a complete solution for the Minority Game for all in the phase ˛ > ˛c . Macroscopic quantities such as 2 and H can be analytically calculated. For ˛ < ˛c , there are multiple minima of H D 0 and the system’s final state is not unique (replica symmetry breaking) and depends on its initial state. In this case, dynamics has to be considered. Breaking of replica symmetry is also considered in the case with market impact [17]. We notice that in the long run, the characteristic time in the dynamics is approximately proportional to N, where all agents observe the performance of their strategies among all P states with P D ˛N. This characteristic time is also inversely proportional to since the dynamics of scores take a time of approximately 1/ to adapt a change of scores, as discussed earlier. The real time t can then be rescaled as D
t N
in which one characteristic time step in the system corresponds to N/ real time steps t. This is the reason for the systems with small having a convergence time of N/ . We can hence write a dynamical equation for Y i in the rescaled time by denoting the variable y i ( ) D Yi (N ) which gives
(25)
X dy i i j tanh(y i ) C % i D ˝ i d
(26)
jD1
where the first two terms on the right hand side represent the average behavior of agents obtained by the average frequency they play their strategies [10]. These two terms are considered to be deterministic. The last term % i represents the noise or the fluctuations around the average behavior. The properties of these fluctuations are given by h%( )i D 0
(27)
˝ ˛ 2 %( )%( 0 ) Š i j ı 0 : N
(28)
By writing the Fokker–Planck equation for the probability distribution P fy i g; t [10], many physical implications can be obtained. We first note that the noise covariance Eq. (28) is linearly related to , revealing the role of as the global temperature of the system. When ! 0, the noise covariance vanishes and the minimization of H gives a valid solution, even for ˛ < ˛c . It can also be deduced that in the asymmetric phase, the last term in Eq. (23) vanishes such that 2 is independent of and initial conditions. In the symmetric phase, this last term does not vanish and 2 is dependent on both and initial conditions. An alternative approach to derive dynamical equations is the generating functional approach [14,15,16], which monitors the dynamics using path integrals over time. The approach was first used on the batch update version of the Minority Game, in which agents update their virtual scores only after a batch of P time steps and with ! 1 as in the basic Minority Game. The quenched disorder can be averaged out in the dynamical equations and in the limit of N ! 1, we obtain a representative “single” agent dynamical equation with the variable y(t), where y(t) represents the difference in the virtual scores of the two strategies of this “single” agent after the tth batch. The dynamics are stochastic but non-Markovian in nature, and can be extended to regions inaccessible by the replica method. This method again confirms the relation of Eq. (24) and gives the same value of ˛c [14]. For ˛ > ˛c , the fraction of frozen agents is obtained analytically and the volatility is calculated to a high accuracy. For ˛ < ˛c in the limit of ˛ ! 0, is shown to diverge as ˛ 1/2 for
1875
1876
Minority Games
y(0) < yc , and vanishes as ˛ 1/2 for y(0) > yc , with yc 0:242 [14]. This approach was later extended to the case of on-line update (update of virtual score after every step) and the cases of < 1 [15,16]. Minority Games and Financial Markets The basic Minority Game model is a simple model that is used to describe the possible interaction of investors in the financial markets. Despite its simplicity, some variants of the game show certain predictive abilities for real financial data [22,24,41]. Though Minority Games are simple, they setup a framework of agent-based models from which sophisticated trading models can be built, and implementation in real trading may be possible. Although these sophisticated models are usually used for private trading and may not be open to the public, Minority Games are still a useful tool to understand the dynamics in financial markets. There are several fundamental differences between the basic game and the markets. The basic Minority Game is a negative sum game in which the sum of gain of all agents is negative. Agents are not concerned with capital and cannot refrain from participation even if they found the game unprofitable. Whether the simple payoff function in Eq. (6) correctly represents the evaluation of strategies by the real investor is questionable. We also note that the symmetric phase corresponds to a phase of information efficiency in which the game becomes unpredictable. Although the basic Minority Game provides a very colorful collective behavior of agents from simple interactions and dynamics, some details can affect the behaviors and modifications have to be made in order to draw a more direct correspondence of Minority Games to real financial markets. Some variants of the Minority Game are modified to study a particular issue or aspect of the real markets. While with the introduction of several financial aspects, some variants of the game lead to a more realistic model of the market, at the same time they complicate the models. Among the different aspects, one of the primary issues is to draw an analogy to trading where price dynamics have to be introduced to the Minority Game. A common price dynamic used in the game is to relate the attendance A(t) to the price p(t), and thus the return r(t) in trading is given by r(t) log[p(t C 1)] log[p(t)] D
A(t)
(29)
where is called the liquidity, which is used to control the sensitivity of price on attendance. With this or similar price dynamics, the trading process can be defined in the game.
After the introduction of price dynamics, the issue of payoff function was also addressed. The mixed minoritymajority game originally proposed by M. Marsili [18] is based on a simplified version of the Minority Game in which there is no strategy table and no information (as discussed in Sect. “Variants of the Minority Game”). The payoff function in this model is based on the expectations of the agents in relation to the price change in the next steps. For simplicity, we consider the expectation E i [A(t C 1)jt] of agent i on the attendance A(t C 1) in the next step, which is expressed as E i [A(t C 1)jt] D ˚ i A(t) :
(30)
For ˚ i > 0, agents expect the attendance in the next step to be negatively correlated with that in the present step (i. e., the price fluctuates), revealing the minority nature of the agents and they are called fundamentalists or contrarian agents. For ˚ i < 0, agents expect the attendance in the next step to be positively correlated with that in the present step (i. e., a price trend develops), revealing the majority nature of the agents and they are called trend followers. For both fundamentalists and trend followers, if they expect the price to go up in the next step, buying is considered to be profitable, and vice versa. Thus, the payoff function ıU i (t) D U i (t C 1) U i (t) is proportional to the product of the current decision a i (t) and the expectation of a price change in the next step is given by ıU i (t) / a i (t)E i [A(t C 1)jt] D ˚ i a i (t)A(t)
(31)
with ˚ i > 0 and ˚ i < 0 corresponding to fundamentalists and trend followers, respectively. Hence, fundamentalists are considered to be playing a minority game while trend followers play a majority game. The two kinds of agents interact in the same game and it was found that the ratio of fundamentalists to trend followers is important in affecting the behaviors of the system. If more than half of the agents are fundamentalists, the fundamentalists prevail and the game is minority in nature with hA(t C 1)A(t)i < 0. On the other hand, if more than half of the agents are trend followers, the trend followers prevail and the game is majority in nature with hA(t C 1)A(t)i > 0. Thus, the behaviors of both minority and majority agents are found to be self-sustained, depending on the relative population of the agents. The $-game [19] shares some similarity to the majority nature of the trend followers in the mixed minoritymajority game, but with a crucial difference of using the real attendance of the next step, not the expectations of agents, in the payoff function. In the $-game, agents are again equipped with strategy tables and the virtual score
Minority Games Minority Games, Table 2 The payoff functions of the Minority Game, the majority game and the $-game, with ˚ i set to ˙1 in the mixed minority-majority game ıUi;s (t) Minority Game Majority Game $-game
a i;s (t)A(t) ai;s (t)A(t) a i;s (t 1)A(t)
of strategy s is updated according to
U i;s (t C 1) D U i;s (t) C a i;s (t 1)A(t) :
(32)
According to this payoff scheme, the present actions a i;s (t) would only change the payoff at the next step at t C 1. Suppose an agent buys an asset, he gains by selling the asset in the next step if the price rises, and vice versa. This payoff scheme aims to model the mode of one-step speculating in realistic markets though agents are not restricted to act oppositely in consecutive steps in the model. Bubble-like behaviors are found in the model, in which agents buy (sell) and push up (down) the price, leading to positive evaluations of the buying actions such that agents are more likely to buy (sell) again. This process continues and a persistent price trend is observed. This persistent price trend is not observed in real markets, and can be eliminated from the model if agents are concerned with their limited capital, risk or maximum holding of assets [19,21,22,28]. To summarize the different payoff schemes in the Minority game, majority game and the $-game, Table 2 shows the payoff functions of the three games. Other than the payoff functions, we consider the stationary state of collective behaviors in the system. The stationary state of the Minority Game is not a Nash Equilibrium. There are an extremely large number of Nash Equilibria in the Minority Game [17,18] and one example is (N C 1)/2 agents always make an action of a i (t) D C1 while (N 1)/2 agents always make an action of a i (t) D 1. In this case, D 1 and no individual has the incentive to change his action by himself (the majority group change if any of the losers moves). This state is not stationary in the game. The stationary state of the game is described by the minimization of predictability H, but Nash Equilibria are states of minimum volatility 2 . Physically, instead of competing with the other N 1 players, agents are interacting with the total attendance A(t) which includes also its own action. By subtracting their own actions from the attendance, their cumulated payoffs
Eq. (13) in the simplified minority game becomes U i (t C 1) D U i (t)
A(t) a i (t) N
(33)
where eta denotes the market impact. The Minority Game corresponds to the case of D 0 in which Nash Equilibrium is not attained. In this simplified model, > 0 brings heterogeneity to the behaviors of agents and Nash Equilibrium is attained [18]. In some variants of the Minority Game, the role of participants are studied. It was suggested that a symbiotic relation is present between two kinds of traders [7,25,42], namely the producers and the speculators. Producers are agents who always participate and trade with only one strategy. They have a primary interest in trading in the market for business or other reasons. Speculators are agents who speculate and have no interest in the intrinsic values of the assets traded. They can refrain from participating in the market at any time they found it unprofitable. This model corresponds to one of the grand-canonical Minority Games. In this model, it was found that the gains of producers are always negative but their losses decrease with an increasing number of speculators, since speculators provide liquidity to producers and make the market more unpredictable. On the other hand, the gain of speculators generally increases with the number of producers, since producers provide more information to speculators which make the market more predictable. As a result, producers and speculators are symbiotic. In addition to studying the role of producers and speculators, the grand-canonical Minority Game plays a crucial role in understanding financial markets [24,26,28]. By introducing the grand-canonical nature to the model, agents can choose to refrain from participating in the market when they found the game unprofitable. While observing the market as non-trading outsiders, they can participate in the market again once they found it profitable. The predictable and unpredictable phases are usually preserved in this class of models, where fat-tail distributions of price return are found around the phase transition point. These can be fitted by power laws. While outside the critical region, the fluctuation distributions becomes Gaussian. In addition, volatility clustering, where high volatility is likely to cluster in time, is also found and can be fitted by power laws or other forms of function [25,26,28,43]. This suggests that the dynamics of agents are correlated in time. The observation of power laws in the model coincides with the observations of fat-tail price return distributions and volatility clustering in real markets in the high frequency range [44]. While Gaussian fluctuations are not found in real markets, these properties of the model pro-
1877
1878
Minority Games
vide important implications or conjectures of financial markets being in the critical state. It also suggests the possibility of a self-organized critical system as an explanation of the behaviors of financial markets. In addition to power laws, rescaling of financial market data of different frequencies [45] also provides preliminary evidence of the property of scale invariance in time in critical systems. Although power laws in financial markets may have origins other than critical phenomenon [46], these conjectures provide a potential perspective in understanding the dynamics and behaviors of the markets. If financial markets exist as a critical phenomenon, their behaviors can be understood qualitatively from the underlying nature of interactions in similar systems within the same universality class. In this case, microscopic details of the systems are not crucial in affecting the generic behaviors for systems in the same class. As financial markets are observed to be operated close to informational efficiency, correspondingly, the grand-canonical Minority Games show stylized facts near the critical point of phase transition to the unpredictable phase, i. e., the phase of informational efficiency. In some versions of the grand-canonical minority game, rarely large fluctuations resulting from a sudden participation of a larger number of speculators are also found which draw analogy to market crashes. Future Directions In view of the exciting physical pictures brought along by the grand-canonical Minority Games, more analytic works can be developed to understand the dynamics of the critical regime around which the fat-tail distributions and the volatility clustering are found. The formation of power laws and anomalous fluctuations may be understood with the analytic tools. An analytic approach to the grand-canonical minority game would also provide more clues in proving or disproving the conjecture of the financial market being a self-organized critical phenomenon. On the other hand, simple modeling that reveals the dynamics of the financial market is still possible. Development of other simple models that draw direct analogy to the financial markets, together with analytic solution is crucial in understanding how the markets work. Other than the grand-canonical games, some variants of the basic Minority Game are still simple and worth solving analytically. Although an analytic solution may not be available for complicated models that introduce more and more realistic aspects into the game, comprehensive modeling based on the inductive nature of agent-based models provides us with a new perspective in understanding the financial markets.
Other than modeling, efforts may be put in to implementing Minority-Game strategies in real trading. Although the strategies from the basic Minority Game and its variants may not accurately describe the strategies for real trading, the simplicity of the strategy does leave us a large freedom in expanding, modifying and tuning with respect to attaining profitability in real trading. Using Minority Games as a framework, a sophisticated real trading system can be built that includes a comprehensive picture of the trading mechanism. Predictive capacity may also be obtained from this kind of agent-based model, which goes beyond the standard economic assumption of deductive agents and market efficiency. All these possible future directions show great potential over the conventional statistical tools in financial analysis. These developments may be used for private trading and may not be accessible through public and academic literature. Some work [22,24,41] has already shown potential in this direction of application.
Bibliography Primary Literature 1. Challet D, Zhang Y-C (1997) Emergence of cooperation and organization in an evolutionary game. Physica A 246:407 2. Arthur WB (1994) Inductive reasoning and bounded rationality: The El Farol problem. Am Econ Assoc Pap Proc 84:406–411 3. Savit R, Manuca R, Riolo R (1999) Adaptive competition, market efficiency, and phase transitions. Phys Rev Lett 82:2203–2206 4. Johnson NF, Hart M, Hui PM (1999) Crowd effects and volatility in a competitive market. Physica A 269:1 5. Hart ML, Jefferies PPM, Hui P, Johnson NF (2001) Crowd-anticrowd theory of multi-agent market games. Eur Phys J B 20:547–550 6. Hart ML, Jefferies P, Johnson NF, Hui PM (2000) Generalized strategies in the minority game. Phys Rev E 63:017102 7. Challet D, Marsili M, Zhang Y-C (2000) Modeling market mechanisms with minority game. Physica A 276:284 8. Challet D, Marsili M, Zecchina R (2000) Statistical mechanics of heterogeneous agents: Minority games. Phys Rev Lett 84:1824–1827 9. Marsili M, Challet D, Zecchina R (2000) Exact solution of a modified El Farol’s bar problem: Efficiency and the role of market impact. Physica A 280:522 10. Marsili M, Challet D (2001) Continuum time limit and stationary states of the minority game. Phys Rev E 64:056138 11. Challet D, Marsili M (1999) Symmetry breaking and phase transition in the minority game. Phys Rev E 60:R6271 12. Garrahan JP, Moro E, Sherrington D (2000) Continuous time dynamics of the thermal minority game. Phys Rev E 62:R9 13. Sherrington D, Moro E, Garrahan JP (2002) Statistical physics of induced correlation in a simple market. Physica A 311:527–535 14. Heimel JAF, Coolen ACC (2001) Generating functional analysis of the dynamics of the batch minority game with random external information. Phys Rev E 63:056121
Minority Games
15. Heimel JAF, Coolen ACC, Sherrington D (2001) Dynamics of the batch minority game with inhomogeneous decision noise. Phys Rev E 65:016126 16. Coolen ACC, Heimel JAF (2001) Dynamical solution of the online minority game. J Phys A Math Gen 34:10783 17. De Martino AD, Marsili M (2001) Replica symmetry breaking in the minority game. J Phys A Math Gen 34:2525–2537 18. Marsili M (2001) Market mechanism and expectations in minority and majority games. Phys A 299:93–103 19. Andersen JV, Sornette D (2003) The $-game. Eur Phys J B 31:141 20. Johnson NF, Hui PM, Johnson R, Lo TS (1999) Self-organized segregation within an evolving population. Phys Rev Lett 82: 3360–3363 21. Challet D (2008) Inter-pattern speculation: Beyond minority, majority and $-games. J Econ Dyn Control 32:85 22. Yeung CH, Wong KYM, Zhang Y-C (2008) Models of financial markets with extensive participation incentives. Phys Rev E 777:026107 23. Slanina F, Zhang Y-C (1999) Capital flow in a two-component dynamical system. Physica A 272:257–268 24. Jefferies P, Hart ML, Hui PM, Johnson NF (2001) From minority games to real world markets. Eur Phys J B 20:493–502 25. Challet D, Chessa A, Marsili M, Zhang Y-C (2000) From minority games to real markets. Quant Finance 1:168 26. Challet D, Marsili M (2003) Criticality and finite size effects in a simple realistic model of stock market. Phys Rev E 68: 036132 27. Challet D, Marsili M, Zhang Y-C (2001) Stylized facts of financial markets in minority games. Physica A 299:228 28. Giardina I, Bouchaud J-P (2003) Bubbles, crashes and intermittency in agent based market models. Eur Phys J B 31:421 29. Cavagna A (1999) Irrelevance of memory in the minority game. Phys Rev E 59:R3783–R3786 30. Cavagna A (2000) Comment on adaptive competition, market efficiency, and phase transitions. Phys Rev Lett 84: 1058 31. Challet D, Zhang Y-C (1998) On the minority game: Analytical and numerical studies. Physica A 256:514 32. Savit R (2000) Savit replies on comment on adaptive competition, market efficiency, and phase transitions. Phys Rev Lett 84:1059 33. Wong KYM, Lim SW, Gao Z (2005) Effects of diversity on multiagent systems: Minority games. Phys Rev E 71:066103 34. Cavagna A, Garrahan JP, Giardina I, Sherrington D (1999)
35.
36.
37. 38.
39.
40. 41.
42. 43.
44.
45. 46.
A thermal model for adaptive competition in a market. Phys Rev Lett 83:4429–4432 Challet D, Marsili M, Zecchina R (2000) Comment on thermal model for adaptive competition in a market. Phys Rev Lett 85:5008 Cavagna A, Garrahan JP, Giardina I, Sherrington D (2000) Reply to comment on thermal model for adaptive competition in a market. Phys Rev Lett 85:5009 Zhang Y-C (1998) Modeling market mechanism with evolutionary games. Europhys News 29:51 Burgos E, Ceva H (2000) Self organization in a minority game: The role of memory and a probabilistic approach. Physica A 284:489 Kay R, Johnson NF (2004) Memory and self-induced shocks in an evolutionary population competing for limited resources. Phys Rev E 70:056101 Hod S, Nakar E (2002) Self-segregation versus clustering in the evolutionary minority game. Phys Rev Lett 88:238702 Lamper D, Howison SD, Johnson NF (2001) Predictability of large future changes in a competitive evolving population. Phys Rev Lett 88:017902 Zhang Y-C (1999) Towards a theory of marginally efficient markets. Physica A 269:30 Bouchaud J-P, Giardina I, Mezard M (2001) On a universal mechanism for long-range volatility correlations. Quant Finance 1:212–216 Liu Y, Gopikrishnan P, Cizeau P, Meyer M, Peng CK, Stanley HE (1999) Statistical properties of the volatility of price fluctuations. Phys Rev E 60:1390 Mantegna R, Stanley HE (2005) Scaling behavior in the dynamics of an economic index. Nature 376:46–49 Gabaix X, Gopikrishnan P, Plerou V, Stanley HE (2003) A theory of power-law distributions in financial market fluctuations. Nature 423:267
Books and Reviews Challet D, Marsili M, Zhang Y-C (2005) Minority games. Oxford University Press, Oxford Coolen AAC (2004) The mathematical theory of minority games. Oxford University Press, Oxford Johnson NF, Jefferies P, Hui PM (2003) Financial market complexity. Oxford University Press, Oxford Minority Game’s website: http://www.unifr.ch/econophysics/ minority/
1879
1880
Mobile Agents
Mobile Agents N IRANJAN SURI 1 , JAN VITEK2 1 Institute for Human and Machine Cognition, Pensacola, USA 2 Purdue University, West Lafayette, USA Article Outline Definition of the Subject Introduction Classification of Mobile Agent Capabilities Theoretical Foundations of Mobility Requirements Addressed by Mobile Agents Components of a Mobile Agent System Security Survey of Mobile Agent Systems Application Areas Future Directions Acknowledgments Bibliography Definition of the Subject Mobile agents are programs that, with varying degrees of autonomy, can move between hosts across a network. Mobile agents combine the notions of mobile code, mobile computation, and mobile state. They are location aware and can move to new network locations through explicit mobility operations. Mobile agents realize the notion of moving the computation to the data as opposed to moving the data to the computation, which is an important paradigm for distributed computing. Mobile agents are effective in operating in networks that tend to disconnect, have low bandwidth, or high latency. Introduction Advances in computer communications and computing power have changed the landscape of computing: computing devices ranging from the smallest embedded sensors to the largest servers are routinely interconnected and must interoperate. Their connections often are set up over untrusted and untrustworthy networks, with limited connectivity and dynamic topologies. The computational capacities of the devices as well as the communications bandwidth between the devices are in a state of constant change and users expect computer systems to dynamically adapt to such changes. Systems should opportunistically take advantage of new resources while at the same time transparently compensate for failures of systems and communication links. Moreover, the characteristics of the applications
running on those devices are quite often dynamic, with new software added to the system at run time. While many of the characteristics of distributed systems have changed, the tools for developing distributed software have not evolved. The majority of distributed programming is still being done in languages and environments that were designed either for uni-processor hardware systems or for static software systems in which the locations and functionality of all clients and servers can be specified a priori. This entry will look at different approaches using mobile agents, a new paradigm that eases the task of developing modern distributed systems. In addition, the entry will look at programming languages and middleware designed to support mobile agents, as well as the security mechanisms required by those languages and infrastructures. Mobile agents are software agents with the additional capability to move between computers across a network connection. By movement, we mean that the running program that constitutes an agent moves from one system to another, taking with the agent the code that constitutes the agent as well as the state information of the agent. The movement of agents may be user-directed or self-directed (i. e. autonomous). In the case of user-directed movement, agents are configured with an itinerary that dictates the movement of the agents. In the case of self-directed movement, agents may move in order to better optimize their operation. Mobility may also be a combination of userand self-directedness. Classification of Mobile Agent Capabilities Mobile agents encompass three basic capabilities: mobile code, mobile computation, and mobile state. These three capabilities are shown in Fig. 1 below. Each of the capabilities is an evolution of previously developed capabilities. The following subsections describe each capability. Mobile Computation Mobile computation involves moving a computation from one system to another. This capability is an evolution of remote computation, which allows a system to exploit the computational resources of another system over a network connection. One of the original mechanisms for remote computation was Remote Procedure Call (RPC). Java Remote Method Invocation (RMI) is another example of remote computation as are servlets and stored procedures. The difference between mobile and remote computation is that mobile computation supports network disconnection. In a traditional remote computation model,
Mobile Agents
Mobile Agents, Figure 1 The Three Orthogonal Capabilities of Mobile Agents
the system requesting the service (the client) must remain connected to the system providing the service (the server) for the duration of the remote computation operation. Additionally, depending on the interface exposed by the server, an interaction can require an arbitrary number of messages between client and server. If network connectivity is lost, the remote computation will become an orphaned computation that will either be terminated or whose results will be discarded. A mobile computation, on the other hand, is an autonomous entity. Once the computation moves from the first system (which may nominally be called the client) to the second system (the server), the computation continues to execute on the server even if the client becomes disconnected. The agent returns to the client with the results of the computation when (and if) the connectivity is recovered.
Mobile Code Mobile Code is the ability to move code from one system to another. The code may be either source code that is compiled or interpreted or binary code. Binary code may further be either machine dependent or be some intermediate, machine-independent form. Mobile code is used in other contexts besides mobile agents. For example, system administrators use mobile code in order to remotely install or upgrade software on client systems. Similarly, a web browser uses mobile code to pull an applet or script to execute as part of a web page. Code may be mobile in two different ways: push and pull. In the push model, the system sending the code
originates the code transfer operation whereas in the pull model, the system receiving the code originates the code transfer operation. An example of the pull model is a web browser downloading components such as applets or scripts. Remote installation is an example of the push model. Mobile agent systems use the push model of code mobility. Pull mobility is often considered to be more secure and trustworthy because the host receiving the code is the one that requested the code. Usually, the origin of the request lies in some action carried out by a user of the system and hence pull mobility is superficially more secure. Push mobility on the other hand allows a system to send code to the receiving system at unexpected or unmonitored times. Hence push mobility is less trustworthy from a user’s point of view. In practice the overwhelming majority of security exploits encountered in distributed systems originates in careless user actions such as executing attachments received via email. Mobile code allows systems to be extremely flexible. New capabilities can be downloaded to systems on the fly thereby dynamically adding features or upgrading existing features. Moreover, if capabilities can be downloaded on demand, temporarily unused capabilities can also be discarded. Swapping capabilities on an as-needed basis allows systems to support small memory constrained devices. Discarding capabilities after use can also help improve system security. Mobile State Mobile state is an evolution of state capture, which allows the execution state of a process to be captured. State capture has been traditionally used to checkpoint systems to protect against unexpected system failure. In the event of a failure, the execution of a process can be restarted from the last checkpointed state thereby not wasting time by restarting from the very beginning. Checkpointing is thus very useful for long-running processes. Operating system research has investigated capturing entire process states, a variant of checkpointing, for load balancing purposes in the early 1980s, but that avenue of research proved to be a dead-end due to the coarse granularity of process and semantic problems due to the impossibility of capturing operating system resources such as open file descriptors. Mobile state allows the movement of the execution state of an agent to another system for continued execution. The key advantage provided by mobile state is that the execution of the agent does not need to restart after the agent moves to a new host. Instead, the execution continues at the very next instruction in the agent.
1881
1882
Mobile Agents
Not all mobile agent systems provide support for state mobility [15]. The term strong mobility is used to describe systems that can capture and move execution state with the agent. Operationally, strong mobility guarantees that all variables will have identical values and the program counter will be at the same position. Weakly mobile-agent systems, on the other hand, usually support the capture of most of a program’s data, but restart the program from a predefined program point and thus require some programmer involvement at each migration. The advantage of strong mobility is that the result of migrating is well defined and easier to understand, but its disadvantage is that it is much more complex to implement efficiently. The languages and systems that support strong mobility are Telescript [57], D’Agents, NOMADS, and Ara, while weak mobility is supported by a large number of mobile agent frameworks [1,3,5,16,35,39,40,45]. Results by Bettini and De Nicola suggest that strong mobility can be translated into weak mobility without affecting the application semantics [2]. The result is partial as it only works for single-threaded agents. Research is needed to be able to translate multi-threaded strongly mobile agents. A recent advance in this area comes in the form of a modified Java-compatible VM. The IBM Jikes Research VM is designed to support pluggable Just-in-Time (JIT) compilers. Moreover, the VM is designed to allow recompilation of a method midstream, which requires that the state of the method be recoverable. The Mobile Jikes RVM [31] exploits this capability to provide migration of state but also provides good performance due to the Justin-Time compiler. The most important advantage provided by strong mobility is the ability to support external asynchronous migration requests (also known as forced mobility). This allows entities other than the agent (such as other system components, an administrator, or the owner) to request that an agent be moved. Forced mobility is useful for survivability, load-balancing, forced isolation, and replication for fault-tolerance.
node then executes the code and sends the output data back to the client. 2. Code on Demand – This approach supports software components with dynamically loaded behavior. In this approach, code fragments are requested as they are needed, and dynamically compiled (if needed), verified and linked into a running system. 3. Mobile Agents – Mobile agents (The term ‘mobile agent’ is slightly misleading, as mobility is not restricted to agents, but can be used with any software component or program. Unfortunately, the literature does not differentiate between ‘mobile agents’ and ‘mobile programs’) strengthen code-on-demand with support for moving running computations. Rather than simply moving code (and possibly input data), mobile agents view a computation as a single entity and support the migration of a complete program to another node. This transfer is often seamless, so that the computation can proceed without disruption. Remote Evaluation Remote evaluation is doubtlessly the simplest way to achieve mobility. This approach is often used for system administration tasks in which small programs written in a scripting language are submitted to hosts on a secure network. Stamos coined the name [36] to describe a technique where one computer sends another computer a request in the form of a program. The receiving computer executes the program in the request and returns the result back to the sending computer. A number of papers investigated this approach in the early 90’s [34,37,38], but the only noteworthy infrastructure supporting this approach today is the SafeTCL scripting language [4,29]. The main drawback of scripting languages is that they are not suited to the development of large and reliable software systems because they often lack the basic software engineering features (e. g. encapsulation and data hiding) needed in large systems. Furthermore, the remote evaluation paradigm is confined to classical client/server settings and does not support detached operations well.
Classification of Mobile Systems The classification of mobile systems by Picco and Vigna distinguishes three broad approaches to mobility and summarizes the previous description of mobility technologies: 1. Remote evaluation – Remote evaluation technologies provide means for an application to invoke services on another node by specifying the code, as well as the input data, necessary to invoke the service. The code and input data are sent to the remote node, and the remote
Code on Demand Code on demand is one of the main innovations of Sun Microsystems’s Java programming language. This approach allows applications to be delivered piecemeal. The Java execution environment, called a Java Virtual Machine (JVM), is able to find and load any missing components at run time. These components are dynamically linked into the running system, and components that are never needed for a particular application run are never sent across the network, conserving network bandwidth. While dynamic loading is not a novel concept,
Mobile Agents
the idea of allowing potentially untrusted content to be integrated into a running execution environment popularized the concept of ‘safe programming languages’. The subsection on security will look in more detail at the requirement for safety. The success of Java owes as much to the safety features that were installed to ensure security as to its dynamic nature. Microsoft’s .NET infrastructurez, and in particular the Common Language Runtime (CLR), provide a similar functionality, but that technology remains, as of this writing, untested and may not yet provide a comparable level of security as Java, though in the long run it is almost certainly going to play a major role. To summarize, the advantage of code-on-demand over remote evaluation is that a language such as Java is a general-purpose programming language with several mature implementations suited for building complex systems. The disadvantage of code on demand approaches is that software is not location-aware; in other words the code running in a Java system can not know where it is located nor is it able to trigger its own migration. Standards such as Java remote method invocation (RMI) extend the pure code-on-demand approach with the means to transfer data along with code under program control; they can thus be used as a basis to implement mobile agent systems but do not provide all of the functionality of a mobile agent system. Mobile Agents Mobile software agents improve on previous approaches by bundling code and data into computational entities that can control their own mobility. Mobile agents programs can thus control their own deployment, perform load balancing, and program distributed applications. Mobile-agent infrastructures can be implemented as extensions to code-on-demand systems [1,3, 5,6,16,25,35,40,45], as extensions to remote evaluation systems, or as new programming languages [39,56]. The advantage of building on an existing language such as Java is that the existing technology can be leveraged, but this comes at the price of some conceptual complexity, since the system designer must deal with inherent limitations of Java. For example, Java does not provide adequate resource management and process isolation facilities to allow untrusted computations to execute on a trusted machine. Another advantage of mobile-agent systems is that the infrastructure, not the programmer, is in charge of migrating the state and code needed by the computation. Finally, since computations are first-class entities in mobileagent systems, they naturally can be associated with authority, access rights, and resources. It is important to realize that, at a basic level, all of these approaches are equally powerful [30]. There is no
distributed application that can be implemented with one technology and not any other [2]. Just as we now recognize that high-level programming languages increase programmer productivity, high-level mobility abstractions increase productivity even further. The goal of mobile-agent research is to find programming abstractions that are well suited to the tasks at hand, and provide well-engineered, efficient linguistic constructs to support these abstractions. In the long run, mobile-agent languages must be supplemented with automated tools for reasoning about programs and validating their properties (by static analysis, abstract interpretation or model checking). The goal of research in this arena is not only to provide the verification technology but also to design languages that are amenable to verification. Just as it is much easier to verify Java code than assembly, it is easier to verify programs that use mobility explicitly than systems in which mobility is implicit. In the long run we expect verification technologies to play an essential role to ensure correctness and provide the kind of security guarantees expected in critical information systems. Theoretical Foundations of Mobility A theoretical model of a system allows formal reasoning about the system. Formal reasoning can be used to establish guarantees about the behavior of the system. Understanding the semantics of mobile computation is essential for reasoning about mobile agents. Reasoning about mobility can, in turn, yield guarantees about the correctness of mission critical software. Researchers have explored a number of theoretical models based on process calculi with encouraging results. Two of the approaches that have been studied in this direction are Cardelli and Gordon’s ambient calculus [11] and Vitek and Castagna’s Seal calculus [50]. These models abstract both: Logical mobility (mobility of programs) and Physical mobility (mobility of nodes, such as handheld devices). The results obtained thus far include a number of type systems for controlling agent mobility [9,10,12], and a logic for stating properties of agent programs [10]. These formalizations have wider applicability as shown by an application of the ambient calculus as a query language for XML [8]. The research on foundations of mobility is actively continuing; in the long run theoretical results can be expected to feed back into languages and infrastructures. The development of language-level abstractions will simplify the task of writing mobile software agents.
1883
1884
Mobile Agents
Requirements Addressed by Mobile Agents A paradigm is as much defined by the features that are excluded as the features that are included. We now present a list of the key required features as well as some features that we explicitly do not expect mobile agent systems to support. Mobile agents, like most modern distributed systems, have to address five issues that require infrastructure support: No global state The physical size of the network and the number of hosts that can participate in a distributed computation must scale to global networks of millions of nodes (of which tens of thousands or more might be participating in a single distributed computation). At this size, there can be no assumption of shared state in the programming model, nor can there be algorithms that require synchronization, or that rely on up to-date information. Explicit localities Fluctuations in bandwidth, latency, and reliability are so common that they cannot be hidden. Location of resources and of the computation that accesses those resources must be explicit in the programming model. Restricted connectivity Failures of machines and communication links can occur without warning or detection. Some of these failures may be temporary as machines may be restarted and connections reestablished, but others may be permanent. Thus, at any time, a computation may communicate with only a subset of the network, or be fully disconnected. Dynamic configuration The network topology, both in the physical sense and in the logical sense of available services, changes over time. New hosts and commu-
Mobile Agents, Figure 2 Components of a Typical Mobile Agent System
nication links appear with no advance notice, while other hosts disappear and reappear under a new name and address. Applications should be able to adapt to these changes and dynamically reconfigure themselves. Without the ability to dynamically reconfigure its components, applications will have difficulty adapting to changes in locality and connectivity. Security Since security requirements vary from application to application, basic security mechanisms must be provided by the underlying infrastructure (type safe programming language, checked array access, access control mechanisms) and must be extended with tools for automatic validation of security properties (model checking, program analysis, proof-carrying code). These five requirements – absence of global state, explicit localities, restricted connectivity, dynamic configuration, and security – drive many of the design decisions behind current mobile agent systems. Mobile agent architectures support the above requirements. Mobile agents do not define any notion of shared or global state. Furthermore, mobile agents do not require that hosts be connected while the agents execute. In fact mobile agents have been repeatedly advocated for disconnected operations (restricted connectivity). Finally, mobile agents, through the use of mobile code, support dynamic configuration. Security, the last issue, remains a challenge that must be addressed through infrastructural tools and services. Components of a Mobile Agent System Figure 2 below shows the components of a mobile agent system. This is not an exhaustive list and not all of the components shown are essential to a mobile agent system.
Mobile Agents
Every mobile agent system includes an execution environment that is responsible for receiving and hosting mobile agents. Often, the execution environment runs in the background as a daemon or service so that it is always available to receive agents. The execution environment provides a layer of separation between the mobile agent and the host platform, which is important for security. Execution environments communicate with each other to provide mobility and messaging between agents. These protocols may be built on top of other network communication mechanisms provided by the infrastructure. Most mobile agent systems use an interpreted or partially compiled language for the mobile agent code. In such cases, the execution environment contains an interpreter (or virtual machine, in the case of Java) The remaining components in the execution environment are capabilities provided by the infrastructural layer. Not all of them are necessary for every mobile agent system. Security The main technical, and social, obstacle to approaches based on mobile software agents is security. Not only must researchers devise technical solutions, but also users and organizations must become confident enough in those solutions to permit foreign programs to migrate to and execute on their machines [14,48]. If the organization responsible for a database is not convinced of the quality of the security mechanisms, the department will never allow mobile agents to visit the database. Although a mobile-agent application can still function (by having its agents access the database from across the network), the application will use more network bandwidth and suffer higher latencies. Thus the security and containment of untrusted mobile code, and the objective analysis of proposed security solutions, is a critical research area. When a host receives mobile code, it ideally should evaluate the security implications of executing that particular code, but at the least, it must determine the trustworthiness of the agent’s sender (and programmer). Failure to properly contain mobile code may result in serious damage to the host or in the leakage of restricted information. Such damage can be malicious (e. g. espionage or vandalism) or unintentional (programming failures or unexpected interactions with other system components). Other consequences of failing to contain mobile code include denial-of-service attacks and the infiltration of privileged networks through a downloaded Trojan horse or virus [22,28,48,52]. The symmetry of mobile-agent security concerns is remarkable as both the agent and the environment in which
it executes must be protected from each other. Through purposeful engineering on the part of its developer, an agent may seek to obtain restricted data from the host on which it is running on or damage the host in some way. On the other hand, a host may seek to steal data from or corrupt the agents that migrate to it [33]. In the civilian environment, a (dishonest) company might gain an economic advantage over a competitor via a malicious agent or host, while in more critical environments such as in the military, an adversary might gain a strategic or tactical advantage during an armed conflict. To relate the security issues to Java programming, we compare an agent to an Applet, a Java program “embedded” inside a Web page and downloaded and executed on a user’s machine whenever that user browses the web page. The applet runs within an environment composed of several layers, the first layer is the Java Development Kit (JDK) and its class libraries, the second layer consists of the Java Virtual Machine, the third layer is the operating system, and the last layer is the host device itself. The distinction between these layers is important, since some layers may be easier to subvert than others. For instance, an application may trust a server that belongs to a known organization, but may not trust the libraries found on that server. In Java, this trust mismatch can occur if some of the classes against which an applet is linked have been downloaded from the network [18,54]. There are two different threats that must be considered when attempting to secure mobile-agent applications: Exogenous threats Attacks occurring outside of the mobile-agent system. For example, if a host is running both a mobile-agent system and a Web server, an adversary might attack the host via a “standard” Web server exploit, and gain access without every attacking the mobile-agent system itself. Endogenous threats Threats specific to a mobile-agent system. Horizontal hostility (malicious agents): Attacks between agents running on the same host in which an agent tries to disrupt the execution of other colocated agents. Vertical hostility (malicious agents & malicious hosts): Attacks against an agent by the execution environment, as well as attacks against the environment by an agent. In the remainder, we consider only endogenous threats, as they are specific to mobile agents. There are two different viewpoints to take into account:
1885
1886
Mobile Agents
For a host, it is necessary to provide protection mechanisms so that agents cannot attack each other (horizontal protection) or the host itself (vertical protection); For an agent, it may be necessary to protect it from attacks initiated by the host (hostile host) and other agents (horizontal). We now consider each issue in turn. Malicious Agents A number of techniques have been used in the past to place protection boundaries between so-called “untrusted code” moved to a host and the remainder of the software running on that host. Traditional operating systems use virtual memory to enforce protection between processes [13]. A process cannot read or write another processes’ memory and communication between processes requires traps to the kernel. By limiting the traps an untrusted process can invoke, it can be isolated to varying degrees from other processes on the host. However there is usually little point in sending a computation to a host if the computation cannot interact with other computations there, load balancing being the only exception [27]. In the context of mobile agent systems, an attractive alternative to operating system protection mechanisms is to use language-based protection mechanisms [52]. The attraction of language-based protection is twofold: precision of protection and performance. Language-based mechanisms allow access rights to be placed with more precision than traditional virtualmemory systems, and the cost of cross-protection boundaries can often be reduced to zero, since checking is moved from runtime to the language compiler [17,23]. Safe languages The requirement for language-based security is, first and foremost, safe languages. A safe language is a language that enforces memory safety and type safety. In other words, a safe language does not permit arbitrary memory modifications, and carefully constrains how data of one type is transformed into data of another type. The objective of a safe language is to permit reasoning about security properties of programs at the source level in a compositional manner. It should be possible to check the code with automatic tools to obtain the guarantee that the mobile agent is not malicious. For this to be the true, it is necessary to produce assured implementations of safe languages, that is, implementations that do not contain hidden security vulnerabilities. Well-known examples of safe languages include Java and SafeTCL [18,29]. The Telescript agent language is a case of a safe language explicitly designed for secure mobile code [44]. Tradi-
tional languages, such as C, do not ensure memory or type safety, and thus, it is much more difficult to obtain trust in agents written in those languages. Even in safe language, there are many opportunities for security exploits. After many years, the research community is closer to producing assured implementations of the Java programming language, but much work remains. The survey by Moreau and Hartel lists several hundred papers on formalizing aspects of Java [20]. The difficulty in obtaining a clear specification of all aspects of Java underscores the need for research in semantics and formal techniques without which there can be no hope of obtaining any assurance. Sandboxing Protection against vertical attacks is achieved by enforcing a separation between the user code and the system, a technique popularized by the well-known Java-sandbox security model; related approaches have been used in operating systems [18,19, 24]. In this model user code runs with restricted access right within the same address space as the system code. Security relies on type safety, language access control mechanisms and dynamic checks. Over the years, a number of faults were discovered and fixed in this model [17,23,54,55]. The sandbox model is a basis for building more powerful security architectures that are suited to agent systems [18]. Sandboxing alone does not provide protection against horizontal attacks. For this, it is necessary to extend the protection model to include protection domains, which constrain how one agent can interact with another. Protection domains can be constructed in a safe language by providing a separate namespace for each component. Fully disjoint namespaces are not desirable as they result in disjoint applications [27]. Instead, if mobile agents must interact, some degree of sharing among namespaces is necessary. Several research systems have tried to provide better isolation of Java applications [7,53], but these attempts achieved limited success due to the constraint of working above commercial Java Virtual Machines (which allow only certain kinds of extensions to Java’s basic security mechanisms). In the future, protection domains must be integrated into the Java Virtual Machine definition. Denial of service Mobile agents can mount denial of service attacks by using an excessive amount of CPU or memory. An environment for mobile agents therefore must provide support for tracking memory and CPU usage, as well as support for termination. Termination implies stopping all threads of an agent and reclaiming its memory. Current Java systems fail to protect the Java Virtual Machine against denial of service attacks,
Mobile Agents
since they support neither resource accounting nor full agent termination. Providing efficient accounting and termination support in a language-based system remains an open research problem. Beyond safe languages Safe languages must ensure that an agent’s code obeys certain well-formedness rules. In the case of Java, this assurance is obtained by verifying the bytecode of incoming agents with a complex data flow analysis algorithm [17] and by imposing some constraints on how programs may be linked [23]. A large body of research on proof-carrying code [28] is trying to broaden the set of agent languages to traditional unsafe languages such as C. Proof-carrying code associates a security proof with each program. The host need only check that the proof matches the program to determine whether the program obeys the desired security properties. Checking a proof against a program is computationally much easier than analyzing the code directly to generate the proof, making proof-carrying code an attractive approach. This direction of research is encouraging, as it may allow the expression of complex security properties, and verification of a program’s compliance with those properties. Malicious Hosts In mobile-agent computing, an agent’s owner must be able to trust that it is not subverted when visiting a series of servers, some of which may have been compromised and made capable of malicious action against the agent [33]. Malicious servers are a particularly difficult problem, since the server must have access to all of the agent’s code in order to execute it. A small body of research has attempted to solve this problem. The solutions fall into the following categories: 1. 2. 3. 4. 5. 6.
Code Signing, Replication, Partial Result Authentication Codes, Proof Verification, Code Obfuscation, and Encrypted Functions.
We will assess each approach in the following paragraphs. Code Signing can be used to protect agents from malicious hosts by attaching digital signatures to the code of the mobile agent. Code signing is being used by Sun Microsystems and Microsoft to provide guarantees of authenticity for downloaded code. The technology can be used to ensure that a server has not altered the code
of an agent while in transit. Code signing does not protect the agent’s data from being modified, nor does it prevent the server from accessing the information contained in the agent, but it does provide a basic level of assurance that it is essential for some applications. Furthermore, in a network in which servers can not be compromised and agents come from a single source code signing may be the best solution to security. Replication was studied as a general method for mobile agent computation security, marrying some ideas from the fields of fault tolerance and cryptography [32,51]. The approach relies on the replication of agents and servers. The same agent computation is performed on several servers. Voting can then be used to move from one phase of a distributed computation to the next. While replication enjoys some pleasing theoretical properties, it is heavily restricted in practice. It supposes that computations are deterministic, and that several servers with the same resources are available. The connectivity assumptions also are not appropriate in scenarios involving unreliable networks. Partial Result Authentication Codes are very similar to message authentication codes (MAC). Instead of authenticating the origins of a message, however, they authenticate the correctness of an intermediate agent state or partial result. For example, if we consider the values of selected program variables at some point during execution, we can determine whether those values could possibly have arisen from normal program execution. If not, the program has been altered in some way. PRACSs are computationally cheaper than digital signatures and have slightly different security properties (forward integrity): if an agent visits n servers and some server in m (m < n) is malicious, the results of servers 1 to m 1 cannot be falsified. In some scenarios, mobile agents do not have intermediate results. Nevertheless, this approach can be used to ensure that results of a disconnected query are truthful [21]. Proof Verification is an approach in which a digitally signed trace of an agent’s computation is returned along with the result. This trace can then be validated – a malicious host would affect the agent by changing the results and thus producing a trace that does not correspond to a valid computation [49]. Although techniques for producing compact traces have been developed, the size and complexity of the trace remains an issue. Code Obfuscation aims at protecting a mobile agent’s functioning by making it very hard to divert the agent’s execution in a meaningful way. This is achieved by transforming the code such that automated reverse en-
1887
1888
Mobile Agents
gineering cannot be applied. This prevents a malicious host from locating the places in the code that should be modified or that should be executed in a non-conformant way. Also, variables can be split all over the program in order to make simple read-outs impossible. Although there are many tools and also commercial products that use code obfuscation, especially in the field of digital rights management, recent theoretical results point at the impossibility of achieving perfect obfuscation [26]. Secure Coprocessors involve building a trusted execution environment for agents within a secure coprocessor [58,59]. This approach is based on tamper-proof hardware and public key infrastructures. Some experimental systems have been designed, but not validated. Secure co-processors have the potential of providing appropriate security for mobile-agent programs, but at the cost of upgrading to more expensive hardware. Secure coprocessors will be useful in some applications, but not in all (or even the majority). However, trusted coprocessors might be the only hard security anchor available today for securing mobile agent applications. Encrypted Functions that can be executed in their encrypted form are a software-only cryptographic approach to the malicious host problem. If available, this would be the ideal way to protect any mobile agent and its payload. The approach of computing with encrypted functions was demonstrated in [33], another system was proposed in [46]. The conclusion is that for special functions it is possible to let a mobile agent protect itself from a malicious host without having to rely on trusted hardware or on-line help from remote agents. However, the solutions proposed seem to be impractical for today’s standards and no implementation has been reported so far. Survey of Mobile Agent Systems To summarize this entry, we now review some of the popular mobile-agent systems and classify them according to the following characteristics: the type of mobility supported by the language or system, the language(s) in which agents are written, whether host mobility (mobile devices) is supported, and the quality, availability, and current level of support of the implementation. The quality of implementation column discriminates products from research prototypes. The availability column indicates which systems can be freely used, and which require a license. Finally, the level of support column indicates whether the project is still active, and whether assistance is forthcoming.
The general conclusions that emerge are that weak mobility is by far the predominant approach. The only commercial strongly mobile system (Telescript) was discontinued several years ago, and the other two strongly mobile system (D’Agents and NOMADS) are university research prototypes. Java is the most popular agent implementation language, due to both the popularity of the language and its support for dynamic loading and advanced security features. Most available systems are research prototypes, but they have the advantage of being open source and thus can be used as a starting point for further development. A number of these projects have an active developer community, an important factor for the adoption of an infrastructure, although it is important to note that most opensource systems do not enjoy the same kind of support as commercial products. While the above table is not an exhaustive list of existing mobile-agent systems (over 100 such systems have been implemented in the last five years), it provides a good overview of the most influential systems. The main conclusion to draw is that Java is the common thread in most current mobile-agent implementations. Another, more recent technology is Microsoft’s .NET with the Common Language Runtime. Like Java, .NET is based on a virtual machine and supports the dynamic loading of programs. The suitability of .NET to mobile-agent applications has not yet been completely evaluated. Application Areas Information retrieval was the original application area envisioned for mobile agents. In bandwidth sensitive environments, moving the mobile agent to the data, where the agent can select a small desired subset, is more efficient than moving large amounts of data to the user. Mobile agents allow bandwidth conservation and latency reduction in many information-retrieval and management applications. A mobile agent that performs a multi-step query against multiple databases can be dispatched close to the location of the databases, avoiding the transmission of intermediate results across the network. Similarly, in an unreliable network environment, the same mobile agent can continue its query task even if the network goes down temporarily. But, mobile agents can support or improve performance in many other application areas. Mobile agents can relocate themselves during execution, an essential property for reactive and adaptive systems that must respond to changing execution environments. As soon as a change in operating conditions is detected, a mobile-agent application can reconfigure itself to relocate its computations
Mobile Agents Mobile Agents, Table 1 Selected popular mobile agent systems and their key properties Agent System
Mobility support
Telescript
Mobile agent (strong) Mobile agent (strong) Code on demand Remote evaluation Mobile agent (strong) Mobile agent (weak) Mobile agent (weak) Mobile agent (weak) Mobile agent (weak)
NOMADS Java SafeTCL D’Agents JavaSeal Mole Aglets Lime
Messenger
Base Host language mobility Telescript No
Quality of imple- Availability mentation Product Discontinued
Level of support None
Special Features
Java
No
Prototype
Free
Medium
Resource Control
Java Tcl Java, Tcl, Scheme Java
No No No
Product Product Prototype
Commercial Open source Open source
High High Medium
Multi-language
No
Prototype
Open source
Low
Seal Calculus
Java
No
Prototype
Open source
Low
Java
No
Product
Open source
Medium
Java
Yes
Prototype
Open source
Medium
No
Prototype
Open source
Low
Mobile code (weak) M0
away from a physical attack on the network or closer to a critical database after a drop in network bandwidth. Mobility also applies to the application’s data itself, enabling new styles of pro-active applications where system wide actions have to be performed even before any clients are around. An example is the pre-caching of datasets to remote geographical locations to ensure instant access to essential information. Mobile agents, with their capability to move code, allow new capabilities to be pushed dynamically to platforms. This capability is useful when existing systems need to be retasked or used for purposes not originally intended when they were deployed. Migration of capabilities is also important to accommodate changing circumstances and environmental conditions. Mobile agents also support disconnected operations. In a mobile system, a client can send agents to a server before disconnecting. The agents can then perform their task while the client is unreachable and communicate results back whenever connectivity is regained. Similarly, a server might send a component to a handheld or other portable device to further reduce connectivity requirements. Even if the network is stable, mobility still allows bandwidth conservation and latency reduction. For example, if a client application needs to perform a complex multi-step query, it can send the query code to the network location of the databases, avoiding the transmission of intermediate results across the network. Although the database developers could add a new database operation that performed the
First MA system
Coordination via Tuplespaces
complex query, it is unreasonable to expect that developers can predict and address every client need in advance. Mobile agents allow a client application to make efficient use of network resources even when the available services expose low-level, application-independent interfaces. In all four cases – changing network conditions, disconnected operation, bandwidth conservation, and latency reduction – the common thread is dynamic deployment and reconfiguration. Traditional programming languages constrain designers to commit to a particular system structure at build time. The choice whether a particular service is implemented on the client or server side must be made early and cannot be revisited if some of the initial assumptions about the application turn out to be invalid. Mobile agents, on the other hand, decouple system design from system deployment, and turn control over deployment to the applications themselves, allowing much more flexible design patterns. An application can deploy its components to the most attractive network locations and redeploy those components when network conditions change, leading to more efficient use of available resources and faster completion times. Additional application areas are discussed below: Distributed Sensor Grids Mobile agents can migrate to key locations in a network of autonomous sensors and then filter the collected sensor data to reduce bandwidth requirements and imple-
1889
1890
Mobile Agents
ment local management policies. Mobile agents are particularly useful when it is not possible to pre-install stationary agents on all of the sensors. In particular, the filtering algorithm, which can be of arbitrary complexity, may depend on the phenomenon being observed and change over time. For example, different filtering agents can be used to achieve different levels of accuracy. For high-end sensors, onboard processing is possible, and agents can be sent to the sensors themselves; in other cases, agents can be sent to routers or gateway machines within and at the edge of the sensor field. The agents can be relocated on the fly, allowing the sensor application to optimize the placement of its management and filtering code with respect to current network loads. By using automated monitoring algorithms, mobile agents can act as customized monitors that wait for phenomena of interest to be observed at distributed sensor locations and then notify human operators. Such a capability can significantly reduce the workload of human operators. In addition to acting as on-line filters, mobile agents can also help perform data mining on off-line sensor data. In particular, mobile agents can move to several sensors and efficiently correlate information across the multiple sensors in order to classify observed phenomenon. Mobile agent technology can be particularly useful in situations where both the sensors as well as the client systems are computationally weak and constrained by battery. This is a common occurrence in sensor networks where small sensors (such as unattended ground sensors) are being tasked by users with small PDA or cell phone client devices. Mobile agents carrying specific filtering, fusion, or other algorithms can be pushed from the PDA or cell phone to opportunistically discovered nodes in the network fabric thereby supporting the weak sensor and client platforms. Given the adaptive capabilities of mobile agents, they can also relocate themselves to other intermediate nodes when network configurations change [42,47]. Unmanned Autonomous Systems Unmanned autonomous systems require varying degrees of remote tasking and monitoring but often operate in extreme circumstances with varying network connectivity. Mobile agents allow the dynamic configuration of the software deployed on unmanned autonomous systems. New capabilities such as new algorithms, missions, and functions that were not anticipated when the system was originally designed and deployed can be dynamically deployed while vehicles and sensors are in the field. This functionality is already used in outer space exploration projects where communication latencies and physical inaccessibil-
ity to devices require mobile-agent like approaches. Fully embracing mobile agent architectures will enable finer granularity and improve ease of use. Mobile agent technologies further allow exchange of functions between platforms, providing a very efficient and localized software update mechanism. Certain kinds of autonomous vehicles such as undersea vehicles lose network connectivity while submerged. The capability of mobile agents to support disconnected operation is important under these circumstances. Space exploration vehicles such as the Mars rover present a variation of the network disconnection problem – extremely long network latencies. Mobile agents, by moving themselves to the remote nodes, overcome latency problems. Finally, mobile agents can reduce the bandwidth requirements for relaying data from unmanned vehicles by applying the necessary filtering and fusion algorithms on the platform or close to the platform. High Availability Systems Some large-scale distributed systems must be able to evolve and cannot tolerate downtime. Mobile agents provide a technology for dynamically upgrading such distributed systems with new procedures without interrupting their operation. Furthermore, mobile agents can provide additional fault tolerance, since an agent can be duplicated at any point of its computation and stored on secondary storage or sent to another platform. Mobile-agent technology provides an attractive way to implement distributed monitoring tools that enforce nonlocal security policies over large-scale networks. Such nonlocal monitoring can detect distributed denial of service (DDOS) and other large-scale attacks and help pinpoint the source of those attacks. Through the use of mobile agents, this monitoring functionality can be deployed dynamically in response to previous attacks. For example, if a network is under extensive DDOS attack, the DDOS monitoring code can be dispatched on the fly to a larger number of machines. Similarly, if a network comes under a previously unknown attack, mobile agents can be implemented and deployed to detect that particular attack on the fly. The ability to dynamically distribute existing and new monitoring code makes networks more robust to cyber-attack, including attacks that are encountered for the first time. Agile Computing Agile computing may be defined as opportunistically (or on user demand) discovering and taking advantage of
Mobile Agents
available resources in order to improve capability, performance, efficiency, fault tolerance, and survivability [43]. The term agile is used to highlight both the need to quickly react to changes in the environment as well as the ability to take advantage of transient resources only available for short periods of time. Agile computing is a promising approach to building information systems that need to operate in highly dynamic environments.. Mobile agents are one approach to building agile computing systems. Mobility of code is important in order to be opportunistic. If an idle resource is found, the likelihood that the system already has the code for a particular service or capability is small. Mobile code can be used to dynamically push the necessary services to systems. State migration is also important for the purpose of achieving a high degree of agility. Without state migration, it would be difficult to quickly and dynamically move computations to and away from systems as resource availability changes.
lished in a special issue entitled Mobile Software Agents in the Scalable Computing: Practice and Experience journal. A recent book on mobile agents [41] provides a review and survey of mobile agents and concepts and can serve as a reference or a textbook for a course on mobile agents. One key unsolved research question with mobile agents continues to be protection of the mobile agent that is executing on a remote untrusted host. This continues to be a challenge with no promising solutions on the horizon. Acknowledgments Portions of this chapter were initially written for the Information Technology Assessment Consortium (ITAC) at the Institute for Human & Machine Cognition (IHMC) as part of a report entitled Software Agents for the Warfighter, sponsored by NASA and DARPA, and edited by Jeffrey Bradshaw at IHMC. Bibliography
Future Directions
Primary Literature
Active research in mobile agents has petered out, with most researchers migrating to other areas. The ideas and technology heralded by mobile agents were extensions of previous concepts that were integrated together. For example, mobile computing extended remote invocation and mobile state extended process migration. To some extent, the research area of mobile agents ended but the ideas continue to exist and continue to evolve and be used, but under different labels. The field of mobile agents also suffered from an identity crisis caused by the use of the word agent. Researchers in the areas of Autonomous Agents and Multi Agent Systems regarded mobile agents as primarily belonging to distributed systems and networking. On the other hand, the distributed systems community did not like the connotations of the word “agent” and hence did not adopt the terminology of mobile agents, even though the concepts introduced are important and tend to get used anyway. Much of the early research in the area of mobile agents was published as part of the ECOOP Mobile Object Systems workshop series, the Mobile Agents conferences (temporarily renamed Agent Systems and Applications / Mobile Agents), and to a lesser extent the Autonomous Agents and Multi–Agent Systems conference. The last Mobile Agents conference was held in 2002. The Agents, Interactions, Mobility, and Systems (AIMS) track at the ACM Symposium on Applied Computing was the last venue for publishing mobile agents-related papers. A collection of these papers, revised and extended, were pub-
1. Baumann J, Hohl F, Rothermel K, Straßer M (1998) Mole – concepts of a mobile agent system. WWW Journal, Special issue on Appl Tech Web Agents (to appear) 2. Bettini L, De Nicola R (2001) Translating strong mobility into weak mobility. Mobile Agents 3. Binder W (2001) Design and implementation of the J-SEAL2 mobile agent kernel. In: The 2001 Symposium on Applications and the Internet (SAINT-2001), San Diego, January 2001 4. Borenstein NE (1994) E-mail with a mind of its own: The SafeTcl language for enabled mail. In: Proceedings of IFIP International Conference, Barcelona, Spain, 1994 5. Bouchenak S (1999) Pickling threads state in the Java system. In: 3rd European Research Seminar on Advances in Distributed Systems (ERSADS’99), Madeira Island, Portugal, April 1999 6. Bryce C, Vitek J (1999) The javaseal mobile agent kernel. In: Milojevic D (ed) Proceedings of the 1st International Symposium on Agent Systems and Applications, Third International Symposium on Mobile Agents (ASAMA’99), Palm Springs, 9-13 May 1999. ACM Press, pp 176–189 7. Bryce C, Vitek J (2002) The JavaSeal mobile agent kernel. Auton Agents MultiAgent Syst 8. Cardelli L, Ghelli G (2001) A query language based on the ambient logic. Programming languages and systems. In: 10th European Symposium on Programming, ESOP 2001 9. Cardelli L, Gordon AD (1999) Types for mobile ambients. In: Proceedings of the 26th ACM Symposium on Principles of Programming Languages, 1999. pp 79–92 10. Cardelli L, Gordon AD (2000) Anytime, anywhere. Modal logics for mobile ambients. In: Proceedings of the 27th ACM Symposium on Principles of Programming Languages, 2000. pp 365– 377 11. Cardelli L, Gordon AD (2000) Mobile ambients. TCS special issue on coordination. D Le Métayer 12. Cardelli L, Ghelli G, Gordon AD (2000) Types for the ambient calculus. In: I&C special issue on TCS’2000
1891
1892
Mobile Agents
13. Colusa Software (1995) Omniware: a universal substrate for mobile code. White paper, Colusa Software. http://www. colusa.com 14. Farmer WM, Guttman JD, Swarup V (1996) Security for mobile agents: Issues and requirements. In: National Information Systems Security Conference. National Institute of Standards and Technology 15. Fuggetta A, Picco GP, Vigna G (1998) Understanding code mobility. IEEE Trans Softw Eng 16. Funfrocken S (1998) Transparent migration of Java-based mobile agents: Capturing and reestablishing the state of Java programs. In: Proceedings of the Second International Workshop on Mobile Agents, September 1998. Lecture Notes in Computer Science, no 1477. Springer, Stuttgart, pp 26–37 17. Goldberg A (1998) A specification of java loading and bytecode verification. In: Proceedings of the Fifth ACM Conference on Computer and Communications Security, 1998 18. Gong L, Mueller M, Prafullchandra H, Schemers R (1997) Going beyond the sandbox: An overview of the new security architecture in the Java Development Kit 1.2. In: Proceedings of the USENIX Symposium on Internet Technologies and Systems, Monterey, California, December 1997 19. Grimm R, Bershad BN (1999) Providing policy-neutral and transparent access control in extensible systems. In: Vitek J, Jensen C (eds) Secure internet programming: Security issues for distributed and mobile objects. Lecture Notes in Computer Science, vol 1603. Springer, Berlin, pp 117–146 20. Hartel PH, Moreau LAV (2001) Formalizing the safety of Java, the Java Virtual Machine and Java Card. ACM Comput Surv (to appear) 21. Hohl F (1997) Time limited blackbox security: protecting mobile agents from malicious hosts. Mobile Agent Security. Lecture Notes in Computer Science. Springer, Berlin 22. Jaeger T (1999) Access control in configurable systems. In: Vitek J, Jensen C (eds) Secure internet programming: Security issues for distributed and mobile objects. Lecture Notes in Computer Science, vol 1603. Springer, Berlin, pp 117–146 23. Jensen T, Le Metayer D, Thorn T (1998) Security and dynamic class loading in Java: A formalization. In: Proceedings of the 1998 IEEE International Conference on Computer Languages, May 1998, pp 4–15 24. Jones MB (1999) Interposition agents: Transparently interposing user code at the system interface. In: Vitek J, Jensen C (eds) Secure internet programming: Security issues for distributed and mobile objects. Lecture Notes in Computer Science, vol 1603. Springer, Berlin, pp 117–146 25. Lange D, Oshima M (1998) Programming and deploying Java Mobile Agents with aglets. Addison Wesley 26. Loureiro S (2001) Mobile code protection. Ph D thesis, Sophia Antipolis 27. Malkhi D, Reiter MK, Rubin AD (1998) Secure execution of Java applets using a remote playground. In: Proc of the 1998 IEEE Symp on Security and Privacy, Oakland, May 1998, pp 40–51 28. Necula GC, Lee P (1998) Safe untrusted agents using using proof-carrying code. In: Vigna G (ed) Special issue on mobile agent security, vol 1419 of Lect. Notes in Computer Science. Springer, pp 61–91 29. Ousterhout JK, Levy JY, Welch BB (1997) The Safe-Tcl Security Model. Technical report. Sun Microsystems Laboratories, Mountain View. Online at http://research.sun.com/techrep/ 1997/abstract-60.html
30. Puliato B, Riccobene S, Scarpa M (1999) An analytical comparison of the clientsever, Remote evaluation, and Mobile Agents paradigms. In: Proc. ASA/MA’99, pp 278–292, October 1999 31. Quitadamo R, Cabri G, Leonardi L (2006) Enabling Java mobile computing on the IBM Jikes Research Virtual Machine. In: The International Conference on the Principles and Practice of Programming in Java 2006 (PPPJ 2006). ACM Press, Mannheim 32. Roth V (1999) Mutual protection of cooperating agents. In: Vitek J, Jensen C (eds) Secure internet programming: Security issues for distributed and mobile objects. Lecture Notes in Computer Science, vol 1603. Springer, Berlin, pp 117–146 33. Sander T, Tschudin CF (1998) Protecting Mobile Agents Against Malicious Hosts. In: Vigna D (ed) Mobile Agent and Security. Lecture Notes in Computer Science, vol 1419. Springer, Berlin 34. Segal ME (1991) Extending dynamic program updating systems to support distributed systems that communicate via remote evaluation. In: Proc. International Workshop on Configurable Distributed Systems, 1991, pp 188–199 35. Sekiguchi T, Masuhara H, Yonezawa A (1999) A simple extension of Java language for controllable transparent migration and its portable implementation. In: Coordination Languages and Models. Lecture Notes in Computer Science, pp 211–226 36. Stamos JW (1986) Remote evaluation. Ph D thesis, Massachusetts Institute of Technology. Technical report MIT/LCS/ TR-354 37. Stamos JW, Gifford DK (1990) Implementing remote evaluation. IEEE Trans Softw Eng 16(7) 38. Stamos JW, Gifford DK (1990) Remote evaluation. ACM Trans Program Lang Syst 12(4):537–565 39. Suri N, Bradshaw JM, Breedy MR, Groth PT, Hill GA, Jeffers R, Mitrovich TR, Pouliot BR, Smith DS (2000) NOMADS: Toward an environment for strong and safe agent mobility. In: Proceedings of Autonomous Agents ’2000, Barcelona, Spain. ACM Press, New York 40. Suri N, Bradshaw JM, Breedy MR, Groth PT, Hill GA, Jeffers R (2000) Strong mobiling and fine-grained resource control in NOMADS. Agent Systems, Mobile Agents, and applications. Lecture Notes in Computer Science, vol 1882. Springer, Berlin 41. Suri N et al (2003) Applying agile computing to support efficient and policy-controlled sensor information feeds in the army future combat systems environment. In: Proceedings of the U.S. Army 2003 Annual Collaborative Technology Symposium 42. Suri N, Bradshaw J, Carvalho M, Breedy M, Cowin T, Saavedra R, Kulkarni S (2003) Applying agile computing to support efficient and policy-controlled sensor information feeds in the army future combat systems environment. In: Proceedings of the Collaborative Technologies Alliance Conference (CTA 2003), College Park 43. Suri N, Bradshaw J, Carvalho M, Cowin T, Breedy M, Groth P, Saavedra R (2003) Agile computing: Bridging the gap between Grid computing and Ad-hoc Peer-to-Peer resource sharing. In: Proceedings of the 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid), 2003 44. Tardo J, Valenta L (1996) Mobile agent security and telescript. In: Proceedings of IEEE COMPCON, February 1996 45. Truyen E, Robben B, Vanhaute B, Coninx T, Joosen W, Verbaeten P (2000) Portable support for transparent thread migration in Java. In: Proceedings of the Joint Symposium on Agent
Mobile Agents
46.
47. 48. 49.
50.
51.
52.
53.
54. 55.
56. 57. 58.
59.
Systems and Applications / Mobile Agents (ASA/MA), September 2000, pp 29–43 Tschudin C The messenger environment M0 – A condensed description. In: Vitek J, Tschudin C (eds) Mobile object systems. Lecture Notes in Computer Science, vol 1222. Springer, Berlin Tschudin C, Lundgren H, Gulbrandsen H (2000) Active routing for Ad Hoc Networks. IEEE Commun Mag Tschudin CF (1999) Mobile agent security. In: Klusch M (ed) Intelligent information agents. Springer Vigna J (1998) Cryptographic traces for Mobile Agents. In: Vigna G (ed) Mobile agent security. Lecture Notes in Computer Science, vol 1419. Springer, Berlin, pp 137–153 Vitek J, Castagna G (1999) Seal: A framework for secure mobile computations. In: Bal HE, Belkhouche B, Cardelli L (eds) Internet programming languages. Lecture Notes in Computer Science, vol 1686. Springer Vogler H, Moschgath M-L, Kunkelmann T (1997) An approach for Mobile Agent Security and fault tolerance using distributed transactions. Darmstadt Univ of Technology, ITO, Proc of ICPADS’97 (to appear) Volpano D, Smith G (1998) Language issues in mobile program security. In: Vigna G (ed) Mobile agent security. Lecture Notes in Computer Science, no 1419. Springer, pp 25–43 Von Eicken T, Chang C-C, Czajkowski G, Hawblitzel C (1999) J-Kernel: A capability-based operating system for Java. Lect Notes Comput Sci 1603:369–394 Wallach DS (1999) A New approach to mobile code security. Ph D Thesis, Princeton University Wallach DS, Balfanz D, Dean D, Felten EW (1997) Extensible security architectures for Java. Technical report 546–97, Department of Computer Science, Princeton University White JE (1997) Telescript. In: Cockayne WR, Zyda M (eds) Mobile agents. Manning Publ, Greenwich, pp 37–57 White JE (1997) Mobile agents. In: Bradshaw JM (ed) Software agents. AAAI/MIT Press, Cambridge, pp 437–472 Wilhelm UG, Staamann S, Buttyn L (1999) Introducing trusted third parties to the mobile agent paradigm. In: Vitek J, Jensen C (eds) Secure internet programming: Security issues for distributed and mobile objects. Lecture Notes in Computer Science, vol 1603. Springer, Berlin, pp 117–146 Yee B (1999) A sanctuarity for mobile agents. In: Vitek J, Jensen C (eds) Secure internet programming: Security issues for distributed and mobile objects. Lecture Notes in Computer Science, vol 1603. Springer, Berlin, pp 117–146
Books and Reviews Barak B et al (2001) On the (im)possibility of obfuscating. In: CRYPTO ‘01, Santa Barbara, 19–23 August. Lecture Notes in Computer Science, vol 2139. Springer, Berlin, pp 1–18. Braun P, Rossak W (2004) Mobile Agents: Basic concepts, mobility models, and the Tracy Toolkit. Morgan Kauffman Cardelli L, Ghelli G, Gordon AD (1999) Mobility types for mobile ambients. In: Wiedermann J, van Emde Boas P, Nielsen M (eds) Automata, languagese and programming. In: 26th International Colloquium, ICALP’99 Proceedings. Lecture Notes in Computer Science, vol 1644. Springer, Berlin Cardelli L, Ghelli G, Gordon AD (2000) Ambient groups and mobility types. In: van Leeuwen J, Watanabe O, Hagiya M, Mosses PD, Ito T (eds) Theoretical computer science Carriero N, Gelernter D (1989) Linda in context. Commun ACM 32(4):444–458 Carvalho M, Breedy M (2002) Supporting flexible data feeds in dynamic sensor grids through mobile agents. In: Proceedings of the 6th IEEE International Conference on Mobile Agents. Springer Gong L, Schemers R (1998) Signing, sealing, and guarding Java Objects. In: Vigna G (ed) Mobile agent security. Lecture Notes in Computer Science, vol 1420. Springer, Berlin, pp 206– 216 Gray RS (1996) Agent Tcl: A flexible and secure mobile-agent system. In: Proceedings of the 4th Annual Tcl/Tk Workshop (TCL 96), July 1996, pp 9–23 Gray R, Kotz D, Cybenko G, Rus D (1998) Security in a multiple-language mobile-agent system, In: Vigna G (ed) Lecture Notes in Computer Science: Mobile Agents and Security McGraw G, Felten EW (1997) Java security: Hostile applets, holes, and antidotes. Wiley Murphy AL, Picco GP, Roman G-C (2001) Lime: A middleware for physical and logical mobility. In: Proceedings of the 21 st International Conference on Distributed Computing Systems (ICDCS-21), May 2001 Picco GP, Murphy AL, Roman G-C (1999) Lime: Linda meets mobility. In: Garlan D (ed) Proceedings of the 21 st International Conference on Software Engineering, May 1999 Tschudin C (1994) An introduction to the M0 messenger system. Technical report 86 (Cahier du CUI), University of Geneva Volpano D (1996) Provably-secure programming languages for remote evaluation. ACM Comput Surv 28A(2):electronic
1893
1894
Molecular Automata
Molecular Automata JOANNE MACDONALD1 , DARKO STEFANOVIC2 , MILAN STOJANOVIC1 1
2
National Chemical Bonding Center: Center for Molecular Cybernetics, Division of Experimental Therapeutics, Department of Medicine, Columbia University, New York, USA National Chemical Bonding Center: Center for Molecular Cybernetics, Department of Computer Science, University of New Mexico, Albuquerque, USA
Article Outline Glossary Definition Introduction Molecular Automata as Language Recognizers Molecular Automata as Transducers and Controllers Future Directions Bibliography Glossary Algorithmic self-assembly of DNA tiles Spontaneous assembly of structures consisting of interlocking semirigid DNA molecules (the tiles). Different tiles, containing different exposed ends, can be synthesized such that their affinity to interlock can be predefined. Particular sets of tiles may be constructed that result in the self-assembly of unique structures; each such set represents an algorithm for the assembly of the structure. DNA Sierpinski triangle An algorithmically self-assembled DNA structure with a pattern approximating the Sierpinski gasket, a fractal. DNA computing Generally, any use of the informationcarrying capacity of DNA to achieve some computational or decision-making goal. The term is also used to refer specifically to the use of DNA in the manner pioneered by Adleman as a massively parallel computation substrate to solve certain combinatorial optimization problems. DNA circuit A system consisting of multiple DNA logic gates in which DNA is used as the signal carrier between the gates; the system performs a Boolean logic function that may be more complex than the functions achievable using single logic gates. DNA binary counter A device that uses DNA as a computational substrate to maintain its state, an integer bi-
nary numeral, and advances through successive states which represent successive integers. An algorithmic self-assembly implementation exists. DNA logic gate A molecular device using DNA as a computational substrate to perform a Boolean logic function. Deoxyribozyme An oligonucleotide synthesized such that its structure gives rise to enzymatic activity that can affect other oligonucleotides. Deoxyribozyme-based automaton A molecular automaton which uses deoxyribozyme-based logic gates to achieve its function. Deoxyribozyme-based logic gate An implementation of a DNA logic gate in which enzymatic activity of a deoxyribozyme is controlled by the presence or absence of activating or inhibiting oligonucleotides. Ligase An enzyme, in particular a deoxyribozyme, that promotes the linking of two oligonucleotides into a single longer oligonucleotide. Modular design of nucleic acid catalysts A method for the design of nucleic acid catalysts, in particular deoxyribozymes, wherein their nucleotide sequences are chosen by stringing together motifs (such as recognition regions, stems, and loops) which have been established to perform a desired function. With care, motifs will not interfere and will achieve a combined complex function. Molecular automaton A device at the molecular scale which performs a predefined function; this function is seen at the macroscopic level as sampling the environment for molecular stimuli and providing Molecular finite state automaton A device that performs a language recognition function using molecules as a computational substrate; the language symbols, the states, and the transitions between states are realized as molecules. Molecular Mealy automaton A device using molecules as a computational substrate to achieve a general-purpose sequential logic function, viz., a finite state transducer. Molecular logic gate A device using molecules as a computational substrate to perform a Boolean logic function. Oligonucleotide A single-stranded DNA molecule consisting of a relatively small number of nucleotides, usually no more than a few dozen. Phosphodiesterase An enzyme, in particular a deoxyribozyme, that promotes the cleavage of an oligonucleotide into two shorter oligonucleotides by catalyzing the hydrolysis a phosphodiester bond.
Molecular Automata
Definition Automata are devices that autonomously perform predetermined, often complex, functions. Historically, the term referred to mechanical devices that imitated human behaviors, especially those that exhibited intelligence. In the 20th century, electronics made possible the development of various complex automata, the prime example being the computer. For its part, computer software was organized as a hierarchy of automata operating within the computer automaton. A rich formal theory of automata arose in the fields of electronics and computer science to aid in the design and use of these now ubiquitous devices. Molecular automata employ molecular-scale phenomena of binding, dissociation, and catalytic action to achieve predetermined functions. Different chemical implementations are possible, including proteins and inorganic reactions, but the principal prototypes today use nucleic acid chemistry, and this article will focus on this modality. Nucleic acid automata may be able to interact directly with signaling molecular processes in living tissues. Thus, they provide a path towards autonomous diagnostic and therapeutic devices and even towards engineered control of cell behavior.
Introduction Humans appear to be naturally fascinated with the construction of devices with parts that move independently, contraptions that operate autonomously, and in particular, machines with an appearance of intelligence. Each technological advance throughout the ages has coincided with the production of more and more sophisticated devices; indeed, some argue that our strong innate urge toward a mechanistic philosophy has led to the development of automata [16]. Using tools at hand, ancient cultures created a variety of automata, from the animated figures of Rhodes to the singing figures of China. As with many human endeavors, the 18th century was the era of splendor and variety in automata, particularly among the French, and sophistication continued through the 19th: from courtly displays (mechanical) to digesting ducks [33] (biomechanical) to painting devices (mechanical, but emulating creative intelligence) [4,31]. Modern automata are almost uniquely defined by the advent of computers. Just as electronics permitted building more complex systems than mechanics, software on programmable computers permitted vastly more complex systems. An entire discipline of automata theory developed to characterize the computational abilities and re-
source requirements of different devices; all of theoretical computer science is an outgrowth of this effort. The formal notion of an automaton centers on a description of the configuration, or state, of the device, and of the transitions it makes from state to state. The states are typically discrete. The transitions are discrete as well and they occur in response to external stimuli formalized as the consumption of discrete input symbols. Given a current state and a current input symbol, the transition rules of an automaton define what state the automaton may enter next. A history of external stimuli is captured by an input symbol string. The automaton may produce output symbols. This has been formalized in two different but equally expressive ways: either an output symbol is associated with a specific state, or it is associated with a specific transition between two states. In either case, the successive output symbols form an output string. Automata defined in this fashion are closely linked with formal language theory. The input and the output symbols are drawn from certain alphabets, and the input and the output strings belong to languages over these alphabets. Of particular interest are automata with a restricted form of output: a yes or a no output is associated with each state. Such automata can be viewed as language recognizers under the following interpretation: if the state in which the automaton finds itself upon consuming an input string is a yes output state, that string is said to belong to the language, and otherwise not. More generally, an automaton defines a correspondence between strings in the input alphabet and strings in the output alphabet; thus it can be understood as a language-translating device, or transducer. Language recognizers and transducers play a central role in traditional accounts of computational complexity, as well as in practical daily use, such as in web query processing, under the name of parsers. If the output alphabet consists of signals for mechanical actions, we have the essence of robotic control. The number of symbols in an alphabet is generally taken to be finite. On the other hand, the number of states of an automaton might be finite or infinite. Finite state automata are readily implemented in electronics using a register (to hold an encoding of the state) and a combinational circuit (to compute the transitions). Indeed, only finite state automata can be physically implemented. Nevertheless, a beautiful theory exists for classes of automata with infinite state, this theory has led to useful programming paradigms, and such automata can be simulated using finite resources as long as we are willing to accept that they may occasionally run out of these resources and fail. Examples include pushdown automata, in which the state includes the contents of a semi-infinite push-down stack
1895
1896
Molecular Automata
of symbols, and the Turing machine, in which the state includes the contents of an infinite tape of symbols. Even as electronic embodiments of automata permeate the modern society, from mobile telephones to automated bank tellers to navigational satellites, they are not readily integrated into life processes: the environment of living tissues presents an obstacle for the deployment, powering, and operation of electronics. Biocompatible alternatives are needed. With recent advances in molecular biology, research is turning toward the taming of molecules for the development of automata on the molecular scale. Within this body of research several strands can be recognized. Firstly, researchers have adapted molecular processes for explicit emulation of mathematical theory. Secondly, researchers have sought to enhance known cellular systems for alternative control of biological events. Additionally, however, researchers are also beginning to engineer molecules for completely novel purposes unrelated to their original design. Regardless of the purpose, all of these automata fundamentally require similar mechanisms to operate autonomously within their respective environments. Molecular automata must have some ability to sense inputs (molecular or other) within their external environment, make a predetermined decision based on the sensed input, and then autonomously produce an output that affects their environment. Here we review recent advances in the engineering of molecular automata, and outline fundamental principles for engineering of molecular automata. Sect. “Molecular Automata as Language Recognizers” treats molecular automata created as embodiments of the mathematical concept of a language recognizer. Minimal formal background is given, and recent research exemplars are introduced, along with sufficient background in nucleic acid chemistry. We only treat exemplars that have been demonstrated in the laboratory. Sect. “Molecular Automata as Transducers and Controllers” treats molecular automata that work as transducers, and is similarly organized. In Sect. “Future Directions”, we speculate on the future uses and developments of the molecular automata technology in science and medicine. Molecular Automata as Language Recognizers Preliminaries Finite State Automata A finite state automaton (sometimes just “finite automaton”) is a notional device that reads strings of symbols drawn from a finite alphabet and either accepts them or rejects them. Languages are sets of
strings, and the language recognized by the automaton is defined to be the set of all strings which the automaton accepts. We now consider how the automaton makes its decision. The automaton operates in a series of discrete steps, and each step consumes one symbol of the input string. The automaton exists in one of finitely many abstract states, and each step of operation causes a transition between states. A transition function describes which next states the automaton may enter upon consuming a particular symbol in a particular state. One state is designated as the start state; this is the state of the automaton before any input is read. Upon starting, the automaton reads the symbols from the input string and makes the indicated transitions until the string has been exhausted. If the final state in which it finds itself belongs to the designated subset of accepting states, we say that it has accepted the input string. Formally, an alphabet ˙ is a finite set of uninterpreted symbols. In examples, the sets f0; 1g and fa; b; c; : : : g are common; in computing ASCII and Unicode are used. While it may seem plausible that the nucleotides fA; T; G; Cg should be used directly as the alphabet in molecular implementations of automata, device design, such as the use of restriction enzymes with multi-nucleotide recognition sites, often dictates a less dense representation; for instance, each of a few unique short oligonucleotides may stand for a symbol of the alphabet. A string, or word, is a finite ordered sequence of symbols. A string of no symbols is written ". Two strings s1 and s2 can be concatenated to form s1 s2 . Any set of strings is called a language. The notion of concatenation is extended to languages, L1 L2 D fs1 s2 js1 2 L1 ^ s2 2 L2 g. A special operator, the Kleene star, finitely iterates concatenation: L? D " [ L [ LL [ LLL [ : : : If we identify single-symbol strings with symbols, the set of all strings over an alphabet ˙ is ˙ ? . Given a finite set of states Q, the transition function of an automaton is ı : Q ˙ ! Q. The start state is S 2 Q, and the accepting (sometimes called final) states are F Q. To formalize the operation of the automaton, we define the configuration of the automaton to be its current state paired with the unread input. The initial configuration for input w is (S, w). The relation ` describes one step of operation: (s; x) ` (t; y) if (9a 2 ˙ )x D ay ^ ı(s; a) D t. The relation `? is the reflexive, transitive closure of ` and describes the complete operation. If (S; w) `? (s; ") then the automaton accepts w if and only if s 2 F. Finite state automata as described are properly called deterministic. Nondeterministic finite state automata are
Molecular Automata
Molecular Automata, Figure 1 Example finite state automaton. The alphabet fa; bg, states fs0 ; s1 g, start state s0 , accepting states fs0 g, and transition function given by the table in a define a deterministic finite state automaton, graphically shown in b. This automaton accepts precisely the strings that contain an even number of b’s. In c, the actions of the automaton in accepting the string baaba are shown
an apparent generalization that permits transitions that read strings rather than single symbols from the input at each step of operation, which is described by a transition relation ı Q ˙ ? Q. With a nondeterministic automaton, we say it has accepted the input string if there is an accepting state among all states it may have reached while consuming the entire input string. In the conventional visual representation of a finite state automaton, the transition diagram, (example shown in Fig. 1), the states are shown as vertices in a graph, and the transitions as directed edges labeled with symbols (or with strings for nondeterministic automata). The class of languages that can be recognized by a finite state automaton is known as regular languages; the term itself derives from an alternative description of the same class using regular grammars (see below). Nondeterministic automata recognize the same class of languages but may require fewer states for the same language. The fact that nondeterministic automata are no more powerful than deterministic ones means that a number of variations that seem to be “in between” can be freely used. For instance, a deterministic automaton with a partial transition function ı, i. e., in which the transition out of some particular state on some particular input symbol is not defined is really a nondeterministic automaton. Rather than having an explicit “stuck” state with a self-loop for all input symbols, as it would be required in a deterministic description, the automaton is stuck by virtue of not having any way out of a state. This is used in Benenson’s automata description (Sect. “Benenson’s Finite Automaton”).
Finding functioning encodings for the symbols, states, and transitions in a molecular implementation is a challenging task, and therefore prototypes have dealt with small numbers of symbols and states, and it is of interest to know how many different automata can be encoded in a particular prototype. Considering deterministic automata alone, given that the transition functions ı maps Q ˙ to Q, the number of possible such functions is jQjjQjj˙ j ; there are jQj choices for the initial state, and there are 2jQj choices for accepting states (any subset of the set of states). Thus, there are jQj2jQj jQjjQjj˙ j different automata descriptions. Of course, many of these describe automata that are identical up to a relabeling of the states; moreover, automata may be different and yet recognize the same language, so these calculations should not be taken too seriously. More Powerful Automata Pushdown automata are a natural extension of finite state automata; these devices include a notionally infinite pushdown stack. This structure allows the device to write an unbounded number symbols to its scratch memory, but at any time only the most recently written symbol can be read back (which also erases it). These devices are capable of recognizing a larger class of languages, called the context-free languages. Turing machines add two pushdown stacks to a finite state automaton, or, equivalently, an infinite read/write tape. These devices are capable of recognizing any language. All plausible models of what can be computed that have been put forward have been shown equivalent to Turing machines; we say that they describe universal computation. Wang Tiles and Self Assembly One form of universal computation has been designed from a specific set of Wang tiles [47]. These are a mathematical model wherein square unit tiles are labeled with specific “glue” symbols on each edge. Each tile is only allowed to associate with tiles that have matching symbols, and the tiles cannot be rotated or reflected. The computational interest of Wang tiles is the question of proving whether they can be used to tile a plane. Any Turing machine can be translated into a set of Wang tiles; the Wang tiles can tile the plane if and only if the Turing machine will never halt. Prototypes Molecular Automata as Language Recognizers The encoding of finite automata states and transitions using DNA and restriction enzymes was first proposed by Rothemund [34]. A detailed design was also offered by Gar-
1897
1898
Molecular Automata
Molecular Automata, Figure 2 a A four-arm junction and b its three-dimensional structure; c a DNA DX; and d a DNA TX. (From [28])
zon [19]. Shapiro’s group extended this idea to an autonomously operating cascade of cleavages of a doublestranded DNA directed by oligonucleotides [1,6,8]. In effect, this group demonstrated the first molecular equivalent of a finite automaton. What is now sometimes called the Shapiro–Benenson–Rothemund automaton consists of a mixture of two groups of DNA molecules (input and “software”, i. e., transition rules) and the FokI restriction enzyme (“hardware”). The automaton from [6] (Fig. 3) is described in Sect. “Benenson’s Finite Automaton”. DNA has also been applied in the construction of Wang tiles and general self-assembly algorithms. Erik Winfree was the first to note that planar self-assembly of DNA molecules can be applied to this form of universal computation [48]. DNA molecules analogous to Wang tiles can be constructed from double-crossover (DX) molecules [18], which consist of two side-by-side DNA duplexes that are joined by two crossovers (Fig. 2). They have a rigid stable body, and open sticky ends for attachment to other DX molecules. The sticky ends of these DNA tiles may be labeled with certain sequences, analogous to the symbols labeling the sides of Wang tiles. This labeling allows the sticky ends to bind only to the tile ends that have a complementary sequence of base pairs, which corresponds to the rule that restricts Wang tiles to only associate with tiles that have matching symbols. Triple crossover (TX) molecules, consisting of three side-by-side DNA duplexes are also good for this purpose [23]. It was shown that universal computation could be accomplished by self assembling these DNA tiles [50]. An appealing interpretation is that a two-dimensional self-assembling structure can be used to represent (in one dimension) the time development of a one-dimensional (in the other dimension) cellular automaton. The first example of computing performed by DNA self-assembly [29], a four-bit cumulative XOR, is described in Sect. “DNA Self-Assembly”. This is followed in Sect. “Algorithmic DNA Self-Assembly” by a detailed description of algorithmic self-as-
sembly of DNA tiles [5,35,36] used in the construction of a binary counter and a Sierpinski triangle. Benenson’s Finite Automaton The automaton described by Benenson encodes finite automata states and transitions using DNA and restriction enzymes. The input (I) consists of repetitive sequences, with groups of five base pairs (separated by three base-pairs in [6]) which denote input symbols (Fig. 3a), and, upon enzymatic cleavage, the state of the automaton as well. Two symbols in the input are a (TGGCT) and b (GCAGG). The cleavage that leaves the first four bases in an overhang is defined as a transition to an S0 state, while the cleavage that leaves the second four bases in an overhang is defined as a transition to an S1 state. Interactions of an input with transition rules occur as follows: The readable input starts with an overhang of four bases, and this overhang is recognized by a complementary overhang on a transition rule (software). All transition rules are complexed with FokI (this is called the softwarehardware complex). FokI recognizes the “guide” sequence in transition rules (GGATG), and then, upon interactions with input, cleaves the input at the constant position, nicking the DNA helix 9 and 13 positions away from the guide sequences (Fig. 4). In this way, the transition rule directs the cleavage of an input. As this process generates a new overhang, within the next symbol, we say that cleavage moves the automaton to read the next symbol, and that at the same time it defines the new state of the automaton. The structure of the transition rules defines whether a state of the automaton changes or not, by a distance between an overhang and the guide sequence. This process is a cascade, because a new overhang can be recognized by another (or the same) transition rule. With two states and two symbols, we can have a total of eight transition rules (Fig. 3b), defining all possible transitions, upon reading a symbol. (Keinan and colleagues also reported three-state, three-symbol automata [40].)
Molecular Automata
Molecular Automata, Figure 3 a The input molecule (I) is a double-stranded DNA with an overhang, and repeating motifs encoding symbols a (TGGCT), b (GCAGG) and, finally, a terminator t (GTCGG). The same sequences define two states (S0 and S1 ) of the finite automaton, depending on where FokI cleaves. Thus, S0 in a is defined as a GGCG overhang and in b it is defined as a CAGG overhang, while S1 is defined in a as a TGGC overhang and in b as a GCAG overhang. States are similarly defined in t. b All possible transition rules in complex with FokI (software-hardware complex) for a two-state, two-symbol automaton. Spacer sequences (squares with circles) between guide sequence (GGATG) and targeting overhangs define the change of states, because they serve to adjust the cleavage position within the next symbol. Overhangs define the initial state and symbol that is read by the transition rules
We say that this automaton (or in biochemistry a cascade) is capable of consuming input strings over a two-letter alphabet fa; bg and transitioning between states depending on the input and in accordance with a preprogrammed control, defined through the presence of an arbitrary mixture of transition rule complexes (T1–T8). The mechanism of this cascade is best explained through an example, and we provide one in Fig. 4. For example, the transition function consisting of only transition rules T4 and T5 can move the automaton back and forth between states S0 and S1 endlessly, as long as it is moving along an input made of symbols a and b in alternation, baba , with the final result S0 . We invite the reader to try any other combination of transition rules in Fig. 3b on the same input. Some of them will stall the automaton. Thus, through selecting a set of transition rules, or software/hardware complexes (transition rules of the form: if you find the system in state Sm and read a symbol 2 fa; bg from the input, go to state Sn and move to the next input symbol), one could write a molecular “program” able to read and degrade an input DNA molecule, which encodes a defined set of symbols. Input molecules for which transitions always succeed will be degraded up
to a terminator symbol, which will read the final state of an automaton (“answer”). If there is no transition rule present in the system that matches the current state and the current symbol of the input at a certain stage of the degradation, the automaton stalls and the input molecule is not degraded to a terminator symbol. Thus, it can be said that this system indeed works as a finite-state automaton and thus recognizes the language abstractly specified by its set of transition rules. Whilst the laboratory demonstration was of finite state automata, one could envisage pushdown automata or Turing machines; indeed Shapiro has a patent on a molecular Turing machine [38]. DNA Self-Assembly The first DNA self-assembling automaton was a four-bit cumulative XOR [29]. The Boolean function XOR evaluates to 0 if its two inputs are equal, otherwise to 1. The “cumulative” (i. e., multi-argument) XOR takes Boolean input bits x1 ; : : : ; x n , and computes the Boolean outputs y1 ; : : : ; y n , where y1 D x1 , and for i > 1, y i D y i1 XOR x i . The effect of this is that yi is equal to the even or odd parity of the first i values of x. Eight types of TX molecules were used in the implementation: two cor-
1899
1900
Molecular Automata
Molecular Automata, Figure 4 The effect of two transition rules, T4 and T5 on an input I from Fig. 3a is shown. The overhang at I (CAGG) defines an initial state (S0 ) and a current input symbol, b. The input encodes the string baba. For clarity, only bases in overhangs, symbols, and the guide sequence are shown. (i) Complexation between T4 and I directs the cleavage of I to I-7: Overhang CAGG on I is complementary to an overhang GTCC on the T4 complex. Thus, the automaton is at the beginning of its “calculation” in the state S0 and will “read” the next symbol b on the input. T4 complexes with the input and cleaves at the position 9/13 bases away within the region defining the symbol a. Upon this cleavage the whole complex disintegrates, leaving the shortened input (I-7) with a new overhang in place, TGGC, representing the next state S1 and the current symbol a. Thus, the automaton performed according to a transition rule T4 an elementary step reading the first input symbol, b, and transitioned from state S0 into state S1 , moving along the input (tape) to the second symbol, a. (ii) The T5 complex recognizes overhang at the I-7 and the input again at the 9/13 position, which is now within the next symbol, b; this leaves the new CAGG (S0 ) overhang. Thus, the automaton transitioned from state S1 into state S0 , having read the input symbol a. It also moved to the following input symbol b, on the (ever shrinking) input, now at I-16. (iii) The automaton (actually, the T4 transition rule) recognizes and cleaves I-16 and moves to the next symbol a, transitioning in this process to S1 state and producing I-23. (iv) In the last step, I-23 is cleaved by T5 to I-32, producing the state S0 in the terminator symbol
ner tiles, two input tiles, and four output tiles. The corner tiles connect a layer of input tiles to a layer of output tiles. The input tiles represent x i D 0 and x i D 1. There are two ways to get each of the two possible outputs of a bitwise XOR, and so four output tiles were used, leaving some interpretation to the human observer. One output tile represents the state with output bit y i D 0 and input bits x i D 0 and y i1 D 0; another the state with output bit y i D 0 and input bits x i D 1 and y i1 D 1. The remaining output tiles represent the two states with y i D 1. The actual computation of the XOR operation is accomplished by harnessing the way the output tiles connect to the input tiles. Each output tile (yi ) attaches to a unique combination
of one input tile (xi ) and one output tile (y i1 ), and leaves one sticky end open that encodes its own value (yi ) so that another output tile may attach to it. Thus, only the output tiles that represent the correct solution to the problem are able to attach to the input tiles. Algorithmic DNA Self-Assembly As another type of DNA based paradigm capable of autonomous computing, we give an example of algorithmic self-assembly of DNA tiles, as first suggested, and then implemented by Winfree. Algorithmic self-assembly is an extension of various DNA nanotechnologies developed over the years by Seeman’s group, and is based on a vision that one can encode in a set
Molecular Automata
Molecular Automata, Figure 5 A binary counter in the process of self-assembly. The seed tile starts off the assembly. The right side and bottom border tiles connect to each other with double bonds, while all the other tiles connect with single bonds. A tile needs two single bonds (or one double bond) to form a stable attachment to the structure; the marked attachment positions show where a tile can form a stable attachment
of tiles the growth of an aperiodic crystal. While aperiodic crystals seem on the surface irregular, the position of each component of such crystal is actually precisely encoded by a program. The approach is based on a rigid set of DNA tiles, such as double or triple crossover tiles. Unlike standard (i. e., crossover-free) structures, such as three- and fourway junctions, these molecules are sufficiently rigid to define precise positions of other tiles interacting with them, which could lead to a regular crystal growth. This has led to many impressive periodic 2D structures. Interactions between tiles occur through complementary overhangs. For example, each of the DAO-E tiles in Fig. 7 projects four overhangs (schematically presented as different shapes), each of which can be independently defined. We examine two computations that have been demonstrated by means of algorithmic self-assembly: a binary counter, and a Sierpinski triangle. The binary counter [5,36] uses seven different types of tiles: two types of tiles representing 1, two types representing 0, and three types for the creation of a border (corner, bottom, and side tiles). The counter (Fig. 5) works
by first setting up a tile border with the border tiles—it is convenient to think of the “side” border tiles as being on the right, as then the counter will display the digits in the customary order. Two border tiles bind together with a strong, double bond, while all other tiles bind to each other and to border tiles with a single bond. Since any tile except a border tile must bind to two additional tiles in order to have two bonds, but a border tile and another border tile of the correct type will form a double bond with each other, a stable border forms before other stable formations, composed of non-border tiles, are created. The bottom and side border tiles are designed such that the only tile that may bind in the border’s corner (to both a side and a bottom border tile) is a specific type of 1 tile. Only one of the 0 tiles may bind to both this 1 tile and the bottom of the border, and this type of 0 tile may also bind to itself and the bottom of the border, and thus may fill out the left side of the first number in the counter with leading zeros. The only type of tile which may bind both above the 1 in the corner and to the right side of the border is the other type of 0 tile, and the only tile which may bind to the left of it is a 1 tile—we get the number 10, or two in bi-
1901
1902
Molecular Automata Molecular Automata, Table 1 The first eight rows of Pascal’s triangle
1 1 1 1
3
1
4
1 1
10
6
15
7
1 3
1
6
5
1
1 2
21
4
1
10 20
35
5
1
15
6
35
21
1 7
1
Molecular Automata, Table 2 Marks indicate the odd entries in the first eight rows of Pascal’s triangle Molecular Automata, Figure 6 A Sierpinski triangle as a fractal image
nary. The tile binding rules are such that this can continue similarly up the structure, building numbers that always increment by one. We now look at how an aperiodic crystal corresponding to the so-called Sierpinski triangle was constructed [35]. First we consider what this fascinating figure is in the abstract. The Sierpinski triangle (or gasket), properly speaking, is a fractal, obtained from a triangle by iteratively removing an inner triangle half its size ad infinitum; Fig. 6 is an approximation up to ten iterations. Any physical realization will provide only a finite number of iterations and can thus only be an approximation of the true fractal; some realizations have been fabricated in the past with interesting physical properties [20]. A peculiar connection exists with Pascal’s triangle (Table 1), the table of binomial coefficients nk . Each entry in Pascal’s triangle is the sumof thetwo entries right n1 above it, according to the equality nk D n1 C k1 k . If we turn an initial section consisting of 2m rows of Pascal’s triangle into an image by mapping odd numbers to black and even numbers to white (Table 2), we get an approximation to the Sierpinski triangle; various generalizations are possible, see [21,32]. Thus, we could generate the Sierpinski triangle by first generating Pascal’s triangle, row by row, using the above formula. But instead of computing the entries of Pascal’s triangle, which are large numbers, we can compute modulo 2, i. e., only keep track of whether the numbers are odd or even. Supposing numbers were written in binary, this is equivalent to keeping only the least significant bit. In that case, the addition operation degenerates into an XOR. This is the chosen method for DNA self-assembly of a Sierpinski triangle (Table 3).
Molecular Automata, Table 3 The computational schema for the construction of an approximation to a Sierpinski triangle
1 1 0 1 0 0 1 0 1 0
1 0 1 0 1
1 0 0 0 1 0
1 0 1 0 0 0 1
1 0 0 0 0 0 0 0
1 0 1 0 0 0 1
1 0 0 0 1 0
1 0 1 0 1
1 0 1 0 0 1 0 1 0 1
The set of tiles that worked the best for the synthesis of the Sierpinski triangle consists of 6 tiles, three R and three S tiles (Fig. 7). The R tiles interact with the S tiles, and vice versa. The R tiles in row i calculate (display their overhang) outputs to the S tiles in the row i C 1, based on inputs (interactions with complementary overhangs) from the S tiles in row i 1 (Fig. 7c). Thus, each row is composed of either R or S tiles. There are four tiles (S-00, S-11, R-00, R-11) which are binary-encoded with values of 0 (Fig. 7a). These are incorporated in the growing crystal between two ad-
Molecular Automata
Molecular Automata, Figure 7 The DAO-E Sierpinski set of tiles, their corresponding schematic representations and the mechanism of their assembly. Each tile is assigned a schematic rectangular representation. An overhang is represented by the shapes in each corner, with complements being assigned the same shapes, but in black and gray. In c we give an example of assembly
jacent tiles in row i 1 that have both the value 0 or both the value 1. There are two tiles (R-01 and S-01) with values of 1 (Fig. 7b), and they have additional loops that will show as bright regions on the atomic force microscopy (AFM) characterization (cf. Fig. 9). These tiles are incorporated in between two tiles in row i 1 that display different values. Due to the C2-symmetry of the displayed overhangs two tiles, there is no need to define any further tiles. In fact, the tile R-11 is never used in this particular calculation, and is redundant. Assembly proceeds as follows: Winfree and colleagues first constructed a long input (scaffold strand) DNA, which contained binding sites for the S-00 tiles at regular distances. The input is necessary to ensure longer periods of growth (more than 70 individual rows were grown from the input). The input also contained a small num-
ber of randomly distributed sites to which the S-01 tile binds. This is an initiation site, around which aperiodic growth occurs. Thus, the first row, immediately on the input strand, contains mostly S-00 tiles, and here and there a single S-01 tile. The second row of R tiles assembles on these tiles, with R-00 between all S-00 tiles and R-01 on each side of S-01 (i. e., between S-01 and S-00 tiles). An example of this type of crystal growth is shown in Fig. 8, while Fig. 9 shows the actual AFM results. In principle, tiling of surfaces is Turing complete. While it is certain that many interesting aperiodic crystals can be encoded using DNA tiles, the main limitation for further progress at this moment is the high error rate (1– 10%) of the addition of individual tiles. Explicit error correction using some form of redundancy or “self-healing” tile sets to recover from errors is called for [49].
1903
1904
Molecular Automata
Molecular Automata, Figure 8 A schematic example of the assembly of an aperiodic 2D crystal from tiles encoding the Sierpinski triangle
Molecular Automata, Figure 9 AFM picture (bottom) of a representative aperiodic crystal encoding the Sierpinski triangle in its structure (schematic, top). (From [35])
Molecular Automata
Molecular Automata as Transducers and Controllers Preliminaries Finite State Transducers A finite state transducer is a variation of the deterministic finite state automaton which, in addition to consuming one input symbol at each step of operation, produces as output a string over some alphabet. A special case is the Mealy automaton [30], which emits one output symbol at each step, so that the length of the output is the same as the length of the input; an example is shown in Fig. 10. Molecular Automata as Language Recognizers Shapiro’s group used the same type of cascades as in their finite automata to create a molecular transducer as a proofof-concept for a diagnostic automaton [7]. This design is explained in Sect. “Therapeutic and Diagnostic Automata”. Using a conceptually different model, Stojanovic and Stefanovic created molecular transducers from deoxyribozyme-based logic gates (see Sect. “Deoxyribozyme-Based Logic Gates as Transducers”), which have been used as components for the construction of adders (Sect. “Adders and Other Elementary Arithmetic and Logic Functions”) and automata for games strategies (Sects. “Automata for Games of Strategy” and “Improved Automata for Games of Strategy”). Deoxyribozyme-Based Logic Gates as Transducers Simple DNA logic-gate transducers have been constructed from deoxyribozymes by Stojanovic and Stefanovic [43]. Deoxyribozymes are enzymes made of DNA that catalyze DNA reactions, such as the cleavage of a DNA strand into two separate strands, or the ligation of two strands into one. These can be modified to include allosteric regulation sites, to which specific control molecules can bind and so affect the catalytic activity. There is a type of regulation site to which a control molecule must bind before the enzyme can complex with (i. e., bind to) the substrate, thus this control molecule pro-
Molecular Automata, Figure 10 Example Mealy automaton. The input alphabet, the states, and the transition function are as in Fig. 1. The output alphabet is fx; y; zg. Successive runs of as in the input are transduced alternately into runs of xs and runs of ys; bs are transduced into zs
motes catalytic activity. Another type of regulation site allows the control molecule to alter the conformation of the enzyme’s catalytic core, such that even if the substrate has bound to the enzyme, no cleavage occurs; thus this control molecule suppresses or inhibits catalytic activity. This allosterically regulated enzyme can be interpreted as a logic gate, the control molecules as inputs to the gate, and the cleavage products as the outputs. This basic logic gate corresponds to a conjunction, such as e. g., a ^ b ^ :c, here assuming two promotory sites and one inhibitory site, and using a and b as signals encoded by the promotor input molecules and c as a signal encoded by the inhibitor input molecule. Deoxyribozyme logic gates are constructed via a modular design [41,43] that combines molecular beacon stem-loops with hammerhead-type deoxyribozymes, Fig. 11. A gate is active when its catalytic core is intact (not modified by an inhibitory input) and its substrate recognition region is free (owing to the promotive inputs), allowing the substrate to bind and be cleaved. Correct functioning of individual gates can be experimentally verified through fluorescent readouts (Fig. 21). The inputs are compatible with sensor molecules [42] that can detect cellular disease markers. Final products (outputs) can be tied to release of small molecules. All products and inputs (i. e., external signals) must be sufficiently different to minimize the error rates of imperfect oligonucleotide matching, and they must not bond to one another. The gates use oligonucleotides as both inputs and outputs, so cascading gates is possible without external interfaces. Two gates are coupled in series if the product of an “upstream” gate specifically activates a “downstream” gate. A series connection of two gates, an upstream ligase and a downstream phosphodiesterase, has been experimentally validated [44]. Prototypes Therapeutic and Diagnostic Automata Using similar principles to their finite automata (Sect. “Benenson’s Finite Automaton”), Shapiro’s groups created a molecular transducer to assess the presence or absence of a series of oligonucleotides, as a proof-of-concept for a diagnostic automaton [7]. The automaton analyzes up- and down-regulated mRNA molecules that are characteristic for some pathological states (Fig. 12a). The authors introduced several new concepts in their work, primarily in order to maximize the tightness of control over a diagnostic cascade. We present here a somewhat simplified explanation that focuses on the underlining principles and the ability of these finite automata to analyze oligonucleotide inputs.
1905
1906
Molecular Automata
Molecular Automata, Figure 11 Examples of deoxyribozyme-based logic gates. a A YES gate showing deoxyribozyme core and molecular beacon allosteric controlling region; this gate is active in the presence of one activating input. b A two-input AND gate produces outputs only if both inputs are present. c A three-input ANDANDNOT gate produces outputs only if the two activating inputs (A and B) are present, and the inhibiting input (C) is absent
Input molecules for diagnostic automata are similar to those in Fig. 3. Each “symbol” in the input now contains recognition sites (“diagnostic” symbol), which are, upon cleavage by FokI, recognized by transition rules. In turn, these transition rules are regulated by the expression of genes (i.e., mRNA) characteristic for certain types of cancer. If we are looking to diagnose an increase in certain mRNA level, there are, potentially, two types of transition rules that can cleave each of the symbols after these are exposed as overhangs through an action by FokI: (1) YES!YES transitions, which will lead to the next step of processing the input because a new overhang, recognizable by the next set of transition rules, will be exposed; and
(2) YES!NO transitions, in which FokI cleaves off from the site that is recognized by the next set of transition rules, and, thus, these transitions cause the automaton to stall (i. e., no existing transition rules recognize the new overhang). The YES!YES transition is activated by the presence of mRNA, through the removal of an inactivating (blocking) oligonucleotide from its complex with one of the strands making a transition rule (Fig. 12b). In contrast, the YES!NO transitions are deactivated by an mRNA, because mRNA will compete for binding to one of the strands in the transition rule, thus, effectively breaking up the transition rule. In case we want to diagnose a downreg-
Molecular Automata
Molecular Automata, Figure 12 Therapeutic automata: a A diagnostic procedure is based on an analysis of a set of genes (e. g., Genes 1 to 4) which have the levels of expression characteristic for a particular disease; these genes regulate the transition rules that process the input molecule (see below). Input (I) molecule, similar to Fig. 3, encodes the symbols that are cleaved by these transition rules. Symbols are recognized by two types of transition rules regulated by the mRNA molecule transcribed from a gene; one type leads to continuation of the cascade, while the other leads to no further processing. b YES!YES transition rule for an upregulated gene 3 in this cascade (e. g., PIM1); in the absence of the gene this transition rule is deactivated by the inactivating strand that binds stronger to strand 2 of the rule, displacing strand 1. c YES!NO Transition rule Gene 3 expressed is formed from two software strands, by a displacement of a blocking (protector) strand. d Transition rule Gene 1 not expressed, is degraded in the presence mRNA for Gene 1, thus stalling the automaton
ulation of mRNA (genes 1 and 2 in Fig. 12a), the situation is reversed: The lack of characteristic mRNA leaves the YES!YES transition active, while the presence of it would
break this transition rule apart. In contrast, the presence of mRNA will activate YES!NO transition by removing the inactivating oligonucleotide.
1907
1908
Molecular Automata
A diagnostic automaton can be turned into a “therapeutic” automaton by modifying the events at the end of the cascade. Instead of releasing a terminator symbol (cf. Figs. 3 and 4, Shapiro and colleagues described a degradation of a stem-loop, and a release of a linear oligonucleotide, which had the potential for a therapeutic (antisense) action. A further interesting detail was that a negative diagnosis could be coupled to a parallel cascade, which can release another oligonucleotide, complementary to and blocking the activity of a therapeutic oligonucleotide. The authors also described independent and parallel action of two different automata in the same solution. From the perspective of this review, it is also important that Winfree proved that these automata are capable of general Boolean computing [39]. The “therapeutic automaton” is in many respects the most challenging ex vivo molecular system that has ever been constructed. Although there are many reasons to doubt eventual practical applications as a “doctor in cell”, it will always remain an extremely impressive intellectual exercise, which will serve as an inspiration for different systems. Adders and Other Elementary Arithmetic and Logic Functions Another type of transducing automaton is a number adder, which accepts as input two numbers expressed as strings of bits (binary digits) and produces as output the string of bits corresponding to their sum. On
the surface these may not appear to have direct relevance with diagnostic and therapeutic automata, however adders serve as extremely important computational components, and their development begins to hint at the power in automaton building through the multiple layering of logic gates. In electronics, binary adders commonly use as a building block the full adder, which is a device that takes as input three bits (normally two digits from two numbers, and a carry-in bit) and produces a sum bit and a carry bit. The full adder is commonly realized using two half-adders; a half-adder takes as input two bits and produces a sum bit and a carry bit. A half-adder, in turn, is realized using simple logic gates. A logic gate takes some number of input bits (often just one or two) and produces one output bit as a Boolean function of the inputs, such as a negation (NOT-gate) or a conjunction (AND-gate). Adders and logic gates are simple circuits with obvious applications. Numerous molecular devices with logic gate and adder functions (as well as mixed-mode devices such as molecularoptical) have been introduced in recent years [10,11,12,13, 14,15,17,22,24,51]. Logic gates built from deoxyribozymes were first described by Stojanovic and Stefanovic [43], and were subsequently used to build both a half-adder [45] and a full adder [24]. A half-adder was achieved by combining three two-input gates in solution, an AND gate for the carry bit, and an XOR, realized using two ANDNOT gates (gates of
Molecular Automata, Figure 13 Components of a molecular half-adder, which adds two binary inputs to produce a sum and carry digit. Top: Circuit diagram, truth table of calculations and deoxyribozyme logic gates required for construction. Bottom: implementation of the deoxyribozyme-based half adder. The sum output is seen by an increase in red channel (tertramethylrhodamine) fluorescence (0 C 1 D 1 and 1 C 0 D 1), and the carry output is seen by an increase in green channel (fluorescein) fluorescence (1 C 1 D 2)
Molecular Automata
Molecular Automata, Figure 14 Components of a molecular full adder, which adds three binary inputs to produce a sum and carry digit. Left: circuit diagram, truth table of calculations and deoxyribozyme-based logic gates required for construction. Right: implementation of the deoxyribozymebased full adder. The sum output is seen by an increase in red channel (tetramethylrhodamine) fluorescence, and the carry output is seen by an increase in green channel (fluorescein) fluorescence
the form x ^ :y) for the sum bit. The two substrates used are fluorogenically marked, one with red tetramethylrhodamine (T), and the other with green fluorescein (F), similar to Fig. 21, and the activity of the device can be followed by tracking the fluorescence at two distinct wavelengths. The results, in the presence of Zn2C ions, are shown in Fig. 13. When both inputs are present, only the green fluorescein channel (carry bit) shows a rise in fluorescence. When only input i1 is present or only input i2 is present, only the red tetramethylrhodamine channel (sum bit) rises. With no inputs, neither channel rises. Thus, the two bits of output can be reliably detected and are correctly computed. A molecular full adder was similarly constructed, requiring 7 molecular logic gates in total: three AND gates for computation of the SUM bit, and four 3-input gates for calculation of the CARRY bit (Fig. 14). The requirement for a 3-input ANDAND gate as well as several ANDNOTANDNOT gates necessitated additional deoxyribozyme logic gate design. This was achieved by precomplexing
gates with a complementary oligonucleotide that would subsequently be removed by addition of inputs (Fig. 15). The length of input increased to 30 nucleotides, and served a dual function—to activate controlling elements directly and to remove precomplexed oligonucleotides when necessary. The combination of stem-loop regulation in cis and controlling elements supplied in trans is reminiscent of some classical in vivo regulatory pathways. This simple demonstration points towards the construction of even more complex artificial networks, which could mimic natural systems through the use of multifunctional regulatory components. Automata for Games of Strategy Initial construction of networks capable of simple arithmetic operations suggested the possibility that circuits of highly regulated deoxyribozymes could support complex molecular decision trees. This was subsequently demonstrated by Stojanovic and Stefanovic in the construction of the first interactive molecular transducing automaton, designed to per-
1909
1910
Molecular Automata
Molecular Automata, Figure 15 Three-input ANDAND gate structure
fectly play the game of tic-tac-toe against a human opponent [46]. To understand how this was achieved, the structure of the game will be briefly examined. A sequential game is a game in which players take turns making decisions known as moves. A game of perfect information is a sequential game in which all the players are informed before every move of the complete state of the game. A strategy for a player in a game of perfect information is a plan that dictates what moves that player will make in every possible game state. A strategy tree is a (directed, acyclic) graph representation of a strategy (an example is shown in Fig. 16). The nodes of the graph represent reachable game states. The edges of the graph represent the opponent’s moves. The target node of the edge contains the strategy’s response to the move encoded on the edge. A leaf represents a final game state, and can, usually, be labelled either win, lose, or draw. Thus, a path from
the root of a strategy tree to one of its leaves represents a game. In a tree, there is only one path from the root of the tree to each node. This path defines a set of moves made by the players in the game. A player’s move set at any node is the set of moves made by that player up to that point in a game. For example, a strategy’s move set at any node is the set of moves dictated by the strategy along the path from the root to that node. A strategy is said to be feasible if, for every pair of nodes in the decision tree for which the opponent’s move sets are equal, one of the following two conditions holds: (1) the vertices encode the same decision (i. e., they dictate the same move), or (2) the strategy’s move sets are equal. A feasible strategy can be successfully converted into Boolean logic implemented using monotone logic gates, such as the deoxyribozyme logic gates. In the first implementation of the tic-tac-toe automaton, the following simplifying assumptions were made to
Molecular Automata, Figure 16 The chosen strategy (game tree) for the symmetry-pruned game of tic-tac-toe, drawn as the diagram of a Mealy automaton. Each state is labeled with the string of inputs seen on the path to it. Each edge is labeled a/b), where b is the output that is activated on input a
Molecular Automata
Molecular Automata, Figure 17 The tic-tac-toe game board with the field numbering convention
reduce the number and complexity of needed molecular species. The automaton moves first and its first move is into the center (square 5, Fig. 17). To exploit symmetry, the first move of the human, which must be either a side move or a corner move, is restricted to be either square 1 (corner) or square 4 (side). (Any other response is a matter or rotating the game board.) The game tree in Fig. 16 represents the chosen strategy for the automaton. For example, if the human opponent moves into square 1 following the automaton’s opening move into square 5, the automaton responds by moving into square 4. If the human then moves into square 6, the automaton responds by moving into square 3. If the human then moves into square 7, the automaton responds by moving into square 2. Finally, if the human then moves into square 8, the automaton responds by moving into square 9, and the game ends in a draw. This strategy is feasible; therefore, following a conversion procedure, it is possible to reach a set of Boolean formulas that realize it, given in Table 4. (For a detailed analysis of feasibility conditions for the mapping of games of strategy to Boolean formulas, see [3].) Human interaction with the automaton is achieved by the sequential addition of nine input oligonucleotides (i1 –
Molecular Automata, Figure 18 Sequences of the nine oligonucleotide inputs used to indicate human move positions in the MAYA automaton
i9 ). Each oligonucleotide is 15–17 nucleotides long and has a unique sequence of A, T, C, and G’s, with no more than 4 nucleotides in common between each oligonucleotide sequence (Fig. 18). The input numbering directly matches the square numbering of the tic-tac-toe game board (Fig. 17). Hence, to signify a move into square 9, the human would choose input i9 for addition. The game is played in a standard laboratory plate, with 9 wells representing each square of the tic-tac-toe game board. The automaton is pre-prepared by the addition of fluorescent substrate and a specific set of deoxyribozymebased logic gates in each well. Thus the automaton moves are predetermined by the inherent logic provided in each well, and in the chosen arrangement the automaton never loses because it plays according to a perfect strategy. The arrangement of deoxyribozyme logic gates corresponding to the above formulas is given in Fig. 19. This is the initial state of the nine wells of a well-plate in which the automaton is realized in the laboratory. The automaton was named MAYA since it uses a Molecular Array of YES and AND gates to determine responses to human moves.
Molecular Automata, Table 4 Boolean formulas resulting from the tic-tac-toe game tree. The inputs are designated ik , and the outputs are designated ok
o1 D i 4 o2 D (i6 ^ i7 ^ :i2 ) _ (i7 ^ i9 ^ :i1 ) _ (i8 ^ i9 ^ :i1 ) o3 D (i1 ^ i6 ) _ (i4 ^ i9 ) o4 D i 1 o5 D 1 o6 D (i1 ^ i2 ^ :i6 ) _ (i1 ^ i3 ^ :i6 ) _ (i1 ^ i7 ^ :i6 ) _ (i1 ^ i8 ^ :i6 ) _ (i1 ^ i9 ^ :i6 ) o7 D (i2 ^ i6 ^ :i7 ) _ (i6 ^ i8 ^ :i7 ) _ (i6 ^ i9 ^ :i7 ) _ (i9 ^ i2 ^ :i1 ) o8 D i9 ^ i7 ^ :i4 o9 D (i7 ^ i8 ^ :i4 ) _ (i4 ^ i2 ^ :i9 ) _ (i4 ^ i3 ^ :i9 ) _ (i4 ^ i6 ^ :i9 ) _ (i4 ^ i7 ^ :i9 ) _ (i4 ^ i8 ^ :i9 )
1911
1912
Molecular Automata
Molecular Automata, Figure 19 Realizing a tic-tac-toe automaton using deoxyribozyme logic. The center well contains a constitutively active deoxyribozyme. Each of the eight remaining wells contains a number of deoxyribozyme logic gates as indicated. In the schematic, green numbers are indices to the inputs that appear as positive literals in conjunctions; red as negative
An example game played against MAYA is shown in Fig. 20. The play begins when Mg2C ions, a required cofactor, are added to all nine wells, activating only the deoxyribozyme in well 5, i. e., prompting the automaton to play its first move into the center, since this well contains a single active deoxyribozyme which begins cleaving to produce fluorescent output. The human player detects this move by monitoring wells for increase in fluorescence emissions using a fluorescent plate reader. After that, the game branches according to the human’s inputs. In response to the automaton’s first move, the human player may choose either well 1 or well 4 in this restricted game, and thus will add input oligonucleotide i1 or i4 to every well of the game board, so that each well of the automaton receives information on the position of the human move. This creates a chain reaction among the deoxyribozyme logic gates in each well, opening individual
stem-loops complementary to the chosen input. However only one gate in one well of the board will become fully activated in response to this first human move: the YESi1 gate placed in well 4 (if the human added i1 to every well) or the YESi4 gate in well 1 (if the human added i4 to every well). The activated gate subsequently cleaves to produce fluorescent outputs, which is again detected by the human via fluorescence monitoring. Subsequent human moves are unrestricted, so the human may choose to play in any of the remaining wells where neither player has yet moved. However at this point the automaton is poised to win the game, either through a diagonal three in a row (wells 1, 5, 9; if the human chose to play in well 4) or through a horizontal three in a row (wells 4, 5, 6; if the human chose to play in well 1). Assuming perfect play, the human will choose to block an automaton win by playing into well 9 (adding input i9 to
Molecular Automata
Molecular Automata, Figure 20 A game of tic-tac-toe that ends in a draw because both the automaton and its human opponent play perfectly. As the human adds input to indicate his moves, the automaton responds with its own move, activating precisely one well, which is shown enlarged. The newly activated gate is shown in light green. The bar chart shows the measured change in fluorescence in all the wells. Wells that are logically inactive (contain no active gates) have black bars, and wells that are logically active have green bars (the newly active well is light green)
1913
1914
Molecular Automata
each well) or well 6 (adding input i6 to each well). This would again cause a chain reaction with complementary stem loops in every well, but again, only a single logic gate would become fully activated (either gate i4 ANDi9 or i1 ANDi6 in well 3, depending on the human’s initial moves). The game thus continues by repeated human input addition and subsequent fluorescent monitoring until either the human makes an error leading to an automaton win, or there is a draw as illustrated in Fig. 20. Improved Automata for Games of Strategy In order further to probe the complexity with which a molecular automaton could be rationally constructed, a second version of MAYA was constructed [27]. MAYA-II plays a non-restricted version of tic-tac-toe where the automaton still moves first in the middle square, but the human player may choose to respond in any of the remaining squares. The new automaton was also designed to be more user-friendly than the original MAYA, allowing monitoring of both the automaton and human moves using a dualcolor fluorescence output system, similar to the previously constructed half and full adder [24,45] (Fig. 21). The complexity of MAYA-II encompasses 76 possible games, of which 72 end in an automaton win and 4 end in a draw. The required arrangement of logic gates for a perfect automaton strategy was determined by computer modeling. However, using the gate limitations of the time, a strategy using only 9 inputs could not be determined by the program. Acceptable logic could be predicted by increasing the number of inputs to 32 different oligonucleotides. These inputs were coded inm where n refers to the position of the human move (1–4, 6–9) and m corresponds to the move order (1–4). For instance, to signify a move into well 9 on the first move, the human would add input i91 to all wells.
Using this input coding, the final arrangement of logic gates for MAYA-II’s chosen strategy used 128 deoxyribozyme-based logic gates (Fig. 22); 32 gates monitor and display the human player’s moves, and 96 gates calculate the automaton’s moves based on the human-added inputs. Essentially, successive automaton moves are constructed as a hierarchy of AND gates, with YES gates responding to the first human move. Some NOT loops are included to prevent secondary activation in already played wells, or are redundant and included to minimize cumulative nondigital behavior in side wells over several moves. In doing this, MAYA-II is a step toward programmable and generalizable MAYAs that are trainable to play any game strategy. Coordination of such a large number of rationally designed molecular logic gates was an unprecedented molecular engineering challenge, taking several years of construction. While spurious binding of oligonucleotides was predicted to cause serious problems, this was not observed in the building of the system. Instead, the most challenging aspect was titrating individual gates to produce similar levels of fluorescent signals, since the individual oligonucleotide input sequence could affect the catalytic activity of the molecule. MAYA-II perfectly plays a general tic-tac-toe game by successfully signaling both human and automaton moves. An example of play against the automaton is shown in Fig. 23. It could be argued that by integrating more than 100 molecular logic gates in a single system, MAYA-II represents the first “medium-scale integrated molecular circuit” in solution. This level of rationally-designed complexity has important implications for the future of diagnostic and therapeutic molecular automata. Moreover, the increased complexity of MAYA-II enabled refinement of the deoxyribozyme logic gate model, allowing the de-
Molecular Automata, Figure 21 Dual color fluorescence output system utilized in MAYA-II. Automaton moves are determined using logic gates based on the deoxyribozyme E6 [9,24,45], which cleaves substrate ST (dual labeled with tetramethylrhodamine and black hole quencher 2) to produce product PT and an increase in tetramethylrhodamine fluorescence. Human moves are displayed using logic gates based on the deoxyribozyme 8.17 [24,37,45], which cleaves substrate SF (dual labeled with fluorescein and black hole quencher 1) to produce product PF and an increase in fluorescein fluorescence
Molecular Automata
Molecular Automata, Figure 22 Realizing the MAYA-II automaton using deoxyribozyme logic. The center well contains a constitutively active deoxyribozyme. Each of the eight remaining wells contains a number of deoxyribozyme logic gates as indicated; boxed gates monitor human moves and the rest determine deoxyribozyme moves. In the schematic, green numbers are indices to the inputs that appear as positive literals in conjunctions; red as negative
velopment of design principles for optimizing digital gate behavior and the generation of a library of 32 known input sequences for future “plug and play” construction of complex automata. This library of known sequences is already being employed in the construction of other complex DNA automata [26]. Future Directions This review mostly focused on DNA-based automata, an area of research that has its roots in Adleman’s early computing experiments [2]. The early attempts to find applications for this type of DNA computing mostly focused on some kind of competition with silicon, for example, through massively parallel computing. After more then a decade of intensive research, we can safely conclude that, without some amazing new discovery, such applications are very unlikely. Instead, it seems that the most likely ap-
plications will come from the simplest systems, in which groups of molecules will have some diagnostic or therapeutic role. Further, approaches such as Winfree’s can lead to completely new thinking in materials sciences. But aside from practical considerations, experiments in which mixtures of molecules perform complex functions and execute programs are of great interest for basic science. Mixtures of molecules perform complex tasks in organisms, and some of the systems we described provide us with tools to understand how such complexity may have arisen at first. Bibliography Primary Literature 1. Adar R, Benenson Y, Linshiz G, Rosner A, Tishby N, Shapiro E (2004) Stochastic computing with biomolecular automata. Proc Natl Acad Sci USA (PNAS) 101(27):9960–9965
1915
1916
Molecular Automata
Molecular Automata, Figure 23 MAYA-II example game. A game played against MAYA-II ends in an automaton win, since the human opponent made an error in move 3. The human adds input to indicate their moves to every well, which causes a chain reaction to activate precisely one well for each fluorescent color. Human moves are displayed in the fluorescein (green) fluorescence channel, and automaton moves are displayed in the tetramethylrhodamine (red) fluorescence channel
2. Adleman LM (1994) Molecular computation of solutions to combinatorial problems. Science 266(5187):1021–1024 3. Andrews B (2005) Games, strategies, and boolean formula manipulation. Master’s thesis, University of New Mexico 4. Bailly C (2003) Automata: The Golden Age, 1848–1914. Robert Hale, London 5. Barish RD, Rothemund PWK, Winfree E (2005) Two computa-
tional primitives for algorithmic self-assembly: Copying and counting. Nano Lett 5(12):2586–2592 6. Benenson Y, Adar R, Paz-Elizur T, Livneh Z, Shapiro E (2003) DNA molecule provides a computing machine with both data and fuel. Proc Natl Acad Sci USA (PNAS) 100(5):2191–2196 7. Benenson Y, Gil B, Ben-Dor U, Adar R, Shapiro E (2004) An autonomous molecular computer for logical control of gene expression. Nature 429:423–429
Molecular Automata
8. Benenson Y, Paz-Elizur T, Adar R, Keinan E, Livneh Z, Shapiro E (2001) Programmable and autonomous computing machine made of biomolecules. Nature 414:430–434 9. Breaker RR, Joyce GF (1995) A DNA enzyme with Mg2+ -dependent RNA phosphoesterase activity. Chem Biol 2:655–660 10. Collier CP, Wong EW, Belohradský M, Raymo FM, Stoddart JF, Kuekes PJ, Williams RS, Heath JR (1999) Electronically configurable molecular-based logic gates. Science 285:391–394 11. Credi A, Balzani V, Langford SJ, Stoddart JF (1997) Logic operations at the molecular level. An XOR gate based on a molecular machine. J Am Chem Soc 119(11):2679–2681 12. de Silva AP, Dixon IM, Gunaratne HQN, Gunnlaugsson T, Maxwell PRS, Rice TE (1999) Integration of logic functions and sequential operation of gates at the molecular-scale. J Am Chem Soc 121(6):1393–1394 13. de Silva AP, Gunaratne HQN, McCoy CP (1993) A molecular photoionic AND gate based on fluorescent signalling. Nature 364:42–44 14. de Silva AP, Gunaratne HQN, McCoy CP (1997) Molecular photoionic AND logic gates with bright fluorescence and “off-on” digital action. J Am Chem Soc 119(33):7891–7892 15. de Silva AP, McClenaghan ND (2000) Proof-of-principle of molecular-scale arithmetic. J Am Chem Soc 122(16):3965–3966 16. de Solla Price DJ (1964) Automata and the origins of mechanism and mechanistic philosophy. Technol Cult 5(1):9–23 17. Ellenbogen JC, Love JC (2000) Architectures for molecular electronic computers: 1. Logic structures and an adder built from molecular electronic diodes. Proc IEEE 88(3):386–426 18. Fu TJ, Seeman NC (1993) DNA double-crossover molecules. Biochemistry 32:3211–3220 19. Garzon M, Gao Y, Rose JA, Murphy RC, Deaton RJ, Franceschetti DR, Stevens SE Jr (1998) In vitro implementation of finite-state machines. In: Proceedings 2nd International Workshop on Implementing Automata WIA’97. Lecture Notes in Computer Science, vol 1436. Springer, London, pp 56–74 20. Gordon JM, Goldman AM, Maps J, Costello D, Tiberio R, Whitehead B (1986) Superconductin-normal phase boundary of a fractal network in a magnetic field. Phys Rev Lett 56(21):2280–2283 21. Holter NS, Lakhtakia A, Varadan VK, Varadan VV, Messier R (1986) On a new class of planar fractals: the Pascal–Sierpinski gaskets. J Phys A: Math Gen 19:1753–1759 22. Huang Y, Duan X, Cui Y, Lauhon LJ, Kim KH, Lieber CM (2001) Logic gates and computation from assembled nanowire building blocks. Science 294(9):1313–1317 23. LaBean TH, Yan H, Kopatsch J, Liu F, Winfree E, Reif JH, Seeman NC (2000) Construction, analysis, ligation, and self-assembly of DNA triple crossover complexes. J Am Chem Soc 122:1848– 1860 24. Lederman H, Macdonald J, Stefanovic D, Stojanovic MN (2006) Deoxyribozyme-based three-input logic gates and construction of a molecular full adder. Biochemistry 45(4):1194–1199 25. Lipton RJ, Baum EB (eds) (1996) DNA Based Computers, DIMACS Workshop 1995 (Princeton University: Princeton, NJ). Series in Discrete Mathematics and Theoretical Computer Science, vol 27. American Mathematical Society, Princeton 26. Macdonald J (2007) DNA-based calculators with 7-segment displays. In: The 13th International Meeting on DNA Computing, Memphis 27. Macdonald J, Li Y, Sutovic M, Lederman H, Pendri K, Lu W, Andrews BL, Stefanovic D, Stojanovic MN (2006) Medium scale in-
28. 29.
30. 31. 32. 33. 34.
35.
36.
37. 38. 39. 40.
41. 42. 43. 44.
45. 46. 47.
48.
49.
50.
tegration of molecular logic gates in an automaton. Nano Lett 6(11):2598–2603 Mao C (2004) The emergence of complexity: Lessons from DNA. PLoS Biology 2(12):2036–2038 Mao C, LaBean TH, Reif JH, Seeman NC (2000) Logical computation using algorithmic self-assembly of DNA triple-crossover molecules. Nature 407:493–496, erratum, Nature 408 (2000), p 750 Mealy GH (1955) A method for synthesizing sequential circuits. Bell Syst Techn J 34:1045–1079 Peppé R (2002) Automata and Mechanical Toys. Crowood Press, Ramsbury Pickover CA (1990) On the aesthetics of sierpinski gaskets formed from large pascal’s triangles. Leonardo 23(4):411–417 Riskin J (2003) The defecating duck, or, the ambiguous origins of artificial life. Crit Inq 29(4):599–633 Rothemund PWK (1996) A DNA and restriction enzyme implementation of Turing machines. In: Lipton RJ, Baum EB (eds) DNA Based Computers. American Mathematical Society, Providence, pp 75–120 Rothemund PWK, Papadakis N, Winfree E (2004) Algorithmic self-assembly of DNA Sierpinski triangles. PLoS Biol 2(12):2041–2053 Rothemund PWK, Winfree E (2000) The program-size complexity of self-assembled squares. In: STOC’00: The 32nd Annual ACM Symposium on Theory of Computing. Association for Computing Machinery, Portland, pp 459–468 Santoro SW, Joyce GF (1997) A general purpose RNA-cleaving DNA enzyme. Proc Natl Acad Sci USA (PNAS) 94:4262–4266 Shapiro E, Karunaratne KSG (2001) Method and system of computing similar to a Turing machine. US Patent 6,266,569 B1 Soloveichik D, Winfree E (2005) The computational power of Benenson automata. Theoret Comput Sci 344:279–297 Soreni M, Yogev S, Kossoy E, Shoham Y, Keinan E (2005) Parallel biomolecular computation on surfaces with advanced finite automata. J Am Chem Soc 127(11):3935–3943 Stojanovic MN, de Prada P, Landry DW (2001) Catalytic molecular beacons. Chem Bio Chem 2(6):411–415 Stojanovic MN, Kolpashchikov D (2004) Modular aptameric sensors. J Am Chem Soc 126(30):9266–9270 Stojanovic MN, Mitchell TE, Stefanovic D (2002) Deoxyribozyme-based logic gates. J Am Chem Soc 124(14):3555–3561 Stojanovic MN, Semova S, Kolpashchikov D, Morgan C, Stefanovic D (2005) Deoxyribozyme-based ligase logic gates and their initial circuits. J Am Chem Soc 127(19):6914–6915 Stojanovic MN, Stefanovic D (2003) Deoxyribozyme-based half adder. J Am Chem Soc 125(22):6673–6676 Stojanovic MN, Stefanovic D (2003) A deoxyribozyme-based molecular automaton. Nature Biotechnol 21(9):1069–1074 Wang H (1963) Dominoes and the AEA case of the decision problem. In: Fox J (ed) Mathematical Theory of Automata. Polytechnic Press, New York, pp 23–55 Winfree E (1996) On the computational power of DNA annealing and ligation. In: Lipton RJ, Baum EB (eds) (1996) DNA Based Computers. American Mathematical Society, Providence, pp 199–221 Winfree E (2006) Self-healing tile sets. In: Chen J, Jonoska N, Rozenberg G (eds) (2006) Natural Computing. Springer, Berlin, pp 55–78 Winfree E, Yang X, Seeman NC (1999) Universal computation via self-assembly of DNA: Some theory and experi-
1917
1918
Molecular Automata
ments. In: Landweber LF, Baum EB (eds) DNA Based Computers II, DIMACS Workshop 1996 (Princeton University: Princeton, NJ), American Mathematical Society, Princeton. Series in Discrete Mathematics and Theoretical Computer Science, vol 44. pp 191–213; Errata: http://www.dna.caltech.edu/ Papers/self-assem.errata 51. Yurke B, Mills Jr AP, Cheng SL (1999) DNA implementation of addition in which the input strands are separate from the operator strands. Bio Systems 52(1–3):165–174
Books and Reviews Chen J, Jonoska N, Rozenberg G (eds) (2006) Natural Computing. Springer, Berlin Lewis HR, Papadimitriou CH (1981) Elements of the Theory of Computation. Prentice-Hall, Englewood Cliffs Seeman NC (2002) It started with Watson and Crick, but it sure didn’t end there: Pitfalls and possibilities beyond the classic double helix. Nat Comput Int J 1(1):53–84
Motifs in Graphs
Motifs in Graphs SERGI VALVERDE1 , RICARD V. SOLÉ2 1
2
Complex Systems Lab, Parc de Recerca Biomedica de Barcelona, Barcelona, Spain Santa Fe Institute, Santa Fe, USA
Article Outline Glossary Definition of the Subject Introduction Levels of Network Complexity Subgraph Census Network Motifs Dynamic Behavior of Network Motifs Motifs as Fingerprints of Evolutionary Paths Future Directions Bibliography Glossary Graph (network) A set G of objects called points, nodes, or vertices connected by links called lines or edges. Subgraph A subset of a graph G whose vertex and edge sets are subsets of those of G. Modularity A network is called modular if it is subdivided into relatively autonomous, internally highly connected components. Scale-free network A class of complex network showing a high heterogeneity in the distribution of links among its nodes. Such distribution decays as a power law. Network motifs These are specific subgraphs that occur in different parts of a network at frequencies much higher than those found in randomized networks. Definition of the Subject Several nested levels of organization can be defined for any complex system, being many of such levels describable in terms of some type of network pattern. In this context, complex networks both in nature and technology have been shown to display overabundance of some characteristic, small subgraphs (so called motifs) which appear to be characteristic of the class of network considered. These tiny modules offer a powerful way of classifying networks and are the fingerprints of the rules generating network complexity.
Introduction Complex systems can be described, on a first approximation by means of a network [1,3,5,19]. In such a network, the typical components of the system (atoms, proteins, species, computers, humans or neurons) are simply nodes with no further structure. They are linked to others by means of an edge. The presence of such a link implies that there is some type of causal relation. Such relation can be the presence of a bond among atoms or electrostatic forces among proteins. It can also be a trophic link (who eats who) or a wire. It can also be a more abstract association, such as friendship in a social network. The resulting web provides a glimpse of the topological complexity of the network. An example of one such web is illustrated in Fig. 1a, where we display part of the network of protein-protein interactions that is present inside human cells. Each protein is indicated here as a sphere, although in reality it has a complex structure (Fig. 1b). Each link in this graph means that the two linked proteins interact physically (Fig. 1b). At the small scale, the network is thus represented in terms of molecules, but as happens with all complex systems, cell behavior (largely resulting from the works of proteins) cannot be reduced to the behavior of individual units. Interactions need to be taken into account. Actually, most proteins perform their function by linking to other proteins, forming complexes. When we look at the map of protein interactions on a global scale, a complex network emerges (Fig. 1a). These assemblies then are able to interact with DNA, propagate external signals or build transport machines able to carry vesicles through the cell. Similarly, other complex networks will be describable at different levels of detail. The ultimate goal of any such description is understanding how the system works and/or how it has emerged. An important message of complex systems approaches is the notion that most complex systems cannot be reduced to their elementary components. In particular, trying to reduce complex patterns to the properties of the underlying basic units has failed in most cases to succeed. There is quite a range of levels of analysis in any of those systems and each one involves typically some sort of emergent property. Ideally, it would be great reducing the whole complexity to the basic components. Since such a dream is not possible, we might wonder if perhaps some type of intermediate subsets of small size might be enough. Such a goal guided the proposal of searching for special types of subgraphs composed by a very small number of elements that could capture the key characteristics of complexity under the network perspective. The idea is not unreason-
1919
1920
Motifs in Graphs
Motifs in Graphs, Figure 1 Complexity in biological systems can be seen at multiple scales. Proteins for example, typically work within pathways and assemblies and we can define a protein network or proteome as the graph of protein-protein interactions (a) where each ball corresponds to one protein. Here part of the human protein map is shown and the colors indicate different modules, i. e. groups of proteins having more connections between them than with the rest of the web. Looking closely, connections mean physical interactions among proteins, as indicated in b. Different patterns of protein interactions can be described in terms of small subgraphs. By counting the frequency of each one of these subgraphs, we obtain the so-called subgraph census, shown in c for our example (using sets of four elements)
able, since it takes into account a subset of elements and their interactions. In other words, looking at small subgraphs (assuming their organization can be associated to a given functional trait) is a good starting point towards the analysis of higher-level complexity. To this goal, several researchers developed a number of tools and concepts to properly capture the statistical relevance of some specific types of subgraphs. Subgraph censuses [4,7,32] and later on network motifs [13,14] were introduced in order to provide a small-scale, better characterization of complex networks. Here we will summarize their basic properties, how they are characterized and what type of relevant information they provide. The power of the approximation is illustrated with a case study on protein networks. Levels of Network Complexity Complex networks have been shown to exhibit a number of interesting properties. One of them is the presence of modularity. Roughly speaking, a modular system is formed by quasi-independent parts that appear integrated within themselves but also exhibit a certain degree of interdependency among them. Modularity is considered a prerequisite for the adaptation of complex organisms and their evolvability [31]. It is particularly obvious in cellular networks [21], where it can be detected at the topological level. These networks include the webs of in-
teractions among proteins, genes, enzymes and metabolites or signaling molecules. They are thus associated to the processing of energy, matter and information within and between cells. In many cases, modular structures appear to be related to functional traits: a given set of closely related proteins, for example, might be all associated to cell division or communication. All these components interact in a way or another at different times and spatial locations. How many of them need to be considered in order to capture relevant information on functional traits? In order to explore the problem of how to define different levels of complexity, let us first provide the basic ingredients that we need. Within the context of network theory [1,3,5,19] the given system is represented as a graph ˝ D (V; E) composed by a set of N nodes (say proteins) V D fv i g and a set of links e i j 2 E indicating if a connection exists between nodes vi and vj . Each node in a graph, if connected to some other node, will have a number of links k. This is known as the node degree. Statistically, a very important description of the network structure is provided by the so-called degree distribution P(k) which measures the probability of a given node having k links. Some real networks are homogeneous, meaning that their degree distribution has a well-defined average value. In these networks, most (if not all) elements will have a degree that does not strongly deviate from the average. By contrast, the majority of complex networks display a rather different architecture. This type of heteroge-
Motifs in Graphs
neous network is characterized by a probability distribution which falls off as a power law with a cut-off, i. e. P(k) (k C k0 ) ek/k c :
(1)
Here k0 is a constant and 2 < < 3 denotes the scaling exponent (typically close to 2:5). The cut-off kc is a characteristic degree indicating the presence of a maximum number of links. The hubs tend to have important roles in technological but also in cellular systems [22] where the most connected nodes are often cancer-related genes and their failure typically involves some proliferative disorder. In this context, one possible attribute of an element that can give relevant information is associated to its degree. However, elements having only two nodes can be important in connecting hubs and thus degree alone is not enough. At its smallest scale, modules are defined by means of subgraphs involving three or four elements [34]. These subgraphs have received considerable attention in relation with the so-called network motifs [13,14]. Roughly speaking, motifs are patterns of interconnections occurring in complex networks at numbers that are significantly higher than those in randomized networks. The analysis of their statistical distribution reveals that each class of natural and artificial network seems to display common patterns of motif abundances. The statistical pattern is thus interpreted as functionally meaningful. Under this view, motif abundances—as well as modularity—would be a consequence of selection forces, perhaps reflecting optimization. Subgraph Census Here we define the basic concepts relating small subgraphs and their abundance within complex networks. Sociologists pioneered this type of network analysis by developing the n-subgraph census, or the enumeration of all possible subgraphs with n nodes in the network. For example, in Fig. 1 the 4-subgraph census is displayed in (c). We can see that the number of subgraphs having a larger number of links (i. e. more dense ones) rapidly decays (in an exponential fashion). The way this decay occurs and the frequency distribution is characteristic for a given type of network structure. In this case, subgraphs are undirected (no arrows are taken into account) but directed links can also be taken into account, which implies much larger combinatorics (Fig. 2). Within the social sciences, the 3-subgraph census (or triad census) enables us to quantify the degree of network transitivity [4,7,32]. In a transitive social network, whenever a node i has ties with nodes j and k, there is a link between j and k and thus these nodes form a triangle (see Fig. 2). This suggests a simple transitivity
Motifs in Graphs, Figure 2 Some examples of subgraphs that can be constructed using n D 3 or n D 4 nodes. At left we show all the subgraphs that can be constructed using non-directed links (i. e. no arrows). A small number of graphs are obtained, but when directedness is introduced, even a very small number of nodes can generate a large number of subgraphs as shown in b where almost all combinations are shown for n D 3
test by counting the number of subgraph triangles in the network. However, any meaningful interpretation of subgraph counts requires a statistical significance test that signals important deviations from their expected value in the network (see next section). Assuming sparse graphs (hKi N), the probability of a given subgraph ˝ i can be estimated. Following Itzkovitz et al., we can see how this is calculated using the subgraph example displayed in Fig. 4. Each node has a degree sequence given by the indegree list fK i g and the outdegree list fR i g (with i D 1; : : : ; N). The lists would be completed by the so-called mutual edges fM i g, i. e. cases where there is a pair of edges in both directions between two nodes. For each subgraph, another degree sequence is provided by two new lists, now fk j g and fr j g for the in- and outdegrees, respectively. For our example, we have: fK j g D f2; 1; 1; 0g and fR j g D f0; 1; 1; 2g. The idea is to compute the different probabilities associated to each directed edge linking all pairs of nodes. For example, the probability of having a directed link from node 1 to node 2 (for K1 R NhKi) is approximately: K 1 R2 P(1 ! 2) D ; (2) NhKi which can be interpreted as follows [9]: we perform K 1 attempts for the first node to connect to the target node with a probability R2 /NhKi. Similarly, we would have: (K1 1)R3 (3) NhKi being the approach used for all edges. The average number of appearances of ˝ i is finally computed by averagP(1 ! 3) D
1921
1922
Motifs in Graphs
Motifs in Graphs Motifs in Graphs, Figure 3 When looking at different networks, we find that their internal structure in terms of small subgraphs differs considerably. Here four examples are shown. The first two are models, namely a a tree and b a square lattice. The other two are real networks: c a street network and d a protein interaction networks. The right column displays the observed frequencies of subgraphs of size four. As we can appreciate, different subgraphs appear to be common in some nets but not in others. Heterogeneous networks, like the protein interaction network d, display a greater variety of subgraph types than the regular networks a–c because the rich number of different node connectivity patterns
ing. A general derivation has been done by [9] where the power-law distribution of connectivities is taken into account. Here, we assume real networks having a scale-free in-degree distribution Pi (k) and an exponential out-degree distribution Po (k) (the following is also valid for a network with scale-free out-degree distribution and an exponential in-degree distribution): Pi (k) D
i 1 1 i
k0
(k C k0 ) i ;
(4)
where k0 is the cut-off value for the distribution. The average number of appearances hGi of a given subgraph ˝ i scales with the subgraph size and the exponent of the indegree distribution: hGi N ngCs i C1 ;
(5)
where s is the maximum in-degree in the subgraph and n and g are the number of nodes and links in the subgraph, respectively. This scaling is actually valid for 2 < i < s C 1. There are limitations to the above mean-field approximation. The average number of subgraphs cannot measure the high diversity of subgraph types in scale-free networks. Regular networks often display a limited repertory of connectivity patterns and they are more suitable for the mean-field analysis (see Fig. 3). The subgraph distribution provides a more adequate quantification of the impact of degree distribution in the local network structure. There is empirical evidence that subgraph distributions in scale-free networks are skewed and without a characteristic scale. In this context, [35] proposed machine-learning techniques that learn the observed subgraph distribution rather than assuming a specific type distribution. This is of uttermost importance since the choice of the model distribution strongly constrains what subgraphs are the most significant, thus introducing unwanted biases in the interpretation of the observed subgraph census.
Motifs in Graphs, Figure 4 The expected frequency of a given subgraph in a random network can be computed by looking at the specific set of types of links and their numbers. Here an example of a subgraph is shown, indicating the basic quantities that can be defined in order to predict its abundance (see text)
Network Motifs Network motifs are defined in terms of subgraphs which appear much more often than expected from pure chance. Specifically, they occur with a high frequency compared with the expected frequency from an ensemble of randomized graphs with identical degree structure [13,14]. Random networks are generated from the real network by switching links while preserving specific network properties. Random networks keep the real in/out- degree sequence but they do not have degree-degree correlations. Then, comparison between the real network and random networks signals the overabundance of sub-structures cannot be a consequence of network heterogeneity alone. The subgraph ˝ i is a network motif when its abundance is statistically significant as indicated by the Z-score [13]: Z(˝ i ) D
Nreal(˝ i ) hNrand (˝ i )i : (Nrand (˝ i ))
(6)
Here Nreal (˝ i ) is the number of times the subgraph appears in the network, whereas hNrand (˝ i )i and (Nrand (˝ i )) refer to the mean and standard deviation of its appearances in the randomized ensemble, respectively. In order to be significant, it is required that jZ(˝ i )j > 2. When Z(˝ i ) > 2 (Z(˝ i ) < 2) the motif (antimotif) is considered to be more (less) common than expected from random.
1923
1924
Motifs in Graphs
From a practical point of view, network motif detection is a computationally intensive process. For instance, current techniques cannot detect network motifs having more than eight nodes. Motif detection requires three hard computation subtasks: (1) counting the number of subgraph instances, (2) grouping topologically equivalent subgraphs in the same class and (3) generation of random networks and comparison between observed and expected subgraph frequencies. Subgraph sampling can be used to accelerate the first task [11] but this technique is not accurate and requires a costly bias correction [33]. Much work has been done to efficiently group subgraphs having the same topology [17]. In addition, we can efficiently assess subgraph significance by comparing with theoretical estimations of the number of graphs with given degree sequence. This avoids the expensive, explicit generation of random networks during the third task [33]. A handful of motifs appear to be shared by similar or related types of networks. This is the case of Bi-parallel, Bi-fan, the feed-forward loop (FFL) and its close variants that appear both in electronic circuits and biological networks involving computations (i. e., transcriptional regulatory networks and neuronal networks). Such a common point might be easily interpreted in functional terms: similar subgraphs are abundant because they are selected or chosen to perform a given function or task. This suggests that we can classify complex networks into distinct functional families based on the set of typical motifs. As shown below, no evidence from statistical patterns supports this view. In this context, the significance profile (SP) enables a classification of different networks according to their motif abundances [14]. The significance profile is a vector of normalized Z-scores defined as follows: Z(˝ i ) SP(˝ i ) D P 2 : ( j Z (˝ j ))1/2
(7)
This normalization emphasizes the relative importance of any given motif and enables a meaningful comparison across many different networks with a variable number of nodes. Figure 5 shows the SP vectors for two different types of networks using 3-size subgraphs. Although this method was proposed to obtain similar classes of networks, we must notice that networks with similar SP of 3-size subgraphs have distinct SP of 4-size subgraphs. This suggests that an effective network classification must use SP of different subgraph sizes simultaneously. Still, it is unclear what set of subgraphs is sufficient to discriminate among all possible network classes. The limitation to the maximum subgraph size handled by current motif detection algorithms may finally preclude the practical applica-
Motifs in Graphs, Figure 5 Triad significance profile (TSP) of networks representing the logical software organization in several word-processor applications (above) and the direct transcription network in the bacteria E. coli (below). Notice the remarkable similarity of TSP between these functionally unrelated systems
bility of SP-based network classification. In addition, the comparison of real systems suggests the SP classification is not always reliable. Networks with different functions may have similar sets of motifs. Figure 5 shows that triad significance profiles (TSP) for software networks [28] and the transcription network in the bacteria E. coli [13] have similar SPs but they are very different systems. We may overcome some of the above limitations by an adequate choice of the random network model used to compute the Z-score (see Eq. (6)). For example, we can assess the statistical abundance of subgraphs under different considerations besides the fixed degree sequence. The work in [8] showed that subgraph abundances of geometric networks, where the presence of links depends on the spatial proximity of nodes, cannot be explained by spatial embedding alone. On the other hand, the large-scale network organization (i. e., described by the degree distribution together with the hierarchical structure of the network) alone might explain the natural frequencies of common motifs in biological networks [29]. Finally, standard motif detection does not handle weighted networks, where links have an associated level of activity, traffic flow or importance. The topological motif definition can be extended to take into account link weights. Here, the Z-score is replaced by the so-called motif intensity score and thus providing a dynamics perspective on subgraph importance. [20] showed that weighted motif definitions may considerably modify the conclusions drawn from pure topological statistics. Dynamic Behavior of Network Motifs We can further investigate the biological significance of motifs by studying their dynamical properties. In this
Motifs in Graphs
context, a specific example of network motif, namely the feed-forward loop motif in transcriptional regulatory networks, has received much more attention than other systems. This FFL motif can be decomposed into two different kinds, i. e., coherent and incoherent, depending on the nature of individual interactions. Regulation links can be positive (activation) or negative (repression). The FFL is coherent (incoherent) if the overall sign of the indirect path from the input to the output node has the same (different) sign as the direct link. The numerical analysis of a specific instance of coherent FFL motif (i. e., with only positive links and AND-like behavior of the joint regulation of the output) shows this motif can filter out transient or fluctuating input signals [23]. This theoretical prediction was validated in experiments on motifs in real systems [15]. An extension of this work in [16] reported an exhaustive analysis of the eight possible types of coherent and incoherent FFL motifs in transcription networks. Specifically, they found that incoherent FFL motifs speed up the response time of the target gene expression following stimulus steps in one direction (e. g., off to on) but not in the other direction (on to off). This suggests the excess of FFL motifs might be explained in terms of selective pressures towards robust functioning in noisy environments. The above emphasizes the dynamics of motifs at the local scale, where motifs are basic building blocks of functional modules. However, the same network motif could perform many different functions depending on global system requirements. For example, robust functioning can be achieved with flexible components that switch their mode of operation to replace damaged or missing components [27]. In general, the mapping between function and structure is not unique but we can expect network topology to strongly influence dynamics. For example, synthetic models of brain networks suggest they are shaped in order to maximize the number and diversity of network motifs [24]. Local structural variability allows a large space of functional states. In this context, a related question is what network features facilitate the emergence of collective synchronization (or the ability of the collective to coordinate and form a global, coherent pattern). Numerical studies suggest the pattern of motif connections determines its ability to synchronize, where denser network motifs are more prone to synchronize than less connected motifs [18]. Motifs as Fingerprints of Evolutionary Paths An overabundance of some specific subgraphs seems a reasonable evidence for some special role. This is of course assuming that comparison is made against a ran-
dom version of the same system. Moreover, it also needs assuming that changes in network structure are directly related to selective pressures. Are the abundances of network motifs evidence for such selection pressures? When looking at the network organization, we surely will recognize some forms and shapes that can easily be interpreted as resulting from selective pressures. However, it has been argued that not all patterns that we observe resulting from evolutionary change are necessarily tied to selective forces [12]. Such patterns are what evolutionary biologists Stephen Jay Gould and Richard Lewontin called spandrels [6]. The term spandrel was borrowed from Architecture. A spandrel is the space between two arches or between an arch and a rectangular enclosure (an example is depicted with discontinuous line in Fig. 6a). In evolutionary biology, a spandrel is a phenotypic characteristic that evolved as a side effect of a true adaptation. We can summarize the features of evolutionary spandrels as follows: 1. They are the byproduct of building rules, 2. They have intrinsic, well-defined, non-random features, and 3. Their structure reveals some of the underlying rules of the system’s construction. Looking at the picture in Fig. 6a, we can see that architectural spandrels are indeed well-defined, non-random structures, arising as a side-consequence of a prior decision. Moreover, their geometric shape is fully constrained by the dominant arches. Let us now consider cellular networks as a case study. Specifically, the patterns of protein interactions inside cells. These networks are certainly functional structures (as discussed in the Introduction) and are the result of evolution. If we look at the abundance of given motifs, we find that some of them (such as the FFL, Fig. 6c) are more common than expected from chance (for example, the FFL gives the “bump” in both top and bottom of Fig. 5). On the other hand, looking at the presence of this motif on the network, we can see that these subgraphs appear clustered together. These clustered patterns are actually characteristic of all biological motifs in cellular webs. What is the explanation for this structure? Let us forget for a moment about functionality and just consider the basic rules that make these networks grow. It is known that the genome evolves by duplication and diversification. Duplication refers to the accidental appearance of a redundant copy of a given gene and diversification to the further changes taking place in the wiring. Using a duplication model we are actually considering an essential trait of biological evolution: it takes place through tinkering [10,25].
1925
1926
Motifs in Graphs
Motifs in Graphs, Figure 6 Architectural and evolved spandrels. The left figure shows the ceiling of King’s College Chapel, in Cambridge. A spandrel is indicated in a as surrounded by a yellow dashedline and appears profusely decorated. The network shown in b is the gene regulatory map of E. coli. Here nodes are genes and arrows point from a regulatory to a regulated gene. In c a specific motif, the so-called Feed-Forward Loop (FFL) is shown. The location of this subgraph in the whole network is indicated in b by means of lighter arrows. Most FFL appear together, forming larger, localized structures, which are the result of duplication events (see text)
In this context, one important source of divergence between engineering (technology) and evolution is that the engineer works according to a preconceived plan (in that he foresees the product of his efforts) and second that in order to build a new system a completely new design and units can be used without the need to resort to previous designs. Jacob also mentions the point that the engineer will tend to approach the highest level of perfection (measured in some way) compatible with the technology available. The main point is that natural selection does not work as an engineer, but as a tinkerer, who knows what is going to be produced but is limited by the constraints present at all levels of biological organization as well as by historical circumstances. As shown below, tinkering might help explain how special subgraphs appear more commonly than expected by chance as a result of the extensive reuse of available parts. Changes in wiring are associated to the emergence of novelty and new functionalities. Below we consider such simple approximation based on single-gene events. One can use the simplest DD models of protein network evolution [25,30] which involves the following set of rules, to be applied a given number of times, until N nodes are present. Assuming that we have a graph of size n, we iterate the following rules: 1. Duplication: choose a node at random and duplicate it, thus generating a new node.
2. Link deletion: the new node shares a set of neighboring nodes with its predecessor. For each common pair of common links, we choose one of them and delete it with some probability ı. This rule thus removes redundant relations among proteins. 3. Link addition: a link is added among the chosen and redundant nodes with probability ˛. This is a small number and allows new functionalities to emerge by linking the twin proteins. Tinkering by means of duplication and rewiring helps in understanding how some motifs might be so common and why they appear together. In Fig. 7 we show some examples of how tinkering allows us to explain why some subgraphs are expected to be more common. We can easily see that some specific arrangements (such as the graph highlighted in green) are easily obtained by a single duplication event. The architecture of cellular networks is not based on geometry: Topology replaces geometry and substructures need to be understood as subgraphs. Using the previous set of rules, starting from a very small graph, it is very easy to obtain a network similar to the one in Fig. 1a and with exactly the same subgraph census, as displayed in Fig. 8. In spite of the fact that the model does not consider possible functional roles, the resulting network is very close to the observed one, thus indicating that the topological patterns are a byproduct of the growth rules. From the previous definition, motif abundances need to be understood as the spandrels of network biocomplex-
Motifs in Graphs
Motifs in Graphs, Figure 7 Duplication rules may explain natural motif frequencies. Here, starting from a simple two-node system we generate different subgraphs by means of duplication and rewiring rules
are also responsible for their presence and specific regularities. They result from the works of the tinkerer, and their frequency and distribution within a given web have no adaptive meaning per se. The observation of a common minimal subgraph is certainly interesting, but not relevant in evolutionary terms, unless as integrated within a larger, functionally relevant set of interacting units [26]. The picture provided by network biology is a multiscale one [2] and allows one to shift from the geometry of architectural patterns to the no less fascinating universe of topological patterns. Evolutionary paths proceed through extensive evolutionary tinkering, which largely influences the architectural patterns to be found. The overall topology of cellular maps includes a plethora of non-random features, some of them not being functionally selected. As the spandrels built by the architect, evolution plays the role of a tinkerer unpurposely leading to network spandrels. They include functional units, but the real relevance of them is likely to reside in their combinatorial possibilities, instead of intrinsic, individual properties. Future Directions
Motifs in Graphs, Figure 8 Subgraph census for the simple model of duplication and rewiring. Here we used ı D 0:7 and ˛ D 0:1 and in a the observed frequencies of subgraphs are shown, to be compared with the ones found in human protein networks (inset, b)
ity [26]. Why? They follow the previous list: (a) their abundance is matched by in silico models lacking real functionality, and are thus a byproduct of the network building rules; (b) they exhibit highly non-random features at several scales, and these are particularly obvious when looking at the way in which motifs form clusters (Fig. 6b and c). The aggregates strongly indicate that duplicationrewiring processes, which generate the whole structure,
The presence of subgraphs having a disproportionate number of appearances open up the possibility of detecting special properties of complex networks. They offer a good way of classifying graphs into given groups and determining the evolutionary paths that lead to such nonrandom abundances. The previous examples illustrate just one aspect of the possible spectrum of motif analysis, dealing with topology. But networks are weighted, the nature of the nodes is not uniform and in some cases we might have dynamical information of how things change. Although a pure topological perspective might not capture all the desired traits, an appropriate choice of the information contained in a subgraph might help in defining real functional units. Future work should test how different forms of defining motifs might overcome some of the drawbacks of using pure topology. Moreover, it remains open how motifs might somehow represent basic building blocks of complex networks or are instead another level of description of a nested hierarchy of complexity. Bibliography Primary Literature 1. Albert R, Barabási AL (2002) Rev Mod Phys 74:47–97 2. Bornholdt S (2005) Less is more in modeling large genetic networks. Science 310:449–450 3. Bornholdt S, Schuster G (eds) (2002) Handbook of Graphs and Networks. Wiley, Berlin
1927
1928
Motifs in Graphs
4. Davis JA, Leinhardt S (1968) The structure of positive interpersonal relations in small groups. Annual Meeting of the American Sociological Association, Boston 5. Dorogovtsev SN, Mendes JFF (2003) Evolution of Networks: From Biological Nets to the Internet and WWW. Oxford University Press, New York 6. Gould SJ, Lewontin RC (1979) The Spandrels of San Marco and the Panglossian Paradigm: A Critique of the Adaptationist Programme. Proc Roy Soc London B 205:581–598 7. Holland PW, Leinhardt S (1970) A method for detecting structure in sociometric data. Am J Soc 70:492–513 8. Itzkovitz S, Alon U (2005) Subgraphs and Network Motifs in Geometric Networks. Phys Rev E 71:026117 9. Itzkovitz S, Milo R, Kashtan N, Ziv G, Alon U (2003) Subgraphs in random networks. Phys Rev E 68:026127 10. Jacob F (1977) Evolution as tinkering. Science 196:1161–1166 11. Kashtan N, Itzkovitz S, Milo R, Alon U (2004) Efficient Sampling Algorithm for Estimating Subgraph Concentrations and Detecting Network Motifs. Bioinformatics 20(11):1746– 1758 12. Mazurie A, Bottani S, Vergassola M (2005) An evolutionary and functional assessment of regulatory network motifs. Genome Biol 6:R35 13. Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002) Network Motifs: Simple Building Blocks of Complex Networks. Science 298:824–827 14. Milo R, Itzkovitz S, Kashtan N, Levitt R, Shen-Orr S, Ayzenshtat I, Sheffer M, Alon U (2004) Superfamilies of designed and evolved networks. Science 303:1538–1542 15. Mangan S, Alon U (2003) Structure and function of the feedforward loop network motif. Proc Nat Acad Sci 100(21):11980– 11985 16. Mangan S, Zaslaver A, Alon U (2003) The coherent feedforward loop serves as a sign-sensitive delay element in transcription networks. J Mol Biol 334:197–204 17. McKay BD (1981) Practical Graph Isomorphism. Congressus Numerantium 30:45–87 18. Moreno-Vega Y, Vázquez-Prada M, Pacheco A (2004) Fitness for synchronization of network motifs. Phys A 343:279–287 19. Newman MEJ (2003) SIAM Rev 45:167–256 20. Onnela J-K, Saramaki J, Kertész J, Kaski K (2005) Intensity and coherence of motifs in weighted complex networks. Phys Rev E 71:065103(R) 21. Ravasz E, Somera SL, Mongru DA, Oltvai ZN, Barabási A-L (2002) Hierarchical organization of modularity in metabolic networks. Science 297:1551–1555 22. Rodriguez-Caso C, Medina MA, Solé RV (2005) Topology, tinkering and evolution of the human transcription factor network. FEBS J 272:6423–6434 23. Shen-Orr S, Milo R, Mangan S, Alon U (2002) Network motifs in the transcriptional regulatory network of Escherichia coli. Nat Genet 31:64–68
24. Sporns O, Kotter R (2004) Motifs in Brain Networks. PLoS Biol 2(11):e369 25. Solé RV, Ferrer I, Cancho R, Montoya JM, Valverde S (2002) Selection, Tinkering, and Emergence in Complex Networks. Complexity 8:20–33 26. Solé RV, Valverde S (2006) Are Network Motifs The Spandrels of Cellular Complexity? Trends Ecol Evol 21:419–22 27. Tononi G, Sporns O, Edelman GM (1999) Measures of degeneracy and redundancy in biological networks. Proc Natl Acad Sci USA 96:3257–3262 28. Valverde S, Solé RV (2005) Network Motifs in Computational Graphs: A Case Study in Software Architecture. Phys Rev E 72(2):026107 29. Vázquez A, Dobrin R, Sergi D, Eckmann JP, Oltvai ZN, Barabási A-L (2004) The topological relationship between the largescale attributes and local interaction patterns of complex networks. Proc Nat Acad Sci 1001:17940–17945 30. Vázquez A, Flammini A, Maritan A, Vespignani A (2003) Modeling of Protein Interaction Networks. Complexus 1:38–44 31. Wagner G, Pavlicev M, Cheverud JM (2007) The road to modularity. Nat Rev Genet 8:921–931 32. Wasserman S, K Faust (1994) Social Network Analysis. Cambridge University Press, Cambridge 33. Wernicke S (2006) Efficient Detection of Network Motifs. IEEE/ACM Trans Comp Biol Bioinf 3(4):347–359 34. Wolf DM, Arkin AP (2003) Motifs, modules and games in bacteria. Curr Opin Microbiol 6(2):125–134 35. Ziv E, Koytcheff R, Middendorf M, Wiggins C (2005) Systematic identification of statistically significant network measures. Phys Rev E 71:016110
Books and Reviews Banzhaf W, Kuo PD (2004) Network motifs in natural and artificial transcriptional regulatory networks. J Biol Phys Chem 4(2): 85–92 Dobrin R, Beg Q, Barabási A-L, Oltvai Z (2004) Aggregation of topological motifs in the Escherichia coli transcriptional regulatory network. BMC Bioinform 5:10 Gould SJ (2002) The Structure of Evolutionary Theory. Harvard University Press, Cambridge Hartwell LH, Hopfield JJ, Leibler S, Murray AW (1999) From molecular to modular cell biology. Nature 402:C47–C52 Kuo PD, Banzhaf W, Leier A (2006) Network topology and the evolution of dynamics in an artificial regulatory network model created by whole genome duplication and divergence. Biosystems 85:177–200 Solé RV, Pastor-Satorras R, Smith E, Kepler TS (2002) A model of large-scale proteome evolution. Adv Complex Syst 5:43–54 Zhang LV et al (2005) Motifs, themes and thematic maps of an integrated Saccharomyces cerevisiae interaction network. J Biol 4(2):6
Multi-Granular Computing and Quotient Structure
Multi-Granular Computing and Quotient Structure LING Z HANG1 , BO Z HANG2 1 Artificial Intelligence Institute, Anhui University, Hefei, Anhui, China 2 Department of Computer Science, State Key Lab of Intelligent Technology and Systems, Tsinghua University, Beijing, China Article Outline Glossary Definition of the Subject Introduction The Granularity Relation Hierarchy Combination Future Directions Acknowledgments Bibliography Glossary Granularity Granularity is the relative size, scale, level of detail, or depth of penetration that characterizes an object or system. Multi-granular computing Humans are good at viewing and solving a problem at different grain-sizes (abstraction levels) and translating from one abstraction level to the others easily. This is one of basic characteristics of human intelligence. The aim of multi-granular computing is intended to investigate the granulation problem in human cognition and endow computers with the same capability to make them more efficient in problem solving. Quotient set Given a universe X and an equivalence relation R on X, define a set [X] D f[x]jx 2 Xg; [x] D fyjy x; y 2 Xg. [X] is called a quotient set with respect to R, or simply a quotient set. Quotient space Given a topologic space (X; T), T is a topology on X and R is an equivalence relation on X. Define a quotient structure on [X] as [T] D fujp1 (u) 2 T; u [X]g, where p : X ! [X] is a natural projection from X to [X]. Construct a topologic space ([X]; [T]). Space ([X]; [T]) is a quotient space corresponding to R. There are authors who keep the neighborhood system structured but remove the axioms of topology [13,14]. Quotient space model The quotient space model is a mathematical model to represent a problem at dif-
ferent grain-sizes by using the concept of quotient space in algebra. In the model a problem (or a system) is described by a triple (X; T; f ), with universe (domain) X, structure T and attribute f . If X represents the universe composed of the objects with the finest grain-size, when we view the same universe X at a coarser grain-size, we have a coarse-grained universe denoted by [X]. Then we have a new problem space ([X]; [T]; [ f ]), where [X] is the quotient universe of X, [T] the corresponding quotient structure and [f ] the quotient attribute. The coarse space ([X]; [T]; [ f ]) is called a quotient space of space (X; T; f ). Therefore, a problem with different grain-sizes can be represented by a family of quotient spaces. Definition of the Subject On one hand, any system in the world, either natural or artificial, has a multi-granular structure. This is called structural granulation. On the other hand, man always conceptualizes the world at different granularities and handles it hierarchically. This is called cognitive granulation. We believe the granulation in cognition underlies the human power in problem solving. Unfortunately, computers are generally capable of dealing with problems in only one abstraction level so far. The motivation of the research on multi-granular computing is intended to investigate the granulation both in human cognition and real world and endow the computers with the same capacity. This will greatly reduce the computational complexity in the computerized problem solving. This strategy can be used to improve many algorithms in broad areas such as planning, search, and machine learning. The key to the multi-granular computing is to construct a mathematical model for representing a problem at different grain-sizes. We present an algebraic model (quotient space model) of granulation that is used to analyze both human cognition and real world, especially human problem solving behaviors. The model was developed in order to represent granules and compute with them easily. Introduction Granularity is the relative size, scale, level of detail, or depth of penetration that characterizes an object or system. This term has been used in astronomy, physics, geography, photography, and information science and technology frequently. When a system is divided into components, it’s important to get the right degree of componentization. The fine-grained components give much more details in constructing precisely the right function-
1929
1930
Multi-Granular Computing and Quotient Structure
ality of a system. The coarse-grained components are easier to manage but may lose some important details. Performance and management considerations tend to favor the use of multi-granularity. For example, a text contains several levels such as chapter, paragraph, sentence, and word; each has a different grain-size. A video has scene, shot, and frame. A country may be split into state, city, and community. Man always conceptualizes an object (or system) at different granularities and deals with it hierarchically. This is one of man’s characteristics that underlies his/her power. The aim of multi-granular computing is intended to investigate the granulation problem in human cognition and endow computers with the same capability. So multi-granular computing has widely been investigated for a long time. In the data and knowledge management research community, the formalization of the concept of time granularity and its applications were addressed. A time granularity was defined by a countable set of granules; each granule can be composed of a single instant, a set of contiguous instants (time-interval), or even a set of non-contiguous instants [3]. The paper [4] investigates the algorithm, its computational complexity and several optimization techniques to the temporal CSP (Constraint Satisfaction Problem) involving constraints in terms of different time granularities. In image processing, a multiresolution model is used extensively and it is a model which captures a wide range of levels of detail of an object and which can be used to reconstruct any one of those levels on demand. The overall goal of multiresolution modeling is to use information from objects with different spatial granularities or extract the details from complex models that are necessary for rendering a scene and to get rid of the other, unnecessary details [10,11]. In GIS applications, the representation of multi-granular spatio-temporal objects in commercially available DBSMs was addressed [2]. The research works above mostly focused on the specific application domain and lacks of a general theoretical framework. The key to the investigation is to construct a multi-granular mathematical model for a problem. Recently, there have been several models to deal with the issue such as fuzzy set [24], rough set [20], and others [1,15,18]. Each has its own characteristics. We presented a new model called quotient-space model [25,26]. In the model a problem (or a system) is described by a triple (X; T; f ). Universe (or domain) X represents the whole objects of the problem that we intended to deal with. It may be a point set and its power may either be finite or infinite. T is the structure of X and represents the relationship among objects of X. Structure T may have different forms but will be described by a topology on X in the following discussion. Attribute f is a set of functions
defined on universe X and f : X ! Y may be multi-component such as f D ( f1 ; f2 ; : : : ; f n ); f i : X ! Yi , where Y i is either a set of real numbers or other kinds of sets. Triple (X; T; f ) is called a problem space (or simply a space). We will focus on the granularity relationships. Suppose that X represents the universe composed of the components with the finest grain-size; each component is regarded as indecomposable. When we view the same universe X at a coarser grain-size, we have a coarsegrained universe denoted by [X]. Then we have a new problem space ([X]; [T]; [ f ]), where [X] is the quotient universe of X, [T] is the corresponding quotient structure and [f ] the quotient attribute. For example, text X consists of a set of words. If “word” is regarded as an indecomposable component, then X is the finest universe. T is the text structure, i. e., the syntactic relationship among words. f is the property of words. If considering “sentence” as a new component, then we have a new (quotient) universe [X] composed of a set of sentences; each sentence consists of a set of words. Then [X] is a coarse grain-size universe. The quotient structure [T] represents the contextual relationship among sentences. Quotient attribute [f ] represents the property of each sentence [x]. Then, space ([X]; [T]; [ f ]) represents the same text with sentences as its components. The coarse space ([X]; [T]; [ f ]) is called a quotient space of (X; T; f ) (Fig. 1). The coarse universe [X] can be defined in many ways. Generally, universe [X] is defined by an equivalence relation R on X. Then [X] consists of the whole equivalence classes obtained by R; each equivalence class in X is regarded as a component in [X]. Universe [X] can also be defined by fuzzy relation and tolerence relation. In these cases, the components may have blurry boundaries or they may overlap each other. Assume < is a family of equivalence relations on X. Define an order relation ‘ 1/2), then its energy consumption – assuming that the operation is irreversible – is bounded from below by kTJ ln(2p). A reduced reliability of switching limits the range of possible algorithms to those that ac-
cept some degree of randomness, like probabilistic algorithms [146], image processing algorithms, and classification algorithms (e. g. [66]). Somewhat related is a scheme proposed by Kish [103] that drives computations by noise. Shown is a simple model circuit containing a comparator with error rate around 0.5. Inspired by the highly efficient way in which information is processed in the brain, this thermal noise driven computing scheme requires a minimum energy requirement of 1:1kTJ per bit. This scheme may require extensive error correction to make it work. The use of noise and fluctuations as a computational resource is also outlined in [42], as likely being a driving force for mechanisms in biological organisms [216]. Brownian motion provides a free search mechanism, allowing a molecule, signal, or other entity to explore its environment as it randomly moves around, thereby undergoing certain operations (like molecular binding) when it arrives at certain locations (like molecular binding sites). An early mentioning of Brownian motion for computations is made by Bennett in [17], who uses it for reversible computation in case thermal noise cannot be avoided. Bennett considers a Brownian implementation for a mechanical Turing machine, in which signals move around randomly, searching their way through the machine’s circuitry, only to be restricted by the circuit’s topology. This search would, in principle, not require any energy to be provided from a power supply, the drawback being that the computation would move forward and backward evenly without any bias, and thus would take a long time to complete. Funneling the random motions into preferred directions would require the selective use of ratchets, but ratchets tend to consume energy, as shown by Feynman [61]. A mechanism somewhat similar to a Brownian Turing machine is found in nature, in the form of the Brownian tape-copying machine embodied by RNA polymerase [17]. The efficient – in terms of energy consumption – use of Brownian motion in various biological mechanisms, such as those involving Brownian molecular motors, suggest that biological organisms may provide inspiration for future nanocomputing systems in this respect. Noise and fluctuations have also been used in the simulated annealing process of a Boltzmann machine that can be implemented by SET devices [215] (see also Sect. “Neural Network-Based Architectures”), though reductions in power consumption have not been the main purpose of this work. Fault-Tolerance When a system experiences a failure in one of its parts and can continue to operate, possibly at a reduced level,
2009
2010
Nanocomputers
rather than – as most systems would do – fail completely, it is called fault-tolerant. A fault-tolerant system may experience a reduced throughput or an increased response time in the face of one (or more) of its components failing. Fault-tolerance is an important property for nanocomputers, since the reliability of their components will likely be low due to: The probabilistic nature of interactions on nanometer scales, rooted in thermodynamics, quantum mechanics, etc. The individual behavior of particles, molecules, etc. gaining importance over their averaged behavior due to their decreased numbers, causing the “Law of Large Numbers” to become invalid below the device level [21,48]. When faults have a permanent character – either incurred at manufacturing time or later – they are usually referred to as defects. A nanocomputer architecture that can cope with defects is called defect-tolerant. When faults have a transient character, i. e. when they occur during operation and are temporary, they are referred to as transient faults, or, simply, faults, like the more general term. They are mainly caused by environmental conditions, such as thermal noise, quantum fluctuations, and electromagnetic perturbations. Cosmic radiation, which is a major source of transient faults in CMOS technology, may be less of a concern, on the other hand, for nanoelectronics based on materials different from semiconductor silicon [72]. The term used to describe robustness of an architecture to the occurrence of transient faults is fault-tolerance, like the more general term. Faults may also be intermittent, meaning that they occur during a certain (extended) time interval, then disappear, and may (or may not) later reappear. Intermittent faults tend to be caused by unstable or marginal hardware, due to for example manufacturing residues and process variations. These problems tend to increase with the increase in integration density [38]. Changes in operational parameters, such as temperature, voltage, and frequency changes may trigger intermittent faults [38]. Though transient faults and intermittent faults have many characteristics in common, the latter differ in that they tend to occur in bursts, at the same location. Repairing a part affected by intermittent faults usually eliminates them, whereas transient faults can not be prevented by repairs. Tolerance to faults and defects can be accomplished at various levels of abstraction [76]: the physical device level, the architectural level, and the application level. An example relating to the physical device level is the tolerance of
digital devices to noise due to the digital encoding of signals. Tolerance at the architectural level compensates for malfunctioning of devices by organizing them in structures with a certain level of redundancy. Such a structure may consist, for example, of replicated (identical) components, operating in parallel, on which majority voting is used to extract the result of an operation. Tolerance at the application level refers to the ability of a computing application to operate correctly despite defects and faults in the underlying hardware. Research aiming to improve the tolerance at the application level may be found in the framework of probabilistic algorithms or image processing algorithms that tolerate some noise in their input. As tolerance at the physical device level tends to decrease with increasing integration density, the tolerance at the architectural and application levels need to improve to compensate for this. It also works the other way around: it may make less sense, for example, to improve tolerance at the physical device level, if the application level is able to cope with a certain level of faults or defects on its own. Most research efforts on fault-tolerance in the nanocomputing field are in practice directed to the architectural level, and this section is no exception on that. Redundancy in an architecture is achieved by equipping it with additional resources [112]. Especially important are the following types of redundancy: Information redundancy refers to the addition of information to the original data such as to be able to recover it in case of faults. Error Correcting Codes (ECC) are the usual form to represent redundant information (e. g. [207]). Used extensively to achieve reliable operation in communications and memory, error correcting codes provide a way to cope with the corruption of bits by encoding messages as code words that contain redundant information. This information is used to reconstruct the original code word in case errors occur. Hardware redundancy refers to the inclusion of additional hardware to the system. This may encompass the use of extra components working in parallel to mask any faults that occur – a technique called static redundancy, which includes the use of ECC. When only one element of the extra components are used at a time, to be switched out and replaced by a spare in case of a fault, the term dynamic redundancy is used. Dynamic redundancy is usually achieved through periodic testing of circuits or self-checking of circuits, followed up by the reconfiguration of a circuit, or even its self-repair. Time redundancy involves the repetition of operations in a system. This may take place after an error is detected, in which case a recovery procedure needs to be
Nanocomputers
started up. Alternatively, as a standard procedure operations may be repeated a fixed number of times, after which the consecutive results are compared to each other. Time redundancy is only suitable to cope with non-permanent faults, since repetitions of operations by a defective component will only repeat incorrect results. The different nature of defects as opposed to faults is reflected in the different strategies usually followed to make an architecture robust to them, though there tends to be an overlap. The permanent nature of a defect implies that it needs to somehow be isolated from non-defective parts to prevent it from affecting a system’s correct functioning. Strategies to cope with defects thus tend to first detect defects and then reconfigure the system around them. This requires spare components that become activated due to the reconfiguration, while the defective components (or defective connections) are made inaccessible. Absent reconfiguration, a system’s performance will be aversely affected by a defect, in terms of throughput, speed, reliability, etc., but this may be ameliorated to an acceptable level if sufficient spare resources are available. Depending on whether reconfiguration takes place at the same time that a nanocomputer is processing data or not, the terms on-line or off-line reconfiguration, respectively, are used. On-line reconfiguration implies a continuous monitoring of a computer for defects and accordingly taking action if they are detected. Usually envisioned as being self-contained in the computer, this type of operation needs the detection of defects and the associated reconfiguration to be robust to defects and faults themselves, which complicates its realization. Off-line reconfiguration is usually done under the direction of a host-computer, which creates a defect-table and reconfigures the system accordingly. While easier to realize than on-line configuration, it may be only feasible if the size of a nanocomputer is limited, since otherwise the detection and reconfiguration operations would consume prohibitive amounts of time. A well-known example of off-line configuration of a prototype defect-tolerant nanocomputer is the Teramac [84]. Consisting of 106 gates operating at 1 MHz, it is based on Field Programmable Gate Arrays (FPGAs), 75 percent of which contain at least one defect. Having a total of three percent defects out of a total of 7,670,000 resources including wiring and gates, the Teramac needs a rich interconnection structure to find alternative routes around defects. To this end, it employs a crossbar array, which is described in more detail in Sect. “Crossbar Array-Based Architectures”.
Fault-tolerance requires a different strategy than defect-tolerance, due to the temporary nature of faults. Such a strategy usually involves a combination of information redundancy and hardware redundancy. Motivated by the low reliability of electronic components in his time, von Neumann studied how reliable computation could be obtained through the use of unreliable components [201]. His multiplexing technique employs a circuit built from faulty universal gates – all of the same type – to simulate a single gate of this type, but with increased reliability. Each (input- or output-) wire of the gate is replaced by a bundle of N wires in the circuit. The circuits investigated by von Neumann are based on the NAND-gate, but other gates could also be used in principle. The logic state 1 or 0 of a bundle of wires is determined by whether the fraction of wires in state 1 – called its excitation level – lies above resp. below a certain threshold. The scheme consists of three stages, of which the first – the Executive Unit – conducts the logic operation, and the last two – the Restorative Units – act as nonlinear amplifiers that restore the excitation level of the output bundle to its nominal level (see Fig. 1). The resulting circuit has two input bundles of wires and one output bundle of wires. Wires from the first input bundle are randomly paired by a permutation network labeled U with the wires from the second bundle; the resulting pairs of wires are then used as inputs to the faulty NAND-gates in the Executive Unit. The wires in the output bundle are split, and again randomly paired to form the inputs to the faulty NAND-gates in the first Restorative Unit. This is repeated for the second Restorative Unit. This scheme suffers from a limited reliability, while requiring a high redundancy. For example, an error rate of 0.005 per device requires a redundancy of
Nanocomputers, Figure 1 Von Neumann’s fault-tolerant circuit that is used to replace a single faulty NAND-gate. It consists of three stages: The Executive Unit conducts the actual logic operation, whereas the two later Restorative Units reduce the fault-rate of the Executive Unit’s output. The bundle of output wires of the circuit will effectively be reduced to one wire (not shown here), of which the value is determined by the excitation level of the bundle
2011
2012
Nanocomputers
1000 to achieve a probability of failure of the gate of merely 0.027. A similar scheme by von Neumann that uses threeinput majority gates, instead of NAND-gates, does not fare better. A drawback of von Neumann’s scheme is that it entirely focuses on the error rate of gates, but not on that of wires. While the latter can be modeled in terms of the former if wires have a constant length, this condition will not be valid in densely connected circuits, causing failure rates of gates to depend on the length of their input wires [163]. Von Neumann’s scheme is at the root of many follow-up schemes proposed in the decades thereafter. Most of these efforts tend to minimize the number of required gates, given a certain failure probability of the components. Dobrushin and Ortyukov, for example, show in [54] that a function computed by m fault-free elements can be reliably computed by O(m log m) unreliable elements with a high probability. This is further improved in [161,162] to O(m) unreliable elements. An excellent review of these models is found in [163]. This review also shows some variations on this line of research, in which it is not gates that may fail, but contacts of switches. Thought to have become irrelevant with relays becoming obsolete for computational purposes, these variations regain significance in the context of nanocomputing, given that nanoscale switching devices with mechanical contacts have attracted renewed interest [36,37]. Another method, R-fold modular redundancy [52], achieves tolerance to faults by employing R copies of a unit (R preferably an odd number), of which the majority establishes the output in accordance with an operation conducted by a majority gate. If R D 3, this technique is called Triple Modular Redundancy (TMR) [52]. When the outputs of three TMR units are combined by a majority gate on a second level and so on in a hierarchy of levels, the result is a model with increased reliability higher in the hierarchy: Cascaded Triple Modular Redundancy (CTMR) [186]. TMR, CTMR, and von Neumann’s multiplexing technique lead to high redundancies, as compared to reconfiguration techniques [149], but on the other hand they are more suitable for transient faults. Von Neumann’s original multiplexing technique is reconsidered in [172], with claims that its redundancy can be reduced by using output wire bundles as input bundles to other gates, rather than – as von Neumann advocated – reducing them to a single wire between the outputs bundles of a gate-equivalent unit and the inputs bundles of the next unit. Von Neumann’s multiplexing technique is applied in a hierarchical way in combination with reconfiguration techniques in [78]. The claim is of a significantly reduced redundancy in this design, even when the device error rate is up to 0.01.
The usefulness of Markov chains in the analysis of these models is reviewed in [79]. The constructions of von Neumann and others have some resemblances with error correcting codes, especially to so-called repetition codes, which are codes (but not very efficient ones at that) in which each symbol of a message is repeated a number of times to create redundancy. The use of error correcting codes in this context leads to a scheme in which computation takes place in the encoded space, whereby errors are corrected locally, and encoding and decoding is only necessary at the beginning and the end, respectively, of the computation. This line of thought is followed in [187], but then through the use of generalized Reed–Solomon codes (e. g. [207]), rather than repetition codes, to realize a fault-tolerant computing model with improved reliability. Fault-tolerance in architectures based on cellular automata and on crossbar arrays are discussed in the sections describing these architectures, i. e. in Sects. “Cellular Automaton-Based Architectures” and “Crossbar Array-Based Architectures”, respectively. Cellular Automaton-Based Architectures A Cellular Automaton ( Cellular Automata as Models of Parallel Computation) consist of cells that are organized in a regular array. Each cell contains a finite automaton that, taking as input the state of the cell and the states of its neighboring cells, changes its state according to a set of transition rules. These rules are designed such as to produce certain global behavior. When cellular automata are used as architectures for nanocomputers, this behavior is usually general-purpose computation, but applicationspecific behavior, like pattern recognition tasks, has also been investigated. Originating in von Neumann’s research on self-reproduction in the 1940’s, cellular automata have experienced a resurgence in interest in recent years, especially in the context of nanocomputing architectures. Their simplicity – evidenced by a tiling-based structure in combination with local neighborhoods of cells with finite automatalike complexity – has much potential for bottom-up fabrication by molecular self-assembly. Moreover, their local interconnection structure implies a constant length of wires, which makes them efficiently scalable to high integration densities [11]. Cellular automata are unlikely to be used in combination with top-down fabrication methods, because of their significant overhead. Apart from the low percentage of cells that typically participate simultaneously in a computation, there may also be – depending on the design – relatively high hardware requirements
Nanocomputers
for each cell, given its limited functionality as compared to conventional VLSI circuit designs. One of these factors may be improved, but this tends to be at the cost of efficiency with respect to the other factor: increasing the functionality of each cell, may increase the percentage of them that engage in a computation, since configurations of cells implementing certain tasks can be smaller, but on the other hand a cell will require more hardware resources to implement its functionality. A cellular automaton’s overhead, which may exceed a factor of 10, could still be acceptable if it can be compensated for by the low-cost availability of cells in the huge quantities that can be expected with bottom-up fabrication. Even when only bottom-up fabrication methods are available, however, it is important to minimize the complexity of cells, because success in this area translates directly into more efficient physical realizations. Though traditionally the complexity of a cellular automaton’s cells is measured in terms of the number of cell states, this is inadequate for a nanocomputing framework, as it leaves unmentioned a cell’s functionality in terms of the (number of) transition rules. The centralized storage of the table of transition rules, which is common in traditional cellular automata or software to simulate them, is infeasible for nanocomputers, because it violates one of the key tenets of cellular automata, i. e., locality. That leaves only storage of the rule table in each cell individually as a viable option, implying that this table should be as small as possible, less a huge overhead occurs. The alternative to storage of the rule table is finding an efficient representation through inherently available physical interactions implied by the physical structure of a cell, and this is ultimately what many researchers aim for when using cellular automata for nanocomputers. This alternative, however, imposes even stricter limitations on transition rules. A cell’s complexity is determined by the information required to represent its functionality, and this includes the cell’s state as well as the transition rules or any equivalent physical structure or mechanism. There are various ways to represent this information. A representation requiring a decreased number of bits for storage of transition rules may require increased resources, both in time and hardware, to decode the information. This implies that it is important to consider the overall costs, including the required supporting circuitry, to be taken into account when evaluating a cell’s complexity. Absent a concrete physical implementation, it tends to be easiest to settle for a cost measurement in terms of the number of bits to represent the cell states plus the number of bits to represent the transition rules, under the assumption that decoding of this information is done through a straight-
Nanocomputers, Figure 2 Bits required to encode a transition rule in a cellular automaton with 5-state cells and von Neumann neighborhood. The lefthand side of a transition rule requires three bits for the cell state itself and three bits each for the four neighboring cells’ states. The right-hand side requires three bits to represent the new state of the cell after the transition
forward and simple scheme. For the cellular automaton in [117], for example, three bits are required to represent a cell’s five states. The von Neumann neighborhood of a cell in this model requires each transition rule to encode in its left-hand side the states of a cell itself and the states of four neighboring cells, and in the right-hand side the new state of the cell (see Fig. 2). Given that there are 58 transition rules in this model, this results in a total of 3 C 58 3 (1 C 4 C 1) D 957 bits that are required to encode the functionality in a cell. A similar calculation yields 148 bits of information for the model in [157] and 100 bits of information for the model in [158]. A further reduction may be required to allow feasible physical realizations. The use of cellular automata for nanocomputers originates in the late 1970’s work of Fredkin, Toffoli, and Margolus, who aimed for computation inspired by physical models. A major issue in this research is the reduction of energy consumption as a road to high-performance highintegration computers. Signals in this paradigm have a discrete particle-like character and are conserved throughout operations. Moreover, a reversible scheme of logic ( Reversible Cellular Automata) is adopted. One of the first proposals for a (2-dimensional) cellular automaton based on molecular interactions is due to Carter [28]. Written from a chemist’s perspective, this proposal focuses on how particular operations of cells – including propagation in wires and switching by various automata – could be realized by means of the cascaded exchange of single and double bonds in molecules. This bond exchange – termed soliton propagation in conjugated systems – is relatively slow, but due to the small distances at which switches can be placed (in the order of 20 nm), switching times in the subnanosecond time scale are claimed to be feasible. Follow-up on this work is needed to show how the proposed operations could be
2013
2014
Nanocomputers
Nanocomputers, Figure 3 Quantum Dot Cellular Automaton (QCA). a A cell in the QCA contains four quantum dots storing two electrons. Since the electrons repel each other, they are likely to be in opposite corners of the cell. This allows for two possible cell configurations, which are interpreted as signals 0 and 1. b Due to the Coulomb interactions between electrons in quantum dots of neighboring cells, signals propagate along cells. A wire with cells in configuration 0 will change from the left to the right into a wire with cells in configuration 1 if the left-most cell is kept in configuration 1. c The same mechanism also works for wires taking left- and right-turns. d A NOT-gate formed from cells in the QCA. Change of input signal 0 to 1 results in the output signal changing from 1 to 0. e A majority-gate formed from cells in the QCA. From the majority-gate an AND-gate can be constructed by setting one of the inputs to the constant value 0. An OR-gate can be constructed by setting one input to 1
combined into a cellular automaton capable of useful computations, but unfortunately the work has remained relatively unknown in the computer science community. The Quantum Dot Cellular Automaton (QCA) [121, 164,194] has cells that consist of four quantum dots, two of which contain an electron each (Fig. 3a). Due to Coulomb interactions inside a cell, the electrons will settle along one of two possible polarizations, interpreted as signals 0 and 1, respectively. On a larger scale, there are also Coulomb interactions between cells through the electrons in them. These interactions can be used for the propagation of sig-
nals along cells (Fig. 3b and c). Logic operations are also possible, like the NOT-gate in Fig. 3d, or the majority gate in Fig. 3e, which covers both an AND gate and an OR gate in its functionality. The QCA model is not a cellular automaton in the true sense, since it assumes a layout of cells along the topology of the circuitry they represent. This may even include layouts in which a cell is placed with an offset of half a cell at the side of another cell [194]. QCA promise extremely low power consumption as their operation does not involve flows of electrons, but rather tunneling of electrons on a local scale, i. e., between quan-
Nanocomputers
tum dots within a cell. The problem with QCA, however, is that they tend to be very sensitive to noise, imposing a low operating temperature on them, which is an obstacle to their efficient application. A supposedly more practical variation on QCA is the Magnetic QCA [39], in which the dots are implemented as submicron magnetic dots and which may allow operating temperatures as high as room temperature. There is considerable doubt among some researchers [24] whether QCA will ever provide a significant competitive advantage with respect to CMOS in terms of integration density, speed, power consumption, and the implementation of complex systems. A 2-dimensional cellular automaton using optical signals to trigger transitions has been proposed by Biafore [20]. Employing physical interactions rather than explicit transition rules, this model has very simple partitioned cells that are structured such as to give control over both (a) which cells interact as neighbors and (b) when these interactions occur. Its design is based on Margolus’ reversible cellular automaton [131] implementing Fredkin’s billiard ball model (BBM) [65] ( Reversible Cellular Automata). The cells are suitable for implementation by quantum-dot devices that are organized such that electrical charge is transferred under the control of optical clock signals of different wave lengths supplied in a specific order. A similar way to control operations by electromagnetic or optical signals in a cellular automaton – though then applied to binary addition rather than universal computation – is described in [14]. Somewhat related is the cellular automaton model with less regular topologies in [15]. A 1-dimensional cellular automaton in which transitions are triggered by electromagnetic pulses of welldefined frequency and length is described in [126]. Apart from classical operations like the AND, OR, and NOT, quantum-mechanical operations are in principle possible by this scheme. Cellular automata governed by nonlinear dynamics, so-called Cellular Neural Networks [35], hold another promise for nanocomputers, since the required physical implementations may be relatively simple. They are explored for this purpose in more detail in [101,217]. The logic states of such cellular automata can be expressed in terms of the electrical phases in a dynamic physical process. A possible implementation of such a model is by tunneling phase logic [217]. Tunneling phase logic still suffers from many problems [23], however, ranging from the difficulty to manufacture uniform tunneling diodes to the occurrence of stray background charges. Implementations of cellular automata utilizing physical interactions between CO molecules arranged on a copper (Cu) surface are described in [58,85]. The atoms in
the Cu surface form a triangular grid, on which the CO molecules settle, with the carbon atom of a molecule oriented toward a grid point. When two molecules are in nearest grid points, they will slightly point away from each other, due to a weak repulsive force between them. This force is sufficiently strong in some configurations of CO molecules to move certain CO molecules by one grid point, particularly in the so-called Chevron-configuration (Fig. 4a). Organized in suitable configurations, CO molecules will move on the grid in succession, triggering each other’s hops, like domino stones falling in succession. These configurations, called Molecule Cascades, can be exploited to transmit signals on the grid (Fig. 4b), and to conduct Boolean AND operations on signals (Fig. 4c). More complex circuits can also be formed, and these will require configurations to cross wires, and in some cases OR-gates. As compared to the interactions based on the Chevron configuration, the interactions required for wire crossings and OR-gates are more complicated, and unfortunately less reliable. Based on a wide variety of configurations, a three-bit sorter is even possible, as shown in [85]. A formal analysis of the configurations possible by molecule cascades is given in [25]. Though molecule cascades allow impressive integration densities – the threebit sorter would require a area of merely 200 nm2 – their speed of computation is very low: it takes about one hour for the three-bit sorter to complete its computation. There is no fundamental reason, though, why different systems of molecules would not be much faster. Setting up a molecular cascade system for computation involves a careful and time-consuming manipulation of individual molecules by a Scanning Tunneling Microscope (STM), so these types of systems are far from practical applications. A distinct disadvantage of the molecule cascades in [85] is that they allow only one-time computation: once a computation ends, the molecules have to be moved back to their initial positions to conduct the computation once more. An important step toward a practical system would be the development of a mechanism inherently present in the system to reinitialize molecules. The cellular automata discussed above utilize the physical interactions inherent in their design to conduct transitions, an approach perhaps best illustrated in [132]. Though this tends to result in extremely simple cells, it may be challenging to build additional functionality into cells, such as functionality to configure cells into states necessary to initiate certain computations. This may be especially difficult if such a configuration process needs to take place on only part of the cell space. Likewise, error correction functionality is difficult to implement through simple physical interactions in a cellular space. For these
2015
2016
Nanocomputers
Nanocomputers, Figure 4 Molecule Cascades based on CO molecules (indicated by circles) placed on a grid of copper atoms (indicated by dots). a Chevron configuration, in which the center molecule moves one grid point away due to repulsive forces. b Wire based on the Chevron configuration. The arrows indicate the hops made by the CO molecules. c AND-gate based on the Chevron configuration (C D A ^ B)
reasons some researchers advocate cells that are more complex, but still sufficiently simple to potentially allow a huge number of them to be fabricated by bottom-up techniques and organized in arrays at nanometer scale integration densities. One approach along these lines is the Cell Matrix [57]. Each cell in this model, consisting of a memory of less than 100 bytes and a few dozen gates, can be configured to conduct an operation like NAND, XOR, one-bit addition, etc. The above cellular automaton models are all timed synchronously, requiring the delivery of a central clock signal to each cell. A synchronous mode of timing causes a wide array of problems as pointed out in Sect. “Heat Dissipation”, like a high power consumption and heat dissipation. For this reason, cellular automata in which the principle of locality is carried one step further to the timing of the architecture, have started to attract interest in recent years. The resulting asynchronous cellular automata have a mode of operation in which all cells conduct their transitions at random times and independent of each other. Notwithstanding the advantage of asynchronous timing from a physical point of view, it brings up the question how to actually compute on such cellular automata in a deterministic way. The usual method to do this is to simulate a timing mechanism on an asynchronous cellular au-
tomaton to force the cells approximately into lock-step, and then using well-established methods to compute synchronously. Unfortunately, such an approach causes a significant overhead in terms of the number of cell states and transition rules. To simulate an n-state synchronous cellular automaton on its asynchronous counterpart, one needs, depending on the method, a number of states of up to O(n2 ) [157]. Worse yet, to synchronize different parts of an asynchronous cellular automaton with each other – necessary, because in the end it is a synchronous cellular automaton that is being simulated – exchange of signals between these parts is required. This exchange takes the form of so-called synchronization waves that propagate along the cell space [157]. Since all cells need to continuously change their states to let these waves pass, there are many transitions being conducted that do not contribute to the computation, even in areas of the cell space where no signals or active configurations are present. Implemented physically, such asynchronous cellular automata would thus need to consume much energy for dummy transitions, making them hardly better candidates for nanocomputer architectures than synchronous cellular automata. A more efficient approach to conduct computations on asynchronous cellular automata is to conduct asynchronous computations directly, using only synchronization between cells on strictly local scales. This approach
Nanocomputers
Nanocomputers, Figure 5 A Self-Timed Cellular Automaton (STCA) consists of cells with four partitions. All four partitions as well as one partition of each of the four neighboring cells are rewritten by transition rules. a Example of a small configuration of cells with partitions having values 0 (white) and 1 (black). b Transition rule for signal propagation. c Applying this rule to the configuration at the left results in a 1-partition moving to the right
is based on embedding a circuit on the cellular space and then simulating the circuit’s operation through the cooperative transitions of the cells. While the circuits in synchronous models are typically standard logic circuits involving AND, OR, and NOT gates, different functionalities are required for asynchronous circuits, but the general idea of embedding circuits on the cell space remains the same. The approach in [1,117,119,157,158] is to use delay-insensitive circuits (see Sect. “Heat Dissipation”), which, due to their robustness to signal delays, combine very well with asynchronous cellular automata, since the requirement no longer holds that signals must arrive at certain locations at certain times dictated by a central clock, as in synchronous circuits. This takes away concerns about variations in the operational speed of cells and considerably simplifies the design of configurations representing circuit elements embedded in the cellular space. To avoid the randomness in the timing of transitions interfering with proper deterministic operation of an asynchronous cellular automaton, its design involves transition rules that serialize transitions into well-controlled sequences of cell states. Some asynchronous cellular automata designed according to these principles can be found in [1,117,119]. Unfortunately, many of these models still require tens of transition rules, which may hinder
efficient physical implementations. To this end, alternative models have been proposed, called Self-Timed Cellular Automata (STCA) [157], in which cells are subdivided in four partitions, each representing a part of a cell’s state information (see Fig. 5a). This results in a substantial reduction of the number of transition rules. The model in [158] employs merely six rules. One of these rules, a rule for signal propagation is shown in Fig. 5b, and its effect on a cell configuration of a signal is shown in Fig. 5c. In order to compute on a cellular automaton, it is necessary to configure it for the computation. One way to do this is by having a separate mechanism for (re-) configuration, through a separate layer of circuitry to address each individual cell like in a memory, and writing the required state into it. Another way is to have the reconfiguration done by the cells themselves. This requires the transition rules to contain sufficient functionality to represent the reconfiguration mechanism. Such functionality resembles self-reproduction of configurations in cellular automata. In other words, such cellular automata require construction universality ( Self-Replication and Cellular Automata). Unfortunately, it turns out that implementing construction universality on a cellular automaton tends to give a large overhead in terms of the number of transition rules. A self-contained mechanism for reconfigura-
2017
2018
Nanocomputers
tion may thus not be practical, absent a novel ingenious way to implement construction universality, probably inspired by some biological mechanisms. Reconfiguration in cellular automata can also be used to achieve defect-tolerance. In the approach in [90], defects are detected and isolated from non-defective cells through waves of cells in certain states propagating over the cell space, leaving defective cells, which are unable to adhere to the state changes, standing out to be detected as defective [90]. This approach is taken one step further in [93] by additionally scanning the cell-space for defect-free areas on which circuits can be configured. Being self-contained, this scanning process uses reconfiguration mechanisms resembling the self-reproduction functionality on cellular automata. An online approach to defect-tolerance is followed in [92], in which configurations called random flies float around randomly in the cellular space, and stick to configurations that are static. A configuration unable to compute due to defects is static in this model and will thus be isolated by a layer of random flies stuck to it. Being highly experimental, these approaches require an overhead in terms of the number of transition rules that may be too high to be of practical significance for nanocomputers. Approaches to fault-tolerance in cellular automatonbased nanocomputers tend to focus on economic ways to implement redundancy. Early work in this line of research is reported in [150]. This model can correct at most one error in 19 cells, but to this end each cell requires a neighborhood of 49 cells in its transition rules, which is much higher than usual cellular automata. The increased complexity of cells suggests that they will be very error-prone in physical implementations, limiting the practical value of this work. The model in [80] suffers from similar problems. Better fault-tolerance is obtained in [67,68,70] with synchronous cellular automata, and in [69,204] with asynchronous cellular automata simulating synchronous cellular automata: the idea is to organize cells in blocks that perform a fault-tolerant simulation of a second cellular automaton, which on its turn is also organized in blocks, simulating even more reliably a third cellular automaton, and so on. This results in a hierarchical structure with high reliability at the higher levels, like with the CTMR technique mentioned in Sect. “Fault-Tolerance”. The cells in these models, however, are too complex to be suitable for efficient physical implementations, since they contain information regarding the hierarchical organization, such as block structure, address within a block, programs selecting transition rules, and timing information. In [91] a faulttolerant STCA based on BCH codes (e. g. [207]) is pro-
posed, but computation on this model is inefficient, since only one signal at a time is allowed. Spielman’s work [187] on implementing a computation scheme in an encoded space based on an error correcting code, mentioned in Sect. “Fault-Tolerance”, has resulted in related work [158] in the context of STCAs. Each partition of an STCA’s cell and each adjacent partition of a neighboring cell is stored in a memory and encoded by an error correcting code. Up to one third of the memory’s bits can – if corrupted – be corrected by this method. Cellular automata in which cells have the complexities of processing elements [62,202], rather than (much simpler) finite automata, form the other end of the spectrum of cellular nanocomputer architectures. Since the fabrication of even simple processing elements are likely to be beyond the reach of bottom-up techniques, this approach may be infeasible for the same reasons as to why alternatives to von Neumann architectures are proposed in the first place for nanocomputers. In conclusion, the main challenges for cellular automaton-based nanocomputers are: the design of models requiring a very small number of transition rules i. e., in the order of less than 10, the design or discovery of a mechanism to implement transition rules in a physically feasible way, to implement reconfiguration ability, while keeping the number of transition rules small, to implement fault- and defect-tolerance while keeping the number of transition rules and cell states small.
Crossbar Array-Based Architectures A crossbar consists of a set of, say, N horizontal wires that cross another set of M vertical wires. The N M cross points of the wires form a matrix of switches, which can be individually set (programmed) in a low-resistance state (closed) or a high-resistance state (opened). The simple structure of two planes of parallel wires allows for a wide variety of bottom-up fabrication techniques, such as self-assembly by fluidics-based alignment [87], directed self-assembly based on chemical techniques in combination with an alternating current (AC) electric field [53], or fabrication including top-down techniques like nanoimprinting [99,214]. Alignment of wires in order to connect them – a major problem in nanoelectronics – is relatively straightforward for crossbar arrays, due to the wires being perpendicular. Combined with a significant flexibility in making and altering connections, this accounts for much of the attention for crossbar arrays in the context of nanocomputers. Nanowires with diameters of only
Nanocomputers
a few nanometers that can be used in implementations of crossbar arrays have already been fabricated, like single-crystal nanowires [41,87,100,144] and carbon nanotube wires [184]. The crossbar in [171] employs carbon nanotubes as its wires that can be programmed by applying a voltage across a crosspoint of two wires. This voltage, which is higher than that for normal (non-programming) operation of the crossbar, results in the wires being bent toward each other to close the switch. After removing the voltage, these deformations are preserved due to van der Waals forces, so memory functionality is inherent to this mechanism. To open the switch again, an opposite voltage is applied, which drives the wires apart. Nanometer scale implementations of crossbar arrays like the above provide significant efficiency as compared to their CMOS-based counterparts, the latter requiring a separate 1-bit memory for each cross point to store the state of the corresponding switch. Another nanoscale crossbar design places a very thin layer of rotaxane molecules between the two wire planes in the crossbar [32]. Molecules in this layer at crosspoints can be set into low-resistance or high-resistance states – again through applying a voltage of sufficient strength – and these states tend to remain unchanged throughout successive read operations. The crossbar is used in the following contexts: Routing When used as a interconnection array, the crossbar array allows the routing of any input from a (single) vertical wire to any output on a (single) horizontal wire, by closing the switch between these wires. The resulting flexibility in connectivity provides an efficient way to route around defects. Logic Logic when implemented by crossbar arrays usually takes the form of wired (N)OR-planes or (N)ANDplanes [47,109,182,183], whereby wires in one plane provide inputs and wires in the other plane provide outputs. Memory The 1-bit memories at the crosspoints offer the potential of huge storage capacities when combined in large crossbar arrays [32,77,107,214]. Reading out a memory bit is usually done by applying a voltage to one of the corresponding wires, resulting in the state of the memory appearing on the other wire. Realizations as associative memories have also been considered, whereby information storage is distributed over the switches such that recovery is possible even if some switches fail [45,179]. Address Decoding The connections between the two wire planes can also be set such that a demultiplexer is obtained, which can address a single wire out of
the N wires in one plane by much p less than N wires in the other plane, e. g. down to O( N) or even O(log N) wires [32,47,208,219]. Apart from use in combination with crossbar arrays for routing, logic, or memory, a demultiplexer is also an excellent interface between micro-scales on one hand and nano-scales on the other, allowing for a limited number of microwires in one plane of a crossbar array to address a much larger number of much thinner nanowires in the other plane. Such demultiplexers tend to be static, i. e., they do not require reprogramming of the switch settings. When used as micro/nano-interfaces, reprogramming may even be impossible as nanowires can not be accessed directly; prepatterned connections [47] or stochastically assembled connections [208] may be suitable options in this case. The cross points in crossbar arrays can take various forms, the most common of which are resistors, diodes – both combined with a switch in each cross point – and Field Effect Transistors (FETs). Resistor crossbars, followed by diode crossbars, are easiest to fabricate by bottom-up techniques, since both are 2-terminal devices, which significantly simplifies alignment problems. Resistor crossbars, additionally, can be built using metal nanowires, facilitating low output impedances, but their linear device characteristics may complicate the approximation of nonlinear functionalities [180]. Though diode and FET crossbars will do better with regard to the latter point, they carry significant disadvantages with existing fabrication technology [180]: limited current density, forward-biased voltage drops, relatively high impedance of semiconductor nanowires, difficulty of configuration, etc. Resistor and diode crossbars, on the other hand, come with another disadvantage [189]: being 2-terminal devices, they are passive, thus lacking signal gain – a major disadvantage as compared to 3-terminal devices like transistors – complicating the restoration of signals. Moreover, 2-terminal devices tend to have higher power consumption than devices like transistors, due to the inherently higher currents flowing to ground [189]. The lack of signal restoration is especially problematic when many levels of logic are required through which signals pass. Minimizing this number of levels, however, causes an inefficient use of hardware: it prevents the exploitation of the logic structure of a circuit design, since each level – such as an OR-plane or an AND-plane in a crossbar array – has a homogeneous structure. This is a well-known problem in Programmable Logic Arrays (PLAs) [49]: an n-input XOR, for example, requires a number of product terms that is exponential in n when
2019
2020
Nanocomputers
expressed by the 2-level logic of a single PLA. It is thus unsurprising that signal restoration in crossbar arrays has received significant attention. Several proposals have been made to include signal restoration functionality in crossbar arrays. DeHon [50] realizes some of the cross points between two wires as FETs, such that the electric field of one wire controls the conductivity of the other wire. This requires doping of (parts of) a wire, but the problems related to that, such as alignment of doped parts of wires with crossing wires, may require further research. Kuekes et al. [110] propose to use a bistable-switch latch to restore signals, in which the logic value of a wire is controlled by two switches, one connecting to a logic 0 and one to a logic 1, such that only one switch is closed at a time. Goldstein [75] also uses a latch for signal restoration, though in this case the latch is composed of a wire with two Negative Differential Resistance molecules at either end. There are also schemes in which the inherent poor voltage margins of resistor-based crossbar logic is improved by including error-correcting codes in the scheme [109,111]. Within the class of crossbar arrays for memory applications, there is a pronounced difference in terms of storage capacity between crossbar arrays based on diodes and FETs on the one hand, and crossbar arrays based on resistors on the other hand. The use of diodes or FETs ensures that each cross point is independent of the other cross points, since electrical current can only flow in one direction in these designs. In resistor-based crossbar arrays, on the other hand, there will often be an indirect (parasitic) path through which a current flows when wires are connected at multiple cross points to perpendicular wires. This has a profound impact on the storage capacity of memories based on crossbar arrays. Whereas an N N crossbar array will have an O(N 2 ) storage capacity when based on diodes or FETs, it will have only O(N log N) storage capacity when based on resistors (for more detailed estimates of storage capacities see [185]). Though a standalone crossbar array is insufficient as an architecture for nanocomputers, it can be applied at different places and at different levels in a hierarchy to form a more or less “complete” architecture. Goldstein’s NanoFabrics [75], for example, consist of so-called nanoBlocks and switch-blocks organized in tiles called clusters that are connected to each other through long nanowires of varying lengths (e. g. 1, 2, 4, and 8 clusters long). These wires allow signals to traverse a few clusters without the need to go through switches. The nanoBlocks themselves consist of logic based on crossbar arrays, and the switch-blocks consist of interconnections based on crossbar arrays. Defect-tolerance is obtained by
testing the circuits – like in the Teramac, mentioned in Sect. “Fault-Tolerance” – but unlike the Teramac, testing is conducted in incremental stages in which an increasing part of the hardware – once it has been configured around defects – is used in the testing of the remaining parts [140]. Testing the initial part is conducted by a host. DeHon’s Nanowire-Based Programmable Architecture [49] provides designs for Wired-OR logic, interconnection, and signal restoration implemented by atomicscale nanowire crossbar arrays. The CMOL architecture [125,190] is based on a reconfigurable crossbar array of nanodevices that is superimposed at a small angle on a grid of CMOS cells. It is used to implement memory, FPGA logic, or neural network inspired models (see Sect. “Neural Network-Based Architectures”). Each CMOS cell consists of an invertor and two pass transistors serving two pins that connect to the crossbar array. Snider et al.’s Field-Programmable Nanowire Interconnect (FPNI) [181] improves on this proposal by adjusting the crossbar array’s structure so that all CMOSfabricated pins can be of the same height. Logic functionality in the architecture is restricted to the CMOS part and routing to the crossbar of nanowires. Reconfigurability is one of the strong points of crossbar arrays, but it requires a time-intensive process of testing for defects to guarantee defect-free routes. In effect, manufacturing yield is traded for an expensive die per die multi-step functional test and programming process [24]. A combination with error-correcting codes, like in [108], is thought to be more economical by some authors [7]. It is unclear as to what extent reconfiguration around defects can be made self-contained in crossbar-array-based architectures. Neural Network-Based Architectures A Neural Network is a collection of threshold elements, called neurons, of which the human brain contains about 1011 , each connected to 103 to 104 other neurons. Neurons receive inputs from each other through weighted connections. The strengths (weights) of the connections in a neural network are changed in accordance with the input to it; this process is called learning. Learning is usually accomplished through a Hebbian learning rule, which was first discovered by Hebb to be of fundamental importance in real brains. Hebbian learning can be shortly described as “neurons that fire together, wire together”. In other words, two neurons tend to strengthen the connection between them when their patterns of fired pulses correlate positively with each other. In artificial neural networks this rule is often implemented in terms of the correlation of
Nanocomputers
the time-averaged analog values of neurons, rather than the actual pulses. Hebbian learning lies at the basis of an impressive self-organizing ability of neural networks. The model has much flexibility with regard to the connectivity of the network between the neurons, as a result of which little detailed information is required about the underlying network structure to conduct learning processes. This tends to result in a good tolerance to defects and faults, which is an important reason for the interest in neural networks for nanocomputer architectures. Neural networks tend to be especially useful to problems that involve pattern recognition, classification, associative memory, and control. Recognition of visual images, for example, is conducted by brains to an extent that is unsurpassed by traditional computers, in terms of speed (taking approximately 100 ms), as well as in terms of fidelity. Though individual neurons are slow – it takes about 10 ms for a neural signal to propagate from one neuron to the other – the system as a whole is very fast. Contrast that to the time it takes computers with clock speeds of many GHz to performing the same task, which tends to be in the order of minutes to hours. Most studied applications of artificial neural networks have difficulty to go beyond the toy-problem level. A better knowledge about the brain may lead to better results, and there is also the expectation [125] that the availability of huge amounts of hardware resources may be of help. It is important to realize, however, that the large-scale organization of the brain is directed not only by relatively easyto-characterize learning algorithms, but also by genetic factors, which involve a certain degree of “wetness”. The latter is not very compatible with the usual perceptions about computers. It is questionable whether the abundance of hardware resources alone will result in a breakthrough in neural networks-based architectures. Neuromorphic Engineering is the term used to describe the interdisciplinary field that takes inspiration from neuroscience, computer science and engineering to design artificial neural systems. Though initially associated with VLSI systems engineering – due to Mead [135] coining the term neuromorphic – it is increasingly attracting contributions related to nanocomputers. One of the early papers in this context is [170], in which quantum dots arranged in a two-dimensional array and deposited on a resonant tunneling diode (RTD) are connected by nanowires, to form nodal points of nonlinear bistable elements in a conductive network. The dynamics of this system bears strong similarities to the equations representing associative memories. Another proposal related to neuromorphic nanoarchitectures is the Nanocell. A Nanocell is an element con-
taining randomly ordered metallic islands that are interlinked with molecules – acting as switches – between micrometer-sized metallic input/output leads [196]. To impose certain functionalities on the nanocell, like an AND, a NOR, an Adder, etc., it is necessary for these switches to be set into certain configurations. Genetic Algorithms (GA) are used for this in [195], but this method requires a detailed knowledge of the random structure inside the nanocell, so it is impractical, resulting in proposals to use neural networks, in which – ideally – the switches can be programmed through a learning process employing electrical pulses on input and output terminals of the nanocell [88]. To create useful circuits from nanocells it is necessary to connect them – after programming – into certain circuits. Creating such inter-cell connectivity has not been researched in detail, but neuromorphic learning methods may be suitable in this context as well. Boltzmann machine neural networks based on single electron tunneling devices are proposed in [215]. Simulated annealing, the processing in Boltzmann machines through which a minimum energy state is found in an optimization problem, is implemented through a digital oscillator based on the stochastic nature of the tunneling phenomenon. The architecture consists of a network in which outputs of neural elements are fed back to their inputs through weights in a interconnection network, which are implemented by capacitance values reflecting the particular problem to be optimized. CrossNets [197] are neuromorphic architectures based on the CMOL architecture. Neurons in this model, being relatively sparse, are represented by CMOS logic, and the neural interconnections, being very dense, are represented by a crossbar array of nanometer scale wires, whereby a cross-connection between two wires represents a so-called synapse. The crossbar array offers a very flexible framework to connect neurons to each other: it allows connectivity ranging from hierarchical structures that are typically used in multi-layered perceptrons [83] to more homogeneous networks with quasi-local connections. CrossNets can be used for pattern classification in feedforward multilayered perceptrons as well as for pattern restoration in associative memories. The crossbar array, though very flexible, has also disadvantages in this context, because it limits interconnection weights to be binary, and it is hard to avoid interference between different switches when setting or updating weights. Some solutions to these problems may be reached by using multiple-wire cross connections per weight [197]. The above models employ neural signals that have analogous values, as opposed to the spiking signals that are common in real brains. Spikes offer the opportunity for decreased power consumption, and,
2021
2022
Nanocomputers
more importantly, their timing appears to encode important information in brains, and is thought to allow the semantic binding of different brain areas with each other. Some preliminary results on spiking neuromorphic architectures based on CMOL CrossNets and comparisons to non-spiking models are given in [71]. A different neuromorphic architecture based on crossbar arrays is explored in [45] for use in associative memories. Future Directions In pondering the future directions nanocomputers may take, it is important to realize that CMOS technology is a formidable competitor. Detailed predictions on the pace at which this technology progresses have been made for decades, with varying degrees of success. When the emphasis is on technological limitations, the resulting predictions tend to be overly conservative, as they do not take into account the many ways often found around those limitations [86]. When on the other hand trends are simply extrapolated from the past, predictions tend to be on the optimistic side, as they ignore hard physical limitations that might still be ahead [86]. Notwithstanding the uncertainties of the life-time of CMOS technology, there seems to be a consensus that it can be extended to at least the year 2015. The exponential growth experienced over so many decades, however, may significantly slow down towards the end [143]. Beyond this, it is expected [23] that the benefits of increased integration densities will be extended for 30 more years through the use of emergent technologies. These technologies are likely to be used in combination with silicon technologies and gradually replace more of the latter. Many of the limitations that are perceived in technology – and CMOS is no exception to this – are likely to be circumvented by reconsidering the underlying assumptions, as was observed by Feynman [60]. This extends to the framework of circuits and architectures as well. Reconsiderations of underlying assumptions has frequently resulted in paradigm shifts in computer science. A notable example form Reduced Instruction Set Computers (RISC) [73], which use a small set of rapidly executed instructions rather than a more expanded and slower set of instructions as was common until the 1980’s in Complex Instruction Set Computers (CISC). The CISC approach was advantageous in the context of limited sizes of memories, necessitating a dense storage of instructions at the cost of increased processing in the processor itself. That changed with the increasing sizes and decreasing costs of memories, coupled with the realization that only a small percentage of the instructions available in CISC processors
were actually used in practice. In a similar way, a changing context will determine the direction in which architectures of nanocomputers develop. This chapter has identified a number of factors of importance in this respect, like the growing problems faced in the fabrication of arbitrary structures, the increasing occurrence of defects in those structures, the increasing occurrence of faults during computations, and the increasing problems with heat dissipation. These problems will have to be addressed one way or the other, and architectural solutions are likely to play an important role in them; regular or random structures in combination with (re)configuration techniques seem especially attractive in this context. A gradual move away from the traditional von Neumann architecture toward the increasing use of parallelism is likely to occur. Initially it will be conventional processors that will be placed in parallel configurations – a trend that has already started in the first decade of the 21st century. As feature sizes decrease further, fabrication technology will be more and more bottom-up, resulting in parallelism becoming increasingly fine-grained, to the point that computers consist of a huge number of elements with extremely simple logic functionality that can be configured according to specifications by a programmer. In the end, architectures may look more like intelligent memory than like the computers known today. Programming methods, though apt to change as architectures change, may be relatively unaffected by the trend toward finely grained parallelism, witness the resemblance of conventional programming languages to languages for fine-grained hardware platforms like FPGAs, some of which have been designed with resemblance to the programming language C in mind. Trends in technology are often highly influenced by breakthroughs, and it is entirely possible that future nanocomputers develop in different directions. The increasing research efforts into unconventional computing offers a clue to the wide range of what is possible in this regard. Computation models based on new media like reaction-diffusion systems, solitons, and liquid crystals [2] are just a few of the fascinating examples that may some day work their way into practical computer architectures.
Bibliography 1. Adachi S, Peper F, Lee J (2004) Computation by asynchronously updating cellular automata. J Stat Phys 114(1/2): 261–289 2. Adamatzky A (2002) New media for collision-based computing. In: Collision-Based Computing. Springer, London, pp 411–442
Nanocomputers
3. Adleman LM (1994) Molecular computation of solutions to combinatorial problems. Science 266(11):1021–1024 4. Appenzeller J, Joselevich E, Hönlein W (2003) Carbon nanotubes for data processing. In: Nanoelectronics and Information Technology. Wiley, Berlin, pp 473–499 5. Athas WC, Svensson LJ, Koller JG, Tzartzanis N, Chou EYC (1994) Low-power digital systems based on adiabatic-switching principles. IEEE Trans Very Large Scale Integr Syst 2(4):398–407 6. Aviram A, Ratner MA (1974) Molecular rectifiers. Chem Phys Lett 29(2):277–283 7. Bahar RI, Hammerstrom D, Harlow J, Joyner WH Jr, Lau C, Marculescu D, Orailoglu A, Pedram M (2007) Architectures for silicon nanoelectronics and beyond. Computer 40(1):25–33 8. Ball P (2006) Champing at the bits. Nature 440(7083):398–401 9. Banu M, Prodanov V (2007) Ultimate VLSI clocking using passive serial distribution. In: Future Trends in Microelectronics: Up the Nano Creek. Wiley, Hoboken, pp 259–276 10. Bashirullah R, Liu W (2002) Raised cosine approximation signalling technique for reduced simultaneous switching noise. Electron Lett 38(21):1256–1258 11. Beckett P, Jennings A (2002) Towards nanocomputer architecture. In: Lai F, Morris J (eds) Proc. 7th Asia-Pacific Computer Systems Architecture Conf. ACSAC’2002 (Conf. on Research and Practice in Information Technology), vol 6. Australian Computer Society, Darlinghurst, Australia 12. Benioff P (1980) The computer as a physical system: A microscopic quantum mechanical Hamiltonian model of computers as represented by Turing machines. J Stat Phys 22(5):563– 591 13. Benioff P (1984) Comment on: Dissipation in computation. Phys Rev Lett 53(12):1203 14. Benjamin SC, Johnson NF (1997) A possible nanometer-scale computing device based on an adding cellular automaton. Appl Phys Lett 70(17):2321–2323 15. Benjamin SC, Johnson NF (1999) Cellular structures for computation in the quantum regime. Phys Rev A 60(6):4334–4337 16. Bennett CH (1973) Logical reversibility of computation. IBM J Res Dev 17(6):525–532 17. Bennett CH (1982) The thermodynamics of computation – a review. Int J Theor Phys 21(12):905–940 18. Bennett CH (1984) Thermodynamically reversible computation. Phys Rev Lett 53(12):1202 19. Bennett CH (1988) Notes on the history of reversible computation. IBM J Res Dev 32(1):16–23 20. Biafore M (1994) Cellular automata for nanometer-scale computation. Physica D 70:415–433 21. Birge RR, Lawrence AF, Tallent JR (1991) Quantum effects, thermal statistics and reliability of nanoscale molecular and semiconductor devices. Nanotechnology 2(2):73–87 22. Bohr MT, Chau RS, Ghani T, Mistry K (2007) The high solution. IEEE Spectr 44(10):23–29 23. Bourianoff G (2003) The future of nanocomputing. Computer 36(8):44–53 24. Brillouët M (2007) Physical Limits of Silicon CMOS: Real Showstopper or Wrong Problem? In: Future Trends in Microelectronics; Up the Nano Creek. Wiley, Hoboken, pp 179–191 25. Carmona J, Cortadella J, Takada Y, Peper F (2006) From molecular interactions to gates: a systematic approach. In: ICCAD ’06: Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design, San Jose, 5–9 Nov 2008
26. Carter FL (1983) The chemistry in future molecular computers. In: Computer Applications in Chemistry, Proc. 6th Int. Conf. on Computers in Chemical Research and Education. Elsevier, Amsterdam, pp 225–262 27. Carter FL (1983) Molecular level fabrication techniques and molecular electronic devices. J Vac Sci Technolog B 1(4):959– 968 28. Carter FL (1984) The molecular device computer: point of departure for large scale cellular automata. Physica D 10(1– 2):175–194 29. Cavin RK, Zhirnov VV, Hutchby JA, Bourianoff GI (2005) Energy barriers, demons, and minimum energy operation of electronic devices. In: Proc. SPIE, vol 5844, pp 1–9 30. Ceruzzi P (1998) A history of modern computing. MIT Press, Cambridge 31. Chan SC, Shepard KL, Restle PJ (2005) Uniform-phase uniform-amplitude resonant-load global clock distributions. IEEE J Solid-State Circuits 40(1):102–109 32. Chen Y, Jung GY, Ohlberg DAA, Li X, Steward DR, Jeppesen JO, Nielsen KA, Stoddard JF, Williams RS (2003) Nanoscale molecular-switch crossbar circuits. Nanotechnology 14(4):462–468 33. Choi H, Mody C (2007) Molecular electronics in the longue durée: the microelectronics origins of nanotechnology. In: Joint Wharton-Chemical Heritage Foundation Symposium on the Social Studies of Nanotechnology, Philadelphia, 7–8 Jun 2007 34. Chou SY, Krauss PR, Renstrom PJ (1996) Imprint lithography with 25-nanometer resolution. Science 272(5258):85–87 35. Chua LO, Yang L (1988) Cellular neural networks: theory. Circuit Syst IEEE Trans 35(10):1257–1272 36. Collier CP, Wong EW, Belohradský M, Raymo FM, Stoddart JF, Kuekes PJ, Williams RS, Heath JR (1999) Electronically configurable molecular-based logic gates. Science 285(5426):391– 394 37. Collier CP, Mattersteig G, Wong EW, Luo Y, Beverly K, Sampaio J, Raymo FM, Stoddart JF, Heath JR (2000) A [2]Catenane-based solid state electronically reconfigurable switch. Science 289(5482):1172–1175 38. Constantinescu C (2007) Impact of intermittent faults on nanocomputing devices. In: Workshop on Dependable and Secure Nanocomputing. Edinburgh, 28 Jun 2007 39. Cowburn RP, Welland ME (2000) Room temperature magnetic quantum cellular automata. Science 287(5457):1466– 1468 40. Cui Y, Lieber CM (2001) Functional nanoscale electronic devices assembled using silicon nanowire building blocks. Science 291(5505):851–853 41. Cui Y, Lieber C, Lauhon L, Gudiksen M, Wang J (2001) Diameter-controlled synthesis of single crystal silicon nanowires. Appl Phys Lett 78(15):2214–2216 42. Dasmahapatra S, Werner J, Zauner KP (2006) Noise as a computational resource. Int J Unconv Comput 2(4):305–319 43. Davari B (1999) CMOS technology: present and future. In: Proc. IEEE Symp. on VLSI circuits. Digest of Technical Papers, pp 5–9 44. Davis A, Nowick SM (1997) An introduction to asynchronous circuit design. Tech Rep UUCS-97-013, Computer Science Department, University of Utah 45. Davis BA, Principe JC, Fortes JAB (2004) Design and performance analysis of a novel nanoscale associative memory.
2023
2024
Nanocomputers
46.
47. 48.
49. 50.
51.
52. 53.
54.
55. 56. 57. 58.
59. 60. 61.
62.
63.
64.
65. 66.
In: Proceedings of 4th IEEE Conference on Nanotechnology, pp 314–316 Debray P, Raichev OE, Rahman M, Akis R, Mitchel WC (1999) Ballistic transport of electrons in T-shaped quantum waveguides. Appl Phys Lett 74(5):768–770 DeHon A (2003) Array-based architecture for FET-based nanoscale electronics. IEEE Trans Nanotechnol 2(1):23–32 DeHon A (2004) Law of large numbers system design. In: Nano, quantum and molecular computing: implications to high level design and validation. Kluwer, Norwell, pp 213–241 DeHon A (2005) Nanowire-based programmable architectures. ACM J Emerg Technol Comput Syst 1(2):109–162 DeHon A, Lincoln P, Savage JE (2003) Stochastic assembly of sublithographic nanoscale interfaces. IEEE Trans Nanotechnol 2(3):165–174 Dennard RH, Gaensslen FH, Yu HN, Rideout VL, Bassous E, LeBlanc AR (1974) Design of ion-implanted mosfets with very small physical dimensions. IEEE J Solid-State Circ 9(5):256– 268 Depledge PG (1981) Fault-tolerant computer systems. IEE Proceedings A 128(4):257–272 Diehl MR, Yaliraki SN, Beckman RA, Barahona M, Heath JR (2002) Self-assembled deterministic carbon nanotube wiring networks. Angewandte Chem Int Ed 41(2):353–356 Dobrushin RL, Ortyukov SI (1977) Upper bound for the redundancy of self-correcting arrangements of unreliable functional elements. Probl Inform Transm 13(3):203–218 Drexler KE (1986) Engines of creation. Anchor Books, New York Drexler KE (1992) Nanosystems: molecular machinery, manufacturing, and computation. Wiley, New York Durbeck LJK, Macias NJ (2001) The cell matrix: an architecture for nanocomputing. Nanotechnology 12(3):217–230 Eigler DM, Lutz CP, Crommie MF, Mahoran HC, Heinrich AJ, Gupta JA (2004) Information transport and computation in nanometer-scale structures. Phil Trans R Soc Lond A 362(1819):1135–1147 Feynman RP (1985) Quantum mechanical computers. Optics News 11:11–20 Feynman RP (1992) There’s plenty of room at the bottom (reprint of 1959 lecture). J Microelectromech Syst 1(1):60–66 Feynman RP, Leighton R, Sands M (2006) Ratchet and pawl. In: The Feynman Lectures on Physics, vol 1. Addison Wesley, San Francisco, pp 1–9 Fountain TJ, Duff MJB, Crawley DG, Tomlinson CD, Moffat CD (1998) The use of nanoelectronic devices in highly parallel computing systems. IEEE Trans VLSI Syst 6(1):31–38 Frank MP (2005) Introduction to reversible computing: motivation, progress, and challenges. In: CF ’05: Proceedings of the 2nd conference on Computing frontiers. ACM Press, New York, pp 385–390 Frazier G, Taddiken A, Seabaugh A, Randall J (1993) Nanoelectronic circuits using resonant tunneling transistors and diodes. In: Digest of Technical Papers. IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, 24–26 Feb 1993, pp 174–175 Fredkin E, Toffoli T (1982) Conservative logic. Int J Theor Phys 21:219–253 Fuk´s H (2002) Nondeterministic density classification with diffusive probabilistic cellular automata. Phys Rev E 66(6):066106.
67. Gács P (1986) Reliable computation with cellular automata. J Comput Syst Sci 32(1):15–78 68. Gács P (1989) Self-correcting two-dimensional arrays. In: Micali S (ed) Randomness in Computation. Advances in Computing Research (a scientific annual), vol 5. JAI Press, Greenwich, pp 223–326 69. Gács P (1997) Reliable cellular automata with self-organization. In: IEEE Symposium on Foundations of Computer Science, pp 90–99 70. Gács P, Reif X (1988) A simple three-dimensional real-time reliable cellular array. J Comput Syst Sci 36(2):125–147 71. Gao C, Hammerstrom D (2007) Cortical models onto CMOL and CMOS – architectures and performance/price. IEEE Trans Circ Syst I: Regul Pap 54(11):2502–2515 72. Gil D, de Andrés D, Ruiz JC, Gil P (2007) Identifying fault mechanisms and models of emerging nanoelectronic devices. In: Workshop on Dependable and Secure Nanocomputing (DSN’07). Online proceedings www.laas.fr/WDSN07/ WDSN07_files/Texts/WDSN07-POST-01-Gil.pdf. Accessed 5 Aug 2008 73. Gimarc CE, Milutinovic VM (1987) A survey of RISC processors and computers of the mid-1980s. Computer 20(9):59–69 74. Goldstein SC (2005) The impact of the nanoscale on computing systems. In: IEEE/ACM International Conference on Computer-Aided Design (ICCAD 2005). San Jose, CA, pp 655–661. Online Proceedings www.cs.cmu.edu/~seth/ papers/goldstein-iccad05.pdf. Accessed 5 Aug 2008 75. Goldstein SC, Budiu M (2001) Nanofabrics: Spatial computing using molecular electronics. In: Proceedings of the 28th annual international symposium on Computer architecture, pp 178–191 76. Graham P, Gokhale M (2004) Nanocomputing in the presence of defects and faults: a survey. In: Nano, Quantum and Molecular Computing. Kluwer, Boston, pp 39–72 77. Green JE, Choi JW, Boukai A, Bunimovich Y, JohnstonHalperin E, Delonno E, Luo Y, Sheriff BA, Xu K, Shin YS, Tseng HR, Stoddart JF, Heath JR (2007) A 160-kilobit molecular electronic memory patterned at 1011 bits per square centimeter. Nature 445(7126):414–417 78. Han J, Jonker P (2003) A defect- and fault-tolerant architecture for nanocomputers. Nanotechnology 14(2):224–230 79. Han J, Gao J, Qi Y, Jonker P, Fortes JAB (2005) Toward hardware-redundant, fault-tolerant logic for nanoelectronics. IEEE Des & Test Comput 22(4):328–339 80. Harao M, Noguchi S (1975) Fault tolerant cellular automata. J Comput Syst Sci 11(2):171–185 81. Hartmanis J (1995) On the weight of computations. Bull Eur Assoc Theor Comput Sci 55:136–138 82. Hauck S (1995) Asynchronous design methodologies: an overview. Proc IEEE 83(1):69–93 83. Haykin S (1998) Neural Networks: A Comprehensive Foundation. Prentice Hall PTR, Upper Saddle River, NJ 84. Heath JR, Kuekes PJ, Snider GS, Williams RS (1998) A defecttolerant computer architecture: Opportunities for nanotechnology. Science 280(5370):1716–1721 85. Heinrich AJ, Lutz CP, Gupta JA, Eigler DM (2002) Molecule cascades. Science 298(5597):1381–1387 86. Ho R, Mai KW, Horowitz MA (2001) The future of wires. Proc IEEE 89:490–504
Nanocomputers
87. Huang Y, Duan X, Wei Q, Lieber C (2001) Directed assembly of one-dimensional nanostructures into functional networks. Science 291(5504):630–633 88. Husband CP, Husband SM, Daniels JS, Tour JM (2003) Logic and memory with nanocell circuits. IEEE Trans Electron Dev 50(9):1865–1875 89. Hush NS (2003) An overview of the first half-century of molecular electronics. Ann N Y Acad Sci 1006:1–20 90. Isokawa T, Abo F, Peper F, Kamiura N, Matsui N (2003) Defecttolerant computing based on an asynchronous cellular automaton. In: Proceedings of SICE Annual Conference, Fukui, Japan, pp 1746–1749 91. Isokawa T, Abo F, Peper F, Adachi S, Lee J, Matsui N, Mashiko S (2004) Fault-tolerant nanocomputers based on asynchronous cellular automata. Int J Mod Phys C 15(6):893–915 92. Isokawa T, Kowada S, Peper F, Kamiura N, Matsui N (2006) Online marking of defective cells by random flies. In: Yacoubi SE, Chopard B, Bandini S (eds) Lecture Notes in Computer Science, vol 4173. Springer, Berlin, pp 347–356 93. Isokawa T, Kowada S, Takada Y, Peper F, Kamiura N, Matsui N (2007) Defect-tolerance in cellular nanocomputers. New Gener Comput 25(2):171–199 94. International Roadmap Commitee (2005) International Technology Roadmap for Semiconductors 95. International Roadmap Commitee (2005) International Technology Roadmap for Semiconductors, Emerging Research Devices. www.itrs.net/Links/2005ITRS/ERD2005.pdf. Accessed 5 Aug 2008 96. International Roadmap Commitee (2005) International Technology Roadmap for Semiconductors, Interconnect. www. itrs.net/Links/2005ITRS/ERD2005.pdf. Accessed 5 Aug 2008 97. Iwai H (2004) CMOS scaling for sub-90 nm to sub-10 nm. In: VLSID ’04: Proceedings of the 17th International Conference on VLSI Design, IEEE Computer Society, Washington, DC, p 30 98. Jablonski DG (1990) A heat engine model of a reversible computation. Proc IEEE 78(5):817–825 99. Jung GY, Johnston-Halperin E, Wu W, Yu Z, Wang SY, Tong WM, Li Z, Green JE, Sheriff BA, Boukai A, Bunimovich Y, Heath JR, Williams RS (2006) Circuit fabrication at 17nm half-pitch by nanoimprint lithography. Nano Lett 6(3):351–354 100. Kamins TI, Williams RS, Chen Y, Chang YL, Chang YA (2000) Chemical vapor deposition of Si nanowires nucleated by TiSi2 islands on si. Appl Phys Lett 76(5):562–564 101. Kiehl RA (2006) Information processing in nanoscale arrays: DNA assembly, molecular devices, nano-array architectures. In: ICCAD ’06: Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design, San Jose, 5–9 Nov 2006 102. Kish LB (2002) End of Moore’s law: thermal (noise) death of integration in micro and nano electronics. Phys Lett A 305(3– 4):144–149 103. Kish LB (2006) Thermal noise driven computing. Appl Phys Lett 89(14):144104 104. Knap W, Deng Y, Rumyantsev S, Lu JQ, Shur MS, Saylor CA, Brunel LC (2002) Resonant detection of subterahertz radiation by plasma waves in a submicron field-effect transistor. Appl Phys Lett 80(18):3433–3435 105. Korkmaz P, Akgul BES, Palem KV, Chakrapani LN (2006) Advocating noise as an agent for ultra-low energy computing: probabilistic complementary metal-oxide-semicon-
106.
107. 108.
109.
110.
111.
112. 113. 114. 115.
116.
117.
118.
119.
120. 121. 122.
123.
124.
ductor devices and their characteristics. Jpn J Appl Phys 45(4B):3307–3316 Kreup F, Graham AP, Liebau M, Duesberg GS, Seidel R, Unger E (2004) Carbon nanotubes for interconnect applications. In: Electron Devices Meeting, 2004. IEDM Technical Digest. IEEE International, pp 683–686 Kuekes PJ, Williams RS, Heath JR (2000) Demultiplexer for a molecular wire crossbar network. US Patent 6 128 214 Kuekes PJ, Robinett W, Seroussi G, Williams RS (2005) Defecttolerant interconnect to nanoelecronic circuits: internally redundant demultiplexers based on error-correcting codes. Nanotechnology 16(6):869–881 Kuekes PJ, Robinett W, Williams RS (2005) Improved voltage margins using linear error-correcting codes in resistorlogic demultiplexers for nanoelectronics. Nanotechnology 16(9):1419–1432 Kuekes PJ, Steward DR, Williams RS (2005) The crossbar latch: Logic value storage, restoration, and inversion in crossbar circuits. J Appl Phys 97(3):034301 Kuekes PJ, Robinett W, Roth RM, Seroussi G, Snider GS, Williams RS (2006) Resistor-logic demultiplexers for nanoelectronics based on constant- weight codes. Nanotechnology 17(4):1052–1061 Lala PK (2001) Self-checking and fault-tolerant digital design. Morgan Kaufmann, San Francisco, CA Landauer R (1961) Irreversibility and heat generation in the computing process. IBM J Res Dev 5(3):183–191 Landauer R (1984) Dissipation in computation. Phys Rev Lett 53(12):1205 Landauer R (1992) Information is physical. In: PhysComp’92: Workshop on Physics and Computation, Dallas, 2–4 Oct 1992, pp 1–4 Le J, Pinto Y, Seeman NC, Musier-Forsyth K, Taton TA, Kiehl RA (2004) DNA-templated self-assembly of metallic nanocomponent arrays on a surface. Nano Lett 4(12):2343–2347 Lee J, Adachi S, Peper F, Morita K (2003) Embedding universal delay-insensitive circuits in asynchronous cellular spaces. Fundamenta Informaticae 58(3/4):295–320 Lee J, Peper F, Adachi S, Mashiko S (2004) On reversible computation in asynchronous systems. In: Quantum Information and Complexity. World Scientific, Singapore, pp 296–320 Lee J, Adachi S, Peper F, Mashiko S (2005) Delay-insensitive computation in asynchronous cellular automata. J Comput Syst Sci 70:201–220 Lee J, Peper F, Adachi S (2006) Reversible logic elements operating in asynchronous mode. US Patent 6 987 402 Lent CS, Tougaw PD, Porod W, Bernstein GH (1993) Quantum cellular automata. Nanotechnology 4(1):49–57 Li C, Fan W, Lei B, Zhang D, Han S, Tang T, Liu X, Liu Z, Asano S, Meyyappan M, Han J, Zhou C (2004) Multilevel memory based on molecular devices. Appl Phys Lett 84(11):1949–1951 Liebmann LW (2003) Layout impact of resolution enhancement techniques: impediment or opportunity? In: Proc. 2003 Int. Symp. on Physical Design (ISPD’03), ACM Press, pp 110– 117 Likharev KK, Semenov VK (1991) RSFQ logic/memory family: a new Josephson-junction technology for sub-terahertzclock-frequency digital systems. IEEE Trans Appl Supercond 1(1):3–28
2025
2026
Nanocomputers
125. Likharev KK, Strukov DB (2005) Introduction to Molecular Electronics. In: Cuniberti G et al (ed) CMOL: Devices, circuits, and architectures. Springer, Berlin, pp 447–477 126. Lloyd S (1993) A potentially realizable quantum computer. Science 261(5128):1569–1571 127. Lloyd S (2000) Ultimate physical limits to computation. Nature 406(6799):1047–1054 128. Madou MJ (2002) Lithography. In: Fundamentals of Microfabrication, The Science of Miniaturization. CRC Press, Florida, pp 1–76 129. Maezawa K, Förster A (2003) Quantum transport devices based on resonant tunneling. In: Nanoelectronics and Information Technology, pp 407–424 130. Manohar R, Martin AJ (1995) Quasi-delay-insensitive circuits are turing-complete. Tech. Rep. CaltechCSTR:1995.cs-tr-9511, California Institute of Technology, Pasadena, CA 131. Margolus NH (1984) Physics-like models of computation. Physica D 10(1/2):81–95 132. Margolus NH (1999) Crystalline computation. In: Feynman and computation: exploring the limits of computers. Perseus books, Cambridge, pp 267–305 133. Martin AJ (1990) Programming in VLSI: From communicating processes to delay-insensitive circuits. In: Hoare CAR (ed) Developments in Concurrency and Communication. AddisonWesley, Reading, pp 1–64 134. Mayor M, Weber HB, Waser R (2003) Molecular Electronics. In: Nanoelectronics and Information Technology. Wiley, Berlin, pp 501–525 135. Mead C (1990) Neuromorphic electronic systems. Proc IEEE 78(10):1629–1636 136. Mead C, Conway L (1980) Introduction to VLSI Systems. Addison-Wesley, Boston 137. Meindl JD (1995) Low power microelectronics: retrospect and prospect. Proc IEEE 83(4):619–635 138. Meindl JD, Chen Q, Davis JA (2001) Limits on silicon nanoelectronics for terascale integration. Science 293(5537):2044– 2049 139. Miller DAB (2000) Rationale and challenges for optical interconnects to electronic chips. Proc IEEE 88(6):728–749 140. Mishra M, Goldstein SC (2003) Defect tolerance at the end of the roadmap. In: Proceedings of the IEEE International Test Conference (ITC), vol 1, pp 1201–1210 141. Mizuno M, Anjo K, Surni Y, Wakabayashi H, Mogami T, Horiuchi T, Yamashina M (2000) On-chip multi-ghz clocking with transmission lines. In: 2000 IEEE International Solid-State Circuits Conference (ISSCC). Digest of Technical Papers, pp 366– 367 142. Montemerlo MS, Love JC, Opiteck GJ, Goldhaber-Gordon DJ, Ellenbogen JC (1996) Technologies and designs for electronic nanocomputers. Tech. Rep. 96W0000044, MITRE 143. Moore GE (2003) No exponential is forever: but “forever” can be delayed! In: Solid-State Circuits Conference. Digest of Technical Papers. ISSCC. IEEE International Solid-State Circuits Conference (ISSCC), vol 1, pp 20–23 144. Morales A, Lieber C (2001) A laser ablation method for the synthesis of crystalline semiconductor nanowires. Science 291(5348):208–211 145. Morita K (2003) A simple universal logic element and cellular automata for reversible computing. Lect Notes Comput Sci 2055:102–113
146. Motwani R, Raghavan P (1995) Randomized Algorithms. Cambridge University Press, New York, NY 147. Muller DE, Bartky WS (1959) A theory of asynchronous circuits. In: Proceedings of an International Symposium on the Theory of Switching. Harvard University Press, pp 204–243 148. Nikoli´c K, Forshaw M (2003) The current status of nanoelectronic devices. Int J Nanosci 2(1/2):7–29 149. Nikoli´c K, Sadek A, Forshaw M (2002) Fault-tolerant techniques for nanocomputers. Nanotechnology 13(3):357–362 150. Nishio H, Kobuchi Y (1975) Fault tolerant cellular spaces. J Comput Syst Sci 11(2):150–170 151. O KK, Kim K, Floyd B, Mehta J, Yoon H, Hung CM, Bravo D, Dickson T, Guo X, Li R, Trichy N, Caserta J, Bomstad W, Branch J, Yang DJ, Bohorquez J, L Gao L, Sugavanam A, Lin JJ, Chen J, Martin F, Brewer J (2003) Wireless communications using integrated antennas. In: Proc. 2003 IEEE International Interconnect Technology Conference, San Francisco, 2–4 June 2003, pp 111–113 152. O’Mahony F, Yue CP, Horowitz MA, Wong SS (2003) A 10-GHz global clock distribution using coupled standing-wave oscillators. IEEE J Solid-State Circ 38(11):1813–1820 153. O’Mahony F, Yue CP, Horowitz M, Wong SS (2003) 10 GHz clock distribution using coupled standing-wave oscillators. In: Solid-State Circuits Conference. Digest of Technical Papers. IEEE International Solid-State Circuits Conference (ISSCC), vol 1, pp 428–504 154. Ono Y, Fujiwara A, Nishiguchi K, Inokawa H, Takahashi Y (2005) Manipulation and detection of single electrons for future information processing. J Appl Phys 97:031101 155. Palem KV (2005) Energy aware computing through probabilistic switching: a study of limits. IEEE Trans Comput 54(9):1123–1137 156. Parviz BA, Ryan D, Whitesides GM (2003) Using self-assembly for the fabrication of nano-scale electronic and photonic devices. IEEE Trans Adv Packag 26(3):233–241 157. Peper F, Lee J, Adachi S, Mashiko S (2003) Laying out circuits on asynchronous cellular arrays: a step towards feasible nanocomputers? Nanotechnology 14(4):469–485 158. Peper F, Lee J, Abo F, Isokawa T, Adachi S, Matsui N, Mashiko S (2004) Fault-tolerance in nanocomputers: a cellular array approach. IEEE Trans Nanotechnol 3(1):187–201 159. Petty M (2007) Molecular Electronics, from Principles to Practice. Wiley, West Sussex 160. Pinto YY, Le JD, Seeman NC, Musier-Forsyth K, Taton TA, Kiehl RA (2005) Sequence-encoded self-assembly of multiple-nanocomponent arrays by 2D DNA scaffolding. Nano Lett 5(12):2399–2402 161. Pippenger N (1985) On networks of noisy gates. In: 26th Annual Symposium on Foundations of Computer Science, 21– 23 October 1985, Portland, Oregon, IEEE, pp 30–38 162. Pippenger N (1989) Invariance of complexity measures for networks with unreliable gates. J ACM 36(3):531–539 163. Pippenger N (1990) Developments in: “The synthesis of reliable organisms from unreliable components”. In: Proc. of Symposia in Pure Mathematics, vol 50. pp 311–324 164. Porod W (1998) Quantum-dot cellular automata devices and architectures. International journal of high-speed electronics and systems 9(1):37–63 165. Porod W, Grondin RO, Ferry DK (1984) Dissipation in computation. Phys Rev Lett 52(3):232–235
Nanocomputers
166. Rahman A, Reif R (2000) System-level performance evaluation of three-dimensional integrated circuits. IEEE Trans Very Large Scale Integr Syst 8(6):671–678 167. Robert RW, Keyes W (1985) What makes a good computer device? Science 230(4722):138–144 168. Robinson AL (1984) Computing without dissipating energy. Science 223(4641):1164–1166 169. Rothemund PW, Papadakis N, Winfree E (2004) Algorithmic self-assembly of DNA sierpinski triangles. PLoS Biol 2(12): 2041–2053 170. Roychowdhury VP, Janes DB, Bandyopadhyay S, Wang X (1996) Collective computational activity in self-assembled arrays of quantum dots: a novel neuromorphic architecture for nanoelectronics. IEEE Trans Electron Dev 43(10):1688–1699 171. Rueckes T, Kim K, Joselevich E, Tseng G, Cheung C, Lieber C (2000) Carbon nanotube based nonvolatile random access memory for molecular computing. Science 289(5476):94–97 172. Sadek AS, Nikoli´c K, Forshaw M (2004) Parallel information and computation with restitution for noise-tolerant nanoscale logic networks. Nanotechnology 15(1):192–210 173. Sathe V, Chueh JY, Kim J, Ziesler CH, Kim S, Papaefthymiou M (2005) Fast, efficient, recovering, and irreversible. In: CF ’05: Proceedings of the 2nd Conference on Computing Frontiers. ACM, New York, pp 407–413 174. Seitz CL (1980) System timing. In: Mead CA, Conway LA (eds) Introduction to VLSI Systems. Addison–Wesley, Boston 175. Sherman WB, Seeman NC (2004) A precisely controlled DNA biped walking device. Nano Lett 4(7):1203–1207 176. Shor PW (2004) Progress in quantum algorithms. Quantum Inf Process 3(1–5):5–13 177. Smith PA, Nordquist CD, Jackson TN, Mayer TS, Martin BR, Mbindyo J, Mallouk TE (2000) Electric-field assisted assembly and alignment of metallic nanowires. Appl Phys Lett 77(9):1399–1401 178. Snepscheut JvD (1985) Trace theory and VLSI design. In: Lecture Notes in Computer Science, vol 200. Springer, Berlin 179. Snider GS, Kuekes PJ (2003) Molecular-junction-nanowirecrossbar-based associative array. US Patent 6 898 098 180. Snider GS, Robinett W (2005) Crossbar demultiplexers for nanoelectronics based on n-hot codes. IEEE Trans Nanotechnol 4(2):249–254 181. Snider GS, Williams RS (2007) Nano/CMOS architectures using a field-programmable nanowire interconnect. Nanotechnology 18(3):1–11 182. Snider GS, Kuekes PJ, Williams RS (2004) CMOS-like logic in defective, nanoscale crossbars. Nanotechnology 15(8):881– 891 183. Snider GS, Kuekes PJ, Hogg T, Williams RS (2005) Nanoelectronic architectures. Appl Phys A 80(6):1183–1195 184. Soh C, Quate C, Morpurgo C, Marcus C, Kong C, Dai C (1999) Integrated nanotube circuits: controlled growth and ohmic contacting of single-walled carbon nanotubes. Appl Phys Lett 75(5):627–629 185. Sotiriadis PP (2006) Information capacity of nanowire crossbar switching networks. IEEE Trans Inf Theory 52(7):3019– 3032 186. Spagocci S, Fountain T (1999) Fault rates in nanochip devices. Proc Electrochem Soc 98-19:582–596 187. Spielman DA (1996) Highly fault-tolerant parallel computation. In: Proceedings of the 37th IEEE Symposium on Foun-
188. 189.
190.
191.
192. 193. 194. 195.
196.
197.
198.
199. 200. 201.
202.
203.
204.
205.
206. 207. 208.
dations of Computer Science (FOCS), Burlington, 14–16 Oct 1996, pp 154–163 Srivastava N, Banerjee K (2004) Interconnect challenges for nanoscale electronic circuits. TMS J Mater (JOM) 56(10):30–31 Stan MR, Franzon PD, Goldstein SC, Lach JC, Ziegler MM (2003) Molecular electronics: from devices and interconnect to circuits and architecture. Proc IEEE 91(11):1940–1957 Strukov DB, Likharev KK (2005) CMOL FPGA: a reconfigurable architecture for hybrid digital circuits with two-terminal nanodevices. Nanotechnology 16(6):888–900 Taubin A, Cortadella J, Lavagno L, Kondratyev A, Peeters A (2007) Design automation of real life asynchronous devices and systems. Found Trends Electron Des Autom 2(1):1–133 Theis TN (2000) The future of interconnection technology. IBM J Res Dev 44(3):379–390 Toffoli T (1984) Comment on: Dissipation in computation. Phys Rev Lett 53(12):1204 Tougaw PD, Lent CS (1994) Logical devices implemented using quantum cellular-automata. J Appl Phys 75:1818–1825 Tour JM, Van Zandt L, Husband CP, Husband SM, Wilson LS, Franzon PD, Nackashi DP (2002) Nanocell logic gates for molecular computing. IEEE Trans Nanotechnol 1(2):100–109 Tour JM, Cheng L, Nackashi DP, Yao Y, Flatt AK, St Angelo SK, Mallouk TE, Franzon PD (2003) Nanocell electronic memories. J Am Chem Soc 125(43):13279–13283 Türel Ö, Lee JH, Ma X, Likharev K (2005) Architectures for nanoelectronic implementation of artificial neural networks: new results. Neurocomputing 64:271–283 Uchida K (2003) Single-electron devices for logic applications. In: Nanoelectronics and Information Technology. Wiley, Berlin, pp 425–443 Unger SH (1969) Asynchronous Sequential Switching Circuits. Wiley, New York von Hippel AR (1956) Molecular engineering. Science 123(3191):315–317 von Neumann J (1956) Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components. In: Automata Studies. Princeton University Press, Princeton, pp 43– 98 Waingold E, Taylor M, Srikrishna D, Sarkar V, Lee W, Lee V, Kim J, Frank M, Finch P, Barua R, Babb J, Amarasinghe S, Agarwal A (1997) Baring it all to software: Raw machines. Computer 30(9):86–93 Wang KL, Khitun A, Flood AH (2005) Interconnects for nanoelectronics. In: Proc. 2005 IEEE International Interconnect Technology Conference, San Francisco, 6–8 June 2005, pp 231–233 Wang W (1990) An asynchronous two-dimensional self-correcting cellular automaton. Ph D thesis, Boston University, Boston, MA 02215, short version: In Proc. 32nd IEEE Symposium on the Foundations of Computer Science, San Juan, 1–4 Oct 1990. IEEE Press, pp 188–192, 1991 Weeber JC, González MU, Baudrion AL, Dereux A (2005) Surface plasmon routing along right angle bent metal strips. Appl Phys Lett 87(22):221101 Whitesides GM, Grzybowsky B (2002) Self-assembly at all scales. Science 295(5564):2418–2421 Mac Williams FJ, Sloane NJA (1978) The Theory of Error-Correcting Codes. North-Holland, Amsterdam Williams RS, Kuekes PJ (2001) Demultiplexer for a molecular wire crossbar network. US Patent 6 256 767
2027
2028
Nanocomputers
209. Winfree E, Liu F, Wenzler LA, Seeman NC (1998) Design and self-assembly of two-dimensional DNA crystals. Nature 394(6693):539–544 210. Wolf SA, Awschalom DD, Buhrman RA, Daughton JM, von Molnar S, Roukes ML, Chtchelkanova AY, Treger DM (2001) Spintronics: a spin-based electronics vision for the future. Science 294(5546):1488–1495 211. Wong HSP, Frank DJ, Solomon PM, Wann CHJ, Wesler JJ (1999) Nanoscale CMOS. Proc IEEE 87(4):537–570 212. Wood J, Edwards TC, Lipa S (Nov 2001) Rotary traveling-wave oscillator arrays: a new clock technology. IEEE J Solid-State Circ 36(11):1654–1665 213. Worschech L, Beuscher F, Forchel A (1999) Quantized conductance in up to 20 µm long shallow etched GaAs/AlGaAs quantum wires. Appl Phys Lett 75(4):578–580 214. Wu W, Jung GY, Olynick DL, Straznicky J, Li Z, Li X, Ohlberg DAA, Chen Y, Wang SY, Liddle JA, Tong WM, Williams RS
215.
216. 217.
218.
219.
(2005) One-kilobit cross-bar molecular memory circuits at 30nm half-pitch fabricated by nanoimprint lithography. Appl Phys A 80(6):1173–1178 Yamada T, Akazawa M, Asai T, Amemiya Y (2001) Boltzmann machine neural network devices using single-electron tunneling. Nanotechnology 12(1):60–67 Yanagida T, Ueda M, Murata T, Esaki S, Ishii Y (2007) Brownian motion, fluctuation and life. Biosystems 88(3):228–242 Yang T, Kiehl R, Chua L (2001) Tunneling phase logic cellular nonlinear networks. Int J Bifurc Chaos 11(12):2895– 2911 Zhirnov VV, Cavin RK, Hutchby JA, Bourianoff GI (2003) Limits to binary logic switch scaling – a gedanken model. Proc IEEE 91(11):1934–1939 Zhong Z, Wang D, Cui Y, Bockrath MW, Lieber CM (2003) Nanowire crossbar arrays as address decoders for integrated nanosystems. Science 302(5649):1377–1379
Network Analysis, Longitudinal Methods of
Network Analysis, Longitudinal Methods of TOM A. B. SNIJDERS University of Oxford, Oxford, United Kingdom Article Outline Glossary Definition of the Subject Introduction Stochastic Models for Network Dynamics Statistical Estimation and Testing Example: Dynamics of Adolescent Friendship Models for the Co-evolution of Networks and Behavior Example: Co-evolution of Adolescent Friendship and Alcohol Use Extensions Future Directions Bibliography Glossary Actors The social actors who are represented by the nodes of the network, and indicated by a label denoted i or j in the set 1; : : : ; n. Behavior An umbrella term for changing characteristics of actors, considered as components of the outcome of the stochastic system: e. g., behavioral tendencies or attitudes of human actors, performance, etc. Each behavior variable Zh is assumed to be measured on an ordinal discrete scale with values 1; 2; : : : ; M h for some M h 2. The value of behavior variable Zh for actor i is denoted Zih . Change determination process The stochastic model defining the probability distribution of changes, conditional on the event that there is an opportunity for change. Change opportunity model The stochastic process defining the moments where tie indicators can change. This can be either tie-based, meaning that an ordered pair of actors (i; j) is chosen and the possibility arises that the tie variable from i to j is changed; or actorbased, meaning that an actor i is chosen and the possibility arises that one of the outgoing tie variables from actor i is changed. Covariates Variables which can depend on the actors (actor covariates) or on pairs of actors (dyadic covariates), and which are considered to be deterministic, or determined outside of the ‘stochastic system’ under consideration.
Effects Components of the objective function. Influence The phenomenon that change probabilities for actors’ behavior depend on the network positions of the actors, usually in combination with the current behavior of the other actors. Markov chain A stochastic process where the probability distribution of future states, given the present state, does not depend on past states. Method of moments A general method of statistical estimation, where the parameters are estimated in such a way that expected values of a vector of selected statistics are equal to their observed values. Network A simple directed graph representing a relation on the set of actors with binary tie indicators X ij which can be regarded as a state which can change, but will normally change slowly. Objective function Usually denoted by f i ; the informal description is that this is a measure of how attractive it is to go from an old to a new state. More formally, when there is an opportunity for change, the probability of the change is assumed to be proportional to the exponential transform of the objective function. The objective function has a similar role as the linear predictor in generalized linear models in statistics, and is specified here as a linear combination of effects. Rate function Usually denoted by , the expected number of opportunities for change per unit of time. Selection The phenomenon that change probabilities for network ties depend on the behavior of one or both of the two actors involved. Tie indicator A variable X ij indicating by the value X i j D 1 that there is a tie i ! j, and by the value 0 that there is no such tie. Also called tie variables. Definition of the Subject Social networks represent the patterns of ties between social actors. To analyze empirically the mechanisms that determine creation and termination of ties, especially if several mechanisms that may be complementary are studied simultaneously, statistical methods are needed. This chapter is aimed at the case that network panel data are available to the researcher, and treats recently developed statistical models for such data, with corresponding estimation methods. To represent the feedback processes inherent in network dynamics, it is helpful to regard such panel data as momentary observations on a continuoustime stochastic process on the space of directed graphs. Tie-oriented and actor-oriented stochastic models are presented, which can reflect endogenous network dynamics as well as effects of exogenous variables. These models can
2029
2030
Network Analysis, Longitudinal Methods of
be regarded as agent-based models, and they can be implemented as computer simulation models. To estimate the parameters of the model, stochastic approximation methods can be used. Social networks are especially interesting because they are important influences on individual behavior – and in turn the network ties are influenced by individual behavior. This two-way influence can be represented by models for the co-evolution of networks and changeable actor attributes. Such models, and statistical methods to analyze panel data on networks and behavior, are also treated. An extensive example is discussed about friendship among teenagers in a school setting. Introduction When we think of social networks, it is quite natural to think of them as being dynamic. Ties are established, gain in strength, they can blossom and decay, and they may wither or be terminated with a bang. This applies to all kinds of relations – friendship or collaboration between humans, joint ventures between companies, bilateral agreements between countries, and so on. (Note that all these are examples of ‘positive’ ties; ‘negative’ ties are not dealt with in this chapter). It is also natural to think of such changes as being dependent on characteristics of the nodes – inclinations and abilities of humans, resources of companies, locations and capacities of countries – and on characteristics of pairs of nodes such as similarity or spatial proximity, as well as on the existing network structure – reciprocation of friendships, transitive closure of friendships (which is the case when friends of friends become friends), group formation of companies or countries. Finally, we are not surprised when such network changes have repercussions – friendships are often thought to have good or bad influences on the individuals concerned, agreements between companies and between countries will have consequences for the performance of the companies and for how the countries fare. Indeed, in the cases of the companies and countries, the links often are created with the purpose of having beneficial consequences for the companies or countries, respectively. This sets the stage for this chapter, which is concerned with inferential data analysis for network dynamics. The focus is on social networks, and the nodes, or vertices, in the network will be referred to as (social) actors. Data analysis means that methods will be presented for analyzing empirical data on network change. Inference means that the aim is to have methods for testing hypotheses about mechanisms that may drive the network dynamics, and for estimating parameters that figure in such mechanisms. As usual in statistical inference, the ‘mechanisms’ will be ex-
pressed as probability models, also called stochastic models. For hypothesis testing it must be possible to play off one theory against another, which means that when we have theories, or mechanisms, T 1 and T 2 , that are not contradictory but could occur together, we need models which can express both mechanisms simultaneously and reflect each mechanism in a set of parameters. The mechanism operates if some of its parameters are nonzero. Then we can test the null hypothesis that T 1 operates but not T 2 , against the alternative hypothesis that both mechanisms operate; in other words, we can test for T 2 while controlling for T 1 . This requires flexibility of the stochastic models being used: what was stated above implies that it must be possible to specify the models in diverse ways, e. g., with some parameters for effects of actor characteristics and others for effects of various aspects of the existing network structure, and these parameters being estimable from observed data. For data consisting of independent observations the elaboration of these principles of statistical inference in the linear regression model and its generalizations is well known. Network data have the complicating feature that they are not composed of independent observations: the occurrence, creation, or termination of one tie is highly dependent on the existence of other ties. Therefore more complex statistical models are needed, giving an adequate representation of the mutual dependence between the existence of various ties between the actors in the network. But readers who are acquainted with generalized linear statistical models will see that, although the models treated in this chapter have this greater complexity, many elements known from generalized linear models do play a role. Social network analysis is concerned with diverse types of networks which can be represented by diverse data structures: simple graphs and their generalizations. In line with the current state of the statistical methodology for network dynamics, this chapter is restricted to data structures where the changing network is a changing simple directed graph, where the arcs represent social ties that can be regarded as states rather than events. This means that, although changeable, the ties have inertia, a tendency to endure. Friendship between humans and agreements between companies are examples of states; conversations and momentary transactions are events. For networks where ties are states, the dependence between ties can be represented by assuming that changes in the network are dependent upon the existing network structure. In mathematical terms, this is to say that it is reasonable to assume that – given the available ‘independent’ or ‘explanatory’ variables – the network is a Markov chain. A Markov chain is a stochastic process where the probability distri-
Network Analysis, Longitudinal Methods of
bution of future states, given the present state, does not depend on the past states. Such models were first proposed for network dynamics by [22] and elaborated by [59]. For some sociological applications outside the realm of networks, see [2,10]. For a network of events (e. g., the network of conversations ongoing at each given moment) the assumption of a Markov chain would be untenable. For a network of states (e. g., the network of joint ventures between companies) the Markov assumption is usually not totally realistic but can be used as a first approximation, and in many cases is the best assumption one can make given the limitation of the available data. This assumption often can be made more plausible by using relevant explanatory variables. In the start of this introduction, ties were portrayed as possibly gaining in strength or decaying. It would be attractive to reflect this by measuring the ties on an ordinal scale. This is not considered here: we are dealing with simple and not with valued directed graphs, and the tie indicators, which are the variables X ij indicating how actor i is tied to actor j, are constrained to having the values 0 or 1, indicating, respectively, absence and presence of the tie i ! j. The restriction to binary tie variables is in line with traditional network analysis, but extensions to valued ties are important and are the subject of current work. The Markov assumption will often be more reasonable for valued than for binary ties. The assumption that we are dealing with a network of relational states must be reflected by the way in which the ties are measured. Measurement in social network analysis is a subject which tends to receive too little attention, and this chapter is no exception. In practice, ties will have to be measured in such a way that observing a tie is a good indicator for the relational state being investigated, such as friendship or collaboration. When the relation under study is a type of communication, which usually has an ephemeral nature, it will be necessary to aggregate the communication over a sufficiently long time interval, so that the resulting variable can be regarded as indicative of a relational state. For example, [25] aggregated email communication to the binary tie variable defined by at least one email being sent over a 60-day period. In other situations, shorter or longer periods may be relevant. Stochastic Models for Network Dynamics This section presents stochastic models for use in statistical modeling. The inferential aspects (parameter estimation and testing of hypotheses) are treated in Section “Statistical Estimation and Testing”. These models were applied, e. g., to the testing of theories about dynam-
ics of friendship networks [12,55,57], of trust networks in firms [56], of artistic prestige [13], and of ties between venture capital firms [8]. The network is represented by the node set f1; : : : ; ng with tie variables xij , where x i j D 1 or 0 indicates whether the tie i ! j is present or absent. The tie variables are collected in the n n adjacency matrix x D x i j . Self-ties are excluded, so that x i i D 0 for all i. The concepts of network (directed graph) and matrix (its adjacency matrix) will be used interchangeably, depending on what is most convenient. In accordance with the usual notation in probability and statistics, random variables will be indicated by capitals; and observations, or other non-random variables, by small letters. The ties are assumed to be outcomes of time-dependent random variables, denoted by X i j (t) and collected in the time-dependent random matrix X(t). In addition to the network X(t), which can be regarded as the dependent variable of the model, there can be other variables regarded as independent or explanatory variables in the sense that their values are not modeled but accepted as given, and they may influence the network. Such variables are called covariates and when depending on the actors they are denoted vi , while if they depend on pairs of actors (dyads) the notation is wij . Examples are the age of actors (actor variable) and their spatial proximity (dyadic variable). Basic Model Definition The following basic assumptions are made. 1. Time, denoted by t, is a continuous variable. This does not mean that it is assumed that observations are made continuously; in most practical cases, observations are made at a number (perhaps a small number) of discrete time moments. However, it is natural and mathematically convenient to assume that there is an underlying process X(t) (which may be observed only partially) which proceeds in continuous time. 2. X(t) is a Markov process. This means that the conditional distribution of future states depends on the past only as a function of the present. In other words, to predict the future it is sufficient to know the present state of the network, and knowledge of past states will not improve predictability. This assumption was discussed above. It can be expressed by saying that the network represents a state, and usually goes together with inertia, i. e., the tendency of ties to remain in existence unless something special happens. 3. At any given moment t, no more than one tie variable X i j (t) can change.
2031
2032
Network Analysis, Longitudinal Methods of
This assumption, first proposed by [22], means that changes of ties are not directly coordinated, and ties are mutually dependent only because tie changes will depend on the current total configuration of ties. This is an important simplifying condition and excludes, for example, partner swapping and the coordinated formation of groups. Changes of several ties are decomposed as sequences of changes of single ties. These assumptions still allow an extremely wide array of probability models, and further specification is necessary. In network change, two aspects can be distinguished: the frequency of tie change, which may depend on the actors involved; and the network structures that tend to be formed by the tie changes – the ‘direction’ of change. Examples of the former are that younger individuals might change their friendship ties more frequently than older individuals, or that more central actors might change their ties more frequently than peripheral actors. Examples of the latter are tendencies toward tie reciprocation, and toward transitive network closure. These two aspects will be represented by distinct components of the model and distinct parameters, which allow the inference about the one aspect to be relatively undisturbed by inference about the other aspect. The first component is the change opportunity process, the second the change determination model. Change Opportunity Process Two specifications of the change opportunity process are given. They use the concept of a Poisson process, which is a stochastic process of events occurring at a certain rate , which means that the probability that an event occurs in the time interval from t to t C , where is a small positive number, is given (in the limit for tending to 0) by . One could say that the rate is the probability of occurrence per unit of time (in short time intervals). Two specifications of the opportunity process are given here. 1. Tie-based change opportunities For each tie variable X ij , opportunities for change occur according to a Poisson process with rate i j . 2. Actor-based change opportunities For each actor i, opportunities to establish one new outgoing tie i ! j, or dissolve one existing tie i ! j, occur according to a Poisson process with rate i . The rates can be constant, or depend on covariates or functions of the current state of the network; if they are not constant, they are called rate functions. Tie-based rates i j , for example, could depend on the proximity between actors or on their joint embeddedness such as
P the current number of common friends h X i h (t) X jh (t). Actor-based rates i could depend on actor variables or on positional variables such as actor i’s current outdegree P j X i j (t). Tie-based change opportunities were proposed by Robins and Pattison (personal communication), and correspond to the Gibbs sampling and Metropolis–Hastings procedures for simulating exponential random graph models, see [41]. Actor-based change opportunities were proposed by [46]. When an opportunity for change occurs, there is, in each opportunity model, a set of potential new networks that could be the result of the change. Denoting the current network by x0 , for the tie-based opportunity model this set can be denoted by C i j (x 0 ). This is the set of the two possible matrices x where all elements other than xij are equal to those in the current matrix x0 , and where xij itself can be either 0 or 1. In the actor-based opportunity model actor i, when confronted with an opportunity for change, chooses one of his outgoing tie variables and changes this into its opposite value, changing 0 to 1 (creating a new tie) or changing 1 to 0 (terminating an existing tie). Therefore the set of potential new networks here is the set composed of x0 itself together with the n 1 matrices which are equal to x0 except for exactly one non-diagonal element in line i which is replaced by its opposite, x i j D 1 x 0i j . The set of new possible states is denoted in shorthand applicable to either case by C (x 0 ). Since it is allowed that the current situation is continued, it always holds that x 0 2 C (x 0 ). Change Determination Model The choice of the new state of the network is dependent on what is called the objective function, which is a function f i (x 0 ; x; v; w) depending on the current state of the network x0 , the potential new state x, the actor i, and the covariates summarized here as v (actor covariates) and w (dyadic covariates). The objective function can be interpreted informally as a measure of how attractive it is for actor i to change from state x0 to state x. When actor i has the opportunity to change some outgoing tie variable X ij , given that currently X(t) D x 0 , the set of possible new states of the network is denoted C (x 0 ). All x 2 C (x 0 ) differ from x0 by at most one element xij for some j. When there is an opportunity for change from the current state x0 , the probabilities of the values of the next state x 2 C (x 0 ) are proportional to exp f i (x 0 ; x; v; w) . The models can be summarized as follows. For the tie-oriented model, when an opportunity for change occurs, it
Network Analysis, Longitudinal Methods of
refers to some pair (i; j); opportunities for changing X ij occur at a rate i j for each pair (i; j). When such an opportunity occurs, the probability that x0 changes to the different state x is given by PfX(t) changes to x j (i; j) has a change opportunity at time t; X(t) D x 0 g D p i j (x 0 ; x; v; w) D
exp f i (x 0 ; x; v; w) ; exp f i (x 0 ; x 0 ; v; w) C exp f i (x 0 ; x; v; w)
(1) where x and x0 are identical except for x i j D 1 x 0i j . For the actor-oriented model, opportunities for change occur for actors i. Opportunities for actor i to change one of the outgoing tie variables X ij ( j D 1; : : : ; n; j ¤ i) occur at a rate i . The set of permitted new states, following on a given current state x0 , is C (x 0 ). The probability that the new state is x, provided that x is permitted (i. e., x 2 C (x 0 )), is given by PfX(t) changes to x j i has a change opportunity at time t; X(t) D x 0 g D p i (x 0 ; x; v; w) exp f i (x 0 ; x; v; w) : D P 0 0 x 0 2C (x 0 ) exp f i (x ; x ; v; w)
(2)
The two model components can be put together by giving the transition rate matrix, also called Q-matrix, of which the elements are defined by ˚ P X(t C dt) D x j X(t) D x 0 dt dt # 0
q x 0 ;x D lim
(x ¤ x 0 )
(see textbooks on continuous-time Markov chains, such as [34]). Note that the assumptions imply that q x 0 ;x D 0 whenever x i j ¤ x 0i j for more than one element (i; j): For digraphs x and x0 which differ from each other only in the element with index (i; j), the elements of the Q-matrix are given for the tie-based opportunity process by q x 0 ;x D i j (x 0 ; v; w) p i j (x 0 ; x; v; w)
(3)
and for the actor-based opportunity process by q x 0 ;x D i (x 0 ; v; w) p i (x 0 ; x; v; w) :
(4)
Tie-based models with constant change rates and objective functions defined as f (x; v; w) (not depending on the preceding state x0 or on the actor i) can be regarded as Metropolis–Hastings dynamics (cf. [34]) for obtaining random draws from the digraph probability distribution with probability function c exp f (x; v; w) where c is a normalizing constant. In statistical mechanics f (x; v; w) then will be called a potential function, see [32]. When f (x; v; w) is a linear combination as in (6) below, the distribution is an exponential random graph model for which the Metropolis–Hastings algorithm is treated in [41]. The actor-based opportunity model was proposed in [46] and, for a different data structure, in [45], as a stochastic actor-oriented model. In this model, the network dynamics is regarded as being driven by the social actors. Actors are assumed to control their outgoing ties, subject to inertia and the current network structure. This point of view is in accordance with the methodological approach of structural individualism [54,60], where actors are assumed to be purposeful and to behave subject to structural constraints. The purposes and constraints of the actors are summarized in the objective functions. One way to obtain the probabilities (2) is to assume that at each opportunity for change, actor i myopically optimizes the objective functions plus a random term, under the constraint that only one tie can change at a time. The myopia means that the actor only optimizes the state of the network that will be the immediate result of this change, without considering later network structures that might result in the future further ahead. The random term expresses otherwise unmodeled purposes and constraints. When the random terms have independent Gumbel distributions (the precise mathematical form of which does not matter for the present exposition), the choice probabilities are given by (2) (cf. [30,45,46]). Simulation The Markov process defined above can be iterating by the following algorithm. The various steps in the algorithm can be derived using basic properties of Poisson processes and conditional probabilities (see [34]). 1. The process starts with a given time t and current state X(t) D x 0 . 2. For the tie-based opportunity process define D P i j , and for the actor-based opportunity process ij P D i i . Let U be an independently drawn random number, uniformly distributed between 0 and 1, and let t =
2033
2034
Network Analysis, Longitudinal Methods of
ln(U)/. Note that t has the exponential distribution with parameter . Change t into t C t. 3a. In the case of the tie-based opportunity process, choose a random pair (i; j) (with i ¤ j) with probabilities i j /. With probability given by (1), change X i j (t) into 1 x 0i j . 3b. In the case of the tie-based opportunity process, choose a random actor i with probabilities i /. To have a way for denoting the permitted new digraphs which are elements of C i (x 0 ), define by x 0 (i Ý j) for j ¤ which is equal to x0 except i the digraph only that x 0 (iÝ j) i j D 1x 0i j ; define x 0 (iÝi) D x 0 . Then choose a random j with probabilities p i (x 0 ; x 0 (i Ý j); v; w) exp f i (x 0 ; x 0 (i Ý j); v; w) ; D P 0 0 x 0 2C (x 0 ) exp f i (x ; x ; v; w)
4.
(5)
which is just the same as (2). If j ¤ i, change X i j (t) into 1 x 0i j . Go to step 1.
The stochastic process on the space of digraphs can be defined by this simulation algorithm just as well as by the Q-matrices (3) and (4), respectively. Specification of the Change Determination Process The changes can be regarded for tie-based opportunities as determinations of new values for X ij according to a binary logistic regression model (see [23]); and for actor-based opportunities, according to a multinomial logistic regression model (see [29]). In the further elaboration the parallel with logistic regression is followed because the objective function f i is specified as a linear combination X f i (x 0 ; x; v; w) D ˇ k s ki (x 0 ; x; v; w) (6) k
where the functions ski are so-called effects driving the network dynamics while the weights ˇ k are parameters indicating the force of these effects and which can be estimated from the data. The specification of the model will be the choice of a limited set of such effects for use in (6). A list of some effects is the following. In the formulae, replacing an index by a + means that a sum is taken over this index. Many examples of effects do not depend on x0 or on v or w, and these arguments are then dropped from the notation. P Outdegree effect s1i (x) D x iC D j x i j This effect models the tendency to have ties at all; this tendency
will also be influenced by all other effects, and therefore the interpretation of its parameter is conditional on the further selection of effects included in the model. In many models this is the only effect to which those actors j contribute who have no reciprocal link to i nor any links with any others to whom i is linked. In such models, its weight ˇ 1 can be interpreted as the ‘value’ for actor i of a tie to such an other actor who is further completely isolated from i’s personal network. P Reciprocity effect s2i (x) D j x i j x ji This is the number of reciprocated ties for actor i. It models the tendency toward reciprocation of choices. Thus, a higher value for its parameter ˇ 2 will imply a higher tendency to forming reciprocated ties. Degree Distribution The following three effects are related to modeling the dynamics of the degree distribution. Since the data are directed graphs, three distinct aspects of the degree distribution are the variability of indegrees, variability of outdegrees, and association between in- and outdegrees. Sometimes the term of preferential attachment is used [1,40] for the increased attractiveness of ties to nodes that already have high degrees. In our model these three aspects of preferential attachment can be expressed by including in the objective function terms depending on in- and out-degrees. The precise functional form for the effects is determined also by the requirement that the resulting model be amenable to statistical inference. Experience shows that using the square root of the degrees often leads to more stable estimation of the parameters than using the raw (untransformed) degrees. This suggests that, for many empirical networks, the models with squared roots of the degrees are better descriptions of reality than the models with untransformed degrees. Therefore only the models using the square roots are presented here. Popularity effect (square root measure) P p s3i (x) D j x i j xC j This effect is defined by the sum of the square roots of indegrees of the others to whom i is tied. In other words, popularity of other actors is measured by the square root of their indegree. The root-popularity effect models the tendency to form ties to those actors who have high indegrees already (the Matthew effect in networks; see the chapter on Social Network Analysis, Graph Theoretical Approaches to). This will be reflected by the dispersion of the indegrees. Activity effect (square root measure) P p s4i (x) D j x i j x jC This effect is defined by the sum of the square roots of outdegrees of the others to
Network Analysis, Longitudinal Methods of
whom i is tied. The effect models the tendency to form ties to those actors who have high outdegrees already. This will be reflected by the association between indegrees and outdegrees. Own outdegree, power 1.5, effect P p s4i (x) D j x i j x iC D x 1:5 The first expression iC show that each new tie has a ‘value’ equal to the actor’s outdegree, comparing to a unit ‘value’ for the outdegree effect. This effect expresses that actors who already have many outgoing ties have a higher propensity to establish new ties. This will lead to a greater dispersion of outdegrees. Network Closure The following two effects are related to network closure, also called transitivity or clustering: for friendship networks, this expresses that friends of friends tend to become friends. In addition to the two effects mentioned here, tendencies toward structural balance [7] and tendencies to avoid geodesic distances equal to 2 can also be implemented as effects that will lead to transitivity; also see [46,47]. P Transitive triplets effect s5i (x) D j;h x i j x i h x jh This formula represents the number of transitive patterns in i’s ties as indicated in the figure below. A transitive triplet for actor i is a configuration (i; j; h) in which all three of the ties i ! j; j ! h; i ! h are present (and irrespective of whether there are also other ties between these three actors). This models the tendency toward network closure, where (for a positive parameter) formation of the tie i ! h becomes increasingly likely when there are more indirect connections (‘two-paths’) i ! j ! h. P Transitive ties effect s6i (x) D h x i h max j (x i j x jh )] This is the number of actors j to whom i is directly as well as indirectly tied, i. e., for which there exists at least one h such that (i; j; h) is a transitive triplet. The transitive triplets and transitive ties effects are two distinct ways of modeling tendencies toward network closure. For the effect on the probability of forming the tie i ! h, the number of two-paths i ! j ! h makes no difference for the transitive ties effect, as long as there is at least one indirect connection; for the transitive triplets effect the contribution to the objective function increases linearly with the number of twopaths. P Three-cycle effect s7i (x) D j;h x i j x jh x h i This is the number of three-cycles i ! j ! h ! i in which actor i is involved. This effect models the tendency toward forming three-cycles, which is the simplest form of generalized exchange [3,27] (i gives to j, j gives to h, and h gives to i) and which is opposed to hierarchy.
Network Analysis, Longitudinal Methods of, Figure 1 Transitive triplet
Covariate Effects Actor covariates v can influence the propensities to form or terminate ties in different ways, because this propensity might be influenced by the value for the sender or the receiver of the tie, or by some combination of these two values. P v-related popularity s8i (x; v) D j x i j v j The v-related popularity effect is defined by the sum of the covariate over all actors to whom i is tied. Positive parameter values will imply that ties to actors with high v values are more attractive. This will lead to a tendency toward a correlation between vi and the indegree of i. v-related activity s9i (x; v) D v i x iC In words, this is vi times i’s outdegree. Positive parameter values will imply that actors with high v values tend to make more ties. This will lead to a tendency toward a correlation between vi and the outdegree of i. P v-related similarity s10;i (x; v) D j x i j simv (i; j) Here simv (i; j) indicates the similarity between actors i and j defined by simv (i; j) D 1 (jv i v j j/), where D maxh;k jv h v k j is the observed range of the covariate v. Thus, the effect is is the sum of similarities between i and the others to whom he is tied. Positive parameter values will imply that there is a preference for ties between actors with similar values of vi and vj . P v ego-alter interaction s11;i (x; v) D j x i j v i v j This product interaction is an alternative to the similarity effect for expressing that the propensity for a tie to exist depends on the combined values of the v for the sending actor (‘ego’, i) and the receiving actor (‘alter’, j). P Main effect of ws12;i (x; w) D j x i j w i j For a dyadic covariate w, this is defined by the sum of the values of wij for all other actors j to whom i is tied. It is clear that the list can be extended indefinitely, and that researchers have to make a limited choice reflecting theoretical and content-matter knowledge and interest. The potential complexity of network dynamics justifies to have many candidate effects that may be used to model the network dynamics. For instance, when a researcher wishes to
2035
2036
Network Analysis, Longitudinal Methods of
estimate a model for testing whether there is a preference for choosing network partners with similar v values, while controlling for the tendency to have ties, and the tendencies for reciprocation and transitive closure, then effects s1 ; s2 ; s5 and/or s6 , and s10 should be included in the model (6). Since transitive closure could be expressed by effects s5 as well as s6 , there may be no strong prior arguments for choosing between these two ‘control’ effects, and empirical grounds could be used to choose either one or both. If there are grounds to suspect that the vi values may also be associated with in- or outdegrees, the popularity and activity effects s8 and s9 may also be included. Specification of the Change Opportunity Process For the model specification it should be noted that the ‘social time’ which determines the speed of change of the network is not necessarily the same as the physical time elapsing between consecutive observation moments. Given the absence of an extraneous definition of this ‘social time’, it is not a restriction to set to 1 the total time elapsed between each pair of consecutive observations. If there are M 3 observation moments, it is advisable to specify distinct rate parameters m governing the frequency of opportunities for change between tm and t mC1 , and allow 1 , 2 , etc., to be different. If the change rate further is constant (independent of actors), m then represents the expected number of opportunities for change between tm and t mC1 . This is the expected number per ordered pair (i; j) in the case of tie-based opportunities, and per actor i in the case of actor-based opportunities. The symbol will denote the vector ( 1 ; : : : ; M1 ). In the more general case the rate function can be defined, e. g., for the actor-based model, by a function depending on actor covariates and positional characteristics of the actors. When, for example, a dependence on one covariate vi and the current out-degree x 0iC is considered, a logarithmic link function could be used giving a model such as i (x 0 ; v) D m exp ˛1 v i C ˛2 x iC : In general the symbol ˛ will be reserved for parameters indicating dependence of the rate function on covariates and network characteristics. Statistical Estimation and Testing The most usual type of longitudinal network data is panel data, where for M 2 time points, an observation x(t m ) is available of the network on the same set f1; : : : ; ng of actors.
These models can be simulated on computers in rather straightforward ways (the algorithm is written out in [47]). Parameter estimation, however, is more complicated, because the likelihood function or explicit probabilities can be computed only for uninteresting models. This section presents the Method of Moments (MoM) estimates proposed in [46]. Maximum Likelihood (ML) estimators are presented in [48]. In the current implementation, ML estimators are more time-consuming than MoM estimators, and can be used only for relatively small data sets. For the more straightforward models (dynamics of networks only, no endowment functions), the MoM method is hardly less efficient than the ML method. For more complicated models, where it is important to squeeze every bit of information out of the data, it can be useful to employ ML methods. In the following description of the estimation method, the parameter vector ( ; ˛; ˇ) is denoted by . It is undesirable to make the restrictive assumption that the distribution of the process is stationary. Instead, for each observation moment t m (m D 1; : : : ; M 1) the observed network x(t m ) can be used as a conditioning event for the distribution of X(t mC1 ). The Method of Moments requires that a vector of statistics U mC1 D U X(t m ); X(t mC1 ) is utilized, such that the expected value ˚ E U X(t m ); X(t mC1 ) j X(t m ) D x(t m ) is sensitive to the parameter . Given the conditioning on the preceding observation, the moment equations, or estimating equations, can then be written as M1 X
˚ E U X(t m ); X(t mC1 ) j X(t m ) D x(t m )
mD1
D
M1 X
U x(t m ); x(t mC1 ) :
(7)
mD1
It turns out that suitable statistics are the following. The number of changed ties between consecutive observations, X jX i j (t mC1 ) X i j (t m )j ; i; j
is especially sensitive to the rate of change m . A vector of statistics sensitive especially to ˇ is the sum of the individual objective functions X f i X(t mC1 ) : i
To solve the estimating equation (7), in the absence of ways to calculate analytically the expected values, stochastic approximation methods can be used. Variants of the
Network Analysis, Longitudinal Methods of
Robbins–Monro [9,42] algorithm have been used with good success. This is a stochastic iteration method which produces a sequence of estimates (N) which is intended to converge to the solution of (7), and which works here as follows. For a given provisional estimate (N) , the model is simulated so that for each m D 1; : : : ; M 1, a simulated random draw is obtained from the conditional distribution of X(t mC1 ) conditional on X(t m ) D x(t m ). This simulated network is denoted X(N) (t mC1 ). Denote P M1 (N) (N) Um D U x(t m ); X(N) (t mC1 ) and U (N) D mD1 U m , and let uobs be the right-hand side of (7). Then the iteration step in the Robbins–Monro algorithm for obtaining the Method of Moments estimate is given by
(NC1) D (N) a N D1 U (N) uobs ;
(8)
where D is a suitable matrix and aN a sequence of positive constants tending to 0. Tuning details of the algorithm, including the choices of D and aN , are given in [46]. The experience with the convergence of this algorithm is quite good. The standard errors can be computed using the standard formulae of standard errors for the Method of Moments, based on the delta method, and applying simulation methods; such simulation methods are discussed in [44]. Bayesian estimators for these models are presented in [24] and Maximum Likelihood estimators in [48]. Example: Dynamics of Adolescent Friendship As an example, the adolescent friendship network is considered of a year cohort at a secondary school in Glasgow (Scotland), studied in the Teenage Friends and Lifestyle Study [36,38]. This data set was collected at three measurement points t1 ; t2 ; t3 in 1995–1997, at intervals of roughly one year, starting when the pupils were 12–13 years old. Here the network is studied that is formed by the 129 (out of 160) pupils who were present at all three measurement waves. Sex (boys scored as 1, girls as 2) and drinking behavior are used as actor variables, drinking (alcohol consumption) being measured on a 5-point scale ranging from 1 (not at all) to 5 (more than one a week). Both variables are centered around the mean, which is 1.43 for sex and 2.60 for drinking (averaged over t1 and t2 ). The data set is used as an illustration; more wide-ranging analyses are presented in [37,52,53]. Three models are presented. Rate parameters are assumed constant within periods between observation moments, and the duration of the periods is (arbitrarily but without loss of generality) set at 1. The first model contains only the most basic dyadic and triadic effects: outdegree, reciprocity, transitive triplets, and three-cycles. The
second adds to this the three effects to model more precisely the degree distribution. The third model adds to these structural effects the effects of two covariates: gender and alcohol consumption. Parameter estimates are approximately unbiased and normally distributed; for this assertion there is no mathematical proof yet, but it is supported by simulation studies. Therefore, effects can be tested by referring the studentized estimates (or t-ratios, i. e., estimate divided by standard error) to a standard normal distribution. When the t-ratio exceeds 2 in absolute value, the effect can be interpreted as being significant at the significance level of 5 %. The table present first the estimated rate parameters. These indicate that the pupils had about 12 opportunities for changes in the first period (t1 – t2 ), and about 9 in the second period (t2 – t3 ). The parameters for the objective function, which the table shows next, are more important for the interpretation. There are strong tendencies toward reciprocity and transitivity, and a tendency away from three-cycles. This is the case in all three models, although the parameter estimates are slightly different. This indicates that the structural features of transitive closure and of local hierarchy (the interpretation of the negative three-cycle effect) cannot be ‘explained away’ by tendencies in tie formation and dissolution that are associated to degrees, sex, or alcohol consumption. The degree effects in Models 2 as well as 3 show that there is a positive popularity effect – with a borderline significance at the 5% level in Model 3; there is a negative activity effect; and a negative effect of own outdegree raised to the power 1.5. The positive popularity effect suggests that differential values in in-degrees tend to
Network Analysis, Longitudinal Methods of, Figure 2 Contributions of drinking behaviors vi ; vj to the objective function for friendship
2037
2038
Network Analysis, Longitudinal Methods of
Network Analysis, Longitudinal Methods of, Table 1 Parameter estimates for modeling evolution of friendship network, Glasgow school cohort. Standard errors between parentheses Model 1 Effect par. (s.e.) Rate 1 11.82 (1.00) Rate 2 9.23 (0.83) Outdegree 2.697 (0.047) Reciprocity 2.38 (0.10) Transitive triplets 0.459 (0.033) Three -cycles 0.57 (0.10) Popularity (sq. root) – Activity (sq. root) – Own outdegree, power 1.5 – Sex (F) popularity – Sex (F) activity – Sex similarity – Drinking popularity – Drinking activity – Drinking ego × alter –
be self-sustaining, leading to rather strongly dispersed indegrees. The other two, negative, degree-related effects indicate that differences in out-degrees, as well as correlations between in- and out-degrees, tend to be self-correcting, leading to relatively low dispersions of out-degrees and low in-degree – out-degree correlations. Of the three gender-related effects, only the similarity effect is significant. In this age range, a strong preference for same-sex friendships is to be expected (the interest which perhaps exists in the other sex is not reported as friendship). For alcohol consumption, the interaction effect shows that those who drink more themselves have a higher preference for friends who also drink more; and the activity effect shows that those drinking more tend to mention less friends, for friends of average drinking habits (where the contribution of the interaction is nil). This is most clearly expressed by jointly considering the three contributions related to drinking behavior that can be made by the tie i ! j. In a formula, this is represented by the coefficient of the variable yij in the objective function. Recall that the actor variables are centered around the mean. Using the formulae for effects s8 ; s9 , and s11 and filling in the coefficients in Table 1 yields 0:026 (v j v¯) 0:086 (v i v¯) C 0:107 (v j v¯)(v i v¯) ; with v¯ D 2:60. The following picture illustrates the contributions to the objective function made by ego’s drinking behavior vi and alter’s drinking behavior vj .
Model 2 par. 12.05 9.05 0.16 2.48 0.569 0.55 0.223 0.89 0.560 – – – – – –
(s.e.) (1.15) (0.79) (0.34) (0.12) (0.036) (0.11) (0.095) (0.14) (0.084)
Model 3 par. 12.19 9.09 0.18 2.22 0.544 0.44 0.184 0.92 0.587 0.16 0.12 0.904 0.026 0.086 0.107
(s.e.) (1.13) (0.77) (0.41) (0.12) (0.034) (0.11) (0.093) (0.17) (0.094) (0.11) (0.12) (0.097) (0.030) (0.038) (0.024)
It can be concluded that those who do not drink alcohol (v i D 1), prefer friends who drink no or little alcohol, while the reverse is true for those who drink a lot of alcohol. The range of this contribution is 0.8 for those with v i D 1 and 0.9 for those with v i D 5, which is comparable to the value 0.9 of the gender similarity effect. Thus, for those with the highest and the lowest values of alcohol use, the importance of the alcohol use of potential friends, when comparing potential friends with the minimum and the maximum alcohol use, is approximately as great as the importance of their gender. On the other hand, for those with medium values of alcohol use, the alcohol use of potential friends plays virtually no role at all. Models for the Co-evolution of Networks and Behavior The importance of networks derives, for an important part, from the effects of networks on the behavior and performance of the actors. This can be, for example, because of influence between friends or between collaborating partners [16], because of exchange of resources [28], or because of structural advantages [4]. A few examples drawn from the many studies in this field are concerned with influence of friends of adolescents on smoking behavior [14,37,52], effects of acquaintances on labor market outcomes [17], competitive advantage of firms [18], effects of friends on delinquency [6,19], effects of position in patent citation networks on growth rates of companies [39], and job performance [51].
Network Analysis, Longitudinal Methods of
This type of changing individual attributes, which might range from, e. g., behavioral tendencies and attitudes of individuals to performance of companies, will be briefly referred to as ‘behavior’. When the ties in the network are influenced by the behavior and the behavior, in turn, is influenced by the network, a mutual feedback arises between the network and the behavior. The network structure of the ties between the actors together with their behavior constitute the endogenously changing environment for each of the actors [61]. Thus, this approach is well-suited to study macro-micro-macro questions of the type discussed by [11]. The vector of attributes for actor i at time t is denoted z i (t) D z i1 (t); : : : ; z i H (t) where z i h (t) denotes the hth attribute of actor i. For the n actors these are stacked in the matrix z(t), which is regarded as the outcome of a random matrix Z(t). To model the co-evolution of networks and behavior, where the network and the behaviorinfluence one another dynamically, the stochastic process X(t); Z(t) is considered. This is treated in quite the same way as the stochastic process X(t) was treated above. It is assumed that z i (t) represents an enduring, but changeable, state of actor i rather than a momentary behavior, so that it is a sensible approximation to assume that X(t); Z(t) is a Markov process in continuous time. Now the change probabilities of X(t) will depend on the current state of X(t) as well as Z(t). The fact that network change is co-determined by behavior is called behavior-dependent selection, and will often be referred to briefly as selection. Similarly, the change probabilities of Z(t) will depend on the current state of Z(t) as well as X(t), and this will be called influence. This terminology was also used, e. g., by [14]. To remain close to the framework for network modeling, it is assumed that all of the behavior variables zih are measured on a discrete ordinal scale, with values coded as consecutive integers f0; 1; : : : ; M h g for some M h 1. The analogue of the simplifying assumption (3) made above for network change is the following. This decomposes the change between consecutive observations x(t ); z(t ) m m and x(t mC1 ); z(t mC1 ) into a sequence of the smallest possible steps. 30 . At any given moment t, no more than one of all the variables X i j (t); Z i h (t) can change. When Z i h (t) changes, at any given instant it can change only to an immediate neighboring value, i. e., by a decrease or increase of 1 (permitted only if this step does not take Zih outside of the permitted range from 0 to M h ).
This means that there is no direct coordination between changes in ties and changes in behavior, and the dependence between networks and behavior is brought about because both react to each other. The model for the network is just like it was above, the rate function and objective function now being denoted by X and f iX , and allowed to depend on the current behavior Z(t), to represent behavior-dependent selection. For each of the dependent behavior variables Zh there also is a rate function Zi h driving the frequency of changes and an objective function f iZ h defining the probabilities of behavior changes when there is an opportunity of change. Their dependence on the current network will represent influence. For each actor i, opportunities to change behavior Zih occur according to a Poisson process with rate Zi h . When actor i has the opportunity to change behavior Zih given that currently (X(t); Z(t)) D (x 0 ; z0 ), there are three possible new states (x 0 ; z), where either z D z0 or the only difference between z and z0 is that z i h D z0i h 1 or z i h D z0i h C 1; unless one of these values is outside the range f0; : : : ; M h g, in which case there are only the two remaining possible new states. The probabilities of going from state (x 0 ; z0 ) to state (x 0 ; z) are proportional to exp f iZ h (z0 ; z; x 0 ; v; w) . The change probabilities in this model are given by PfZ(t) changes to z j i has an opportunity to change Z i h at time t; X(t) D x 0 ; Z(t) D z0 g exp f iZ h (z0 ; z; x 0 ; v; w) D P Zh ; 0 0 0 z 0 2C Z h (z 0 ) exp f i (z ; z ; x ; v; w)
(9)
where C Z h (z0 ) is the set of two or three permitted new values for Zh . The effects in the objective function f iX for network X with weights ˇ X , and changes now are denoted by s ki k likewise the objective function for behavior changes is assumed to be expressed as a linear combination of effects weighted by parameters, X Zh 0 ˇ kZ h s ki (z ; z; x 0 ; v; w) : (10) f iZ h (z0 ; z; x 0 ; v; w) D k
A set of possible effects that could be included in the objective function for behavior are the following. For simplicity of notation, it is assumed that there is only one behavior variable, so that the index h can be dropped from the notation. The first two effects are used to define the shape of
2039
2040
Network Analysis, Longitudinal Methods of
the objective function as a function of zi , and other terms depending on zi could be added. Z (z) D z Linear Shape Effect s i1 i Z (z) D z 2 The linear and squared Squared Shape Effect s i2 i shape effects together define what could be regarded as a quadratic preference function on the behavior, ˇ1Z z i C ˇ2Z z2i . The word ‘preference function’ is used with some reluctance, because it is used here only as an easy shorthand term for the combined short-term result of preferences and constraints, depending only on the actor’s behavior zi itself, net of the other terms in the behavior objective function. If ˇ2 < 0 this is a unimodal function of zi . If ˇ2 > 0, on the other hand, and if the minimum of the quadratic function is assumed within the range of the behavior variable, then the behavior is drawn to the extremes of the range, with actors already low on zi being drawn to low values and actors already high on zi being drawn to high values. This can represent, e. g., addictive behavior. Other nonlinear functions of zi could, of course, also be included. Z (z; x) D z x Indegree effect s i3 i Ci This represents that actors with a high indegree (‘popular’ actors) have a higher tendency toward high values of the behavior. Z (z; x) D z x Outdegree effect s i4 i iC This represents that actors with a high outdegree (‘active’ actors) have a higher tendency toward high values of the behavior. P Z (z; x) D Total similarity effect s i5 j x i j simz (i; j) The dyadic similarity simz is as defined above, now applied to the dependent behavior Z. The total similarity effects adds the similarity values between i and the actors toward whom i has a tie. This is the primary representation of social influence: the preference for behavior which is close to that of one’s network members. P Z (z; x) D z x 1 Average alter effect s i6 i iC j x i j z j The coefficient of zi is the average behavior of the actors to whom i is tied (i’s ‘alters’, j). The coefficient is defined as 0 when the outdegree x iC is 0. This is another representation of social influence: the preference for behavior depends on the average behavior of one’s network members. The parameter estimation for this model is discussed in [49].
Example: Co-evolution of Adolescent Friendship and Alcohol Use As an example for the co-evolution of networks and behavior the data of the Glasgow school cohort is used again, but now the alcohol consumption is used as a dependent,
Network Analysis, Longitudinal Methods of, Table 2 Parameter estimates for modeling co-evolution of friendship network and alcohol consumption, Glasgow school cohort Effect par. Friendship dynamics Rate 1 12.13 Rate 2 9.09 Outdegree 0.22 Reciprocity 2.19 Transitive triplets 0.529 Three -cycles 0.42 Popularity (sq. root) 0.193 Activity (sq. root) 0.92 Own outdegree, power 1.5 0.581 Sex (F) popularity 0.16 Sex (F) activity 0.12 Sex similarity 0.90 Drinking popularity 0.026 Drinking activity 0.118 Drinking ego × alter 0.166 Alcohol consumption dynamics Rate 1 1.48 Rate 2 2.22 Linear shape 0.36 Squared shape 0.34 Indegree 0.07 Outdegree 0.08 Sex (F) 0.02 Average alter 0.83
(s.e.) (1.26) (0.97) (0.43) (0.12) (0.036) (0.10) (0.086) (0.15) (0.099) (0.10) (0.13) (0.11) (0.043) (0.057) (0.043) (0.25) (0.37) (0.42) (0.14) (0.12) (0.17) (0.23) (0.35)
or endogenous, variable. In the treatment in Table 1 the alcohol consumption was used as an exogenous variable, i. e., its values were accepted as if determined by processes independent of the network. Now we follow a co-evolution approach where it is assumed that the dynamics in alcohol consumption can be co-determined by the network just as the network dynamics can be co-determined by the alcohol consumption. To obtain interpretations of the numerical parameter values, it must be noted that this table refers to the centered actor variables. The raw variable ranges from 1 to 5, with averages at the three observations rising from 2.5 through 2.7 to 3.1. The overall mean, 2.8, is subtracted from the observations. We shall denote by zi the raw value of actor i’s alcohol consumption scored 1–5, by z¯ the average equal to 2.8, and by z˜i the average of the alcohol consumption of i’s friends. The interpretation of the friendship dynamics is quite the same as in the analysis where alcohol was treated as an exogenous variable. For the dynamics of alcohol con-
Network Analysis, Longitudinal Methods of
sumption, the indegree, outdegree, and sex of the actor do not seem to have an important influence. The effect of the average drinking behavior of the friends of the focal actor does have a significant effect, however, with a t-value of 0:83/0:35 D 2:4. When we ignore the small and nonsignificant effects of outdegree, indegree, and sex, the remaining part of the objective function for drinking behavior is 0:36 (z i z¯) C 0:83(˜z i z¯))(z i z¯) 0:34(z i z¯)2 D 0:36 C 0:83(˜z i z¯) (z i z¯) 0:34(z i z¯)2 : This is a quadratic function, unimodal, with a maximum at 0:83 (˜z i z¯) D 0:62 C 1:22 z˜i : z i D z¯ C 2 0:34 Taking account of the integer values of zi , this implies that those with friends with the smallest possible average of smoking behavior z˜i D 1 are drawn towards the ‘preferred’ value of 1 themselves, while those having friends who have the highest possible average z˜i D 5 are themselves also drawn toward this value 5 as a ‘preferred’ value. It can be concluded that the data provide support for the existence, in the co-evolution of friendship and drinking tendencies, of selection (tendency to choose friends with similar behavior) as well as influence (tendency to change behavior in the direction of friends’ behavior). Extensions The basic model specifications defined above can be extended in various ways. Above we already mentioned the possibility to let the rates of change depend on covariates or on current network structure. Another possibility is to introduce an asymmetry between the values of ties when they are formed and their values when they are lost. E.g., for friendship dynamics, there is theoretical and empirical evidence that the additional ‘value’ of a tie added by its being reciprocated is higher when considering a potential loss of the tie than when considering the potential new formation of the tie. This asymmetry can be modeled by endowment functions, see [49]. Technically, this means that the effects s ki (x 0 ; x; v; w) used in (6) depend not only on the new state x but, unlike the examples given above, also on the preceding state x0 . For example, the endowment effect of a tie being reciprocal is expressed by X x 0i j x i j x ji ; s ki D j
which is sensitive to the number of i’s reciprocal ties only when they are candidates to being terminated and not when they are candidates to being created.
Similarly, going upward on a behavior variable might be not the opposite of going downward, which can be modeled by endowment effects in the objective function for behavior. Future Directions Although the statistical modeling of network dynamics started already with [22] and [59], this area has been in rapid development only since recent years. Much work remains to be done, however, to extend these methods to other types of network data and to study their properties. The lists of effects that can be included in the objective functions illustrate the flexibility of this model and its adaptability to research questions and network data. The availability of methods for analysis of network panel data has been a stimulus also for the further collection of such data. Plausible models and good methods for parameter estimation and testing have now been developed, as summarized in this chapter, and they are available in the SIENA (Simulation Investigation for Empirical Network Analysis) program. This program is available as freeware with additional material on the website http://stat.gamma.rug.nl/ snijders/siena.html and has an extensive manual [50]. The examples presented here were analyzed using this program. The tests used in the examples, based on studentized parameter estimates, can be regarded as Wald-type tests. Some limited simulation studies have supported the validity of these tests. Score-type tests associated to the Method of Moments estimators were developed in [43]. The examples in this paper underline the need for methods to assess fit of models, and also to compare non-nested models. This could be done formally based on estimated likelihoods, or informally based on the comparison of observed and expected values of relevant statistics that are not used for parameter estimation. The open question of assessing fit also invites speculation about the robustness of the results against the use of models of which the fit is not beyond doubt. There is an inherent tension between the complexity of processes of network dynamics, and the limited amount of data that can in practice be observed concerning these processes. One issue is that the models proposed here are Markov processes. For two-wave data sets there are no clear alternatives to making such an assumption, but the assumption is certainly debatable. Including more information in the state space (by using covariates, by considering valued rather than dichotomous ties, etc.) may relax the doubts concerning such an assumption.
2041
2042
Network Analysis, Longitudinal Methods of
Another issue is the difference between the tie-oriented and actor-oriented models. Which type of model is to be preferred is a matter both of social science theory and of empirical fit. It will be important to know, supported by simulation studies and/or mathematical results, the extent to which results based on particular models for network dynamics are robust to deviations from the precise assumptions made. In addition, it will be useful to develop still other models, e. g., models accounting for actor heterogeneity (like were developed for non-longitudinal network data, e. g., by [20,35,58]) or measurement error. Other open questions are about mathematical properties of the estimators and tests proposed. Simulation studies support the conjecture that the Method of Moments estimators have asymptotically normal distributions, but this has not been proven. It is unknown if the solution to the moment equation (7), under certain conditions, is unique. Similar questions can be asked about the Maximum Likelihood estimators. All this indicates that there is ample scope for future work on methods of statistical inference for network dynamics. Bibliography Primary Literature 1. Albert R, Barabási A-L (2002) Statistical mechanics of complex networks. Rev Mod Phys 74:47–97 2. Bartholomew DJ (1982) Stochastic Models for Social Processes, 3rd edn. Wiley, New York 3. Bearman PS (1997) Generalized exchange. Am J Sociol 102:1383–1415 4. Burt RS (1992) Structural Holes. Harvard University Press, Cambridge 5. Brass DJ, Galaskiewicz J, Greve HR, Tsai W (2004) Taking stock of networks and organizations: a multilevel perspective. Acad Manag J 47:795–817 6. Burk WJ, Steglich CEG, Snijders TAB (2007) Beyond dyadic interdependence: Actor-oriented models for co-evolving social networks and individual behaviors. Int J Behav Dev 31:397–404 7. Cartwright D, Harary F (1956) Structural Balance: A Generalization of Heiders Theory. Psychol Rev 63:277–292 8. Checkley M, Steglich CEG (2007) Partners in Power: Job Mobility and Dynamic Deal-Making. Eur Manag Rev 4:161–171 9. Chen H-F (2002) Stochastic Approximation and its Applications. Kluwer, Dordrecht 10. Coleman JS (1964) Introduction to Mathematical Sociology. The Free Press of Glencoe, New York 11. Coleman JS (1990) Foundations of Social Theory. Belknap Press of Harvard University Press, Cambridge/London 12. de Federico de la Rua A (2003) La dinamica de las redes de amistad. La eleccion de amigos en el programa Erasmus. REDES 4.3, http://revista-redes.rediris.es. Accessed 10 Aug 2007 13. de Nooy W (2002) The dynamics of artistic prestige. Poetics 30:147–167 14. Ennett ST, Bauman KE (1994) The Contribution of Influence and Selection to Adolescent Peer Group Homogeneity: The Case of
15. 16. 17. 18. 19. 20.
21. 22. 23. 24.
25. 26. 27. 28. 29. 30.
31.
32. 33.
34. 35. 36.
37.
38.
39.
Adolescent Cigarette Smoking. J Personal Soc Psychol 67:653– 63 Doreian P, Stokman FN (eds) (1997) Evolution of Social Networks. Gordon and Breach, Amsterdam Friedkin NE (1998) A Structural Theory of Social Influence. Cambridge University Press, Cambridge Granovetter MS (1973) The strength of weak ties. Am J Sociol 78:1360–1380 Gulati R, Nohria N, Zaheer A (2000) Strategic Networks. Strateg Manag J 21:203–215 Haynie DL (2001) Delinquent Peers Revisited: Does Network Structure Matter? Am J Sociol 106:1013–1057 Hoff PD, Raftery AE, Handcock MS (2002) Latent Space Approaches to Social Network Analysis. J Am Stat Assoc 97:1090– 1098 Holland PW, Leinhardt S (1975) Local structure in social networks. Sociol Methodol – 1976 1–45 Holland PW, Leinhardt S (1977) A dynamic model for social networks. J Math Sociol 5:5–20 Hosmer D, Lemeshow S (2000) Applied logistic regression, 2nd edn. Wiley-Interscience, New York Koskinen JH, Snijders TAB (2007) Bayesian inference for dynamic social network data. J Stat Plan Inference 137:3930– 3938 Kossinets G, Watts DJ (2006) Empirical Analysis of an Evolving Social Network. Science 311:88–90 Leenders RTAJ (1995) Models for network dynamics: A Markovian framework. J Math Sociol 20:1–21 Lévi-Strauss Ce (1969; orig. 1947) The Elementary Structures of Kinship. Beacon Press, Boston Lin N, Cook K, Burt RS (eds) (2001) Social Capital. Theory and Research. Aldine de Gruyter, New York Long JS (1997) Regression Models for Categorical and Limited Dependent Variables. Sage Publications, Thousand Oaks Maddala GS (1983) Limited-dependent and Qualitative Variables in Econometrics, 3rd edn. Cambridge University Press, Cambridge Newcomb TM (1962) Student Peer-Group Influence. In: Sanford N (ed) The American College: A Psychological and Social Interpretation of the Higher Learning. Wiley, New York Newman MEJ, Barkema GT (1999) Monte Carlo methods in Statistical Physics. Clarendon Press, Oxford Newman MEJ, Watts DJ, Strogatz SH (2002) Random graph models of social networks. Proceedings of the National Academy of Sciences USA, 99, 2566–2572, National Academy of Sciences, Washington Norris JR (1997) Markov Chains. Cambridge University Press, Cambridge Nowicki K, Snijders TAB (2001) Estimation and prediction for stochastic blockstructures. J Am Stat Assoc 96:1077–1087 Pearson MA, Michell L (2000) Smoke Rings: Social network analysis of friendship groups, smoking and drug-taking. Drugs Educ Prev Policy 7:21–37 Pearson M, Steglich C, Snijders T (2006) Homophily and assimilation among sport-active adolescent substance users. Connections 27(1):47–63 Pearson M, West P (2003) Drifting Smoke Rings: Social Network Analysis and Markov Processes in a Longitudinal Study of Friendship Groups and Risk-Taking. Connections 25(2):59– 76 Podolny JM, Stuart TE, Hannan MT (1997) Networks, knowl-
Network Analysis, Longitudinal Methods of
40. 41.
42. 43. 44.
45. 46.
47.
48.
49.
50.
51.
edge, and niches: competition in the worldwide semiconductor industry, 1984–1991. Am J Sociol 102:659–689 de Solla Price D (1976) A general theory of bibliometric and other advantage processes. J Am Soc Inf Sci 27:292–306 Robins GL, Woolcock J, Pattison P (2005) Small and other worlds: Global network structures from local processes. Am J Sociol 110:894–936 Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22:400–407 Schweinberger M (2007) Statistical modeling of digraph panel data: Goodness-of-fit. Submitted for publication Schweinberger M, Snijders TAB (2006) Markov models for digraph panel data: Monte Carlo-based derivative estimation. Comput Stat Data Anal 51:4465–4483 Snijders TAB (1996) Stochastic actor-oriented dynamic network analysis. J Math Sociol 21:149–172 Snijders TAB (2001) The statistical evaluation of social network dynamics. In: Sobel M, Becker M (eds) Sociological Methodology. Basil Blackwell, Boston and London, pp 361–395 Snijders TAB (2005) Models for longitudinal network data. In: Carrington PJ, Scott J, Wasserman S (eds) Models and Methods in Social Network Analysis. Cambridge University Press, New York Snijders TAB, Koskinen JH, Schweinberger M (2007) Maximum Likelihood Estimation for Social Network Dynamics. Submitted for publication Snijders TAB, Steglich CEG, Schweinberger M (2007) Modeling the co-evolution of networks and behavior. In: van Montfort K, Oud H, Satorra A (eds) Longitudinal models in the behavioral and related sciences. Lawrence Erlbaum, Mahwah, pp 41–71 Snijders TAB, Steglich CEG, Schweinberger M, Huisman M (2007) Manual for SIENA version 3. ICS, University of Groningen; Groningen, Department of Statistics, University of Oxford, Oxford http://stat.gamma.rug.nl/snijders/siena.html. Accessed 4 Dec 2007 Sparrowe RT, Liden RC, Wayne SJ, Kraimer ML (2001) Social Networks and the performance of individuals and groups. Acad Manag J 44:316–325
52. Steglich CEG, Snijders TAB, Pearson M (2007) Dynamic networks and behavior: Separating selection from influence. Submitted for publication 53. Steglich CEG, Snijders TAB, West P (2006) Applying SIENA: An Illustrative Analysis of the Coevolution of Adolescents’ Friendship Networks, Taste in Music, and Alcohol Consumption. Methodology 2:48–56 54. Udehn L (2002) The changing face of methodological individualism. Ann Rev Sociol 8:479–507 55. van de Bunt GG, van Duijn MAJ, Snijders TAB (1999) Friendship networks through time: An actor-oriented statistical network model. Comput Math Organ Theory 5:167–192 56. van de Bunt GG, Wittek RPM, de Klepper MC (2005) The Evolution of Intra-Organizational Trust Networks; The Case of a German Paper Factory: An Empirical Test of Six Trust Mechanisms. Int Sociol 20:339–369 57. van Duijn MAJ, Zeggelink EPH, Huisman M, Stokman FN, Wasseur FW (2003) Evolution of Sociology Freshmen into a Friendship Network. J Math Sociol 27:153–191 58. van Duijn MAJ, Snijders TAB, Zijlstra BH (2004) p2 : a random effects model with covariates for directed graphs. Stat Neerlandica 58:234–254 59. Wasserman S (1980) Analyzing social networks as stochastic processes. J Am Stat Assoc 75:280–294 60. Wippler R (1978) The structural-individualistic approach in Dutch sociology. Neth J Sociol 4:135–155 61. Zeggelink EPH (1994) Dynamics of structure: an individual oriented approach. Soc Netw 16:295–333
Further Reading For further reading, basic concepts of continuous-time Markov processes can be found in [34]. The basic definition of the model presented here and of the statistical estimation methods for network dynamics based on panel data can be studied in [46,47]. The approach to the co-evolution of networks and behavior is presented in [49,52]. Some examples of the methods presented in this chapter can be found in [6,53,55,56,57].
2043
2044
Networks and Stability
Networks and Stability FRANK H. PAGE JR.1,2 , MYRNA W OODERS3,4 1 Department of Economics, Indiana University, Bloomington, USA 2 Centre d’Economie de la Sorbonne, Universite Paris 1, Pantheon–Sorbonne, France 3 Department of Economics, Vanderbilt University, Nashville, USA 4 Department of Economics, University of Warwick, Coventry, UK Article Outline Glossary Definition of the Subject Introduction The Primitives Abstract Games of Network Formation and Stability Strong Stability, Pairwise Stability, Nash Stability, and Farsighted Consistency Singleton Basins of Attraction Future Directions Acknowledgments Bibliography Glossary Homogeneous networks A homogeneous network consists of a finite set of nodes together with a finite set of mathematical objects called links or arcs, each identifying a connection between a pair of nodes. Given finite node set N with typical element i, a homogeneous linking network G is a finite collection of sets of the form fi; i 0 g called links. Link fi; i 0 g 2 G indicates that nodes i and i 0 are connected in network G. A homogeneous directed network G is a finite collection of ordered pairs (i; i 0 ) called arcs. Arc (i; i 0 ) 2 G indicates that nodes i and i0 are connected in network G via a connection running from i to i0 . In a homogeneous network (whether it be a linking network or a directed network) all connections are of the same type. Heterogeneous networks A heterogeneous network consists of a finite set of nodes together with a finite set of mathematical objects called labeled links or labeled arcs, each identifying a particular type of connection between a pair of nodes. Given finite node set N with typical element i and given finite label set A with typical element a, a heterogeneous linking network G is a finite collection of ordered pairs of the form (a; fi; i 0 g) called labeled links. Labeled link
(a; fi; i 0 g) 2 G indicates that nodes i and i0 are connected in network G via a type a link. A heterogeneous directed network G is a finite collection of ordered pairs of the form (a; (i; i 0 )) called labeled arcs. Labeled arc (a; (i; i 0 )) 2 G indicates that nodes i and i 0 are connected in network G via a type a arc running from i to i0 . In a heterogeneous network (whether it be a linking network or a directed network) connections can differ and are distinguished by type. Abstract game of network formation with respect to irreflexive dominance An abstract game of network formation with respect to irreflexive dominance consists of a feasible set of networks G equipped with a irreflexive dominance relation >. A dominance relation on G is a binary relation on G such that for all G and G0 in G, G 0 > G (read G0 dominates G) is either true or false. The dominance relation is irreflexive if G > G is always false. Abstract game of network formation with respect to path dominance An abstract game of network formation with respect to path dominance consists of a feasible set of networks G equipped with a path dominance relation p induced by an irreflexive dominance relation > on G. Given networks G and G0 in G, G 0 p G (read G0 path dominates G) if either G 0 D G or there is a finite sequence of networks in G beginning with G and ending with G0 such that each network along the sequence dominates its predecessor. Definition of the Subject Our subject is networks, and in particular, stable networks and the game theoretic underpinnings of stable networks. Networks are pervasive. We routinely communicate over the internet, advance our careers by networking, travel to conferences over the transportation network and pay for the trip using the banking network. Doing this utilizes networks in our brain. The list could go on. While network models have had a long history in sociology, the natural sciences, and engineering (e. g., in modeling social organizations, brain architecture, and electrical circuits), the rise of the network paradigm in economics is relatively recent. Economists are now beginning to think of political and economic interactions as network phenomena and to model everything from terrorist activities to asset market micro structures as games of network formation. This trend in economics, which began with the seminal paper by Myerson [88] on graphs and cooperation and accelerated with the publication of the papers by Jackson and Wolinsky [64] and Dutta and Mutuswami [39] on stable and efficient networks, is likely to continue with the de-
Networks and Stability
velopment of new algorithms, the expansion of computational capacity and the broad application of network theories to economic, political, and social phenomena. What economists bring to the study of networks that is new is game theory. For the most part sociologists, natural scientists and engineers have used networks descriptively and have focused on the design of networks from the perspective of a single designer or on the random evolution of networks from the perspective of nature. This singularity of perspective is a consequence of the nonstrategic nature of the phenomena being explained or the problem being solved (e. g., the spread of a disease through a given population, the transmission of electrical impulses in the brain, or the optimal design of an integrated circuit). In economics the perspective is often times strategic. In particular, in many economic situations, several individuals, guided by their own self interest, behave strategically in putting into place pieces of the network of economic, political, or social interactions under their control and in so doing generate payoffs and externalities that determine the network of economic interactions that eventually emerges in equilibrium. Thus in economics, pieces of the network are the strategies and the network that ultimately prevails is the result of strategic competition rather than the design of a single individual or nature. Because large computer networks such as the internet are built, operated, and used by a many diverse individuals with competing interests, computer scientists are also beginning to use game theoretic models to analyze and understand the optimal design of secure computer networks (see for example, [105,115]). Conversely, what networks bring to the study of economics is a new way of modeling the structure of economic interactions and externalities that makes possible a game-theoretic analysis of how theses structures influence individual payoffs and the economic equilibrium that emerges from competition. Introduction Our main objective is to present a unified, game-theoretic development of the main concepts of stability that have appeared in the recent economics literature on strategic network formation, specifically, the notions of strong stability (Jackson and van den Nouweland [61]), pairwise stability (Jackson and Wolinsky [64]), Nash stability, and farsighted consistency (Chwe [26]). In order to accomplish this we follow the approach introduced in Page and Wooders [91,93]. The key ingredient in this approach is an abstract game model of endogenous network formation (i. e., abstract game in the sense of von Neumann– Morgenstern [122]). The model is built on four primi-
tives: A feasible set of networks, player preferences over networks, the rules of network formation, and an irreflexive dominance relation over networks. In the remainder of this introduction we provide an overview of the four primitives of our model, a summary of results discussed, and note some important areas of research that are beyond the scope of this entry. Feasible Sets: The feasible set may consist of networks as simple as homogenous linking networks or as complex as heterogeneous directed networks. All networks consist of a finite set of nodes (representing, for example, economic agents or players) together with a finite set of mathematical objects called links, labeled links, arcs, or labeled arcs describing the connections between nodes. Here we will focus on homogeneous linking networks, as does most of literature (see, for example, Myerson [88], Jackson and Wolinsky [64], and Jackson and van den Nouweland [61]), except in our discussion of Nash stability where we will consider homogeneous directed networks (as in [5]). What distinguishes homogeneous networks (linking or directed) from heterogeneous networks (linking or directed) is that in a homogeneous network all connections between nodes are of the same type whether represented by a link as in a linking network or by an arc as in a directed network. Thus, in a homogeneous linking network all links are of the same type and in a homogeneous directed network all arcs are of the same type. While homogeneous networks are quite restrictive, they have been very important in developing our understanding of social and economic networks and have proved very useful in many economic applications (see, for example, Belleflamme and Bloch [7], Bramoulle and Kranton [17], Calvo-Armengol [19], and Furusawa and Konishi [41]). Page and Wooders [89,91,93,94] extend the existing literature on economic and social networks by introducing the notion of heterogeneous directed networks. These types of networks potentially have a rich set applications (in the natural sciences, engineering, sociology, politics, as well as economics) because connections or interactions between nodes can be distinguished by direction or intent as well as by type, intensity, or purpose. Players’ Preferences: We will assume throughout that each player’s preferences are given by an irreflexive binary relation defined on the feasible set of networks. Thus, we will assume that players have strong (or strict) preferences over networks Under strong preferences, if a player prefers one network to another, then the player’s preference is strict. However, we will comment where appropriate on weak preferences. Under weak preferences, if a player prefers one network to another, then the player’s preference is either strict or indifferent.
2045
2046
Networks and Stability
Rules of Network Formation: We will focus here on three different sets of rules: Jackson–Wolinsky rules [64], Jackson–van den Nouweland rules [61], and Bala–Goyal rules [5]. In particular, in our discussions of pairwise stable homogeneous linking networks we will assume that the rules of network formation are the Jackson–Wolinsky rules. Under the Jackson–Wolinsky rules the addition of a link is bilateral (i. e., the two players that would be involved in the link must agree to adding the link), the subtraction of a link is unilateral (i. e., at least one player involved in the link must agree to subtract or delete the link), and network changes take place one link at a time (i. e., only one link can be added or subtracted at a time). In our discussion of strongly stable homogeneous linking networks, we will assume that the rules of network formation are the Jackson–van den Nouweland rules. Under the Jackson–van den Nouweland rules link addition is bilateral, link subtraction is unilateral, and in any one play of the game several links can be added and/or subtracted. Thus the Jackson–van den Nouweland rules are the Jackson–Wolinsky rules without the one-link-at-a-time restriction. Finally, in our discussion of Nash homogeneous directed networks we will assume that the rules of network formation are the Bala–Goyal rules. Under the Bala–Goyal rules an arc may be added or subtracted unilaterally by the initiating player involved in the arc and in any one play of the game only network changes brought about by an individual player are allowed. Note that all three of these sets of rules can be described as being uniform across networks. Under uniform rules the rules for changing a network are the same no matter which status quo network is being changed. Page, Wooders and Kamat [94] allow nonuniform rules and introduce a network representation of nonuniform rules. Dominance Relations: Given players’ preferences and the rules of network formation we will define a dominance relation over the feasible set of networks that incorporates both players preferences and the rules. Here we will focus on dominance relations that are either direct or indirect. Under direct dominance players are concerned with immediate consequences of their network formation strategies whereas under indirect dominance players are farsighted and consider the eventual consequences of their strategies. General Results: A specification of the primitives induces two types of abstract games over homogeneous networks: (i) a network formation game with respect to the irreflexive dominance relation induced by preferences and rules, and (ii) a network formation game with respect to path dominance induced by this irreflexive dominance relation. We will begin by considering the game with respect
to irreflexive dominance and present results on the existence of quasi-stable and stable networks. These results provide a network rendition of classical results from graph theory on the existence of quasi-stable sets and stable sets due to Chvatal and Lovasz [25], Berge [8], and Richardson [100]. We will also present a result on the existence and nonemptiness of the set of farsightedly consistent networks. This result is a network rendition of a result due to Chwe [26] for abstract games. Next we will consider the game over homogeneous networks with respect to path dominance, and we will conclude that the following results hold: 1. Given preferences and the rules governing network formation, the set of homogeneous networks (linking or directed) contains a unique, finite, disjoint collection of nonempty subsets each constituting a strategic basin of attraction. These basins of attraction are the absorbing sets of the competitive process of network formation modeled via the game. 2. A stable set of homogeneous networks (in the sense of von Neumann–Morgenstern) with respect to path dominance consists of one network from each basin of attraction. 3. The path dominance core, defined as the set networks having the property that no network in the set is path dominated by any other homogeneous network, consists of one network from each basin of attraction containing a single network. Note that the path dominance core is contained in each stable set and is nonempty if and only if there is a basin of attraction containing a single network. As a corollary, we conclude that any homogeneous network contained in the path dominance core is constrained Pareto efficient. 4. From the results above it follows that if the dominance relation is transitive and irreflexive, then the path dominance core is nonempty. These results are a special cases of results due to Page and Wooders [93]. Specific Results for Pairwise Stability, Strong Stability, Nash Stability, and Farsighted Consistency What are the connections between our notions of stability for homogeneous networks (basins of attraction, path dominance stable sets, and path dominance core) and the notions of strong stability [39,61], pairwise stability [64], Nash stability [5], and farsighted consistency [26,94]? From the general results in [91] and [93] for heterogeneous directed networks, we will conclude for the case of homogeneous networks (linking or directed) that, de-
Networks and Stability
pending on how we specialize the primitives of the model, the path dominance core is equal to the set of strongly stable networks, the set of pairwise stable networks, or the set of Nash networks. In particular, we will conclude that: (a) If path dominance is induced by a direct dominance relation, then in the set of homogeneous linking networks the path dominance core is equal to the set of strongly stable networks. (b) If, in addition, the rules of network formation are the Jackson–Wolinsky rules, then in the set of homogeneous linking networks the path dominance core is equal to the set of pairwise stable networks. (c) If path dominance is induced by a direct dominance relation and if the rules of network formation are the Bala–Goyal rules, then in the set of homogeneous directed networks the path dominance core is equal to the set of Nash networks. We can then conclude from (3) above that the existence of at least one basin of attraction containing a single network is, depending on how we specialize primitives, both necessary and sufficient for either (i) the existence of a strongly stable network, or (ii) a pairwise stable network, or (iii) a Nash network. For path dominance induced by an indirect dominance relation, we can conclude from our prior results that for the case of homogeneous linking networks with Jackson–Wolinsky or Jackson–van den Nouweland rules or for the case of homogeneous directed networks with Bala– Goyal rules, each strategic basin of attraction has a nonempty intersection with the largest farsightedly consistent set of networks. This result, together with (2) above, implies that there always exists a path dominance stable set of homogeneous networks contained in the largest farsightedly consistent set. Thus, the path dominance core is contained in the largest consistent set. In light of our results on the path dominance core and stability (both strong and pairwise), we conclude that if path dominance is induced by an indirect dominance relation, then any homogeneous network contained in the path dominance core (i. e., the farsighted core) is not only farsightedly consistent but also strongly stable, as well as pairwise stable. Other papers using indirect dominance (or variations thereof) and farsighted consistency in games (not necessarily network formation games) include Li [76,77], Xue [129,130], Luo [79], Mariotti and Xue [80], Diamantoudi and Xue [36], and Mauleon and Vannetelbosch [84], Bhattacharya [9], Herrings, Mauleon and Vannetelbosch [55]. We remark that solution concepts defined using dominance relations have a long and distinguished history in the literature of game theory. First, consider the von Neu-
man–Morgenstern stable set (see [122] and [100]). The vN-M stable set is defined with respect to a dominance relation on a set of outcomes and consists of those outcomes that are externally and internally stable with respect to the given dominance relation. Similarly, Gillies [44] defines the core based on a given dominance relation. These solution concepts, with a few exceptions, have typically been applied to models of economies or cooperative games where the notion of dominance is based on what a coalition can achieve using only the resources owned by its members (cf., Aumann [3]) or a given set of utility vectors for each possible coalition (cf., Scarf [106]). Particularly notable exceptions are Schwartz [107], Panzer, Kalai and Schmeidler [66], Kalai and Schmeidler [67], Shenoy [109], Inarra, Kuipers, and Olaizola [58], and van Deemen [119]. Their motivations are in part similar to ours in that they take as given a set of possible choices for players (here consisting of set of networks) and a dominance relation and, based on these, describe a set of possible or likely outcomes called, by Kalai and Schmeidler, the admissible set. While their examples treat direct dominance, their general results have wider applications. Because our objective here is to provide a unified game theoretic treatment of the main stability notions for network formation games, many topics related to strategic networks are not covered here. For example, we do not discuss the conflict between stability and efficiency which is the main focus of the important papers by Dutta and Mutuswami [39] and Currarini and Morelli [30], and Mutuswami and Winter [87]. Nor do we treat the topic of network formation and cooperative games, the topic of the seminal paper by Myerson [88] and the excellent book by Slikker and van den Nouweland [113] among many other contributions, or the topic of network formation and evolution treated in Hojman and Szeidl [56]. Our game theoretic approach provides a snapshot of all possible network formation paths under dynamics which respect preferences and the rules of network formation. Our approach, however, is not explicitly dynamic. For network dynamics, we can only suggest to the reader the elegant papers by Skyrms and Pemantle [111], Watts [124], Jackson and Watts [62], Konishi and Ray [71], and Dutta, Ghosal, and Ray [38]. We do not touch on the topic of learning in networks, a topic which has been the focus of much work by Goyal [45,46], nor do we discuss random networks, introduced in economics by Kirman [68]. For random networks we refer the reader to the recent book by Vega-Redondo [121] and the references contained therein. Finally, we do not discuss the statistical mechanics of network formation. For this topic, we recommend to the reader the excellent papers by Blume [13] and Durlauf [37].
2047
2048
Networks and Stability
Because our focus is on foundational issues in strategic network formation, and in particular stability, we do not discuss any of the plethora of economic applications that can be found in the exploding literature on social and economic networks. For now, we can only offer the reader the following modest and incomplete list of topics and papers: Development and insurance Bloch, Genicot, and Ray [12], Bramoulle and Kranton [18]; Employment and labor markets Rees [98], Granovetter [50], Boorman [16], Montgomery [86], Topa [118], Calvo-Armengol [19], Calvo-Armengol and Jackson [21], Calvo-Armengol and Jackson [22]; Industrial organization and R&D Demange and Henriet [32], Bloch [10], Kranton and Minehart [74], Goyal and Moraga-Gonzalez [49], Goyal and Joshi [47], Belleflamme and Bloch [7], Bloch [11], and Deroian and Gannon [35], Mauleon, Sempere-Monerris, and Vannetelbosch [83], Wang and Watts [123]; International trade Casella and Rauch [23,24] Zissimos [2], Goyal and Joshi [48], and Furusawa and Konishi [41]; Market microstructure Tesfatsion [116,117], Kirman, Herreiner and Weisbuch [69], Kranton and Minehart [75], Corominas-Bosch [28], and Even-Dar, Kearns, and Suri [40]; Public goods Bramoulle and Kranton [17]; Organizations, coordination, and communication Chwe [27], Currarini [29], Demange [34]; Marketing and advertising Galeotti and Moraga-Gonzalez [43]. The Primitives Our abstract game of network formation rests on four primitives: The feasible set of networks, players’ preferences, the rules of network formation, and a dominance relation over feasible networks. In this section, we discuss in detail these four primitives and in the next section, using these primitives we construct our abstract games of network formation.
ementary notion of a network, the homogeneous linking network, and proceed to a more complex notion, the heterogeneous directed network introduced in [91]. Let N be a finite set of nodes, with typical element denoted by i, and let A be a finite set of link types or arc types, with typical element denoted by a. If the network is directed (to be defined below), we refer to the elements of A as arc types, otherwise we will refer to the elements of A as link types. For any set E, we denote by P(E) the collection of all subsets of E. Finally, for any set E, we denote by jEj the cardinality of E (note that j;j D 0). Linking Networks Definition 1 (Homogeneous Linking Networks, Myerson [88], Jackson–Wolinsky [64]) Let P2 (N) denote the set of all subsets of N of size 2. A linking network, G, is a subset (possibly empty) of P2 (N) and for any G P2 (N), each subset fi; i 0 g 2 G is called a link in G. The collection of all homogeneous linking networks is denoted by P(P2 (N)). Thus, P2 (N) is the set of all possible links and a homogeneous linking network G is simply a subset of all possible links. For example, if N D fi1 ; i2 ; i3 g, then P2 (N) D ffi1 ; i2 g; fi2 ; i3 g; fi1 ; i3 gg ; and the subset G1 D ffi1 ; i2 g; fi1 ; i3 gg of P2 (N) is a homogeneous linking network. Figure 1 depicts homogeneous linking network G1 . Here, the link fi1 ; i3 g 2 G1 denotes that nodes i1 and i3 are connected or linked. Note that all links are the same (i. e., links are homogeneous) and links have no orientation or direction. Also, note that in a homogeneous linking network, loops are not allowed by definition (a loop being a link between a node and itself). Finally, note that in
Feasible Networks Types of Networks Surprisingly, there is no agreedupon definition of a network but rather several definitions depending on the application. But what all definitions have in common is a nonempty set of nodes and a precise mathematical description of how nodes are connected. What differentiates these various definitions then are the details of how nodes are connected. We begin with the most el-
Networks and Stability, Figure 1 Homogeneous Linking Network G1
Networks and Stability
Networks and Stability, Figure 2 Heterogeneous Linking Network G2
a linking network multiple links between any pair of nodes are not allowed. However, because links are homogeneous, multiple links are unnecessary. The following extended definition of a linking network allows for heterogeneous links. This heterogeneity is represented by a labeling of links using elements of the set A of link types. Definition 2 (Heterogeneous Linking Networks) A heterogeneous linking network, G, is a subset of A P2 (N). Given any G A P2 (N), each ordered pair (a; fi; i 0 g) 2 G consisting of a link type and a link is called a labeled link in G. The collection of all heterogeneous linking networks is denoted by P(A P2 (N)). Thus, A P2 (N) is the set of all possible labeled links and a heterogeneous linking network G is simply a subset of all possible labeled links. For example, given N D fi1 ; i2 ; i3 g and A D fa1 ; a2 ; a3 g, the subset G2 D f(a1 ; fi1 ; i2 g); (a1 ; fi1 ; i3 g); (a3 ; fi1 ; i3 g)g of A P2 (N) is a heterogeneous linking network. Figure 2 depicts heterogeneous linking network G2 . Here, the labeled link (a1 ; fi1 ; i3 g) 2 G2 denotes that nodes i1 and i3 are linked by a type a1 link. First, note that in network G2 in addition to being linked by an a1 link, nodes i1 and i3 are also linked by an a3 link. Thus, in a heterogeneous linking network, links are not identical and multiple, distinct links between any given pair of nodes are possible. Second, note that in network G2 , nodes i1 and i2 – like nodes i1 and i3 – are linked by an a1 link. Thus, in a heterogeneous linking network, link types can be used multiple times for different pairs of nodes. Finally, note that in a heterogeneous linking network, links are still without orientation or direction and loops are not possible.
Networks and Stability, Figure 3 Homogeneous Directed Network G3
Directed Networks The link orientation problem as well as the problem of loops is resolved by moving to directed networks. As is the case with linking networks, there are two categories of directed networks: Homogeneous directed networks and heterogeneous directed networks. Definition 3 (Homogeneous Directed Networks) A homogeneous directed network, G, is a subset of N N. Given any G N N, each ordered pair i; i 0 2 G consisting of a beginning node i and an ending node i0 is called an arc in G. The collection of all directed networks is denoted by P(N N). Thus, N N is the set of all possible arcs and a homogeneous directed network G is simply a subset of all possible arcs. For example, given N D fi1 ; i2 ; i3 g, G3 D f(i1 ; i1 ); (i1 ; i2 ); (i1 ; i3 ); (i3 ; i1 )g is a homogeneous directed network. Figure 3 depicts homogeneous directed network G3 . Here, the arc (i1 ; i3 ) 2 G3 denotes that nodes i1 and i3 are connected by an arc running from node i1 to node i3 . Note that because (i3 ; i1 ) 2 G3 there is also an arc running in the opposite direction from i3 to i1 . Also, note that in a homogeneous directed network, while connections have direction, all connections are of the same type – that is, connections are homogeneous. Finally, note that in a directed network loops are allowed. For example, (i1 ; i1 ) 2 G3 and therefore in network G3 there is an arc running from node i1 to node i1 . The following definition, from [94], allows for heterogeneous, multiple arcs by labeling arcs using the set A of arc types. Definition 4 (Heterogeneous Directed Networks) A heterogeneous directed network, G, is a subset of A (N N). Given any G A (N N), each ordered pair (a; i; i 0 ) 2 G consisting of an arc type and an arc is
2049
2050
Networks and Stability
˚ G C (i) :D a 2 A: (a; (i; i 0 )) 2 G
for some i 0 2 N ;
˚ G (i) :D a 2 A: (a; (i 0 ; i)) 2 G ˚ G C (a; i) :D i 0 ˚ G (a; i) :D i 0
for some i 0 2 N ; 2 N : (a; (i; i 0 )) 2 G ; 2 N : (a; (i 0 ; i)) 2 G :
For example, referring to heterogeneous directed network G4 above (see Fig. 4), Networks and Stability, Figure 4 Heterogeneous Directed Network G4
G4 (a1 ) :D f(i1 ; i2 ); (i1 ; i3 ); (i3 ; i1 )g ; G4 (i1 ; i3 ) :D fa1 ; a3 g ; G4C (i1 ) :D fa1 ; a2 ; a3 g ;
called a labeled arc in G. The collection of all labeled directed networks is denoted by P(A (N N)). Thus, A (N N) is the set of all possible labeled arcs and a heterogeneous directed network G is simply a subset of all possible labeled arcs. For example, given N D fi1 ; i2 ; i3 g and A D fa1 ; a2 ; a3 g the subset G4 D f(a2 ; (i1 ; i1 )); (a1 ; (i1 ; i2 )); (a1 ; (i1 ; i3 )); (a1 ; (i3 ; i1 )); (a3 ; (i1 ; i3 ))g of A (N N) is a heterogeneous directed network. Figure 4 depicts heterogeneous directed network G4 . Here, the labeled arc (a1 ; (i1 ; i3 )) 2 G4 denotes that nodes i1 and i3 are connected by an arc of type a1 running from node i1 to node i3 . Note that nodes i1 and i3 are also connected by an arc of type a3 running from node i1 to node i3 . Thus, in addition to having direction, connections are heterogeneous. Also, note that arc type a1 is used three times in network G4 : Once in describing the connection running from i1 to i2 , once in describing the connection running from i1 to i3 , and once in describing the connection running from i3 to i1 . Finally, note that loops are allowed. For example, (a2 ; (i1 ; i1 )) 2 G4 and therefore in network G4 there is an a2 arc running from node i1 to node i1 . Remarks (1) In the terminology of graph theory (e. g., see Bollobas [15]), a homogeneous linking network is called a graph, while a homogeneous directed network is called a directed graph. (2) The following notation is useful in describing heterogeneous directed networks. Given heterogeneous directed network G A (N N), let ˚ G(a) :D (i; i 0 ) 2 N N : (a; (i; i 0 )) 2 G ; ˚ G(i; i 0 ) :D a 2 A: (a; (i; i 0 )) 2 G ;
G4 (i1 ) :D fa1 ; a2 g ; G4C (a2 ; i1 ) :D fi1 g ; G4 (a2 ; i1 ) :D fi1 g : ˇ ˇ (3) The number ˇG C (i)ˇ is the number arc types leaving ˇ ˇ node i in network G, while the number ˇG C (a; i)ˇ is the out degree of node i for arc types a in network G. The number jG (i)j is the number of arc types entering node i in network G, while the number jG (a; i)j is the indegree of node i for arc type a in network G. For directed network G4 , we have for example, ˇ C ˇ ˇG (i2 )ˇ D 0 and jG (i2 )j D 1 ; 4 4 ˇ C ˇ ˇG (a1 ; i1 )ˇ D 2 and jG (a1 ; i1 )j D 1 : 4
4
(4) Rockafellar [101], essentially defines a network G to be a nonempty subset of A (N N) such that for all a 2 A, G(a) f(i; i 0 ) 2 N N : i ¤ i 0 g, and jG(a)j 1. The Feasible Set In the abstract games of network formation we develop here, we will assume that the game is played over some feasible set of networks G. Some examples of feasible sets are: The set of all homogeneous linking networks P(P2 (N)) as in Jackson–Wolinsky [64] and Jackson–van den Nouweland [61]; The set of all homogeneous directed networks P(N N) as in Bala and Goyal [5] and an arbitrary subset of the set of heterogeneous directed networks P(A (N N)) as in [91] and [93]. The following example is taken from Page and Wooders [92] where the feasible set is taken to be the set of all club networks – a particular class of heterogeneous directed networks. Example 1 (Club Networks) Let D be a finite set of players with typical element d and let C be a finite set of clubs or club locations with typical element c. As before, let A be
Networks and Stability
a finite set of arc types. Finally, let N D D [ C be the set of nodes. We consider an abstract game of network formation played over the feasible set G P(A (N N)) of club networks where G 2 G if and only if G is a nonempty subset of A (D C) such that (i) for all players d 2 D, the set G(d) :D f(a; c) 2 A C : (a; (d; c)) 2 Gg is nonempty and (ii) for all (a; (d; c)) 2 G, a 2 A(d; c). Here, A(d; c) is the set of actions (represented by arc types) available to player d in club c. Given club network G 2 G, (a; (d; c)) 2 G means that in club network G player d is a member of club c and takes action a 2 A(d; c) – or in the terminology of directed networks, that in club network G, there is an arc of type a running from node (player) d to node (club) c. Thus, in this example the feasible set is a set of bipartite directed networks. We remark that the basic model of club formation underlying this example has a long history in the literature, going back to economies with essentially homogeneous agents modeled as games in characteristic function form (Shubik (bridge game) [110]) and serves as an example of several models in more recent literature on coalitional games (cf., Banerjee, Konishi and Sonmez [6], Bogomolnaia and Jackson [14], Diamantoudi and Xue [36]), and in economies with clubs (cf. Arnold and Wooders [2] and Allouch and Wooders [1]). As in Konishi, Le Breton and Weber [70] and Demange [33], for example, we allow “free entry” into clubs. Paths and Circuits A sequence of links ffi; i 0 g k g k in G 2 G P(P2 (N)) constitutes a path if each link fi; i 0 g k has one node in common with the preceding link fi; i 0 g k1 and the other node in common with the succeeding link fi; i 0 g kC1 . A circuit is a finite path ffi; i 0 g k g hkD1 in G which
Networks and Stability, Figure 5 Path and circuit in Heterogeneous Directed Network G4
begins at a node i and returns the same node. The length of a path is the number of links in the path. A sequence of labeled links f(a; fi; i 0 g) k gk in G 2 G P(AP2 (N)) constitutes a path if each labeled link (a; fi; i 0 g) k has one node in common with the preceding labeled link (a; fi; i 0 g) k1 and the other node in common with the succeeding link (a; fi; i 0 g) kC1 . A circuit is a finite path f(a; fi; i 0 g) k g hkD1 in G which begins at a node i and ends at the same node. The length of a path is the number of labeled links in the path. A sequence of arcs f(i; i 0 ) k g k in G 2 G P(NN) constitutes a path if the beginning node i of arc (i; i 0 ) k coincides with the ending node ´{ 0 of preceding arc (i; i 0 ) k1 . A circuit is a finite path f(i; i 0 ) k ghkD1 in G such that node i of arc (i; i 0 )1 and node i0 of arc (i; i 0 ) h are the same node. The length of a path is the number of arcs in the path. Finally, a sequence of labeled arcs f(a; (i; i 0 )) k g k in G 2 G P(A (N N)) constitutes a path if the beginning node i of labeled arc (a; (i; i 0 )) k coincides with the ending node ´{ 0 of preceding arc (a; (i; i 0 )) k1 . A circuit is a finite path f(a; (i; i 0 )) k g hkD1 in G such that node i of labeled arc (a; (i; i 0 ))1 and node i0 of labeled arc (a; (i; i 0 )) h are the same node. The length of a path is the number of labeled arcs in the path. In Fig. 5, f(a1 ; (i3 ; i1 ))1 ; (a2 ; (i1 ; i1 ))2 ; (a1 ; (i1 ; i2 ))3 g is a path in G4 of length 3, while f(a1 ; (i3 ; i1 ))1 ; (a2 ; (i1 ; i1 ))2 ; (a3 ; (i1 ; i3 ))3 g is a circuit in G4 of length 3. Players’ Preferences For the remainder of this entry we will assume that the set of players is given by the set of nodes N. Thus, henceforth the nodes represent players in the game of network formation. Let (N) denote the collection of all coalitions of players (i. e., nonempty subsets of N) with typical element denoted by S.
2051
2052
Networks and Stability
For each player i 2 N let i be an irreflexive binary relation on G(D P(P2 (N)) or P(N N)) and write G 0 i G if player i 2 N prefers network G 0 2 G to network G 2 G. Because i is irreflexive, G Ÿ i G for all networks G 2 G. Coalition S 0 2 (N) prefers network G 0 to network G, written G 0 S 0 G, if G 0 i G for all players i 2 S 0 . Note that because players’ preferences f i g i2N are irreflexive, coalitional preferences, fS gS2 (N) , are also irreflexive. A Remark on Weak Preferences Players are said to have weak preferences on G(D P(P2 (N)) or P(N N)), denoted by % i if G 0 % i G means that player i either strongly prefers G0 to G (denoted G 0 i G) or is indifferent between G0 and G (denoted G 0 i G). If coalitional preferences are based on weak preference, then we say that coalition S 0 2 P(N) weakly prefers network G0 to network G, written G 0 w S 0 G, if for all players i 2 S 0 , G 0 % i G and if for at least one player i 0 2 S 0 , G 0 i 0 G. Note that if preferences are weak and G 0 w S 0 G where S0 consists of a single player i 0 , so that S 0 D fi 0 g, then G 0 i 0 G. Finally, note that weak coalitional preferences fw S gS2 (N) are irreflexive (i. e., G Ÿw S G for all G 2 G and S 2 (N)). Network Payoff Functions In many applications, players’ preferences are specified via real-valued network payoff functions, fv i ()g i2N . If this is the case, then for each player i 2 N and each network G 2 G, v i (G) is the payoff to player i in network G. Note that the payoff v i (G) to player i in network G depends on the entire network. Thus, the player may be affected by connections between other players even when he himself has no direct or indirect connection with those players. Intuitively, ‘widespread’ network externalities are allowed. Given payoff functions fv i ()g i2N , player i prefers network G0 to network G if v i (G 0 ) > v i (G). Coalitional preferences can then be specified by stating that coalition S 0 2 (N) prefers network G0 to network G if v i (G 0 ) > v i (G) for all i 2 S 0 . Preference Supernetworks By viewing each network G in feasible set G as a node in a larger network, we can represent coalitional preferences as a heterogeneous directed network. To begin, let P :D fp S : S 2 (N)g
denote the set of arc labels for preference arcs (or p-arcs for short). Definition 5 (Coalitional Preference Supernetworks, [94]) Given feasible set G (D P(P2 (N)) or P(N N)), a coalitional preference supernetwork P is a subset of P (G G) such that (p S 0 , (G; G 0 )) is contained in P if and only if G 0 S 0 G.
The Rules of Network Formation The rules of network formation are specified via a collection of coalitional effectiveness relations f!S gS2 (N) defined on the feasible set of networks G (D P(P2 (N)) or P(N N)). Each effectiveness relation !S represents what a coalition S can do. Thus, if G !S G 0 this means that under the rules of network formation coalition S 2 (N) can change network G 2 G to network G 0 2 G by adding, subtracting, or replacing connections in G (where, depending on the feasible set, a connection is a link or an arc). Examples of Network Formation Rules Jackson–Wolinsky Rules [64] (Bilateral-Unilateral Rules) Assume that the feasible set of networks G is equal to the set of homogeneous linking networks P(P2 (N)). Under the Jackson–Wolinsky rules of network formation (see [64]), adding a link from player i to player i0 requires that both players i and i0 agree to add the link (i. e., link addition is bilateral); (ii) subtracting a link from player i to player i0 requires that player i or player i0 or both agree to subtract the link (i. e., link subtraction can be unilateral); (iii) link addition or link subtraction takes place one link at a time. (i)
Thus, for any pair of networks G and G0 in G, if G !S G 0 and G ¤ G 0 , then either G 0 D G [ fi; i 0 g and
S D fi; i 0 g
or G 0 D Gnfi; i 0 g and
for some fi; i 0 g 2 P2 (N)
S D fig
for some fi; i 0 g 2 G or S D fi 0 g or S D fi; i 0 g :
To illustrate, consider Fig. 6 depicting two homogeneous linking networks G and G0 . Observe that G 0 D G [ fi2 ; i3 g
and G D G 0 nfi2 ; i3 g :
Under the effectiveness relations implied by the Jackson– Wolinsky rules, for networks G and G 0 we have G ! G 0 ; G 0 ! G ; G 0 ! G ; G 0 ! G : fi 2 ;i 3 g
fi 2 ;i 3 g
fi 2 g
fi 3 g
Jackson–van den Nouweland Rules [61] (Bilateral-Unilateral Rules) Again assume that the feasible set of networks G is equal to the set of homogeneous linking networks P(P2 (N)). Under the Jackson–van den Nouweland rules of network formation,
Networks and Stability
Networks and Stability, Figure 6 a Network G. b Network G0
(i) adding a link from player i to player i0 requires that both players i and i0 agree to add the link (i. e., link addition is bilateral); (ii) subtracting a link from player i to player i0 requires that player i or player i0 or both agree to subtract the link (i. e., link subtraction can be unilateral). Thus, the Jackson–van den Nouweland rules are the Jackson–Wolinsky rules without the one-link-at-a-time restriction. Note that if link addition is bilateral and link subtraction is unilateral (i. e., if rules (i) and (ii) hold), then G !S G 00 and G ¤ G 00 implies that (i) if fi; i 0 g 2 G 00 and fi; i 0 g … G ; then fi; i 0 g S ;
Observe that G 00 D (Gnfi1 ; i3 g) [ fi2 ; i3 g and G D G 00 nfi2 ; i3 g [ fi1 ; i3 g : Under the effectiveness relations implied by the Jackson– van den Nouweland rules, for networks G and G 00 we have G ! G 00 ; G ! G 00 ; fi 1 ;i 2 ;i 3 g
G
00
! G ; G
fi 1 ;i 3 g
fi 2 ;i 3 g
00
! G :
fi 1 ;i 2 ;i 3 g
and (ii) if fi; i 0 g … G 00
and fi; i 0 g 2 G ; then
fi; i 0 g \ S ¤ ; :
To illustrate, consider Fig. 7 depicting two homogeneous linking networks G and G 00 .
Networks and Stability, Figure 7 a Network G. b Network G00
Note that under the one-link-at-a-time restriction, it is not possible under the Jackson–Wolinsky rules to move directly from network G to network G 00 or directly from network G 00 to network G (i. e., G and G 00 are not related under the effectiveness relations f!S gS2 (N) ). Instead, under the Jackson–Wolinsky rules, the change from G to G 00
2053
2054
Networks and Stability
Networks and Stability, Figure 8 a Network G. b Network G0 . c Network G00
or from G 00 to G requires two moves. For example, first G ! G 0 ; fi 2 ;i 3 g
and then; G 0 ! G 00 fi 3 g
or G 0 ! G 00 ; fi 1 g
or first G 00 ! G 0 ; fi 1 ;i 3 g
fi 3 g
or G 0 ! G : fi 2 g
Bala–Goyal Rules [5] (Noncooperative Rules – UnilateralUnilateral Rules) Now assume that the feasible set of networks is equal to the set of homogeneous directed networks P(N N). Translating Bala and Goyal rules into our notation and terminology, adding an arc from player i to player i0 requires only that player i agree to add the arc (i. e., arc addition is unilateral and can be carried out only by the initiator, player i); (ii) subtracting an arc from player i to player i0 requires only that player i agree to subtract the arc (i. e., arc subtraction is unilateral and can be carried out only by the initiator, player i); (iii) G !S G 0 implies that jSj D 1 (i. e., only network changes brought about by individual players are allowed).
(i)
We shall also refer to rules (i)–(iii) as noncooperative. Note that a player i can add or subtract an arc to player i0 without regard to the preferences of player i0 and can add and/or subtract arcs to several players simultaneously and can do so without regard to those players’ preferences. Thus in general under noncooperative rules, effectiveness relations display a type of symmetry, and in particular, if G ! G 0 , then G 0 ! G. fig
G ! G 0 ; G ! G 0 : fi 1 g
and then; G 0 ! G
fig
To illustrate, consider Fig. 8 depicting three homogeneous directed networks G, G0 , and G 00 . Under the effectiveness relations implied by noncooperative rules for networks G and G 0 in Fig. 8 we have fi 1 g
Note that under noncooperative rules, networks G and G 00 in ˚ Fig.8 are not related under the effectiveness relations !fig i2N . However, under the noncooperative rules we have, for example, the following effectiveness relations G !fi 1 g G 0 ; G 0 !fi 2 g G 00 and G 00 !fi 2 g G 0 ; G 0 !fi 1 g G : Rules Supernetworks Again by viewing each network G in feasible set G as a node in a larger network, we can represent the rules of network formation as a heterogeneous directed network. To begin, let M :D fm S : S 2 (N)g
denote the set of arc labels for move arcs (or m-arcs for short). Definition 6 (Rules Supernetworks, [94]) Given feasible set G (D P(P2 (N)) or P(N N)), a rules supernetwork R is a subset of M(G G) such that (m S 0 , (G; G 0 )) is contained in R if and only if G !S 0 G 0 , where denotes the name of the network formation rules in force. We shall adopt the convention that D jw if the rules are Jackson– Wolinsky, D jn if the rules are Jackson–van den Nouweland, and D bg if the rules are Bala–Goyal or noncooperative.
Networks and Stability
Supernetworks Given feasible set G (D P(P2 (N)) or P(N N)), coalitional preferences fS gS2 (N) , and coalitional effectiveness relations f!S gS2 (N) can be represented by a heterogeneous directed network called a supernetwork (see [89] and [94]). In particular, given preference supernetwork P and rules supernetwork R , the corresponding supernetwork is given by G :D P [ R : Letting A :D P [ M (i. e., the union of all preference arcs and move arcs), then G A (G G) : Dominance Relations Direct Dominance Given feasible set G (D P(P2 (N)) or P(N N)), coalitional preferences fS gS2 (N) , and coalitional effectiveness relations f!S gS2 (N) , network G 0 2 G directly dominates network G 2 G, written G 0 B G, if for some coalition S 0 2 (N), G S 0 G 0
and
G ! G0 : 0 S
Thus, network G0 directly dominates network G if some coalition S 0 prefers G0 to G and if under the rules of network formation coalition S 0 has the power to change G to G0 . Note that direct dominance is irreflexive but not in general transitive. Also note that if G is the supernetwork, then G 0 B G if and only if (p S 0 , (G; G 0 )) 2 G and (m S 0 , (G; G 0 )) 2 G , for some coalition S 0 . Indirect Dominance Given feasible set G (D P(P2 (N)) or P(N N)), coalitional preferences fS gS2 (N) , and coalitional effectiveness relations f!S gS2 (N) , network G 0 2 G indirectly dominates network G 2 G, written G 0 BB G, if there is a finite sequence of networks, G0 ; G1 ; : : : ; G h ; with G D G0 , G 0 D G h , and G k 2 G for k D 0; 1; : : : ; h, and a corresponding sequence of coalitions, S1 ; S2 ; : : : ; S h ;
initially deviating coalition S1 , as well as all the coalitions along the way, is that the ultimate network outcome G 0 D G h be preferred. Thus, for example, the initially deviating coalition S1 will not be deterred from changing network G0 to network G1 even if network G1 is not preferred to network G D G0 , as long as the ultimate network outcome G 0 D G h is preferred to G0 , that is, as long as G0 S 1 G h . Finally, note that indirect dominance is irreflexive but not in general transitive. In order to capture the idea of farsightedness in strategic behavior, Chwe [26] analyzed abstract games equipped with indirect dominance relations in great detail, introducing the equilibrium notions of consistency and largest consistent set. The basic idea of indirect dominance goes back to the work of Guilbaud [53] and Harsanyi [54]. Given the supernetwork representation of preferences and rules, G , we can write, G 0 BB G if there is a finite sequence of networks, G0 ; G1 ; : : : ; G h ; with G D G0 , G 0 D G h , and G k 2 G for k D 0; 1; : : : ; h, and a corresponding sequence of coalitions, S1 ; S2 ; : : : ; S h ; such that for k D 1; 2; : : : ; h (m S k ; (G k1 ; G k )) 2 G ; and (p S k ; (G k1 ; G h )) 2 G : Path Dominance Any irreflexive dominance relation > on G (D P(P2 (N)) or P(N N)) – for example, direct or indirect – induces a path dominance relation on the set of networks (sometimes referred to as the transitive closure of >). In particular, corresponding to dominance relation > on networks G there is a corresponding path dominance relation p on G specified as follows: Network G 0 2 G path dominates network G 2 G with respect to > (i. e., with respect to the underlying dominance relation >), written G 0 p G, if G 0 D G or if there exists a finite sequence of networks fG k g hkD0 in G with G h D G 0 and G0 D G such that for k D 1; 2; : : : ; h
such that for k D 1; 2; : : : ; h G k1 ! G k ; Sk
and G k1 S k G h :
Note that if network G0 indirectly dominates network G (i. e., if G 0 BB G), then what matters to the
G k > G k1 : We refer to such a finite sequence of networks as a finite domination path and we say network G0 is >-reachable from network G if there exists a finite domination path
2055
2056
Networks and Stability
from G to G0 . Thus, G0 p G
if and only if
8 0 < G
is >-reachable from G ; or : 0 G DG:
Note that, even though the underlying dominance relation > is irreflexive and intransitive or transitive, the induced path dominance relation p on G is both reflexive (G p G) and transitive (G 0 p G and G 00 p G 0 implies that G 00 p G). >-Supernetworks Let > denote the irreflexive dominance relation on G. It is often useful to represent > as a homogeneous directed network, D> , where D> is a subset of G G and where >-arc (G; G 0 ) 2 D> if and only if G 0 > G (i. e., if and only if G 0 >-dominates G). We call such a homogeneous directed network (or directed graph) a >-supernetwork. For example, suppose G D fG1 ; G2 ; G3 ; : : : ; G7 g and suppose the dominance relation > on G is a direct dominance relation and has the supernetwork representation given in Fig. 9. Note that network G5 is >-reachable through D> from network G1 by the domination path given by the >-arc sequence f(G1 ; G6 )1 ; (G6 ; G2 )2 ; (G2 ; G3 )3 ; (G3 ; G4 )4 ; (G4 ; G5 )5 g : Thus, G5 path dominates G1 . Note that network G2 is >-reachable from network G2 by the domination circuit given by the >-arc sequence f(G2 ; G3 )1 ; (G3 ; G4 )2 ; (G4 ; G5 )3 ; (G5 ; G2 )4 g :
and that network G3 is >-reachable from network G3 by two domination circuits given by the >-arc sequences f(G3 ; G4 )1 ; (G4 ; G5 )2 ; (G5 ; G2 )3 ; (G2 ; G3 )4 g and f(G3 ; G4 )1 ; (G4 ; G3 )2 g : Because networks G2 and G5 are on the same circuit, G5 is >-reachable from G2 and G2 is >-reachable from G5 . Thus, G5 path dominates G2 (i. e., G5 p G2 ) and G2 path dominates G5 (i. e., G2 p G5 ). The same cannot be said of networks G1 and G5 . In particular, while G5 p G1 , it is not true that G1 p G5 because G1 is not >-reachable from G5 . Finally, note that network G0 is isolated in D> . In particular, G0 is not reachable through D> from any network in G and no network in G is reachable through D> from G0 . In general, a network G 2 G is isolated if there does not exist a network G 0 2 G with G 0 p G or G p G0 . Note that if the direct dominance relation with >-supernetwork depicted in Fig. 9 has underlying coalitional preferences fS gS2 (N) and coalitional effectiveness relations f!S gS2 (N) , then the >-arc from network G3 to network G4 in Fig. 9 means that for some coalition S, G4 is preferred to G3 and more importantly, that coalition S has the power to change network G3 to network G4 . Thus, G3 sS G4 and G3 !S G4 . But because there is a >-arc in the opposite direction, from network G4 to network G3 , G3 also directly dominates G4 . Thus for some coalition S0 disjoint from coalition S (S 0 \ S D ;), G4 sS 0 G3 and G4 !S 0 G3 . Finally, note that if coalitional preferences over networks are weak (i. e., are based on weak preferences), then the statement, ‘for some coalition S0 disjoint from coalition S can be weakened to for some coalition S 0 not equal to coalition S. With this weakening, the requirement that the intersection of S and S0 be empty is no longer needed. Abstract Games of Network Formation and Stability
Networks and Stability, Figure 9 >-supernetwork D>
An abstract game of network formation consists of a feasible set of networks equipped with a dominance relation. We shall consider two classes of games: (i) games where the feasible set of networks G (D P(P2 (N)) or P(N N)) is equipped with an irreflexive dominance relation >, either direct or indirect (i. e., > is equal to B or BB), induced by coalitional preferences and network formation rules; and (ii) games where the feasible set of networks is equipped with a path dominance relation p induced by such an irreflexive dominance relation.
Networks and Stability
Network Formation Games with Respect to Irreflexive Dominance In this section we consider the abstract game with respect to irreflexive dominance given by the pair (G; >) : Throughout this section we will assume that primitives are represented by supernetwork G :D P [ R (where is equal to jw, jn, or bg) and >-supernetwork D> (where > is equal to B or BB), and that the feasible set of networks G is equal to the set of homogeneous linking networks P(P2 (N)) or the set of homogeneous directed networks P(N N). Quasi-Stability and Stability We define the >-distance from G0 to G1 in D> to be the length of the shortest >-path from G0 to G1 if G1 is >-reachable from G0 in D> , and C1 if G1 is not reachable from G0 in D> . We denote the distance from G0 to G1 in D> by dD> (G0 ; G1 ). Thus, 8 length of shortest >-path ˆ ˆ ˆ ˆ from G0 to G1 in D> ; < dD> (G0 ; G1 ) :D if G1 isreachable from G0 ; ˆ ˆ ˆ C1 ; if G1 is not reachable from ˆ : G0 in D> : The following are network renditions of quasi-stable and stable sets. Definition 7 (Quasi-Stable Sets and Stable Sets, [8,25]) (1) A subset Q of networks in G is said to be quasi-stable for network formation game (G; >) if, (a) Q is internally stable, that is, dD> (G0 ; G1 ) 2, whenever G0 and G1 are in Q, with G0 ¤ G1 , and (b) Q is externally quasi-stable, that is, given any G0 … Q, there exists G1 2 Q with dD> (G0 ; G1 ) 2. (2) A subset S of networks in G is said to be stable for network formation game (G; >) if, (a) S is internally stable and (b) S is externally stable, that is, given any G0 … S, there exists G1 2 S with dD> (G0 ; G1 ) 1. Thus, if Q is externally quasi-stable, a path of length at most 2 is required to get from any network outside of Q to a network in Q, whereas, if Q is externally stable a path of length at most 1 is required. Letting P> (G0 ) :D fG 2 G : G > G0 g ;
an alternative way to write part (2) of the definition above is as follows: (2)0 A subset S of networks in G is said to be stable if (a) (internal stability) G 2 S implies that P> (G) \ S D ; and (b) (external stability) G … S implies that P> (G)\S ¤ ;. If S is stable (or quasi-stable), then it is automatically nonempty. Note that a stable set S is simply a von Neuman– Morgenstern stable set with respect to the dominance relation > defined on G. Also, note that if S is stable then it is automatically quasi-stable. We now state a remarkably simple result on the existence of quasi-stable sets. This result is a network rendition of a general result due to Chvatal and Lovasz [25] on the existence of quasi-stable sets in directed graphs (here the directed graph is the >-supernetwork D> ; also see, Galeana-Sanchez and Li [42]). Theorem 1 (Existence of quasi-stable sets for network formation games, [89]) There exists a quasi-stable set Q for network formation game (G; >). In fact, it follows from the Theorem due to Chvatal and Lovasz [25] that any finite set Z equipped with an irreflexive binary relation has a -quasi-stable set. Moreover, if the relation is transitive, then any -quasi-stable set is -stable. Next we state two results on the existence of stable sets. Theorem 2 (Existence of stable sets for network formation games, [89]) (1) If D> contains no >-circuits, then there exists a unique stable S for (G; >). (2) If D> contains no >-circuits of odd length, then there exists a stable S for (G; >). Part (1) of Theorem 2 is an immediate consequence of a 1958 result due to Berge (see Theorem 4, p. 48 in Berge [8]). Part (2) of Theorem 2 is a supernetwork version of the classical result due to Richardson [100]. A >-circuit in supernetwork D> is said to be of odd length if there is an odd number of connections in the circuit. Farsighted Consistency Chwe [26] in an influential paper introduced the notion of farsighted consistency in an abstract game. The following is a definition of farsighted consistency for abstract games of network formation. Definition 8 (Farsighted Consistency, [89]) A subset F of networks in G is said to be farsightedly consistent for
2057
2058
Networks and Stability
network formation game (G; BB) if,
respect to indirect dominance). In particular, Page and Kamat modify the indirect dominance relation so as to make it transitive as well as irreflexive. They then show that the unique stable set with respect to path dominance induced by this new transitive indirect dominance relation is contained in the largest farsightedly consistent set – and in this way show that the largest farsightedly consistent set is nonempty and externally stable.
for all G0 2 F ; (m S 1 ; (G0 ; G1 )) 2 G for some G1 2 G and some coalition S1 ; implies that there exists G2 2 F with G2 D G1 or G2 BB G1 such that ; (p S 1 ; (G0 ; G2 )) … G : In words, a subset of directed networks F is said to be farsightedly consistent if given any network G0 2 F and any m S 1 -deviation to network G1 2 G by coalition S1 (via adding, subtracting, or replacing arcs in accordance with R ) there exists further deviations leading to some network G2 2 F where the initially deviating coalition S1 is not better off – and possibly worse off. A network G 2 G is said to be farsightedly consistent if G 2 F where F is a farsightedly consistent set. If (i) G D P(P2 (N)), (ii) coalitional preferences are weak, denoted by fw S gS2 (N) , so that indirect dominance is weak (denoted by BBw and defined in the obvious way), and (iii) coalitional effectiveness relations are determined by Jackson–Wolinsky rules, then the notion of farsighted consistency above (essentially due to Chwe [26]) is closely related to the notion of pairwise farsighted stability introduced in Herrings, Mauleon, and Vannetelbosch [55]. For any game (G; BB), there can be many farsightedly consistent sets. We shall denote by F the largest farsightedly consistent set. Thus, if F is farsightedly consistent, then F F . Unlike quasi-stable sets and stable sets where existence implies nonemptiness, in considering farsightedly consistent sets two critical questions arise: (i) does there exist a largest farsightedly consistent set of networks for (G; BB), and (ii) is it nonempty? Our next result provides a positive answer to both questions. Theorem 3 (Existence, Uniqueness, and Nonemptiness of F , [94]) There exists a unique, nonempty largest farsightedly consistent set F for network formation game (G; BB). Moreover, F is externally stable; that is, if network G is not contained in F , then there exists a network G0 contained in F that indirectly dominates G (i. e., G 0 BB G). The method of proving existence and uniqueness is a straightforward, supernetwork rendition of Chwe’s [26] method and is similar to the method introduced by Roth [103,104]. Page and Kamat [89] provide an alternative proof (to that of Chwe and of [94]) of the nonemptiness and external stability of the largest consistent set (with
Network Formation Games with Respect to Path Dominance In this section we consider the network formation game with respect to path dominance given by the pair (G; p ) : Throughout this section we will assume that the underlying primitives, (G; fS g ; f!S g ; >) S2 (N) ; are such that G is equal to the set of homogeneous linking networks P(P2 (N)) or the set of homogeneous directed networks P(N N), that the dominance relation > on G is given by either direct dominance B or indirect dominance BB, and that primitives are represented by supernetwork G :D P [ R (with equal to jw, jn, or bg) and > -supernetwork D> . We will present three notions of stability introduced in [91] and [93] for abstract games of network formation with respect to path dominance over heterogeneous directed networks: (i) strategic basins of attraction, (ii) path dominance stable sets, and (iii) the path dominance core. Preliminaries Networks Without Descendants If G1 p G0 and G0 p G1 , networks G1 and G0 are equivalent, written G1 p G0 . If networks G1 and G0 are equivalent then either networks G1 and G0 coincide or G1 and G0 are on the same circuit (see Fig. 9 above for a picture of a circuit). If G1 p G0 but G1 and G0 are not equivalent (i. e., not G1 p G0 ), then network G1 is a descendant of network G0 and we write G1 > p G0 : Referring to Fig. 9, observe that network G5 is a descendant of network G1 , that is, G5 > p G1 . Network G 0 2 G has no descendants in G if for any network G 2 G G p G0
implies that G p G 0 :
Networks and Stability
Thus, if G0 has no descendants then G p G 0 implies that G and G0 coincide or lie on the same circuit. Note that any isolated network is by definition a network without descendants (e. g., network G0 in Fig. 9). In attempting to identify and characterize stable homogeneous networks, networks without descendants are of particular interest. Here is our main result concerning networks without descendants. Theorem 4 (All path dominance network formation games have networks without descendants, [91,93]) In network formation game G; p every network G 2 G is path dominated by a network G 0 2 G without descendants (i. e., G 0 p G and G0 has no descendants). By Theorem 4, in network formation game G; p , corresponding to any network G 2 G there is a network G 0 2 G without descendants which is >-reachable from G. Thus, in any network formation game the set of networks without descendants is nonempty. Referring to Fig. 9, the set of networks without descendants is given by fG0 ; G2 ; G3 ; G4 ; G5 ; G7 g : We shall denote by Z the set of networks without descendants. Basins of Attraction Stated loosely, a basin of attraction is a set of equivalent networks to which the strategic network formation process represented by the game might tend and from which there is no escape. Formally, we have the following definition. Definition 9 (Basin of Attraction, [91,93]) A set of networks A G is said to be a basin of attraction for (G; p ) if (a) the networks contained in A are equivalent (i. e., for all G0 and G in A, G 0 p G) and for no set A0 having A as a strict subset is this true that all the networks in A0 are equivalent, and (b) no network in A has descendants (i. e., there does not exist a network G 0 2 G such that G 0 > p G for some G 2 A). A is a strict subset of A0 if A A0 and A0 nA ¤ ;. As the following characterization result shows, there is a very close connection between networks without descendants and basins of attraction. Theorem 5 (A characterization of basins of attraction, [91,93]) Let A be a subset of networks in G. The following statements are equivalent: (1) A is a basin of attraction for (G; p ).
(2) There exists a network without descendants, G 2 Z, such that ˚ A D G0 2 Z : G0 p G : In light of Theorem 5, we conclude that in any network formation game (G; p ), G contains a unique, finite, disjoint collection of basins of attraction, say fA1 ; A2 ; : : : ; Am g, where for each k D 1; 2; : : : ; m (m 1) ˚ A k D AG :D G 0 2 Z : G 0 p G for some network G 2 Z. Note that for networks G0 and G in Z such that G 0 p G, AG 0 D AG (i. e. the basins of attraction AG 0 and AG coincide). Also, note that if network G 2 G is isolated, then G 2 Z and ˚ AG :D G 0 2 Z : G 0 p G D fGg is, by definition, a basin of attraction – but a very uninteresting one. Example 2 (Basins of Attraction) In Fig. 9 above the set of networks without descendants is given by Z D fG0 ; G2 ; G3 ; G4 ; G5 ; G7 g : Even though there are six networks without descendants, because networks G2 ; G3 ; G4 , and G5 are equivalent, there are only three basins of attraction: A1 D fG0 g ;
A2 D fG2 ; G3 ; G4 ; G5 g ; and A3 D fG7 g :
Moreover, because G2 ; G3 ; G4 , and G5 are equivalent, AG 2 D AG 3 D AG 4 D AG 5 D fG2 ; G3 ; G4 ; G5 g : Stable Sets with Respect to Path Dominance mal definition of a p -stable set is as follows.
The for-
Definition 10 (Stable Sets with Respect to Path Dominance, [91,93]) A subset V of networks in G is said to be a stable set for G; p if (a) (internal p -stability) whenever G0 and G1 are in V , with G0 ¤ G1 , then neither G1 p G0 nor G0 p G1 hold, and (b) (external p -stability) for any G0 … V there exists G1 2 V such that G1 p G0 .
2059
2060
Networks and Stability
In other words, a nonempty subset of networks V is a stable set for (G; p ) if G0 and G1 are in V , with G0 ¤ G1 , then G1 is not >-reachable from G0 , nor is G0 >-reachable from G1 , and if G0 … V , then there exists G1 2 V reachable from G0 . We now have our main results on the existence, construction, and cardinality of stable sets. These results can be viewed as variations on some classical results from graph theory applied to network formation games (e. g., see Berge [8], Chap. 2). Theorem 6 (Stable sets: Existence, construction, and cardinality, [91,93]) Without loss of generality assume that G; p has basins of attraction given by fA1 ; A2 ; : : : ; Am g ;
The Path Dominance Core Definition 11 (The Path Dominance Core, [91,93]) A network G 2 G is contained in the path dominance core C of network formation game (G; p ) if and only if there does not exist a network G 0 2 G, G 0 ¤ G, such that G 0 p G. Our next results give necessary and sufficient conditions for the path dominance core of a network formation game over homogeneous networks to be nonempty, as well as a recipe for constructing the path dominance core. Theorem 7 (Path dominance core: Nonemptiness and construction, [91,93]) Without loss of generality assume that (G; p ) has basins of attraction given by fA1 ; A2 ; : : : ; Am g ;
where basin of attraction A k contains jA k j many networks (i. e., jA k j is the cardinality of A k ). Then the following statements are true: (1) V G is a stable set for G; p if and only if V is constructed by choosing one network from each basin of attraction, that is, if and only if V is of the form V D fG1 ; G2 ; : : : ; G m g ; where Gk 2 A k for k D 1; 2; : : : ; m. (2) G; p possesses jA1 j jA2 j jAm j :D M many stable sets and each stable set, V q , q D 1; 2; : : : ; M, has cardinality ˇ ˇ ˇVq ˇ D jfA1 ; A2 ; : : : ; Am gj D m : Example 3 (Basins of Attraction and Stable Sets) Referring to Fig. 9, it follows from Theorem 6 that because jA1 j jA2 j jA3 j D 1 4 1 D 4 ; the network formation game (G; p ) has 4 stable sets, each with cardinality 3. By examining Fig. 9 in light of Theorem 6, we see that the stable sets for (G; p ) are given by V1 D fG0 ; G2 ; G7 g ;
V2 D fG0 ; G3 ; G7 g ;
V3 D fG0 ; G4 ; G7 g ;
V4 D fG0 ; G5 ; G7 g :
It should be noted that by equipping the abstract network formation game with the path dominance relation rather than the original dominance relation, we avoid the famous Lucas [78] example of a game with no stable set.
where basin of attraction A k contains jA k j many networks. Then the following statements are true: (1) (G; p ) has a nonempty path dominance core if and only if there exists a basin of attraction containing a single network, that is, if and only if for some basin of attraction A k , jA k j D 1. (2) Let ˚ A k 1 ; A k 2 ; : : : ; A k n fA1 ; A2 ; : : : ; Am g ; be the subset of basins of attraction containing all basins having cardinality 1. Then the path dominance core C of (G; p ) is given by ˚ C D G k1 ; G k2 ; : : : ; G k n ; where G k i 2 A k i , for i D 1; 2; : : : ; n. If coalitional preferences over networks are based on weak preferences, that is, if coalitional preferences are given by fw S gS2 (N) , then the corresponding path dominance core – the weak path dominance core – is contained in the path dominance core based on strong preference relations. Example 4 (Basins of Attraction and the Path Dominance Core, [91,93]) It follows from Theorem 7 that the path dominance core of the network formation game (G; p ) with feasible set G D fG0 ; G1 ; : : : ; G7 g and path dominance relation p induced by the dominance relation depicted in Fig. 9 is C D fG0 ; G7 g : Suppose that the >-supernetwork corresponding to the direct dominance relation > on G D fG0 ; G1 ; : : : ; G7 g is instead depicted by Fig. 10.
Networks and Stability
efficiency and classical Pareto efficiency are equivalent. Thus, if the collection of coalitional effectiveness relations f!S gS2 (N) on G is complete, that is, if for any pair of networks G and G0 in G, G !S G 0 for some coalition S 2 (N), then P E D E, and we have C P E D E. Strong Stability, Pairwise Stability, Nash Stability, and Farsighted Consistency In this section we continue our discussion of the network formation game with respect to path dominance given by the pair (G; p ) : Throughout this section we will continue to assume that the underlying primitives, Networks and Stability, Figure 10 A different >-supernetwork D>
(G; fS g ; f!S g ; >) S2 (N) ;
Now the network formation game G; p has 3 circuits and 1 basin of attraction, A D fG2 ; G3 ; G4 ; G5 g. Because core of jA 1 j D 4, by Theorem 7 thepath dominance G; p is empty. By Theorem 6, G; p has 4 stable sets each containing 1 network. These stable sets are given by V1 D fG2 g ; V2 D fG3 g ; V3 D fG4 g ; V4 D fG5 g : The Path Dominance Core and Constrained Pareto Efficiency We say that a network G 2 G is constrained Pareto efficient for game G; p if and only if there does not exist another network G 0 2 G such that (i) some coalition S can change network G to network G0 (that is, G !S G 0 for some coalition S 2 (N)) and (ii) G0 is preferred by all players (that is, G i G 0 for all players i 2 N). Letting E denote the set of all constrained Pareto efficient networks, it is easy to see that dominance core C the path of network formation game G; p is a subset of E, that is, C E. Under the classical notion of Pareto efficiency, a network G is said to be Pareto efficient if and only if there does not exists another network G 0 such that G i G 0 for all players i 2 N, regardless of whether or not some coalition S can change network G to network G0 . Letting P E denote the set of all classically Pareto efficient networks, it is easy to see that P E E. Note, however, that if under the rules of network formation, any network G can be changed to any other network G0 via the actions of some coalition S, then the notions of constrained Pareto
are such that G is equal to the set of homogeneous linking networks P(P2 (N)) or the set of homogeneous directed networks P(N N), that the dominance relation > on G is given by either direct dominance B or indirect dominance BB, and that primitives are represented by supernetwork G :D P [ R (with equal to jw, jn, or bg) and >-supernetwork D> . We will complete our unification of stability results for network formation games by concluding that, depending on how we further specialize the primitives underlying the game (G; p ), the path dominance core is equal to the set of pairwise stable networks [64], the set of strongly stable networks [39,61], or the set of Nash networks [5]. We also present results on the relationships between basins of attraction, the path dominance core, and the largest farsightedly consistent set [26]. All of these results follow immediately from more general results in Page and Wooders [91,93] where the notions of strong stability, pairwise stability, Nash stability, and farsighted consistency are all extended to heterogeneous directed networks. Strongly Stable Homogeneous Networks We begin with a formal definition of strong stability based on that of Jackson–van den Nouweland [61]. Definition 12 (Strong Stability, [91,93]) Assume that G is equal to the set of all homogeneous linking networks, P(P2 (N)), and that the Jackson–van den Nouweland rules are Network G 2 G is said to be strongly stable in in force. G; p if for all G 0 2 G and S 2 (N), G !S G 0 implies that G ˜S G 0 .
2061
2062
Networks and Stability
Thus, a network is strongly stable if whenever a coalition has the power to change the network to another network, the coalition will be deterred from doing so because the change is not preferred by the coalition. If coalitional preferences are strong, the change ‘not being preferred’ means that the change will not make all members of the coalition better off. If coalitional preferences are weak (i. e., based on weak preferences), the change ‘not being preferred’ means that the change will either make no members better off or will make some members better off and some members worse off. Note that under our definition of strong stability a network G 2 G that cannot be changed to another network by any coalition is strongly stable. If (i) coalitional preferences are weak (i. e., fw S gS2 (N) ), and (ii) coalitional effectiveness relations are determined by Jackson–van den Nouweland rules, then the definition of strong stability above is exactly that of Jackson–van den Nouweland. As it stands, our definition is closely related to that given by Dutta– Mutuswami [39]. We now have our main result on the path dominance core and strong stability. Denote the set of strongly stable networks by SS. Theorem 8 (The path dominance core and strong stability, [91,93]) Assume that G is equal to the set of all homogeneous linking networks, P(P2 (N)), and that the Jackson– van den Nouweland rules are in force. (1) If the path dominance core C of G; p is nonempty, then SS is nonempty and C SS. (2) If the dominance relation > underlying p is a direct dominance relation B, then C D SS and SS is nonempty if and only if there exists a basin of attraction containing a single network. Note that the set of strongly stable homogeneous linking networks is contained in the set of constrained Pareto efficient homogeneous linking networks. Thus, C SS E. Pairwise Stable Networks The following definition of pairwise stability is a translation of the Jackson–Wolinsky definition [64]. Definition 13 (Pairwise Stability, [91,93]) Assume that G is equal to the set of all homogeneous linking networks, P(P2 (N)), and that the Jackson–Wolinsky rules are in force. Network G 02 G is said to be pairwise stable in G; p if for all fi; i g 2 P2 (N), (1)
G !fi;i 0 g G [ fi; i 0 g implies that G ˜fi;i 0 g G [ fi; i 0 g;
(2)(a) G !fig GŸfi; i 0 g implies that G ˜fig GŸfi; i 0 g, and (2)(b) G !fi 0 g GŸfi; i 0 g implies that G ˜fi 0 g GŸfi; i 0 g. Thus, a homogeneous linking network is pairwise stable if there is no incentive for any pair of players to add a link to the existing network and there is no incentive for any player who is party to a link in the existing network to dissolve or remove the link. Note that under our definition of pairwise stability a network G 2 G that cannot be changed to another network by any coalition, or can only be changed by coalitions of size greater than 2, is pairwise stable. Let P S denote the set of pairwise stable networks. It follows from the definitions of strong stability and pairwise stability that SS P S. Moreover, if the full set of Jackson–Wolinsky rules are in force, then SS D P S. Jackson–van den Nouweland [61] provide two examples of the potential for strong stability to refine pairwise stability (i. e., two examples where SS is a strict subset of P S). However, under Jackson–Wolinsky rules because network changes can occur only one link at a time and because deviations by coalitions of more than two players are not possible such refinements are not possible driving SS and P S to equality. We now have our main result on the path dominance core and pairwise stability. Theorem 9 (The path dominance core and pairwise stability, [91,93]) Assume that G is equal to the set of all homogeneous linking networks, P(P2 (N)), and that the Jackson–Wolinsky rules are in force. (1) If the path dominance core C of G; p is nonempty, then P S is nonempty and C P S. (2) If the dominance relation > underlying p is a direct dominance relation B, then C D P S and P S is nonempty if and only if there exists a basin of attraction containing a single network. Theorem 9 can be viewed as an extension of a result due to Jackson and Watts [62] on the existence of pairwise stable homogeneous linking networks for network formation games induced by Jackson–Wolinsky rules. In particular, Jackson and Watts [62] show that for this particular class of Jackson–Wolinsky network formation games, if there does not exist a closed cycle of networks, then there exists a pairwise stable network. Our notion of a strategic basin of attraction containing multiple networks corresponds to their notion of a closed cycle of networks. Thus, stated in our terminology, Jackson and Watts show that for this class of network formation games, if there does not exist
Networks and Stability
a basin of attraction containing multiple networks, then there exists a pairwise stable network. Following our approach, by part 2 of Theorem 9 the existence of at least one strategic basin containing a single network is both necessary and sufficient for the existence of a pairwise stable network. Nash Networks The following definition of Nash networks is a variation on the definition from Bala and Goyal [5]. Definition 14 (Nash Networks, [91,93]) Assume that G is equal to the set of all homogeneous directed networks, P(N N), and that the Bala–Goyal rules arein force. Network G 2 G is said to be a Nash network in G; p if for all G 0 2 G, G !fi 0 g G 0 implies that G ˜fi 0 g G 0 . Thus, a homogeneous directed network is Nash if whenever an individual player has the power to change the network to another network, the player will have no incentive to do so. We shall denote by NE the set of Nash networks. Note that under our definition any network that cannot be changed to another network by a coalition of size 1 is a Nash network. Finally, note that the set of strongly stable networks SS is contained in the set of Nash networks NE. We now have our main result on the path dominance core and Nash stability. Theorem 10 (The path dominance core and Nash networks, [91,93]) Assume that G is equal to the set of all homogeneous directed networks, P(N N), and that the Bala–Goyal rules are in force. (1) If the path dominance core C of G; p is nonempty, then NE is nonempty and C NE. (2) If the dominance relation > underlying p is a direct dominance relation B, then C D NE and NE is nonempty if and only if there exists a basin of attraction containing a single network. We close this section by noting that under the Bala–Goyal rules the set of Nash networks NE is contained in the set of constrained Pareto efficient networks E. If in addition the dominance relation is direct, then C D SS D NE E. Farsightedly Consistent Networks Our final result summarizes the relationships between basins of attraction, the path dominance core, and the largest farsightedly consistent set. Theorem 11 (Basins of attraction, the path dominance core, and the largest consistent set, [91,93]) Assume that (i) G is equal to the set of homogeneous linking networks P(P2 (N)) and that the Jackson–Wolinsky rules or
the Jackson–van den Nouweland rules are in force; or (ii) that G is equal to the set of homogeneous directed networks P(N N)) and that the Bala–Goyal rules are in force. Given network formation game G; p , where path dominance is induced by an indirect dominance relation BB, assume without loss of generality that G; p has nonempty largest consistent set given by F and basins of attraction given by fA1 ; A2 ; : : : ; Am g : Then the following statements are true: (1) Each basin of attraction A k , k D 1; 2; : : : ; m, has a nonempty intersection with the largest consistent set F , that is F \ A k ¤ ; ; for k D 1; 2; : : : ; m : (2) If G; p has a nonempty path dominance core C, then C F : Singleton Basins of Attraction In the abstract games, G; p , that we have considered, the key condition guaranteeing nonemptiness of the path dominance core is the existence of basins of attraction containing a single network. Question: Are there classes of games for which this is true? In general, if the irreflexive dominance relation > inducing path dominance p is transitive, then the >-supernetwork D> is without circuits, and therefore all basins of attraction for the game G; p contain a single network. Unfortunately, if the dominance relation is given by direct or indirect dominance, then transitivity fails to hold in general. In the next two sections we identify several classes of network formation games having singleton basins. Network Formation Games and Potential Functions Assume that G is equal to the set of homogeneous directed networks, P(N N), and that the Bala–Goyal rules are in force, so that primitives are represented by supernetwork Gbg :D P [ Rbg . In addition assume that player preferences over P(N N) are specified via payoff functions fv i ()g i2N and that the dominance relation > over P(N N) is given by direct dominance B. Thus, G 0 B G if and only if for some player i 0 2 N, G ! G 0 and fi 0 g v i 0 (G 0 ) > v i 0 (G). We say that the noncooperative network formation game G; p is a potential game if there exists a function P() : G ! R
2063
2064
Networks and Stability
such that for all G and G0 with G !fi 0 g G 0 for some player i0 , v i 0 (G 0 ) > v i 0 (G)
if and only if P(G 0 ) > P(G) :
It is easy to see that any noncooperative network formation game G; p possessing a potential function (i. e., a potential game) has no circuits, and thus possesses strategic basins of attraction each consisting of a single network. Thus, we can conclude from our Theorem 8 that any noncooperative network formation game possessing a potential function has a nonempty path dominance core. In addition, we know from our Theorem 10 that in this example the path dominance core C is equal to the set of Nash networks NE. As has been shown by Monderer and Shapley [85], potential games are closely related to congestion games introduced by Rosenthal [102]. Page and Wooders [92] introduce a club network formation game which is a variant of the noncooperative network formation game described above – but for a class of heterogenous directed networks – and using methods similar to those introduced by Hollard [57], show that this game possesses a potential function. Prior papers studying potential games in the context of linking networks include Qin [97], Slikker, Dutta, van den Nouweland, and Tijs [112] and Slikker and van den Nouweland [114]. These papers have focused on providing the strategic underpinnings of the Myerson value (Myerson [88] and Aumann and Myerson [4]). Jackson–Wolinsky Network Formation Games Assume that G is equal to the set of all homogeneous linking networks, P(P2 (N)), and that the Jackson–van den Nouweland rules are in force, so that rules are represented by rules supernetwork Rjn . In addition assume that player preferences over P(P2 (N)) are weak and therefore that coalitional preferences, fw S gS2 (N) , are weak. Finally, assume that the dominance relation > on P(P2 (N)) is given by direct dominance – but because coalitional preferences are weak, direct dominance is weak, denoted by Bw . In the Jackson–Wolinsky network formation game coalitional preferences are specified by player payoff functions, fv i ()g i2N , and player payoff functions are in turn specified by a network value function
Thus in the Jackson–Wolinsky game, each player’s payoff function v i () is given by Yi (; v) where v() is the network value function. The basic idea here is that given network G, v(G) is the total value generated by network G and Yi (G; v) is value allocated to player i. Translating Jackson– Wolinsky into our abstract game model, if G 0 Bw G then one of the following is true: G !fi;i 0 g G 0 where G 0 D G [ fi; i 0 g (a link between players i and i0 is added) and Yi (G 0 ; v) Yi (G; v) and Yi 0 (G 0 ; v) Yi 0 (G; v) with strict inequality for at least on of the players; (ii) G !fig G 0 where G 0 D Gnfi; i 0 g (a link between players i and i0 is subtracted) and Yi (G 0 ; v) > Yi (G; v); (iii) G !fi 0 g G 0 where G 0 D Gnfi; i 0 g (a link between players i and i0 is subtracted) and Yi 0 (G 0 ; v) > Yi 0 (G; v). (i)
Each homogeneous linking network G can be partitioned into a collection of subnetworks called components as follows. Let H G be any subnetwork of G and define ˚ N H :D i 2 N : 9i 0 2 N
such that
˚
i; i 0 2 H :
Subnetwork H G is said to be a component if i and i 0 are in N H , then there is a path between i and i 0 ; and if i 2 N H , H G, and fi; i 0 g 2 G then fi; i 0 g 2 H. Let C (G) denote the set of all components of G. If the allocation rule (Yi (G; v)) i2N is egalitarian (see [64]); that is, if v i (G) D Yi (G; v) D
v(G) ; jNj
then any network G 2 arg maxG2G v(G) is pairwise stable and hence by part (2) of Theorem 9, G; p has a basin of attraction containing a single network. Alternatively, if the allocation rule (Yi (G; v)) i2N is componentwise egalitarian (see [64]); that is, if v i (G) D Yi (G; v) D
v(H) ; jN H j
for H 2 C (G) and
i 2 NH ;
v() : G ! R together with an allocation rule, Y(G; v) D (Yi (G; v)) i2N 2 RjNj satisfying X Yi (G; v) D v(G) : i2N
then there exists a pairwise stable network (see Jack- son [59]), and hence by part (2) of Theorem 9, G; p has a basin of attraction containing a single network. Finally, Jackson [59] has shown that if the allocation rule is given by the Myerson value (see [88] and [4]); that
Networks and Stability
is if v i (G) D Yi (G; v) D
X jSj!(jNj jSj 1)! ; v(G S[fig ) v(G S ) jNj! SNnfig
where G S :D
˚˚
0
0
strategic behavior in large networks (as in, for example, Kalai [65] or Wooders, Cartwright and Selten [128]). Under what conditions and to what extent might Kalai’s “expost stability” or Wooders, Cartwright and Selten’s social conformity continue to hold in strategic network formation? Acknowledgments
a single then G; p has a basin of attraction containing network, and hence by part 2 of Theorem 9 G; p has at least one pairwise stable network.
This paper was begun while Page and Wooders were visiting CERMSEM at the University of Paris 1 in June and October of 2007. The authors thank CERMSEM and Paris 1 for their hospitality. URLs: http://mypage.iu.edu/~fpage, http://www.myrnawooders.com.
Future Directions
Bibliography
i; i
2 G : i 2 S and i 2 S ;
There are many possible directions for future research on the topic of networks and stability. Here we only mention a few that, from the perspective of this entry, seem especially promising. There are a number of potential questions to be addressed concerning the path dominance core with direct or indirect dominance. For example, what is the relationship, if any, between basins of attraction and the path dominance core and partnered (or separating) collections of coalitions, as in for example Maschler and Peleg [81], Maschler, Peleg and Shapley [82], Reny and Wooders [99] and Page and Wooders [90]? Or what is relationship between basins of attraction and the path dominance core and the inner core, as in for example Qin [95,96]? One of the most pressing issues in our view is strategic network dynamics. Future research will address the following open question: Given the rules of network formation, the preferences of individuals, the strategic behavior of coalitions, and the trembles of nature, what network dynamics are likely to emerge and persist? Another direction is large networks, where there are many, but still a finite number of nodes or networks with a continuum of nodes. As in the framework of cooperative games (cf., Kovalenkov and Wooders [72] and Wooders, [125,127] for cores of transferable utility and nontransferable utility games) does it hold that some notion of the approximate path dominance core is nonempty if the numbers of players is sufficiently large? Do networks tend towards having some property analogous to the equal treatment property (as in, for example, Debreu and Scarf [31] and Green [52] for exchange economies, [72] or Wooders [126], for cooperative games or Gravel and Thoron [51] for local public goods economies, or Jackson and Watts [63] for a repeated game approach to a matching model). Then there is the problem of characterizing
1. Allouch N, Wooders M (2007) Price taking equilibrium in economies with multiple memberships in clubs and unbounded club sizes. J Econ Theory. doi:10.1016/j.jet.2007.07. 06 2. Arnold T, Wooders M (2006) Club formation with coordination. University of Warwick Working Paper 640 3. Aumann RJ (1964) Markets with a continuum of traders. Econometrica 32:39–50 4. Aumann RJ, Myerson RB (1988) Endogenous formation of links between players and coalitions: An application of the Shapley value. In: Roth A (ed) The Shapley value. Cambridge University Press, Cambridge, pp 175–191 5. Bala V, Goyal S (2000) A noncooperative model of network formation. Econometrica 68:1181–1229 6. Banerjee S, Konishi H, Sonmez T (2001) Core in a simple coalition formation game. Soc Choice Welf 18:135–158 7. Belleflamme P, Bloch F (2004) Market sharing agreements and collusive networks. Int Econ Rev 45:387–411 8. Berge C (2001) The theory of graphs. Dover, Mineola (reprint of the translated French edition published by Dunod, Paris, 1958) 9. Bhattacharya A (2005) Stable and efficient networks with farsighted players: The largest consistent set. Typescript, University of York 10. Bloch F (1995) Endogenous structures of association in oligopolies. Rand J Econ 26:537–556 11. Bloch F (2005) Group and network formation in industrial organization: A survey. In: Demange G, Wooders M (eds) Group formation in economics: Networks, clubs, and coalitions. Cambridge University Press, Cambridge, pp 335–353 12. Bloch F, Genicot G, Ray D (2008) Informal insurance in social networks. J Econ Theory. doi:10.1016/j.jet.2008.01.008 13. Blume L (1993) The statistical mechanics of strategic interaction. Games Econ Behav 5:387–424 14. Bogomolnaia A, Jackson MO (2002) The stability of hedonic coalition structures. Games Econ Behav 38:201–230 15. Bollobas B (1998) Modern graph theory. Springer, New York 16. Boorman SA (1975) A combinatorial optimization model for transmission of job information through contact networks. Bell J Econ 6:216–249 17. Bramoulle Y, Kranton R (2007) Public goods in networks. J Econ Theory 135:478–494
2065
2066
Networks and Stability
18. Bramoulle Y, Kranton R (2007) Risk-sharing networks. J Econ Behav Organ 64:275–294 19. Calvo-Armengol A (2004) Job contact networks. J Econ Theory 115:191–206 20. Calvo-Armengol A, Ballester C, Zenou Y (2006) Who’s who in networks. Wanted: The key player. Econometrica 75:1403– 1418 21. Calvo-Armengol A, Jackson MO (2004) The effects of social networks on employment and inequality. Am Econ Rev 94:426–454 22. Calvo-Armengol A, Jackson MO (2007) Social networks in labor markets: Wage and employment dynamics and inequality. J Econ Theory 132:27–46 23. Casella A, Rauch J (2002) Anonymous market and group ties in international trade. J Int Econ 58:19–47 24. Casella A, Rauch J (2003) Overcoming informational barriers in international resource allocations: Prices and ties. Econ J 113:21–42 25. Chvatal V, Lovasz L (1972) Every directed graph has a semikernel. In: Hypergraph Seminar, Lecture Notes in Mathematics, vol 411. Springer, Berlin 26. Chwe M (1994) Farsighted coalitional stability. J Econ Theory 63:299–325 27. Chwe M (2000) Communication and coordination in social networks. Rev Econ Stud 67:1–16 28. Corominas-Bosch M (2004) Bargaining in a network of buyers and sellers. J Econ Theory 115:35–77 29. Currarini S (2007) Group stability of hierarchies in games with spillovers. Math Soc Sci 54:187–202 30. Currarini S, Morelli M (2000) Network formation with sequential demands. Rev Econ Des 5:229–249 31. Debreu G, Scarf H (1963) A limit theorem on the core of an economy. Int Econ Rev 4:235–246 32. Demange G, Henreit D (1991) Sustainable oligopolies. J Econ Theory 54:417–428 33. Demange G (1994) Intermediate preferences and stable coalition structures. J Math Econ 23:45–48 34. Demange G (2004) On group stability and hierarchies in networks. J Political Econ 112:754–778 35. Deroian F, Gannon F (2005) Quality improving alliances in differentiated oligopoly. Int J Ind Organ 24:629–637 36. Diamantoudi E, Xue L (2003) Farsighted stability in hedonic games. Soc Choice Welf 21:39–61 37. Durlauf S (1997) Statistical mechanics approaches to socioeconomic behavior. In: Arthur WB, Durlauf S, Lane DA (eds) The economy as an evolving complex system II. Addison-Wesley, Reading, pp 81–104 38. Dutta B, Ghosal S, Ray D (2005) Farsighted network formation. J Econ Theory 122:143–164 39. Dutta B, Mutuswami S (1997) Stable networks. J Econ Theory 76:322–344 40. Even-Dar E, Kearns M, Suri S (2007) A network formation game for bipartite exchange economies. Computer and Information Science typescript, University of Pennsylvania 41. Furusawa T, Konishi H (2007) Free trade networks. J Int Econ 72:310–335 42. Galeana-Sanchez H, Xueliang L (1998) Semikernels and (k; l)-kernels in digraphs. SIAM J Discret Math 11:340–346 43. Galeotti A, Moraga-Gonzalez JL (2007) Segmentation, advertising and prices. Int J Ind Organ. doi:10.1016/j.ijindorg.2007. 11.002
44. Gillies DB (1959) Solutions to general non-zero-sum games. In: Tucker AW, Luce RD (eds) Contributions to the Theory of Games, vol 4. Princeton University Press, Princeton, pp 47–85 45. Goyal S (2005) Learning in networks. In: Demange G, Wooders M (eds) Group formation in economics: Networks, clubs, and coalitions. Cambridge University Press, Cambridge, pp 122– 167 46. Goyal S (2007) Connections: An introduction to the economics of networks. Princeton University Press, Princeton 47. Goyal S, Joshi S (2003) Networks of collaboration in oligopoly. Games Econ Behav 43:57–85 48. Goyal S, Joshi S (2006) Bilateralism and free trade. Int Econ Rev 47:749–778 49. Goyal S, Moraga-Gonzalez JL (2001) R&D networks. Rand J Econ 32:686–707 50. Granovetter M (1973) The strength of weak ties. Am J Sociol 78:1360–1380 51. Gravel N, Thoron S (2007) Does endogenous formation of jurisdictions lead to wealth stratification? J Econ Theory 132:569–583 52. Green J (1972) On the inequitable nature of core allocations. J Econ Theory 4:132–143 53. Guilbaud GT (1949) La theorie des jeux. Economie. Appliquee 2:18 54. Harsanyi JC (1974) An equilibrium-point interpretation of stable sets and a proposed alternative definition. Manag Sci 20:1472–1495 55. Herings PJ-J, Mauleon A, Vannetelbosch V (2006) Farsightedly stable networks. Meteor Research Memorandum RM/06/041 56. Hojman D, Szeidl A (2006) Endogenous networks, social games and evolution. Games Econ Behav 55:112–130 57. Hollard G (2000) On the existence of a pure strategy equilibrium in group formation games. Econ Lett 66:283–287 58. Inarra E, Kuipers J, Olaizola N (2005) Absorbing and generalized stable sets. Soc Choice Welf 24:433–437 59. Jackson MO (2003) The stability and efficiency of economic and social networks. In: Dutta B, Jackson MO (eds) Networks and groups: Models of strategic formation. Springer, Heidelberg, pp 99–141 60. Jackson MO (2005) A survey of models of network formation: Stability and efficiency. In: Demange G, Wooders M (eds) Group formation in economics: Networks, clubs, and coalitions. Cambridge University Press, Cambridge, pp 11–57 61. Jackson MO, van den Nouweland A (2005) Strongly stable networks. Games Econ Behav 51:420–444 62. Jackson MO, Watts A (2002) The evolution of social and economic networks. J Econ Theory 106:265–295 63. Jackson MO, Watts A (2008) Social games: Matching and the play of finitely repeated games. Games Econ Behav. doi:10. 1016/j.geb.2008.02.004 64. Jackson MO, Wolinsky A (1996) A strategic model of social and economic networks. J Econ Theory 71:44–74 65. Kalai E (2004) Large robust games. Econometrica 72:1631– 1665 66. Kalai E, Pazner A, Schmeidler D (1976) Collective choice correspondences as admissible outcomes of social bargaining processes. Econometrica 44:233–240 67. Kalai E, Schmeidler D (1977) An admissible set occurring in various bargaining situations. J Econ Theory 14:402–411 68. Kirman A (1983) Communication in markets: A suggested approach. Econ Lett 12:101–108
Networks and Stability
69. Kirman A, Herreiner D, Weisbuch G (2000) Market organization and trading relationships. Econ J 110:411–436 70. Konishi H, Le Breton M, Weber S (1998) Equilibrium in a finite local public goods economy. J Econ Theory 79:224–244 71. Konishi H, Ray D (2003) Coalition formation as a dynamic process. J Econ Theory 110:1–41 72. Kovalenkov A, Wooders M (2001) Epsilon cores of games with limited side payments: Nonemptiness and equal treatment. Games Econ Behav 36:193–218 73. Kovalenkov A, Wooders M (2003) Approximate cores of games and economies with clubs. J Econ Theory 110:87–120 74. Kranton R, Minehart D (2000) Networks versus vertical integration. RAND J Econ 31:570–601 75. Kranton R, Minehart D (2001) A theory of buyer-seller networks. Am Econ Rev 91:485–508 76. Li S (1992) Far-sighted strong equilibrium and oligopoly. Econ Lett 40:39–44 77. Li S (1993) Stability of voting games. Soc Choice Welf 10:51– 56 78. Lucas WF (1968) A game with no solution. Bull Am Math Soc 74:237–239 79. Luo X (2001) General systems and '-stable sets – a formal analysis of socioeconomic environments. J Math Econ 36:95– 109 80. Mariotti M, Xue L (2002) Farsightedness in coalition formation. Typescript, University of Aarhus 81. Maschler M, Peleg B (1967) The structure of the kernel of a cooperative game. SIAM J Appl Math 15:569–604 82. Maschler M, Peleg B, Shapley LS (1971) The kernel and bargaining set for convex games. Int J Game Theory 1:73–93 83. Mauleon A, Sempere-Monerris J, Vannetelbosch V (2008) Networks of knowledge among unionized firms. Can J Econ (to appear) 84. Mauleon A, Vannetelbosch V (2004) Farsightedness and cautiousness in coalition formation games with positive spillovers. Theory Decis 56:291–324 85. Monderer D, Shapley LS (1996) Potential games. Games Econ Behav 14:124–143 86. Montgomery J (1991) Social networks and labor market outcomes: Toward an economic analysis. Am Econ Rev 81:1408– 1418 87. Mutuswami S, Winter E (2002) Subscription mechanisms for network formation. J Econ Theory 106:242–264 88. Myerson RB (1977) Graphs and cooperation in games. Math Oper Res 2:225–229 89. Page FH Jr, Kamat S (2005) Farsighted stability in network formation. In: Demange G, Wooders M (eds) Group formation in economics: Networks, clubs, and coalitions. Cambridge University Press, Cambridge, pp 89–121 90. Page FH Jr, Wooders M (1996) The partnered core and the partnered competitive equilibrium. Econ Lett 52:143–152 91. Page FH Jr, Wooders M (2005) Strategic basins of attraction, the farsighted core, and network formation games. FEEM Working Paper 36.05 92. Page FH Jr, Wooders M (2007) Club networks with multiple memberships and noncooperative stability. Indiana University, Department of Economics typescript (paper presented at the Conference in Honor of Ehud Kalai, 16–18 December, 2007)
93. Page FH Jr, Wooders M (2008) Strategic basins of attraction, the path dominance core, and network formation games. Games Econ Behav. doi:10.1016/j.geb.2008.05.003 94. Page FH Jr, Wooders M, Kamat S (2005) Networks and farsighted stability. J Econ Theory 120:257–269 95. Qin C-Z (1993) A conjecture of Shapley and Shubik on competitive outcomes in the cores of NTU market games. Int J Game Theory 22:335–344 96. Qin C-Z (1994) The inner core of an N-person game. Games Econ Behav 6:431–444 97. Qin C-Z (1996) Endogenous Formations of Cooperation Structures. J Econ Theory 69:218–226 98. Rees A (1966) Information networks in labor markets. Am Econ Rev 56:218–226 99. Reny PJ, Wooders M (1996) The partnered core of a game without side payments. J Econ Theory 70:298–311 100. Richardson M (1953) Solutions of irreflexive relations. Ann Math 58:573–590 101. Rockafellar RT (1984) Network flows and monotropic optimization. Wiley, New York 102. Rosenthal RW (1973) A class of games possessing pure-strategy Nash equilibria. Int J Game Theory 2:65–67 103. Roth AE (1975) A lattice fixed-point theorem with constraints. Bull Am Math Soc 81:136–138 104. Roth AE (1977) A fixed-point approach to stability in cooperative games. In: Karamardian S (ed) Fixed points: Algorithms and applications. Academic Press, New York 105. Roughgarden T (2005) Selfish routing and the price of anarchy. MIT Press, Cambridge 106. Scarf H (1967) The core of an N-person game. Econometrica 35:50–69 107. Schwartz T (1974) Notes on the abstract theory of collective choice. Carnegie-Mellon University, School of Urban and Public Affairs typescript 108. Shapley LS, Shubik M (1969) On market games. J Econ Theory 1:9–25 109. Shenoy PP (1980) A dynamic solution concept for abstract games. J Optim Theory Appl 32:151–169 110. Shubik M (1971) The “bridge game” economy: An example of indivisibilities. J Political Econ 79:909–912 111. Skyrms B, Pemantle R (2000) A dynamic model of social network formation. Proc Nat Acad Sci 97:9340–9346 112. Slikker M, Dutta B, van den Nouweland A, Tijs S (2000) Potential maximizers and network formation. Math Soc Sci 39:55– 70 113. Slikker M, van den Nouweland A (2001) Social and economic networks in cooperative game theory. Kluwer, Boston 114. Slikker M, van den Nouweland A (2002) Network formation, costs, and potential games. In: Borm P, Peters H (eds) Chapters in game theory. Kluwer, Boston, pp 223–246 115. Tardos E, Wexler T (2007) Network formation games and the potential function method. In: Nisan N, Roughgarden T, Tardos E, Vazirani V (eds) Algorithmic game theory. Cambridge University Press, Cambridge, pp 487–516 116. Tesfatsion L (1997) A trade network game with endogenous partner selection. In: Amman HM, Rustem B, Whinston AB (eds) Computational approaches to economic problems. Kluwer, Boston, pp 249–269 117. Tesfatsion L (1998) Preferential partner selection in evolutionary labor markets: A study in agent-based computational economics. In: Porto VW, Saravanan N, Waagen D, Eiben AE (eds)
2067
2068
Networks and Stability
118. 119. 120.
121. 122. 123.
Evolutionary programming VII. Proceedings of the seventh annual conference on evolutionary programming. Springer, Berlin, pp 15–24 Topa G (2001) Social interactions, local spillovers, and unemployment. Rev Econ Stud 68:261–295 van Deemen AMA (1991) A note on generalized stable set. Soc Choice Welf 8:255–260 van den Nouweland A (2005) Models of network formation in cooperative games. In: Demange G, Wooders M (eds) Group formation in economics: Networks, clubs, and coalitions. Cambridge University Press, Cambridge, pp 58–88 Vega-Redondo F (2007) Complex social networks. Cambridge University Press, Cambridge von Neumann J, Morgenstern O (1944) Theory of games and economic behavior. Princeton University Press, Princeton Wang P, Watts A (2006) Formation of buyer-seller trade networks in a quality differentiated product market. Can J Econ 39:971–1004
124. Watts A (2001) A dynamic model of network formation. Games Econ Behav 34:331–341 125. Wooders M (1983) The epsilon core of a large replica game. J Math Econ 11:277–300 126. Wooders M (2008) Competitive markets and market games. Rev Econ Design (forthcoming) 127. Wooders M (2008) Small group effectiveness, per capita boundedness and nonemptiness of approximate cores. J Math Econ. doi:101016/j.jmateco.2007.06.006 128. Wooders M, Cartwright C, Selten R (2006) Behavioral conformity in games with many players. Games Econ Behav 57:347– 360 129. Xue L (1998) Coalitional stability under perfect foresight. Econ Theory 11:603–627 130. Xue L (2000) Negotiation-proof Nash equilibrium. Int J Game Theory 29:339–357 131. Zissimos B (2005) Why are free trade agreements regional? FEEM Working Paper 67-07
Neuro-fuzzy Systems
Neuro-fuzzy Systems LESZEK RUTKOWSKI 1 , KRZYSZTOF CPAŁKA 1,2 , 3, ´ ROBERT N OWICKI 1 , AGATA POKROPI NSKA 1 RAFAŁ SCHERER 1
2
3
Departament of Computer Engineering, Cz˛estochowa University of Technology, Cz˛estochowa, Poland Department of Artificial Intelligence, Academy of Humanities and Economics, Lodz, Poland Institute of Mathematics and Computer Science, Jan Dlugosz University, Cz˛estochowa, Poland
Article Outline Glossary Definition of the Subject Introduction Fuzzy Reasoning Description of Fuzzy Inference Systems Logical-Type Neuro-fuzzy Systems Mamdani-Type Neuro-fuzzy Systems Simplified Neuro-fuzzy Systems Takagi–Sugeno Neuro-fuzzy Systems Neuro-fuzzy Systems with Weights Neuro-fuzzy Systems for Pattern Classification Learning Criteria Isolines Method Future Directions Acknowledgments Bibliography Glossary Neuro-fuzzy systems Fusion of fuzzy logic and neural networks with the ability to automated adaptation to training data and knowledge interpretability. Fuzzy reasoning Reasoning on the basis of fuzzy premises and fuzzy rules inferring fuzzy conclusions. Definition of the Subject Neuro-fuzzy systems (NFS) are part of soft computing concept. They are a synergistic fusion of fuzzy logic and neural networks with the ability to automate adaptation to training data and knowledge interpretability. Thus neurofuzzy systems incorporate two goals of machine learning which are difficult to meet at the same time: transparency of the rule base and high accuracy obtained thanks to learning from data. There are various types of NFS which differ in the form of fuzzy rules and inference methods.
They have a very broad range of applications in life, science, economics, manufacturing, medicine etc. NFS can replace neural networks in all applications as most of them are universal approximators. Therefore they can perform well in tasks of classification, prediction, approximation and control, similarly to neural networks. Neural networks work as black boxes (usually it is not possible to tell where the knowledge is stored) and can not use prior knowledge. NFS can use almost the same learning methods and achieve the same accuracy as neural networks yet the knowledge in the form of fuzzy rules is easily interpretable for humans. On the other hand, fuzzy systems (not neuro-fuzzy) require manual tuning of their parameters and fuzzy reasoning in such systems is more troublesome and computationally demanding than simplified one, used in NFS. Introduction Over the last decade fuzzy sets and fuzzy logic introduced in 1965 by Lotfi Zadeh [62] have been used in a wide range of problem domains, including process control, image processing, pattern recognition and classification, management, economics and decision making. Specific applications include washing-machine automation, camcorder focusing, TV color tuning, automobile transmissions and subway operations [17]. We have also been witnessing a rapid development in the area of neural networks (see e. g. [53,54,63]). Both fuzzy systems and neural networks, along with probabilistic methods [1,11, 42], evolutionary algorithms [13,36], rough sets [43,44] and uncertain variables [3,4,5], constitute a consortium of soft computing techniques [1,23,26]. These techniques are often used in combination. For example, fuzzy inference systems are frequently converted into connectionist structures called neuro-fuzzy systems which exhibit advantages of neural networks and fuzzy systems. In literature various neuro-fuzzy systems have been developed (see e. g. [6,7,8,9,10,14,15,16,19,20,21,23,28,29,30,31, 32,33,34,37,38,41,45,46,47,49,50,58]). They combine the natural language description of fuzzy systems and the learning properties of neural networks. Some of them are known in literature under short names such as ANFIS [18], ANNBFIS [9], DENFIS [24], FALCON [31], GARIC [2], NEFCLASS [40], NEFPROX [39,40], SANFIS [57] and others. The most popular designs of neurofuzzy structures fall into one of the following categories, depending on the connective between the antecedent and the consequent in fuzzy rules: (i)
Takagi–Sugeno method – consequents are functions of inputs,
2069
2070
Neuro-fuzzy Systems
(ii) Mamdani-type reasoning method – consequents and antecedents are related by the min operator or generally by a t-norm, (iii) Logical-type reasoning method – consequents and antecedents are related by fuzzy implications, e. g. binary, Łukasiewicz, Zadeh and others. It should be noted that most applications are dominated by the Mamdani-type fuzzy reasoning. However, it was emphasized by Yager [60,61] that “no formal reason exists for the preponderant use of the Mamdani method in fuzzy logic control as opposed to the logical method other than inertia”. In this work we will outline all three types of systems. Moreover we will compare several variants of these systems in terms of efficiency for an exemplary problem. Fuzzy Reasoning In this section we present the idea of fuzzy reasoning with various fuzzy implications which will be useful for construction of neuro-fuzzy systems developed in the next sections. The basic rule of inference in classical logic is modus ponens. The compositional rule of inference describes a composition of a fuzzy set and a fuzzy relation. Fuzzy rule IF
x is A THEN
y is B
(1)
is represented by a fuzzy relation R. Having given input linguistic value A0 , we can infer an output fuzzy set B0 by the composition of the fuzzy set A0 and the relation R. The generalized modus ponens is the extension of the conventional modus ponens tautology, to allow partial fulfillment of the premises: Premise x is A0 Implication IF x is A THEN y is B
(5)
where I() D Ifuzzy () is a fuzzy implication (logical approach to fuzzy reasoning, see Subsect. “Logical Approach to Fuzzy Inference Reasoning”) given in Definition 1 or a t-norm (Mamdani approach to fuzzy reasoning, see Subsect. “Mamdani Approach to Fuzzy Inference Reasoning”). Logical Approach to Fuzzy Inference Reasoning Definition 1 (see Fodor [12]) A fuzzy implication is a function I D Ifuzzy : [0; 1]2 ! [0; 1] satisfying the following conditions: (I1) if a1 a3 , then I(a1 ; a2 ) I(a3 ; a2 ), for all a1 ; a2 ; a3 2 [0; 1], (I2) if a2 a3 , then I(a1 ; a2 ) I(a1 ; a3 ), for all a1 ; a2 ; a3 2 [0; 1], (I3) I(0; a2 ) D 1, for all a2 2 [0; 1] (falsity implies anything), (I4) I(a1 ; 1) D 1, for all a1 2 [0; 1] (anything implies tautology), (I5) I(1; 0) D 0 (Booleanity). Selected fuzzy implications satisfying all or some of the above conditions are listed in Table 1. In this table, Implications 1–4 are examples of an S-implication associated with a t-conorm I(a; b) D Sf1 a; bg :
(6)
Neuro-fuzzy systems based on fuzzy implications given in Table 1 are called logical systems. Implications (6) and (7) belong to a group of R-implications associated with the t-norm T and given by a; b 2 [0; 1] :
(7)
z
where A, B, A0 , B0 are fuzzy sets, x and y are linguistic variables. Applying the compositional rule of inference [22], we obtain (2)
and ˚ B0 (y) D A0 ıR (y) D sup A0 (x)T R (x; y) : (3) x2X
The Zadeh implication belongs to a group of Q-implications given by I(a; b) D SfN(a); Tfa; bgg ;
a; b 2 [0; 1] :
(8)
It is easy to verify that S-implications and R-implications satisfy all the conditions of Definition 1. However, the Zadeh implication violates conditions I1 and I4, whereas the Willmott implication violates conditions I1, I3 and I4. Mamdani Approach to Fuzzy Inference Reasoning
The problem is to determine the membership function of the fuzzy relation described by R (x; y) D A!B (x; y)
A!B (x; y) D I(A (x); B (y)) ;
I(a; b) D supfzjTfa; zg bg ;
Conclusion y is B0
B0 D A0 ı R D A0 ı (A ! B)
based on the knowledge of A (x) and B (y). We denote
(4)
In practice we frequently use Mamdani-type operators given by I(a; b) D minfa; bg ;
a; b 2 [0; 1] ;
(9)
Neuro-fuzzy Systems
Neuro-fuzzy Systems, Table 1 Fuzzy implications
acterized by the membership function (
No Name 1 Kleene–Dienes (binary) 2 Łukasiewicz 3 Reichenbach
Implication Ifuzzy (a; b) maxf1 a; bg minf1; 1 a C bg 1aCab 8 b 8 b
4 Fodor
5 Rescher
8 0
a; b 2 [0; 1]
where x D [x1 ; : : : ; x n ] 2 X, y 2 Y, Ak D Ak1 Ak2 : : : Akn , Ak1 ; Ak2 ; : : : ; Akn are fuzzy sets characterized by membership functions Ak (x i ), i D 1; : : : ; n, k D 1; : : : ; N, i whereas Bk are fuzzy sets characterized by membership functions B k (y), k D 1; : : : ; N. The firing strength of the kth rule, k D 1; : : : ; N, is defined by n ˚ k (¯x) D T Ak (x¯ i ) D Ak (¯x) :
8 ˆ ˆ 0, p i j D ˛ d1i C (1 ˛) N1 if there is a link from page i to page j, and p i j D (1 ˛) N1 if there is no such link, where ˛ D 0:85. For a page i with no outgoing links, that is, when d i D 0, p i j D N1 for every j. This represents the idea that with probability ˛ a surfer on a page which has at least one link, randomly chooses (with equal probabilities) to follow one of the links on the page, and with probability 1 ˛ the surfer ignores the links, and chooses to surf to a random page on the Web (again, with equal probabilities), while if there are no links on the page, the surfer just chooses a random Web page to surf to. Note that the matrix P D [p i j ] has positive entries, and it is row stochastic (i. e., each of its rows sums up to 1). For the web described in Fig. 1, 2 3 1/40 7/8 1/40 1/40 1/40 1/40 6 1/40 1/40 19/80 19/80 19/80 19/80 7 6 7 6 1/6 1/6 1/6 1/6 1/6 1/6 7 7: PD6 6 1/40 1/40 1/40 1/40 9/20 9/20 7 6 7 4 1/40 1/40 1/40 9/20 1/40 9/20 5 1/6 1/6 1/6 1/6 1/6 1/6 The PageRank of a page i, PR(i), is defined as the probability that a random surfer will be in i. In this model, the larger PR(i) is, the more important page i is. The relations between these PageRanks are: 8 PR(1) D p11 PR(1) C C p N1 PR(N) ˆ ˆ < :: : : ˆ ˆ : PR(N) D p1N PR(1) C C p N N PR(N) T Let D PR(1); PR(2); : : : ; PR(N) be the (column) vector of PageRanks. By the above equations, the vector satisfies D P T . That is, the PageRank vector is an eigenvector of PT that corresponds to the eigenvalue 1, with nonnegative elements which sum up to 1. For the model to work we need to know that such an eigenvector exists, and that it is unique. This is guaranteed by the first major theorem on nonnegative matrices, which was proved by Perron in 1907 [43]. The Perron Theorem is a theorem on positive matrices. Its important extension to nonnegative matrices, proved by Frobenius in 1912 [23] and referred to as the Perron–Frobenius Theorem, requires the notion of irreducibility. This is the natural place to use graph theory but this was not done by Frobenius who did not care much for the, infant at his time, theory of graphs. In a very interesting article [48] Schneider describes the relations between irreducibility, the related concept of full indecomposability and the Frobenius–König Theorem as they appear in the papers of Frobenius, König and Markov (after which
the highly important Markov Chains are named). He also touches the acrimonious controversy developed between Frobenius and König about their respective contributions to the topic. Without taking sides in this controversy, it is clear from historical perspective that Frobenius underestimated the relevance of digraphs to the theory of nonnegative matrices. This hopefully is demonstrated in this survey. Matrices, Graphs and Digraphs While we concentrate here on the interrelations between nonnegative matrices and directed graphs, we mention briefly the relations between matrices and graphs, and include the basic terminology of graphs and digraphs that will be used in this article. A (simple, undirected) graph G is a pair G D (V ; E), where V is a finite set, and E is a set of unordered pairs of distinct elements in V. The elements of V are the vertices of G, and the elements of E are the edges of G. If e D fu; vg is an edge of G, we say that u (or v) and e are incident with each other, and that u and v are adjacent vertices in G. The degree of a vertex v 2 V is the number of vertices adjacent to v. The added adjective ‘simple’ means that in these graphs there are no loops (that is, no edge joins a vertex to itself) or multiple edges (that is, any two distinct vertices are joined by at most one edge). Several types of graphs may be associated with matrices. Given an m n matrix A, its bipartite graph B(A) D (V; E) has m C n vertices: V D fr1 ; : : : ; r m ; c1 ; : : : ; c n g (each vertex ri corresponds to a row of A, and each vertex cj corresponds to a column of A). fu; vg is an edge of B(A) if and only if u D r i and v D c j for some i and j and a i j ¤ 0. With a symmetric n n matrix A we may associate its graph G(A) D (V; E), with vertices V D f1; : : : ; ng and fi; jg 2 E if and only if i ¤ j and a i j (D a ji ) ¤ 0. G(A) is a simple graph, and it conveys the zero-nonzero pattern of A, except for the diagonal entries of A. We may also associate matrices with given graphs. Given a graph G D (V ; E) with vertices V D fv1 ; : : : ; v n g and edges E D fe1 ; : : : ; e m g, its incidence matrix C D C(G) is the n m matrix with c i j D 1 if vi is incident with ej in G, and c i j D 0 otherwise. The adjacency matrix of G, A D A(G) is the n n (0; 1)-matrix with a i j D 1 if and only if fv i ; v j g 2 E. (The diagonal of A(G) is zero). Another matrix associated with G is its (combinatorial) Laplacian, L D L(G) D D(G) A(G), where D D D(G) is the diagonal matrix of vertex degrees, i. e., di i is equal to the degree of the vertex vi . By scaling L one obtains the (normalized) Laplacian L D
2083
2084
Non-negative Matrices and Digraphs
L(G), defined as L D D 1/2 LD 1/2 (for this purpose D 1
is actually D , the Moore–Penrose inverse of D: it is 1 D 0 if d i i D 0, and the diagonal matrix with D ii 1 D i i D 1/d i i if d i i > 0). These matrices associated with the graph G are related: A D CC T D, and hence L D D A D 2D CC T . The adjacency matrix A and the Laplacians L and L are symmetric matrices, and thus have real eigenvalues. The relations between these eigenvalues and the structure of the graph G were extensively studied. In particular, L (and thus also L) is a singular positive semidefinite matrix, its smallest eigenvalue is zero (Le D 0, where e is the n-vector with all entries equal to 1). The second smallest eigenvalue of L was called by Fiedler the algebraic connectivity of G, motivated by the fact that the algebraic connectivity is nonzero if and only if the graph G is connected. While there is a natural relation between symmetric matrices and graphs, general square matrices may be associated with digraphs. A digraph (directed graph) is a pair D (V ; E), where V is a finite set (we usually assume V D f1; : : : ; ng) and E is a set of ordered pairs of (not necessarily distinct) elements of V. The elements of V are called the vertices of ; The elements of E are the arcs of . We say that (i; j) 2 E is an arc from i to j; i is its initial vertex or tail, and j is its terminal vertex or head. An arc (i; i) is called a loop (i is both its initial vertex and its terminal vertex). The number of arcs with i as an initial vertex is the outdegree of i. The number of arcs with i as terminal vertex is the indegree of i. We will consider only digraphs with no multiple arcs. That is, for every two distinct vertices i and j there is at most one arc from i to j. A digraph which has no loops and no multiple arcs is called simple. So in this article, a simple digraph is simply a digraph with no loops. A weighted digraph is a digraph in which each arc (i; j) is assigned a weight ! i j (usually a real or complex number). A (directed) walk of length m from vertex i to vertex j in a digraph is a sequence of m arcs (i0 ; i1 ); (i1 ; i2 ); : : : ; (i m1 ; i m ), with the initial vertex of each arc equal to the terminal vertex of its predecessor, and i0 D i, i m D j. If i0 D i m , it is a closed walk. If all the vertices in the walk are distinct, except, possibly, the first and last, the walk is called a (directed) path. When i m D i0 the path is called a (directed) cycle. The distance from vertex i to vertex j is the length of the shortest path from i to j, and the diameter of a digraph is largest distance between two vertices in the digraph. The girth of a digraph is the length of its shortest cycle. A digraph is strongly connected if for every two vertices i ¤ j there are walks from i to j and from j to i. Every digraph on one vertex is strongly connected.
With each simple digraph there is an underlying graph. The underlying graph G has the same vertices as , and each arc (i; j) of is replaced by the corresponding edge fi; jg in G. (No multiple edges, though, so if (i; j) and ( j; i) are both arcs of , there is only one edge fi; jg in G). Let be a digraph with vertices V D f1; : : : ; ng and arcs E D fe1 ; : : : ; e m g. Its incidence matrix Q D Q( ) is the n m matrix with q i j D 1 if i the initial vertex of ej , q i j D 1 if i the terminal vertex of ej and q i j D 0 otherwise. (If is a simple digraph with incidence matrix Q, and G is its underlying digraph with combinatorial Laplacian L, then Q Q T D L). The adjacency matrix A D A( ) of a digraph with vertices f1; : : : ; ng is an n n matrix with a i j D 1 if and only if (i; j) 2 E, and a i j D 0 otherwise. The adjacency matrix is a square (0; 1)-matrix. On the other hand, given an n n matrix A, the digraph of A, (A), has vertices f1; : : : ; ng, and (i; j) is an arc if and only if a i j ¤ 0. We may also consider a weighted digraph associated with A by assigning the weight ai j to the arc (i; j) of (A). Obviously, (A( )) D . Re-labeling the vertices of a digraph is equivalent to simultaneous permutation of the rows and the columns of an associated matrix. Such simultaneous permutation is achieved by pre-multiplying A by a permutation matrix P, and post-multiplying it by PT . The digraph (PAPT ) is obtained from the digraph (A) by relabeling its vertices according to the permutation, and vice versa: if 0 is a digraph obtained from (A) by re-labeling the vertices, then there exists a permutation matrix P such that (PAPT ) D 0 . A subdigraph of is a digraph 0 D (V 0 ; E 0 ), where 0 V V and E 0 E. Such 0 is an induced subdigraph of if E 0 D E \ (V 0 V 0 ), i. e., E 0 consists of all the arcs in E whose initial and terminal vertices are in V 0 . The adjacency matrix of the subdigraph of induced by V 0 is a principal submatrix of the adjacency matrix A D A( ) (that is, a submatrix of A lying on rows and columns with the same indices). We denote this submatrix by A[V 0 ]. The converse also holds. If D (A) for some matrix A, then the digraph of a principal submatrix of A is an induced subdigraph of . In what follows, all matrices under discussion will be square, unless specifically stated otherwise. As mentioned before, we deal here only with nonnegative matrices and digraphs. For more on the connections between matrices and graphs, see e. g., Part II of [31], and the reference there. For the many connections between the spectrum of the Laplacian of a graph to graph structure parameters see [18].
Non-negative Matrices and Digraphs
Irreducible Nonnegative Matrices and Their Digraphs
a. A is irreducible. b. (A) is strongly connected.
A matrix A is nonnegative (denoted A 0) if all its entries are nonnegative, and positive (denoted A > 0) if all its entries are positive. If A and B are matrices of the same order, we write A B if A B 0 and A > B if A B > 0. If A is an n n matrix, then its spectrum consists of n (complex, not necessarily distinct) eigenvalues, 1 ; : : : ; n . The spectral radius of A is (A) D maxfj i j : 1 i ng. Let (A) D maxfj i j : 1 i n; i ¤ (A)g. With these notations, we can state Perron’s Theorem [43]:
c. There is a positive integer m such that (I C A)m > 0.
Theorem 1 (Perron Theorem) If A is a positive square matrix, then
d. (I C A)n1 > 0. In Perron’s Theorem (Theorem 1) parts a–c, ‘positive’ may be replaced by ‘nonnegative and irreducible’ – see parts a–c in Theorem 4 below. The new parts d–f of Theorem 4 also hold for such matrices (in particular for positive matrices). Theorem 4, the Perron–Frobenius Theorem, was first proved in [23]. Theorem 4 (Perron–Frobenius Theorem, part I) If A is an n n nonnegative and irreducible square matrix, then a. (A) > 0.
a. (A) > 0. b. (A) is a simple eigenvalue of A (called the Perron root of A). c. To (A) there corresponds a positive eigenvector. (Such an eigenvector, normalized so that the sum of its entries is 1, is called the Perron vector of A). d. (A) < (A), that is, (A) is the only eigenvalue of A of maximum modulus. A m D L, L D x y T , where Ax D (A)x, e. limm!1 (A) T x > 0, A y D (A)y, y > 0, and x T y D 1. f. For every (A)/ (A) < r < 1 there exists a constant A m C D C(r; A) such that for every m, (A) L1 Cr m . The Perron Theorem may be extended to a wider class of matrices. A matrix A is primitive if for some natural number m, Am is positive. Theorem 2 In Perron’s Theorem (Theorem 1) “positive” may be replaced by “primitive”. The theorem can be generalized to irreducible matrices. An n n matrix A is reducible if n > 1 and there exists a permutation matrix P such that PAPT D
B C
0 D
;
b. (A) is a simple eigenvalue of A (called the Perron root of A). c. To (A) there corresponds a positive eigenvector. (Such an eigenvector, normalized so that the sum of its entries is 1, is called the Perron vector of A). In addition, d. If Ax D x and the vector x is positive, then D (A). e. If B A and B ¤ A, then (B) > (A). f. If 0 B A and B ¤ A, then (B) < (A). If a nonnegative A is irreducible, so is AT . The spectrum of AT is equal to that of A. Hence there exists a Perron vector y of AT . It satisfies: y T A D (A)y T . The Perron vectors x and y of A and AT are referred to as the right- and left- Perron vectors of A, respectively. The following result follows from Part f of the Theorem 4. Corollary 5 If B ¤ A is a principal submatrix of a nonnegative irreducible matrix A, then (B) < (A). When A is irreducible and nonnegative, but not primitive, there may be several eigenvalues of maximum modulus. The number of eigenvalues of A with modulus (A) is called the index of cyclicity of A.
with square diagonal blocks B and D, or n D 1 and A D 0. A matrix is irreducible if it is not reducible. Irreducibility may be described in terms of the digraph of A: A matrix A of order greater than 1 is irreducible if and only if (A) is strongly connected. Moreover,
Theorem 6 (Perron–Frobenius Theorem, part II) If A is a nonnegative and irreducible square matrix with index of cyclicity k > 1, then
Theorem 3 Let A be an n n nonnegative matrix, n > 1. Then the following are equivalent:
b. The spectrum of A goes onto itself under rotation of the complex plane by 2/k.
a. The eigenvalues of A of modulus (A) are (A)e2 i l /k , l D 0; 1; : : : ; k 1.
2085
2086
Non-negative Matrices and Digraphs
the loop at vertex 1 immediately implies that d1 D d2 D d3 D d4 D 1 and the matrix is primitive.
c. There exists a permutation matrix P such that 2 6 6 6 T PAP D 6 6 4
0 0 :: : 0 A k1
A12 0 0 0
0 A23 :: : 0 0
:: :
0 0 :: : A k1 k 0
3 7 7 7 7; 7 5
Corollary 7 A nonnegative irreducible matrix A is primitive if and only if its index of cyclicity is 1. Using digraphs of nonnegative matrices simplifies the proof of the Perron–Frobenius Theorem. Digraphs are particularly useful in the computation of the index of cyclicity of an irreducible matrix [45]: Theorem 8 (Romanovsky Theorem) Let A be an n n irreducible nonnegative matrix, and for every 1 i n let di be the greatest common divisor of all the lengths of closed walks through i in (A). Then all the di ’s are equal and are equal to the order of cyclicity of A. An actual computation of the index of cyclicity by this theorem may be very hard when A is a large matrix and (A) has many cycles, but can be done for small matrices: Example 9 Let 0 61 A1 D 6 40 0
0 0 1 0
0 0 0 1
3 1 07 7 05 0
2
and
1 61 A2 D 6 40 0
Corollary 10 A nonnegative irreducible matrix A is primitive if trace(A) > 0. Every primitive matrix is irreducible (see Theorem 3). However, the converse is not true. For example,
where the zero diagonal blocks are square.
2
Combining Romanovsky’s Theorem and the Perron– Frobenius Theorem, part II, we get that
0 0 1 0
0 0 0 1
3 1 07 7 05 0
0 1
1 0
is irreducible and not primitive. Note that an n n, n > 1, nonnegative matrix A is irreducible if and only if its digraph is strongly connected, and it is primitive if and only if there exists m such that for every pair of (not necessarily distinct) vertices u and v of (A), there is a walk of length m from u to v. The smallest such m, that is, the smallest m for which Am is positive, is called the exponent of A. For information about exponents see Sect. 3.5 in [14]. The index of cyclicity is sometimes called the period of the matrix. A primitive matrix is sometimes called an aperiodic matrix. We mention also that in a different approach, every 1 1 matrix is considered irreducible, even if it is zero. This corresponds well with the definition of every digraph on one vertex as strongly connected digraph, but requires some changes in the statement of Theorem 4. The above results are classical, and details may be found in many books, e. g., Chap. 2 in [7], or Chap. 8 in [33]. Special Irreducible Matrices
Then (A1 ) is the digraph shown on the left and (A2 ) is the digraph shown on the right in Fig. 2; both are strongly connected. For (A1 ) we have d1 D d2 D d3 D d4 D 4 and the index of cyclicity of A1 is 4. For (A2 ), the existence of
Non-negative Matrices and Digraphs, Figure 2 The digraphs of Examples 9
Nearly Reducible Matrices A strongly connected digraph is minimally connected (or minimally strong) if removal of any arc of results in a digraph which is not strongly connected. A matrix is nearly reducible if its digraph is minimally connected. That is, an n n matrix A, with n > 1 is nearly reducible if it is irreducible, and replacing any one nonzero entry by a zero yields a reducible matrix. The 1 1 zero matrix is the only 1 1 nearly reducible matrix (though by our definition it is also reducible). A minimally connected digraph has no loops, and a nearly reducible matrix has zero diagonal. Nearly reducible matrices have a specific zero-nonzero pattern, described inductively in the following theorem, due to Hartfiel [29]: Theorem 11 Let A be an n n nearly reducible matrix, n 2. Then there exists an integer 1 m < n and a per-
Non-negative Matrices and Digraphs
mutation matrix P such that B C PAPT D ; D E where B is an m m matrix of the form 3 2 0 0 0 6 0 0 0 7 7 6 7 6 B D 6 ::: ::: : : : : : : ::: 7 ; 7 6 4 0 0 0 5 0
0
0
0
where each stands for a nonzero entry, C is an m(nm) matrix with a single nonzero entry, located in the last row, D is an (n m) m matrix with a single nonzero entry, located in the first column, and E is an (n m) (n m) nearly reducible matrix. The following theorem is a translation to nearly reducible matrices of a result of Gupta on minimally connected digraphs [27]: Theorem 12 Let A be an n n nearly reducible matrix, n > 1. Then the number of nonzero entries in A is at least n and at most 2(n 1). The lower bound is attained when (A) is an n-cycle, and only in this case. The upper bound is attained only when (A) is a digraph obtained from a tree by replacing each edge fi; jg of the tree with two oppositely directed arcs, (i; j) and ( j; i). The results concerning nearly reducible matrices and minimally connected digraph may be found in Sect. 3.3 in [14], and Sect. 4.5 in [42]. See also Sect. 29.8 in [31]. Fully Indecomposable Matrices A concept related to irreducibility is that of full indecomposability. We should first mention that reducible and irreducible matrices are sometimes called decomposable and indecomposable, which explains the terminology of partlyand fully- decomposable matrices introduced below. An n n matrix A, is partly decomposable if n > 1 and A has a zero m k submatrix with m and k positive integers such that m C k D n, or n D 1 and A D 0. That is, a square matrix A of order greater than 1 is partly decomposable if and only if there exist permutation matrices P and Q such that B 0 ; PAQ D C D with square diagonal blocks B and D. A square n n matrix A is fully indecomposable if and only if it is not partly decomposable.
In terms of the bipartite graph of A: A is partly decomposable if there exists a subset S D fr i 1 ; : : : ; r i m g of the “row vertices” fr1 ; : : : ; r n g, and a subset T D fc j 1 ; : : : ; c j nm g of the “column vertices” fc1 ; : : : ; c n g, such that there are no edges in B(A) from S to T. In terms of the digraph of A: A is partly decomposable if there exist subsets V1 and V 2 of the vertex set f1; : : : ; ng, not necessarily disjoint, of sizes m and k such that m C k D n, and such that there are no arcs in (A) from V 1 into V 2 . In comparison, the matrix A is reducible if there exist such subsets V 1 and V 2 which are disjoint and V1 [ V2 D f1; : : : ; ng. Clearly a reducible matrix of order greater than 1 is also partly decomposable, and a fully indecomposable matrix is irreducible. The converse does not hold: the matrix 0 1 1 1 is an example of a partly decomposable matrix (V1 D V2 D f1g) which is irreducible. However, irreducibility and full indecomposability are not too far apart, as the following results show. For the first of the two see Brualdi, Parter and Schneider [13] and for the second see Brualdi [12]. Theorem 13 A square matrix A of order greater than 1 is fully indecomposable if and only if for some permutation matrix P the matrix PA is irreducible with all the entries on its main diagonal non-zero. Theorem 14 A square nonnegative matrix A of order greater than 1 is irreducible if and only if I C A is fully indecomposable. Henrik Minc opened a course on nonnegative matrices that he gave at the Technion in 1974 (which was the basis for [42]) by stating that the Frobenius–König Theorem is the fundamental result on the zero pattern of square matrices. Theorem 15 (Frobenius–König Theorem) A necessary and sufficient condition that every diagonal of an n n matrix contains a zero is that the matrix contain an m k zero submatrix with m C k D n C 1. Recall that the permanent of a matrix A is given by per(A) D
n X Y
a i (i) ;
2S n iD1
where Sn is the group of all permutations on f1; : : : ; ng. That is, it is the sum of all diagonal products of A. Corollary 16 If A is a square nonnegative matrix and per(A) D 0, then A is partly decomposable.
2087
2088
Non-negative Matrices and Digraphs
We mention two more results, due to Marcus and Minc [40]: Theorem 17 A nonnegative fully indecomposable matrix is primitive. (The converse is not true: 0 1 1 1
Let R( ) be the digraph with vertices V1 ; : : : ; Vk and an arc from V i to V j , i ¤ j, if and only if there is a walk from a vertex in V i to a vertex in V j (and thus from any vertex in V i to any vertex in V j ). R( ) is called the reduced graph of . Theorem 19 Let be a digraph. Then R( ) is a simple digraph with no cycles, and its vertices may be labeled V1 ; : : : ; Vk so that if (Vi ; Vj ) is an arc of R( ) then i > j.
is primitive but not fully indecomposable). Theorem 18 A fully indecomposable n n matrix can contain at most n(n 2) zeros. For details on fully indecomposable matrices see, e. g., Sect. II.5.4 in [41], and Sect. 4.2 in [14]. It is worth noting that the Frobenius–König Theorem played an important role in the proof of the Perron–Frobenius Theorem (see the discussion in [48]). Another important application of the Frobenius–König Theorem is Birkhoff’s Theorem [9], which states that the set of all n n doubly stochastic matrices, i. e. matrices that are both row stochastic and column stochastic, forms a convex polyhedron with the permutation matrices as vertices. A slightly more general form of the Frobenius–König Theorem is P. Hall’s theorem on systems of distinct representatives [28].
When is (A), the digraph of a matrix A, the equivalence classes of the communication relation on f1; : : : ; ng are called the classes of A. The reduced graph of (A) is denoted also by R(A) and referred to as the reduced graph of A. The rows and columns of A may be permuted, according to the order of the classes described in Theorem 19. We thus get Theorem 20 Let A be an n n matrix. Then there exists a permutation matrix P such that 2 6 6 PAP T D 6 4
A11 A21 :: : A k1
0 A22 :: : A k2
:: :
0 0 :: : Ak k
3 7 7 7; 5
(1)
Reducible Nonnegative Matrices and Their Digraphs For reducible nonnegative matrices, there is also a strong connection between the spectral properties of the matrices and the graph theoretic properties of their digraphs. This connection has been studied extensively, and we review here a sample of the results. If A is a reducible matrix, there is a permutation matrix P such that B 0 PAP T D : C D If any of the square diagonal blocks is reducible, we may further reduce the matrix A, until we get a lower blocktriangular matrix, with irreducible and/or 1 1 zero diagonal blocks, which is permutationally equivalent to A. This can be also described by the digraph of A. Let be any digraph on vertices V D f1; : : : ; ng. We say that a vertex i has access to vertex j in if i D j or there is a walk from i to j in . We say that i and j communicate if i has access to j and j has access to i. Communication is an equivalence relation on V, and it therefore induces a partition of V into equivalence classes V1 ; : : : ; Vk . The induced digraphs i D (Vi ) are strongly connected. These are the strongly connected components of .
where each of the diagonal blocks A11 ; : : : ; A k k is square and irreducible or a 1 1 zero matrix. The diagonal blocks are uniquely determined, up to a simultaneous permutation of rows and columns, but their ordering is not necessarily unique. The block matrix (1) is the (lower block-triangular) Frobenius normal form of A. The spectrum of A is the union of the spectrums of the diagonal blocks in the Frobenius normal form. In particular, the spectral radius of A, (A), is an eigenvalue of A, and there exists a class V i of A such that (A[Vi ]) D (A). Such a class is called a basic class. If there is a walk in R(A) from V i to V j or Vi D Vj , we say that the class V i has access to the class V j , and denote it by Vi >D Vj . We will write Vi > Vj if V i has access to V j and Vi ¤ Vj . We will also write i >D Vj if i is a vertex and either i 2 Vj or there is a walk from i to a vertex in V j . A class is final if it has no access to another class (i. e., its outdegree in R(A) is 0). A class is initial if no other class has access to it (that is, its indegree in R(A) is 0). The level of a basic class V i is the maximal number of basic classes on a path in R(A) that terminates at V i . The level of an initial basic class is 1.
Non-negative Matrices and Digraphs
is A’s Frobenius normal form. The spectra of the diagonal blocks are f0g, f3; 1g, f0g, f0g, f3; 3g. Hence V 2 and V5 are the basic classes of A. Finally, the basic classes V 2 and V 5 are of levels 2 and 1, respectively.
Non-negative Matrices and Digraphs, Figure 3 The digraph and reduced digraph of Example 2
Example 21 Let 2 0 1 6 3 2 6 6 0 0 6 AD6 6 0 0 6 0 0 6 4 0 8 0 0
0 0 0 0 0 0 3
4 0 0 0 0 9 0
0 0 0 0 0 0 0
0 0 5 0 7 0 0
0 0 3 0 0 0 0
3 7 7 7 7 7: 7 7 7 5
Theorem 22 (Rothblum Index Theorem) Let A be a square nonnegative matrix. Then the maximal level of a basic class is equal to (A).
Then (A) is the digraph shown on the left in Fig. 3, and R(A) is the digraph shown on the right. The classes are f1; 2g, f3; 7g, f4g, f5g and f6g. The classes f5g and f3; 7g are initial classes, f4g is a final class. One labeling that fits the requirements of Theorem 19 is V1 D f4g, V2 D f1; 2g, V3 D f6g, V4 D f5g and V5 D f3; 7g. Let 2 3 0 0 0 1 0 0 0 6 1 0 0 0 0 0 0 7 6 7 6 0 1 0 0 0 0 0 7 6 7 7 PD6 6 0 0 0 0 0 1 0 7 6 0 0 0 0 1 0 0 7 6 7 4 0 0 1 0 0 0 0 5 0 0 0 0 0 0 1 be the permutation matrix corresponding to the permutation
1 2 3 4 5 6 7 ; 4 1 2 6 5 3 7 then
2
6 6 6 6 6 T PAP D 6 6 6 6 4
0 4 0 9 0 0 0
0 0 3 0 0 0 0
0 1 2 8 0 0 0
0 0 0 0 7 5 0
0 0 0 0 0 0 0
0 0 0 0 0 0 3
0 0 0 0 0 3 0
The algebraic multiplicity of the Perron eigenvalue (A) is obviously equal to the number of basic classes of A. The size of the largest Jordan block corresponding to (A) can also be “read” from the reduced graph of A. Recall that the size of the largest Jordan block associated with the eigenvalue (A) is the multiplicity of (A) as a root of the minimal polynomial of A. It is called the index of the Perron eigenvalue (A) of the matrix A and we denote it by D (A). Equivalently, the index of (A) is the smallest nonnegative integer such that N (( (A)I A) ) D N (( (A)I A)C1 ), where N (B) denotes the nullspace of a matrix B. The following was proved in [46]:
3 7 7 7 7 7 7 7 7 7 5
The matrix A in Example 21 thus has index (A) D 2. Theorem 22 implies the following (earlier) result by Schneider [47]: Corollary 23 Let A be a square nonnegative matrix. Then in the Jordan Canonical form of A there is only one block corresponding to the eigenvalue (A) (equivalently, dimN (( (A)I A)) D 1) if and only if for every two basic classes in R(A), one of the two classes has access to the other. When A is a reducible nonnegative matrix, there need not exist a positive eigenvector corresponding to (A). The following theorem appeared in vol. II, p. 77 of [24]: Theorem 24 Let A be a nonnegative matrix. There exists a positive eigenvector corresponding to the spectral radius of A if and only if the final classes of A are exactly the basic ones. The algebraic eigenspace (or the Perron generalized eigenspace) of A, denoted by E(A), is defined to be N (( (A)I A)n ) D N (( (A)I A) ). The nonzero elements of the algebraic eigenspace of A are called generalized eigenvectors of A associated with the eigenvalue (A). There are several results on the existence of various nonnegative bases to the algebraic eigenspace of nonnegative matrix A. The first part of the next theorem is the Nonnegative Basis Theorem, due to Rothblum in [46], and the second is the Preferred Basis Theorem due to Richman and Schneider [44]. For an n-vec-
2089
2090
Non-negative Matrices and Digraphs
tor x, supp(x) denotes the support of the vector x, that is, supp(x) D f1 i n j x i ¤ 0g. Theorem 25 Let A be a square nonnegative matrix with basic classes Vi 1 ; : : : ; Vi m . Then a. There exists a set of vectors fx 1 ; : : : ; x m g in E(A) such that x j 0 and supp x j D fi j i >D Vi j g;
1 j m:
(2)
Moreover, any such set of nonnegative vectors forms a basis of E(A). b. There exists a basis for E(A) that satisfies in addition to (2) also X c jl x l ; 1 j m; (3) (A (A)I)x j D Vi l >Vi j
where the cj l are positive. A basis of E(A) which satisfies (2) and (3) is called a preferred-basis. One preferred-basis for the matrix A in Example 21 is fx 1 ; x 2 g, where (x 1 )T D [ 0
0
(x 2 )T D [ 1/20
1 3/20
0
0 1
0 0
1 ] and 14/15
2/5
2/3 ] :
The existence of a nonnegative Jordan basis was also studied. Recall that a Jordan chain of length r corresponding to the eigenvalue of A is a sequence of r nonzero vectors x 0 D x; x 1 D (A I)x; : : : ; x r1 D (A I)r1 x such that (A I)r x D 0. A Jordan basis for the algebraic eigenspace E(A) of a nonnegative matrix A is a basis of E(A) consisting of several Jordan chains corresponding to the Perron eigenvalue (A). As is well-known, there always exists a Jordan basis for E(A). However, there does not always exist a nonnegative Jordan basis (i. e., a Jordan basis all of whose vectors are nonnegative) for the algebraic eigenspace of a nonnegative matrix A. Necessary and sufficient conditions for the existence of a nonnegative Jordan basis were given by Richman and Schneider [44]. To state the theorem, we need some definitions and notations. Let li denote the number of basic classes of level i. By the Rothblum Index Theorem, there is a basic class of level , and no basic class of higher level. We denote by l(A) the sequence of levels (l1 ; l2 ; : : : ; l ). Let j(A) be the non-increasing sequence ( j1 ; j2 ; : : : ; jr ) of the sizes of the Jordan blocks corresponding to (A) (of course, j1 D ). We need also the notion of a dual sequence. If
˛ D (˛1 ; ˛2 ; : : : ; ˛r ) is a sequence of non-increasing positive integers, the dual sequence ˛ is obtained so: Create a diagram of r left-aligned rows of dots, ˛ i dots in the ith row from the top. Let ˛ be the (non-increasing) sequence of column-lengths of the diagram. For example, if ˛ D (4; 3; 2; 2; 1), then ˛ D (5; 4; 2; 1), as seen in the following diagram:
With these definitions, Richman and Schneider’s theorem may now be stated. Theorem 26 Let A be an n n nonnegative matrix. Then the following are equivalent: a. There exists a nonnegative Jordan basis for the algebraic eigenspace E(A). b. For every 1 l there exists a nonnegative basis for N ((A (A)I) l ). c. l(A) D j(A) . For the matrix A of Example 21, there are two basic classes, one of level 2 and one of level 1. Hence l(A) D (1; 1). The index of A is 2, and the algebraic multiplicity of (A) D 3 is also 2, so j(A) D (2) (see also Corollary 23. Clearly j(A) D l(A), so there exists a nonnegative Jordan basis for the algebraic eigenspace E(A). Indeed, the preferred basis of A presented above is also a Jordan basis for E(A). A reducible nonnegative matrix may have nonnegative eigenvectors associated with eigenvalues other than the Perron eigenvalue. An eigenvalue of A is a distinguished eigenvalue if it has a corresponding nonnegative eigenvector. A distinguished eigenvalue is necessarily nonnegative. A class V i in R(A) is a distinguished class if (A i i ) > (A j j ) for every j ¤ i such that V j has access to V i . In Example 21 the basic class V 5 is distinguished, as is the (non-basic) class V 4 , while the basic class V 2 and the classes V 1 and V 3 are not distinguished. The following theorem was proved by Victory [54], and was already implicit in the original paper by Frobenius [23]. Theorem 27 (Frobenius–Victory Theorem) Let A be an n n nonnegative matrix. Then a. A real number is a distinguished eigenvalue of A if and only if there exists a distinguished class V i of A such that (A i i ) D .
Non-negative Matrices and Digraphs
b. If V i is a distinguished class of A and D (A i i ), then there is a nonnegative eigenvector x corresponding to such that supp(x) D f1 j n j j >D Vi in (A)g : The vector x with these properties is unique up to a positive scalar multiple. For the matrix A in Example 21, in addition to (A) D 3 the eigenvalue 0, which is (A44 ), is a distinguished eigenvalue. The only vertex with access in (A) to V4 D f5g is 5 itself, and xD[ 0
0
0
0
1
0
0 ]T
is the unique nonnegative eigenvector corresponding to 0 whose support is f5g. Some remarks about the terminology: In Rothblum’s paper level was termed height. Later level was used (see [30]), reserving the term height characteristic to represent the Weyr’s characteristic, that is, the sequence (A) D (1 ; : : : ; ) with i D dim N (( (A)I A) i ) dim N (( (A)I A) i1 ). The sequence l(A) is called the level characteristic of A, and the sequence j(A) of Jordan blocks is known as the Segré characteristic, e. g. [30]. Nonnegative bases are referred to as semipositive bases in many of the works mentioned in this section, where a semipositive matrix or vector is a nonnegative matrix or vector which is not equal to zero. The results here are over 20 years old – see [49], where additional references may be found. Much work was done on this subject in the years since then, which we don’t cover here. In a series of papers by Tam, Tam and Wu, Tam and Schneider, the above results, and more, were extended to cone-preserving maps, providing also alternative proofs for the case of nonnegative matrices. The latest installment in this series of papers is [51], where reference can be found to the previous papers in the series, as well as to works of others. The existence of nonnegative bases, relations between characteristics of the reduced graph R(A) and characteristics of the matrix A, and extensions of such results to general (not necessarily nonnegative) matrices, were further studied by Hershkowitz, Schneider and others, see [30] and the references therein. See also Sect. 9.3 in [31], and the references there. Examples: Ranking Problems Google PageRank Revisited We now consider in more detail the Google PageRank Example.
In the Google model, we may think of a weighted digraph W representing the links on the Web. Each of the N pages is represented by a vertex, each link is represented by an arc, and a weight is assigned to each arc: If the outdegree of a vertex i is di , the weight of each of the di arcs is 1/d i . Let T be the matrix associated with this weighted digraph, that is, ti j is 1/d i if (i; j) is an arc in W, and 0 otherwise; Make T row stochastic by replacing each zero row by N1 e T , where e is the N-vector with all entries equal to 1. Now Te D e. The Google matrix is: 1 J; N where J is the (in this case, N N) matrix of all ones and ˛ D 0:85. By this representation of P it is easy to see that P is positive and that Pe D e (since both Te D N1 Je D e). This implies (by Theorem 4 d) that 1 is the Perron root of P. Since a matrix and its transpose have the same spectrum, 1 is also the Perron root of PT . The existence of a unique Perron vector for PT is therefore guaranteed by the Perron Theorem. It is not known how exactly Google calculates (an approximation to) , but probably some version of the power method is used. This is done essentially by starting with a vector x0 and m computing lim m!1 x m , where x m D PT x m1 D PT x0 . By Perron’s Theorem (Theorem 1m e), for every vector x0 whose entries sum up to 1, PT x0 converges to e T x0 D . Part f of the Perron Theorem gives an idea about the rate of convergence: By a known bound for a row stochastic matrix (Chap. 5, Theorem 5.10 in [7]) P D ˛T C (1 ˛)
(P) 1
N X jD1
min p i j : i
The construction of P and the Web structure imply that the quantity on the right hand side is ˛. Hence the choice of ˛ determines the rate of convergence of the power method: it is roughly the rate of convergence of ˛ m to zero (see Theorem 1 f ). In the definition of P, the matrix T is the crucial part to the idea of Google, of ranking pages according to the hyperlink structure of the Web. It is T that contains the information on that hyperlink structure. But the matrix T is not irreducible. (The digraph representing the Web is certainly not strongly connected. Studies of the Web suggest that it has a large strongly connected component, “the core”, but also three other, equally large, groups of vertices: one consisting of vertices which have access to the core, but no access from the core; another consisting of vertices which can be accessed from the core, but have no access to the core; and finally, there is a group of vertices which have no
2091
2092
Non-negative Matrices and Digraphs
access to or from the core [15]). The addition of the matrix 1 N J may be explained as we did in Sect. “Matrices, Graphs and Digraphs”, but technically it has the role of making the matrix P irreducible (in fact, positive) thus guaranteeing the existence of a unique PageRank vector. If we replace the matrix N1 J by a matrix E D ev T , where v is a positive probability vector (i. e., v T e D 1), the new matrix P would be again positive and row stochastic. Of course, a choice of v ¤ N1 e (E ¤ N1 J) means that certain pages are favored. Indeed, it is reported that Google now uses such biased perturbation matrices in the construction of P. The choice of ˛ is the result of a balancing act: large ˛ means that more weight is given to the hyperlink structure of the Web; small ˛ means faster convergence to the PageRank vector. PageRank is not the only Web information retrieval method that is based on the hyperlink structure of the Web and uses nonnegative matrices associated with the Web digraph. See [38] for an illuminating survey on three such methods, PageRank included. Tournaments Suppose that n players participate in a round-robin tournament. In such a tournament, every player plays against every other player once, and there is a clear winner in each of the games. The outcomes of the games can be represented by a digraph on n vertices, each representing a player, with (i; j) an arc if and only if player i beats player j. Such a digraph has no loops, and for every i ¤ j exactly one of (i; j) and ( j; i) is an arc. It is called a tournament. The adjacency matrix of a tournament is called a tournament matrix. That is, a (0; 1)-matrix A is a tournament matrix if A if and only if A C AT D J I. Figure 3 shows a tournament on 4 vertices and next to it its adjacency matrix. The problem is: Given the outcomes of all games in a round-robin tournament, how does one rank the players? It seems that it is not enough to count the scores of the players. As in the Google PageRank example, it seems reasonable to rate players not only by their scores, but also by the scores of the players they beat. The vector of scores of the n players is Ae. The vector of second-level scores of the players, whose ith component is the sum of the scores of the players defeated by player i, is A(Ae) D A2 e. The vector of kth-level scores is Ak e. The entries of the kth-level scores grow large with k, but we are interested in their relative magnitude, so we may normalize each kth-level scores vector. By Theorem 2, if the matrix A is primitive, then
A k e D s; lim k!1 (A)
Non-negative Matrices and Digraphs, Figure 4 A tournament and its tournament matrix
where s is a scalar multiple of the Perron vector of A. The vector s can then be used to rank the players in the tournament A: the larger si is, the higher is the rank of player i. For example, the tournament matrix in Fig. 4 is primitive, its Perron root is D 1:3953, and [0:6256; 0:5516; 0:3213; 0:4484]T is a corresponding eigenvector, which leads to the ranking of 1 as the top player, followed by 2, then 4, and finally 3 in the last place. A tournament matrix need not be primitive or even irreducible; however, by the next theorem every irreducible tournament matrix of order greater than 3 is primitive. Theorem 28 An n n irreducible tournament matrix is primitive if and only if n 4. Tournament matrices of order 1 are 2 are reducible, and the only irreducible tournament matrices on 3 vertices are the matrices whose digraph is a 3-cycle. All these irreducible 3 3 tournament matrices are not primitive. It is quite obvious from Fig. 5 that in these tournaments all three players should have equal rank in any reasonable ranking method. Here is how we may rank the players when the tournament matrix A is reducible. Let P be a permutation matrix such that 2 3 A11 0 0 6 A21 A22 0 7 6 7 PAPT D 6 : 7 :: ::: ::: 5 4 :: : A k1
A k2
Ak k
Non-negative Matrices and Digraphs, Figure 5 A 3-cycle tournament
Non-negative Matrices and Digraphs
is in Frobenius normal form, with each diagonal block square and irreducible (of size at least 3 3), or a 1 1 zero matrix. Let V i , i D 1; : : : ; k, be the classes of A (V i corresponds to Ai i ). Since A is a tournament matrix, each of the blocks below the diagonal has all entries equal to 1. Thus in this case V i has access to V j if and only if i > j (which means that each player in V i beats each player in V j for every j < i). Now if jVi j D 3, rank all the players in V i as equal. If jVi j > 3, rank the players in V i according to the Perron vector of Ai i , and for i > j rank all players in V i higher than the players in V j . This is the Kendall–Wei ranking ([34,56]). For the results of this section see Sect. 10.7 in [10]. See also [22] and the references therein for more on tournament matrices and ranking tournament players.
The Laplacian Matrix of a Directed Graph Let be a digraph with no loops on vertices f1; : : : ; ng. Let A D A( ) be the adjacency matrix of A, and let D D D( ) be the diagonal matrix of outdegrees of the vertices of (that is, di i is the outdegree of the vertex i). The matrix L( ) D D A is the Laplacian Matrix of . Observe that L( )e D 0. If is a weighted digraph, we may replace the matrix A with the matrix A D A( ) whose i; j entry is the weight of the arc (i; j), and the diagonal matrix of outdegrees by the diagonal matrix D D D( ) whose i; i entry is the sum of the weights of the arcs with i as initial vertex. The Laplacian of the weighted digraph is L( ) D D( ) A( ). An arborescence is a digraph with no cycles, in which all the vertices but one vertex u have outdegree 1, and u has outdegree 0. The vertex u is called the root of the arborescence. That is, an arborescence is a digraph whose underlying graph is a tree, in which all the arcs are directed towards the root u. A spanning arborescence of a digraph is a subdigraph of with the same set of vertices as , which is an arborescence. The Matrix-Tree Theorem below is due to Tutte [52]. Theorem 29 (The Matrix-Tree Theorem) Let L( ) be the Laplacian matrix of a weighted simple digraph . Let L i ( ) be the principal submatrix of L( ) obtained by deleting the ith row and ith column. Then det(L i ( )) is the sum of weights of all spanning arborescences of rooted at the vertex i, where the weight of a subdigraph is the product of the weights of its arcs. A non-weighted digraph , may be considered as having weight 1 assigned to each arc. With these weights, L( ) D L( ). Thus we have:
Corollary 30 Let L( ) be the Laplacian matrix of a simple digraph . Let L i ( ) be the principal submatrix of L( ) obtained by deleting the ith row and ith column. Then det(L i ( )) is the number of spanning arborescences of rooted at the vertex i. For the results here see Sect. 9.6 in [14], and Chap. VI in [53]. They are generalizations of Kirchhoff’s Theorem on the (combinatorial) Laplacian of a graph. A more general form of the Matrix-Tree Theorem is the All Minors Matrix-Tree Theorem, which relates minors of order n k of the Laplacians to sums of weights of certain spanning (directed) forests consisting of k disjoint arborescences. See e. g. [16] and the references there. See also [17] and the references there for more on such spanning forests and their relations to the Laplacian. As mentioned in Sect. “Matrices, Graphs and Digraphs” of this article, for graphs there are many results on the connections between the spectrum of the Laplacian and the structure of the graph (see [18,21] and the references therein). There aren’t many such results concerning the spectrum of the Laplacian of a digraph. Attempts in this direction have only begun in recent years, and we will touch upon these briefly in the next section, on research directions. Research Directions We conclude by mentioning a few directions in current research. Relating analytic properties of a nonnegative matrix to properties of its digraph: The Romanovsky Theorem is an (old) example of such a theorem. More recent results in this spirit can be found in [35,36,37]. In the first of these, a lower bound on the second largest eigenvalue of a stochastic matrix A is given in terms of the girth (i. e., the length of the shortest cycle) of its digraph. In the second and third, directed graphs are used in measuring the sensitivity of stationary distributions (i. e., the Perron left eigenvector of the transition matrix) under perturbations of the transition matrix. Spectral theory for digraphs: One disadvantage of the Laplacian of a directed graph, compared to that of a graph, is that it is not symmetric and need not have real eigenvalues. Fan Chung [19] and Chai Wah Wu [57] independently derived from the Laplacian L D D A of a digraph a symmetric matrix whose eigenvalues may be considered instead of the eigenvalues of the original Laplacian. Following Chung, for a strongly connected digraph let SD˚
1 ˚ P C PT ˚ ; 2
2093
2094
Non-negative Matrices and Digraphs
where P D D1 A and ˚ is the diagonal matrix with the left Perron vector of P on its diagonal. By scaling S we get a normalized version: S D ˚ 1/2 S˚ 1/2 :
The eigenvalues of S (see [57]; Wu’s matrix is actually a scalar multiple of S) or of S (see [19,20]) may be considered the directed graph equivalents of the eigenvalues of the combinatorial or normalized Laplacian of a graph. Both Chung and Wu examine connections between these eigenvalues (mainly the second smallest one) and some digraph parameters (such as the diameter and the Cheeger constant). This study seems to be at its beginning. Searching the Web and Google: There is a lot of activity of analyzing the Google matrix, the ideas behind Google, possible improvements and alternative approaches. The best way to find the current state of affairs of these subjects is to google related terms (try that on http://scholar.google.com). Bibliography 1. Alon U (2006) An Introduction to Systems Biology: Design Principles of Biological Circuites. CRC Mathematical and Computational Biology Series. Chapman, Boca Raton 2. Bang-Jensen J, Gutin G (2001) Digraphs: Theory Algorithms and Applications. Springer, London 3. Bapat RB, Raghavan TES (1997) Nonnegative Matrices and Applications. Encyclopedia of Mathematics and its Applications 64. Cambridge University Press, Cambridge 4. Beineke LW, Wilson RJ (2004) Topics in Algebraic Graph Theory. Encyclopedia of Mathematics and its Applications 102. Cambridge University Press, Cambridge 5. Berman A (2003) Graphs of matrices and matrices of graphs. Numer Math J Chin Univ (Engl Ser) 12(suppl):12–14 6. Berman A, Neumann M, Stern RJ (1989) Nonnegative Matrices in Dynamical Systems. Wiley-Interscience, New York 7. Berman A, Plemmons R (1994) Nonnegative Matrices in the Mathematical sciences. SIAM, Philadelphia 8. Berman A, Shaked-Monderer N (2003) Completely Positive Matrices. World Scientific, River Edge 9. Birkhoff G (1946) Tres obseraciones sobre el algebra lineal. Univ Nac Tucuman Rev Ser A 5:147–150 10. Bondy JA, Murty USR (1976) Graph Theory with Applications. North-Holland, New York 11. Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. Comput Netw ISDN Syst 33(1–7):107–117 12. Brualdi RA (1979) Matrices permutation equivalent to irreducible matrices and applications. Linear Multilinear Algebra 7:1–12 13. Brualdi RA, Parter SV, Schneider H (1966) The diagonal equivalence of a nonnegative matrix to a stochastic matrix. J Math Anal Appl 16:31–50 14. Brualdi RA, Ryser HJ (1991) Combinatorial Matrix Theory. Encyclopedia of Mathematics and its Applications 39. Cambridge University Press, Cambridge
15. Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A, Wiener J (2000) Comput Netw 33(1–3): 309–320 16. Chaiken S (1982) A combinatorial proof of the all-minors matrix tree theorem. SIAM J Alg Dis Meth 3:319–329 17. Chebotarev P, Agaev R (2002) Forest matrices around the Laplacian matrix. Linear Algebra Appl 356:253–274 18. Chung FRK (1997) Spectral Graph Theory. CBMS, Regional Conference Series in Mathematics 92. AMS, Providence 19. Chung F (2005) Laplacians and the Cheeger inequality for directed graphs. Ann Comb 9:1–19 20. Chung F (2006) The diameter and Laplacian eigenvalues of a directed graphs. Electron J Combin 13(4):6 21. Cvetkovi´c D, Doob M, Sachs H (1980) Spectra of Graphs: Theory and Applications. Academic Press, New York 22. Eshenbach C, Hall F, Hemansinha R, Kirkland S, Li Z, Shader B, Stuart J, Weaver J (2000) Properties of Tournaments among Well-Matched Players. Amer Math Month 107(10):881–892 23. Frobenius G (1912) Über Matrizen aus nicht negativen Elementen. Sitzungsbericht. Preussische Akademie der Wissenschaften, Berlin, pp 456–477 24. Gantmacher FR (1959) The Theory of Matrices. (Translated from Russian), Chelsea, New York 25. Godsil C, Royle G (2001) Algebraic Graph Theory. Graduate Texts in Mathematics 207. Springer, New York 26. Gulli A, Signorini A (2005) The indexable web is more than 11.5 billion pages. In: Proc. 14th WWW (Posters), pp 902–903 27. Gupta RP (1967) On basis digraphs. J Combin Theory 3: 309–311 28. Hall P (1935) On representatives of subsets. J London Math Soc 10:26–30 29. Hartfiel DJ (1970) A simplified form for nearly reducible and nearly decomposable matrices. Proc Amer Math Soc 24: 388–393 30. Hershkowitz D (1999) The combinatorial structure of generalized eigenspaces – from nonnegative matrices to general matrices. Linear Algebra Appl 302–303:173–191 31. Hogben L (2007) Handbook of Linear Algebra. Discrete Mathematics and Its Applications Series 39. CRC Press, Boca Raton 32. van der Holst H, Lovász L, Schrijver A (1999) The Colin de Verdière graph parameter. In: Graph Theory and Computational Biology. Bolyai Soc Math Stud 7:29–85 33. Horn RA, Johnson CR (1985) Matrix Analysis. Cambridge University Press, Cambridge 34. Kendall MG (1955) Further contributions to the theory of paired comparisons. Biometrics 11:43–62 35. Kirkland S (2004) Digraph-based conditioning for Markov chains. Linear Algebra Appl 385:81–93 36. Kirkland SJ (2004) A combinatorial approach to the conditioning of a single entry in the stationary distribution for a Markov chain. Electron J Linear Algebra 11:168–179 37. Kirkland SJ (2004/5) Girth and subdominant eigenvalues for stochastic matrices. Electron J Linear Algebra 12:25–41 38. Langville AN, Meyer CD (2005) A survey of eigenvector methods of Web information retrieval. SIAM Rev 47(1):135–161 39. Lee D, Seung H (2001) Algorithms for nonnegative matrix factorization. Adv Neural Process 13:556–562 40. Marcus M, Minc H (1963) Disjoint pairs of sets and incidence matrices. Illinois J Math 7:137–147 41. Marcus M, Minc H (1964) A survey of matrix theory and matrix inequalities. Allyn and Bacon, Boston
Non-negative Matrices and Digraphs
42. Minc H (1988) Nonnegative Matrices. Wiley-Interscience Series in Discrete Mathematics and Optimization. Wiley, New York 43. Perron O (1907) Zur Theorie der Matrizen. Math Ann 64: 248–263 44. Richman D, Schneider H (1978) On the singular graph and the Weyr characteristic of an M-matrix. Aequationes Math 17: 208–234 45. Romanovsky V (1936) Recherche sur les chains de Markoff. Acta Math 66:147–251 46. Rothblum UG (1975) Algebraic eigenspaces of nonnegative matrices. Linear Algebra Appl 12:281–292 47. Schneider H (1956) The elementary divisors associated with 0 of a singular M-matrix. Proc Edinburgh Math Soc 10(2):108–122 48. Schneider H (1977) The concepts of irreducibility and full indecomposability of a matrix in the works of Frobenius, König and Markov. Linear Algebra Appl 18:139–162 49. Schneider H (1986) The influence of the marked reduced graph of a nonnegative matrix on the Jordan Form and on related properties: a survey. Linear Algebra Appl 84:161–189
50. Seneta E (1981) Nonnegative Matrices and Markov Chains. Springer Series in Statistics, 2nd edn. Springer, New York 51. Tam BS (2004) The Perron generalized eigenspace and the spectral cone of a cone-preserving map. Linear Algebra Appl 393:375–429 52. Tutte WT (1948) The dissection of equilateral triangles into equilateral triangles. Proc Cambridge Philos Soc 7:463–482 53. Tutte WT (1980) Graph Theory. Encyclopedia of Mathematics and its Applications 21. Cambridge University Press, Cambridge 54. Victory Jr HD (1985) On nonnegative solutions to matrix equations. SIAM J Alg Dis Meth 6:406–412 55. Wasserman S, Faust K (1994) Social network analysis: Methods and applications. Cambridge University Press, Cambridge 56. Wei TH (1952) The Algebraic Foundations of Ranking Theory. Ph D Thesis, Cambridge University 57. Wu CW (2005) On Rayleigh-Ritz rations of a generalized Laplacian matrix of directed graphs. Linear Algebra Appl 402:207– 227
2095
2096
Non-standard Analysis, an Invitation to
Non-standard Analysis, an Invitation to
lus, and represents the completion of the a key developmental stage of science.
W EI -Z HE YANG Department of Mathematics, National Taiwan University, Taipei, Taiwan
Introduction
Article Outline
This article, written by a (general) mathematician, is aimed at a reader with some mathematical maturity. There are different constructions of non-standard analysis, and we have chosen to start with the “ultrapower” construction of the semantic kind, which is not as difficult to comprehend as the first presentation.
Glossary Definition of the Subject Introduction Pre-Robinson Infinitesimals Hyperreals General Idea of NSA Loeb Construction A Future for Non-standard Analysis? Bibliography Glossary Hyperreal A proper extension of the real numbers, containing infinities and infinitesimal numbers, such that a transfer principle allows first-order theorems in the reals to be extended therein. Infinity An element in an ordered field which is larger than 1 C 1 C 1 C C 1 for any number of 1’s. Infinitesimal An element in an ordered field whose absolute value is less than 1/n for any positive integer n. Non-archimedean Infinitesimal or infinite values exist in an ordered field. Transfer principle A rule which transforms assertions about standard sets, mappings, etc., into one about internal sets, mappings. Intuitively, it is an elementary embedding between structures. Definition of the Subject Generically speaking, non-standard analysis is any form of mathematics that relies on non-standard models and the transfer principle. However, it is likely safe to say that of the most common meaning for the term “non-standard analysis” is a rigorous formulation of analysis using infinitesimals, as an alternative to the analysis we all know in the -ı formulation. Abraham Robinson is usually regarded as the first person to do this coherently [12]. It is sometimes claimed that non-standard analysis can achieve more than standard models [2]. This may not be entirely true, but philosophically, non-standard analysis justifies the standard pedagogical process – infinitesimals in Leibniz notation – in which most people learned calcu-
Warning 1 What an (general) undergraduate to a student of mathematics is what a (general) mathematician to a logician.
Differential = Infinitesimal Difference? dy
Suppose a freshman is questioned on the derivative dx for the cubic function y D x 3 . Then after the difference-quotient is found, y D 3x 2 C 3x(x) C (x)2 ; x dy
he just says “substituting by x D 0”, dx D 3x 2 . And the instructors often have trouble explaining the difference between “x set to 0” and “limx!0 ”. It is estimated that, among the mathematical-literates (i. e., those who learnt calculus), less than 10% are competent with “-ı-logy”. But, of course, they are assured that what they learnt is no nonsense. The calculus we learn is from Leibniz, not Seki Kowa (even if the textbook is in Japanese), and not even Newton, who must have the priority. And the discoveries of calculus and of rigorous analysis (chiefly of Cauchy), are some 150 years apart. In essence, the success of Leibniz is due to his elegant and mnemonic way of notations, centered around the notion of differential. When a freshman understands that “the differential of y”, written as dy, is “the infinitesimal difference of y”, he deals with calculus problems just like doing routine high school algebra. Indeed calculus as envisaged by Leibniz, is a computation scheme in a number system, denoted by ? R below. This system of “hyperreals” must include, besides the usual real numbers, infinitesimals. Leibniz conceived of the notion of infinitesimals out of necessity, because he was pre-Cauchy. He needed the infinitesimals (as differentials) in the calculus. But from the very beginning many people, notably Bishop G. Berkeley, “discover impossibilities and contradictions”. Infinitesimals cannot be, simultaneously, zero and non-zero. If they are zero, the division is illegal; if dy they are not, when y D x 3 ; dx D 3x 2 is wrong because
Non-standard Analysis, an Invitation to Non-standard Analysis, an Invitation to, Table 1 Notations of this Chapter Notation iff :, ^, _, ) 8, 9 :D R; C; Z N (a : : : b), [a : : : b] (a : : : b], [a : : : b) f g h i [; t; \; n P (A) RN ˇ RN c RN 0 RN R RC ¨ R ˙ R ?R U U
Meaning if and only if “negation of ”, “and ”, “or ”, “imply ”. “for all ”, “exists ” “defined to be equal to ” set of reals, complex numbers, integers set of natural numbers (zero excluded) open and closed intervals semi-open intervals the braces holding a set angles holding a sequence set union, disjoint union, intersection, set difference power set (set of all subsets including A and ;) set of all bounded sequences set of all convergent sequences set of all sequences converging to zero(=vanishing) set of all eventually -zero(=nullifying) sequences D R n f0g, non-zero reals [0 : : : 1), non -negative real number D R t f1; C1g; R with signed infinities D R t f1g; R with unsigned infinity (one possible) hyperreal number system ultra -filter, also and denote (as superscript) “ultrapower construction ” superscript denotes the set of standard elements denotes the projection for the ultrapower construction U
3xdx C (dx)2 does not just evaporate away. Of course this is a contradiction that Leibniz, who was a first-rate philosopher-logician, could not fail to discover. Thus the calculus as a science was, for many people before Cauchy, is only recognized de facto and not de jure, just like the nations of Taiwan and Kosovo. Structure of This Article In the next Sect. “Pre-Robinson Infinitesimals”, we will tell you how the people tried to handle infinitesimals before [12]. In Sect. “Hyperreals”, we define the hyperreal field using the ultrapower construction. Section “General Idea of NSA” builds a small framework of non-standard analysis using hyperreals, and Sect. “Loeb Construction” introduces the Loeb construction for Wiener processes in NSA.
Recap: Naive Set Notations We assume the concept of a set without recourse to the axioms of set theory. Definition 1 (Binary Boolean Set-Operations) For two sets A and B, there are these Boolean combinations of them: A [ B :D fx : x 2 A or x 2 Bg; A \ B :D fx : x 2 A and x 2 Bg; A n B :D fx : x 2 A and x 62 Bg : Definition 2 (Function) Let A and B be two non-empty sets. If for any x 2 A, there is correspondingly assigned one y 2 B, and written as f (x), then we call this assignment f a function. Its domain is A, codomain B. The notation is f : A 7! B, or f : x 2 A 7! f (x) 2 B. Also, for C A, or for D B, we write f [C] :D f f (x) : x 2 Cg :
Pre-Robinson Infinitesimals Herein we describe the work prior to Robinson (despite say the work of [13]), who is credited with being the creator of the field.
f
1
[D] :D fx 2 A: f (x) 2 Dg :
If f [A] D B, then the function f is surjective. If f (x) D f (u) implies x D u, then the function f is injective. when
2097
2098
Non-standard Analysis, an Invitation to
the function is both injective and surjective, it is called bijective. Sometimes the codomain of a function is just understood, but suppressed, and the function is then written as an “indexed family” h f j : j 2 Ii. In this case, we write f j instead of f ( j), and I, the domain of definition, here I, is called the indexing-set, and is often omitted when no confusion is likely. When the indexing set is N (and in a few other situations) the indexed family is called a sequence. Definition 3 (Cartesian-Product, Couple, Triple, and Sequence) For two non-empty sets A; B, their Cartesian product is defined as A B :D fha; bi : a 2 A ; and b 2 Bg : Here “a couple”, or “an ordered pair”, is denoted by ha; bi. An ordered n-tuple is analogously denoted like hx1 ; x2 ; ; x n i. Definition 4 (General Operations among a Family of Sets) If we have an indexed family of sets (X j : j 2 I), then we define: [ j2I X j :D fx : x 2 X j ; for some j 2 Ig ; \ j2I X j :D fx : x 2 X j ; for each j 2 Ig ; Y j2I X j :D fg : g is a function, with g( j) 2 X j ; for all j 2 Ig ; Definition 5 (Quotient-Set and Equivalence Relation) This is a review of the most difficult notion in naive set theory. In studying a non-empty set A, there often occur an “equivalence relation”, denoted by, say, “”. This means we write “x y” to mean that these two elements x; y of A are “of the same kind”, “similar”, or “equivalent”, referring to something of our concern. If we collect all similar members into “bunches”, each bunch is called an “equivalence class” (of this relation ). The set of (bunches=) equivalence-classes is called the quotient set of A with respect to A the relation , and written as . This process of formA ing the quotient set from the set A, is usually described as “identifying equivalent members”. Formally speaking: a binary relation on A is called an equivalence relation iff it has the following three properties: r ı (Reflexivity): for any x 2 A; x x. sı (Symmetry): when fx; yg A, and x y, then y x. t ı (Transitivity): when fx; y; zg A, and x y, and also y z, then x z. For x 2 A, its equivalence class (modulo ), is then (x) :D fy 2 A: x yg :
(It is a non-empty subset of A). The quotient set is then: A :D f (x) : x 2 Ag : Ordered Field Extension The relation between the system R of reals and the system of hyper-reals may be announced thus: First of all, ? R should be a proper extension of the ordered-field R. Secondly, one should also be able to extend, from R to ? R, the usual functions exp; cos; sin, etc. This is one of the topics of this article, and we explain the meaning of “extension” in this subsection.
?R
Definition 6 (Commutative-Semigroup) Let us consider the arithmetic operation C in R. Though it is a twovariable function, just like, e. g., f : f (x; y) D sin(x y) (x 3 y 2 ), it is written “infix”. (So that x C y is “really” C(x; y)). There are the the noteworthy laws of associativity, and of commutativity: 8x ; 8y ; 8z ; (x C y) C z D (x C (y C z) ;
xC y D yCx:
(We mathematics student read 8 as “for all”). By abstraction, we call a binary operation (= two variable function) on a non-empty set A, a “commutative semigroup structure” on this set. This system, the set A together with the commutative semigroup structure, is called a commutative semigroup. We realize that: in R, multiplication is another commutative semigroup structure. Definition 7 (Commutative-Group) Now in R, 0 is distinguished by the “unitality law”: 8x ; x C 0 D 0 C x D x : There is then a negation function, mapping x into its negative x, and the reciprocal law holds: x C (x) D 0 ;
(x) C x D 0 :
By abstraction, we call a triple of a two variable function, a distinguished element, and a one-variable function, all on a non-empty set A, satisfying all the laws mentioned above, as a “commutative group structure”. And the set, together with this commutative group structure, is called a commutative group. We realize that: (R; (C; 0; )) is a commutative group. As to the multiplication, the distinguished unity is 1, but the reciprocation fails at precisely 0. If we all the functions are restricted to the zero-deleted set R :D R n f0g ; then we do have a commutative group structure on it.
Non-standard Analysis, an Invitation to
Definition 8 (Field) Suppose F is a set of more than one elements. Let the triple (C; 0; ) be a commutative group structure, and (; 1;1 ) be a commutative group structure on F D F n f0g. And suppose the distributive law holds: (8x ; 8y ; 8z) ;
x (y C z) D x y C x z :
Then this couple of triple is called a “field structure” on F. (Of course, there are, altogether, six items in the structure: two distinguished elements (=“null-variable functions”), two (single variable) functions, two two-variable functions). And the set F, together with this field structure, is called a field. Definition 9 (Total-Ordering, and Ordered Field) A total-ordering on a set T is a binary relation on T, which satisfies the transitive law and the trichotomy law. For example, in R, the relation “less than” is a total-ordering: the transitive law says that: if a < b, and b < c, then a < c, while the trichotomy law requires that: for any two members a; b, exactly one sentence among the following three is true: a < b; a D b; b < a. An ordered-field structure on a set K is a combination of a field structure, and an total-ordering . It is precisely this axiom which denies the existence of the infinitesimals: a positive infinitesimal would not be usable as a unit for measurement. Definition 14 (Infinitesimal) In an ordered field F, an element is called infinitesimal, if for any n 2 N; n jj < 1. In other words, no matter how many terms are there, the sum n D C C C C (n terms) < 1.
Non-standard Analysis, an Invitation to
And in an Archimedean ordered field, the only infinitesimal is zero. When people become acquainted with the concept of limit, they know that “infinitesimal” should not be interpreted statically. Rather, it is the “vanishing behavior” of, e. g. the function (sin3 x)/x 2 , when x ! 0, that is to be identified with “infinitesimal”. Along this direction there appeared several attempts to find (=construct) the system ? R promised by Leibniz. We now describe a good and very natural try, just years before the discovery of Robinson, by Schmieden–Laugwitz. Definition 15 (Guiding Idea) A nullifying sequence is considered as zero, a vanishing sequence is considered as an infinitesimal, and a constant sequence is identified with the number itself. There may be better terminology, but here a sequence hx n i is called nullifying, if after a finite number of terms, the rest of them are all zero. And a sequence hx n i is called vanishing, if it is converging to zero. Definition 16 (Confusing!) Limit and limiting-point recall the most confusing duo-definitions for a sophomore: “b is the limit of a sequence ha n i”, and “c is a limiting point of a sequence ha n i”. For any > 0, there exists an m 2 N, such that; 8n m; ja n bj < . For any > 0, and for any m 2 N, there exists an n 2 N, such that n > m, and ja n cj < . (We support Kelley’s advocation, replacing “limiting” by “cluster”.) Definition 17 (Eventual Set, Frequent Set) A subset A of N is called “eventual”, if it is “cofinite” (=complement of a finite subset). That is: N n A is finite. It is called “frequent”, if it is infinite. If a property P(i) holds true for i belonging to an eventual set (the mention of which is unnecessary and omitted intentionally), we say P(i) holds true “for eventual i”. Similar meaning is applied to the phrase “for frequent i”. With this kinds of abuse of the English language, we may rephrase the definitions above as: For a sequence a :D ha n : n 2 Ni, b 2 R is a limit of a iff, for any > 0; a i 2 (b : : : b C ) for eventual i; c is a limiting-point of a iff, for any > 0; a i 2 (c : : : c C ) for frequent i. Remark 2 See p. 65 in [7], for the definitions of “eventually”, “frequently”, for a general directed set of indices. Here the (naturally) directed set N is very special, because it is the least well-ordered infinite set. Notation 1 (Sequence Spaces) We write x 2 RN to mean: x D hx n i is a sequence of real numbers. (This is in accordance to our notating the set of mappings). Here are some subsets of RN , arranged decreasingly:
RN is the set of all bounded sequences; ˇ
RN c is the set of all convergent sequences; RN 0 is the set of all sequences converging to zero (=vanishing); RN is the set of all eventually-zero (=nullifying) sequences. Definition 18 (Coordinatewise Operations on Sequence Spaces) hx n i C hy n i :D hx n C y n i ; hx n i hy n i :D hx n y n i ; hx n i hy n i :D hx n y n i : Lemma 2 The system RN is a commutative unital ring, a real vector space, and thence a commutative algebra (with identity). The following inclusions are subalgebra embeddings: N N RN c Rˇ R ;
Now according to the guiding idea, a number 2 R is identified with the sequence with all terms = . In this sense RN , (and a fortiori RN ), is an extension of R, conˇ sidered as an algebra. But this is not field-extension, because and 1O 2 RN be the indicator Example 1 Let 1E 2 RN ˇ ˇ sequences of the sets of even and odd indices, respectively. Then they are complementary idempotents: 1E 1E D 1E ;
1O 1O D 1O ;
1E 1O D 0 ; 1E C 1O D 1 :
Since the two elements are neither 1, nor 0, of the algebra RN , the latter cannot be a field, because even an 8th ˇ grader knows that in a field F, the only idempotent element x D x 2 is x D 0 (the zero) or x D 1 (the identity). Definition 19 (Extension of Function) Consider these two functions cos and sin. For a sequence x D hx n i 2 RN , x is a “number” in a generalized sense, and we would like to define cos(x); sin(x). If x is a constant sequence: x n D a; 8n, then cos(x); sin(x), should be the sequences with constant terms of cos(a); sin(a) correspondingly. So we see that a coordinatewise definition is almost forced on us: If f : R 7! R is everywhere defined, then we define: f hx n i :D h f (x n )i : So far as it is defined, it works fine. As an example: the addition-formula sin(hx n i C hy n i) D sinhx n i coshy n i C sinhy n i coshx n i
2101
2102
Non-standard Analysis, an Invitation to
would simply mean that, for each index n, sin(x n C y n ) D sin(x n ) cos(y n ) C sin(y n ) cos(x n ) : Theorem 1 (A Transfer Principle: Restricted-Form) If functions f ; g; : : :, are all defined on R, and they has a formula connecting them, then this formula still holds for the extended functions on the whole sequence space RN . We remark that coordinatewise extension is equally applicable to two-variable function, so long as it is defined everywhere. The Transfer Principle remains valid. And indeed the principle has nothing to do with the differentiability, continuity, etc., of the functions involved. This is pure naive-set theory! Remark 3 If bounded sequences are called “moderate numbers”, then only bounded functions will map “moderate number” to “moderate number”. Definition 20 (Ordering) For two elements x D hx n i; y D hy n i of RN , we define: (x y) ;
iff
(8n; x n y n ) :
We write x < y, iff x y, and x ¤ y. (This means the following: x n y n , for all n 2 N, but there is an index n, (at least one, and only one is enough!) such that x n < y n . The binary relation in RN is called a partial ordering, because it is transitive, and antisymmetric: If x y, and also y x, then x D y. This partial ordering is not a totalordering, because some pair of members, for example 1 E and 1O , may be ‘incomparable’, as neither 1E 1O nor 1O 1E is true. Lemma 3 (Nullifying and Vanishing Sequences) The set N N N N RN 0 is an ideal of R Rb R c . The set R is an ideal N N N of Rb Rc . Doing modulo R0 means that two sequences with vanishing difference are considered the same. Or, modulo RN 0 is a magnifying-glass not powerful enough to distinguish a vanishing sequence from zero sequence. And you don’t have non-trivial infinitesimals in the system. Also, mod RN Lemma 4 The quotient algebra RN c 0 is (isomorphic to) R. The isomorphism identifies any convergent sequence hx n i with its limit limn!1 x n . The Lemma says that lim n!1 x n y n D limn!1 x n lim n!1 y n , and the like. (Just the first theorems of our calculus text-book.) Definition 21 (Quotient by Eventuality) We turn to D RN mod RN the quotient algebra RN , where two seE N quences x 2 R and y 2 RN are equivalent mod RN , iff x n D y n for eventual n. As we denote by E the set of all
E
eventual set of indices, we will also write x y or x DE y for this equivalence, and the equivalence class of x is writN ten E(x) 2 RN E . The system RE is a gentle modification N of R . It has essentially the same good or bad properties of RN . Because, in essence, the idea of mod E is simply: When studying a sequence, never worry about its few exceptional terms! So we also can extend all the everywheredefined functions for these “E-numbers”. And the transfer principle still applies! Something better comes out from this modification mod E: ordering. Definition 22 (Ordering) For two elements x D hx n i; y D hy n i of RN , we write x E y, and also E(x) E(y), iff xn yn
for eventual
n:
Naturally we write x n. ?R
An equivalent definition of this infinity is that: ˛1 2 I. The set of all limited hyperreals is a subalgebra (with identity) of ? R, with the set I0 of infinitesimals as an ideal. The quotient algebra is isomorphic with the real field R, so the set of all limited hyperreals is indeed R C I0 D fc C % : c 2 R ; % 2 I0 g : Note: Keeping I D N, and O; E the subsets of odd and even indices respectively, one of the two sequences 1O ; 1E , is (equivalent to) the unit (= 1) of ? R, while the other is the zero (= 0) of ? R. (They are idempotents complementing each other). The choice is made by the ultrafilter U. Or, we can choose one or the other from the two sets O; E, to be a member of U. The axiom of choice allow us to do this before the final set U is ultimately found.
Example 2 (A monad is at least a continuum) When I D N, the hyperreal " D U h n1 : n 2 Ni is infinitesimal. Then so is "3 , and indeed for any positive real number r > 0; "r is infinitesimal. These "r are inversely ordered as r. This is easily seen from the representing sequences hnr : n 2 Ni. So the monad of infinitesimals is at least a continuum, cardinally speaking. Example 3 (A smaller infinitesimal) Suppose real numbers r > 1; s > 1 are fixed. Then for eventual n 2 N D I, sn <
1 ; nr
A fortiori, this is true for ultimate n; therefore
s D U hs n : n 2 Ni < "r : To an atheist-Buddhist, an ultrafilter U is a Buddha’s magnifying glasses, viewing through which a tiny monad is bigger than a solar system. Star-Extensions Now we come to the central issue of function-extension and the intimately related transfer principle. For an everywhere-defined function f , the coordinatewise definition still works fine, and we surely have an extension f of f to the hyperreals. (The same proof works. Or rather, the assertion here is then a corollary of the former principle.) For a function like csc, which is undefined at n 2 Z, csc as definable at x D hx n i, is csc(x) :D hcsc(x n )i 2 RN : In RN , the definition breaks down, if only one term x n 2 Z. How about the situation in RN E ? Write temporarily the domain of definition of csc as A :D Dom(csc) D R n Z, then if for eventual n, x n 2 B, the definition goes through! And for such an x 2 RN , csc(E (x)) 2 RN is defined. In complete analogy, for an E x 2 RN , if only: “for ultimate n; x n 2 A”, U hcsc(x n )i 2 ? R is well-defined! Definition 29 (Star-Extension of a Set) For a non-empty set A R, the set ? A is the set of all U (x), where x D hx j i is a “sequence” such that, for ultimate j, x j 2 A. Definition 30 (Extension of a Mapping) For any mapping f : A 7! B, and x D hx j i 2 AI , we have the “sequence” y D f ı x D h f (x j )i 2 B I , whose equivalence class D U (y) 2 ? B is unambiguously determined by D U (x) 2 ? A, and is independent of the representative x 2 AI . We then write D ? f (). This mapping
Non-standard Analysis, an Invitation to
? f : ? A 7! ? B is called the star-extention of f . (It is surely an extension.)
It is of utmost importance to note here that these definitions of “star-extensions” are strictly notions of naive set theory. Of course A; B can be any two non-emptysets, and f any function from A to B. Theorem 3 (Transfer Principle) All the identities for the standard functions hold true for their star-extensions, simply because they hold true for the original functions. Example 4 (HyperIntegers) A hyperinteger U (x) 2 ? Z is the equivalence class of a sequence of integers x 2 ZN . If “integer” is changed to “even integer”, then we get even hyperinteger U (x) 2 ?(2 Z), where 2 Z :D f2 n : n 2 Zg is the set of even integers. And there are odd hyperintegers, too. All the hyperintegers may be classified as even or odd, because of the dichotomy property of the ultrafilter U. Example 5 (Cartesian Product and Two Variable Function) For two non-empty sets A; B, by canonical identification, ?
(A B) D ? A ? B :
So for any mapping g : A B 7! C, the star-extention of g may be defined: ? g : ? A ? B 7! ? C. What this says is easy but tedious to explain. First, we know the identification: I
I
The star-extensions ? sin; ? cos of the sine, cosine functions are defined on ? R and valued in the star-extended unit closed interval ? [0 : : : 1]. Note: From now on, the star-extensions of such classical functions like sin, cos, . . . , will not have the star displayed. NS View of Continuous Functions The introduction of hyperreals gives us a different standpoint to view many familiar concepts. Remark 4 (Extended Version of a Sequence) Suppose s D hs n : n 2 Ni is a sequence of real numbers. This is a function from N to R. Therefore, it may be star-extended to a function ?
s : m 2 ? N 7! s m 2 ? R :
We call this latter the extended version of the original sequence. Indeed for any unlimited integer D U hn j : j 2 Ni, we may define s :D U hs n j : j 2 Ni 2 ? R : It is obvious that the convergence of the sequence s lim s n D b ;
n!1
is equivalent to: for all unlimited integer 2 ? N n N ;
s b :
I
(A B) D A B : Because for x 2 AI ; u 2 B I , the couple hx; ui 2 AI B I corresponds bijectively to an element x ˝ u of (A B)I , by: x ˝ u : i 2 I 7! hx j ; u j i 2 A B ; And the equivalence class of x ˝ u depends on the equivalence classes U (x) and U (u) only. In this way we are sure that U hg(x j ; y j )i is unambiguously determined by U (x) and U (u) only, and is independent of the representative x 2 AI ; y 2 B I . Actually we did give a proof for g D division, A D R D C; B D R D R n f0g. The proof is good for any function of any (finite) number of variables. Some examples of Transfer are: D j˛j 2 ? R is meaningful for ˛ 2 ? R. And indeed j˛j D ˛, if ˛ < 0. (1) is meaningful for 2 ? Z. And indeed it is 1 for an even hyperinteger, it is 1 for an odd hyperinteger.
? abs(˛)
As an example, when r 2 R; y 2 R, and
y m p s n :D cos ; with m D floor(n r) ; n standard calculus gives us lim s n D e
n!1
y2 r 2
:
We will talk about a function f : R 7! R, though usually the domain of definition is only required to be an open interval. Theorem 4 A function f is bounded around a point a 2 R; iff ? f maps the monad of a into finite hyperreals. Proof If f maps a neighborhood (a e : : : a C e) of a into [M : : : M], where e 2 R is positive, then for D U hx j i a, we have a e < x j < a C e for ultimate j, and thence f (x j ) 2 [M : : : M]. Accordingly: M ? ? f ()? M :
2105
2106
Non-standard Analysis, an Invitation to
On the other hand, if f is unbounded around a, then for any j 2 I D N, there is x j (in the domain of definition of f ), with jx j aj < 1j , and f (x j ) > j. Thus :D U hx j i a, but j? f ()j is infinite. Theorem 5 Let A R be an interval; b 2 A, and f : A ) R. Then f is continuous at b; iff: for all 2 ? A, b ) ? f () f (b) : In other words; f is continuous at b; iff ? f maps the monad a C I0 into the monad f (a) C I0 . Proof By the standard definition, when f is continuous at b, and > 0 is given, there is a ı > 0, such that, for jx bj < ı, (and x 2 A), j f (x) f (b)j < . Now D U hx j i, and b 2 I0 , so for ultimate j; jx j bj < ı. Thus j f (x j ) f (b)j < , for ultimate j. This means ? f () f (b) 2 I0 , or, ? f () f (b). Suppose f is not continuous at b, then there is > 0, such that for whatever j 2 N, there is x j 2 A, with jx j bj < 1j , while j f (x j ) f (b)j > . Then with D U hx j i 2 ? A; b, while j? f () f (b)j > . Corollary 1 If f is continuous at c, it is bounded around c. Theorem 6 Let A R be a compact (=closed and bounded) interval, and f : A 7! R is continuous. Then f is bounded. Proof If f is unbounded, then for j 2 N, there is x j 2 A, with j f (x j )j > j. And we find D U hx j i, with ? f () infinite. But 2 ? A is a limited hyperreal, and its standard part b D ı 2 A; b . By the continuity criterion, ? f () f (b) 2 R. This is a contradiction. Theorem 7 Let A R be an interval, and f : A 7! R. Then f is uniformly continuous, iff: for all 2 ? A; 2 ? A, ?
?
) f () f () : Proof By the standard definition, when f is uniformly continuous on A, and > 0 is given, there is a ı > 0, such that, for jx yj < ı, (and x 2 A; y 2 A), j f (x) f (b)j < . Now D U hx j i 2 ? A; D U hy j i 2 ? A, and 2 I0 , so for ultimately many j; jx j y j j < ı. Thus j f (x j ) f (y j )j < , for ultimately many j. This means ? f () ? f () 2 I , or, ? f () ? f (). Suppose f is not 0 uniformly continuous, then there is > 0, such that for whatever j 2 N, there is a pair x j 2 A; y j 2 A, with jx j y j j < 1j , while j f (x j ) f (y j )j > . Then with D U hx j : ji 2 ? A; D U hy j : j 2 Ii 2 ? A; , while j? f () ? f ()j > . Contradiction. One of the merits of NSA is that: a standard notion may be interpreted in a nonstandard way, which is pedagogically advantageous.
There will be, as examples, a couple of theorems easily proved by the NS criterion just given. But first we introduce some notions very useful later on. Definition 31 (Hyperfinite Set) Let T j A for each j 2 I. Then the sequence defines an “internal” subset T :D U (T) of ? A: U (T) :D fU (x) : x j 2 T j for ultimate jg ? A : If for ultimate j, the set T j is finite: card(T j ) < 1, T is called a hyperfinite set with internal cardinal U hcard(T j )i 2 ? N. Example 6 (Hyperfinite Equi-Partition) Consider the hyperfinite set T of equi-partition points with cardinal 1 CN: N :D U hN j i 2 N I ; (N j < N jC1 ! 1; )
T j :D
k : k 2 Z ; 0 k Nj ; Nj
T :D U hT j i :
Three convenient choices are N j D j! and N j D 2 j , and N j D 3 j . The corresponding partition set T is then called, respectively, the partition by factorial, or binary, or ternary fractions. Lemma 6 (Density of Equi-Partition Set) The countable set T1 D [ j T j is dense in the unit interval U : For any number c 2 U, there is a hyperreal % 2? T; % c. Proof There are two convenient choices: the left- and the right-approximation: floor(c N j ) % D floorT (c) D U : j2I ; Nj ceil(c N j ) or % D ceilT (c) D U : j2I : Nj Note that, for any x 2 U , st(floorT (x)) D x ;
st(ceilT (x)) D x :
Both floorT and ceilT are (Skolem?) “inverse- function” of the standard-part mapping st from U onto T . Indeed for the binary-rational partition T j D 2 j , or the ternaryrational partition T j D 3 j , the floor-approximation gives the binary or ternary expansion respectively.
Non-standard Analysis, an Invitation to
Theorem 8 Let A R be a compact interval, and f : A ) R is continuous. Then f has a maximum and minimum. Proof Now(by simple scaling and translations) we may and will assume A D U D [0 : : : 1] the unit interval. Take a hyperinteger N :D U hN j i, and then the equi-partition T D U (T) into N parts. Here T j :D fk/N j : k 2 Z; 0 k N j g. Now for each j, the maximum value of
is unique, though the maximum point x j may be nonunique: 8x 2 T j ;
We just choose (say) the least one such maximum point x j . And define :D U (x) 2 ? A ;
x0 :D ı ;
x0 2 A :
Then for any c 2 A, and % as before, ?
f (%) ? f () ;
x0 ; f (x0 ) ? f () ;
?
f (c) ? f (%) ;
We see that f (c) f (x0 ); 8c 2 A, i. e., f attains its maxi mum-value at x0 . Theorem 9 (Intermediate-Value) Let f : A 7! R be continuous, v 2 R; A D [a : : : b], and (v f (a))(v f (b)) < 0. Then there is u 2 (a : : : b), such that f (u) D v. Proof We may and will assume A D U ; a D 0; b D 1; f (0) < 0 < f (1). As before, we take a hyper-finite N equi-partition T D U hT j i; N D hN j i. Find the least x j 2 T j , such that f (x j ) 0. Because 0 2 T j ; 1 2 T j , and f (0) < 0 < f (1), we know f (x j ) 0; f (x j 1/N j ) < 0. With :D U hx j i 2 ? A; u :D ı , then :D U h1/N j i, we have: ?
f () 0 ;
?
f ( ) < 0 :
But u; u; and the continuity of f at u implies: ?
f () f (u) 2 I0 ;
?
f ( ) f (u) 2 I0 :
Therefore: 0 ? f () < ? f () ? f ( ) D (? f () f (u)) (? f ( ) f (u)) 2 I0 ; resulting in f (u) D 0.
? f ()
NS Calculus
f 0 (a) D lim
x!0
f (a C x) f (a) ; x
the differentiability of f at a implies the continuity of f at a, which means that ? f maps the monad a C I0 of a into the monad f (a) C I0 of f (a). Definition 32 (Difference) If f is a real-valued function defined around a, then the “discern” of f at a is the (nonstandard) function a f : dx 2 I0 7! ? f (a C dx) f (a) 2 ? R :
% c; ?
Proof Suppose 2 ? A; 2 ? A, and . Now let c D ı () 2 A. It follows that both c and c hold true. The continuity of f implies that both ? f () f (c) and ? f () f (c) hold. Therefore ? f () ? f ().
According to the standard definition:
f [T j ] :D f f (x) : x 2 T j g
f (x j ) f (x) ;
Theorem 10 Let A R be a compact interval, and f : A 7! R be continuous. Then f is uniformly continuous.
2 I0 . Or ı (? f ()) D 0, which means
Of course we know that the continuity of f at a means that the range of a f is included in I0 . So we are led to consider (non-standard) functions from I0 to I0 . From now on, dx; dy; will be variables whose values are infinitesimals. (So they live in I0 .) Definition 33 (Tangential and Tangent) If ˚ : I0 7! I0 ; : I0 7! I0 , and for any nonzero infinitesimal 2 I, ˚() () 0; they are called tangential to each other. For any number c 2 R, the associated tangent is the (non-standard) function: dx 2 I0 7! c dx 2 I0 : Theorem 11 (Differentiability) A function f defined on an open interval U is differentiable at a 2 U, and there with derivative f 0 (a) D c; iff : a f is tangential to the tangent associated with c. Proof If f 0 (a) D c, then for any k 2 N, there is e k 2 R with e k > 0, such that whenever juj < e k , j( f (a C u) f (a))/u cj < 1k . Now fix k 2 I. Thus to any (non-zero) dx D U hu j i 2 I, for ultimate j, 0 < ju j j < e k , resulting in j( f (a C u j ) f (a))/u j cj < 1k . So ˇ ˇ ˇ ˇ ˇ 1 ˇ a f (dx) c dx ˇ ˇ ? f (a C dx) f (a) ˇDˇ ˇ< : ˇ c ˇ ˇ k ˇ ˇ dx dx
2107
2108
Non-standard Analysis, an Invitation to
This being true for all k 2 N, we have: for any dx 2 I, ˇ ˇ ˇ a f (dx) c dx ˇ ˇ 2 I0 ; ˇ ˇ ˇ dx which means: a f is tangential to the tangent associated with c. Conversely, if f is not differentiable at a, one can show that the nonstandard condition of differentiability can not be satisfied. Indeed then the following four limf (aCu) f (a) its of the difference quotient are not all finite u and coincident: lim sup ; u#0
lim inf ; u#0
lim sup ;
lim inf :
u"0
Here it is assumed that ¤ 0 so as to allow the division by . But when D 0, then necessarily f 0 (a) D 0, and the proposed identity is trivial. Theorem 13 (Leibniz’ Product Rule) If two functions f and g are differentiable at a, then so is their product h; h(x) D f (x) g(x). And h0 (a) D f 0 (a) g(a) C f (a) g 0 (a) : Proof The non-standard calculations are essentially the same as the standard ones. Lemma 7 If f has a local maximum at x D a, then
u"0
8dx 2 I ;
a f (dx) 0 :
For example, a case may be Theorem 14 (“Fermat”) On an open interval, if a function f has a maximum at a, and there differentiable, then f 0 (a) D 0.
f (a C u) f (a) b :D lim sup u u#0 > e :D lim inf u#0
f (a C u) f (a) : u
In that case, there are positive sequences u 2 RN ; v 2 RN , decreasing to 0, such that f (a C u j ) f (a) ; j!1 uj f (a C v j ) f (a) e :D lim : j!1 vj
b :D lim
Taking D U hu j i 2 I, and D U hv j i 2 I, we see that: the nonstandard condition of differentiability can not be satisfied simultaneously for dx D and for dx D , whatever the choice of c. Theorem 12 (Chain-Rule) If a function f is differentiable at a, and a function g is differentiable at b D f (a), then the composite function h :D g ı f is differentiable at a, with derivative h0 (a) D g 0 (b) f 0 (a). Proof For any 2 I; 2 I, we have: ? f (a
C ) f (a) f 0 (a) ; ? g(b C ) g(b) g 0 (b) :
Let now D ? f (a C ) f (a). Then: ? h(a
? g(? f (a C )) g( f (a)) C ) h(a) D ? g(b C ) g(b) D g 0 (b) f 0 (a) :
x Proof Recall the signum function sign(x) D jxj for x ¤ 0. We will not put the star on its star-extension. For dx 2 I,
a f (dx) f 0 (a)dx 2I; dx So if f 0 (a) ¤ 0, we take dx 2 I to have the same signum as f 0 (a), then there is a contradiction: sign( a f (dx)) D sign( f 0 (a) dx) > 0 : Conventional deductions then give, as corollaries, Theorem 15 (“Rolle”) Let a function f is continuous on a closed interval [a : : : b], and differentiable on the open interval (a : : : b). If moreover f (a) D f (b), then there exists a zero of the derived function f 0 in the open interval. Generally, there exists in the open interval (a : : : b), such that f 0 () D
f (b) f (a) : ba
has a maximum at a, and there differentiable, then f 0 (a) D 0. Remark 5 For a function f defined on a closed interval, one-sided derivative may still be defined at the endpoints naturally. Nelson mentioned that a stronger notion is in some way more convenient than differentiability. f is called derivable at a point a, iff there is a derivative f 0 (a) D c such that: (as x1 ! a ; x2 ! a ; x1 ¤ x2 ) ;
f (x2 ) f (x1 ) 0 lim f (a) D 0 : x2 x1
Non-standard Analysis, an Invitation to
The NS formulation is: (1 2 I ; 2 2 I ; 1 ¤ 2 ) )
? f (a
C 2 ) ? f (a C 1 ) f 0 (a) : 2 1
Of course all the differentiation rules are valid for this stronger-sense derivative. Theorem 16 A function f is derivable on an interval, (i. e., at every point of the interval), iff it is differentiable with continuous derivative everywhere on the interval. The function f (x) D x 2 cos( x1 ) (extended by f (0) D 0), is differentiable on R, but f 0 (x) D 2x cos( x1 ) C sin( x1 ) (extended by f 0 (0) D 0), is not continuous at x D 0. Indeed it is not derivable there. NS Integral Calculus Let N :D (N j : j 2 I) 2 N I be a strictly increasing sequence of natural numbers, and for each j 2 I, define
Tj D
k : k 2 Z ; 0 k Nj Nj
1 X f (x) : x 2 T j0 ; Nj
U
f (x)dx D st
1 X N
Z
b
f (x) 1[a:::b] (x)dx :
f (x)dx :D a
The definition is valid for any function f as long as it is bounded. (This is just like the Banach-limit for a bounded sequence. Axiom of choice is also involved in the latter!) The Riemann integrability is relevant only to prove that this “integral” is independent of the Riemann-summation scheme. We will now skip all these matter, mentioning only that, as is easily seen: Z
Z
c
Z
b
f (x)dx D a
c
f (x)dx C a
f (x)dx : b
This justifies the usual definition that
T j0 :D T j n f1g :
Theorem 17 (Riemann Sum) The Riemann-integral for a Riemann-integrable function on the unit interval is the standard part of the “hyperaverage” of f , over a hyperfinite equi-partition T as above:
Z
:
Assuming that f is Riemann-integrable (in the standard sense), this sequence S D hS jRi of Riemann-sums converges to the Riemann integral U f (x)dx.
Z
And then for any bounded function f defined on R:
(a < b < c) )
The unit interval partitioned into N j U is correspondingly subintervals (k 1)/N j : : : k/N j , k D 1; 2; ; N j . Suppose f is a function on the unit interval U . The left-Riemann sum of f , relative to this partition is S j :D
“integral over R”)
Z ˝ ˛ 1 U S j ; f (x)dx :D st N X ( f (x) : x 2 T j ) ; where S j :D
k 2 2 : k 2 Z ; N j k < N j ; Tj D Nj
Z
Z
b
a
f (x)dx :D a
f (x)dx ;
for
b < a:
b
Theorem 18 (Fundamental Theorem of Calculus) (1) If f is continuous R u on an interval containing a point a, and g(u) :D a f (x)dx, then g is continuous, and for a continuity point b of f , the function g is differentiable at b, with g 0 (b) D f (b). (2) If f is derivable on the interval [a : : : b], so that the derivative f 0 is continuous, hence Riemann integrable, then Z b f 0 (x)dx D f (b) f (a) : a
NS-ODE
( f (t) : t 2 T ) : ?
Remark 6 For the improper Riemann integral p 1 x, the formula is valid.
R1 0
Theorem 19 (Cauchy–Peano) If f : (x; t) 2 R [0 : : : 1] 7! f (x; t) 2 R is bounded and continuous, then, for any “initial-position” a 2 R, there is a “solution-path” (dx)/
Definition 34 Let now f : R 7! R be a function, bounded, and compactly supported, which means f (x) D 0 for x outside a finite interval, then we may define (the
y : t 2 [0 : : : 1] 7! y(t) 2 R ; such that: it is continuous, and y(0) D a ;
y 0 (t) D f (y(t) ; t) ;
8t 2 [0 : : : 1] :
2109
2110
Non-standard Analysis, an Invitation to
Proof The polygonal approximation is the following wellknown scheme y D lim y j : j!1
Take a natural number N j , and divide the time-interval into N j equal subintervals. In the duration of kth subinterval (k 1)/N j t < k/N j , the velocity is approximated by the velocity-field at the latest position: y j (t) :D f (y((k 1)/N j ); (k 1)/N j ), and consequently there is a recursion:
k k1 1 yj :D y j C Nj Nj Nj
k1 k1 f yj ; ; k D 1; 2 ; ; N j : Nj Nj The standard way is to take a “limiting point” of the sequence of approximate solutions (which are polygonal = “piecewise-linear”), y j (t); j D 1; 2; 3; : : :. They are considered as points of the path-space (= function space, with uniform topology). Ascoli’s compactness criterion ensures the existence of such a limiting point y1 . From the uniform convergence lim y j (t) D y1 (t) ;
j!1
lim f (y j (t) ; t) D f (y1 (t); t) ;
j!1
it follows easily the sought-for identity: Z t f y1 (s); s ds : y1 (t) D a C
N
For “time” D Nk on this hyperfinite set, define the N th approximate “position” Y( ) to be k
N
DaC
General Idea of NSA The Set-Terminology Ever since G. Cantor invented the Mengenlehre, and after the Hilbertian axiomatization (r) evolution, the naive set-theory became the common language of mathematics. Everything is (to be expressed in terms of) set! As an example, the color “blue” may be defined as “the set of all blue things”. (I’d like to make this as my first teaching of pedagogy.) If this sentence is hardly comprehensible, then an easier explanation is as follows. Suppose I say: “This stone ! is blue”. Then what I mean is the mathematical sentence “! 2 B”, where B is a conveniently collected set of blue stones. Of course; B should be a subset of a convenient “total set” ˝ of stones. With this primitive idea of ‘membership relation’, simple set-notions fit in with our daily logic. Precisely: if a is the sentence “x 2 A”, and b is “x 2 B”, then the logical combinations of assertions correspond to the Boolean combinations of sets in the well-known way: “a and b” is “x 2 A \ B”;
The standard proof is translated into the NS proof thus: With N j " 1, we have an internal hyper-natural number (unlimited, of course), N D U (N); N :D hN j i, together with the internal set of partition-points on the NS timeinterval:
k T :D : k 2 Z; 0 k N :
It is well-known that under the very general condition of boundedness and continuity, the solution is not unique, reflecting the arbitrariness in the choice of one limiting point from a compact set. Of course this corresponds to the arbitrariness of the choice of the unlimited N 2 ? N, or rather, the hyper-finite partition T .
“a or b” is “x 2 A [ B” ;
0
Y
The above NS summation-multiplied-by-infinitesimal goes to Riemann integral: Z t y(t) D a C f (y(s); s)ds : 0
1 N
k1 X
?
f
i i Y ; :
iD0
Finally take
floor(t N ) : y(t) :D st Y N
N
N
“not b” is “x 2 ˝ n B” : Here a convenient total set ˝ should be available for the logical analysis. From this thinking, a set A of subsets of ˝ is called a Boolean algebra of sets, if it is closed under the three set-operations mentioned above: [; \, and “complementation” (set-difference by the total set). My second example of the “triumph of naive set theory” is the axiomatization of probability theory by A. N. Kolmogorov. Its publication was 10 years after N. Wiener’s construction of Wiener process. Wiener wanted to talk about the “probabilistic behavior of the path W of a particle”. By a path is meant a function from the time-axis to the state-space; W(t) signifying “the position at time t”. Here the time-axis is the half-line [0 C 1), and the state-space is the real line R. And Wiener requires that the path is continuous, and starts from the origin: X D 0. In the view point of Kolmogorov, the phrase “probabilistic
Non-standard Analysis, an Invitation to
behavior of the path W of a particle” should be changed to “behavior of a random particle”.“A random particle” just means that it is “one chosen from many”; these many form a set. (This set, the so-called probability space, is always an imagined set. It is almost always an infinite set, so never “concrete” for students of probability theory!) The picture is: Lady Luck collects a big set ˝ of all particles (= chancelings). For each particle ! 2 ˝, its path is W(; !), mapping time t to its position W(t; !) at time t. (So Wiener process W is a “two-variable function”, its domain of definition is the set-product RC ˝. Textbooks of probability theory always suppress the variable ! which signifies the randomness.) The success of Kolmogorov’s axiomatization is simply this: it makes precise what kind of “questions about probability” one can ask. Now probability comes in, only because Lady Luck chooses one chanceling ! from all conceivables. (Then this chosen one is the “reality”). We are interested (only) in the assertions of this form “! 2 B”. We can only ask questions of this kind: “What is the probability of the event ! 2 B?” And that only for some kind of set B. The answer (B) is the probability predestined by Her. For what kind of B ˝ is (B) meaningful is also Her designation. Then we simply call B an event. So (B) is now the probability of event B. (Instead of the “event ! 2 B”). Since we can do Boolean combinations of assertions, correspondingly, the set A of all events should be a Boolean set-algebra. But in probability theory, always the stronger condition of sigma-completeness is required: A is closed under countable union, thus called a -algebra. Definition 35 ((Kolmogorov) Probability Space) A probability space is a nonempty set ˝, together with a measure structure hA; i, where A is a -algebra on ˝, and : A 7! U satisfies the countable-additivity condition: If 8n; B n 2 A and 8m ¤ n; B m \ B n D ;, then
nD1
A1 ; A2 ; ; A n ;
are intervals
Theorem 20 (Wiener) There exists a probability space (˝; A; ), and a stochastic process W on it, such that: for each ! 2 ˝, the sample-path W(; !) is a continuous function on [0 C 1); and, for set B as above, Z Y x 2 1 (B) D p e 2(t k s k ) dx : 2(t k s k ) A k 1kn
Naive Set Operations As we have explained above, a mathematical system is a set with structure, which is a finite set of items of functions, relations, or functions. We now explain that starting from a ground set, there is a super-structure, which will contain all conceivable structures. Naive as we are, sometimes we just have to be very careful. So we give a very careful definition of couple and relation, which will be a foundation of all that follows. Definition 36 (Couple and Cartesian Product) For two non-empty sets A; B, the Cartesian product is defined as A B :D fha; bi : a 2 A ;
and
b 2 Bg :
Here “a couple”, or “an ordered pair”, is defined by ha; bi :D ffag ; fa; bgg : The ordered n-tuple, for n > 2, may be defined recursively as hx1 ; x2 ; : : : ; x n i :D hhx1 ; x2 ; : : : ; x n1 i ; x n i :
An :D fhx1 ; x2 ; : : : ; x n i : x1 ; x2 ; : : : ; x n 2 Ag :
Returning to Wiener’s process. Lady Luck should allow assertions ! 2 B of the form W(t; !) 2 (c1 : : : c2 ], where time t 2 [0 C 1) and interval (c1 : : : c2 ] are arbitrary. With this condition satisfied, the two-variable function W : ht; !i 7! W(t; !) is called a stochastic process on the probability space (˝; A; ). Actually the central questions asked by (Bachelier and) Wiener are events of the following forms: for
s 1 < t1 s 2 < t2 s n < t n ;
Note that we could write the “1-tuple” as hai :D a. Then write:
1 X (B n ) : [1n n, then there is unlimited positive 2 A. (2): If for any unlimited positive % 2 A, there is another unlimited 2 A, such that < zeta, then there is a limited positive 2 A. (3): Both subsets R( ? R), and ? R n R, of the standard set ? R(hyper-version) are external. Proof (1) We may assume that A is bounded above. But then with D sup(A) 2 ? R; is unlimited. Therefore by the definition of lub, there exists an 2 A, with 2 .
2115
2116
Non-standard Analysis, an Invitation to
(2) Let z be the infimum (= inf = greatset lower bound) of the subset of positive elements of A, then z is limited. And by the definition of inf, there is 2 A, such that z z C 1. Theorem 26 (Saturation Principle) Let hA n : n 2 Ni be a sequence of internal sets with the “finitely-intersecting property”: 8n 2 N ;
\1mn A m ¤ ; :
Then the sequence is itself intersecting: \m2N A m ¤ ; :
Measure Let a basic non-empty set ˝ be fixed. We recall that a subset A P (˝) is called a Boolean algebra of sets, iff the following requirements are satisfied: ˝ 2 A; (8x 2 A)(8y 2 A) ;
(x [ y 2 A) ;
(8x 2 A)(8y 2 A) ;
(x \ y 2 A) ;
(8x 2 A)(8y 2 A) ;
(x n y 2 A) :
Or the last three sentences can be combined into one: (8x 2 A)(8y 2 A) ; ((x [ y 2 A) ^ (x \ y 2 A) ^ (x n y 2 A)) :
(For the proof, the countable indicial set I D N is enough.) Now for each n 2 N; A n D U hA n; j : j 2 Ii. And for this fixed n 2 N, \m2N A m D U h\1mn A m; j : j 2 Ii ¤ ; 2 ? S 1 : Therefore for each n 2 N, and for ultimate j; \1mn A m; j ¤ ;. (In particular, we may and will just assume: for all j; A1; j ¤ ;). For each j 2 I, we define k j :D maxfn 2 N : \1mn A m; j ¤ ; ; and n jg ; Now for each j 2 I, choose an x j 2 \1mk j A m; j . We see that D U hx j i 2 A n , for any n 2 N. Indeed for any n 2 N, f j 2 I : x j 2 A n; j g f j : k j ng D f j 2 I : j ng \ f j 2 I : \mn A m; j ¤ ;g 2 U : Corollary 4 For a sequence hA n : n 2 Ni of internal sets, their (countably infinite) union is internal [n2N A n 2 ? S , iff the union equals to a finite sub-union. I. e., iff there 1 exists m 2 N, such that [n2N A n D [1nm A n : Proof If B :D [n2N A n 2 ? S 1 (is internal), then: all B n :D B n A n are internal, and \n2N B n D ;, by de Morgan’s law. Then the saturation principle requires that for a certain m 2 N, \1nm B n D ;. Or, \1nm (B n A n ) D ;, and [1nm A n D B D [n2N A n . Loeb Construction In this section we provide the Loeb version of constructing a non-standard version of stochastic analysis (cf. [8]).
If A is closed under countable operations, it is then called a -algebra on ˝: 8(x n : n 2 N) ; ((8n 2 N ; x n 2 A) ) ([n2N x n 2 A)) : It is well-known that for any set B of subsets of ˝, there is uniquely a smallest -algebra A B. This is the -algebra generated by B 2 P (P (˝)). Definition 44 (Probability Content and Measure) Let A be a Boolean algebra of sets on a total set ˝. A function : A 2 A 7! (A) 2 R is called a content on (˝; A), if the following conditions are satisfied: (1) : 8x 2 A ;
(x) 0 ;
(2) : 8x 2 A ;
8y 2 A ; (x \ y D ;) ) ((x [ y) D (x) C (y)) :
If the following condition of “the continuity from above at the empty set” is satisfied, then is called a measure: ((8n 2 N ; (x n 2 A) x n ))^(\n x n D ;)) ) lim (x n ) D 0 :
8(x n : n 2 N) ; ^(x nC1
n!1
An equivalent condition is: is -additive, in the sense that: if hA n i is a disjoint sequence in A, with union [n2N A n 2 A, then X (A n ) : ([n2N A n ) D n2N
Non-standard Analysis, an Invitation to
If content (or measure) is normalized: (˝) D 1, then it is called a probability content (or measure, respectively). Theorem 27 (Extension of a Measure) If is a measure on (˝; A), and B is the -algebra generated by A, ˜ on then there is a unique extension of to a measure (˝; B). Further, let A be the collection of sets of the form A D (B n D) [ (D n B), where B 2 B, and D C, with ˜ ˜ D 0, and define then (A) D (B), then the C 2 B; (C) definition is valid, and is a measure on (˝; A). This extension of is the “completion of ”. It is “complete” in the sense that whenever A 2 A; F A, and (A) D 0, then F 2 A with (F) D 0. Remark 8 (-Finiteness) The definition of a content or measure, is generally extended to allow C1 in the range. Then in the discussion the condition of -finiteness is is generally sufficient and necessary: there is a sequence A n 2 A, with [n2N A n D ˝, and (A n ) finite, for all n. Loeb Measure Definition 45 (Loeb Content) Let ˝ 2 ? S 1 be an internal set; A be an internal set of subsets of ˝, which is closed under complementation and finite union. A Loeb content on (˝; A) is an internal function : A ) ? R, which is a content on A, i. e., it is finitely-additive and non-negatively valued. (But it is valued in ? R, not real-valued.) Theorem 28 (Loeb Measure) For a Loeb content space h˝; A; i, define the standard-part of as ı
: A 2 A 7! ı (A) :D ı ((A)) :
It is then -additive, and its completion is the Loeb measure L() of , defined on the Loeb algebra L(A) consisting of those B ˝ which are “-approximable”: For any positive real , there are A 2 A; C 2 A, such that: A B C;
(C) (A) < :
Indeed there is D 2 A, (called Loeb’s modification of B), such that L()(BD) D 0 : Here is the symmetric-difference: BD :D (B n D) [ (D n B) : We omit the proof, but by the saturation property of internal sets, the continuity or -additivity of is trivial. It should be emphasized that L() is a real-valued measure; therefore the integrands of its integration are real-valued, but defined on ˝, hence is not internal.
Definition 46 (Measurable Functions) If temporarily we denote by G the set of all open sets of R, (also called the topology of R), and call elements of ? G star-open, then an internal function F : ˝ ) ? R is called A-measurable, iff: (G 2 ? G) ) F 1 (G) 2 A : If F :D U hF j i ; D U h j i, we define naturally: ?
Z
Z Fd :D U
F j d j :
This is valued in ? R. Remark 9 (Loeb Integral) If F is real-bounded, then Z
Z ı ? Fd D ı FdL() ; which is then called the Loeb integral. The equality here is guaranteed by the S-integrability of the internal function F: R (1): ? jFjd is limited.R (2): If (A) D 0, then ? jFjd D 0. R (3): If F(!) D 0 for all ! 2 A, then ? jFjd 0. Now we turn to the integration of a real-valued function f on ˝, with respect to the real-valued measure L(). Theorem 29 (Lifting of Function) Any L(A)-measurable function f has a “lifting” F : ˝ 7! ? R, in the sense that: F is internal; A-measurable function, satisfying L()f! : ı F(!) ¤ f (!)g D 0 : Definition 47 (Normed Counting Measure) If ˝ 2 1 is a hyperfinite set, and A the set of all internal set, the internal probability content , ?S
(A) :D
? card(A) ? card(˝)
;
is called the normed counting measure on ˝. Example 11 (Hyperfinite Equi-partition of the UnitInterval) Let N 2 ? N n N be an unlimited hyperinteger, T be the equi-partition hyperset of the unit interval into N parts, and be the normalized uniform counting measure. Then we get the Loeb measure L() and the Loeb algebra L(A) on T ? [0 : : : 1]. The Loeb measure space (T ; L(A); L()) is simply the pull-back of the standard Lebesgue measure space (U ; B(U ); ) by the standard-part map.
2117
2118
Non-standard Analysis, an Invitation to
Here we have a dual of push-down and pull-back (or lifting). The push-down is via standard-part map st : T 7! U, which is a surjection, while its inverse st1 furnishes the pull-back. The push-down of L(A), by st, is by definition, ˚
B U : st1 [B] 2 L(A) :
This is precisely the Lebesgue algebra B(U ). And for a member B in this latter algebra, then the push-down of L() will assign the measure value L()(fx 2 T : ? x 2 Bg). But we have already shown that when B is an interval, this value is simply its length (B). This is enough to identify the pushed-down measure. Actually, the same kind of Loeb construction may be applied to the whole real axis R, or the real half-axis RC , instead of U (or the analogous finite intervals). A little bit of attention should be paid to the -finiteness, of course.
an element ! 2 ˝ is called a coin-particle (of hyper-finite rank). The normalized counting measure Pj on ˝ j is simple indeed, as is the one occurring most often in our highschool textbooks. (We will denote by EP j the expectation with respect to Pj ). The normalized counting measure P is defined on the algebra of all internal subsets of the hyperfinite internal set ˝. Therefore the Loeb construction give us the Loeb measure L(P), called the hyperfinite coin-measure, on the Loeb algebra L(A), which is called the hyperfinite coin algebra. Definition 50 (The (Rank j) Random Walker) For fixed j, and any rank j coin-particle ! j 2 Coin T j , its path is the mapping t 7! B j (! j ; t) 2 R, defined as follows. The domain of definition is
k 0 2 T j :D : k D 0 ; 1 ; 2 ; : : : ; Nj : Nj And when t D
Anderson’s Wiener Process Definition 48 (Hyperfinite Time-axis) When N 2 ? Nn N is fixed, N D U hN j i, we have the hyperfinite timeaxis
k 2 T :D U hT j i; T j :D : k 2 Z ; 0 k < Nj : Nj As before; T j is the interval [0 : : : N j ) equi-partitioned into N 2j parts, each with length 1/N j . The right-end N j is deliberately deleted, so that card(T j ) D N 2j , while the augmented set is T j0 :D T j [ fN j g. The set T j will be called rank k (discretized) time-axis. The left- and the right- approximations from the semi-axis [0 C 1) to T is welldefined: 8 floor(c N j ) ˆ ˆ floor (c) D : j 2 I ; ˆ T U < Nj ˆ ceil(c N j ) ˆ ˆ ceil (c) D : j 2 I : : T U Nj Definition 49 (Hyperfinite Coin-Tossing) With n o N 2 k ? N nN, and T D : k D 0; 1; 2; ; N 1 as above, N we define: Coin :D f1; C1g ; ˝ j :D Coin T j ; ˝ :D U h˝ j i ; Of course, +1 or 1 means “head” or “tail” of a coin toss. Any element ! j 2 ˝ j is called a coin-particle of rank j, and
k Nj ,
m 1 X B j (! j ; t) :D p !j :0m Tc , goes to zero as (Tc T)ˇ if T < Tc , and the isothermal susceptibility 0 T diverges as (T Tc ) if T > Tc and (T Tc ) if T < Tc . If we assume that the free energy F, close to the critical point, is a generalized homogeneous function of T Tc and M, that is, a function satisfying, for all values of , the relation F((T Tc ); ˇ M) 2˛ F(T Tc ; M) : Choosing D 1/jT Tc j, we can write F(T Tc ; M) under the form
M F(T Tc ; M) D jT Tc j2˛ f ; jT Tc jˇ where f is a function of only one variable. From this expression it can be shown (see [4]) that the critical exponents satisfy the following so-called scaling relations ˛ D ˛0 ;
D 0;
and ˛ C 2ˇ C D 2 :
An important distinguishing feature of the power-law behavior of physical quantities in the neighborhood of a critical point is that these quantities have no intrinsic scale. The function x 7! exp(x/) has an intrinsic scale , whereas the function x 7! x a has no intrinsic scale: power laws are self-similar. Quantities exhibiting a power-law behavior have been observed in a variety of disciplines ranging from linguistics and geography to medicine and economics. As mentioned above, the emergence of such a behavior is regarded as the signature of a collective mechanism. Second-order phase transitions are always associated with a broken symmetry. That is, the symmetry group of the ordered phase (the phase characterized by a nonzero value of the order parameter) is a subgroup of the disordered phase (the phase characterized by an order parameter identically equal to zero). To an order parameter, we can always associate a symmetry-breaking field. In the presence of such a field, the order parameter has a nonzero value, and, in this case, the system cannot exhibit a second-order phase transition. In the case of an ideally simple para-ferromagnetic phase transition, the paramagnetic
Phase Transitions in Cellular Automata
phase is invariant under the tridimensional rotation group whereas the ferromagnetic phase is no more invariant under that group. The corresponding broken symmetry is characterized by the nonzero value of the vector magnetization M which plays the role of the order parameter, and the symmetry-breaking field is the magnetic field B intensive conjugate parameter of M. The nature of the broken symmetry is not always obvious as, for instance, in the case of the normal-superfluid or normal-superconductor phase transitions, but it does exist (see [5]). Depending upon the nature of the order parameter, a second-order phase transition exists only above a critical space dimensionality, called the lower critical space dimension. Critical exponents, which depend upon space dimensionality, take their mean-field values (i. e., when space dependence and correlations are ignored) above another critical space dimensionality, called the upper critical space dimension. In the case of the Ising model, the lower and upper critical space dimensions are, respectively, equal to 1 and 4. In what follows we focus on phase transitions in cellular automata. Deterministic one-dimensional cellular automaton rules are defined as follows. Let Q denote the finite set of integers f0; 1; : : : ; mg and s(i; t) 2 Q represent the state at site i 2 Z and time t 2 N; a local evolution rule is a map f : Q r ` Cr r C1 ! Q such that s(i; t C 1) D f s(i r` ; t); s(i ` C 1; t); : : : ; s(i C rr ; t) ;
is Q D f0; 1g and the rule’s radii are r` D rr D 1. Sites in a nonzero state are sometimes said to be active. It is easy to 3 verify that there exist 22 D 256 different elementary cellular automaton local rules f : f0; 1g3 ! f0; 1g. The local rule of an elementary cellular automaton can be specified by its look-up table, giving the image of each of the eight three-site neighborhoods. That is, any sequence of eight binary digits specifies an elementary cellular automaton rule. Here is an example: 111 1
110 0
101 1
100 1
011 1
010 0
001 0
000 0
Following Wolfram [6], a code number may be associated with each cellular automaton rule. If Q D f0; 1g, this code number is the decimal value of the binary sequence of images. For instance, the code number of the rule above is 184 since 101110002 D 27 C 25 C 24 C 23 D 18410 : More generally, the code number N(f ) of a one-dimensional jQj-state n-input cellular automaton rule f is defined by N( f ) D
X
f (x1 ; x2 ; : : : ; x n )
(x 1 ;x 2 ;:::;x n )2Q n
jQjjQj
n1 x
1 CjQj
n2 x
0 2 CCjQj x n
:
(1) The Domany–Kinzel Cellular Automaton
where the integers r` and rr are, respectively, the left radius and right radius of the rule f ; if r` D rr D r, r is called the radius of the rule. The local rule f , which is a function of n D r` C r1 C 1 arguments, is often said to be an n-input rule. The function S t : i 7! s(i; t) is the state of the cellular automaton at time t; St belongs to the set Q Z of all configurations. Since the state S tC1 at t C 1 is entirely determined by the state St at time t and the local rule f , there exists a unique mapping F f : S ! S, such that S tC1 D F f (S t ) ;
Directed percolation refers to lattice models that mimic coffee brewing, that is, causing water to pass through a bed of ground coffee. Consider the square lattice represented in Fig. 1 in which open bonds are randomly distributed with a probability p. In contrast with the usual bond percolation problem, here bonds are directed downwards, as indicated by the arrows.
(2)
called the cellular automaton global rule or the cellular automaton evolution operator induced by the local rule f . Most models presented in this article will be formulated in terms of probabilistic cellular automata. In this case, the image by the evolution rule of any (r` C rr C 1)block, is a discrete random variable with values in Q. The simplest cellular automata are the so-called elementary cellular automata in which the finite set of states
Phase Transitions in Cellular Automata, Figure 1 A configuration of directed bond percolation on a square lattice
2159
2160
Phase Transitions in Cellular Automata
If we imagine a fluid flowing downwards from wet sites in the first row, one problem is to find the probability P(p) that, following directed open bonds, the fluid will reach sites on an infinitely distant last row. There clearly exists a threshold value pc above which P(p) is nonzero (see [7]). If the downwards direction is considered to be the time direction, the directed bond percolation process may be viewed as the evolution of a two-input one-dimensional cellular automaton rule f such that s(i; t C 1) D f s(i; t); s(i C 1; t) 8 ˆ 0; if s(i; t) C s(i C 1; t) D 0 ; ˆ ˆ ˆ 0 and (ii) P(C) D 0 iff 2 . With such a P(C) we can set up a clustering problem to minimize P(C) (given the equivalence
2229
2230
Positional Analysis and Blockmodeling
specified for a blockmodel). If there are exact equivalences of a given type then P(C) is 0. However, if there are no such partitions (i. e. is empty) then an optimization approach gives the solution(s) that differ(s) the least from an ideal case – provided that the criterion function is compatible with and sensitive to the type of equivalence that is used. The intuition behind using these criterion functions is that of a pair of blockmodels that are compared in terms of the criterion function. One is an ideal blockmodel with only the permitted block types for a given equivalence and the other is a corresponding empirical blockmodel (i. e. with the same partition of the vertices) for a clustering C. Let B(C i ; C j ) denote the set of all of the ideal blocks (in the blockmodel) and let p(C i ; C j ) denote the inconsistency between the empirical block defined by Ci and Cj and the corresponding nearest ideal block. The criterion function for the blockmodel is expressed as P(C) D ˙ p(C i ; C j ) where the summation is over all k2 blocks. At the block level, the block inconsistency is p(C i ; C j ) D min ı(R(C i ; C j ); B) where the minimization is over all possible blocks for B(C i ; C j ) and ı(R(C i ; C j ); B) measures the deviation (number of inconsistencies) between R(C i ; C j ) and the nearest ideal block B. For structural equivalence, the simplest inconsistency measure for a block whose nearest ideal block is null is the number of 1s in it. Put differently, ı(R(C i C j ); B) is the number of 1s that are present where they ought not be. Similarly, the simplest inconsistency (ı(R(C i ; C j ); B)) for a block whose nearest ideal block is complete is the number of 0s in it. The total number of inconsistencies is obtained by summing the inconsistencies over all of the blocks in the blockmodel. This gives us P(C). This can be generalized by using differential weights for the types of inconsistencies between the ideal and empirical blockmodels. For example, for structural equivalence partitions, it is possible to view 1s in what is specified as a null block as more serious than 0s in what is thought to be a complete block. If so, not all inconsistencies are viewed as being equally important and penalties can be imposed on some of them. Of course, this modifies the definition of, and values returned for, the criterion function, P(C). The ‘best’ partitions C are not known ahead of time and have to be identified empirically. A simple relocation algorithm can be used determine one (or more) partitions that minimize the criterion function. Optimization and Measures of Fit The relocation algorithm is defined and mobilized as follows. To locate a partition (as consistent as possible with the set of permitted block types defining the equivalence
used) into k clusters, the network is partitioned randomly into k clusters. The value of P(C) is then computed. Of course, given that the starting partition is random, this value will be very large. The starting (and any) partition can be changed in two ways: (i) interchanging a pair of vertices between two different clusters or (ii) moving a vertex from one cluster to another. These two types of change (transformations) determine the neighborhood of any clustering, C. The algorithm is expressed as follows: Repeat: if in the neighborhood of the current clustering, C, there exists a clustering, C 0 , such that P(C 0 ) < P(C) then move to the clustering P(C 0 ). This continues until a smaller value of P(C) cannot be found. The relocation algorithm has to be repeated many times (hundreds or thousands of times) for different random starting partitions for a given value of k in order to obtain solutions with the minimal values of P(C). A criterion function is defined explicitly for a particular type of blockmodel. As noted above, for regular equivalence there are two ideal block types. The value of ı(R(C i ; C j ); B) is defined differently for each of these types. If B is a null block, then one specification of ı(R(C i ; C j ); B) is the number of 1-covered rows and if B is a regular block, ı(R(C i ; C j ); B) is the number of rows or columns that have only 0s. (Other specifications are possible.) For a structural balance theoretic partition, there are two block types – positive and negative blocks. For positive blocks (on the diagonal) every 1 (for binary signed networks) contributes to ı(R(C i ; C j ); B) and for negative blocks (off-diagonal) every + 1 (for binary signed networks) contributes to ı(R(C i ; C j ); B) for that block. If P is the total number of ‘positive inconsistencies’ and N is the total number of ‘negative inconsistencies’ then one specification for the criterion function is P(C) D P C N. Alternatively, these two inconsistency types can be weighted differently as in P(C) D ˛P C (1 ˛)N with 0 6 ˛ 6 1. This formulation extends naturally to valued signed networks. A more extended discussion is contained in [13] and generalized blockmodeling is implemented in [3]. Deductive and Inductive Uses of Blockmodeling Within conventional blockmodeling, using the indirect approach, blockmodeling is essentially inductive. A specific type of equivalence is specified, some (dis)similarity measure is specified (often implicitly) and a clustering algorithm is used to identify a clustering. For example, in using structural equivalence as the selected equivalence, corrected Euclidean distances can be computed and used in
Positional Analysis and Blockmodeling Positional Analysis and Blockmodeling, Table 1 Inductive and deductive uses of generalized blockmodeling Clustering Unknown (in advance) Pre-specification Unknown Deductive Constraints Given Constrained Pre-specification Given Inductive
Blockmodel Unknown (in advance) Given Unknown Given
Positional Analysis and Blockmodeling, Table 2 Two examples of pre-specified blockmodels Model 1 Model 2 Complete Complete Null Row Regular Row Dominant Complete Complete Regular Null Null Complete Null Regular Null Null Null Regular Symmetric
conjunction with the Ward criterion in a hierarchical clustering procedure. This usage results in a dendrogram or a cluster diagram and it is then necessary for a researcher to identify a clustering given the dendrogram. Neither the number of clusters is specified in advance nor is the location of the permitted types in a blockmodel specified. This is purely inductive. In another variant of using structural equivalence, CONCOR can be used to give successive splits of previously established clusters (starting with the whole network as a cluster) where the number of splits is specified in advance. This, too, is inductive and the researcher accepts (or not) the clusters that happen to be returned by the application of the algorithm. In contrast, generalized blockmodeling can be used in a pre-specified (and hence deductive use of blockmodeling) fashion. The different options are given in Table 1. A simple pre-specified model is one where the block types and their location in the blockmodel are specified in advance but the actual clustering, C, is not known and is to be “discovered”. Two examples of pre-specified blockmodels for k D 3 are given in Table 2 where both the types of blocks and their place in the blockmodel are specified. These specifications state that the type of blockmodel is known or conjectured but not its exact realization. It is permissible to formulate alternative blockmodels for the same network as alternative hypothetical models. However, the two criterion functions cannot be compared directly (in magnitude) because different block types define different inconsistencies. It is possible – and interesting – for different blockmodels to fit the same network. In general, the criterion function for structural equivalence is the most stringent with every 0 and 1 in the ‘wrong’ places (blocks of the ‘opposite’ type) contributing while the number of inconsistencies under regular equiv-
alence is likely to be much smaller. So comparing the criterion functions for different blockmodels based on structural and regular equivalence for the same data is inadmissible. In short, the metrics differ and the size of the criterion function matters only within a given blockmodel type for a specific network data set – where the aim is to find the closest empirical blockmodel(s) consistent with the selected type of blockmodel for the network studied. What matters is the appropriate type of blockmodel given the problem studied. The use of criterion functions is to identify the best fitting blockmodels of a given type and they should never be compared across different types of blockmodels. An alternative option is to specify the clustering and leave open the exact nature of the blockmodel. The constraint takes the form of specifying which vertices go in which cluster and, given that, seeing the blockmodel type that results (in terms of the permitted block types). In the first example used below the presence of two types of organizations in the network suggested a partition of the organizations into the two types. Under this strategy, it is possible also to identify only partially the membership of the clusters by specifying that some vertices must be clustered together and other pairs of vertices are never clustered together. The third approach listed in Table 1 is the constrained pre-specification where both the blockmodel and the clustering are specified. (In the organizational example just mentioned, the partition into two types of organizations could be coupled to a specification of one type of a core-periphery blockmodel.) Engaging in pre-specifying blockmodels requires social knowledge about the specific network or the type of network that it represents. And it may be that a lot of such knowledge is required to make blockmodel specifi-
2231
2232
Positional Analysis and Blockmodeling
cations with confidence. However, often a network analyst knows more about a network than what is implied by a completely inductive approach. As such, a pre-specified blockmodel is one that can be tested rather than one that is discovered. All of the inductive, partially deductive and fully deductive uses of blockmodeling are useful and the decision over which approach to take depends on the knowledge that is available about a network. For example, a type of blockmodel that is discovered for a network of a given type (e. g. corporate networks in the United States) becomes a candidate for a blockmodel to be tested in another network of the same broad type in another country. Examples of Generalized Blockmodeling Example 1. An Interorganizational Network The first example features a small network with 15 organizations and is taken from [15]. The studied organizations are involved in a specific collaborative enterprise and the relation shown in Fig. 3 is the presence (or not) of reciprocated ties prior to the start of the collaborative enterprise. Even though this is a small network, the structure is not immediately apparent from the diagram. Prior to the analysis, the basic intuition was that these organizations form some kind of a core-periphery structure. Of course, there are many types of core-periphery structures so something more has to be specified. Those organizations thought to be in the core were expected to be tied to each other and
Positional Analysis and Blockmodeling, Figure 3 An interorganizational network matrix in an arbitrary order
to most of the other organizations in the network. These organizations occupy the core as Position 1. Another set of organizations was each thought to be tied to most of the core organizations but not to each other. These organizations occupy Position 2 and can be called the semi-periphery. Finally, there was the suspicion that one organization largely unknown prior to the collaborative effort. This organization (or possibly more than one organization) occupies Position 3. Reflecting these arguments, the specification used in [15] is shown in Table 3. This pre-specified model was fitted with 3 clusters and the rows of the matrix were permuted so that members of the same position (cluster) are grouped together. The matrix array in Fig. 4 resulted. The blue lines (that extend beyond the matrix on the left and at the bottom) separate the three positions. While the model fits the data well, there are some inconsistencies: among the core organizations in Position 1, s6 and s1 do not have a reciprocated tie (and contribute 2 inconsistencies to the criterion function as 0s in a specified complete block) and, among the organizations in the semi-periphery (Position 2), f9 and f2 do Positional Analysis and Blockmodeling, Table 3 A pre-specified blockmodel Position 1 Position 1 Complete Position 2 Regular Position 3 Null
Position 2 Regular Null Null
Position 3 Null Null Null
Positional Analysis and Blockmodeling, Figure 4 Blockmodel of the interorganizational network with 3 positions
Positional Analysis and Blockmodeling
Example 2. Structural Balance and Signed Networks
Positional Analysis and Blockmodeling, Figure 5 Blockmodeling image for Fig. 4
share a tie when the expectation expressed in the model was there would be no ties there. This brings the number of inconsistencies to 4. There is a lone organization in Position 3 that has no reciprocated ties with the other organizations: the blocks defined in terms of this position contribute no inconsistencies. So the optimized value of the criterion function is 4 and this partition is unique. Figure 5 shows the blockmodel image for this network gotten by using the pre-specified model shown in Table 3. The loop for Position 1 represents a complete block and the edge between Position 1 and Position 2 represents a regular link. This example raises another interesting issue. As noted above in Sect. “Conventional Blockmodeling”, the organizations could be grouped into two clusters based on what they do. One was made up of ‘practicing’ organizations (with regard to health service delivery) and the other was made up of organizations ‘supporting’ the collaborative effort. The former organizations are labeled f1 through f9 and the second set s1 though s6. An alternative blockmodel was an obvious candidate and is consistent with the third deductive use of blockmodeling listed in Table 1. It has the same specification of block types with three positions: Position 1 made up of the support organizations; Position 2 made up of all of the practicing organizations. (Position 3 could be specified for f6 as an isolate.) Much as this blockmodel made sense a priori, it has a much worse fit. Looking at Fig. 4, it is clear that if s5 and f6 are interchanged between Positions 1 and 2, the count of inconsistencies would jump by 12 to 16. The structure of the second specification is right but the composition of the positions was inferior – which shows, for this network, that the constrained pre-specification did not fit as well as a simpler pre-specification with only the blockmodel stated ahead of the model fitting. It suggests also that testing specifications is important and that the assumed social knowledge for specified blockmodel need not be correct.
The data for this example are taken from [18] for a group of 21 women living in a college residence. Measurement for signed social relations was done by using four relations that tapped an underlying dimension of affect. Each woman was asked to provide signed data about the other women in the residence in the form of who they would like to do activities with and those with whom they did not want to do these things. The items used were: having them as a room mate; going on a double date with them; taking them home to visit the family for a weekend and being a friend after college. These relations have been combined into counts of activities to be shared and counts of activities not to be shared are shown in Fig. 6. Black squares indicate the presence of positive ties for all four relations with red squares showing where all four ties were negative. Shades of gray show lesser counts of positive ties (+3, +2 and +1) and shades of pink show lesser counts of negative ties (3, 2 and 1). White squares indicate no preference either way. (There were no cases where, for a pair or women, there was a positive tie on one relation and a negative tie on another relation). If structural balance holds exactly then a signed relation will partition the actors into clusters (also called plus-sets) where all of the ties within a cluster are positive and all ties between the clusters are negative. In practice, structural balance does not hold exactly and there are some negative ties within plus-sets and some positive ties between them.
Positional Analysis and Blockmodeling, Figure 6 Original signed and valued matrix
2233
2234
Positional Analysis and Blockmodeling Positional Analysis and Blockmodeling, Table 4 Values of P(C) and the number of solutions for values of k k 2 3 4 5 6 P(C) 48.5 37 32 32.5 33 Solutions 1 1 1 2 1
These are all inconsistencies with structural balance and the total count of them gives the value of a criterion function. Partitions were sought that minimize this criterion function, P(C), and this depends on the number of clusters (k) used. Table 4 summarizes the outcomes leading to the choice of the partition shown in Fig. 6. (The criterion function, P(C), is larger for higher values of k). The best fitting (and unique) partition is with 4 clusters. The pre-specified blockmodel for this case is: Positive (+, 0) Negative (, 0) Negative (, 0) Negative (, 0)
Negative (, 0) Positive (+, 0) Negative (, 0) Negative (, 0)
Negative (, 0) Negative (, 0) Positive (+, 0) Negative (, 0)
Negative (, 0) Negative (, 0) Negative (, 0) Positive (+, 0)
To locate the best fitting partition(s), the relocation algorithm described above was used with P(C) D ˛P C (1 ˛)N with ˛ D 0:5. The unique best fitting partition is shown in Fig. 7 where the blue lines mark the boundaries between the four plus-sets. Any red or pink square in the diagonal blocks shows the presence of negative ties, in what is specified as positive blocks, as inconsistencies
Positional Analysis and Blockmodeling, Figure 7 Permuted signed and valued matrix with 4 plus-sets
contributing to P(C). Similarly, black and gray squares in the off-diagonal blocks show positive ties that are present in what are specified as negative blocks and they also contribute to P(C). The plus-set {g, h, o, u} is particularly cohesive with all their positive ties being distributed among themselves and all their negative ties being sent to women in other plus-sets. Furthermore, the women in this plusset receive no positive ties from women in other plus-sets. The plus-set {l, n, t} has only positive ties within their plusset but two positive ties are sent to women in other plussets. All of the negative ties are sent outside the plus-set. The other three person plus-set is less consistent with the ideal structural balance partition having two negative ties within it and positive ties to women in other plus-sets. The blocks associated with the large plus-set have both kinds of inconsistencies. In summary, the generalized blockmodeling discerned a structural pattern as a partition of a valued signed relation that is readily interpretable. Example 3. The Supreme Court as Two-Mode Data Two-mode data involves two sets of social objects. Examples of such data include companies employing people, people attending events, organizations joining alliances and Supreme Court Justices voting on cases they hear. While the row objects and column objects belong to different sets, it is still possible to apply blockmodeling tools to two-mode data. Of course, the rows and columns will be partitioned differently. For the example of the voting of Supreme Court Justices there are two possible weights: 1 for voting in favor of a decision and 0 for not so voting. (An alternative weighting scheme has three values where 1 is voting for a decision, 0 is not voting at all (for justices recusing themselves from a case) and 1 for voting in dissent). The simpler weight scheme is used in this example. Instead of seeking a clustering C as used for one-mode data, a two-clustering is sought where C D (C 1 ; C 2 ) with C 1 a partition of V 1 and C 2 a partition of V 2 . As for one-mode data, block types, together with their associated inconsistency counts, can be defined and an optimization problem formulated as (˚; P(C), minimize). The solution set is made up of two-clusterings C D (C 1 ; C 2 ) for which P(C ) is minimized over all C 2 ˚ , the set of feasible two-clusterings. The value of the criterion function, P(C ) D P(C 1 ; C 2 ) is obtained by summing all of the inconsistencies for each block of the two-mode blockmodel. Structural equivalence blocks (null and complete) was used in [14] to examine the voting of the Supreme Court for the 2000–1 term for the 26 most important decisions identified in [16](a). The justices were partitioned into 4 clusters and the cases into 7 clusters with 13 in-
Positional Analysis and Blockmodeling
Positional Analysis and Blockmodeling, Figure 8 A partition of the Supreme Court voting for 2000–2001
consistencies. Their optimal solution partition is shown in Fig. 8 where the four clusters of justices are in the center column of the figure and the clusters of cases are shown on the two sides. Both sets of clusters are color coded according to cluster membership. The blue lines show the votes of the justices for their decisions. The partition of the justices shows the familiar ideological cleavage of the Supreme Court with the wing of ‘conservative’ justices at the top and the ‘liberal’ wing at the bottom. Figures 9 and 10 show the corresponding results for the next two years of the Supreme Court’s decisions based on the important cases identified in [16](b, c). There are subtle differences between the three two-mode blockmodel partitions within the same broad description of the voting patterns of the Supreme Court across the years. The methodological point here is that generalized blockmodeling can be applied to two-mode data and that the resulting partitions are both clear and interpretable. The benefits from using a generalized blockmodeling approach include the use of precise criterion functions and an optimization approach for identifying blockmodels empirically, having a well defined measure of fit for a blockmodel, being able to define new types of blocks and new blockmodel types, being able to use blockmodeling in a deductive fashion (when our knowledge mer-
its doing so), and being able to differentially weight the different types of inconsistencies in blocks and doing so within a very flexible, coherent and broad framework for blockmodeling. Further, in cases where the direct and indirect approaches have been compared, for both structural and regular equivalence, the generalized approach has never been outperformed by the indirect approach and most often it returns blockmodels that fit better for the same equivalence type. However, even though these benefits are clear, it does not follow that the generalized approach to blockmodeling totally dominates the conventional approach nor that the major problems have been solved. Future Directions One drawback to generalized blockmodeling is that the combinatorial computational burden is considerable even when a local optimization method is used. This constrains the size of networks that can be modeled, a major disadvantage. Much larger networks can be analyzed using the indirect approach and, even though attention is confined to structural and regular equivalence with it, the network size restriction for generalized blockmodeling is problematic. This issue becomes even more acute when
2235
2236
Positional Analysis and Blockmodeling
Positional Analysis and Blockmodeling, Figure 9 A partition of the Supreme Court voting for 2002–2003
Positional Analysis and Blockmodeling, Figure 10 A partition of the Supreme Court voting for 2003–2004
Positional Analysis and Blockmodeling
three-mode networks are considered. For these problems, some combination of indirect methods, direct methods and graph theoretical methods will be needed. So, one open problem set stems the need to create more efficient algorithms together with the formulation of better heuristics for partitioning networks in general. Positional analysis gives priority to network locations and network positions. Indeed, this is its hallmark. Because the whole network matters, both approaches to blockmodeling assume that the boundary of the network has been located correctly. In many cases, the assumption is appropriate but, as the ambition of network analysts expands to consider networks of (much) larger size, this assumption becomes questionable. At the moment, we simply do not know the vulnerability (or instability) of blockmodeling solutions when the boundary has been draw inaccurately. (Leaving out crucial network vertices is far more problematic than including vertices of little structural relevance). The need to learn the vulnerabilities of blockmodeling to this type of boundary problem is acute and creates a second open problem set. One data analytic response to this problem is to represent the wider network environment of the network being studied in some fashion so as to reduce the vulnerability that comes from simply ignoring the wider network environment. Equally acute is the problem of missing network data when the boundary has been correctly identified. That is, the data may contain all of the actors but the information from and about them is incomplete, inaccurate or missing. This sets up the third set of open problems: dealing with inadequate data. The missing data part of this problem is easy to sweep under the rug by assuming that the missing relational data is the same as null relational data. In reality, the 0 in a network data set can be a null relation or that no data for the tie is available and it is folly to treat them as the same. There are three broad approaches to this problem that show promise. One is to experiment with complete extant data by systematically deleting or otherwise corrupting the data and reanalyzing the corrupted data with the same blockmodeling tools to learn more about which types of corrupted data matter the most and how much missing or corrupted data will distort the results of blockmodeling the network. A second approach is to use simulation methods to generate data where the sources of missing or incomplete data is built in the generating process and can be studied. The disadvantage to both of these approaches is that the derived knowledge may be restricted to the situations featured in the experiments and simulations. The third approach deals with the analysis of data directly by marking the missing data as such and incorporating the missing data by defining new block types de-
fined by the missing data in them. All three approaches are suggestions for tackling the third open problem set arising from the need to consider ‘bad’ data seriously – far more seriously than the current cavalier practice of assuming network data are all ‘good’ when positional approaches are used. Most of the examples used in [14] feature binary networks and there is a clear need to extend generalized blockmodeling to consider valued networks, the fourth set of open problems. The balance theoretic partitioning can deal with valued signed network – as the example discussed above demonstrates. In one respect, conventional blockmodeling does this because many of the (dis)similarity measures that are used are applicable for valued networks. But for generalized blockmodeling this is a real problem. A crude way of dealing with valued networks is to impose some threshold to convert a valued network into one (or more) binary networks. The work in [22] suggests that this strategy is fraught with hazard and he has introduced a potentially fruitful approach for a generalized blockmodeling approach to valued network data. A fifth open problem set is ushered in by the criterion function used to fit generalized blockmodels. For a given set of block types and a specific blockmodel type, we can say that the optimization approach provides empirical blockmodels that fit the best – subject to the caveat the relocation method is a local optimization procedure. If the criterion function is well defined (i. e. it is compatible with and sensitive to the equivalence used), then identifying partitions minimizing it all but guarantees the claim. And if the values of the criterion function are small (as in the examples used above) this claim is strengthened. However, this does not address the issue of how large a criterion function has to be to say that the specified blockmodel does not fit the data well even though the partition returns the lowest possible value of the criterion function for the data at hand. The ‘best’ may not be good enough. Some statistical theory is needed for fully testing the fit of blockmodels to data in a principled fashion. Assembling this constitutes a solution set for the fifth set of open problems. Dealing with multiple relations constitutes the sixth open problem set discussed here. This is acute for both the conventional and generalized approaches to the blockmodeling problem even though, at face value, it seems more acute for generalized blockmodeling. Conventional blockmodeling handles this issue by stacking the relations and computing the (dis)similarity measure across the full set of relations and mobilizing a clustering algorithm. However, this strategy implies that every relationship has
2237
2238
Positional Analysis and Blockmodeling
Positional Analysis and Blockmodeling, Figure 11 Two relations with the same partition but different blockmodels
the same set of block types for each relation. The example shown in Fig. 11 demonstrates the serious drawback with this assumption. Two relations are shown. On the left is a hierarchical relation for a hypothetical bureaucratic structure, similar to the one in Fig. 1. The relation shown is ‘has authority over’ and a regular equivalence is shown with the levels that are color coded according to the partition. On the right is a hypothetical social relationship of advice seeking. Suppose, further, in this hypothetical organization that there are definite norms about advice seeking. People at the same bureaucratic level are free to seek advice from each other and it is acceptable for lower ranked subordinates to see advice from their bosses. In addition, suppose that the informal norms are such that bosses seeking advice from their subordinates is unacceptable and they never do so (or admit to doing so). The data on the right hand network are fully consistent with a ranked clusters model [12]. Within levels of the hierarchy, advice seeking, when it occurs, is fully symmetrical and the only directed ties are from subordinates at the lowest level to some of their bosses at the middle level and from the middle level to the top company official. This ranked clusters partition is exact and the clusters are identical for each relation in Fig. 11. The problem with the ‘stacking’ of relations approach is clear for this example. The blockmodels for each relationship are totally different even though the vertex partition for each relation is the same. Assuming, for example, that a single equivalence holds for both relations and computing some (dis)similarity will simply confound the two types of blockmodel with the result that the actual partitions will not be identified and, even if they were, they
would be misrepresented in terms of the predicates of the blockmodel. Analyzing multiple relations where the different relations can have quite different blockmodels is a deep problem, one that cannot be solved by the use of (dis)similarity measures as if they apply to each relation in the same fashion. The seventh, and final, open problem set considered here has to do with the notion of evolving social networks. One of the basic assumptions of the positional approach, with blockmodeling as the primary strategy, is that there is a large and/or complex social network that we observe and the primary task is understand the structure of the network. This is done by delineating and describing the simpler blockmodel image. Another part of the basic assumption is that the underlying blockmodel image is the fundamental structure and the observed (surface) structure is an indicator only of the fundamental structure. There is a hint in the Supreme Court example that the voting structure did change over time. Of course, this could be due to the nature of the cases that happened to be considered in each of the terms studied. But even if there is some systematic evolution of the structure, the images depicted in Figs. 8 through 10 are no more than a description of change and does not deal with evolution as a structural process. The (deep) problem is to formulate models of network evolution for the fundamental structure(s) represented by the blockmodel while all of the data depict the observed surface model. Blockmodeling methods, whether in the conventional or generalized mode, form a powerful set of tools for modeling, representing and understanding the structure of so-
Positional Analysis and Blockmodeling
cial networks. The many benefits stemming from the use of these tools, especially in the generalized blockmodeling variant, inspire confidence in this approach and suggest an enormous potential for using these tools. However, there are many serious open problem sets that need to be solved before this potential is fully realized. Bibliography 1. Batagelj V (1997) Notes on blockmodeling. Soc Netw 19:143– 53 2. Batagelj V, Mrvar A (1998) Pajek – A Program for large network analysis. Connections 21:47–57 3. Batagelj V, Doreian P, Ferligoj A (1992) An optimizational approach to regular equivalence. Soc Netw 14:121–35 4. Borgatti SP, Everett MG (1989) The class of all regular equivalences: Algebraic structure and computation. Soc Netw 11:159–72 5. Borgatti SP, Everett MG (2002) Notions of position in social network analysis. In: Marsden PV (ed) Sociological Methodology. Am Soc Assoc, Washington DC, pp 1–36 6. Borgatti SP, Everett MG, Freeman LC (2002) Ucinet for Windows: Software for Social Network Analysis, Analytic Technologies. Harvard, Massachussets 7. Breiger RL, Boorman SA, Arabie P (1975) An algorithm for clustering relational data with applications to social network analysis and comparison to multidimensional scaling. J Math Psychol 12:328–83 8. Burt RS (1976) Positions in networks. Soc Forces 55:93–122 9. Cartwright D, Harary F (1956) Structural balance: A generalization of Heider’s theory. Psychol Rev 63:277–292 10. Davis JA (1967) Clustering and structural balance in graphs. Hum Relat 20:181–7
11. Doreian P, Mrvar A (1996) A partitioning approach to structural balance. Soc Netw 18:149–68 12. Doreian P, Batagelj V, Ferligoj A (2000) Symmetric-acyclic decomposition of networks. J Classif 17:3–28 13. Doreian P, Batagelj V, Ferligoj A (2005) Positional analysis of sociometric data. In: Carrington PJ, Scott J, Wasserman S (eds) Models and Methods in Social Network Analysis. Cambridge University Press, New York 14. Doreian P, Batagelj V, Ferligoj A (2005) Generalized Blockmodeling. Cambridge University Press, New York 15. Gold M, Doreian P, Taylor EF (2008) Understanding a collaborative effort to reduce racial and ethnic disparities in health care: Contributions from social network analysis. Soc Sci Med 67:1018–1027 16. Greenhouse L (2001, 2003, 2004) (a) In year of Florida vote, Supreme Court did much other work. New York Times, July 2; (b) In momentous term, justices remake the law and the court. New York Times, July 1; (c) The year Rehnquist may have lost his court. New York Times, July 4 17. Heider F (1946) Attitudes and cognitive organization. J Psychol 21:107–12 18. Lemann TB, Solomon RL (1952) Group Characteristics as Revealed in Sociometric Patterns and Personality Ratings. Sociometry Monographs, vol 27. Beacon House, Boston 19. Lorrain FP, White H (1971) Structural equivalence of individuals in networks. J Math Soc 1:49–80 20. White DR (1984) REGGE: A regular graph equivalence algorithm for computing role distances prior to blockmodeling. University of California, Irvine (Unpublished manuscript) 21. White DR, Reitz KP (1983) Graph and semigroup homomorphisms on networks and relations. Soc Netw 5:193–235 22. Žiberna A (2005) Generalized blockmodeling of valued networks. Soc Netw 29:105–26
2239
2240
Possibility Theory
Possibility Theory DIDIER DUBOIS, HENRI PRADE IRIT-CNRS, Universite Paul Sabatier, Toulouse Cedex, France Article Outline Glossary Definition of the Subject Introduction Historical Background Basic Notions of Possibility Theory Qualitative Possibility Theory Quantitative Possibility Theory Probability-Possibility Transformations Applications and Future Directions Bibliography Glossary Possibility distribution A possibility distribution restricts a set of possible values for a variable of interest in an elastic way. It is represented by a mapping from a universe gathering the potential values of the variable to a scale such as the unit interval of the real line, or a finite linearly ordered set, expressing to what extent each value is possible for the variable. Thus, a possibility distribution restricts a set of more or less possible values belonging to a universe that may be also ordered such as a subpart of real line for a numerical variable, or not ordered if for instance the variable takes its value in the set of interpretations of a logical language. This may be used for representing uncertainty if the restriction pertains to possible values for an ill-known state of the world, or for representing preferences if the restriction encodes a set of values that are considered as more or less satisfactory for some purpose. Possibility measure A possibility measure is a set function (increasing in the wide sense) that returns the maximum of a possibility distribution over a subset representing an event. Necessity measure A necessity measure is a set function, associated by duality to a possibility measure through a relation expressing that an event is all the more necessarily true (all the more certain) as the opposite event is less possible. A necessity measure estimates to what extent the information represented by the underlying possibility distribution entails the occurrence of the event.
Guaranteed possibility A guaranteed possibility measure is a set function (decreasing in the wide sense) that returns the minimum of a possibility distribution over a subset representing an event. While possibility measures evaluate the consistency of the information between an event and the available information represented by the underlying possibility distribution, guaranteed possibility measures capture another view of the idea of possibility related to the idea of (guaranteed) feasibility, or sufficiency condition. Possibilistic logic Standard possibilistic logic is a weighted logic where formulas are pairs made of a classical logical formula and a weight that acts as a lower bound of the necessity of the logical formula. Extended possibilistic logics may include formulas weighted in terms of lower bounds of possibility or guaranteed possibility measures. Definition of the Subject Possibility theory is the simplest uncertainty theory devoted to the modeling of incomplete information. It is characterized by the use of two basic dual set functions that respectively grade the possibility and the necessity of events. Possibility theory lies at the crossroads between fuzzy sets, probability and non-monotonic reasoning. Possibility theory is closely related to fuzzy sets if one considers that a possibility distribution is a particular fuzzy set (of mutually exclusive) possible values. However fuzzy sets and fuzzy logic are primarily motivated by the representation of gradual properties while possibility theory handles the uncertainty of classical (or fuzzy) propositions. Possibility theory can be cast either in an ordinal or in a numerical setting. Qualitative possibility theory is closely related to belief revision theory, and common-sense reasoning with exception-tainted knowledge in Artificial Intelligence. It has been axiomatically justified in a decisiontheoretic framework in the style of Savage, thus providing a foundation for qualitative decision theory. Quantitative possibility theory is the simplest framework for statistical reasoning with imprecise probabilities. As such it has close connections with random set theory and confidence intervals, and can provide a tool for uncertainty propagation with limited statistical or subjective information. Introduction Possibility theory is an uncertainty theory devoted to the handling of incomplete information. To a large extent, it is similar to probability theory because it is based on setfunctions. It differs from the latter by the use of a pair of dual set functions (possibility and necessity measures) in-
Possibility Theory
stead of only one. Besides, it is not additive and makes sense on ordinal structures. The name “Theory of Possibility” was coined by Zadeh [1], who was inspired by a paper by Gaines and Kohout [2]. In Zadeh’s view, possibility distributions were meant to provide a graded semantics to natural language statements. However, possibility and necessity measures can also be the basis of a full-fledged representation of partial belief that parallels probability. It can be seen either as a coarse, non-numerical version of probability theory, or a framework for reasoning with extreme probabilities, or yet a simple approach to reasoning with imprecise probabilities [3]. After reviewing pioneering contributions to possibility theory, we recall its basic concepts and present the two main directions along which it has developed: the qualitative and quantitative settings. Both approaches share the same basic “maxitivity” axiom. They differ when it comes to conditioning, and to independence notions. Historical Background Zadeh was not the first scientist to speak about formalizing notions of possibility. The modalities possible and necessary have been used in philosophy at least since the Middle-Ages in Europe, based on Aristotle’s works. More recently they became the building blocks of Modal Logics that emerged at the end of the first decade of the XXth century from the works of C.I. Lewis (see Hughes and Cresswell [5]). In this approach, possibility and necessity are all-or-nothing notions, and handled at the syntactic level. More recently, and independently from Zadeh’s view, the notion of possibility, as opposed to probability, was central in the works of one economist, Shackle, and in those of two philosophers, D. Lewis and L.J. Cohen.
the degree of surprise of its least surprising realization. The disbelief notion introduced later by Spohn [7] employs the same type of convention as potential surprise, but using the set of natural integers as a disbelief scale. Shackle also introduces a notion of conditional possibility, whereby the degree of surprise of a conjunction of two events A and B is equal to the maximum of the degree of surprise of A, and of the degree of surprise of B, should A prove true. D. Lewis In his 1973 book [8] the philosopher David Lewis considers a graded notion of possibility in the form of a relation between possible worlds he calls comparative possibility. He equates this concept of possibility to a notion of similarity between possible worlds. This non-symmetric notion of similarity is also comparative, and is meant to express statements of the form: a world j is at least as similar to world i as world k is. Comparative similarity of j and k with respect to i is interpreted as the comparative possibility of j with respect to k viewed from world i. Such relations are assumed to be complete pre-orderings and are instrumental in defining the truth conditions of counterfactual statements. Comparative possibility relations ˘ obey the key axiom: for all events A; B; C, A ˘ B implies C [ A ˘ C [ B : This axiom was later independently proposed by the first author [9] in an attempt to derive a possibilistic counterpart to comparative probabilities. Interestingly, the connection between numerical possibility and similarity is currently investigated by Sudkamp [10]. L.J. Cohen
G.L.S. Shackle A graded notion of possibility was introduced as a fullfledged approach to uncertainty and decision in the 1940– 1970’s by the English economist G.L.S. Shackle [6], who called degree of potential surprise of an event its degree of impossibility, that is, the degree of necessity of the opposite event. Shackle’s notion of possibility is basically epistemic, it is a “character of the chooser’s particular state of knowledge in his present”. Impossibility is understood as disbelief. Potential surprise is valued on a disbelief scale, namely a positive interval of the form [0; y ], where y denotes the absolute rejection of the event to which it is assigned. In case everything is possible, all mutually exclusive hypotheses have zero surprise. At least one elementary hypothesis must carry zero potential surprise. The degree of surprise of an event, a set of elementary hypotheses, is
A framework very similar to the one of Shackle was proposed by the philosopher L.J. Cohen [11] who considered the problem of legal reasoning. He introduced so-called Baconian probabilities understood as degrees of provability. The idea is that it is hard to prove someone guilty at the court of law by means of pure statistical arguments. The basic feature of degrees of provability is that a hypothesis and its negation cannot both be provable together to any extent (the contrary being a case for inconsistency). Such degrees of provability coincide with necessity measures. L.A. Zadeh In his seminal paper [1] Zadeh proposed an interpretation of membership functions of fuzzy sets as possibility distributions encoding flexible constraints induced by
2241
2242
Possibility Theory
natural language statements. Zadeh articulated the relationship between possibility and probability, noticing that what is probable must preliminarily be possible. However, the view of possibility degrees developed in his paper refers to the idea of graded feasibility (degrees of ease, as in the example of “how many eggs can Hans eat for his breakfast”) rather than to the epistemic notion of plausibility laid bare by Shackle. Nevertheless, the key axiom of “maxitivity” for possibility measures is highlighted. In two subsequent articles [12,13], Zadeh acknowledged the connection between possibility theory, belief functions and upper/lower probabilities, and proposed their extensions to fuzzy events and fuzzy information granules. Basic Notions of Possibility Theory The basic building blocks of possibility theory were first extensively described in the authors’ books [14,15] (see also [16]). Let S be a set of states of affairs (or descriptions thereof), or states for short. A possibility distribution is a mapping from S to a totally ordered scale L, with top 1 and bottom 0, such as the unit interval. The function represents the state of knowledge of an agent (about the actual state of affairs) distinguishing what is plausible from what is less plausible, what is the normal course of things from what is not, what is surprising from what is expected. It represents a flexible restriction on what is the actual state with the following conventions (similar to probability, but opposite to Shackle’s potential surprise scale): (s) D 0 means that state s is rejected as impossible; (s) D 1 means that state s is totally possible (= plausible). If S is exhaustive, at least one of the elements of S should be the actual world, so that 9s; (s) D 1 (normalization). Distinct values may simultaneously have a degree of possibility equal to 1. Possibility theory is driven by the principle of minimal specificity. It states that any hypothesis not known to be impossible cannot be ruled out. A possibility distribution is said to be at least as specific as another 0 if and only if for each state of affairs s: (s) 0 (s) (Yager [17]). Then, is at least as restrictive and informative as 0 . In the possibilistic framework, extreme forms of partial knowledge can be captured, namely: Complete knowledge: for some s0 ; (s0 ) D 1 and (s) D 0; 8s ¤ s0 (only s0 is possible) Complete ignorance: (s) D 1; 8s 2 S, (all states are possible).
Given a simple query of the form “does event A occur?” where A is a subset of states, the response to the query can be obtained by computing degrees of possibility and necessity, respectively (if the possibility scale L D [0; 1]): ˘ (A) D sup (s); N(A) D inf 1 (s) : s2A
s…A
˘ (A) evaluates to what extent A is consistent with , while N(A) evaluates to what extent A is certainly implied by . The possibility-necessity duality is expressed by N(A) D 1 ˘ (Ac ), where Ac is the complement of A. Generally, ˘ (S) D N(S) D 1 and ˘ (;) D N(;) D 0. Possibility measures satisfy the basic “maxitivity” property ˘ (A [ B) D max(˘ (A); ˘ (B)). Necessity measures satisfy an axiom dual to that of possibility measures, namely N(A \ B) D min(N(A); N(B)). On infinite spaces, these axioms must hold for infinite families of sets. Human knowledge is often expressed in a declarative way using statements to which belief degrees are attached. It corresponds to expressing constraints the world is supposed to comply with. Certainty-qualified pieces of uncertain information of the form “A is certain to degree ˛” can then be modelled by the constraint N(A) ˛. The least specific possibility distribution reflecting this information is [15]: ( 1; if s 2 A (A;˛) (s) D (1) 1 ˛ ; otherwise : Acquiring further pieces of knowledge leads to updating (A;˛) into some < (A;˛) . Apart from ˘ and N, a measure of guaranteed possibility or sufficiency can be defined [18]: (A) D infs2A (s). In contrast, ˘ appears to be a measure of potential possibility. It estimates to what extent all states in A are actually possible according to evidence. (A) can be used as a degree of evidential support for A. Uncertain statements of the form “A is possible to degree ˇ” often mean that all realizations of A are possible to degree ˇ. They can then be modelled by the constraint (A) ˇ. It corresponds to the idea of observed evidence. This type of information is better exploited by assuming an informational principle opposite to the one of minimal specificity, namely, any situation not yet observed is tentatively considered as impossible. This is similar to closed-world assumption. The most specific distribution ı(A;ˇ ) in agreement with (A) ˇ is: ( ˇ ; if s 2 A ı(A;ˇ ) (s) D 0 ; otherwise : Acquiring further pieces of evidence leads to updating ı(A;ˇ ) into some wider distribution ı > ı(A;ˇ ) . Such evidential support functions do not behave with the same
Possibility Theory
conventions as possibility distributions: ı(s) D 1 means that S is guaranteed to be possible, because of a high evidential support, while ı(s) D 0 only means that S has not been observed yet (hence is of unknown possibility). Distributions ı are generally not normalized to 1, and serve as lower bounds to possibility distributions (because what is observed must be possible). Such a bipolar representation of information using pairs (ı; ) may provide a natural interpretation of interval-valued fuzzy sets. Note that possibility distributions induced from certainty-qualified pieces of knowledge combine conjunctively, by discarding possible states, while evidential support distributions induced by possibility-qualified pieces of evidence combine disjunctively, by accumulating possible states. Notions of conditioning and independence were studied for possibility measures. Conditional possibility is defined similarly to probability theory using a Bayesian like equation of the form [15] ˘ (B \ A) D ˘ (BjA) ? ˘ (A) : However, in the ordinal setting the operation ? cannot be a product and is changed into the minimum. In the numerical setting, there are several ways to define conditioning, not all of which have this form [19]. There are several variants of possibilistic independence [20,21,22]. Generally, independence in possibility theory is neither symmetric, nor insensitive to negation. For non Boolean variables, independence between events is not equivalent to independence between variables. Joint possibility distributions on Cartesian products of domains can be represented by means of graphical structures similar to Bayesian networks for joint probabilities (see [23,24]. Such graphical structures can be taken advantage of for evidence propagation [25] or learning [26]. Qualitative Possibility Theory This section is restricted to the case of a finite state space S, supposed to be the set of interpretations of a formal propositional language. In other words, S is the universe induced by Boolean attributes. A plausibility ordering is a complete pre-order of states denoted by , which induces a wellordered partition fE1 ; : : : ; E n g of S. It is the comparative counterpart of a possibility distribution , i. e., s s 0 if and only if (s) (s 0 ). Indeed it is more natural to expect that an agent will supply ordinal rather than numerical information about his beliefs. By convention E1 contains the most normal states of fact, E n the least plausible, or most surprising ones. Denoting argmax(A) any most plausible state s0 2 A, ordinal counterparts of possibility and necessity measures [9] are then defined as follows:
fsg ˘ ; for all s 2 S and A ˘ B if and only if max(A) max(B) A N B if and only if max(B c ) max(Ac ) : Possibility relations ˘ are those of Lewis and satisfy the characteristic property A ˘ B implies C [ A ˘ C [ B while necessity relations can also be defined as A N B if and only if B c ˘ Ac , and satisfy a similar axiom: A N B implies C \ A N C \ B : The latter coincide with epistemic entrenchment relations in the sense of belief revision theory [27,28]. Conditioning a possibility relation ˘ by an non-impossible event C such that C >˘ ; means deriving a relation ˘ C A ˘ B if and only if A \ C ˘ B \ C :
The notion of independence for comparative possibility theory was studied in Dubois et al. [22], for independence between events, and Ben Amor et al. [29] between variables. Non-monotonic Inference Suppose S is equipped with a plausibility ordering. The main idea behind qualitative possibility theory is that the state of the world is always believed to be as normal as possible, neglecting less normal states. A ˘ B really means that there is a normal state where A holds that is at least as normal as any normal state where B holds. The dual case A N B is intuitively understood as “A is at least as certain as B”, in the sense that there are states where B fails to hold that are at least as normal as the most normal state where A does not hold. In particular, the events accepted as true are those which are true in all the most plausible states, namely the ones such that A > N ;. These assumptions lead us to interpret the plausible inference Aj B of a proposition B from another A, under a state of knowledge ˘ as follows: B should be true in all the most normal A B c in terms of states were A is true, which means B >˘ ordinal conditioning, that is, A \ B is more plausible than A \ B c . Aj B also means that the agent considers B as an accepted belief in the context A. This kind of inference is non-monotonic in the sense that Aj B does not always imply A \ Cj B for any additional information C. This is similar to the fact that a conditional probability P(BjA \ C) may be low even if
2243
2244
Possibility Theory
P(BjA) is high. The properties of the consequence relation j are now well-understood, and are precisely the ones laid bare by Lehmann and Magidor [30] for their socalled “rational inference”. Monotonicity is only partially restored: Aj B implies A \ Cj B holds provided that Aj C c does not hold (i. e. that states were A is true do not typically violate C). This property is called rational monotony, and, along with some more standard ones (like closure under conjunction), characterizes default possibilistic inference j . In fact, the set fB; Aj Bg of accepted beliefs in the context A is deductively closed, which corresponds to the idea that the agent reasons with accepted beliefs in each context as if they were true, until some event occurs that modifies this context. This closure property is enough to justify a possibilistic approach [31] and adding the rational monotonicity property ensures the existence of a single possibility relation generating the consequence relation j [32]. Rather than being constructed from scratch, plausibility orderings can be generated by a set of if-then rules tainted with unspecified exceptions. This set forms a knowledge base supplied by an agent. Each rule “if A then B” is understood as a constraint of the form A \ B >˘ A \ B c on possibility relations. There exists a single minimally specific element in the set of possibility relations satisfying all constraints induced by rules (unless the latter are inconsistent). It corresponds to the most compact plausibility ranking of states induced by the rules [32]. This ranking can be computed by an algorithm originally proposed by Pearl [33]. Possibilistic Logic Qualitative possibility relations can be represented by (and only by) possibility measures ranging on any totally ordered set L (especially a finite one) [9]. This absolute representation on an ordinal scale is slightly more expressive than the purely relational one. When the finite set S is large and generated by a propositional language, qualitative possibility distributions can be efficiently encoded in possibilistic logic [34]. A possibilistic logic base K is a set of pairs (; ˛), where is a Boolean expression and ˛ is an element of L. This pair encodes the constraint N() ˛ where N() is the degree of necessity of the set of models of . Each prioritized formula (; ˛) has a fuzzy set of models (described in Sect. “Basic Notions of Possibility Theory”) and the fuzzy intersection of the fuzzy sets of models of all prioritized formulas in K yields the associated plausibility ordering on S. Syntactic deduction from a set of prioritized clauses is achieved by refutation using an extension of the stan-
dard resolution rule, whereby ( _ ; min(˛; ˇ)) can be derived from ( _ ; ˛) and ( _ :; ˇ). This rule, which evaluates the validity of an inferred proposition by the validity of the weakest premiss, goes back to Theophrastus, a disciple of Aristotle. Possibilistic logic is an inconsistency-tolerant extension of propositional logic that provides a natural semantic setting for mechanizing nonmonotonic reasoning [35], with a computational complexity close to that of propositional logic. See [36] for a detailed introduction to possibilistic logic, its syntactic and semantic aspects, different extensions that involve time, for instance, or that encode constraints on lower bounds of possibility or guaranteed possibility measures, or that perform multiple source information fusion. See [37] for a sketch of further extensions for handling groups of agents’ beliefs and mutual beliefs. Another compact representation of qualitative possibility distributions is the possibilistic directed graph, which uses the same conventions as Bayesian nets, but relies on an ordinal notion of conditional possibility [15] ( ˘ (BjA) D
1;
if ˘ (B \ A) D ˘ (A)
˘ (B \ A) ;
otherwise :
Joint possibility distributions can be decomposed into a conjunction of conditional possibility distributions (using minimum) in a way similar to Bayes nets [38]. It is based on a symmetric notion of qualitative independence ˘ (B \ A) D min(˘ (A); ˘ (B)) that is weaker than the causal-like condition ˘ (BjA) D ˘ (B) [22]. Ben Amor and Benferhat [39] investigate the properties of qualitative independence that enable local inferences to be performed in possibilistic nets. Decision-Theoretic Foundations Zadeh [1] hinted that “since our intuition concerning the behavior of possibilities is not very reliable”, our understanding of them “would be enhanced by the development of an axiomatic approach to the definition of subjective possibilities in the spirit of axiomatic approaches to the definition of subjective probabilities”. Decision-theoretic justifications of qualitative possibility were recently devised, in the style of Savage [40]. On top of the set of states, assume there is a set X of consequences of decisions. A decision, or act, is modelled as a mapping f from S to X assigning to each state S its consequence f (s). The axiomatic approach consists in proposing properties of a preference relation between acts so that a representation of this relation by means of a preference functional W( f ) is ensured, that is, act f is as good as act g (denoted f g) if
Possibility Theory
and only if W( f ) W(g). W( f ) depends on the agent’s knowledge about the state of affairs, here supposed to be a possibility distribution on S, and the agent’s goal, modelled by a utility function u on X. Both the utility function and the possibility distribution map to the same finite chain L. A pessimistic criterion W ( f ) is of the form: W ( f ) D min max(n((s)); u( f (s))) s2S
where n is the order-reversing map of L. n((s)) is the degree of certainty that the state is not s (hence the degree of surprise of observing s), u( f (s)) the utility of choosing act f in state s. W ( f ) is all the higher as all states are either very surprising or have high utility. This criterion is actually a prioritized extension of the Wald maximin criterion. The latter is recovered if (s) D 1 (top of L) 8s 2 S. According to the pessimistic criterion, acts are chosen according to their worst consequences, restricted to the most plausible states S D fs; (s) n(W ( f ))g. The optimistic counterpart of this criterion is: WC ( f ) D max min((s)); u( f (s))) : s2S
WC ( f ) is all the higher as there is a very plausible state with high utility. The optimistic criterion was first proposed by Yager [41] and the pessimistic criterion by Whalen [42]. These optimistic and pessimistic possibilistic criteria are particular cases of a more general criterion based on the Sugeno integral [43] specialized to possibility and necessity of fuzzy events [1,14]: S ;u ( f ) D max min(; (F )) 2L
where F D fs 2 S; u( f (s)) g, is a monotonic set function that reflects the decision-maker attitude in front of uncertainty: (A) is the degree of confidence in event A. If D ˘ , then S˘;u ( f ) D WC ( f ). Similarly, if D N, then S N;u ( f ) D W ( f ). For any acts f ; g, and any event A, let f Ag denote an act consisting of choosing f if A occurs and g if its complement occurs. Let f ^ g (resp. f _ g) be the act whose results yield the worst (resp. best) consequence of the two acts in each state. Constant acts are those whose consequence is fixed regardless of the state. A result in [44,45] provides an act-driven axiomatization of these criteria, and enforces possibility theory as a “rational” representation of uncertainty for a finite state space S: Theorem 1 Suppose the preference relation on acts obeys the following properties: 1. (X S ; ) is a complete preorder. 2. There are two acts such that f g.
3. 8A; 8g and h constant, 8 f ; g h implies gAf hAf . 4. If f is constant, f h and g h imply f ^ g h. 5. If f is constant, h f and h g imply h f _ g. Then there exists a finite chain L, an L-valued monotonic set-function on S and an L-valued utility function u, such that is representable by a Sugeno integral of u( f ) with respect to . Moreover is a necessity (resp. possibility) measure as soon as property (iv) (resp. (v)) holds for all acts. The preference functional is then W ( f ) (resp. WC ( f )). Axioms (4–5) contradict expected utility theory. They become reasonable if the value scale is finite, decisions are one-shot (no compensation) and provided that there is a big step between any level in the qualitative value scale and the adjacent ones. In other words the preference pattern f h always means that f is significantly preferred to h, to the point of considering the value of h negligible in front of the value of f . The above result provides decisiontheoretic foundations of possibility theory, whose axioms can thus be tested from observing the choice behavior of agents. See [46] for another approach to comparative possibility relations, more closely relying on Savage axioms but giving up any comparability between utility and plausibility levels. The drawback of these and other qualitative decision criteria is their lack of discrimination power [47]. To overcome it, refinements of possibilistic criteria were recently proposed, based on lexicographic schemes [48]. These new criteria turn out to be representable by a classical (but big-stepped) expected utility criterion. Quantitative Possibility Theory The phrase “quantitative possibility” refers to the case when possibility degrees range in the unit interval. In that case, a precise articulation between possibility and probability theories is useful to provide an interpretation to possibility and necessity degrees. Several such interpretations can be consistently devised. See [49] for a detailed survey. A degree of possibility can be viewed as an upper probability bound [50], and a possibility distribution can be viewed as a likelihood function [51]. A possibility measure is also a special case of a Shafer plausibility function [52]. Following a very different approach, possibility theory can account for probability distributions with extreme values, infinitesimal [7] or having big steps [53]. There are finally close connections between possibility theory and idempotent analysis [54,55]. The theory of large deviations in probability theory [56] also handles set-functions that look like possibility measures [57]. Here we focus on the role of possibility theory in the theory of imprecise probability.
2245
2246
Possibility Theory
Possibility as Upper Probability
Conditioning
Let be a possibility distribution where (s) 2 [0; 1]. Let P() be the set of probability measures P such that P ˘ , i. e. 8A S; P(A) ˘ (A). Then the possibility measure ˘ coincides with the upper probability function P such that P (A) D supfP(A); P 2 P()g while the necessity measure N is the lower probability function P such that P (A) D inffP(A); P 2 P()g ; see [50,58] for details. P and are said to be consistent if P 2 P(). The connection between possibility measures and imprecise probabilistic reasoning is especially promising for the efficient representation of non-parametric families of probability functions, and it makes sense even in the scope of modelling linguistic information [59]. A possibility measure can be computed from a set of nested confidence subsets fA1 ; A2 ; : : : ; A m g where A i A iC1 ; i D 1 : : : m 1. Each confidence subset A i is attached a positive confidence level i interpreted as a lower bound of P(A i ), hence a necessity degree. It is viewed as a certainty-qualified statement that generates a possibility distribution i according to Sect. “Basic Notions of Possibility Theory”. The corresponding possibility distribution is
There are two kinds of conditioning that can be envisaged upon the arrival of new information E. The first method presupposes that the new information alters the possibility distribution by declaring all states outside E impossible. The conditional measure (:jE) is such that ˘ (BjE) ˘ (E) D ˘ (B \ E). This is formally Dempster rule of conditioning of belief functions, specialized to possibility measures. The conditional possibility distribution representing the weighted set of confidence intervals is, ( (s) ; if s 2 E (sjE) D ˘ (E) 0; otherwise :
(s) D
min i (s)
iD1;:::;m
( D
1;
if u 2 A1
1 j1 ;
if j D maxfi : s … A i g > 1 :
The information modelled by can also be viewed as a nested random set f(A i ; i ); i D 1; : : : ; mg, where
i D i i1 . This framework allows for imprecision (reflected by the size of the A i ’s) and uncertainty (the i ’s). And i is the probability that the agent only knows that A i contains the actual state (it is not P(A i )). The random set view of possibility theory is well adapted to the idea of imprecise statistical data, as developed in [60,61]. Namely, given a bunch of imprecise (not necessarily nested) observations (called focal sets), supplies an approximate repP resentation of the data, as (s) D i : s2A i i . The set P() contains many probability distributions, arguably too many. Neumaier [62] has recently proposed a related framework, in a different terminology, for representing smaller subsets of probability measures using two possibility distributions instead of one. He basically uses a pair of distributions (ı; ) (in the sense of Sect. “Basic Notions of Possibility Theory”) of distributions, he calls “cloud”, where ı is a guaranteed possibility distribution (in our terminology) such that ı. A cloud models the (generally non-empty) set P() \ P(1 ı), viewing 1 ı as a standard possibility distribution.
De Baets et al. [63] provide a mathematical justification of this notion in an infinite setting, as opposed to the min-based conditioning of qualitative possibility theory. Indeed, the maxitivity axiom extended to the infinite setting is not preserved by the min-based conditioning. The product-based conditioning leads to a notion of independence of the form ˘ (B \ E) D ˘ (B) ˘ (E) whose properties are very similar to the ones of probabilistic independence [21]. Another form of conditioning [64,65], more in line with the Bayesian tradition, considers that the possibility distribution encodes imprecise statistical information, and event E only reflects a feature of the current situation, not of the state in general. Then the value ˘ (BkE) D supfP(BjE); P(E) > 0; P ˘ g is the result of performing a sensitivity analysis of the usual conditional probability over P() (Walley [66]). Interestingly, the resulting set-function is again a possibility measure, with distribution ( (s) max (s); (s)CN(E) ; if s 2 E (skE) D 0; otherwise : It is generally less specific than on E, as clear from the above expression, and becomes non-informative when N(E) D 0 (i. e. if there is no information about E). This is because (kE) is obtained from the focusing of the generic information over the reference class E. On the contrary, (jE) operates a revision process on due to additional knowledge asserting that states outside E are impossible. See De Cooman [65] for a detailed study of this form of conditioning. Probability-Possibility Transformations The problem of transforming a possibility distribution into a probability distribution and conversely is meaningful in the scope of uncertainty combination with heteroge-
Possibility Theory
neous sources (some supplying statistical data, other linguistic data, for instance). It is useful to cast all pieces of information in the same framework. The basic requirement is to respect the consistency principle ˘ P. The problem is then either to pick a probability measure in P(), or to construct a possibility measure dominating P. There are two basic approaches to possibility/ probability transformations, which both respect a form of probability-possibility consistency. One, due to Klir [67,68] is based on a principle of information invariance, the other [69] is based on optimizing information content. Klir assumes that possibilistic and probabilistic information measures are commensurate. Namely, the choice between possibility and probability is then a mere matter of translation between languages “neither of which is weaker or stronger than the other” (quoting Klir and Parviz [70]). It suggests that entropy and imprecision capture the same facet of uncertainty, albeit in different guises. The other approach, recalled here, considers that going from possibility to probability leads to increase the precision of the considered representation (as we go from a family of nested sets to a random element), while going the other way around means a loss of specificity. From Possibility to Probability The most basic example of transformation from possibility to probability is the Laplace principle of insufficient reason claiming that what is equally possible should be considered as equally probable. A generalized Laplacean indifference principle is then adopted in the general case of a possibility distribution : the weights i bearing the sets A i from the nested family of levels cuts of are uniformly distributed on the elements of these cuts A i . Let Pi be the uniform probability measure on A i . The resulting P probability measure is P D iD1; ::: ;m i Pi . This transformation, already proposed in 1982 [71] comes down to selecting the center of gravity of the set P() of probability distributions dominated by . This transformation also coincides with Smets’ pignistic transformation [72] and with the Shapley value of the “unamimity game” (another name of the necessity measure) in game theory. The rationale behind this transformation is to minimize arbitrariness by preserving the symmetry properties of the representation. This transformation from possibility to probability is one-to-one. Note that the definition of this transformation does not use the nestedness property of cuts of the possibility distribution. It applies all the same to nonnested random sets (or belief functions) defined by pairs f(A i ; i ); i D 1; : : : ; mg, where i are non-negative reals P such that iD1;:::;m i D 1.
From Objective Probability to Possibility From probability to possibility, the rationale of the transformation is not the same according to whether the probability distribution we start with is subjective or objective [73]. In the case of a statistically induced probability distribution, the rationale is to preserve as much information as possible. This is in line with the handling of -qualified pieces of information representing observed evidence, considered in Sect. “Basic Notions of Possibility Theory”; hence we select as the result of the transformation of a probability measure P, the most specific possibility measure in the set of those dominating P [69]. This most specific element is generally unique if P induces a linear ordering on S. Suppose S is a finite set. The idea is to let ˘ (A) D P(A), for these sets A having minimal probability among other sets having the same cardinality as A. If p1 > p2 > : : : > p n , then ˘ (A) D P(A) for sets A of the form fs i ; : : : ; s n g, and the possibility distribution is P defined as P (s i ) D jDi; ::: ;m p j . Note that P is a kind of cumulative distribution of P. If there are equiprobable elements, the unicity of the transformation is preserved if equipossibility of the corresponding elements is enforced. In this case it is a bijective transformation as well. Recently, this transformation was used to prove a rather surprising agreement between probabilistic indeterminateness as measured by Shannon entropy, and possibilistic non-specificity. Namely it is possible to compare probability measures on finite sets in terms of their relative peakedness (a concept adapted from Birnbaum [74]) by comparing the relative specificity of their possibilistic transforms. Namely let P and Q be two probability measures on S and P , Q the possibility distributions induced by our transformation. It can be proved that if P Q (i. e. P is less peaked than Q) then the Shannon entropy of P is higher than the one of Q [75]. This result give some grounds to the intuitions developed by Klir [67], without assuming any commensurability between entropy and specificity indices. Possibility Distributions Induced by Prediction Intervals In the continuous case, moving from objective probability to possibility means adopting a representation of uncertainty in terms of prediction intervals around the mode viewed as the “most frequent value”. Extracting a prediction interval from a probability distribution or devising a probabilistic inequality can be viewed as moving from a probabilistic to a possibilistic representation. Namely suppose a non-atomic probability measure P on the real line, with unimodal density p, and suppose one wishes to
2247
2248
Possibility Theory
represent it by an interval I with a prescribed level of confidence P(I) D of hitting it. The most natural choice is the most precise interval ensuring this level of confidence. It can be proved that this interval is of the form of a cut of the density, i. e. I D fs; p(s) g for some threshold . Moving the degree of confidence from 0 to 1 yields a nested family of prediction intervals that form a possibility distribution consistent with P, the most specific one actually, having the same support and the same mode as P and defined by ([69]): (inf I ) D (sup I ) D 1 D 1 P(I ) : This kind of transformation again yields a kind of cumulative distribution according to the ordering induced by the density p. Similar constructs can be found in the statistical literature (Birnbaum [74]). More recently Mauris et al. [76] noticed that starting from any family of nested sets around some characteristic point (the mean, the median,. . . ), the above equation yields a possibility measure dominating P. Well-known inequalities of probability theory, such as those of Chebyshev and Camp–Meidel, can also be viewed as possibilistic approximations of probability functions. It turns out that for symmetric uni-modal densities, each side of the optimal possibilistic transform is a convex function. Given such a probability density on a bounded interval [a; b], the triangular fuzzy number whose core is the mode of p and the support is [a; b] is thus a possibility distribution dominating P regardless of its shape (and the tightest such distribution). These results justify the use of symmetric triangular fuzzy numbers as fuzzy counterparts to uniform probability distributions. They provide much tighter probability bounds than Chebyshev and Camp–Meidel inequalities for symmetric densities with bounded support. This setting is adapted to the modelling of sensor measurements [77]. These results are extended to more general distributions by Baudrit et al., [78], and provide a tool for representing poor probabilistic information. Subjective Possibility Distributions The case of a subjective probability distribution is different. Indeed, the probability function is then supplied by an agent who is in some sense forced to express beliefs in this form due to rationality constraints, and the setting of exchangeable bets. However his actual knowledge may be far from justifying the use of a single well-defined probability distribution. For instance in case of total ignorance about some value, apart from its belonging to an interval, the framework of exchangeable bets enforces a uniform
probability distribution, on behalf of the principle of insufficient reason. Based on the setting of exchangeable bets, it is possible to define a subjectivist view of numerical possibility theory, that differs from the proposal of Walley [66]. The approach developed by Dubois Prade and Smets [79] relies on the assumption that when an agent constructs a probability measure by assigning prices to lotteries, this probability measure is actually induced by a belief function representing the agents actual state of knowledge. We assume that going from an underlying belief function to an elicited probability measure is achieved by means of the above mentioned pignistic transformation, changing focal sets into uniform probability distributions. The task is to reconstruct this underlying belief function under a minimal commitment assumption. In the paper [79], we pose and solve the problem of finding the least informative belief function having a given pignistic probability. We prove that it is unique and consonant, thus induced by a possibility distribution. This result exploits a simple partial ordering between belief functions comparing their information content, in agreement with the expected cardinality of random sets. The obtained possibility distribution can be defined as the converse of the pignistic transformation (which is one-to-one for possibility distributions). It is subjective in the same sense as in the subjectivist school in probability theory. However, it is the least biased representation of the agents state of knowledge compatible with the observed betting behavior. In particular it is less specific than the one constructed from the prediction intervals of an objective probability. This transformation was first proposed in [80] for objective probability, interpreting the empirical necessity of an event as summing the excess of probabilities of realizations of this event with respect to the probability of the most likely realization of the opposite event. Possibility Theory and Defuzzification Possibilistic mean values can be defined using Choquet integrals with respect to possibility and necessity measures [65,81], and come close to defuzzification methods [82]. A fuzzy interval is a fuzzy set of reals whose membership function is unimodal and upper-semi continuous. Its ˛-cuts are closed intervals. Interpreting a fuzzy interval M, associated to a possibility distribution M , as a family of probabilities, upper and lower mean values E (M) and E (M), can be defined as [83]: Z E (M) D
1
inf M˛ d˛; 0
where M ˛ is the ˛-cut of M.
E (M) D
Z
1
sup M˛ d˛ 0
Possibility Theory
Then the mean interval E(M) D [E (M); E (M)] of M is the interval containing the mean values of all random variables consistent with M, that is E(M) D fE(P)jP 2 P( M )g, where E(P) represents the expected value associated to the probability measure P. That the “mean value” of a fuzzy interval is an interval seems to be intuitively satisfactory. Particularly the mean interval of a (regular) interval [a; b] is this interval itself. The upper and lower mean values are linear with respect to the addition of fuzzy numbers. Define the addition M C N as the fuzzy interval whose cuts are M˛ C N˛ D fs C t; s 2 M˛ ; t 2 N˛ g defined according to the rules of interval analysis. Then E(M C N) D E(M) C E(N), and similarly for the scalar multiplication E(aM) D aE(M), where aM has membership grades of the form M (s/a) for a ¤ 0. In view of this property, it seems that the most natural deˆ fuzzication method is the middle point E(M) of the mean interval (originally proposed by Yager [84]). Other defuzzification techniques do not generally possess this kind ˆ of linearity property. E(M) has a natural interpretation in terms of simulation of a fuzzy variable [85], and is the mean value of the pignistic transformation of M. Indeed it is the mean value of the empirical probability distribution obtained by the random process defined by picking an element ˛ in the unit interval at random, and then an element s in the cut M˛ at random. Applications and Future Directions Possibility theory has not been the main framework for engineering applications of fuzzy sets in the past. However, on the basis of its connections to symbolic artificial intelligence, to decision theory and to imprecise statistics, we consider that it has significant potential for further applied developments in a number of areas, including some where fuzzy sets are not yet al.ways accepted. Only some directions are pointed out here. 1. Rules with exceptions can be modelled by means of conditional possibility [32], based on its capability to account for non-monotonic inference, as shown in Sect. “Non-monotonic Inference”. Possibility theory has also enabled a typology of fuzzy rules to be laid bare, distinguishing rules whose purpose is to propagate uncertainty through reasoning steps, from rules whose main purpose is similarity-based interpolation [86], depending on the choice of a many-valued implication connective that models a rule. The bipolar view of information based on (ı; ) pairs sheds new light on the debate between conjunctive and implicative representation of rules [87]. Representing a rule as a material implication focuses on counterexamples to rules, while
using a conjunction between antecedent and consequent points out examples of the rule and highlights its positive content. Traditionally in fuzzy control and modelling, the latter representation is adopted, while the former is the logical tradition. Introducing fuzzy implicative rules in modelling accounts for constraints or landmark points the model should comply with (as opposed to observed data) [88]. The bipolar view of rules in terms of examples and counterexamples may turn out to be very useful when extracting fuzzy rules from data [89]. 2. Possibility theory also offers a framework for preference modeling in constraint-directed reasoning. Both prioritized and soft constraints can be captured by possibility distributions expressing degrees of feasibility rather than plausibility [90]. Possibility offers a natural setting for fuzzy optimization whose aim is to balance the levels of satisfaction of multiple fuzzy constraints (instead of minimizing an overall cost) [91]. Qualitative decision criteria are particularly adapted to the handling of uncertainty in this setting. Applications of possibility theory-based decision-making can be found in scheduling [92,93,94,95]. 3. Quantitative possibility theory is the natural setting for a reconciliation between probability and fuzzy sets. An important research direction is the comparison between fuzzy interval analysis [96] and random variable calculations with a view to unifying them [97]. Indeed, a current major concern, in for instance risk analysis studies, is to perform uncertainty propagation under poor data and without independence assumptions (see the papers in the special issue [98]). Finding the potential of possibilistic representations in computing conservative bounds for such probabilistic calculations is certainly a major challenge [99] The active area of fuzzy random variables is also connected to this question [100]. 4. One might also mention the well-known possibilistic clustering technique [101,102]. However, it is only loosely related to possibility theory. This name is due to the use of fuzzy clusters with (almost) unrestricted membership functions, that no longer form a usual fuzzy partition. But one might use it to generate genuine possibility distributions, where possibility derives from similarity, a point of view already mentioned before. Other applications of possibility theory can be found in fields such as data analysis [103,104,105], database querying [106], diagnosis [107,108], belief revision [109], argumentation [110] case-based reasoning [111,112]. Lastly,
2249
2250
Possibility Theory
possibility theory is also being studied from the point of view of its relevance in cognitive psychology. Experimental results [113] suggest that there are situations where people reason about uncertainty using the rules or possibility theory, rather than with those of probability theory.
Bibliography 1. Zadeh LA (1978) Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst 1:3–28 2. Gaines BR, Kohout L (1975) Possible automata. In: Proc Int Symp Multiple-Valued Logic, Bloomington May 13–16. IEEE Press, pp 183–196 3. Dubois D, Prade H (1998) Possibility theory: Qualitative and quantitative aspects. In: Gabbay DM, Smets PP (eds) Handbook of Defeasible Reasoning and Uncertainty Management Systems, vol 1. Kluwer, Dordrecht, pp 169–226 4. Dubois D, Nguyen HT, Prade H (2000) Fuzzy sets and probability: Misunderstandings, bridges and gaps. In: Dubois D, Prade H (eds) Fundamentals of Fuzzy Sets. Kluwer, Boston, pp 343– 438 5. Hughes GE, Cresswell MJ (1968) An introduction to modal logic. Methuen, London 6. Shackle GLS (1961) Decision, order and time in human affairs, 2nd edn. Cambridge University Press, Cambridge 7. Spohn W (1990) A general, nonprobabilistic theory of inductive reasoning. In: Shachter RD et al (eds) Uncertainty in Artificial Intelligence, vol 4. North Holland, Amsterdam, pp 149– 158 8. Lewis DL (1973) Counterfactuals. Basil Blackwell, Oxford 9. Dubois D (1986) Belief structures, possibility theory and decomposable measures on finite sets. Comput AI 5:403–416 10. Sudkamp T (2002) Similarity and the measurement of possibility. Actes Rencontres Francophones sur la Logique Floue et ses Applications (Montpellier, France). Cepadues Editions, Toulouse, pp 13–26 11. Cohen LJ (1977) The probable and the provable. Clarendon, Oxford 12. Zadeh LA (1979) Fuzzy sets and information granularity. In: Gupta MM, Ragade R, Yager RR (eds) Advances in fuzzy set theory and applications. North-Holland, Amsterdam, pp 3–18 13. Zadeh LA (1982) Possibility theory and soft data analysis. In: Cobb L, Thrall R (eds) Mathematical frontiers of social and policy sciences. Westview Press, Boulder, pp 69–129 14. Dubois D, Prade H (1980) Fuzzy sets and systems: Theory and applications. Academic Press, New York 15. Dubois D, Prade H (1988) Possibility theory. Plenum, New York 16. Klir GJ, Folger T (1988) Fuzzy sets, uncertainty and information. Prentice Hall, Englewood Cliffs 17. Yager RR (1983) An introduction to applications of possibility theory. Hum Syst Manag 3:246–269 18. Dubois D, Hajek P, Prade H (2000) Knowledge-driven versus data-driven logics. J Log Lang Inf 9:65–89 19. Walley P (1996) Measures of uncertainty in expert systems. Artif Intell 83:1–58 20. De Cooman G (1997) Possibility theory. Part I: Measure- and integral-theoretic groundwork; Part II: Conditional possibility; Part III: Possibilistic independence. Int J Gen Syst 25:291–371
21. De Campos LM, Huete JF (1999) Independence concepts in possibility theory. Fuzzy Sets Syst 103:127–152 & 487–506 22. Dubois DD, Farinas del Cerro L, Herzig A, Prade H (1997) Qualitative relevance and independence: A roadmap. In: Proc of the 15h Inter Joint Conf on Artif Intell, Nagoya, 23–29 August, 1997. Morgan Kaufmann, San Mateo, pp 62–67 23. Borgelt C, Gebhardt J, Kruse R (2000) Possibilistic graphical models. In: Della Riccia G et al (eds) Computational intelligence in data mining. Springer, Wien, pp 51–68 24. Benferhat S, Dubois D, Garcia L, Prade H (2002) On the transformation between possibilistic logic bases and possibilistic causal networks. Int J Approx Reason 29(2):135–173 25. Ben Amor N, Benferhat S, Mellouli K (2003) Anytime propagation algorithm for min-based possibilistic graphs. Soft Comput 8(2):150–161 26. Borgelt C, Kruse R (2003) Operations and evaluation measures for learning possibilistic graphical models. Artif Intell 148(1– 2):385–418 27. Gärdenfors P (1988) Knowledge in flux. MIT Press, Cambridge 28. Dubois D, Prade H (1991) Epistemic entrenchment and possibilistic logic. Artif Intell 50:223–239 29. Ben Amor N et al (2002) A theoretical framework for possibilistic independence in a weakly ordered setting. Int J Uncert Fuzz & Knowl-B Syst 10:117–155 30. Lehmann D, Magidor M (1992) What does a conditional knowledge base entail? Artif Intell 55:1–60 31. Dubois D, Fargier H, Prade H (2004) Ordinal and probabilistic representations of acceptance. J Artif Intell Res 22:23–56 32. Benferhat S, Dubois D, Prade H (1997) Nonmonotonic reasoning, conditional objects and possibility theory Artif Intell 92:259–276 33. Pearl J (1990) System Z: A natural ordering of defaults with tractable applications to default reasoning. In: Proc 3rd Conf Theoretical Aspects of Reasoning About Knowledge. Morgan Kaufmann, San Francisco, pp 121–135 34. Dubois D, Lang J, Prade H (1994) Possibilistic logic. In: Gabbay DM et al (eds) Handbook of logic in AI and logic programming, vol 3. Oxford University Press, Oxford, pp 439–513 35. Benferhat S, Dubois D, Prade H (1998) Practical handling of exception-tainted rules and independence information in possibilistic logic. Appl Intell 9:101–127 36. Dubois D, Prade H (2004)Possibilistic logic: A retrospective and prospective view. Fuzzy Sets Syst 144:3–23 37. Dubois D, Prade H (2007) Toward multiple-agent extensions of possibilistic logic. In: Proc IEEE Inter Conf on Fuzzy Systems (FUZZ-IEEE (2007)), London 23–26 July, 2007. pp 187–192 38. Benferhat S, Dubois D, Garcia L, Prade H (2002) On the transformation between possibilistic logic bases and possibilistic causal networks. Int J Approx Reason 29:135–173 39. Ben Amor N, Benferhat S (2005) Graphoid properties of qualitative possibilistic independence relations. Int J Uncert Fuzz & Knowl-B Syst 13:59–97 40. Savage LJ (1972) The foundations of statistics. Dover, New York 41. Yager RR (1979) Possibilistic decision making. IEEE Trans Syst Man Cybern 9:388–392 42. Whalen T (1984) Decision making under uncertainty with various assumptions about available information. IEEE Trans Syst Man Cybern 14:888–900
Possibility Theory
43. Grabisch M, Murofushi T, Sugeno M (eds) (2000) Fuzzy measures and integrals theory and applications. Physica, Heidelberg 44. Dubois D, Prade H, Sabbadin R (2000) Qualitative decision theory with Sugeno integrals. In: Grabisch M, Murofushi T, Sugeno M (eds) Fuzzy measures and integrals theory and applications. Physica, Heidelberg, pp 314–322 45. Dubois D, Prade H, Sabbadin R (2001) Decision-theoretic foundations of possibility theory. Eur J Oper Res 128:459–478 46. Dubois D, Fargier H, Perny P, Prade H (2003) Qualitative decision theory with preference relations and comparative uncertainty: An axiomatic approach. Artif Intell 148:219–260 47. Dubois D, Fargier H (2003) Qualitative decision rules under uncertainty. In: Della Riccia G et al (eds) Planning based on decision theory. CISM courses and Lectures, vol 472. Springer, Wien, pp 3–26 48. Fargier H, Sabbadin R (2005) Qualitative decision under uncertainty: Back to expected utility. Artif Intell 164:245–280 49. Dubois D (2006) Possibility theory and statistical reasoning. Comput Stat Data Anal 51(1):47–69 50. Dubois D, Prade H (1992) When upper probabilities are possibility measures. Fuzzy Sets Syst 49:s 65–74 51. Dubois D, Moral S, Prade H (1997) A semantics for possibility theory based on likelihoods. J Math Anal Appl 205:359–380 52. Shafer G (1987) Belief functions and possibility measures. In: Bezdek JC (ed) Analysis of fuzzy information, vol I: Mathematics and Logic. CRC Press, Boca Raton, pp 51–84 53. Benferhat S, Dubois D, Prade H (1999) Possibilistic and standard probabilistic semantics of conditional knowledge bases. J Log Comput 9:873–895 54. Maslov V (1987) Méthodes Opératorielles. Mir Publications, Moscow 55. Kolokoltsov VN, Maslov VP (1997) Idempotent analysis and applications. Kluwer, Dordrecht 56. Puhalskii A (2001) Large deviations and idempotent probability. Chapman and Hall, Boca Raton 57. Nguyen HT, Bouchon-Meunier B (2003) Random sets and large deviations principle as a foundation for possibility measures. Soft Comput 8:61–70 58. De Cooman G, Aeyels D (1999) Supremum-preserving upper probabilities. Inf Sci 118:173–212 59. Walley P, De Cooman G (1999) A behavioural model for linguistic uncertainty. Inf Sci 134:1–37 60. Gebhardt J, Kruse R (1993) The context model. I Int J Approx Reason 9:283–314 61. Joslyn C (1997) Measurement of possibilistic histograms from interval data. Int J Gen Syst 26:9–33 62. Neumaier A (2004) Clouds, fuzzy sets and probability intervals. Reliab Comput 10:249–272 63. De Baets B, Tsiporkova E, Mesiar R (1999) Conditioning in possibility with strict order norms. Fuzzy Sets Syst 106:221–229 64. Dubois D, Prade H (1997) Bayesian conditioning in possibility theory. Fuzzy Sets Syst 92:223–240 65. De Cooman G (2001) Integration and conditioning in numerical possibility theory. Ann Math AI 32:87–123 66. Walley P (1991) Statistical reasoning with imprecise probabilities. Chapman and Hall, Boca Raton 67. Klir GJ (1990) A principle of uncertainty and information invariance. Int J Gen Syst 17:249–275
68. Geer JF, Klir GJ (1992) A mathematical analysis of informationpreserving transformations between probabilistic and possibilistic formulations of uncertainty. Int J Gen Syst 20:143–176 69. Dubois D, Prade H, Sandri S (1993) On possibility/probability transformations. In: Lowen R, Roubens M (eds) Fuzzy logic: State of the art. Kluwer, Dordrecht, pp 103–112 70. Klir GJ, Parviz B (1992) Probability B-possibility transformations: A comparison. Int J Gen Syst 21:291–310 71. Dubois D, Prade H (1982) On several representations of an uncertain body of evidence. In: Gupta M, Sanchez E (eds) Fuzzy information and decision processes. North-Holland, Amsterdam, pp 167–181 72. Smets P (1990) Constructing the pignistic probability function in a context of uncertainty. In: Henrion M et al (eds) Uncertainty in artificial intelligence, vol 5. North-Holland, Amsterdam, pp 29–39 73. Dubois D, Prade H (2001) Smets new semantics for quantitative possibility theory. In: Proc ESQARU (2001), Toulouse, LNAI 2143. Springer, pp 410–421 74. Birnbaum ZW (1948) On random variables with comparable peakedness. Ann Math Stat 19:76–81 75. Dubois D, Huellermeier E (2005) A Notion of comparative probabilistic entropy based on the possibilistic specificity ordering. In: Godo L (ed) Symbolic and Quantitative Approaches to Reasoning with Uncertainty. Proc of 8th European Conference, ECSQARU 2005, Barcelona, 6–8. Lecture Notes in Computer Science, vol 3571. Springer, Berlin 76. Dubois D, Foulloy L, Mauris G, Prade H (2004) Probability-possibility transformations, triangular fuzzy sets, and probabilistic inequalities. Reliab Comput 10:273–297 77. Mauris G, Lasserre V, Foulloy L (2000) Fuzzy modeling of measurement data acquired from physical sensors. IEEE Trans Meas Instrum 49:1201–1205 78. Baudrit C, Dubois D, Fargier H (2004) Practical representation of incomplete probabilistic information. In: Lopz-Diaz M et al (eds) Soft methods in probability and statistics. Proc 2nd Int Conf. Springer, Oviedo, pp 149–156 79. Dubois D, Prade H, Smets P (2003) A definition of subjective possibility. Badania Operacyjne i Decyzije (Wroclaw) 4:7–22 80. Dubois D, Prade H (1983) Unfair coins and necessity measures: A possibilistic interpretation of histograms. Fuzzy Sets Syst 10(1):15–20 81. Dubois D, Prade H (1985) Evidence measures based on fuzzy information. Automatica 21:547–562 82. Van Leekwijck W, Kerre EE (2001) Defuzzification: Criteria and classification. Fuzzy Sets Syst 108:303–314 83. Dubois D, Prade H (1987) The mean value of a fuzzy number. Fuzzy Sets Syst 24:279–300 84. Yager RR (1981) A procedure for ordering fuzzy subsets of the unit interval. Inf Sci 24:143–161 85. Chanas S, Nowakowski M (1988) Single value simulation of fuzzy variable. Fuzzy Sets Syst 25:43–57 86. Dubois D, Prade H (1996) What are fuzzy rules and how to use them. Fuzzy Sets Syst 84:169–185 87. Dubois D, Prade H, Ughetto L (2003) A new perspective on reasoning with fuzzy rules. Int J Intell Syst 18:541–567 88. Galichet S, Dubois D, Prade H (2004) Imprecise specification of ill-known functions using gradual rules. Int J Approx Reason 35:205–222 89. Dubois D, Huellermeier E, Prade H (2003) A note on quality measures for fuzzy association rules. In: De Baets B, Bilgic T
2251
2252
Possibility Theory
90.
91.
92. 93. 94. 95.
96.
97. 98.
99. 100. 101. 102.
(eds) Fuzzy sets and systems. Proc of the 10th Int Fuzzy Systems Assoc World Congress IFSA, Istanbul, 2003, LNAI 2715. Springer, pp 346–353 Dubois D, Fargier H, Prade H (1996) Possibility theory in constraint satisfaction problems: Handling priority, preference and uncertainty. Appl Intell 6:287–309 Dubois D, Fortemps P (1999) Computing improved optimal solutions to max-min flexible constraint satisfaction problems. Eur J Oper Res 118:95–126 Dubois D, Fargier H, Prade H (1995) Fuzzy constraints in jobshop scheduling. J Intell Manuf 6:215–234 Slowinski R, Hapke M (eds) (2000) Scheduling under fuzziness. Physica, Heidelberg Chanas S, Zielinski P (2001) Critical path analysis in the network with fuzzy activity times. Fuzzy Sets Syst 122:195–204 Chanas S, Dubois D, Zielinski P (2002) Necessary criticality in the network with imprecise activity times. IEEE Trans Man Mach Cybern 32:393–407 Dubois D, Kerre E, Mesiar R, Prade H (2000) Fuzzy interval analysis. In: Dubois D, Prade H (eds) Fundamentals of fuzzy sets. Kluwer, Boston, pp 483–581 Dubois D, Prade H (1991) Random sets and fuzzy interval analysis. Fuzzy Sets Syst 42:87–101 Helton JC, Oberkampf WL (eds) (2004) Alternative Representations of Epistemic Uncertainty. Reliability Engineering and Systems Safety, vol 85. Elsevier, Amsterdam, p 369 Guyonnet D et al (2003) Hybrid approach for addressing uncertainty in risk assessments. J Env Eng 129:68–78 Gil M (ed) (2001) Fuzzy random variables. Inf Sci 133 Krishnapuram R, Keller J (1993) A possibilistic approach to clustering. IEEE Trans Fuzzy Syst 1:98–110 Bezdek J, Keller J, Krishnapuram R, Pal N (1999) Fuzzy models and algorithms for pattern recognition and image processing. In: The Handbooks of Fuzzy Sets Series. Kluwer, Boston
103. Wolkenhauer O (1998) Possibility theory with applications to data analysis. Research Studies Press, Chichester 104. Tanaka H, Guo PJ (1999) Possibilistic data analysis for operations research. Physica, Heidelberg 105. Borgelt C, Gebhardt J, Kruse R (2000) Possibilistic graphical models. In: Della Riccia G et al (eds) Computational intelligence in data mining. CISM Series, vol N408. Springer, Berlin 106. Bosc P, Prade H (1997) An introduction to the fuzzy set and possibility theory-based treatment of soft queries and uncertain of imprecise databases. In: Smets P, Motro A (eds) Uncertainty management in information systems. Kluwer, Dordrecht, pp 285–324 107. Cayrac D, Dubois D, Prade H (1996) Handling uncertainty with possibility theory and fuzzy sets in a satellite fault diagnosis application. IEEE Trans Fuzzy Syst 4:251–269 108. Boverie S et al (2002) Online diagnosis of engine dyno test benches: A possibilistic approach. Proc 15th Eur Conf on Artificial Intelligence, Lyon. IOS Press, Amsterdam, p 658–662 109. Benferhat S, Dubois D, Prade H, Williams M-A (2002) A practical approach to revising prioritized knowledge bases. Stud Log 70:105–130 110. Amgoud L, Prade H (2004) Reaching agreement through argumentation: A possibilistic approach. In: Proc of the 9th Int Conf on Principles of Knowledge Representation and Reasoning (KR’04), Whistler. AAAI Press, Cambridge, pp 175–182 111. Dubois D, Huellermeier E, Prade H (2002) Fuzzy set-based methods in instance-based reasoning. IEEE Trans Fuzzy Syst 10:322–332 112. Huellermeier E, Dubois D, Prade H (2002) Model adaptation in possibilistic instance-based reasoning. IEEE Trans Fuzzy Syst 10:333–339 113. Raufaste E, Da Silva R, Neves C (2003) Mariné testing the descriptive validity of possibility theory in human judgements of uncertainty. Artif Intell 148:197–218
Principal-Agent Models
Principal-Agent Models INÉS MACHO-STADLER, DAVID PÉREZ-CASTRILLO Universitat Autònoma de Barcelona, Barcelona, Spain Article Outline Glossary Definition of the Subject Introduction The Base Game Moral Hazard Adverse Selection Future Directions Bibliography Glossary Information economics Information economics studies how information and its distributions among the players affect economic decisions. Asymmetric information In a relationship or a transaction, there is asymmetric information when one party has more or better information than other party concerning relevant characteristics of the relationship or the transaction. There are two types of asymmetric information problems: Moral Hazard and Adverse Selection. Principal–agent The principal-agent model identifies the difficulties that arise in situations where there is asymmetric information between two parties and finds the best contract in such environments. The “principal” is the name used for the contractor while the “agent” corresponds to the contractee. Both principal and agent could be individuals, institutions, organizations, or decision centers. The optimal solutions propose mechanisms that try to align the interests of the agent with those of the principal, such as piece rates or profit sharing; or that induce the agent to reveal the information, such as self-reporting contracts. Moral hazard (hidden action) The term moral hazard initially referred to the possibility that the redistribution of risk (such as insurance which transfers risk from the insured to the insurer) changes people’s behavior. This term, which has been used in the insurance industry for many years, was studied first by Kenneth Arrow. In principal-agent models, the term moral hazard is used to refer to all environments where the ignorant
party lacks information about the behavior of the other party once the agreement has been signed, in such a way that the asymmetry arises after the contract is settled. Adverse selection (hidden information) The term adverse selection was originally used in insurance. It describes a situation where, as a result of private information, the insured are more likely to suffer a loss than the uninsured (such as offering a life insurance contract at a given premium may imply that only the people with a risk of dying over the average take it). In principal-agent models, we say that there is an adverse selection problem when the ignorant party lacks information while negotiating a contract, in such a way that the asymmetry is previous to the relationship. Definition of the Subject Principal-Agent models provide the theory of contracts under asymmetric information. Such a theory analyzes the characteristics of optimal contracts and the variables that influence these characteristics, according to the behavior and information of the parties to the contract. This approach has a close relation to game theory and mechanism design: it analyzes the strategic behavior by agents who hold private information and proposes mechanisms that minimize the inefficiencies due to such strategic behavior. The costs incurred by the principal (the contractor) to ensure that the agents (the contractees) will act in her interest are some type of transaction cost. These costs include the tasks of investigating and selecting appropriate agents, gaining information to set performance standards, monitoring agents, bonding payments by the agents, and residual losses. Principal-agent theory (and Information Economics in general) is possibly the area of economics that has evolved the most over the past twenty-five years. It was initially developed in parallel with the new economics of Industrial Organization although its applications include now almost all areas in economics, from finance and political economy to growth theory. Some early papers centered on incomplete information in insurance contracts, and more particularly on moral hazard problems, are Spence and Zeckhauser [88] and Ross [81]. The theory soon generalized to dilemmas associated with contracts in other contexts [38,46]. It was further developed in the mid-seventies by authors such as Pauly [72,73], Mirrlees [66], Harris and Raviv [39], and Holmström [40]. Arrow [7] worked on the analysis of the optimal incentive contract when the agent’s effort is not verifiable.
2253
2254
Principal-Agent Models
A particular case of adverse selection is the one where the type of the agent relates to his valuation of a good. Asymmetric information about buyers’ valuation of the objects sold is the fundamental reason behind the use of auctions. Vickrey [92] provides the first formal analysis of the first and second-prize auctions. Akerlof [3] highlighted the issue of adverse selection in his analysis of the market for second-hand goods. Further analyzes include the early work of Mirrlees [65], Spence [87], Rothschild and Stiglitz [83], Mussa and Rosen [68], as well as Baron and Myerson [10], and Guesnerie and Laffont [36]. The importance of the topic has also been recognized by the Nobel Foundation. James A. Mirrlees and William Vickrey were awarded with the Nobel Prize in Economics in 1996 “for their fundamental contributions to the economic theory of incentives under asymmetric information”. Five years later, in 2001, George A. Akerlof, A. Michael Spence and Joseph E. Stiglitz also obtained the Nobel Prize in Economics “for their analyzes of markets with asymmetric information”. Introduction The objective of the Principal-Agent literature is to analyze situations in which a contract is signed under asymmetric information, that is, when one party knows certain relevant things of which the other party is ignorant. The simplest situation concerns a bilateral relationship: the contract between one principal and one agent. The objective of the contract is for the agent to carry out actions on behalf of the principal; and to specify the payments that the principal will pass on to the agent for such actions. In the literature, it is always assumed that the principal is in charge of designing the contract. The agent receives an offer and decides whether or not to sign the contract. He will accept it whenever the utility obtained from it is greater than the utility that the agent would get from not signing. This utility level that represents the agent’s outside opportunities is his reservation utility. In order to simplify the analysis, it is assumed that the agent cannot make a counter offer to the principal. This way of modeling implicitly assumes that the principal has all the bargaining power, except for the fact that the reservation utility can be high in those cases where the agent has excellent outside opportunities. If the agent decides not to sign the contract, the relationship does not take place. If he does accept the offer, then the contract is implemented. It is crucial to notice that the contract is a reliable promise by both parties, stating the principal and agent’s obligations for all (contractual) contingencies. It can only be based on verifiable
variables, that is, those for which it is possible for a third party (a court) to verify whether the contract has been fulfilled. When some players know more than others about relevant variables, we have a situation with asymmetric information. In this case, incentives play an important role. Given the description of the game played between principal and agent, we can summarize its timing in the following steps: (i)
The principal designs the contract (or set of contracts) that she will offer to the agent, the terms of which are not subject to bargaining. (ii) The alternatives opened to the agent are to accept or to reject the contract. The agent accepts it if he desires so, that is, if the contract guarantees him greater expected utility than any other (outside) opportunities available to him. (iii) The agent carries out an action or effort on behalf of the principal. (iv) The outcome is observed and the payments are done. From these elements, it can be seen that the agent’s objectives may be in conflict with those of the principal. When the information is asymmetric, the informed party tries to take advantage, while the uninformed player tries to control this behavior via the contract. Since a Principal-Agent problem is a sequential game, the solution concept to use is Subgame (Bayesian) Perfect Equilibrium. The set-up gives rise to three possible scenarios: 1. The Symmetric Information case, where the two players share the same information, even if they both may ignore some important elements (some elements may be uncertain). 2. The Moral Hazard case, where the asymmetry of information arises once the contract has been signed: the decision or the effort of agent is not verifiable and hence it cannot be included in the contract. 3. The Adverse Selection case, where the asymmetry of information is previous to the signature of the contract: a relevant characteristic of the agent is not verifiable and hence the principal cannot include it in the contract. To see an example of moral hazard, consider a laboratory or research center (the principal) that contracts a researcher (the agent) to work on a certain project. It is difficult for the principal to distinguish between a researcher who is thinking about how to push the project through, and a researcher who is thinking about how to organize his evening. It is precisely this difficulty in controlling effort inputs, together with the inherent uncertainty in any
Principal-Agent Models
Principal-Agent Models, Figure 1 The figure summarizes the timing of the relationship and the three cases as a function of the information available to the participants
research project, what generates a moral hazard problem, which is a non-standard labor market problem. For an example of adverse selection, consider a regulator who wants to set the price of the service provided by a public monopoly equal to the average costs in the firm (to avoid subsidies). This policy (as many others) is subject to important informational requirements. It is not enough that the regulator asks the firm to reveal the required information in order to set the adequate price, since the firm would attempt to take advantage of the information. Therefore, the regulator should take this problem into account. The Base Game Consider a contractual relationship between a principal and an agent, who is contracted to carry out a task. The relationship allows a certain result to be obtained, whose monetary value will be referred to as x. For the sake of exposition, the set of possible results X is assumed to be finite, X D fx1 ; : : : ; x n g. The final result depends on the effort that the agent devotes to the task, which will be denoted by e, and the value of a random variable for which both participants have the same prior distribution. The
probability of result xi conditional on effort e can be written as: Prob[x D x i je] D p i (e) ; for i 2 f1; 2; : : : ; ng ; P with niD1 p i (e) D 1. Let us assume that p i (e) > 0 for all e, i, which implies that no result can be ruled out for any given effort level. The Base Game is the reference situation, where principal and agent have the same information (even the one concerning the random component that affects the result). Since uncertainty exists, participants react to risk. Risk preferences are expressed by the shape of their utility functions (of the von Neumann–Morgenstern type). The principal, who owns the result and must pay the agent, has preferences represented by the utility function B(x w) ; where w represents the payoff made to the agent. B(:) is assumed to be increasing and concave: B0 > 0, B00 0 (where the primes represent, respectively, the first and second derivatives). The concavity of the function B(:) indicates that the principal is either risk-neutral or risk averse.
2255
2256
Principal-Agent Models
The agent receives a monetary pay-off for his participation in the relationship, and he supplies an effort which implies some cost to him. For the sake of simplicity we represent his utility function as: U(w; e) D u(w) v(e) ; additively separable in the components w and e. This assumption implies that the agent’s risk-aversion does not vary with the effort he supplies (many results can be generalized for more general utility functions). The utility derived from the wage, u(w), is increasing and concave: u0 (w) > 0 ;
u00 (w) 0 ;
thus the agent may be either risk-neutral, u00 (w) D 0, or risk-averse u00 (w) < 0. In addition, greater effort means greater disutility. We also assume that the marginal disutility of effort is not decreasing: v 0 (e) > 0, v 00 (e) 0. A contract can only be based in verifiable information. In the base game, it can depend on the characteristics of the principal and the agent and includes both the effort e that the principal demands from the agent, and the wages fw(x i )g iD1;:::;n . If the agent rejects the contract, he will have to fall back on the outside opportunities that the market offers him. These other opportunities, that by comparison determine the limit for participation in the contract, are summarized in the agent’s reservation utility, denoted by U. So the agent will accept the contract as long as he earns an expected utility equal or higher than his reservation utility. Since, the principals problem is to design a contract that the agent will accept, (by backward induction) the optimal contract must satisfy the participation constraint and it is the solution to the following maximization problem: Max
[e;fw(x i )g iD1;:::;n ]
n X
p i (e)B(x i w(x i ))
iD1
s.t.
n X
p i (e)u(w(x i )) v(e) U :
iD1
The above problem corresponds to a Pareto Optimum in the usual sense of the term. The solution to this problem is conditional on the value of the parameter U, so that even those cases where the agent can keep a large share of the surplus are taken into account. The principal’s program is well behaved with respect to payoffs given the assumptions on u(w). Hence the Kuhn– Tucker conditions will be both necessary and sufficient for the global solution of the problem. However, we cannot ascertain the concavity (or quasi-concavity) of the functions
with respect to effort given the assumptions on v(e), because these functions also depend on all the p i (e). Hence it is more difficult to obtain global conclusions with respect to this variable. Let us denote by e ı the efficient effort level. From the first-order Kuhn–Tucker conditions with respect to the wages in the different contingencies, we can analyze the associated pay-offs fw ı (x i ) iD1;:::;n g. We obtain the following condition: ı D
B0 (x i w ı (x i )) ; for all i 2 f1; 2; : : : ; ng ; u0 (w ı (x i ))
where ı is the multiplier associated with the participation constraint. When the agent’s utility is additively separable, the participation constraint binds (ı is positive). The previous condition equates marginal rates of substitution and indicates that the optimal distribution of risk requires that the ratio of marginal utilities of the principal and the agent to be constant irrespective of the final result. If the principal is risk-neutral (B00 (:) D 0), then the optimal contract has to be such that u0 (w ı (x i )) D constant for all i. In addition, if the agent is risk-averse (u00 (:) < 0), he receives the same wage, say w ı , in all contingencies. This wage only depends on the effort demanded and is determined by the participation constraint. If the agent is risk-neutral (u00 (:) D 0) and the principal is risk-averse (B00 (:) < 0), then we are in the opposite situation. In this scenario, the optimal contract requires the principal’s profit to be independent of the result. Consequently, the agent bears all the risk, insuring the principal against variations in the result. When both the principal and the agent are risk-averse, each of them needs to accept a part of the variability of the result. The precise amount of risk that each of them supports depends on their degrees of risk-aversion. Using the Arrow–Pratt measure of absolute risk-aversion rp D B00 /B0 and ra D u00 /u0 for the principal and the agent respectively, we can show that: rp dw ı D ; dx i rp C ra which indicates how the agent’s wage changes given an increase in the result xi . Since rp /(rp C ra ) 2 (0; 1), when both participants are risk-averse, the agent only receives a part of the increased result via a wage increase. The more risk-averse is the agent, that is to say, the greater is ra , the less the result influences his wage. On the other hand, as the risk-aversion of the principal increases, greater rp , changes in the result correspond to more important changes in the wage.
Principal-Agent Models
Moral Hazard Basic Moral Hazard Model Here we concentrate on moral hazard, which is the case in which the informational asymmetry relates to the agent’s behavior during the relationship. We analyze the optimal contract when the agent’s effort is not verifiable. This implies that effort cannot be contracted upon, because in case of the breach of contract, no court of law could know if the contract had really been breached or not. There are many examples of this type of situation. A traditional example is accident insurance, where it is very difficult for the insurance company to observe how careful a client has been to avoid accidents. The principal will state a contract based on any signals that reveal information on the agent’s effort. We will assume that only the result of the effort is verifiable at the end of the period and, consequently, it will be included in the contract. However, if possible, the contract should be contingent on many other things. Any information related to the state of nature is useful, since it allows better estimations of the agent’s effort thus reducing the risk inherent in the relationship. This is known as the sufficient statistic result, and it is perhaps the most important conclusion in the moral hazard literature [40]. The empirical content of the sufficient statistic argument is that a contract should exploit all available information in order to filter out risk optimally. The timing of a moral hazard game is the following. In the first place, the principal decides what contract to offer the agent. Then the agent decides whether or not to accept the relationship, according to the terms of the contract. Finally, if the contract has been accepted, the agent chooses the effort level that he most desires, given the contract that he has signed. This is a free decision by the agent since effort is not a contracted variable. Hence, the principal must bear this in mind when she designs the contract that defines the relationship. To better understand the nature of the problem faced by the principal, consider the case of a risk-neutral principal and a risk-averse agent, which implies that, under the symmetric information, the optimal contract is to completely insure the agent. However, if the principal proposes this contract when the agent’s effort is not a contracted variable, once he has signed the contract the agent will exert the effort level that is most beneficial for him. Since the agent’s wage does not depend on his effort, he will use the lowest possible effort. The idea underlying an incentive contract is that the principal can make the agent interested in the consequences of his behavior by making his pay-off dependent
on the result obtained. Note that this has to be done at the cost of distorting the optimal risk sharing among both participants. The trade-off between efficiency, in the sense of the optimal distribution of risk, and incentives determines the optimal contract. Formally, since the game has to be solved by backwards induction, the optimal contract under moral hazard is the solution to the maximization problem: Max
[e;fw(x i )g iD1;:::;n ]
s.t.
n X iD1
n X
p i (e)B(x i w(x i ))
iD1
p i (e)u(w(x i )) v(e) U ) ( n X e 2 Arg Max p i ( eˆ)u(w(x i )) v( eˆ) eˆ
iD1
The second restriction is the incentive compatibility constraint and the first restriction is the participation constraint. The incentive compatibility constraint, and not the principal as under symmetric information, determines the effort of the agent. The first difficulty in solving this program is related to the fact that the incentive compatibility constraint is a maximization problem. The second difficulty is that the expected utility may fail to be concave in effort. Hence, to use the first order condition of the incentive compatibility constraint may be incorrect. In spite of this, there are several ways to proceed when facing to this problem. (a) Grossman and Hart [33] propose to solve it in steps, identifying first the optimal payment mechanism for any effort and then, if possible, the optimal effort. This can be done since the problem is concave in payoffs. (b) The other possibility is to consider situations where the agent’s maximization problem is well defined. One possible scenario is when the set of possible efforts is finite, in which case the incentive compatibility constraint takes the form of a finite set of inequalities. Another scenario is to write the incentive compatibility as the first-order condition of the maximization problem, and introduce assumptions that allow doing it. The last solution is known as the first-order approach. Let us assume that the first-order approach is adequate, and substitute the incentive compatibility constraint in the previous program by n X
p0i ( eˆ)u(w(x i )) v 0 ( eˆ) D 0 :
iD1
Solving the principals program with respect to the payoff scheme, and denoting by (resp., ) the Lagrangean multiplier of the participation constraint (resp., the incentive
2257
2258
Principal-Agent Models
The basic moral hazard setup, with a principal hiring and an agent performing effort, has been extended in several directions to take into account more complex relationships.
agent’s consumption over time, which depends on the sequence of payments received (as a function of the past contingencies). Malcomson and Spinnewyn [58], Fudenberg, Holmström, and Milgrom [30], and Rey and Salanié [77] study when the optimal long-term contract can be implemented through the sequence of optimal short-term contracts. Chiappori, Macho-Stadler, Rey, and Salanié [16] show that, in order for the sequence of optimal short-term contracts to admit the same solution as the long-term contract, two conditions must be met. First, the optimal sequence of single-period contracts should have memory. That is why, when the reservation utility is invariant (is not history dependent), the optimal sequence of short-term contracts will not replicate the long-term optimum unless there exist means of smoothing consumption, that is, the agent has access to credit markets. Second, the long-term contract must be renegotiation-proof. A contract is said to be renegotiation-proof if at the beginning of any intermediate period, no new contract or renegotiation that would be preferred by all participants is possible. When the long term contract is not renegotiation-proof (i. e., if it is not possible for participants to change the clauses of the contract at a certain moment of time even if they agree), it cannot coincide with the sequence of short term contracts.
Repeated Moral Hazard Certain relationships in which a moral hazard problem occurs do not take place only once, but they are repeated over time (for example, work relationships, insurance, etc.). The duration aspect (the repetition) of the relationship gives rise to new elements that are absent in static models. Radner [76] and Rubinstein and Yaari [84] consider infinitely repeated relationships and show that frequent repetition of the relationship allows us to converge towards the efficient solution. Incentives are not determined by the payoff scheme contingent on the result of each period, but rather on average effort, and the information available is very precise when the number of periods is large. A sufficiently threatening punishment, applied when the principal believes that the agent on average does not fulfill his task, may be sufficient to dissuade him from shirking. When the relationship is repeated a finite number of times, the analysis of the optimal contract concentrates on different issues relating long-term agreements and shortterm contracts. Note that in a repeated set up, the agent’s wage and the agent’s consumption in a period need not be equal. Lambert [54], Rogerson [80], and Chiappori and Macho-Stadler [15] show that long-term contracts have memory (i. e., the pay-offs in any single period will depend on the results of all previous periods) since they internalize
One Principal and Several Agents When a principal contracts with more than on agent, the stage where agents exert their effort, which is translated into the incentive compatibility, depends on the game among the agents. If the agents behave as a coordinated and cooperating group, then the problem is similar to the previous one where the principal hires a team. A more interesting case appears when agents play a non-cooperative game and their strategies form a Nash equilibrium. Holmström [40] and Mookherjee [67], in models where there is personalized information about the output of each agent, show that the principal is interested in paying each agent according to his own production and that of the other agents if these other results can inform on the actions of the agent at hand. Only if the results of the other agents do not add information or, in other words, if an agent’s result is a sufficient statistic for his effort, then he will be paid according to his own result. When the only verifiable outcome is the final result of teamwork (joint production models), the optimal contract can only depend on this information and the conclusions are similar to those obtained in models with only one agent. Alchian and Demsetz [5] and Holmström [41] show that joint production cannot lead to efficiency when all the income is distributed amongst the agents, i. e., if the budget constraint always binds. Another player should be
compatibility constraint), we obtain that for all i: 1 u0 (w(x
i ))
DC
p0i (e) : p i (e)
This condition shows that the wage should not depend at all on the value that the principal places on the result. It depends on the results as a measure of how informative they are as to effort, in order to serve as an incentive for the agent. The wage will be increasing in the result as long as the result is increasing in effort. Hence, it is optimal that the wage will be increasing in the result only in particular cases. The necessary condition for a wage to be increasing with results is p0i (e)/p i (e) to be decreasing in i. In statistics, this is called the monotonous likelihood quotient property. It is a strong condition; for example, first-order stochastic dominance does not guarantee the monotonous likelihood property. Extensions of Moral Hazard Models
Principal-Agent Models
contracted to control the productive agents and act as the residual claimant of the relationship. Tirole [90] and Laffont [49] have studied the effect of coalitions among the agents in an organization on their payment scheme. If collusion is bad for the organization, it adds another dimension of moral hazard (the colluding behavior). The principal may be obliged to apply rules that are collusion-proof, which implies more constraints and simpler contracts (more bureaucratic). When coordination can improve the input of a group of agents, the optimal contract has to find payment methods that strengthen group work (see [44,56]). Another principal’s decision when she hires several agents is the organization with which she will relate. This includes such fundamental decisions as how many agents to contract and how should they be structured. These issues have been studied by Demski and Sappington [23], Melumad and Reichelstein [63], and Macho-Stadler and Pérez-Castrillo [57]. Holmström and Milgrom [42] analyze a situation in which the agent carries out several tasks, each one of which gives rise to a different result. They study the optimal contract when tasks are complementary (in the sense that exerting effort in one reduces the costs of the other) or substitutes. Their model allows to build a theory of job design and to explain the relationship among responsibility and authority. Several Principals and One Agent When one agent works for (or signs his contracts with) several principals simultaneously (common agency situation), in general, the principals are better off if they cooperate. When the principals are not able to achieve the coordination and commitment necessary to act as a single individual and they do not value the results in the same way, they each demand different efforts or actions from the agent. Bernheim and Whinston [12] show that the effort that principals obtain when they do not cooperate is less than the effort that would maximize their collective profits. However, the final contract that is offered to the agent minimizes the cost of getting the agent to choose the contractual effort. Adverse Selection Basic Adverse Selection Model Adverse selection is the term used to refer to problems of asymmetric information that appear before the contract is signed. The classic example of Akerlof [3] illustrates very well the issue: the buyer of a used car has much less information about the state of the vehicle than the seller. Similarly, the buyer of a product knows how much he appre-
ciates the quality, while the seller only has statistical information about a typical buyer’s taste [68]; or the regulated firm has more accurate information about the marginal cost of production than the regulator. A convenient way to model adverse selection problems is to consider that the agent can be of different types, and that the agent knows his type before any contract is signed while the principal does not know it. In the previous examples, the agent’s type is the particular quality of the used car, the level of appreciation of quality, or the firm’s marginal cost. How can the principal deal with this informational problem? Instead of offering just one contract for every (or several) types of agents, she can propose several contracts so that each type of agent chooses the one that is best for him. A useful result in this literature is the revelation principle [31,32,69] that states that any mechanism that the principal can design is equivalent to a direct revelation mechanism by which the agent is asked to reveal his type and a contract is offered according to his declaration. That is, a direct revelation mechanism offers a menu of contracts to the agent (one contract for each possible type), and the agent can choose any of the proposed contracts. Clearly, the mechanism must give the agent the right incentives to choose the appropriate contract, that is, it must be a self-selection mechanism. Menus of contracts are not unusual. For instance, insurance companies offer several possible insurance contracts between which clients may freely choose their most preferred. For example, car insurance contracts can be with or without deductible clauses. The second goes to more risk averse or more frequent drivers while deductibles attract less risk averse or less frequent drivers. Therefore, the timing of an adverse selection game is the following. In the first place, the agent’s characteristics (his “type”) are realized, and only the agent learns them. Then, the principal decides the menu of contracts to offer to the agent. Having received the proposal, the agent decides which one of the contracts (if any) to accept. Finally, if one contract has been accepted, the agent chooses the predetermined effort and receives the corresponding payment. A simple model of adverse selection is the following. A risk-neutral principal contracts an agent (who could be risk-neutral or risk-averse) to carry out some verifiable effort on her behalf. Effort e provides an expected payment to the principal of ˘ (e), with ˘ 0 (e) > 0 and ˘ 00 (e) < 0. The agent could be either of two types that differ with respect to the disutility of effort, which is v(e) for type G(good), and kv(e), with k > 1 for type B(bad). Hence, the agent’s utility function is either U G (w; e) D u(w) v(e) or U B (w; e) D u(w) kv(e). The principal considers that
2259
2260
Principal-Agent Models
the probability for an agent to be type-G is q, where 0 < q < 1. The principal designs a menu of contracts f(e G ; w G ); (e B ; w B )g, where (e G ; w G ) is directed towards the most efficient type of agent, while (e B ; w B ) is intended for the least efficient type. For the menu of contracts to be a sensible proposal, the agent must be better off by truthfully revealing his type than by deceiving the principal. The principals problem, is therefore to maximize her expected profits subject to the restrictions that (a) after considering the contracts offered, the agent decides to sign with the principal (participation constraints), and (b) each agent chooses the contract designed for his particular type (incentive compatibility constraints): Max
[(e G ;w G );(e B ;w B )]
q[˘ (e G ) w G ] C (1 q)[˘ (e B ) w B ]
s.t. u(w G ) v(e G ) U u(w B ) kv(e B ) U u(w G ) v(e G ) u(w B ) v(e B ) u(w B ) kv(e B ) u(w G ) kv(e G ) The main characteristics of the optimal contract menu f(e G ; w G ); (e B ; w B )g are the following: (i) The contract offered to the good agent (eG , w G ) is efficient (‘non distortion at the top’). The optimal salary wG however is higher than under symmetric information: this type of agent receives an informational rent. That is, the most efficient agent profits from his private information and in order to reveal this information he has to receive a utility greater than his reservation level. (ii) The participation condition binds for the agent when he has the highest costs (he just receives his reservation utility). Moreover, a distortion is introduced into the efficiency condition for this type of agent. By distorting, the principal loses efficiency with respect to type-B agents, but she pays less informational rent to the G-types. Principals Competing for Agents in Adverse Selection Frameworks Starting with the pioneer work by Rothschild and Stiglitz [83] on insurance markets, there have been many studies on markets with adverse selection problems where there is competition among principals to attract agents. We move from a model where one principal maximizes her profits subject to the above constraints, to a game theory environment where each principal has to take into ac-
count the actions by others when deciding which contract to offer. In this case, the adverse selection problem may be so severe that we may find ourselves in situations in which no equilibrium exists. To highlight the main results in this type of models, consider a simple case in which there are two possible riskaverse agent types: good (G) and bad (B) with G being more productive than B. In particular, we assume that G is more careful than B, in the sense that he commits fewer errors. When the agent exerts effort, the result could be either a success (S) or a failure (F). The probability that it is successful is pG when the agent is type-G and pB when he is type B, where pG > pB . The principal values a successful result more than a failure. The result is observable, so that the principal can pay the agent according to the result, if she so desires. There are several risk-neutral principals. Therefore, we look for the set of equilibrium contracts in the game played by principals competing to attract agents. Equilibrium contracts must satisfy that there does not exist a principal who can offer a different contract that would be preferred by all or some of the agents and that gives that principal greater expected profits. This is why, if information was symmetric, the equilibrium contracts would be characterized by the following properties: (i) principals’ expected profits are zero; and (ii) each contract must be efficient. Hence the agent receives a fixed contract insuring him against random events. In particular, the equilibrium salary that the agent receives under symmetric information is higher when he is of type G than when he is of type B. When the principals cannot observe the type of the agent, the previous contracts cannot be longer an equilibrium: all the agents would claim to be a good type. An equilibrium contract pair fC G ; C B g must satisfy the condition that no principal can add a new contract that would give positive expected profits to the agents that prefer this new contract to CG and CB . If the equilibrium contracts for the two agent types turn out to be the same, that is, there is only one contract that is accepted by both agent types, then the equilibrium is said to be pooling. On the other hand, when there is a different equilibrium contract for each agent type, then we have a separating equilibrium. In fact, pooling equilibria never exist, since pooling contracts always give room for a principal to propose a profitable contract that would only be accepted by the G-types (the best agents). If an equilibrium does exist, it must be such that each type of agent is offered a different contract. If the probability that the agent is “good” is large enough, then a separating equilibrium does not exist ei-
Principal-Agent Models
ther. That is, an adverse selection problem in a market may provoke the absence of any equilibrium in that market. When, a separating equilibria does exist. In, the results are similar to the ones under moral hazard in spite of the differences in the type of asymmetric information and in the method of solving. That is, contingent pay-offs are offered to the best agent to allow the principal to separate them from the less efficient ones. In this equilibrium, the least efficient agents obtain the same expected utility (and even sign the same contract) as under symmetric information, while the best agents lose expected utility due to the asymmetric information. Extensions of Adverse Selection Models Repeated Adverse Selection In this extension, we consider whether the repetition of the relationship during several periods helps the principal and how it influences the form of the optimal contract. Note first that if the agent’s private information is different in each period and the information is not correlated among periods, then any current information revealed does not affect the future and hence the repeated problem is equivalent to simple repetition of the initial relationship. The optimal intertemporal contract will be the sequence of optimal single-period contracts. Consider the opposite situation where the agent’s type is constant over time. If the agent decides to reveal his type truthfully in the first period, then the principal is able to design efficient contracts that extract all surpluses from the agent. Hence, the agent will have very strong incentives to misrepresent his information in the early periods of the relationships. In fact, Baron and Besanko [8] show that if the principal can commit herself with a contract that covers all the periods, then the optimal contract is the repetition of the optimal static contract. This implies that the contract is not sequentially rational, and it is also non robust to renegotiation: once the first period is finished, the principal “knows” the agent’s type and a better contract for both parties is possible. It is often the case that the principal cannot commit not to renegotiate a long-term contract. Laffont and Tirole [52] show that, in this case, it may be impossible to propose perfect revelation (separating) contracts in the first periods. This is known as the ratchet effect. Also, Freixas, Guesnerie, and Tirole [29], and Laffont and Tirole [51] have proven that, even when separating contracts exist, they may be so costly that they are often non optimal and we should expect that information be revealed progressively over time. Baron and Besanko [9] and Laffont and Tirole [53] also introduce frameworks in which it is
possible to propose perfect revelation contracts but they are not optimal. Relationships with Several Agents: Auctions One particularly interesting case of relationship among one principal and several agents is that of a seller who intends to sell one or several items to several interested buyers, where buyers have private information about their valuation for the item(s). A very popular selling mechanism in such a case is an auction. As Klemperer [48] would put it, auction theory is one of economics’ success stories in both practical and theoretical terms. Art galleries generally use English auctions: the agents bid “upwards”; while fish markets are generally examples of Dutch auctions: the seller reduces the price of the good until someone stops the auction by buying. Public contracts are generally awarded through (first price or second price) sealedbid auctions where buyers introduce their bid in a closed envelope, the good is sold to the highest bidder and the prize is either the winner’s own bid or the second highest bid. Vickrey [92,93] was the first to establish the key result in auction theory, the Revenue Equivalence Theorem which, subject to some reasonable conditions, says that the seller can expect equal profits on average from all the above (and many other) types of auctions, and that buyers are also indifferent among them all. Auctions are efficient, since the buyer who ends up with the object is the one with the highest valuation. Hence, the existence of private information does not generate any distortions with respect to who ends up getting the good, but the revenue of the seller is less than under symmetric information. Myerson [70] solves the general mechanism design problem of a seller who wants to maximize her expected revenue, when the bidders have independent types and all agents are risk-neutral. In general, the optimal auction is more complex than the traditional English (second price) or Dutch (first price) auction. His work has been extended by many other authors. When the buyers’ types are affiliated (i. e., they are not negatively correlated in any subset of their domain), Milgrom and Weber [64] show that the revenue equivalence theorem breaks down. In fact, in this situation, McAfee, McMillan, and Reny [61] show that the seller may extract the entire surplus from the bidders as if there was no asymmetric information. Starting with Maskin and Riley [60], several authors have also analyzed auctions of multiple units. Finally, Clarke [19] and Groves [34] initiated another group of models in which the principal contracts with several agents simultaneously, but does not attempt to maximize her own profits. This is the case of the provision of
2261
2262
Principal-Agent Models
a public good through a mechanism provided by a benevolent regulator. Relationships with Several Agents: Other Models and Organizational Design Adverse selection models have attempted to analyze the optimal task assignment, the advantages of delegation, or the optimal structure of contractual relationships, when the principal contracts with several agents. Riordan and Sappington [79] analyze a situation where two tasks have to be fulfilled and show that if the person in charge of each task has private information about the costs associated with the task, then the assignment of tasks within the organization is an important decision. For example, when the costs are positively correlated, then the principal will prefer to take charge of one of the phases herself while she will prefer to delegate the task when the costs are negatively correlated. In a very general framework, Myerson [71] shows a powerful result: in adverse selection situations, centralization cannot be worse than decentralization, since it is always possible to replicate a decentralized contract with a centralized one. This result is really a generalization of the revelation principle. Baron and Besanko [11] and Melumad, Mookherjee, and Reichelstein [62] show that, if the principal can offer complex contracts in a decentralized organization, then a decentralized structure can replicate a centralized organization. When there are problems of communication between principal and agents, the equivalence result does not hold: Melumad and Reichelstein [63] show that delegation of authority can be preferable if communication between the principal and the agents is difficult. Still concerning the optimal design of the organization, Dana [22] analyzes the optimal hierarchical structure in industries with several productive phases, when firms have private information related to their costs. They show that structures that concentrate all tasks to a single agent are superior since, the incentives to dishonestly reveal the costs of each of the phases are weaker. Da-Rocha-Alvarez and De-Frutos [20] argue that the absolute advantage of the centralized hierarchy is not maintained if the differences in costs between the different phases are sufficiently important. Several Principals Stole [89] and Martimort [59] point out the difficulty of extending the revelation principle to situations where an agent with private information is contracted by several principals who act separately. Given that not only one contract (or menu of contracts) is offered to the agent, but several contracts coming from different principals, it is not longer necessarily true that the best a principal can do is to offer a “truth-telling mechanism”.
Consider a situation with two principals that are hiring a single agent. If we accept that agent’s messages are restricted to the set of possible types that the agent may have, we can obtain some conclusions. If the activities or efforts that the agent carries out for the two principals are substitutes (for instance, a firm producing for two different customers), then the usual result on the distortion of the decision holds: the most efficient type of agent supplies the efficient level of effort while the effort demanded from the least efficient type is distorted. However, due to the lack of cooperation between principals, the distortion induced in the effort demanded from the less efficient type of agent is lower than the one maximizing the principals’ aggregate profits. On the other hand, if the activities that the agent carries out for the principals are complementary (for example, the firm produces a final good that requires two complementary intermediate goods in the production process), then the comparison of the results under cooperation and under no cooperation between the principals reveals that: if a principal reduces the effort demanded from the agent, in the second case, this would imply that it is also profitable for the other principal to do the same. Therefore, the distortion in decisions is greater to that produced in the case in which principals cooperate. Models of Moral Hazard and Adverse Selection The analysis of principal-agent models where there are simultaneously elements of moral hazard and adverse selection is a complex extension of classic agency theory. Conclusions can be obtained only in particular scenarios. One common class of models considers situations where the principal cannot distinguish the part corresponding to effort from the part corresponding to the agent’s efficiency characteristic because both variables determine the production level. Picard [74] and Guesnerie, Picard, and Rey [37] propose a model with risk-neutral participants and show that, if the effort demanded from the different agents is not decreasing in the characteristic (if a higher value of this parameter implies greater efficiency), then the optimal contract is a menu of distortionary deductibles designed to separate the agents. The menu of contracts includes one where the principal sells the firm to the agent (aiming at the most efficient type), and another contract where she sells only a part of the production at a lower prize (aiming at the least efficient type). However, there are also cases where fines are needed to induce the agents to honestly reveal their characteristic. In fact, the main message of the previous literature is that the optimal solution for problems that mix adverse selection and moral hazard does not imply efficiency losses with respect to the pure adverse selection solution when
Principal-Agent Models
the agent’s effort is observable. However, in other frameworks (see [50]), a true problem of asymmetric information appears only when both problems are mixed when, and efficiency losses are evident. Therefore, the same solution as when only the agent’s characteristic is private information cannot be achieved. Future Directions Empirical Studies of Principal-Agent Models The growing interest on empirical issues related to asymmetric information started in the mid nineties (see the survey by Chiappori and Salanie [18]). A very large part of the literature is devoted to test the predictions of the canonical models of moral hazard and adverse selection, where there is only one dimension in which information is asymmetric. A great deal of effort is devoted to try to ascertain whether it is moral hazard, or adverse selection, or both prevalent in the market. This is a difficult task because both adverse selection and moral hazard generate the same predictions in a cross section. For instance, a positive correlation between insurance coverage and probability of accident can be due to either the intrinsically riskier drivers selecting into contracts with better coverage (as the [83] model of adverse selection will predict) or to drivers with better coverage exerting less effort to drive carefully (as the canonical moral hazard model will predict). Chiappori, Jullien, Salanié and Salanié [17] have shown that the positive correlation between coverage and risk holds more generally that in the canonical models as long as the competitive assumption is maintained. Future empirical approaches are likely to incorporate market power (as in [17]), multiple dimensions of asymmetric information (as in [28]), as well as different measures of asymmetric information (as in [91]). These advances will be partly possible thanks to richer surveys which collect subjective information regarding agents’ attributes usually unobserved by principals or agent’s subjective probability distributions. The wider availability of panel data will mean that it will become easier to disentangle moral hazard from adverse selection (as in [1]). Much is to be learnt by using field experiments that allow randomly varying contract characteristics offered to individuals and hence disentangling moral hazard from adverse selection (as in [47]). Contracts and Social Preferences Although principal-agent theory has proved fundamental in expanding our understanding of contract situations, real-life contracts frequently do not exactly match its predictions. Many contracts are linear and simpler, incentives
are often stronger and wage gaps more compressed than expected. One possible explanation is that the theory has mainly focused on economic agents exclusively motivated by their own monetary incentives. However, this assumption leaves aside issues such as social ties, team spirit or work morale, which the human resources literature highlights. A recent strand of economic literature, known as “behavioral contract theory”, has tried to incorporate social aspects into the economic modeling of contracts. Such theory has been motivated by two types of empirical support. On the one hand, extensive interview studies with firm managers and employees [13] has shown not only that agents care about social comparisons such as internal pay equity or effort diversity, but that their incentives to work hard are affected by them and that principals are aware of it and design their contracts accordingly. On the other hand, one of the most influential contributions of the experimental literature has been to show that, assuming that economic agents are not completely selfish (but exhibit some form of social preferences), helps organizing many laboratory data. Experiments replicating labor markets (starting with Fehr’s [26]) confirm Akerlof’s [4] insight that contracts may be understood as a form of gift exchange in which principals may offer a “generous” wage and agents may respond with more than the minimum effort required. Incorporating social and psychological aspects in a systematic manner into agents’ motivations has given rise to several forms of utility functions reflecting inequality aversion [14,24,27], fairness [75] and reciprocity [25]. More recently, such utility functions have been included into standard contract theory models and have helped in shortening the gap between theory predictions and real-life contracts. In particular, issues such as employees’ feelings of envy or guilt towards their bosses [45], utility comparisons among employees [35,78] or peer-pressure motivating effort decisions [43,55] have proved important in widening the scope of issues principal-agent theory can help to understand. Principal-Agent Markets The literature has been treating each principal-agent relation as an isolated entity. Thus, it normally takes a given relationship between a principal and an agent (or among several principals and/or several agents), and analyzes the optimal contract. In particular, the principal assumes all the bargaining power as she has the right to offer the contract she likes the most, and agent’s payoff is determined by his exogenously given reservation utility. However, in markets there is typically not a single partnership but there
2263
2264
Principal-Agent Models
are several. It is then interesting to consider the simultaneous determination of both the identity of the pairs that meet (i. e., the matching between principals and agents) and the contracts these partnerships sign. The payoffs to each principal and agent will then depend on the other principal-agent relationships being formed in the market. This analysis requires a general equilibrium-like model. Game theory provides a very useful tool to deal with the study of markets where heterogeneous players from one side can form partnerships with heterogeneous players from the other side: the two-sided matching models. Examples of classic situations studied in two-sided matching models (see [82,86]) are the marriage market, the college admissions model, or the assignment market (where buyers and sellers transact). Several papers extend this game theory models to situations where each partnership involves contracts and show that the simultaneous consideration of matching and contracts has important implications. Dam and Pérez-Castrillo [21] show that, in an economy where landowners contract with tenants, a government willing to improve the situation of the tenants can be interested in creating wealth asymmetries among them. Otherwise, the landowners would appropriate all the incremental money that the government is willing to provide to the agents. Serfes [85] shows that higher-risk projects do not necessarily lead to lower incentives, which is the prediction in the standard principal-agent theory, and Alonso-Paulí and Pérez-Castrillo [6] apply the theory to markets where contracts (between shareholders and managers) can include Codes of Best Practice. On the empirical side, Ackerberg and Botticini [2] find strong evidence for endogenous matching between landlords and tenants and that risk sharing is an important determinant of contract choice. Future research will extend the general equilibrium analysis of principal-agent contracts to other markets. In addition, the literature has only studied one-to-one matching models. This should be extended to situations where each principal can hire several agents, or where each agent deals with several principals. The interplay between (external) market competition and (internal) collaboration between agents or principals can provide useful insights about the characteristics of optimal contracts in complex environments. Bibliography Primary Literature 1. Abbring J, Chiappori PA, Heckman JJ, Pinquet J (2003) Adverse Selection and Moral Hazard in Insurance: Can Dynamic Data Help to Distinguish? J European Econ Ass 1:512–521
2. Ackerberg DA, Botticini M (2002) Endogenous Matching and the Empirical Determinants of Contract Form. J Political Econ 110:564–592 3. Akerlof G (1970) The Market for ‘Lemons’: Qualitative Uncertainty and the Market Mechanism. Quarterly J Econ 89:488–500 4. Akerlof G (1982) Labor Contracts as a Partial Gift exchange. Quarterly J Econ 97:543–569 5. Alchian A, Demsetz H (1972) Production, Information Costs, and Economic Organization. Am Econ Rev 62:777–795 6. Alonso-Paulí E, Pérez-Castrillo D (2007) Codes of Best Practice in Competitive Markets. Mimeo 7. Arrow K (1985) The Economics of Agency. In: J Pratt, R Zeckhauser (eds) Principals and agents: The Structure of business. Harvard University Press, Boston 8. Baron D, Besanko D (1984) Regulation and Information in a Continuing Relationship. Information Econ Policy 1:267–302 9. Baron D, Besanko D (1987) Commitment and Fairness in a Dynamic Regulatory Relationship. Rev Econ Studies 54:413–436 10. Baron D, Myerson R (1982) Regulating a Monopoly with Unknown Costs. Econometrica 50:911–930 11. Baron D, Besanko D (1992) Information, Control and Organizational Structure. J Econ Management Strategy 1(2):237–275 12. Bernheim BD, Whinston MD (1986) Common Agency. Econometrica 54:923–942 13. Bewley T (1999) Why Rewards Don’t Fall During a Recession. Harvard University Press, Cambridge 14. Bolton G, Ockenfels A (2000) ERC: A Theory of Equity, Reciprocity and Competition. Am Econ Rev 90:166–193 15. Chiappori PA, Macho-Stadler I (1990) Contrats de Travail Répétés: Le Rôle de la Mémoire. Annales Econ Stat 17:4770 16. Chiappori PA, Macho-Stadler I, Rey P, Salanié B (1994) Repeated Moral Hazard: The Role of Memory, Commitment and Access to Credit Markets. Eur Econ Rev 38:1527–1553 17. Chiappori PA, Jullien B, Salanié B, Salanié F (2006) Asymmetric Information In Insurance: General Testable Implications. Rand J Econ 37:783–798 18. Chiappori PA, Salanié B (2003) Testing Contract Theory: a Survey of Some Recent Work. In: Dewatripont, Hansen, Turnovsky (eds) Advances in Economics and Econometrics, vol. 1. Cambridge University Press, Cambridge, pp 115–149 19. Clarke E (1971) Multipart Pricing of Public Goods. Public Choice 11:17–33 20. Da-Rocha-Alvarez JM, De-Frutos MA (1999) A Note on the Optimal Structure of Production. J Econ Theory 89:234–246 21. Dam K, Pérez-Castrillo D (2006) The Principal-Agent Matching Market. Frontiers Econ Theory. Berkeley Electro 2(1):1–34 22. Dana JD (1993) The Organization and Scope of Agents: Regulating Multiproduct Industries. J Econ Theory 59:288–310 23. Demski JS, Sappington D (1986) Line-item Reporting, Factor Acquisition and Subcontracting. J Accounting Research 24:250–269 24. Desiraju R, Sappington D (2007) Equity and Adverse Selection. J Econ Management Strategy 16:285–318 25. Dufwenberg M, Kirchsteiger G (2004) A Theory of Sequential Reciprocity. Games Econ Behavior 47:268–298 26. Fehr E, Kirchsteiger G, Riedl A (1993) Does Fairness Prevent Market Clearing? Quart J Econ 108:437–460 27. Fehr E, Schmidt K (1999) A Theory of Fairness, Competition and Cooperation. Quart J Econ 114:817–868
Principal-Agent Models
28. Finkelstein A, McGarry K (2006) Multiple Dimensions of Private Information: Evidence from the Long-term Care Insurance Market. Am Econ Rev 96:938–958 29. Freixas X, Guesnerie R, Tirole J (1985) Planning under Information and the Ratchet Effect. Rev Econ Studies 52:173–192 30. Fudenberg D, Holmström B, Milgrom B (1990) Short-term Contracts and Long-term Agency Relationships. J Econ Theory 51:1–31 31. Gibbard A (1973) Manipulation for Voting Schemes. Econometrica 41:587–601 32. Green JR, Laffont JJ (1977) Characterization of Satisfactory Mechanisms for the Revelation of Preferences for Public Goods. Econometrica 45:427–438 33. Grossman SJ, Hart OD (1983) An Analysis of the Principal-Agent Problem. Econometrica 51:7–45 34. Groves T (1973) Incentives in Teams. Econometrica 41:617–631 35. Grund C, Sliwka D (2005) Envy and Compassion in Tournaments. J Econ Manag Strateg 14:187–207 36. Guesnerie R, Laffont JJ (1984) A Complete Solution to a Class of Principal-Agent Problems with Application to the Control of a Self-Managed Firm. J Public Econ 25:329–369 37. Guesnerie R, Picard P, Rey P (1989) Adverse Selection and Moral Hazard with Risk Neutral Agents. Eur Econ Rev 33:807– 823 38. Harris M, Raviv A (1978) Some Results on Incentive Contracts with Applications to Education and Employment, Health Insurance and Law Enforcement. Am Econ Rev 68:20–30 39. Harris M, Raviv A (1979) Optimal Incentive Contracts with Imperfect Information. J Econ Theory 2:231–259 40. Holmström B (1979) Moral Hazard and Observability. Bell J Econ 10:74–91 41. Holmström B (1982) Moral Hazard in Teams. Bell J Econ 13:324– 340 42. Holmström B, Milgrom P (1991) Multitask Principal-Agent Analysis: Incentive Contracts, Assets Ownership, and Job Design. J Law Econ Organization 7(Suppl):24–52 43. Huck S, Rey-Biel P (2006) Endogenous Leadership in Teams. J Inst Theoretical Econ 162:1–9 44. Itoh H (1990) Incentives to Help in Multi-Agent Situations. Econometrica 59:611–636 45. Itoh H (2004) Moral Hazard and Other-Regarding Preferences. Jpn Econ Rev 55:18–45 46. Jensen M, Meckling W (1976) The Theory of the Firm, Managerial Behavior, Agency Costs and Ownership Structure. J Finan Econ 3:305–360 47. Karlan D, Zinman J (2006) Observing Unobservables: Identifying Information Asymmetries with a Consumer Credit Field Experiment. Mimeo 48. Klemperer P (2004) Auctions: Theory and Practice. Princeton University Press, Princeton 49. Laffont JJ (1990) Analysis of Hidden Gaming in a Three Levels Hierarchy. J Law Econ Organ 6:301–324 50. Laffont JJ, Tirole J (1986) Using Cost Observation to Regulate Firms. J Polit Econ 94:614–641 51. Laffont JJ, Tirole J (1987) Comparative Statics of the Optimal Dynamic Incentive Contract. Eur Econ Rev 31:901–926 52. Laffont JJ, Tirole J (1988) The Dynamics of Incentive Contracts. Econometrica 56:1153–1176 53. Laffont JJ, Tirole J (1990) Adverse Selection and Renegotiation in Procurement. Rev Econ Studies 57:597–626
54. Lambert R (1983) Long Term Contrats and Moral Hazard. Bell J Econ 14:441–452 55. Lazear E (1995) Personnel Economics. MIT Press, Cambridge 56. Macho-Stadler I, Pérez-Castrillo D (1993) Moral Hazard with Several Agents: The Gains from Cooperation. Int J Ind Organ 11:73–100 57. Macho-Stadler I, Pérez-Castrillo D (1998) Centralized and Decentralized Contracts in a Moral Hazard Environment. J Ind Econ 46:489–510 58. Malcomson JM, Spinnewyn F (1988) The Multiperiod PrincipalAgent Problem. Rev Econ Stud 55:391–408 59. Martimort D (1996) Exclusive Dealing, Common Agency and Multiprincipals Incentive Theory. Rand J Econ 27:1–31 60. Maskin E, Riley J (1989) Optimal Multi-Unit Auctions. In: Hahn F (ed) The Economics of Missing Markets, Information, and Games. Oxford University Press, Oxford, pp 312–335 61. McAfee P, McMillan J, Reny P (1989) Extracting the Surplus in the Common Value Auction. Econometrica 5:1451–1459 62. Melumad N, Mookherjee D, Reichestein S (1995) Hierarchical Decentralization of Incentive Contracts. Rand J Econ 26:654– 672 63. Melumad N, Reichelstein S (1987) Centralization vs Delegation and the Value of Communication. J Account Res 25:1–18 64. Milgrom P, Weber RJ (1982) A Theory of Auctions and Competitive Bidding. Econometrica 50:1089–1122 65. Mirrlees J (1971) An Exploration in the Theory of Optimum Income Taxation. Rev Econ Studies 38:175–208 66. Mirrlees J (1975) The Theory of Moral Hazard and Unobservable Behavior, Part I. WP Nuffield College, Oxford 67. Mookherjee D (1984) Optimal Incentive Schemes with Many Agents. Rev Econ Studies 51:433–446 68. Mussa M, Rosen S (1978) Monopoly and Product Quality. J Econ Theory 18:301–317 69. Myerson R (1979) Incentive Compatibility and the Bargaining Problem. Econometrica 47:61–73 70. Myerson R (1981) Optimal Auction Design. Math Op Res 6:58– 73 71. Myerson R (1982) Optimal Coordination Mechanisms in Generalized Principal-Agent Models. J Math Econ 10:67–81 72. Pauly MV (1968) The Economics of Moral Hazard. Am Econ Rev 58:531–537 73. Pauly MV (1974) Overinsurance and Public Provision of Insurance: The Roles of Moral Hazard and Adverse Selection. Quart J Econ 88:44–62 74. Picard P (1987) On the Design of Incentive Schemes under Moral Hazard and Adverse Selection. J Public Econ 33:305–331 75. Rabin M (1993) Incorporating Fairness into Game Theory and Economics. Am Econ Rev 83:1281–1302 76. Radner R (1981) Monitoring Cooperative Agreements in a Repeated Principal-Agent Relationship. Econometrica 49:1127– 1148 77. Rey P, Salanié B (1990) Long term, Short term and Renegotiation. Econometrica 58:597–619 78. Rey-Biel P (2008) Inequity Aversion and Team Incentives. ELSE WP, Scandinavian J Econ 110:297–320 79. Riordan MH, Sappington DE (1987) Information, Incentives, and the Organizational Mode. Quart J Econ 102:243–263 80. Rogerson W (1985) Repeated Moral Hazard. Econometrica 53:69–76 81. Ross SA (1973) The Economic Theory of Agency: The Principal’s Problem. Am Econ Rev 63:134–139
2265
2266
Principal-Agent Models
82. Roth AE, Sotomayor M (1990) Two-sided matching: A study in game-theoretic modeling and analysis. Cambridge University Press, New York 83. Rothschild M, Stiglitz J (1976) Equilibrium in Competitive Insurance Markets: An Essay in the Economics of Imperfect Information. Quart J Econ 90:629–650 84. Rubinstein A, Yaari ME (1983) Repeated Insurance Contrats and Moral Hazard. J Econ Theory 30:74–97 85. Serfes K (2008) Endogenous Matching in a Market with Heterogeneous Principals and Agents. Int J Game Theory 36:587–619 86. Shapley LS, Shubik M (1972) The Assignment Game I: The Core. Int J Game Theory 1:111–130 87. Spence M (1974) Market Signaling. Harvard University Press, Cambridge 88. Spence M, Zeckhauser R (1971) Insurance Information, and Individual Action. Am Econ Rev 61:380–387 89. Stole L (1991) Mechanism Design under Common Agency. WP MIT, Cambridge 90. Tirole J (1986) Hierarchies and Bureaucracies: On the Role of Collusion in Organizations. J Law Econ Organ 2:181–214 91. Vera-Hernandez M (2003) Structural Estimation of a PrincipalAgent Model: Moral Hazard in Medical Insurance. Rand J Econ 34:670–693
92. Vickrey W (1961) Counterspeculation, Auctions and Competitive Sealed Tenders. J Finance 16:8–37 93. Vickrey W (1962) Auction and Bidding Games. In: Recent Advances in Game Theory. The Princeton University Conference, Princeton. pp 15–27
Books and Reviews Hart O, Holmström B (1987) The Theory of Contracts. In: Bewley T (ed) Advances in Economic Theory, Fifth World Congress. Cambridge University Press, Cambridge Hirshleifer J, Riley JG (1992) The Analytics of Uncertainty and Information. Cambridge University Press, Cambridge Laffont JJ, Martimort D (2002) The Theory of Incentives: The Principal-Agent Model. Princeton University Press, Princeton Laffont JJ, Tirole J (1993) A Theory of Incentives in Procurement and Regulation. MIT Press, Cambridge Macho-Stadler I, Pérez-Castrillo D (1997) An Introduction to the Economics of Information: Incentives and Contracts. Oxford University Press, Oxford Milgrom P, Roberts J (1992) Economics, Organization and Management. Prentice-Hall, Englewood Cliffs
Probability Densities in Complex Systems, Measuring
Probability Densities in Complex Systems, Measuring GUNNAR PRUESSNER Department of Mathematics, Imperial College London, London, UK Article Outline Glossary Definition of the Subject Introduction Simulation Techniques Scaling Histogram Data Representation Future Directions Bibliography Glossary Apparent exponent The apparent exponent is the powerlaw exponent of the probability density function P(s; sc ) in the scaling region. The scaling region is the range of event sizes s which is closest to a straight line in a double-logarithmic plot of P(s; sc ), the intermediate range sc s s0 . Consistent estimator An estimator is consistent if it converges to the quantity estimated (the expectation value of the population) as the sample size is increased. For example, the lack of independence of consecutive measurements generated in a numerical simulation can render an estimator inconsistent. Corrections to scaling In general, pure power-law behavior is found in an observable only to leading order, for example hs 2 i D aL˛ C bLˇ C : : : with ˛ > ˇ. While the sub-leading terms can have great physical relevance more emphasis is normally given to the leading order. The quality of the data analysis when determining the leading order can improve significantly by allowing for correction terms. The exponents found in these correction terms are usually expected to be universal as well. Correlation time Fitting the autocorrelation function of an observable to an exponential exp(t/ ) produces the correlation time . Although correlations are in general more complicated than the single exponential suggests, the standard deviation of the estimator of the nth moment p from N measurements is often estimated to be (2 C 1)/N times the estimated standard deviation of the nth moment, 2 C 1 2 n 2 sn D s ; (1) N
as if the number of independent measurements was only N/(2 C 1). Estimator A numerical estimator is any function that provides an estimate from the sample, that is the set of all measurements taken. A good estimator is unbiased, consistent and efficient. Very often, such an estimator coincides with the definition of the observable as taken from the exact distribution, for exPN 2 ample s 2 D N1 i s i for estimating the second moment from an uncorrelated sample s1 ; s2 ; : : : ; s N , with R exact value hs 2 i D dss 2 P(s; sc ). However, generally, a function of observables to be estimated, is not well estimated by taking the function of the estimates. For example, the square of the first moment hsi2 is not well P estimated by the numerical estimate s 2 D ( s i /N)2 , as this estimator would be biased. Finite size scaling Observables in complex systems that display a power-law dependence on a parameter, often diverge in the thermodynamic limit. In finite systems they remain finite and their value is expected to diverge as a power-law of the system size. The relation between the value of the observable and the system size is known as “finite size scaling” (abbreviated FSS). Gap exponent Given that a system displays scaling, the exponents n0 characterizing the dependence of the nth moment on a parameter, such as the system size in case of finite size scaling, often are linear in n. The gap exponent is the gap between consecutive moments, 0 n0 . nC1 Importance sampling Importance sampling is a numerical technique to bias the frequency which with configurations are generated, so that states of greater importance, e. g. large observables, are generated more often than others, less important ones. Using a Markov chain to generate the states of an Ising model as opposed to generating them at random and applying a Boltzmann–Gibbs weight can be regarded as a form of importance sampling. Lower cutoff The lower cutoff in a probability density function of event sizes in complex systems is a value of the event size above which the distribution displays universal behavior. Below this value the system is governed by microscopic details. For many systems, the probability density functions for different system sizes coincide for event sizes below the lower cutoff. Markov chain Monte Carlo A Monte Carlo technique whereby configurations are generated by transforming the state of the system according to a transition probability. The stationary probability distribution of the different states corresponds to the target probabil-
2267
2268
Probability Densities in Complex Systems, Measuring
ity distribution, i. e. the distribution to be modeled. In complex systems, Markov Chain Monte Carlo (abbreviated MCMC) is the natural method to study a model: Configurations of the system are generated with the frequency they occur in the exact distribution. Moment analysis In general it is a difficult to identify and quantify scaling behavior in probability density functions. The most general analysis is a data collapse, the quality of which is not easily determined. The most widely used method to determine scaling exponents and moment ratios of the universal scaling function therefore is a moment analysis. Based on the scaling assumption of the probability density function, moments scale as a power of the upper cutoff with amplitudes certain ratios of which are universal. Monte Carlo Technique to calculate numerical estimates for expectation values in stochastic models by generating configurations at random. More generally, Monte Carlo (abbreviated MC) is a stochastic technique to numerically integrate a high-dimensional integral, here corresponding to the calculation of an expectation value by integrating over all degrees of freedom, constituting the phase space of the system. Parameter space and phase space The parameter space of a complex system is the space spanned by all parameters of the model, such as system size and couplings of the interacting agents. A numerical study usually aims to probe the model throughout a large part of the parameter space. The number of parameters therefore needs to be as small as possible. Often only a single parameter exists. Leaving the parameters fixed, an individual numerical simulation samples the phase space available to the system. The phase space is the set of all possible configurations or states of the model. This space is very high dimensional and it is virtually impossible to sample this space homogeneously. A numerical simulation relies on the assumption that the sample taken nevertheless is sufficiently representative to allow for reliable estimates. Stationary distribution and transient Most complex systems studied possess a limiting distribution, i. e. the probability density distribution in phase space converges. This is the stationary distribution. A free random walker, for example, does not possess a stationary distribution, while a random walker in a harmonic potential does. Due to correlations, the probability distribution of states after any finite time generally depends on the initial condition, which is therefore often chosen to be random. The measurements discarded due to these correlation are called the transient.
Unbiased estimator An estimator is unbiased if the population average of the estimator is independent of P the sample size. For example, s D N i s i /N from an uncorrelated sample s1 ; : : : ; s N is an unbiased estimator of the expected first moment. Estimating the variance of s as 2 (s) D s 2 s2 however is biased, because the population mean of s 2 s 2 is ((N 1)/N)(hs2 i hsi2 ), which depends on the sample size N. Upper cutoff The upper cutoff is the characteristic scale of the universal part of the event size distribution. It is a measure of the event size at which the scaling function of the event size distribution breaks up. Moments hs n i with sufficiently large n are to leading order a power of the upper cutoff. The upper cutoff itself is expected to be a power law of the system parameter, i. e. the system size in case of finite size scaling. The exponent controlling the relation between upper cutoff and system size is the gap exponent.
Definition of the Subject Observables in complex systems usually obtain a broad range of values. They are random variables of a stochastic process and the system explores a wide phase space. The observables are therefore characterized by a probability density function (PDF), representing the probability (density) to find the complex system in a state with a particular value of the observable. As in other areas of statistical mechanics, complex systems are often studied in computer simulations, most prominently Monte Carlo [1,2,3] and the probability density function is recorded in form of a histogram, which frequently has power-law asymptotes. Historically, the probability density function itself plays a dominant rôle in the characterization of complex system, while more recently derived quantities, in particular moments, are used more frequently.
Introduction There are three main stages in the analysis of the PDF of a complex system: The first step is to generate in a computer simulation estimates of the PDF itself or derived quantities, such as moments. This is a general, technical problem considered in computational physics. In a second step, the expected behavior of the observables is to be derived, a process that often feeds to the stage of data generation, as it might suggest new observables to be estimated. Thirdly, the estimates are to be analyzed and compared to the expected behavior.
Probability Densities in Complex Systems, Measuring
The following material highlights main aspects of these steps. In practice, the first and the second step go hand in hand and might even change places, but as the simulation techniques are so clearly distinct from the theoretical expectations and the data analysis, which, in turn, are closely related, simulation techniques are presented first. The following subsection describes a key-example in complexity which has been analyzed in great detail using the techniques described in this article. The subsequent sections present some principal ideas and techniques from the three areas introduced above. Bak–Tang–Wiesenfeld Model The Bak–Tang–Wiesenfeld (BTW) Model [4,5] is the first model of Self Organized Criticality studied because of the peculiar features of the probability density function of its key observable. The BTW model is a cellular automaton, sometimes also called a “stochastic cellular automaton” when it is updated in a random fashion. The BTW model evolves as follows: In two dimension, height variables zi , which can take integral values set to 0 initially, are assigned to every site i of a square lattice of size L L. Whenever a height variable exceeds a certain critical value, its value is decreased by 4 and the variables of the surrounding four nearest neighboring sites are incremented by 1. These rules are modified for sites on the open boundary, where particles are lost. To drive the model, the height variable at either a single, specific site or at a randomly chosen site is increased by 1 and the relaxation rules described above are applied until none of the sites exceeds the critical value, which gives rise to an “avalanche” of size s, measuring the number of times the updating rule has been applied. The histogram P(s), calculated by repeating the above procedure millions of times, measures the frequency with which a certain avalanche size has been observed. It represents a numerical estimate of the exact, discrete probability function hP(s)i. In the following, an over-line as in P(s) indicates a numerical estimate, while brackets, such as in hP(s)i indicate the exact population average. Strictly, the distribution hP(s)i is usually inaccessible, because even a finite complex system has so many different states and therefore so many different outcomes, some with very small frequencies, that it is virtually impossible to calculate the distribution exactly or sample all outcomes with the exact weight. As shown in Fig. 1, the PDF of the avalanche size has a very long, power law tail. This is typical for complex systems and distinguishes it from what is often found in simpler problems. Most importantly, the PDF deviates
from a simple Gaussian which is, by the central limit theorem [6], the expected PDF, if the observable was a sum or an average of many independent random variables. The lack of convergence to a Gaussian therefore indicates the presence of correlations, the hallmark of a complex system. The aim of a numerical estimate and analysis of the PDF of a complex system is to characterize it quantitatively in the form of exponents, moment ratios and other features, in order to compare it to other complex systems. To improve the numerical results, various simulation and analysis techniques are employed, many of which are aiming to minimize the amount of CPU-time spent in the simulation. In the following, some standard simulation techniques are described. These methods have been developed since the early 1950s [2] and represent themselves a branch of numerical analysis. The data analysis is based on the sample, i. e. the set of measurements produced in numerical simulations, subject of the following section. The analysis can either focus on the PDF itself or on its moments, revealing different features of the system. Simulation Techniques In general, the complex system to be studied numerically models a more complicated phenomenon observed in nature, in economy, society etc. Typically, such a model is composed of a large number of interacting elements, which might be divisible in different classes. These elements might be agents, particles, organisms, locations etc. and the classes might represent different rôles or species. Depending on the class, the type of interaction might be very different. Only in very rare cases the model strictly reflects the original natural, economical or social observation it has been derived from. A complete representation of the original problem, even if possible, might be numerically intractable, because of too large a number of parameters that characterize the model. These parameters generally are couplings, i. e. they parametrize the interaction between the individual elements, and generally stay fixed during the simulation. The immediate aim of the simulation is to determine observables as functions of the parameters. Only if the number of parameters is small enough, the parameter space can be covered sufficiently densely to allow for a reliable statement of their various rôles and effects. The parameter space is to be distinguished from the phase space, which is the space or set of possible configurations of the system. Here, a configuration or state is the (smallest) set of values describing the system in such a way that they suffice at any point to prescribe the further evo-
2269
2270
Probability Densities in Complex Systems, Measuring
Probability Densities in Complex Systems, Measuring, Figure 1 BTW simulation of a system of size 256 256, with 1 107 iterations used in the transient and 2 107 iterations used for statistics: a Raw histogram, b binned histogram, (dashed line shows exponent 1.1) c rescaled histogram, using D 1:22 and d (attempt) of a data collapse including very small system sizes, L D 32; 64; 128; 256. The different ways to generate, process, present and analyze the raw data are discussed in this article
lution of the system. Each of these values is associated with a degree of freedom in phase space. Reducing the degrees of freedom generally has little impact on the quality of the numerical estimates. Even the simplest computer models usually have so many states that it becomes virtually impossible to calculate any property by realizing all such states. For example, a two dimensional square lattice of size 10 10 sites, which each can be in one out of two states, has 2100 possible states and it would take thousands of billions of years to visit each state, even if a billion states could be realized every second. Such a comprehensive study of this rather small system therefore is out of scope. Moreover, the original process to be modeled, realized in nature or society, cannot be thought of being appropriately represented in this form. Monte Carlo is by far the most widely used technique for measuring probability densities in a complex system. Other approaches, in particular deterministic molecular dynamics, are far less popular. This section focuses on the practice of Monte Carlo and various standard techniques used in conjunction with it.
Monte Carlo Methods In its most general form, a Monte Carlo algorithm estimates an expectation value of an observable by averaging over a random sample. In the following, this is illustrated for a problem with discrete configurations , each of which describing the state of the entire system. In case of interacting particles at sites, this would be a set of numbers representing the state of every individual site. Assuming the exact, but usually unknown probability of state to be P ( ), which in systems with continuous degrees of freedom is replaced by a PDF, the expectation value for an observable A( ) is hAi D
X
P ( )A( )
(2)
fg
where the sum runs over the set of all states f g. The exact expectation value can be approximated by the mean A over a (small) random sample of N configurations
Probability Densities in Complex Systems, Measuring
1 ; 2 ; : : : ; N , AD
N 1 X A( i ) N
are visited more frequently, which is discussed further in Subsect. “Rare Events”. (3)
iD1
if the configurations i occur with P (). By the (weak) law of large numbers, the estimate A converges to the exact value hAi in the limit of a large sample size N. In most classical models of statistical mechanics, the probability P () is known exactly and often corresponds to the Boltzmann–Gibbs weight. In some rare cases, such as percolation [7], configurations of the system are all equally important and are therefore generated easily at random and independently. For a much greater class of problems, for example in case of the Ising model [8], the configurations differ greatly in statistical weight and it is a hard computational problem to generate them in a sensible way. In these cases, Monte–Carlo sampling is mainly concerned with finding important configurations, which is often achieved by constructing a pseudo-dynamical process that generates the configurations with the correct probabilities. In complex systems, on the other hand, usually the dynamics is given while the probability is unknown, so that the first concern is to produce measurements with the correct probability (density). However, both issues lead to the same solution, namely Markov Chain Monte Carlo, which can be regarded as a form of importance sampling. Importance sampling is a method for producing measurements with a frequency optimized for reducing the number of measurements required to estimate an expectation value within a prescribed error. The sampling method therefore is, ideally, adapted to the observable. In the vast majority of numerical simulations of complex systems, however, the sampling scheme used corresponds to the simplest form of a Markov chain, where the configuration of the system evolves from one stage to another in discrete steps given by the dynamics, so that the frequency with which a configuration occurs during the simulation converges to the exact probability of that configuration [9]. The Markov condition means that the probability to observe a certain configuration of the system depends only on the directly preceding configuration. By including a sufficient amount of information in the configuration, in principle every process can be rendered Markovian. For example, while the sequence of particle coordinates in a classical gas is not Markovian, the same sequence together with the particles’ moments is so. More sophisticated approaches bias the sampling so that regions in phase space which have a greater contribution A()P () to the expectation value than other regions
Monte Carlo Applied to Complex Systems In the following, it will be assumed that the complex system has its dynamics given in the form of a set of rules specifying how the model evolves from one configuration to another. Furthermore, it will be assumed that these rules involve some randomness, so that a given starting configuration does not necessarily lead to one particular configuration in the next step. The evolution of the system is subject to certain transition probabilities encoded in the transition matrix W(; 0 ), which gives the probability to go from configuration 0 to configuration . This leads to the notion of the system’s trajectory in phase space, which is a random variable, traced out by application of the transition matrix in every Monte Carlo time step or update. Markov Chain Monte Carlo uses the phase space trajectory to sample the phase space in a fair fashion. By construction, consecutive configurations are generally correlated, i. e. the probability to find the system in a certain configuration depends on the preceding configuration. Given the Markovian nature of the process, the probability depends only on the previous configuration, nevertheless correlations can be very long-lived, because the previous configuration’s probability depends in turn on its preceding configuration and so on. The correlation time is a measure for the number of updates over which these correlations decay, the numerical consequences of which are discussed in Subsect. “Correlation Time”. At the beginning of the simulation the system is initialized in a particular form which in practice is often a rather artificial state. Due to correlations the influence of the initial condition can last for very many updates. The Markov-chain approach relies on the insight that the sampling frequency after a sufficiently long transient converges to the stationary distribution P ( ), which solves for all states [6] X X P ( 0 )W(; 0 ) P ( )W( 0 ; ) (4) 0D f 0 g
f 0 g
where W(; 0 ) is the transition probability from 0 to introduced above. Unless relaxation processes are to be studied, the stationary regime is sampled after a sufficiently transient, which is ignored, i. e. configurations generated during the transient do not enter the numerical estimates. It is general practice to save the configuration of a system regularly at later stages, which might subsequently be used as a starting point for other trajectories in phase space. This method also allows a systematic study of
2271
2272
Probability Densities in Complex Systems, Measuring
the influence of the transient. Eq. (4) is to be distinguished from detailed balance, 0 D P ( 0 )W(; 0 ) P ()W( 0 ; )
(5)
which solves (4) term by term. Detailed balance is the hallmark of equilibrium thermodynamics and has far reaching consequences for the topology of the network spanned by configurations and the possible transitions between them [10]. Poisson Processes Many complex systems consist of a set of concurrent Poisson processes, for example particles at sites that decay with one particular rate and interact with another rate. Maintaining a list of all possible processes and treating the fastest as deterministic is a widely adopted approach. If n processes with rate r1 ; : : : ; r n are possible with the fastest having the largest rate e, one process, say i, out of n is picked at random with uniform probability, performed with probability r i /e [11] and the continuous time incremented by 1/(en). Within time T on average enT picks are made, each process is selected eT times and performed r i T times, which is precisely the average number of times the Poissonian event should occur. This implementation therefore correctly reproduces the averages. On the other hand, an event controlled by a Poisson process can take place an arbitrary number of times within any finite time-span, which this implementation is incapable of. If correct time-keeping is expected to play an important rôle, as for example in the case of temporal correlations, the time increment itself is to be drawn randomly from a waiting time distribution [6,12]. Rare Events In the presence of rare events, [13] reliably estimating expectation values might become impossible. The problem affect bounded and unbounded observables, although an unbounded observable is potentially more prone to this phenomenon. The problem is caused by events that occur only very rarely, where, however, the observable is particularly large. If, for example, an event occurs with frequency f D 106 where the observable A has a value of A D 104 a, with a being the average of A over all other realizations, then including the event will change the estimate of the average of A only by f A /a D 102 . Assuming that the second moment over all other events is a2 , including the rare event will shift the estimate by f A2 /a2 D 102 . An even more dramatic effect is expected for higher moments, the details of which are discussed in Sect. “Scaling”. Anticipating Subsect. “Moment Analysis” somewhat, the difficulty of estimating higher moments reliably lies in their higher demand of independent measurements (also
discussed in Subsect. “Estimators”). To prevent systematic underestimation, at least a certain fraction of measurements must be taken from above hs n i1/n . The number of measurements required increases significantly with n and is further increased in cases of long time correlation and pathological distributions. In the presence of rare events more sophisticated importance sampling techniques have to be used [14,15] to calculate reliable estimates. In financial applications, rare events are often discussed in the context of so-called “fat tails” [16], which refer to PDFs that are close to a normal distribution but still have significant support at about five to ten standard deviations away from the mean. Fat tails are sometimes characterized by power laws, which are often the focus of the analysis of complex systems, as discussed in this article. Rare events are sometimes associated [17] with multiscaling, which is a form of power law scaling to be discussed in Subsect. “Gap Scaling Versus Multiscaling”. Random Number Generators The simulation of a complex system that is essentially discrete and can be represented by integers usually spends a significant fraction of time producing (pseudo) random numbers. If, on the other hand, the model is essentially continuous, i. e. requires floating point numbers and involves functions from the mathematical library, those are likely to consume most of the CPU-time. This observation suggests two paths of optimization: Firstly, floating point operations should be avoided as much as possible. For example, as most random number generators produce random integers, a frequent comparison of the form rand()/RAND_MAX < q where q is a constant floating point number, should be replaced by rand() < i, where i is q rescaled by a factor RAND_MAX. Similarly, Boolean random variables, true with probability 1/2, should be replaced by random bits, obtained using a bit-mask on an integer random variable, which is shifted on many platforms most efficiently by adding it to itself. The second obvious path of optimization addresses the (pseudo) random number generator. All modern random number generators, such as rand2 and others from [18], various generators discussed in [19] or the Mersenne twister [20], are of comparable quality and usually have passed all standard tests [21,22]. Their periods, that is the number of pseudo random numbers generated until they repeat, are usually long enough for modern requirements. Different generators mainly differ in CPU-time consumptions and general acceptance. Linear congruential random number generators and some of those found in the Standard C Library, on the other hand, generally lack the qual-
Probability Densities in Complex Systems, Measuring
ity required in modern computer simulations. They often fail standard tests, yet some of them are very fast. Different numerical simulations are independent only if the random number sequences used are independent. Only very few random number generators, mostly those designed for the use in parallelized simulations, guarantee the independence of sequences for different seeds, which in some cases have to be generated separately. It is good practice to seed the random number generator in a controlled way that ensures that none of the seeds is used more than once. Estimators Similar to experiments, numerical simulations suffer from different sources of error, many of which can be reliably estimated or even cured. In a Monte-Carlo simulation, expectation values are estimated using an estimator, for example the estimator for the nth moment Z hs n i D ds s n P (s) (6) from a simulation producing a sequence s1 ; s2 ; : : : ; s N is s n D N 1
N X
sn :
(7)
iD1
For the sake of the argument, in the following it will be assumed that all moments exist. An estimator is called unbiased if the expectation value of the estimator coincides with the expectation value of the object to be estimated, independent of the sample size N. Provided the sequence s1 ; : : : ; s N , introduced above, was taken from the stationary regime, the above example, Eq. (7), estimating s n , is unbiased, because hs n i D hs n i since hs ni i D hs n i for every element in the se2 quence. On the other hand, s n is not an unbiased estimator of hs n i2 , even if the elements in the sequence are independent. In this case, 2 hs n i D hs n i2 C N 1 hs 2n i hs n i2 (8) which has an explicit dependence on N unless the variance of sn vanishes. Consequently, the unbiased estimator of the variance of sn , denoted as 2 (s 2 ) is N 2n 2 (9) 2 (s n ) D s sn N 1 assuming mutual independence of the measurements. An estimator is called consistent if it converges to the quantity to be estimated as the sample size is increased. The law of large numbers ensures this property for simple means, however, more complicated objects to be estimated might not posses an estimator that is obviously consistent.
Numerical Error The central aim of a numerical simulation is to produce, efficiently, large sample sizes, with small or even vanishing correlation time, estimating observables using unbiased and consistent estimators. The quality of the numerical estimate, i. e. its error is usually measured by means of the standard deviation of the estimator, which, in turn, is to be estimated as well. The standard deviation measures the width of the Gaussian describing the expected normal distribution of the estimator. The standard deviation, (), is defined as the square root of variance 2 (). The variance of the estimator of hs n i, introduced in Eq. (7) is D 2 E ˝ ˛2 2 s n D s n s n D N 1 hs 2n i hs n i2 D N 1 2 (s n )
(10)
using Eq. (8), assuming that the sample is uncorrelated. 2 A naïve estimator of this variance is N 1 (s 2n s n ), which is, however, a biased estimator, because of Eq. (8): D
E N 1 2 N 1 s 2n s n D N 1 2 (s n ) N
(11)
so that the unbiased estimator for the variance of the estimator of hs n i is in fact 2 (12) 2 s n D (N 1)1 s 2n s n : When estimating functions of means, such as the variance f hsi; hs 2 i D hs 2 i hsi2 (13) finding an unbiased estimator for it might already be a difficult task. The most natural choice is of course f (s; s 2 ) but as pointed out above, Eq. (9), this choice is biased. The task of finding an estimator for the variance of the estimator might be even more problematic. The most naïve approach is to assume independence and approximate the variance by error propagation 12 12 0 ˇ @ f ˇˇ ˇ @ f ˇ D EA : 2 ( f ) 2 (s) @ ˇˇ D E A C 2 s 2 @ ˇ @s hsi; s 2 @s 2 hsi; s 2 0
(14) A first attempt to include correlations is 2 ( f ) 2 (s)
@f @s
2
C 2 s2
@f
2
@s 2 @f @f C 2 covar s; s 2 @s @s 2
(15)
2273
2274
Probability Densities in Complex Systems, Measuring
where covar s; s 2 denotes the covariance ˝ ˛ ˝ ˛˝ ˛ covar s; s 2 D ss 2 s s 2 :
(16)
that they are uncorrelated, which means that the expectation value of their product factorizes, hs i s j i D hs i ihs j i. The converse is generally not true, but frequently assumed. The correlation time in a sample is derived from the autocorrelation function in the stationary state,
Similar schemes are often used to estimate variances of very complicated functions of the observables, sometimes assuming that the expectation value of the function coincides with the function of the expectation values of the observables, which clearly is only the case for linear functions. One widely used class of methods available to reduce bias in estimators and the estimators of their variances are the Bootstrap and the Jackknife [23,24]. Both methods are so-called “resampling plans”, which prescribe a method to construct estimators and estimate their variances from the distribution of estimates based on sub-samples. In case of the Jackknife, a sample containing N individual measurements produces N sub-samples by taking all measurements apart from the first, second, third and so on. The estimator of the desired quantity is applied to each of the N sub-samples, each of which containing N 1 measurements. More elaborate schemes are available particularly suited to small sample sizes. A very similar approach is to measure the desired quantity in, say M, independent measurements, such as different Monte-Carlo simulations with different random number sequences, each consisting of N individual measurements. If N is large enough, the central limit theorem renders the distribution of the M estimates indistinguishable from a normal distribution. The error of the mean is then easily derived. This scheme lends itself to a simulation consisting of many independent runs, if they are necessary anyway to determine a quantity reliably. It can be applied in a weighted fashion if individual runs consists of different sample sizes.
so that C(0) D 1. Motivated by the study of Markovian processes [6], the correlation time of a sequence is estimated by fitting C(j) to an exponential exp( j/ ). One can show that the mean calculated from a finite sequence si , i D 1; 2; : : : ; N, with correlation time has a variance equal to the variance of the mean derived from a set of N/(2 C 1) uncorrelated measurements. Obviously, in numerical simulations a sample with large correlation time is inferior to one with a smaller correlation time. The estimated mean from an uncorrelated sample comprising of N measurements has variance 2 (s) D 2 (s)/N, so that along the simple arguments presented above, the variance of the mean increases (almost) linearly in the correlation time and decreases inversely in the sample size. As discussed further below, the square root of the variance of the estimator is usually used as a measure of the statistical error. When comparing different numerical techniques in terms of their computational costs, i. e. their CPU-time, one therefore has to compare the product of the square of the error and CPU-time.
Correlation Time If the sequence s1 ; s2 ; : : : ; s N used in Eq. (7) is observed in a Markov process, the individual measurements will not be mutually independent: If P(s2 js1 ) denotes the probability to observe s2 given the directly preceding sample was s1 , then P(s2 js1 ) is, in general, not independent of s1 . The same is true for n > 1 in P(s n js1 ), which is generally not independent of s1 , and even P(s n js n1 ; s n2 ; : : : ; s1 ) or any other probability of sn given a previous (sub-)sequence. Only if the probability of sn depends only on the directly preceding sample s n1 but not on any earlier one, then the sequence s1 ; : : : ; s n itself is a Markov process and fully specified by the transition probabilities between successive si . The independence of two measurements si and sj means that their joint probability factorizes. It implies
Reducing the Correlation Time A sample of sample size N and correlation time contains correlated subsamples which have only correlation time 1, s1 ; s1C ; s1C2 ; : : : . By pruning the original sample or, equivalently, reducing the sampling rate in a numerical simulation, one can reduce the correlation time to arbitrarily small values. This, however, is achieved only with the additional computational cost for the intermediate states of the system which are produced but not sampled. Such a technique pays off only if the cost of the sampling is high compared to the production of a new state of the system, as for example if the (expensive) Fourier transform is to be taken in a system the configuration of which evolves at virtually no costs. Often, it is more efficient to include all correlated sub-samples.
C( j) D
hs i s jCi i hsi2 2 (s)
(17)
which is independent of i because of stationarity. The autocorrelation function is normalized with the variance of the observable s, 2 (s) D hs 2 i hsi2
(18)
Probability Densities in Complex Systems, Measuring
Scaling Scaling is the hallmark of universality [25,26]. One of the aims of the analysis of a complex system is to determine whether it displays scaling and if so, whether the characteristics of the scaling behavior suggest that it belongs to a certain universality class. The universality hypothesis makes it possible to extend the predictive power of a simple model to more realistic situations and entire classes of stochastic processes. Finite Size Scaling Hypothesis The PDF hP(s; sc )i of an observable expected to display scaling is tested against the scaling hypothesis also known as simple scaling ( (19a) as G (s/sc ) for s > s0 hP(s; sc )i D otherwise (19b) f (s; sc ) where a is a metric factor (constant and in particular independent of sc ), denotes a (critical) scaling exponent (not to be confused with the correlation time) and G (x) is the scaling function usually assumed to be universal up to a pre-factor. The upper cutoff sc is the typical scale of the observable for which the density function rapidly drops to 0. It is also called “characteristic event size” and coincides, up to a pre-factor, with the first moment hsi in the case D 1. The lower cutoff s0 separates a range of values of the observable s where the histogram is governed predominantly by the non-universal PDF f (s; sc ). Equation (19) applies only approximately in discrete systems and in general only asymptotically, i. e. for sufficiently large s, sc . Taking this limit appropriately, quickly becomes very technical and for that reason will not be discussed in detail in the following. The upper cutoff sc is often assumed to diverge with the system size L as a power-law, sc D bL D with D being another (critical) exponent, which can be regarded as the spatial dimension of sc and therefore of the observable s. The pre-factor b is a second metric factor. Together with a, these two factors are the only non-universal parameters entering the universal part of the PDF, Eq. (19). In order to test for universal behavior, one has to impose that G (s/sc ) is a dimensionless function of a dimensionless argument. For dimensional consistency, a and b both are dimensionful, unless D 1 in which case a is a pure number which can in principle be absorbed into G , or if D D 1, in which case b is a number which can be absorbed into G as well. If sc depends only on L, (19) describes finite size scaling. In standard critical phenomena [25] this is observed
only exactly at the critical point. Away from it, the upper cutoff depends on an external tuning parameter, such as the reduced temperature t, which vanishes at the critical point of a ferromagnetic phase transition. The PDF then is still expected to display scaling as in (19), to be investigated in the thermodynamic limit, i. e. L ! 1. In this case one expects sc D b0 D with b0 being another metric factor, the correlation length and D the same spatial dimensionality as above. Similarly is expected to be the same, while the scaling function differs in finite size scaling and critical (point) scaling. Most importantly, the finite size scaling function depends on more details of the system than the critical scaling function, such as the shape, topology and aspect ratio of the system as well boundary conditions [26]. The difference between finite size scaling function and critical scaling function is generally explained to be caused by the difference in correlation length: The statistics of a finite system at the critical point displaying finite size scaling depends on the geometric properties of the boundaries of the system as the correlation length reaches them. In case of critical scaling the system can often be treated as being composed of many independent sub-systems, not probing the boundaries. Therefore, the distribution of some observables, such as the order parameter, is normal if the thermodynamic limit is taken when the system is not at the critical point. Other PDFs, such as the cluster size distribution in percolation, remain non-trivial but different from finite L, as they are not subject to the central limit theorem [6,7]. Complex systems and most notably self-organized criticality very often display finite size scaling and lack an explicit tuning parameter. Therefore, the following focuses on finite size scaling. Data Collapse By plotting the numerical estimate of hP(s; sc )i(s) in a double logarithmic plot versus s, the slope of the resulting graph in the scaling region, i. e. the region dominated by the power-law behavior, sc s s0 , gives the apparent exponent ˜ . This is illustrated in Fig. 1b. According to Eq. (19a) the slope is plus any contribution due to the scaling function. Only if the scaling function converges to a constant for small arguments, lim x!0 G (x) > 0, will the apparent exponent ˜ coincide with the (actual) scaling exponent . From the slope of the PDF in a double-logarithmic plot a first estimate for can be derived and can subsequently be used in a data collapse, which consists in plotting the estimates of the PDF in the rescaled form s P(s; sc ) versus s/sc . For each individual measurement sc is to be estimated, but the same is assumed to apply to all of them. If the result follows simple scaling, then there is a value
2275
2276
Probability Densities in Complex Systems, Measuring
Probability Densities in Complex Systems, Measuring, Figure 2 Example of a data collapse, here for the avalanche size distribution of the one-dimensional Oslo model, [27], driven at a single site, with system sizes L D 640; 1280; 2560. The upper cutoff is expected to scale like sc / L2:25 . The data collapses for large values of s. For small s, the behavior is non-universal
of sc for each measurement so that all data s P(s; sc ) for s > s0 collapse onto the same curve, representing the scaling function in the form aG (s/sc ), as exemplified by the socalled Oslo model [27] in Fig. 2. A failed data collapse is shown for small system sizes of the BTW model in Fig. 1d. The construction of a data collapse is discussed further in Subsect. “Binning During Data Analysis”. The different measurements with sc varying between them are usually taken from different system sizes. Plotting sc versus the system size in a double-logarithmic plot reveals the spatial dimension or gap exponent (see Subsect. “Gap Scaling Versus Multiscaling”) D, since sc D bL D , possibly with corrections to scaling (see Subsect. “Corrections to Scaling”). The same strategy is applied if sc is expected to depend on a different parameter, such as a temperature. There is no established method for quantifying the quality of a data collapse and it therefore remains somewhat subjective. There is, however, a simple method to rule out simple scaling, for example if the tail of the scaling function, i. e. the region of large arguments, changes shape for different values of sc , as in the BTW model at least in the case of small system sizes, Fig. 1d, or if the lower cutoff changes significantly with increasing sc as in the Forest Fire Model [28]. It is particularly important to probe for this phenomenon, because the quantitative analysis based on moments of the PDF is generally unable to detect it, see Subsect. “Variance and Numerical Error of Higher Moments”.
The Lower Cutoff The lower cutoff s0 appears in Eq. (19) to distinguish the universal part of the PDF shaped by G (s/sc ) from the non-universal part given by f (s; sc ). Below the lower cutoff complex and critical systems are expected to be governed by microscopic physics, i. e. the specific details of the process. In this region, the PDF might depend on additional parameters. More often however, the processes on a very small scale are expected to be similar, so that often f (s; sc ) f (s; sc0 ) even for sc very different from sc0 . This is found, for example, in critical percolation, where sc is solely determined by the system size and s is chosen to be so small that the finiteness of the system is irrelevant. This feature is also visible in the patterns at low s in Fig. 2. Different definitions of the observable usually have an impact in this region, but not for the asymptotic, large s behavior. Some complex systems do not possess a lower cutoff, so that the entire region of accessible values of the observable s is governed by the universal behavior. Although discrete systems have a natural lower cutoff given by the smallest possible event, this might be hidden in the definition range of the observable, as for example in the case of the avalanche size distribution of the one-dimensional BTW model, which follows hP(s)i D s 1 (s/sc ) (1 s/sc ) with s 2 f1; 2; : : : ; sc g; sc D L and denoting the Heaviside step function. A continuous system without a lower cutoff is physically equivalent to one without an upper cutoff, because by suitable rescaling the range of observables s tested can be made arbitrarily large. Corrections to Scaling Equation (19a) describes only the leading order behavior of the PDF, which is only asymptotically valid as sc s0 and s s0 . The former condition, sc s0 , suppresses contributions from the non-universal part and can in case of finite size scaling be achieved by increasing the system size. The latter condition, s s0 , is reflected in the condition s > s0 , however the larger s0 is chosen, the smaller the error of Eq. (19a). The corrections accounting for the failure of the exact scaling behavior are known as corrections to scaling [29], originally introduced in the context of ferromagnetic phase transitions. In principle, the scaling form Eq. (19a) can be generalized to as G (s/sc ) C a1 s (C!1 ) G1 (s/sc ) C : : :
(20)
with !1 > 0. It is much more common to account for corrections to scaling on the level of individual observables, such as moments or sc , instead of the entire distribution. For example, the standard finite size scaling ansatz includ-
Probability Densities in Complex Systems, Measuring
ing corrections to scaling is D
sc D bL (1 C b1 L
!10
C b2 L
!20
C : : :)
(21)
with the exponents ! i and ! i0 expected to be universal. Moments The data collapse described in the previous section suffers from the problem that its quality is not easily quantifiable. Only the obvious failure gives a strong indication of the absence of finite size scaling in the form Eq. (19). A more quantitative tool is the moment analysis. In principle, moments can be derived from the histogram after the simulation, but the error introduced by binning methods, which are almost always a necessity, or rounding errors in case of continuous event sizes can be very big. They are easily avoided by calculating the moments during the simulation. Numerics of Moments Moments are usually calculated during the simulation and provide access to the scaling behavior of the system without large memory or computational requirements. However, as individual contributions to moments can vary greatly in value, the quality of the estimate relies crucially on the precision of the numerical calculation. The standard method to calculate the nth moment of P(s; sc ) is to sum all nth powers of individual event sizes s and finally divide by the number of contributions. If the process takes place in continuous time, each event size might carry an additional weight with it, and the normalization required is the sum over all weights. This weight usually corresponds to the amount of time the observable had a particular value now entering the estimator. Depending on the type of process, measuring the moments can contribute significantly to the overall computing time. One method to minimize it, is to reduce the number of additions and multiplications. An implementation could read // Normalization power=weight; moment[0]+=power; // First moment power*=event_size; moment[1]+=power; // Second moment, etc. power*=event_size; moment[2]+=power; power*=event_size; moment[3]+=power; ... which can be further simplified in the form of a loop. Mo-
ments of very high order can be generated in the form // Normalization moment[0]+=weight; // First moment power=event_size; moment[1]+=power*weight; // Second moment power*=power; moment[2]+=power*weight; // Fourth moment power*=power; moment[3]+=power*weight; ... Precision is crucial in particular when it is a priori unknown whether the main contribution to a moment is due to many small events or a few very big ones. Integer variables are an option only if the weight is integer-valued or can be rendered so. They are ideal, because they do not suffer from rounding errors. To avoid overruns, many platforms offer 64 bit integers. Integers are often computationally, i. e. in terms of CPU-time, advantageous as well. Where floating point numbers are a necessity, they have to be of appropriate size. A central criterion is whether small events can still enter with sufficient accuracy at the end of the simulation, when the various variables holding sums of different powers of the observables contain very large numbers. Adding a sufficiently small number to a large floating point number might not actually change it, depending on the size of the mantissa. The IEEE 754 standard [30] describes floating point numbers with 24, 53 and 64 bit mantissae. It is computationally very inefficient to study moments hs n i for n … N, because it requires a floating point operation, such as sqrt or pow. Where such moments are unavoidable, these operations can often be “recycled”, by, say, calculating pow(event_size, 1./3.), taking its square and constructing from these two values plus event_size, all powers 1/3; 2/3; 1; 4/3; 5/3; : : : using a minimal number of multiplications. Moment Analysis Simple finite size scaling, Eq. (19), implies that the moments hs n i of the PDF scale like a power of the upper cutoff sc , lim
hs n i
s c !1 s nC1 c
D a lim g n (y) for y!0
n> 1
(22)
where the amplitude g n (0) is a moment of the universal scaling function Z 1 g n (y) D dx x n G (x) ; (23) y
2277
2278
Probability Densities in Complex Systems, Measuring
which is universal up to a pre-factor, i. e. any ratio of these amplitudes is universal. Assuming that the scaling function G (x) is continuous and has no singularity at finite argument, one can prove that the limit lim y!0 g n (y) exists. Moreover, as g0 (0) > 0, the exponent cannot be less than unity. Simplifying Eq. (22) hs n i / sc n
with n D 1Cn
for
n > 1 (24)
the scaling exponents n are usually determined for n ranging from 1 up to typically 8 as the slope in a doublelogarithmic plot of the moment versus the system parameter, which in case of finite size scaling is the linear size of the system. If no such parameter is quantifiable, moments can be measured with respect to each other, noting that n
hs n i / hs m i m
for
n > 1:
(25)
A priori the lower cutoff and the corrections to scaling are unknown and in order to determine the (asymptotic) finite size scaling behavior, a numerical study has to probe system sizes as large as possible. The true asymptote will, by definition, strictly remain inaccessible. Gap Scaling Versus Multiscaling Moments following the scaling behavior prescribed by (22) are said to display gap scaling with D being the gap exponent [31]. This term refers to the constant gap between the scaling exponents for consecutive moments as a function not of sc itself but of the tuning parameter, usually the system size. Assuming Eq. (21) and using the simplified notation Eq. (24) finite size scaling implies 0
hs n i / L n
with n0 D D(1 C n )
(26)
so that two consecutive moments have an exponent dif0 fering by nC1 n0 D D, visible as a constant slope when plotting the scaling exponent as a function of n, see Fig. 3. Measuring this slope is the standard method for calculating the exponent D. The intersection of the linear continuation of the n with the abscissa gives 1, which is a standard method of estimating . An exponent n > 0 indicates that the corresponding moment is diverging as sc diverges. For > 1 there are some moments for non-negative n which do not diverge. If D 1 the only non-negative moment that does not diverge is the normalization n D 0. If the exponents n do not increase linearly in n yet the moments still display scaling behavior, the system is said to exhibit multiscaling. In this case, the scaling form Eq. (22) is often extended to hs n i s c !1 ln(sc ) n sc n lim
(27)
Probability Densities in Complex Systems, Measuring, Figure 3 The exponents n estimated from the moments calculated from hP(s; sc )i as introduced in Fig. 4. The dotted line shows the exponents estimated from sc ranging from 25 to 1 104 , the dashed line shows exponents based on sc ranging from 1 104 to 6:4 105 . The full line shows the exact values the numerical data converges to. The rounding in the numerical estimates can be mistaken as multiscaling
to allow for logarithmic contributions to the scaling behavior. The presence of these logarithmic “corrections” are often interpreted as a sign of multiscaling, as is the presence of rare events, see Subsect. “Rare Events”. Multiscaling often means that the n converge for large n and only the n continue to increase. Multiscaling is sometimes confused with the absence of scaling altogether, or with the rounding effect observed close to n D 1, as shown in Fig. 3. The standard method of estimating and the gap exponent D is to fit the moments hs n i to a power-law, see (24) and Eq. (26), possibly including corrections to scaling (Subsect. “Corrections to Scaling”) and usually by plotting not versus sc but versus some system parameter such as the size L. The resulting n or n0 are then fitted against 1 C n or D(1 C n ) respectively to determine and possibly D. Variance and Numerical Error of Higher Moments The variance of the nth moment is hs 2n i hs n i2 D h(s n hs n i)2 i. This difference is always non-negative and comparing the scaling exponents for both contributions from Eq. (24) confirms 2n 2 n , where the equality holds only if D 1. Thus, unless D 1, the standard deviation of the estimator of the nth moment is expected to scale with exponent (1/2)2n D nD C (1/2)(1 ). The relative variance, also known as the relative fluctuations, therefore is expected to scale as hs 2n i / sc 2n 2 n D sc1 hs n i2
(28)
Probability Densities in Complex Systems, Measuring
and therefore diverges asymptotically for > 1 as the upper cutoff increases. For D 1 the relative variance might not change p with sc , although the ratio hs 2n i/hs n i2 might have a strong dependence on n. Because (24) determines only the leading order scaling of the moments but makes no statement about the respective amplitude, the difference hs 2n i hs n i2 might have a leading order that scales slower than 2n if D 1. In this case the relative variance might asymptotically vanish, which is known as self-averaging [32]. Self-averaging is more commonplace in the context of scaling with system size away from a critical point. In this case, spatial densities usually follow a Gaussian distribution and the system can be decomposed into finite patches, the number of which grows linearly in the volume. Therefore, the relative variance of a density that converges in the thermodynamic limit to a finite value, decreases like Ld . This effect is known as strong self-averaging [33]. Where the variance decays with increasing L, but not as fast as Ld , the effect is known as weak self-averaging. The variance can decrease faster than Ld only in the (rare) presence of anti-correlations. A Monte–Carlo simulation can be regarded as a means to integrate a function, so that estimating hs n i means in fact to perform the integral sn
Z D
ds P(s; sc )s n :
(29)
The most efficient method to estimate hs n i samples exactly with a density / hP(s; sc )is n , which in principle can be achieved using (ideal) importance sampling. In practice, this is rarely possible and only very crude approximations to hP(s; sc )is n are available as sampling frequencies. In the context of complex systems, hP(s; sc )i itself is the sampling rate, in the vast majority of problems. The comparison to ideal importance sampling suggests an alternative, much stronger criterion to assess the quality of a numerical estimate. The product hP(s; sc )is n usually has its maximum at some smax close to sc and one might therefore compare the density of measurements around smax to the weight this range of s enters the integral (29) [34]. Correspondingly, one can derive the number of measurements needed to perform the integral (29) reliably by imposing a lower bound on the density of measurements around s as a fraction of hP(s; sc )is n . Demanding that the density of measurements around s is nowhere less than P(s; sc )s n , then gives rise to a scale s , so that at least one sample is to be taken above the “characteristic largest
event” s , Z 1 ds P(s; sc )s n D 1 s
(30)
which depends on n and is to be compared to the total weight in the sample actually produced in the simulation ˜ sc ), P(s; Z 1 ˜ ) D N ˜ sc ) N(s ds P(s; (31) s
where N is the total number of measurements taken and ˜ sc ) D P(s; sc ) is the sampling frequency unless imporP(s; tance sampling is used. The event size s usually is of the order of the upper cutoff sc and the number of measurements needed typically scales itself like scn . Thus, this much stronger criterion, constraining the sampling density even in the extreme region of very large s produces a much stronger constraint on the sample size than suggested by the relative variance, which increases with sc only as sc1 , see Eq. (28). Figure 4 shows the product s n hP(s; sc )i/hs n i for a range of n as a function of s. As indicated above, the larger n, the more weight enters from the tail of the distribution, which implies that small s features are more and more ignored with increasing n. While increasing n are needed in order to derive the gap exponent (see Subsect. “Gap Scaling Versus Multiscaling”) reliably, an analysis that relies solely on moments can miss a lack of finite size scaling for small s, such as divergent lower cutoff. This problem can be avoided either by careful inspection of the data collapse, or by accurate determination of
Probability Densities in Complex Systems, Measuring, Figure 4 Moments for larger n draw more weight from larger s. A moment analysis investigating large moments might possibly miss features of the distribution at small arguments. The distribution used here is hP(s; sc )i D as3/2 G(s/sc ) for s 2 [1; 1 with G(x) D (1 C x 2 ) exp(x 3 /sc ) and sc D 10 000
2279
2280
Probability Densities in Complex Systems, Measuring
Probability Densities in Complex Systems, Measuring, Figure 5 Scaling of the first moment hsi of hP(s; sc )i (see Fig. 4). a In a double-logarithmic plot, the slope ranges from 0.95 (dashed line) to about 0.5 (asymptotically exact; full line) as indicated. Correction terms (corrections to scaling) or very large system sizes are necessary for the estimated exponent to be sufficiently close to the asymptote, which is 1 D 0:5 known from construction. This becomes . In this form it is easier to see that the more apparent when plotted in the form shown in b: The same data but plotted as hsis1/2 c asymptotic behavior is only reached by probing system sizes at least as big as the largest one shown
the scaling of moments for very small n > 1, which might necessitate the investigation of fractional moments, see Fig. 3. The closer the moment is to 1, the slower the convergence to the asymptotic behavior. This is illustrated in Fig. 5a which shows the scaling of the first moment hsi in a system with D 1:5. Whether displaying and analyzing the histogram or moments, it is generally significantly more informative to plot the data with the leading order divided out, for example ssc1/2 if s / sc1/2 is expected as shown in Fig. 5b. If the numerical data converges to a constant, this confirms the scaling, while the shape of ss 1/2 indicates the rôle of corrections to scaling. Moreover, the spread of the numerical data across the ordinate is reduced, allowing for a more detailed inspection of the details. This is discussed further in Subsect. “Binning During Data Analysis” for the scaling of histograms. Critical Slowing Down Large upper cutoffs are needed to isolate the asymptotic behavior on the one hand, but they are numerically very demanding on the other hand. Another effect which reduces the effectiveness of Monte– Carlo simulations, known from ferromagnetic phase transitions, is critical slowing down. The (auto-)correlation time discussed in Subsect. “Correlation Time” effectively reduces the number of independent measurements produced in the simulation. Critical slowing down describes the behavior of the correlation time at the critical point or in finite size scaling: The correlation time , see Subsect. “Correlation Time”, diverges like a power-law of the tuning parameter, which in finite size scaling means that / Lz , where z is known as the dynamical exponent.
Taking all different effects into account, the increasing relative variance for > 1 (exponent , see Eq. (19)), the minimal number of measurements required for a certain minimal coverage and finally critical slowing down, large system sizes always represent a challenge to the quality of the estimate of moments, in particular for large n. These are, however, necessary for a reliable estimate of the asymptotic behavior of the PDF. Moment Ratios A priori the amplitude a on the RHS of (22) is unknown; definition Eq. (19a) fixes only and otherwise states only that there exist quantities a and G (x), fixed only up to a pre-factor, so that the universal part of the PDF obeys Eq. (19a). By taking ratios of quantities containing non-universal pre-factors these are removed. For example, g n (0) g1n2 (0) hs n ihsin2 D hs 2 in1 g2n1 (0)
(32)
contains only moments of the scaling function with any non-universal pre-factor removed. Moreover, one can show that (32) converges to a non-zero value if hs n i follows Eq. (24). For D 1 the metric-factor a becomes dimensionless and can therefore be fully absorbed into the scaling function. In ferromagnetic critical phenomena and related phase transitions, such as percolation, where D 1, ratios of the form hs n m i/hs m in are universal already. Histogram Data Representation Naïvely, collecting the histogram in a simulation of a complex system with integer valued event sizes
Probability Densities in Complex Systems, Measuring
amounts only to incrementing a variable histogram [event_size]++. While this procedure is very efficient, provided that the number of processor cycles needed to access histogram[event_size] is minimal, the memory requirements quickly exceed that of standard computers as sc increases. Moreover, histogram entries for small event sizes might overflow, necessitating very large data types to accommodate the largest entry in the histogram. At the same time, entries for very large event sizes are very sparse. Using this technique for continuous event sizes means that they have to be rounded. Various methods of binning are established to handle these problems. Binning is usually also used at the stage of data analysis, i. e. in the preparation of data after the actual simulation has taken place. The key is to map the complete range of event sizes to the range available in the representation of the histogram. In the context of computing, the map is effectively a hash-function, in the following h(s). Assuming that the event size lies within a certain, finite range and is non-negative, s 2 [0; smax ] D ˝, not necessarily discrete, sets S l ˝ are defined so that all s 2 S l are equivalent under the hash algorithm and distinct otherwise, i. e. s; s 0 2 S l implies h(s) D h(s 0 ) D l and vice versa. The sets are usually continuous, so that the hash algorithm h(s) can be represented by a set of ranges, for example h(s) D l for s l s < s l C1 . The hash function is then used in the form histogram[h(event_size)]++. After collecting the data into bins, they are normalized by the bin size and often shown with respect to the geop metric mean of the bin ranges s l s l C1 . There is obviously some freedom of choice, but if the choice makes a qualitative or quantitative difference, this implies that the binning is too coarse. Binning Schemes Within the Simulation Binning can take place either during the simulation or in the data analysis. In general, it is advisable to keep the simulation as simple as possible: If the binning scheme within the simulation fails, all data might be lost or rendered useless. If it fails at the stage of data analysis, it can easily be fixed and repeated. However, due to memory constraints, it is very often necessary to use binning within the simulation. In these cases, it is vital to choose the most efficient binning scheme to avoid a negative impact on the CPU time. Power Law Binning The aim of binning generally is to reduce the histogram’s memory requirements, while avoiding potential overflows. The first aim can be achieved
by choosing the size of the ranges s l C1 s l according to the (expected) PDF, which a priori is, however, unknown. Assuming a pure power law, hP(s; sc )i D s , the ranges would need to be s l D (c(M l))1/(1) for > 1, where c determines the size of the range and M is the maximum l. In case of discrete event sizes, obviously the ranges need to be rounded. Even distribution of the events across the histogram has the additional advantage of equal error for each bin, provided that the error is only a function of the number of events. This might not be the case for very small events, where correlations might be significantly larger than for larger event sizes. Although the binning ranges described above would produce ideal bins, it is generally not advisable to use an expected result, such as the exponent , in the process of collecting the raw data to measure this quantity. Exponential Binning The scheme most suitable for presenting the data in a double-logarithmic plot is exponential binning. In this case the range limits are powers of some base sb , so that s l D rs bl with some pre-factor r. As a result, the data points are evenly spread with a spacing of ln(s b ) after taking the logarithm of the abscissa. The error, on the other hand, increases towards larger s because of the smaller number of events collected in bins for larger event sizes. Approximating the PDF by a pure power law, the number of events in the bin with hash l is, up to the normalization Z s l C1 (rs bl )1 ds s D (33) (1 s1 b ) 1 sl for > 1 which decreases with increasing l. The width of each bin is s l C1 s l D s l (s b 1). Exponential binning is frequently used and the one most suited for a direct calculation of the hash value during the simulation. Two simple methods can be distinguished for determining the hash value: Firstly, a function ˜ can be devised so that, for example, its integer part h(s) gives the hash value. In case of exponential binning, this ˜ D ln(s/r)/ ln(s b ). Using functions from function is h(s) the mathematical library, such as log is, however, computationally very expensive. Secondly, one can compare s to the various sl until l is found so that s l s < s l C1 . Since small events are the most frequent ones, this might be implemented by simply linearly increasing l. A crude estimate of the expected number of comparisons shows that this number generally quickly converges with increasing sc , as hln(s)i is finite. The constant number of expected comparisons it to be compared to the obvious “divide and conquer” or tree-
2281
2282
Probability Densities in Complex Systems, Measuring
search approach: Given M bins, s is compared to s M/2 , in the next step, depending on the outcome of the previous comparison, to s M/4 or s3M/4 etc. This method can be improved further by choosing the values to compare to so that the probability for s to be above or below are roughly equal, again assuming a certain, simplified form of hP(s; sc )i. The expected number of comparisons in this scheme is of the order of ln(M). Usually, the (logarithmic) bin size sb is fixed, so that M increases with sc like ln(sc ), i. e. the expected number of comparisons is proportional to the double logarithm of sc . Even though asymptotically tree-search has a greater number of comparisons than the linear search discussed above, in practice, the double logarithms means that tree-search will almost always by far outperform a linear search. Similarly, even though direct ˜ has constant calculation of the bin l using a function h(s) computational costs, they will typically be greater than for the tree search, in particular if the observable is discrete and provisions must be made in the region of its small values. Coarse Graining The simplest binning scheme consists of dividing s by a constant, h(s) D s/q so that the resulting range of hash values is small enough to fit in the memory available. In case of integer valued event sizes, the simplest and potentially the fasted way of determining the hash value is a bit shift, i. e. a division by a power of 2. The key problem is the same as the initial motivation for binning, namely that entries for small s might overflow, while entries for large s might be very sparse. The problem is more serious in case of coarse graining if the factor q is so large that the majority of events arrives in the first bin. Moreover, not fully resolving the distribution for small s means that most of the non-universal part of the distribution is lost, potentially hiding problems of the behavior of the lower cutoff. A very powerful alternative is a combination of coarse graining and exponential binning. A small number of thresholds is introduced and within each pair, a different rescaling q factor is used to implement different degrees of coarse graining. This method also allows the use of different variables types in different regions of the histogram, i. e. large types for small s and smaller less memory consuming ones for smaller values.
Binning During Data Analysis To simplify visual inspection, in practice the raw histogram produced in the simulation is always binned afterwards. The most widely used method is exponential bin-
ning, as it provides equally spaced data points in a doublelogarithmic plot. To perform a data collapse, the data is plotted in the form s P(s; sc ) versus s/sc . If Eq. (19) applies, the data for different sc collapses on the same line, approximating G (s/sc ) up to an error due to finite size corrections and the presence of the non-universal part. An example is shown in Fig. 2. If Eq. (19) applies, the data collapses for any function f (x) when plotted in the form s f (s/sc )P(s; sc ) versus s/sc and shows G (s/sc ) f (s/sc ). A particularly important case is f (x) D x , which leads to sc P(s; sc ) versus s/sc . To expose as many details as possible of the assumed scaling function, it is generally advisable to choose f (x) so that the ordinate of the resulting data spreads as little as possible (see also Fig. 5). If lim x!0 G (x) ¤ 0, this is usually the case for f (x) D 1. If G (x) is expected to have a powerlaw dependence on the argument, G (x) D x ˛ G˜ (x) with lim x!0 G˜ (x) ¤ 0, then f (x) D x ˛ is the appropriate choice. If the exponent and the values sc are unknown, the first step to perform a data collapse is to collapse the binned data (Fig. 6a) approximately with some preliminary choices for and (Fig. 6b) possibly also for the different values of sc and determine an outstanding feature in the resulting data, such as the maximum (arrows in Fig. 6b). A different choice for will change the relative vertical position of the “landmark”, which is to be chosen so that all landmarks line up at the same value on the ordinate, Fig. 6c. Next, the sc for the different simulations are chosen to collapse the data horizontally, usually by estimating sc as the position of the landmark with respect to the abscissa, Fig. 6d. In this last figure the lower cutoff s0 can be estimated as well, as the value of s from where on the data roughly coincide. A double-logarithmic plot of both cutoffs indicates roughly constant s0 and a gap-exponent D D 2:25 for sc . A data collapse often reveals that much greater system sizes are needed, as scaling applies to the asymptotic behavior only. The attempt of a data collapse for the BTW model, Fig. 1d, shows an example for a failed collapse in the case of small system sizes. While binning greatly improves the visual quality of the data, it assumes that hP(s; sc )i does not change too suddenly on the scale of the size of the bin, which therefore needs to be chosen small enough. However, choosing sb too small results in too many bins and therefore too much statistical noise. For large s most of those will be empty, some will contain a single entry, some two etc. Because the width of the bin is (s b 1)s l the density within bins containing a single entry scales like s 1 l , leading in a double-logarithmic plot to a sequence of points with slope 1. Parallel to this line, lies a set of points corresponding to
Probability Densities in Complex Systems, Measuring
Probability Densities in Complex Systems, Measuring, Figure 6 A data collapse in detail. Using the data shown in Fig. 2, a shows the binned data. In b a preliminary estimate has been made for the scaling exponent and the data is plotted in the form s1:5 P(s; sc ). A “landmark” has been chosen (the maxima, indicated by arrows), which connected by a line (dotted) indicate that the exponent is to be chosen larger than the preliminary choice. In c D 1:55 is chosen to make all maxima line up (dotted line). The position of the maxima on the abscissa (arrows) indicate the value of sc . d shows the final collapse, after shifting the data horizontally by dividing s by the respective sc . The approximate value for s0 , estimated for the two smaller system sizes as the point where the data approximately coincide is indicated by an arrow
bins with two entries etc. The resulting plot displays a spurious scaling exponent D 1, see Fig. 7. Future Directions The numerical methods discussed above are commonly used in other areas of statistical physics and have been mainly developed there. That applies to the Monte Carlo methods as well as to the data analysis. Within complexity, the development of methods hardly forms a research branch in its own right. Most methods are developed by practitioners for the specific problem at hand. As complexity is a very diverse field, only very few models have received so much attention that specialized
algorithms have been developed and discussed broadly in the literature, for example [11,35]. Most models are defined by a dynamic process which normally makes a simple implementation readily available. Algorithmic improvements mainly focus on efficiency and resources, as both CPU time and memory requirements are the two key factors limiting the quality of results in terms of sample size and system size respectively. In other areas, as for example in the context of networks, new observables often require new, efficient algorithms. Even though almost all models are constrained by the computer hardware available and therefore benefit significantly from improved algorithms, optimization often clashes with the universality of the algorithm and, on
2283
2284
Probability Densities in Complex Systems, Measuring
Probability Densities in Complex Systems, Measuring, Figure 7 If the sample is too sparse in the tail of the distribution, exponential binning produces a spurious slope D 1
a more technical level, with the elegance and readability of the code. With respect to data analysis, some literature is concerned with new approaches to established observables, such as the moment analysis discussed in Subsect. “Moment Analysis” to derive the exponents of the PDF. While algorithms and methods have advanced significantly, there seems to be a maximum amount of information about a complex system accessible through a given amount of CPU time spent in a computer simulation. The most significant general advancement in computational physics therefore comes from the continuous improvement of computer hardware available, producing an ever increasing sample size within a certain amount of CPU time. While this relieves the research from many constraints, it does not imply that sophisticated algorithms and well thought-out methods become less relevant. It is the combined effect of methods, software and hardware that moves the subject forward continuously. A good algorithm to cope with today’s technical constraints will do an even better job in the years to come. Bibliography Primary Literature 1. Metropolis N, Ulam S (1949) The Monte Carlo Method. J Am Stat Ass 44:335–341 2. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equation of State Calculations by Fast Computing Machines. J Chem Phys 21:1087–1092 3. Landau DP, Binder K (2000) A Guide to Monte Carlo Simulations in Statistical Physics. Cambridge University Press, Cambridge
4. Bak P, Tang C, Wiesenfeld K (1987) Self-Organized Criticality: An Explanation of 1/f Noise. Phys Rev Lett 59:381–384 5. Bak P, Tang C, Wiesenfeld K (1988) Self-Organized Criticality. Phys Rev A 38:364–374 6. van Kampen NG (1992) Stochastic Processes in Physics and Chemistry, 3rd impression 2001, enlarged and revised edn. Elsevier, Amsterdam 7. Stauffer D, Aharony A (1994) Introduction to Percolation Theory. Taylor, London 8. Binney JJ, Dowrick NJ, Fisher AJ, Newman MEJ (1998) The Theory of Critical Phenomena. Clarendon Press, Oxford 9. Binder K, Heermann DW (1997) Monte Carlo Simulation in Statistical Physics. Springer, Berlin 10. Zia RKP, Schmittmann B (2007) Probability currents as principal characteristics in the statistical mechanics of non-equilibrium steady states. J Stat Mech P07012 11. de Oliveira MM, Dickman R (2005) How to simulate the quasistationary state. Phys Rev E 71:016129 12. Spitzer F (1970) Interaction of Markov processes. Adv Math 5:256–290 13. Bouchaud JP, Comtet A, Georges A, Le Doussal P (1990) Classical Diffusion of a Particle in a One-Dimensional Random Force Field. Ann Phys 201:285–341 14. Grassberger P (2002) Go with the winners: a general Monte Carlo strategy. Comp Phys Comm 147:64–70 15. Pradhan P, Dhar D (2006) Sampling rare fluctuations of height in the oslo ricepile model. J Phys A 40:2639–2650 16. Bouchaud JP, Potters M (2003) Theory of Financial Risk and Derivative Pricing: From Statistical Physics to Risk Management. Cambridge University Press, Cambridge 17. De Menech M, Stella AL, Tebaldi C (1998) Rare events and breakdown of simple scaling in the Abelian sandpile model. Phys Rev E 58:R2677–R2680 18. Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical Recipes in C, 2nd edn. Cambridge University Press, New York 19. Knuth DE (1997) The Art of Computer Programming, vol 3. Seminumerical Algorithms, 2nd edn. Addison Wesley, Massachusetts 20. Matsumoto M, Nishimura T (1998) Mersenne Twister: A 623Dimensionally Equidistributed Uniform Pseudorandom Number Generator. ACM Trans Model Comp Sim 8:3–30 21. Gentle JE (1998) Random Number Generation and Monte Carlo Methods. Springer, Berlin 22. Ferrenberg AM, Landau DP (1992) Monte Carlo Simulations: Hidden Errors from “Good” Random Number Generators. Phys Rev Lett 69:3382–3384 23. Efron B (1982) The Jackknife, the Bootstrap and Other Resampling Plans. SIAM, Philadelphia 24. Berg BA (1992) Double Jackknife bias-corrected estimators. Comp Phys Com 69:7–14 25. Fisher ME (1967) The Theory of equilibrium critical phenomena. Rep Prog Phys 30:615–730 26. Privman V, Hohenberg PC, Aharony A (1991) Universal CriticalPoint Amplitude Relations. In: Domb C, Lebowitz JL (eds) Phase Transitions and Critical Phenomena, vol 14. Academic Press, New York, pp 1–134 27. Christensen K, Corral A, Frette V, Feder J, Jøssang T (1996) Tracer Dispersion in a Self-Organized Critical System. Phys Rev Lett 77:107–110
Probability Densities in Complex Systems, Measuring
28. Pruessner G, Jensen HJ (2002) Broken scaling in the forest-fire model. Phys Rev E 65:056707–1–8. Preprint cond-mat/0201306 29. Wegner FJ (1972) Corrections to Scaling Laws. Phys Rev B 5:4529–4536 30. Dowd K, Severance C (1998) High Performance Computing, 2nd edn. O’Reilly, Sebastopol 31. Pfeuty P, Toulouse G (1977) Introduction to the Renormalization Group and to Critical Phenomena. Wiley, Chichester 32. Ferrenberg AM, Landau DP, Binder K (1991) Statistical and Systematic Errors in Monte Carlo Sampling. J Stat Phys 63:867– 882 33. Milchev A, Binder K, Heermann DW (1986) Fluctuations and Lack of Self-Averaging in the Kinetics of Domain Growth. Z Phys B 63:521–535 34. Christensen K, Moloney NR (2005) Complexity and Criticality. Imperial College Press, London 35. Pruessner G, Jensen HJ (2004) Efficient algorithm for the forest fire model. Phys Rev E 70:066707–1–25. Preprint condmat/0309173
Books and Reviews Anderson TW (1964) The Statistical Analysis of Time Series. Wiley, London Barenblatt GI (1996) Scaling, self-similarity, and intermediate asymptotics. Cambridge University Press, Cambridge Berg BA (2004) Markov Chain Monte Carlo Simulations and Their Statistical Analysis. World Scientific, Singapore Brandt S (1998) Data Analysis. Springer, Berlin Jensen HJ (1998) Self-Organized Criticality. Cambridge University Press, New York Liggett TM (2005) Stochastic Interacting Systems: Contact, Voter and Exclusion Processes. Springer, Berlin Marro J, Dickman R (1999) Nonequilibrium Phase Transitions in Lattice Models. Cambridge University Press, New York Newman MEJ, Barkema GT (1999) Monte Carlo Methods in Statistical Physics. Oxford University Press, New York Stanley HE (1971) Introduction to Phase Transitions and Critical Phenomena. Oxford University Press, New York
2285
2286
Probability Distributions in Complex Systems
Probability Distributions in Complex Systems DIDIER SORNETTE Department of Management, Technology and Economics, ETH Zurich, Zurich, Switzerland Article Outline Glossary Definition of the Subject Introduction The Fascination with Power Laws Testing for Power Law Distributions in your Data Beyond Power Laws: “Kings” Future Directions Bibliography Glossary Complex system A system with a large number of mutually interacting parts, often open to its environment, which self-organizes its internal structure and its dynamics with novel and sometimes surprising macroscopic “emergent” properties. Criticality (in physics) A state in which spontaneous fluctuations of the order parameter occur at all scales, leading to diverging correlation length and susceptibility of the system to external influences. Power law distribution A specific family of statistical distribution appearing as a straight line in a log-log plot; exhibits the property of scale invariance and therefore does not possess characteristic scales. Self-organized criticality Occurs when the system dynamics are attracted spontaneously, without any obvious need for parameter tuning to a critical state with infinite correlation length and power law statistics. Stretched-exponential distribution A specific family of sub-exponential distribution interpolating smoothly between the exponential distribution and the power law family. Definition of the Subject This core article for the Encyclopedia of Complexity and System Science (Springer Science) reviews briefly the concepts underlying complex systems and probability distributions. The latter are often taken as the first quantitative characteristics of complex systems, allowing one to detect the possible occurrence of regularities providing a step toward defining a classification of the different levels of organization (the “universality classes”). A rapid
survey covers the Gaussian law, the power law and the stretched exponential distributions. The fascination for power laws is then explained, starting from the statistical physics approach to critical phenomena, out-of-equilibrium phase transitions, self-organized criticality, and ending with a large, but not exhaustive, list of mechanisms leading to power law distributions. A checklist for testing and qualifying a power law distribution from data is described in seven steps. This essay enlarges the description of distributions by proposing that “kings”, i. e., events even beyond the extrapolation of the power law tail, may reveal information which is complementary and perhaps sometimes even more important than the power law distribution. We conclude with a list of future directions. Introduction Complex Systems The study of out-of-equilibrium dynamics (e. g. dynamical phase transitions) and of heterogeneous systems (e. g., spin-glasses) has progressively made popular in physics the concept of complex systems and the importance of systemic approaches: systems with a large number of mutually interacting parts, often open to their environment, self-organize their internal structure and their dynamics with novel and sometimes surprising macroscopic (“emergent”) properties. The complex system approach, which involves “seeing” inter-connections and relationships, i. e., the whole picture as well as the component parts, has become pervasive in modern control of engineering devices and business management. It also plays an increasing role in most of the scientific disciplines, including biology (biological networks, ecology, evolution, origin of life, immunology, neurobiology, molecular biology, etc.), geology (plate-tectonics, earthquakes and volcanoes, erosion and landscapes, climate and weather, environment, etc.), economics and social sciences (cognition, distributed learning, interacting agents, etc.). There is a growing recognition that progress in most of these disciplines, in many of the pressing issues for our future welfare as well as for the management of our everyday life, will need such a systemic complex system and multidisciplinary approach. A central property of a complex system is the possible occurrence of coherent large-scale collective behaviors with very rich structure, resulting from the repeated nonlinear interactions among its constituents: the whole turns out to be much more than the sum of its parts. Most complex systems around us exhibit rare and sudden transitions that occur over time intervals which are short compared to the characteristic time scales of their prior evolution. Such extreme events express more than anything
Probability Distributions in Complex Systems
else the underlying “forces” usually hidden by almost perfect balance and thus provide the potential for a better scientific understanding of complex systems. These crises have fundamental societal impacts and range from large natural catastrophes such as earthquakes, volcanic eruptions, hurricanes and tornadoes, landslides, avalanches, lightning strikes, and catastrophic events of environmental degradation, to the failure of engineering structures, crashes in the stock market, social unrest leading to largescale strikes and upheaval, economic drawdowns on national and global scales, regional power blackouts, traffic gridlock, diseases and epidemics, etc. Given the complex dynamics of these systems, a first standard attempt to quantify and classify the characteristics and the possible different regimes consists of 1. identifying discrete events, 2. measuring their sizes, 3. constructing their probability distribution. The interest in probability distributions in complex systems has the following roots. They offer a natural metric of the relative rate of occurrence of small versus large events, and thus of the associated risks. As such, they constitute essential components of risk assessment and prerequisites of risk management. Their mathematical form can provide constraints and guidelines to identify the underlying mechanisms at their origin and thus at the origin of the behavior of the complex system under study. This improved understanding may lead to better forecasting skills, and even to the option (or illusion (?)) of (a certain degree of) control [1,2]. Probability Distributions Let us first establish some notation and vocabulary. Consider a process X whose outcome is a real number. The probability density function P(x) of X (also called probability distribution or pdf) is such that the probability that X is found in a small interval x around x is P(x)x. The probability that X is between a and b is therefore given by the integral of P(x) between a and b: P (a < X < b) D
Z
b
P(x)dx :
(1)
a
The pdf P(x) depends on the units used to quantify the variable x and has the dimension of the inverse of x, such that P(x)x, being a probability, i. e., a number between 0 and 1, is dimensionless. In a change of variable, say
x ! y D f (x), the probability is invariant. Thus, the invariant quantity is the probability P(x)x and not the pdf P(x). We thus have P(x)x D P(y)y ;
(2)
P(x)jd f /dxj1 ,
leading to P(y) D taking the limit of infinitesimal intervals. By definition, P(x)
0. It is normalR x max ized, x min P(x)dx D 1, where xmin and xmax (often ˙1) are the smallest and largest possible values for x, respectively. The empirical estimation of the pdf P(x) is usually plotted with the horizontal axis scaled as a graded series for the measure under consideration (the magnitude of the earthquakes, etc.) and the vertical axis scaled for the number of outcomes or measures in each interval of horizontal value (the number of earthquakes of magnitude between 1 and 2, between 2 and 3, etc.). This implies a “binning” into small intervals. If the data is sparse, the number of events in each bin becomes small and can fluctuate, leading to a poor representation of the data. In this case, it is useful to construct the cumulative distribution P (x) defined by Z x P (x) D P (X x) D P(y)dy ; (3) 1
which is much less sensitive to fluctuations. P (x) gives the fraction of events with values less than or equal to x. P (x) increases monotonically with x from 0 to 1. Similarly, we can define the so-called complementary cumulative (or survivor) distribution P> (x) D 1 P (x). For random variables which take only discrete values x1 ; x2 ; : : : ; x n , the pdf is made of a discrete sum of Dirac functions (1/n)[ı(x x1 ) C ı(x x2 ) C C ı(x x n )]. The corresponding cumulative distribution function (cdf) P (x) is a staircase. There are also more complex distributions made of a continuous cdf but which are singular with respect to the Lebesgue measure dx. An example is the Cantor distribution constructed from the Cantor set (see for instance Chap. 5 in [3]). Such a singular cdf is continuous but has its derivative which is zero almost everywhere: the pdf does not exist (see e. g. [4]). Brief Survey of Probability Distributions Statistical physics is rich with probability distributions. The most famous is the Boltzmann distribution, which describes the probability that the configuration of a system in thermal equilibrium has a given energy. Its extension to out-of-equilibrium systems is the subject of intense scrutiny [5]; see also Chap. 7 in [3] and references therein. Special cases include the Maxwell–Boltzmann distribution, the Bose–Einstein distribution and the Fermi– Dirac distribution.
2287
2288
Probability Distributions in Complex Systems
In the quest to characterize complex systems, two distributions have played a leading role: the normal (or Gaussian) distribution and the power law distribution. The Gaussian distribution is the paradigm of the “mild” family of distributions. In contrast, the power law distribution is the representative of the “wild” family. The contrast between “mild” and “wild” is illustrated by the following questions: What is the probability that someone has twice your height? Essentially zero! The height, weight and many other variables are distributed with “mild” pdfs with a well-defined typical value and relatively small variations around it. The Gaussian law is the archetype of “mild” distributions. What is the probability that someone has twice your wealth? The answer of course depends somewhat on your wealth but in general, there is a non-vanishing fraction of the population twice, ten times or even one hundred times as wealthy as you are. This was noticed at the end of the 19th century by Pareto, after whom the Pareto law has been named, which describes the power law distribution of wealth [6,7], a typical example of “wild” distributions. The Normal (or Gaussian) Distribution The expression of the Gaussian probability density function of a random variable x with mean x0 and standard deviation reads PG (x) D p
1 2 2
(x x0 )2 exp 2 2 defined for 1 < x < C1 :
(4)
The importance of the normal distribution as a model of quantitative phenomena in the natural and social sciences can be in large part attributed to the central limit theorem. Many measurements of physical as well as social phenomena can be well approximated by the normal distribution. While the mechanisms underlying these phenomena are often unknown, the use of the normal model can be theoretically justified by assuming that many small, independent effects additively contribute to each observation. The Gaussian distribution is also justified as the most parsimonious choice in absence of information other than just the mean and the variance: it maximizes the information entropy among all distributions with known mean and variance. As a result of the central limit theorem, the normal distribution is the most widely used family of distributions in statistics and many statistical tests are based on the assumption of asymptotic normality of the data.
In probability theory, the standard Gaussian distribution arises as the limiting distribution of a large class of distributions of random variables (with suitable centering and normalization) characterized by a finite variance, which is nothing but the statement of the central limit theorem (see Chapter 2 in [3]). At the beginning of the twenty-first century, when power laws are often taken as the hallmark of complexity, it is interesting to reflect on the fact that the previous giants of science in the eighteenth and nineteenth centuries (Halley, Laplace, Quetelet, Maxwell and so on) considered that the Gaussian distribution expressed a kind of universal law of nature and of society. In particular, the Belgian astronomer Adolphe Quetelet was instrumental in popularizing the statistical regularities discovered by Laplace in the frame of the Gaussian distribution, which influenced the likes of John Herschel and John Stuart Mill and led Comte to define the concept of “social physics.” The Power Law Distribution A probability distribution function P(x) exhibiting a power law tail is such that P(x) /
C ; x 1C
for x large ;
(5)
possibly up to some large limiting cut-off. The exponent (also referred to as the “index”) characterizes the nature of the tail: for < 2, one speaks of a “heavy tail” for which the variance is theoretically not defined. For power laws, the scale factor C plays a role analogous to the role of variance in Gaussian distributions (see Chapter 4 in [3]). In particular, it enjoys the additivity property: the scale factor of the distribution of the sum of several independent random variables, each with a distribution exhibiting a power law tail with the same exponent , is equal to the sum of the scale factors characterizing each distribution of each random variable in the sum. A more general form is P(x) /
L(x) ; x 1C
for x large ;
(6)
where L(x) is a slowly varying function defined by lim x!1 L(tx)/L(x) D 1 for any finite t (typically, L(x) is a logarithm ln(x) or power of a logarithm such as (ln(x))n with n finite). In mathematical language, a function such as (6) is said to be “regularly varying”. This more general form means that the power law regime is an asymptotic statement holding only as a better and better approximation as one considers larger and larger x values. Power laws obey the symmetry of scale invariance, that is, they verify the following defining property that, for an
Probability Distributions in Complex Systems
arbitrary real number , there exists a real number such that P(x) D P(x) ;
8x :
(7)
Obviously, D 1C . The relation (7) means that the ratio of the probabilities of occurrence of two sizes x1 and x2 depends only on their ratio x1 /x2 and not on their absolute values. For instance, according to the Zipf law ( D 1) for the distribution of city sizes, the ratio of the number of cities with more that 1 million inhabitants to those with more than 100,000 persons is the same as the ratio of the number of cities with more than 100,000 inhabitants to those with more than 10,000 persons, both ratios being equal to 1/10. The symmetry of scale invariance (7) extends to the space of functions the concept of scale invariance which characterizes fractal geometric objects. It should be stressed that, when they exhibit a powerlaw-like shape, most empirical distributions do so only over a finite range of event sizes, either bounded between a lower and an upper cut-off [8,9,10,11], or above a lower threshold, i. e., only in the tail of the observed distribution [12,13,14,15]. Power law distributions and, more generally, regularly varying distributions remain robust functional forms under a large number of operations, such as linear combinations, products, minima, maxima, order statistics, and powers, which may also explain their ubiquity and attractiveness. Jessen and Mikosch [16] give the conditions under which transformations of power law distributions are also regularly varying, possibly with a different exponent (see also Sect. 4.4 in [3] for an heuristic presentation of similar results). The Stretched Exponential Distribution The so-called stretched exponential (SE) distribution has been found to be a versatile intermediate distribution interpolating between “thin tail” (Gaussian, exponential, . . . ) and very “fat tail” distributions. In particular, Laherrère and Sornette [17] have found that several examples of fat-tailed distributions in the natural and social sciences, often considered to be good examples of power laws, could sometimes be represented as well as or even better by an SE distribution. Malevergne et al. [18] present systematic statistical tests comparing the SE family with the power law distribution in the context of financial return distributions. The SE family is defined by the following expression for the survival distribution (also called the complementary cumulative distribution function): h x c u c i Pu (x) D 1 exp C ; d d for x u : (8)
The constant u is a lower threshold that can be changed to increasingly emphasize the tail of the distribution as u is increased. The structural exponent c controls the “thin” versus “heavy” nature of the tail. 1. For c D 2, the SE distribution (8) has the same asymptotic tail as the Gaussian distribution. 2. For c D 1, expression (8) recovers the pure exponential distribution. 3. For c < 1, the tail of Pu (x) is fatter than an exponential, and corresponds to the regime of sub-exponentials (see Chapter 6 in [3]). 4. For c ! 0 with u c c !; (9) d the SE distribution converges to the Pareto distribution with tail exponent . Indeed, we can write c
c x uc c1 x exp dc dc u c x c1 h u c x c i Dc c exp 1 ; d u d u h u c xi 1 ln ; as c ! 0 ' x exp c d u h i x ' x 1 exp ln ; u u ' C1 ; (10) x which is the pdf of the Pareto power law model with tail index . This implies that, as c ! 0, the characteristic scale d of the SE model must also go to zero with c to ensure its convergence towards the Pareto distribution. This shows that the Pareto model can be approximated with any desired accuracy on an arbitrary interval (u > 0; U) by the (SE) model with parameters (c; d) satisfying Eq. (9) where the arrow is replaced by an equality. The limit c ! 0 provides any desired approximation to the Pareto distribution uniformly on any finite interval (u; U). This deep relationship between the SE and power law models allows us to understand why it can be very difficult to decide, on a statistical basis, which of these models best fits the data [17,18]. This insight can be made rigorous to develop a formal statistical test of the (SE) hypothesis versus the Pareto hypothesis [18,19]. From a theoretical viewpoint, this class of distributions (8) is motivated in part by the fact that large deviations of multiplicative processes are generically distributed with stretched exponential distributions [20]. Stretched exponential distributions are also parsimonious examples of the important subset of sub-exponentials, that is, of the general class of distributions decaying slower than an ex-
2289
2290
Probability Distributions in Complex Systems
ponential [21]. This class of sub-exponentials share several important properties of heavy-tailed distributions [22] not shared by exponentials or distributions decreasing faster than exponentials: for instance, they have “fat tails” in the sense of the asymptotic probability weight of the maximum compared with the sum of large samples [4] (see also Chaps 1 and 6 in [3]). Notwithstanding their fat-tailness, stretched exponential distributions have only finite moments, in contrast with regularly varying distributions for which moments of order equal to or larger than the tail index are not defined. However, they do not admit an exponential moment, which leads to problems in the reconstruction of the distribution from the knowledge of their moments [23]. In addition, the existence of all moments is an important property allowing for an efficient estimation of any high-order moment, since it ensures that the estimators are asymptotically Gaussian. In particular, for stretchedexponentially distributed random variables, the variance, skewness and kurtosis can be accurately estimated, contrarily to random variables with regularly varying distribution with tail index smaller than about 5. The Fascination with Power Laws Probability distribution functions with a power law dependence in terms of event or object sizes seem to be ubiquitous statistical features of natural and social systems. It has repeatedly been argued that such an observation relies on an underlying self-organizing mechanism, and therefore power laws should be considered as the statistical imprints of complex systems. It is often claimed that the observation of a power law relation in data often points to specific kinds of mechanisms at its origin, that can often suggest a deep connection with other, seemingly unrelated systems. In complex systems, the appearance of power law distributions is often thought to be the signature of hierarchy and robustness. In the last two decades, such claims have been made, for instance, for earthquakes, weather and climate changes, solar flares, the fossil record, and many other systems, to promote the relevance of selforganized criticality as an underlying mechanism for the organization of complex systems [24]. This claim is often unwarranted as there are many non-self-organizing mechanisms that produce power law distributions [3,25,26,27]. Research on the origins of power law relations and efforts to observe and validate them in the real world are extremely active in many fields of modern science, including physics, geophysics, biology, medical sciences, computer science, linguistics, sociology and economics. This section briefly summarizes the present understanding.
Statistical Physics in General and the Theory of Critical Phenomena The study of critical phenomena in statistical physics suggests that power laws emerge close to special critical or bifurcation points separating two different phases or regimes of the system. In systems at thermodynamic equilibrium modeled by general spin models, renormalization group theory [28] has demonstrated the existence of universality, so that diverse systems exhibit the same critical exponents and identical scaling behavior as they approach criticality, i. e., they share the same fundamental macroscopic properties. For instance, the behavior of water and CO2 at their boiling points at a certain critical pressure and that of a magnet at its Curie point fall in the same universality class because they can be characterized by order parameter s with the same symmetries and dimensions in the same space dimension. In fact, almost all material phase transitions are described by a small set of universality classes. From this perspective, the fascination with power laws reflects the fact that they characterize the many coexisting and delicately interacting scales at a critical point. The existence of many scales leading to complex geometrical properties is often associated with fractals [12]. While it is true that critical points and fractals share power law relations, power law relations and power law distributions are not the same. The latter, which is the subject of this essay, describes the probability density function or frequency of occurrence of objects or events, such as the frequency of earthquakes of a given magnitude range. In contrast, power law relations between two variables (such as the magnetization and temperature in the case of the Curie point of a magnet) describe a functional abstraction belonging to or characteristic of these two variables. Both power law relations and power law distributions can result from the existence of a critical point. A simple example in percolation is (i) the power law dependence of the size of the larger cluster as a function of the distance from the percolation threshold and (ii) the power law distribution of cluster sizes at the percolation threshold [29]. Out-of-Equilibrium Phase Transition and Self-Organized Critical Systems (SOC) In the broadest sense, self-organized criticality (SOC) refers to the spontaneous organization of a system driven from the outside into a globally stationary state, which is characterized by self-similar distributions of event sizes and fractal geometrical properties. This stationary state is dynamical in nature and is characterized by statistical fluctuations, which are generically referred to as “avalanches.”
Probability Distributions in Complex Systems
The term “self-organized criticality” contains two parts. The word “criticality” refers to the state of a system at a critical point at which the correlation length and the susceptibility become infinite in the infinite size limit as in the preceding section. The label “self-organized” is often applied indiscriminately to pattern formation among many interacting elements. The concept is that the structuration, the patterns and large scale organization appear spontaneously. The notion of self-organization refers to the absence of control parameters. In this class of mechanisms, where the critical point is the attractor, the situation becomes more complicated as the number of universality classes proliferates. In particular, it is not generally well-understood why sometimes local details of the dynamics may change the macroscopic properties completely while in other cases, the universality class is robust. In fact, the more we learn about complex out-of-equilibrium systems, the more we realize that the concept of universality developed for critical phenomena at equilibrium has to be enlarged to embody a more qualitative meaning: the critical exponents defining the universality classes are often very sensitive to many (but not all) details of the models [30]. Of course, one of the hailed hallmarks of SOC is the existence of power law distributions of “avalanches” and of other quantities [3,24,31]. Non-exhaustive List of Mechanisms Leading to Power Law Distributions There are many physical and mathematical mechanisms that generate power law distributions and self-similar behavior. Understanding how a mechanism is affected by microscopic laws constitutes an active field of research. We can propose the following non-exhaustive list of mechanisms that have been found to operate in different complex systems, and which can lead to power law distribution of avalanches or cluster sizes. For most of these mechanisms, we refer the reader to Chaps. 14 and 15 in [3] and to [27,32] for detailed explanations and the relevant bibliography. However, some of the mechanisms mentioned here have not been reviewed in these three references and are thus new to the list developed in particular in [3]. We should also stress that some of the mechanisms in this list are actually different incarnations of the same underlying idea (for instance preferential attachment which is a re-discovery of the Yule process, see [33] for an informative historical account). 1. percolation, fragmentation and other related processes, 2. directed percolation and its universality class of socalled “contact processes,”
3. cracking noise and avalanches resulting from the competition between frozen disorder and local interactions, as exemplified in the random field Ising model, where avalanches result from hysteretic loops [34], 4. random walks and their properties associated with their first passage statistics [35] in homogeneous as well as in random landscapes, 5. flashing annihilation in Verhulst kinetics [36], 6. sweeping of a control parameter towards an instability [25,37], 7. proportional growth by multiplicative noise with constraints (the Kesten process [38] and its generalization, for instance in terms of generalized Lotka–Volterra processes [39]), whose ancestry can be traced to Simon and Yule, 8. competition between multiplicative noise and birthdeath processes [40], 9. growth by preferential attachment [32], 10. exponential deterministic growth with random times of observations (which gives the Zipf law) [41], 11. constrained optimization with power law constraints (HOT for “highly optimized tolerant”), 12. control algorithms, which employ optimal parameter estimation based on past observations, shown to generate broad power law distributions of fluctuations and of their corresponding corrections in the control process [42,43], 13. on-off intermittency as a mechanism for power law pdf of laminar phases [44,45], 14. self-organized criticality which comes in many flavors as explained in Chapter 15 of [3]: cellular automata sandpiles with and without conservation laws, systems made of coupled elements with threshold dynamics, critical de-synchronization of coupled oscillators of relaxation, nonlinear feedback of the order parameter onto the control parameter generic scale invariance, mapping onto a critical point, extremal dynamics. If there is one lesson to extract from this impressive list it is that, when observing an approximate linear trend in the log-log plot of some data distribution, one should refrain from jumping to hasty conclusions on the implications of this approximate power law behavior. Another lesson is that power laws appear to be so ubiquitous perhaps because many roads lead to them!
2291
2292
Probability Distributions in Complex Systems
Testing for Power Law Distributions in your Data Although power law distributions are attractive for their simplicity (they are straight lines on log-log plots) and may be justified from theoretical reasons as discussed above, demonstrating that data do indeed follow a power law distribution requires more than simple fitting. Indeed, several alternative functional forms can appear to follow a power law form over some extent, such as stretched exponentials and log-normal distributions. Thus, validating that a given distribution is a power law is not easy and there is no silver bullet. Clauset et al. [46] have recently summarized some statistical techniques for making accurate parameter estimates for power-law distributions based on maximum likelihood methods and the Kolmogorov–Smirnov statistic. They illustrate these statistical methods on 24 realworld data sets from a range of different disciplines. In some cases, they find that power laws are consistent with the data while in others the power law is ruled out. The log-likelihood ratio that they propose is however not warranted for non-nested models [47] Here, we offer some advice for the characterization of a power law distribution as a possible adequate representation of a given data set. We emphasize good sense and practical aspects. 1. Survivor distribution First, the survival distribution should be constructed using raw data by ranking the values in increasing order. Then, rank versus values gives immediately a non-normalized survival distribution. The advantage of this construction is that it does not require binning or kernel estimation, which is a delicate art, as we have alluded to. 2. Probability density function The previous construction of the complementary cumulative (or survivor) distribution function should be complemented with that of the density function. Indeed, it is well-known that the cumulative distribution, being a “cumulative” integral of the density function as its name indicates, may be contaminated by disturbances at one end of the density function, leading to rather long cross-overs that may hide or perturb the power law. For instance, if the generating density distribution is a power law truncated by an exponential, as found for critical systems not exactly at their critical point or in the presence of finitesize effects [48], the power law part of the cumulative distribution will be strongly distorted leading to a spurious estimation of the exponent . This problem can be in large part alleviated by constructing the pdf using binning or, even better, kernel methods (see the very readable article [49] and references therein). By testing
and comparing the survival and the probability density distributions, one obtains either a confirmation of power law scaling or an understanding of the origin(s) of the deviations from the power law. 3. Structural analysis by visual inspection Given that these first two steps have been performed, we recommend a preliminary visual exploration by plotting the survival and density distributions in (i) linear-linear coordinates, (ii) log-linear coordinates (linear abscissa and logarithmic ordinate) and (iii) log-log coordinates (logarithmic abscissa and logarithmic ordinate). The visual comparison between these three plots provides a fast and intuitive view of the nature of the data. A power law distribution will appear as a convex curve in the linear-linear and log-linear plots and as a straight line in the log-log plot. A Gaussian distribution will appear as a bellshaped curve in the linear-linear plot, as an inverted parabola in the log-linear plot and as a strongly concave sharply falling curve in the log-log plot. An exponential distribution will appear as a convex curve in the linear-linear plot, as a straight line in the log-linear plot and as a concave curve in the log-log plot. Having in mind the shape of these three reference distributions in these three representations provides fast and useful reference points to classify the unknown distribution under study. For instance, if the log-linear plot shows a convex shape (upward curvature), we can conclude that the distribution has a tail fatter than an exponential. Then, the log-log plot will confirm if a power law is a reasonable description. If the log-log plot shows a downward curvature (concave shape), together with the information that the log-linear plot shows a convex shape, we can conclude that the distribution has a tail fatter than an exponential but thinner than a power law. For example, it could be a gamma distribution ( x n exp[x/x0 ] with n > 0) or a stretched distribution (expression (8) with c < 1). Only more detailed quantitative analysis will allow one to refine the diagnostic, often with less-than-definite conclusions (see as an illustration the detailed statistical analysis comparing the power law to the stretched exponential distributions to describe the distribution of financial returns [18]). The deviations from linearity in the log-log plot suggest the boundaries within which the power law regime holds. We say “suggest,” as a visual inspection is only a first step which can be actually misleading. While we recommend a first visual inspection, it is only a first indication, not a proof. It is a necessary step to convince
Probability Distributions in Complex Systems
oneself (and the reviewers and journal editors) but certainly not a sufficient condition. It is a standard rule of thumb that power law scaling is thought to be meaningful if it holds over at least two to three decades on both axes and is bracketed by deviations on both sides whose origins can be understood (for instance, due to insufficient sampling and/or finite-size effects). As an illustration of the potential errors stemming from visual inspection, we refer to the discussion of Sornette et al. [50], on the claim of Pacheco et al. [51] of the existence of a break in the Gutenberg–Richter distribution of earthquake magnitudes at m D 6:4 for California. This break was claimed to reveal the finiteness of the crust thickness according to Pacheco et al. [51]. This claim has subsequently been shown to be unsubstantiated, as the Gutenberg–Richter law (which is a power law when expressed in earthquake energies or seismic moments) seems to remain valid up to magnitudes of 7.5 in California and up to magnitude about 8–8.5 worldwide. This visual break at m D 6:4 turned out to be just a statistical deviation, completely expected from the nature of power law fluctuations [15,52]. 4. OLS fitting The next step is often to perform an ordinary least-square (OLS) regression of the data (survival distribution or kernel-reconstructed density) on the logarithm of the variables, in order to estimate the parameters of the power law. These parameters are the exponent , the scale factor C and possibly an upper threshold or other parameters controlling the crossover to other behaviors outside the scaling regime. Using logarithms ensures that all the terms in the sum of squares over the different data points contribute approximately similarly in the OLS. Otherwise, without logarithms, given the large range of values spanned by a typical power law distribution, a relative error of say 1% around a value of the order of 104 would have a weight in the sum ten thousand times larger than the weight due to the same relative error of 1% around a value of the order of 102 , biasing the estimation of the parameters towards fitting preferentially the large values. In addition, in logarithm units, the estimation of the exponent of a power law constitutes a linear problem which is solved analytically. When performing the OLS estimation on the survival distribution, it is optimal to shift the ranks by 1/2 [53]. With this improvement, the OLS method is typically more robust to deviations from a pure power law form than the Hill estimator discussed below. 5. Maximum likelihood estimation Using an OLS method to estimate the parameters of a power law assumes implicitly that the distribution of the deviations
from the power law (actually the difference between the logarithm of the data and the logarithm of the power law distribution) are normally distributed. This may not be a suitable approximation. An estimation which removes this assumption consists in using the likelihood method, in which the parameters of the power law are chosen so as to maximize the likelihood function. When the data points are independent, the likelihood Q function is nothing but the product N iD1 P(x i ) over the N data points x1 ; x2 ; : : : ; x N of the power law distribution P(x). In this case, the exponent which maximizes this likelihood (or equivalently and more conveniently its logarithm called the log-likelihood) is called the Hill estimator [54]. It reads xj 1X 1 ; (11) D ln n xmin where xmin is the smallest value among the n values used in the data set for the estimation of . Since power laws are often asymptotic characteristics of the tail, it is appropriate not to use the full data set but only the upper tail with data values above a lower threshold. Then, plotting 1/ or as a function of the lower threshold usually provides a good sense of the existence of a power law regime: one should expect an approximate stability of 1/ over some scaling regime. Note that the Hill estimator provides an unbiased estimate of 1/ while obtained by inverting 1/ is slightly biased (see e. g., Chapter 6 in [3]). We refer to [55,56] for improved versions and procedures of the Hill estimator which deal with finite ranges and dependence. 6. Non-parametric methods Methods testing for a power law behavior in a given empirical distribution which are not parametric and not sensitive to the value of the exponent provide useful complements of the above fitting and parametric estimation approaches. Pisarenko et al. [57] and Pisarenko and Sornette [58] have developed new statistics such that a power law behavior is associated with a zero value of the statistics independently of the numerical value of the exponent and with a nonzero value otherwise. Plotting these statistics as a function of the lower threshold of the data sample allows one to detect subtle deviations from a pure power law. Lasocki [59] and Lasocki and Papadimitriou [60] have developed another non-parametric approach to detect deviations from a power law, the smoothed bootstrap test for multimodality, which makes it possible to test the complexity of the distribution without specifying any particular probabilistic model. The method relies on testing the hypotheses that the number of modes or the number of bumps exhibited by the distribution
2293
2294
Probability Distributions in Complex Systems
function equal 1. Rejection of one of these hypotheses indicate that the distribution has more complexity than described by a simple power law. Once the evidence for a power law distribution has been reasonably demonstrated, the most difficult task remains: finding a mechanism and model which can explain the data. Note that the term “explain” refers to different meanings depending on the expert you are speaking to. For a statistician, having been unable to reject the power law function (5) given the data amounts to saying that the power law model “explains” the data. The emphasis of the statistician will be on refining parametric and non-parametric procedures to test the way the power law “fits” or deviates from the empirical data. In contrast, a physicist or a natural scientist sees this as only a first step, and attributes the word “explain” to the stage where a mechanism described in terms of a more fundamental process or first principles can derive the power law. But even among natural scientists, there is no consensus on what is a suitable “explanation.” The reason stems from the different cultures and levels of study in different fields, well addressed in the famous paper “More is different” by Anderson [61]: a suitable explanation for a physicist will frustrate a chemist whose explanation, in turn, will not satisfy a biologist. Each scientific discipline’s concepts are anchored in its characteristic fundamental scientific level, which provides the underpinning for the next scientific level of description (think for instance of the hierarchy: physics ! chemistry ! molecular biology ! cell biology ! animal biology ! ethology ! sociology ! economics ! . . . ). Once a model at a given scientific description level has been proposed, the action of the model on inputs gives outputs which are compared with the data. Verifying that the model, inspired by the preliminary power law evidence, adequately fits this power law is a first step. Unfortunately, much too often, scientists stop there and are happy to report that they have a model that fits their empirical power law data. This is not good science. Keeping in mind the many possible mechanisms at the origin of power law distributions reviewed above, a correct procedure is to run the candidate model to get other predictions that can themselves be put to the test. This validation is essential to determine the degree to which the model is an accurate representation of the real world from the perspective of its intended uses. Reviewing a large body of literature devoted to the problem of validation, Sornette et al. [62] have proposed a synthesis in which the validation of a given model is formulated as an iterative construction process that mimics the often implicit process occurring in
the minds of scientists. Validation is nothing but the progressive build-up of trust in the model, based on testing the model against non-redundant novel experiments or data, that allows one to make a decision and act on that basis. The applications of the validation program to a cellular automaton model for earthquakes, to a multifractal random walk model for financial time series, to an anomalous diffusion model for solar radiation transport in the cloudy atmosphere, and to a computational fluid dynamics code for the Richtmyer–Meshkov instability, exemplify the importance of going beyond the simple qualification of a power law. Beyond Power Laws: “Kings” The Standard View Power law distributions embody the notion that extreme events are not exceptional 9-sigma events (to refer to the terminology using the Gaussian bell curve and its standard deviation as the metric to quantify deviations from the mean). Instead, extreme events should be considered as rather frequent and part of the same organization as other events. In this view, a great earthquake is just an earthquake that started small. . . and did not stop; it is inherently unpredictable due to its sharing of all the properties and characteristics of smaller events (except for its size), so that no genuinely informative precursor can be identified [63]. This is the view expounded by Bak and coworkers in their formulation of the concept of self-organized criticality [24,64]. In the following, we outline several promising directions of research that expand on these ideas. Self-Organized Criticality Versus Criticality However, there are many suggestions that inherent unpredictability does not need to be the case. One argument is that criticality and self-organized criticality (SOC) can actually co-exist. The hallmark of criticality is the existence of specific precursory patterns (increasing susceptibility and correlation length) in space and time. Continuing with the example of earthquakes, the idea that a great earthquake could result from a critical phenomenon has been put forward by different groups, starting almost three decades ago [65,66,67]. Attempts to link earthquakes and critical phenomena find support in the evidence that rupture in heterogeneous media is similar to a critical phenomenon (see Chapter 13 of [3] and references therein). Also indicative is the often-reported observation of increased intermediate magnitude seismicity before large events [68,69]. An illustration of the coexistence of crit-
Probability Distributions in Complex Systems
icality and of SOC is found in a simple sandpile model of earthquakes on a hierarchical fault structure [70]. Here, the important ingredient is to take into account both nonlinear dynamics and complex geometry. From the point of view of self-organized criticality, this is surprising news: large earthquakes do not lose their identity. In the model of Huang et al. [70], a large earthquake is different from a small one, a very different story than the one told by common SOC wisdom in which any precursory state of a large event is essentially identical to a precursory state of a small event and an earthquake does not “know” how large it will become. The difference comes from the absence of geometry in standard SOC models. Reintroducing geometry is essential. In models with hierarchical fault structures, one finds a degree of predictability of large events. Beyond Power Laws: Five Examples of “Kings” Are power laws the whole story? The following examples suggest that some extreme events are even “wilder” than predicted by the extrapolation of power law distributions. They can be termed “outliers” or even better “kings” [17]. According to the definition of the Engineering Statistical Handbook [71], “An outlier is an observation that lies an abnormal distance from other values in a random sample from a population.” Here, we follow Laherrère and Sornette [17] and use the term “king” to refer to events which are even beyond the extrapolation of the fat tail distribution of the rest of the population. Material failure and rupture processes There is now ample evidence that the distribution of damage events, for instance quantified by the acoustic emission radiated by micro-cracking in heterogeneous systems, is well-described by a Gutenberg–Richter like power law [72,73,74,75]. But consider now the energy released in the final global event rupturing the system in pieces! This release of energy is many, many times larger than the largest ever recorded event in the power law distribution. Material rupture exemplifies the co-existence of a power law distribution and a catastrophic event lying beyond the power law. Gutenberg–Richter law and characteristic earthquakes In seismo-tectonics, the situation is muddy because of the difficulties with unambiguously defining the spatial domain of influence of a given fault. Researchers who have delineated a spatial domain surrounding a clearly mapped large fault claim to find a Gutenberg–Richter distribution up to a large magnitude region characterized by a bump or anomalous rate of large earthquakes. These large earthquakes have rupture lengths comparable with the fault length [76,77].
If proven valid, this concept of a characteristic earthquake provides another example in which a “king” coexists with a power law distribution of smaller events. Others have countered that this bump disappears when removing the somewhat artificial partition of the data [78,79], so that the characteristic earthquake concept may be a statistical artifact. In this view, a particular fault may appear to have characteristic earthquakes, but the stress-shedding region, as a whole, behaves according to a pure scale-free power law distribution. Several theoretical models have been offered to support the idea that, in some seismic regimes, there is a coexistence between a power law and a large size regime (the “king” effect). Gil and Sornette [80] reported that this occurs when the characteristic rate for local stress relaxations is fast compared with the diffusion of stress within the system. The interplay between dynamical effects and heterogeneity has also been shown to change the Gutenberg–Richter behavior to a distribution of small events combined with characteristic system size events [81,82,83,84]. On the empirical side, progress should be made in testing the characteristic earthquake hypothesis by using the prediction of the models to identify independently of seismicity those seismic regions in which the king effect is expected. This remains to be done [85]. Extreme king events in the pdf of turbulent velocity fluctuations The evidence for kings does not require, and is not even synonymous in general with, the existence of a break or a bump in the distribution of event sizes. This point is well-illustrated in shell models of turbulence which are believed to capture the essential ingredient of these flows, while being amenable to analysis. Such “shell” models replace the three-dimensional spatial domain by a series of uniform onion-like spherical layers with radii increasing as a geometrical series 1; 2; 4; 8; : : : ; 2n and communicating mostly with nearest neighbors. The quantity of interest is the distribution of velocity variations between two instants at the same position or between two points simultaneously. L’vov et al. [86] have shown that they could collapse the pdf’s of velocity fluctuations for different scales only for small velocity fluctuations, while no scaling held for large velocity fluctuations. The conclusion is that the distributions of velocity increments seems to be composed of two regions, a region of so-called “normal scaling” and a domain of extreme events. They could also show that these extreme fluctuations of the fluid velocity correspond to intensive peaks propagating coherently (like solitons) over several shell layers with
2295
2296
Probability Distributions in Complex Systems
a characteristic bell-like shape, approximately independent of their amplitude and duration (up to a rescaling of their size and duration). One could summarize these findings by saying that “characteristic” velocity pulses decorate an otherwise scaling probability distribution function. Outliers and kings in the distribution of financial drawdowns In a series of papers, Johansen and Sornette [87,88,89] have shown that the distribution of drawdowns in financial markets exhibits the coexistence of a fat tail with a characteristic regime with “kings” (called “outliers” in the papers). The analysis encompasses exchange markets (US dollar against the Deutsch Mark and against the Yen), the major world stock markets, the U.S. and Japanese bond markets and commodity markets. Here, drawdowns are defined as a continuous decrease in the value of the price at the close of each successive trading day. The results are found robust with using “coarse-grained drawdowns,” which allows for a certain degree of fuzziness in the definition of cumulative losses. Interestingly, the pdf of returns at a fixed time scale, usually the daily returns, does not exhibit any anomalous king behavior in the tail: the pdf of financial returns at fixed time scales seems to be adequately described by power law tails [90]. The interpretation proposed by Johansen and Sornette is that these drawdown kings are associated with crashes, which occur due to a global instability of the market which amplifies the normal behavior via strong positive feedback mechanisms [91]. Paris as the king in the Zipf distribution of French city sizes Since Zipf [92], it is well-documented that the distribution of city sizes (measured by the number of inhabitants) is, in many countries, a power law with an exponent close to 1. France is not an exception as it exhibits a nice power law distribution of city sizes. . . except for Paris which is completely out of range, a genuine king with a size several times larger than expected from the distribution of the rest of the populations of cities [17]. This king effect reveals a particular historical organization of France, whose roots are difficult to unravel. Nevertheless, we think that this king effect embodied by Paris is a significant signal to explain in order to understand the competition between cities in Europe. Kings and Crises in Complex Systems We propose that these kings may reveal information which is complementary and perhaps sometimes even more important than the power law pdf.
Indeed, it is essential to realize that the long-term behavior of complex systems is often controlled in large part by rare catastrophic events: the universe was probably born during an extreme explosion (the “big-bang”); the nucleosynthesis of all important atomic elements constituting our matter results from the colossal explosion of supernovae; the largest earthquake in California repeating about once every two centuries accounts for a significant fraction of the state’s total tectonic deformation; landscapes are shaped more by the “millennium” flood that moves large boulders than by the action of all other eroding agents; the largest volcanic eruptions lead to major topographic changes as well as severe climatic disruptions; evolution is characterized by phases of quasi-statis interrupted by episodic bursts of activity and destruction; financial crashes can destroy trillions of dollars in an instant; political crises and revolutions shape the long-term geopolitical landscape; even our personal life is shaped on the long run by a few key “decisions/happenstances.” The outstanding scientific question is thus how such large-scale patterns of catastrophic nature might evolve from a series of interactions from the smallest to increasingly larger scales. In complex systems, it has been found that the organization of spatial and temporal correlations does not stem, in general, from a nucleation phase diffusing across the system. It results, rather, from a progressive and more global cooperative process occurring over the whole system by repetitive interactions. An instance would be the many occurrences of simultaneous scientific and technical discoveries signaling the global nature of the maturing process. Standard models and simulations of scenarios of extreme events are subject to numerous sources of error, each of which can have a negative impact on the validity of the predictions [93]. Some of the uncertainties are under control in the modeling process; they usually involve trade-offs between faithful descriptions and manageable calculations. Other sources of errors are beyond control as they are inherent in the modeling methodology of the specific disciplines. The two known strategies for modeling are both limited in this respect: analytical theoretical predictions are out of reach for most complex problems, while brute force numerical resolution of the equations (when they are known) or of scenarios is reliable only in the “center of the distribution”, i. e., in the regime far from the extremes where good statistics can be accumulated. Crises are extreme events that occur rarely, albeit with extraordinary impact, and are thus completely under-sampled and poorly constrained. Even the introduction of teraflop (or even petaflops in the near future) supercomputers does not qualitatively change this fundamental limitation.
Probability Distributions in Complex Systems
Recent developments suggest that non-traditional approaches, based on the concepts and methods of statistical and nonlinear physics could provide a middle way to direct the numerical resolution of more realistic models and the identification of relevant signatures of impending catastrophes. Enriching the concept of self-organizing criticality, the predictability of crises would then rely on the fact that they are fundamentally outliers, e. g., large earthquakes are not scaled-up versions of small earthquakes but the result of specific collective amplifying mechanisms. To address this challenge, the available theoretical tools comprise in particular bifurcation and catastrophe theories, dynamical critical phenomena and the renormalization group, nonlinear dynamical systems, and the theory of partially (spontaneously or not) broken symmetries. Some encouraging results have been gathered on concrete problems, such as the prediction of the failure of complex engineering structures, the detection of precursors of stock market crashes and of human parturition, with exciting potential for earthquakes. At the beginning of the third millennium, it is tempting to extrapolate and forecast that a larger multidisciplinary integration of the physical sciences together with artificial intelligence and soft-computational techniques, fed by analogies and fertilization across the natural sciences, will provide a better understanding of the limits of predictability of catastrophes and adequate measures of risks for a more harmonious and sustainable future for our complex world. Future Directions Our exposition has focused mainly on the concept of distributions of event sizes as a first approach to characterizing the organization of complex systems. But probability distribution functions are just one-point statistics and thus provide only an incomplete picture of the organization of complex systems. This opens the road to several better measures of the organization of complex systems. Statistical estimations of probability distribution functions is a delicate art. An active research field in mathematical statistics which is insufficiently used by practitioners of other sciences is the domain of “robust estimation.” Robust estimation techniques are methods which are insensitive to small departures from the idealized assumptions which have been used to optimize the algorithm. Such techniques include M-estimates (which follow from maximum likelihood considerations), L-estimates (which are linear combinations of order statistics), and R-estimates (based on statistical rank tests) [94,95,96].
Ideally, one would like to measure the full multivariate distribution of events, which can be in full generality decomposed into the set of marginal distributions discussed above and of the copula of the system. A copula embodies completely the entire dependence structure of the system [97,98]. Copulas have recently become fashionable in financial mathematics and in financial engineering [19]. Their use in other fields in the natural sciences is embryonic but can be expected to blossom. When analyzing a complex system, a common trap is to assume without critical thinking and testing that the statistics are stationary, implying that monovariate (marginal) and multivariate distribution functions are sufficient to fully characterize the system. It is indeed a common experience that the dependencies estimated and predicted by standard models change dramatically at certain times. In other words, statistical properties are conditional on specific regimes. The existence of regime-dependent statistical properties has been discussed in particular in climate science, in medical sciences and in financial economics. In the latter, a quite common observation is that investment strategies, which have some moderate beta (coefficient of regression to the market) for normal times, can see their beta jump to a much larger value (close to 1 or larger depending on the leverage of the investment) at certain times when the market collectively dives. Said differently, investments which are thought to be hedged against negative global market trends may actually lose as much or more than the global market at certain times when a large majority of stocks plunge simultaneously. In other words, the dependency structure and the resulting distributions at different time scales may change in certain regimes. The general problem of the application of mathematical statistics to non-stationary data (including nonstationary time series) is very important, but alas, not much can be done. There are only a few approaches which may be used and only in specific conditions, which we briefly mention. 1. Use of algorithms and methods which are robust with respect to possible non-stationarity in data, such as normalization procedures or the use of quantile samples instead of initial samples. 2. Modeling non-stationarity by some low-frequency random processes, such as a narrow-band random process X(t) D A(t) cos(! t C (t)) where ! 1 and A(t) and (t) are slowly varying amplitude and phase. In this case, the Hilbert transform can be very useful to characterize (t) non-parametrically.
2297
2298
Probability Distributions in Complex Systems
3. The estimation of the parameters of a low-frequency process based on a “short” realization is often hopeless. In this case, the only quantity which can be evaluated is the uncertainty (or scatter) of the results due to non-stationarity. 4. Regime Switching popularized by Hamilton [99] for autoregressive time series models is a special case of non-stationarity, which can be handled with specific methods. We already discussed the problem of “kings.” One key issue that needs more scrutiny is that these outliers are often identified only with metrics adapted to take into account transient increases of the time dependence in the time series, as for instance in the case of returns of individual financial assets [88] (see also Chap. 3 of [91]). These outliers seem to belong to a statistical population which is different from the bulk of the distribution and require some additional amplification mechanisms active only at special times. The presence of such outliers both in marginal distributions and in concomitant events, together with the strong impact of crises and of crashes in complex systems, suggests the need for novel measures of dependence, different definitions of events and other time-varying metrics across different variables. This program is part of the more general need for a joint multi-time-scale and multivariate approach to the statistics of complex systems. The presence of outliers poses the problem of exogeneity versus endogeneity. An event identified as anomalous could perhaps be cataloged as resulting from exogenous influences. The concept of exogeneity is fundamental in statistical estimation [100,101]. Here, we refer to the question of exogeneity versus endogeneity in the broader context of self-organized criticality, inspired in particular by the physical and natural sciences. As we already discussed, according to self-organized criticality, extreme events are seen to be endogenous, in contrast with previous prevailing views (see for instance the discussion in [64,102]). But, how can one assert with 100% confidence that a given extreme event is really due to an endogenous self-organization of the system, rather than a response to an external shock? Most natural and social systems are indeed continuously subjected to external stimulations, noises, shocks, solicitations, and forcing, which can vary widely in amplitude. It is thus not clear a priori if a given large event is due to a strong exogenous shock, to the internal dynamics of the system, or to a combination of both. Addressing this question is fundamental for understanding the relative importance of self-organization versus external forcing in complex systems and under-
pins much of the problem of dependence between variables. The concepts of endogeneity and exogeneity have many applications in the natural and social sciences (see [103] for a review) and we expect this viewpoint to develop into a general strategy of investigation.
Bibliography 1. Satinover JB, Sornette D (2007) “Illusion of Control” in Minority and Parrondo Games. Eur Phys J B 60:369-384 2. Satinover JB, Sornette D (2007) Illusion of Control in a Brownian Game. Physica A 386:339-344 3. Sornette D (2006) Critical Phenomena in Natural Sciences. Chaos, Fractals, Self-organization and Disorder: Concepts and Tools, 2nd edn. Springer Series in Synergetics. Springer, Heidelberg 4. Feller W (1971) An Introduction to Probability Theory and its Applications, vol II. John Wiley, New York 5. Ruelle D (2004) Conversations on Nonequilibrium Physics With an Extraterrestrial. Phys Today 57(5):48–53 6. Zajdenweber D (1976) Hasard et Prévision. Economica, Paris 7. Zajdenweber D (1997) Scale invariance in Economics and Finance. In: Dubrulle B, Graner F, Sornette D (eds) Scale Invariance and Beyond. EDP Sciences and Springer, Berlin, pp 185– 194 8. Malcai O, Lidar DA, Biham O, Avnir D (1997) Scaling range and cutoffs in empirical fractals. Phys Rev E 56:2817–2828 9. Biham O, Malcai O, Lidar DA, Avnir D (1998) Is nature fractal? Response Sci 279:785–786 10. Biham O, Malcai O, Lidar DA, Avnir D (1998) Fractality in nature. Response Sci 279:1615–1616 11. Mandelbrot BB (1998) Is nature fractal? Sci 279:783–784 12. Mandelbrot BB (1982) The fractal Geometry of Nature. Freeman WH, San Francisco 13. Aharony A, Feder J (eds) (1989) Fractals in Physics. Phys D 38(1–3). North Holland, Amsterdam 14. Riste T, Sherrington D (eds) (1991) Spontaneous Formation of Space-Time Structures and Criticality. Proc NATO ASI, Geilo, Norway. Kluwer, Dordrecht 15. Pisarenko VF, Sornette D (2003) Characterization of the frequency of extreme events by the Generalized Pareto Distribution. Pure Appl Geophys 160(12):2343–2364 16. Jessen AH, Mikosch T (2006) Regularly varying functions. Publ l’Inst Math, Nouvelle serie 79(93):1–23. Preprint http://www. math.ku.dk/~mikosch/Preprint/Anders/jessen_mikosch.pdf 17. Laherrère J, Sornette D (1999) Stretched exponential distributions in nature and economy: Fat tails with characteristic scales. Eur Phys J B 2:525–539 18. Malevergne Y, Pisarenko VF, Sornette D (2005) Empirical Distributions of Log-Returns: between the Stretched Exponential and the Power Law? Quant Fin 5(4):379–401 19. Malevergne Y, Sornette D (2006) Extreme Financial Risks (From Dependence to Risk Management). Springer, Heidelberg 20. Frisch U, Sornette D (1997) Extreme deviations and applications. J Phys I, France 7:1155–1171 21. Willekens E (1988) The structure of the class of subexponential distributions. Probab Theory Relat Fields 77:567–581
Probability Distributions in Complex Systems
22. Embrechts P, Klüppelberg CP, Mikosh T (1997) modeling Extremal Events. Springer, Berlin 23. Stuart A, Ord K (1994) Kendall’s advances theory of statistics. John Wiley, New York 24. Bak P (1996) How Nature Works: the Science of Self-organized Criticality. Copernicus, New York 25. Sornette D (1994) Sweeping of an instability: an alternative to self-organized criticality to get power laws without parameter tuning. J Phys I Fr 4:209–221 26. Sornette D (2002) Mechanism for Power laws without Self-Organization. Int J Mod Phys C 13(2):133–136 27. Newman MEJ (2005) Power laws, Pareto distributions and Zipf’s law. Contemp Phys 46:323–351 28. Wilson KG (1979) Problems in physics with many scales of length. Sci Am 241:158–179 29. Stauffer D, Aharony A (1994) Introduction to Percolation Theory, 2nd edn. Taylor & Francis, London, Bristol, PA 30. Gabrielov A, Newman WI, Knopoff L (1994) Lattice models of Fracture: Sensitivity to the Local Dynamics. Phys Rev E 50:188–197 31. Jensen HJ (2000) Self-Organized Criticality: Emergent Complex Behavior. In: Physical and Biological Systems. Cambridge Lecture Notes in Physics, Cambridge University Press 32. Mitzenmacher M (2004) A brief history of generative models for power law and lognormal distributions. Internet Math 1:226–251 33. Simkin MV, Roychowdhury VP (2006) Re-inventing Willis. Preprint http://arXiv.org/abs/physics/0601192 34. Sethna JP (2006) Crackling Noise and Avalanches: Scaling, Critical Phenomena, and the Renormalization Group. Lecture notes for Les Houches summer school on Complex Systems, summer 2006 35. Redner S (2001) A Guide to First-Passage Processes. Cambridge University Press, New York 36. Zygadło R (2006) Flashing annihilation term of a logistic kinetic as a mechanism leading to Pareto distributions. Phys Rev E 77, 021130 37. Stauffer D, Sornette D (1999) Self-Organized Percolation Model for Stock Market Fluctuations. Phys A 271(3–4):496– 506 38. Kesten H (1973) Random difference equations and renewal theory for products of random matrices. Acta Math 131:207– 248 39. Solomon S, Richmond P (2002) Stable power laws in variable economies. Lotka-Volterra implies Pareto-Zipf. Eur Phys J B 27:257–261 40. Saichev A, Malevergne A, Sornette D (2007) Zipf law from proportional growth with birth-death processes (working paper) 41. Reed WJ, Hughes BD (2002) From Gene Families and Genera to Incomes and Internet File Sizes: Why Power Laws are so Common in Nature. Phys Rev E 66:067103 42. Cabrera JL, Milton JG (2004) Human stick balancing: Tuning Lévy flights to improve balance control. Chaos 14(3):691–698 43. Eurich CW, Pawelzik K (2005) Optimal Control Yields Power Law Behavior. Int Conf Artif Neural Netw 2:365–370 44. Platt N, Spiegel EA, Tresser C (1993) On-off intermittency: A mechanism for bursting. Phys Rev Lett 70:279–282 45. Heagy JF, Platt N, Hammel SM (1994) Characterization of onoff intermittency. Phys Rev E 49:1140–1150
46. Clauset A, Shalizi CR, Newman MEJ (2007) Power-law distributions in empirical data. Preprint http://arxiv.org/abs/0706. 1062 47. Gourieroux C, Monfort A (1994) Testing non-nested hypotheses. In: Engle RF, McFadden DL (eds) Handbook of Econometrics, Volume IV. Elsevier Science, pp 2583-2637 48. Cardy JL (1988) Finite-Size Scaling. North Holland, Amsterdam 49. Cranmer K (2001) Kernel estimation in high-energy physics. Comput Phys Commun 136(3):198–207 50. Sornette D, Knopoff L, Kagan YY, Vanneste C (1996) Rank-ordering statistics of extreme events: application to the distribution of large earthquakes. J Geophys Res 101:13883–13893 51. Pacheco JF, Scholz C, Sykes L (1992) Changes in frequencysize relationship from small to large earthquakes. Nature 355:71–73 52. Main I (2000) Apparent Breaks in Scaling in the Earthquake Cumulative Frequency-magnitude Distribution: Fact or Artifact? Bull Seismol Soc Am 90:86–97 53. Gabaix X, Ibragimov R (2008) Rank-1/2: A simple way to improve the OLS estimation on tail exponents. Work Paper NBER 54. Hill BM (1975) A simple general approach to inference about the tail of a distribution. Ann Stat 3:1163–1174 55. Drees H, de Haan L, Resnick SI (2000) How to Make a Hill Plot. Ann Stat 28(1):254–274 56. Resnik SI (1997) Discussion of the Danish Data on Large Fire Insurance Losses. Astin Bull 27(1):139–152 57. Pisarenko VF, Sornette D, Rodkin M (2004) A new approach to characterize deviations in the seismic energy distribution from the Gutenberg–Richter law. Comput Seism 35:138–159 58. Pisarenko VF, Sornette D (2006) New statistic for financial return distributions: power law or exponential? Phys A 366:387–400 59. Lasocki S (2001) Quantitative evidences of complexity of magnitude distribution in mining-induced seismicity: Implications for hazard evaluation. 5th International Symposium on Rockbursts and Seismicity in Mines. In: van Aswegen G, Durrheim RJ, Ortlepp WD (eds) Dynamic Rock Mass Response to Mining, Symp Ser, vol S27, S Afr Inst Min Metall, Johannesburg, pp 543–550 60. Lasocki S, Papadimitriou EE (2006) Magnitude distribution complexity revealed in seismicity from Greece. J Geophys Res B11309(111). doi:10.1029/2005JB003794 61. Anderson PW (1972) More is different (Broken symmetry and the nature of the hierarchical structure of science). Science 177:393–396 62. Sornette D, Davis AB, Ide K, Vixie KR, Pisarenko VF, Kamm JR (2007) Algorithm for Model Validation: Theory and Applications. Proc Nat Acad Sci USA 104(16):6562–6567 63. Geller RG, Jackson DD, Kagan YY, Mulargia F (1997) Earthquakes cannot be predicted. Science 275(5306):1616–1617 64. Bak P, Paczuski M (1995) Complexity, contingency and criticality. Proc Nat Acad Sci USA 92:6689–6696 65. Allègre CJ, Le Mouel JL, Provost A (1982) Scaling rules in rock fracture and possible implications for earthquake predictions. Nature 297:47–49 66. Keilis-Borok V (1990) The lithosphere of the Earth as a large nonlinear system. Geophys Monogr Ser 60:81–84 67. Sornette A, Sornette D (1990) Earthquake rupture as a critical point: Consequences for telluric precursors. Tectonophysics 179:327–334
2299
2300
Probability Distributions in Complex Systems
68. Bowman DD, Ouillon G, Sammis CG, Sornette A, Sornette D (1996) An observational test of the critical earthquake concept. J Geophys Res 103:24359–24372 69. Sammis SG, Sornette D (2002) Positive Feedback, Memory and the Predictability of Earthquakes. Proc Nat Acad Sci USA (SUPP1) 99:2501–2508 70. Huang Y, Saleur H, Sammis CG, Sornette D (1998) Precursors, aftershocks, criticality and self-organized criticality. Europhys Lett 41:43–48 71. National Institute of Standards and Technology (2007) Engineering Statistical Handbook, National Institute of Standards and Technology. Preprint http://www.itl.nist.gov/div898/ handbook/prc/section1/prc16.htm 72. Pollock AA (1989) Acoustic Emission Inspection. In: Metal Handbook, 9th edn, vol 17, Nondestructive Evaluation and Quality Control. ASM International, pp 278–294 73. Omeltchenko A, Yu J, Kalia RK, Vashishta P (1997) Crack Front Propagation and Fracture in a Graphite Sheet: a Molecular-dynamics Study on Parallel Computers. Phys Rev Lett 78:2148–2151 74. Fineberg J, Marder M (1999) Instability in Dynamic Fracture. Phys Rep 313:2–108 75. Lei X, Kusunose K, Rao MVMS, Nishizawa O, Sato T (2000) Quasi-static Fault Growth and Cracking in Homogeneous Brittle Rock Under Triaxial Compression Using Acoustic Emission Monitoring. J Geophys Res 105:6127–6139 76. Wesnousky SG (1994) The Gutenberg-Richter or characteristic earthquake distribution, which is it? Bull Seismol Soci Am 84(6):1940–1959 77. Wesnousky SG (1996) Reply to Yan Kagan’s comment On: The Gutenberg-Richter or characteristic earthquake distribution, which is it? Bull Seismol Soc Am 86(1A):286–291 78. Kagan YY (1993) Statistics of characteristic earthquakes. Bull Seismol Soc Am 83(1):7–24 79. Kagan YY (1996) Comment On: The Gutenberg–Richter or characteristic earthquake distribution, which is it? by Wesnousky SG. Bull Seismol Soc Am 86:274–285 80. Gil G, Sornette D (1996) Landau–Ginzburg theory of self-organized criticality. Phys Rev Lett 76:3991–3994 81. Fisher DS, Dahmen K, Ramanathan S, Ben-Zion Y (1997) Statistics of Earthquakes in Simple Models of Heterogeneous Faults. Phys Rev Lett 78:4885–4888 82. Ben-Zion Y, Eneva M, Liu Y (2003) Large Earthquake Cycles And Intermittent Criticality On Heterogeneous Faults Due To Evolving Stress And Seismicity. J Geophys Res B6(108):2307. doi:10.1029/2002JB002121 83. Hillers G, Mai PM, Ben-Zion Y, Ampuero J-P (2007) Statistical Properties of Seismicity Along Fault Zones at Different Evolutionary Stages. Geophys J Int 169:515–533 84. Zöller G, Ben-Zion Y, Holschneider M (2007) Estimating recur-
85. 86. 87. 88.
89.
90.
91.
92. 93. 94. 95.
96. 97. 98. 99.
100. 101. 102.
103.
rence times and seismic hazard of large earthquakes on an individual fault. Geophys J Int 170:1300–1310 Ben-Zion (2007) private communication L’vov VS, Pomyalov A, Procaccia I (2001) Outliers, Extreme Events and Multiscaling. Phys Rev E 6305(5):6118, U158-U166 Johansen A, Sornette D (1998) Stock market crashes are outliers. Eur Phys J B 1:141–143 Johansen A, Sornette D (2001) Large Stock Market Price Drawdowns Are Outliers. J Risk 4(2):69–110. http://arXiv.org/abs/ cond-mat/0010050 Johansen A, Sornette D (2007) Shocks, Crash and Bubbles in Financial Markets. In press, In: Brussels Economic Review on Non-linear Financial Analysis 149–2/Summer 2007. Preprint http://arXiv.org/abs/cond-mat/0210509 Gopikrishnan P, Meyer M, Amaral LAN, Stanley HE (1998) Inverse cubic law for the distribution of stock price variations. Eur Phys J B 3:139–140 Sornette D (2003) Why Stock Markets Crash, Critical Events in Complex Financial Systems. Princeton University Press, Princeton, NJ Zipf GK (1949) Human behavior and the principle of least-effort. Addison-Wesley, Cambridge, MA Karplus WJ (1992) The Heavens are Falling: The Scientific Prediction of Catastrophes in Our Time. Plenum, New York Hubert PJ (2003) Robust Statistics. Wiley-Interscience, New York van der Vaart AW, Gill R, Ripley BD, Ross S, Silverman B, Stein M (2000) Asymptotic Statistics. Cambridge University Press, Cambridge Wilcox RR (2004) Introduction to Robust Estimation and Hypothesis Testing, 2nd edn. Academic Press, Boston Joe H (1997) Multivariate models and dependence concepts. Chapman & Hall, London Nelsen RB (1998) An Introduction to Copulas, Lectures Notes in statistic 139. Springer, New York Hamilton JD (1989) A New Approach to the Economic Analysis of Non-stationary Time Series and the Business Cycle. Econometrica 57:357–384 Engle RF, Hendry DF, J Richard F (1983) Exogeneity. Econometrica 51:277–304 Ericsson N, Irons JS (1994) Testing exogeneity, Advanced Texts in Econometrics. Oxford University Press, Oxford Sornette D (2002) Predictability of catastrophic events: material rupture, earthquakes, turbulence, financial crashes and human birth. Proc Nat Acad Sci USA 99(SUPP1):2522–2529 Sornette D (2005) Endogenous versus exogenous origins of crises, in the monograph entitled: Extreme Events in Nature and Society. In: Albeverio S, Jentsch V, Kantz H (eds) Series: The Frontiers Collection. Springer, Heidelberg (e-print at http://arxiv.org/abs/physics/0412026)
Probability and Statistics in Complex Systems, Introduction to
Probability and Statistics in Complex Systems, Introduction to HENRIK JELDTOFT JENSEN1,2 1 2
Institute for Mathematical Sciences, London, UK Department of Mathematics, Imperial College London, London, UK
Is the nature of complex systems such that they are in need of a special treatment in terms of statistics and probability. Yes, they are in the sense that a particular focus suggests itself. This becomes clear if we try to specify what we mean by complex systems. Although no consensus exists for the definition of what constitutes a complex system, the following summary will probably be accepted by most people. Complex Systems consist of a large number of interacting components. The interactions give rise to emergent hierarchical structures. The components of the system and properties at systems level typically change with time. A complex system is inherently open and its boundaries often a matter of convention. Large numbers of components and interaction between these are central and this has immediate consequences for the nature of the relevant statistics. The interactions between the many components will typically be so strong that correlations cannot be neglected ( Correlations in Complex Systems) and hence, when we sum up contributions from the individual parts, we may not have the central limit theorem to ensure that the macroscopic quantities are Gaussian distributed. Instead of peaked distributions with well-defined average and higher moments, one typically encounters very broad heavy tailed distributions, frequently the tail reaches so far out that the average doesn’t even exist. When this is the case, a description in terms of the typical—on the average scenario—is not possible ( Probability Distributions in Complex Systems). This makes not only fluctuations significant, it makes a description and understanding of the fluctuations absolutely essential ( Fluctuations, Importance of: Complexity in the View of Stochastic Processes). To illustrate this point think of, say, flood protection. It is no good to protect oneself against some imaginary average event, say a “typical” flooding, if effectively any size of flooding may occur with a non negligible probability. This makes it important to be able to deal with broad distributions and distributions of extreme events ( Extreme Value Statistics).
The hierarchical nature of complex systems ( Hierarchical Dynamics) leads to situations where the existence of many time scales cannot be ignored and as a result it may be inappropriate to assume the statistics to be stationary. The time dependence of the statistics may come about for a number of reasons. One is that the emergent hierarchical components change their internal state with time. An alternative reason may be that the strong interaction between components, themselves with no internal time evolution, prevents the collective set of components from reaching an equilibrium or asymptotic stationary state. In such cases one needs to be able to understand how to describe and predict the behavior of a system that is in a transient for all the relevant time scales, such as, say, the age of life on earth, if one is dealing with the history of the biosphere. It seems natural to divide the discussion of probability and statistics for complex systems into at least three aspects: 1. Analyzing data from complex systems, 2. The phenomenology and 3. Modeling. When analyzing data it is important to be able to handle the effect of long memory, correlations and exceptionally strong fluctuations. These effects make it necessary to exert special care when trying to identify probability distributions extracted from observational or experimental data or from data generated by computer simulations ( Probability Densities in Complex Systems, Measuring). The often very large amounts of data, and the lack of fundamental theory derived from first principle, have made it important to develop methods to identify structures in data sets, in this respect Bayesian statistics is often very relevant to complex systems ( Bayesian Statistics). From experiments and observations we know that one signature of complexity may be power law like probability distributions. To determine the exponent can be complicated as the exponent one reads off from the slope in a loglog plot of the distribution may only be an “apparent” exponent. Not that this makes the exponent less important, but it does make the exponent more specific. An apparent exponent might very well depend on systems size and on the amount of collected data. In contrast we know from the lesson of statistical mechanics of critical phenomena that the asymptotic behavior of an infinite system may be the same even when microscopic details differ. This encourages the study of simplistic models in the hope that, though simplistic, the models might nevertheless capture the essential mechanisms relevant for the phenomena under consideration. With this in mind, particular empha-
2301
2302
Probability and Statistics in Complex Systems, Introduction to
sis is placed on how apparent exponents converge towards their “universal” values in the limit of long distance and long time scales in the limit of a system composed of infinitely many components. Of course we don’t know how applicable this form of universality is in the case of nonequilibrium complex systems. It might very well be that the many power laws observed in experiments are in fact only approximate and only persist for a limited range of support for the stochastic variable being studied. To settle this, careful experiments and theoretical studies are needed and will make use of some of the methods relevant to power laws discussed in the present section of the encyclopedia. The focus on the functional form of the probability distributions becomes complicated by the fact that many complex systems never enter a stationary state during time scales accessible to observations or simulations. This doesn’t exclude that the system relaxes towards a stationary sate in the mathematically asymptotic limit of infinite time, but this limit might be of little physical relevance. Then it becomes essential to study the very nature of how systems for all observational times is relaxing towards the stationary state. Inspired by the slow dynamics in glassy systems, and the very widespread intermittent dynamics encountered when many components interact, it has been suggested that the statistics of records may be of relevance to the intermittent relaxation of complex systems. The analysis of the statistics of the time instances marked by the occurrence of abrupt activity can be interesting. It might allow insight into the question concerning, how relevant record dynamics is for the complex dynamics and can in this way help to provide understanding of the collective dynamics of the components ( Record Statistics and Dynamics). To construct and analyze theoretical models of complex systems a number of methods have been developed. At the intuitive level we have attempts to develop phenomenological theories by generalizing the well-studied branching process originating in Galton’s and Watson’s sociological studies ( Branching Processes). Among attempts to formulate theory of more basic foundation we encounter methods developed in physics to deal with phase transitions in materials where many interacting particles enter into a critical state. Here a critical state is taken to mean that no particular length or time scale can be identified, one talks about scale invariance, to indicate that all length and time scales are involved in the phenomena. To analyze such systems field theoretic methods are particularly useful as a tool to study the asymptotic behavior, i. e., long length and time scales ( Field Theoretic Meth-
ods). Stochastic analysis is particularly relevant since we are dealing with large numbers of components and, moreover, a stochastic element is often present even at the basic level, e. g. through thermal fluctuations in the case of physical systems ( Stochastic Processes). In traditional statistical mechanics the concept of entropy has played a very important role. It is therefore only natural that work is being done to try to develop the definition of entropy to make it applicable beyond the traditional areas. In particular strong interaction makes the entropy non-extensive and appropriate generalizations are needed. The ordinary random walker is a bit of a workhorse in statistical mechanics. When random walks are used to model aspects of complex systems, the walker needs often to be dressed up with a broad distribution of step sizes ( Levy Statistics and Anomalous Transport: Levy Flights and Subdiffusion) or to walk in a background that makes the step size distribution location dependent ( Random Walks in Random Environment). This more sophisticated walker is not any longer necessarily controlled by the central limit theorem and new mathematics are needed. The theory of random matrices is another example of methods developed to understand statistical aspects of physical systems that now have become of much broader importance. This is a field developed in response to the need of physicists when they, a while ago, started to try to understand the energy spectrum of heavy nuclei. The formalism has since then been used in a range of very different fields including the analysis of the stability of ecosystems. It is very likely that random matrices also will become a standard tool in complexity ( Random Matrix Theory). One reason for this is that complex systems often can be represented in terms of networks and that random matrix theory naturally relates to network analysis. Finally, I feel it is in place to explain why this section contains an article about the stochastic Löwner equation ( Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems). The topic is included, as a spectacular example of how successful fairly abstract mathematics, from the realms of the pure end of the mathematical spectrum, can sometimes be in providing a detailed quantitative understanding of the intricacies of systems that are complex in the sense that they are far from equilibrium and in a nonstationary state. One might hope that this degree of detail in the mathematical understanding of the statistics of complex systems may become the norm in the future, as the research in to the statistical and probabilistic analysis of complex systems develop.
Quantum Algorithms
Quantum Algorithms MICHELE MOSCA 1,2,3 1 Institute for Quantum Computing and Dept. of Combinatorics and Optimization, University of Waterloo, Waterloo, Canada 2 St. Jerome’s University, Waterloo, Canada 3 Perimeter Institute for Theoretical Physics, Waterloo, Canada Article Outline Glossary Definition of the Subject Introduction and Overview Early Quantum Algorithms Factoring, Discrete Logarithms and the Abelian Hidden Subgroup Problem Algorithms Based on Amplitude Amplification Simulation of Quantum Mechanical Systems Generalizations of the Abelian Hidden Subgroup Problem Quantum Walk Algorithms Adiabatic Algorithms Topological Algorithms Quantum Algorithms for Quantum Tasks Future Directions Bibliography Glossary Quantum circuit model One of the standard and most commonly used models of quantum computation which generalizes the classical model of acyclic circuits and closely models most of the proposed physical implementations of quantum computers. When studying algorithms for a problem with an infinite number of possible inputs, one usually restricts attention to uniform families of circuits, which are families of circuits in which the circuit for inputs of size n can be generated efficiently as a function of n. For example, one might require that there is a classical Turing machine that can generate the nth circuit in time polynomial in n. Black box model A model of computation where the input to the problem includes a “black-box” that can be applied (equivalently, an “oracle” that can be “queried”). This is the only way to extract information from the black-box. For example, the black-box could accept inputs j 2 f0; 1gn and output a value X j 2 f0; 1g. In this particular case, we can think of the black-box as a means for querying the bits of the
string X D X1 X2 X3 : : : X2n . In the black-box model, one usually measures complexity in terms of the number of applications of the black-box. Computational complexity When referring to an algorithm, the computational complexity (often just called the complexity) is a measure of the resources used by the algorithm (which we can also refer to as the cost of the algorithm) usually measured as a function of the size of the input to the algorithm. The complexity for input size n is taken to be the cost of the algorithm on a worst-case input of size n to the problem. This is also referred to as worst-case complexity. When referring to a problem, the computational complexity is the minimum amount of resources required by any algorithm to solve the problem. See Quantum Computational Complexity for an overview. Query complexity When referring to a black-box algorithm, the query complexity is the number of applications of the black-box or oracle used by the algorithm. When referring to a black-box problem, the query complexity is the minimum number of applications of the black-box required by any algorithm to solve the problem. Definition of the Subject The strong Church–Turing thesis states that a probabilistic Turing machine can efficiently simulate any realistic model of computation. By “efficiently”, we mean that there is a polynomial p such that the amount of resources used by the Turing machine simulation is not more than p(M) where M is the amount of resources used by the given realistic model of computation. Since a computer is a physical device, any reasonable model of computation must be cast in a realistic physical framework, hence the condition that the model be “realistic” is very natural. The probabilistic Turing machine is implicitly cast in a classical framework for physics, and it appears to hold as long as the competing model of computation is also cast in a classical framework. However, roughly a century ago, a new framework for physics was developed: quantum mechanics. The impact of this new framework on the theory of computation was not taken very seriously until the early 1970’s by Stephen Wiesner [177] for cryptographic purposes (and later by Bennett and Brassard [27]). Benioff [23] proposed using quantum devices in order to implement reversible computation. Feynman [79] noted that a classical computer seems incapable of efficiently simulating the dynamics of rather simple quantum mechanical systems, and proposed that a “quantum” computer, with components evolving ac-
2303
2304
Quantum Algorithms
cording to quantum mechanical rules, should be able to perform such simulations efficiently (see Sect. “Simulation of Quantum Mechanical Systems”). Manin made a similar observation independently [136]. Deutsch [64] worked on proving the original Church–Turing thesis (which was only concerned about effective computability, and not efficient computability) in a quantum mechanical framework, and defined two models of quantum computation; he also gave the first quantum algorithm. One of Deutsch’s ideas is that quantum computers could take advantage of the computational power present in many “parallel universes” and thus outperform conventional classical algorithms. While thinking of parallel universes is sometimes a convenient way for researchers to invent quantum algorithms, the algorithms and their successful implementation are independent of any particular interpretation of standard quantum mechanics. Quantum algorithms are algorithms that run on any realistic model of quantum computation. The most commonly used model of quantum computation is the circuit model (more strictly, the model of uniform families of acyclic quantum circuits), and the quantum strong Church–Turing thesis states that the quantum circuit model can efficiently simulate any realistic model of computation. Several other models of quantum computation have been developed, and indeed they can be efficiently simulated by quantum circuits. Quantum circuits closely resemble most of the currently pursued approaches for attempting to construct scalable quantum computers. The study of quantum algorithms is very important for several reasons. Computationally secure cryptography is widely used in society today, and relies on the believed difficulty of a small number of computational problems. Quantum computation appears to redefine what is a tractable or intractable problem, and one of the first breakthroughs in the development of quantum algorithms was Shor’s discovery of efficient algorithms [165] for factoring and finding discrete logarithms. The difficulty of factoring and finding discrete logarithms was (and still is!) at the core of currently-used public-key cryptography, and his results showed that if and when a quantum computer is built, then any messages that had been previously encrypted using our current public-key cryptographic tools could be compromised by anyone who had recorded the ciphertext and public keys. Furthermore the vast publickey infrastructure currently in place would be compromised with no clear alternative to replace it; also, any alternative will take many years to deploy. There was a sudden rush to answer two fundamental questions. Firstly, can we actually build a sufficiently large quantum computer? Perhaps this isn’t a reasonable model of computation. Subse-
quent work on quantum fault-tolerant error correction indicates that the answer is “yes”, and experimental progress has steadily grown. Secondly, what other interesting and important problems can quantum computers solve more efficiently than the best known classical algorithms? The following sections survey the state of the art on the second question. As we gain confidence about which problems are still hard in a quantum mechanical framework, we can start to rebuild confidence in a secure public-key cryptographic infrastructure that is robust in the presence of quantum technologies. Although the cryptographic implications are dramatic and of immediate relevance, in the longer term, the most important significance of quantum algorithms will be for a wider range of applications, where important problems cannot be solved because there are no known (or possible) efficient classical algorithms, but there are efficient quantum mechanical solutions. At present, we know of applications such as searching and optimizing (Sect. “Algorithms Based on Amplitude Amplification”) and simulating physical systems (see Sect. “Simulation of Quantum Mechanical Systems”), but the full implications are still unknown and very hard to predict. The next few sections give an overview of the current state of quantum algorithmics. Introduction and Overview There are several natural models of quantum computation. The most common one is a generalization of the classical circuit model. A detailed description of the circuit model of quantum computation can be found in several textbooks [122,129,149]. One can also define continuous models of computation, where one specifies the Hamiltonian H(t) of the system at time t, where H(t) is a “reasonable” Hamiltonian (e. g. a sum Hamiltonians involving a constant number of nearby subsystems), and one evolves the system for a period of time T. A reasonable measure of RT the total cost might be tD0 kH(t)k dt. Most algorithmic work in quantum computing has been developed in discrete models of computation, that is, with a discrete state space and with discrete time steps. In Subsect. “Continuous Time Quantum Walk Algorithms” and Sect. “Adiabatic Algorithms”, we discuss algorithms developed in a continuous-time model of computation. Even if, as in the case of classical computers, implementations of scalable fault-tolerant quantum computers have discrete time steps and state spaces, these algorithms are still very useful since there are efficient simulations using any universal discrete model of quantum computation. Note that if no such efficient simulation exists, then either
Quantum Algorithms
the continuous model of computation in question is physically unrealistic, or the quantum strong Church–Turing thesis is incorrect. Discrete state spaces can be used to approximate continuous state spaces in order to solve problems normally posed with continuous state spaces. One must choose appropriate discretizations, analyzing errors in the approximation, and quantify the scaling of the algorithm as the overall approximation error gets arbitrarily small. Quantum algorithms for such continuous problems are surveyed in Quantum Algorithms and Complexity for Continuous Problems. Many of the key ideas that led to the development of quantum computation emerged from earlier work on reversible computation [24]. Many facts from the theory of reversible computing are fundamental tools in the development of quantum algorithms. For example, suppose we have a classical algorithm for computing a function f : f0; 1gn ! f0; 1gm (we use binary encoding for convenience). The details of the computing model are not so important, as long as it’s a realistic model. For concreteness, suppose we have a reasonable encoding of a circuit of size C, using gates from a finite set, that takes x 2 f0; 1gn as input and outputs y 2 f0; 1gm (discarding any additional information it might have computed). Then this circuit can be efficiently converted into a circuit, composed only of reversible gates, that maps jxi j yi j 0i 7! j xi j y ˚ f (x)i j 0i, where ˚ denotes the bitwise XOR (addition modulo 2), and the third register of 0s is ancilla workspace that is reset to all 0s by reversible operations. This new circuit uses O(C) reversible gates from a finite set, so the overhead is modest. A basic introduction to this and other important facts about reversible computing can be find in most standard textbooks on quantum computing. For example, in Sect. “Algorithms Based on Amplitude Amplification” on quantum searching, we use the fact that we can convert any classical heuristic algorithm that successfully guesses a solution with probability p into a reversible quantum algorithm that guesses a solution with p probability amplitude p. Most of the known quantum algorithms can be phrased as black-box algorithms solving black-box problems. A black-box, or oracle, is subroutine or subcircuit that implements some operation or function. It does so in a way that provides no other information other than simply taking an input and giving the prescribed output. One cannot, for example, look at the inner workings of the circuit or device implementing the black-box to extract additional information. For example, Shor’s factoring algorithm can be viewed as an algorithm that finds the order of an element in a black-box group (that is, a group for which
the group operations are computed by a black-box), or the period of a black-box function, where the black-box is substituted with a subcircuit that implements exponentiation modulo N. The quantum search algorithm is described as a black-box algorithm, but it is straightforward to substitute in a subcircuit that checks if a given input is a valid certificate for some problem in NP. If we take a black-box algorithm that uses T applications of the black-box and A other computational steps, and replace each black-box with a subcircuit that uses B elementary gates, then we get an algorithm that uses TB C A gates. Thus if T and A are both polynomial in size, then an efficient black-box algorithm yields an efficient algorithm whenever we replace the black-box with a polynomial time computable function. Many lower bounds have been found in the black-box model. The query complexity of a black-box problem is the number of applications of the black-box (or queries to the oracle) that a black-box algorithm must make in order to solve the problem. If we try to solve a problem that has query complexity T, where the black-box is implemented by some subcircuit, then we can conclude that any algorithm that treats the subcircuit as a blackbox must apply the subcircuit a total of T times and thus use ˝(T) gates (we use the fact that any implementation of the black-box uses at least one gate; if we had a lower bound on the complexity of implementing the black-box, then we could derive a better lower bound in this case). However, this does not imply that an ˝(T) lower bound applies to any algorithm that uses the subcircuit, since it might exploit the information within the subcircuit in a way other than just applying the subcircuit. A discussion of the black-box model and its practical relevance can be found in [52,122,173]. In the literature, one can often see a progression from studying basic algorithmic primitives (such as the convergence properties of a generic quantum walk), to the application to solve a black-box problem (such as element distinctness), to solving some concrete computational problem (like factoring an integer) that doesn’t involve a black-box. This survey will include both black-box and nonblack-box results. It is infeasible to detail all the known quantum algorithms, so a representative sample is given in this article. For a subset of this sample, there is an explicit definition of the problem, a discussion of what the best known quantum algorithm can do, and a comparison to what can be achieved with classical algorithms. For blackbox problems, we attempt to give the number of queries and the number of non-query operations used by the algorithm, as well as the best-known lower bounds on the
2305
2306
Quantum Algorithms
query complexity. In some cases, all of this information is not readily available in the literature, so there will be some gaps. As a small technical note, when we refer to a real number r as an input or output to a problem, we are referring to a finite description of a real number from which, for any integer n, one can efficiently compute (in time polynomial in n) an approximation of r with error at most 1/2n . In this article, we start with a brief sketch of the very early quantum algorithms, and then in the subsequent sections the algorithms are grouped according to the kind of problems they solve, or the techniques or algorithmic paradigms used. Section “Early Quantum Algorithms” summarizes the early quantum algorithms. Section “Factoring, Discrete Logarithms and the Abelian Hidden Subgroup Problem” describes the Abelian hidden subgroup algorithms, including Shor’s factoring and discrete logarithm algorithms. Section “Algorithms Based on Amplitude Amplification” describes quantum searching and amplitude amplification and some of the main applications. Section “Simulation of Quantum Mechanical Systems” describes quantum algorithms for simulating quantum mechanical systems, which are another important class of algorithms that appear to offer an exponential speedup over classical algorithms. Section “Generalizations of the Abelian Hidden Subgroup Problem” describes several non-trivial generalizations of the Abelian hidden subgroup problem, and related techniques. Section “Quantum Walk Algorithms” describes the quantum walk paradigm for quantum algorithms and summarizes some of the most interesting results and applications. Section “Adiabatic Algorithms” describes the paradigm of adiabatic algorithms. Section “Topological Algorithms” describes a family of “topological” algorithms. Section “Quantum Algorithms for Quantum Tasks” describes algorithms for quantum tasks which cannot be done by a classical computer. In Sect. “Future Directions” we conclude with a discussion. Early Quantum Algorithms The first explicitly defined quantum algorithm was the one described by David Deutsch in his landmark paper [64] where he defined the model of quantum computation, including a circuit and Turing machine model. The problem was to decide if a given function f : f0; 1g 7! f0; 1g is constant or “balanced” (the function f is balanced if there are an equal number of 0 and 1 outputs). In other words, output f (0) ˚ f (1), where ˚ denotes addition modulo 2. One is given a circuit that implements j xi j 0i 7! j xi j f (x)i. Deutsch showed that us-
ing one application of the circuit, we can compute 1 1 p j 0i j f (0)i C p j 1i j f (1)i : 2 2 If f (0) D f (1), then the first qubit is in the state p1 j 0i C p1 j 1i and thus applying a Hadamard transfor2 2 mation to the first qubit will output j 0i with certainty. However, if f (0) ¤ f (1), then applying a Hadamard gate and measuring will output j 0i with probability 12 and j 1i with probability 12 . Thus if we obtain j 1i, we are certain that the function is balanced, and thus we would output “balanced” in this case. If we obtain j 0i, we cannot be certain; however, by guessing “constant” with probability 23 in this case, we get an algorithm that guesses correctly with probability 23 . This is not possible with a classical algorithm that only evaluates f once. If one knows, for example, that the unitary transformation also maps j xi j 1i 7! j xi j 1 ˚ f (x)i, then one can solve this problem with certainty with only one query [54]. A common tool that is used in this and many other quantum algorithms, is to note that if one p applies suchpan implementation of f on the input j xi (1/ 2 j 0i 1/p 2 j 1i), f (x) j xi (1/ 2 j 0i then p the output can be written as (1) 1/ 2 j 1i). This technique can be used to replace the 2step process of first mapping j xi j 0i 7! j xi j f (x)i, then applying a Z gate j xi j f (x)i 7! j xi (1) f (x) j f (x)i D (1) f (x) j xi j f (x)i, and then applying the function evaluation a second time in order to “uncompute” the value of f (x) from the second register: (1) f (x) j xi j f (x)i 7! (1) f (x) j xi j 0i. This problem was generalized to the Deutsch–Jozsa problem [65] which asks the same constant versus “balanced” question, but for a function f : f0; 1gn ! f0; 1g, with the promise that the function is either constant or balanced. The notion of a promise problem appears frequently in the study of quantum algorithms and complexity. One can either think of the condition as a promise, or alternatively, one can accept any answer as correct in the case that the promise is not satisfied. This problem was solved with two queries with a similar algorithm, and the two queries can be reduced to one if the oracle evaluating f has the form j xi j bi 7! j xi j b ˚ f (x)i. A classical deterministic algorithm requires 2n1 C 1 evaluations of the function in order to decide if f is constant or balanced with certainty. However, a classical randomized algorithm can decide if f is constant or balanced with error probability with only O(log 1 ) queries. Other oracle separations were given in [30,31]. Bernstein and Vazirani [28] defined a special family of functions that are either constant or balanced, in particular, for any a 2 f0; 1gn , b 2 f0; 1g, let fa (x) D a x ˚ b.
Quantum Algorithms
They showed how to find a using only 2 evaluations of a black-box implementing j xi j 0i 7! j xi j f (x)i. This can be reduced to 1 evaluation given the standard blackbox U f : j xi j bi 7! j xi j b ˚ f (x)i. This algorithm can be generalized quite naturally to finding the hidden matrix M in the affine map x 7! Mx C b, where x 2 Z nN , b 2 Z m N and M is an m n matrix of values in Z N (with elementwise addition modulo N), using only m queries [104,143]. In some sense, the quantum algorithm takes a black-box for right-multiplying by M and turns it into a black-box for left-multiplying by M. This allows us to determine M with only m queries instead of n C 1 queries, which can be advantageous if m < n, although no practical application has been developed to date. Bernstein and Vazirani also give the first instance of a black-box problem (recursive Fourier sampling) where a quantum algorithm gives a exponential improvement over bounded error randomized classical algorithms. Simon [167] gave a problem where there is an exponential gap. The problem is somewhat simpler to state, and is a special case of a broad family of important problems for which efficient quantum algorithms were subsequently found (using similar methods). Simon’s Problem Input: A black-box U f implementing j xi j 0i 7! j xi j f (x)i for a function f : f0; 1gn ! f0; 1gn with the property that f (x) D f (y) if and only if x ˚ y 2 K D f0; sg for some s 2 f0; 1gn . Problem: Find s. Note that f is one-to-one if s D 0, the string of 0s, and f is two-to-one if s ¤ 0. Simon’s algorithm is very simple and elegant. We sketch it here, since many of the algorithms presented in the upcoming sections follow the same overall structure. We give a modified version that has a definite running time (versus letting it run indefinitely and analyzing the expected running time). Simon’s Algorithm 1. Set i D 1. 2. Prepare the 2n qubit state j 00 : : : 0i j 00 : : : 0i. 3. Apply a Hadamard gate to the first n qubits to obtain the state p1 n j xi j 0i. 2
Apply U f to create the state p1 n j xi j f (x)i. 2 Apply a Hadamard gate to the first n qubits. Measure the first n qubits to obtain a string y i 2 f0; 1gn . If i D n C 3, go to the next step. Otherwise, increment i and go to step 2. 8. Let M be the (n C 3) n matrix whose ith row is the vector y i . Solve the system Mx D 0 (where we treat x 4. 5. 6. 7.
and 0 as column vectors). If there is a unique non-zero solution x D s, then output s. If the only solution is x D 0, then output 0. Otherwise, output “FAIL”. Note that in step 3, since we can partition the space Z2n into a disjoint union of cosets Z2n /K of the subgroup K D f0; sg, where f (x) is constant on each coset, we can rewrite the state of the system as 1 p j xi j f (x)i 2n X 1 1 p (j zi C j z ˚ si) j f (z)i : D p n1 2 2 fz;z˚sg2Z n /K 2
Since the rest of the algorithm ignores the second register, we can assume that in the first register we have a random “coset state” p1 (j zi C j z ˚ si) selected uniformly at 2 random from all the coset states. After applying the Hadamard gates, the coset state gets mapped to a equally weighted superposition of states that are orthogonal to K, X 1 (1)yz j yi p 2n1 y2K ? where K ? D fy j y 2 Z2n ; y s D 0g. Thus, with n C O(1) random sampled from K ? , the samples vectors y i will generated K ? with high probability. Then, using linear algebra, we can efficiently compute generators for K. In the generalizations of Simon’s algorithm that we see in the next few sections, we’ll see a more general formulation of K ? for a hidden subgroup K. Shortly after Simon came up with his black-box algorithm, Shor [165] used similar ideas to derive his famous algorithms for factoring and finding discrete logarithms. Quantum Algorithms for Simon’s Problem Simon’s algorithm solves this problem with bounded error using n C O(1) applications of U f and O(n) other elementary quantum operations and O(n3 ) elementary classical operations. Brassard and Høyer [35] combined Simon’s algorithm with amplitude amplification order to make the algorithm exact. Classical Algorithms for Simon’s Problem Simon [167] n showed a lower bound of ˝(2 4 ) queries, and this can be n improved to ˝(2 2 ). This was a historical moment, since the field of quantum algorithms progressed from work on black-box algorithms and complexity, to having an algorithm without blackboxes and of broad practical importance. It is important to note the fundamental role of the foundational complexity-
2307
2308
Quantum Algorithms
theoretic work and black-box algorithms to the development of the practical algorithms. Factoring, Discrete Logarithms and the Abelian Hidden Subgroup Problem Factoring, Finding Orders and Periods, and Eigenvalue Estimation The most well known quantum algorithm is Shor’s algorithm [165,166] for the integer factorization problem. Integer Factorization Problem Input: An integer N. Problem: Output positive integers p1 ; p2 ; : : : ; p l ; r1 ; r2 ; : : : ; r l where the pi are distinct primes and N D r pr11 pr22 : : : p l l . This problem can be efficiently reduced (that is, in time polynomial in log N), by a probabilistic classical algorithm, to O(l) instances (note that l 2 O(log N)) of the problem of finding the multiplicative order of an element modulo N, which can be solved efficiently on a quantum computer. Order Finding Problem Input: Positive integers a and N, such that GCD(a; N) D 1 (i. e. a is relatively prime to N). Problem: Find the order of a modulo N. Essentially the same quantum algorithm can efficiently find the order of an element a in a finite group G given a black-box for performing the group arithmetic. Shor’s algorithm in fact solves the more general problem of finding the period of a periodic function f . Period Finding Problem Input: A black-box implementing a periodic function f : Z 7! X for some finite set X, where f (x) D f (y) if and only if r j x y. Problem: Find the period r. Shor described his algorithm for the specific function f (x) D a x mod N, where N was the integer to be factored; however the algorithm will find the period of any such periodic function (note the assumption that the values of f (1); f (2); : : : ; f (r) are distinct; one can also analyze the case where the values are not entirely distinct [32,145]). Kitaev later showed that the problem of finding the order of a 2 G can alternatively be reduced to the problem of estimating the eigenvalues of the operator that multiplies by a, and he described a efficient quantum algorithm for estimating such eigenvalues [126]. His method is to reduce the problem to phase estimation. Although Kitaev’s algo-
rithm was qualitatively different [116], it was shown that using an improved phase estimation algorithm yields an order-finding circuit that is essentially equivalent to Shor’s factoring algorithm based on period-finding [54]. Both Shor’s and Kitaev’s approaches find r by finding good estimates of a random integer multiple of 1r and then applying the continued fractions algorithm to find r. Sampling Estimates to an Almost Uniformly Random Integer Multiple of 1/r Input: Integers a, and N such that GCD(a; N) D 1. Let r denote the (unknown) order of a. Problem: Output a number x 2 f0; 1; 2; : : : ; 2n 1g such that for each k 2 f0; 1; : : : ; r 1g we have ˇ
ˇ ˇx 1 kˇ 1 Pr ˇˇ n ˇˇ 2 c 2 r 2r r for some constant c > 0. Shor’s analysis of the algorithm works by creating P a “periodic state”, which is done by preparing x j xi j a x mod Ni and measuring the second register (in fact, just ignoring or tracing out the second register suffices, since one never uses the value of the measurement outcome). One then applies the quantum Fourier transform, or its inverse, to estimate the period. Kitaev’s algorithm works by estimating a random eigenvalue of the operator U a that multiplies by a. Eigenvalue Estimation Problem Input: A quantum circuit implementing the controlled-U, for some unitary operator U, and an eigenstate j i of U with eigenvalue e2 i! . Problem: Obtain a good estimate for !. He solves the eigenvalue estimation problem by solving a version of the well-studied phase estimation problem, and uses the crucial fact (as did Shor) that one can effij ciently implement a circuit that computes U a2 D U a 2 j using j group multiplications instead of 2 j multiplications. Thus one can efficiently reduce the eigenvalue estimation problem to the following phase estimation problem. Phase Estimation Problem Input: The states p1 (j 0i C e2 i! y j 1i), for y D 1; 2; 2 4; : : : ; 2n , for some ! 2 [0; 2). Problem: Obtain a good estimate of the phase parameter !. Kitaev’s phase estimation algorithm used O(n) copies of the states p1 (j 0i C e2 i! y j 1i) for y D 1; 8; 64; : : :, and 2
provides an estimate with error at most 21n with high probability. Although there are slightly more efficient phase
Quantum Algorithms
estimation algorithms, one advantage of his algorithm is that it does not require a quantum Fourier transform, and instead performs some efficient classical post-processing of estimates of y! mod 1. This might be advantageous experimentally. The phase estimation problem has been studied for several decades [100,102] recently with some focus on the algorithmic complexity of the optimal or near-optimal estimation procedures. Quantum Algorithms for Order Finding Finding the order of a random element in ZN Quantum complexity is in O((log N)2 log log(N) log log log(N)). Order finding in a black-box group Quantum black-box complexity (for groups with unique encodings of group elements) is O(log r) blackbox multiplications and O(n C log2 r) other elementary operations. Classical Algorithms for Order Finding Finding the order of a random element in ZN Best known rigorouspprobabilistic classical algorithm has complexity in eO( log N log log N) . Best known heuristic probabilistic classical algorithm has complexity in e
1 O((log N) 3
2 (log log N) 3
).
Order finding in a black-box group Classical black-box multiplication complexity is in p ( r) [53]. By “heuristic” algorithm, we mean the proof of its running time makes some plausible but unproven assumptions. Discrete Logarithms Shor [165,166] also solved the problem of finding discrete logarithms in the multiplicative group of a finite field. The Discrete Logarithm Problem Input: Elements a and b D a t in Zp , where t is an integer from f0; 1; 2; : : : ; r 1g and r is the order of a. Problem: Find t. (The number t is called the discrete logarithm of b with respect to the base a.) Shor solved this problem by defining f : Zr Zr 7! Zp as f (x; y) D a x b y . Note that f (x1 ; y1 ) D f (x2 ; y2 ) if and only if (x1 x2 ; y1 y2 ) is in the additive subgroup of Zr Zr generated by (t; 1). Since r is known, there is no need for the continued fractions algorithm. There is also an analogous eigenvalue estimation version of this algorithm.
The algorithm can be defined in general for any group G where we have a black-box for computing the group operation. Given a; b 2 G, output the smallest positive t such that b D a t . For example, one can apply this algorithm to finding discrete logarithms in the additive group of an elliptic curve over a finite field [121,150], a group widely used in public key cryptography [137]. In this case, the group operation is described as addition, and thus the problem is to find the smallest positive integer t such that b D ta, where b and a are points on some elliptic curve. Quantum Complexities of the Discrete Logarithm Problem Finding discrete logarithms in Fq – Quantum complexity is in O((log q)2 log log(q) log log log(q)). Discrete logarithms in a black-box group represented with strings of length n (including elliptic curve groups discussed above) – Quantum black-box complexity (for groups with unique encodings of group elements) is O(log r) black-box multiplications and O(n C log2 r) other elementary operations. Classical Complexities of the Discrete Logarithm Problem Finding discrete logarithms in Fq – Best known rigorous probabilistic classical algop
rithm has complexity in eO( log q log log q) for certain values of q (including q D 2n and prime q). – Best known heuristic probabilistic classical algo1
2
rithm has complexity in eO((log q) 3 (log log q) 3 ) . Discrete logarithms in a black-box group represented with strings of length n p – Classical black-box complexity is in ( r). For a large class of elliptic curves, the best known clasp sical algorithms have complexity in O( r) group additions. There are sub-exponential algorithms for special families of curves. Abelian Hidden Subgroup Problem Notice how we can rephrase most of the problems we have already discussed, along with some other ones, as a special case of the following problem. The Abelian Hidden Subgroup Problem Let f : G ! X map an Abelian group G to some finite set X with the property that there exists some subgroup K G such that for any x; y 2 G, f (x) D f (y) if and only if x C K D y C K. In other words f is constant on cosets of K and distinct on different cosets.
2309
2310
Quantum Algorithms
Deutsch’s problem G D Z2 , X D f0; 1g, and K D f0g if f is balanced, and K D f0; 1g if f is constant. Generalized Simon’s problem G D Z2n , X D f0; 1gn , and K is any subgroup of Z2n . Finding orders G D Z, X is any finite group H, r is the order of a 2 H. The subgroup K D rZ is the hidden subgroup of G, and a generator for K reveals r. Finding the period of a function G D Z, X is any set, r is the period of f . The subgroup K D rZ is the hidden subgroup of G, and a generator for K reveals the period r. Discrete logarithms in any group G D Zr Zr , X is any group H. Let a be an element of H with a r D 1 and suppose b D a k . Consider the function f (x1 ; x2 ) D a x 1 b x 2 . We have f (x1 ; x2 ) D f (y1 ; y2 ) if and only if (x1 ; x2 ) (y1 ; y2 ) 2 f(tk; t); t D 0; 1; : : : ; r 1g. The hidden subgroup K is the subgroup generated by (k; 1), where k is the discrete logarithm. Hidden linear functions [34] G D Z Z. Let g be some permutation of Z N for some integer N. Let h be a function from Z Z to Z N defined by h(x; y) D x C ay mod N. Let f D g ı h. The subgroup K is the hidden subgroup generated by (a; 1), and the generator reveals the hidden linear function h. Self-shift-equivalent polynomials [83] Given a polynomial P in l variables X1 ; X2 ; : : : ; X l over Fq (the finite field with q elements), the function f which maps (a1 ; a2 ; : : : ; a l ) 2 Fql to P(X 1 a1 ; X2 a2 ; : : : ; X l a l ) is constant on cosets of a subgroup K of Fql . This subgroup K is the set of self-shift-equivalences of the polynomial P. Abelian stabilizer problem [119] Let G be any group acting on a finite set X. That is, each element of G acts as a map from X to X in such a way that for any two elements a; b 2 G, a(b(x)) D (ab)(x) for all x 2 X. For a particular element x 2 X, the set of elements which fix x (that is the elements a 2 G such that a(x) D x) form a subgroup. This subgroup is called the stabilizer of x in G, denoted StG (x). Let f x denote the function from G to X which maps g 2 G to g(x). The hidden subgroup of f x is StG (x). If we restrict attention to finite Abelian groups, or more generally, finitely generated Abelian groups, then we can efficiently solve the hidden subgroup problem, by generalizations of the algorithms for factoring, finding discrete logarithms, and Simon’s problem. The Abelian hidden subgroup problem can also be used to decompose a finite Abelian group into a direct sum of cyclic groups if there is a unique representative for each group element [44,143]. For example, the multiplicative group of integers modulo N is an Abelian group, and we
can efficiently perform computations in the group. However, having a decomposition of the group would imply an efficient algorithm for factoring N. The class group of a number field is another Abelian group for which a decomposition is believed to be hard to find on a classical computer. For example, such a decomposition would easily give the size of the class group, which is also known to be as hard as factoring, assuming the Generalized Riemann Hypothesis. Computing the class group of a complex number field is a simple consequence of the algorithm for decomposing an Abelian group into a direct sum of cyclic groups, since there are techniques for computing unique representatives in this case. A unique classical representation is sufficient, but a quantum state could also be used to represent a group element. For example, a uniform superposition over classical representatives of the same group element would also work (this technique was applied by Watrous in the case of solvable groups [175]). Computing the class number of a real number field is not as straightforward, since there is no known way to efficiently compute a unique classical representative for each group element. However Hallgren [95] used the techniques outlined in Subsect. “Lattice and Number Field Problems” to show how to compute quantum states to represent the group elements, assuming the Generalized Riemann Hypothesis, in the case of class group of a real number field of constant degree, and thus is able to compute the class group in these cases as well . Quantum Algorithms for the Abelian Hidden Subgroup Problem There exists a bounded-error quantum algorithm for finding generators for the hidden subgroup K G D Z N 1 Z N 2 Z N l of f using O(l) evaluations of f and O(log3 N) other elementary operations, where N D N1 N2 : : : N l D jGj. It can be shown that ˝(l) queries are needed for worstcase K. Classical Algorithms for the Abelianp Hidden Subgroup Problem In the black-box model ˝( jG/Kj) queries are necessary in order to even decide if the hidden subgroup is trivial. Generalizations are discussed in Sect. “Generalizations of the Abelian Hidden Subgroup Problem”.
Algorithms Based on Amplitude Amplification In 1996 Grover [90] developed a quantum algorithm for solving the search problem that gives a quadratic speedup over the best possible classical algorithm.
Quantum Algorithms
Search Problem Input: A black-box U f for computing an unknown function f : f0; 1gn ! f0; 1g. Problem: Find a value x 2 f0; 1gn satisfying f (x) D 1, if one exists. Otherwise, output “NO SOLUTION”. The decision version of this problem is to output 1 if there is a solution, and 0 if there is no solution. This problem is stated in a very general way, and can be applied to solving a wide range of problems, in particular any problem in NP. For example, suppose one wanted to find a solution to an instance ˚ of the 3-SAT problem. The Boolean formula ˚ is in “3-conjunctive normal form” (3-CNF), which means that it is a conjunction (logical AND) of clauses, each of which is a disjunction (logical OR) of three Boolean variables (or their negations). For example, the following is a 3-CNF formula in the variables b1 ; b2 ; : : : ; b5 : ˚ D (b1 _ b¯2 _ b5 ) ^ (b1 _ b¯4 _ b¯5 ) ^ (b4 _ b2 _ b3 ) ; where we let b¯ denote the logical NOT of b. A “satisfying assignment” of a particular 3-CNF formula ˚ is an assignment of 0 or 1 values to each of the n variables such that the formula evaluates to 1. Given a satisfying assignment, it is easy to check if it satisfies the formula. Define f ˚ to be the function that, for any x D x1 x2 : : : x n 2 f0; 1gn , maps x 7! 1 if the assignment b i D x i , i D 1; 2; : : : ; n satisfies ˚ and x 7! 0 otherwise. Solving the search problem for f ˚ is equivalent to finding a satisfying assignment for ˚. If we only learn about f by applying the black-box U f , then any algorithm that finds a solution with high probability for any f (even if we just restrict p to functions f with at most one solution) requires ˝( 2n ) applications of U f [25]. The basic intuition as to why a quantum algorithm might provide some speed-up is that a quantum version of an algorithm that guesses the solution with probability 1 the correct answer with probabil2 n will in fact produce p ity amplitude 1/ 2n . The hope is that additional guesses increase p the amplitude ofpfinding a correction solution by ˝(1/ 2n ), and thus O( 2n ) guesses and applications of U f might suffice in order to get the total probability amplitude close to 1. Of course, one needs to find a unitary algorithm for implementing this, and obvious approaches do not permit the amplitudes to add constructively. Lov Grover discovered a quantum algorithm that does permit the amplitudes to add up in such a way. This algorithm was analyzed and generalized to the technique known as “amplitude amplification” [34,35,38]. Further generalizations
of these search algorithms are discussed in Sect. “Quantum Walk Algorithms”. Quantum Algorithms for Searching: There exists a quantum algorithm that finds a solution with probability at least 23 if one exists, otherwise outputs “NO SOLUTION”, p and uses O( 2n ) applications of U f . The algorithm uses p p O(n 2n ) elementary gates. Note that ˝( 2n ) applications of U f are necessary in order to find a solution for any f , even if we restrict attention to functions f with at most one solution. If, as an additional input, we are also given an algorithm A that will find a solution to f (x) D 1 with probability p, then amplitude amplification will find a solution p with high probability using O(1/ p) applications of U f and of unitary versions of A and A1 . This more general statement implies that if there are m 1 solutions, then there is an algorithm which makes p an expected number of queries in O( 2n /m) (since guessing uniformly at random will succeed with probability 2mn ). This algorithm works even if the unitary black-box U f computes f with a small bounded error [108]. Classical Algorithms for the Search Problem: Any classical algorithm must make ˝(2n ) queries in order to succeed with probability at least 23 on any input f , even if we restrict attention to functions with at most one solution. Exhaustive searching will find a solution using O(2n ) applications of U f . If we are also given a black-box A that successfully guesses a solution to f (x) D 1 with probability p 21n , then ( 1p ) applications of U f are needed and also sufficient (using random sampling). If instead of a black-box for U f , we are given a circuit for f or some other description of f (such as a description of the 3-SAT formula ˚ in the case of f ˚ ) then there might be more efficient algorithms that uses this additional information to do more than just use it to apply U f . One can directly use amplitude amplification to get a quadratic speed-up for any such heuristic algorithm that guesses a solution with probability p, and then repeats until a solution is found. The quantum searching algorithm has often been called a ‘database’ search algorithm, but this term can be misleading in practice. There is a quadratic speed-up if there is an implicit database defined by an efficiently computable function f . However, if we are actually interested in searching a physical database, then a more careful analysis of the physical set-up and the resources required is necessary. For example, if a database is arranged in a linear array, then to query an N-element database will take time
2311
2312
Quantum Algorithms
proportional to N. If the data is arranged in a square grid, then a general lookup could be possibly be done in time p proportional to N. This limitation is true for both classical and quantum searching; however the hardware assumptions in the two cases are often different, so one needs to be careful when making a comparison. (e. g. [158,180] and Chap. 6 of [149], discuss these issues). Local searching [2] restricts attention to architectures or models where subsystems can only interact locally, and does not allow access to any memory location in one unit of time regardless of the memory size (which is not truly possible in any case, but can sometimes be appropriate in practice). The algorithms used include quantum walk algorithms (e. g. [182]) and can achieve quadratic or close to quadratic speed-ups depending on details such as the spatial dimension of the database. It is also important to note that there has been much work studying the quantum query complexity of searching an ordered list. It is known that (log N) queries are necessary and sufficient, but the constant factor is not yet known [49,98,107]. Other Applications of Amplitude Amplification Apart from the application to the searching problem, one of the first applications of the quantum searching algorithm was to counting [37], and more generally amplitude amplification can be used to estimate amplitudes [38]. The quantum algorithm for amplitude estimation combines the techniques of the order finding algorithm with amplitude amplification. Bounds on optimal phase estimation translate to bounds on optimal amplitude estimation. It has several applications, including approximate and exact counting [37,38], and approximating the mean (or, in the continuous case Quantum Algorithms and Complexity for Continuous Problems, the integral) of a function [172]. These applications offer up to a quadratic speed-up over classical algorithms. We have already mentioned the straight-forward, and very useful, application of amplitude amplification to searching: providing a quadratic speed-up up for any algorithm that consists of repeating a subroutine that guesses a solution, until a solution is found. However, there are many applications of amplitude amplification that are not so straightforward, and require some additional nontrivial algorithmic insights. Other applications of amplitude amplification include the collision finding problem, which has applications to breaking hash functions in cryptography. Collision Finding Problem: Input: A black-box U f for computing a function
f : f0; 1gn ! f0; 1gm . The function f is r to 1, for some positive integer r. Problem: Find two distinct x; y 2 f0; 1gn with f (x) D f (y), if r > 1. Otherwise output “NO COLLISION”. Quantum Algorithms for Collision Finding There exists a quantum algorithm [36] that uses O((2n /r)1/3 ) applications of U f , and O((2n /r)1/3 polylog (2n /r)) other elementary operations, and O((2n /r)1/3 ) space, and outputs a collision with probability at least 23 if r > 1, and outputs “NO COLLISION” otherwise. It can be shown that ˝((2n /r)1/3 ) applications of U f are needed. Classical Algorithms for Collision Finding There exists a classical probabilistic algorithm that uses O((2n /r)1/2 ) applications of U f , and O((2n /r)1/2 polylog (2n /r)) other elementary operations, and O(n) space, and outputs a collision with probability at least 2/3 if r > 1, and outputs “NO COLLISION” otherwise. It can be shown that ˝((2n /r)1/2 ) applications of U f are needed. Other applications include finding claws [37], finding the maximum (or minimum) value of a function [106], string matching [153], estimating the median of a function [91,148] and many others [67,84]. There is also a non-trivial application to the element distinctness problem [42], which we define in Sect. “Quantum Walk Algorithms”, since there is a better, optimal, quantum walk algorithm. Amplitude amplification also a basic tool in making algorithms exact [35,146]. Simulation of Quantum Mechanical Systems Feynman’s reason for considering a quantum computer was to simulate other quantum mechanical systems [79]. This general idea was described in more detail by Lloyd [131], Wiesner [178] and Zalka [180], and later by many other authors (e. g. [149]). More detailed situations have been studied by numerous authors, including [43,45,119,170]. There are other notions of what one what one might mean by simulating a physical system, such as computing properties of the ground state of the Hamiltonian, or other properties of the spectrum of the Hamiltonian. In this section, we focus on the problem of simulating the evolution of a quantum mechanical system given the initial state (or a description of it) and a description of the Hamiltonian of the system. For convenience, we will restrict attention to a discrete Hilbert space. In practice, when modelling systems with a continuous state space, the state space will need to be approximated with a discrete state space [180] with enough
Quantum Algorithms
states so that the accumulated error is below the desired error threshold. This general problem of dealing with continuous problems is discussed in Quantum Algorithms and Complexity for Continuous Problems. We will also restrict attention to time-independent Hamiltonians, since well-behaved time-dependent Hamiltonian evolution can be approximated by a sequence of time-dependent Hamiltonian evolutions. There are two natural ways that have been studied for representing the Hamiltonian. The first way, is to reprePM sent H D kD1 H k , where H k is a simple enough Hamiltonian that we know how to efficiently simulate its evolution. For convenience, we assume the simulation of the H k term for a time t takes unit cost and is exact. In practice, if the simulation can be done with error in time polynomial in 1 , D jjH k jjt, and the number of qubits n, then the effect on the overall cost is by a factor that is polynomial in these factors. In many situations, the simulation is polynomial in log( 1 ) and log , and independent of n and . This leads to the following formulation: Hamiltonian Simulation Problem 1 Input: An integer n and black-boxes A1 , A2 ; : : : ; A M , where Aj takes a non-negative real number r and executes e i H j r , for a set of Hamiltonians H j acting on n qubits. PM Hk . The value of kHk, the trace norm of H D kD1 A positive real number t. A positive real number < 1. An n-qubit stateˇj i.˛ ˇ ˛ Output: A state ˇ f satisfying ˇ f e i Ht j i < . In practice, the input will likely be a classical description of how to prepare the state j i. It suffices to have an upper bound on kHk (if the bound is within a constant factor, this won’t affect the stated complexities). In particular, it suffices that we can efficiently compute or approximate the eigenvalue k of a given an eigenvector j E k i of H k . This is the case, for example, when the Hilbert space is a tensor product of n finite sized subsystems and H k acts non-trivially only on a finite number of subsystems, say c of them. In other words, up to a reordering of the subsystem labels, H k D I nc ˝ H˜ k . Since ˜
e i H k t D I nc ˝ e i H k t ; in order to simulate the evolution of H k for a time interval of size t, one only needs to simulate H˜ k on the relevant c qubits for a time interval of size t. In this case, since we know H˜ k , one can use “brute-force” ˇ ˛ methods ˇto ap˛ proximately implement the map ˇ Ek 7! e2 ik t ˇ Ek , for any time interval t. An easy example is when the state space is n qubits, and H k is a tensor product of Pauli operators. This means that
˜ k D (P1 ˝ P2 ˝ : : : ˝ we can easily diagonalize H˜ k as H c Pc )Z (P1 ˝ P2 ˝ : : : ˝ Pc ) for some one-qubit unitary operations P1 ; P2 ; : : : ; Pc . Thus we have ˜
c
e i H k t D (P1 ˝ P2 ˝ : : : ˝ Pc )e i Z t (P1 ˝ P2 ˝ : : : ˝ Pc ) : Since c
e i Z t j x1 x2 : : : x c i D e i t f (x 1 :::x c ) j x1 x2 : : : x c i where f (x1 ; : : : ; x c ) D x1 ˚ : : : ˚ x c , this simulation can be done easily. In fact, as pointed out in [149], in the case that H k is such a product Hamiltonian, c does not need to be constant. Another example, [180], is where the eigenvectors of H k are of the form j Ek i D
n 1 2X
n
e2 i jk/2 j ji
jD0
(i. e. “momentum” eigenstates), in which case the inverse quantum Fourier transform will map j E k i 7! j ki, and then one can easily approximate j ki 7! e ik t j ki for any efficiently computable function k , and then apply the quantum Fourier transform to map j ki 7! j E k i and thus effectively compute the transformation j E k i 7! e ik t j E k i. P Thus we can study any situation where H D k H k , and we have some means of efficiently simulating the time evolution of H k . If the H k commute, then e i Ht D e i H 1 t e i H 2 t : : : e i H n t ; and the problem is straightforward. For the non-trivial case when the H k do not commute pairwise, we can take advantage of approximations derived from Trotter formulas, like n e i Ht D e i H 1 t/n e i H 2 t/n : : : e i H n t/n C O(t 2 /n) and other improved versions. A good estimate of the overall cost of this family of simulations is the number of terms of the form e i H j r that are used, for any choice of r (we can assume 0 r t, and that r is some efficiently computable number). Quantum Complexity of Hamiltonian Simulation Problem 1 There is a quantum algorithm that simulates e i Ht on a given input j i with trace distance error at most < 1 that uses a number of exponential terms Nexp satisfying Nexp 2 (M C 1)M
1Co(1) o(1)
O(1/ps) 1 ;
2313
2314
Quantum Algorithms
p
1 2s ,
where D kHk t and D 1/2s . Since the running time of simulating each exponential term is assumed to be polynomial in n, the overall running time is polynomial in n. In general, there are Hamiltonians for which ˝( ) time is necessary.
For example, setting k D
The product D kHk t is the relevant parameter since one can effectively speed up time by a factor of s by rescaling the Hamiltonian by a multiplicative factor of s, for any s > 0 (e. g. one can perform an operation in half the time by doubling all the energies). It is hard to give an explicit comparison to the best known classical complexity of this general problem. There are classical algorithms and heuristics for special cases, but in the worst-case, the best solutions known require time exponential in n and polynomial in (e. g. in 1Co(1) ). Another natural formulation of the problem [5,29] considers the case of sparse Hamiltonians, where for any basis state j xi there are at most d basis states j yi such that h xj H j yi ¤ 0 and one can efficiently compute this neighborhood for any input x. This leads to the following blackbox formulation of the problem.
Some applications involve trying to study properties of the ground state of a Hamiltonian by combining simulation of the dynamics with some approach for generating the ground state with non-trivial initial amplitude. For example, if one had a procedure A for generating a ground state j E0 i of H with probability p, then one can combine the algorithm for simulating H with eigenvalue estimation, and then amplitude amplification on the eigenvalue estimates in order to generate the ground state with high p fidelity. The algorithm would use O(1/ p) applications p of A and A1 and O(1/ p) eigenvalue estimations. The eigenvalue estimations would have to be precise enough to distinguish j E0 i from the next highest energy eigenstate produced by A. If the gap between these two energies is , then the eigenvalue estimation would involve simulating 1 the evolution of H for a time in ˝( ). Several papers address techniques for generating the ground states for various problems of interest including [3]. Another application for such simulations is to implement algorithms that are designed in the continuous-time model in a discrete-time quantum computation model.
Hamiltonian Simulation Problem 2 Input: Integers n and ˇ d, and˛ a black-box U H that maps j x; ii j 0i 7! j x; ii ˇ y i ; H x;y i , where yi is the index of the ith non-zero entry in column x of the Hermitian matrix H (if there are d 0 < d non-zero entries, then U f can output any index for i > d 0 ). The values H x;y D h xj H j yi are the (x; y) entries of H represented in the computational basis. A positive real number t. A positive real number < 1. An n-qubit stateˇj i.˛ ˇ ˛ Output: A state ˇ f satisfying ˇ f e i Ht j i < . The best known general solution was shown in [29]. Quantum Complexity of Hamiltonian Simulation Problem 2 There is a quantum algorithm that simulates e i Ht on a given input j i with trace distance error at most < 1 that uses a number of black-box calls, Nbb , satisfying, for any positive integer k, Nbb 2 O
! 2k1 1 5 (log n) ;
1 (1C 2k ) (4C 1k ) 2k
d
where D jjHjjt and log n is the smallest positive integer r such that log2(r) n < 2, where log(r) 2 refers to iterating the log2 function r times. In general, there are Hamiltonians that require ˝( ) black-box evaluations.
Nbb 2
s where D
(1CO( p1 s )) (4CO( p1 s ))
d
O 1
1 p
we get
s
(log n) :
Generalizations of the Abelian Hidden Subgroup Problem Non-Abelian Hidden Subgroup Problem One of the most natural ways to generalize the Abelian hidden subgroup problem is to non-Abelian groups. The problem definition is the same, apart from letting G be non-Abelian. One natural problem in this class of problems is the graph automorphism problem. Graph automorphism problem: Consider G D S n , the symmetric group on n elements, which corresponds to the permutations of f1; 2; : : : ; ng. Let G be a graph on n vertices labeled f1; 2; : : : ; ng. For any permutation 2 S n , let fG map Sn to the set of n-vertex graphs by mapping fG ( ) D (G), where (G) is the graph obtained by permuting the vertex labels of G according to . For the function fG , the hidden subgroup of G is the automorphism group of G (i. g. the permutations such that (G) D G). The graph isomorphism problem (deciding if two graphs G1 and G2 are isomorphic) can be reduced to solving the graph automorphism problem. p The best known classical algorithm takes time in e O(
n log n)
to decide if
Quantum Algorithms
two graphs are isomorphic, and there is no substantially better quantum algorithm known. There has been much work attacking the non-Abelian hidden subgroup problem. Ettinger, Høyer and Knill [71] showed the following. Query Complexity of the Non-Abelian Hidden Subgroup Problem For any finite group G, the non-Abelian hidden subgroup problem can be solved with high probability using O(log jGj) queries to U f . Thus, the main question remaining is whether it is possible for the entire algorithm, including black-box queries and other elementary operations, to be efficient. Quantum Fourier Transform Approaches One natural approach is to mimic the algorithm for the Abelian HSP, which starts by computing 1 X p j xi j f (x)i jGj x2G and noticing that this equals 1 0 X X 1 1 @p p j xiA j yi : j f (G)j jKj x2 f 1 (y) y2 f (G) For convenience, we suppose we measure the second register (it suffices just to trace out the 2nd register) and get some random outcome y, and thereby project the first register into the state 1 X j a C Ki D p j a C xi jKj x2K where a is any element of G such that f (a) D y. In other words, as in the Abelian HSP algorithm, the first register is an equal superposition of all the elements in some coset of K. The question is: how do we extract information about K given a random coset state of K? In the Abelian case, a quantum Fourier transform of the coset state allowed us to sample elements that were orthogonal to K, which we illustrated in the case of Simon’s algorithm. A more general way of looking at the quantum Fourier transform of an Abelian group G D Z N 1 Z N 2 Z N l is in terms of representation theory. The Abelian group G can be represented by homomorphisms that maps G to the complex numbers. There are in fact jGj such homomorphisms, in a natural oneto-one correspondence with the elements of G. For each g D (a1 ; a2 ; : : : ; a l ) 2 G, define g to be the homomor-
phism that maps any h D (b1 ; b2 ; : : : ; b l ) 2 G according to 2 i
g (h) D e
Pl
ai bi iD1 N i
:
Using these definitions, we derive the following quantum Fourier transform maps: 1 X g (h) j hi : j gi 7! p jGj h2G It is easy to verify that the quantum Fourier transform QFT N 1 ˝ QFT N 2 ˝ : : : ˝ QFT N l maps a coset state of a subgroup K to a superposition of labels h where K is in the kernel of h , that is, h (k) D 1 for all k 2 K. Thus, after applying the quantum Fourier transform, one will only measure labels h D (b1 ; b2 ; : : : ; b l ) such that 2 i
g (h) D e
Pl
ai bi iD1 N i
D1
which generalizes the notion of orthogonality defined in the explanation of Simon’s algorithm in a very natPl a i b i ural way, since it means iD1 N i D 0 mod 1. This gives us a linear equation that must be satisfied by all h D (b1 ; b2 ; : : : ; b l ) 2 K, and thus after l C O(1) such random equations are collected, we can efficiently find generators for K by solving a linear system of equations. One can also define representations for finite nonAbelian groups G, except in order to fully capture the structure of G, we allow homomorphisms to invertible matrices over C (in the Abelian case, we only need 1 1 matrices). For any finite group G, one can define a finite set of such homomorphisms 1 ; 2 ; : : : ; k that map elements of G to unitary matrices of dimension d1 d1 , d2 d2 ; : : : ; d k d k , respectively, with the property that P 2 i d i D jGj, and any other such representation is equivalent to a representation that can be factored into a collection of some number of these representations. More precisely, any representation : G ! M dd (C) has the property that there exists some invertible P 2 M dd (C), and P a list of j 1 ; j 2 ; : : : such that i d i D d and for every g2G P (g)P1 D ˚ i j i (g) (that is, P (g)P 1 is block-diagonal, with the matrices j i (g) along the diagonals). Furthermore, the representations j are irreducible in the sense that they cannot be decomposed into the sum of two or more smaller representations in this way. They are also unique up to conjugation. P Since i d 2i D jGj, one can define a unitary transformation from G to the vector space spanned by the labels of
2315
2316
Quantum Algorithms
all the entries i ( j; k) of these irreducible representations i . In particular, we can map
j gi 7!
X i
s
di jGj
X
i ( j; k)(g) j i; j; ki
0 j;kd i
where i ( j; k)(g) is the value of the ( j; k) entry in the matrix i (g). Such a mapping is unitary and is called a quantum Fourier transform for the group G. There is freedom in the specific choice of i within the set of unitary representations that are conjugate to i , so there is more than one quantum Fourier transform. There has been much study of such quantum Fourier transforms for non-Abelian groups, which are sometimes possible to implement efficiently [21,96,140,151,156], but efficient constructions are not known in general. It appears they are of limited use in solving the non-Abelian hidden subgroup, except in special cases [68,87,96,156] such as when K is a normal subgroup of G. In the next sections we discuss several other lines of attack on the non-Abelian hidden subgroup that have yielded some partial progress on the problem. “Sieving” Kuperberg [130] introduced a method for attacking the hidden subgroup problem for the dihedral group that leads to a sub-exponential algorithm. The dihedral group DN is a non-Abelian group of order 2N, which corresponds to the set of symmetries of a regular N-gon. It can be defined as the set of elements f(x; d) j x 2 f0; 1; 2; : : : ; N 1g; d 2 f0; 1gg, where f(x; 0) j x 2 f0; 1; 2; : : : ; N 1gg is the Abelian subgroup of DN , corresponding to the rotations (satisfying (x; 0)C(y; 0) D (xCy mod N; 0)), and f(0; 0); (y; 1)g are Abelian subgroups of order 2 corresponding to reflections. In general, (x; 0) C (y; 1) D (y x; 1), (y; 1) C (x; 0) D (x y; 1) D ((x; 0) C (y; 1)). If the hidden subgroup is a subgroup of the Abelian subgroup of order N, then finding the hidden subgroup easily reduces to the Abelian hidden subgroup problem. However, there is no known efficient algorithm for finding hidden subgroups of the form f(0; 0); (y; 1)g. So we can focus attention to the following restricted version of the dihedral hidden subgroup. Dihedral Hidden Subgroup Problem (Hard Case) Input: An integer N, and a black-box implementing U f : j x; di j 0i 7! j x; di j f (x; d)i, where f (x1 ; d1 ) D f (x2 ; d1 ) if and only if (x1 ; d1 ) (x2 ; d2 ) 2 f(0; 0); (y; 1)g for some y. Problem: Find y.
As mentioned earlier, the dihedral hidden subgroup problem can be efficiently reduced to the following phase estimation problem [70]. Ettinger-Høyer Phase Estimation Problem for the Dihedral HSP Input: An integer N, and a black-box Od that outputs a classical value k 2 f0; 1; : : : ;p N 1g uniformly at random, along with the qubit 1/ 2(j 0i C e i k j 1i) where D (2 d)/N, for some integer d 2 f0; 1; 2; : : : ; N 1g. Problem: Find d. Note that if we could sample the values k D 2 j for j D 1; 2; : : : ; dlog Ne, then the phase estimation problem can be solved directly using the quantum Fourier transform [54]. Regev designed a clever method for generating states p j of the form 1/ 2(j 0i C e i2 j 1i) using O(1) calls to the black-box Od , given an oracle that solves the subset sum problem on average. Kuperberg [130] developed a “sieving” method of generating states of the form p j 1/ 2(j 0i C e i2 j 1i) and the method was refined and improved to use less memory by Regev [154]. Quantum Algorithms for the Dihedral HSP There exists a quantum algorithm that p solves the dihedral HSP
withp running time in e O( log N) and uses space in N) e O( log p . There is also an algorithm with running time in eO(
log N log log N)
and uses space in O(log N).
Classical Algorithms for the Dihedral HSP The classical complexity of the dihedral hidden subgroup problem is p ( N) evaluations of the function f . Similar sieving methods were applied [11] to yield a subexponential time algorithm for the HSP over the product groups Gn for a fixed non-Abelian group G. It has also been show that these kinds of quantum sieve algorithms will not give efficient quantum algorithms for graph isomorphism [141]. “Pretty Good Measurements” A natural approach to solving the non-Abelian hidden subgroup problem is to prepare several instances of a random coset state for the hidden subgroup K, and then try to determine what K is. More precisely, after preparing X X j xi j f (x)i D j y C Ki j f (y)i x2G
yCK2G/K
and discarding the second register, we are left with the mixed state X K D j y C Ki h y C Kj : yCK2G/K
Quantum Algorithms
Thus one could try to implement or approximate the optimal quantum measurement for identifying the mixed states K , over all possible hidden subgroups K G. Furthermore, one could sample the state K a polynomial number of times t, and try to guess K given K ˝ K ˝ : : : ˝ K D Kt . Holevo [102] determined the optimal measurement for the following general state distinguishability problem. Given 2 f j g, output a label m such that the probability that D m is maximum. Let pj denote the probability that D j . Holevo proved that the maximum probability of guessing the correct input state is achieved by a POVM with elements fG j g satisfying the following conP P P ditions: i p i i G i D i p i G i i and i p i i G i p j j . However, it is not in general easy to efficiently find and implement such a POVM. Hausladen and Wootters [99] defined a ‘pretty good’ measurement for distinguishing quantum states that is not necessarily optimal, but has a simpler form. The measurement used POVM elements G j D T 1/2 j T 1/2 and, P if these don’t sum to the identity, also I j G j , where P T D i i . For the case of the dihedral hidden subgroup problem, it was shown [19] that the pretty good measurement is in fact optimal; however, in this case, it is still not known how to efficiently implement the pretty good measurement. However, it was later shown how to implement the pretty good measurement for the Heisengroup group HSP [19]. For example, in the case of the dihedral hidden subgroup problem for D2N , after a quantum Fourier transform on each of n coset states, one can view the resulting state of n registers as (j 0i C e2 i
k1 d N
k2 d N
j 1i) j k1 i ˝ (j 0i C e2 i ˝ : : : ˝ (j 0i C e2 i
kn d N
j 1i) j k2 i j 1i) j k n i
for ki selected independently and uniformly at random from f0; 1; : : : ; N 1g. k1 d
k2 d
The (j 0i C e2 i N j 1i) ˝ (j 0i C e2 i N j 1i) ˝ : : : ˝ kn d (j 0i C e2 i N j 1i) part of the state can be rewritten as X
˛r j S r i
r
P where r spans all the possible sums of the form i b i k i , b i 2 f0; 1g (the ‘subset sums’), j Sr i is the uniform superposition of the strings j b1 b2 : : : b n i that satisfy P b i k i Dpr, and ˛ r is the appropriate normalization factor (i. e. nr /2n where nr is the number of solutions to P b i k i D r).
The optimal measurement [19] (in the restricted case of order two subgroups) can be thought of as mapping j Sr i 7! j ri, performing an inverse quantum Fourier ˜ which will be the transform and then measuring a value d, guess of the value d (interestingly, a similar measurement is optimal [61] in the case of trying to optimally estimate an arbitrary phase parameter 2 [0; 2) given the state (j 0iCe i k 1 j 1i)(j 0iCe i k 2 j 1i): : :(j 0iCe i k n j 1i)). Note that implementing such a basis change in reverse would solve the subset sum problem (which is NP-hard). In fact, it suffices to solve the subset sum problem on average [154]. A nice discussion of this connection can be found in [19]. For groups that are semidirect products of an Abelian group and a cyclic group, the pretty good measurement corresponds to solving what is referred to [19] as a ‘matrix sum problem’, which naturally generalizes the subset sum problem. They also show that the pretty good measurements are optimal in these cases, and similarly relate their implementation to solving the matrix sum problem to certain average-case algebraic problems. They show that the pretty good measurement can be implemented for several groups, including semidirect product groups of the form Zrp Ì Z p for constant r (when r D 2, this is the Heisenberg group), and of the form Z N Ì Z p with p prime (which are metacyclic) and where the ratio N/p is polynomial in log N. Other Methods and Results There are also some algorithms for solving other cases of the non-Abelian hidden subgroup problem that don’t use any of the above techniques, e. g. [82,110,111]. These results use sophisticated classical and quantum group theoretic techniques to reduce an instance of a non-Abelian HSP to instances of HSP in Abelian groups. One of the most recent results [112] shows that such a reduction is possible for nil-2 groups, which are nilpotent groups of class 2. The group G is nilpotent of class n if the following holds1 . Let A1 D G, and let A iC1 D [A i ; G], for i > 0. A group G is nilpotent if A nC1 D f1g, for some integer n, and the class of a nilpotent group is the smallest positive integer n for which A nC1 D f1g. One of their techniques is to generalize Abelian HSP to a slightly more general problem, where the hidden subgroup K is hidden by a quantum procedure with the following properties. For every g1 ; g2 ; : : : ; g N 2 G the algo1 For any two elements g; h of a group, we define their commutator, denoted [g; h] to be [g; h] D g 1 h 1 g h, and for any two subgroups H; K G we define [H; K] to be the (normal) subgroup of G generated by all the commutators [h; k] where h 2 H; k 2 K.
2317
2318
Quantum Algorithms
rithm maps j g1 i j g2 i : : : j g N i j 0i j 0i : : : j 0i 7! j g1 i j g2 i ˇ Eˇ E ˇ ˇ ˇ ˇ : : : j g N i ˇ g11 ˇ g22 : : : ˇ
N gN
E
ˇ ˛ where the set of states fˇ gi j g 2 Gg is a hiding set for ˇ K,˛ for each i D 1; 2; : : : ; N. A set of normalized states fˇ g j g 2 Gg is a hiding set for the subgroup K of G if If g and h are in the same left coset of K then j g i D j h i. If g and h are in different left cosets of K then h g j j i D 0. Generators for the subgroup K of G can be found in time polynomial in log jGj using a quantum hiding procedure with N 2 O(log jGj). They find a series of non-trivial reductions of the standard HSP in nil-2 groups to instances of the Abelian HSP with a quantum hiding function. Lattice and Number Field Problems The Abelian hidden subgroup problem also works for finitely generated groups G. We can, thus, define the hidden subgroup problem on G D Z Z : : : Z D Zn . The hidden subgroup K will be generated by some n-tuples in Z n . We can equivalently think of G as a lattice and K as a sublattice. The function f : G ! X, for some finite set X, that satisfies f (x) D f (y) if and only if x y 2 K, can be thought of as hiding the sublattice K. By generalizing the problem to hiding sublattices of Rn , one can solve some interesting and important number theoretic problems. The solutions in these cases were not a simple extension of the Abelian hidden subgroup algorithm. Hallgren [93,95] found a quantum algorithm for finding the integer solutions x; y to Pell’s equation x 2 d y 2 D 1, for any fixed integer d. He also found an efficient quantum algorithm for the principal ideal problem, and later generalized it to computing the unit group of a number field of constant degree [94]. Solving Pell’s equation is known to be at least as hard as factoring integers. We don’t have room to introduce all the necessary definitions, but we’ll sketch some of the main ideas. A number field F can be defined as a subfield Q( ) of the complex numbers that is generated by the rationals Q together with a root of an irreducible polynomial with rational coefficients; the degree of F is the degree of the irreducible polynomial. The “integers” of F are the elements of F that are roots of monic polynomials (i. e. polynomials with leading coefficient equal to 1, such as x 2 C 5x C 1). The integers of F form a ring, denoted O.
One can define a parameter called the discriminant of F (we won’t define it here), and an algorithm is considered efficient if its running time is polynomial in log and n. The unit group of the ring F, denoted O , is the set of elements ˛ in O that have an inverse ˛1 2 O. “Computing” the unit group corresponds to finding a description of a system of “fundamental” units, 1 ; 2 ; : : : ; r that generate O in the sense that every unit 2 O can be written as D 1k 1 2k 2 : : : rk r for some k1 ; k2 ; : : : ; kr 2 Z and some root of unity . However, in general, an exact description of a fundamental unit requires an exponential number of bits. There are are some finite precision representations of the elements of the unit group, such as the “Log” embedding into Rr . This representation describes a unit ˛ by an element of Rr where some finite precision representation of each coordinate suffices. This Log representation of the unit group O , corresponds to a sublattice L D O of Rr . Hence, we have a relationship between several important computational number field problems, and lattice problems. By the above correspondence between O and the lattice L Rr , we can formulate [94,162] the problem of computing the unit group as the problem of finding elements that approximate generators of the sublattice L of Rr . One important non-trivial step is defining a function f : Rr ! X (for some infinite set X) such that f (x) D f (y) if and only if x y 2 L as well as appropriate discrete approximations to this function. The definition of these functions involves substantial knowledge of algebraic number theory, so we will not describe them here. By designing quantum algorithms for approximating generators of the lattice L, [94,162] one can find polynomial time algorithms for computing the unit group O of an algebraic number field F D Q( ). A corollary of this result, is a somewhat simpler solution to the principal ideal problem (in a constant degree number field) that had been found earlier by Hallgren [93]. An ideal I of the ring O is a subset of elements of O that is closed under addition, and is also closed under multiplication by elements of O. An ideal I is principal if it can be generated by one element ˛ 2 I ; in other words I D ˛ O D f˛ˇ j ˇ 2 Og. The principal ideal problem is, given generators for I , to decide if I is principal, and if it is, to find a generator ˛. As mentioned in Subsect. “Abelian Hidden Subgroup Problem”, the tools developed can also be applied to find unique (quantum) representatives of elements of the class group for constant degree number fields [95] (assuming the generalized Riemann hypothesis), and thus allow for the computation of the class group in these cases.
Quantum Algorithms
Hidden Non-linear Structures Another way to think of the Abelian hidden subgroup problem is as an algorithm for finding a hidden linear structure within a vector space. For simplicity, let’s consider the Abelian HSP over the additive group G D Fq Fq : : : Fq D Fqn , where q D p m is a prime power. The elements of G can also be regarded as a vector space over Fq . A hidden subgroup H G corresponds to a subspace of this vector space and its cosets correspond to parallel affine subspaces or flats. The function f is constant on these linear structures within the vector space Fqn . A natural way to generalize this [51] is to consider functions that are constant on sets that are defined by non-linear equations. One problem they study is the hidden radius problem. The circle of radius r 2 Fq centered at t D (t1 ; t2 ; : : : ; t n ) 2 Fqn is the set of points P 2 x D (x1 ; x2 ; : : : ; x n ) 2 Fqn that satisfy i (x i t i ) D r. The point x on a circle centered at t will be represented as either (x; (s)), where s D x t and (s) is a random permutation of s, or as (x; (t)) where (t) is a random permutation of t. We define f 1 and f1 be the functions satisfying f1 (x; (s)) D (t) and f1 (x; (t)) D (s). Hidden Radius Problem Input: A black box U f1 that implements j xi j (s)i j 0i 7! j xi j (s)i j f1 (x; (s))i, and a black box U f1 that implements j xi j (t)i j 0i 7! j xi j (t)i j f1 (x; (t))i. Problem: Find r. Quantum Algorithms for the Hidden Radius Problem For odd d, there is a bounded error quantum algorithm that makes polylog(q) queries to U f1 and U f1 and finds r. However, there is no known polynomial bound on the non-query operations. There is a bounded-error quantum algorithm that also makes polylog(q) operations in total, and determines (r), where (r) D 1 if r is a quadratic residue (that is, r D u2 for some u 2 Fq ) and 0 otherwise. Classical Algorithms for the Hidden Radius Problem It was shown in [51] that the expected number of queries needed to be able to guess any bit of information about r correctly with probability greater than 12 C poly(d1 log q) is exponential in d log q. A number of other black-box problems of this kind were defined in [51] with quantum algorithms that are exponentially more efficient than any classical algorithm, in some cases just in terms of query complexity, and other times in terms of all operations. These problems fit into the frameworks of shifted subset problems and hidden polynomial problems. They use a variety of non-trivial tech-
niques for these various problems, including the quantum Fourier transform, quantum walks on graphs, and make some non-trivial connections to various Kloosterman sums. Further work along these lines has been done in [63].
Hidden Shifts and Translations There have been a variety of generalizations of the hidden subgroup problem to the problem of finding some sort of hidden shift or translation. Grigoriev [89] addressed the problem of the shiftequivalence of two polynomials (the self-shift-equivalence problem is a special case of the Abelian hidden subgroup problem). Given two polynomials P1 ; P2 in l variables X 1 ; X2 ; : : : ; X l over Fq (the finite field with q elements), does there exist an element (a1 ; a2 ; : : : ; a l ) 2 Fql such that P1 (X 1 a1 ; X2 a2 ; : : : ; X l a l ) D P2 (X 1 ; X2 ; : : : ; X l ). More generally, if there is a group G (Fql in this case) acting on a set X (the set of polynomials in l variables over Fq in this case), one can ask if two elements x; y 2 X are in the same orbit of the action of G on X (that is, if there is a g 2 G such that g(x) D y). In general, this seems like a hard problem, even for a quantum computer. The dihedral hidden subgroup problem [70] is a special case of the hidden translation problem [82], where there is a finite group G, with unique representation of the elements of G, and two injective functions f 0 and f 1 from G to some finite set X. Hidden Translation Problem: Input: Two black boxes U f0 and U f1 that, for any x 2 G, implement the maps U f0 : j xi j 0i 7! j xi j f0 (x)i and U f1 : j xi j 0i 7! j xi j f1 (x)i. A promise that f1 (x) D f0 (ux), for some u 2 G. Problem: Find u. The same problem expressed with additive group notation has been called the hidden shift problem, and instances of this problem were solved efficiently by van Dam, Hallgren and Ip [59]. For example, they find an efficient solution in the case that f1 (x) D (x C s) where f0 D is a multiplicative character function over a finite field, which implies a method for breaking a class of “algebraically homomorphic” cryptosystems. They also describe a more general hidden coset problem. In [82] it is shown how to solve the hidden translation problem in G D Z np in polynomial time, and then show how to use this to solve the problem for any group that they call “smoothly solvable”. Let us briefly define what this means.
2319
2320
Quantum Algorithms
The derived subgroup of a group G is G (1) D [G; G]. In general, we define G (0) D G; G (nC1) D [G (n) ; G (n) ], for n 1. A group G is solvable if G (n) D f1g, the trivial group, for some positive integer n, and the series of subgroups is called the derived series of G. A group G is called smoothly solvable if m is bounded above by a constant, and if the factor groups G ( jC1) /G ( j) are isomorphic to a direct product of a group of bounded exponent (the exponent of a group is the smallest positive integer r such that g r D 1 for all g in the group) and a group of size polynomial in log jGj. The algorithm for smoothly solvable groups works by solving a more general orbit coset problem, for which they prove a “self-reducibility” property. In particular, orbit coset problem for a finite group G is reducible to the orbit coset problem in G/N and N, for any solvable normal subgroup N of G.
Quantum Algorithms for the Generalized Hidden Shift Problem: There is a quantum algorithm that, for any fixed > 0, and M N , solves the generalized hidden shift problem in time polynomial in log N.
Quantum Algorithms for the Hidden Translation Problem: For groups G D Znp , the hidden translation problem can be solved with bounded probability using O(p(n C p) p1 ) queries and (n C p)O(p) other elementary operations. In general, for smoothly solvable groups G, the hidden translation problem can also be solved with a polynomial number of queries and other elementary operations. Another consequence of the tools they developed is a polynomial time solution to the hidden subgroup for such smoothly solvable groups.
Classical Algorithms for the Generalized Hidden p Shift Problem: Any classical algorithm requires ˝( N) evaluations of the function f .
Classical Algorithms for the Hidden Translation Problem: In general, including the case G D Znp , the hidden transp lation problem requires ˝( jGj) queries on a classical computer. Another natural generalization of the hidden translation or hidden shift problem and the Abelian hidden subgroup problem is the generalized hidden shift problem introduced in [46]. There is a function f : f0; 1; 2; : : : ; M 1g Z N ! X for some finite set X, with the property that for a fixed b 2 f0; 1; : : : ; M 1g, the mapping x 7! f (b; x) is one-to-one, and there is some hidden value s 2 Z N such that f (b; x) D f (b C 1; x C s) for all b 2 f0; 1; : : : ; M 2g. Note that for M D 2, this is equivalent to the dihedral hidden subgroup problem for the group DN , and for M D N, this problem is equivalent to the Abelian hidden subgroup problem for the hidden subgroup h(1; s)i of the group Z N Z N . Generalized Hidden Shift Problem: Input: Positive integers M and N. A black boxes U f that maps j b; xi j 0i 7! j b; xi j f (b; x)i for all b 2 f0; 1; : : : ; M 1g and x 2 Z N , where f satisfies the properties defined above. Problem: Find s.
The algorithm uses a “pretty good measurement” that involves solving instances of the following matrix sum problem. Given x 2 Z kN and w 2 Z N chosen uniformly at random, find b 2 f0; 1; : : : ; M 1g k such that P b i x i D w mod N. Note how this generalizes the subset sum problem, which was shown to be related to the dihedral hidden subgroup problem [154]. While there is no efficient solution known for small M (even an average case solution suffices for the dihedral HSP), for M N , Lenstra’s integer programming algorithm allows for an efficient solution to the matrix sum problem.
Other Related Algorithms There are a variety of other problems that aren’t (as far as we know) generalizations of the hidden subgroup problem, and arguably deserve a separate section. We’ll mention them here since various parts of the algorithms for these problems use techniques related to those discussed in one of the other subsections of this section. Van Dam and Seroussi [56] give an efficient quantum algorithm for estimating Gauss sums. Consider a finite field F p r (where p is prime, and r is a positive integer). The multiplicative characters are homomorphisms of the multiplicative group, F pr , to the complex numbers C, and also map 0 7! 0. Each multiplicative character can be specified by an integer ˛ 2 f0; 1; : : : ; pr 2g by defining ˛ (g j ) D ˛ j , where g is a generator for F pr and r D e2 i/(p 1) . The additive characters are homomorphisms of the additive group of the field to C, and can be specified by a value ˇ 2 F p r according to eˇ (x) D % Tr(ˇ x) , where P pj 2 i/p . Tr(y) D r1 jD0 y and % D e The Gauss sum G(F p r ; ˛ ; eˇ ) is defined as G(F p r ; ; eˇ ) D
X
(x)eˇ (x) :
x2F p r
It is known that the norm of the Gauss sum is p jG(F p r ; ; eˇ )j D pr , and thus the hard part is determining, or approximating, the parameter in the equation p G(F p r ; ; eˇ ) D e i pr .
Quantum Algorithms
Gauss Sum Problem for Finite Fields: Input: A prime number p, positive integer r and a standard specification of F p r (including a generator g). A positive integer ˛ 2 f0; 1; : : : ; pr 2g. An element ˇ 2 F p r . A parameter , 0 < < 1. Problem: Output an approximation, with error at most , p to in the equation G(F p r ; ˛ ; eˇ ) D e i pr . One noteworthy feature of this problem is that it is not a black-box problem. Quantum Algorithms for the Finite Field Gauss Sum Problem: There is a quantum algorithm running in time O( 1 polylog(pr )) that outputs a value ˜ such that j ˜ j < with probability at least 23 . Classical Complexity of the Finite Field Gauss Sum Problem: It was shown that solving this problem is at least as hard as the discrete logarithm problem in the multiplicative group of F p r (see Subsect. “Discrete Logarithms”). Various generalizations of this problem were also studied in [56]. Other examples include [57] which studies the problem of finding solutions to equations of the form a f x C bg y D c, where a; b; c; f ; g are elements of a finite field, and x; y are integers. Quantum Walk Algorithms Quantum walks, sometimes called quantum random walks, are quantum analogues of (classical) random walks, which have proved to be a very powerful algorithmic tool in classical computer science. The quantum walk paradigm is still being developed. For example, the relationship between the continuous time and discrete time models of quantum walks is still not fully understood. In any case, the best known algorithms for several problems are some type of quantum walk. Here we restrict attention to walks on discrete state spaces. Because of the quantum strong Church–Turing thesis, we expect that any practical application of a walk on a continuous state space will have an efficient (up to polynomial factors) simulation on a discrete system. In general, any walk algorithm (classical or quantum), consists of a discrete state space, which is usually finite in size, but sometimes infinite state spaces are also considered when it is convenient to do so. The state space is usually modeled as being the vertices of a graph G, and the edges of the graph denote the allowed transitions. In classical discrete time walks, the system starts in some initial state, vi . Then at every time step the system moves to a random neighbor w of the current vertex v, according
to some probability distribution p(v; w). Let M denote the matrix where the (v; w) entry is p(v; w). Let v0 be the column vector with the value pi in the ith position, where pi is the probability that the initial vertex is vi . Then the vector vt D M t v0 describes the probability distribution of the system after t time steps after starting in a state described by the probability distribution v0 . The walks are usually analyzed as abstract walks on a graph. In practice, the vertices are representing more sophisticated objects. For example, suppose one wishes to solve a 3-SAT formula ˚ on n Boolean variables. One could define a random walk on the 2n possible assignments of the Boolean variables. So the vertices of the graph would represent the 2n Boolean strings of length n. One could start the walk on a random vertex (which corresponds to a random assignment of the n-Boolean variables). At every step of the walk, if the current vertex v corresponds to a satisfying assignment, then p(v; v) D 1 and the walk should not leave the vertex. Otherwise, a random clause should be picked, and one of the variables in that clause should be picked uniformly at random and flipped. This implicitly defines a probability distribution p(v; w). In a quantum walk, instead of just having classical probability distributions of the vertices v i 2 V(G), one can P have superpositions v i 2V (G) ˛ i j v i i, and more generally any quantum mixed state of the vertices. If we restrict to unitary transitions, then there is a unitary matrix U that contains the transition amplitudes ˛(v; w) of going from vertex v to vertex w, and if the systems starts in initial state j 0 i, then after t time steps the state of the system is U t j 0 i. These unitary walks are not really “random” since the evolution is deterministic. More generally, the transition function could be a completely positive map E , and if the system starts in the initial state D j 0 i h 0 j, then after t time steps the state of the system will be E t ( ). One cannot in general define a unitary walk on any graph [164]; however if one explicitly adds a “coin” system of dimension as a large as the maximum degree d of the vertices (i. e. the new state space consists of the states j v i i j ci, v i 2 V(G) and c 2 f0; 1; : : : ; d 1g) then one can define a unitary walk on the new graph one would derive from the combined graph-coin system. In particular, the state of the coin system indicates to which neighbor of a given vertex the system should evolve. More generally, one can define a unitary walk on states of the form (v i ; v j ), where fv i ; v j g is an edge of G. A continuous version of quantum walks was introduced by Farhi and Gutmann [72]. The idea is to let the adjacency matrix of the graph be the Hamiltonian driving the evolution of the system. Since the adjacency matrix is Hermitian, the resulting evolution will be unitary. The rea-
2321
2322
Quantum Algorithms
son such a unitary is possible even for a graph where there is no unitary discrete time evolution is that in this continuous time Hamiltonian model, for any non-zero time evolution, there is some amplitude with which the walk has taken more than one step. In classical random walks, one is often concerned with the “mixing time”, which is the time it takes for the system to reach its equilibrium distribution. In a purely unitary (and thus reversible) walk, the system never reaches equilibrium, but there are alternative ways of arriving at an effective mixing time (e. g. averaging over time). In general, quantum walks offer at most a quadratically faster mixing. Another property of random walks is the “hitting time”, which is the time it takes to reach some vertex of interest. There are examples where quantum walks offer exponentially faster hitting times. The study of what are essentially quantum walks has been around for decades, and the algorithmic applications have been developed for roughly 10 years. Much of the early algorithmic work developed the paradigm and discovered the properties of quantum walks on abstract graphs, such as the line or circle, and also on general graphs (e. g. [6,17]). There have also been applications to more concrete computational problems, and we will outline some of them here. Element Distinctness Problem Input: A black-box U f that maps j ii j bi 7! j ii j b ˚ f (i)i for some function f : f0; 1; : : : ; N 1g ! f0; 1; : : : ; Mg. Problem: Decide whether there exist inputs i and j, i ¤ j, such that f (i) D f ( j). Prior to the quantum walk algorithm of Ambainis, 3 the best known quantum algorithm used O(N 4 log N) queries [42]. Quantum Algorithms for Element Distinctness Problem The quantum walk algorithm in [15] uses O(N 2/3 ) evaluations of U f , O(N 2/3 polylogN) non-query operations and O(N 2/3 polylogN) space. Classical Algorithms for Element Distinctness A classical computer requires N O(1) applications of U f in order to guess correctly with bounded error for worst-case instances of f . As is often the case with classical random walk algorithms, the graph is only defined implicitly, and is usually exponentially large in the size of the problem instance. For the element distinctness algorithm, the graph is defined as follows. The vertices are subsets of f1; 2; : : : ; Ng of size dN 2/3 e. Two vertices are joined if the subsets differ in exactly two elements. A detailed description and analysis of this walk is beyond the scope of this survey.
Szegedy [171] extended the approach of Ambainis to develop a powerful general framework for quantizing classical random walks in order to solve search problems. Suppose we wish to search a solution space of size N and there are N solutions to f (x) D 1. Furthermore, suppose there is a classical random walk with transition matrix M, with the property that p(v; w) D p(w; v) (known as a ‘symmetric’ walk). It can be shown that the matrix M has maximum eigenvalue 1, and suppose the next highest eigenvalue is 1 ı, for ı > 0. The classical theory of random walks implies the existence of a bounded-error classical random walk search algorithm p with query complexity in 1 O( ı ). Szegedy developed a “ ı-rule” that gives a quantum version of the classical walk with query complexity in p O(1/ ı). This technique was generalized further in [134] and summarized nicely in [160]. Quantum walk searching has been applied to other problems such as triangle-finding [135], commutativity testing [133], matrix product verification [40], associativity testing when the range is restricted [66], and element k-distinctness [15]. A survey of results in quantum walks can be found in [13,14,123,160]. Continuous Time Quantum Walk Algorithms In this section we describe two very well-known continuous time quantum walk algorithms. The first algorithm [48] illustrates how a quantum walk algorithm can give an exponential speed-up in the black-box model. A problem instance of size n corresponds to an oracle OG n that encodes a graph Gn on O(2n ) vertices in the following way. The graph Gn is the graph formed by taking 2 binary trees of depth n, and then “gluing” the two trees together by adding edges that create a cycle that alternates between the leaves of the first tree (selected at random) and the leaves of the second tree (selected at random). The two root vertices are called the “ENTRANCE” vertex, labeled with some known string, say, the all zeroes string 000 : : : 0 of length 2n, and “EXIT” vertex, which is labeled with a random string of length 2n. The remaining vertices of the graph are labeled with distinct random bit strings of length 2n. The oracle OG n encodes the graph Gn in the following way. For j xi where x 2 f0; 1g2n encodes a vertex label of Gn , OG n maps j xi j 00 : : : 0i to j xi j n1 (x); n2 (x); n3 (x)i where n1 (x); n2 (x); n3 (x) are the labels of the neighbors of x in any order (for the exit and entrance vertex, there will only be two distinct neighbors). “Glued-Trees” Problem Input: A black-box implementing OG n for a graph Gn of the above form. Problem: Output the label of the EXIT vertex.
Quantum Algorithms
Quantum Algorithms, Figure 1 This is an example of a “glued-trees” graph with random labelings for all vertices (except the “ENTRANCE” vertex). The goal is to find the label of the “EXIT” vertex (in this case, it is 101010), given a black-box that outputs the vertex labels of the neighbors of a given vertex label
Quantum Algorithms for the “Glued-Trees” Problem There is a continuous time quantum walk which starts at the ENTRANCE vertex (in this case j 00 : : : 0i) and evolves according to the Hamiltonian defined by the adjacency matrix of the graph Gn for an amount of time t selected uniformly at random in [0; n4 /(2)] where 0 < < 1. Measuring will then yield the EXIT label with probability at least (1) 2n . The authors show how to efficiently simulate this continuous time quantum walk using a universal quantum computer that makes a polynomial number of calls to OG n .
rithm is the problem of evaluating a NAND-tree (or ANDOR tree). The problem is nicely described by a binary tree of depth n whose leaves are labeled by the integers i 2 f1; 2; : : : ; 2n g. The input is a black-box OX that encodes a binary string X D X1 X2 : : : X N , where N D 2n . The ith leaf vertex is assigned value X i , and the parent of any pair of vertices takes on the value which is the NAND of the value of its child vertices (the NAND of two input bits is 0 if both inputs are 1 and 1 if either bit is 0). Thus, given the assignment of values to the leaves of a binary tree, one can compute the values of the remaining vertices in the tree, including the root vertex. The value of the NAND tree for a given assignment is the value of the root vertex.
Classical Algorithms for the “Glued-Trees” Problem Any classical randomized algorithm must evaluate the blackbox OG n an exponential number of times in order to output the correct EXIT vertex label with non-negligible probability. More precisely, any classical algorithm that makes 2n/6 queries to OG n can only find the EXIT with probability at most 4 2n/6 .
Input: A black-box OX that encodes a binary string X D X 1 X2 : : : X N 2 f0; 1g N , N D 2n .
Another very interesting and recent problem for which a quantum walk algorithm has given the optimal algo-
Problem: Output the value of the binary NAND tree whose ith leaf has value X i .
NAND-Tree Evaluation
2323
2324
Quantum Algorithms
Classical Algorithms for NAND-Tree Evaluation The best known classical randomized algorithm uses O(N 0:753::: ) evaluations of the black-box, and it is also known that ˝(N 0:753:::) evaluations are required for any classical randomized algorithm. Until recently, no quantum algorithm worked better. Quantum Algorithms for NAND-Tree Evaluation Farhi, Goldstone and Gutmann [76] showedp a continuous time walk that could solve this it time O( N) using a continuous version of the black-box, and it was subsequently showed that O(N 1/2C ) queries to the discrete oracle suffice, for any real constant > 0, and discrete walk versions of the algorithm and other generalizations were developed [18,50] This was a very interesting breakthrough in solving a fundamental problem that had stumped quantum algorithms experts for a number of years. The general idea is inspired from techniques in particle physics and scattering theory. They consider a graph formed by taking a binary tree and making two additions to it. Firstly, for each leaf vertex where X i D 1, add another vertex and join it to that leaf. Secondly, attach theproot vertex to the middle of a long path of length in ˝( N). Then evolve the system according to the Hamiltonian equal to the adjacency matrix of this graph. Then one should start the system in a superposition of states on the left side of the line graph with phases defined so that if the NAND-tree were not attached, the “wave packet” would move from left to right along the line. If the packet gets reflected with non-negligible amplitude, then the NAND tree has value 1, otherwise, if the packet gets mostly transmitted, then the NAND tree has value 0. Thus one measures the system, and if one obtains a vertex to the left of the NAND-tree, one guesses “1”, and if one obtains a vertex to the right of the NAND-tree, one guesses “0”. This algorithm outputs the correct answer with high probability. In a discrete query model, one can carefully simulate the continuous time walk [50] (as discussed in Sect. “Simulation of Quantum Mechanical Systems”), or one can apply the results of Szegedy [171] to define a discrete-time coined walk with the same spectral properties [18] and 1 thus obtain a discrete query complexity of N 2 C for any constant > 0. This algorithm has also been applied to solve MIN– MAX trees with a similar improvement [55]. The NANDtree and related problems are related to deciding the winner of two-player games. Another very recent new class of quantum algorithms for evaluating a wider class of formulas, based on “span” programs, was developed in [155].
Adiabatic Algorithms It is possible to encode the solution to a hard problem into the ground state of an efficiently simulatable Hamiltonian. For example, in order to try to solve 3-SAT for a formula on n Boolean variables, one could define a Hamiltonian X f˚ (x) j xi h xj H1 D x2f0;1g n
where f˚ (x) is the number of clauses of ˚ that are violated by the assignment x. Then one could try to define algorithms to find such a ground state, such as quantum analogues of classical annealing or other heuristics (e. g. [62,74,101]). Adiabatic algorithms (also known as adiabatic optimization algorithms) are a new paradigm for quantum algorithms invented by Farhi, Goldstone, Gutmann and Sipser [74]. The paradigm is based on the fact that, under the right conditions, a system that starts in the ground state of a Hamiltonian H(0), will with high probability remain in the ground state of the Hamiltonian H(t) of the system at a later time t, if the Hamiltonian of the system changes “slowly enough” from H(0) to H(t). This fact is called the adiabatic theorem (see e. g. [115]). This theorem inspires the following algorithmic paradigm: Convert your problem to generating the ground state of some easy-to-simulate Hamiltonian H 1 . Initialize your quantum computer in an easy-to-prepare state j 0 i of an easy-to-simulate Hamiltonian H 0 . On the quantum computer, simulate a time-dependent Hamiltonian H(t) D (1 t/T)H0 C t/TH1 , for t going from 0 to T. An important detail is how slowly to transition from H 0 to H 1 (in other words, how large T should be). This related to two important parameters. Let 0 (t) be the smallest eigenvalue of H(t), and assume that the corresponding eigenspace is non-degenerate. Let 1 (t) be the second smallest eigenvalue of H(t), and define g(t) D 1 (t) 0 (t) to be the gap between the two lowest eigenvalues. The norm of the Hamiltonian is also relevant. This is to be expected, since one can effectively speed-up time by a factor of s by just multiplying the Hamiltonian of the system by s. In any realistic implementation of the Hamiltonian one pays for such a speed-up by at least a factor of s in some resource. For example, if we simulate the Hamiltonian using quantum circuits (e. g. as described in Sect. “Simulation of Quantum Mechanical Systems”), the overhead is a factor of s 1Co(1) in the circuit depth; thus there is no actual speed-up. Furthermore, the norm of the derivatives of the Hamiltonian is also relevant.
Quantum Algorithms
There are a variety of theorems and claims in the literature proving, or arguing, that a value of T polynomial in the operator norm of dH(t)/ dt (or even some higher derivative) and in the inverse of the minimum gap (i. e. the minimum g(t), for 0 t T), and in ı1 , is sufficient in order to generate the final ground state with probability at least 1 ı. The general folklore is that with the right assumptions the dependence on the minimum gap gmin 2 ), and one can find examples when this is the is ˝(1/gmin case; however more sophisticated descriptions of the dependence on g(t) are known (see e. g. [115]). Assuming there is exactly one satisfying assignment w, then j wi h wj is the unique ground state of H 1 . This algorithm has been studied numerically and analytically [58,75,157] and variations have been introduced as well to work around various lower bounds that were proved [77]. Unfortunately, it is not known what the worst-case complexity is for these algorithms on such NPhard problems, since it has proved very hard to provide rigorous or even convincing heuristic bounds on the minimum gap for such problems of interest. It is widely believed that these algorithms will not solve an NP-hard problem in worst-case polynomial time, partly because it is believed that no quantum algorithm can do this. A slight generalization of adiabatic algorithms, called adiabatic computation (where the final Hamiltonian does not need to be diagonal in the computational basis) was shown to be polynomially equivalent to general quantum computation [7]. Topological Algorithms The standard models of quantum computation (e. g. quantum Turing machine, quantum acyclic circuits) are known to be equivalent in power (up to a polynomial factor), and the quantum strong Church–Turing thesis states that any realistic model of computation can be efficiently simulated by such a quantum computer. If this were not the case, then one should seek to define a stronger model of computation that encapsulates the full computational power that the laws of physics offer. Freedman [80] proposed defining a computing model based on topological quantum field theories. The main objective was that such a computer might naturally solve an NP-hard or #P-hard topological problem, in particular, evaluating the Jones polynomial at certain points. A natural family of such topological quantum field theory computers was shown to be equivalent in power to the standard model of quantum computation, thus such a new model of computation would not provide additional computational power, but it was hoped that this
new paradigm might inspire new quantum algorithms. In fact, it has been shown that “topological” algorithms can approximate the value of the Jones polynomial at certain points more efficiently than any known classical algorithm [8,81]. The known approximations are not good enough to solve an NP-hard problem. Several other generalizations and related problems and algorithms have been found recently [10,85,132,179]. We will briefly sketch some of the definitions, results and techniques. A knot is a closed non-intersecting curve embedded in R3 , usually represented via a knot diagram, which is a projection of the knot into the plane with additional information at each cross-over to indicate which strand goes over and which goes under. Two knots are considered equivalent if one can be manipulated into the other by an isotopy (i. e. by a transformations one could make to an actual knot that can be moved and stretched but not broken or passed through itself). A link is a collection of non-intersecting knots embedded in R3 . They can be represented by similar diagrams, and there is a similar notion of equivalence. The Jones polynomial of a link L is a polynomial VL (t) that is a link invariant; in other words it has the property that if two links L1 and L2 are equivalent, then VL 1 (t) D VL 2 (t). Computing the Jones polynomial is in general #P-hard for all but a finite number of values of t. In particular, it is #P-hard to evaluate the Jones polynomial exactly at any primitive rth root of unity for any integer r 5. However, certain approximations of these values are not known to be #P-hard. The Jones polynomial is a special case of the Tutte polynomial of a planar graph. For a planar graph G D (V; E), with weights edge weights v D fv e j e 2 Eg the multivariate Tutte polynomial is defined as X Y q k(A) ve Z G (q; v e 1 ; v e 2 ; : : :) D A E
e2A
where q is another variable and k(A) is the number of connected components in the subgraph (V; A). The standard Tutte polynomial TG (x; y) is obtained by setting v e D v for all e 2 E, x D 1 C q/v and y D 1 C v. Connections with physics are discussed, for example, in [120,169,176]. Here we briefly sketch a specific instance of such a problem, and the approach of [10] taken to solve this problem on a quantum computer. Firstly, for any planar graph G, one can efficiently find its medial graph LG , which is a 4-regular planar graph which can be drawn from a planar embedding of G as follows. Draw a new vertex with weight ui in the middle of each edge that had label vi . For each new vertex, on each side of the original edge on which the vertex is placed,
2325
2326
Quantum Algorithms
draw a new edge going in the clockwise direction joining the new vertex to the next new vertex encountered along the face of the original graph. Do the same in the counterclockwise direction, and remove all the original edges and vertices. From this medial graph, one can define another polynomial called the Kauffman bracket, denoted hLG i(d; u1 ; u2 ; : : :), that satisfies hLG i(d; u1 ; u2 ; : : :) D d jV j Z G (d 2 ; du1 ; du2 : : :). The next section sketches a quantum algorithm that approximates the Kauffman bracket. Additive Approximation of the Multivariate Tutte Polynomial for a Planar Graph Input: A description of a planar graph G D (V; E). Complex valued weights v1 ; v2 ; : : : ; v m corresponding to the edges e1 ; e2 ; : : : ; e m 2 E of G. A complex number q. Problem: Output an approximation of Z G (q; v1 ; v2 ; : : : ; v m ). Quantum Algorithms for Approximating the Tutte Polynomial Aharonov et al. [10] give a quantum algorithm that solves the above problem in time polynomial in n with an additive approximate they denote by alg /poly(m), which is described below. The value alg depends on the embedding of the graph. The results are hard to compare to what his known about classical algorithms for this problem (see [33] for a discussion), but there are special cases that are BQP-hard. Classical Algorithms for Approximating the Tutte Polynomial It was shown [10] that for certain ranges of parameter choices, the approximations given by the quantum algorithms are BQP-hard, and thus we don’t expect a classical algorithm to be able to provide as good of an approximation unless classical computers can efficiently simulate quantum computers. Sketch of the Structure of the Algorithm We only have room to give a broad overview of the algorithm. One of the main points is to emphasize that this algorithm looks nothing any of the other algorithms discussed in the previous sections. At a very high level, the idea is that these medial graphs T G can be represented by a linear operation QG such that hLG i(d; u1 ; u2 ; : : :) D h 1j QG j 1i. The quantum algorithm approximates the inner product between j 1i and QG j 1i, and therefore gives an approximation to the Kauffman bracket for G and thus the generalized Tutte polynomial for G.
The medial graph LG will be represented as a product of basic elements T i from a generalized Temperley– Lieb algebra. These basic elements T i will be represented by simple linear transformations on finite dimensional Hilbert spaces, which can be implemented on a quantum computer. Below we briefly sketch this decomposition and correspondence. One can easily draw the medial graph LG in the plane so that one can slice it with horizontal lines so that in between each consecutive horizontal line there is a diagram with only one of the following: a crossing of two lines, a “cap” or a “cup”. One can think of the gluing together of these adjacent diagrams T i to form the graph LG as a product operation. We will sketch how to map each T i to a linear operation (T i ) acting on a finite dimensional Hilbert space. The state space that the operation (T i ) will act on is the set of finite walks starting at vertex 1 on the infinite graph G with vertices labeled by the non-negative integers and edges fi; i C 1g for all i 0. For example, the walk 1 2 3 2 3 is a walk of length 4 from 1 to 3. The linear transformation (T i ) maps a walk w1 to a linear combination of walks that are “compatible” with T i (we won’t explain here the details of defining this linear transformation). In order to apply this technique to a diagram with a crossing, one eliminates the crossing by replacing it with two non-intersecting lines. This replacement can be done in two different ways, and one can represent the diagram with a crossing as a formal linear combination of these two diagrams one gets by replacing a crossing at a vertex (say with label u) with two non-intersecting lines, where one of the two links gets the coefficient u (by a simple rule that we don’t explain here). We can then apply the construction to each of these new diagrams, and combine them linearly to get the linear transformation corresponding to the diagram with the crossing. These linear transformations QG are not necessarily unitary (they were unitary in the earlier work on the Jones polynomial, and other related work also constructs unitary representations); however the authors show how one can use ancilla qubits and unitaries to implement non-unitary transformations and approximate the desired inner product using the “Hadamard test” (see Sect. “Quantum Algorithms for Quantum Tasks”). Quantum Algorithms for Quantum Tasks At present, when we think of quantum algorithms, we usually think of starting with a classical input, running some quantum algorithm, or series of quantum algorithms, with
Quantum Algorithms
some classical post-processing in order to get a classical output. There might be some quantum sub-routines that have been analyzed, but the main goal in mind is to solve a classical problem. For example, quantum error correction (see [124] for a recent survey) can be thought of as an algorithm having a quantum input and a quantum output. There are many algorithms for transferring a qubit of information through a network of qubits under some physical constraints. We might develop a quantum cryptographic infrastructure where objects like money and signatures [88] are quantum states that need to be maintained and manipulated as quantum states for long periods of time. Several of the algorithms described in the previous sections have versions which have a quantum input or a quantum output or both. For example, the amplitude amplification algorithm can be rephrased in terms of producing a quantum state j i given a black-box U that recognizes j i by mapping j i 7! j i and acting as the identity on all states orthogonal to j i. The end result of such a quantum search is the quantum state j i. Amplitude estimation is estimating a transition probability of a unitary operator. The topological algorithms require as a basic subroutine a quantum algorithm for approximating the inner product of j 0i and U j 00 : : : 0i, which the authors call the “Hadamard test”. The algorithm consists of using a controlled-U operation to create the state p1 j 0i j 00 : : : 0i C p1 j 1i U j 00 : : : 0i. Note that if we 2 2 apply the Hadamard gate to the first qubit, we get r
1 C Re h 00 : : : 0j U j 00 : : : 0i j 0i j 0 i 2 r 1 Re h 00 : : : 0j U j 00 : : : 0i C j 1i j 2
1i
for normalized states j 0 i and j 1 i. We can thus estimate the real part of the h 00 : : : 0j U j 00 : : : 0i by repeating several times, or applying the quadratically more efficient amplitude estimation algorithm algorithm described earlier (the goal is a superpolynomial speed-up, so a quadratic improvement is not substantial in this case). We can also estimate the complex part similarly. Another example is the coset orbit problem [82] mentioned in Sect. “Generalizations of the Abelian Hidden Subgroup Problem”. The input to this problem consists of two quantum states j 0 i and j 1 i from a set of mutually orthogonal states and black-boxes for implementing the action of a group G on the set . For a state j i, let j u i denote the state resulting from the action
of u 2 G on j i, and let Gj i denote the subgroup of G that stabilizes the state j i (i. e. the set of u 2 G such that j u i D j i). The question is whether there exists a u 2 G such that j u 1 i D j 0 i. If the answer is “yes”, then the set of u satisfying j u 1 i D j 0 i is a left coset of Gj 1 i . Thus, the algorithm should output a coset representative u along with O(log n) generators of Gj 1 i . The solution to this problem was an important part of the solution to the hidden translation problem. Several other quantum algorithms have such “quantum” sub-routines. Other examples of quantum tasks include quantum data compression which was known to be information theoretically possible, and efficient quantum circuits for performing it were also developed [26]. Another example is entanglement concentration and distillation. Researchers have also developed quantum algorithms for implementing some natural basis changes, which have quantum inputs and quantum outputs. For example, the Clebsch-Gordan transformation [20] or other transformations (e. g. [69,103]). We’ve only listed a few examples here. In the future, as quantum technologies develop, in particular quantum computers and reliable quantum memories, quantum states and their manipulation will become an end in themselves. Future Directions Roughly 10 years ago, many people said that there were essentially only two quantum algorithms. One serious omission was the simulation of quantum mechanical systems, which was in fact Feynman’s initial motivation for quantum computers. Apart from this omission, it was true that researchers were developing a better and deeper understanding of the algorithms of Shor and Grover, analyzing them in different ways, generalizing them, and applying them in non-trivial ways. It would have been feasible to write a reasonably sized survey of all the known quantum algorithms with substantial details included. In addition to this important and non-trivial work, researchers were looking hard for “new” approaches, and fresh ideas like quantum walks, and topological algorithms, were being investigated, as well as continued work on the non-Abelian version of the hidden subgroup problem. The whole endeavor of finding new quantum algorithms was very hard and often frustrating. Fortunately, in the last 10 years, there have been many non-trivial developments, enough to make the writing of a full survey of quantum algorithms in a reasonable number of pages impossible. Some directions in which future progress might be made are listed below.
2327
2328
Quantum Algorithms
The complexity of the non-Abelian hidden subgroup problem will hopefully be better understood. This includes addressing the question: Does there exist a quantum polynomial time algorithm for solving the graph isomorphism problem? Of course, a proof that no such quantum algorithm exists would imply that P ¤ PSPACE. So, more likely, we might develop a strong confidence that no such algorithm exists, in which case this can form the basis of quantum computationally secure cryptography [142]. There are many examples where amplitude amplification was used to speed-up a classical algorithm in a non-trivial way. That is, in a way that was more than just treating a classical algorithm as a guessing algorithm A and applying amplitude amplification to it. There are likely countless other algorithms which can be improved in non-trivial ways using amplitude amplification. The quantum walk paradigm for quantum algorithms emerged roughly 10 years ago, and has recently been applied to find the optimal black-box algorithm for several problems, and has become a standard approach for developing quantum algorithms. Some of these blackbox problems are fairly natural, and the black-boxes can be substituted with circuits for actual functions of interest. For example, collision finding can be applied to find collisions in actual hash functions used in cryptography. We will hopefully see more instances where black-box algorithm can be applied to solve an problem without a black-box, or where there is no black-box in the first place. In addition to the development of new quantum walk algorithms, we will hopefully have a more elegant and unified general theory of quantum walks that unites continuous and discrete walks, coined and non-coined walks, and quantum and classical walks. The adiabatic algorithm paradigm has not reached the level of success of quantum walks, partly because it is hard to analyze the worst case complexity of the algorithms. To date there is no adiabatic algorithm with a proof that it works more than quadratically faster than the best known classical algorithm. Can we do better with an adiabatic algorithm? If and when we have large-scale quantum computers, we will be able to just test these algorithms to see if indeed they do have the conjectured running times on instances of interesting size. The topological algorithms have received limited attention to date. This is partly because the initial work in this field was largely inaccessible to researchers without substantial familiarity with topological quantum
field theory and related areas of mathematics. The more recent work summarized in this paper and other recent papers is a sign that this approach could mature into a fruitful paradigm for developing new important quantum algorithms. The paradigm of measurement based computation (see e. g. [118] for an introduction) has been to date mostly focused on its utility as a paradigm for possibly implementing a scalable fault-tolerant quantum computer. We might see the development of algorithms directly in this paradigm. Similarly for globally controlled architectures. There is also a growing group of researchers looking at the computational complexity of various computational problems in physics, in particular of simulating certain Hamiltonian systems, often coming from condensed matter physics. Much of the work has been complexity theoretic, such as proving the QMAhardness of computing ground states of certain Hamiltonians (e. g. [9]). Other work has focused on understanding which quantum systems can be simulated efficiently on a classical computer. This work should lead to the definition of some simulation problems that are not known to be in BPP, nor believed to be NPhard or QMA-hard, and thus might be good candidates for a quantum algorithm. There has been a language and culture barrier between physicists and theoretical computer scientists when it comes to discussing such problems. However, it is slowly breaking down, as more physicists are becoming familiar with algorithms and complexity, and more quantum computer scientists are becoming familiar with language and notions from physics. This will hopefully lead to more quantum algorithms for computational problems in physics, and new algorithmic primitives that can be applied to a wider range of problems. In summary, as daunting as it is to write a survey of quantum algorithms at this time, it will be a much harder task in another 10 years. Furthermore, in another 10 years we will hopefully have a better idea of when we might expect to see quantum computers large enough to solve problems faster than the best available classical computers. Bibliography 1. Aaronson S (2003) Algorithms for Boolean function query properties. SIAM J Comput 32:1140–1157 2. Aaronson S, Ambainis A (2003) Quantum search of spatial regions. In: Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS). IEEE Computer Society, Washington, DC, pp 200–209
Quantum Algorithms
3. Abrams D, Lloyd S (1999) Quantum algorithm providing exponential speed increase for finding eigenvalues and eigenvectors. Phys Rev Lett 83:5162–5165 4. Aharonov D, Regev O (2005) Lattice problems in NP intersect coNP. J ACM 52:749–765 5. Aharonov D, Ta-Shma A (2007) Adiabatic quantum state generation. SIAM J Comput 37:47–82 6. Aharonov D, Ambainis A, Kempe J, Vazirani U (2001) Quantum walks on graphs. In: Proceedings of ACM Symposium on Theory of Computation (STOC’01). ACM, New York, pp 50–59 7. Aharonov D, van Dam W, Kempe J, Landau Z, Lloyd S, Regev O (2004) Adiabatic auantum computation is equivalent to standard quantum computation. In: Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science (FOCS’04). IEEE Computer Society, Washington, DC, pp 42–51 8. Aharonov D, Jones V, Landau Z (2006) A polynomial quantum algorithm for approximating the Jones polynomial. In: Proceedings of the thirty-eighth annual ACM symposium on Theory of computing (STOC). ACM, New York, pp 427–436 9. Aharonov D, Gottesman D, Irani S, Kempe J (2007) The power of quantum systems on a line. In: Proc. 48th IEEE Symposium on the Foundations of Computer Science (FOCS). IEEE Computer Society, Washington, DC, pp 373–383 10. Aharonov D, Arad I, Eban E, Landau Z Polynomial quantum algorithms for additive approximations of the Potts model and other points of the Tutte plane. quant-ph/0702008 11. Alagic G, Moore C, Russell A (2007) Quantum algorithms for Simon’s problem over general groups. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (SODA). Society for Industrial and Applied Mathematics, Philadelphia, pp 1217–1224 12. Ambainis A (2002) Quantum lower bounds by quantum arguments. J Comput Syst Sci 64:750–767 13. Ambainis A (2003) Quantum walks and their algorithmic applications. Int J Quantum Inf 1:507–518 14. Ambainis A (2004) Quantum search algorithms. SIGACT News. 35(2):22–35 15. Ambainis A (2004) Quantum walk algorithm for element distinctness. In: Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science (FOCS’04). IEEE Computer Society, Washington, DC, pp 22–31 16. Ambainis A, Spalek R (2006) Quantum algorithms for matching and network flows. In: Proceedings of STACS’06. Lecture Notes in Computer Science, vol 3884. Springer, Berlin, pp 172–183 17. Ambainis A, Bach E, Nayak A, Vishwanath A, Watrous J (2001) One-dimensional quantum walks. In: Proceedings of the 33rd ACM Symposium on Theory of Computing. ACM, New York, pp 37–49 18. Ambainis A, Childs A, Reichardt B, Spalek R, Zhang S (2007) Any AND-OR formula of size N can be evaluated in time N1/2Co(1) on a quantum computer. In: 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07). IEEE Computer Society, Washington, DC, pp 363–372 19. Bacon D, Childs A, van Dam W (2005) From optimal measurement to efficient quantum algorithms for the hidden subgroup problem over semidirect product groups. In: Proc 46th IEEE Symposium on Foundations of Computer Science (FOCS 2005). IEEE Computer Society, Washington, DC, pp 469–478
20. Bacon D, Chuang I, Harrow A (2006) Efficient quantum circuits for Schur and Clebsch-Gordan transforms. Phys Rev Lett 97:170502 21. Beals R (1997) Quantum computation of Fourier transforms over symmetric groups. In: Proceedings of the 29th Annual ACM Symposium on Theory of Computing (STOC). ACM, New York, pp 48–53 22. Beals R, Buhrman H, Cleve R, Mosca M, deWolf R (2001) Quantum lower bounds by polynomials. J ACM 48:778–797 23. Benioff P (1982) Quantum mechanical models of Turing machines that dissipate no energy. Phys Rev Lett 48(23): 1581–1585 24. Bennett C (1988) Notes on the history of reversible computation by Charles Bennett. IBM J Res Dev 32(1):16–23 25. Bennett C, Bernstein E, Brassard G, Vazirani U (1997) Strengths and weaknesses of quantum computing. SIAM J Comput 26:1510–1523 26. Bennett C, Harrow A, Lloyd S (2006) Universal quantum data compression via gentle tomography. Phys Rev A 73:032336 27. Bennett CH, Brassard G (1984) Quantum cryptography: Public-key distribution and coin tossing. In: Proceedings of IEEE International Conference on Computers, Systems and Signal Processing, India. IEEE Computer Society, Washington, DC, pp 175–179 28. Bernstein E, Vazirani U (1997) Quantum complexity theory. SIAM J Comput 26:1411–1473 29. Berry DW, Ahokas G, Cleve R, Sanders BC (2007) Efficient quantum algorithms for simulating sparse Hamiltonians. Comm Math Phys 270:359 30. Berthiaume A, Brassard G (1992) The quantum challenge to structural complexity theory. In: Proc 7th Conference Structure Complexity Theory, IEEE Comp Soc Press, pp 132–137 31. Berthiaume A, Brassard G (1994) Oracle quantum computing. J Modern Opt 41(12):2521–2535 32. Boneh D, Lipton R (1995) Quantum cryptanalysis of hidden linear functions (Extended Abstract). In: Proceedings of 15th Annual International Cryptology Conference (CRYPTO’95). Springer, London, pp 424–437 33. Bordewich M, Freedman M, Lovasz L, Welsh DJA (2005) Approximate counting and quantum computation. Comb Probab Comput 14:737–754 34. Boyer M, Brassard G, Høyer P, Tapp A (1998) Tight bounds on quantum searching. Fortschr Phys 56(5-5):493–505 35. Brassard G, Høyer P (1997) An exact quantum polynomialtime algorithm for Simon’s problem. In: Proc of Fifth Israeli Symposium on Theory of Computing and Systems (ISTCS’97). IEEE Computer Society, Washington, DC, pp 12–23 36. Brassard G, Høyer P, Tapp A (1997) Cryptology column– quantum algorithm for the collision problem. ACM SIGACT News 28:14–19 37. Brassard G, Høyer P, Tapp A (1998) Quantum Counting. In: Proceedings of the ICALP’98. Lecture notes in computer science. Springer, Berlin, pp 1820–1831 38. Brassard G, Høyer P, Mosca M, Tapp A (2002) Quantum amplitude amplification and estimation. In: Quantum Computation and Quantum Information Science. AMS Contemp Math Ser 35:53–74 39. Brown M (2003) Classical cryptosystems in a quantum setting. Master Thesis, University of Waterloo 40. Buhrman H, Špalek B (2006) Quantum verification of matrix products. In: Proceedings of the 14th Annual ACM-SIAM Sym-
2329
2330
Quantum Algorithms
41.
42.
43. 44. 45. 46.
47.
48.
49.
50. 51.
52.
53.
54. 55.
56. 57.
58.
59.
posium on Discrete Algorithms (SODA). IEEE Computer Society, Washington, DC, pp 880–889 Buhrman H, Fortnow L, Newman I, Röhrig H (2003) Quantum property testing. In: Proceedings of the 14th Annual ACMSIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, pp 480–488 Buhrman H, Dürr C, Heiligman M, Høyer P, Magniez F, Santha M, de Wolf R (2005) Quantum algorithms for element distinctness. SIAM J Comput 34:1324–1330 Byrnes T, Yamamoto Y (2006) Simulating lattice gauge theories on a quantum computer. Phys Rev A 73:022328 Cheung K, Mosca M (2001) Decomposing finite Abelian groups. Quantum Inf Comput 1(2):26–32 Childs A (2002) Quantum information processing in continuous time. Ph D thesis, MIT Childs A, van Dam W (2007) Quantum algorithm for a generalized hidden shift problem. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). ACM, New York, pp 1225–1232 Childs A, Farhi E, Gutmann S (2002) An example of the difference between quantum and classical random walks. Quantum Inf Process 1:35–43 Childs A, Cleve R, Deotto E, Farhi E, Gutmann S, Spielman D (2003) Exponential algorithmic speedup by a quantum walk. In: Proceedings of the 35th Annual ACM Symposium on Theory of Computing. IEEE Computer Society, Washington, DC Childs A, Landahl A, Parrilo P (2007) Improved quantum algorithms for the ordered search problem via semidefinite programming. Phys Rev A 75:032335 Childs AM, Cleve R, Jordan SP, Yeung D Discrete-query quantum algorithm for NAND trees. quant-ph/0702160 Childs A, Schulman L, Vazirani U (2007) Quantum algorithms for hidden nonlinear structures. In: Proceeding of 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2007). IEEE Computer Society, Washington, DC Cleve R (2000) An introduction to quantum complexity theory. In: Macchiavello C, Palma GM, Zeilinger A (eds) Collected Papers on Quantum Computation and Quantum Information Theory. World Scientific, Singapore, pp 103–127 Cleve R (2000) The query complexity of order-finding. In: IEEE Conference on Computational Complexity. IEEE Computer Society, Washington, DC, p 54 Cleve R, Ekert A, Macchiavello C, Mosca M (1998) Quantum algorithms revisited. Proc Royal Soc Lond A 454:339–354 Cleve R, Gavinsky D, Yeung D (2008) Quantum algorithms for evaluating Min-Max trees. In: Proceedings of TQC 2008. Lecture Notes in Computer Science, vol 5106. Springer, Berlin van Dam W, Seroussi G Efficient quantum algorithms for estimating Gauss sums. quant-ph/0207131 van Dam W, Shparlinski I (2008) Classical and quantum algorithms for exponential congruences. In: Proceedings of TQC 2008. Lecture Notes in Computer Science, vol 5106. Springer, Berlin, pp 1–10 van Dam W, Mosca M, Vazirani U (2001) How powerful is adiabatic quantum computation? In: Proceedings 46th IEEE Symposium on Foundations of Computer Science (FOCS’01). IEEE Computer Society, Washington, DC, pp 279–287 van Dam W, Hallgren S, Ip L (2003) Quantum algorithms for some hidden shift problems. In: Proceedings of the ACMSIAM Symposium on Discrete Algorithms (SODA’03). Soci-
60.
61.
62. 63.
64.
65. 66.
67.
68.
69. 70.
71.
72. 73. 74. 75.
76. 77. 78.
79. 80.
ety for Industrial and Applied Mathematics, Philadelphia, pp 489–498 van Dam W, D’Ariano GM, Ekert A, Macchiavello C, Mosca M (2007) General optimized schemes for phase estimation. Phys Rev Lett 98:090501 van Dam W, D’Ariano GM, Ekert A, Macchiavello C, Mosca M (2007) Optimal phase estimation in quantum networks. J Phys A: Math Theor 40:7971–7984 Das A, Chakrabarti BK (2008) Quantum annealing and analog quantum computation. Rev Mod Phys 80:1061 Decker T, Draisma J, Wocjan P (2009) Efficient quantum algorithm for identifying hidden polynomials. Quantum Inf Comput (to appear) Deutsch D (1985) Quantum theory, the Church–Turing principle and the universal quantum computer. Proc Royal Soc Lond A 400:97–117 Deutsch D, Jozsa R (1992) Rapid solutions of problems by quantum computation. Proc Royal Soc Lond A 439:553–558 Dörn S, Thierauf T (2007) The quantum query complexity of algebraic properties. In: Proceedings of the 16th International Symposium on Fundamentals of Computation Theory (FCT). Lecture Notes in Computer Science, vol 4639. Springer, Berlin, pp 250–260 Durr C, Heiligman M, Høyer P, Mhalla M (2004) Quantum query complexity of some graph problems. In: Proceedings of 31st International Colloquium on Automata, Languages, and Programming (ICALP’04). Lecture Notes in Computer Science, vol 3142. Springer, Berlin pp 481–493 Ettinger JM (1998) On noncommutative hidden subgroups. Lecture at AQIP’98. http://www.brics.dk/~salvail/aqip/ abstlist.html#upload-paper.10480.1 Ettinger JM Quantum time-frequency transforms. quantph/0005134 Ettinger M, Høyer P (2000) On quantum algorithms for noncommutative hidden subgroups. Adv Appl Math 25(3): 239–251 Ettinger M, Høyer P, Knill E (2004) The quantum query complexity of the hidden subgroup problem is polynomial. Inf Process Lett 91:43–48 Farhi E, Gutmann S (1998) Quantum computation and decision trees. Phys Rev A 58:915–928 Farhi E, Gutmann S An analog analogue of a digital quantum computation. quant-ph/9612026 Farhi E, Goldstone J, Gutmann S, Sipser M (2000) Quantum computation by adiabatic evolution. quant-ph/0001106 Farhi E, Goldstone J, Gutmann S, Lapan J, Lundgren A, Preda D (2001) A quantum adiabatic evolution algorithm applied to random instances of an NP. Science 20 April:472 Farhi E, Goldstone J, Gutmann S A quantum algorithm for the Hamiltonian NAND tree. quant-ph/0702144v2 Farhi E, Goldstone J, Gutmann S Quantum adiabatic evolution algorithms with different paths. quant-ph/0208135 Fenner S, Zhang Y (2005) Quantum algorithms for a set of group theoretic problems. In: Proceedings of the Ninth ICEATCS Italian Conference on Theoretical Computer Science. Lecture Notes in Computer Science, vol 3701. Springer, Berlin, pp 215–227 Feynman R (1982) Simulating physics with computers. Int J Theor Phys 21(6,7):467–488 Freedman M (1998) P/NP, and the quantum field computer. Proc Natl Acad Sci 95(1):98–101
Quantum Algorithms
81. Freedman M, Kitaev A, Wang Z (2002) Simulation of topological Field theories by quantum computers. Comm Math Phys 227(3):587–603 82. Friedl K, Ivanyos G, Magniez F, Santha M, Sen P (2003) Hidden translation and orbit coset in quantum computing. In: Proceedings of the thirty-fifth annual ACM symposium on Theory of computing (STOC’03). IEEE Computer Society, Washington, DC, pp 1–9 83. Friedl K, Ivanyos G, Santha M (2005) Efficient testing of groups. In: Proceedings of the Thirty-seventh Annual ACM Symposium on Theory of Computing (STOC). IEEE Computer Society, Washington, DC, pp 157–166 84. Furrow B (2008) A panoply of quantum algorithms. Quantum Inf Comput 8(8–9):0834–0859 85. Geraci J, Lidar D (2008) On the exact evaluation of certain instances of the potts Partition function by quantum computers. Comm Math Phys 279:735 86. Gerhardt H, Watrous J (2003) Continuous-time quantum walks on the symmetric group. 6th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, (APPROX) and 7th International Workshop on Randomization and Approximation Techniques in Computer Science, (RANDOM), pp 290–301 87. Grigni M, Schulman L, Vazirani M, Vazirani U (2001) Quantum mechanical algorithms for the nonabelian hidden subgroup problem. In: Proceedings of the thirty-third annual ACM symposium on Theory of computing (SODA’03), pp 68–74 88. Gottesman D, Chuang I Quantum digital signatures. quantph/0105032 89. Grigoriev D (1997) Testing shift-equivalence of polynomials by deterministic, probabilistic and quantum machines. Theor Comput Sci 180:217–228 90. Grover L (1996) A fast quantum mechanical algorithm for database search. In: Proceedings of the 28th Annual ACM Symposium on the Theory of Computing (STOC 1996), pp 212–219 91. Grover L (1998) A framework for fast quantum mechanical algorithms. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing (STOC), pp 53–62 92. Hales L, Hallgren S (2000) An Improved Quantum Fourier Transform Algorithm and Applications. FOCS 2000, pp 515–525 93. Hallgren S (2002) Polynomial-time quantum algorithms for Pell’s equation and the principal ideal problem. STOC 2002, pp 653–658 94. Hallgren S (2005) Fast quantum algorithms for computing the unit group and class group of a number field. In: Proceedings of the 37th ACM Symposium on Theory of Computing (STOC 2005), pp 468–474 95. Hallgren S (2007) Polynomial-time quantum algorithms for Pell’s equation and the principal ideal problem. J ACM 54(1):653–658 96. Hallgren S, Russell A, Ta-Shma A (2000) Normal subgroup reconstruction and quantum computation using group representations. In: Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, pp 627–635 97. Hallgren S, Moore C, Roetteler M, Russel A, Sen P (2006) Limitations of quantum coset states for graph isomorphism. In: Proceedings of the 38th Annual ACM Symposium on Theory of Computing. IEEE Computer Society, Washington, DC
98. Hassidim A, Ben-Or M (2008) The Bayesian learner is optimal for noisy binary search (and pretty good for quantum as well). In: Proceedings of the 49th Symposium on Foundations of Computer Science. IEEE Computer Society, Washington, DC 99. Hausladen P, Wootters WK (1994) A pretty good measurement for distinguishing quantum. states. J Mod Opt 41:2385 100. Helstrom CW (1976) Quantum detection and estimation theory. Academic Press, New York 101. Hogg T (2000) Quantum search heuristics. Phys Rev A 61:052311 102. Holevo AS (1982) Probabilistic and statistical aspects of quantum theory. North Holland, Amsterdam 103. Høyer P Efficient quantum transforms. quant-ph/9702028 104. Høyer P (1999) Conjugated operators in quantum algorithms. Phys Rev A 59(5):3280–3289 105. Høyer P (2001) Introduction to recent quantum algorithms. In: Proceedings of 26th International Symposium on Mathematical Foundations of Computer Science (MFCS’01). Lecture Notes in Computer Science, vol 2136. Springer, Berlin, pp 62–73 106. Høyer P, Dürr C A quantum algorithm for finding the minimum. quant-ph/9607014 107. Høyer P, Neerbek J, Shi Y (2002) Quantum complexities of ordered searching, sorting, and element distinctness. Algorithmica 34(4):429–448 108. Høyer P, Mosca M, de Wolf R (2003) Quantum search on bounded-error inputs. In: Proceedings of the Thirtieth International Colloquium on Automata, Languages and Programming (ICALP’03), pp 291–299 109. Inui Y, Le Gall F (2008) Quantum property testing of group solvability. In: Proceedings of the 8th Latin American Symposium (LATIN’08: Theoretical Informatics). Lecture Notes in Computer Science, vol 4957. Springer, Berlin, pp 772–783 110. Ivanyos G, Magniez F, Santha M (2003) Efficient quantum algorithms for some instances of the non-Abelian hidden subgroup problem. Int J Found Comput Sci 14(5):723–739 111. Ivanyos G, Sanselme L, Santha M (2007) An efficient quantum algorithm for the Hidden subgroup problem in extraspecial groups. 24th STACS. Lecture Notes in Computer Science, vol 4393. Springer, Berlin, pp 586–597 112. Ivanyos G, Sanselme L, Santha M (2008) An efficient quantum algorithm for the hidden subgroup problem in nil-2 groups. In: Proceedings of the 8th Latin American Symposium (LATIN’08: Theoretical Informatics). Lecture Notes in Computer Science, vol 4957. Springer, Berlin, pp 759–771 113. Ioannou L (2002) Continuous-time quantum algorithms: Searching and adiabatic computation. Master Math Thesis, University of Waterloo 114. Jaeger F, Vertigan DL, Welsh DJA (1990) On the Computational complexity of the Jones and Tutte polynomials. Math Proc Cambridge Phil Soc 108(1):5–53 115. Jansen S, Ruskai M, Seiler R (2007) Bounds for the adiabatic approximation with applications to quantum computation. J Math Phys 48:102111 116. Jozsa R (1998) Quantum algorithms and the Fourier transform. Proc Royal Soc Lond A 454:323–337 117. Jozsa R (2003) Notes on Hallgren’s efficient quantum algorithm for solving Pell’s equation. quant-ph/0302134 118. Jozsa R (2006) An introduction to measurement based quantum computation. NATO Science Series, III: Computer and
2331
2332
Quantum Algorithms
119.
120. 121.
122. 123. 124.
125.
126. 127.
128. 129. 130.
131. 132. 133. 134.
135. 136.
137. 138.
139.
140.
Systems Sciences. Quantum Information Processing - From Theory to Experiment, vol 199, pp 137–158 Kassal I, Jordan S, Love P, Mohseni M, Aspuru-Guzik A (2008) Quantum algorithms for the simulation of chemical dynamics. arXiv:0801.2986 Kauffman L (2001) Knots and Physics. World Scientific, Singapore Kaye P (2005) Optimized quantum implementation of elliptic curve arithmetic over binary fields. Quantum Inf Comput 5(6):474–491 Kaye P, Laflamme R, Mosca M (2007) An introduction to quantum computation. Oxford University Press, Oxford Kempe J (2003) Quantum random walks - an introductory overview. Contemp Phys 44(4):307–327 Kempe J (2007) Approaches to quantum error correction. Quantum Decoherence, Poincare Seminar 2005. Progress in Mathematical Physics. Birkhäuser, Basel, pp 85–123 Kempe J, Shalev A (2005) The hidden subgroup problem and permutation group theory. In: Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms (SODA’05), pp 1118–1125 Kitaev A (1995) Quantum measurements and the Abelian stabilizer problem. quant-ph/9511026 Kitaev A (1996) Quantum measurements and the Abelian stabilizer problem. Electronic Colloquium on Computational Complexity (ECCC), p 3 Kitaev A (1997) Quantum computations: Algorithms and error correction. Russ Math Surv 52(6):1191–1249 Kitaev A, Shen A, Vyalvi M (2002) Classical and quantum computation. American Mathematical Society Kuperberg G (2005) A subexponential-time quantum algorithm for the dihedral hidden subgroup problem. SIAM J Comput 35:170–188 Lloyd S (1996) Universal quantum simulators. Science 273:1073–1078 Lomonaco S, Kauffman L (2006) Topological quantum computing and the Jones polynomial. In: Proc SPIE 6244 Magniez F, Nayak A (2007) Quantum complexity of testing group commutativity. Algorithmica 48(3):221–232 Magniez F, Nayak A, Roland J, Santha M (2007) Search via quantum walk. 39th ACM Symposium on Theory of Computing (STOC), pp 575–584 Magniez F, Santha M, Szegedy M (2007) Quantum algorithms for the triangle problem. SIAM J Comput 37(2):413–424 Manin Y (2000) Classical computing, quantum computing, and Shor’s factoring algorithm. Séminaire Bourbaki, 41 (19981999), 862, pp 375–404. The appendix translates an excerpt about quantum computing from a 1980 paper (in Russian) Menezes A, van Oorschot P, Vanstone S (1996) Handbook of Applied Cryptography. CRC Press, Boca Raton Moore C, Russell A (2007) For distinguishing conjugate hidden dubgroups, the pretty good measurement is as hood as it gets. Quantum Inf Comput 7(8):752–765 Moore C, Rockmore D, Russell A, Schulman L (2004) The power of basis selection in fourier sampling: hidden subgroup problems in affine groups. In: Proceedings of the fifteenth annual ACM-SIAM symposium on discrete algorithms (SODA’04), pp 1113–1122 Moore C, Rockmore D, Russell A (2006) Generic Quantum Fourier Transforms. ACM Trans Algorithms 2:707–723
141. Moore C, Russell A, Sniady P (2007) On the impossibility of a quantum sieve algorithm for graph isomorphism: unconditional results. In: Proceedings of the thirty-ninth annual ACM symposium on Theory of computing (STOC), pp 536–545 142. Moore C, Russell A, Vazirani U A classical one-way function to confound quantum adversaries. quant-ph/0701115 143. Mosca M (1999) Quantum computer algorithms. Ph D thesis, Oxford 144. Mosca M (2001) Counting by quantum eigenvalue estimation. Theor Comput Sci 264:139–153 145. Mosca M, Ekert A (1998) The hidden subgroup problem and eigenvalue estimation on a quantum computer. In: Proceedings 1st NASA International Conference on Quantum Computing & Quantum Communications. Lecture Notes in Computer Science, vol 1509. Springer, Berlin, pp 174–188 146. Mosca M, Zalka C (2004) Exact quantum Fourier transforms and discrete logarithm algorithms. Int J Quantum Inf 2(1): 91–100 147. Motwani R, Raghavan P (1995) Randomized Algorithms. Cambridge University Press, Cambridge 148. Nayak A, Wu F (1999) The quantum query complexity of approximating the median and related statistics. In: Proceedings of the Thirty-first Annual ACM Symposium on Theory of Computing (STOC), pp 384–393 149. Nielsen M, Chuang I (2000) Quantum computation and quantum information. Cambridge University Press, Cambridge 150. Proos J, Zalka C (2003) Shor’s discrete logarithm quantum algorithm for elliptic curves. Quantum Inf Comput 3:317–344 151. Püschel M, Rötteler M, Beth T (1999) Fast quantum Fourier transforms for a class of non-Abelian groups. In: Proceedings of the 13th International Symposium on Applied Algebra, Algebraic Algorithms and Error-Correcting Codes, pp 148–159 152. Radhakrishnan J, Roetteler M, Sen P (2005) On the power of random bases in Fourier sampling: Hidden subgroup problem in the Heisenberg group. In: Proceedings of the 32nd International Colloquium on Automata, Languages and Programming (ICALP), pp 1399–1411 p p ˜ n C m) 153. Ramesh H, Vinay V (2003) String matching in O( quantum time. J Discret Algorithms 1:103–110 154. Regev O (2004) Quantum computation and lattice problems. SIAM J Comput 33:738–760 155. Reichardt B, Spalek R Span-program-based quantum algorithm for evaluating formulas. In: Proceedings of the fortieth annual ACM symposium on Theory of computing (STOC 2008), pp 103–112 156. Roetteler M, Beth T Polynomial-time solution to the hidden subgroup problem for a class of non-Abelian groups. quantph/9812070 157. Roland J, Cerf N (2002) Quantum search by local adiabatic evolution. Phys Rev A 65(4):042308 158. Rudolph T, Grover L (2004) How significant are the known collision and element distinctness quantum algorithms? Quantum Inf Comput 4:201–206 159. Russell A, Shparlinski I (2004) Classical and quantum function reconstruction via character evaluation. J Complex 20: 404–422 160. Santha M (2008) Quantum walk based search algorithms. In: Proceedings of the 5th Annual International Conference on Theory and Applications of Models of Computation (TAMC 2008). Lecture Notes in Computer Science, vol 4978. Springer, Berlin, pp 31–46
Quantum Algorithms
161. Sasaki M, Carlini A, Jozsa R (2001) Quantum template matching. Phys Rev A 64:022317 162. Schmidt A, Vollmer U (2005) Polynomial time quantum algorithm for the computation of the unit group of a number field. In: Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pp 475–480 163. Schumacher B (1995) Quantum coding. Phys Rev A 51:2738– 2747 164. Severini S (2003) On the digraph of a unitary matrix. SIAM J Matrix Anal Appl (SIMAX) 25(1):295–300 165. Shor P (1994) Algorithms for quantum computation: discrete logarithms and factoring. In: Proceedings of the 35th Annual Symposium on Foundations of Computer Science, pp 124–134 166. Shor P (1997) Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM J Comput 26:1484–1509 167. Simon D (1994) On the power of quantum computation. In: Proceedings of the 35th IEEE Symposium on the Foundations of Computer Science (FOCS), pp 116–123 168. Simon D (1997) On the power of quantum computation. SIAM J Comput 26:1474–1483 169. Sokal A (2005) The multivariate Tutte polynomial (alias Potts model) for graphs and matroids. Surveys in Combinatorics. Cambridge University Press, Cambridge, pp 173–226 170. Somma R, Ortiz G, Gubernatis JE, Knill E, Laflamme R (2002) Simulating physical phenomena by quantum networks. Phys Rev A 65:042323
171. Szegedy M (2004) Quantum speed-Up of Markov chain based algorithms. In: Proceedings of the 45th IEEE Symposium on the Foundations of Computer Science (FOCS), pp 32–41 172. Terhal B (1999) Quantum algorithms and quantum entanglement. Ph D thesis, University of Amsterdam 173. Vazirani U (1998) On the power of quantum computation. Phil Trans Royal Soc Lond A 356:1759–1768 174. Vazirani U (1997) Berkeley Lecture Notes. Lecture 8 http:// www.cs.berkeley.edu/~vazirani/qc.html 175. Watrous J (2001) Quantum algorithms for solvable groups. In: Proceedings of the thirty-third annual ACM symposium on Theory of computing (STOC), pp 60–67 176. Welsh DJA (1993) Complexity: knots, colourings and counting. Cambridge University Press, Cambridge 177. Wiesner S (1983) Conjugate coding. Sigact News 15(1):78–88 178. Wiesner S (1996) Simulations of many-body quantum systems by a quantum computer. quant-ph/9603028 179. Wocjan P, Yard J (2008) The Jones polynomial: quantum algorithms and applications in quantum complexity theory. Quantum Inf Comput 8(1–2):147–180 180. Zalka C (1998) Efficient simulation of quantum systems by quantum computers. Proc Roy Soc Lond A 454:313–322 181. Zalka C (2000) Using Grover’s quantum algorithm for searching actual databases. Phys Rev A 62(5):052305 182. Zhang S (2006) New upper and lower bounds for randomized and quantum local search. In: Proceedings of the thirtyeighth annual ACM symposium on Theory of computing (STOC), pp 634–643
2333
2334
Quantum Algorithms and Complexity for Continuous Problems
Quantum Algorithms and Complexity for Continuous Problems ANARGYROS PAPAGEORGIOU, JOSEPH F. TRAUB Department of Computer Science, Columbia University, New York, USA Article Outline Glossary Definition of the Subject Introduction Overview of Quantum Algorithms Integration Path Integration Feynman–Kac Path Integration Eigenvalue Approximation Qubit Complexity Approximation Elliptic Partial Differential Equations Ordinary Differential Equations Gradient Estimation Simulation of Quantum Systems on Quantum Computers Future Directions Acknowledgments Bibliography Glossary Black box model This model assumes we can collect knowledge about an input f through queries without knowing how the answer to the query is computed. A synonym for black box is oracle. Classical computer A computer which does not use the principles of quantum computing to carry out its computations. Computational complexity In this article, complexity for brevity. The minimal cost of solving a problem by an algorithm. Some authors use the word complexity when cost would be preferable. An upper bound on the complexity is given by the cost of an algorithm. A lower bound is given by a theorem which states there cannot be an algorithm which does better. Continuous problem A problem involving real or complex functions of real or complex variables. Examples of continuous problem are integrals, path integrals, and partial differential equations. Cost of an algorithm The price of executing an algorithm. The cost depends on the model of computation.
Discrete problem A problem whose inputs are from a countable set. Examples of discrete problems are integer factorization, traveling salesman and satisfiability. "-Approximation Most real-world continuous problems can only be solved numerically and therefore approximately, that is to within an error threshold ". The definition of "-approximation depends on the setting. See worst-case setting, randomized setting, quantum setting. Information-based complexity The discipline that studies algorithms and complexity of continuous problems. Model of computation The rules stating what is permitted in a computation and how much it costs. The model of computation is an abstraction of a physical computer. Examples of models are Turing machines, real number model, quantum circuit model. Optimal algorithm An algorithm whose cost equals the complexity of the problem. Promise A statement of what is known about a problem a priori before any queries are made. An example in quantum computation is the promise that an unknown 1-bit function is constant or balanced. In informationbased complexity a promise is also called global information. Quantum computing speedup The amount by which a quantum computer can solve a problem faster than a classical computer. To compute the speedup one must know the classical complexity and it is desirable to also know the quantum complexity. Grover proved quadratic speedup for search in an unstructured database. Its only conjectured that Shor’s algorithm provides exponential speedup for integer factorization since the classical complexity is unknown. Query One obtains knowledge about a particular input through queries. For example, R 1 if the problem is numerical approximation of 0 f (x) dx a query might be the evaluation of f at a point. In information-based complexity the same concept is called an information operation. Quantum setting There are a number of quantum settings. An example is a guarantee of error at most " with probability greater than 1/2. Qubit complexity The minimal number of qubits to solve a problem. Query complexity The minimal number of queries required to solve the problem. Randomized setting In this setting the expected error with respect to the probability measure generating the random variables is at most ". The computation is ran-
Quantum Algorithms and Complexity for Continuous Problems
domized. An important example of a randomized algorithm is the Monte Carlo method. Worst-case setting In this setting an error of at most " is guaranteed for all inputs satisfying the promise. The computation is deterministic.
tion, Feynman path integration, the smallest eigenvalue of a differential equation, approximation, partial differential equations, ordinary differential equations and gradient estimation. We will also briefly report on the simulation of quantum systems on a quantum computer.
Definition of the Subject
Introduction
Most continuous mathematical formulations arising in science and engineering can only be solved numerically and therefore approximately. We shall always assume that we are dealing with a numerical approximation to the solution. There are two major motivations for studying quantum algorithms and complexity for continuous problems.
We provide a summary of the contents of the article.
1. Are quantum computers more powerful than classical computers for important scientific problems? How much more powerful? This would answer the question posed by Nielsen and Chuang (p. 47 in [48]). 2. Many important scientific and engineering problems have continuous formulations. These problems occur in fields such as physics, chemistry, engineering and finance. The continuous formulations include path integration, partial differential equations (in particular, the Schrödinger equation) and continuous optimization. To answer the first question we must know the classical computational complexity (for brevity, complexity) of the problem. There have been decades of research on the classical complexity of continuous problems in the field of information-based complexity; see, for example, the monographs [49,57,59,67,68,71,75]. The reason we know the complexity of many continuous problems is that we can use adversary arguments to obtain their query complexity. See, for example, [54] for an exposition. This may be contrasted with the classical complexity of discrete problems where we have only conjectures such as P ¤ NP. Even the classical complexity of the factorization of large integers is unknown. Knowing the classical complexity of a continuous problem we obtain the quantum computation speedup if we know the quantum complexity. If we know an upper bound on the quantum complexity through the cost of a particular quantum algorithm then we can obtain a lower bound on the quantum speedup. To state and prove complexity theorems, the mathematical formulation of the problem, the promise about the class of inputs, and the model of computation must be precisely specified. See [66] for a discussion of the model of computation for continuous problems. Regarding the second motivation, in this article we will report on high-dimensional integration, path integra-
Section “Overview of Quantum Algorithms” We define basic concepts and notation including quantum algorithm, continuous problem, query complexity and qubit complexity. Section “Integration” High-dimensional integration, often in hundreds or thousands of variables, is one of the most commonly occurring continuous problems in science. In Subsect. “Classical Computer” we report on complexity results on a classical computer. For illustration we begin with a one-dimensional problem and give a simple example of an adversary argument. We then move on the d-dimensional case and indicate that in the worst case the complexity is exponential in d; the problem suffers the curse of dimensionality. The curse can be broken by the Monte Carlo algorithm. In Subsect. “Quantum Computer” we report on the algorithms and complexity results on a quantum computer. Under certain assumptions on the promise the quantum query complexity enjoys exponential speedup over classical worst case query complexity. A number of the problems we will discuss enjoy exponential quantum query speedup. This does not contradict Beals et al. [6] who prove that speedup can only be polynomial. The reason is that [6] deals with problems concerning total Boolean functions. For many classes of integrands there is quadratic speedup over the classical randomized query complexity. This is the same speedup as enjoyed by Grover’s search algorithm of an unstructured database. To obtain the quantum query complexity one has to give matching upper and lower bounds. As usual the upper bound is given by an algorithm, the lower bound by a theorem. The upper bound is given by the amplitude amplification algorithm of Brassard et al. [12]. We outline a method for approximating high-dimensional integrals using this algorithm. The quantum query complexity lower bounds for integration are based on the lower bounds of Nayak and Wu [47] for computing the mean of a Boolean function. Section “Path Integration” We define a path integral and provide an example due to Feynman. In Subsect. “Classical Computer” we report on complexity results
2335
2336
Quantum Algorithms and Complexity for Continuous Problems
on a classical computer. If the promise is that the class of integrands has finite smoothness, then path integration is intractable in the worst case. If the promise is that the class of integrands consists of entire functions the query complexity is tractable even in the worst case. For smooth functions intractability is broken by Monte Carlo. In Subsect. “Quantum Computer” we report on the algorithm and complexity on a quantum computer. The quantum query complexity enjoys Grover-type speedup over classical randomized query complexity. We outline the quantum algorithm. Section “Feynman–Kac Path Integration” The Feynman–Kac path integral provides the solution to the diffusion equation. In Subsect. “Classical Computer” we report on algorithms and complexity on a classical computer. In the worst case for a d-dimensional Feynman– Kac path integral the problem again suffers the curse of dimensionality which can be broken by Monte Carlo. In Subsect. “Quantum Computer” we indicate the algorithm and query complexity on a quantum computer. Section “Eigenvalue Approximation” One of the most important problems in physics and chemistry is approximating the ground state energy governed by a differential equation. Typically, numerical algorithms on a classical computer are well known. Our focus is on the Sturm– Liouville eigenvalue (SLE) problem where the first complexity results were recently obtained. The SLE equation is also called the time-independent Schrödinger equation. In Subsect. “Classical Computer” we present an algorithm on a classical computer. The worst case query complexity suffers the curse of dimensionality. We also state a randomized algorithm. The randomized query complexity is unknown for d > 2 and is an important open question. In Subsect. “Quantum Computer” we outline an algorithm for a quantum computer. The quantum query complexity is not known when d > 1. It has been shown that it is not exponential in d; the problem is tractable on a quantum computer. Section “Qubit Complexity” So far we’ve focused on query complexity. For the foreseeable future the number of qubits will be a crucial computational resource. We give a general lower bound on the qubit complexity of continuous problems. We show that because of this lower bound there’s a problem that cannot be solved on a quantum computer but that’s easy to solve on a classical computer using Monte Carlo. A definition of a quantum algorithm is given by (1); the queries are deterministic. Wo´zniakowski [77] introduced the quantum setting with randomized queries for continu-
ous problems. For path integration there is an exponential reduction in the qubit complexity. Section “Approximation” Approximating functions of d variables is a fundamental and generally hard problem. The complexity is sensitive to the norm, p, on the class of functions and to several other parameters. For example, if p D 1 approximation suffers the curse of dimensionality in the worst and randomized classical cases. The problem remains intractable in the quantum setting. Section “Partial Differential Equations” Elliptic partial differential equations have many important applications and have been extensively studied. In particular, consider an equation of order 2m in d variables. In the classical worst case setting the problem is intractable. The classical randomized and quantum settings were only recently studied. The conclusion is that the quantum may or may not provide a polynomial speedup; it depends on certain parameters. Section “Ordinary Differential Equations” Consider a system of initial value ordinary equations in d variables. Assume that the right hand sides satisfy a Hölder condition. The problem is tractable even in the worst case with the exponent of "1 depending on the Hölder class parameters. The complexity of classical randomized and quantum algorithms have only recently been obtained. The quantum setting yields a polynomial speedup. Section “Gradient Estimation” Jordan [37] showed that approximating the gradient of a function can be done with a single query on a quantum computer although it takes d C 1 function evaluations on a classical computer. A simplified version of Jordan’s algorithm is presented. Section “Simulation of Quantum Systems on Quantum Computers” There is a large and varied literature on simulation of quantum systems on quantum computers. The focus in these papers is typically on the cost of particular classical and quantum algorithms without complexity analysis and therefore without speedup results. To give the reader a taste of this area we list some sample papers. Section “Future Directions” ber of open questions.
We briefly indicate a num-
Overview of Quantum Algorithms A quantum algorithm consists of a sequence of unitary transformations applied to an initial state. The result of the algorithm is obtained by measuring its final state. The quantum model of computation is discussed in detail in [6,7,8,17,27,48] and we summarize it here as it applies to continuous problems.
Quantum Algorithms and Complexity for Continuous Problems
The initial state j 0 i of the algorithm is a unit vector of the Hilbert space H D C 2 ˝ ˝ C 2 , times, for some appropriately chosen integer , where C 2 is the two dimensional space of complex numbers. The dimension of H is 2 . The number denotes the number of qubits used by the quantum algorithm. The final state j i is also a unit vector of H and is obtained from the initial state j 0 i through a sequence of unitary 2 2 matrices, i. e., j i f :D U T Q f U T1 Q f U1 Q f U0 j
0i :
(1)
The unitary matrix Qf is called a quantum query and is used to provide information to the algorithm about a function f . Qf depends on n function evaluations f (t1 ); : : : ; f (t n ), at deterministically chosen points, n 2 . The U0 ; U1 ; : : : ; U T are unitary matrices that do not depend on f . The integer T denotes the number of quantum queries. For algorithms solving discrete problems, such as Grover’s algorithm for the search of an unordered database [26], the input f is considered to be a Boolean function. The most commonly studied quantum query is the bit query. For a Boolean function f : f0; : : : ; 2m 1g ! f0; 1g, the bit query is defined by Q f j jijki D j jijk ˚ f ( j)i :
S: F ! G :
(4)
Typically, F is a linear space of real functions of several variables, and G is a normed linear space. We wish to approximate S(f ) to within " for f 2 F . We approximate S(f ) using n function evaluations f (t1 ); : : : ; f (t n ) at deterministically and a priori chosen sample points. The quantum query Qf encodes this information, and the quantum algorithm obtains this information from Qf . Without loss of generality, we consider algorithms that approximate S(f ) with probability p 3/4. We can boost the success probability of an algorithm to become arbitrarily close to one by repeating the algorithm a number of times. The success probability becomes at least 1 ı with a number of repetitions proportional to log ı 1 . The local error of the quantum algorithm (1) that computes the approximation f ( j), for f 2 F and the outcome j 2 f0; 1; : : : ; M 1g, is defined by
X 3 p f ( j)
; (5) e( f ; S) D min ˛ : 4 j : kS( f ) f ( j)k˛
(2)
Here D m C 1, j ji 2 H m , and jki 2 H1 with ˚ denoting addition modulo 2. For a real function f the query is constructed by taking the most significant bits of the function f evaluated at some points tj . More precisely, as in [27], the bit query for f has the form Q f j jijki D j jijk ˚ ˇ( f ( ( j)))i ;
tions of the form presented above. However, any algorithm with multiple measurements can be simulated by a quantum algorithm with only one measurement [8]. Let S be a linear or nonlinear operator such that
(3)
where the number of qubits is now D m0 C m00 and j ji 2 H m0 , jki 2 H m00 . The functions ˇ and are used to discretize the domain D and the range R of f , 00 respectively. Therefore, ˇ : R ! f0; 1; : : : ; 2m 1g and 0 m : f0; 1; : : : ; 2 1g ! D. Hence, we compute f at t j D ( j) and then take the m00 most significant bits of f (t j ) by ˇ( f (t j )), for the details and the possible use of ancillary qubits see [27]. At the end of the quantum algorithm, the final state j f i is measured. The measurement produces one of M outcomes, where M 2 . Outcome j 2 f0; 1; : : : ; M 1g occurs with probability p f ( j), which depends on j and the input f . Knowing the outcome j, we classically compute the final result f ( j) of the algorithm. In principle, quantum algorithms may have measurements applied between sequences of unitary transforma-
where p f ( j) denotes the probability of obtaining outcome j for the function f . The worst case error of a quantum algorithm is defined by e quant(; S) D sup e( f ; S) :
(6)
f 2F
The query complexity compquery("; S) of the problem S is the minimal number of queries necessary for approximating the solution with accuracy ", i. e., compquery(") D minfT : 9 such that e quant (; S) "g : (7) Similarly, the qubit complexity of the problem S is the minimal number of qubits necessary for approximating the solution with accuracy ", i. e., compqubit (") D minf : 9 such that e quant(; S) "g : (8) Integration Integration is one of the most commonly occurring mathematical problems. One reason is that when one seeks the expectation of a continuous process one has to compute
2337
2338
Quantum Algorithms and Complexity for Continuous Problems
an integral. Often the integrals are over hundreds or thousands of variables. Path integrals are taken over an infinite number of variables. See Sect. “Path Integration”. Classical Computer We begin with a one dimensional example to illustrate some basic concepts before moving to the d-dimensional case (Number of dimensions and number of variables are used interchangeably.) Our simple example is to approximate Z
1
I( f ) D
f (x) dx : 0
For most integrands we can not use the fundamental theorem of calculus to compute the integral analytically; we have to approximate numerically (most real world continuous problems have to be approximated numerically). We have to make a promise about f . Assume ˚ F1 D f : [0; 1] ! R j continuous and
j f (x)j 1; x 2 [0; 1] :
Use queries to compute y f D [ f (x1 ); : : : ; f (x n )] : We show that with this promise one cannot guarantee an "-approximation on a classical computer. We use a simple adversary argument. Choose arbitrary numbers x1 ; : : : ; x n 2 [0; 1]. The adversary answers all these queries by answering f (x i ) D 0, i D 1; : : : ;Rn. 1 What is f ? It could be f1 0 and 0 f1 (x) dx D 0 or R1 it could be the f 2 shown in Fig. 1. The value of 0 f2 (x) dx can be arbitrarily close to 1. Since f1 (x) and f2 (x) are indistinguishable with these query answers and this promise, it is impossible to guarantee an "-approximation on a classical computer with " < 1/2. We will return to this example in Sect. “Qubit Complexity”. We move on to the d-dimensional case. Assume that the integration limits are finite. For simplicity we assume we are integrating over the unit cube [0; 1]d . So our problem is Z I( f ) D f (x) dx : [0;1] d
If our promise is that f 2 Fd , where ˚ Fd D f : [0; 1]d ! R j continuous and d
j f (x)j 1; x 2 [0; 1] ;
Quantum Algorithms and Complexity for Continuous Problems, Figure 1 All the function evaluations are equal to zero but the integral is close to one
then we can not compute an "-approximation regardless of the value of d. Our promise has to be stronger. Assume that integrand class has smoothness r. There are various ways to define smoothness r for functions of d variables. See, for example, the definition on p. 25 in [67]. (For other definitions see [71].) With this definition Bakhvalov [4] showed that the query complexity is query (9) compclaswor (") D "d/r : This is the worst case query complexity on a classical computer. What does this formula imply? If r D 0 the promise is that the functions are only continuous but have no smoothness. Then the query complexity is 1, that is, we cannot guarantee an "-approximation. If r and " are fixed the query complexity is an exponential function of d. We say the problem is intractable. Following R. Bellman this is also called the curse of dimensionality. In particular, let r D 1. Then the promise is that the integrands have one derivative and the query complexity is ("d ). The curse of dimensionality is present for many for continuous problems in the worst case setting. Breaking the curse is one of the central issues of information-based complexity. For high-dimensional integration the curse can be broken for F d by the Monte Carlo algorithm which is a special case of a randomized algorithm. The Monte Carlo algorithm is defined by MC ( f ) D
n 1X f (x i ) ; n iD1
(10)
Quantum Algorithms and Complexity for Continuous Problems
where the xi are chosen independently from a uniform distribution. Then the expected error is e
MC
p var( f ) (f) D p ; n
where Z
2
Z
2
f (x) dx
var( f ) D [0;1] d
f (x) dx [0;1] d
is the variance of f . For the promise f 2 Fd the query complexity is given by query
compclasran(") D ("2 ) :
(11)
This is the randomized query complexity on a classical computer. Thus Monte Carlo breaks the curse of dimensionality for the integration problem; the problem is tractable. Why should picking sample points at random be much better than picking them deterministically in the optimal way? This is possible because we have replaced the guarantee of the worst case setting by the stochastic assurance of the randomized setting. There is no free lunch! The reader will note that (11) is a complexity result even though it is the cost of a particular algorithm, the Monte Carlo algorithm. The reason is that for integrands satisfying this promise Monte Carlo has been proven optimal. It is known that if the integrands are smoother Monte Carlo is not optimal; see p. 32 in [67]. In the worst case setting (deterministic) the query complexity is infinite if f 2 Fd . In the randomized setting the query complexity is independent of d if f 2 Fd . This will provide us guidance when we introduce Monte Carlo sampling into the model of quantum computation in Sect. “Qubit Complexity”. Generally pseudo-random numbers are used in the implementation of Monte Carlo on a classical computer. The quality of a pseudo-random number generator is usually tested statistically; see, for example, Knuth [41]. Will (11) still hold if a pseudo-random number generator is used? An affirmative answer is given by [69] provided some care is taken in the choice of the generator and f is Lipschitz. Quantum Computer We have seen that Monte Carlo breaks the curse of dimensionality for high-dimensional integration on a classical computer and that the query complexity is of order "2 . Can we do better on a quantum computer?
The short answer is yes. Under certain assumptions on the promises, which will be made precise below, the quantum query complexity enjoys exponential speedup over the classical worst case query complexity and quadratic speedup over the classical randomized query complexity. The latter is the same speedup as enjoyed by Grover’s search algorithm of an unstructured database [26]. To show that the quantum query complexity is of order "1 we have to give matching, or close to matching upper and lower bounds. Usually, the upper bound is given by an algorithm, the lower bound by a theorem. The upper bound is given by the amplitude amplification algorithm of Brassard et al. [12] which we describe briefly. The amplitude amplification algorithm of Brassard et al. computes the mean SUM( f ) D
N1 1 X f (i) ; N iD0
of a Boolean function f : f0; 1; : : : ; N 1g ! f0; 1g, where N D 2 k , with error " and probability at least 8/ 2 , using a number of queries proportional to minf"1 ; Ng. Moreover, Nayak and Wu [47] show that the order of magnitude of the number of queries of this algorithm is optimal. Without loss of generality we can assume that "1 N. Perhaps the easiest way to understand the algorithm is to consider the operator G D (I j ih j)O f ; that is used in Grover’s search algorithm. Here O f jxi D (1) f (x) jxi;
x 2 f0; 1g k ;
denotes the query, which is slightly different yet equivalent [48] to the one we defined in (2). Let r 1 X jxi j iD N x be the equally weighted superposition of all the states. If M denotes the number of assignments for which f has the value 1 then M : SUM( f ) D N Without loss of generality 1 M N 1. Consider the space H spanned by the states r X 1 jxi and j 0i D NM x : f (x)D0 r 1 X jxi : j 1i D M x : f (x)D1
2339
2340
Quantum Algorithms and Complexity for Continuous Problems
Then j i D cos( /2)j 0 i C sin( /2)j 1 i p where sin( /2) D M/N, and /2 is the angle between the states j i and j 0 i. Thus
M sin2 D : 2 N Now consider the operator G restricted to H which has the matrix representation
cos sin GD : sin cos p Its eigenvalues are ˙ D e˙i , i D 1, and let j˙ i denote the corresponding eigenvectors. We can express j i using the j˙ i to get j i D aj i C bjC i ; with a; b 2 C, jaj2 C jbj2 D 1. This implies that phase estimation [48] with Gp , p D 2 j , j D 0; : : : ; t 1, and initial state j0i˝t j i can be used to approximate either or 2 with error proportional to 2t . Note that the first t qubits of the initial state determine the accuracy of phase estimation. Indeed, let ˜ be the result of phase estimation. Since sin2 ( /2) D sin2 ( /2), ˇ ˇ ˇ 2 ˇ ˇsin ( ) ˜ N ˇ D O 2t ; ˇ ˇ M with probability at least 8/ 2 ; see [12,48] for the details. Setting t D (log "1 ) and observing that phase estimation requires a number of applications of G (or queries Of ) proportional to "1 yields the result. (The complexity of quantum algorithms for the average case approximation of the Boolean mean has also been studied [35,52]). For a real function f : f0; 1; : : : ; N 1g ! [0; 1]; we can approximate SUM( f ) D
N1 1 X f (i) ; N iD0
by reducing the problem to the computation of the Boolean mean. One way to derive this reduction is to truncate f (i) to the desired number of significant bits, typically, polynomial in log "1 , and then to use the bit representation of the function values to derive a Boolean function whose mean is equal to the mean of the truncated real function, see, e. g. [77]. The truncation of the function values is formally expressed through the mapping ˇ
in (3). Variations of this idea have been used in the literature [3,27,50]. Similarly, one discretizes the domain of a function f : [0; 1]d ! [0; 1] using the function in (3) and then uses the amplitude amplification algorithm to compute the average SUM( f ) D
N1 1 X f (x i ) ; N
(12)
iD0
for x i 2 [0; 1]d , i D 0; : : : ; N 1. The quantum query complexity of computing the average (12) is of order "1 . On the other hand, the classical deterministic worst case query complexity is proportional to N (recall that "1 N), and the classical randomized query complexity is proportional to "2 . We now turn to the approximation of high-dimensional integrals and outline an algorithm for solving this problem. Suppose f : [0; 1]d ! [0; 1] is a function for which we are given some promise, for instance, that f has smoothness r. The algorithm integrating f with accuracy " has two parts. First, using function evaluations f (x i ), i D 1; : : : ; n, at deterministic points, it approximates f classically, by a function fˆ with error "1 , i. e., k f fˆk "1 ; where k k is the L1 norm. The complexity of this problem has been extensively studied and there are numerous results [49,67,71] specifying the optimal choice of n and the points xi that must be used to achieve error "1 . Thus Z Z Z f (x) dx D g(x) dx ; fˆ(x) dx C [0;1] d
[0;1] d
[0;1] d
where g D f fˆ. Since fˆ is known and depends linearly to integrate it exactly. on the f (x i ) the algorithm proceeds R So it suffices to approximate [0;1]d g(x) dx knowing that kgk "1 . The second part of the algorithm approximates the integral of g using the amplitude amplification algorithm to compute SUM(g) D
N1 1 X g(y i ) ; N iD0
for certain points y i 2 [0; 1]d , with error "2 . Once more, there are many results, see [67,71] for surveys specifying the optimal N and the points yi , so that SUM(g) approxR imates [0;1]d g(x) dx with error "2 . Finally, the algorithm approximates the original integral by the sum of the results of its two parts. The overall error of the algorithm is proportional to " D "1 C "2 .
Quantum Algorithms and Complexity for Continuous Problems
Quantum Algorithms and Complexity for Continuous Problems, Table 1 Query complexity of high-dimensional integration Worst case Randomized Fdr;˛
"d/(rC˛)
r Wp;d ; 2 p 1 "d/r
Quantum
"2d/(2(rC˛)Cd) "d/(rC˛Cd) "2d/(2rCd)
"d/(rCd)
r Wp;d ;1p2
"d/r
"pd/(rpCpdd)
"d/(rCd)
r W1;d
"d/r
"d/r
"d/(rCd)
Variations of the integration algorithm we described are known to have optimal query complexity for a number of different promises [27,29,34,50]. The quantum query complexity lower bounds for integration are based on the lower bounds of Nayak and Wu [47] for computing the Boolean mean. The quantum algorithms offer an exponential speedup over classical deterministic algorithms and a polynomial speedup over classical randomized algorithms for the query complexity of high-dimensional integration. Tabel 1 summarizes the query complexity (up to polylog factors) of high-dimensional integration in the worst case, randomized and quantum setting for functions r belonging to Hölder classes Fdr;˛ and Sobolev spaces Wp;d . Heinrich obtained most of the quantum query complexity results in a series of papers, which we cited earlier. Heinrich summarized his results in [28] where a corresponding table showing error bounds can be found. Abrams and Williams [3] were the first to apply the amplitude amplification algorithm to high-dimensional integration. Novak [50] was the first to spell out his promises and thus obtained the first complexity results for high-dimensional integration. Path Integration A path integral is defined as Z I( f ) D f (x) ( dx) ;
(13)
X
where is a probability measure on an infinite-dimensional space X. It can be viewed as an infinite-dimensional integral. For illustration we give an example due to R. Feynman. In classical mechanics a particle at a certain position at time t0 has a unique trajectory to its position at time t1 . Quantum mechanically there are an infinite number of possible trajectories which Feynman called histories, see Fig. 2. Feynman summed over the histories. If one goes to the limit one gets a path integral. Setting t0 D 0, t1 D 1 this integral is Z I( f ) D f (x) ( dx) ; C[0;1]
which is a special case of (13).
Quantum Algorithms and Complexity for Continuous Problems, Figure 2 Different trajectories of a particle
Path integration occurs in numerous applications including quantum mechanics, quantum chemistry, statistical mechanics, and mathematical finance. Classical Computer The first complexity analysis of path integration is due to Wasilkowski and Wo´zniakowski [72]; see also Chap. 5 in [67]. They studied the deterministic worst case and randomized settings and assume that is a Gaussian measure; an important special case of a Gaussian measure is a Wiener measure. They make the promise that F, the class of integrands, consists of functions f : X ! R whose rth Fréchet derivatives are continuous and uniformly bounded by unity. If r is finite then path integration is intractable. Curbera [20] showed that the worst case ˇ query complexity is of order "" where ˇ is a positive number depending on r. Wasilkowski and Wo´zniakowski [72] also considered the promise that F consists of entire functions and that is the Wiener measure. Then the query complexity is a polynomial in "1 . More precisely, they provide an algorithm for calculating a worst case "-approximation with cost of order "p and the problem is tractable with this promise. The exponent p depends on the particular Gaussian measure; for the Wiener measure p D 2/3. We return to the promise that the smoothness r is finite. Since this problem is intractable in the worst case, Wasilkowski and Wo´zniakowski [72] ask whether set-
2341
2342
Quantum Algorithms and Complexity for Continuous Problems
tling for a stochastic assurance will break the intractability. The obvious approach is to approximate the infinitedimensional integral by a d-dimensional integral where d may be large (or even huge) since d is polynomial in "1 . Then Monte Carlo may be used since its speed of convergence does not depend on d. Modulo an assumption that the nth eigenvalue of the covariance operator of does not decrease too fast the randomized query complexity is roughly "2 . Thus Monte Carlo is optimal. Quantum Computer Just as with finite dimensional integration Monte Carlo makes path integration tractable on a classical computer and the query complexity is of order "2 . Again we ask whether we can do better on a quantum computer and again the short answer is yes. Under certain assumptions on the promises, which will be made precise below, the quantum query complexity is of order "1 . Thus quantum query complexity enjoys exponential speedup over the classical worst case query complexity and quadratic speedup over the classical randomized query complexity. Again the latter is the same speedup as enjoyed by Grover’s search algorithm of an unstructured database. The idea for solving path integration on a quantum computer is fairly simple but the analysis is not so easy. So we outline the algorithm without considering the details. We start with a classical deterministic algorithm that uses an average as in (12) to approximate the path integral I(f ) with error " in the worst case. The number of terms N of the this average is an exponential function of "1 . Nevertheless, on a quantum computer we can approximate the average, using the amplitude amplification algorithm as we discussed in Sect. “Integration (Quantum Computer)”, with cost that depends on log N and is, therefore, a polynomial in "1 , Sect. 6 in [70]. A summary of the promises and the results in [70] follows. The measure is Gaussian and the eigenvalues of its covariance operator is of order jh , h > 1. For the Wiener measure occurring in many applications we have h D 2. The class of integrands consists of functions f whose rth Fréchet derivatives are continuous and uniformly bounded by unity. In particular assume the integrands are at least Lipschitz. Then Path integration on a quantum computer is tractable. Query complexity on a quantum computer enjoys exponential speedup over the worst case and quadratic speedup over the classical randomized query complexity. More precisely, the number of quantum queries is at most 4:22"1 .
Results on the qubit complexity of path integration will be given in Sect. “Qubit Complexity”. Details of an algorithm for computing an "-approximation to a path integral on a quantum computer may be found in Sect. 6 in [70]. Feynman–Kac Path Integration An important special case of a path integral is a Feynman– Kac path integral. Assume that X is the space C of continuous functions and that the measure is the Wiener measure w. Feynman–Kac path integrals occur in many applications; see [23]. For example consider the diffusion equation @z 1 @2 z (u; t) C V(u) z(u; t) (u; t) D @t 2 @u2 z(u; 0) D v(u) ; where u 2 R, t > 0, V is a potential function, and v is an initial condition function. Under mild conditions on v and V the solution is given by the Feynman–Kac path integral Z Rt z(u; t) D v(x(t) C u)e 0 V (x(s)Cu) ds w( dx) : (14) C
The problem generalizes to the multivariate case by considering the diffusion equation @z 1 (u; t) D z(u; t) C V(u) z(u; t) @t 2 z(u; 0) D v(u) ; with u 2 Rd , t > 0, and V; v : Rd ! R, the potential and the initial value function, respectively. As usual, denotes the Laplacian. The solution is given by the Feynman–Kac path integral Z Rt z(u; t) D v(x(t) C u)e 0 V (x(s)Cu) ds w( dx) ; C
where C is the set of continuous functions x : RC ! Rd such that x(0) D 0. Note that there are two kinds of dimension here. A Feynman–Kac path integral is infinite dimensional since we are integrating over continuous functions. Furthermore u is a function of d variables. Classical Computer We begin with the case when u is a scalar and then move to the case where u is a multivariate function. There have been a number of papers on the numerical solution of (14); see, for example [14].
Quantum Algorithms and Complexity for Continuous Problems
The usual attack is to solve the problem with a stochastic assurance using randomization. For simplicity we make the promise that u D 1 and V is four times continuously differentiable. Then by Chorin’s algorithm [16], the total cost is of order "2:5 . The first complexity analysis may be found in Plaskota et al. [58] where a new algorithm is defined which enjoys certain optimality properties. They construct an algorithm which computes an "-approximation at cost of order ":25 and show that the worst case complexity is of the same order. Hence the exponent of "1 is an order of magnitude smaller and with a worst case rather than a stochastic guarantee. However, the new algorithm requires a numerically difficult precomputation which may limit its applicability. We next report on multivariate Feynman–Kac path integration. First consider the worst case setting with the promise that v and V are r times continuously differentiable with r finite. Kwas and Li [43] proved that the query complexity is of order "d/r . Therefore in the worst case setting the problem suffers the curse of dimensionality. The randomized setting was studied by Kwas [42]. He showed that the curse of dimensionality is broken by using Monte Carlo using a number of queries of order "2 . The number of queries can be further improved to "2/(1C2r/d) , which is the optimal number of queries, by a bit more complicated algorithm. The randomized algorithms require extensive precomputing [42,43]. Quantum Computer Multivariate multivariate Feynman–Kac path integration on a quantum computer was studied in [42]. With the promise as in the worst and randomized case, Kwas presents an algorithm and complexity analysis. He exhibits a quantum algorithm that uses a number of queries of order "1 that is based on the Monte Carlo algorithm. He shows that the query complexity is of order "1/(1Cr/d) and is achieved by a bit more complicated quantum algorithm. Just as in the randomized case the quantum algorithms require extensive precomputing [42,43].
settings has only recently been addressed for the Sturm– Liouville eigenvalue problem [53,55] (see also [10] for quantum lower bounds with a different kind of query than the one we discuss in this article). In some cases we have sharp complexity estimates but there are important questions that remain open. Most of our discussion here concerns the complexity of approximating the smallest eigenvalue of a Sturm– Liouville eigenvalue problem. We will conclude this section by briefly addressing quantum algorithms for other eigenvalue problems. In the physics literature this problem is called the timeindependent Schrödinger equation. The smallest eigenvalue is the energy of the ground state. In the mathematics literature it is called the Sturm–Liouville eigenvalue problem. Let I d D [0; 1]d and consider the class of functions ˇ
ˇ @q 2 C(I d ) ; Q D q : I d ! [0; 1]ˇˇ q ; D j q : D @x j kD j qk1 1 ; kqk1 1
;
where k k1 denotes the supremum norm. For q 2 Q, P define Lq : D C q, where D djD1 @2 /@x 2j is the Laplacian, and consider the eigenvalue problem Lq u D u; x 2 (0; 1)d ;
(15)
u(x) 0; x 2 @I d :
(16)
In the variational form, the smallest eigenvalue D (q) of (15), (16) is given by R Pd (q) D min
Id
0¤u2H 10
2 2 jD1 [D j u(x)] C q(x)u (x) dx R 2 I d u (x) dx
;
(17) where H01 is the space of all functions vanishing on the boundary of I d having square integrable first order partial derivatives. We consider the complexity of classical and quantum algorithms approximating (q) with error ".
Eigenvalue Approximation Eigenvalue problems for differential operators arising in physics and engineering have been extensively studied in the literature; see, e. g. [5,18,19,22,25,40,63,65]. Typically, the mathematical properties of the eigenvalues and the corresponding eigenfunctions are known and so are numerical algorithms approximating them on a classical computer. Nevertheless, the complexity of approximating eigenvalues in the worst, randomized and quantum
Classical Computer In the worst case we discretize the differential operator on a grid of size h D ("1 ) and obtain a matrix M" D " C B" , of size proportional to "d "d . The matrix " is the (2d C 1)-point finite difference discretization of the Laplacian. The matrix B" is a diagonal matrix containing evaluations of q at the grid points. The smallest eigenvalue of M" approximates (q) with error O(") [73,74].
2343
2344
Quantum Algorithms and Complexity for Continuous Problems
We compute the smallest eigenvalue of M" using the bisection method (p. 228 in [22]). The resulting algorithm uses a number of queries proportional to "d . It turns out that this number of queries is optimal in the worst case, and the problem suffers from the curse of dimensionality. The query complexity lower bounds are obtained by reducing the eigenvalue problem to high-dimensional integration. For this we use the perturbation formula [53,55] Z 2 ¯ ¯ C q(x) q(x) u q¯ (x) dx (q) D (q) Id
¯ 21 ; C O kq qk
(18)
where q; q¯ 2 Q and u q¯ is the normalized eigenfunction ¯ corresponding to (q). The same formula is used for lower bounds in the randomized setting. Namely, the query complexity is ˝ "2d/(dC2) : Moreover, we can use (18) to derive a randomized algorithm. First we approximate q by a function q¯ and then approximate (q) by approximating the first two terms on the right hand side of (18); see Papageorgiou and Wo´zniakowski [55] for d D 1, and Papageorgiou [53] for general d. However, this algorithm uses O " max(2/3;d/2) ; queries. So, it is optimal only when d 2. Determining the randomized complexity for d > 2 and determining if the randomized complexity is an exponential function of d are important open questions. Quantum Computer The perturbation formula (18) can be used to show that the quantum query complexity is ˝ "d/(dC1) : As in the randomized case, we can use (18) to derive an algorithm that uses O("d/2 ) quantum queries. The difference between the quantum algorithm and the randomized algorithm is that the former uses the amplitude amplification algorithm to approximate the integral on the right hand side of (18) instead of Monte Carlo. The algorithm is optimal only when d D 1 [53,55]. For general d the query complexity is not known exactly, we only have the upper bound O("p ), p 6. The best quantum algorithm known is based on phase estimation. In particular, we discretize the problem as in the worst case and apply phase estimation to the unitary
matrix e i M" ; where is chosen appropriately so that the resulting phase belongs to [0; 2). We use a splitting formula to approximate the necessary powers of the matrix exponential. The largest eigenvalue of " is O("2 ) and kqk1 1. This implies that the resulting number of queries does not grow exponentially with d. Finally, there are a number of papers providing quantum algorithms for eigenvalue approximation without carrying out a complete complexity analysis. Abrams and Lloyd [2] have written an influential paper on eigenvalue approximation of a quantum mechanical system evolving with a given Hamiltonian. They point out that phase estimation, which requires the corresponding eigenvector as part of its initial state, can also be used with an approximate eigenvector. Jaksch and Papageorgiou [36] give a quantum algorithm which computes a good approximation of the eigenvector at low cost. Their method can be generally applied to the solution of continuous Hermitian eigenproblems on a discrete grid. It starts with a classically obtained eigenvector for a problem discretized on a coarse grid and constructs an approximate eigenvector on a fine grid. We describe this algorithm briefly for the case d D 1. Suppose N D 2 k and N0 D 2 k 0 are the number of points in the fine and coarse grid, respectively. Given the eigenvector for the coarse grid jU (N 0 ) i, we approximate the eigenvector for the fine grid jU (N) i by
ˇ (N) ˛ ˇ (N ) ˛ j0i C j1i ˝(kk 0 ) 0 ˇ ˇU˜ D U p : 2 The effect of this transformation is to replicate the coordinates of jU (N 0 ) i2 kk 0 times. The resulting approximation is good enough in the sense that the success probability of phase estimation with initial state jU˜ h i is greater than 1/2. Szkopek et al. [64] use the algorithm of Jaksch and Papageorgiou in the approximation of low order eigenvalues of a differential operator of order 2s in d dimensions. Their paper provides an algorithm with cost polynomial in "1 and generalizes the results of Abrams and Lloyd [2]. However, [64] does not carry out a complexity analysis but only considers known classical algorithms in the worst case for comparison. Qubit Complexity For the foreseeable future the number of qubits will be a crucial computational resource. We give a general lower bound on the qubit complexity of continuous problems.
Quantum Algorithms and Complexity for Continuous Problems
Recall that in (1) we defined a quantum algorithm as j
fi
D U T Q f U T1 Q f : : : U1 Q f U0 j
0i
qubit
compstd ("; S) D 1 :
:
where j 0 i and j f i are the initial and final state vectors, respectively. They are column vectors of length 2 . The query Qf , a 2 2 unitary matrix, depends on the values of f at n 2 deterministic points. That is Q f D Q f ( f (x1 ); : : : ; f (x n )) : It is important to note that in the standard model the evolution is completely deterministic. The only probabilistic element is in the measurement of the final state. For the qubit complexity (8) we have the following lower bound query qubit compstd ("; S) D ˝ log compclas (2"; S) :
(19)
qubit
Here S specifies the problem, see (4), and compstd ("; S) is the qubit complexity in the standard quantum setting. On query the right hand side compclas ("; S) is the query complexity on a classical computer in the worst case setting. See [77] for details. We provide an intuition about (19). Assume 2 function evaluations are needed to solve the problem specified by S to within ". Note that qubits are needed to generate the Hilbert space H of size 2 to store the evaluations. Equation (19) can be interpreted as a certain limitation of the standard quantum setting, which considers queries using function evaluations at deterministic points. We will show why this is a limitation and show we can do better. Consider multivariate integration which was discussed in Sect. “Integration”. We seek to approximate the solution of Z S( f ) D f (x) dx : 0;1] d
Assume our promise is f 2 F0 where n Fd D f : [0; 1]d ! R j continuous
o and j f (x)j 1; x 2 [0; 1]d :
For the moment let d D 1. We showed that with this promise we cannot guarantee an "-approximation on a classical computer with " < 1/2. That is, query
compclas ("; S) D 1 :
By (19)
If the qubit complexity is infinite even for d D 1 its certainly infinite for general d. But we saw that if classical randomization (Monte Carlo) is used query
compclasran ("; S) D ("2 ) : Thus we have identified a problem which is easy to solve on a classical computer and is impossible to solve on a quantum computer using the standard formulation of a quantum algorithm (1). Our example motivates extending the notion of quantum algorithm by permitting randomized queries. The quantum setting with randomized queries was introduced by Wo´zniakowski [77]. The idea of using randomized queries is not entirely new. Shor’s algorithm [60] uses a special kind of randomized query, namely, Q x jxi D j jx modNi ; with j D 0; : : : ; N 1 and a random x from the set f2; 3; : : : ; N 1g. In our case, the randomization affects only the selection of sample points and the number of queries. It occurs prior to the implementation of the queries and the execution of the quantum algorithm. In this extended setting, we define a quantum algorithm as j
f ;! i
D
U T! Q f ;! U T! 1 Q f ;! : : : U1 Q f ;! U0 j
0i ;
(20)
where ! is a random variable and Q f ;! D Q f ;! ( f (x1;! ); : : : ; f (x n;! )) ; and the x j;! are random points. The number of queries T! and the points x j;! are chosen at random initially and then remain fixed for the remainder of the computation. Note that (20) is identical to (1) except for the introduction of randomization. Randomized queries require that we modify the criterion (6) by which we measure the error of an algorithm. One possibility is to consider the expected error and another possibility is to consider the probabilistic error with respect to the random variable !. Both cases have been considered in the literature [77] but we will avoid the details here because they are rather technical. A test of the new setting is whether it buys us anything. We compare the qubit complexity of the standard and randomized settings for integration and path integration.
2345
2346
Quantum Algorithms and Complexity for Continuous Problems
Integration
Approximation
We make the same promise as above, namely, f 2 Fd .
Approximating functions of d variables is a fundamental and generally hard problem. Typically, the literature considers the approximation of functions that belong to the Sobolev space Wpr ([0; 1]d ) in the norm of L q ([0; 1]d ). The condition r/d > 1/p ensures that functions in Wpr ([0; 1]d ) are continuous, which is necessary for function values to be well defined. Thus, when p D 1 the dimension d can be arbitrarily large while the smoothness r can be fixed, which cannot happen when p < 1. For p D 1 the problem suffers the curse of dimensionality in the worst and the randomized classical cases [49,71]. Recently, Heinrich [30] showed that quantum computers do not offer any advantage relative to classical computers since the problem remains intractable in the quantum setting. For different values of the parameters p; q; r; d the classical and quantum complexities are also known [30, 49,71]. In some cases quantum computers can provide a roughly quadratic speedup over classical computers, but there are also cases where the classical and quantum complexities coincide. Table 2 summarizes the order of the query complexity (up to polylog factors) of approximation in the worst case, randomized and quantum setting, and is based on a similar table in [30] describing error bounds.
Quantum setting with deterministic queries. We remind the reader that query
compstd (") D 1 qubit
compstd (") D 1 : Quantum setting with randomized queries. Then [77] query compran (") D "1 qubit compran (") D log "1 : Therefore, there is infinite improvement in the randomized quantum setting over the standard quantum setting. As the analogue of (19) we have the following lower bound on the randomized qubit complexity [77] query qubit compran ("; S) D ˝ log compclasran ("; S) ; (21) query
where compclasran("; S) is the query complexity on a classical computer in the randomized setting. Path Integration Quantum setting with deterministic queries. With an appropriate promise it was shown [77] that query compstd (") D "1 qubit compstd (") D "2 log "1 : Thus, modulo the log factor, the qubit complexity of path integration is a second degree polynomial in "1 . That seems pretty good but we probably will not have enough qubits for a long time to do new science, especially with error correction. Quantum setting with randomized queries. Then [77] query compran (") D "1 qubit compran (") D log "1 : Thus there is an exponential improvement in the randomized quantum setting over the standard quantum setting.
Elliptic Partial Differential Equations Elliptic partial differential equations have many important applications and have been extensively studied in the literature, see [75] and the references therein. A simple example is the Poisson equation, for which we want to find a function u : ˝¯ ! R, that satisfies u(x) D f (x) ; u(x) D 0 ;
x2˝
x 2 @˝ ;
where ˝ Rd , More generally we consider elliptic partial differential equations of order 2m on a smooth bounded domain ˝ Rd with smooth coefficients and homogeneous
Quantum Algorithms and Complexity for Continuous Problems, Table 2 Query complexity of approximation Worst case
Randomized
Quantum
1 p < q 1,
r/d 2/p 2/q "dpq/(rpqd(qp)) "dpq/(rpqd(qp)) "d/r
1 p < q 1,
r/d < 2/p 2/q "dpq/(rpqd(qp)) "dpq/(rpqd(qp)) "dpq/(2rpq2d(qp))
1qp1
"d/r
"d/r
"d/r
Quantum Algorithms and Complexity for Continuous Problems
boundary conditions with the right hand side belonging to C r (˝) and the error measured in the L1 norm; see [32] for details. In the worst case the complexity is proportional to "d/r [75] and the problem is intractable. The randomized complexity of this problem was only recently studied along with the quantum complexity by Heinrich [31,32]. In particular, the randomized query complexity (up to polylog factors) is proportional to " maxfd/(rC2m); 2d/(2rCd)g ; and the quantum query complexity is proportional to " maxfd/(rC2m); d/(rCd)g : Thus the quantum setting may provide a polynomial speedup over the classical randomized setting but not always. Moreover, for fixed m and r and for d > 4m the problem is intractable in all three settings. Ordinary Differential Equations In this section we consider the solution of a system of ordinary differential equations with initial conditions z0 (t) D f (z(t));
t 2 [a; b];
z(a) D ;
where f : Rd ! Rd , z : [a; b] ! Rd and 2 Rd with f () ¤ 0. For the right hand side function f D [ f1 ; : : : ; f d ], where f j : Rd ! R, we assume that the f j belong to the Hölder class Fdr;˛ , r C ˛ 1. We seek to compute a bounded function on the interval [a; b] that approximates the solution z. Kacewicz [38] studied the classical worst case complexity of this problem and found it to be proportional to "1/(rC˛) . Recently he also studied the classical randomized and quantum complexity of the problem and derived algorithms that yield upper bounds that from the lower bounds by only an arbitrarily small positive parameter in the exponent [39]. The resulting randomized and quantum complexity bounds (up to polylog factors) are O "1/(rC˛C1/2 ) and O "1/(rC˛C1 ) ; where 2 (0; 1) is arbitrarily small, respectively. Observe that the randomized and quantum complexities (up to polylog factors) satisfy ˝ "1/(rC˛C1/2)
and ˝ "1/(rC˛C1) ; respectively. Even more recently, Heinrich and Milla [33] showed that the upper bound for the randomized complexity holds with D 0, thereby establishing tight upper and lower randomized complexity bounds, up to polylog factors. Once more, the quantum setting provides a polynomial speedup over the classical setting. Gradient Estimation Approximating the gradient of a function f : Rd ! R with accuracy " requires a minimum of d C 1 function evaluations on a classical computer. Jordan [37] shows how this can be done using a single query on a quantum computer. We present Jordan’s algorithm for the special case where the function is a plane passing through the oriP gin, i. e., f (x1 ; : : : ; x d ) D djD1 a j x j , and is uniformly bounded by 1. Then r f D (a1 ; : : : ; a d )T . Using a single query and phase kickback we obtain the state N1 N1 X 1 X p e2 i f ( j 1 ;:::; j d ) j j1 i j jd i ; N d j 1 D0 j d D0
where N is a power of 2. Equivalently, we have N1 N1 X 1 X p e2 i(a 1 j 1 CCa d j d ) j j1 i j jd i : N d j 1 D0 j d D0
This is equal to the state N1 N1 1 X 2 i a 1 j 1 1 X 2 i a d j d p e j j1 i : : : p e j jd i : N j D0 N j D0 1
d
We apply the Fourier transform to each of the d registers and then measure each register in the computational basis to obtain m1 ; : : : ; md . If aj can be represented with finitely many bits and N is sufficiently large then m j /N D a j , j D 1; : : : ; d. For functions with second order partial derivatives not identically equal to zero the analysis is more complicated and we refer the reader to [37] for the details. Simulation of Quantum Systems on Quantum Computers So far this article has been devoted to work on algorithms and complexity of problems where the query and qubit complexities are known or have been studied. In a number
2347
2348
Quantum Algorithms and Complexity for Continuous Problems
of cases, the classical complexity of these problems is also known and we know the quantum computing speedup. The notion that quantum systems could be simulated more efficiently by quantum computers than by classical computers was first mentioned by Manin [44], see also [45], and discussed thoroughly by Feynman [24]. There is a large and varied literature on simulation of quantum systems on quantum computers. The focus in these papers is typically on the cost of particular quantum and classical algorithms without complexity analysis and therefore without speedup results. To give the reader a taste of this area we list some sample papers: Berry et al. [9] present an efficient quantum algorithm for simulating the evolution of a sparse Hamiltonian. Dawson, Eisert and Osborne [21] introduce a unified formulation of variational methods for simulating ground state properties of quantum many-body systems. Morita and Nishimori [46] derive convergence conditions for the quantum annealing algorithm. Brown, Clark, Chuang [13] establish limits of quantum simulation when applied to specific problems. Chen, Yepez and Cory [15] report on the simulation of Burgers equation as a type-II quantum computation. Paredes, Verstraete and Cirac [56] present an algorithm that exploits quantum parallelism to simulate randomness. Somma et al. [61] discuss what type of physical problems can be efficiently simulated on a quantum computer which cannot be simulated on a Turing machine. Yepez [78] presents an efficient algorithm for the many-body three-dimensional Dirac equation. Nielsen and Chuang [48] discuss simulation of a variety of quantum systems. Sornborger and Stewart [62] develop higher order methods for simulations. Boghosian and Taylor [11] present algorithms for efficiently simulating quantum mechanical systems. Zalka [79] shows that the time evolution of the wave function of a quantum mechanical many particle system can be simulated efficiently. Abrams and Lloyd [1] provide fast algorithms for simulating many-body Fermi systems. Wisner [76] provides two quantum many-body problems whose solution is intractable on a classical computer. Future Directions The reason there is so much interest in quantum computers is to solve important problems fast. The belief is that we
will be able to solve scientific problems, and in particular quantum mechanical systems, which cannot be solved on a classical computer. That is, that quantum computation will lead to new science. Research consists of two major parts. The first is identification of important scientific problems with substantial speedup. The second is the construction of machines with sufficient number of qubits and long enough decoherence times to solve the problems identified in the first part. Abrams and Lloyd [2] have argued that with 50 to 100 qubits we can solve interesting classically intractable problems from atomic physics. Of course this does not include qubits needed for fault tolerant computation. There are numerous important open questions. We will limit ourselves here to some open questions regarding the problems discussed in this article. 1. In Sect. “Path Integration” we reported big wins for the qubit complexity for integration and path integration. Are there big wins for other problems? 2. Are there problems for which we get big wins for query complexity using randomized queries? 3. Are there tradeoffs between the query complexity and the qubit complexity? 4. What are the classical and quantum complexities of approximating the solution of the Schrödinger equation for a many-particle system? How do they it depend on the number of particles? What is the quantum speedup? Acknowledgments We are grateful to Erich Novak, University of Jena, and Henryk Wo´zniakowski, Columbia University and University of Warsaw, for their very helpful comments. We thank Jason Petras, Columbia University, for checking the complexity estimates appearing in the tables. Bibliography 1. Abrams DS, Lloyd S (1997) Simulation of Many-Body Fermi Systems on a Universal Quantum Computer. Phys Rev Lett 79(13):2586–2589; http://arXiv.org/quant-ph/9703054 2. Abrams DS, Lloyd S (1999) Quantum Algorithm Providing Exponential Speed Increase for Finding Eigenvalues and Eigenvectors. Phys Rev Lett 83:5162–5165 3. Abrams DS, Williams CP (1999) Fast quantum algorithms for numerical integrals and stochastic processes. http://arXiv.org/ quant-ph/9908083 4. Bakhvalov NS (1977) Numerical Methods. Mir Publishers, Moscow 5. Babuska I, Osborn J (1991) Eigenvalue Problems. In: Ciarlet PG, Lions JL (eds) Handbook of Numerical Analysis, vol II. NorthHolland, Amsterdam, pp 641–787 6. Beals R, Buhrman H, Cleve R, Mosca R, de Wolf R (1998) Quan-
Quantum Algorithms and Complexity for Continuous Problems
7.
8. 9.
10.
11.
12.
13.
14.
15.
16. 17. 18. 19. 20. 21.
22. 23.
24. 25. 26.
27.
28.
tum lower bounds by polynomials. Proceedings FOCS’98, pp 352–361. http://arXiv.org/quant-ph/9802049 Bennett CH, Bernstein E, Brassard G, Vazirani U (1997) Strengths and weaknesses of quantum computing. SIAM J Comput 26(5):1510–1523 Bernstein E, Vazirani U (1997) Quantum complexity theory. SIAM J Comput 26(5):1411–1473 Berry DW, Ahokas G, Cleve R, Sanders BC (2007) Efficient quantum algorithms for simulating sparse Hamiltonians. Commun Math Phys 270(2):359–371; http://arXiv.org/quant-ph/ 0508139 Bessen AJ (2007) On the complexity of classical and quantum algorithms for numerical problems in quantum mechanics. Ph D thesis. Department of Computer Science, Columbia University Boghosian BM, Taylor W (1998) Simulating quantum mechanics on a quantum computer. Physica D 120:30–42 http://arXiv. org/quant-ph/9701019 Brassard G, Hoyer P, Mosca M, Tapp A (2002) Quantum Amplitude Amplification and Estimation. Contemporary Mathematics, vol 305. Am Math Soc, Providence, pp 53–74. http://arXiv. org/quant-ph/0005055 Brown KR, Clark RJ, Chuang IL (2006) Limitations of Quantum Simulation Examined by Simulating a Pairing Hamiltonian using Magnetic Resonance. Phys Rev Lett 97(5):050504; http:// arXiv.org/quant-ph/0601021 Cameron RH (1951) A Simpson’s rule for the numerical evaluation of Wiener’s integrals in function space. Duke Math J 8:111– 130 Chen Z, Yepez J, Cory DG (2006) Simulation of the Burgers equation by NMR quantum information processing. Phys Rev A 74:042321; http://arXiv.org/quant-ph/0410198 Chorin AJ (1973) Accurate evaluation of Wiener integrals. Math Comp 27:1–15 Cleve R, Ekert A, Macchiavello C, Mosca M (1996) Quantum Algorithms Revisited. Proc R Soc Lond A 454(1969):339–354 Collatz L (1960) The Numerical Treatment of Differential Equations. Springer, Berlin Courant C, Hilbert D (1989) Methods of Mathematical Physics, vol I. Wiley Classics Library. Wiley-Interscience, New York Curbera F (2000) Delayed curse of dimension for Gaussian integration. J Complex 16(2):474–506 Dawson CM, Eisert J, Osborne TJ (2007) Unifying variational methods for simulating quantum many-body systems. http:// arxiv.org/abs/0705.3456v1 Demmel JW (1997) Applied Numerical Linear Algebra. SIAM, Philadelphia Egorov AD, Sobolevsky PI, Yanovich LA (1993) Functional Integrals: Approximate Evaluation and Applications. Kluwer, Dordrecht Feynman RP (1982) Simulating physics with computers. Int J Theor Phys 21:476 Forsythe GE, Wasow WR (2004) Finite-Difference Methods for Partial Differential Equations. Dover, New York Grover L (1997) Quantum mechanics helps in searching for a needle in a haystack. Phys Rev Lett 79(2):325–328; http:// arXiv.org/quant-ph/9706033 Heinrich S (2002) Quantum Summation with an Application to Integration. J Complex 18(1):1–50; http://arXiv.org/quant-ph/ 0105116 Heinrich S (2003) From Monte Carlo to Quantum Computation.
29. 30.
31. 32. 33.
34.
35.
36.
37.
38. 39.
40. 41.
42.
43. 44. 45.
46.
47.
48.
In: Entacher K, Schmid WC, Uhl A (eds) Proceedings of the 3rd IMACS Seminar on Monte Carlo Methods MCM2001, Salzburg. Special Issue of Math Comput Simul 62:219–230 Heinrich S (2003) Quantum integration in Sobolev spaces. J Complex 19:19–42 Heinrich S (2004) Quantum Approximation II. Sobolev Embeddings. J Complex 20:27–45; http://arXiv.org/quant-ph/ 0305031 Heinrich S (2006) The randomized complexity of elliptic PDE. J Complex 22(2):220–249 Heinrich S (2006) The quantum query complexity of elliptic PDE. J Complex 22(5):691–725 Heinrich S, Milla B (2007) The randomized complexity of initial value problems. Talk presented at First Joint International Meeting between the American Mathematical Society and the Polish Mathematical Society, Warsaw, Poland Heinrich S, Novak E (2002) Optimal summation by deterministic, randomized and quantum algorithms. In: Fang KT, Hickernell FJ, Niederreiter H (eds) Monte Carlo and Quasi-Monte Carlo Methods 2000. Springer, Berlin Heinrich S, Kwas M, Wo´zniakowski H (2004) Quantum Boolean Summation with Repetitions in the Worst-Average Setting. In: Niederreiter H (ed) Monte Carlo and Quasi-Monte Carlo Methods, 2002. Springer, New York, pp 27–49 Jaksch P, Papageorgiou A (2003) Eigenvector approximation leading to exponential speedup of quantum eigenvalue calculation. Phys Rev Lett 91:257902; http://arXiv.org/quant-ph/ 0308016 Jordan SP (2005) Fast Quantum Algorithm for Numerical Gradient Estimation. Phys Rev Lett 95:050501; http://arXiv.org/ quant-ph/0405146 Kacewicz BZ (1984) How to increase the order to get minimalerror algorithms for systems of ODEs. Numer Math 45:93–104 Kacewicz BZ (2006) Almost optimal solution of initial-value problems by randomized and quantum algorithms. J Complex 22(5):676–690 Keller HB (1968) Numerical methods for two-point boundaryvalue problems. Blaisdell Pub Co, Waltham Knuth DE (1997) The Art of Computer Programming, vol 2: Seminumerical Algorithms, 3rd edn. Addison-Wesley Professional, Cambridge Kwas M (2005) Quantum algorithms and complexity for certain continuous and related discrete problems. Ph D thesis. Department of Computer Science, Columbia University Kwas M, Li Y (2003) Worst case complexity of multivariate Feynman–Kac path integration. J Complex 19:730–743 Manin Y (1980) Computable and Uncomputable. Sovetskoye Radio, Moscow (in Russian) Manin YI (1999) Classical computing, quantum computing, and Shor’s factoring algorithm. http://arXiv.org/quant-ph/ 9903008 Morita S, Nishimori H (2007) Convergence of Quantum Annealing with Real-Time Schrödinger Dynamics. J Phys Soc Jpn 76(6):064002; http://arXiv.org/quant-ph/0702252 Nayak A, Wu F (1999) The quantum query complexity of approximating the median and related statistics. In: Proc STOC 1999, Association for Computing Machinery, New York, pp 384–393. http://arXiv.org/quant-ph/9804066 Nielsen MA, Chuang IL (2000) Quantum Computation and Quantum Information. Cambridge University Press, Cambridge
2349
2350
Quantum Algorithms and Complexity for Continuous Problems
49. Novak E (1988) Deterministic and Stochastic Error Bounds in Numerical Analysis. Lecture Notes in Mathematics, 1349. Springer, Berlin 50. Novak E (2001) Quantum complexity of integration. J Complex 17:2–16; http://arXiv.org/quant-ph/0008124 51. Ortiz G, Gubernatis JE, Knill E, Laflamme R (2001) Quantum algorithms for fermionic simulations. Phys Rev A 64(2):022319; http://arXiv.org/cond-mat/0012334 52. Papageorgiou A (2004) Average case quantum lower bounds for computing the boolean mean. J Complex 20(5):713–731 53. Papageorgiou A (2007) On the complexity of the multivariate Sturm–Liouville eigenvalue problem. J Complex 23(4–6):802– 827 54. Papageorgiou A, Traub JF (2005) Qubit complexity of continuous problems. http://arXiv.org/quant-ph/0512082 55. Papageorgiou A, Wo´zniakowski H (2005) Classical and Quantum Complexity of the Sturm–Liouville Eigenvalue Problem. Quantum Inf Process 4(2):87–127; http://arXiv.org/quant-ph/ 0502054 56. Paredes B, Verstraete F, Cirac JI (2005) Exploiting Quantum Parallelism to Simulate Quantum Random Many-Body Systems. Phys Rev Lett 95:140501; http://arXiv.org/cond-mat/0505288 57. Plaskota L (1996) Noisy Information and Computational Complexity. Cambridge University Press, Cambridge 58. Plaskota L, Wasilkowski GW, Wo´zniakowski H (2000) A new algorithm and worst case complexity for Feynman–Kac path integration. J Comp Phys 164(2):335–353 59. Ritter K (2000) Average-Case Analysis of Numerical Problems. Lecture Notes in Mathematics, 1733. Springer, Berlin 60. Shor PW (1997) Polynomial-time algorithms for prime factorization and discrete logarithm on a quantum computer. SIAM J Comput 26(5):1484–1509 61. Somma R, Ortiz G, Knill E, Gubernatis (2003) Quantum Simulations of Physics Problems. In: Pirich AR, Brant HE (eds) Quantum Information and Computation. Proc SPIE 2003, vol 5105. The International Society for Optical Engineering, Bellingham, pp 96–103. http://arXiv.org/quant-ph/0304063 62. Sornborger AT, Stewart ED (1999) Higher Order Methods for Simulations on Quantum Computers. Phys Rev A 60(3):1956– 1965; http://arXiv.org/quant-ph/9903055
63. Strang G, Fix GJ (1973) An Analysis of the Finite Element Method. Prentice-Hall, Englewood Cliffs 64. Szkopek T, Roychowdhury V, Yablonovitch E, Abrams DS (2005) Egenvalue estimation of differential operators with a quantum algorithm. Phys Rev A 72:062318 65. Titschmarsh EC (1958) Eigenfunction Expansions Associated with Second-Order Differential Equations, Part B. Oxford University Press, Oxford 66. Traub JF (1999) A continuous model of computation. Phys Today May:39–43 67. Traub JF, Werschulz AG (1998) Complexity and Information. Cambridge University Press, Cambridge 68. Traub JF, Wo´zniakowski H (1980) A general theory of optimal algorithms. ACM Monograph Series. Academic Press, New York 69. Traub JF, Wo´zniakowski H (1992) The Monte Carlo algorithm with a pseudorandom generator. Math Comp 58(197):323–339 70. Traub JF, Wo´zniakowski H (2002) Path integration on a quantum computer. Quantum Inf Process 1(5):365–388; http:// arXiv.org/quant-ph/0109113 71. Traub JF, Wasilkowski GW, Wo´zniakowski H (1988) Information-Based Complexity. Academic Press, New York 72. Wasilkowski GW, Wo´zniakowski H (1996) On tractability of path integration. J Math Phys 37(4):2071–2088 73. Weinberger HF (1956) Upper and Lower Bounds for Eigenvalues by Finite Difference Methods. Commun Pure Appl Math IX:613–623 74. Weinberger HF (1958) Lower Bounds for Higher Eigenvalues by Finite Difference Methods. Pacific J Math 8(2):339–368 75. Werschulz AG (1991) The Computational Complexity of Differential and Integral Equations. Oxford University Press, Oxford 76. Wisner S (1996) Simulations of Many-Body Quantum Systems by a Quantum Computer. http://arXiv.org/quant-ph/96 77. Wo´zniakowski H (2006) The Quantum Setting with Randomized Queries for Continuous Problems. Quantum Inf Process 5(2):83–130 78. Yepez J (2002) An efficient and accurate quantum algorithm for the Dirac equation. http://arXiv.org/quant-ph/0210093 79. Zalka C (1998) Simulating quantum systems on a quantum computer. Proc Royal Soc Lond A 454(1969):313–322; http:// arXiv.org/quant-ph/9603026
Quantum Cellular Automata
Quantum Cellular Automata KAROLINE W IESNER1,2 1 School of Mathematics, University of Bristol, Bristol, UK 2 Centre for Complexity Sciences, University of Bristol, Bristol, UK
Article Outline Glossary Definition of the Subject Introduction Cellular Automata Early Proposals Models of QCA Computationally Universal QCA Modeling Physical Systems Implementations Future Directions Bibliography
Glossary Configuration The state of all cells at a given point in time. Neighborhood All cells with respect to a given cell that can affect this cell’s state at the next time step. A neighborhood always contains a finite number of cells. Space-homogeneous The transition function / update table is the same for each cell. Time-homogeneous The transition function / update table is time-independent. Update table Takes the current state of a cell and its neighborhood as an argument and returns the cell’s state at the next time step. Schrödinger picture Time evolution is represented by a quantum state evolving in time according to a timeindependent unitary operator acting on it. Heisenberg picture Time evolution is represented by observables (elements of an operator algebra) evolving in time according to a unitary operator acting on them. BQP complexity class Bounded error, quantum probabilistic, the class of decision problems solvable by a quantum computer in polynomial time with an error probability of at most 1/3. QMA complexity class Quantum Merlin–Arthur, the class of decision problems such that a “yes” answer can be verified by a 1-message quantum interactive proof (verifiable in BQP).
Quantum Turing machine A quantum version of a Turing machine – an abstract computational model able to compute any computable sequence.
0
1
Swap operation The one-qubit unitary gate U D 1 0 Hadamard gate The one-qubit unitary gate
1 1 1 UD p 2 1 1
1 0 Phase gate The one-qubit unitary gate U D 0 e i
Pauli operator The three Pauli operators are
0 1 0 i 1 0 x D 1 0 ; y D i 0 ; z D 0 1 Qubit 2-state quantum system, representable as vector a j0i C b j1i in complex space with a2 C b2 D 1. Definition of the Subject Quantum cellular automata (QCA) are a generalization of (classical) cellular automata (CA) and in particular of reversible CA. The latter are reviewed shortly. An overview is given over early attempts by various authors to define one-dimensional QCA. These turned out to have serious shortcomings which are discussed as well. Various proposals subsequently put forward by a number of authors for a general definition of one- and higher-dimensional QCA are reviewed and their properties such as universality and reversibility are discussed. Quantum cellular automata (QCA) are a quantization of classical cellular automata (CA), d-dimensional arrays of cells with a finite-dimensional state space and a local, spatially-homogeneous, discrete-time update rule. For QCA each cell is a finite-dimensional quantum system and the update rule is unitary. CA as well as some versions of QCA have been shown to be computationally universal. Apart from a theoretical interest in a quantized version of CA, QCA are a natural framework for what is most likely going to be the first application of quantum computers – the simulation of quantum physical systems. In particular, QCA are capable of simulating quantum dynamical systems whose dynamics are uncomputable by classical means. QCA are now considered one of the standard models of quantum computation next to quantum circuits and various types of measurement-based quantum computational models1 . Unlike their classical counterpart, an axiomatic, all-encompassing definition of (higher-dimensional) QCA is still missing. 1 For details on these and other aspects of quantum computation see the article by Kendon in this Encyclopedia.
2351
2352
Quantum Cellular Automata
Introduction Automata theory is the study of abstract computing devices and the class of functions they can perform on their inputs. The original concept of cellular automata is most strongly associated with John von Neumann ( 1903, † 1957), a Hungarian mathematician who made major contributions to a vast range of fields including quantum mechanics, computer science, functional analysis and many others. According to Burks, an assistant of von Neumann, [45] von Neumann had posed the fundamental questions: “What kind of logical organization is sufficient for an automaton to reproduce itself?”. It was Stanislav Ulam who suggested to use the framework of cellular automata to answer this question. In 1966 von Neumann presented a detailed analysis of the above question in his book Theory of Self-Reproducing Automata [45]. Thus, von Neumann initiated the field of cellular automata. He also made central contributions to the mathematical foundations of quantum mechanics and, in a sense von Neumann’s quantum logic ideas were an early attempt at defining a computational model of physics. But he did not pursue this, and did not go in the directions that have led to modern ideas of quantum computing in general or quantum cellular automata in particular. The idea of quantum computation is generally attributed to Feynman who, in his now famous lecture in 1981, proposed a computational scheme based on quantum mechanical laws [19]. A contemporary paper by Benioff contains the first proposal of a quantum Turing machine [6]. The general idea was to devise a computational device based on and exploiting quantum phenomena that would outperform any classical computational device. These first proposals were sequentially operating quantum mechanical machines imitating the logical operations of classical digital computation. The idea of parallelizing the operations was found in classical cellular automata. However, how to translate cellular automata into a quantum mechanical framework turned out not to be trivial. And to a certain extent how to do this in general remains an open question until today. The study of quantum cellular automata (QCA) started with the work of Grössing and Zeilinger who coined the term QCA and provided a first definition [21]. Watrous developed a different model of QCA [46]. His work lead to further studies by several groups [16,17,18]. Independently of this, Margolus developed a parallelizable quantum computational architecture building on Feynman’s original ideas [26]. For various reasons to be discussed below, none of these early proposals turned out to be physical. The study of QCA gained new mo-
mentum with the work by Richter, Schumacher, and Werner [36,37] and others [3,4,34] who avoided unphysical behavior allowed by the early proposals [4,37]. It is important to notice that in spite of the over two-decade long history of QCA there is no single agreed-upon definition of QCA, in particular of higher-dimensional QCA. Nevertheless, many useful properties have been shown for the various models. Most importantly, quite a few models were shown to be computationally universal, i. e. they can simulate any quantum Turing machine and any quantum circuit efficiently [16,34,35,38,46]. Very recently, their ability to generate and transport entanglement has been illustrated [14]. A comment is in order on a class of models which is often labeled as QCA but in fact are classical cellular automata implemented in quantum mechanical structures. They do not exploit quantum effects for the actual computation. To make this distinction clear they are now called quantum-dot QCA. These types of QCA will not be discussed here. Cellular Automata Definition (Cellular Automata) A cellular automaton (CA) is a 4-tuple (L; ˙; N ; f ) consisting of (1) a d-dimensional lattice of cells L indexed i 2 Zd , (2) a finite set of states ˙ , (3) a finite neighborhood scheme N Zd , and (4) a local transition function f : ˙ N ! ˙ . A CA is discrete in time and space. It is space and time homogeneous if at each time step the same transition function, or update rule, is applied simultaneously to all cells. The update rule is local if for a given lattice L and lattice site x; f (x) is localized in x C N D fx Cnjx 2 L; n 2 N g, where N is the neighborhood scheme of the CA. In addition to the locality constraint the local transition function f must generate a unique global transition function mapping a lattice configuration C t 2 ˙ L at time t to a new configuration C tC1 at time t C 1 : F : ˙ L ! ˙ L . Most CA are defined on infinite lattices or, alternatively, on finite lattices with periodic boundary conditions. For finite CA only a finite number of cells is not in a quiescent state, i. e. a state that is not effected by the update. The most studied CA are the so-called elementary CA – 1-dimensional lattices with a set of two states and a neighborhood scheme of radius 1 (nearest-neighbor interaction). i. e. the state of a cell at point x at time t C 1 only depends on the state of cells x 1; x, and x C 1 at time t. There are 256 such elementary CA, easily enumerated using a scheme invented by Wolfram [48]. As an example and for later reference, the update table of rule 110 is given
Quantum Cellular Automata Quantum Cellular Automata, Table 1 Update table for CA rule ‘110’ (the second row is the decimal number ‘110’ in binary notation)
M110 D
111 110 101 100 011 010 001 000 0 1 1 0 1 1 1 0
in Table 1. CA with update rule ‘110’ have been shown to be computationally universal, i. e. they can simulate any Turing machine in polynomial time [15]. A possible approach to constructing a QCA would be to simply “quantize” a CA by rendering the update rule unitary. There are two problems with this approach. One is, that applying the same unitary to each cell does not yield a well-defined global transition function nor necessarily a unitary one. The second problem is the synchronous update of all cells. “In practice”, the synchronous update of, say, an elementary CA can be achieved by storing the current configuration in a temporary register, then update all cells with odd index in the original CA, update all cells with even index in the register and finally splice the updated cells together to obtain the original CA at the next time step. Quantum states, however, cannot be copied in general due to the so-called no-cloning theorem [49]. Thus, parallel update of a QCA in this way is not possible. Sequential update on the other hand leads to either an infinite number of time steps for each update or inconsistencies at the boundaries. One solution is a partitioning scheme as it is used in the construction of reversible CA. Reversible Cellular Automata Definition (Reversible CA) A CA is said to be reversible if for every current configuration there is exactly one previous configuration. The global transition function F of a reversible CA is bijective. In general, CA are not reversible. Only 16 out of the 256 elementary CA rules are reversible. However, one can construct a reversible CA using a partitioning scheme developed by Toffoli and Margolus for 2-dimensional CA [40]. Consider a 2-dimensional CA with nearest neighborhood scheme N D fx 2 Z2 j8jx i j 1g. In the partitioning scheme introduced by Toffoli and Margolus each block of 2 2 cells forms a unit cube such that the even translates C 2x with x 2 Z2 and the odd translates C 1 C 2x, respectively, form a partition of the lattice, see Fig. 1. The update rule of a partitioned CA takes as input an entire block of cells and outputs the updated state of the entire block. The rule is then applied alternatingly
Quantum Cellular Automata, Figure 1 Even (solid lines) and odd (dashed lines) of a Margolus partitioning scheme in d D 2 dimensions using blocks of size 2 2. For each partition one block is shown shaded. Update rules are applied alternatingly to the solid and dashed partition
to the even and to the odd translates. The Margolus partitioning scheme is easily extended to d-dimensional lattices. A generalized Margolus scheme was introduced by Schumacher and Werner [37]. It allows for different cell sizes in the intermediate step. A partitioned CA is then a CA with a partitioning scheme such that the set of cells are partitioned in some periodic way: Every cell belongs to exactly one block, and any two blocks are connected by a lattice translation. Such a CA is neither time homogeneous nor space homogeneous anymore, but periodic in time and space. As long as the rule for evolving each block is reversible, the entire automaton will be reversible. Early Proposals Grössing and Zeilinger were the first to coin the term and formalize a QCA [21]. In the Schrödinger picture of quantum mechanics the state of a system at some time t is described by a state vector j t i in Hilbert space H . The state vector evolves unitarily, j
tC1 i
D Uj
ti
:
(1)
U is a unitary operator, i. e. UU D 1, with the complex conjugate U and the identity matrix 1. If fj i ig is a computational basis of the Hilbert space H any state j i 2 H
2353
2354
Quantum Cellular Automata
P can be written as a superposition j i i c i j i i, with coeffiP cients c i 2 C and i c i c i D 1. The QCA constructed by Grössing and Zeilinger is an infinite 1-dimensional lattice where at time t lattice site i is assigned the complex amplitude ci of state j t i. The update rule is given by unitary operator U. Definition (Grössing–Zeilinger QCA) A Grössing– Zeilinger QCA is a 3-tuple (L; H ; U) which consists of (1) an infinite 1-dimensional lattice L Z representing basis states of (2) a Hilbert space H with basis set fj i ig, and (3) a band-diagonal unitary operator U. Band-diagonality of U corresponds to a locality condition. It turns out that there is no Grössing–Zeilinger QCA with nearest-neighbor interaction and nontrivial dynamics. In fact, later on, Meyer showed more generally that “in one dimension there exists no nontrivial homogeneous, local, scalar QCA. More explicitly, every band r-diagonal unitary matrix U which commutes with the one-step translation matrix T is also a translation matrix T k for some k 2 Z, times a phase” [27]. Grössing and Zeilinger also introduced QCA where the unitarity constraint is relaxed to only approximate unitarity. After each update the configuration can be normalized which effectively causes non-local interactions. The properties of Grössing–Zeilinger QCA were studied by Grössing and co-workers in some more detail in following years, see [20] and references therein. This pioneering definition of QCA, however, was not studied much further, mostly because the “non-local” behavior renders the Grössing–Zeilinger definition non-physical. In addition, it has little in common with the concepts developed in quantum computation later on. The GrössingZeilinger definition really concerns what one would call today a quantum random walk (for further details see the review by Kempe [23]). The first model of QCA researched in depth was that introduced by Watrous [46], whose ideas where further explored by van Dam [16], Dürr, LêThanh, and Santha [17,18], and Arrighi [2]. A Watrous-QCA is defined over an infinite 1-dimensional lattice, a finite set of states including a quiescent state. The transition function maps a neighborhood of cells to a single quantum state instantaneously and simultaneously. Definition (Watrous-QCA) A Watrous-QCA is a 4-tuple (L; ˙; N ; f ) which consists of (1) a 1-dimensional lattice L Z, (2) a finite set of cell states ˙ including a quiescent state ", (3) a finite neighborhood scheme N , and (4) a local transition function f : ˙ N ! H˙ .
Here, H˙ denotes the Hilbert space spanned by the cell states ˙ . This model can be viewed as a direct quantization of a CA where the set of possible configurations of the CA is extended to include all linear superpositions of the classical cell configurations, and the local transition function now maps the cell configurations of a given neighborhood to a quantum state. One cell is labeled “accept” cell. The quiescent state is used to allow only a finite number of states to be active and renders the lattice effectively finite. This is crucial to avoid an infinite product of unitaries and, thus, to obtain a well-defined QCA. The Watrous QCA, however, allows for non-physical dynamics. It is possible to define transition functions that do not represent unitary evolution of the configuration, either by producing superpositions of configurations which do not preserve the norm, or by inducing a global transition function which is not unitary. This leads to nonphysical properties such as super-luminal signaling [37]. The set of Watrous QCA is not closed under composition and inverse [37]. Watrous defined a restricted class of QCA by introducing a partitioning scheme. Definition (Partitioned Watrous QCA) A partitioned Watrous QCA is a Watrous QCA with ˙ D ˙ l ˙c ˙r for finite sets ˙ l ; ˙c , and ˙r , and matrix of size ˙ ˙ . For any state s D (s l ; s c ; s r ) 2 ˙ define transition function f as f (s1 ; s2 ; s3 ; s) D (s l3 ;s m2 ;s r1 ;s ) ;
(2)
with matrix element s i ;s j . In a partitioned Watrous QCA each cell is divided into three sub-cells – left, center, and right. The neighborhood scheme is then a nearest-neighbor interaction confined to each cell. The transition function consists of a unitary acting on each partitioned cell and swap operations among sub-cells of different cells. Figure 2 illustrates the swap operation between neighboring cells. For the class of partitioned Watrous QCA Watrous provides the first proof of computational universality of a QCA by showing that any quantum Turing machine can be efficiently simulated by a partitioned Watrous-QCA with constant slowdown and that any partitioned Watrous-QCA can be simulated by a quantum Turing machine with linear slowdown. Theorem ([46]) Given any quantum Turing machine MTM , there exists a partitioned Watrous QCA MCA which simulates MTM with constant slowdown.
Quantum Cellular Automata
Quantum Cellular Automata, Figure 2 Each cell is divided into three sub-cells labeled l, c, and r for left, center, and a right, respectively. The update rule consists of swapping left and right sub-cells of neighboring cells and then updating each cell internally using a unitary operation acting on the left, center, and right part of each cell
Theorem ([46]) Given any partitioned Watrous QCA MCA , there exists a quantum Turing machine MTM which simulates MCA with linear slowdown. Watrous’ model was further developed by van Dam [16], who defined a QCA as an assignment of a product vector to every basis state in the computational basis. Here the quiescent state is eliminated and thus the QCA is made explicitly finite. Van Dam showed that the finite version is also computationally universal. Efficient algorithms to decide whether a given 1-dimensional QCA is unitary was presented by Dürr, LeTanh, and Santha [17,18]. Due to substantial shortcomings such as non-physical behavior, these early proposals were replaced by a second wave of proposals to be discussed below. Today, there is not a generally accepted QCA model that has all the attributes of the CA model: unique definition, simple to describe, and computationally powerful. In particular, there is no axiomatic definition, contrary to its classical counterpart, that yields an immediate way of constructing/enumerating all of the instances of this model. Rather, each set of authors defines QCA in their own particular fashion. The states s 2 ˙ are basis states spanning a finite-dimensional Hilbert space. At each point in time a cell represents a finite-dimensional quantum system in a superposition of basis states. The unitary operators represent the discrete-time evolution of strictly finite propagation speed. Models of QCA Reversible QCA Schumacher and Werner used the Heisenberg picture rather than the Schrödinger picture in their model [37]. Thus, instead of associating a d-level quantum system with each cell they associated an observable algebra with each cell. Taking a quasi-local algebra as the tensor product of
observable algebras over a finite subset of cells, a QCA is then a homomorphism of the quasi-local algebra, which commutes with lattice translations and satisfies locality on the neighborhood. The observable-based approach was first used in Ref. [36] with focus on the irreversible case. However, this definition left questions open such as whether the composition of two QCA will again form a QCA. The following definition does avoid this uncertainty. Consider an infinite d-dimensional lattice L Zd of cells x 2 Zd , where each cell is associated with the observable algebra Ax and each of these algebras is an isomorphic copy of the algebra of complex d d-matrices. When Zd is a finite subset of cells, denote by A() the algebra of observables belonging to all cells in , i. e. the tensor product ˝x2 Ax . The completion of this algebra is called a quasi-local algebra and will be denoted by A(Zd ). Definition (Reversible QCA) A quantum cellular automaton with neighborhood scheme N Zd is a homomorphism T : A(Zd ) ! A(Zd ) of the quasi-local algebra, which commutes with lattice translations, and satisfies the locality condition T(A()) T(A( C N )) for every finite set Zd . The local transition rule of a cellular automaton is the homomorphism T0 : A0 ! A(N ). Schumacher and Werner presented and proved the following theorem on one-dimensional QCA. Theorem (Structure Theorem [37]) Let T be the global transition homomorphism of a one-dimensional a nearestneighbor QCA on the lattice Zd with single-cell algebra A0 D Md . Then T can be represented in the generalized Margolus partitioning scheme, i. e. T restricts to an isomorphism T : A() !
O
Bs ;
(3)
s2˙
where for each quadrant vector q 2 Q, the subalgebra B q A( C q) is a full matrix algebra, B q Mn(q) . These algebras and the matrix dimensions n(q) are uniquely determined by T. Theorem (Structure Theorem [37]) does not hold in higher dimensions [47]. A central result obtained in this framework is that almost any [47] 1-dimensional QCA can be represented using a set of local unitary operators and a generalized Margolus partitioning [37], as illustrated in Fig. 3. Furthermore, if the local implementation allows local ancillas, then any QCA, in any lattice dimension can be built from local unitaries [37,47]. In addition, they proved the following Corollary.
2355
2356
Quantum Cellular Automata
Block-Partitioned and Nonunitary QCA Brennen and Williams introduced a model of QCA which allows for unitary and nonunitary rules [14]. Definition (Block-partitioned QCA) A Block-partitioned QCA is a 4-tuple fL; ˙; N ; Mg consisting of (1) a 1-dimensional lattice of n cells indexed L D 0; : : : ; n 1, (2) a 2-dimensional state space ˙ , (3) a neighborhood scheme N , and (4) an update rule M applied over N . Quantum Cellular Automata, Figure 3 Generalized Margolus partitioning scheme in 1 dimension using two unitary operations U and V
Corollary ([37]) The inverse of a nearest-neighbor QCA exists, and is a nearest-neighbor QCA. The latter result is not true for CA. A similar result for finite configurations was obtained in [4]. Here evidence is presented that the result does not hold for two dimensional QCA. The work by Schumacher and Werner can be considered the first general definition for 1-dimensional QCA. A similar result for many-dimensional QCA does not exist. Local Unitary QCA Péres-Delgado and Cheung proposed a local unitary QCA [34]. Definition (Local-unitary QCA) A local-unitary QCA is a 5-tuple f(L; ˙; N ; U0 ; V0 )g consisting of (1) a d-dimensional lattice of cells indexed by integer tuples L Zd , (2) a finite set of orthogonal basis states ˙ , (3) a finite neighborhood scheme N Zd , (4) a local read function U0 : (H˙ )˝N ! (H˙ )˝N , and (5) a local update function V0 : H˙ ! H˙ . The read operation carries the further restriction that any two lattice translations U x and U y must commute for all x; y 2 L. The product VU is a valid local, unitary quantum operation. The resulting global update rule is well defined and space homogeneous. The set of states includes a quiescent state as well as an “ancillary” set of states/subspace which can store the result of the “read” operation. The initial state of a local-unitary QCA consists of identical kd blocks of cells initialized in the same state. Local-unitary QCA are universal in the sense that for any arbitrary quantum circuit there is a local-unitary QCA which can simulate it. In addition any local-unitary QCA can be simulated efficiently using a family of quantum circuits [34]. Adding an additional memory register to each cell allows this class of QCA to model any reversible QCA of the Schumacher/Werner type discussed above.
Given a system with nearest-neighbor interactions, the simplest unitary QCA rule has radius r D 1 describing a unitary operator applied over a three-cell neighborhood j 1; j; j C 1: M(u00 ; u01 ; u10 ; u11 ) D j00ih00j ˝ u00 C j01ih01j ˝ u01 C j10ih10j ˝ u10 C j11ih11j ˝ u11 ;
(4)
where jabihabj ˝ u ab means update the qubit at site j with the unitary uab if the qubit at the site j 1 is in state jai and the qubit at site j C 1 is in state jbi. M commutes with its own 2-site translation. Thus, a partitioning is introduced by updating simultaneously all even qubits with rule M before updating all odd qubits with rule M. Periodic boundaries are assumed. However, by addressability of the end qubits simulation of a block-partitioned QCA by a QCA with boundaries can be achieved. Nonunitary update rules correspond to completely positive maps on the quantum states where the neighboring states act as the environment. Take a nearest-neighbor 1-dimensional Block-partitioned QCA. In the density operator formalism each quantum system is given by P the probability distribution D i p i j ih j over outer products of quantum states j i. A completely positive map S( ) applied to state is represented by a set of Krauss operators F , which are positive operators that sum up P to the identity F F D 1. The map S ab j ( ) acting on cell j conditioned on state a of the left neighbor and state b of the right neighbor can then be written as X ab S ab Fab F ˝ jabihabj : (5) j ( ) D jabihabj ˝
As an example, the CA rule ‘110’ can now be translated into an update rule for cell j in a block-partitioned nonunitary QCA: j
j
F1 D j00ih00j ˝ 1 j C j10ih10j ˝ 1 j C j11ih11j ˝ x C j01ih01j ˝ j1i j j h1j j
F2 D j01ih01j ˝ j1i j j h0j ; where x is the Pauli operator.
(6) (7)
Quantum Cellular Automata
The implementation of such a block-partitioned nonunitary QCA is proposed in form of a lattice of even order constructed with an alternating array of two distinguishable species ABABABAB : : : that are globally addressable and interact via the Ising interaction. Update rules that generate and distribute entanglement were studied in this framework [14]. Continuous-Time QCA Vollbrecht and Cirac initiated the study of continuoustime QCA [44]. They show that the computability of the ground state energy of a translationally invariant n-neighbor Hamiltonian was QMA-hard. Their QCA model is taken up by Nagaj and Wocjam [31] who used the term Hamiltonian QCA. Definition (Hamiltonian QCA) A Hamiltonian QCA is a tuple fL; ˙ D ˙ p ˙d g consisting of (1) a 1-dimensional lattice of length L, (2) a finite set of orthogonal basis states ˙ D ˙ p ˙d containing (2a) a data register ˙d , and (2b) a program register ˙ p . The initial state encodes both the program and the data, stored in separate subspaces of the state space: ji D
L O
jp j i ˝ jd j i
j
(8)
jD1
The computation is carried out autonomously. Nagaj and Wocjam showed that, if the system is left alone for a period of time t D O(L log L), polynomially in the length of the chain, the result of the computation is obtained with probability p 5/6 O(1/ log L). Hamiltonian QCA are computationally universal, more precisely they are in the complexity class BQP. Two constructions for Hamiltonian QCA are given in [31], one using a 10-dimensional state space, and the resulting system can be thought of as the diffusion of a system of free fermions. The second construction given uses a 20-dimensional state space and can be thought of as a quantum walk on a line. Examples of QCA Raussendorf proved an explicit construction of QCA and proves its computational universality [35]. The QCA lives on a torus with a 2 2 Margolus partitioning. The update rule is given by a single 4-qubit unitary acting on 2 2 blocks of qubits. The four-qubit unitary operation consists of swap operations, the Hadamard transformation, and a phase gate. The initial state of the QCA is prepared such that columns encode alternatingly data and program. When the QCA is running the data travel in one direc-
tion while the program (encoding classical information in orthogonal states) travels in the opposite direction. Where the two cross the computation is carried out through nearest-neighbor interaction. After a fixed number of steps the computation is done and the result can be read out of a dedicated “data” column. This QCA is computationally universal, more precisely, it is within a constant as efficient as a quantum logic network with local and nearest-neighbor gates. Shepherd, Franz, and Werner compared classically controlled QCA to autonomous QCA [38]. The former is controlled by a classical compiler that selects a sequence of operations acting on the QCA at each time step. The latter operates autonomously, performing the same unitary operation at each time step. The only step that is classically controlled is the measurement (and initialization). They show the computational equivalence of the two models. Their result implies that a particular quantum simulator may be as powerful as a general one. Computationally Universal QCA Quite a few models have been shown to be computationally universal, i. e. they can simulate any quantum Turing machine and any quantum circuit efficiently. A WatrousQCA simulates any quantum Turing machine with constant slowdown [46]. The QCA defined by Van Dam is a finite version of a Watrous QCA and is computationally universal as well [16]. Local-unitary QCA can simulate any quantum circuit and thus are computationally universal [34]. Block-partitioned QCA can simulate a quantum computer with linear overhead in time and space [14]. Continuous-time QCA are in complexity class BQP and thus computationally universal [44]. The explicit constructions of 2-dimensional QCA by Raussendorf is computationally universal, more precisely, it is within a constant as efficient as a quantum logic network with local and nearest-neighbor gates [35]. Shepherd, Franz, and Werner provided an explicit construction of a 12-state 1-dimensional QCA which is in complexity class BQP. It is universally programmable in the sense that it simulates any quantum-gate circuit with polynomial overhead [38]. Arrighi and Fargetton proposed a 1-dimensional QCA capable of simulating any other 1-dimensional QCA with linear overhead [3]. Implementations of computationally universal QCA have been suggested by Lloyd [24] and Benjamin [8]. Modeling Physical Systems One of the goals in developing QCA is to create a useful modeling tool for physical systems. Physical systems that
2357
2358
Quantum Cellular Automata
can be simulated with QCA include Ising and Heisenberg interaction spin chains, solid state NMR, and quantum lattice gases. Spin chains are perhaps the most obvious systems to model with QCA. The simple cases of such 1-dimensional lattices of spins are Hamiltonians which commute with their own lattice translations. Vollbrecht and Cirac showed that the computability of the ground state energy of a translationally invariant n-neighbor Hamiltonian is in complexity class QMA [44]. For simulating noncommuting Hamiltonians a block-wise update such as the Margolus partitioning has to be used (see Sect. “Reversible Cellular Automata”). Here the fact is used that any Hamiltonian can be expressed as the sum of two Hamiltonians, H D H a C H b . Ha and H b can then, to a good approximation, be applied sequentially to yield the original Hamiltonian H, even if these do not commute. It has been shown that such 1-dimensional spin chains can be simulated efficiently on a classical computer [43]. It is not known, however, whether higher dimensional spin systems can be simulated efficiently classically. Quantum Lattice Gas Automata Any numerical evolution of a discretized partial differential equation can be interpreted as the evolution of some CA, using the framework of lattice gas automata. In the continuous time and space limit such a CA mimics the behavior of the partial differential equation. In quantum mechanical lattice gas automata (QLGA) the continuous limit on a set of so called quantum lattice Boltzman equation recovers the Schrödinger equation [39]. The first formulation of a linear unitary CA was given in Ref. [10]. Meyer coined the term quantum lattice gas automata (QLGA) and demonstrated the equivalence of a QLGA and the evolution of a set of quantum lattice Boltzman equations [27,28]. Meyer [29], Boghosian and Taylor [12], and Love and Boghosian [25] explored the idea of using QLGA as a model for simulating physical systems. Algorithms for implementing QLGA on a quantum computer have been presented in [13,30,32]. Implementations A large effort is being made in many laboratories around the world to implement a model of a quantum computer. So far all of them are confined to a very finite number of elements and are no way near to a quantum Turing machine (which in itself is a purely theoretical construct but can be approximated by a very large number of computational elements). One existing experimental set-up that is very promising for quantum information processing and that does not suffer from this “finiteness” are optical lat-
tices (for a review, see [11]). They possess a translation symmetry which makes QCA a very suitable framework in which to study their computational power. Optical lattices are artificial crystals of light and consist of hundreds of thousands of microtraps. One or more neutral atoms can be trapped in each of the potential minima. If the potential minima are deep enough any tunneling between the traps is suppressed and each site contains the same amount of atoms. A quantum register – here in form of a so-called Mott insulator – has been created. The biggest challenge at the moment is to find a way to address the registers individually to implement quantum gates. For a QCA all that is needed is implementing the unitary operation(s) acting on the entire lattice simultaneously. The internal structure of the QCA guarantees the locality of the operations. This is a huge simplification compared to individual manipulation of the registers. Optical lattices are created routinely by superimposing two or three orthogonal standing waves generated from laser beams of a certain frequency. They are used to study Fermionic and Bosonic quantum gases, nonlinear quantum dynamics, strongly correlated quantum phases, to name a few. A type of locally addressed architecture by global control was put forward by Lloyd [24]. In this scheme a 1-dimensional array is built out of three atomic species, periodically arranged as ABCABCABC. Each species encodes a qubit and can be manipulated without affecting the other species. The operations on any species can be controlled by the states of the neighboring cells. The end-cells are used for readout, since they are the only individually addressable components. Lloyd showed that such a quantum architecture is universal. Benjamin investigated the minimum physical requirements for such a many-species implementation and found a similar architecture using only two types of species, again arranged periodically ABABAB [7,8,9]. By giving explicit sequences of operations implementing one-qubit and two-qubit (CNOT) operations Benjamin showed computational universality. But the reduction in spin resources comes with an increase in logical encoding into four spin sites with a buffer space of at least four empty spin sites between each logical qubit. A continuation of this multi-species QCA architecture is found in the work by Twamley [42]. Twamley constructed a proposal for a QCA architecture based on Fullerene (C60 ) molecules doped with atomic species 15 N and 31 P, respectively, arranged alternatingly in a one-dimensional array. Instead of electron spins which would be too sensitive to stray electric charges the quantum information is encoded in the nuclear spins. Twamley constructed sequences of pulses implementing Benjamin’s scheme for one- and two-qubit operations. The weakest
Quantum Cellular Automata
point of the proposal is the readout operation which is not well-defined. A different scheme for implementing QCA was suggested by Tóth and Lent [41]. Their scheme is based on the technique of quantum-dot CA. The term quantum-dot CA is usually used for CA implementations in quantum dots (for classical computation). The authors, therefore, called their model a coherent quantum-dot CA. They illustrated the usage of an array of N quantum dots as an N-qubit quantum register. However, the set-up and the allowed operations allow for individual control of each cell. This coherent quantum-dot CA is more a hybrid of a quantum circuit with individual qubit control and a QCA with constant nearest-neighbor interaction. The main property of a QCA, operating under global control only, is not taken advantage of. Future Directions The field of QCA is developing rapidly. New definitions have appeared very recently. Since QCA are now considered to be one of the standard measurement-based models of quantum computation, further work on a consistent and sufficient definition of higher-dimensional QCA is to be expected. One proposal for such a “final” definition has been put forward in [4,5]. In the search for robust and easily implementable quantum computational architectures QCA are of considerable interest. The main strength of QCA is global control without the need to address cells individually (with the possible exception of the read-out operation). It has become clear that the global update of a QCA would be a way around practical issues related to the implementation of quantum registers and the difficulty of their individual manipulation. More concretely, QCA provide a natural framework for describing quantum dynamical evolution of optical lattices, a field in which the experimental physics community has made huge progress in the last decade. The main focus so far has been on reversible QCA. Irreversible QCA are closely related to measurement-based computation and remain to be explored further. Bibliography Primary Literature 1. Aoun B, Tarifi M (2004) Introduction to quantum cellular automata. http://arxiv.org/abs/quant-ph/0401123 2. Arrighi P (2006) Algebraic characterizations of unitary linear quantum cellular automata. In: Mathematical Foundations of Computer Science 2006. Lecture Notes in Computer Science, vol 4162. Springer, Berlin, pp 122–133
3. Arrighi P, Fargetton R (2007) Intrinsically universal one-dimensional quantum cellular automata. 0704.3961. http://arxiv.org/ abs/0704.3961 4. Arrighi P, Nesme V, Werner R (2007) One-dimensional quantum cellular automata over finite, unbounded configurations. 0711.3517v1. http://arxiv.org/abs/0711.3517 5. Arrighi P, Nesme V, Werner R (2007) N-dimensional quantum cellular automata. 0711.3975v1. http://arxiv.org/abs/ arXiv:0711.3975 6. Benioff P (1980) The computer as a physical system: A microscopic quantum mechanical hamiltonian model of computers as represented by turing machines. J Stat Phys 22:563–591 7. Benjamin SC (2000) Schemes for parallel quantum computation without local control of qubits. Phys Rev A 61:020301–4 8. Benjamin SC (2001) Quantum computing without local control of qubit-qubit interactions. Phys Rev Lett 88(1):017904 9. Benjamin SC, Bose S (2004) Quantum computing in arrays coupled by “always-on” interactions. Phys Rev A 70:032314 10. Bialynicki-Birula I (1994) Weyl, Dirac, and Maxwell equations on a lattice as unitary cellular automata. Phys Rev D 49:6920 11. Bloch I (2005) Ultracold quantum gases in optical lattices. Nature Phys 1:23–30 12. Boghosian BM, Taylor W (1998) Quantum lattice-gas model for the many-particle Schrödinger equation in d dimensions. Phys Rev E 57:54 13. Boghosian BM, Taylor W (1998) Simulating quantum mechanics on a quantum computer. Physica D: Nonlinear Phenomena 120:30–42 14. Brennen GK, Williams JE (2003) Entanglement dynamics in one-dimensional quantum cellular automata. Phys Rev A 68:042311 15. Cook M (2004) Universality in elementary cellular automata. Complex Syst 15:1 16. van Dam W (1996) Quantum cellular automata. Master’s thesis, University of Nijmegen 17. Dürr C, Santha M (2002) A decision procedure for unitary linear quantum cellular automata. SIAM J Comput 31:1076–1089 18. Dürr C, LêThanh H, Santha M (1997) A decision procedure for well-formed linear quantum cellular automata. Random Struct Algorithms 11:381–394 19. Feynman R (1982) Simulating physics with computers. Int J Theor Phys 21:467–488 20. Fussy S, Grössing G, Schwabl H, Scrinzi A (1993) Nonlocal computation in quantum cellular automata. Phys Rev A 48:3470 21. Grössing G, Zeilinger A (1988) Quantum cellular automata. Complex Syst 2:197–208 22. Gruska J (1999) Quantum Computing. Osborne/McGraw-Hill. QCA are treated in Section 4.3 23. Kempe J (2003) Quantum random walks: an introductory overview. Contemp Phys 44:307 24. Lloyd S (1993) A potentially realizable quantum computer. Science 261:1569–1571 25. Love P, Boghosian B (2005) From Dirac to diffusion: Decoherence in quantum lattice gases. Quantum Inf Process 4:335–354 26. Margolus N (1991) Parallel quantum computation. In: Zurek WH (ed) Complexity, Entropy, and the Physics of Information, Santa Fe Institute Series. Addison Wesley, Redwood City, pp 273–288
2359
2360
Quantum Cellular Automata
27. Meyer DA (1996) From quantum cellular automata to quantum lattice gases. J Stat Phys 85:551–574 28. Meyer DA (1996) On the absence of homogeneous scalar unitary cellular automata. Phys Lett A 223:337–340 29. Meyer DA (1997) Quantum mechanics of lattice gas automata: One-particle plane waves and potentials. Phys Rev E 55:5261 30. Meyer DA (2002) Quantum computing classical physics. Philos Trans Royal Soc A 360:395–405 31. Nagaj D, Wocjan P (2008) Hamiltonian quantum cellular automata in 1d. 0802.0886. http://arxiv.org/abs/0802.0886 32. Ortiz G, Gubernatis JE, Knill E, Laflamme R (2001) Quantum algorithms for fermionic simulations. Phys Rev A 64:022319 33. Perez-Delgado CA, Cheung D (2005) Models of quantum cellular automata. http://arxiv.org/abs/quant-ph/0508164 34. Perez-Delgado CA, Cheung D (2007) Local unitary quantum cellular automata. Phys Rev A (Atomic, Molecular, and Optical Physics) 76:032320–15 35. Raussendorf R (2005) Quantum cellular automaton for universal quantum computation. Phys Rev A (Atomic, Molecular, and Optical Physics) 72:022301–4 36. Richter W (1996) Ergodicity of quantum cellular automata. J Stat Phys 82:963–998 37. Schumacher B, Werner RF (2004) Reversible quantum cellular automata. quant-ph/0405174. http://arxiv.org/abs/quant-ph/ 0405174 38. Shepherd DJ, Franz T, Werner RF (2006) Universally programmable quantum cellular automaton. Phys Rev Lett 97:020502–4
39. Succi S, Benzi R (1993) Lattice Boltzmann equation for quantum mechanics. Physica D: Nonlinear Phenomena 69:327–332 40. Toffoli T, Margolus NH (1990) Invertible cellular automata: A review. Physica D: Nonlinear Phenomena 45:229–253 41. Tóth G, Lent CS (2001) Quantum computing with quantum-dot cellular automata. Phys Rev A 63:052315 42. Twamley J (2003) Quantum-cellular-automata quantum computing with endohedral fullerenes. Phys Rev A 67:052318 43. Vidal G (2004) Efficient simulation of one-dimensional quantum many-body systems. Phys Rev Lett 93(4):040502 44. Vollbrecht KGH, Cirac JI (2008) Quantum simulators, continuous-time automata, and translationally invariant systems. Phys Rev Lett 100:010501 45. von Neumann J (1966) Theory of Self-Reproducing Automata. University of Illinois Press, Champaign 46. Watrous J (1995) On one-dimensional quantum cellular automata. In: Proceedings of the 36th Annual Symposium on Foundations of Computer Science, Milwaukee, pp 528–537 47. Werner R Private communication 48. Wolfram S (1983) Statistical mechanics of cellular automata. Rev Mod Phys 55:601 49. Wootters WK, Zurek WH (1982) A single quantum cannot be cloned. Nature 299:802–803
Books and Reviews Summaries of the topic of QCA can be found in chapter 4.3 of Gruska [21], and in Refs. [1,32].
Quantum Computational Complexity
Quantum Computational Complexity JOHN W ATROUS Institute for Quantum Computing and School of Computer Science, University of Waterloo, Waterloo, Canada Article Outline Glossary Definition of the Subject Introduction The Quantum Circuit Model Polynomial-Time Quantum Computations Quantum Proofs Quantum Interactive Proof Systems Other Selected Notions in Quantum Complexity Future Directions Bibliography Glossary Quantum circuit A quantum circuit is an acyclic network of quantum gates connected by wires: the gates represent quantum operations and the wires represent the qubits on which these operations are performed. The quantum circuit model is the most commonly studied model of quantum computation. Quantum complexity class A quantum complexity class is a collection of computational problems that are solvable by a chosen quantum computational model that obeys certain resource constraints. For example, BQP is the quantum complexity class of all decision problems that can be solved in polynomial time by a quantum computer. Quantum proof A quantum proof is a quantum state that plays the role of a witness or certificate to a quantum computer that runs a verification procedure. The quantum complexity class QMA is defined by this notion: it includes all decision problems whose yes-instances are efficiently verifiable by means of quantum proofs. Quantum interactive proof system A quantum interactive proof system is an interaction between a verifier and one or more provers, involving the processing and exchange of quantum information, whereby the provers attempt to convince the verifier of the answer to some computational problem.
Definition of the Subject The inherent difficulty, or hardness, of computational problems is a fundamental concept in computational complexity theory. Hardness is typically formalized in terms of the resources required by different models of computation to solve problems, such as the number of steps of a deterministic Turing machine. A variety of models and resources are often considered, including deterministic, nondeterministic and probabilistic models; time and space constraints; and interactions among models of differing abilities. Many interesting relationships among these different models and resource constraints are known. One common feature of the most commonly studied computational models and resource constraints is that they are physically motivated. This is quite natural, given that computers are physical devices, and to a significant extent it is their study that motivates and directs research on computational complexity. The predominant example is the class of polynomial-time computable functions, which ultimately derives its relevance from physical considerations; for it is a mathematical abstraction of the class of functions that can be efficiently computed without error by physical computing devices. In light of its close connection to the physical world, it seems only natural that modern physical theories should be considered in the context of computational complexity. In particular, quantum mechanics is a clear candidate for a physical theory to have the potential for implications, if not to computational complexity then at least to computation more generally. Given the steady decrease in the size of computing components, it is inevitable that quantum mechanics will become increasingly relevant to the construction of computers—for quantum mechanics provides a remarkably accurate description of extremely small physical systems (on the scale of atoms) where classical physical theories have failed completely. Indeed, an extrapolation of Moore’s Law predicts sub-atomic computing components within the next two decades [74,78]; a possibility inconsistent with quantum mechanics as it is currently understood. That quantum mechanics should have implications to computational complexity theory, however, is much less clear. It is only through the remarkable discoveries and ideas of several researchers, including Richard Feynman [49], David Deutsch [40,41], Ethan Bernstein and Umesh Vazirani [29,30], and Peter Shor [88,89], that this potential has become evident. In particular, Shor’s polynomial-time quantum factoring and discrete-logarithm algorithms [89] give strong support to the conjecture that quantum and classical computers yield differing notions of
2361
2362
Quantum Computational Complexity
computational hardness. Other quantum complexity-theoretic concepts, such as the efficient verification of quantum proofs, suggest a wider extent to which quantum mechanics influences computational complexity. It may be said that the principal aim of quantum computational complexity theory is to understand the implications of quantum physics to computational complexity theory. To this end, it considers the hardness of computational problems with respect to models of quantum computation, classifications of problems based on these models, and their relationships to classical models and complexity classes. Introduction This article surveys quantum computational complexity, with a focus on three fundamental notions: polynomialtime quantum computations, the efficient verification of quantum proofs, and quantum interactive proof systems. Based on these notions one defines quantum complexity classes, such as BQP, QMA, and QIP, that contain computational problems of varying hardness. Properties of these complexity classes, and the relationships among these classes and classical complexity classes, are presented. As these notions and complexity classes are typically defined within the quantum circuit model, this article includes a section that focuses on basic properties of quantum circuits that are important in the setting of quantum complexity. A selection of other topics in quantum complexity, including quantum advice, space-bounded quantum computation, and bounded-depth quantum circuits, is also presented. Two different but closely related areas of study are not discussed in this article: quantum query complexity and quantum communication complexity. Readers interested in learning more about these interesting and active areas of research may find the surveys of Brassard [33], Cleve [34], and de Wolf [39] to be helpful starting points. It is appropriate that brief discussions of computational complexity theory and quantum information precede the main technical portion of the article. These discussions are intended only to highlight the aspects of these topics that are non-standard, require clarification, or are of particular importance in quantum computational complexity. In the subsequent sections of this article, the reader is assumed to have basic familiarity with both topics, which are covered in depth by several text books [13,43,60,67,79,82]. Computational Complexity Throughout this article the binary alphabet f0; 1g is denoted ˙ , and all computational problems are assumed
to be encoded over this alphabet. As usual, a function f : ˙ ! ˙ is said to be polynomial-time computable if there exists a polynomial-time deterministic Turing machine that outputs f (x) for every input x 2 ˙ . Two related points on the terminology used throughout this article are as follows. 1. A function of the form p : N ! N (where N D f0; 1; 2; : : : g) is said to be a polynomial-bounded function if and only if there exists a polynomial-time deterministic Turing machine that outputs 1 f (n) on input 1n for every n 2 N. Such functions are upper-bounded by some polynomial, and are efficiently computable. 2. A function of the particular form a : N ! [0; 1] is said to be polynomial-time computable if and only if there exists a polynomial-time deterministic Turing machine that outputs a binary representation of a(n) on input 1n for each n 2 N. References to functions of this form in this article typically concern bounds on probabilities that are functions of the length of an input string to some problem. The notion of promise problems [44,52] is central to quantum computational complexity. These are decision problems for which the input is assumed to be drawn from some subset of all possible input strings. More formally, a promise problem is a pair A D (Ayes ; Ano ), where Ayes ; Ano ˙ are sets of strings satisfying Ayes \ Ano D ¿. Languages may be viewed as promise problems that obey the additional constraint Ayes [Ano D ˙ . Although complexity theory has traditionally focused on languages rather than promise problems, little is lost and much is gained in shifting one’s focus to promise problems. Karp reductions and the notion of completeness are defined for promise problems in the same way as for languages. Several classical complexity classes are referred to in this article, and compared with quantum complexity classes when relations are known. The following classical complexity classes, which should hereafter be understood to be classes of promise problems and not just languages, are among those discussed. P A promise problem A D (Ayes ; Ano ) is in P if and only if there exists a polynomial-time deterministic Turing machine M that accepts every string x 2 Ayes and rejects every string x 2 Ano . NP A promise problem A D (Ayes ; Ano ) is in NP if and only if there exists a polynomial-bounded function p and a polynomial-time deterministic Turing machine M with the following properties. For every string x 2 Ayes , it holds that M accepts (x; y) for some string y 2 ˙ p(jxj) , and for every string x 2 Ano , it holds that M rejects (x; y) for all strings y 2 ˙ p(jxj) .
Quantum Computational Complexity
BPP A promise problem A D (Ayes ; Ano ) is in BPP if and only if there exists a polynomial-time probabilistic Turing machine M that accepts every string x 2 Ayes with probability at least 2/3, and accepts every string x 2 Ano with probability at most 1/3. PP A promise problem A D (Ayes ; Ano ) is in PP if and only if there exists a polynomial-time probabilistic Turing machine M that accepts every string x 2 Ayes with probability strictly greater than 1/2, and accepts every string x 2 Ano with probability at most 1/2. MA A promise problem A D (Ayes ; Ano ) is in MA if and only if there exists a polynomial-bounded function p and a probabilistic polynomial-time Turing machine M with the following properties. For every string x 2 Ayes , it holds that Pr[M accepts (x; y)] 23 for some string y 2 ˙ p(jxj) ; and for every string x 2 Ano , it holds that Pr[M accepts (x; y)] 13 for all strings y 2 ˙ p(jxj) . AM A promise problem A D (Ayes ; Ano ) is in AM if and only if there exist polynomial-bounded functions p and q and a polynomial-time deterministic Turing machine M with the following properties. For every string x 2 Ayes , and at least 2/3 of all strings y 2 ˙ p(jxj) , there exists a string z 2 ˙ q(jxj) such that M accepts (x; y; z); and for every string x 2 Ano , and at least 2/3 of all strings y 2 ˙ p(jxj) , there are no strings z 2 ˙ q(jxj) such that M accepts (x; y; z). SZK A promise problem A D (Ayes ; Ano ) is in SZK if and only if it has a statistical zero-knowledge interactive proof system. PSPACE A promise problem A D (Ayes ; Ano ) is in PSPACE if and only if there exists a deterministic Turing machine M running in polynomial space that accepts every string x 2 Ayes and rejects every string x 2 Ano . EXP A promise problem A D (Ayes ; Ano ) is in EXP if and only if there exists a deterministic Turing machine M running in exponential time (meaning time bounded by 2 p , for some polynomial-bounded function p), that accepts every string x 2 Ayes and rejects every string x 2 Ano . NEXP A promise problem A D (Ayes ; Ano ) is in NEXP if and only if there exists an exponential-time non-deterministic Turing machine N for A. PL A promise problem A D (Ayes ; Ano ) is in PL if and only if there exists a probabilistic Turing machine M running in polynomial time and logarithmic space that accepts every string x 2 Ayes with probability strictly greater than 1/2 and accepts every string x 2 Ano with probability at most 1/2.
Quantum Computational Complexity, Figure 1 A diagram illustrating known inclusions among most of the classical complexity classes discussed in this paper. Lines indicate containments going upward; for example, AM is contained in PSPACE
NC A promise problem A D (Ayes ; Ano ) is in NC if and only if there exists a logarithmic-space generated family C D fC n : n 2 Ng of poly-logarithmic depth Boolean circuits such that C(x) D 1 for all x 2 Ayes and C(x) D 0 for all x 2 Ano . For most of the complexity classes listed above, there is a standard way to attach an oracle to the machine model that defines the class, which provides a subroutine for solving instances of a chosen problem B D (Byes ; Bno ). One then denotes the existence of such an oracle with a superscriptfor example, PB is the class of promise problems that can be solved in polynomial time by a deterministic Turing machine equipped with an oracle that solves instances of B (at unit cost). When classes of problems appear as superscripts, one takes the union, as in the following example: [ PB : PNP D B2NP
Quantum Information The standard general description of quantum information is used in this article: mixed states of systems are represented by density matrices and operations are represented
2363
2364
Quantum Computational Complexity
by completely positive trace-preserving linear maps. The choice to use this description of quantum information is deserving of a brief discussion, for it will likely be less familiar to many non-experts than the simplified picture of quantum information where states are represented by unit vectors and operations are represented by unitary matrices. This simplified picture is indeed commonly used in the study of both quantum algorithms and quantum complexity theory; and it is often adequate. However, the general picture has major advantages: it unifies quantum information with classical probability theory, brings with it powerful mathematical tools, and allows for simple and intuitive descriptions in many situations where this is not possible with the simplified picture. Classical simulations of quantum computations, which are discussed below in Subsect. “Classical Upper Bounds on BQP”, may be better understood through a fairly straightforward representation of quantum operations by matrices. This representation begins with a representation of density matrices as vectors based on the function defined as vec(jxihyj) D jxijyi for each choice of n 2 N and x; y 2 ˙ n , and extended by linearity to all matrices indexed by ˙ n . The effect of this mapping is to form a column vector by reading the entries of a matrix in rows from left to right, starting at the top. For example, 0 1 ˛
Bˇ C ˛ ˇ C vec DB @ A : ı ı Now, the effect of a general quantum operation ˚ , represented in the typical Kraus form as ˚( ) D
k X
A j Aj ;
jD1
is expressed as a matrix by means of the equality 1 0 k X A j ˝ A j A vec( ) : vec(˚( )) D @ jD1
The matrix K˚ D
k X
Aj ˝ Aj
jD1
is sometimes called the natural representation (or linear representation) of the operation ˚. Although this matrix could have negative or complex entries, one can reasonably view it as being analogous to a stochastic matrix that describes a probabilistic computation.
For example, the complete phase-damping channel for a single qubit can be written D( ) D j0i h0j j0i h0j C j1i h1j j1i h1j : The effect of this mapping is to zero-out the off-diagonal entries of a density matrix
˛ ˇ ˛ 0 D D : ı 0 ı The natural representation of this operation is easily computed 0 1 1 0 0 0 B0 0 0 0C C KD D B @0 0 0 0A : 0 0 0 1 The natural matrix representation of quantum operations is well-suited to performing computations. For the purposes of this article, the most important observation about this representation is that a composition of operations corresponds simply to matrix multiplication. The Quantum Circuit Model General Quantum Circuits The term quantum circuit refers to an acyclic network of quantum gates connected by wires. The quantum gates represent general quantum operations, involving some constant number of qubits, while the wires represent the qubits on which the gates act. An example of a quantum circuit having four input qubits and three output qubits is pictured in Fig. 2. In general a quantum circuit may have n input qubits and m output qubits for any choice of integers n; m 0. Such a circuit induces some quantum operation from n qubits to m qubits, determined by composing the actions of the individual gates in the appropriate way. The size of a quantum circuit is the total number of gates plus the total number of wires in the circuit. A unitary quantum circuit is a quantum circuit in which all of the gates correspond to unitary quantum operations. Naturally this requires that every gate, and hence the circuit itself, has an equal number of input and output qubits. It is common in the study of quantum computing that one works entirely with unitary quantum circuits. The unitary model and general model are closely related, as will soon be explained. A Finite Universal Gate Set Restrictions must be placed on the gates from which quantum circuits may be composed if the quantum circuit
Quantum Computational Complexity
Quantum Computational Complexity, Figure 2 An example of a quantum circuit. The input qubits are labeled X1 ; : : : ; X4 , the output qubits are labeled Y1 ; : : : ; Y3 , and the gates are labeled by (hypothetical) quantum operations ˚1 ; : : : ; ˚6
Quantum Computational Complexity, Figure 3 A universal collection of quantum gates: Toffoli, Hadamard, phase-shift, ancillary, and erasure gates
model is to be used for complexity theory—for without such restrictions it cannot be argued that each quantum gate corresponds to an operation with unit-cost. The usual way in which this is done is simply to fix a suitable finite set of allowable gates. For the remainder of this article, quantum circuits will be assumed to be composed of gates from the following list: 1. Toffoli gates. Toffoli gates are three-qubit unitary gates defined by the following action on standard basis states: T : jaijbijci 7! jaijbijc ˚ abi : 2. Hadamard gates. Hadamard gates are single-qubit unitary gates defined by the following action on standard basis states: 1 (1) a H : jai 7! p j0i C p j1i : 2 2
3. Phase-shift gates. Phase-shift gates are single-qubit unitary gates defined by the following action on standard basis states: P : jai 7! i a jai : 4. Ancillary gates. Ancillary gates are non-unitary gates that take no input and produce a single qubit in the state j0i as output. 5. Erasure gates. Erasure gates are non-unitary gates that take a single qubit as input and produce no output. Their effect is represented by the partial trace on the space corresponding to the qubit they take as input. The symbols used to denote these gates in quantum circuit diagrams are shown in Fig. 3. For the sake of simplifying some of the diagrams that appear later in this article, some additional gates are illustrated in Fig. 4 along with their realizations as circuits with gates from the chosen basis set.
2365
2366
Quantum Computational Complexity
Quantum Computational Complexity, Figure 4 Four additional quantum gates, together with their implementations as quantum circuits. Top left: a NOT gate. Top right: a constant j1i ancillary gate. Bottom left: a controlled-NOT gate. Bottom right: a phase-damping (or decoherence) gate
The above gate set is universal in a strong sense: every quantum operation can be approximated to within any desired degree of accuracy by some quantum circuit. Moreover, the size of the approximating circuit scales well with respect to the desired accuracy. Theorem 1, stated below, expresses this fact in more precise terms, but requires a brief discussion of a specific sense in which one quantum operation approximates another. A natural and operationally meaningful way to measure the distance between two given quantum operations ˚ and is given by ı(˚; ) D
1 k ˚ k˘ ; 2
where k k˘ denotes a norm usually known as the diamond norm [64,67]. A technical discussion of this norm is not necessary for the purposes of this article and is beyond its scope. Instead, an intuitive description of the distance measure ı(˚; ) will suffice. When considering the distance between quantum operations ˚ and , it must naturally be assumed that these operations agree on their numbers of input qubits and output qubits; so assume that ˚ and both map n qubits to m qubits for nonnegative integers n and m. Now, suppose that an arbitrary quantum state on n or more qubits is prepared, one of the two operations ˚ or is applied to the first n of these qubits, and then a general measurement of all of the resulting qubits takes place (including the m output qubits and the qubits that were not among the inputs to the chosen quantum operation). Two possible probability distributions on measurement outcomes arise: one corresponding to ˚ and the other corresponding to . The quantity ı(˚; ) is simply the maximum possible total variation distance between these distributions, ranging over all possible initial states and general measurements. This is a number between 0 and 1 that intuitively represents the observable difference between quantum operations.
In the special case that ˚ and take no inputs, the quantity ı(˚; ) is simply one-half the trace norm of the difference between the two output states; a common and useful ways to measure the distance between quantum states. Now the Universality Theorem, which represents an amalgamation of several results that suits the needs of this article, may be stated. In particular, it incorporates the Solovay–Kitaev Theorem, which provides a bound on the size of an approximating circuit as a function of the accuracy. Theorem 1 (Universality Theorem) Let ˚ be an arbitrary quantum operation from n qubits to m qubits. Then for every " > 0 there exists a quantum circuit Q with n input qubits and m output qubits such that ı(˚; Q) < ". Moreover, for fixed n and m, the circuit Q may be taken to satisfy size(Q) D poly(log(1/")). Note that it is inevitable that the size of Q is exponential in n and m in the worst case [68]. Further details on the facts comprising this theorem can be found in both Nielsen and Chuang [79] and Kitaev, Shen, and Vyalyi [67]. Unitary Purifications of Quantum Circuits The connection between general and unitary quantum circuits can be understood through the notion of a unitary purification of a general quantum circuit. This may be thought of as a very specific manifestation of the Stinespring Dilation Theorem [90], which implies that general quantum operations can be represented by unitary operations on larger systems. It was first applied to the quantum circuit model by Aharonov, Kitaev, and Nisan [9], who gave several arguments in favor of the general quantum circuit model over the unitary model. The term purification is borrowed from the notion of a purification of a mixed quantum state, as the process of unitary purifica-
Quantum Computational Complexity
Quantum Computational Complexity, Figure 5 A general quantum circuit (left) and its unitary purification (right)
tion for circuits is similar in spirit. The universal gate set, described in the previous section has the effect of making the notion of a unitary purification of a general quantum circuit nearly trivial at a technical level. Suppose that Q is a quantum circuit taking input qubits (X1 ; : : : ; X n ) and producing output qubits (Y1 ; : : : ; Ym ), and assume there are k ancillary gates and l erasure gates among the gates of Q to be labeled in an arbitrary order as G1 ; : : : ; G k and K1 ; : : : ; K l , respectively. A new quantum circuit R may then be formed by removing the gates labeled G1 ; : : : ; G k and K1 ; : : : ; K l ; and to account for the removal of these gates the circuit R takes k additional input qubits (Z1 ; : : : ; Z k ) and produces l additional output qubits (W1 ; : : : ; Wl ). Figure 5 illustrates this process. The circuit R is said to be a unitary purification of Q. It is obvious that R is equivalent to Q, provided the qubits (Z1 ; : : : ; Z k ) are initially set to the j0i state and the qubits (W1 ; : : : ; Wl ) are traced-out, or simply ignored, after the circuit is run—for this is precisely the meaning of the removed gates. Despite the simplicity of this process, it is often useful to consider the properties of unitary purifications of general quantum circuits.
for all x 2 ˙ n and a 2 ˙ , where A is some predicate that represents the particular oracle under consideration. When quantum computations relative to such an oracle are to be studied, quantum circuits composed of ordinary quantum gates as well as the oracle gates fR n g are considered; the interpretation being that each instance of Rn in such a circuit represents one oracle query. It is critical to many results concerning quantum oracles, as well as most results in the area of quantum query complexity, that the above definition (1) takes each Rn to be unitary, thereby allowing these gates to make queries “in superposition”. In support of this seemingly strong definition is the fact, discussed in the next section, that any efficient algorithm (quantum or classical) can be converted to a quantum circuit that closely approximates the action of these gates. Finally, one may consider a more general situation in which the predicate A is replaced by a function that outputs multiple bits. The definition of each gate Rn is adapted appropriately. Alternately, one may restrict their attention to single-bit queries as discussed above, and use the Bernstein–Vazirani algorithm [30] to simulate one multiple-bit query with one single-bit query as illustrated in Fig. 6.
Oracles in the Quantum Circuit Model Oracles play an important, and yet uncertain, role in computational complexity theory; and the situation is no different in the quantum setting. Several interesting oraclerelated results, offering some insight into the power of quantum computation, will be discussed in this article. Oracle queries are represented in the quantum circuit model by an infinite family fR n : n 2 Ng of quantum gates, one for each possible query length. Each gate Rn is a unitary gate acting on n C 1 qubits, with the effect on computational basis states given by R n jx; ai D jx; a ˚ A(x)i
(1)
Quantum Computational Complexity, Figure 6 The Bernstein–Vazirani algorithm allows a multiple-bit query to be simulated by a single-bit query. In the example pictured, f : ˙ 3 ! ˙ 3 is a given function. To simulate a query to this function, the gate R6 is taken to be a standard oracle gate implementing the predicate A(x; z) D hf(x); zi, for h; i denoting the modulo 2 inner product
2367
2368
Quantum Computational Complexity
Polynomial-Time Quantum Computations This section focuses on polynomial-time quantum computations. These are the computations that are viewed, in an abstract and idealized sense, to be efficiently implementable by the means of a quantum computer. In particular, the complexity class BQP (short for bounded-error quantum polynomial time) is defined. This is the most fundamentally important of all quantum complexity classes, as it represents the collection of decision problems that can be efficiently solved by quantum computers. Polynomial-Time Generated Circuit Families and BQP To define the class BQP using the quantum circuit model, it is necessary to briefly discuss encodings of circuits and the notion of a polynomial-time generated circuit family. It is clear that any quantum circuit formed from the gates described in the previous section could be encoded as a binary string using any number of different encoding schemes. Such an encoding scheme must be chosen, but its specifics are not important so long as following simple restrictions are satisfied: 1. The encoding is sensible: every quantum circuit is encoded by at least one binary string, and every binary string encodes at most one quantum circuit. 2. The encoding is efficient: there is a fixed polynomialbounded function p such that every circuit of size N has an encoding with length at most p(N). Specific information about the structure of a circuit must be computable in polynomial time from an encoding of the circuit. 3. The encoding disallows compression: it is not possible to work with encoding schemes that allow for extremely short (e. g., polylogarithmic-length) encodings of circuits; so for simplicity it is assumed that the length of every encoding of a quantum circuit is at least the size of the circuit. Now, as any quantum circuit represents a finite computation with some fixed number of input and output qubits, quantum algorithms are modeled by families of quantum circuits. The typical assumption is that a quantum circuit family that describes an algorithm contains one circuit for each possible input length. Precisely the same situation arises here as in the classical setting, which is that it should be possible to efficiently generate the circuits in a given family in order for that family to represent an efficient, finitely specified algorithm. The following definition formalizes this notion. Definition 2 Let S ˙ be any set of strings. Then a collection fQ x : x 2 Sg of quantum circuits is said to be polynomial-time generated if there exists a polynomial-time
deterministic Turing machine that, on every input x 2 S, outputs an encoding of Qx . This definition is slightly more general than what is needed to define BQP, but is convenient for other purposes. For instance, it allows one to easily consider the situation in which the input, or some part of the input, for some problem is hard-coded into a collection of circuits; or where a computation for some input may be divided among several circuits. In the most typical case that a polynomialtime generated family of the form fQ n : n 2 Ng is referred to, it should be interpreted that this is a shorthand for fQ1n : n 2 Ng. Notice that every polynomial-time generated family fQ x : x 2 Sg has the property that each circuit Qx has size polynomial in jxj. Intuitively speaking, the number of quantum and classical computation steps required to implement such a computation is polynomial; and so operations induced by the circuits in such a family are viewed as representing polynomial-time quantum computations. The complexity class BQP, which contains those promise problems abstractly viewed to be efficiently solvable using a quantum computer, may now be defined. More precisely, BQP is the class of promise problems that can be solved by polynomial-time quantum computations that may have some small probability to make an error. For decision problems, the notion of a polynomial-time quantum computation is equated with the computation of a polynomial-time generated quantum circuit family Q D fQ n : n 2 Ng, where each circuit Qn takes n input qubits, and produces one output qubit. The computation on a given input string x 2 ˙ is obtained by first applying the circuit Qjxj to the state jxihxj, and then measuring the output qubit with respect to the standard basis. The measurement results 0 and 1 are interpreted as yes and no (or accept and reject), respectively. The events that Q accepts x and Q rejects x are understood to have associated probabilities determined in this way. BQP Let A D (Ayes ; Ano ) be a promise problem and let a; b : N ! [0; 1] be functions. Then A 2 BQP(a; b) if and only if there exists a polynomial-time generated family of quantum circuits Q D fQ n : n 2 Ng, where each circuit Qn takes n input qubits and produces one output qubit, that satisfies the following properties: 1. if x 2 Ayes then Pr[Q accepts x] a(jxj), and 2. if x 2 Ano then Pr[Q accepts x] b(jxj). The class BQP is defined as BQP D BQP(2/3; 1/3). Similar to BPP, there is nothing special about the particular choice of error probability 1/3, other than that it is
Quantum Computational Complexity
a constant strictly smaller than 1/2. This is made clear in the next section. There are several problems known to be in BQP but not known (and generally not believed) to be in BPP. Decision-problem variants of the integer factoring and discrete logarithm problems, shown to be in BQP by Shor [89], are at present the most important and well-known examples. Error Reduction for BQP When one speaks of the flexibility, or robustness, of BQP with respect to error bounds, it is meant that the class BQP(a; b) is invariant under a wide range of “reasonable” choices of the functions a and b. The following proposition states this more precisely. Proposition 3 (Error reduction for BQP) Suppose that a; b : N ! [0; 1] are polynomial-time computable functions and p : N ! N is a polynomial-bounded function such that a(n) b(n) 1/p(n) for all but finitely many n 2 N. Then for every choice of a polynomial-bounded function q : N ! N satisfying q(n) 2 for all but finitely many n 2 N, it holds that BQP(a; b) D BQP D BQP 1 2q ; 2q : The above proposition may be proved in the same standard way that similar statements are proved for classical probabilistic computations: by repeating a given computation some large (but still polynomial) number of times, overwhelming statistical evidence is obtained so as to give the correct answer with an extremely small probability of error. It is straightforward to represent this sort of repeated computation within the quantum circuit model in such a way that the requirements of the definition of BQP are satisfied. Simulating Classical Computations with Quantum Circuits It should not be surprising that quantum computers can efficiently simulate classical computers—for quantum information generalizes classical information, and it would be absurd if there were a loss of computational power in moving to a more general model. This intuition may be confirmed by observing the containment BPP BQP. Here an advantage of working with the general quantum circuit model arises: for if one truly believes the Universality Theorem, there is almost nothing to prove. Observe first that the complexity class P may be defined in terms of Boolean circuit families in a similar manner to BQP. In particular, a given promise problem A D
(Ayes ; Ano ) is in P if and only if there exists a polynomialtime generated family C D fC n : n 2 Ng of Boolean circuits, where each circuit Cn takes n input bits and outputs 1 bit, such that 1. C(x) D 1 for all x 2 Ayes , and 2. C(x) D 0 for all x 2 Ano . A Boolean circuit-based definition of BPP may be given along similar lines: a given promise problem A D (Ayes ; Ano ) is in BPP if and only if there exists a polynomial-bounded function r and a polynomial-time generated family C D fC n : n 2 Ng of Boolean circuits, where each circuit Cn takes n C r(n) input bits and outputs 1 bit, such that 1. Pr[C(x; y) D 1] 2/3 for all x 2 Ayes , and 2. Pr[C(x; y) D 1] 1/3 for all x 2 Ano , where y 2 ˙ r(jxj) is chosen uniformly at random in both cases. In both definitions, the circuit family C includes circuits composed of constant-size Boolean logic gates—which for the sake of brevity may be assumed to be composed of NAND gates and FANOUT gates. (FANOUT operations must be modeled as gates for the sake of the simulation.) For the randomized case, it may be viewed that the random bits y 2 ˙ r(jxj) are produced by gates that take no input and output a single uniform random bit. As NAND gates, FANOUT gates, and random bits are easily implemented with quantum gates, as illustrated in Fig. 7, the circuit family C can be simulated gate-by-gate to obtain a quantum circuit family Q D fQ n : n 2 Ng for A that satisfies the definition of BQP. It follows that BPP BQP. The BQP Subroutine Theorem There is an important issue regarding the above definition of BQP, which is that it is not an inherently “clean” definition with respect to the modularization of algorithms. The BQP subroutine theorem of Bennett, Brassard, Bernstein and Vazirani [27] addresses this issue. Suppose that it is established that a particular promise problem A is in BQP, which by definition means that there must exist an efficient quantum algorithm (represented by a family of quantum circuits) for A. It is then natural to consider the use of that algorithm as a subroutine in other quantum algorithms for more complicated problems, and one would like to be able to do this without worrying about the specifics of the original algorithm. Ideally, the algorithm for A should function as an oracle for A, as defined in Subsect. “Oracles in the Quantum Circuit Model”.
2369
2370
Quantum Computational Complexity
Quantum Computational Complexity, Figure 7 Quantum circuit implementations of a NAND gate, a FANOUT gate, and a random bit. The phase-damping gates, denoted by D, are only included for aesthetic reasons: they force the purely classical behavior that would be expected of classical gates, but are not required for the quantum simulation of BPP
A problem arises, however, when queries to an algorithm for A are made in superposition. Whereas it is quite common and useful to consider quantum algorithms that query oracles in superposition, a given BQP algorithm for A is only guaranteed to work correctly on classical inputs. It could be, for instance, that some algorithm for A begins by applying phase-damping gates to all of its input qubits, or perhaps this happens inadvertently as a result of the computation. Perhaps it is too much to ask that the existence of a BQP algorithm for A admits a subroutine having the characteristics of an oracle for A? The BQP subroutine theorem establishes that, up to exponentially small error, this is not too much to ask: the existence of an arbitrary BQP algorithm for A implies the existence of a “clean” subroutine for A with the characteristics of an oracle. A precise statement of the theorem follows. Theorem 4 (BQP subroutine theorem) Suppose A D (Ayes ; Ano ) is a promise problem in BQP. Then for any choice of a polynomial-bounded function p there exists a polynomial-bounded function q and a polynomial-time generated family of unitary quantum circuits fR n : n 2 Ng with the following properties: 1. Each circuit Rn implements a unitary operation U n on n C q(n) C 1 qubits. 2. For every x 2 Ayes and a 2 ˙ it holds that hx; 0m ; a ˚ 1jU n jx; 0m ; ai 1 2p(n) for n D jxj and m D q(n). 3. For every x 2 Ano and a 2 ˙ it holds that hx; 0m ; ajU n jx; 0m ; ai 1 2p(n) for n D jxj and m D q(n).
Quantum Computational Complexity, Figure 8 A unitary quantum circuit approximating an oracle-gate implementation of a BQP computation. Here Qn is a unitary purification of a circuit having exponentially small error for some problem in BQP
The proof of this theorem is remarkably simple: given a BQP algorithm for A, one first uses Proposition 3 to obtain a circuit family Q having exponentially small error for A. The circuit illustrated in Fig. 8 then implements a unitary operation with the desired properties. This is essentially a bounded-error quantum adaptation of a classic construction that allows arbitrary deterministic computations to be performed reversibly [26,92]. The following corollary expresses the main implication of the BQP subroutine theorem in complexity-theoretic terms. Corollary 5 BQPBQP D BQP. Classical Upper Bounds on BQP There is no known way to efficiently simulate quantum computers with classical computers—and there would be little point in seeking to build quantum computers if there were. Nevertheless, some insight into the limitations of quantum computers may be gained by establishing containments of BQP in the smallest classical complexity classes where this is possible.
Quantum Computational Complexity
The strongest containment known at this time is given by counting complexity. Counting complexity began with Valiant’s work [94] on the complexity of computing the permanent, and was further developed and applied in many papers (including [22,47,91], among many others). The basic notion of counting complexity that is relevant to this article is as follows. Given a polynomialtime nondeterministic Turing machine M and input string x 2 ˙ , one denotes by #M(x) the number of accepting computation paths of M on x, and by #M(x) the number of rejecting computation paths of M on x. A function f : ˙ ! Z is then said to be a GapP function if there exists a polynomial-time nondeterministic Turing machine M such that f (x) D #M(x) #M(x) for all x 2 ˙ . A variety of complexity classes can be specified in terms of GapP functions. For example, a promise problem A D (Ayes ; Ano ) is in PP if and only if there exists a function f 2 GapP such that f (x) > 0 for all x 2 Ayes and f (x) 0 for all x 2 Ano . The remarkable closure properties of GapP functions allows many interesting facts to be proved about such classes. Fortnow’s survey [50] on counting complexity explains many of these properties and gives several applications of the theory of GapP functions. The following closure property is used below. GapP–multiplication of matrices. Let p; q : N ! N be polynomial-bounded functions. Suppose that for each n 2 N a sequence of p(n) complex-valued matrices is given A n;1 ; A n;2 ; : : : ; A n;p(n) ; each having rows and columns indexed by strings in ˙ q(n) . Suppose further that there exist functions f ; g 2 GapP such that f (1n ; 1 k ; x; y) D Re A n;k [x; y] g(1n ; 1 k ; x; y) D Im A n;k [x; y] for all n 2 N, k 2 f1; : : : ; p(n)g, and x; y 2 ˙ q(n) . Then there exist functions F; G 2 GapP such that F(1n ; x; y) D Re (A n;p(n) A n;1 )[x; y] G(1n ; x; y) D Im (A n;p(n) A n;1 )[x; y] for all n 2 N and x; y 2 ˙ q(n) . In other words, if there exist two GapP functions describing the real and imaginary parts of the entries of a polynomial sequence of matrices, then the same is true of the product of these matrices. Now, suppose that Q D fQ n : n 2 Ng is a polynomial-time generated family of quantum circuits. For each n 2 N, let us assume the quantum circuit Qn consists
of gates G n;1 ; : : : ; G n;p(n) for some polynomial bounded function p, labeled in an order that respects the topology of the circuit. By tensoring these gates with the identity operation on qubits they do not touch, and using the natural matrix representation of quantum operations, it is possible to obtain matrices M n;1 ; : : : ; M n;p(n) with the property that M n;p(n) M n;1 vec( ) D vec(Q n ( )) for every possible input density matrix to the circuit. The probability that Qn accepts a given input x is then h1; 1j M n;p(n) M n;1 jx; xi ; which is a single entry in the product of the matrices. By padding matrices with rows and columns of zeroes, it may be assumed that each matrix M n;k has rows and columns indexed by strings of length q(n). The assumption that the family Q is polynomial-time generated then allows one to easily conclude that there exist GapP functions f and g so that 1 Re A n;k [x; y] 2 1 n k g(1 ; 1 ; x; y) D Im A n;k [x; y] 2 f (1n ; 1 k ; x; y) D
for all n 2 N, k 2 f1; : : : ; p(n)g, and x; y 2 ˙ q(n) . (Note that his fact makes use of the observation that the natural matrix representation of all of the gates listed in Subsect. “A Finite Universal Gate Set” have entries whose real and imaginary parts come from the set f1; 1/2; 0; 1/2; 1g. The numbers ˙1/2 are only needed for the Hadamard gates.) By the property of GapP functions above, it follows that there exists a GapP function F such that Pr[Q accepts x] D
F(x) : 2 p(jxj)
The containment BQP PP follows easily. This containment was first proved by Adleman, DeMarrais, and Huang [7] using a different method, and was first argued using counting complexity along similar lines to the above proof by Fortnow and Rogers [51]. There are two known ways that this upper bound can be improved. Using the fact that BQP algorithms have bounded error, Fortnow and Rogers [51] proved that BQP is contained in a (somewhat obscure) counting complexity class called AWPP, which implies the following theorem. Theorem 6 PPBQP D PP. Another improvement comes from the observation that the above proof that BQP PP makes no use of the
2371
2372
Quantum Computational Complexity
bounded-error assumption of BQP. It follows that an unbounded error variant of BQP is equal to PP. PQP Let A D (Ayes ; Ano ) be a promise problem. Then A 2 PQP if and only if there exists a polynomial-time generated family of quantum circuits Q D fQ n : n 2 Ng, where each circuit Qn takes n input qubits and produces one output qubit, that satisfies the following properties. If x 2 Ayes then Pr[Q accepts x] > 1/2; and if x 2 Ano then Pr[Q accepts x] 1/2. Theorem 7 PQP D PP. Oracle Results Involving BQP It is difficult to prove separations among quantum complexity classes for apparently the same reasons that this is so for classical complexity classes. For instance, one clearly cannot hope to prove BPP 6D BQP when major collapses such as NC D PP or P D PSPACE are still not disproved. In some cases, however, separations among quantum complexity classes can be proved in relativized settings, meaning that the separation holds in the presence of some cleverly defined oracle. The following oracle results are among several that are known: 1. There exists an oracle A such that BQPA 6 MA A [18,30,96]. Such an oracle A intuitively encodes a problem that is solvable using a quantum computer but is not even efficiently verifiable with a classical computer. As the containment BPP BQP holds relative to every oracle, this implies that BPP A ¨ BQP A for this particular choice of A. 2. There is an oracle A such that NPA 6 BQPA [27]. In less formal terms: a quantum computer cannot find a needle in an exponentially large haystack in polynomial time. This result formalizes a critically important idea, which is that a quantum computer can only solve a given search problem efficiently if it is able to exploit that problem’s structure. It is easy to define an oracle search problem represented by A that has no structure whatsoever, which allows the conclusion NPA 6 BQP A to be drawn. It is not currently known whether the NP-complete problems, in the absence of an oracle, have enough structure to be solved efficiently by quantum computers; but there is little hope and no indication whatsoever that this should be so. It is therefore a widely believed conjecture that NP 6 BQP. 3. There is an oracle A such that SZKA 6 BQP A [1,5]. This result is similar in spirit to the previous one, but is technically more difficult and rules out the existence of quantum algorithms for unstructured collision detection problems. The graph isomorphism problem and
various problems that arise in cryptography are examples of collision detection problems. Once again, it follows that quantum algorithms can only solve such problems if their structure is exploited. It is a major open question in quantum computing whether the graph isomorphism problem is in BQP. Quantum Proofs There are many quantum complexity classes of interest beyond BQP. This section concerns one such class, which is a quantum computational analogue of NP. The class is known as QMA, short for quantum Merlin–Arthur, and is based on the notion of a quantum proof : a quantum state that plays the role of a certificate or witness to a quantum computer that functions as a verification procedure. Interest in both the class QMA and the general notion of quantum proofs is primarily based on the fundamental importance of efficient verification in computational complexity. The notion of a quantum proof was first proposed by Knill [69] and consider more formally by Kitaev (presented at a talk in 1999 [65] and later published in [67]). Definition of QMA The definition of QMA is inspired by the standard definition of NP included in Subsect. “Computational Complexity” of this article. This definition is of course equivalent to the other well-known definition of NP based on nondeterministic Turing machines, but is much better-suited to consideration in the quantum setting—for nondeterminism is arguably a non-physical notion that does not naturally extend to quantum computing. In the definition of NP from Subsect. “Computational Complexity”, the machine M functions as a verification procedure that treats each possible string y 2 ˙ p(jxj) as a potential proof that x 2 Ayes . The conditions on M are known as the completeness and soundness conditions, which derive their names from logic: completeness refers to the condition that true statements have proofs, while soundness refers to the condition that false statements do not. To define QMA, the set of possible proofs is extended to include quantum states, which of course presumes that the verification procedure is quantum. As quantum computations are inherently probabilistic, a bounded probability of error is allowed in the completeness and soundness conditions. (This is why the class is called QMA rather than QNP, as it is really MA and not NP that is the classical analogue of QMA.) QMA Let A D (Ayes ; Ano ) be a promise problem, let p be a polynomial-bounded function, and let a; b : N !
Quantum Computational Complexity
[0; 1] be functions. Then A 2 QMA p (a; b) if and only if there exists a polynomial-time generated family of circuits Q D fQ n : n 2 Ng, where each circuit Qn takes n C p(n) input qubits and produces one output qubit, with the following properties: 1. Completeness. For all x 2 Ayes , there exists a p(jxj)qubit quantum state such that Pr[Q accepts (x; )] a(jxj). 2. Soundness. For all x 2 Ano and all p(jxj)qubit quantum states it holds that Pr[Q accepts (x; )] b(jxj). S Also define QMA D p QMA p (2/3; 1/3), where the union is over all polynomial-bounded functions p. Problems in QMA Before discussing the general properties of QMA that are known, it is appropriate to mention some examples of problems in this class. Of course it is clear that thousands of interesting combinatorial problems are in QMA, as the containment NP QMA is trivial. What is more interesting is the identification of problems in QMA that are not known to be in NP, for these are examples that provide insight into the power of quantum proofs. There are presently just a handful of known problems in QMA that are not known to be in NP, but the list is growing. The Local Hamiltonian Problem Historically speaking, the first problem identified to be complete for QMA was the local Hamiltonian problem [65,67], which can be seen as a quantum analogue of the MAX-k-SAT problem. Its proof of completeness can roughly be seen as a quantum analogue of the proof of the Cook–Levin Theorem [38,72]. The proof has subsequently been strengthened to achieve better parameters [61]. Suppose that M is a Hermitian matrix whose rows and columns are indexed by strings of length n for some integer n 1. Then M is said to be k-local if and only if it can be expressed as M D P (A ˝ I)P1 for an arbitrary matrix A indexed by ˙ k , P a permutation matrix defined by P jx1 x n i D jx(1) x(n) i for some permutation 2 S n , and I denoting the identity matrix indexed by ˙ nk . In less formal terms, M is a matrix that arises from a “gate” on k qubits, but where the gate is described by a 2 k 2 k Hermitian matrix A rather than a unitary matrix. It is possible to express such a ma-
trix compactly by specifying A along with the bit-positions on which A acts. Intuitively, a k-local matrix assigns a real number (representing the energy) to any quantum state on n qubits. This number depends only on the reduced state of the k qubits where M acts nontrivially, and a limit on this value can be thought of as a local constraint on a given quantum state. Loosely speaking, the k-local Hamiltonian problem asks whether there exists a quantum state that satisfies a collection of such constraints. T HE K-LOCAL HAMILTONIAN PROBLEM Input: A collection H1 ; : : : ; H m of k-local Hermitian matrices indexed by strings of length n and satisfying k H j k 1 for j D 1; : : : ; m. Yes: There exists an n-qubit quantum state j i such that jH1 C C H m j 1. No: For every n-qubit quantum state j i it holds that jH1 C C H m j 1. Theorem 8 The 2-local Hamiltonian problem is complete for QMA with respect to Karp reductions. The completeness of this problem has been shown to be closely related to the universality of the so-called adiabatic model of quantum computation [11]. Other Problems Complete for QMA Aside from the local Hamiltonian problem, there are other promise problems known to be complete for QMA, including the following: 1. Restricted versions of the local Hamiltonian problem. For example, the 2-local Hamiltonian problem remains QMA-complete when the local Hamiltonians are restricted to nearest-neighbor interactions on a two-dimensional array of qubits [81]. The hardness of the local Hamiltonian problem with nearest-neighbor interactions on one-dimensional systems is known to be QMA-complete for 12 dimensional particles in place of qubits [10], but is open for smaller systems including qubits. 2. The density matrix consistency problem. Here, the input is a collection of density matrices representing the reduced states of a hypothetical n-qubit state. The input is a yes-instance of the problem if there exists a state of the n-qubit system that is consistent with the given reduced density matrices, and is a no-instance if every state of the n-qubit system has reduced states that have significant disagreement from one or more of the input density matrices. This problem is known to be complete for QMA with respect to Cook reductions [73], but is not known to be complete for Karp reductions.
2373
2374
Quantum Computational Complexity
3. The quantum clique problem. Beigi and Shor [23] have proved that a promise version of the following problem is QMA-complete for any constant k 2. Given a quantum operation ˚ , are there k different inputs to ˚ that are perfectly distinguishable after passing through ˚? They name this the quantum clique problem because there is a close connection between computing the zero-error capacity of a classical channel and computing the size of the largest clique in the complement of a graph representing the channel; and this is a quantum variant of that problem. 4. Several problems about quantum circuits. All of the problems mentioned above are proved to be QMAhard through reductions from the local Hamiltonian problem. Other problems concerning properties of quantum circuits can be proved QMA-complete by more direct means, particularly when the problem itself concerns the existence of a quantum state that causes an input circuit to exhibit a certain behavior. An example in this category is the non-identity check problem [59] that asks whether a given unitary circuit implements an operation that is close to some scalar multiple of the identity. The Group Non-Membership Problem Another example of a problem in QMA that is not known to be in NP is the group non-membership problem [17,96]. This problem is quite different from the QMA-complete problems mentioned above in two main respects. First, the problem corresponds to a language, meaning that the promise is vacuous: every possible input can be classified as a yes-instance or no-instance of the problem. (In all of the above problems it is not possible to do this without invalidating the known proofs that the problems are in QMA.) Second, the problem is not known to be complete for QMA; and indeed it would be quite surprising if QMA were shown to have a complete problem having a vacuous promise. In the group non-membership problem the input is a subgroup H of some finite group G, along with an element g 2 G. The yes-instances of the problem are those for which g 62 H, while the no-instances are those for which g 2 H. T HE GROUP NON-MEMBERSHIP PROBLEM Input: Group elements h1 ; : : : ; h k and g from some finite group G. Let H D hh1 ; : : : ; h k i be the subgroup generated by h1 ; : : : ; h k . Yes: g 62 H. No: g 2 H. Like most group-theoretic computational problems, there are many specific variants of the group non-mem-
bership problem that differ in the way that group elements are represented. For example, group elements could be represented by permutations in cycle notation, invertible matrices over a finite field, or any number of other possibilities. The difficulty of the problem clearly varies depending the representation. The framework of black-box groups, put forth by Babai and Szemerédi [20], simplifies this issue by assuming that group elements are represented by meaningless labels that can only be multiplied or inverted by means of a group oracle. Any algorithm or protocol that solves a problem in this framework can then be adapted to any specific group provided that an efficient implementation of the group operations exists that can replace the group oracle. Theorem 9 The group non-membership problem is in QMA for every choice of a group oracle. Error Reduction for QMA The class QMA is robust with error bounds, which are reflected by the completeness and soundness probabilities, in much the same way that the same is true of BQP. Two specific error reduction methods are presented in this section: weak error reduction and strong error reduction. The two methods differ with respect to the length of the quantum proof that is needed to obtain a given error bound, which is a unique consideration that arises when working with quantum proofs. Suppose that A D (Ayes ; Ano ) is a given promise problem for which a QMA-type quantum verification procedure exists. To describe both error reduction procedures, it is convenient to assume that the input to the problem is hard-coded into each circuit in the family representing the verification procedure. The verification procedure therefore takes the form Q D fQ x : x 2 ˙ g, where each circuit Qx takes a p(jxj)-qubit quantum state as input, for some polynomial-bounded function p representing the quantum proof length, and produces one output qubit. The completeness and soundness probabilities are assumed to be given by polynomial-time computable functions a and b as usual. The natural approach to error reduction is repetition: for a given input x the verification circuit Qx is evaluated multiple times, and a decision to accept or reject based on the frequency of accepts and rejects among the outcomes of the repetitions is made. Assuming a(jxj) and b(jxj) are not too close together, this results in a decrease in error that is exponential in the number of repetitions [67]. The problem that arises, however, is that each repetition of the verification procedure apparently destroys the quantum proof it verifies, which necessitates the composite verifica-
Quantum Computational Complexity
does not require any increase in the length of the quantum proof. This form of error reduction is called strong error reduction because of this advantage, which turns out to be quite handy in some situations. The procedure is illustrated in Fig. 10. The following theorem follows from an analysis of this procedure. Theorem 10 Suppose that a; b : N ! [0; 1] are polynomial-time computable functions and q : N ! N is a polynomial-bounded function such that a(n) b(n) 1/q(n) for all but finitely many n 2 N. Then for every choice of polynomial-bounded functions p; r : N ! N such that r(n) 2 for all but finitely many n, it holds that QMA p (a; b) D QMA p (1 2r ; 2r ). Quantum Computational Complexity, Figure 9 An illustration of the weak error reduction procedure for QMA. The circuits Qx represent the original verification procedure, and the circuit labeled F outputs acceptance or rejection based on the frequency of accepts among its inputs (which can be adjusted depending on a and b). It cannot be assumed that the input state takes the form of a product state ˝ ˝ ; but it is not difficult to prove that there will always be at least one such input state among the states maximizing the acceptance probability of the procedure
tion procedure receiving p(jxj) qubits for each repetition of the original procedure as illustrated in Fig. 9. The form of error reduction implemented by this procedure is called weak error reduction, given that the length of the quantum proof must grow as the error decreases. (Of course one may view that it also has a strength, which is that it does not require a significant increase in circuit depth over the original procedure.) The second error reduction procedure for QMA is due to Marriott and Watrous [76]. Like weak error reduction it gives an exponential reduction in error for roughly a linear increase in circuit size, but has the advantage that it
Containment of QMA in PP By means of strong error reduction for QMA, it can be proved that QMA is contained in PP as follows. Suppose that some promise problem A D (Ayes ; Ano ) is contained in QMA. Then by Theorem 10 it holds that (2) A 2 QMA p 1 2(pC2) ; 2(pC2) for some polynomial-bounded function p. What is important here is that the soundness probability is smaller than the reciprocal of the dimension of the space corresponding to the quantum proof, which is where strong error reduction is essential. Now, consider an algorithm that does not receive any quantum proof, but instead just randomly guesses a quantum proof on p qubits and feeds this proof into a verification procedure having completeness and soundness probabilities consistent with the above inclusion (2). To be more precise, the quantum proof is substituted by the totally mixed state on p qubits. A simple analysis shows that this algorithm accepts every string x 2 Ayes with probability at least 2(p(jxj)C1) and accepts every string
Quantum Computational Complexity, Figure 10 Illustration of the strong error reduction procedure for QMA. The circuit Qx represents a unitary purification of a given QMA verification procedure on an input x, while the circuit C determines whether to accept or reject based on the number of alternations of its input qubits. The quantum proof is denoted by
2375
2376
Quantum Computational Complexity
x 2 Ano with probability at most 2(p(jxj)C2) . The gap between these two probabilities is enough to establish that A 2 PQP D PP. Classical Proofs for Quantum Verification Procedures One may also consider quantum verification procedures that receive classical proofs rather than quantum proofs. Aharonov and Naveh [8] defined a complexity class MQA accordingly. MQA Let A D (Ayes ; Ano ) be a promise problem. Then A 2 MQA if and only if there exists a polynomialbounded function p and a polynomial-time generated family of circuits Q D fQ n : n 2 Ng, where each circuit Qn takes n C p(n) input qubits and produces one output qubit, with the following properties. For all x 2 Ayes , there exists a string y 2 ˙ p(jxj) such that Pr[Q accepts (x; y)] 2/3; and for all x 2 Ano and all strings y 2 ˙ p(jxj) it holds that Pr[Q accepts (x; y)] 1/3. (This class was originally named QCMA, and is more commonly known by that name—but it is impossible to resist the urge to give this interesting class a more sensible name.) When considering the power of quantum proofs, it is the question of whether MQA is properly contained in QMA that arguably cuts to the heart of the issue. Aaronson and Kuperberg [4] studied this question, and based on a reasonable group-theoretic conjecture argued that the group non-membership problem is likely to be in MQA. Are Two Quantum Proofs Better than One? An unusual, yet intriguing question about quantum proofs was asked by Kobayashi, Matsumoto, and Yamakami [71]: are two quantum proofs better than one? The implicit setting in this question is similar to that of QMA, except that the verification procedure receives two quantum proofs that are guaranteed to be unentangled. The complexity class QMA(2) is defined to be the class of promise problems having such two-proof systems with completeness and soundness probabilities 2/3 and 1/3, respectively. (It is not known to what extent this type of proof system is robust with respect to error bounds, so it is conceivable that other reasonable choices of error bounds could give distinct complexity classes.) The restriction on entanglement that is present in the definition of QMA(2) may seem artificial from a physical perspective, as there is no obvious mechanism that would prevent two entities with otherwise unlimited com-
putational power from sharing entanglement. Nevertheless, the question of the power of QMA(2) is interesting in that it turns around the familiar question in quantum computing about the computational power of entanglement. Rather than asking if computational power is limited by a constraint on entanglement, here the question is whether a constraint on entanglement enhances computational power. It is clear that two quantum proofs are no less powerful than one, as a verification procedure may simply ignore one of the proofs—which means that QMA QMA(2). Good upper bounds on QMA(2), however, are not known: the best upper bound presently known is QMA(2) NEXP, which follows easily from the fact that a nondeterministic exponential-time algorithm can guess explicit descriptions of the quantum proofs and then perform a deterministic exponential-time simulation of the verification procedure. Quantum Interactive Proof Systems The notion of efficient proof verification is generalized by means of the interactive proof system model, which has fundamental importance in complexity theory and theoretical cryptography. Interactive proof systems, or interactive proofs for short, were first introduced by Babai [17,19] and Goldwasser, Micali, and Rackoff [55,56], and have led to remarkable discoveries in complexity such as the PCP Theorem [14,15,42]. In the most commonly studied variant of interactive proof systems, a polynomial-time verifier interacts with a computationally unbounded prover that tries to convince the verifier of the truth of some statement. The prover is not trustworthy, however, and so the verifier must be specified in such a way that it is not convinced that false statements are true. Quantum interactive proof systems [66,99] are interactive proof systems in which the prover and verifier may exchange and process quantum information. This ability of both the prover and verifier to exchange and process quantum information endows quantum interactive proofs with interesting properties that distinguish them from classical interactive proofs, and illustrate unique features of quantum information. Definition of Quantum Interactive Proofs and The Class QIP A quantum interactive proof system involves an interaction between a prover and verifier as suggested by Fig. 11. The verifier is described by a polynomial-time generated family ˚ V D Vjn : n 2 N; j 2 f1; : : : ; p(n)g
Quantum Computational Complexity
Quantum Computational Complexity, Figure 11 A quantum interactive proof system. There are six messages in this example, labeled 1; : : : ; 6. (There may be polynomially many messages in general.) The arrows each represent a collection of qubits, rather than single qubits as in previous figures. The superscript n is omitted in the names of the prover and verifier circuits, which can safely be done when the input length n is determined by context
of quantum circuits, for some polynomial-bounded function p. On an input string x having length n, the sequence of circuits n V1n ; : : : ; Vp(n)
determine the actions of the verifier over the course of the interaction, with the value p(n) representing the number of turns the verifier takes. For instance, p(n) D 4 in the example depicted in Fig. 11. The inputs and outputs of the verifier’s circuits are divided into two categories: private memory qubits and message qubits. The message qubits are sent to, or received from, the prover, while the memory qubits are retained by the verifier as illustrated in the figure. It is always assumed that the verifier receives the last message; so if the number of messages is to be m D m(n), then it must be the case that p(n) D bm/2c C 1. The prover is defined in a similar way to the verifier, although no computational assumptions are made: the prover is a family of arbitrary quantum operations P D fPjn : n 2 N; j 2 f1; : : : ; q(n)gg that interface with a given verifier in the natural way. Again, this is as suggested by Fig. 11. Now, on a given input string x, the prover P and verifier V have an interaction by composing their circuits as described above, after which the verifier measures an output qubit to determine acceptance or rejection. In direct analogy to the classes NP, MA, and QMA, one defines quantum complexity classes based on the completeness and soundness properties of such interactions. QIP Let A D (Ayes ; Ano ) be a promise problem, let m be a polynomial-bounded function, and let a; b : N ! [0; 1] be polynomial-time computable functions. Then A 2 QIP(m; a; b) if and only if there exists an m-message quantum verifier V with the following properties:
1. Completeness. For all x 2 Ayes , there exists a quantum prover P that causes V to accept x with probability at least a(jxj). 2. Soundness. For all x 2 Ano , every quantum prover P causes V to accept x with probability at most b(jxj). Also define QIP(m) D QIP(m; 2/3; 1/3) for each polynomial-bounded function m and define QIP D S m QIP(m), where the union is over all polynomialbounded functions m. Properties of Quantum Interactive Proofs Classical interactive proofs can trivially be simulated by quantum interactive proofs, and so the containment PSPACE QIP follows directly from IP D PSPACE [75, 87]. Unlike classical interactive proofs, quantum interactive proofs are not known to be simulatable in PSPACE. Simulation is possible in EXP through the use of semidefinite programming [66]. Theorem 11 QIP EXP. Quantum interactive proofs, like ordinary classical interactive proofs, are quite robust with respect to the choice of completeness and soundness probabilities. In particular, the following facts hold [66]. 1. Every quantum interactive proof can be transformed into an equivalent quantum interactive proof with perfect completeness: the completeness probability is 1 and the soundness probability is bounded away from 1. The precise bound obtained for the soundness probability depends on the completeness and soundness probabilities of the original proof system. This transformation to a perfect-completeness proof system comes at the cost of one additional round (i. e., two messages) of communication.
2377
2378
Quantum Computational Complexity
2. Parallel repetition of quantum interactive proofs with perfect completeness gives an exponential reduction in soundness error. An exponential reduction in completeness and soundness error is also possible for quantum interactive proofs not having perfect completeness, but the procedure is more complicated than for the perfect completeness case. One of the major differences between quantum and classical interactive proof systems is that quantum interactive proofs can be parallelized: every quantum interactive proof system, possibly having polynomially many rounds of communication, can be transformed into an equivalent quantum interactive proof system with three messages [66]. This transformation comes at the cost of weakening the error bounds. However, it preserves perfect completeness, and the soundness error can subsequently be reduced by parallel repetition without increasing the number of messages beyond three. If classical interactive proof systems could be parallelized in this way, then it would follow that AM D PSPACE; a collapse that would surprise most complexity theorists and have major implications to the theory. The following theorem summarizes the implications of the facts just discussed to the complexity classes QIP(m; a; b) defined above. Theorem 12 Let a; b : N ! [0; 1] be polynomial-time computable functions and let p be a polynomial-bounded function such that a(n) b(n) 1/p(n) for all but finitely many n 2 N. Also let m and r be polynomial-bounded functions. Then QIP(m; a; b) QIP(3; 1; 2r ) : For a wide range of completeness and soundness probabilities this leaves just four complexity classes among those defined above: QIP(0) D BQP, QIP(1) D QMA, QIP(2), and QIP(3) D QIP. The final property of quantum interactive proofs that will be discussed in this section is the existence of an interesting complete promise problem [85]. The problem is to determine whether the operations induced by two
quantum circuits are significantly different, or are approximately the same, with respect to the same metric on quantum operations discussed in Subsect. “A Finite Universal Gate Set”. T HE QUANTUM CIRCUIT DISTINGUISHABILITY PROBLEM
Input: Quantum circuits Q0 and Q1 , both taking n input qubits and producing m output qubits. Yes: ı(Q0 ; Q1 ) 2/3. No: ı(Q0 ; Q1 ) 1/3. Theorem 13 The quantum circuit distinguishability problem is QIP-complete with respect to Karp reductions. Zero-Knowledge Quantum Interactive Proofs Interactive proof systems can sometimes be made to have a cryptographically motivated property known as the zeroknowledge property [56]. Informally speaking, an interactive proof system is zero-knowledge if a verifier “learns nothing” from an interaction with the prover on an input x 2 Ayes , beyond the fact that it is indeed the case that x 2 Ayes . This should hold even if the verifier deviates from the actions prescribed to it by the interactive proof being considered. At first this notion seems paradoxical, but nevertheless there are many interesting examples of such proof systems. For an introductory survey on zeroknowledge, see [93]. Several variants of zero-knowledge are often studied in the classical setting that differ in the particular way that the notion of “learning nothing” is formalized. This article will only consider statistical zero-knowledge, which is the variant of zero-knowledge that is most easily adapted to the quantum setting. Suppose that A D (Ayes ; Ano ) is a promise problem, and (V ; P) is a quantum interactive proof system for A. By this it is meant that V is the honest verifier and P is the honest prover that behave precisely as the proof system specifies. Whether or not this proof system possesses the zero-knowledge property depends on the characteristics of interactions between a cheating verifier V 0 with the honest prover P on inputs x 2 Ayes . Figure 12 illustrates such
Quantum Computational Complexity, Figure 12 A cheating verifier V 0 performs a quantum operation ˚ x with the unintentional help of the honest prover
Quantum Computational Complexity
an interaction, wherein it is viewed that the cheating verifier V 0 is using the interaction with P on input x to compute some quantum operation ˚ x . Informally speaking, the input to this operation represents the verifier’s state of knowledge before the protocol is run, while the output represents the verifier’s state of knowledge after. Now, the proof system (V ; P) is said to be quantum statistical zero-knowledge if, for any choice of a polynomial-time cheating verifier V 0 , the quantum operation ˚ x can be efficiently approximated for all x 2 Ayes . More precisely, the assumption that V 0 is described by polynomialtime generated quantum circuits must imply that there exists a polynomial-time generated family fQ x : x 2 ˙ g of quantum circuits for which ı(˚x ; Q x ) is negligible for every x 2 Ayes . Intuitively this definition captures the notion of learning nothing—for anything that the cheating verifier could compute in polynomial time with the help of the prover could equally well have been computed in polynomial time without the prover’s help. The class of promise problems having statistical zeroknowledge quantum interactive proofs is denoted QSZK. QSZK A promise problem A D (Ayes ; Ano ) is in QSZK if and only if it has a statistical zero-knowledge quantum interactive proof system. Although it has not been proved, it is reasonable to conjecture that QSZK is properly contained in QIP; for the zero-knowledge property seems to be quite restrictive. Indeed, it was only recently established that there exist nontrivial quantum interactive proof systems that are statistical zero-knowledge [100]. The following facts [97,100] are among those known about QSZK. 1. Statistical zero-knowledge quantum interactive proof systems can be parallelized to two messages. It follows that QSZK QIP(2). 2. The class QSZK is closed under complementation: a given promise problem A has a quantum statistical zero-knowledge proof if and only if the same is true for the problem obtained by exchanging the yes- and noinstances of A. 3. Statistical zero-knowledge quantum interactive proof systems can be simulated in polynomial space: QSZK PSPACE. Classical analogues to the first and second facts in this list were shown first [86]. (The classical analogue to the third fact is SZK PSPACE, which follows trivially from IP PSPACE.) A key step toward proving the above properties (which is also similar to the classical case) is to establish that the following promise problem is complete
for QSZK. The problem is a restricted version of the QIPcomplete quantum circuit distinguishability problem. T HE QUANTUM STATE DISTINGUISHABILITY PROBLEM Input: Quantum circuits Q0 and Q1 , both taking no input qubits and producing m output qubits. Let 0 and 1 be the density matrices corresponding to the outputs of these circuits. Yes: ı( 0 ; 1 ) 2/3. No: ı( 0 ; 1 ) 1/3. Theorem 14 The quantum state distinguishability problem is QSZK-complete with respect to Karp reductions. A quantum analogue of a different SZK-complete problem known as the entropy difference problem [53] has recently been shown to be complete for QSZK [16]. Multiple-Prover Quantum Interactive Proofs Multiple-prover interactive proof systems are variants of interactive proofs where a verifier interacts with two or more provers that are not able to communicate with one another during the course of the interaction. Classical multiple-prover interactive proofs are extremely powerful: the class of promise problems having multiple-prover interactive proofs (denoted MIP) coincides with NEXP [21]. This is true even for two-prover interactive proofs wherein the verifier exchanges just one round of communication with each prover in parallel [46]. The key to the power of multiple-prover interactive proofs is the inability of the provers to communicate during the execution of the proof system. Similar to a detective interrogating two suspects in separate rooms in a police station, the verifier can ask questions of the provers that require strongly correlated answers, limiting the ability of cheating provers to convince the verifier of a false statement. The identification of MIP with NEXP assumes a completely classical description of the provers. Quantum information, however, delivers a surprising twist for this model: even when the verifier is classical, an entangled quantum state shared between two provers can allow for non-classical correlations between their answers to the verifier’s questions. This phenomenon is better known as a Bell-inequality violation [24] in the quantum physics literature. Indeed, there exist two-prover interactive proof systems that are sound against classical provers, but can be cheated by entangled quantum provers [36]. MIP A promise problem A D (Ayes ; Ano ) is in MIP if and only if there exists a multiple-prover interactive proof system for A wherein the verifier is classical and the provers may share an arbitrary entangled state.
2379
2380
Quantum Computational Complexity
One may also consider fully quantum variants of multiple-prover interactive proofs, which were first studied by Kobayashi and Matsumoto [70]. QMIP A promise problem A D (Ayes ; Ano ) is in QMIP if and only if there exists a multiple-prover quantum interactive proof system for A. Various refinements on these classes have been studied, where parameters such as the number of provers, number of rounds of communication, completeness and soundness probabilities, and bounds on the amount of entanglement shared between provers are taken into account. The results proved in both [36] and [70] support the claim that it is entanglement shared between provers that is the key issue in multiple-prover quantum interactive proofs. Both models are equivalent in power to MIP when provers are forbidden to share entanglement before the proof system is executed. At the time of the writing of this article, research into these complexity classes and general properties of multiple-prover quantum interactive proofs is highly active and has led to several interesting results (such as [37,62,63], among others). Despite this effort, little can be said at this time about the relationship among the above classes and other known complexity classes. For instance, only the trivial lower bounds PSPACE MIP and QIP QMIP, and no good upper bounds, are known. It has not even been ruled out that non-recursive languages could have multiple-prover quantum interactive proof systems. These difficulties seem to stem from two issues: (i) no bounds are known for the size of entangled states needed for provers to perform optimally, or nearly optimally, in an interactive proof, and (ii) the possible correlations that can be induced with entanglement are not well-understood. One highly restricted variant of MIP that has been studied, along with its unentangled counterpart, is as follows. ˚MIP A promise problem A D (Ayes ; Ano ) is in ˚MIP if and only if there exists a one-round two-prover interactive proof system for A wherein the provers each send a single bit to the verifier, and the verifier’s decision to accept or reject is determined by the questions asked along with the XOR of these bits. The verifier is classical and the provers are quantum and share an arbitrary entangled state. ˚ MIP This class is similar to ˚MIP , except that the provers may not share entanglement (and therefore can be assumed to be classical without loss of generality).
It holds that ˚MIP D NEXP for some choices of completeness and soundness probabilities [25,58]. On the other hand, it has been proved [101] that ˚MIP QIP(2) and therefore ˚MIP EXP. Other Variants of Quantum Interactive Proofs Other variants of quantum interactive proof systems have been studied, including public-coin quantum interactive proofs and quantum interactive proofs with competing provers. Public-Coin Quantum Interactive Proofs Public-coin interactive proof systems are a variant of interactive proofs wherein the verifier performs no computations until all messages with the prover have been exchanged. In place of the verifier’s messages are sequences of random bits, visible to both the prover and verifier. Such interactive proof systems are typically called Arthur–Merlin games [17,19], and the verifier and prover are called Arthur and Merlin, respectively, in this setting. Quantum variants of such proof systems and their corresponding classes are easily defined. For instance, QAM is defined as follows [76]. QAM A promise problem A D (Ayes ; Ano ) is in QAM if and only if it has a quantum interactive proof system of the following restricted form. Arthur uniformly chooses some polynomial-bounded number of classical random bits, and sends a copy of these bits to Merlin. Merlin responds with a quantum state on a polynomial-bounded number of qubits. Arthur then performs a polynomial-time quantum computation on the input, the random bits, and the state sent by Merlin to determine acceptance or rejection. It is clear that QAM QIP(2), but unlike the classical case [54] equality is not known in the quantum setting. It is straightforward to prove the upper bound QAM PSPACE, while containment QIP(2) PSPACE is not known. The class QMAM may be defined through a similar analogy with classical Arthur–Merlin games. Here, Merlin sends the first message, Arthur responds with a selection of random bits, and then Merlin sends a second message. QMAM A promise problem A D (Ayes ; Ano ) is in QMAM if and only if it has a quantum interactive proof system of the following restricted form. Merlin sends a quantum state on a polynomial-bounded number of qubits to Arthur. Without performing any computations, Arthur uniformly chooses some poly-
Quantum Computational Complexity
nomial-bounded number of classical random bits, and sends a copy of these bits to Merlin. Merlin responds with a second quantum state. Arthur then performs a polynomial-time quantum computation on the input, the random bits, and the two states sent by Merlin in order to determine acceptance or rejection. Even when Arthur is restricted to a single random bit, this class has the full power of QIP [76]. Theorem 15 QMAM D QIP. Quantum Interactive Proofs with Competing Provers Another variant of interactive proof systems that has been studied is one where a verifier interacts with two competing provers, sometimes called the yes-prover and the noprover. Unlike ordinary two-prover interactive proof systems, it is now assumed that the provers have conflicting goals: the yes-prover wants to convince the verifier that a given input string is a yes-instance of the problem being considered, while the no-prover wants to convince the verifier that the input is a no-instance. The verifier is sometimes called the referee in this setting, given that interactive proof systems of this form are naturally modeled as competitive games between the two provers. Two complexity classes based on interactive proofs with competing provers are the following. RG A promise problem A D (Ayes ; Ano ) is in RG (short for refereed games) if and only if it has a classical interactive proof system with two competing provers. The completeness and soundness conditions for such a proof system are replaced by the following conditions: 1. For every x 2 Ayes , there exists a yes-prover Pyes that convinces the referee to accept with probability at least 2/3, regardless of the strategy employed by the no-prover Pno . 2. For every x 2 Ano , there exists a no-prover Pno that convinces the referee to reject with probability at least 2/3, regardless of the strategy employed by the yes-prover Pyes . QRG A promise problem A D (Ayes ; Ano ) is in QRG (quantum refereed games) if and only if it has a quantum interactive proof system with two competing provers. The completeness and soundness conditions for such a proof system are analogous to RG. Classical refereed games have the computational power of deterministic exponential time [45], and the same is true in the quantum setting [57]: RG D QRG D EXP. The
containment EXP RG represents an application of the arithmetization technique, while QRG EXP exemplifies the power of semidefinite programming. Other Selected Notions in Quantum Complexity In this section of this article, a few other topics in quantum computational complexity theory are surveyed that do not fall under the headings of the previous sections. While incomplete, this selection should provide the reader with some sense for the topics that have been studied in quantum complexity. Quantum Advice Quantum advice is a formal abstraction that addresses this question: how powerful is quantum software? More precisely, let us suppose that some polynomial-time generated family of quantum circuits Q D fQ n : n 2 Ng is given. Rather than assuming that each circuit Qn takes n input qubits as for the class BQP, however, it is now assumed that Qn takes n C p(n) qubits for p some polynomialbounded function: in addition to a given input x 2 ˙ n , the circuit Qn will take an advice state n on p(n) qubits. This advice state may be viewed as pre-loaded quantum software for a quantum computer. The advice state may depend on the input length n, but not on the particular input x 2 ˙ n . Similar to a quantum proof, the difficulty of preparing this state is ignored; but unlike a quantum proof the advice state is completely trusted. The quantum complexity class BQP/qpoly is now defined to be the class of promise problems that are solved by polynomialtime quantum algorithms with quantum advice in the natural way. The first published paper on quantum advice was [80], while the definitions and most of the results discussed in this section are due to Aaronson [2]. BQP/qpoly A promise problem A D (Ayes ; Ano ) is in BQP/qpoly if and only if there exists a polynomialbounded function p, a collection of quantum states f n : n 2 Ng where each n is a p(n)-qubit state, and a polynomial-time generated family of quantum circuits Q D fQ n : n 2 Ng with the following properties. For all x 2 Ayes , Q accepts (x; jxj ) with probability at least 2/3; and for all x 2 Ano , Q accepts (x; jxj ) with probability at most 1/3. Similar to BQP without advice, it is straightforward to shown that the constants 2/3 and 1/3 in this definition can be replaced by a wide range of functions a; b : N ! [0; 1]. As the notation suggests, BQP/qpoly is a quantum analogue of the class P/poly. This analogy may be used as the
2381
2382
Quantum Computational Complexity
basis for several relevant points about BQP/qpoly, quantum advice, and their relationship to classical complexity classes and notions. 1. As is well-known, P/poly may be defined either in a manner similar to the above definition for BQP/qpoly, or more simply as the class of promise problems solvable by polynomial-size Boolean circuit families with no uniformity restrictions. Based on similar ideas, the class BQP/qpoly does not change if the circuit family fQ n : n 2 Ng is taken to be an arbitrary polynomialsize circuit family without uniformity constraints. 2. There is a good argument to be made that quantum advice is better viewed as a quantum analogue of randomized advice rather than deterministic advice. That is, BQP/qpoly can equally well be viewed as a quantum analogue of the (suitably defined) complexity class BPP/rpoly. It happens to be the case, however, that BPP/rpoly D P/poly. (The situation is rather different for logarithmic-length advice, where randomized advice is strictly more powerful than ordinary deterministic advice.) 3. Any combination of quantum, randomized, or deterministic advice with quantum, randomized, or deterministic circuits can be considered. This leads to classes such as BQP/rpoly, BQP/poly, P/qpoly, P/rpoly, and so on. (The only reasonable interpretation of BPP/qpoly and P/qpoly is that classical circuits effectively measure quantum states in the standard basis the instant that they touch them.) At most three distinct classes among these possibilities arise: BQP/qpoly, BQP/poly, and P/poly. This is because BQP/rpoly D BQP/poly and BPP/qpoly D BPP/rpoly D BPP/poly D P/qpoly D P/rpoly D P/poly : The principle behind these equalities is that nonuniformity is stronger than randomness [6]. The following theorem places an upper-bound on the power of polynomial-time quantum algorithms with quantum advice [2]. Theorem 16 BQP/qpoly PP/poly. Although in all likelihood the class PP/poly is enormous, containing many interesting problems that one can have little hope of being able to solve in practice, the upperbound represented by this theorem is far from obvious. The most important thing to notice is that the power of quantum advice (to a BQP machine) is simulated by deterministic advice (to a PP machine). This means that
no matter how complex, a polynomial-size quantum advice state can never encode more information accessible to a polynomial-time quantum computer than a polynomiallength string can, albeit to an unbounded error probabilistic machine. Quantum advice has also been considered for other quantum complexity classes such as QMA and QIP. For instance, the following bound is known on the power of QMA with quantum advice [3]. Theorem 17 QMA/qpoly PSPACE/poly. Generally speaking, the study of both quantum and randomized advice is reasonably described as a sticky business when unbounded error machines or interactive protocols are involved. In these cases, complexity classes defined by both quantum and randomized advice can be highly non-intuitive and dependent on specific interpretations of models. For example, Raz [83] has shown that both QIP/qpoly and IP/rpoly contain all languages, provided that the advice is not accessible to the prover. Likewise, Aaronson [2] has observed that both PP/rpoly and PQP/qpoly contain all languages. These results are quite peculiar, given that significantly more powerful models can become strictly less powerful in the presence of quantum or randomized advice. For instance, even a Turing machine running in exponential space cannot decide all languages with bounded error given randomized advice. Space-Bounded Quantum Computation The quantum complexity classes that have been discussed thus far in this article are based on the abstraction that efficient quantum computations are those that can be performed in polynomial time. Quantum complexity classes may also be defined by bounds on space rather than time, but here a different computational model is required to reasonably compare with classical models of spacebounded computation. One simple choice of a suitable model is a hybrid between quantum circuits and classical Turing machines—and although this model is different from the variants of quantum Turing machines that were originally studied in the theory of quantum computing [30,40], the term quantum Turing machine (QTM for short) is nevertheless appropriate. There should be no confusion given that no other models of quantum Turing machines are discussed in this article. Figure 13 illustrates a quantum Turing machine. A quantum Turing machine has a read-only classical input tape, a classical work tape, and a quantum tape consisting of an infinite sequence of qubits each initialized to the zero-state. Three tape heads scan the quantum tape,
Quantum Computational Complexity
Quantum Computational Complexity, Figure 13 A quantum Turing machine. This variant of quantum Turing machine is classical with the exception of a quantum work tape, each square of which contains a qubit. Quantum operations and measurements can be performed on the qubits scanned by the quantum tape heads
allowing quantum operations to be performed on the corresponding qubits. (It would be sufficient to have just two, but allowing three tape heads parallels the choice of a universal gate set that allows three-qubit operations.) A single step of a QTM’s computation may involve ordinary moves by the classical parts of the machine and quantum operations on the quantum tape: Toffoli gates, Hadamard gates, phase-shift gates, or single-qubit measurements in the computational basis. The running time of a quantum Turing machine is defined as for ordinary Turing machines. The space used by such a machine is the number of squares on the classical work tape plus the number of qubits on the quantum tape that are ever visited by one of the tape heads. Similar to the classical case, the input tape does not contribute to the space used by the machine because it is a read-only tape. The following complexity classes are examples of classes that can be defined using this model. BQL A promise problem A D (Ayes ; Ano ) is in BQL (bounded-error quantum logarithmic space) if and only if there exists a quantum Turing machine M running in polynomial time and logarithmic space that accepts every string x 2 Ayes with probability at least 2/3 and accepts every string x 2 Ano with probability at most 1/3. PQL A promise problem A D (Ayes ; Ano ) is in PQL (unbounded-error quantum logarithmic space) if and only if there exists a quantum Turing machine M running in polynomial time and logarithmic space that accepts every string x 2 Ayes with probability strictly greater than 1/2 and accepts every string x 2 Ano with probability at most 1/2. BQPSPACE A promise problem A D (Ayes ; Ano ) is in BQPSPACE (bounded-error quantum polynomial
space) if and only if there exists a quantum Turing machine M running in polynomial space that accepts every string x 2 Ayes with probability at least 2/3 and accepts every string x 2 Ano with probability at most 1/3. PQPSPACE A promise problem A D (Ayes ; Ano ) is in PQPSPACE (unbounded-error quantum polynomial space) if and only if there exists a quantum Turing machine M running in polynomial space that accepts every string x 2 Ayes with probability strictly greater than 1/2 and accepts every string x 2 Ano with probability at most 1/2. Unlike polynomial-time computations, it is known that quantum information does not give a significant increase in computational power in the space-bounded case [95,98]. Theorem 18 The following relationships hold. 1. BQL PQL D PL. 2. BQPSPACE D PQPSPACE D PSPACE. The key relationship in the above theorem, from the perspective of quantum complexity, is PQL PL, which can be shown using space-bounded counting complexity. In particular, the proof relies on a theory of GapL functions [12] that parallels the theory of GapP functions, and allows for a variety of matrix computations to be performed in PL. The above theorem, together with the containment PL NC [32], implies that both BQL and PQL are contained in NC. An interpretation of this fact is that logarithmic-space quantum computations can be very efficiently simulated in parallel.
2383
2384
Quantum Computational Complexity
Bounded-Depth Quantum Circuits The depth of a classical or quantum circuit is the maximum number of gates encountered on any path from an input bit or qubit to an output bit or qubit in the circuit. One may reasonably think of circuit depth as the parallel running time, or the number of time units needed to apply a circuit when operations may be parallelized in any way that respects the topology of the circuit. The following complexity class, first defined by Moore and Nilsson [77], represent bounded-error quantum variants of the class NC. BQNC A promise problem A D (Ayes ; Ano ) is in BQNC (bounded-error quantum NC) if and only if there exists a logarithmic-space generated family fQ n : n 2 Ng of poly-logarithmic depth quantum circuits, where each circuit Qn takes n input qubits and produces one output qubit, such that Pr[Q accepts x] 2/3 for all x 2 Ayes , and Pr[Q accepts x] 1/3 for all x 2 Ano . Many other complexity classes based on boundeddepth quantum circuits have been studied as well. The survey of Bera, Green, and Homer [28] discusses several examples. In the classical case there is a very close relationship between space-bounded and depth-bounded computation [31,32]. This close relationship is based on two main ideas: the first is that space-bounded computations can be simulated efficiently by bounded-depth circuits using parallel algorithms for matrix computations, and the second is that bounded-depth Boolean circuits can be efficiently simulated by space-bounded computations via depth-first traversals of the circuit to be simulated. For quantum computation this close relationship is not known to exist. One direction indeed holds, as was discussed in the previous subsection: space-bounded quantum computations can be efficiently simulated by depthbounded circuits. The other direction, which is an efficient space-bounded simulation of bounded-depth quantum circuits, is not known to hold and is arguably quite unlikely. Informally speaking, bounded-depth quantum circuits are computationally powerful, whereas spacebounded quantum Turing machines are not. Three facts that support this claim are as follows. 1. Computing the acceptance probability for even constant-depth quantum circuits is as hard as computing acceptance probabilities for arbitrary polynomial-size quantum circuits [48]. 2. Shor’s factoring algorithm can be implemented by quantum circuits having logarithmic-depth, along with classical pre- and post-processing [35].
Quantum Computational Complexity, Figure 14 A diagram of inclusions among most of the complexity classes discussed in this article
3. The quantum circuit distinguishability problem remains complete for QIP when restricted to logarithmicdepth quantum circuits [84]. It is reasonable to conjecture that BQNC is incomparable with BPP and properly contained in BQP. Future Directions There are many possible future directions for research in quantum computational complexity theory. The following list suggests just a few of many possibilities. 1. The power of multiple-prover quantum interactive proofs, for both quantum and classical verifiers, is very poorly understood. In particular: (i) no interesting upper-bounds are known for either MIP or QMIP, and (ii) neither the containment NEXP MIP nor NEXP QMIP is known to hold.
Quantum Computational Complexity
2. The containment NP BQP would be a very powerful incentive to build a quantum computer, to say the least. While there is little reason to hope that this containment holds, there is at the same time little evidence against it aside from the fact that it fails relative to an oracle [27]. A better understanding of the relationship between BQP and NP, including possible consequences of one being contained in the other, is an important direction for further research in quantum complexity. 3. Along similar lines to the previous item, an understanding of the relationship between BQP and the polynomial-time hierarchy has remained elusive. It is not known whether BQP is contained in the polynomialtime hierarchy, whether there is an oracle relative to which this is false, or even whether there is an oracle relative to which BQP is not contained in AM. 4. Interest in complexity classes is ultimately derived from the problems that they contain. An important future direction in quantum complexity theory is to prove the inclusion of interesting computational problems in the quantum complexity classes for which this is possible. Of the many problems that have been considered, one of the most perplexing from the perspective of quantum complexity is the graph isomorphism problem. It is a long-standing open problem whether the graph isomorphism problem is in BQP. A seemingly easier task than showing the inclusion of this problem in BQP is proving it is in co-QMA; but even this problem has remained unresolved.
Bibliography 1. Aaronson S (2002) Quantum lower bound for the collision problem. In: Proceedings of the 35th Annual ACM Symposium on Theory of Computing. ACM Press, New York 2. Aaronson S (2005) Limitations of quantum advice and oneway communication. Theory Comput 1:1–28 3. Aaronson S (2006) QMA/qpoly is contained in PSPACE/poly: de-Merlinizing quantum protocols. In: Proceedings of the 21st Annual IEEE Conference on Computational Complexity. IEEE Ceomputer Society Press, Los Alamitos pp 261–273 4. Aaronson S, Kuperberg G (2007) Quantum versus classical proofs and advice. Theory Comput 3:129–157 5. Aaronson S, Shi Y (2004) Quantum lower bounds for the collision and the element distinctness problems. J ACM 51(4):595–605 6. Adleman L (1978) Two theorems on random polynomial time. In: Proceeding of the 19th Annual IEEE Symposium on Foundations of Computer Science. IEEE Ceomputer Society Press, Los Alamitos pp 75–83 7. Adleman L, DeMarrais J, Huang M (1997) Quantum computability. SIAM J Comput 26(5):1524–1540 8. Aharonov D, Naveh T (2002) Quantum NP – a survey. Available as arXiv.org e-Print quant-ph/0210077
9. Aharonov D, Kitaev A, Nisan N (1998) Quantum circuits with mixed states. In: Proceedings of the 30th Annual ACM Symposium on Theory of Computing. ACM Press, New York, pp 20– 30 10. Aharonov D, Gottesman D, Irani S, Kempe J (2007) The power of quantum systems on a line. In: Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science. IEEE Ceomputer Society Press, Los Alamitos pp 373–383 11. Aharonov D, van Dam W, Kempe J, Landau Z, Lloyd S, Regev O (2007) Adiabatic quantum computation is equivalent to standard quantum computation. SIAM J Comput 37(1):166–194 12. Allender E, Ogihara M (1996) Relationships among PL, #L, and the determinant. RAIRO – Theor Inf Appl 30:1–21 13. Arora S, Barak B (2006) Complexity Theory: A Modern Approach. Web draft available at http://www.cs.princeton.edu/ theory/complexity/ 14. Arora S, Safra S (1998) Probabilistic checking of proofs: a new characterization of NP. J ACM 45(1):70–122 15. Arora S, Lund C, Motwani R, Sudan M, Szegedy M (1998) Proof verification and the hardness of approximation problems. J ACM 45(3):501–555 16. Aroya A-B, Shma A-T (2007) Quantum expanders and the quantum entropy difference problem. Available as arXiv.org e-print quant-ph/0702129 17. Babai L (1985) Trading group theory for randomness. In: Proceedings of the Seventeenth Annual ACM Symposium on Theory of Computing. ACM Press, New York pp 421–429 18. Babai L (1992) Bounded round interactive proofs in finite groups. SIAM J Discret Math 5(1):88–111 19. Babai L, Moran S (1988) Arthur-Merlin games: a randomized proof system, and a hierarchy of complexity classes. J Comput Syst Sci 36(2):254–276 20. Babai L, Szemerédi E (1984) On the complexity of matrix group problems I. In: Proceedings of the 25th Annual IEEE Symposium on Foundations of Computer Science. IEEE Ceomputer Society Press, Los Alamitos pp 229–240 21. Babai L, Fortnow L, Lund C (1991) Non-deterministic exponential time has two-prover interactive protocols. Comput Complex 1(1):3–40 22. Beigel R, Reingold N, Spielman D (1995) PP is closed under intersection. J Comput Syst Sci 50(2):191–202 23. Beigi S, Shor P (2007) On the complexity of computing zeroerror and Holevo capacity of quantum channels. Available as arXiv.org e-Print 0709.2090 24. Bell J (1964) On the Einstein-Podolsky-Rosen paradox. Phys 1(3):195–200 25. Bellare M, Goldreich O, Sudan M (1998) Free bits, PCPs, and non-approximability —towards tight results. SIAM J Comput 27(3):804–915 26. Bennett CH (1973) Logical reversibility of computation. IBM J Res Dev 17:525–532 27. Bennett CH, Bernstein E, Brassard G, Vazirani U (1997) Strengths and weaknesses of quantum computing. SIAM J Comput 26(5):1510–1523 28. Bera D, Green F, Homer S (2007) Small depth quantum circuits. ACM SIGACT News 38(2):35–50 29. Bernstein E, Vazirani U (1993) Quantum complexity theory (preliminary abstract). In: Proceedings of the 25th Annual ACM Symposium on Theory of Computing. ACM Press, New York pp 11–20
2385
2386
Quantum Computational Complexity
30. Bernstein E, Vazirani U (1997) Quantum complexity theory. SIAM J Comput 26(5):1411–1473 31. Borodin A (1977) On relating time and space to size and depth. SIAM J Comput 6:733–744 32. Borodin A, Cook S, Pippenger N (1983) Parallel computation for well-endowed rings and space-bounded probabilistic machines. Inf Control 58:113–136 33. Brassard G (2003) Quantum communication complexity. Found Phys 33(11):1593–1616 34. Cleve R (2000) An introduction to quantum complexity theory. In: Macchiavello C, Palma GM, Zeilinger A (eds) Collected Papers on Quantum Computation and Quantum Information Theory. World Scientific. Singapore pp 103–127 35. Cleve R, Watrous J (2000) Fast parallel circuits for the quantum Fourier transform. In: Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science. pp 526–536 36. Cleve R, Høyer P, Toner B, Watrous J (2004) Consequences and limits of nonlocal strategies. In: Proceedings of the 19th Annual IEEE Conference on Computational Complexity. pp 236–249 37. Cleve R, Slofstra W, Unger F, Upadhyay S (2007) Perfect parallel repetition theorem for quantum XOR proof systems. In: Proceedings of the 22nd Annual IEEE Conference on Computational Complexity. pp 109–114 38. Cook S (1972) The complexity of theorem proving procedures. In: Proceedings of the Third Annual ACM Symposium on Theory of Computing. ACM Press, New York pp 151–158 39. de Wolf R (2002) Quantum communication and complexity. Theor Comput Sci 287(1):337–353 40. Deutsch D (1985) Quantum theory, the Church–Turing principle and the universal quantum computer. Proc Roy Soc Lond A 400:97–117 41. Deutsch D (1989) Quantum computational networks. Proc Roy Soc Lond A 425:73–90 42. Dinur I (2007) The PCP theorem by gap amplification. J ACM 54(3) 43. Du D-Z, Ko K-I (2000) Theory of Computational Complexity. Wiley, New York 44. Even S, Selman A, Yacobi Y (1984) The complexity of promise problems with applications to public-key cryptography. Inf Control 61:159–173 45. Feige U, Kilian J (1997) Making games short. In: Proceedings of the 29th Annual ACM Symposium on Theory of Computing. ACM Press, New York pp 506–516 46. Feige U, Lovász L (1992) Two-prover one-round proof systems: their power and their problems. In: Proceedings of the 24th Annual ACM Symposium on Theory of Computing. ACM Press, New York pp 733–744 47. Fenner S, Fortnow L, Kurtz S (1994) Gap-definable counting classes. J Comput Syst Sci 48:116–148 48. Fenner S, Green F Homer S, Zhang Y (2005) Bounds on the power of constant-depth quantum circuits. In: Proceedings of the 15th International Symposium on Fundamentals of Computation Theory. Lect Notes Comput Sci 3623:44–55 49. Feynman R (1983) Simulating physics with computers. Int J Theor Phys 21(6/7):467–488 50. Fortnow L (1997) Counting complexity. In: Hemaspaandra L, Selman A (eds) Complexity Theory Retrospective II. Springer, New York pp 81–107
51. Fortnow L, Rogers J (1999) Complexity limitations on quantum computation. J Comput Syst Sci 59(2):240–252 52. Goldreich O (2005) On promise problems (a survey in memory of Shimon Even [1935–2004]). Electronic Colloquium on Computational Complexity; Report TR05-018 53. Goldreich O, Vadhan S (1999) Comparing entropies in statistical zero-knowledge with applications to the structure of SZK. In: Proceedings of the 14th Annual IEEE Conference on Computational Complexity. pp 54–73 54. Goldwasser S, Sipser M (1989) Private coins versus public coins in interactive proof systems. In: Micali S (ed) Randomness and Computation, vol 5 of Advances in Computing Research. JAI Press, Greenwich, Conn pp 73–90 55. Goldwasser S, Micali S, Rackoff C (1985) The knowledge complexity of interactive proof systems. In: Proceedings of the Seventeenth Annual ACM Symposium on Theory of Computing. ACM Press, New York pp 291–304 56. Goldwasser S, Micali S, Rackoff C (1989) The knowledge complexity of interactive proof systems. SIAM J Comput 18(1):186–208 57. Gutoski G, Watrous J (2007) Toward a general theory of quantum games. In: Proceedings of the 39th ACM Symposium on Theory of Computing. ACM Press, New York pp 565–574 58. Håstad J (2001) Some optimal inapproximability results. J ACM 48(4):798–859 59. Janzing D, Wocjan P, Beth T (2005) Non-identity-check is QMA-complete. Int J Quantum Inf 3(2):463–473 60. Kaye P, Laflamme R, Mosca M (2007) An introduction to quantum computing. Oxford University Press, Oxford 61. Kempe J, Kitaev A, Regev O (2006) The complexity of the local Hamiltonian problem. SIAM J Comput 35(5):1070–1097 62. Kempe J, Kobayashi H, Matsumoto K, Toner B, Vidick T (2007) Entangled games are hard to approximate. Proceedings ot the 49th Annual IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Los Alamitos 2008 63. Kempe J, Regev O, Toner B (2007) The unique games conjecture with entangled provers is false. Proceedings ot the 49th Annual IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Los Alamitos 2008 64. Kitaev A (1997) Quantum computations: algorithms and error correction. Russ Math Surv 52(6):1191–1249 65. Kitaev A (1999) Quantum NP. Talk at AQIP’99: Second Workshop on Algorithms in Quantum Information Processing. DePaul University, Chicago 66. Kitaev A, Watrous J (2000) Parallelization, amplification, and exponential time simulation of quantum interactive proof system. In: Proceedings of the 32nd ACM Symposium on Theory of Computing. ACM Press, New York pp 608–617 67. Kitaev A, Shen A, Vyalyi M (2002) Classical and Quantum Computation. Graduate Studies in Mathematics, vol 47. American Mathematical Society, Providence 68. Knill E (1995) Approximation by quantum circuits. Technical Report LAUR-95-2225, Los Alamos National Laboratory. Available as arXiv.org e-Print quant-ph/9508006 69. Knill E (1996) Quantum randomness and nondeterminism. Technical Report LAUR-96-2186, Los Alamos National Laboratory. Available as arXiv.org e-Print quant-ph/9610012 70. Kobayashi H, Matsumoto K (2003) Quantum multi-prover interactive proof systems with limited prior entanglement. J Comput Syst Sci 66(3):429–450
Quantum Computational Complexity
71. Kobayashi H, Matsumoto K, Yamakami T (2003) Quantum Merlin–Arthur proof systems: Are multiple Merlins more helpful to Arthur? In: Proceedings of the 14th Annual International Symposium on Algorithms and Computation. Lecture Notes in Computer Science, vol 2906. Springer, Berlin 72. Levin L (1973) Universal search problems. Probl Inf Transm 9(3):265–266 (English translation) 73. Liu Y-K (2006) Consistency of local density matrices is QMAcomplete. In: Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. Springer, Berlin; Lect Notes Comput Sci 4110:438–449 74. Lloyd S (2000) Ultimate physical limits to computation. Nature 406:1047–1054 75. Lund C, Fortnow L, Karloff H, Nisan N (1992) Algebraic methods for interactive proof systems. J ACM 39(4):859–868 76. Marriott C, Watrous J (2005) Quantum Arthur-Merlin games. Comput Complex 14(2):122–152 77. Moore C, Nilsson M (2002) Parallel quantum computation and quantum codes. SIAM J Comput 31(3):799–815 78. Moore G (1965) Cramming more components onto integrated circuits. Electron 38(8):82–85 79. Nielsen MA, Chuang IL (2000) Quantum Computation and Quantum Information. Cambridge University Press, Cambride 80. Nishimura H, Yamakami T (2004) Polynomial time quantum computation with advice. Inf Process Lett 90(4):195–204 81. Oliveira R, Terhal B (2005) The complexity of quantum spin systems on a two-dimensional square lattice. Available as arXiv.org e-Print quant-ph/0504050 82. Papadimitriou C (1994) Computational Complexity. AddisonWesley, Readig, Mass 83. Raz R (2005) Quantum information and the PCP theorem. In: 46th Annual IEEE Symposium on Foundations of Computer Science. pp 459–468 84. Rosgen B (2008) Distinguishing short quantum computations. Proceedings of the 25th Annual Symposium on theoretical Aspects of Computer Science, IBFI, Schloss Dagstuhl, pp 597–608 85. Rosgen B, Watrous J (2005) On the hardness of distinguishing mixed-state quantum computations. In: Proceedings of the 20th Annual Conference on Computational. pp 344–354 86. Sahai A, Vadhan S (2003) A complete promise problem for statistical zero-knowledge. J ACM 50(2):196–249
87. Shamir A (1992) IP D PSPACE. J ACM 39(4):869–877 88. Shor P (1994) Algorithms for quantum computation: discrete logarithms ande factoring. In: Proceedings of the 35th Annual IEEE Symposium on Foundations of Computer Science. pp 124–134 89. Shor P (1997) Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM J Comput 26(5):1484–1509 90. Stinespring WF (1955) Positive functions on C -algebras. Proc Am Math Soc 6(2):211–216 91. Toda S (1991) PP is as hard as the polynomial-time hierarchy. SIAM J Comput 20(5):865–887 92. Toffoli T (1980) Reversible computing. Technical Report MIT/LCS/TM-151, Laboratory for Computer Science, Massachusetts Institute of Technology 93. Vadhan S (2007) The complexity of zero knowledge. In: 27th International Conference on Foundations of Software Technology and Theoretical Computer Science. Lect Notes Comput Sci 4855:52–70 94. Valiant L (1979) The complexity of computing the permanent. Theor Comput Sci 8:189–201 95. Watrous J (1999) Space-bounded quantum complexity. J Comput Syst Sci 59(2):281–326 96. Watrous J (2000) Succinct quantum proofs for properties of finite groups. In: Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science. pp 537–546 97. Watrous J (2002) Limits on the power of quantum statistical zero-knowledge. In: Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science. pp 459– 468 98. Watrous J (2003) On the complexity of simulating spacebounded quantum computations. Comput Complex 12:48– 84 99. Watrous J (2003) PSPACE has constant-round quantum interactive proof systems. Theor Comput Sci 292(3):575–588 100. Watrous J (2006) Zero-knowledge against quantum attacks. In: Proceedings of the 38th ACM Symposium on Theory of Computing. ACM Press, New York pp 296–305 101. Wehner S (2006) Entanglement in interactive proof systems with binary answers. In: Proceedings of the 23rd Annual Symposium on Theoretical Aspects of Computer Science. Lect Notes Comput Sci 3884:162–171
2387
2388
Quantum Computing
Quantum Computing VIV KENDON School of Physics and Astronomy, University of Leeds, Leeds, UK Article Outline Glossary Definition of the Subject Introduction Digital Quantum Computing Unconventional Extensions Future Directions Bibliography Glossary Bit A two state classical system used to represent a binary digit, zero or one. Bose–Einstein particles Integer spin quantum particles like to be together: any number can occupy the same quantum state. Classical computing What we can compute within the laws of classical physics. Error correction In a quantum context, fixing errors without disturbing the quantum superposition. Entanglement Quantum states can be more highly correlated than classical systems: the extra correlations are known as entanglement. Fermi–Dirac particles Half-integer spin quantum particles like to be alone: only one such particle can occupy each quantum state. Quantum communications Using quantum mechanics can gain an advantage when transmitting information. Quantum dense coding Classical bits can be encoded two for one into qubits for communications purposes. Quantum key distribution Quantum mechanics allows for secure key distribution in the presence of eavesdroppers and noisy environments. The keys can then be used for encrypted communication. Quantum teleportation A method to transmit an unknown quantum state using only classical communications plus shared entanglement. Quantum computing Computation based on the laws of quantum mechanics for the allowed logical operations. Qubit A two state quantum system such as the spin of an electron or the polarization of a photon. More complex quantum particles (such as atoms) can be used as qubits if just two of their available states are chosen to represent the qubit.
Qubus A quantum version of a computer bus, the fast communications linking memory and processing registers. Can be implemented using a coherent light source (such as a laser). Scalable A computer architecture designed from modular units that can be efficiently expanded to an arbitrary size. Squeezing With a pair of complementary quantum observables, making one uncertain so that the other can be measured more precisely. Threshold result In the context of quantum computation, this results says that error correction can work if the error rate is low enough. Tunneling Quantum particles can get through barriers that classical particles remain stuck behind. If the barrier is not infinitely high, there is a some probability for the quantum particle to be the other side of it even though it doesn’t on average have enough energy to jump over the top. Unitary operations How to control quantum systems while preserving quantum properties. Quantum systems evolving without any influence from environmental disturbance follow unitary dynamical evolution. Definition of the Subject Quantum computing is computing that follows the logic of quantum mechanics. The first part of this subject is wellspecified after the hundred-odd years of development of quantum theory. Definitions of computing are fuzzier, and argued over by philosophers: for the purposes of this article we will leave such nuances aside in favor of exploring what is possible when the constraints of classical logic are put aside, but the limitations of the physical world are kept firmly in hand. Quantum computing is not simply computing that involves quantum effects, that would be too broad to be useful. Since transitors and lasers exploit quantum properties of matter and light, most classical computers would thus be included. Yet it is worth remembering that definitions are never as clearcut as we would like. Figuring out which quantum systems can be simulated efficiently by classical computers is an active area of current research, and pinning down the boundary between quantum and classical is the object of ongoing experimental investigation. Introduction This article on quantum computing is in the Unconventional Computing section of the encyclopedia. It is thus a somewhat distinctive view of quantum computing,
Quantum Computing
adapted to the context in which you are likely to be reading it. It is self-contained, in that it introduces the concepts you will need to understand the contents, but those desiring more details and a broader perspective on the subject can refer to the introductory article Quantum Information Science, Introduction to and referenced articles therein. To many, quantum computing is already unconventional computing, depending for its conception on the elusive properties of matter at the scale of atoms rather than the simple everyday experience of counting and deduction that underpins classical computation. Yet it is already an established field in which there is a standard approach, and the possibility to be unconventional within a quantum computing context. This article is thus divided into two parts, the first forming a lightening introduction to the standard digital version of quantum computing, and the second expanding into less well-charted territory beside and beyond. In order to write about such an interdisciplinary subject for a broad audience, I have tried to keep everything at a level a non-specialist can follow, while not oversimplifying to the point of inaccuracy. Please be forgiving about the parts you could have written better yourself, and inquisitive about the parts that may have something new for you. Digital Quantum Computing Digital quantum computing developed a more or less parallel structure to classical digital computing, with a quantum version of each component. There are good reasons for this: many of the optimal features of classical digital computing carry over to quantum computing with little or no change. In order to explain why there is a fundamental advantage to using quantum logic for computation, this section will explain the components step by step and then provide examples of how the parts combine into the whole. What Is a Qubit? (Quantum Mechanics for Dummies) Quantum computing exploits the logic of quantum mechanics, which is notoriously tricky to understand. The mathematical structure is simple and linear, so treating it as “maths we happen to be able to build (we think)”, is one option for getting to grips with it. Here I’ve chosen the pictorial intuition approach, with some cartoon qubits to help explain how they behave. Those already familiar with quantum mechanics can safely skip over this section. Those still bemused by the end of it may like to sample from the wide range of available textbooks to find an approach which suits them best.
Classical computers use bits as their basic unit, any classical system with two states can represent a single bit. Tossing a coin, with the resulting “heads” or “tails” is an example of a two state classical system. In computers, the two states are usually labeled “0” (zero) and “1” (one). Quantum computers use quantum bits, usually called qubits, as their basic unit, which are two state quantum systems, and the two states are also usually labeled zero and one. Examples of physical systems with two quantum states include electron spin, photon polarization and phosphorus nuclear spin. These have all been used in experiments aimed towards building a quantum computer. Those who know a bit of quantum mechanics should note that in a quantum computer we usually localize or otherwise separate individual qubits so they are distinguishable from each other, and there are thus no Fermi or Bose statistics to complicate the picture. We will revisit this aspect of quantum mechanics briefly in Subsect. “Topological Quantum Computing”. Here are a couple of cartoon qubits to help explain
their quantum properties: . They are wriggly, fidgety little things, they move so fast you can’t tell which way up they are just by looking. If you reach down to grab hold of one the only way you can get a grip is by an arm or a leg. First we have to choose our basis states, laand j1i . We write the zero and one beled j0i in those funny brackets (“kets”) to remind us they are quantum states rather than classical bits. To simplify notation, we will sometimes write the state of several qubits together inside one ket thus, j0ij0i j00i. In terms of the cartoon qubits, this means we decided to label “grabbed by the arm” as the zero state and “grabbed by the leg” as the one state. The key quantum property is that qubits can fidget around anywhere in between the zero and one states, in what is called a superposition state, for examp p ple, (j0i C j1i)/ 2 or (j0i j1i)/ 2 . If you reach down and grab a qubit in one of these superposition states, you are equally like to grab an arm or a leg, indicating zero or one in a basis state. So you can’t tell the qubit was in a superposition state by measuring it, the outcome is always a basis state. What’s more, grabbing it by the arm hauls it onto its feet, and grabbing it by the leg shoves it into a handstand. So after measuring, what you found is what you now have, regardless of what it was before. Luckily, it is possible to do other things to qubits besides measuring them.
2389
2390
Quantum Computing
The most general state of one qubit can be written ˛j0i C ˇj1i with j˛j2 C jˇj2 D 1, and ˛, ˇ are complex numbers. The probability of measuring zero is given by j˛j2 and the probability of measuring one is given by jˇj2 . (Why complex numbers? Because it works conveniently. If you don’t like complex numbers, there are ways to formulate quantum mechanics that avoid using them). Cartoon qubits with limbs don’t quite capture all the subtle quantum effects, but they are a useful image to keep in mind if you are new to these concepts. This account also ignores all of the more fundamental issues to do with the interpretation of quantum mechanics, especially those generally referred to as “the quantum measurement problem”. Now consider two qubits (traditionally belonging to Alice and Bob) in the state (j0iA j0iB C p j1iA j1iB )/ 2 . This is an example of an entangled state: the two qubits are completely correlated. Suppose Alice measures her qubit: if she finds j0iA then she knows Bob will find j0iB when he measures his qubit. If she finds j1iA then she knows Bob will find j1iB . Alice can’t communicate anything to Bob this way because she can’t control whether she will find 0 or 1 when she measures. After they have both measured their qubits, they then share one random bit, which is a resource they can use for other tasks. The purely quantum feature of entanglement that cannot be reproduced with suitably correlated classical systems is that Alice and Bob can use any basis states they like and the perfect p correlation still p appears. For example, (j0i C j1i)/ 2 and (j0i j1i)/ 2 are another possible pair of basis states: the rule is they must be orthogonal to each other. This is the equivalent of changing the orientation of your classical coordinate system. However, Alice and Bob do both have to use the same basis: for a thorough discussion of the consequences of this and other reference frames in quantum mechanics, see Bartlett et al. [1]. Now that we have qubits to make a quantum register for our quantum computer, and measurements to read out the state of the quantum register, we need some quantum gates to perform our calculation with. There are two ways to change a quantum state: measurements, which are what happen when you (or something else) observes the state of a quantum system, and unitary dynamics, which is what quantum systems do when no-one is looking. Happily, it is possible to steer a quantum system without looking at it (see Fig. 1), so we can make it evolve the way we choose through applying unitary operations. Unitary operations are reversible, the simplest example, acting on a single qubit, is the Hadamard gate H. We can define
Quantum Computing, Figure 1 Unitary evolution: controlling qubits without looking at them
what it does by the effect on basis states: p Hj0i D (j0i C j1i)/ 2 p Hj1i D (j0i j1i)/ 2 ;
(1)
and of course it can also act on superposition states, 1 H(˛j0i C ˇj1i) D p f(˛ C ˇ)j0i C (˛ ˇ)j1ig : (2) 2 The Hadamard gate rotates the qubit through an angle of /2. Notice also that applying the Hadamard gate twice gets you back where you started, H(Hj i) D j i, the Hadamard gate is its own inverse. Rotation gates can rotate through any other angle, for example, R j0i D cos( /2)j0i C sin( /2)j1i R j1i D sin( /2)j0i cos( /2)j1i:
(3)
To carry out computations, we also need gates that act on two qubits at once, for example, the familiar CNOT gate: Cj0ic j0i t D j0ic j0i t Cj0ic j1i t D j0ic j1i t Cj1ic j0i t D j1ic j1i t
(4)
Cj1ic j1i t D j1ic j0i t where the first qubit (labeled c) is the control and remains unchanged, while the second qubit (labeled t) is the target and is flipped if the control qubit is in state j1i. Just as with classical computation, small sets of universal quantum gates exist from which any computation can be constructed efficiently [2].
Quantum Computing
A particular set is usually chosen to suit the physical components of the quantum computer, there are many possible choices, the gates described here are just a few of the one and two qubit gates available. How Do Qubits Give Us Cool Computing? Now that we have qubits to form a quantum register, quantum gates to perform our computation, and measurements to read out the result, the next step is to put them all together. We will begin with an example, which will illuminate the basic idea much more clearly than a general description. The Deutsch–Jozsa Algorithm was one of the earliest quantum algorithms, it solves the following promise problem. We are promised that the function f (x) : x 7! f0; 1g is either balanced i. e., the number of times f (x) outputs zero is equal to the number of times f (x) outputs one, compared over all possible input values x (with 0 < x N), or, f (x) is constant, i. e., the output is always either zero or one (but we don’t know which) for any input x. The cost of this computation is measured in “queries” or evaluations of the function f (x). The circuit diagram in Fig. 2 shows how to solve this problem with a quantum computer. There is one register of n D dlog2 Ne qubits to hold the input x, and one extra single qubit jyi. We start with all the qubits in the j0i state. The first trick is the set of Hadamards applied to the jxi register: Hj0 : : : 0i D (j0 : : : 0i C j0 : : : 1i C j0 : : : 10i C C j1 : : : 1i)/2n/2 which is a superposition of all numbers 0 : : : 2n 1. We then make our one query to the oracle that computes f (x) for us, using the superposition of all possible inputs now stored in jxi. We get back a superposition of all the answers, which we store by adding it to the jyi qubit. Remember we cannot detect a superposition by measurement alone, so the superposition of all the possible answers doesn’t help us yet. The second trick allows us to detect what sort of superposition of answers we got back. Be-
Quantum Computing, Figure 2 Circuit diagram for the Deutsch–Jozsa algorithm for N < 8. See text for details. Normalization factors have been omitted from the qubit states
fore adding the answer to jyi, we prepare this qubit in the p state (j0i j1i)/ 2. This can be done by flipping it from 0i to j1i then applying a Hadamard gate. Then, when we add the answers to jyi, there are three possibilities: p 1. the answers are all j0i: jyijxi D (j0i j1i)jxi/ 2 D unchanged. p 2. the answers are all j1i: jyijxi D (j1i j0i)jxi/ 2 D minus what it was before. 3. the answers are half and half: jyijxi D something else. The second sequence of Hadamard gates then attempts to undo the first set and convert jxi back to the all zero state. In the first two cases, this succeeds (we can’t detect the minus sign when we measure), but in the third case the Hadamards don’t get us back where we started so there will be some ones in the jxi qubits which we find when we measure them. So, if we measure all zeros for jxi then f (x) is constant, while if we measure any ones then f (x) is balanced. To solve this problem classically you have to examine the value of f (x) for one value of x at a time and compare them. As soon as you accumulate both a zero and a one in the set of answers, you know (because of the promise) that f (x) is balanced. But you can’t be sure it is constant until you accumulate more than half the answers and see that they are all the same. So the best possible classical algorithm must take at least two queries, and could require up to N/2 C 1. Thus the quantum algorithm is fundamentally more efficient, requiring only the one evaluation of f (x). How Quantum Computing Got Started (A Little Bit of History) The idea that quantum mechanics could give us fundamentally faster computers first occurred to Feynman [3] and independently to Deutsch [4]. They both observed that quantum systems are like a parallel computer (c.f. Feynman paths or many worlds respectively). It remained a cute idea for the next decade, until a key result showing error correction is theoretically possible implied it might be feasible to build one that really works [5,6,7]. Around the same time, Shor [8] found an algorithm for factoring that could in theory break crypto schemes based on products of large primes being hard to factorize. At this point the funders sat up and took notice. Not just because factoring could become easier with a quantum computer, there are public key crypto schemes based on other “one-way” functions that are hard to compute given the public key, but easy to compute from the private key. But by shifting the boundary between what’s easy and what’s hard, quantum computing threatens to break other classical one-way functions too.
2391
2392
Quantum Computing
At the same time, quantum communications protocols were being developed, in particular, quantum key distribution solves the crypto problem created by Shor’s algorithm [9]. Quantum resources can double the channel capacity, a technique known as quantum dense coding [10], and quantum teleportation can transmit a quantum state through a classical channel [11] with the help of some shared entanglement. Commercial quantum communications devices are available now for key distribution. They work over commercial fibre optics, they work well enough to reach from earth to satellites, and handheld devices under development [12,13]. It remains a niche market, but the successful technology in a closely related quantum information field has helped to keep the funding and optimism for quantum computers buoyant. Error Correction for Quantum Computers The most important result that allowed quantum computing to move from being an obscure piece of theory to a future technology is that error correction can work to protect the delicate quantum coherences (those signs, or phases, between different components of the superposition). Usually, the ubiquitous environmental disturbances randomize them very rapidly. Quantum systems are very sensitive to noise, which is why we inhabit a largely classical world. The basic idea is simple: correct the errors faster than they accumulate. The main method for doing this is also conceptually simple: logical qubits are encoded in a larger physical quantum system, and the extra degrees of freedom used for error correction. A very simple example: for every qubit in your computation, use three identical qubits. If a single qubit error occurs, the remaining two will give you the correct answer. This is illustrated in Fig. 3. Two errors hitting the three qubits and the majority outcome will be wrong, though, so this code is only good if the error rates are very low. What is not at all obvious is that this has any chance of working. Adding extra degrees of freedom (more qubits, for example) adds more places where errors can occur. Adding yet more qubits to correct these extra errors brings in yet more errors, and so on. Nonetheless, if the error rate is low enough, even the errors arising from the error correction can be corrected fast enough [5,6]. This is known as the threshold result. The details of exactly how to accomplish this are considerable, and employ many techniques from classical error correcting codes [7,8]. For those wanting to know more, there are lots of tutorials available, e. g., [14]. The crucial piece of information is thus what the actual threshold error rate is. It can only be estimated, and
Quantum Computing, Figure 3 Error correction using three qubits to encode one
current estimates suggest it is around 103 to 104 errors per qubit per quantum operation. This is rather smaller than the error rates in most physical systems that have been proposed for quantum computers, but not so small it is clearly unattainable as technology is refined. Various additional methods for avoiding the effects of decoherence have since been developed, such as decoherence-free subspaces [15,16], and using the quantum Zeno effect [17,18] to keep the system evolving in the direction you want. Programming a Quantum Computer The general structure of many quantum algorithms follows the same pattern we already saw in the Deutsch–Jozsa algorithm. This is illustrated in Fig. 4. Each algorithm needs to have its own trick to make it work, so progress on expanding the number of quantum algorithms has been slow. Sometimes the same trick will work for a family of similar problems, but the number of distinct types remains limited. We have already seen an example of a promise problem in the Deutsch–Jozsa algorithm. Shor’s algorithm and the many variants on this method have essentially similar structure, while Grover’s search and quantum walks have a distinctly different way of iterating towards the desired result. Quantum simulation, the earliest suggested application for quantum computation, is in a different category altogether and we will return to it near the end, once we have all the tools necessary to understand how it works. There are also quantum versions of almost anything else computational you can think of, such as quantum game theory, and quantum neu-
Quantum Computing
Quantum Computing, Figure 4 General structure of many quantum algorithms
ral networks. Most of these other types of quantum information processing fall into the category of communication protocols, where the advantage is gained through quantum entanglement shared between more than one person or location. To consolidate our understanding of the functioning of quantum algorithms, we will briefly outline Shor’s algorithm and then give an overview of quantum walks. Shor’s Algorithm The task is to find a factor of a number N D pq where p and q are both prime. First choose a co-prime number a < N. If a turns out not to be co-prime then the task is done. This can be checked efficiently using Euclid’s algorithm, but only happens in a very small number of cases. Then run the quantum computation shown in Fig. 5 to
find r, the periodicity of ax (mod N). The first half of the computation creates a superposition of ax (mod N) for all possible inputs x up to N 2 . This upper limit is not essential, it determines the precision of the output and thus the number of times the computation has to be repeated on average to succeed. Shor’s first insight was that this modular exponentiation step can be done efficiently. The answer is then contained in the entanglement between the qubits in the upper and lower registers [19]. Shor’s second insight was that a quantum version of the familiar discrete Fourier transform can be used to rearrange the state so that measuring the upper register allows one to calculate the value of r with high probability. Once we have found r, then a r/2 ˙ 1 gives a factor of N (with high probability). If the first attempt doesn’t find a factor, repeating the computation a few times with different values of a rapidly increases the chances of success.
Quantum Computing, Figure 5 Circuit diagram for Shor’s algorithm for factoring shown for the example of factoring 15. H is a Hadamard gate, Rn is a rotation by /n, and the U gates perform the modular exponentiation (see text). IQFT is inverse quantum Fourier transform, and M is the final measurement
2393
2394
Quantum Computing
Quantum Computing, Figure 6 Comparison of a quantum walk of 100 steps with the corresponding classical random walk. Progressively increasing the decoherence produces a “top-hat” distribution for the right choice of decoherence rate
The classical difficulty of factoring is not known, but the best classical algorithms we have are sub-exponential (exponential to some slower than linear function of N), whereas Shor’s algorithm is polynomial in N. This is roughly speaking an exponential improvement, and it promises to bring factoring into the set of “easy to solve” problems (i. e. polynomial resources) from the “hard to solve” set (exponential resources). There is a whole family of problems using same method, all based around identifying a hidden subgroup of an Abelian group. Some extensions have been made beyond Abelian Groups but this is in general much harder (for a review, see [20]). Quantum Walks Classical random walks underpin many of the best classical algorithms, so finding a faster quantum version of a random walk is a good bet for a new type of quantum algorithm. Quantum random walks that are faster than classical random walks have been found in a variety of guises, but turning them into algorithms is harder. They have been applied most successfully to problems related to searching, such as subset finding [21] and element distinctness [22]. A simple recipe for a quantum walk on a line goes as follows: 1. Start at the origin
2. Toss a qubit (quantum coin) p Hj0i ! (j0i C j1i)/ 2 p Hj1i ! (j0i j1i)/ 2 3. Move left and right according to qubit state Sjx; 0i ! jx 1; 0i Sjx; 1i ! jx C 1; 1i 4. Repeat steps 2. and 3. T times 5. measure position of walker, T x T If you repeat steps 1. to 5. many times, you get a probability distribution P(x; T), which turns out to spread quadratically faster than a classical random walk. If we add a little decoherence (measure the quantum walker with prob p at each step) then we get a top hat distribution for just the right amount of noise [23], see Fig. 6. For related results of mixing times on cycles and torii, see Refs. [24,25]. Quantum walks can be used to find a marked item in an unsorted database, the equivalent of finding a person’s name from their phone number by searching the telephone directory. Classical solutions have to check on average N/2 entries, and in the worst case need to check them all. This problem was investigated by Grover [26,27], who
Quantum Computing
found a quantum algorithm that is quadratically faster than classical searching. His algorithm doesn’t use quantum walks, but Shenvi et al. [28] showed that quantum walks can perform the task just as well. The recipe for this is: 1. Start in a uniform distribution over graph with N nodes. 2. Use Grover coin everywhere except the marked node. p 3. Run for approx 2 N/2 steps. 4. Particle will now be at the marked node with high probability. This is the inverse of starting at the origin and trying to get uniform (top hat) distribution. A quadratic speed up is the best that can be achieved by any algorithm tackling this problem [29]. The quantum walk version of Grover’s search has been generalized to find more than one item. Magniez et al. [30,31] show how to detect triangles in graphs, Ambainis [22] applies quantum walks to deciding element distinctness, and Childs and Eisenberg [21] generalize to finding subsets. These generally obtain a polynomial improvement over classical algorithms, for example, reducing the running time to O(N 2/3 ) compared to O(N) for classical methods. There is one known problem for which a quantum walk can provide a more impressive speed up. The “glued trees” problem [32] is also about finding a marked state, but this time the starting point is also specified, and an oracle is available to give information about the possible paths between the start and finish. This achieves an exponential speed up, but has not been generalized to other problems. For a short review of quantum walk algorithms, see [33]. How Powerful Is Quantum Computing? What we’ve seen so far is quantum computers providing a speed up over classical computation, i. e., new complexity classes to add to the zoo [34], but nothing outside of the classical computability limits. The mainstream view is that quantum computing achieves the same computability as classical computing, and the quantum advantage may be characterized completely by complexity comparisons. The intuitive way to see that this is likely to be correct is to note that, given the mathematical definition of quantum mechanics, we can always simulate quantum dynamics using classical digital computers, but we generally can’t do it efficiently. The chink in this argument is that quantum mechanics is not digital. Qubits can be in an arbitrary superposition ˛j0i C ˇj1i, where j˛j can take any real value between zero and one. Thus we can only sim-
ulate the quantum dynamics to within some precision dictated by the amount of digital resources we employ. However, we cannot measure ˛, we can only infer an approximate value from the various measurements we do make. And classical computers also have continuous values for quantities such as voltages that we choose to interpret as binary zeros and ones. So it isn’t clear there could be any fundamental difference between classical and quantum in this respect that could separate their computational ability. So will quantum computers give us a practical advantage over classical computers? How many qubits do we need to make a useful quantum computer? This depends on what we want to do: Simulating a quantum system: for example, N twostate particles ! 2 N possible different states. The state of the system could be in superposition of all of these 2 N possible states. Classical simulations thus require one complex number per state: 2 NC1 size-of-double ! 1 Gbyte can store the state of just N D 26 2-state systems. The current record is N D 36 in 1 Terabyte of storage [35] – each additional particle doubles the memory required. Thus more than 40 or so qubits is beyond current classical resources. In the cases where we don’t need to keep track of all the possible superpositions, e. g., if only nearest neighbor interactions are involved, larger classical simulations can be performed, for example, see [36]. Shor’s factoring algorithm: the best classical factoring to date is 200 digit numbers (RSA-200) which is approximately 665 bits. Shor’s quantum algorithm needs 2n qubits in the QFT register plus 5n qubits for the modular exponentiation (lower register plus ancilla qubits), a total of 7n logical qubits. A 665 bit number therefore needs 4655 logical qubits. We now need to take account of error correction. Again, this depends on the physical quantum computer and the error rates that have to be corrected. If the error rate is close to the threshold of 103 to 104 , then more error correction is needed. For low error rates, maybe 20–200 physical qubits per logical qubit are required. For high error rates the numbers blow up quickly to maybe 105 physical qubits per logical qubit. This suggest that factoring won’t be the first useful application of quantum computing, despite being the most famous, we may need Tera-qubit quantum computers to produce useful results here. Although the scaling favors quantum computing, the crossover point is very high. Ultimate Physical Limits to Computation This is a subject that is revisited regularly. The version by Lloyd [37] is the best known of these calculations that has an explicitly quantum flavor. Physical limits can be de-
2395
2396
Quantum Computing
duced from these fundamental constants: 108 m s1
The speed of light: c D 3 Boltzmann’s constant: k D 1:38 1023 J K1 The gravitational constant: G D 6:67 1011 m3 kg1 s2 Planck’s constant: „ D 1:055 1034 J s Consider 1 kg of matter occupying volume of 1 liter D 103 m3 – a little smaller than today’s laptops. How fast can this matter compute? Quantum mechanics tells us that the speed of computation is limited by its average energy E to 2E/„ D 5 1050 operations per second with E D mc 2 given by the mass of 1 kg. This is because the time-energy Heisenberg uncertainty principle, Et „, says that a state with a spread in energy of E takes a time t to evolve into a different state. This was extended by Margolus and Levitin [38,39] to show that a state with average energy E takes a time t „/2E to evolve to a different state. The number of bits it can store is limited by its entropy, S D k ln W where W is the number of states the system can be in. For m bits, W D 2 m , so S D km, and the system can store at most m bits. Our estimate of m is thus dependent on what our laptop is made of. It also depends on the energy E, since not all of the W possible states have the same energy, so in practice we don’t get the maximum possible W D 2m number of possible states, since we decided our laptop had a mass-energy of 1 kg. In practice this restriction doesn’t change the numbers very much. There are various ways to estimate the number of states, one way is to assume the laptop is made up of stuff much like the early universe: high energy photons (the leftover ones being the cosmic microwave background radiation). This gives around 2 1031 bits as the maximum possible memory. Having noted that entropy S is a function of energy E, we can deduce that the number of operations per second per bit is given by 2Ek/„S / kT/„, where T D @S/@E is the effective temperature of the laptop. For our 1 kg laptop this means the number of operations per bit is about 1019 , and the temperature is about 5 108 degrees, a very hot plasma. This also implies the computer must have a parallel type of operation since the bit flip time is far smaller than the time it takes for light to travel from one side to the other. If you compress the computer into a smaller volume to reduce the parallelism it reaches serial operation at the black hole density! The evaporation rate for 1 kg black hole is 1019 seconds, after 1032 ops on 1016 bits. This is very fast, but similar in size to conventional computers. Prototype quantum computers already operate at these limits, only with most of their energy locked up as mass.
The salutary thing about these calculations is that once you scale back to a computer in which the mass isn’t counted as part of the available energy, we are actually within some sort of spitting distance of the physical limits. Computational power has been increasing exponentially for so long we have become used to it doubling every few years, and easily forget that it also requires increasing resources to give us this increased computational capacity. Based on current physics, there isn’t some vast reservoir of untapped computational power out there waiting for us to harness as our technology advances, and most of us will see a transition to a significantly different regime in our lifetimes. Which, one may argue, highlights the importance of unconventional computation, where further increases in speed and efficiency may still be available after the standard model has run out of steam. Can We Build a Quantum Computer? The short answer to this question is “yes”, and people already have built small ones, but we don’t know if we can build one big enough to be useful for something we can’t already calculate more easily with our classical computers. In 2000, DiVincenzo [40] provided a checklist of five criteria to assess whether a given architecture was capable of being used to build a scalable quantum computer, i. e., one that can be made large enough to be useful: 1. scalable qubits – to allow quantum computers to be built large enough to solve real problems 2. prepare initial state – e. g. all qubits in state j0i 3. decoherence times far longer than gate times – to allow many quantum operations before the quantum coherence is degraded 4. universal set of quantum gates – a suitable two qubit gate is sufficient 5. measurement of single qubits to read out result – can be hardest part and can be much slower than other operations. The notion of scalability is crucial in the quest to build a useful quantum computer. We need designs that can be tested at small sizes, registers of a few qubits, then scaled up without fundamentally changing the operations we tested at small scales. The most feasible architecture for achieving this is using small, repeatable, connectable units [41]. One of the most important models for scalable quantum computers is the combination of stationary and flying qubits. Flying qubits are usually photons while stationary qubits can be almost any other type that can interact with photons, such as atoms or quantum dots. The motivation for this model is that it can be hard to do everything in one
Quantum Computing
type of physical system, for example, one-qubit gates are easy to apply to photons while two-qubit gates are hard. Also, stationary qubits are limited to nearest neighbor interactions, which then involves many swap operations to allow qubits further apart to be gated together, but this can be overcome using flying intermediaries to connect distant qubits. Atoms in traps combined with photons are currently seen as the best scalable architecture, but this is by no means the only player in the ring. What Have We Got so Far? Single qubits – with enough control to initialize them in a chosen state and apply single qubit gates to them – have been demonstrated in a wide variety of systems: this is only a partial list: photons – using polarization or path to represent the qubit atoms and ions in linear traps – qubit registers atoms loaded into optical lattices one per site quantum dots – with single spins (electrons or holes) electrons floating on liquid helium – electron spin phosphorus nuclei embedded in silicon – Kane model superconducting devices – using either currents or magnetic flux nuclear magnetic resonance (NMR) – using nuclear spins as qubits. Some of these systems have also been demonstrated performing two qubit gates, sometimes even more than one gate at a time maintaining coherence. Progress is slow but steady and this list will likely be out of date by the time you read it, though not likely by more than a few small numbers. For more details of the main types of architectures, [42] is a good starting point. The current capabilities are limited to a few (less than 10) qubits entangled to order a few (less than 20) quantum gate operations NMR can factor 15 (using 7 qubits) NMR qubit record is 7 (running an algorithm), and 12 (benchmarking) atoms: 6 in controlled superposition. The vision is driving beautiful experiments: to find out up to date information on current progress, visit the websites of the major experimental collaborations, for example, NIST: http://qubit.nist.gov/ MIT/Quanta: http://www.media.mit.edu/quanta/ University Vienna: http://www.quantum.at/research/ quantum-computation.html EU Road map: http://qist.ect.it/Reports/reports.htm and look for recent articles in Nature and Science.
Unconventional Extensions Quantum computing as just described appears to mimic classical digital computing quite closely in terms of bits (qubits), gates, error correction, scalable architectures and so on. In many ways this is just because this part of the picture is easiest to relate to those familiar with classical computing. In reality, neither the historical development, nor the creativity with which theorists and experimentalists are tackling the difficult task of trying to build a working quantum computer justify this impression. In this section I will attempt to set the record straight, and hint at some of the wilder territory beyond. What About Analogue? Classical analogue computation was once commonplace, yet was swept aside by the advance of digital computation. To understand why this happened we need to review what it means to binary encode our data in a digital computer. Binary encoding has profound implications for the resources required for computation, as this table explains. Unary vs. Binary Coding Number Unary 0 1 2 3 4 N N Read out: distinguish between measurements with N outcomes Accuracy: errors scale linear in N
Binary 0 1 10 11 100 log2 N bits log2 N measurements with 2 outcomes each errors scale / log N
Binary encoding provides an exponential advantage (reduction) in the amount of memory required to store the data compared to unary encoding. It also means the precision costs are exponentially smaller. To double the accuracy of a unary representation requires double the resources, while a binary representation achieves this with a single extra bit. This is the main reason why digital computing works so well. It is so obvious it is easy to forget that there was once analogue computing in which the data is represented directly as the magnitude of a voltage or water level displacement, or, in the slide rule, with the logarithms etched onto the stick. Those of you too young to have owned a slide rule should now play with Derek’s virtual slide rules here: http://www.antiquark.com/sliderule/ sim/ to get a feel for what one version of analogue computation is like.
2397
2398
Quantum Computing
It does not, of course, have to be binary encoding, numbers in base three or base ten, for example, gain the exponential advantage just as well. One of the first to discuss the implications of binary encoding in a quantum computing context was Jozsa [43], and this was refined in Blume–Kohout et al. [44] and Greentree et al. [45]. Binary encoding matters for quantum computing too in the standard digital model. It gains the same exponential advantage from encoding the data into qubits as classical digital computers gain from using bits. Continuous Variable Quantum Computing Although many quantum systems naturally have a fixed number of discrete states (polarization of a photon or electronic spin for example), there are also variables like position x and momentum p that can take any real value. These can be used to represent data in the same way as voltages or other continuous quantities are used in classical analogue computers. This quantum version of analogue computation is usually known as continuous variable, or CV computation. The important difference between quantum and classical continuous variables is that a conjugate pair, such as the position and momentum, are not independent of each other. Instead, x and p are related by the commutator, [x; p] D i„ Knowing x exactly implies p is unknowable and vice versa, this is called squeezing. The most commonly described continuous variable quantum system uses electromagnetic fields, but rather than specializing to the case of single photons as one does for digital quantum computing, single modes of the field are employed. Braunstein and van Loock [46] provide a comprehensive review of CV quantum information processing for those interested in further details. Lloyd and Braunstein [47] showed that CV computation is universal, i. e., it can compute anything that a digital quantum computer can compute. They did this by showing that it is possible to obtain any polynomial in x and p efficiently using operations that are known to be available in physical systems. The universal set of operations has to include a nonlinear operation applied to one of the variables, which means in this context higher than quadratic in x and p. For electromagnetic fields, an example of a suitable nonlinear operation is the Kerr effect, which produces a quartic term in x and p. This has the effect of an intensity dependent refractive index. Unfortunately, although it is observable, in real physical systems the Kerr effect is too weak to be of practical use in CV computation. Without the nonlinear term, the operations available are Gaussian preserving, which means they can be simulated efficiently with a classical computer [48]. Some operations are easier
than in digital quantum computers, the Fourier transform is a single operation, rather than a long sequence of controlled rotations and Hadamard gates. Continuous variable information processing is especially suitable for quantum communications because of the practicality of working with bright beams of light instead of single photons. Most quantum communications tasks only require Gaussian operations, which are easily achieved with standard optical elements. Quantum repeaters [49] to extend quantum communications channels over arbitrary distances are a good example. However, a useful CV quantum computer isn’t a practical option using current experimental capabilities. The precision that can be obtained in a single mode is limited by the degree of squeezing that can be performed on the mode, currently about 10 dB, which is equivalent to 3.7 bits. An extra bit of precision doubles the required squeezing so CV quantum computation suffers from the same precision problems as classical analogue computing. Hybrid Quantum Computing Architectures Having noted above that continuous variable quantum systems are especially suited to quantum communications tasks, the obvious way to use them for quantum computing is to make a high speed quantum “bus” (qubus) to communicate between stationary qubit registers. Although light and matter interact in simple easy ways, creating a suitable gate to transmit the quantum information between qubits is a far from trivial task. Coherent light as used for continuous variable quantum information is also often employed to measure the final state of the matter qubits at the end of the computation. Unless the required operations are done in a way that reveals only trivial information about the quantum state while the computation is in process, the quantum superpositions will instead be destroyed. The solution is built around an elegant implementation of a parity gate, illustrated in Fig. 7. The left half shows a coherent light beam (blue) interacting with two qubits in turn (green) by means of an interaction that causes a phase shift in the light that depends on the state of the qubit. The right half shows the phase of the light after the interactions. If the two qubits are in the same state, the light will have a phase in one of the two blue positions. If the two qubits are in different states, the two phase shifts will cancel leaving the phase unaltered (red). Provided the angle of the shift, is large enough, the two cases can be distinguished by measuring only the coordinate of the horizontal axis, not the vertical distance away from it. For a detailed exposition illustrated in a superconducting qubit setting, see [50].
Quantum Computing
Quantum Computing, Figure 7 Qubus parity gate operation
Cluster State Quantum Computing First proposed by Raussendorf and Briegel [51], and called a “one way quantum computer”, cluster state quantum computing has no direct classical equivalent, yet it has become so popular it is currently almost the preferred architecture for development of quantum computers. Originally designed for atoms in optical lattices, the idea is to entangle a whole array of qubits in global operations, then perform the computation by measuring the qubits in sequence. This architecture trades space resources for a temporal speed up, requiring roughly n2 physical qubits where the gate model uses n. The measurement sequence can be compressed to allow parallel operation for all but a few measurements that are determined by previous results, and require the outcomes to be fed forward as the computation proceeds. One obvious prerequisite for this architecture is qubits that can be measured quickly and easily. Measurement is often slower than other gate operations so this does limit in practice which types of qubits can be used. Since the measurements proceed from one side of the lattice to the other, see Fig. 8, instead of making the whole array in one go, the qubits can be added in rows as they are needed. This also allows the use of a probabilistic method for preparing the qubits, which can be repeated until it succeeds in a separate process to the main computation [53]. This idea grew out of schemes for linear optical quantum computation in which gates are constructed probabilistically and teleported into the main computation once they succeed [54]. Adding rows of qubits “just in time” does remove one of the main advantages of the original proposal, i. e., making the cluster state in a few simple global operations. However, it gains in another respect in that the data is regularly being moved into fresh qubits, so the whole system doesn’t have to be kept free of decoherence throughout the computation. Cluster state quantum computation has been important on a theoretical level also because it made people think harder about the role of measurement in quantum com-
Quantum Computing, Figure 8 Fragment of a cluster state quantum computation. The computation can be thought of as proceeding from left to right, with the qubit register arranged vertically, though all qubits in these gates can be measured at the same time. White qubits do not contribute directly to the computation and are “removed” by measuring in the Z basis. Pink qubits are measured in the Y basis and form the gates, at the top a CNOT and at the bottom a Hadamard. Blue qubits are measured in the X basis and transmit the logical qubit through the cluster state. The black qubits contain the output state (not yet measured, since their measurements are determined by the next operations). Details in [52]
puting [55]. Instead of striving to obtain perfect unitary quantum evolution throughout the computation, cleverly designed measurements can also be used to push the quantum system along the desired trajectory. Topological Quantum Computing In Sect. “Digital Quantum Computing” we learned about qubits: distinguishable quantum two state systems. But there are more exotic quantum species, indeed, quantum particles are indistinguishable when more than one of them is around in the same place. This means if you swap two of them, you can’t tell anything happened. And there are precisely two ways to do this for particles living in our familiar three spatial dimensions: j
A i1 j
B i2
! j
B i1 j
A i2
(5)
corresponding to Bose–Einstein particles (photons, He-
2399
2400
Quantum Computing
lium atoms), or j
A i1 j
B i2
! j
B i1 j
A i2
(6)
corresponding to Fermi–Dirac particles (electrons, protons, neutrons). That extra minus sign can’t be measured directly, but it does have observable effects, so we do know the world really works this way. One important consequence is that Fermi–Dirac particles cannot occupy the same quantum state, whereas Bose–Einstein particles can, and frequently do. From this we get, among many other things, lasers (photons – Bose–Einstein particles) and semiconductors (electrons – Fermi–Dirac particles), without which modern technology would be very different indeed. A simple change of sign when swapping particles won’t make a quantum computer, but something in between one and minus one can. For this we have to restrict our particles to two spatial dimensions. Then it is possible to have particles, known as anyons, that acquire more complicated phase factors when exchanged. Restricting quantum particles to less than three spatial dimensions is not as hard as it sounds. A thin layer of conducting material sandwiched between insulating layers can confine electrons to two-dimensional motion, for example. The first proposal for a quantum computer using anyons was from Kitaev [56], using an array of interacting spins on a planar surface. The anyons are formed from excitations of the spins – think of this as similar to the way electronic displays such as those found in trains and stations often work by flipping pixels between black and yellow to form characters. This opened up the field to a host of connections with group theory, gauge theory, and braid (knot) theory, and produced a new quantum algorithm to approximate the Jones polynomial [57], a fundamental structure in braid theory. If you already know something about any of the above mathematical subjects, then you can jump into the many accounts of topological quantum computation (also going by the names geometric or holonomic quantum computing). A recent introduction can be found in [58]. Here is a simple example of how quantum phases work, to further illuminate the idea. The Aharonov–Bohm effect [59] was discovered in 1959 and has been observed experimentally. It involves two ingredients. First, electrons can be projected through a double slit much like photons can, and they produce an interference pattern when detected on the far side of the slits. If you aren’t familiar with Young’s double slit experiment [60], note that you need to think about it in terms of single electrons going through both slits in superposition. This is an example of the quantum behavior of particles that combines both wave and
Quantum Computing, Figure 9 Aharonov Bohm affect
particle properties. The interference pattern arises because the electron can’t land on the detector in places where the path length difference is half a wavelength, and prefers to land near where the path difference is a whole number of wavelengths. The second ingredient is a solenoid, a coil of wire with a current running through it. It generates a magnetic field inside the coil that does not leak outside. If you put the solenoid between the two slits, see Fig. 9, the interference pattern shifts, even though there is no magnetic field where the electrons are traveling (they can’t pass through the solenoid). From this we deduce two things: interference patterns are a way to detect the extra quantum phases that are generated in topological quantum effects, and, even though there was no interaction between the electrons and the magnetic field, i. e., no energy exchanged, the magnetic field still affected the quantum properties of the electrons just by being inside the two paths the electrons take through the slits. Now imagine shrinking that experiment down to a magnetic field with a charge (electron) encircling it. This is one way to think of anyons. Confined to two dimensions, the anyons can be moved round each other, and the charge part of one going round the magnetic field part of the other affects the quantum phase. When you want to measure them, you manipulate them so that they form an interference pattern. The rest of the details of how to make them perform quantum computation comes down to the mathematics of group theory. Apart from delighting the more mathematicallyminded quantum computing theorists, this type of quantum computing may have some practical advantages. The effects are generated just by moving the anyons about, and
Quantum Computing
vided you can evolve the systems under both Hamiltonians together, you applies the following Hamiltonian H(t) to j0 i, H(t) D f1 ˛(t)gHsimple C ˛(t)Hhard
Quantum Computing, Figure 10 Anyons: charge (blue) encircling magnetic field (flux, red+yellow)
the exact path isn’t critical, only that they go around the right set of other anyons. So there is less chance of errors and disturbance spoiling the quantum computation. However, constructing a physical system in which anyonic excitations can be manipulated to order is not so easy. In fact, it has not yet been demonstrated experimentally at all, though there is no shortage of suggestions for how to do it, see for example, [61]. Adiabatic Quantum Computing Adiabatic quantum computation is the quantum equivalent of simulated annealing. Simulated annealing is a computational method that finds the minimum solution to a problem by starting from an initial state and evolving towards lower values until a minimum is found. In order to avoid being stuck in a local minimum that is higher than the desired solution, some random moves that increase the value are necessary. This can be thought of as starting in a high temperature distribution and slowly cooling the simulation until the lowest energy state is found. The slowness of the cooling is to allow it to get out of local minima on the way. Quantum mechanics provides a more effective tool for this problem, thanks to the quantum adiabatic theorem. Quantum states don’t get stuck in local minima, they can “tunnel” through barriers to lower-lying states. So there is no need to start in a random initial state, instead, the evolution starts in a minimum energy state that is easy to prepare and transforms from there into the minimum energy state we are interested in. The quantum adiabatic theorem tells you how long it will take to evolve to the desired state without ending up in some higher energy states by mistake. Suppose you can prepare a system in the grounds state j0 i of the Hamiltonian Hsimple , and you want to find the ground state j0 i of the Hamiltonian Hhard . Then, pro-
(7)
where ˛(0) D 0, ˛(T) D 1 and 0 ˛(t) 1. The total time the simulation will take is T, and ˛(t) controls how fast the Hamiltonians switch over. The key parameter is the difference, or gap , between the ground and first excited energy states. It varies as the state evolves from j0 i to j0 i and the smaller it is the slower the state must evolve to stay in the ground state. Let min be the smallest gap that occurs during the adiabatic evolution. Then the quantum adiabatic theorem says that provided T "/2min , the final state will be j0 i with high probability. Here " is a quantity that relates the rate of change of the Hamiltonian to the ground state and first excited state. It is usually about the same size as the energy of the system. The important parameter is thus min . For hard problems, min will become small and the required time T will be large. Adiabatic quantum computing was introduced by Farhi et al. [62] as a natural way to solve SAT problems. SAT stands for satisfiability and a typical problem is expressed as a Boolean expression, for example, B(x1 ; x2 ; x3 ; x4 ) D (x1 _ x3 _ x˜4 ) ^ (x˜2 _ x3 _ x4 ) ^ (x˜1 _ x2 _ x˜3 )
(8)
a set of clauses of three variables each combined with “or” that are then combined together with “and”. The tilde over a variable indicates the negation of that variable. The problem is to determine whether there is an assignment of the variables x1 ; x2 ; x3 ; x4 that makes makes the expression true (in the above simple example there are several such assignments, such as x1 D 0, x2 D 0, x3 D 0, x4 D 0). The Boolean expression is turned into a Hamiltonian by constructing a cost function for each clause. If the clause is satisfied, the cost is zero, if not, the cost is one unit of energy. The Hamiltonian is the sum of all these cost functions. If the ground state of the Hamiltonian is zero the expression can be satisfied, but if it is non-zero then some clauses remain unsatisfied whatever the values of the variables. The adiabatic quantum computation only works efficiently if it runs in a reasonably short amount of time, which in turn requires the gap, min to remain large enough throughout the evolution. Figuring out whether this is so is not easy in general, and has led to some controversy about the power of quantum computation. The 3SAT problem (clauses with three variables as above) is in
2401
2402
Quantum Computing
the general case a hard problem that requires exponential resources to solve classically. Clearly adiabatic quantum computing can also solve it, but the question is how long it takes. The best indications are that the gap will become exponentially small for hard problems, and thus the adiabatic quantum computation will require exponential time to solve it. For dissenting views see, for example, [63] discussing the traveling salesman problem, another standard “hard” problem. One of the difficulties in analyzing this issue is that adiabatic quantum computation uses the continuous nature of quantum mechanics, and care must be taken when assessing the necessary resources to correctly account for the precision requirements. On the other hand, we do know that adiabatic quantum computing is at least equivalent to digital computation (i. e., it can solve all the same problems using equivalent resources) [57,64,65]. There are also indications that it can be more resilient to some types of errors [66]. It has even been implemented in a toy system using nuclear magnetic resonance (NMR) quantum computation [67]. Quantum Simulation We have already noted that quantum simulation was the original application that inspired Feynman [3] and Deutsch [4] to propose the idea of quantum computation. We have also explained why quantum systems cannot generally be simulated efficiently by classical digital computers, because of the need to keep track of the enormous number of superpositions involved. In 1996 Lloyd [68] proved that a quantum system can simulate another quantum system efficiently, making the suggestions from Feynman and Deutsch concrete. The first step in quantum simulation is trivially simple, the quantum system to be simulated is mapped onto the quantum simulator by setting up a one-to-one correspondence between their state spaces (which are called Hilbert spaces for quantum systems). The second step is non-trivial, Lloyd proved that the unitary evolution of the quantum system being studied can be efficiently approximated in the quantum simulator by breaking it down into a sequence of small steps that use standard operations. Mathematically this can be written expfiHtg ' expfiH1 t/ng expfiH2 t/ng n : : : expfiH j t/ng C O(t 2 /n)[H j ; H k ] where expfiHtg is the unitary evolution of the quantum system (H is called the Hamiltonian) and H1 : : : H j are a sequence of standard Hamiltonians available for the quantum simulator. The last term says that the errors
are small provided n, the step size, is chosen to be small enough. There are variations and improvements on this that use higher order corrections to this approximation, see for example, [69]. Quantum simulation has even been demonstrated experimentally [70] in a small test system using an nuclear magnetic resonance (NMR) quantum computer. Just as with the quantum algorithms discussed in the first section, evolving the quantum simulation is only half of the job. Extracting the pertinent information from the simulation requires some tricks, depending on what it is you want to find out. A common quantity of interest is the spectral gap, the energy difference between the ground and first excited states, the quantity discussed in Subsect. “Adiabatic Quantum Computing”. One way to obtain this is to create a superposition of the two energy states and evolve it for a period of time. The phase difference between the ground and excited state will now be proportional to the spectral gap. To measure the phase difference requires a Fourier transform, either quantum or classical. Either way, the simulation has to be repeated or extended to provide enough data points to obtain the spectral gap to the desired precision. Even creating the initial state in superposition of ground and excited states is non-trivial. It has been shown that this can be done for some systems of interest by using quasi-adiabatic evolution starting from the ground state. By running the adiabatic evolution a bit too fast, the state of the system doesn’t remain perfectly in the ground state and the required superposition in generated [71,72]. However, because there is no binary encoding, unlike classical simulation on a digital computer, accuracy is a problem. In particular, accuracy does not scale efficiently with time needed to run the simulation [73]. Is this going to be a problem in practice? Poor scaling for accuracy didn’t stop people building useful analogue computers from electrical circuits or water pipes in the days before digital computers were widely available. As we noted, a quantum simulator becomes useful at around 40 qubits in size, which means it may well be the first real application of quantum computing to calculate something we didn’t know already. In the end, the accuracy required depends on the problem we are solving, so we won’t know whether it works until we try. Future Directions The wide range of creative approaches to quantum computing are largely a consequence of the early stage of development of the field. There is as yet no proven working method to build a useful quantum computer, so new ideas
Quantum Computing
stand a fair chance of being the winning combination. We do not even have any concrete evidence that we will be able to build a quantum computer that actually outperforms classical computers. But we also have no evidence to the contrary, that quantum computers this powerful are not possible for some fundamental physical reason, so the open question, plus the open questions such a quantum computer may be able to solve, keep the field vibrant and optimistic. Beyond the precision quantum engineering required to construct any of the designs described thus far, there is plenty of discussion of quantum computational processes in natural systems. While all natural systems are made up of particles that obey quantum mechanics at a fundamental level, the extent to which their behavior requires quantum logic to describe it is less ubiquitous. Quantum coherences are more usually dissipated into the surrounding environment than marshaled into co-ordinated activity. It has been suggested that the brain may exhibit quantum computational capabilities on two different levels. Firstly, Hameroff and Penrose [74] have argued that brain cells amplify quantum effects that feed into the way the brain functions. This is disputable on physical grounds when the time-scales and energies involved are considered in detail. Secondly, a quantum-like logic for brain processes has been proposed, for example, see [75]. This does not require actual quantum systems to provide it, and is thus not dependent on individual brain cells amplifying quantum effects. Whether any biological computing is exploiting quantum logic is an open question. Since typical biological temperatures and time scales are generally not a hospitable regime for maintaining quantum coherences, we might expect any such quantum effects to be rare at the level of complex computations. Clearly quantum effects are biologically important for of basic processes such as photosynthesis, for example, and transport properties such as the conductance and coherence of single electrons through biological molecules are the subject of much current study. These are perhaps more likely to feature as problems solved by one of the first quantum simulators, than as examples of natural quantum computers.
3. 4.
5. 6.
7.
8.
9.
10.
11.
12. 13.
14.
15.
16.
17. 18.
19.
Bibliography Primary Literature 1. Bartlett SD, Rudolph T, Spekkens RW (2006) Reference frames, superselection rules, and quantum information. Rev Mod Phys 79:555; ArXiv:quant-ph/0610030 2. Dawson CM, Nielsen MA (2006) The Solovay-Kitaev algorithm. Quantum Inf Comp 6:81–95; The Solovay-Kitaev theorem
20. 21.
22.
dates from 1995, but was only partially published in pieces – this Ref. gives a more comprehensive review; ArXiv:quantph/0505030 Feynman RP (1982) Simulating physics with computers. Int J Theor Phys 21:467 Deutsch D (1985) Quantum-theory, the church-turing principle and the universal quantum computer. Proc R Soc Lond A 400(1818):97–117 Knill E, Laflamme R, Zurek W (1996) Threshold accuracy for quantum computation. ArXiv:quant-ph/9610011 Aharonov D, Ben-Or M (1996) Fault tolerant quantum computation with constant error. In: Proc 29 th ACM STOC. ACM, NY 176–188; ArXiv:quant-ph/9611025 Steane A (1996) Multiple particle interference and quantum error correction. Proc Roy Soc Lond A 452:2551; ArXiv:quantph/9601029 Shor PW (1997) Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM J Sci Statist Comput 26:1484 Bennett CH, Brassard G (1984) Quantum cryptography: publickey distribution and coin tossing. In: IEEE International Conference on Computers, Systems and Signal Processing. IEEE Computer Society Press, Los Alamitos, pp 175–179 Bennett CH, Wiesner SJ (1992) Communication via one – and two-particle operators on Einstein–Podolsky–Rosen states. Phys Rev Lett 69(20):2881–2884 Bennett CH, Brassard G, Crépeau C, Jozsa R, Peres A, Wootters WK (1993) Teleporting an unknown quantum state via dual classical and Einstein–Podolsky–Rosen channels. Phys Rev Lett 70:1895–1899 Graham-Rowe D (2007) Quantum ATM rules out fraudulent web purchases. New Sci 2629:30–31 Duligall JL, Godfrey MS, Harrison KA, Munro WJ, Rarity JG (2006) Low cost and compact quantum cryptography. New J Phys 8:249; ArXiv:quant-ph/0608213v2 Steane A (2001) Quantum computing and error correction. In: Gonis, Turchi (eds) Decoherence and its implications in quantum computation and information transfer. IOS Press, Amsterdam, pp 284–298 Lidar DA, Chuang IL, Whaley KB (1998) Decoherence free subspaces for quantum computation. Phys Rev Lett 81:2594–2598; ArXiv:quant-ph/9807004v2 Lidar DA, Whaley KB (2003) Decoherence-free subspaces and subsystems. In: Benatti F, Floreanini R (eds) Irreversible Quantum Dynamics. Lecture Notes in Physics, vol 622. Springer, Berlin, pp 83–120; ArXiv:quant-ph/0301032 Misra B, Sudarshan ECG (1977) The Zeno’s paradox in quantum theory. J Math Phys 18:756 Beige A, Braun D, Tregenna B, Knight PL (2000) Quantum computing using dissipation to remain in a decoherence-free subspace. Phys Rev Lett 85:762–1766; ArXiv:quant-ph/0004043v3 Nielsen M, Chuang I (1996) Talk at KITP Workshop: Quantum Coherence and Decoherence, organized by DiVencenzo DP and Zurek W. http://www.kitp.ucsb.edu/activities/ conferences/past/. Accessed 2 Sep 2008 Lomont C (2004) The hidden subgroup problem-review and open problems. ArXiv:quant-ph/0411037 Childs A, Eisenberg JM (2005) Quantum algorithms for subset finding. Quantum Inf Comput 5:593–604; ArXiv:quantph/0311038 Ambainis A (2004) Quantum walk algorithms for element dis-
2403
2404
Quantum Computing
23.
24. 25. 26.
27.
28.
29.
30. 31.
32.
33.
34. 35.
36.
37. 38.
39.
40. 41.
42.
tinctness. In: 45th Annual IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, pp 22–31 Kendon V, Tregenna B (2003) Decoherence can be useful in quantum walks. Phys Rev A 67:042315; ArXiv:quantph/0209005 Richter P (2007) Almost uniform sampling in quantum walks. New J Phys 9:72; ArXiv:quant-ph/0606202 Richter P (2007) Quantum speedup of classical mixing processes. Phys Rev A 76:042306; ArXiv:quant-ph/0609204 Grover LK (1996) A fast quantum mechanical algorithm for database search. In: Proc 28th Annual ACM STOC. ACM, NY, p 212; ArXiv:quant-ph/9605043 Grover LK (1997) Quantum mechanics helps in searching for a needle in a haystack. Phys Rev Lett 79:325; ArXiv:quantph/9706033 Shenvi N, Kempe J, Whaley BK (2003) A quantum random walk search algorithm. Phys Rev A 67:052307; ArXiv:quantph/0210064 Bennett CH, Bernstein E, Brassard G, Vazirani U (1997) Strengths and weaknesses of quantum computing. SIAM J Comput 26(5):151–152 Magniez F, Santha M, Szegedy M (2003) An o(n1:3 ) quantum algorithm for the triangle problem. ArXiv:quant-ph/0310134 Magniez F, Santha M, Szegedy M (2005) Quantum algorithms for the triangle problem. In: Proceedings of 16th ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, pp 1109–1117 Childs AM, Cleve R, Deotto E, Farhi E, Gutmann S, Spielman DA (2003) Exponential algorithmic speedup by a quantum walk. In: Proc 35th Annual ACM STOC. ACM, NY, pp 59–68; ArXiv:quant-ph/0209131 Ambainis A (2003) Quantum walks and their algorithmic applications. Int J Quantum Information 1(4):507–518; ArXiv:quantph/0403120 Scott Aaronson (2005) The complexity zoo. http://qwiki. caltech.edu/wiki/Complexity_Zoo. Accessed 2 Sep 2008 De Raedt K, Michielsen K, De Raedt H, Trieu B, Arnold G, Richter M, Lippert Th, Watanabe H, Ito N (2007) Massive parallel quantum computer simulator. Comp Phys Comm 176:127–136 Cirac JI, Verstraete F, Porras D (2004) Density matrix renormalization group and periodic boundary conditions: A quantum information perspective. Phys Rev Lett 93:227205 Lloyd S (2000) Ultimate physical limits to computation. Nature 406:1047–1054; ArXiv:quant-ph/9908043 Margolus N, Levitin LB (1996) The maximum speed of dynamical evolution. In: Toffoli T, Biafore M, Liao J (eds) Physcomp96. NECSI, Boston Margolus N, Levitin LB (1998) The maximum speed of dynamical evolution. Physica D 120:188–195; ArXiv:quantph/9710043v2 DiVincenzo DP (2000) The physical implementation of quantum computation. Fortschritte der Physik 48(9–11):771–783 Metodi TS, Thaker DD, Cross AW, Chong FT, Chuang IL (2005) A quantum logic array microarchitecture: Scalable quantum data movement and computation. In: 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’05). IEEE Computer Society Press, Los Alamitos, pp 305–318; ArXiv:quant-ph/0509051v1 Spiller TP, Munro WJ, Barrett SD, Kok P (2005) An introduction
43.
44.
45.
46. 47.
48.
49.
50.
51. 52.
53. 54.
55. 56. 57.
58.
59. 60.
61.
to quantum information processing: applications and realisations. Comptemporary Physics 46:407 Jozsa R (1998) Entanglement and quantum computation. In: Huggett SA, Mason LJ, Tod KP, Tsou S, Woodhouse NMJ (eds) The Geometric Universe, Geometry, and the Work of Roger Penrose. Oxford University Press, Oxford, pp 369–379 Blume-Kohout R, Caves CM, Deutsch IH (2002) Climbing mount scalable: Physical resource requirements for a scalable quantum computer. Found Phys 32(11):1641–1670; ArXiv:quantph/0204157 Greentree AD, Schirmer SG, Green F, Lloyd Hollenberg CL, Hamilton AR, Clark RG (2004) Maximizing the hilbert space for a finite number of distinguishable quantum states. Phys Rev Lett 92:097901; ArXiv:quant-ph/0304050 Braunstein SL, van Loock P (2005) Quantum information with continuous variables. Rev Mod Phys 77:513–578 Lloyd S, Braunstein SL (1999) Quantum computation over continuous variables. Phys Rev Lett 82:1784; ArXiv:quantph/9810082v1 Bartlett S, Sanders B, Braunstein SL, Nemoto K (2002) Efficient classical simulation of continuous variable quantum information processes. Phys Rev Lett 88:097904; ArXiv:quantph/0109047 Ladd TD, van Loock P, Nemoto K, Munro WJ, Yamamoto Y (2006) Hybrid quantum repeater based on dispersive cqed interactions between matter qubits and bright coherent light. New J Phys 8:164; ArXiv:quant-ph/0610154v1 doi:10.1088/ 1367-2630/8/9/184 Spiller TP, Nemoto K, Braunstein SL, Munro WJ, van Loock P, Milburn GJ (2006) Quantum computation by communication. New J Phys 8:30; ArXiv:quant-ph/0509202v3 Raussendorf R, Briegel HJ (2001) A one-way quantum computer. Phys Rev Lett 86:5188–5191 Raussendorf R, Browne DE, Briegel HJ (2003) Measurementbased quantum computation with cluster states. Phys Rev A 68:022312; ArXiv:quant-ph/0301052v2 Nielsen MA (2004) Optical quantum computation using cluster states. Phys Rev Lett 93:040503; ArXiv:quant-ph/0402005 Yoran N, Reznik B (2003) Deterministic linear optics quantum computation utilizing linked photon circuits. Phys Rev Lett 91:037903; ArXiv:quant-ph/0303008 Jozsa R (2005) An introduction to measurement based quantum computation. ArXiv:quant-ph/0508124 Kitaev YA (2003) Fault-tolerant quantum computation by anyons. Annals Phys 303:2–30; ArXiv:quant-ph/9707021v1 Aharonov D, van Dam W, Kempe J, Landau Z, Lloyd S, Regev O (2004) Adiabatic quantum computation is equivalent to standard quantum computation. ArXiv:quant-ph/0405098 Brennen GK, Pachos JK (2007) Why should anyone care about computing with anyons? Proc Roy Soc Lond A 464(2089):1–24; ArXiv:0704.2241v2 Aharonov Y, Bohm D (1959) Significance of electromagnetic potentials in quantum theory. Phys Rev 115:485–491 Young T (1804) Experimental demonstration of the general law of the interference of light. Phil Trans Royal Soc Lon, London, p 94 Jiang L, Brennen GK, Gorshkov AV, Hammerer K, Hafezi M, Demler E, Lukin MD, Zoller P (2007) Anyonic interferometry and protected memories in atomic spin lattices. ArXiv:0711.1365v1
Quantum Computing
62. Farhi E, Goldstone J, Gutmann S, Sipser M (2000) Quantum computation by adiabatic evolution. ArXiv:quant-ph/0001106 63. Kieu TD (2006) Quantum adiabatic computation and the travelling salesman problem. ArXiv:quant-ph/0601151v2 64. Kempe, Kitaev, Regev (2004) The complexity of the local hamiltonian problem. In: Proc 24th FSTTCS. pp 372–383; ArXiv:quant-ph/0406180 65. Kempe, Kitaev, Regev (2006) The complexity of the local hamiltonian problem. SIAM J Comput 35(5):1070–1097 66. Childs AM, Farhi E, Preskill J (2002) Robustness of adiabatic quantum computation. Phys Rev A 65:012322; ArXiv:quantph/0108048 67. van Dam S, Hogg, Breyta, Chuang (2003) Experimental implementation of an adiabatic quantum optimization algorithm. Phys Rev Lett 90(6):067903; ArXiv:quant-ph/0302057 68. Lloyd S (1996) Universal quantum simulators. Science 273:1073–1078 69. Berry DW, Ahokas G, Cleve R, Sanders BC (2007) Efficient quantum algorithms for simulating sparse hamiltonians. Commun Math Phys 270:359; ArXiv:quant-ph/0508139v2 70. Somaroo SS, Tseng CH, Havel TF, Laflamme R, Cory DG (1999) Quantum simulations on a quantum computer. Phys Rev Lett 82:5381–5384; ArXiv:quant-ph/9905045 71. Abrams DS, Lloyd S (1997) Simulation of many-body fermi systems on a universal quantum computer. Phys Rev Lett 79:2586–2589; ArXiv:quant-ph/9703054v1 72. Wu LA, Byrd MS, Lidar DA (2002) Polynomial-time simulation of pairing models on a quantum computer. Phys Rev Lett 89:057904. Due to a proofs mix up there is also an Erratum: 89:139901; ArXiv:quant-ph/0108110v2
73. Brown KR, Clark RJ, Chuang IL (2006) Limitations of quantum simulation examined by simulating a pairing hamiltonian using nuclear magnetic resonance. Phys Rev Lett 97:050504; ArXiv:quant-ph/0601021 74. Hameroff SR, Penrose R (1996) Conscious events as orchestrated spacetime selections. J Conscious Stud 3(1):36–53 75. Khrennikov A (2006) Brain as quantum-like computer. BioSystems 84:225–241; ArXiv:quant-ph/0205092v8
Further Reading For those who seriously want to learn the quantitative details of quantum computing, this is still the best textbook: Nielsen MA, Chuang IL (2000) Quantum Computation and Quantum Information. CUP, Cambs For lighter browsing but still with all the technical details, there are several quantum wikis developed by the scientists doing the research: Quantiki http://www.quantiki.org/wiki/index.php/Main_Page specifically quantum information Qwiki http://qwiki.stanford.edu/wiki/Main_Page covers wider quantum theory and experiments For those still struggling with the concepts (which probably means most people without a physics degree or other formal study of quantum theory), there are plenty of popular science books and articles. Please dive in, it’s the way the world we all live in works, and there is no reason not dig in deep enough to marvel at the way it fits together and puzzle with the best of us about the bits we can’t yet fathom.
2405
2406
Quantum Computing with Trapped Ions
Quantum Computing with Trapped Ions W OLFGANG LANGE University of Sussex, Brighton, UK
Article Outline Glossary Definition of the Subject Introduction Ion Trap Technology Ions as Carriers of Quantum Information Laser Cooling and State Initialization Single-Ion Operations State Detection of Ionic Qubits Two-Qubit Interaction and Quantum Gates Decoherence Quantum Algorithms Distributed Quantum Information with Trapped Ions Future Directions Bibliography Glossary Ion trap Device used for confining charged particles to a small volume of space. There are two types of ion traps: in a Paul trap, the confinement is achieved by using an electric quadrupole radio-frequency field, while a Penning trap uses a static electric quadrupole field and a static magnetic field. For the right field parameters, the charged particles can be stored in the trapping volume permanently. Lamb–Dicke regime Describes the conditions for strong confinement of ions. In the Lamb–Dicke regime, ions are confined to a region smaller than the wavelength of the optical transition of the ion. Another property of ions in this parameter range is that the ion’s recoil energy is less than the energy of a single quantum of vibration in the trap. This suppresses spontaneous sideband transitions. The Lamb–Dicke regime is essential for the reliable operation of laser-induced quantum gates in an ion-trap quantum computer. Qubit The unit of quantum information processing. A contraction of quantum bit, the term qubit designates quantum system with two quasi-stable states carrying the quantum equivalent of binary information (“0” or “1”). In an ion-trap quantum computer, two long-lived electronic states are employed as quantum memory. Their wavefunctions are represented here by
the symbols jgi and jei. The qubit states are manipulated by means of laser pulses. Quantum register A collection of qubits forms a quantum register. The power of a quantum computer increases exponentially with the size N of the quantum register, since in this case 2 N numbers can be processed in parallel. In ion traps, the largest quantum register implemented so far has size N D 8. Normal modes The motion of ions in a linear string is strongly coupled due to their Coulomb repulsion. The system can still be described by analogy with a set of independent harmonic oscillators, when collective modes of motion are considered, in which all ions oscillate in phase and at the same frequency. These are called normal modes. A linear string of N ions has N normal modes of axial vibration. Each mode has a characteristic distribution of motional amplitudes for the ions. In the lowest frequency mode, all ions oscillate at the same amplitude, so that the string moves like a rigid body. Phonon bus Since all ions participate in the collective motion, it can be used to transfer quantum information between ions in the string. To this end, each normal mode is considered as a quantum mechanical oscillator, in which quanta of vibrational motion (phonons) can be excited or de-excited. By coupling their excitation to transitions of the ion-qubit, the phonons serve as a quantum data bus to other ions. Reliable transfer requires that the phonons are restricted to a binary system. The state with no vibrational excitation encodes the qubit j0i, one quantum of vibration corresponds to j1i. Rabi-oscillations The term describes the change of state of a two-level atom, induced by the coherent excitation with a laser beam close to resonance. After half a period (phase ), the population is completely transferred, after another half period it is returned to the initial distribution again. By choosing suitable phases, amplitudes, detunings and duration of the pulses, the individual qubits may be manipulated in a fully controlled way and quantum gates between different ions may be induced (via the phonon bus). Coherent superposition According to the laws of quantum mechanics, a qubit can be in an arbitrary superposition of the two basis states with complex amplitudes ˛ and ˇ, written as ˛jgi C ˇjei. The relative phase is important, for example, jgi jei and jgi C jei are distinct superposition states. As long as a well-defined phase is preserved, the qubit is in a coherent superposition. This is a precondition for quantum information processing.
Quantum Computing with Trapped Ions
Decoherence Any loss of coherence of a quantum state is called decoherence. There are many sources of decoherence. In an ion-trap quantum processor, they range from spontaneous decay of the qubit-levels to phase shifts in fluctuating ambient fields. A quantum computation can only proceed reliably while decoherence is negligible. Quantum error correction is a way to counteract decoherence. Entanglement Entanglement is one of the most important properties of quantum systems and has no classical analogue. A composite system, for example two qubits, is entangled if the states of the individual particles cannot be separated and regarded as independent. Instead, there are strong non-local links between the components, resulting in quantum correlations and state changes of one partner upon measurement of the other. Entangled states have many applications, from spectroscopy to quantum networking. Up to eight ions have been entangled in a deterministic way. Quantum network A system of either local or distant nodes performing quantum calculations and exchanging results via transport of ions or via photon links. Setting up the latter requires a controlled exchange of quantum data between ions and photons, providing flying qubits. The goal is distributed entanglement, for example to transfer qubits over long distances via teleportation and eventually to perform distributed quantum computation. Definition of the Subject In quantum computation, the classical bit as carrier of information is replaced by a two-state quantum system, the quantum bit (qubit). The enhanced computing power of qubits is based on two facts: (1) they can simultaneously store arbitrary superpositions of two values 0 and 1 and (2) superpositions in different qubits may be strongly correlated (entangled), even in the absence of any direct physical link. Quantum algorithms exploit these features to efficiently solve problems whose complexity makes any classical method unfeasible. As a new paradigm of computer science, the concept of quantum computation has generated important results in complexity theory. However, from a practical point of view, the power of quantum information processing can only be unleashed if the necessary quantum hardware is available. What is required is a set of individually accessible binary quantum systems to serve as a quantum register. They must have strong externally controllable interactions between them, but at the same time no coupling to the environment. While there is no system in which these condi-
tions are ideally fulfilled, trapped atomic ions are the closest realization. At present, they provide the most advanced implementation of quantum information processing. This article summarizes the state-of-the-art of quantum computing with trapped ions. As shown below, all necessary components of a trapped-ion quantum computer have been demonstrated, from quantum memory and fundamental quantum logic gates to simple quantum algorithms. Current experimental efforts are directed towards scaling up the small systems investigated so far and enhancing the fidelity of operations to a level where error correction can be applied efficiently. The first task at which a quantum computer is expected to outperform a classical one is the efficient simulation of quantum systems too complex for classical treatment. Introduction The idea of using atomic-scale systems for information processing was first suggested by R. P. Feynman in his lecture There’s plenty of room at the bottom [1]. In the early 1980s, Feynman and P. Benioff found that by exploiting the dynamics of quantum systems, computations could be performed far more efficiently than using a classical computer [2,3]. They argued that the difficulty of efficiently simulating quantum systems on a classical computer implied the superior power of quantum computing. The radically novel idea was to not merely use quantum mechanics as a framework for predicting the macroscopic behavior of a system, but to manipulate individual quantum objects directly. While Feynman didn’t propose a particular physical implementation, the spectacular progress achieved in atomic and optical physics laboratories around the world in preparing and manipulating single atoms and ions makes these particles an obvious choice as a qubit. Binary quantum information is typically stored in two internal electronic levels. Owing to their electric charge, ions can be readily trapped by radio-frequency electric fields. However, Coulomb repulsion keeps the trapped ions so far apart, that any direct interaction involving their internal states is negligible. This seems to preclude logic gates for two or more qubits. However, in 1995, I. Cirac and P. Zoller found a way around this problem in a seminal paper which initiated experimental ion-trap quantum computation as an active research field [4]. They proposed to use the longdistance Coulomb repulsion between ions to couple different qubits in a linear string by exchanging single quanta of their vibration. Trapped ions have other advantages that make them strong contenders as quantum bits. Their internal and ex-
2407
2408
Quantum Computing with Trapped Ions
ternal (motional) states can be manipulated using laser light. In atoms or ions, long-lived excited states exist, allowing for the storage and retrieval of quantum superposition states. Since the first experiment demonstrating a simple quantum gate with a single ion, impressive progress has been made in developing or adapting ion trap technologies for quantum information processing. In this article, the range of techniques used in iontrap quantum computing is presented, starting with the technological foundations of ion trapping (Sect. “Ion Trap Technology”). The subsequent discussion follows a list of general requirements for a system to serve as a practical device for quantum computation. It was compiled as a general guideline by D. DiVincenzo [5]. (1) The quantum system must provide well-characterized qubits forming a scalable quantum register (Sect. “Ions as Carriers of Quantum Information”). (2) It must be possible to initialize the qubits to a known state (Sect. “Laser Cooling and State Initialization”). (3) A method to efficiently measure individual qubits must exist (Sect. “State Detection of Ionic Qubits”). (4) A universal set of quantum gates is required (Sects. “Single-Ion Operations” and “Two-Qubit Interaction and Quantum Gates”). (5) The time over which quantum states lose coherence should be much longer than gate operation times (Sect. “Decoherence”). At present, trapped ion systems are the only technology, for which all five criteria have been successfully demonstrated. Major achievements are two-qubit ion gates with a fidelity around 96%, the entanglement of up to 8 ions and a number of simple algorithms (Sect. “Quantum Algorithms”). In his article, DiVincenzo has added two more requirements in case local quantum processors are to exchange quantum information over some distance. (6) It must be possible to interconvert the stationary qubits of the quantum computer to “flying qubits”, suitable for long-distance transmission. (7) The flying qubits must be faithfully transmitted between specified locations. These requirements become more important, as schemes for distributed quantum computation are being developed. The obvious choice for flying qubits are photons and there are first results regarding the coupling of ions and photons (Sect. “Distributed Quantum Information with Trapped Ions”).
Ion Trap Technology Ions are particularly well suited for quantum information processing, since due to their charge, they can be confined by electromagnetic fields without affecting their internal electronic levels. Presently, all realizations of quantum information processing with ions use variants of the Paultrap, in which the ponderomotive force of a time-dependent inhomogeneous electric field leads to stable confinement [6,7]. For a quantum register containing multiple ions, a linear trap geometry is generally chosen. Linear Paul Trap The linear Paul trap has its origins in the electric quadrupole mass filter [8], in which transverse confinement of ions with a certain charge to mass ratio is achieved by a radiofrequency quadrupole potential in a plane perpendicular to the axis of the device, assumed to be oriented in the z-direction: ˚ (x; y; t) D (U V cos ˝ t)
x 2 y2 ; 2r02
(1)
where ˙U/2 are the dc-voltages and ˙V/2 cos ˝ t the radiofrequency-voltages applied to the quadrupole electrodes at a distance r0 from the trap center. Equation (1) represents a saddle potential which at any particular time provides confinement in one direction only, as shown in Fig. 1. By using alternating voltages and averaging over a period of the radiofrequency, stable trapping may be achieved in both transverse directions. Figure 2 indicates how the trap parameters V and U must be chosen to obtain confinement. In practice, U D 0 and eV m˝ 2 r02 /2 are chosen, where m and e denote mass and charge of the ion. To generate the potential (1), four hyperbolically shaped electrodes are required. Usually these are approximated by simpler structures like rods with circular or triangular cross section, since close to the trap axis the resulting anharmonicity is negligible. On a timescale long compared to the period of the radiofrequency, the ions move as if they are radially confined in a parabolic pseudopotential given by D e 2 jr˚j2 /4m˝ 2 . The motion in the radial directions is therefore harmonic, with the radial secular frequency !r p
eV 2m˝r02
:
(2)
The pseudopotential provides confinement in the radial direction only, while the motion along the z-axis is not restricted. For three-dimensional confinement, an additional static potential must be applied in the z-direction
Quantum Computing with Trapped Ions
Quantum Computing with Trapped Ions, Figure 1 Two-dimensional saddle potential for a charged particle in the linear Paul trap according to Eq. (1). Due to the quadrupole shape, at any given time there is confinement only along one axis. a Potential at t D 0, providing y-confinement only. b Potential at t D /˝, providing x-confinement only. Stable confinement in both directions is achieved by rapidly alternating the sign of the potential
Quantum Computing with Trapped Ions, Figure 2 Stability diagram for the linear Paul trap as a function of radiofrequency amplitude V and dc-amplitude U, scaled to dimensionless parameters q and a, respectively. Confinement in both xand y-direction is achieved in the purple region at the center, defining the operating range of the trap
Quantum Computing with Trapped Ions, Figure 3 Schematic drawing of a linear ion trap, as used for quantum information processing. The ions are arranged in a linear string along the trap axis, held by rf- and dc electric fields. The state of the ions is detected by imaging them to a CCD camera. © R. Blatt, University of Innsbruck, Austria
is reduced to r which is either achieved using additional electrodes at either end of the trap or by segmenting the linear rod electrodes and applying a positive dc-voltage U z to the outer segments. This provides a static harmonic well along the z-axis which is characterized by the longitudinal trap frequency s !z D
2 eUz : mz02
(3)
Here, z0 is half the length between the axially confining electrodes and is a geometric factor accounting for the specific electrode configuration. The presence of the axially confining field weakens the radial confinement which
!r0 D
1 !r2 !z2 : 2
(4)
A sketch of a linear ion trap is shown in Fig. 3. Values for !r /2 in ion traps used for quantum information processing range from 3 MHz to 10 MHz, while typical values for !z /2 are 1 MHz to 4 MHz. If multiple ions are confined and cooled to a sufficiently low temperature in a Paul trap (see Sect. “Vibrational Cooling of a Single Ion”), they form ordered structures [9,10,11]. If the radial confinement is strong enough (!r !z ), ions arrange themselves in a linear pattern along the trap axis at distances determined by the equilibrium of their mutual Coulomb repulsion and the potential providing axial confinement. Figure 4 shows an example
2409
2410
Quantum Computing with Trapped Ions
accomplished through phase shifts, mediated by state-dependent Coulomb interaction rather than the exchange of vibrational quanta, as will be discussed in Sect. “Vibrational Coupling” for quantum information processing in linear ion strings. Quantum Computing with Trapped Ions, Figure 4 String of eight ions in a linear Paul trap [10]. Color indicates the intensity of the fluorescent light detected with a CCD camera. The average distance between adjacent ions is about 10 µm. See Refs. [9,11,12] for other ion crystal structures. © R. Blatt, University of Innsbruck, Austria
of a string of eight 40 CaC ions in a linear trap. The equilibrium spacing of the ions is not uniform and must be determined numerically [12]. The number of ions N that fit on the axis of a linear ion trap is limited by the ratio of radial to axial confinement to approximately N < 1:82 (!r /!z )1:13 [14]. For larger ion numbers, the equilibrium positions no longer coincide with the z-axis of the trap. This must be avoided, since off the axis, the ions undergo micromotion, an oscillatory motion around their equilibrium position, driven by the trapping field at frequency ˝. Only on the trap axis, the ions are protected from micromotion, since ideally it corresponds to a node of the confining radiofrequencyfield. However, micromotion can still occur if stray electric fields shift the ions off the nodal line of the radiofrequency field. Since this may result in rf-heating of the ions and precludes controlled quantum operations, stray fields must be carefully compensated [15,16]. Spherical Paul Trap
Large-Scale Ion Traps There is no fundamental limit to the length of a string of trapped ions and hence to the size of a multi-ion quantum register. However, the technical challenges of manipulating many ions in a single trapping zone are substantial. In particular, the large number of vibrational modes of a long ion crystal makes reliable transfer of quantum information difficult (see Sect. “Vibrational Coupling”). So far, quantum operations have been performed with up to eight ions in a single trap. In order to circumvent the problems associated with long ion chains, trap architectures made up of interconnected individual trap segments have been proposed [25]. In these schemes, processing of quantum information is performed locally in special regions of the trap containing only a small number of ions, which are retrieved from and returned to memory regions. Ion transfer is achieved by changing the axially confining fields with dcelectrodes placed along the trapping zones. This type of trap is also known as a quantum charge-coupled device (QCCD). A sketch of a possible electrode layout is shown in Fig. 5. An important issue for quantum computation in segmented traps is to maintain coherence of internal states during transport of the qubits. This has been demon-
If only a single ion is to be stored, the spherical variant of the Paul trap may be used [17]. Here, the average ponderomotive force of an alternating quadrupole potential is used to provide three-dimensional confinement. The corresponding potential is x 2 C y 2 2z2 ˚ (x; y; z; t) D (U V cos ˝ t) : (5) 2r02 The ideal electrodes are two hyperbolic endcaps along the z-axis and a hyperbolic ring in the x y-plane, but simplified configurations with only two cylindrical endcaps or only one ring-electrode [18,19,20,21] are possible. In a spherical Paul trap, the micromotion vanishes completely only at the origin, which is why it is suited only for the storage of a single ion. It was used in early studies of quantum information processing in an ion trap [22,23]. In the future, spherical Paul traps may again play an important role as single-ion microtraps, arranged in large, scalable arrays [24]. Coupling of adjacent ions would be
Quantum Computing with Trapped Ions, Figure 5 Sketch of an ion trap with multiple segments, providing memory regions for storing qubits (red) and interaction regions for logic operations, including ions to sympathetically cool the vibrational motion (blue). Shuttling between different locations is accomplished by suitable dc-voltages applied to the electrodes shown in green
Quantum Computing with Trapped Ions
strated over a straight distance of 1.2 mm after separating two ions originally held in a single trap [26]. In order to extract arbitrary ions from a quantum register, junctions have to be implemented, for example, a T-junction joining two linear trap sections at an angle of 90ı . A trap of this type was realized at the University of Michigan and used to swap the positions of two ions [27]. Microtraps and Surface-Electrode Traps The number of electrodes required for multi-zone traps quickly grows as more processing or memory segments are included. These devices require integrated fabrication technologies to replace the manual assembly and alignment of electrodes used in simpler traps. Two approaches have been pursued to make ion traps amenable to microfabrication. (1) Three-dimensional microtraps, fabricated from a monolithic multilayer microchip [28,29]. The trapping geometry is similar to that in a macroscopic trap but minimum features are smaller and no assembly is required. A chip-trap has been fabricated from a doped gallium-arsenide heterostructure [30]. (2) Surface-electrode traps further simplify the fabrication of trapping structures. They are derived from threedimensional Paul traps, but have all their electrodes moved to a single plane [31,32]. The ions are again trapped in a minimum of the pseudopotential, which is typically located several tens of μm above the surface. A planar trap for single ions has been demonstrated at NIST [31]. Using microfabrication techniques for ion traps makes them suitable for miniaturization and scaling. As an additional benefit of chip-traps, microelectronics, e. g., for electrode potential control can be placed directly on the chip [33]. Penning Traps An alternative way to trap ions in three dimensions is the Penning trap [6]. In contrast to the Paul-trap, there are no alternating fields. Stable confinement of ions is accomplished with a static magnetic field, providing a radial force in the xy-plane and a static electric field exerting a restoring force in the z-direction. While the z-motion in the Penning trap is harmonic, in the radial plane the ions orbit the center in a combination of cyclotron and magnetron motion. This makes a fixed radial arrangement of ions difficult. However, by using a rotating electric field technique, large radial ion crystals rotating at a controlled rate were produced in a Penning trap [34]. Such a collection of ions is equivalent to a quantum hard disk and has been proposed for quantum information processing [35].
Another scheme for quantum computing with Penning traps uses an array of miniature Penning traps each containing only a single ion [36]. Ions could be shuttled between individual traps using suitable arrangements of electric fields. To perform two-bit quantum gates, two ions would be combined in a single trap, forming an axial crystal, which can be manipulated by analogy with to the equivalent structure in a Paul trap. Axial two-ion Coulomb crystals have recently been observed in experiment [37]. An array of Penning traps holding single electrons has been proposed as a scalable quantum information processor [38]. Here, gates are implemented by controlling the Coulomb interaction with radio-frequency and microwave techniques. In the remainder of this article, Penning traps will not be further discussed, since to date, all realizations of quantum information processing have been performed in Paul traps. Ions as Carriers of Quantum Information In an ion-trap quantum computer, information is encoded in two internal (electronic) states of the ion. Among the large number of levels, only those with a long natural lifetime are suitable for storing quantum information. This excludes levels with electric dipole transitions to lower lying states. For easy manipulation of the qubit-transition, the levels should be coupled either by a two-photon Raman transition, an electric quadrupole or a magnetic dipole transition. Ideally, the states should be insensitive to fluctuations of external magnetic or electric fields to minimize decoherence. Two classes of states have been used for quantum information processing with ions so far: two hyperfine levels of the ground state of an ion with nuclear spin and a combination of an electronic ground state and a lowlying metastable state. Hyperfine State Qubits In any isotope with non-vanishing nuclear spin, the electronic ground state energy levels are split by the interaction of the electrons’ angular momentum with the nuclear magnetic moment (and, if applicable, the electric quadrupole moment). These so-called hyperfine states are coupled by magnetic dipole transitions and hence have an extremely long lifetime. Therefore, they are ideally suited for reliably storing quantum information. An example is the 9 BeC ion, which was used in the first implementations of quantum information processing with ions [22]. The two hyperfine levels are distinguished by the quantum numbers F D 1 and F D 2, differing in the relative orientation of electronic angular momentum and nuclear spin.
2411
2412
Quantum Computing with Trapped Ions
Quantum Computing with Trapped Ions, Figure 6 Qubit-implementation in 9 BeC . As indicated, the quantum information is stored in the two magnetic sublevels jF D 2; mF D 2i and jF D 1; mF D 1i. In order to change the state of the qubit, Raman transitions driven by laser fields E 1 and E 2 , off-resonant from the level P1/2 , are employed
The relevant level structure of 9 BeC is shown in Fig. 6. The two basis states of the quantum bit are represented by a magnetic sublevel of each hyperfine state. A common choice is jF D 2; m F D 2i and jF D 1; m F D 1i. While the outermost levels with m F D ˙F are most conveniently initialized, they are subject to random shifts by fluctuating magnetic fields, leading to decoherence. For a quantum memory with longer lifetime, a first-order magneticfield independent transition at a finite magnetic bias field should be used [39].
Metastable State Qubits Long lifetimes for ionic qubits may also be achieved by employing excited electronic states which are connected to lower lying levels only by higher order electromagnetic multipole transitions. This is the case for ions with a Dstate with lower energy than the first excited P-state. It decays to the ground state via an electric quadrupole transition with a rate around 1s1 . An example is 40 CaC [40], which is used for ion-trap quantum information processing in a number of laboratories. Here, the states j2 S1/2 ; m j D 1/2i and j2 D5/2 ; m j D 1/2i are usually chosen as basis states. The level structure of 40 CaC is shown in Fig. 7. All ions investigated for quantum information processing fall in either of these two categories. An overview of isotopes that have been used is given in Table 1. In the remainder of this article, the two physical basis states of a qubit will be designated by the symbols jgi for the lower (ground-) state and jei for the higher (excited) state, as indicated in Fig. 6 and 7. Note that in the literature, alternative notations are used, e. g., j#i and j"i for hyperfine qubits, jSi and jDi symbolizing the orbital states involved,
Quantum Computing with Trapped Ions, Table 1 Isotopes used in investigations of quantum information processing with trapped ions Scheme
Ion Species
Hyperfine qubit
9 BeC
[22,39], 43 CaC [41,42], 111 CdC [43], [44,45], 25 MgC [46]
171 YbC
Metastable qubit
40 CaC
[40,41], 88 SrC [47,48,49,50]
or the generic j0i and j1i. The 2 N basis states of a quantum register with N ions are represented by analogy with single qubit-states by specifying N entries: jx1 x2 x3 : : : x N i, where xi is either g or e. Laser Cooling and State Initialization Ionization Before a quantum register can be initialized, the ion trap must be loaded with the desired isotope. This is most efficiently achieved by photo-ionizing an atomic beam [51,52,53]. If one of the excitation steps in multiphoton ionization is resonant, a specific isotope may be loaded, distinguished by its isotope shift [51,54]. If no suitable laser source for photo-ionization is available, electron-impact ionization may be used instead. It is not stateselective and hence only suited for loading the most abundant species among an ion’s isotopes. Due to the long trap lifetimes of many hours to days, the trap has to be loaded only infrequently. Initialization of the Quantum Register Any quantum computation must start from a well-defined, i. e. pure state of the quantum register. Therefore, initially
Quantum Computing with Trapped Ions
Quantum Computing with Trapped Ions, Figure 7 Qubit-implementation in 40 CaC . Quantum information is stored in a magnetic sublevel of the ground state j2 S1/2 ; mj D 1/2i and of the metastable state j2 D5/2 ; mj D 1/2i. The state of the qubit is changed by resonantly driving the quadrupole transition between S and D with a laser field E at 729 nm
all ions must be transferred to a specific internal state, typically the lower basis state, so that the register is initialized to jg g g : : : gi. In ions, this is achieved by the wellestablished method of optical pumping [55]. A laser with a suitable polarization repeatedly excites the ions to a set of states from which they eventually decay to the desired state jgi. This state is either not coupled to the exciting laser or driven to a state whose only decay channel is back to jgi (cycling transition). In any case, all ions reach their initial state after a few excitation-emission cycles. In contrast to the quantum gates discussed below, the initialization of the register is irreversible. Not all internal states are amenable to initialization by optical pumping with high fidelity. In some cases, it may be necessary to pump to an auxiliary state, which is then coherently transferred to the desired initial state by suitable laser pulses [39]. Vibrational Cooling of a Single Ion In an ion-trap quantum processor, different ions are coupled by the Coulomb interaction via their motional degrees of freedom (see Sect. “Vibrational Coupling”). In order to transfer quantum information in this way, the ions’ harmonic oscillation in the trapping potential must be cooled close to its quantum mechanical ground state. This is achieved by different stages of laser cooling, exploiting the momentum transfer between ions and light. Initial cooling of the ions is provided by Doppler cooling [56,57,58], using an atomic transition with a natural line width larger than the vibrational frequencies (! r or ! z ) of the ions in the trap. If the cooling laser
is slightly red-detuned from resonance, the Doppler-shift from the ions’ motion ensures that photons are preferably absorbed when an ion is moving towards the laser. The resulting momentum transfer, averaged over many scattering events, reduces the kinetic energy and thus the temperature of the ion. The final temperature which can be reached by this technique is given by the Doppler temperature TD D „ /2kB [59] where kB denotes Boltzman’s constant. T D is on the order of a few millikelvin, which for typical trap frequencies corresponds to a residual excitation of one to tens of vibrational quanta. In order to reach the vibrational ground state, an additional cooling stage must be applied. To this end, a transition is used whose is smaller than the frequency of the vibration to be cooled, e. g., !z . In this case, the ion’s absorption spectrum is composed of a line at the transition frequency ! 0 and a series of sidebands at (!0 ˙ n!z ) with integer n. The strength of these sidebands is given by the vibrational excitation. Very efficient cooling is obtained by tuning a laser to the first red sideband at !0 !z (see Fig. 8). If the ion is well localized (Lamb–Dicke regime), subsequent spontaneous emission occurs predominantly at the carrier frequency ! 0 , so that there is a net reduction of the ion’s kinetic energy by „!z . After a few excitation-emission cycles, the ground state of vibration is reached with high probability. The technique is known as sideband cooling. The ion’s mean vibrational quantum number in the trapping potential is reduced to hni D ( /2!z )2 1 [60,61]. The technical challenge is to find a transition narrower than the longitudinal trap frequency ! z . In ions with cal-
2413
2414
Quantum Computing with Trapped Ions
Single-Ion Operations After initialization, the state of each qubit may be modified by means of suitable laser pulses applied on the qubit transition. Figure 10 shows the geometry for laser excitation. An important precondition is that the ions may be addressed individually. This requires the laser beam to be focused tighter than the minimum distance between ions, given by [12] Quantum Computing with Trapped Ions, Figure 8 Resolved sideband cooling in a two-level system (states jgi; jei). Associated with each level is a ladder of quantized vibrational states jni, labeled by their vibrational quantum number. Red sideband excitation (red arrow), followed predominantly by spontaneous emission on the carrier, reduces the vibrational excitation by one. After a few cycles, the vibrational ground state (green circle) is reached, which is not coupled to the excitation
cium-like level schemes, this is possible on the S1/2 D5/2 quadrupole transition [62]. For hyperfine-qubits, it is necessary to use a two-photon stimulated Raman transition for sideband-cooling [61]. Vibrational Cooling of Ion Strings In an ion string, the motion of ions is strongly coupled due to their Coulomb repulsion. It is therefore more natural to consider N normal modes of motion of the string in the z-direction, rather than N individual ions. The lowest normal mode has frequency ! z , the same as that of a single ion. Normal modes of ion strings will be discussed in Sect. “Normal Modes”. The cooling of a normal mode proceeds by analogy with the cooling of a single ion. It is not required that all the ions in a string are interacting with the laser. Momentum transfer from the mode can already be achieved through a single ion in the string. This fact provides the possibility of using different ions for cooling and data operations. In this case, one ion species is used for sideband-cooling the collective motion of the entire string, while the remaining ions are used in the subsequent quantum processing. This method is called sympathetic cooling [63,64,65,66]. While only the one vibrational mode to be used as a quantum data-bus must be cooled to the ground state, the remaining N 1 vibrational modes should be cold enough not to interfere with quantum operations, ideally also residing in their ground state of motion. When all ions in the string are in a known internal state and the vibrational modes cooled to (or close to) their ground states of motion, the quantum register is ready for quantum information processing [67].
smin
2e 2 0 m!z2
1/3
N 0:56
(6)
for a string of N ions with mass m. For typical experimental parameters, smin ranges from 4 to 10 μm. Focusing must be considerably better to minimize residual laser intensity at the position of adjacent ions and scattered light, which would lead to unwanted excitation of other ions in the string. The laser beam is switched between ions by means of an electro-optic deflector. Single-qubit operations are best visualized in the Bloch picture. Here, a general pure state of a qubit is parametrized by two angles, a polar angle # (0 # ) and an azimuthal angle ' (0 ' 2) such that j i D cos
# # jgi C sin e i' jei : 2 2
(7)
Each qubit state can be represented by a vector pointing in the direction given by (#; ') (the Bloch-vector), ending on the surface of a sphere of unit radius, the so-called Bloch-sphere shown in Fig. 9. The poles represent the basis states jgi and jei, while equal weight superpositions of these two states lie on the equator. Note that the phase as-
Quantum Computing with Trapped Ions, Figure 9 Bloch-sphere representation of the state of a qubit. Each point on the surface of the sphere represents a quantum state j i of the two-level system, with the poles representing the basis states jgi and jei. A suitable laser pulse rotates the state through a polar angle and an azimuthal angle , so that arbitrary points on the sphere can be reached
Quantum Computing with Trapped Ions
formation:
cg ! ce cos /2 ie i sin /2
Quantum Computing with Trapped Ions, Figure 10 Geometry of selective excitation of ions in a quantum register. Shown are the two positions of a laser beam, consecutively addressing control- and a target-ion in a two-qubit quantum gate
sociated with jgi is arbitrarily chosen to be zero. This is no longer possible in two-qubit operations, when differences between the phases of qubits are important. In ionic qubits, the states can be manipulated by exciting the ion with electromagnetic pulses. The transition frequencies of qubits encoded in metastable atomic states (see Sect. “Metastable State Qubits”) are in the optical domain, so that a direct transition driven by a narrowband laser may be used. In the case of 40 CaC , this is the S-D quadrupole transition (cf. Fig. 7). As any two-level atom resonantly excited by electromagnetic radiation, the qubit undergoes Rabi-oscillations, i. e., periodically changes between the states jgi and jei. This corresponds to a rotation of the Bloch-vector. The oscillation frequency ˝ 0 (Rabi-frequency for resonant excitation) is given by the time-dependent amplitude E0 (t) of the exciting field and the (quadrupole) coupling strength: ˝0 (t) D E0 (t)
!L Q : 2„c
(8)
E(t) D E0 (t) cos(k r ! L t C ) is the electric field strength of the laser with frequency ! L and Q a component of the electric quadrupole tensor of the transition, determined by the excitation geometry and polarization of the field. The phase of the laser determines the axis in the xy-plane, around which the Bloch-vector is rotated. For example, D 0 leads to a rotation around the x-axis, while D /2 rotates around the y-axis. The most important parameter is the rotation angle . It is given by the pulse area, obtained by integrating the Rabi-frequency R over the pulse duration: D ˝0 (t) dt. After excitation by a pulse characterized by the parameters and , the initial state of a qubit c g jgi C c e jei is therefore rotated according to the following trans-
iei sin /2 cos /2
cg : ce
(9)
Table 2 gives an overview of important single-qubit operations and the way they are implemented in an iontrap quantum computer by choosing the pulse area and phase of laser pulses resonant with the qubit transition. Other single-qubit transformations can be implemented by combining rotations around different axes. An example is the Hadamard transform. The transformation matrix and its realization by X- and Y-pulses in symbolic notation is
1 cg cg 1 1 ! p (10) ce ce 2 1 1 Pulses around the z-axis, too, are most conveniently excited by a combination of pulses around the x- and y-axes. An example is the realization of a -rotation around the z-axis by three pulses:
cg 1 0 cg ! (11) 0 1 ce ce
For the second class of qubits, hyperfine qubits, transition frequencies between the basis states are several GHz and hence require microwave pulses if driven directly [68]. Due to the large wavelength of microwaves, different ions in a quantum register can only be addressed individually if their transition frequencies are split, for example by the Zeeman-effect when using a strong magnetic field gradient. An alternative, which has been used in most implementations of hyperfine qubits to date, is to drive the qubit using a two-photon stimulated Raman transition involving a third electronic level, connected to the qubit states through an electric dipole transition. In Fig. 6, the intermediate state is 2 P1/2 . By detuning the driving fields from resonance by an amount which is large compared to the inverse lifetime of the auxiliary state, excitation of this state is largely avoided [61]. The advantage is that individual ions may be addressed with the Raman beams. The effective Rabi-frequency for a Raman-transition, to be used instead of expression (8), is ˝0 D
˝1 ˝2 ;
˝ i D E0i
i i 2„
(12)
where ˝ i is the Rabi-frequency for transition i and i the
2415
2416
Quantum Computing with Trapped Ions Quantum Computing with Trapped Ions, Table 2 Important single qubit operations according to Eq. (9) and their realization with laser pulses. The corresponding transformations on the Bloch-sphere are often referred to as R(; ) in the literature. The symbols are used in the quantum circuits discussed below
Description 4 any 4-pulse – identity, no change of qubit: jgi ! jgi jei ! jei 2 any 2-pulse – sign change of qubit-state. jgi ! jgi jei ! jei This has no effect if applied on the qubit -transition since both levels experience the same phase shift. However, it is one of the most important operations in two-ion gates if applied on a transition with selective coupling to the basis states.
0
-pulse – NOT-gate = qubit flip (omitting common phase factor i): jgi ! jei jei ! jgi
2
-pulse, bit-flip and relative sign change: jgi ! jei jei ! jgi
2
0
/2-pulse (basis states are rotated around x-axis to equally weighted superposition): jei i jgi jgi i jei p p jei ! jgi ! 2 2
2
2
/2-pulse (basis states are rotated around y-axis to equally weighted superposition): jei jgi jei C jgi p p jei ! jgi ! 2 2
2
Symbol — —
2 /2-pulse around negative y-axis – identical to combination of Hadamard -gate and Z-gate (sign change of jei): jgi C jei jgi jei p p jei ! jgi ! 2 2
corresponding electric dipole moment. The qubit is resonantly excited if the difference frequency !L1 !L2 of the two lasers (which replaces !L in the direct driving scheme) is tuned to the qubit frequency ! 0 . The role of the k-vector in the case of direct driving is played by the difference k 1 k 2 and the relevant phase determining the single-ion dynamics is given by the difference of the phase constants of the two laser beams, D 1 2 . With these modifications, Eq. (9) applies. The ion dynamics is more complicated if their vibration in the trapping potential is taken into account. This is the basis of two-bit quantum gates in ion traps discussed in Sect. “Two-Qubit Interaction and Quantum Gates”. Generally, excitation of a transition on a vibrational sideband (see Fig. 8) leads to Rabi-oscillations with a modified Rabifrequency. By measuring the state of the qubit as a function of the length of the Rabi pulses (and hence the pulse area ), Rabi-oscillations can be observed experimentally. An example is shown in Fig. 11. The decay of the contrast of the oscillations is an indication of the loss of coherence between the qubit levels. This can be used to determine the decoherence-time of quantum memory (see Sect. “Decoherence”).
State Detection of Ionic Qubits At the end of any data processing, a quantum computer must be subjected to a measurement of its register, in order to determine the result of the calculation. It is important for the success of the quantum computation to achieve a high efficiency readout. One of the greatest advantages of ion-trap quantum processors is that the state of each qubit may be determined with almost 100% efficiency. This is in contrast to other carriers of quantum information such as photons or nuclear spins in a bulk sample. Quantum Jump Detection The quantum state of an ionic qubit is measured optically by monitoring the fluorescent light emitted upon selective laser excitation of one basis state of the qubit on a strong electric dipole transition. The population in the other basis state is either not affected by the probe laser due to selection rules or has previously been transferred (shelved) to another state inaccessible to the probe. The method is therefore also known as electron shelving [69]. The crucial point of the procedure is that the fluorescence is emitted on a transition back to the original qubit-level (cycling transition), so that the cycle of excita-
Quantum Computing with Trapped Ions
Quantum Computing with Trapped Ions, Figure 11 Rabi-oscillations of a single 40 CaC -ion on the blue sideband starting in the motional ground state. The Rabi-frequency is given by Eq. (17) instead of Eq. (8), but the flopping between two states is analogous. The decay of contrast is a measure of decoherence in the system (Sect. “Decoherence”). In the example shown, coherence is maintained for up to 1 ms [62]
tion and emission of fluorescent light can be repeated indefinitely. In practice, transitions to levels outside the cycle occur, though at a small rate. The number of photons extracted from a single qubit can reach Nph 106 [70]. Even for a small photon detection efficiency d , the number of detected photons is large and can be distinguished from the other, non-fluorescing level with almost 100% certainty [71]. The probability for a false interpretation of a zero-photon signal is only exp(d Nph ). Originally, the method has been used to detect quantum jumps between two atomic levels by observing the intermittent fluorescence on a transition coupled to only one of the levels [69,72], hence the name quantum jump detection. It is now employed as the standard readout procedure in quantum information processing with ions. Figure 12 shows the cycling transitions used in two different qubit systems. Additional repumping lasers may be required to avoid population trapping in uncoupled levels. An example of typical photon counting statistics in the two qubit states is shown in Fig. 13. By exciting the en-
tire ion string simultaneously and imaging the fluorescent light with a CCD camera (cf. Fig. 3), the state of the entire quantum register can be read out simultaneously. According to the laws of quantum mechanics, a single measurement cannot retrieve the complete information on the quantum state, ruling out single-shot detection of arbitrary superpositions and entanglement between different qubits. Rather, a projection to the basis states is obtained with a probability given by the magnitude squared of the corresponding coefficient. Correlations between qubits are lost in this way. Alternatively, single qubits might be subjected to a state rotation (see Sect. “SingleIon Operations”) prior to measurement, mapping coherent superpositions of a qubit’s states onto the basis states, which then are probed as described above. A useful transformation is achieved with /2-pulses. If correlations between different qubits are of interest, e. g., entanglement of two qubits, the two-ion gates described in Sect. “Two-Qubit Interaction and Quantum Gates” may be used to disentangle the ions first and mea-
Quantum Computing with Trapped Ions, Figure 12 Transitions used in the electron shelving detection of the state of an ion-qubit. Only when the ion is in state jgi, excitation occurs (red arrow) and strong fluorescence back to the original state (wavy line) is observed. a Relevant levels in 40 CaC b Relevant levels in 9 BeC
2417
Quantum Computing with Trapped Ions
probability
0.8
ion prepared in | g ion prepared in | e
〈 〈
2418
0.6 threshold
0.4 0.2 0.0
and the collective vibration of the entire string may be coupled. Using the collective vibration to logically connect different ions requires introducing a new qubit to the system, stored in the motional quantum state of the ion string. Since all ions participate in the vibration, it constitutes a quantum data bus for the entire register.
Normal Modes
0
5
10
15
20
number of photons detected Quantum Computing with Trapped Ions, Figure 13 Histogram of detected photons after an ion is prepared in either the jgi or the jei state. In correspondence with Fig. 12 jgi designates the fluorescing, while jei is not coupled to the probe laser and scatters practically no photons. Using the indicated threshold to determine the state of the ion, a high detection fidelity is achieved (97.9% in the case of Yb shown above [45])
sure the qubits separately. For a more complete characterization of a quantum operation, methods of quantum state tomography must be used [73]. By preparing a certain quantum result repeatedly and averaging over measurements of 3 N different observables (for an N-ion register), the density matrix of the system may be reconstructed, providing full information of the quantum state [74]. Two-Qubit Interaction and Quantum Gates The most important requirement for any quantum computer is the implementation of gates, logically connecting the quantum states of two or more qubits in a coherent way. Trapped ions in a linear trap are several μm apart and therefore do not interact directly, for example through their dipole moments. Instead, the interaction is established indirectly, based on two principles: (1) Due to the long-range Coulomb repulsion between ions, their motion is strongly coupled, leading to collective vibrations of the ion-string. (2) By driving the qubit transition with a laser tuned to a motional sideband, the internal state of any ion
The description of the collective vibration of a linear string of ions in the trapping potential is based on the concept of normal modes. In a normal mode, all ions oscillate in phase at the same frequency. A single normal mode can provide the bus for the transfer of quantum information between different ions. A linear string of N ions has N normal modes of collective axial vibration. The two with the lowest vibrational frequencies are of particular significance. The fundamental mode has a frequency equal to the axial frequency of the trap ( D !z ) and is known as the center-of-mass (COM) mode. Here, all ions move synchronously, i. e., in the same direction and by equal amounts, so that the relative positions of the ions stay the same. It has the advantage that all ions couple to the motion with the same strength. The disadvantage is that it also couples strongly to fluctuating ambient electric fields, leading to large heating rates (see Sect. “Decoherence”). Therefore, the next higher normal mode is often preferable. It is known as the stretch- or breathing mode, since the ions on opposite sides of the center move in opposite directions with an amplitude proportionalp to their distance from the center. Its frequency is D 3 !z . The motion of a string of four ions in the COM and stretchmode is indicated in Fig. 14. The higher order modes correspond to a more complicated motion of the ions. Their exact frequencies depend on the number of ions in the string and must be calculated numerically [12]. The vibrational excitation spectrum of an ion string becomes increasingly complex at higher ion number N, since not only the number of normal modes increases, but also oscillation at the sum- and differencefrequencies contributes. This makes quantum operations
Quantum Computing with Trapped Ions, Figure 14 The two lowest normal modes of a string of four ions. The positions of pthe ions are shown at two different times. Arrows indicate the motion of the ions. a COM-mode at D !z ; b stretch-mode at D 3 !z
Quantum Computing with Trapped Ions
based on the selective coupling of ions to one particular mode difficult in long strings. In the limit of low vibrational excitation, each mode must be treated as a quantum mechanical harmonic oscillator, the state of which can only be changed in units of single vibrational quanta of energy „ (phonons), where
is the angular frequency of this particular mode. For quantum computation, one collective normal mode is selected for the exchange of information, while the remaining N 1 modes are treated as spectator modes, which should remain unexcited during the calculation. Vibrational Coupling In order to use the motion of ions for information exchange, there must be a coupling between the internal state of an ion and the vibration. This is achieved by detuning the exciting laser from the ion’s resonance at ! 0 to either the red or the blue vibrational sideband of the desired mode, i. e., to a frequency !0 ˙ . Assuming that the vibrational mode is in a state with phonon number n, red sideband excitation drives the transitions jgi jni
! jei jn 1i
(n D 1; 2; : : : ) ;
(13)
while blue sideband excitation drives the transitions jgi jni
! jei jn C 1i
(n D 0; 1; : : : ) :
By analogy with the resonant case discussed in Sect. “Single-Ion Operations”, the evolution of the system for sideband excitation is characterized by an oscillatory change of populations at a modified Rabi-frequency. The probability of sideband transitions scales with a factor of , the so-called Lamb–Dicke parameter. It is given by 2 a0 ; (15) where a0 is the size of an ion’s wavepacket and the wavelength of the transition. For the stretch-mode of a twoion crystal, for example, the Lamb–Dicke parameter in the ground-state is given by s „ 1 !L ; (16) cos L D ˙p 4 2m!z 12 c D
where L is the angle between the laser beam and the trap axis (cf. Fig. 10). The Rabi-frequency for the first sideband transitions (13) and (14) depends on and the number of excited vibrational quanta: p Red sideband: ˝ (n) D ˝0 p n (17) Blue sideband: ˝ C (n) D ˝0 n C 1 ; where ˝ 0 is the Rabi-frequency for resonant excitation from Eq. (8) or Eq. (12). SWAP Gate
(14)
Here, the first symbol represents the ion’s state and the second the state of the vibrational mode, with jni corresponding to a number state with n phonons in the mode. The transitions corresponding to (13) and (14) are depicted in Fig. 15. In each case, a change of the state of the ionic qubit coincides with a change of the phonon number, i. e., the state of the data bus.
A two-qubit gate that illustrates the use of sideband excitation in a system with one ionic and one vibrational qubit is the SWAP-gate. In order to restrict the infinite ladder of vibrational states to a binary system, the qubit is represented by the two lowest vibrational states j0i and j1i. The computational space is then spanned by the four basis states jgi j0i ; jei j0i ; jgi j1i and jei j1i ;
(18)
Quantum Computing with Trapped Ions, Figure 15 Vibrational structure of the qubit states and transitions for excitation on a the first red sideband and b the first blue sideband. The three lowest vibrational levels for each qubit state are shown, displaced horizontally according to the number of phonons for clarity
2419
2420
Quantum Computing with Trapped Ions
is applied if and only if the vibration qubit is in state j1i. Alternatively, this can be expressed as the wave function j i of the system acquiring a change of sign if both input qubits are in the excited state (j1i and jei) and is left unchanged in all other cases. It corresponds to the following truth table.
Quantum Computing with Trapped Ions, Figure 16 Mapping of a qubit from an ion to the vibrational mode using a -pulse on the red sideband
i. e. the first two pairs of levels in Fig. 15. Assuming that the vibrational qubit is initially in the state j0i (after sideband cooling), an arbitrary superposition state ˛ jgi C ˇ jei of the ion can be swapped to the vibrational mode by applying a -pulse on the first red sideband (Fig. 16). With the state jgi j0i not coupled to the red sideband, the mapping obtained is (˛jgi C ˇjei) ˝ j0i ! jgi ˝ (˛j0i C ˇj1i) :
(19)
This operation is the basis for making the quantum state of an ion in a string accessible to any other ion sharing the collective motion. Two-ion gates may be realized by combining the above SWAP operation with a gate for the vibrational mode and a second ion. The transformation described by Eq. (19) requires that the vibrational mode isn’t excited initially. In particular, if state jei j1i is exposed to a red sideband pulse, a transition to jgi j2i would occur, which is outside the computational subspace. More elaborate pulse sequences are required to keep the ion dynamics within the space spanned by the four states of Eq. (18). A SWAP-gate that preserves the computational basis was realized by pulses of blue sideband excitation [75].
input
output
j0ijgi j0ijei j1ijgi j1ijei
j0ijgi j0ijei j1ijgi j1ijei
As shown in Tab. 2, a sign change, is achieved by applying a 2-pulse to the internal states of the ion. In the Cirac–Zoller scheme, the required conditioning on the state jei of the ion is obtained by using a transition which couples exclusively to the upper internal state of the ion. This requires an auxiliary electronic level, e. g., another Zeeman sublevel of the ground state. Conditioning on the vibrational state j1i is achieved by tuning to the first blue vibrational sideband, which induces a transition to a lower state only if at least one vibrational quantum is present. The scheme of this controlled Z-gate is illustrated in Fig. 17. Another important gate may be implemented by combining the controlled Z-gate with single qubit rotations. By applying a resonant /2-pulse to the ion (changing the electronic states but leaving the vibration unaffected) before the controlled Z-gate and an inverse /2-pulse after, a controlled NOT (CNOT) gate is realized, in which the target bit (carried by the electronic state) is flipped depending on the state of the control bit (represented by the
Cirac–Zoller Gate The coupling of two ions in a string through their collective external motion was first proposed in 1995 in a seminal paper by Cirac and Zoller [4]. At the heart of their proposal is a two-qubit controlled Z-gate with one qubit stored in the vibrational state of the COM-mode and the other stored in the internal state of an ion. The controlled Z-gate is the simplest universal quantum gate, from which arbitrary logic gates may be built in conjunction with single qubit operations. The vibrational state controls the transformation of the ion state to which the Z-gate of Eq. (11)
Quantum Computing with Trapped Ions, Figure 17 Controlled Z-gate in the Cirac–Zoller scheme. The laser pulse couples the auxiliary level to the state jei j1i, leaving all other levels unaffected. Using a 2-pulse results in a sign change of the jei j1i-component
Quantum Computing with Trapped Ions
vibrational state). Its truth table is input
output
j0ijgi j0ijei j1ijgi j1ijei
j0ijgi j0ijei j1ijei j1ijgi
The /2-pulses effectively change the computational basis in which the controlled Z-gate is performed. The described CNOT-gate for the internal and the vibrational state of a single ion was demonstrated in 1995 at NIST in Boulder [22] with 9 BeC , the first implementation of an ion-trap quantum gate. The auxiliary state jauxi D jS1/2 ; F D 2; m F D 0i was chosen. The experimental results for different combinations of input states are shown in Fig. 18, confirming the realization of the above truth table. In the original scheme of Cirac and Zoller, the qubits are stored in the electronic states of two ions in a string, while the center-of-mass motion of the string is used as a data bus only during the gate operation. Therefore, before applying the controlled Z-gate to the target ion, the state of the first ion (control qubit) must be transferred to the vibrational motion by means of the SWAP operation described in Sect. “SWAP Gate”. After completion of the gate, the state of the control ion must be restored by mapping the state of the vibrational mode back to the first ion. The implementation of the full Cirac–Zoller gate with qubits stored in different ions has been accomplished recently in Innsbruck [76]. In this experiment, a simplified
Quantum Computing with Trapped Ions, Figure 18 Experimental verification of the truth table of a CNOT quantum gate, starting from the four combinations of basis states indicated on the input axis [22]. The bars represent the measured probability for being in the lower atomic state jgi (blue) and in the first excited vibrational state j1i (red). The ionic qubit is flipped if the vibration is initially in state j1i, while the vibrational state itself is unchanged
procedure for realizing the controlled Z-gate was used, avoiding the need for an auxiliary level. In principle, a controlled Z-operation can be accomplished by applying a 2pulse on the blue sideband of the qubit transition. However, Eq. (17) implies p that state jgij1i has a blue sideband Rabi-frequency 2 times p higher than jgij0i, so that this state experiences a 2 2 pulse. A 2-pulse for all levels (except jeij0i, which does not couple to blue sideband excitation) can be achieved by using a transition composed of four pulses, rotating the state of the target ion with different angles and around different axes [77].
Quantum Computing with Trapped Ions, Figure 19 Controlled Z-gate with composite pulses. Using a sequence of four blue sideband pulses, all ions in jgi j0i, jgi j1i and jei j1i experience a 2-rotation, while jei j0i is unchanged. This corresponds to a controlled Z-gate, up to an overall minus sign. The drawing on the right shows the motion of the Bloch vector of a target qubit starting and ending in the state jgi j0i, picking up a phase shift of
2421
2422
Quantum Computing with Trapped Ions
Quantum Computing with Trapped Ions, Figure 20 Experimentally observed truth table of the Cirac–Zoller CNOT derived from measurement of the joint probabilities of the two qubits [76]
The dynamics of this gate is illustrated in Fig. 19, showing the levels coupled by effective 2-pulses. Note that there is no excitation of levels outside the computational subspace, since 2 rotations do not change the occupation of the states involved. The experimentally determined truth table is shown in Fig. 20. The corresponding fidelity of the gate operation is between 70% and 80%. Geometric Gates The direct coupling of qubits and vibration in the Cirac– Zoller gate requires the bus-mode to be in the quantum mechanical ground state, a condition which is not easy to satisfy for long times in the presence of motional heating. In addition, individual addressing of control and target ion is needed. In order to relax these requirements, a different type of gate was proposed, based on electronic-state-dependent motional displacements [78,79,80,81,82,83]. Being simpler and less error-prone, the technology offers high fidelity gate operation. In a geometric gate, the state of the harmonic oscillator corresponding to the vibrational bus-mode of the ions is displaced in phase-space (i. e. its position coordinate hzi and momentum hpi are changed), conditioned on the internal state of the qubits involved. The displacement is applied in such a way that it follows a closed loop in phase space, returning the vibration to its original state. For a detuning ı D !L !0 between the effective frequency difference ! L of the Raman excitation and the qubit transition frequency, this occurs after a time given by ( ı) D 2 m for integer m. At this time, any entanglement between the bus and the qubits vanishes. Therefore, the only effect of the operation is that the system acquires a phase (geometric phase), equal to the area in phase space enclosed by the trajectory of the displacement,
Quantum Computing with Trapped Ions, Figure 21 Phase-space representation of the bus-mode excitation during a geometric gate. The blue and the red circle are the trajectories for internal states jeijgi and jgijei, respectively. In each case, a phase is acquired
as shown in Fig. 21 [82]: D
Z 1 ˝ 2 : dz dp D 2 m „ area
ı
(20)
The phase has properties which make it particularly suitable for a quantum gate: (1) It depends on the internal state of the two qubits involved (e. g., is non-zero if and only if the two qubits are different); (2) It is independent of the initial state of the bus-mode, as long as excursions stay within the Lamb–Dicke regime (i. e. are smaller than the wavelengths of the transitions involved). Therefore, the phase of the system is much less susceptible to motional heating. Insensitivity to the bus-mode excitation could also be reached by making the detuning ı from single-photon resonance large [79]. However, this leads to an unfavorably slow evolution of the interaction between the qubits. The selective displacement of the bus-mode for certain qubit states is achieved through optical forces exerted by suitable Raman pulses. Two different methods have been applied, resulting in different qubit dynamics and therefore different gates. Mølmer–Sørensen or xy-Gate In this case, bichromatic Raman beams are used. They provide two equally strong effective fields, with opposite detunings ˙ı from the red and the blue vibrational sideband of the qubit-transition. This drives a two-photon transition, corresponding to the two-qubit interaction Hamiltonian HI D
„ (˝)2 (ˆx ˝ ˆ x ) ; 2 ı
(21)
Quantum Computing with Trapped Ions
p where ˆ x D (jgihej C jeihgj)/ 2 is a Pauli spin-matrix. In the Bloch-picture, the Hamiltonian (21) provides a synchronous rotation of the internal states of both qubits around the x-axis (cf. Fig. 9). Alternatively, a rotation around the y-axis could have been chosen, hence the name xy-gate. A gate based on this type of interaction was first implemented in an p experiment generating entangled states (jg gi C ijeei)/ 2 of two p qubits and (jg g g gi ijeeeei)/ 2 of four qubits, starting from a string of ions initialized in the internal ground state [84,85]. It has recently been used for the entanglement of magnetic-field insensitive qubit-states [86,87]. z-Gate This gate is similar to the xy-gate, except that different basis states are affected by the interaction. Here, the frequency difference ! L of two Raman beams is chosen to lie close to the vibrational frequency . In this case, displacement is provided by the optical dipole force from the Raman beams, which is different for each qubit-state. If the spacing between the ions, as well as polarization and detuning of the Raman-beams are chosen correctly, there will be a vibrational displacement only if the ions are in different states. Again, for a closed loop of the displacement in phase space, the vibrational state decouples from that of the qubits, and the state of the system acquires a phase shift . The Hamiltonian corresponding to this interaction is similar to Eq. (21), except that the states of the qubits are rotated around the z-axis of the Bloch-sphere. The transformations of the basis state for the two types of geometric gates discussed are summarized in the following table:
Quantum Computing with Trapped Ions, Figure 22 CNOT-gate with control qubit stored in the motional ground state (j0i) and second excited state (j2i) of the bus-mode. A laser pulse drives a 4 or a 3 transition, returning the ion to its initial electronic state or toggling its state when the bus is in state 0 or 2, respectively [92]
Cirac–Zoller type, which use sideband transitions for coupling internal and motional states, the gate of Monroe et al. uses pulses tuned to resonance with the qubit transition. The necessary conditioning on the state of the vibrational bus-mode is obtained by taking into account the higherorder dependence of the Rabi-frequency on the Lamb– Dicke parameter from Eq. (15) . In fact, if the extension of the wavepacket corresponding to the ion is not small compared with the wavelength of the qubit transition, i. e., & 1, the system is no longer in the Lamb-Dicke regime and the resonant Rabi-frequency becomes ˝(n) D ˝0 e
2 /2
L n (2 ) ;
(22)
A z-gate has been realized experimentally with two [88], interacting p via the stretch-mode. The entangled state (j00i-ij11i)/ 2 was produced with a fidelity of 97%. The high fidelity confirms the benefit of reduced requirements regarding thermally excited motion and addressing of the ions in a geometric gate. The z-gate was used to generate and investigate entanglement between a single 40 CaC -ion and its vibrational motion [89], as well as two 40 CaC -ions [90].
where ˝ 0 is the Rabi-frequency in the Lamb–Dicke limit given by either Eq. (8) or Eq. (12) and L n (x) is the Laguerre polynomial of order n. By adjusting , the transition can be driven by a single pulse of a suitable length in such a way that, for example, the phase change of the system is D 4 if the vibration is in state j0i and D 3 if it is in state j2i (note that this requires using the basis states j0i and j2i for the bus, replacing the usual j0i and j1i). In the first case, there is no change of the system, while in the latter there is a flip of the qubit state as required for a CNOT-gate. This gate was realized experimentally at NIST [92] using D 0:359. Physically the conditional dynamics of the gate relies on the different size of the ions’s wavepacket in different vibrational states. The advantages of this type of gate are that only a single pulse is needed, no auxiliary levels are involved and there are no differential Stark-shifts of the levels during the gate operation.
CNOT-Wavepacket Gate
Fast Gates
A different route to a simplified CNOT-gate was proposed by Monroe and coworkers [91]. In contrast to gates of the
The two-ion gates discussed so far use one normal mode of the string to couple the qubits. In this case, the gate
input
z-gate
jgi jgi
jgi jgi
jgi jei
e i jgi jei
jei jgi
e i jei jgi
jei jei
jei jei
MS-gate
p (jgi jgi C iei jei jei)/ 2 p (jgi jei C iei jei jgi)/ 2 p (jei jgi C ie i jgi jei)/ 2 p (jei jei C ie i jgi jgi)/ 2
9 BeC -ions
2423
2424
Quantum Computing with Trapped Ions Quantum Computing with Trapped Ions, Table 3 Important two-qubit gates employed in ion-trap quantum computation. The table lists the matrices representing the gate operators for the joint basis states, along with the symbols used in quantum circuits. See Tab. 2 for single-ion operations Gate SWAP
CZ
CNOT
Description Ion state swapped to bus-mode g0 g1 e0 e1 1 0 0 0 0 1 g0 C B g1 B 0 1 0 0 C C B @ 0 0 1 0 A e0 1 0 0 0 e1 Method: blue sideband composite -pulse Controlled Z-gate: one combination of basis states acquires minus sign (= phase shift, hence the name phase gate) 0g 0e 1g 1e 1 0 1 0 0 0 0g C B 0 C 0e B 0 1 0 C B @ 0 0 1 0 A 1g 0 0 0 1 1e Methods: 2-pulse on auxiliary transition blue sideband composite 2-pulses Controlled NOT gate: NOT applied to target ion if control qubit (vibration) is 1 0g 0e 1g 1e 1 0 1 0 0 0 0g C B 0e B 0 1 0 0 C C B @ 0 0 0 1 A 1g 0 0 1 0 1e Methods: CZ between /2-pulses wavepacket-gate two-ion CNOT if combined with SWAP gate
speed is limited by the mode frequency , or, more generally, the trap frequency ! z , since on a faster time-scale, the vibrational sidebands cannot be resolved for selective excitation. Recently, geometric gates have been proposed which offer processing speeds beyond the vibrational frequencies [93,94,95,96]. This is possible in long ion strings, if all normal modes of vibration are excited simultaneously. By using suitably shaped laser pulses [93,95] or a fast sequence of precisely timed pulses [93,94,95,96], the time required for a two-ion phase-gate may be reduced below 2/!z . The reason is that with fast pulses local oscillations of the addressed ions may be excited, in which none of the other ions undergo significant motion. This requires the coherent excitation of a large number of normal modes with a minimum of five laser pulses [96]. The most important fundamental logic gates realized in ion traps are summarized in Tab. 3, along with their
Quantum Circuit
Ref. Sect. “SWAP Gate”, [75]
Sect. “Cirac–Zoller Gate”, [22], [76]
Sect. “Cirac–Zoller Gate” [22], [76], Sect. “CNOT-Wavepacket Gate”, [92]
matrix representation and the corresponding notation in quantum circuit diagrams. Decoherence The most important property distinguishing a qubit from a classical bit is that its two states have a well-defined relative phase, i. e., they are coherent. Uncontrollable interactions of a qubit with its environment destroy this coherence, a process known as decoherence. This includes the radiative decay of qubit states. Quantum memory as well as quantum gates are affected by decoherence and decoherence rates are important figures of merit for any quantum processing device. Trapped ion systems have been demonstrated to reach very low decoherence rates, making them the most successful implementation of quantum information processing to date. One way to measure decoherence
Quantum Computing with Trapped Ions Quantum Computing with Trapped Ions, Table 3 continued Gate MS
Z
Description Mølmer–Sørensen gate: multi-ion entanglement gg ge eg ee 0 1 0 0 iei B 1 iei 0 B 0 1 p B i 2 @ 0 ie 1 0 iei 0 0 1 Methods: bichromatic Raman excitation closed loop in motional phase-space scalable to N ions
Quantum Circuit
1 C C C A
Ref. Sect. “Mølmer-Sørensen or xy-gate”, [79], [84], [97]
gg ge eg ee
Geometric Z-phase gate: phase shift for ions in different basis states gg ge eg ee 1 0 gg 1 0 0 0 C B i ge 0 0 C B 0 e C B @ 0 eg 0 ei 0 A ee 0 0 0 1 Methods: optical dipole force phase shift from closed trajectory in phase space controlled Z up to single bit rotations
is through the loss of contrast of the coherent Rabi-oscillation either on the carrier or on a sideband, as shown in Fig. 11. Internal State Decoherence The coherence of internal states is limited by the radiative decay of the qubit-levels. The excited states are either metastable with lifetimes on the order of seconds or hyperfine ground-states with even smaller decay rates. The dominant contribution to the decoherence of internal states in ions used for quantum information processing are fluctuating ambient magnetic fields [98,99,100]. Through the Zeeman-shift of the levels, they result in random fluctuations of the transition frequency between the qubit states. The sensitivity to magnetic field fluctuations may be reduced by using transitions which are independent of magnetic field to first order. This strategy has been successfully applied in atomic clock systems [101]. An example are states with magnetic quantum number m D 0 at zero magnetic field, but similar transitions also exist at finite magnetic field. For a single magnetic field insensitive qubit, memory coherence times greater than 10 s were observed [39], five orders of magnitude larger than in experiments with magnetic field depen-
Sect. “z-gate”, [88], [89]
dent states in 9 BeC [102]. Another source of decoherence are the lasers driving the qubit transition. Via frequency and intensity fluctuations, they contribute to the loss of phase coherence of the qubits. Coherence is more difficult to maintain as the length of the quantum register increases. Collective dephasing is presently the largest source of decoherence and the limiting factor for the generation of large entangled states (see Sect. “Multiparticle Entanglement”). It is caused, for example, by variations of magnetic, electric field or laser intensity across the size of a typical ion string. Ions in different locations or ions being shuttled through a large-scale trap architecture may be affected by these variations in slightly different ways. Qubits in Decoherence-free Subspace Frequently, the environment couples to each qubit of a quantum register in an identical way. In this case, quantum information may be protected by encoding it not in the physical basis states jgi and jei of the qubits, but in two entangled states of two physical qubits, for example 1 j˙ i D p jgei ˙ jegi : 2
(23)
2425
2426
Quantum Computing with Trapped Ions
These logical states are invariant against common phase shifts of individual basis states. For example, if both ions simultaneously acquire a phase shift in state jei, i.e, undergo the transformation jei ! exp(i) jei, j˙ i is not affected, while the individual qubits would lose their phase coherence. The space spanned by j˙ i is called decoherence-free subspace (DFS) [103,104,105]. It has been demonstrated [85], that a physical qubit can be reversibly encoded in a DFS, achieving coherence times two orders of magnitude longer than for the physical qubit. DFS-encoded states have been used in conjunction with qubits based on field-independent transitions to produce even longer-lived entanglement of the logical qubits [39]. Vibrational State Decoherence The coupling required for quantum gates in a string of trapped ions is provided by their collective motion. The mode selected as a data-bus must preserve coherent superpositions of the motional ground state j0i and the state j1i. Any motional decoherence of the ion chain degrades gate fidelities and must be avoided. The ions vibrating in a string constitute an oscillating electric dipole and hence their motion couples to fluctuating electric fields in the environment. These may be generated, for example, by patch electric fields on the trap electrodes. The resulting noise in the ions’ motion is a prime source of decoherence in ion trap quantum information processing. The most obvious environmental influence on the vibration of a string is the excitation of vibrational quanta (phonons), corresponding to motional heating of the mode. This has been investigated in a number of experiments [62,65,106]. By measuring the motional excitation as a function of time starting in the ground state, heating rates between 1/ms [106] and 0.005/ms [62] were observed. Factors influencing the heating rate are the distance d between ions and electrodes (scaling approximately as d 4 [107]) and coating of trap electrodes with atoms during loading of the trap, resulting in patch-potentials. Therefore, careful shielding of the electrodes from atomic beam exposure is important. Another way to reduce heating rates is cooling the electrodes [107,108]. For a multi-ion string, the rate of motional heating strongly depends on which normal mode is used. The COM-mode has heating rates comparable to that of a single ion, as it can be excited by a homogeneous fluctuating field coupling to each individual ion in an equivalent way. The stretch-mode, on the other hand, can only be excited by a field gradient and is therefore much less sensitive to fluctuating fields emanating from the trap electrodes.
Therefore, for high fidelity gate operations, the stretch mode is preferable. While motional heating sets an upper limit to the coherence time of bus qubits, for reliable quantum operations also dephasing of motional superpositions must be taken into account. This has been investigated at NIST [109], in Innsbruck [110] and in Oxford [111]. Dephasing times found are on the order of 100 ms. Dephasing is faster for larger differences in the quantum numbers of the states involved, making it advantageous to use the lowest vibrational states j0i and j1i. Even though only one normal mode is used for quantum information transfer, heating of the remaining N 1 axial vibration modes in an N-ion string (spectator modes) is detrimental, since it affects the Rabi-frequency of the bus-mode. This may be alleviated by cooling all modes of the string, ideally to the quantum mechanical ground state. With typical gate durations on the order of 1 μs, decoherence times of 1 s would permit 106 quantum operations to be performed. Another approach is to consider the probability of error during a gate operation [112]. It should be smaller than 104 to be able to successfully apply quantum error correction. This corresponds to a fidelity of F > 0:9999, which present quantum gates do not yet achieve. Quantum Algorithms The two-ion gates described in the previous section are universal, i. e., any logic circuit may be realized by combining them with single qubit operations [113,114]. In this section, examples of algorithms implemented in ion traps are presented. The most important quantum algorithms have been demonstrated in ion-trap systems in their simplest form. Deutsch–Jozsa Algorithm The Deutsch–Jozsa algorithm [115] is the simplest example of quantum parallelism. The task it performs is to decide if a function f is constant, i. e., has either always the result f (x) D 0 or always f (x) D 1, independent of the input x, or if it is balanced, i. e. its output depends on the input ( f (x) D x or f (x) D NOTx). With classical means, two function evaluations are necessary, whereas the quantum algorithm performs the task with a single application of the function to a superposition state. The algorithm has been demonstrated with a single ionic qubit, using its vibrational excitation as an auxiliary qubit [75] (see Fig. 23). The required pulses were applied
Quantum Computing with Trapped Ions
Quantum Computing with Trapped Ions, Figure 23 Quantum circuit implementing the Deutsch–Jozsa algorithm with a trapped ion. The upper line represents the ionic qubit jxi, the lower line an auxiliary vibrational qubit. In the central box, one of four possible functions f(x) is implemented by acting on y with the identity, a NOT gate and/or a CNOT-gate controlled by x. A final measurement of the ion in state jei indicates a balanced function after only a single function evaluation [75]
either resonantly or on the blue sideband, in the latter case using composite pulses in order to avoid excitation of more than one phonon (cf. Sect. “Cirac–Zoller Gate”). Quantum Teleportation A more complex task and one of fundamental importance for quantum computation is quantum teleportation, the transfer of an unknown quantum state from one qubit to another in a distant location [116]. Even though no finite number of measurements is sufficient to obtain a full specification of a quantum state, the transfer is possible with the help of non-local correlations provided by an entangled pair of qubits. While probabilistic teleportation has been demonstrated with photonic qubits, two ion-
trap experiments have implemented a fully deterministic quantum teleportation protocol. At the University of Innsbruck, 40 CaC -ions were used [99], while at NIST the experiment was performed with 9 BeC [98]. The teleportation procedure requires three ionic qubits. The essential steps are summarized in Fig. 24. Initially, ions B and C are prepared in a maximally entangled state using the gates described in Sect. “Cirac–Zoller Gate” (Innsbruck) and Sect. “Geometric Gates” (NIST). Subsequently, the state to be transferred is prepared in ion A by a single-qubit rotation. The central step of the teleportation protocol is a so-called Bell-state measurement of the states of ions A and B. This is achieved by using another gate to entangle qubits A and B and, after subjecting them
Quantum Computing with Trapped Ions, Figure 24 Schematic representation of quantum teleportation in a chain of three ions: the state of ion A (green) is transferred to ion C with the help of an entangled pair of ions B and C. Blue arrows indicate the manipulation of ions with laser pulses, while the red arrow corresponds to the transmission of two bit of classical information. This protocol is used in two independent experiments [98,99], which differ only in technical details
2427
2428
Quantum Computing with Trapped Ions
Quantum Computing with Trapped Ions, Figure 25 a Circuit diagram for quantum Fourier transform of three qubits, composed of Hadamard transforms and controlled Z-gates (with rotation angles D /2 and /4). In the actual experiment, these gates were replaced by equivalent rotation sequences around the x- and y-axis; b experimental output for the quantum Fourier transform of the input state jgeei C jeeei with period 4 [118]
to a /2-pulse, measuring the state of both qubits. Taken separately, none of the four possible measurement outcomes (gg, ge, eg, or ee), nor the remaining ion C carry any information about the original state. Only when one of four unitary state rotations is applied to ion C, chosen depending on the outcome of the measurement, the state of ion C becomes identical to the original state A. In their implementation of this protocol, the two groups have employed quite different technologies. The qubits are stored differently and different realizations of the phase gate are employed. The Innsbruck group addresses individual ions with tightly focused lasers, protecting the remaining ions from the target ion’s fluorescence by hiding them in another internal state. At NIST, ions are selectively moved to separate zones in a segmented trap, where they can be manipulated individually while maintaining entanglement. In spite of these differences, in both experiments the quantum bit was transferred with a fidelity of around 75%. In a quantum computer, teleportation may be applied to transfer quantum information without physically moving qubits. Quantum Fourier Transform
Grover’s Quantum Search Algorithm
One of the most celebrated quantum algorithms was developed by P. Shor for finding the prime factors of a large number [117]. A central element of this algorithm is the quantum Fourier transform of a set of qubits. Here, the amplitudes xk of a superposition of basis states are Fourier-transformed, resulting in a new state with amplitudes yj : N1 X kD0
x k jki !
N1 X jD0
y j j ji
Here, jki is related to the corresponding ion-state through its binary representation with 0 g and 1 e. The quantum Fourier transform was experimentally demonstrated in a system of three beryllium ions for states with periods 1,2,4,8 and approximately 3 [118]. The sequence of gates that was applied is shown in Fig. 25a. It comprises Hadamard transforms and controlled Z-gates with different phase factors. In the experiment, these are replaced by equivalent sequences of X- and Y-pulses. A simplification of the scheme is possible by performing the read-out of each qubit immediately after the last Hadamard-gate is applied. Measurement of the individual ions is achieved by separating them in a segmented linear trap. The subsequent quantum gates in Fig. 25a must then be replaced by classically controlled phase shifts. Figure 25b shows an example of the outcome of such a measurement. A state with a period 4 in an 8 element space has non-zero Fourier components with a periodicity of 2 (D 8/4). Therefore, only every other state is observed in the output. Depending on the input state, an accuracy between 87% and 99% was reached.
with y j D
N1 X
x k e i2 jk/N :
kD0
(24)
Grover’s search algorithm is another example of a quantum computer outperforming its classical counterpart. It succeeds in finding a marked entry in a database of p length n after n/4 queries, rather than n [119]. It requires an operator V (oracle) marking the target state jai of the search by flipping its sign and an operator W to amplify the contribution of the target state, until, after a few iterations, only the target state has a sizable probability. The position of the target state in the database is then revealed after a measurement. The corresponding quantum circuit is shown in Fig. 26.
Quantum Computing with Trapped Ions
|g〉 X
π 2
|g〉 X
π 2
V
W
A B
Z
A
Y
π 2
B
Y
π 2
MSGate
iterate
Quantum Computing with Trapped Ions, Figure 26 Scheme of Grover’s search algorithm for N qubits encoding n D 2N entries, illustrated for N D 2. Initially, all qubits are prepared in the ground state. /2-pulses prepare an equal superposition of all basis states. The oracle V flips the sign of one marked element jai. Which state is marked is determined by the operators A and B, which can be either NOT (X-gate, see Tab. 2) or the identity. Two more /2-pulses followed by a Mølmer–Sørensen gate amplify the weight of the marked state. After a number of iterations, the qubits are measured
In the experiment [120], a four element database was set up with two ionic qubits. Using /2-pulses, an equally weighted superposition of all basis states is prepared as the input: ji D jg gi C jgei C jegi C jeei /2. The oracle V as well as the amplification W were implemented using the Mølmer–Sørensen gate (Sect. “Mølmer-Sørensen or xy-gate”) together with single-qubit rotations. The iteration step in Fig. 26 takes the state ji through the following transformation: V
W
ji ! (ji jai) ! jai : Therefore, the marked state jai should be retrieved in only a single query. The measured probability was 60%, surpassing the classical limit of 50%. For a register size larger than N D 2, the operators V and W have to be applied iteratively. Quantum Error Correction Any quantum computer is subject to noise, resulting in uncontrolled, irreversible changes of the qubits involved. Even using decoherence-free subspaces (see Sect. “Qubits in Decoherence-Free Subspace”), residual noise still reduces the fidelity of quantum memory and quantum gates. The ability to correct these errors is therefore of great importance for the reliable operation of a quantum computer. The challenge is that error correction must be accomplished without gaining any knowledge about the stored information, since this would destroy any quantum superpositions. An example for an error which might occur during the quantum computation is the flip of a single qubit. The correction of an error of this type has been demonstrated at NIST [100]. To this end, a redundant encoding of a single logical qubit in three physical qubits was ap-
plied. A qubit state ˛ jgi C ˇ jei was encoded as an entangled state ˛ jg g gi C ˇ jeeei. At NIST, the two-ion phase gate described in Sect. “Geometric Gates”, extended to three ions was used. The logical qubit was then subjected to an artificial bit-flip error of variable rate. After decoding the logical qubit by reversing the three-ion entangling gate, information on which error has occurred was obtained by measuring two of the physical qubits. The outcome determined which correction operation was applied to the remaining qubit to restore the original state. Finally, the qubit was analyzed to determine the efficiency of the method. The protocol is illustrated in Fig. 27. The data showed that quantum error correction improved the fidelity if bit-flips occurred with a probability larger than 25%. For smaller errors, the imperfections of the gates applied during encoding and decoding of the logical qubit outweighed the benefit of correction and lead to a reduced fidelity. Improved gates are necessary before a fault tolerant operation can be achieved. Since quantum error correction schemes require a large number of physical quantum bits, passive protection against errors, for example through decoherence-free subspaces, is important. Multiparticle Entanglement As shown in the previous section, entanglement is an essential resource in many quantum algorithms. In the course of a quantum computation, a large number of qubits must be entangled. It is therefore of great relevance to investigate highly entangled quantum states of many qubits. This is not only an important demonstration of the amount of control over a quantum system, but can also serve as a resource for quantum information processing [121]. Entangled states may be generated with the help of two- or multi-bit quantum gates (Sect. “Two-Qubit Interaction and Quantum Gates”). An early achievement was the entanglement of four ions [84] using the Mølmer– Sørensen gate (Sect. “Mølmer-Sørensen or xy-gate”). Since then, the development of new gates and better control over experimental parameters have led to the creation and verification of entangled states of up to eight ions. Each additional particle greatly increases the difficulty of the experiment. Two different classes of entangled states have been investigated. The first type is called Greenberger–HorneZeilinger (GHZ) or Schrödinger cat states and consists of equal superpositions of maximally different quantum states of N ions. 1 j g g g : : : g iCe i' j „ eeeƒ‚ : : : …e i : (25) jGHZ N i D p „ ƒ‚ … 2 N
N
2429
2430
Quantum Computing with Trapped Ions
Quantum Computing with Trapped Ions, Figure 27 Schematic representation of quantum error correction of flips of a Single logical quantum bit logical quantum bit, encoded in a chain of three ions. This protocol was demonstrated experimentally [100]
Three-ion GHZ-states were generated and controlled in 2004 at NIST (F D 89%) [122] and the University of Innsbruck (F D 72%) [123]. In both experiments, the potential of entanglement for practical applications was explored. The largest GHZ-state so far was generated at NIST, using a generalization of the phase gate described in Sect. “z-gate” for up to N D 6 ions [97]. Together with a common single-bit rotation of all the ions before and after the phase gate, the algorithm involved three steps, irrespective of the number of ions to be entangled, making this a very efficient method for multiparticle entanglement generation. The fidelity obtained for the six-ion entangled state was estimated to be better than 51%. In order to verify the coherence of the entangled states, experimenters have subjected them to another phase gate, identical to the entangling operation, but in a reference system rotated around the z-axis of the Bloch-sphere by an angle . In this way, the coherence of the state jGHZ N i is converted to a population difference, which can be directly measured. For a perfect N-ion GHZ-state, the population difference and hence the fluorescence signal varies as Pg g:::g D
1 2
1 cos(N) :
(26)
Results of this method for GHZ-states with four, five, and six ions are shown in Fig. 28. The observed contrast pro-
vides a lower bound for the coherence and hence the fidelity of the state. This method of probing coherences is closely related to Ramsey spectroscopy and, indeed, GHZ-states of multiple ions have been applied in spectroscopy. The N-fold increased oscillation frequency observed in Fig. 28 can be used to improve the phase p sensitivity of a Ramsey spectrometer by a factor of N compared to unentangled atoms [122]. Other applications of GHZ-states include quantum error correction (Sect. “Quantum Error Correction”) and the deterministic preparation of Bell states. A different class of entangled states are the N-particle W-states, consisting of a superposition of N states with exactly one particle in state jgi and all others in jei: 1 i'1 e je : : : eeegi C e i'2 je : : : eegei jWN i D p N Cei'3 je : : : egeei C C e i'N jgee : : : eei : (27) These states were generated in Innsbruck with N D 3 [123] and later up to N D 8 [124]. First, all ions are initialized in state jei and the vibrational COM-mode in state j1i. N pulses on the blue sideband of the qubit transition, are consecutively applied to each ion in the string, exciting the level jgi of that ion with a probability of 1/N. The entanglement is verified by measuring all N 2 elements
Quantum Computing with Trapped Ions
Quantum Computing with Trapped Ions, Figure 28 Fluorescence measurement of decoded GHZ-states as a function of the decoding phase for N D 4; 5; and 6 ions. The sinusoidal dependence on N is a signature of an N-ion GHZ state [97]. The observed contrast is related to the coherence of the state
of the density matrix, completely characterizing the quantum state. For the state jW8 i, a fidelity of F D 72% was determined. The entanglement properties of W-states are very different from those of GHZ-states. If the state of one particle is measured to be jei, the remaining particles are still entangled. In addition, W-states are robust against global dephasing. They might serve as a resource for quantum information processing as well as quantum communication. The six-ion GHZ- and eight-ion W-states define the state-of-the-art in the generation of multiparticle entanglement. The generation of even larger entangled states is a major challenge, due to the increased sensitivity to decoherence and inhomogeneous effects. Distributed Quantum Information with Trapped Ions The quantum systems discussed so far were confined to single ion traps. However, many important applications
require the distribution of quantum information among different locations. A prime example is quantum communication, which at present is based on photons as carriers of quantum information. Substantial benefits are expected from linking quantum communication with quantum computing. For example, small systems of trapped ions could act as routers or repeaters of quantum information. Teleportation is another application which is relevant mainly over long distances, i. e., between remote ion traps. Even quantum computation itself could gain from being spread among a number of local processing sites (distributed quantum computation), linked by quantum communication channels [125]. It is therefore an important goal to establish reliable quantum networks involving spatially separated traps. A first step is to create entanglement between ions stored in remote locations. There are three methods to achieve this: (1) An entangled pair of ions is prepared in one location, and one of the particles is subsequently transported to a distant site. (2) Entanglement is created between an ionic and a photonic qubit, the latter transferring entanglement deterministically to an ion in a distant trap. (3) Entanglement is created locally in two distant traps between an ion and a photon on each side. Subjecting the emitted photons to a joint measurement projects the ions left behind to an entangled state. This scheme works probabilistically, as it is dependent on a particular measurement outcome. The first method is suited for entanglement distribution within an extended trap architecture (Sect. “Large-Scale Ion Traps”), while the latter two methods allow long distance entanglement. Ion–Photon and Remote Ion–Ion Entanglement A crucial step towards mapping quantum information from an atomic ion to a photon is to generate entanglement between trapped ions and photons, establishing a long-distance communication channel. This was demonstrated at the University of Michigan using a single 111 CdC -ion [126]. In the experiment sketched in Fig. 29a, the ion was excited to a state with two channels of spontaneous decay, leading to different hyperfine states of the ion (jgi or jei). Information on which transition has occurred is contained in the polarization of the emitted photon (horizontal=jHi or vertical=jVi). Taking into account theprelative strength p of the transitions, the resulting state is 1/3 jHi jei C 2/3 jVi jgi, which is entangled. In the experiment, it was obtained with a fidelity of F D 97% (see
2431
2432
Quantum Computing with Trapped Ions
Quantum Computing with Trapped Ions, Figure 29 a Experimental setup for generation of ion-photon entanglement. Photons scattered from the excitation beam are analyzed by a beamsplitter distinguishing V- and H-polarization. The polarization rotator is used to change the measurement basis. The qubittransition is driven by a microwave field. b Measured conditional probabilities after both atomic and photonic qubits are rotated by a polar angle D /2 on the Bloch sphere with equal phase ˚. If the atomic and photonic qubits were not entangled but in a statistically mixed state, all conditional probabilities in the figure would have been 0.5 [126]
Fig. 29b). Note that this method of generating entanglement is probabilistic, since it relies on the spontaneous emission of a photon. The generation of entanglement between ion and photon is only an intermediate step towards the goal of entangling the quantum states of ions in different traps. This has been achieved by the same group. Initially, single 171 YbC ions are stored in two identical traps 1 m apart. Using the same principle as described ˇ above, ˛ p an entangled ionphoton state jei i j e i i jgi i ˇ g i / 2 is created in each trap, where ˇ ˛ the index i identifies the trap. The photon states ˇ g and j e i are distinguished by their frequency. The two photon-states are then brought to interference on a beamsplitter and measured at its two output ports. In case of a coincident detection, the state of the atoms p is projected to the entangled state jei1 jgi2 jgi1 jei2 / 2. An experimental verification yields a fidelity of 63% [127]. Ion-Trap Cavity-QED One of the drawbacks of entanglement using spontaneously emitted photons is the low success probability, due to the emission of photons into the full solid angle. As a result, only one ion–ion pair was entangled every 8.5 min in the experiment reported in Ref. [127]. A solution to this problem is to place the ion in a high-fi-
nesse Fabry–Perot optical cavity resonant with the emitted photons. The cavity enhances the coherent interaction between ion and photon and provides a well-defined mode for the photon, resulting in a predetermined direction of propagation when it eventually escapes through a semitransparent mirror. Individual 40 CaC -ions have already been stored in and coupled to an optical cavity, localized well within the Lamb–Dicke regime [128,129]. Optical cavities have applications beyond enhancing the success rate of probabilistic entanglement schemes. If the cavity mode volume is small enough, the coherent interaction between ions and photons becomes stronger than all spontaneous decay processes and the system dynamics becomes deterministic. This is known as the strong coupling limit of cavity-QED. It has been proposed as a technique to transfer quantum states or distribute entanglement deterministically in a quantum network [130]. According to this scheme, the qubit state of an ion is mapped to a cavity photon with the help of a laser pulse. This photon leaks out of the first cavity and enters a second cavity, in which it is absorbed by another ion. A requirement for this transfer is that the coupling between the second ion and the second cavity is provided by a laser pulse which is time-reversed with respect to the first pulse and that the photon-wavepacket carrying the quantum bit is time-symmetric.
Quantum Computing with Trapped Ions
While the strong coupling limit hasn’t been reached with single trapped ions yet, the system has already been used as a source of single photons on demand, emitting photons with a predetermined shape [131]. Cavity-QED techniques have many potential applications in quantum information processing with trapped ions, in particular for the reversible mapping of qubits between photons and ions [125].
gled ions has been p demonstrated to increase the sensitivity by a factor of N [122]. Methods from ion-trap quantum information processing have also been applied to spectroscopic investigation of 27 AlC by using 9 BeC to cool, initialize and detect the Al-ion, which lacks a suitable level structure [137]. The rapid progress of ion-trap technology not only makes harnessing the power of quantum computation a realistic possibility in the near future, but already provides new tools for fundamental and applied research.
Future Directions As shown in this review, all essential elements of quantum information processing have been demonstrated with trapped ions, making them the most successful technology for quantum computation to date. The biggest challenge now is to scale up the present systems to a size where algorithms of practical interest could be run. At the same time, the fidelity of gate operations must be improved, in order to reach the point where quantum error correction could be successfully applied. The implementation of quantum error correction will further increase the demand for larger quantum registers, due to the memory overhead of logical qubits. The most promising route to large-scale quantum computation is a segmented trap architecture with a large number of zones, each storing only a few ions [25]. Present activities in laboratories around the world are directed towards building suitable traps using nanofabrication methods. It is still a major challenge to combine all the required techniques in one scalable system. The implementation of a quantum computer with 300 qubits is discussed by A.M. Steane [132]. Before a universal quantum computer will be available, trapped ion systems are expected to find applications in the analog simulation of other quantum systems, as was originally proposed by Feynman [2]. The idea is to use the evolution of a system of trapped ions to mimic the behavior of other, less accessible systems with equivalent Hamiltonians. An example are quantum phase transitions in spin systems, which could be investigated with trapped ions [133]. A recent experiment has simulated the transition from paramagnetic to ferromagnetic order in a quantum system of two ions [134]. A quantum simulation that was successfully performed at NIST demonstrated the increased sensitivity of 2nd- and 3rd-order nonlinear beam splitters [135]. Trapped ions are even predicted to simulate cosmological particle creation in the early universe [136]. Beyond quantum computation or simulation, the techniques developed for ion trap quantum computers have already started to find applications in other fields. One example is frequency standards. Here, the use of N entan-
Bibliography Primary Literature 1. Feynman RP (1960) There’s plenty of room at the bottom. Eng Sci 23:22–36 2. Feynman RP (1982) Simulating physics with computers. Int J Theor Phys 21:467–488 3. Benioff P (1982) Quantum-mechanical models of turing-machines that dissipate no energy. Phys Rev Lett 48:1581–1585 4. Cirac JI, Zoller P (1995) Quantum computations with cold trapped ions. Phys Rev Lett 74:4091–4094 5. DiVincenzo DP (2000) The physical implementation of quantum computation. Fortschritte Phys-Prog Phys 48:771–783 6. Ghosh PK (1995) Ion Traps. Clarendon, Oxford 7. Paul W (1990) Electromagnetic traps for charged and neutral particles. Rev Mod Phys 62:531–540 8. Paul W, Steinwedel H (1953) Ein neues Massenspektrometer ohne Magnetfeld. Z Naturforsch 8:448–450 9. Waki I, Kassner S, Birkl G et al (1992) Observation of ordered structures of laser-cooled ions in a quadrupole storage ring. Phys Rev Lett 68:2007–2010 10. Nagerl HC, Bechter W, Eschner J et al (1998) Ion strings for quantum gates. Appl Phys B-Lasers Opt 66:603–608 11. Raizen MG, Gilligan JM, Bergquist JC et al (1992) Ionic-crystals in a linear paul trap. Phys Rev A 45:6493–6501 12. Drewsen M, Brodersen C, Hornekaer L et al (1998) Large ion crystals in a linear Paul trap. Phys Rev Lett 81:2827–2830 13. James DFV (1998) Quantum dynamics of cold trapped ions with application to quantum computation. Appl Phys B-Lasers Opt 66:181–190 14. Hughes RJ, James DFV, Gomez JJ et al (1998) The Los Alamos trapped ion quantum computer experiment. Fortschritte Phys-Prog Phys 46:329–361 15. Berkeland DJ, Miller JD, Bergquist JC et al (1998) Minimization of ion micromotion in a Paul trap. J Appl Phys 83:5025–5033 16. Hoffges JT, Baldauf HW, Eichler T et al (1997) Heterodyne measurement of the fluorescent radiation of a single trapped ion. Opt Commun 133:170–174 17. Dehmelt H (1967) Radiofrequency Spectroscopy of Stored Ions. Adv At Mol Phys 3:53 18. Straubel H (1955) Zum Öltröpfchenversuch von Millikan. Naturwissenschaften 42:506–507 19. Brewer RG, Devoe RG, Kallenbach R (1992) Planar ion microtraps. Phys Rev A 46:R6781–R6784 20. Schrama CA, Peik E, Smith WW et al (1993) Novel miniature ion traps. Opt Commun 101:32–36
2433
2434
Quantum Computing with Trapped Ions
21. Jefferts SR, Monroe C, Bell EW et al (1995) Coaxial-resonatordriven rf (paul) trap for strong confinement. Phys Rev A 51:3112–3116 22. Monroe C, Meekhof DM, King BE et al (1995) Demonstration of a fundamental quantum logic gate. Phys Rev Lett 75:4714– 4717 23. Turchette QA, Wood CS, King BE et al (1998) Deterministic entanglement of two trapped ions. Phys Rev Lett 81:3631–3634 24. Cirac JL, Zoller P (2000) A scalable quantum computer with ions in an array of microtraps. Nature 404:579–581 25. Kielpinski D, Monroe C, Wineland DJ (2002) Architecture for a large-scale ion-trap quantum computer. Nature 417:709– 711 26. Rowe MA, Ben-Kish A, Demarco B et al (2002) Transport of quantum states and separation of ions in a dual RF ion trap. Quantum Inform Comput 2:257–271 27. Hensinger WK, Olmschenk S, Stick D et al (2006) T-junction ion trap array for two-dimensional ion shuttling, storage, and manipulation. Appl Phys Lett 88:034101 28. Madsen MJ, Hensinger WK, Stick D et al (2004) Planar ion trap geometry for microfabrication. Appl Phys B-Lasers Opt 78:639–651 29. Brownnutt M, Wilpers G, Gill P et al (2006) Monolithic microfabricated ion trap chip design for scaleable quantum processors. New J Phys 8:232 30. Stick D, Hensinger WK, Olmschenk S et al (2006) Ion trap in a semiconductor chip. Nat Phys 2:36–39 31. Seidelin S, Chiaverini J, Reichle R et al (2006) Microfabricated surface-electrode ion trap for scalable quantum information processing. Phys Rev Lett 96:253003 32. Chiaverini J, Blakestad RB, Britton J et al (2005) Surface-electrode architecture for ion-trap quantum information processing. Quantum Inform Comput 5:419–439 33. Kim J, Pau S, Ma Z et al (2005) System design for large-scale ion trap quantum information processor. Quantum Inform Comput 5:515–537 34. Mitchell TB, Bollinger JJ, Dubin DHE et al (1998) Direct observations of structural phase transitions in planar crystallized ion plasmas. Science 282:1290–1293 35. Porras D, Cirac JI (2006) Phonon superfluids in sets of trapped ions. Found Phys 36:465–476 36. Castrejon-Pita JR, Ohadi H, Crick DR et al (2007) Novel designs for Penning ion traps. J Mod Opt 54:1581–1594 37. Crick DR, Ohadi H, Bhatti I et al (2008) Two-ion Coulomb crystals of Ca+ in a Penning trap. Opt Express 16:2351–2362 38. Ciaramicoli G, Marzoli I, Tombesi P (2003) Scalable quantum processor with trapped electrons. Phys Rev Lett 91:017901 39. Langer C, Ozeri R, Jost JD et al (2005) Long-lived qubit memory using atomic ions. Phys Rev Lett 95:060502 40. Nagerl HC, Roos C, Rohde H et al (2000) Addressing and cooling of single ions in Paul traps. Fortschritte Phys-Prog Phys 48:623–636 41. Lucas DM, Donald CJS, Home JP et al (2003) Oxford ion-trap quantum computing project. Philos Trans R Soc Lond Ser AMath Phys Eng Sci 361:1401–1408 42. Benhelm J, Kirchmair G, Rapol U et al (2007) Measurement of the hyperfine structure of the S-1/2-D-5/2 transition in Ca43C . Phys Rev A 75:032506 43. Deslauriers L, Haijan PC, Lee PJ et al (2004) Zero-point cooling and low heating of trapped Cd-111C ions. Phys Rev A 70:043408
44. Balzer C, Braun A, Hannemann T et al (2006) Electrodynamically trapped Yb+ ions for quantum information processing. Phys Rev A 73:041407 45. Olmschenk S, Younge KC, Moehring DL et al (2007) Manipulation and detection of a trapped Yb+ hyperfine qubit. Phys Rev A 76:052314 46. Schaetz T, Friedenauer A, Schmitz H et al (2007) Towards (scalable) quantum simulations in ion traps. J Mod Opt 54:2317–2325 47. Berkeland DJ (2002) Linear Paul trap for strontium ions. Rev Sci Instrum 73:2856–2860 48. Brown KR, Clark RJ, Labaziewicz J et al (2007) Loading and characterization of a printed-circuit-board atomic ion trap. Phys Rev A 75:015401 49. Brownnutt M, Letchumanan V, Wilpers G et al (2007) Controlled photoionization loading of Sr-88C for precision iontrap experiments. Appl Phys B-Lasers Opt 87:411–415 50. Letchumanan V, Wilpers G, Brownnutt M et al (2007) Zeropoint cooling and heating-rate measurements of a single Sr88C ion. Phys Rev A 75:063425 51. Kjaergaard N, Hornekaer L, Thommesen AM et al (2000) Isotope selective loading of an ion trap using resonanceenhanced two-photon ionization. Appl Phys B-Lasers Opt 71:207–210 52. Gulde S, Rotter D, Barton P et al (2001) Simple and efficient photo-ionization loading of ions for precision ion-trapping experiments. Appl Phys B-Lasers Opt 73:861–863 53. Deslauriers L, Acton M, Blinov BB et al (2006) Efficient photoionization loading of trapped ions with ultrafast pulses. Phys Rev A 74:063421 54. Lucas DM, Ramos A, Home JP et al (2004) Isotope-selective photoionization for calcium ion trapping. Phys Rev A 69:012711 55. Happer W (1972) Optical-pumping. Rev Mod Phys 44:169 56. Wineland DJ, Drullinger RE, Walls FL (1978) Radiation-pressure cooling of bound resonant absorbers. Phys Rev Lett 40:1639–1642 57. Neuhauser W, Hohenstatt M, Toschek P et al (1978) Opticalsideband cooling of visible atom cloud confined in parabolic well. Phys Rev Lett 41:233–236 58. Wineland DJ, Itano WM (1979) Laser cooling of atoms. Phys Rev A 20:1521–1540 59. Stenholm S (1986) The semiclassical theory of laser cooling. Rev Mod Phys 58:699–739 60. Diedrich F, Bergquist JC, Itano WM et al (1989) Laser cooling to the zero-point energy of motion. Phys Rev Lett 62:403–406 61. Monroe C, Meekhof DM, King BE et al (1995) Resolved-sideband raman cooling of a bound atom to the 3d zero-point energy. Phys Rev Lett 75:4011–4014 62. Roos C, Zeiger T, Rohde H et al (1999) Quantum state engineering on an optical transition and decoherence in a Paul trap. Phys Rev Lett 83:4713–4716 63. Kielpinski D, King BE, Myatt CJ et al (2000) Sympathetic cooling of trapped ions for quantum logic. Phys Rev A 61:032310 64. Schmidt-Kaler F, Roos C, Nagerl HC et al (2000) Ground state cooling, quantum state engineering and study of decoherence of ions in Paul traps. J Mod Opt 47:2573–2582 65. Rohde H, Gulde ST, Roos CF et al (2001) Sympathetic groundstate cooling and coherent manipulation with two-ion crystals. J Opt B-Quantum Semicl Opt 3:S34–S41
Quantum Computing with Trapped Ions
66. Barrett MD, DeMarco B, Schaetz T et al (2003) Sympathetic cooling of Be-9C and Mg-24C for quantum logic. Phys Rev A 68:042302 67. King BE, Wood CS, Myatt CJ et al (1998) Cooling the collective motion of trapped ions to initialize a quantum register. Phys Rev Lett 81:1525–1528 68. Mintert F, Wunderlich C (2001) Ion-trap quantum logic using long-wavelength radiation. Phys Rev Lett 87:257904 69. Nagourney W, Sandberg J, Dehmelt H (1986) Shelved optical electron amplifier – observation of quantum jumps. Phys Rev Lett 56:2797–2799 70. Wineland DJ, Bergquist JC, Itano WM et al (1980) Doubleresonance and optical-pumping experiments on electromagnetically confined, laser-cooled ions. Opt Lett 5:245–247 71. Blatt R, Zoller P (1988) Quantum jumps in atomic systems. Eur J Phys 9:250–256 72. Bergquist JC, Hulet RG, Itano WM et al (1986) Observation of quantum jumps in a single atom. Phys Rev Lett 57:1699–1702 73. Vogel K, Risken H (1989) Determination of quasiprobability distributions in terms of probability distributions for the rotated quadrature phase. Phys Rev A 40:2847–2849 74. Roos CF, Lancaster GPT, Riebe M et al (2004) Bell states of atoms with ultralong lifetimes and their tomographic state analysis. Phys Rev Lett 92:220402 75. Gulde S, Riebe M, Lancaster GPT et al (2003) Implementation of the Deutsch-Jozsa algorithm on an ion-trap quantum computer. Nature 421:48–50 76. Schmidt-Kaler F, Haffner H, Riebe M et al (2003) Realization of the Cirac-Zoller controlled-NOT quantum gate. Nature 422:408–411 77. Childs AM, Chuang IL (2001) Universal quantum computation with two-level trapped ions. Phys Rev A 6301:012306 78. Milburn GJ (1999) Simulating nonlinear spin models in an ion trap. arXiv:quant-ph/9908037v1 79. Sorensen A, Molmer K (1999) Quantum computation with ions in thermal motion. Phys Rev Lett 82:1971–1974 80. Molmer K, Sorensen A (1999) Multiparticle entanglement of hot trapped ions. Phys Rev Lett 82:1835–1838 81. Solano E, de Matos RL, Zagury N (1999) Deterministic Bell states and measurement of the motional state of two trapped ions. Phys Rev A 59:R2539–R2543 82. Sorensen A, Molmer K (2000) Entanglement and quantum computation with ions in thermal motion. Phys Rev A 6202:022311 83. Milburn GJ, Schneider S, James DFV (2000) Ion trap quantum computing with warm ions. Fortschritte Phys – Prog Phys 48:801–810 84. Sackett CA, Kielpinski D, King BE et al (2000) Experimental entanglement of four particles. Nature 404:256–259 85. Kielpinski D, Meyer V, Rowe MA et al (2001) A decoherencefree quantum memory using trapped ions. Science 291:1013– 1015 86. Haljan PC, Brickman KA, Deslauriers L et al (2005) Spin-dependent forces on trapped ions for phase-stable quantum gates and entangled states of spin and motion. Phys Rev Lett 94:153602 87. Haljan PC, Lee PJ, Brickman KA et al (2005) Entanglement of trapped-ion clock states. Phys Rev A 72:062316 88. Leibfried D, DeMarco B, Meyer V et al (2003) Experimental demonstration of a robust, high-fidelity geometric two ionqubit phase gate. Nature 422:412–415
89. McDonnell MJ, Home JP, Lucas DM et al (2007) Long-lived mesoscopic entanglement outside the Lamb-Dicke regime. Phys Rev Lett 98:063603 90. Home JP, McDonnell MJ, Lucas DM et al (2006) Deterministic entanglement and tomography of ion-spin qubits. New J Phys 8:188 91. Monroe C, Leibfried D, King BE et al (1997) Simplified quantum logic with trapped ions. Phys Rev A 55:R2489–R2491 92. DeMarco B, Ben-Kish A, Leibfried D et al (2002) Experimental demonstration of a controlled-NOT wave-packet gate. Phys Rev Lett 89:267901 93. Garcia-Ripoll JJ, Zoller P, Cirac JI (2003) Speed optimized twoqubit gates with laser coherent control techniques for ion trap quantum computing. Phys Rev Lett 91:157901 94. Duan LM, Blinov BB, Moehring DL et al (2004) Scalable trapped ion quantum computation with a probabilistic ionphoton mapping. Quantum Inform Comput 4:165–173 95. Garcia-Ripoll JJ, Zoller P, Cirac JI (2005) Quantum information processing with cold atoms and trapped ions. J Phys B – At Mol Opt Phys 38:S567–S578 96. Zhu SL, Monroe C, Duan LM (2006) Trapped ion quantum computation with transverse phonon modes. Phys Rev Lett 97:050505 97. Leibfried D, Knill E, Seidelin S et al (2005) Creation of a sixatom ‘Schroedinger cat’ state. Nature 438:639–642 98. Barrett MD, Chiaverini J, Schaetz T et al (2004) Deterministic quantum teleportation of atomic qubits. Nature 429:737–739 99. Riebe M, Haffner H, Roos CF et al (2004) Deterministic quantum teleportation with atoms. Nature 429:734–737 100. Chiaverini J, Leibfried D, Schaetz T et al (2004) Realization of quantum error correction. Nature 432:602–605 101. Fisk PTH, Sellars MJ, Lawn MA et al (1995) Very high-q microwave spectroscopy on trapped yb-171C ions – application as a frequency standard. IEEE Trans Instrum Meas 44:113–116 102. Meekhof DM, Monroe C, King BE et al (1996) Generation of nonclassical motional states of a trapped atom. Phys Rev Lett 76:1796–1799 103. Palma GM, Suominen KA, Ekert AK (1996) Quantum computers and dissipation. Proc R Soc London Ser A-Math Phys Eng Sci 452:567–584 104. Barenco A, Berthiaume A, Deutsch D et al (1997) Stabilization of quantum computations by symmetrization. SIAM J Comput 26:1541–1557 105. Zanardi P, Rasetti M (1997) Noiseless quantum codes. Phys Rev Lett 79:3306–3309 106. Turchette QA, Kielpinski D, King BE et al (2000) Heating of trapped ions from the quantum ground state. Phys Rev A 6106:063418 107. Deslauriers L, Olmschenk S, Stick D et al (2006) Scaling and suppression of anomalous heating in ion traps. Phys Rev Lett 97:103007 108. Labaziewicz J, Richerme P, Brown KR et al (2007) Compact, filtered diode laser system for precision spectroscopy. Opt Lett 32:572–574 109. Turchette QA, Myatt CJ, King BE et al (2000) Decoherence and decay of motional quantum states of a trapped atom coupled to engineered reservoirs. Phys Rev A 62:053807 110. Schmidt-Kaler F, Gulde S, Riebe M et al (2003) The coherence of qubits based on single Ca+ ions. J Phys B-At Mol Opt Phys 36:623–636
2435
2436
Quantum Computing with Trapped Ions
111. Lucas DM, Keitch BC, Home JP et al (2007) A long-lived memory qubit on a low-decoherence quantum bus. arXiv:0710. 4421v1 [quant-ph] 112. Steane AM (2002). Overhead and noise threshold of fault-tolerant quantum error correction. quant-ph/0207119 113. Barenco A, Bennett CH, Cleve R et al (1995) Elementary gates for quantum computation. Phys Rev A 52:3457–3467 114. Schmidt-Kaler F, Haffner H, Gulde S et al (2003) How to realize a universal quantum gate with trapped ions. Appl Phys B-Lasers Opt 77:789–796 115. Deutsch D (1985) Quantum-theory, the church-turing principle and the universal quantum computer. Proc R Soc London Ser A-Math Phys Eng Sci 400:97–117 116. Bennett CH, Brassard G, Crepeau C et al (1993) Teleporting an unknown quantum state via dual classical and einsteinpodolsky-rosen channels. Phys Rev Lett 70:1895–1899 117. Shor PW (1994) Algorithms for quantum computation – discrete logarithms and factoring. In: Goldwasser S (ed) Proceedings of the 35th Annual Symposium on Foundations of Computer Science, Santa Fe. IEEE Computer Society Press, Washington, pp 124–134 118. Chiaverini J, Britton J, Leibfried D et al (2005) Implementation of the semiclassical quantum Fourier transform in a scalable system. Science 308:997–1000 119. Grover LK (1997) Quantum mechanics helps in searching for a needle in a haystack. Phys Rev Lett 79:325–328 120. Brickman KA, Haljan PC, Lee PJ et al (2005) Implementation of Grover’s quantum search algorithm in a scalable system. Phys Rev A 72:050306 121. Bennett CH, DiVincenzo DP (2000) Quantum information and computation. Nature 404:247–255 122. Leibfried D, Barrett MD, Schaetz T et al (2004) Toward Heisenberg-limited spectroscopy with multiparticle entangled states. Science 304:1476–1478 123. Roos CF, Riebe M, Haffner H et al (2004) Control and measurement of three-qubit entangled states. Science 304:1478– 1480 124. Haffner H, Hansel W, Roos CF et al (2005) Scalable multiparticle entanglement of trapped ions. Nature 438:643–646 125. Monroe C (2002) Quantum information processing with atoms and photons. Nature 416:238–246 126. Blinov BB, Moehring DL, Duan LM et al (2004) Observation of entanglement between a single trapped atom and a single photon. Nature 428:153–157 127. Moehring DL, Maunz P, Olmschenk S et al (2007) Entanglement of single-atom quantum bits at a distance. Nature 449:68–71
128. Guthöhrlein GR, Keller M, Hayasaka K et al (2001) A single ion as a nanoscopic probe of an optical field. Nature 414: 49–51 129. Mundt AB, Kreuter A, Becher C et al (2002) Coupling a single atomic quantum bit to a high finesse optical cavity. Phys Rev Lett 89:103001 130. Cirac JI, Zoller P, Kimble HJ et al (1997) Quantum state transfer and entanglement distribution among distant nodes in a quantum network. Phys Rev Lett 78:3221–3224 131. Keller M, Lange B, Hayasaka K et al (2004) Continuous generation of single photons with controlled waveform in an iontrap cavity system. Nature 431:1075–1078 132. Steane AM (2007) How to build a 300 bit, 1 giga-operation quantum computer. Quantum Inform Comput 7:171–183 133. Porras D, Cirac JI (2004) Effective quantum spin systems with trapped ions. Phys Rev Lett 92:207901 134. Friedenauer A, Schmitz H, Glueckert JT et al (2008) Simulating a quantum magnet with trapped ions. Nat Phys 4:757-761 135. Leibfried D, DeMarco B, Meyer V et al (2002) Trapped-ion quantum simulator: Experimental application to nonlinear interferometers. Phys Rev Lett 89:247901 136. Schutzhold R, Uhlmann M (2007) Analogue of cosmological particle creation in an ion trap. Phys Rev Lett 99:201301 137. Schmidt PO, Rosenband T, Langer C et al (2005) Spectroscopy using quantum logic. Science 309:749–752
Books and Reviews Wineland DJ, Monroe C, Itano WM et al (1998) Trapped-ion quantum simulator. Phys Scr T76:147–151 Ekert A, Jozsa R (1996) Quantum computation and Shor’s factoring algorithm. Rev Mod Phys 68:733–753 Steane A (1997) The ion trap quantum information processor. Appl Phys B – Lasers Opt 64:623–642 Wineland DJ, Barrett M, Britton J et al (2003) Quantum information processing with trapped ions. Philos Trans R Soc Lond Ser AMath Phys Eng Sci 361:1349–1361 Leibfried D, Blatt R, Monroe C et al (2003) Quantum dynamics of single trapped ions. Rev Mod Phys 75:281–324 Steane A (1998) Quantum computing. Rep Prog Phys 61:117–173 Wineland DJ, Monroe C, Itano WM et al (1998) Experimental issues in coherent quantum-state manipulation of trapped atomic ions. J Res Natl Inst Stand Technol 103:259–328 Blatt R, Wineland D (2008) Entangled states of trapped atomic ions. Nature 453:1008–1015 Häffner H, Roos CF, Blatt R (2008) Quantum computing with trapped ions. Phys Rep 469:155–204
Quantum Computing Using Optics
Quantum Computing Using Optics GERARD J. MILBURN, ANDREW G. W HITE Centre for Quantum Computer Technology, The University of Queensland, Brisbane, Australia Article Outline Glossary Definition of the Subject Introduction Quantum Computation with Single Photons Conditional Optical Two-Qubit Gates Cluster State Methods Experimental Implementations Reprise: Single Photon States Future Directions Bibliography Glossary Avalanche photodiode (APD) A device for counting photons that absorbs a single photon and, with some probability, produces a single electrical signal. Beam splitter A linear optical device that partially transmits, and partially reflects, an incoming beam of light. Bell inequality The results of local measurements of dichotomic observables on each component of two correlated classical systems satisfy a correlation function that is less than or equal to a universal bound. This bound can be exceeded by correlated quantum systems. Bell states Four orthogonal, maximally entangled states of two qubits that violate a Bell inequality. Cluster State A highly entangled state of many qubits that enables quantum computation by sequences of conditional single qubit measurements, each conditioned on the results of previous measurements. c-SIGN A two-qubit gate that leaves all logical states unchanged unless both qubits take the value one, in which case the state suffers a phase shift. Entangled state The state of a multi-component quantum system which cannot be expressed as a convex combination of tensor product states of each subsystem. Entangled states cannot be prepared by local operations on each subsystem, even when supplemented by classical communication. HOM interference Hong, Ou and Mandel discovered that when indistinguishable single photon pulses arrive simultaneously at each of the two input ports of a fifty-fifty beam splitter, the probability for detecting
two photons, co-incidently, at each of the two output ports drops to zero. In other words, the photons are either both reflected or both transmitted. Mixed state A quantum state that is not completely known and thus has non zero-entropy. Photon A single quantum excitation of an orthonormal optical mode. A field in such a state has definite intensity and completely random phase, so that the average field amplitude is zero. Pure state A quantum state that is completely known and thus has zero entropy. Quantum computation The ability to process information in a physical device using unitary evolution of superpositions of the physical states that encode the logical states. Qubit The fundamental unit of quantum information in which two orthogonal states encode one bit of information. Unlike a classical bit, the physical system that forms the qubit can be in a superposition of the two logical states simultaneously. Quantum teleportation A quantum communication protocol based on measurement, feed-forward and shared entanglement. Shor’s algorithm An algorithm for finding the prime factors of large integers by unitarily processing information in a quantum computer. Tomography A measurement scheme for experimentally determining a quantum state in which a large sequence of physical systems, prepared in the same state, are subjected to measurements of a carefully chosen set of physical observables. Unitary A transformation of a quantum state that is physically, and thus logically, reversible. Unitary transformations take pure states to pure quantum states. Definition of the Subject Quantum computation is a new approach to information processing based on physical devices that operate according to the quantum principles of superposition and unitary evolution [5,9]. This enables more efficient algorithms than are available for a computer operating according to classical principles, the most significant of which is Shor’s efficient algorithm for finding the prime factors of a large integer [51]. There is no known efficient algorithm for this task on conventional computing hardware. Optical implementations of quantum computing have largely focused on encoding quantum information using single photon states of light. For example, a single photon could be excited to one of two carefully defined orthogonal mode functions of the field with different momentum di-
2437
2438
Quantum Computing Using Optics
rections. However, as optical photons do not interact with each other directly, physical devices that enable one encoded bit of information to unitarily change another are hard to implement. In principle it can be done using a Kerr nonlinearity as was noted long ago [34,61], but Kerr nonlinear phase shifts are too small to be useful. Knill et al. [24] discovered another way in which the state of one photon could be made to act conditionally on the state of another using a measurement based scheme. We discuss this approach in some detail here as it has led to experiments that have already demonstrated many of the key elements required for quantum computation with optics. Introduction The optical fiber communication network is the largest artificial complex system on the planet, enabling the internet, the growth of economies and a massive connectivity of minds presaging a but dimly seen revolution. Information, encoded in pulses of light, courses through the system at more than 10 terabits per second. It is hard to believe that it is barely 20 years since the first transoceanic optical fiber was installed, yet the system is still growing at an astonishing pace. The insatiable appetite for bandwidth in modern economies does not look like abating any time soon. The entire system is held together by thin fibers of glass guiding pulses of light. The huge increase in bandwidth that modern communication networks have enabled follows directly from the very high carrier frequency of optical pulses. A wealth of modulation techniques have been developed to exploit the potential of this bandwidth: it is currently limited only by the speed of the switches in the network required to control the flow of information. Early networks required the switches to be largely electronic and this meant first converting the light pulses to electronic pulses. However increasingly these switches are being replaced with all-optical devices. The routers and switches in the optical fiber network are essentially small computers processing packets of information at the fastest possible speeds. In the 1980s there was a research program to build computers entirely from optical switches processing information encoded optically. The astonishing progress in silicon technology meant that optical computers could never compete with the shrinking scale of semiconductors. The size of an optical switch is largely determined by the wavelength of light. This is beginning to change with the rise of plasmonics and nano photonics. However this constraint does not pose a problem for an optical communication system that spans the entire planet and much of the early work on all-optical switches found its way into the optical fiber system.
Optical fiber networks operate by the principles of classical physics; the computers they connect operate by the principles of classical logic. The semiconductors used as light sources and detectors function by quantum principles, but these do not influence network operation or logic. This is the physics and logic of Newton, Faraday and Maxwell. The motion of charges in semiconductor circuits is governed by the classical understanding of electric and magnetic fields, while the light pulses coursing down a glass fiber are perfectly well described by Maxwell’s theory of electromagnetic radiation. However this situation will inevitably change and the first indications are already upon us. We have known since the early part of the last century that the deepest description of the physical world is not classical but quantum. It is now clear that the quantum world enables new computation and communication tasks that are difficult, if not impossible, in a classical world. Over the same period that the optical communication system has been built, a small group of visionaries have speculated on the ultimate limits to conventional computation. There are two strands to this question. The first strand takes us down a road of ever decreasing dimensions. The astonishing growth in semiconductor technology is the direct result of acquiring the technical ability to make transistors smaller and smaller so that billions can fit onto a single chip. A very natural question is: how small can it get? Ultimately the answer to this question is the domain of quantum mechanics. The second strand joins a more abstract path that began with von Neumann and Turing, and took a quantum twist in the mid 1980s when Feynman [9] asked if a physical computer operating by quantum principles would be a more efficient computational machine than a conventional classical computer. We now know that the answer is yes. The quantum description of light began with Einstein in 1905 with his explanation of the photoelectric effect [7]: when a light ray starting from a point is propagated, the energy is not continuously distributed over an ever increasing volume, but it consists of a finite number of energy quanta, localized in space, which move without being divided and which can be absorbed or emitted only as a whole. In Einstein’s explanation, the energy of each quanta, or photon to use the modern word, is proportional to its frequency while the intensity of the light determines the number of photons per second passing through some given area. Einstein’s insight is now routinely confirmed in a semiconductor device known as the avalanche photo-
Quantum Computing Using Optics
diode (APD). This is a device that produces a current pulse when a single photon is absorbed. If we turn the intensity of the optical source down to very low levels and connect the APD to an audio amplifier we can hear the individual clicks as the photons arrive on the surface of the detector. In this operational sense, a single photon is a detection event, Fig. 1 The weak pulses of light used in conventional optical communication systems contain a huge number of photons, and furthermore, the number of photons per pulse is not fixed but fluctuates from pulse to pulse. This is a direct consequence of the kind of light source that is used, the laser. The laser is a true quantum device but it necessarily produces pulses of light with an indeterminate number of photons. The reason for this is intimately connected with the coherence properties of laser light. In a laser light source both the amplitude of the light and its phase fluctuate as little as quantum theory will permit. However these fluctuations are related by an uncertainty principle: if we try and produce a sequence of pulses with a well defined number of photons we necessarily make the phase random from pulse to pulse (we shall make this idea more precise in what follows). The random phase would destroy the key feature of first order optical coherence that makes a laser so useful. Are there any light sources that controllably produce pulses of light each with one and only one photon per pulse? Until very recently, the answer was no. However the discovery of optical communication schemes (quantum key distribution) and computation schemes (as described below) based on quantum principles have given a strong incentive for building such sources. We will now explain how single photons enable quantum computation and postpone our discussion of exactly what a single photon source is to later.
Quantum Computing Using Optics, Figure 1 A source generates a sequence of optical pulses conditional on some classical input, e. g. electrical pulses. A detector registers a sequence of classical electrical pulses that we interpret as due to the propagation of individual photons from source to detector. We might call this the ballistic picture of a photon
Quantum Computing Using Optics, Figure 2 A single photon pulse incident on a beam splitter in the downward (D) direction. It can be detected in the upward going direction (U) or the downward going direction (D)
Quantum Computation with Single Photons Let us now consider the experiment depicted in Fig. 2. A source (labeled D for downward) is producing a sequence of single photons which are directed towards a 50/50 beam splitter. This is a classical optical device which in the wave theory would be described as simply dividing the wave intensity equally between a reflected beam and a transmitted beam. If we had a source of very high intensity, each of the photodetectors in the reflected path and the transmitted path would record on average an equal number of counts per second. As we reduce the intensity to the single photon level however, we see an irreducible uncertainty in which detector will register the photon in each trial. On average it is still the case that over many trials, one half are recorded at the D-detector and one half at the U-detector. If we were to persist in our ballistic view of a single photon, we might be tempted to say that each photon is either reflected or transmitted at random; it is a simple coin toss. This picture would adequately capture the experimental facts, for this experiment. The beam splitter experiment has a binary outcome: the photon is either detected at U or D. To encode the result at the output we need a single bit of information. Indeed if we allow another single photon source into the upward going input to the beam splitter, we will need a single binary number to encode the input state as well. The kind of encoding we have just described is called dual rail. In more precise terms we have encoded a single bit into one of two perfectly distinguishable momentum states of a single photon. At first sight, it would appear that the experiment we have just described is a fanciful coin toss. That this is not the case can be seen if we ‘toss the coin again’, that is to say, we take the photon after the beam splitter and, instead
2439
2440
Quantum Computing Using Optics
Quantum Computing Using Optics, Figure 3 A single photon pulse incident on a beam splitter in the downward (D) direction. It can be detected in the upward going direction (U) or the downward going direction (D)
of running it into a photodetector, we reflect it back onto an identical beam splitter and then ask for the probability for it to be reflected or transmitted (see Fig. 3). It is well known that in this case we can adjust things so that the probability for the photon to be detected at U is certain. Irreducible uncertainty has been turned into certainty simply by changing the experimental conditions for detecting the photon, not by changing the state of the single photon source. Note that the price we pay for certainty is complete loss of knowledge of what happened to the photon at the left beam splitter. A photon can be detected at either detector in Fig. 3 in two indistinguishable ways corresponding to the two (unknowable) outcomes of reflection or transmission at the left beam splitter. Returning to the simple experiment in Fig. 1, can we say that it is a simple coin toss before the detectors register the photon? Does the output of the beam splitter really encode one bit of information? No. We capture the dual quantum/classical nature of the state of a single photon after a beam splitter by saying that it is a superposition of the two mutually exclusive alternatives. In such a case the output is said to encode a single quantum bit, or qubit. Quantum computation and communication runs on qubits not bits. The quantum description of the state of the photon after a single beam splitter requires us to give a list of the probability amplitudes for the two mutually exclusive possibilities for detection in either the downward or the upward direction. The detection probabilities are then given by the square of the probability amplitudes. For example the input state in Fig. 1 is certain to be in the downward direction, so the ordered pair of probability amplitudes (1, 0) in which the first (second) entry corresponds to detection in the D (U) direction. After a general (not 50/50) beam splitter the state of the photon is depicted by the ordered pair (r; t) where jrj2 C jtj2 D 1. So in fact the beam splitter has enacted the linear transformation (1; 0) ! (r; t).
One might think that the input state (0, 1) is transformed in exactly the same way. However these two input states are physically distinguishable: in one case the photon is going down and in the other it is going up with certainty. Mathematically this is captured by the fact that the input vectors are orthogonal. The beam splitter is a perfectly reversible device and does not destroy information by making two distinguishable alternatives indistinguishable. The problem is avoided when we note that a full quantum theory of beam splitters will in fact show that (0; 1) ! (r; t). There is a central lesson here: operations on qubits must not take two physically distinguishable states and make them indistinguishable. We call this sort of transformation unitary. We now make a change to more conventional notation. The input states (1, 0) and (0, 1) are written as j1; 0i and j0; 1i (ordering is preserved) so that the beam splitter transformations are then written as j1; 0i ! rj1; 0i C tj0; 1i
(1)
j0; 1i ! rj1; 0i C tj0; 1i :
(2)
We now formalize the concept of a qubit by making a distinction between the logical state of a qubit and the physical states of the system used to encode it. In the example used here the relation ship is j0i D j1; 0i
(3)
j1i D j0; 1i :
(4)
We then say that we have a qubit code that uses one photon and two optical modes per qubit. The beam splitter transformation on the logical code acts like j0i ! rj0i C tj1i
(5)
j1i ! rj0i C tj1i :
(6)
This transformation on logical qubits is called a one-qubit gate. If we want to encode two qubits in this dual rail scheme we will need two single photon pulses and four modes, which may be achieved by two independent beam splitters. In a notation that keeps track of the ordering of the beam splitters by an ordering of states from left to right, the logical state j1; 0i would then be represented physically by j1i ˝ j0i D j0; 1i ˝ j1; 0i :
(7)
Note that the number of mutually distinguishable output states from N beam splitters (and N photons) increases exponentially as 2 N . This simply means that for the logical code of N qubits there are 2 N possible logical states.
Quantum Computing Using Optics
If we continue in this fashion we are not going to have a very compact notation. We can make things easier by using the fact that our qubits are ordered corresponding to a physical order of the underlying beam splitters. We can then represent a state, for example j1i ˝ j0i ˝ j1i ˝ j0i, by regarding it as the binary code for an integer, in this case j10i as 10 D 1 23 C 0 22 C 1 21 C 0 20 . It is then clear that with N physical resources we can code 2 N binary numbers (or their corresponding integers). If we could somehow act on all these numbers simultaneously we might have a very efficient processor. This is precisely the hope of quantum computation. For example, if we pass N photons downward through N independent beam splitters, the state at the output is superposition state
j0i ! 2
N/2
N 1 2X
jxi ;
(8)
xD0
an equal superposition of all integers from zero up to 2 N 1. If we could somehow continue to postpone measurement and act unitarily on all these numbers simultaneously, we might gain a considerable advantage in computational efficiency. This was the idea first captured in precise form by David Deutsch [5]. Suppose for example we set aside some number, say N, of photons to encode the input register and some number, say K, to encode an output register. We first prepare the N photons so as to encode the state produced in Eq. (8) and the K to encode the state j0i. Then the unitary processing through some, as yet unspecified, physical device could implement the transformation X x
jxij0i !
X
jxij f (x)i ;
(9)
x
for some function f (we drop the normalization for simplicity). Of course now when we read out the output register we only get one value of the function with a random distribution. This may not sound very useful if you are in fact interested in a particular value of the function, but in that case why do all computations of the function at once? The primary reason why you might want to evaluate a function on all inputs is when you are not so much interested in any particular value but rather some global property of the function itself – for example is it constant? It is precisely that kind of computation that Deutsch [3] showed could be done more efficiently on a quantum computer. A far more interesting example was described by Shor [51] who gave an efficient quantum algorithm for finding the period of a modular function. The particular
modular function he used is part of an algorithm for finding the prime factors of large integers, so that in effect he gave an efficient quantum algorithm for finding such prime factors. The assumed intractability of this problem for a classical computer is why it is used as a method for public-key encryption. If someone had a quantum computer all these crypto systems would be vulnerable to attack. We describe below a simple optical experiment which captures the kind of methods by which a quantum computer could implement Shor’s algorithm. Of course no one yet has a quantum computer capable of posing a threat to public-key crypto systems, although it begins to look like it might be possible. The problem is to figure out what kind of physical systems would enable the arrow in Eq. (9). Various suggestions have been made including single photon optical systems, which we discuss below. At the level of logical qubits we need two types of elementary transformations. The first we have already seen: It is the single qubit transformation in Eq. (5). In addition all we need is some form of interaction that correlates the state of at least two qubits. One example is the controlledSIGN or c- SIGN , gate, jxijyi ! (1)xy jxijyi :
(10)
For single photon, dual rail encoding this presents a major problem. If we look at the c-SIGN gate at the level of physical qubits it will require some kind of intensity dependent phase change so that only a two photon component acquires a phase change. This is possible with a Kerr nonlinearity as was noted long ago [34,61]. Nonlinear optical materials with an intensity dependent refractive index do indeed exist and are generically referred to as Kerr nonlinear materials. The problem is that Kerr nonlinear phase shifts are so small that to expect an intensity as low as two photons to give a phase change in an optical material of practical size is to expect too much. Fortunately there is another way. Conditional Optical Two-Qubit Gates Measurements play quite a different role in quantum mechanics than they do in classical mechanics. In the latter, it is the objective of accurate measurement to reveal the preexisting values of dynamical variables. Of course in reality both the measurement and the preparation of the initial state are subject to noise. However it is assumed that both may be sufficiently refined to reveal the true dynamical state of a system. If we know all there is to know about a classical system, all measurements are dispersion free in principle.
2441
2442
Quantum Computing Using Optics
Quantum Computing Using Optics, Figure 4 A conditional state transformation conditioned on photon counting measurements
Quantum mechanics is an irreducibly statistical theory. This means that even if a system has been prepared in state about which we have maximal knowledge, there will be at least one measurement the results of which are uncertain. Fortunately there is also at least one physical quantity that may be measured for which the results are certain. If we make a perfect measurement of this quantity we have in effect prepared the system in a pure state of which we have maximal knowledge. Measurement thus plays an essential role in state preparation, where knowledge of the prepared state is heralded by the measurement result. Can we use the unique role that measurement plays in quantum mechanics to conditionally implement a required state transformation? Consider the situation shown in Fig. 4. In a dual rail scenario, two modes are mixed on a beam splitter. One mode is assumed to be in the vacuum state (a) or a one photon state (b), while the other mode is arbitrary. A single photon counter is placed in the output port of mode-2. What is the conditional state of mode-1 given a count of n photons? The conditional state of mode a1 is given by (unnormalized), j ˜ (i) i1 D *ˆ (i)j i1 ;
(11)
where *ˆ (i) D 2 hijU( )jii2
(i)
i1 D p
1 j ˜ (i) i1 : P(i)
*ˆ (0) D
1 X (cos 1)n n n (a1 ) a1 n! nD0
*ˆ (1) D cos *ˆ (0) sin2 a1 *ˆ (0) a1 :
In order to see how we can use these kind of transformations to effect a c-SIGN gate, consider the situation shown in Fig. 5. Three optical modes are mixed on a sequence of three beam splitters with beam splitter parameters i , with corresponding reflection and transmission amplitudes, r i D cos i ; t i D sin i . The ancilla modes, a1 ; a2 are restricted to be in the single photon states j1i2 ; j0i3 respectively. We will assume that the signal mode, a0 , is restricted to have at most two photons, thus j i D ˛j0i0 C ˇj1i0 C j2i0 :
(15)
The reason we are only interested in two photon states is that in dual rail encoding a general two qubit state can have at most two photons. We can now chose the beam splitter parameters so that, conditional on the detectors both registering a one, the signal state is transformed as j i!j
(13)
which fixes the normalization of the state, j
It is then possible to show that
0
i D ˛j0i C ˇj1i j2i ;
(16)
(12)
with i D 1; 0. Here U( ) is the unitary transformation implemented by the beam splitter as we described in the introduction. The probability for the specified photo count event is given by, P(i) D 1 h j* (i)*ˆ (i)j i1 ;
Quantum Computing Using Optics, Figure 5 A conditional state transformation on three optical modes, conditioned on photon counting measurements on the ancilla modes a1 ; a2 . The signal mode, a0 is subjected to a phase shift
(14)
with a probability that is independent of the input state j i. This last condition is essential as in a quantum computation, the input state to a general two qubit gate is completely unknown. We will call this transformation the NS (for nonlinear sign shift) gate. This can be achieved using [24]: 1 D 3 D 22:5 deg and 2 D 65:53 deg. The probability of the conditioning event (n2 D 1; n3 D 0) is 1/4. Note that we can’t be sure in a given trial if the correct transformation will be implemented. Such a gate is called a nondeterministic gate. However the key point is
Quantum Computing Using Optics
Quantum Computing Using Optics, Figure 6 A conditional state transformation conditioned on photon counting measurements. A c- SIGN gate that works with probability of 1/16. It uses HOM interference and two NS gates
that success is heralded by the results on the photon counters (assuming ideal operation). We can now proceed to a c-SIGN gate in the dual rail basis. Consider the situation depicted in Fig. 6. We first take two dual rail qubits encoding for j1ij1i. The single photon components of each qubit are directed towards a 50/50 beam splitter where they overlap perfectly in space and time and produces a state of the form j0i2 j2i3 C j2i2 j0i3 , a effect known as Hong–Ou–Mandel (HOM) interference [15]. We then insert an NS gate into each output arm of the HOM interference. When the conditional gates in each arm work, which occurs with probability 1/16, the state is multiplied by an overall minus sign. Finally we direct these modes towards another HOM interference. The output state is thus seen to be j1ij1i. One easily checks the three other cases for the input logical states to see that this device implements the c-SIGN gate with a probability of 1/16 and successful operation is heralded. Clearly a sequence of nondeterministic gates is not going to be much use: the probability of success after a few steps will be exponentially small. The key idea in using nondeterministic gates for quantum computation is based on the idea of gate teleportation of Gottesmann and Chuang [12]. In quantum teleportation an unknown quantum state can be transferred from A to B provided A and B first share an entangled state. Gottesmann and Chuang realized that it is possible to simultaneously teleport a two qubit quantum state and implement a two qubit gate in the process by first applying the gate to the entangled state that A and B share prior to teleportation. We use a non deterministic NS gate to prepare the required entangled state, and only complete the teleportation when the this stage is known to work. The teleportation step itself is non deterministic but, as we see below, by using the appropriate entangled resource the teleportation step can be made near deterministic. The near deterministic teleportation protocol requires only photon counting
Quantum Computing Using Optics, Figure 7 A partial teleportation system for single photons states using a linear optics
and fast feed-forward. We do not need to make measurements in a Bell basis. A nondeterministic teleportation measurement is shown in Fig. 7. The client state is a one photon state in mode-0 ˛j0i0 C ˇj1i0 and we prepare the entangled ancilla state jt1 i D j01i12 C j10i12
(17)
where mode-1 is held by the sender, A, and mode-2 is held by the receiver, B. For simplicity we omit normalization constants wherever possible. This an ancilla state is easily generated from j01i12 by means of a beam splitter. If the total count is n0 C n1 D 0 or n0 C n1 D 2, an effective measurement has been made on the client state and the teleportation has failed. However if n0 C n1 D 1, which occurs with probability 0.5, teleportation succeeds with the two possible conditional states being ˛j0i2 C ˇj1i2
if n0 D 1; n1 D 0
(18)
˛j0i2 ˇj1i2
if n0 D 0; n1 D 1 :
(19)
When successful, this procedure implements a joint measurement on modes 0 and 1. In the conventional deterministic teleportation protocol the joint measurement is a simultaneous measurement of the commuting operators XX and ZZ where X; Z are respectively the Pauli-x and Pauli-z operators. This is a Bell measurement. In the teleportation protocol considered here, we only have a partial Bell measurement. We will refer to it as a non-deterministic teleportation protocol, T1/2 . Note that teleportation failure is detected and corresponds to a photon number measurement of the state of the client qubit. Detected number measurements are a very special kind of error and can be easily corrected by a suitable error correction protocol. For further details see [24] and the review [26]. The next step is to use T1/2 to effect a conditional sign flip c-SIGN1/4 which succeeds with probability 1/4. Note that to implement c-SIGN on two bosonic qubits in modes
2443
2444
Quantum Computing Using Optics
Quantum Computing Using Optics, Figure 8 A c- SIGN two qubit gate with teleportation to increase success probability to 1/4. When using the basic teleportation protocol (T1), we may need to apply a sign correction. Since this commutes with c- SIGN , it is possible to apply c- SIGN to the prepared state before performing the measurements, reducing the implementation of c- SIGN to a state-preparation (outlined) and two teleportations. The two teleportation measurements each succeed with probability 1/2, giving a net success probability of 1/4. The correction operations C1 consist of applying the phase shifter when required by the measurement outcomes
1, 2 and 3, 4 respectively, we can first teleport the first modes of each qubit to two new modes (labeled 6 and 8) and then apply c-SIGN to the new modes. When using T1/2 , we may need to apply a sign correction. Since this commutes with c-SIGN, there is nothing preventing us from applying c-SIGN to the prepared state before performing the measurements. The implementation is shown in Fig. 8 and now consists of first trying to prepare two copies of jt1 i with c-SIGN already applied, and then performing two partial Bell measurements. Given the prepared state, the probability of success is (1/2)2 . The state can be prepared using c-SIGN1/16 , which means that the preparation has to be retried an average of 16 times before it is possible to proceed. The probability of successful teleportation can be boosted to 1 1/(n C 1) using more entangled resource state of the kind jt n i1:::2n D
n X
j1i j j0in j j0i j j1in j :
(20)
jD0
The notation jai j means jaijai : : : ; j times. The modes are labeled from 1 to 2n, left to right. Note that the state exists in the space of n bosonic qubits, where the kth qubit is encoded in modes n C k and k (in this order). We can teleport the state ˛j0i0 C ˛j1i0 using jt n i1:::2n . We first couple the client mode to half of the ancilla modes by applying an n C 1 point Fourier transform on modes 0
to n. This is defined by the mode transformation n
ak ! p
X 1 ! k l al ; n C 1 l D0
(21)
where ! D e i2/(nC1) . This transformation does not change the total photon number and is implementable with passive linear optics. After applying the Fourier transform, we measure the number of photons in each of the modes 0 to n. If the measurement detects k bosons altogether, it is possible to show [24] that if 0 < k < n C 1, then the teleported state appears in mode n C k and only needs to be corrected by applying a phase shift. The modes 2n l are in state 1 for 0 l < (n k) and can be reused in future preparations requiring single bosons. The modes are in state 0 for n k < l < n. If k D 0 or k D n C 1 an effective measurement of the client is made, and the teleportation fails. The probability of these two events is 1/(n C 1), regardless of the input. Note that again failure is detected and corresponds to measurements in the basis j0i; j1i with the outcome known. Note that both the necessary correction and the receiving mode are unknown until after the measurement. Cluster State Methods About the same time it was realized that measurements would provide a path to optical single photon computing, Raussendorf and Briegel [48] gave an independent
Quantum Computing Using Optics
and remarkably novel method by which measurement alone could be used to do quantum information processing. In their approach to quantum computation, an array of qubits is initially prepared in a special entangled state called a cluster state. The computation then proceeds by making a sequence of single qubit measurements. Each measurement is made in a basis that depends on prior measurement outcomes; in other words, the results of past measurements are fed forward to determine the basis for future measurements. Subsequently Popescu showed that the linear optical measurement based scheme of Knill et al. can be interpreted as a Raussendorf and Briegel measurement based quantum computation [44]. Nielsen [37] realized that the LOQC model of [24] could be used to efficiently assemble the cluster using the nondeterministic teleportation tn . As we saw the failure mode of this gate constituted an accidental measurement of the qubit in the computational basis. The key point is that such an error does not destroy the entire assembled cluster but merely detaches one qubit from the cluster. This enables a protocol to be devised that produces a cluster that grows on average. The LOQC cluster state method dramatically reduces the number of optical elements required to implement the original LOQC scheme. Of course if large single photon Kerr nonlinearities were available, the optical cluster state method could be made deterministic [16]. In its simplest form the Raussendorf and Breigel scheme begins with a two dimensional array of qubits. Each qubit is prepared in a superposition of the computational basis states, j0i C j1i. Then an entangling operation is performed between nearest neighbor qubits in the lattice using the two-qubit controlled sign operation jxijyi 7! (1)x y jxijyi :
(22)
In the next step one or more qubits are measured in a particular basis and depending on the results of that measurement another basis is chosen for subsequent qubit measurements. Any circuit model of a quantum algorithm can be mapped onto the two dimensional lattice through a sequence of conditional measurements on subsets of qubits. The key difficulty in doing this with a dual rail optical scheme is that, as we have noted, the transformation in Eq. (22) is very difficult to implement using a deterministic unitary transformation as optical nonlinearities are too small. However a conditional scheme of the kind discussed above might work by conditionally entangling sequences of qubits, provided failure at any point did not destroy the entire developing lattice of entanglement. This is indeed the case because of a key feature of the LOQC
model of [24] scheme. If a teleportation gate failure is heralded, it corresponds to an effective measurement of one of the qubits at input. This feature was used by Knill et al. to establish the scalability of the scheme as detected measurements errors can easily be protected by a suitable code. The importance of this feature for a conditional cluster state assembly is that an accidental measurement of one of the qubits in a developing cluster simply detaches it from the cluster without destroying the remaining entanglement. The picture then emerges of a kind of random cluster assembly in which the cluster grows when a gate succeeds and gets pruned if a gate fails. As long as the probability for gate success is greater than 0.5 there is a chance that the cluster overall will grow. As this does not require a very large teleportation resource, jt n i1:::2n , it can be achieved with quite modest overheads of linear optics and single photons. For this reason optical cluster state methods are preferred over the original LOQC scheme of Knill et al. A number of schemes have been proposed to efficiently assemble a cluster via this probabilistic growth. A recent application of percolation theory is a good example of the kind of optimization that is required [20]. Experimental Implementations Two Qubit Gates Considerable progress has been made on demonstrating conditional two-qubit gates with single photon states. After the initial proposal was made by Knill, Laflamme and Milburn [24], there was a flurry of theoretical work to propose linear-optical two-qubit gates that would be easier to realize experimentally. These can be divided into two categories, internal-ancilla gates [22,41,42,46], where the ancilla photons are intrinsic to the operation of the logic gates, and external-ancilla gates [14,47], where ancilla photons can be used to verify correct gate operation by performing a quantum non-demolition measurement [25] on the gate outputs. There are a variety of configurations of internal-ancilla gates, including simplified [46] and efficient [22] versions of the KLM gate; and gates that gain in efficiency using entangled ancilla photons [41,42]. An entangled internal-ancilla gate was soon realized in Johns Hopkins, where entangling operation was suggested by a 61:5 ˙ 7:4% visibility fringe [43]. An unambiguous demonstration of entangling gate operation was performed at the University of Queensland with an external-ancilla gate [39]: all four entangled Bell states were produced as a function of only the logical values of the input qubits, for a single operating condition of the gate. Both these gates filtered on photon-number, i. e. they re-
2445
2446
Quantum Computing Using Optics
Quantum Computing Using Optics, Figure 9 An external-ancilla CNOT gate. When the control is in the logicalone state, the control and target photons interfere non-classically at the central 1/3 beam splitter which causes a phaseshift in the upper arm of the central interferometer and the target state qubit is flipped. The qubit value of the control is unchanged. Correct operation has probability 1/9 and occurs when a single photon is detected in each output; this can be done with quantum non-demolition (QND) measurements using external ancilla photons [25] or by filtering on photon number in the final detection. Experimentally, the two modes of each qubit are distinguished by orthogonal polarizations. These may be split into separate spatial modes and sent through normal beamsplitters [38,39] or left in one spatial mode and sent through partiallypolarizing beamsplitters [21,27,40]
quired the four or two input photons to be detected to signal successful gate operation. An important requirement for LOQC is that it must be possible to detect successful gate operation by measurement of the ancilla photons and then feed-forward this information to the logic photons: this was realized with an external-ancilla gate at the University of Vienna [10]. Linear-optic gates, both internal and external ancilla, have achieved a wide range of firsts and proof-of-principle demonstrations: the first full characterization of a quantum-logic gate, in any architecture [38]; production of cluster [58] and graph [31] states, and their use for Grover’s algorithm [58]; production of the highest entanglement [27] and fastest gate operation [45] of any physical architecture; and the first demonstration of Shor’s algorithm exhibiting the core processes, coherent control, and resultant entangled states required in a full-scale implementation [29,30]. Measuring gate operation is non-trivial. Further, it is desirable to diagnose, and if possible correct, error behaviors introduced by a real gate, such as phase or bit-flip errors, which can induce the wrong amount or type of entanglement, or decoherence. To date, there have been a wide variety of measures used to gauge the quality of
two-qubit gates. A comprehensive comparison of the various measures, and an architecture-independent measurement standard for two-qubit gates, is given in [60]. The key is quantum process tomography, which allows reconstruction of the quantum state transfer function of the gate. Tomography requires sampling the statistics of a fixed number of measurement outcomes, at least 256 for a twoqubit gate. With these statistics data inversion can be devised to reconstruct the gate process. From this we can calculate its overlap, or fidelity, with respect to the ideal quantum-logic gate, e. g. in recent experiments at the University of Queensland we obtained a process fidelity of Fp D 98:2 ˙ 0:3% for a controlled-/4 gate [28]. The process fidelity is itself not a metric [11], but forms the basis of several, such as the average gate fidelity, i. e. the fidelity of every gate output state with the ideal, averaged over every possible input state. This is given simply by, FD
dFp C 1 ; dC1
(23)
where d is the dimension of the process matrix, e. g. d D 16 for two-qubit gates. These measures are useful for comparing gate performances, but do not provide an error probability-per-gate to allow direct comparison with fault-tolerance thresholds. A recent publication introduces a semidefinite programming technique to do exactly this, using the measured process matrix [59]. Such fault-tolerant benchmarking identifies the magnitude, but not the source of errors – critical in identifying the critical technology path for improving gate operation. This requires a comprehensive theoretical model of the quantum-logic gate, and its errors. Using such a model, Ref. [59] identifies small amounts of multi-photon emission as the dominant, yet previously unrecognized, source of error in linear-optical quantum computing – higher-order terms of 0:30:8% lead to gate errors of 16%! Removing multi-photon emission puts photonic quantum computing within striking distance of a recently predicted fault-tolerance threshold [23]. Quantum Optical Algorithms The simplest useful algorithm for quantum optical information processing is the quantum repeater. This is more conventionally referred to as quantum communication protocol rather than an algorithm, but realistic systems will require few qubit quantum computers to do entanglement distillation and purification giving it an algorithmic flavor. Furthermore quantum repeaters could enable distributed quantum computation. We also mention it here
Quantum Computing Using Optics
as the various quantum optical repeater protocols that have been suggested [6,50,52] will have immediate application in long distance quantum key distribution and thus are pivotal to developing quantum optical networks across the global optical fiber network. This raises many new questions in the area of communication complexity and a great deal of work remains to be done to understand such systems [18,55]. The general idea of a quantum repeater is sketched in Fig. 10. There are three key elements at each node of the network: (i) a pair of entangled qubits, (ii) quantum memories, (iii) a few qubit quantum computer for purification distillation. In the first step two entangled pairs are produced at the origin, one of each pair is held there, while the other member of each pair is sent to distant locations. A measurement is made on the two qubits kept at the origin at step 1, leaving partially entangled qubits at the remote locations, step 2. The remote qubits are then loaded into quantum memory, step 3. The process is repeated, steps 4 and 5 until two qubits are held at the remote locations. A purification algorithm is then run at each remote location so that finally a maximally entangled pair is held at the remote locations separated by twice the distance that separated the pairs in step 1. In the quantum optical realization both the generation of entanglement and purification can be done by heralded non deterministic processes.
Quantum Computing Using Optics, Figure 10 A simple schematic for a quantum repeater
The previous scheme of course requires a good quantum memory and a great deal of research is underway to develop such systems for photons. One promising approach is to use polarized atomic ensembles [35]. The quantum memory itself may need to have error correction algorithms running to maintain the coherence of the entangled pairs until they required for some future quantum information processing task. Many current cryptographic protocols rely on the computational difficulty of finding the prime factors of a large number: a small increase in the size of the number leads to an exponential increase in computational resources. Shor’s quantum algorithm for factoring composite numbers faces no such limitation [51], and its realization represents a major challenge in quantum computation. Only one step of Shor’s algorithm to find the factors of a number N requires a quantum routine. Given a randomly chosen co-prime C (where 1 < C < N and the greatest common divisor of C and N is 1), the quantum routine finds the order of C modulo N, defined to be the minimum integer r that satisfies C r mod N D 1. It is straightforward to find the factors from the order. Consider N D 15: if we choose C D 2, the quantum routine finds r D 4, and the prime factors are given by the nontrivial greatest common divisor of C r/2 ˙ 1 and N, i. e. 3 and 5; similarly if we choose the next possible co-prime, C D 4, we find the order r D 2, yielding the same factors. The quantum routine in Shor’s algorithm can factor a k-bit number using 72k 3 elementary quantum gates, e. g. factoring the smallest meaningful number, 15, requires 4608 gates operating on 21 qubits [1], where the gates are one-, two-, or three-qubit logic gates. Recognizing this is well beyond the reach of current technology, Ref. [1] introduced a compiling technique which exploits properties of the number to be factored, allowing exploration of Shor’s algorithm with a vastly reduced number of resources. The compiled algorithms are not scalable in themselves, but do allow the characterization of core processes required in a full-scale implementation of Shor’s algorithm – including the ability to generate entanglement between qubits by coherent application of a series of quantum gates. In the only previous demonstration of Shor’s algorithm, a compiled set of gate operations were implemented in a liquid NMR architecture [56]. Unfortunately, in such a system the qubits are at all times in a highly mixed state [2], and the dynamics can be fully modeled classically [32], so that neither the entanglement nor the coherent control at the core of Shor’s algorithm can be implemented or verified. The quantum order-finding routine consists of three distinct steps: i) register initialization, j0i˝n j0i˝m ! P n 1 jxij0i˝m1 j1i, where (j0i C j1i)˝n j0i˝m1 j1i D 2xD0
2447
2448
Quantum Computing Using Optics
the argument-register is prepared in an equal coherent superposition of all possible arguments (normalization omitted by convention); ii) modular exponentiation, which by controlled application of the order-finding function proP n 1 duces the entangled state 2xD0 jxijC x mod Ni; iii) the inverse Quantum Fourier Transform (QFT) followed by measurement of the argument-register in the logical basis, which with high probability extracts the order r after further classical processing. If the routine is standalone, the inverse QFT can be performed without two-qubit gates using an approach based on local measurement and feedforward. In recent experiments, Shor’s algorithm was performed for N D 15 using internal-ancilla logic gates [29, 30]. Co-primes of C D 4 and C D 2 were realized using circuits of two controlled-SIGN gates, with three and four qubit inputs, respectively. To no-ones surprise, the prime factors of 3 and 5 were found. What was surprising is that was near-ideal algorithm performance – far better than expected given the known errors inherent in the logic gates. This result highlights a subtle point with respect to benchmarking quantum algorithms: algorithm performance is not necessarily an accurate indicator of the underlying circuit performance. This is particularly the case in Shor’s algorithm, where ideally the algorithm produces mixed states. From algorithm performance alone it is not possible to distinguish between the desired mixture arising from entanglement with the function-register, and the undesired mixture due to environmental decoherence. In [29] this was quantified by using quantum state tomography [17] to measure the joint state of the argument and function registers after modular exponentiation. The joint state was both mixed and entangled, only partially overlapping with the expected states, a sure indication of environmental decoherence. Such accurate knowledge of circuit performance is crucial if the circuits are to be incorporated as sub-routines in larger algorithms. Reprise: Single Photon States The two key requirements for conditional quantum computing with single photons are; (i) single photon sources and, (ii) highly efficient single photon detectors capable of resolving zero, one or two photon detection events. Of course we will need a lot more than just this. In order to make the scheme scalable some kind of photonic memory, or fast switching, or both, would be desirable. Once a particular entangled multi photon state resource has been prepared it might be necessary to store it for some time until it is required for use. We first discuss in some detail what is required for a single photon source.
It is now time to become much more precise about what is meant by a single photon state. The quantum electromagnetic field is described by an electric field operator [57], s X „!n E x ; t) D i E(E 2o V n; h i E E eEl ; e i( k n :Ex !n t) a n; ei( k n :Ex !n t) a n; ; (24) where eEn; are two orthonormal polarization vectors ( D 1; 2) which satisfy Ek n :E e n; D 0, as required for a transverse field, and the frequency is given by the dispersion relation !n D cj kE n j. The positive and negative frequency amplitude operators are respectively a n; and a m; , with bosonic commutation relations [a n; ; a n 0 ; 0 ] D ı 0 ınn 0 ;
(25)
with all other commutations relations zero. Typically we are interested in sources defined by optical cavity modes so that emission is directional and defined by the cavity spatio-temporal mode structure. Much of the difficulty in building single photon sources is in designing the optical cavity to ensure emission into a preferred set of modes. We will assume that the only modes that are excited have the same plane polarization and are all propagating in the same direction, which we take to be the x-direction. The positive frequency components of the quantum electric field for these modes are then
1 X „!n 1/2 E (C) (x; t) D i a n ei!n (tx/c) : (26) 2 V 0 nD0 In ignoring all the other modes, we are implicitly assuming that all our measurements do not respond to the vacuum state, an assumption which is justified by the theory of photon-electron detectors [57]. Let us further assume that all excited modes of this form have frequencies centered on carrier frequency of ˝ 1. Then we can approximate the positive frequency components by
r 1 „˝n 1/2 c X (C) E (x; t) D i a n ei!n (tx/c) (27) 20 Ac L nD0 where A is a characteristic transverse area. This operator has dimensions of electric field. In order to simplify the dimensions we now define a field operator that has dimensions of s1/2 . Taking the continuum limit we thus define the positive frequency operator Z 1 1 0 a(x; t) D ei˝(tx/c) p d! 0 a(! 0 )ei! (tx/c) ; 2 1 (28)
Quantum Computing Using Optics
where we have made a change of variable ! 7! ˝ C ! 0 and used the fact that ˝ 1 to set the lower limit of integration to minus infinity, and [a(!1 ); a (!2 )] D ı(!1 !2 ) :
(29)
In this form the moment n(x; t) D ha (x; t)a(x; t)i has units of s1 . This moment determines the probability per unit time (the count rate) to count a photon at space-time point (x; t) [57]. The field operators a(t) and a (t) can be taken to describe the field emitted from the end of an optical cavity, which selects the directionality. We will contrast single photon states with multimode coherent states defined by a multimode displacement operator acting on the vacuum Dj0i, defined implicitly by D a(!)D D a(!) C ˛(!) ;
(30)
where consistent with proceeding assumptions, ˛(!) is peaked at ! D 0 which corresponds to a carrier frequency ˝ 1. The average field amplitude for this state is Z 1 1 ha(x; t)i D ei˝(tx/c) p ˛(!)ei!(tx/c) 2 1
˛(x; t)ei˝(tx/c) ;
(31)
which implicitly defines the average complex amplitude of the field as the Fourier transform of the frequency dependent displacements ˛(!) We can also calculate the probability per unit time to detect a photon in this state at spacetime point (x; t). This is given by n(x; t) D j˛(x; t)j2 . Note that in this case the second order moment ha (x; t)a(x; t)i factories, a result characteristic of fields with first order coherence. A coherent state is is closest to our intuitive idea of a classical electromagnetic field. The multimode single photon state is defined by Z 1
(!)a (!)j0i : (32) j1i D
where (t) is the Fourier transform of (!). So while the state has zero average field amplitude there is apparently some sense in which the coherence implicit in the superposition of Eq. (32) is manifest. In fact comparing this to the case of a coherent state, Eq. (31), we see that the expression for n(t) is also determined by the Fourier transform of a coherent amplitude. For this state the function
() is periodic in the phase D t x/c and it is not difficult to choose a form with a well defined pulse sequence. However care should be exercised in interpreting these pulses. They do not represent a sequence of pulses each with one photon rather they represent a single photon coherently superposed over all pulses. Once a photon is counted in a particular pulse, the field is returned to the vacuum state. A review of current efforts to produce such a state may be found in [53]. More recent results may be found in [4,8,13,19]. In the absence of true single photon pulse sources, most of the experimental work we have described has been done with a conditional source based on photon pair production in spontaneous parametric down conversion. We now give a model of the kind of state such sources produce. In the case of a continuous wave pump, the entangled two photon produced by SPDC may be well approximated by Z 1 Z 1 j iD d!1 d!2 (!1 ; !2 )a (!1 )b (!2 )j0i ; 0
0
(36) where a; b designate distinguishable modes, for example, distinguished either by polarization or wave vector [54]. More precisely, this is the conditional state given that a down conversion process has in fact taken place at all. The probability per unit time for this event is the down conversion efficiency. For spontaneous parametric down conversion with a continuous pump field at frequency 2˝ we can approximate
1
Normalization requires that Z 1 d!j (!)j2 D 1 :
(!1 ; !2 ) D ı(!1 C !2 2˝)˛(!1 ) :
(33)
1
This last condition implies that the total number of photons, integrated over all modes, is unity, Z 1 d!ha (!)a(!)i D 1 : (34) 1
This state has zero average field amplitude but n(x; t) D j (t x/c)j2 ;
(35)
(37)
We now make the change of variable D ˝ !1 and assume that the bandwidth, B, over which ˛(!1 ) is significantly different from zero is such that ˝ B, then we can write Z 1 dˇ()a ()b ()j0i ; (38) j iD 1
where we have defined ˇ() D ˛(˝ ) and a() a(˝ ); b() b(˝ C ). We will also assume that ˇ() D ˇ(). Normalization of j i requires that Z 1 jˇ()j2 D 1 : (39) 1
2449
2450
Quantum Computing Using Optics
The probability per unit time to detect a photon from this field with a unit efficiency detector is in fact unity, n a (t) D ha (t)a(t)i D 1 The probability per unit time to detect a photon from mode-a is thus independent of time. This simply means that photons will be counted at randomly distributed times from each of the fields a; b. This is a reflection of the fact that the state Eq. (38) is invariant under time translations in a Lorentzian frame. On the other hand let us now compute the coincidence rate, C(t; t 0 ) D h ja (x; t 0 )a(x; t 0 )b (x; t)b(x; t)j i : (40) This is given by ˇZ ˇ C(t; t 0 ) D ˇˇ
1 1
ˇ2 ˇ 0 ˇ()ei(t t) ˇˇ
(41)
2 ˜ D jˇ( )j
(42)
Future Directions
C ( ) ;
(43)
Optical systems are certain to be used for future quantum communication protocols. Indeed the first steps have already been taken with quantum key distribution. In this article we have seen that it is also possible to process quantum information optically using heralded non deterministic schemes of various kinds and simple examples have been implemented experimentally. This greatly enhances the practicability of quantum communication schemes that require some quantum processing, such as quantum repeaters. While this has not yet been realized in practice, we expect that the first demonstrations are not far away. In the effort to produce single photon sources we are learning new ways to encode and process information in optical pulses. We have seen that coherent communication can be done using single photons despite the fact that the average field amplitude for such states is zero. If information can be encoded and decoded on photon number states, this would represent a major step beyond quantum communication protocols like quantum key distribution. It is far from clear however if optical schemes will be viable for large scale quantum computation. Currently ion trap schemes and schemes based on super-conducting devices offer the most likely way forward for quantum computation per se. However a number of investigators are turning to the concept of a hybrid quantum computer which combines optical and matter based qubits. Optical qubits with heralded non deterministic processing combined with matter based quantum memories is poised to make significant achievements. Hybrid schemes offer a path to distributed quantum computation between many nodes each made up of a few hundred qubits. The nodes do not need to be far apart:
where D t 0 t. The coincidence rate is thus symmetrical about t 0 t D 0 and is peaked at t 0 D t. Even though photons are detected at random, independently from each beam, they are highly correlated in time. In the case of spontaneous SPDC the distributions function ˇ() is given approximately [49] ˇ() /
where ˝ is the pump carrier frequency and p is the bandwidth of the pump pulse. It is still the case however that no photon down conversion event may take place within the pump pulse window. Thus the source is not a deterministic single photon source. Migdall [33] has proposed a way to over come this by multiplexing many heralded conditional SPDC sources with a conditioning detection on one mode of each of the multiplexed pairs. Another approach has been implemented by the Polzik group [36]. They used a cavity to enhance the parametric down conversion to implement a frequency tunable source of heralded single photons with a narrow bandwidth of 8 MHz. This approach is particularly important as frequency tunability makes the source compatible with atomic quantum memories.
1 : 2 C !2
(44)
We can now consider a heralded single photon source made by detecting one of the photon pairs and then ask for the kind of single photon state conditionally produced in the other mode. In the case of a CW pump, we first must provide a temporal filter on the detected mode. One can think of this as a time dependent detector that is switched on an off over some time interval. In frequency domain this is simply a filter. If such a detector is placed at the b mode, the conditional state of the a mode is given by Eq. (32) with
(!) / e!
2 / 2
;
(45)
which corresponds to a Gaussian temporal pulse. If we drive SPDC with a pulsed pump, the pump itself provides a natural temporal filter. In this case the frequency distribution function in Eq. (37) is not delta correlated but takes the form [54] i h (!1 ; !2 ) D exp (!1 C !2 2˝)2 /p2 ˛(!1 ) ; (46)
Quantum Computing Using Optics
they could simply be different parts of a single quantum computation device. However, if you will permit us some license, it is not too difficult to imagine a single quantum computation spanning the entire planet, with matter based nodes connected by quantum optical communication channels. Such a system would hold in its web massively entangled quantum states and exhibit a complexity that would make our current optical communication system look rather simple by comparison.
Bibliography 1. Beckman D, Chari AN, Devabhaktuni S, Preskill J (1996) Phys Rev A 54:1034 2. Braunstein SL et al (1999) Phys Rev Lett 83:1054 3. Cleve R, Ekert A, Henderson L, Macchiavello C, Mosca M (1999) On quantum algorithms. LANL quant-ph/9903061 4. Darquie B, Jones MPA, Dingjan J, Beugnon J, Bergamini S, Sortais Y, Messin G, Browaeys A, Grangier P (2005) Science 309:454 5. Deutsch D (1985) Quantum-theory, the Church–Turing principle and the universal quantum computer. Proc R Soc Lond A 400:97–117 6. Duan L-M, Lukin MD, Cirac JI, Zoller P (2001) Nature 414:413 7. Einstein A (1905) On a Heuristic Point of View about the Creation and Conversion of Light? Ann Physik 17:132 8. Eisaman MD, Fleischhauer M, Lukin MD, Zibrov AS (2006) Toward quantum control of single photons. Opt Photonics News 17:22–27 9. Feynman RP (1982) Simulating physics with computers. Int J Theor Phys 21:467 10. Gasparoni S et al (2004) Phys Rev Lett 93:020504 11. Gilchrist A, Langford NK, Nielsen MA (2005) Distance measures to compare real and ideal quantum processes. Phys Rev A 71:062310 12. Gottesman D, Chuang IL (1999) Demonstrating the viability of universal quantum computation using teleportation and single-qubit operations. Nature 402:390–393 13. Hennrich M, Legero T, Kuhn A, Rempe G (2004) Photon statistics of a non-stationary periodically driven single-photon source. New J Phys 6:86 14. Hofmann H, Takeuchi S (2002) Phys Rev A 66:024308 15. Hong CK, Ou ZY, Mandel L (1987) Measurement of subpicosecond time intervals between two photons by interference. Phys Rev Lett 59:2044 16. Hutchinson GD, Milburn GJ (2004) Nonlinear quantum optical computing via measurement. J Modern Opt 51:1211–1222 17. James DFV, Kwiat PG, Munro WJ, White AG (2001) Measurement of qubits. Phys Rev A 64:052312 18. Jiang L, Taylor JM, Khaneja N, Lukin MD (2007) Optimal approach to quantum communication using dynamic programming. arXiv:quant-ph/0710.5808 19. Keller M, Lange B, Hayasaka K, Lange W, Walther H (2003) Continuous generation of single photons with controlled waveform in an ion-trap cavity system. Nature 431:1075 20. Kieling K, Rudolph T, Eisert J (2007) Percolation, renormalization, and quantum computing with nondeterministic gates. Phys Rev Lett 99:130501
21. Kiesel N, Schmid C, Weber U, Ursin R, Weinfurter H (2007) Phys Rev Lett 99:250505 22. Knill E (2002) Phys Rev A 66:052306 23. Knill E (2005) Quantum computing with realistically noisy devices. Nature 434:39–44 24. Knill E, Laflamme R, Milburn GJ (2001) Efficient linear optical quantum computation. Nature 409:46 25. Kok P, Lee H, Dowling JP (2002) Phys Rev A 66:063814 26. Kok P, Munro WJ, Nemoto K, Ralph TC, Dowling JP, Milburn GJ (2007) Linear optical quantum computing. Rev Mod Phys 79:135 27. Langford NK, Weinhold TJ, Prevedel R, Gilchrist A, O’Brien JL, Pryde GJ, White AG (2005) Phys Rev Lett 95:210504 28. Lanyon BP, Barbieri M, Almeida MP, Jennewein T, Ralph TC, Resch KJ, Pryde G, O’Brien JL, Gilchrist A, White AG (2007) Quantum computing using shortcuts through higher dimensions. appear in Nat Phys 29. Lanyon BP, Weinhold TJ, Langford NK, Barbieri M, James DFV, Gilchrist A, White AG (2007) Phys Rev Lett 99:250505 30. Lu C-Y, Browne DE, Yang T, Pan J-W (2007) Phys Rev Lett 99:250504 31. Lu C-Y, Zhou X-Q, Gühne O, Gao W-B, Zhang J, Yuan Z-S, Goebel A, Yang T, Pan J-W (2007) Experimental entanglement of six photons in graph states. Nature Phys 3:91 32. Menicucci NC et al (2002) Phys Rev Lett 88:167901 33. Migdall A et al (2002) Phys Rev A 66:053805 34. Milburn GJ (1989) A quantum optical fredkin gate. Phys Rev Lett 62:2124–2127 35. Mishina OS, Kupriyanov DV, Muller JH, Polzik ES (2007) Spectral theory of quantum memory and entanglement via Raman scattering of light by an atomic ensemble. Phys Rev A 75:042326 36. Neergaard-Nielsen JS, Melholt Nielsen B, Takahashi H, Vistnes AI, Polzik ES (2007) High purity bright single photon source. Opt Express 15:7940 37. Nielsen MA (2004) Optical quantum computation using cluster states. Phys Rev Lett 93:040503 38. O’Brien JL et al (2004) Phys Rev Lett 93:080502 39. O’Brien JL, Pryde GJ, White AG, Ralph TC Branning D (2003) Demonstration of an all-optical quantum controlled-NOT gate. Nature 426:264 40. Okamoto R, Hofmann HF, Takeuchi S, Sasaki K (2007) Phys Rev Lett 99:250506 41. Pittman TB, Jacobs BC, Franson JD (2001) Probabilistic quantum logic operations using polarizing beam splitters. Phys Rev A 64:062311 42. Pittman TB, Jacobs BC, Franson JD (2002) Phys Rev Lett 88:257902 43. Pittman TB, Fitch MJ, Jacobs BC, Franson JD (2003) Experimental controlled-NOT logic gate for single photons in the coincidence basis. Phys Rev A 68:032316 44. Popescu S (2006) KLM quantum computation as a measurement based computation. arXiv:quant-ph/0610025 45. Prevedel R et al (2007) Nature 445:65 46. Ralph TC, White AG, Munro WJ, Milburn GJ (2001) Simple scheme for efficient linear optics quantum gates. Phys Rev A 65:012314 47. Ralph TC, Langford NK, Bell TB, White AG (2002) Linear optical controlled-NOT gate in the coincidence basis. Phys Rev A 65:062324
2451
2452
Quantum Computing Using Optics
48. Raussendorf R, Briegel HJ (2001) A one-way quantum computer. Phys Rev Lett 86:5188 49. Rhode PP, Ralph TC (2005) Phys Rev A 71:032320 50. Sangouard N, Simon C, Minar J, Zbinden H, de Riedmatten H, Gisin N (2007) Long-distance entanglement distribution with single-photon sources. arXiv:quant-ph/0706.1924v1 51. Shor P (1994) Algorithms for quantum computation: Discrete logarithms and factoring. Proc 35th annual symposium on foundations of computer science. See also LANL preprint quant-ph/9508027 52. Simon C et al (2007) Phys Rev Lett 98:190503 53. Special Issue on Single photon Sources (2004) Single photon Sources 51(9–10) 54. U’Ren AB, Mukamel E, Banaszek K, Walmsley IA (2003) Phil Trans Roy Soc A 361:1471 55. Van Meter R, Ladd TD, Munro WJ, Nemoto K (2007) System De-
56. 57. 58.
59.
60. 61.
sign for a Long-Line Quantum Repeater. arXiv:quant-ph/0705. 4128v1 Vandersypen LMK et al (2001) Nature 414:883 Walls DF, Milburn GJ (2008) Quantum Optics, 2nd edn. Springer, Berlin Walther P, Resch KJ, Rudolph T, Schenck E, Weinfurter H, Vedral V, Aspelmeyer M, Zeilinger A (2005) Experimental one-way quantum computing. Nature 434:169 Weinhold TJ, Gilchrist A, Resch KJ, Doherty AC, O’Brien JL, Pryde GJ, White AG (2008) Understanding photonic quantum-logic gates: The road to fault tolerance. arxiv 0808. 0794 White AG et al (2007) Measuring two-qubit gates. J Opt Soc Am B 24:172–183 Yamamoto Y, Kitagawa M, Igeta K (1988) Proc 3rd Asia-Pacific phys conf 779. World Scientific, Singapore
Quantum Cryptography
Quantum Cryptography HOI -KWONG LO, YI Z HAO Center for Quantum Information and Quantum Control, Department of Physics and Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada Article Outline Glossary Definition of the Subject Introduction Quantum Key Distribution: Motivation and Introduction Security Proofs Experimental Fundamentals Experimental Implementation of BB84 Protocol Other Quantum Key Distribution Protocols Quantum Hacking Beyond Quantum Key Distribution Future Directions Acknowledgments Bibliography Glossary One-time pad One-time pad is a classical encryption algorithm invented by Gilbert Vernam in 1917. In onetime pad algorithm, the legitimate users share a random key (e. g. a random binary string) that is not known to anyone else. The message is combined with this random key (“pad”) which is as long as the message. The key is used only once (“one-time”). The most typical usage is in binary case, where an XOR operation is applied between the message and the key to achieve the ciphertext. Claudé Shannon proved that the one-time pad provides perfect secrecy in 1949. The perfect secrecy is defined that the ciphertext does not give any additional information on the message. Key distribution problem The key distribution problem originates from the one-time pad encryption. The onetime pad encryption requires that the two parties share a secret random key before the communication. This key is usually generated by one party. The key distribution problem is how to distribute this random key from one party to the other party securely. This problem is non-solvable classically, but is solvable via quantum key distribution. Quantum key distribution Quantum key distribution (QKD) is a method to distribute a random key between two parties securely. The main idea is to encode
the bit value on the quantum state of certain particle (usually photon) and send the particle to the receiver. The quantum no-cloning theorem guaranteed that any eavesdropper cannot duplicate the encoded quantum information perfectly. BB84 BB84 is the first and so far the most popular quantum cryptography protocol. It was proposed by C. H. Bennet and G. Brassard in 1984 [1]. In the original BB84 proposal, the quantum information is encoded on the polarizations of photons. Later BB84 was extended to the phase coding. Detailed description of BB84 protocol can be found in Sect. “Introduction”. B92 B92 is a quantum cryptography protocol proposed by C. H. Bennet in 1992 [2]. It uses two non-orthogonal states (e. g. the horizontally – and 45ı polarized photons) to denote “0” and “1”. It is simpler than BB84 protocol in implementation. E91 E91 is a quantum cryptography protocol proposed by A. Ekert in 1991 [3]. It is based on entangled photon pairs. E91 protocol is often used in free-space quantum key distribution. Uni-directional QKD Uni-directional QKD is the QKD scheme in which Alice (sender) generates the photon, encodes the quantum information on it, and sends it to Bob (receiver). Bi-directional QKD, or “Plug & play” QKD Bi-directional QKD, or “Plug & play” QKD is the QKD scheme in which Bob generates strong laser pulses and sends them to Alice. Alice encodes her quantum information on the pulse and attenuates the pulse to single-photon level, and sends it back to Bob through the same channel. This design can automatically compensate the phase and the polarization drifting in the channel. Single photon source Single photon source is the light source that can generate a single photon on demand. A perfect single photon source should have zero probability to generate multi photons once triggered. Single photon source is required in the original BB84 protocol. However, it is no longer under absolute demand due to the discovery and implementation of decoy state QKD. Fainted laser source Fainted laser source is the light source that has a standard pulsed laser source and a heavy attenuator. The average output photon number is usually set to 0:1 photon per pulse. This low average photon number is to suppress the production of multi photon signals. However, due to the poisson nature of laser source, the probability of multi photon production can never reach zero unless the laser is turned off.
2453
2454
Quantum Cryptography
Single photon detector Single photon detector is sensitive to the weakest light signals – signals with single photons. Most single photon detectors are threshold by means that they can only detect the arrival of one or more photons, but cannot count the number of photons within one signal. Dark count Dark count is the event that the detector reports a detection while no photon actually hits it. It is a key parameter for single photon detectors. Dark count becomes important when the channel loss between the sender and the receiver is high (i. e., when very few photons can reach the receiver). Qubit Qubit (or quantum bit) is the fundamental unit of quantum information. Whereas a classical bit can take value of either “0” or “1”, a qubit can take a value in any superposition of two distinguishable (i. e., orthogonal) states commonly labeled by j0i and j1i. In other words, a (pure) qubit state can be written in the form aj0i C bj1i where a and b are complex numbers. The normalization constraint is that jaj2 C jbj2 D 1. Physically, a qubit can be encoded in any two-level quantum system, such as the two polarization of a single photon or two atomic levels of an atom. Bit-flip Bit-flip is a typical noise in both classical and quantum communication. In quantum cryptography, a bit-flip in the channel will transform an initial state jii D aj0i C bj1i into the final state j f i D aj1i C bj0i. Phase-flip Phase-flip is a typical noise that is unique in quantum communication. A phase-flip in the quantum channel will transform an initial state jii D aj0iC bj1i into the final state j f i D aj0i bj1i. Definition of the Subject Quantum cryptography is the synthesis of quantum mechanics with the art of code-making (cryptography). The idea was first conceived in an unpublished manuscript written by Stephen Wiesner around 1970 [4]. However, the subject received little attention until its resurrection by a classic paper published by Bennett and Brassard in 1984 [1]. The goal of quantum cryptography is to perform tasks that are impossible or intractable with conventional cryptography. Quantum cryptography makes use of the subtle properties of quantum mechanics such as the quantum no-cloning theorem and the Heisenberg uncertainty principle. Unlike conventional cryptography, whose security is often based on unproven computational assumptions, quantum cryptography has an important advantage in that its security is often based on the laws of physics. Thus far, proposed applications of quantum cryptography include quantum key distribution (abbreviated
QKD), quantum bit commitment and quantum coin tossing. These applications have varying degrees of success. The most successful and important application – QKD – has been proven to be unconditionally secure. Moreover, experimental QKD has now been performed over hundreds of kilometers over both standard commercial telecom optical fibers and open-air. In fact, commercial QKD systems are currently available on the market. On a wider context, quantum cryptography is a branch of quantum information processing, which includes quantum computing, quantum measurements, and quantum teleportation. Among all branches, quantum cryptography is the branch that is closest to real-life applications. Therefore, it can be a concrete avenue for the demonstrations of concepts in quantum information processing. On a more fundamental level, quantum cryptography is deeply related to the foundations of quantum mechanics, particularly the testing of Bell-inequalities and the detection efficiency loophole. On a technological level, quantum cryptography is related to technologies such as single-photon measurements and detection and single-photon sources. Introduction The best-known application of quantum cryptography is quantum key distribution (QKD). The goal of QKD is to allow two distant participants, traditionally called Alice and Bob, to share a long random string of secret (commonly called the key) in the presence of an eavesdropper, traditionally called Eve. The key can subsequently be used to achieve a) perfectly secure communication (via one-time-pad, see below) and b) perfectly secure authentication (via Wigman–Carter authentication scheme), thus achieving two key goals in cryptography. The best-known protocol for QKD is the Bennett and Brassard protocol (BB84) published in 1984 [1]. The procedure of BB84 is as follows (also shown in Table 1). 1. Quantum communication phase (a) In BB84, Alice sends Bob a sequence of photons, each independently chosen from one of the four polarizations – vertical, horizontal, 45-degrees and 135-degrees. (b) For each photon, Bob randomly chooses one of the two measurement bases (rectilinear and diagonal) to perform a measurement. (c) Bob records his measurement bases and results. Bob publicly acknowledges his receipt of signals. 2. Public discussion phase (a) Alice broadcasts her bases of measurements. Bob broadcasts his bases of measurements.
Quantum Cryptography Quantum Cryptography, Table 1 Procedure of BB84 protocol Alice’s bit sequence Alice’s basis Alice’s photon polarization Bob’s basis Bob’s measured polarization Bob’s sifted measured polarization Bob’s data sequence
0 + $ + $ $ 0
(b) Alice and Bob discard all events where they use different bases for a signal. (c) To test for tampering, Alice randomly chooses a fraction, p, of all remaining events as test events. For those test events, she publicly broadcasts their positions and polarizations. (d) Bob broadcasts the polarizations of the test events. (e) Alice and Bob compute the error rate of the test events (i. e., the fraction of data for which their value disagree). If the computed error rate is larger than some prescribed threshold value, say 11%, they abort. Otherwise, they proceed to the next step. (f) Alice and Bob each convert the polarization data of all remaining data into a binary string called a raw key (by, for example, mapping a vertical or 45-degrees photon to “0” and a horizontal or 135-degrees photon to “1”). They can perform classical post-processing such as error correction and privacy amplification to generate a final key. Notice that it is important for the classical communication channel between Alice and Bob to be authenticated. Otherwise, Eve can easily launch a man-in-the-middle attack by disguising as Alice to Bob and as Bob to Alice. Fortunately, authentication of an m-bit classical message requires only logarithmic in m-bit of an authentication key. Therefore, QKD provides an efficient way to expand a short initial authentication key into a long key. By repeating QKD many times, one can get an arbitrarily long secure key. This article is organized as follows. In Sect. “Quantum Key Distribution: Motivation and Introduction”, we will discuss the importance and foundations of QKD; in Section “Security Proofs”, we will discuss the principles of different approaches to prove the unconditional security of QKD; in Section “Experimental Fundamentals”, we will introduce the history and some fundamental components of QKD implementations; in Section “Experimental Implementation of BB84 Protocol”, we will discuss the im-
1 × & + l
1 + l × & -
1 + l + l l 1
0 × . % + $
1 + l × . %
0 × . % × . % . % 0
0 × . % + l
0 + $ + $ $ 0
1 × & × & & 1
plementation of BB84 protocol in detail; in Section “Other Quantum Key Distribution Protocols”, we will discuss the proposals and implementations of other QKD protocols; in Section “Quantum Hacking”, we will introduce a very fresh and exciting area – quantum hacking – in both theory and experiments. In particular, we provide a catalog of existing eavesdropping attacks; in Section “Beyond Quantum Key Distribution”, we will discuss some topics other than QKD, including quantum bit commitment, quantum coin tossing, etc.; in Section “Future Directions”, we will wrap up this article with prospectives of quantum cryptography in the future. Quantum Key Distribution: Motivation and Introduction Cryptography – the art of code-marking – has a long and distinguished history of military and diplomatic applications, dating back to ancient civilizations in Mesopotamia, Egypt, India and China. Moreover, in recent years cryptography has widespread applications in civilian applications such as electronics commerce and electronics businesses. Each time we go on-line to access our banking or credit card data, we should be deeply concerned with our data security. Key Distribution Problem and One-Time-Pad Secure communication is the best-known application of cryptography. The goal of secure communication is to allow two distant participants, traditionally called Alice and Bob, to communicate securely in the presence of an eavesdropper, Eve. See Fig. 1. A simple example of an encryption scheme is the Caesar’s cipher. Alice simply shifts each letter in a message alphabetically by three letters. For instance, the word NOW is mapped to QRZ, because N ! O ! P ! Q, O ! P ! Q ! R and W ! X ! Y ! Z. According to legends, Julius Caesar used Caesar’s cipher to communicate with his generals. An encryption by an alphabetical shift of a fixed but arbitrary number of po-
2455
2456
Quantum Cryptography
Quantum Cryptography, Figure 1 Communication in presence of an eavesdropper
sitions is also called a Caesar’s cipher. Note that Caesar’s cipher is not that secure because an eavesdropper can simply exhaustively try all 26 possible combinations of the key to recover the original message. In conventional cryptography, an unbreakable code does exist. It is called the one-time-pad and was invented by Gilbert Vernam in 1918 [5]. In the one-timepad method, a message (traditionally called the plain text) is first converted by Alice into a binary form (a string consisting of “0”s and “1”s) by a publicly known method. A key is a binary string of the same length as the message. By combining each bit of the message with the respective bit of the key using XOR (i. e. addition modulo two), Alice converts the plain text into an encrypted form (called the cipher text). i. e. for each bit c i D m i C k i mod 2. Alice then transmits the cipher text to Bob via a broadcast channel. Anyone including an eavesdropper can get a copy of the cipher text. However, without the knowledge of the key, the cipher text is totally random and gives no information whatsoever about the plain text. For decryption, Bob, who shares the same key with Alice, can perform another XOR (i. e. addition modulo two) between each bit of the cipher text with the respective bit of the key to recover the plain text. This is because c i C k i mod 2 D m i C 2k i mod 2 D m i mod 2. Notice that it is important not to re-use a key in a onetime-pad scheme. Suppose the same key, k, is used for the encryption of two messages, m1 and m2 , then the cipher texts are c1 D m1 C k mod 2 and c2 D m2 C k mod 2. Then, Eve can simply take the XOR of the two cipher texts to obtain c1 C c2 mod 2 D m1 C m2 C 2k mod 2 D m1 C m2 mod 2, thus learning non-trivial information, namely the parity of the two messages. The one-time-pad method is commonly used in topsecret communication. The one-time-pad method is unbreakable, but it has a serious drawback: it supposes that
Alice and Bob initially share a random string of secret that is as long as the message. Therefore, the one-timepad simply shifts the problem of secure communication to the problem of key distribution. This is the key distribution problem. In top-secret communication, the key distribution problem can be solved by trusted couriers. Unfortunately, trusted couriers can be bribed or compromised. Indeed, in conventional cryptography, a key is a classical string consisting of “0” and “1”s. In classical physics, there is no fundamental physical principle that can prevent an eavesdropper from copying a key during the key distribution process. A possible solution to the key distribution problem is public key cryptography. However, the security of public key cryptography is based on unproven computational assumptions. For example, the security of standard RSA crypto-system invented by Rivest–Shamir–Adleman (RSA) is based on the presumed difficulty of factoring large integers. Therefore, public key distribution is vulnerable to unanticipated advances in hardware and algorithms. In fact, quantum computers – computers that operate on the principles of quantum mechanics – can break standard RSA crypto-system via the celebrated Shor’s quantum algorithm for efficient factoring [6]. Quantum No-Cloning Theorem and Quantum Key Distribution (QKD) Quantum mechanics can provide a solution to the key distribution problem. In quantum key distribution, an encryption key is generated randomly between Alice and Bob by using non-orthogonal quantum states. In contrast to classical physics, in quantum mechanics there is a quantum no-cloning theorem (see below), which states that it is fundamentally impossible for anyone including an eavesdropper to make an additional copy of an unknown quantum state. A big advantage of quantum cryptography is forward security. In conventional cryptography, an eavesdropper Eve has a transcript of all communications. Therefore, she can simply save it for many years and wait for breakthroughs such as the discovery of a new algorithm or new hardware. Indeed, if Eve can factor large integers in 2100, she can decrypt communications sent in 2008. We remark that Canadian census data are kept secret for 92 years on average. Therefore, factoring in the year 2100 may violate the security requirement of our government today! And, no one in his/her sane mind can guarantee the impossibility of efficient factoring in 2100 (except for the fact that he/she may not live that long). In contrast, quantum cryptography guarantees forward security. Thanks to the quan-
Quantum Cryptography
tum no-cloning theorem, an eavesdropper does not have a transcript of all quantum signals sent by Alice to Bob. For completeness, we include the statement and the proof of the quantum no-cloning theorem below. Quantum No-Cloning Theorem An unknown quantum state cannot be copied. (a) The case without ancilla: Given an unknown state j˛i, show that a quantum copying machine that can map j˛ij0i ! j˛ij˛i does not exist. (b) The general case: Given an unknown state j˛i, show that a quantum copying machine that can map j˛ij0ij0i ! j˛ij˛iju˛ i does not exist. Proof (a) Suppose the contrary. Then, a quantum cloning machine exists. Consider two orthogonal input states j0i and j1i respectively. We have
system. The most general thing that she can try to do is to prepare some ancilla in some standard state j0i and couple it to the system. Therefore, we have: juij0i ! juiju i
(3)
and jvij0i ! jvijv i
(4)
for some states ju i and jv i. In the end, the experimentalist lets go of the system and keeps the ancilla. He/she may then perform a measurement on the ancilla to learn about the initial state of the system. Recall that quantum evolution is unitary and as such it preserves the inner product. Now, taking the inner product between Eqs. (3) and (4), we get: hujvih0j0i D hujvihu jv i hujvi D hujvihu jv i
j0ij0i ! j0ij0i
hujvi(1 hu jv i) D 0
(5)
(1 hu jv i) D 0
and
ju i D jv i ;
j1ij0i ! j1ij1i : Consider a general input j˛i D aj0i C bj1i. Since a unitary transformation is linear, by linearity, we have j˛ij0i D (aj0i C bj1i)j0i ! aj0ij0i C bj1ij1i :
(1)
In contrast, for quantum cloning, we need: j˛ij0i ! (aj0i C bj1i)(aj0i C bj1i) D a2 j0ij0i C abj0ij1i C abj1ij0i C b2 j1ij1i : (2) Clearly, if ab 6D 0, the two results shown in Eqs. (1) and (2)) are different. Therefore, quantum cloning (without ancilla) is impossible. (b) similar. More generally, for general quantum states, information gain implies disturbance. Theorem (Information Gain implies disturbance) Given one state chosen from one of the two distinct nonorthogonal states, jui and jvi (i. e. jhujvij 6D 0 or 1), any operation that can learn any information about its identity necessarily disturbs the state. Proof Given a system initially in state either jui and jvi. Suppose an experimentalist applies some operation on the
where in the fourth line, we have used the fact that jhujvij 6D 0. Now, the condition that ju i D jv i means that the final state of the ancilla is independent of the initial state of the system. Therefore, a measurement on the ancilla will tell the experimentalist nothing about the initial state of the system. Therefore, any attempt by an eavesdropper to learn information about a key in a QKD process will lead to disturbance, which can be detected by Alice and Bob who can, for example, check the bit error rate of a random sample of the raw transmission data. The standard BB84 protocol for QKD was discussed in Sect. “Introduction”. In the BB84 protocol, Alice prepares a sequence of photons each randomly chosen in one of the four polarizations – vertical, horizontal, 45-degrees and 135-degrees. For each photon, Bob chooses one of the two polarization bases (rectilinear or diagonal) to perform a measurement. Intuitively, the security comes from the fact that the two polarization bases, rectilinear and diagonal, are conjugate observables. Just like position and momentum are conjugate observables in the standard Heisenberg uncertainty principle, no measurement by an eavesdropper Eve can determine the value of both observables simultaneously. In mathematics, two conjugate observables are represented by two non-commuting Hermitian matrices. Therefore, they cannot be simultaneously
2457
2458
Quantum Cryptography
diagonalized. This impossibility of simultaneous diagonalization implies the impossibility of simultaneous measurements of two conjugate observables.
Notice that, mathematically the phase encoding scheme is equivalent to the polarization encoding scheme. They are simply different embodiments of the same BB84 protocol.
Example of a Simple Eavesdropper Attack: Intercept-Resend Attack
Security Proofs
To illustrate the security of quantum cryptography, let us consider the simple example of an intercept-resend attack by an eavesdropper Eve, who measures each photon in a randomly chosen basis and then resends the resulting state to Bob. For instance, if Eve performs a rectilinear measurement, photons prepared by Alice in the diagonal bases will be disturbed by Eve’s measurement and give random answers. When Eve resends rectilinear photons to Bob, if Bob performs a diagonal measurement, then he will get random answers. Since the two bases are chosen randomly by each party, such an intercept-resend attack will give a bit error rate of 0:5 0:5 C 0:5 0 D 25%, which is readily detectable by Alice and Bob. Sophisticated attacks against QKD do exist. Fortunately, the security of QKD has now been proven. This subject will be discussed further in Sect. “Security Proofs”.
Equivalence Between Phase and Polarization Encoding Notice that the BB84 protocol can be implemented with any two-level quantum system (qubits). In Sect. “Introduction” and the above discussion, we have described the BB84 protocol in terms of polarization encoding. This is just one of the many possible types of encodings. Indeed, it should be noted that other encoding method, particularly, phase encoding also exists. In phase encoding, a signal consists of a superposition of two time-separated pulses, known as the reference pulse and the signal pulse. See Fig. 2 for an illustration of the phase encoding scheme. The information is encoded in the relative phase between two pulses. i. e., the four p p possible statespused by Alice arep1/ 2(jRi C jSi), 1/ 2(jRi jSi), 1/ 2(jRi C ijSi), 1/ 2(jRi ijSi).
Quantum Cryptography, Figure 2 Conceptual schematic for phase-coding BB84 QKD system. PM: Phase Modulator; BS: Beam Splitter; SPD: Single Photon Detector
“The most important question in quantum cryptography is to determine how secure it really is.” (Brassard and Crépeau [7]) Security proofs are very important because a) they provide the foundation of security to a QKD protocol, b) they provide a formula for the key generation rate of a QKD protocol and c) they may even provide a construction for the classical post-processing protocol (for error correction and privacy amplification) that is necessary for the generation of the final key. Without security proofs, a real-life QKD system is incomplete because we can never be sure about how to generate a secure key and how secure the final key really is. Classification of Eavesdropping Attacks Before we discuss security proofs, let us first consider eavesdropping attacks. Notice that there are infinitely many eavesdropping strategies that an eavesdropper, Eve, can perform against a QKD protocol. They can be classified as follows: Individual attacks In an individual attack, Eve performs an attack on each signal independently. The interceptresend attack discussed in Subsect. “Example of a Simple Eavesdropper Attack: Intercept-Resend Attack” is an example of an individual attack. Collective attacks A more general class of attacks is collective attack where for each signal, Eve independently couples it with an ancillary quantum system, commonly called an ancilla, and evolves the combined signal/ancilla unitarily. She can send the resulting signals to Bob, but keep all ancillas herself. Unlike the case of individual attacks, Eve postpones her choice of measurement. Only after hearing the public discussion between Alice and Bob, does Eve decide on what measurement to perform on her ancilla to extract information about the final key. Joint attacks The most general class of attacks is joint attack. In a joint attack, instead of interacting with each signal independently, Eve treats all the signals as a single quantum system. She then couples the signal system with her ancilla and evolves the combined signal and ancilla system unitarily. She hears the public discussion between Alice and Bob before deciding on which measurement to perform on her ancilla.
Quantum Cryptography
Proving the security of QKD against the most general attack was a very hard problem. It took more than 10 years, but the unconditional security of QKD was finally established in several papers in the 1990s. One approach by Mayers [8] was to prove the security of the BB84 directly. A simpler approach by Lo and Chau [9], made use of the idea of entanglement distillation by Bennett, DiVincenzo, Smolin and Wootters (BDSW) [10] and quantum privacy amplification by Deutsch et al. [11] to solve the security of an entanglement-based QKD protocol. The two approaches have been unified by the work of Shor and Preskill [12], who provided a simple proof of security of BB84 using entanglement distillation idea. Other early security proofs of QKD include Biham, Boyer, Boykin, Mor, and Roychowdhury [13], and Ben-Or [14]. Approaches to Security Proofs There are several approaches to security proof. We will discuss them one by one. 1. Entanglement distillation Entanglement distillation protocol (EDP) provides a simple approach to security proof [9,11,12]. The basic insight is that entanglement is a sufficient (but not necessary) condition for a secure key. Consider the noiseless case first. Suppose two distant parties, Alice and Bob, share p a maximally entangled state of the form jiAB D 1/ 2(j00iAB C j11iAB ). If each of Alice and Bob measure their systems, then they will both get “0”s or “1”s, which is a shared random key. Moreover, if we consider the combined system of the three parties – Alice, Bob and an eavesdropper, Eve, we can use a purestate description (the “Church of Larger Hilbert space”) and consider a pure state j iABE . In this case, the von Neumann entropy [15] of Eve S( E ) D S( AB ) D 0. This means that Eve has absolutely no information on the final key. This is the consequence of the standard Holevo’s theorem. See, like, [16]. In the noisy case, Alice and Bob may share N pairs of qubits, which are a noisy version of N maximally entangled states. Now, using the idea of entanglement distillation protocol (EDP) discussed in BDSW [10], Alice and Bob may apply local operations and classical communications (LOCCs) to distill from the N noisy pairs a smaller number, say M almost perfect pairs i. e., a state M . Once such a EDP has been performed, close to jiAB Alice and Bob can measure their respective system to generate an M-bit final key. One may ask: how can Alice and Bob be sure that their EDP will be successful? Whether an EDP will be successful or not depends on the initial state shared by Al-
ice and Bob. In the above, we have skipped the discussion about the verification step. In practice, Alice and Bob can never be sure what initial state they possess. Therefore, it is useful for them to add a verification step. By, for example, randomly testing a fraction of their pairs, they have a pretty good idea about the properties (e. g., the bit-flip and phase error rates) of their remaining pairs and are pretty confident that their EDP will be successful. The above description of EDP is for a quantum-computing protocol where we assume that Alice and Bob can perform local quantum computations. In practice, Alice and Bob do not have large-scale quantum computers at their disposal. Shor and Preskill made the important observation that the security proof of the standard BB84 protocol can be reduced to that of an EDP-based QKD protocol [9,11]. The Shor–Preskill proof [12] makes use of the Calderbank–Shor–Steane (CSS) code, which has the advantage of decoupling the quantum error correction procedure into two parts: bit-flip and phase error correction. They can go on to show that bit-flip error correction corresponds to standard error correction and phase error correction corresponds to privacy amplification (by random hashing). 2. Communication complexity/quantum memory The communication complexity/quantum memory approach to security proof was proposed by Ben-Or [14] and subsequently by Renner and Koenig [17]. See also [18]. They provide a formula for secure key generation rate in terms of an eavesdropper’s quantum knowledge on the raw key: Let Z be a random variable with range Z, let be a random state, and let F be a twouniversal function on Z with range S D f0; 1gs which is independent of Z and . Then [17] 1
d(F(Z)jfFg ˝ ) 12 2 2 (S 2 ([fZg˝ ])S 0 ([ ])s) : Incidentally, the quantum de Finnetti’s theorem [19] is often useful for simplifying security proofs of this type. 3. Twisted state approach What is a necessary and sufficient condition for secure key generation? From the entanglement distillation approach, we know that entanglement distillation a sufficient condition for secure key generation. For some time, it was hoped that entanglement distillation is also a necessary condition for secure key generation. However, such an idea was proven to be wrong in [20,21], where it was found that a necessary and sufficient condition is the distillation of a private state, rather than a maximally entangled state. A private state is a “twisted” version of a maximally entangled state.
2459
2460
Quantum Cryptography
They proved the following theorem in [20]: a state is private in the above sense iff it is of the following form C C 2 m iAB h 2 m j
m D Uj
˝ %A0 B0 U
(6)
Pd
where j d i D iD1 jiii and %A0 B0 is an arbitrary state on A0 ; B0 . U is an arbitrary unitary controlled in the computational basis m
UD
2 X
0 0
ji jiAB hi jj ˝ U iAj B :
(7)
i; jD1
The operation (7) will be called “twisting” (note that 0 0 only U iAi B matter here, yet it will be useful to consider general twisting later). Proof. (copied from [20]) The authors of [20] proved for m D 1 (for higher m, the proof is analogous). Start with an arbitrary state held by Alice and Bob, AA0 BB0 , and include its purification to write the total state in the decomposition ABA0 B0 ;E D aj00iAB j00 iA0 B0 E C bj01iAB j01 iA0 B0 E C cj10iAB j10 iA0 B0 E C dj11iAB j11 iA0 B0 E (8) with the states ji ji on AB and ij on A0 B0 E. Since the key is unbiased and perfectly correlated, we must have b D c D 0 and jaj2 D jdj2 D 1/2. Depending on whether the key is j00i or j11i, Eve will hold the states %0 D TrA0 B0 j00 ih00 j;
%1 D TrA0 B0 j11 ih11 j: (9)
Perfect security requires %0 D %1 . Thus there exists unitaries U 00 and U 11 on A0 B0 such that Xp 0 0 j00 i D p i jU0 iA B ij' iE i i
j11 i D
Xp
0 0
p i jU1 iA B ij' iE i :
(10)
i
After tracing out E, we will thus get a state of the form P Eq. (6), where %A0 B0 D i p i j i ih i j. The main new ingredient of the above theorem is the introduction of a “shield” part to Alice and Bob’s system. That is, in addition to the systems A and B used by Alice and Bob for key generation, we assume that Alice and Bob also hold some ancillary systems, A0 and B0 , often called the shield part. Since we assume that Eve has no access to the shield part, Eve is further limited in her ability to eavesdrop. Therefore, Alice and Bob can derive a higher key generation rate than the case when Eve does have access to the shield part. An upshot is that even a bound entangled state can give a secure key. A bound state is one whose formation (via local operations and classical communica-
tions, LOCCs) requires entanglement, but which does not give any distillable entanglement. In other words, even though no entanglement can be distilled from a bound entangled state, private states (a twisted version of entangled states) can be distilled from a bound entangled state. In summary, secure key generation is a more general theory than entanglement distillation. 4. Complementary principle Another approach to security proof is to use the complementary principle of quantum mechanics. Such an approach is interesting because it shows the deep connection between the foundations of quantum mechanics and the security of QKD. In fact, both Mayers’ proof [8] and Biham, Boyer, Boykin, Mor, and Roychowdhury’s proof [13] make use of this complementary principle. A clear and rigorous discussion of the complementary principle approach to security proof has recently been achieved by Koashi [22]. The key insight of Koashi’s proof is that Alice and Bob’s ability to generate a random secure key in the Z-basis (by a measurement of the Pauli spin matrix Z ) is equivalent to the ability for Bob to help Alice prepare an eigenstate in the complementary, i. e., X-basis ( X ), with their help of the shield. The intuition ispthat an Xbasis eigenstate, for example, jCiA D 1/ 2(j0iA C j1iA ), when measured along the Z-basis, gives a random answer. 5. Other ideas for security proofs Here we discuss two other ideas for security proofs, namely, a) device-independent security proofs and b) security from the causality constraint. Unfortunately, these ideas are still very much under development and so far a complete version of a proof of unconditional security of QKD based on these ideas with a finite key rate is still missing. Let us start with a) device-independent security proofs So far we have assumed that Alice and Bob know what their devices are doing exactly. In practice, Alice and Bob may not know their devices for sure. Recently, there has been much interest in the idea of deviceindependent security proofs. In other words, how to prove security when Alice and Bob’s devices cannot be trusted. See, for example, [23]. The idea is to look only at the input and output variables. A handwaving argument goes as follows. Using their probability distribution, if one can demonstrate the violation of some Bell inequalities, then one cannot explain the data by a separable system. How to develop such a handwaving argument into a full proof of unconditional security is an important question.
Quantum Cryptography
The second idea b) security from the causality constraint is even more ambitious. The question that it tries to address is the following. How can one prove security when even quantum mechanics is wrong? In [24] and references cited therein, it was suggested that perhaps a more general physical principle such as the no-signaling requirement for space-like observables could be used to prove the security of QKD.
Classical Post-processing Protocols As noted in Sect. “Introduction”, after the quantum communication phase, Alice and Bob then proceed with the classical communication phase. In order to generate a secure key, Alice and Bob have to know what classical postprocessing protocol to apply to the raw quantum data. This is a highly non-trivial question. Indeed, a priori, given a particular procedure for classical post-processing, it is very hard to know whether it will give a secure key or what secure key will be generated. In fact, it is sometimes said that in QKD, the optical part is easy, the electronics part is harder, but the hardest part is the classical post-processing protocol. Fortunately, security proofs often give Alice and Bob clear ideas on what classical post-processing protocol to use. This highlights the importance for QKD practitioners to study the security proofs of QKD. Briefly stated, the classical post-processing protocol often consists of a) test for tampering and b) key generation. In a) test for tampering, Alice and Bob may randomly choose a fraction of the signals for testing. For example, by broadcasting the polarizations of those signals, they can work out the bit error rate of the test signals. Since the test signals are randomly chosen, they have high confidence on the bit error rate of the remaining signals. If the bit error rate of the tested signal is higher than a prescribed threshold value, Alice and Bob abort. On the other hand, if the bit error rate is lower than or equal to the prescribed value, they proceed with the key generation step with the remaining signals. They first convert their polarization data into binary strings, the raw keys, in a prescribed manner. For example, they can map a vertical or 45-degrees photon to “0” and a horizontal or 135-degrees photon to “1”. As a result, Alice has a binary string x and Bob has a binary string y. However, two problems remain. First, Alice’s string may differ from Bob’s string. Second, since the bit error rate is non-zero, Eve has some information about Alice’s and Bob’s string. The key generation step may be divided into the following stages: 1. Classical pre-processing This is an optional step. Classical pre-processing has
the advantage of achieving a higher key generation rate and tolerating a higher bit error rate [25,26,27]. Alice and Bob may pre-process their data by either a) some type of error detection algorithm or b) some random process. An example of an error detection algorithm is a B-step [25], where Alice randomly permutes all her bits and broadcasts the parity of each adjacent pair. In other words, starting from xE D (x1 ; x2 ; : : : ; x2N1 ; x2N ), Alice broadcasts a string xE1 D (x(1) C x(2) mod 2; x(3) C x(4) mod 2; : : : ; x(2N1) C x(2N) mod 2), where is a random permutation chosen by Alice. Moreover, Alice informs Bob which random permutation, , she has chosen. Similarly, starting from yE D (y1 ; y2 ; : : : ; y2N1 ; y2N ), Bob randomly permutes all his bits using the same and broadcasts the parity bit of all adjacent pairs. I. e. Bob broadcasts yE1 D (y(1) C y(2) mod 2; y(3) C y(4) mod 2; : : : ; y(2N1) C y(2N) mod 2). For each pair of bits, Alice and Bob keep the first bit iff their parities of the pair agree. For instance, if x(2k1) C x(2k) mod 2 D y(2k1) C y(2k) mod 2, then Alice keeps x(2k1) and Bob keeps y(2k1) as their new key bit. Otherwise, they drop the pair (x(2k1) ; x(2k) ) and (y(2k1) x(2k) ) completely. Notice that the above protocol is an error detection protocol. To see this, let us regard the case where x i 6D y i as an error during the quantum transmission stage. Suppose that for each bit, i, the event x i 6D y i occurs with an independent probability p. For each k, the B-step throws away the cases where a single error has occurred for the two locations (2k 1) and (2k) and keeps the cases when either no error or two errors has occurred. As a result, the error probability after the B-step is reduced from O(p) to O(p2 ). The random permutation of all the bit locations ensures that the error model can be well described by an independent identical distribution (i.i.d.). An example of a random process is an adding noise protocol [26] where, for each bit xi , Alice randomly and independently chooses to keep it unchanged or flip it with probabilities, 1 q and q respectively, where the probability q is publicly known. 2. Error correction Owing to noises in the quantum channel, Alice and Bob’s raw keys, x and y, may be different. Therefore, it is necessary for them to reconcile their keys. One simple way of key reconciliation is forward key reconciliation, whose goal is for Alice to keep the same key x and Bob to change his key from y to x. Forward key reconciliation can be done by either standard error correcting codes such as low-density-parity-check (LDPC) codes
2461
2462
Quantum Cryptography
or specialized (one-way or interactive) protocols such as Cascade protocol [28]. 3. Privacy amplification To remove any residual information Eve may have about the key, Alice and Bob may apply some algorithm to compress their partially secure key into a shorter one that is almost perfectly secure. This is called privacy amplification. Random hashing and a class of two-universal hash functions are often suitable for privacy amplification. See for example [29] and [18] for discussion.
Composability A key generated in QKD is seldom used in isolation. Indeed, one may concatenate a QKD process many times, using a small part of the key for authentication each time and the remaining key for other purposes such as encryption. It is important to show that using QKD as a sub-routine in a complicated cryptographic process does not create new security problems. This issue is called the composability of QKD and, fortunately, has been solved in [30]. Composability of QKD is not only of academic interest. It allows us to refine our definition of security [30,31] and directly impacts on the parameters used in the classical post-processing protocol. Security Proofs of Practical QKD Systems As will be discussed in Sect. “Experimental Fundamentals”, practical QKD systems suffer from real-life imperfections. Proving the security of QKD with practical systems is a hard problem. Fortunately, this has been done with semi-realistic models by Inamori, Lütkenhaus and Mayers [32] and in a more general setting by Gottesman, Lo, Lütkenhaus, and Preskill [33]. Experimental Fundamentals Quantum cryptography can ensure the secure communication between two or more legitimate parties. It is more than a beautiful idea. Conceptually, it is of great importance in the understandings of both information and quantum mechanics. Practically, it can provide an ultimate solution for confidential communications, thus making everyone’s life easier. By implementing the quantum crypto-system in the real life, we can test it, analyze it, understand it, verify it, and even try to break it. Experimental QKD has been performed since about 1989 and great progress has been made. Now, you can even buy QKD systems on the market.
A typical QKD set-up includes three standard parts: a source (Alice), a channel, and a detection system (Bob). A Brief History The First Experiment The proposal of BB84 [1] protocol seemed to be simple. However, it took another five years before it was first experimentally demonstrated by Bennett, Bessette, Brassard, Salvail, and Smolin in 1989 [34]. This first demonstration was based on polarization coding. Heavily attenuated laser pulses instead of single photons were used as quantum signals, which were transmitted over 30 cm open air at a repetition rate of 10 Hz. From Centimeter to Kilometer 30 cm is not that appealing for practical communications. This short distance is largely due to the difficulty of optical alignment in free space. Switching the channel from open air to optical fiber is a natural choice. In 1993, Townsend, Rarity, and Tapster demonstrated the feasibility of phase-coding fiber-based QKD over 10 km telecom fiber [35] and Muller, Breguet, and Gisin demonstrated the feasibility of polarization-coding fiber-based QKD over 1.1 km telecom fiber [36]. (Also, Jacobs and Franson demonstrated both free-space [37] and fiber-based QKD [38].) These are both feasibility demonstrations by means that neither of them applied random basis choosing at Bob’s side. Townsend’s demonstration seemed to be more promising than Muller’s due to the following reasons. 1. The polarization dispersion in fibers is highly unpredictable and unstable. Therefore polarization coding requires much more controlling in the fiber than phase coding. In fiber-based QKD implementations, phase coding is in general more preferred than polarization coding. 2. Townsend et al. used 1310 ns laser as the source, while Muller et al. used 800 ns laser as the source. 1310 nm is the second window wavelength of telecom fibers (the first window wavelength is 1550 nm). The absorption coefficient of standard telecom fiber at 1310 nm is 0.35 dB/km, comparing to 3 dB/km at 800 nm. Therefore the fiber is more transparent to Townsend et al.’s set-up. P. D. Townsend demonstrated QKD with Bob’s random basis selection in 1994 [39]. It was phase-coding and was over 10 km fiber. The source repetition rate was 105 MHz (which is quite high even by today’s standard) but the phase modulation rate was 1.05 MHz. This mismatch brought a question mark on its security.
Quantum Cryptography
Getting Out of the Lab It is crucial to test QKD technique in the field deployed fiber. Muller, Zbinden, and Gisin successfully demonstrated the first QKD experiment outside the labs with polarization coding in 1995 [40,41]. This demonstration was performed over 23 km installed optical fiber under Lake Geneva. (Being under water, quantum communication in the optical fiber suffered less noise.) There is less control over the field deployed fiber than fiber in the labs. Therefore its stabilization becomes challenging. To solve this problem, A. Muller et al. designed the “plug & play” structure in 1997 [42]. A first experiment of this scheme was demonstrated by H. Zbinden et al. in the same year [43]. Stucki, Gisin, Guinnard, Robordy, and Zbinden later demonstrated a simplified version of the “plug & play” scheme under Lake Geneva over 67 km telecom fiber in 2002 [44]. With a Coherent Laser Source The original BB84 [1] proposal required a single photon source. However, most QKD implementations are based on faint lasers due to the great challenge to build the perfect single photon sources. In 2000, the security of coherent laser based QKD systems was analyzed first against individual attacks [45]. Finally, the unconditional security of coherent laser based QKD systems was proven in 2001 [32] and in a more general setting in 2002 [33]. Gobby, Yuan, and Shields demonstrated an experiment based on [45] in 2005 [46] (Note that this work was claimed to be unconditionally secure. However, due to the limit of [45], this is only true against individual attacks rather than the most general attack). The security analysis in [32,33] will severely limit the performance of unconditionally secure QKD systems. Fortunately, since 2003 the decoy state method has been proposed [47,48,49,50,51,52] by Hwang and extensively analyzed by our group at the University of Toronto and by Wang. The first experimental demonstration of decoy state QKD was reported by us in 2006 [53] over 15 km telecom fiber and later over 60 km telecom fiber [54]. Subsequently, decoy state QKD was further demonstrated by several other groups [55,56,57,58,59]. The readers may refer to Subsect. “Decoy State Protocols” for details of decoy state protocols. Sources Single Photon Sources are demanded by the original BB84 [1] proposal. Suggested by its name, the single photon sources are expected to generate exactly one photon on demand. The bottom line for a single photon source is that no more than one photon can be gen-
erated at one time. It is very hard to build a perfect single photon source (i. e., no multi-photon production). Despite tremendous effort made by many groups, perfect single photon source is still far from practical. Fortunately, the proposal and implementation of decoy state QKD (see Subsect. “Decoy State Protocols”) make it unnecessary to use single photon sources in QKD. Parametric Down-Conversion (PDC) Sources are often used as the entanglement source. Its principle is that a high energy ( 400 nm) photon propagates through a highly non-linear crystal (usually BBO), producing two entangled photons with frequency halved. PDC sources are usually used for entanglement-based QKD systems (e. g. Ekert [3] protocol). PDC sources are also used as “triggered single photon sources”, in which Alice possesses a PDC source and monitor one arm of its outputs. In case that Alice sees a detection, she knows that there is one photon emitted from the other arm. Experimental demonstration of QKD with PDC sources is reported in [60]. Attenuated Laser Sources are the most commonly used sources in QKD experiments. They are essentially the same as the laser sources used in classical optical communication except for that heavy attenuation is applied on them. They are simple and reliable, and they can reach Gigahertz with little challenge. In BB84 system and differential-phase-shift-keying (DPSK) system (to be discussed in Subsect. “Differential-Phase-Shift-Keying (DPSK) Protocols”), the laser source is usually attenuated to below 1 photon per pulse. In Gaussian-modulated coherent-state (GMCS) system (to be discussed in Subsect. “Gaussian-Modulated Coherent State (GMCS) Protocol”), the laser source is usually attenuated to around 100 photons per pulse. Attenuated laser sources used to be considered to be non-ideal for BB84 systems as they always have non-negligible probability of emitting multi-photon pulses regardless how heavily they are attenuated. However, the discovery and implementation of decoy method [47,48,49,50,51,52,53,54,55,56,57,58] made coherent laser source much more appealing. With decoy method, it is possible to make BB84 system with laser source secure without significant losses on the performance. Channels Standard Optical Single-Mode Fiber (SMF) is the most popular choice for now. It can connect two arbitrary points, and can easily be extended to networks.
2463
2464
Quantum Cryptography
Moreover, it is deployed in most developed urban areas. SMF has two “window wavelengths”: one is 1310 nm and the other is 1550 nm. The absorptions at these two wavelengths are particularly low ( 0:35 dB/km at 1310 nm, and 0:21 dB/km at 1550 nm). Nowadays most fiber-based QKD implementations use 1550 nm photons as information carriers. The main disadvantage of optical fiber is its birefringence. The strong polarization dispersion made it hard to implement polarization-coding system. Also it has strong spectral dispersion, which affects the high speed (10+ GHz) QKD systems heavily [61] as the pulses are broadened and overlap with each other. For this reason, the loss in fibers (0.21 dB/km at 1550 nm) puts an limit on the longest distance that a fiber-based QKD system can reach. Free Space is receiving more and more attention recently. It is ideal for the polarization coding. There is negligible dispersion on the polarization and the frequency. However, the alignment of optical beams can be challenging for long distances, particularly due to the atmospheric turbulence. Notice that open-air QKD requires a direct line of sight between Alice and Bob (unless some forms of mirrors are used). Buildings and mountains are serious obstacles for open-air QKD systems. The greatest motivation for open-air QKD scheme is the hope for ground-to-satellite [62] and satellite-tosatellite quantum communication. As there is negligible optical absorption in the outer space, we may be able to achieve inter-continental quantum communication with free-space QKD. Detection Systems InGaAs-APD Single Photon Detectors are the most popular type of single-photon detectors in fiber-based QKD and they are commercially available. InGaAsAPD Single Photon Detectors utilize the avalanche effect of semiconductor diodes. A strong biased voltage is applied on the InGaAs diode. The incident photon will trigger the avalanche effect, generating a detectable voltage pulse. The narrow band gap of InGaAs made it possible to detect photons at telecom wavelengths (1550 nm or 1310 nm). InGaAs APD based single photon detectors have simple structure and commercially packaged. They are easy to calibrate and operate. The reliability of InGaAs APD is relatively high. They normally work at 50ı C to 110ı C to lower the dark count rate. This tempera-
ture can be easily achieved by thermal-electric coolers. The detection efficiency of InGaAs-APD based single photon detectors is usually 10% [63]. In single photon detectors, a key parameter (besides detection efficiency) is the dark count rate. The dark count is the event that the detector generates a detection click while no actual photon hits it (i. e. “false alarm”). The dark count rate of InGaAs single photon detector is relatively high (105 per gate; The concept of gating will be introduced below.) even if it is cooled. The after-pulse effect is that the dark count rate of the detector increases for a time period after a successful detection. This effect is serious for InGaAs single photon detectors. Therefore the blank circuit is often introduced to reduce this effect. The mechanism of the blank circuit is that the detector is set to be deactivated for a time period, which is called the “dead time”, after a detection event. The dead time should be set to long enough so that when the detector is re-activated, the after-pulse effect is negligible. The dead time for InGaAs single photon detector is typically in the order of microseconds [63]. The long after pulse effect, together with the large timing jitter limits the InGaAsAPD based single photon detectors to work no faster than several megahertz. Moreover, the blank circuit reduces the detection efficiency of InGaAs-APD based single photon detectors. An additional method to reduce the dark count rate is to apply the gating mode, i. e., the detectors are only activated when the photons are expected to hit them. Gating mode reduces the dark count rate by several orders and is thus used in most InGaAs-APD single photon detectors. However, it may open up a security loophole [64,65]. There is a trade-off between the detection efficiency and the dark count rate. As the biased voltage on an APD increases, both the detection efficiency and the dark count rate increase. Recently, it has been reported that, by gating an InGaAs detector in a sinusoidal manner, it is possible to reduce the dead time and operate a QKD system at 500 MHz, see [66]. This result seems to be an important development which could make InGaAs detectors competitive with newer single photon detector technologies such as SSPDs (to be introduced below). Si-APD Single Photon Detectors are ideal for detection of visible photons (say 800 nm). They have negligible dark count rate and can work at room temperature. They are very compact in size. More importantly, they have high detection efficiency (> 60%) and can work at gigahertz. These detectors are ideal for free space QKD
Quantum Cryptography
systems. However, the band gap of silicon is too large to detect photons at telecom wavelength (1550 nm or 1310 nm), and the strong attenuation of telecom fiber on visible wavelengths makes it impractical to use visible photons in long distance fiber-based QKD systems. Parametric Up-Conversion Single Photon Detectors try to use Si-APD to detect telecom wavelength photons. It uses periodically poled lithium niobate (PPLN) waveguide and a pumping light to up-convert the incoming telecom frequency photons into visible frequency, and uses Si-APD to detect these visible photons. The high speed and low timing-jitter of SiAPDs make it possible to perform GHz QKD on fiberbased system with up-conversion single photon detectors [67,68]. The efficiency of up-conversion detectors is similar to that of InGaAs APD single photon detectors. There is also a trade-off between the detection efficiency and the dark count rate. When increasing the power of the pumping light, the conversion efficiency will increase, improving the detection efficiency. Meanwhile, more pumping photons and up-converted pumping photons (with frequency doubled) will pass through the filter and enter the Si-APD, thus increasing the dark count rate [68]. Transiting-Edge Sensor (TES) is based on critical state superconductor rather than semiconductor APDs. It uses squared superconductor (typically tungsten) thin film as “calorimeter” to measure the electron temperature. A biased voltage is applied on the thin film to keep it in critical state. Once one or more photons are absorbed by the sensor, the electron temperature will change, leading to a change of the current. This current change can be detected by a superconductive quantum-interference device (SQUID) array [69]. The TES single photon detectors can achieve very high detection efficiency (up to 89%) at telecom wavelength. The dark count rate is negligible. Moreover, TES detectors can resolve photon numbers. This is because the electron temperature change is proportional to the number of photons that have been absorbed. The thermal nature of TES detectors limits their counting rates. Once some photons were absorbed by the sensor, it would take a few microseconds before the heat is dissipated to the substrate. This long relaxation time limits the counting rate of TES detector to no more than a few megahertz [69]. This is a major drawback of TES. The bandwidth of TES detector is extremely wide. The detector is sensitive to all the wavelengths. Even the black body radiation from the fiber or the environment
can trigger the detection event, thus increasing the dark count rate. To reduce the dark counts caused by other wavelengths, a spectral filter is necessary. However, this will increase the internal loss and thus reduce the detection efficiency. One of the greatest disadvantage of TES detector is its working temperature: 100 mK. This temperature probably requires complicated cooling devices [69]. Superconductive Single Photon Detectors (SSPDs) also use superconductor thin film to detect incoming photons. However, instead of using a piece of plain thin film, a pattern of zigzag superconductor (typically NbN) wire is formed. The superconductive wire is set to critical state by applying critical current through it. Once a photon hits the wire, it heats a spot on the wire and makes the spot over-critical (i. e., non-superconductive). As the current is the same as before, the current density in the areas around this hot-spot increases, thus making these areas non-superconductive. As a result, a section of the wire becomes nonsuperconductive, and a voltage spike can be observed as the current is kept constant [61]. The SSPD can achieve very high (up to 10 GHz) counting rate. This is because the superconductor wire used in SSPD can dissipate the heat in tens of picoseconds. It also has very low dark count rate (around 10 Hz) due to the superconductive nature. The SSPDs should be able to resolve the incident photon number in principle. However, photon number resolving SSPDs have never been reported yet. The efficiency of SSPD is lower than that of TES. This is because only part ( 50%) of the sensing area is covered by the wire. The fabrication of such complicated zigzag superconductor wire with smooth edge is also very challenging. The working temperature of SSPD (3K) is significantly higher than that of TES. SSPDs can work in closed-cycle refrigerator. Moreover, this relatively high working temperature significantly reduces the relaxation time [61]. Homodyne Detectors are used to count the photon number of a very weak pulse ( 100 photons). The principle is to use a very strong pulse (often called the local oscillator) to interfere with the weak pulse. Then use two photo diodes to convert the two resulting optical pulses into electrical signals, and make a substraction between the two electrical signals. The homodyne detectors are in general very efficient as there is always some detection output given some input signals. However, the noise of the detectors as well as in the electronics is very significant. Moreover,
2465
2466
Quantum Cryptography
the two photo diodes in the homodyne detector have to be identical, which is hard to meet in practice. The homodyne detectors are commonly used in GMCS QKD systems, and they are so far the only choice for GMCS QKD systems. Recently, homodyne detectors have also been used to implement the BB84 protocol. Truly Quantum Random Number Generators An important but often under-appreciated requirement for QKD is a high data-rate truly quantum random number generator (RNG). An RNG is needed because most QKD protocols (with the exception of a passive choice of bases in an entanglement-based QKD protocol) require Alice to choose actively random bases/signals. Given the high repetition rate of QKD, such a RNG must have a high data-rate. To achieve unconditional security, a standard software-based pseudo-random number generator cannot be used because it is actually deterministic. So, a high data rate quantum RNG is a natural choice. Incidentally, some firms such as id Quantique do offer commercial quantum RNGs. Unfortunately, it is very hard to generate RNG by quantum means at high-speed. In practice, some imperfections/bias in the numbers generated by a quantum RNG are inevitable. The theoretical foundation of QKD is at risk because existing security proofs all assume the existence of perfect RNGs and do not apply to imperfect RNGs. Experimental Implementation of BB84 Protocol In this section, we will focus mainly on the optical layer. We will skip several important layers. In practice, the control/electronics layer is equally important. Moreover, it is extremely challenging to implement the classical postprocessing layer in real-time, if one chooses block sizes of codes to be long enough to achieve unconditional security. Polarization Coding Polarization coding usually uses four laser sources generating the four polarization states of BB84 [1] protocol. A conceptual schematic is shown in Fig. 3. Note that due to polarization dispersion of a fiber, usually people need some compensating like the waveplates. The polarization compensation should be implemented dynamically as the polarization dispersion in the fiber changes frequently. This is solved by introducing the electrical polarization controller in [58].
Quantum Cryptography, Figure 3 Conceptual schematic for polarization-coding BB84 QKD system. LD: Laser Diode; CWP: Compensating Wave Plate; HWP: Half Wave Plate; PBS: Polarizing Beam Splitter; SPD: Single Photon Detector
Quantum Cryptography, Figure 4 Conceptual schematic for double Mach–Zehnder interferometer phase-coding BB84 QKD system. PM: Phase Modulator; BS: Beam Splitter; SPD: Single Photon Detector
Phase Coding Original Scheme is basically a big interferometer as shown in Fig. 2. However it is not practical as the stability of such a huge interferometer is extremely poor. Double Mach–Zehnder Interferometer Scheme is an improved version of the original proposal. It has two interferometers and there is only one channel connecting Alice and Bob (comparing to the two channels in the original proposal). A conceptual set-up is shown in Fig. 4. We can see that the two signals travel through the same channel. They only propagate through different paths locally in the two Mach– Zehnder interferometers. Therefore people only need to compensate the phase drift of the local interferometers (the polarization drift in the channel still needs to be compensated). This is a great improvement over the original proposal. However, the local compensation has to be implemented in real time. This is quite challenging. An example set-up that implemented the real time compensation of both polarization and phase drifting is reported in [70]. Faraday–Michelson Scheme is an improved version of the double Mach–Zehnder interferometer scheme. It still has two Mach–Zehnder interferometers but each interferometer has only one beam splitter. The light propagates through the same section of fiber twice due to the Faraday mirror. The schematic is shown in Fig. 5. We can see that the polarization drift is selfcompensated. This is a great advance in uni-direc-
Quantum Cryptography
Quantum Cryptography, Figure 5 Conceptual schematic for Faraday–Michelson phase-coding BB84 QKD system. FM: Faraday Mirror, PM: Phase Modulator; BS: Beam Splitter; C: Circulator; SPD: Single Photon Detector
Quantum Cryptography, Figure 7 Conceptual schematic for Sagnac loop phase-coding BB84 QKD system. PM: Phase Modulator; BS: Beam Splitter; C: Circulator; SPD: Single Photon Detector
Sagnac loop QKD is simple to set up and can be easily used in a network setting with a loop topology. However, its security analysis is highly non-trivial. Quantum Cryptography, Figure 6 Conceptual schematic for “Plug & Play” phase-coding BB84 QKD system. FM: Faraday Mirror, PM: Phase Modulator; BS: Beam Splitter; PBS: Polarizing Beam Splitter; C: Circulator; SPD: Single Photon Detector
tional QKD implementation and is first proposed and implemented in [71]. Nonetheless, the phase drift of local interferometers still needs compensation. Due to the fast fluctuation of phase drift (a drift of 2 usually takes a few seconds), this compensation should be done in real-time. A Faraday–Michelson decoy state QKD implementation over 123.6 km has been reported in [59]. Plug & Play Scheme is another improved version of the double Mach–Zehnder interferometer scheme. It has only one Mach–Zehnder interferometer and the light propagates through the same channel and interferometer twice due to the faraday mirror on Alice’s side. A conceptual set-up is shown in Fig. 6. We can see that both the polarization drift and the phase drift are automatically compensated. A “Plug & Play” scheme based decoy state QKD implementation over 60 km has been reported in [54]. Nonetheless, the bi-directional design brings complications to security as Eve can make sophisticated operations on the bright pulses sent from Bob to Alice. This is often called the “Trojan horse” attack [72]. Recently, the security of “Plug and Play” QKD system has been proven in [73]. Sagnac Loop Scheme Another bi-directional optical layer design is to use a Sagnac loop where the quantum signal is encoded in the relative phase between the clockwise and counter-clockwise pulses that go through the loop. The typical schematic is shown in Fig. 7.
Other Quantum Key Distribution Protocols Given the popularity of the BB84 protocol, why should people be interested in other protocols? There are at least three answers to this question. First, to better understand the foundations of QKD and its generality, it is useful to have more than one protocol. Second, different QKD protocols may have different advantages and disadvantages. They may require different technologies to implement. Having different protocols allows us to compare and contrast them. Third, while it is possible to implement standard BB84 protocol with attenuated laser pulses, its performance in terms of key generation rate and maximum transmission distance is somewhat limited. Therefore, we have to study other protocols. Since, from a practical stand point, the third reason is the most important one, we will elaborate on it in the following paragraph. The original BB84 [1] proposal requires a single photon source. However, most QKD implementations are based on faint lasers due to the great challenge to build perfect single photon sources. Faint laser pulses are weak coherent states that follow Poisson distribution for the photon number. The existence of multi-photon signals opens up new attacks such as photon-number-splitting attack. The basic idea of a photon-number-splitting attack is that Eve can introduce a photon-number-dependent transmittance. In other words, she can selectively suppress single-photon signals and transmit multi-photon signals to Bob. Notice that, for each multi-photon signal, Eve can beamsplit it and keep one copy for herself, thus allowing her to gain a lot of information about the raw key. The security of coherent laser based QKD systems was analyzed first against individual attacks in 2000 [45], then eventually for a general attack in 2001 [32] and 2002 [33]. Unfortunately, unconditionally secure QKD
2467
2468
Quantum Cryptography
based on conventional BB84 protocol [32,33] will severely limit the performance of QKD systems. Basically, Alice has to attenuate her source so that the expected number of photon per pulse is of the same order as the transmittance, . As a result, the key generation rate will scale only quadratically with the transmittance of the channel. Some of the protocols discussed in the following subsections may dramatically improve the performance of QKD over standard BB84 protocol. For instance, the decoy state protocol has been proven to provide a key generation rate that scales linearly with the transmittance of the channel and has been successfully implemented in experiments. We conclude with some simple alternative QKD protocols. In 1992, Bennett proposed a protocol (B92) that makes use of only two non-orthogonal states [2]. A sixstate QKD protocol was first noted by Bennett and coworkers [74] and some years later by Bruss [75]. It has an advantage of being symmetric. Even QKD protocols with orthogonal states have been proposed [76]. Efficient BB84 and six-state QKD protocols have been proposed and proven to be secure by Lo, Chau and Ardehali [77]. A Singaporean protocol has also been proposed. Recently, Gisin and co-workers proposed a one-way coherent QKD scheme [78].
Entanglement-Based Protocols Proposals In 1991, Ekert proposed the first entanglement based QKD protocol, commonly called E91 [3]. The basic idea is to test the security of QKD by using the violation of Bell’s inequality. Note that one can also implement the BB84 protocol by using an entanglement source. Imagine Eve prepares an entangled state of a pair of qubits and sends one qubit to Alice and the second qubit to Bob. Each of Alice and Bob randomly chooses one of the two conjugate bases to perform a measurement. Implementations The key part of entanglement-based quantum cryptography is to distribute an entangled pair (usually EPR pair) to two distant parties, Alice and Bob. Polarization entanglement is preferred in QKD as it is easy to measure the polarization (typically via polarizing beam splitter). The air has negligible birefringence and thus is the perfect channel for polarization-entanglement QKD. In free-space QKD, atomspherical turbulence may shift the light beam. Therefore the collection of incident photon is challenging. Usually large diameter optical telescope is needed to increase the collection efficiency.
Quantum Cryptography, Figure 8 Conceptual schematic entanglement-based QKD system with the source in the middle of Alice and Bob
Quantum Cryptography, Figure 9 Conceptual schematic entanglement-based QKD system with the source at Alice’s side. DET: Alice’s detection system
A standard approach is to put the entanglement source right in the middle of Alice and Bob, see Fig. 8. Once an entangled pair is generated, the two particles are directed to different destinations. Alice and Bob measure the particles locally, and keep the result as the bit value. This approach has potential in the ground-satellite intercontinental entanglement distribution, in which the entanglement source is carried by the satellite and the entangled photons are sent to two distant ground stations. A recent sourcein-the-middle entanglement-based quantum communication work over 13 km and is reported in [79]. A simpler version is to include the entanglement source in Alice’s side locally, see Fig. 9. Once Alice generates an entangled pair, she keeps one particle and send the other to Bob. Both Alice and Bob measure the particle locally and keep the result as the bit value. This approach is significant simpler than the above design because only Bob needs the telescope and compensating parts. A recent experiment of source-in-Alice entanglement-based quantum communication over 144 km open air is reported in [80]. Decoy State Protocols Proposals Recall that BB84 implemented with weak coherent state has a key generation rate that scales only quadratically with the transmittance. The decoy state protocol can dramatically increase the key generation rate so that it scales linearly with the transmittance. In a decoy state protocol, Alice prepares some decoy states in addition to signal states. The decoy states are the same as the
Quantum Cryptography
Quantum Cryptography, Figure 10 Conceptual schematic for decoy state BB84 QKD system (double Mach–Zehnder interferometer phase-coding) with amplitude modulator. PM: Phase Modulator; AM: Amplitude Modulator; BS: Beam Splitter; SPD: Single Photon Detector
signal state, except for the expected photon number. For instance, if the signal state has an average photon number of order 1 (e. g. 0.5), the decoy states have an average photon number 1 , 2 , etc. The decoy state idea was first proposed by Hwang [47], who suggested using a large
(e. g. 2) as a decoy state. Our group provided a rigorous proof of security to decoy state QKD[48,49]. Our numerical simulations showed clearly that decoy states provide a dramatic improvement over non-decoy protocols. In the limit of infinitely many decoy states, Alice and Bob can effectively limit Eve’s attack to a simple beam-splitting attack. Moreover, we proposed practical protocols. Instead of using a large as a decoy state, we proposed using small ’s as decoy states [48]. For instance, we proposed using a vacuum state as the decoy state to test the background and a weak to test the single-photon contribution. We and Wang analyzed the performance of practical protocols in detail [50,51,52]. Notice that the decoy state is a rather general idea that can be applied to other QKD sources. For instance, decoy state protocols have recently been proposed in [81,82,83] for parametric down conversion sources. For a comparison of those protocols, see [84]. Implementations The first experimental demonstration of decoy state QKD was reported by our group in 2006 first over 15 km telecom [53] fiber and later over 60 km telecom fiber [54]. Subsequently, the decoy state QKD was further demonstrated experimentally by several groups worldwide [55,56,57,58]. The implementation of decoy state QKD is straightforward. The key part is to prepare signals with different intensities. A simple solution is to use an amplitude modulator to modulate the intensities of each signal to the desired level, see Fig. 10. Decoy state QKD implementations using amplitude modulator to prepare different states are reported in [53,54,55,56,57]. The amplitude modulator has the disadvantage that the preparation of vacuum state is quite challenging. An alternative solution is to use laser diodes of different in-
Quantum Cryptography, Figure 11 Conceptual schematic for decoy state BB84 QKD system (double Mach–Zehnder interferometer phase-coding) with multiple laser diodes. LDx: Laser Diodes at different intensities; PM: Phase Modulator; BS: Beam Splitter; SPD: Single Photon Detector
tensities to generate different states, see Fig. 11. This solution requires multiple laser diodes and high-speed optical switch, and is thus more complicated than the amplitude modulator solution. It is also challenging to guarantee that all the laser diodes are identical so that Eve cannot tell which source generates some specific pulse. Nonetheless, perfect vacuum states can be easily prepared in this way. Decoy state QKD implementations using multiple laser diodes are reported in [58,59]. Decoy state with a parametric down conversion source has been experimentally implemented in [85]. Strong Reference Pulse Protocols Proposals The proposal of strong reference pulse QKD dated back to Bennett’s 1992 paper [2]. The idea is to add a strong reference pulse, in addition to the signal pulse. The quantum state is encoded in the relative pulse between the reference pulse and the signal pulse. Bob decodes by splitting a part of the strong reference pulse and interfering it with the signal pulse. The strong reference pulse implementation can counter the photon number splitting attack by Eve because it removes the neutral signal in the QKD system. Recall that in the photon number splitting attack, Eve suppresses single photon signals by sending a vacuum. This works because the vacuum is a neutral signal that leads to no detection. In contrast, in a strong reference pulse implementation of QKD, a vacuum signal is not a neutral signal. Indeed, if Eve replaces the signal pulse by a vacuum and keeps the strong reference pulse unchanged, then the interference experiment by Bob will give non-zero detection probability and a random outcome of “0” or “1”. On the other hand, if Eve removes both the signal and the reference pulses, then Bob may detect Eve’s attack by monitering the intensity of the reference pulse, which is supposed to be strong. In some recent papers, the unconditional security of B92 QKD with strong reference pulse has been rigorously proven. However, those proofs require Bob’s system
2469
2470
Quantum Cryptography
to have certain properties and do not apply to standard threshold detectors. Implementations The B92 [2] protocol is simpler to implement than BB84 [1] protocol. However, its weakness in security limits people’s interest on its implementation. A recent implementation of B92 protocol over 200 m fiber is reported in [86].
Quantum Cryptography, Figure 12 Conceptual schematic for Gaussian-modulated Coherent State QKD system. PM: Phase Modulator; AM: Amplitude Modulator; BS: Beam Splitter; PD: Photo Diode; HD: Homodyne Detector (inside dashed box)
Gaussian-Modulated Coherent State (GMCS) Protocol Proposals Instead of using discrete qubit states as in the BB84 protocol, one may also use continuous variables for QKD. Early proposals of continuous variables QKD use squeezed states, which are experimentally challenging. More recently, gaussian-modulated coherent states have also been proposed for QKD. Since a laser naturally emits a coherent state, compared to a squeezed state QKD proposal, a GMCS QKD protocol is experimentally more feasible. In GMCS QKD, Alice sends Bob a sequence of coherent state signals. For each signal, Alice draws two random numbers X A and PA from a set of Gaussian random number with a mean of zero and a variance of VA N0 and sends a coherent state jX A C iPA i to Bob. Bob randomly chooses to measure either the X quadrature of the P quadrature with a phase modulator and a homodyne detector. After performing his measurement, Bob informs Alice which quadrature he has performed for each pulse, through an authenticated public classical channel. Alice drops the irrelevant data and keeps only the quadrature that Bob has measured. Alice and Bob now share a set of correlated Gaussian variables which they regard as the raw key. Alice and Bob randomly select a subset of their signals and publicly broadcast their data to evaluate the excess noise and the transmission efficiency of the quantum channel. If the excess noise is higher than some prescribed level, they abort. Otherwise, Alice and Bob perform key generation by some prescribed protocol. An advantage of a GMCS QKD is that every signal can be used to generate a key, whereas in qubit-based QKD such as the BB84 protocol losses can substantially reduce the key generation rate. Therefore, it is commonly believed that for short-distance (say < 15 km) applications, GMCS QKD may give a higher key generation rate. GMCS QKD has been proven to be secure only against individual attacks. The security of GMCS QKD against the most general type of attack – joint attack – remains an open question. Implementations GMCS protocol has significant advantage over the BB84 [1] protocol at short distances. It
was first implemented by F. Grosshans et al. in 2003 [87]. It was shown to be working with channel loss up to 3.1 dB, which is equivalent to the loss of 15 km telecom fiber. Nonetheless, the strong spectral and polarization dispersion of telecom fiber made it challenging to build up a fiber-based GMCS system. Lodewyck, Debuisschert, Tualle-Brouri, and Grangier built the first fiber-based GMCS system in 2005 [88] but only over a few meters. This distance was largely extended to 14 km by Legré, Zbinden, and Gisin in 2006 [89] with the introduction of the “plug & play” design, which brought questions on its security. The uni-directional GMCS QKD has been later implemented over 5 km optical fiber by Qi, Huang, Qian, and Lo in 2007 [90] and over 25 km optical fiber by J. Lodewyck et al. in 2007 [91]. GMCS QKD requires dual-encoding on both amplitude quadrature and phase quadrature, and homodyne detection for decoding, see Fig. 12. Its implementation is in general more challenging than that of BB84 protocol. Differential-Phase-Shift-Keying (DPSK) Protocols Proposals In DPSK protocol, a sequence of weak coherent state pulses is sent from Alice to Bob. The key bit is encoded in the relative phase of the adjacent pulses. Therefore, each pulse belongs to two signals. DPSK protocol also defeats the photon number splitting attack by removing the neutral signal. Eve may attack a finite train of signals by measuring its total photon number and then splitting off one photon, whenever the photon number is larger than one. But, since each pulse belongs to two signals, the pulses in the boundary of the train will interfere with the pulses immediately outside the boundary. Therefore, Eve’s attack does not allow her to gain full information about all the bit values associated with the train. Moreover, by splitting the signal, Eve has reduced the amplitude of the pulses at the boundary of the train. Therefore, Bob will detect Eve’s presence by the higher bit error rates for the bit values between the pulses at the boundary and those just outside the boundary.
Quantum Cryptography
Quantum Cryptography, Figure 13 Conceptual schematic for differential phase shift keying QKD system. PM: Phase Modulator; BS: Beam Splitter; SPD: Single Photon Detector
While DPSK protocol is simpler to implement than BB84, a proof of its unconditional security is still missing. Therefore, it is hard to quantify its secure key generation rate and perform a fair comparison with, for example, decoy state BB84 protocol. Attacks against DPSK has been studied in, for example, [92,93]. Implementations DPSK protocol is simpler in hardware design than the BB84 [1] protocol as it requires only one Mach–Zehnder interferometer, see Fig. 13. It also has the potential in high-speed applications. Honjo, Inoue, and Takahashi experimentally demonstrated this protocol with a planar light-wave circuit over 20 km fiber in 2004 [94]. This distance was soon extended to 105 km [95] in 2005. In 2007, DPSK scored both the longest and the fastest records in QKD implementations: H. Takesue et al. reported an experimental demonstration of DPSK-QKD over 200 km optical fiber at 10 GHz [61]. However, since a proof of unconditional security is still missing (see last two paragraphs), it is unclear whether the existing experiments generate any secure key. Indeed, the attacks described in [92,93] showed that with only one-way classical post-processing, all existing DPSK experiments are insecure. Quantum Hacking Since practical QKD systems exist and commercial QKD systems are on the market, it is important to understand how secure they really are. We remark that there is still a big gap between the theory and practice of QKD. Even though the unconditional security of practical QKD systems with semi-realistic models have been proven [32,33], practical QKD systems may still contain fatal security loopholes. From a historical standpoint, Bennett and Brassard mentioned that the first QKD system [34] was unconditionally secure to any eavesdropper who happened to be deaf! This was because the system made different sounds depending on whether the source was sending a “0” or a “1”. Just by listening the sounds, an eaves-
Quantum Cryptography, Figure 14 Conceptual schematic for detection efficiency mismatch in time domain. X axis: Time, Y axis: Detector Efficiency
dropper can learn the value of the final key. This example highlights the existence of side channels in QKD and how easy an eavesdropper might be able to break the security of a QKD system, despite the existence of security proofs. In this section, we will sketch a few cleverly proposed quantum hacking strategies that are outside standard security proofs and their experimental implementations. We will conclude counter-measures and future outlook. Notice that we will skip eavesdropping attacks that have already been covered by standard security proofs. Attacks 1) Large Pulse Attack. In a large pulse attack, Eve sends in a strong pulse of laser signal to, for example, Alice laboratory to try to read off Alice’s phase modulator setting from a reflected pulse, see [96]. As a result, Eve may learn which BB84 state Alice is sending to Bob. A simple counter-measure to the large pulse attack is to install an isolator in Alice’s system. 2) Faked State Attack. Standard InGaAs detectors suffer from detection efficiency mismatch. More concretely, as noted in Subsect. “Detection Systems”, InGaAs detectors are often operated in a gated mode. Therefore, the detection efficiency of each detector is time-dependent. Refer to Fig. 14 for a schematic diagram of the detection efficiencies of two detectors (one for “0” and one for “1”) as functions of time. At the expected arrival time, the detection efficiency of the two detectors are similar. However, if the signal is chosen to arrive at some unexpected times, it is possible that the detector efficiencies of the two detectors differ greatly. The faked state attack proposed by Makarov and coworkers [97] is an intercept-resend attack. In a faked state attack, for each signal, Eve randomly chooses one of the
2471
2472
Quantum Cryptography
two BB84 bases to perform a measurement. Eve then sends to Bob a wrong bit in the wrong basis at a time when the detector for the wrong bit has a low detection efficiency. For instance, if Eve has chosen the rectilinear basis and has found a “0” in the bit value, she then prepares a state “1” in the diagonal basis and sends it to Bob at the arrival time where detector efficiency of the detector for “0” is much higher than that of detector for “1”. Now, should Eve have chosen the wrong basis in her measurement, notice that the detection probability by Bob is greatly suppressed. For instance, in our example, if the correct basis is the diagonal basis, let us consider when happens when Bob measures the signal in the correct basis. Since the bit value resent by Eve is “1” in the diagonal basis and Bob’s detector for “1” has a low detection efficiency, most likely Bob will not detect any signal. On the other hand, should Eve have chosen the correct basis in her measurement, Bob has a significant detection efficiency. For instance, in our example, if the correct basis is in fact the rectilinear basis, let us consider what happens when Bob measures in the correct basis. In this case, a bit “1” in the diagonal basis sent by Eve can be re-written as a superposition of a bit “0” and a bit “1” in the rectilinear basis. Since the detector for “0” has a much higher detection efficiency than the detector for “1”, most likely Bob will detect a “0”. Since “0” was exactly what was originally sent by Alice, Bob will find a rather low bit error rate, despite Eve’s intercept-resend attack. The faked state attack, while conceptually interesting, is hard to implement in practice. This is because it is an intercept-resend attack and as such involves finite detection efficiency in Eve’s detectors and precise synchronization between Eve and Alice-Bob’s system. For this reason, the faked state attack has never been implemented in practice. 3) Time-shift Attack. The time-shift attack was proposed by Qi, Fung, Lo, and Ma [64]. It also utilizes the detection efficiency mismatch in the time domain, but is much easier to implement than the faked states attack. As we mentioned in the above section, typical InGaAsAPD detectors usually operate in a gated mode. That is, if the photon hits the detector at unexpected time, the two detectors may have substantially different efficiency. Therefore, Eve can simply shift the arrival time of each signal, creating large efficiency mismatch between “0”s and “1”s. Let’s take a specific example to illustrate this attack: suppose detector 0 has higher efficiency than detector 1 if the signal arrives earlier than the expected time, and lower efficiency than detector 1 if the signal arrives later than expected. Eve can simply shift the arrival time of each bit by sending it through a longer path or a shorter one. Consider
the case in which Eve sends bit i through a shorter path. In this case bit i will hit the detector earlier than expected, thus detector 0 has much higher efficiency. If Bob reports a detection event for the ith bit, Eve can make a guess that this bit is a “0” with high probability of success. Furthermore, Eve can carefully set how many bits should be shifted forward and how many should be shifted backward so that Bob gets similar counts of “0”s and “1”s. In this way, Bob cannot observe a mismatch between the numbers of “0”s and “1”s. Note that the time-shift attack does not make any measurement on the qubits. Therefore, quantum information is not destroyed. That is, Eve does not change the polarization, the phase, or the frequency of any bit. This means the time-shift attack will not increase the bit error rate of the system in principle. Moreover, since Eve does not need to make any measurement or state preparation, the time-shift attack is practically feasible even with current technology. The time-shift attack will introduce some loss as the overall detection efficiency is lower if the photon hits the detector at an unexpected time. Nonetheless, Eve can compensate this loss by making the channel more transparent. Notice that, since the quantum channel between Alice and Bob may contain many lossy components such as splices and couplers, it may not be too hard for Eve to make a channel more transparent. The time-shift attack has been successfully implemented on a commercial QKD system by Zhao, Fung, Qi, Chen, and Lo [65] in 2007. This is the first and so far the only experimentally successful demonstration of quantum hacking on commercial QKD system. It is shown that the system has no-negligible probability to be vulnerable to the time-shift attack. Quantitative analysis shows that the final key shared by Alice and Bob (after the error correction and the privacy amplification of the most general security analysis) has been compromised by Eve. The success of the time-shift attack in [65] is rather surprising as QKD has been widely believed to be unconditionally secure. The experimental success in quantum hacking highlighted the limit of the whole research program of device-independent security proofs [23] by showing that device-independent security proofs, even if they are found to exist in future, do not apply to a practical QKD system. The success of time-shift attack is not due to some technical imperfection. It is deeply connected with the detection efficiency loophole in the verification of Bellinequalities. So far the InGaAs detectors have only 10% detection efficiencies, and the channel connecting Alice and Bob usually has quite large attenuation for long distance communication. The low overall detection efficiency fails the device-independent security proof.
Quantum Cryptography
Notice that even non-gated detectors have dead times and generally suffer from detection efficiency loophole. The detection efficiency mismatch is also discussed in [98]. 4) Phase remapping attack. In a bi-directional implementation of QKD such as the “Plug and Play” set-up, Eve may attempt to tamper with Alice’s preparation process so that Alice prepares four wrong states, instead of the four standard BB84 state. This is called the phase remapping attack and was proposed in [99]. In a “Plug and Play” QKD system, Alice receives a strong pulse from Bob and she then attenuates it to a single-photon level and encodes one of the BB84 state on it. For instance, Alice may encode her state by using a phase modulator. In a phase modulator, the encoded phase is proportional to the voltage applied. In practice, a phase modulator has a finite rise time. For each BB84 setting, one may thus model the applied voltage (and thus the encoded phase) as a trapezium. Ideally, Bob’s strong pulse should arrive at the plateau region of the phase modulation, thus getting maximal phase modulation by Alice’s phase modulator. Now, imagine that Eve applies a time-shift to Bob’s strong pulse so that it arrives in the rise region of the phase modulation graph instead of the plateau region. In this case, Alice has wrongly encoded her phase only partially. If we assume that the four settings of Alice’s encoding (for the four BB84 states) have the same rise region, then by time-shifting Bob’s strong pulse, Eve can force Alice to prepare the four state with phase 0; a; 2a; 3a, rather than 0; /2; ; 3/2. In general, these four states are more distinguishable than the standard BB84 state. Therefore, Eve may subsequently apply an intercept-resend attack to the signal sent out by Alice. It was proven in [99] that in principle Eve can break the security of the QKD system, without alerting Alice and Bob. 5) Attack by passive listening to side channels. The attack by listening to the sounds made by the source in the first QKD experiment is an example of an attack by passive listening to side channels. Another example is [100]. A counter-measure is to carefully locate all possible side channels and to eliminate them one by one. 6) Saturation Attack. In a recent preprint [101], Makarov studied experimentally how by sending a moderately bright pulse, Eve can blind Bob’s InGaAs detector. A simple counter-measure would be for Bob to measure the intensity of the incoming signal. 7) High Power Damage Attack. In Makarov’s thesis, it was proposed that Eve may try to make controlled changes in Alice’s and Bob’s system by using high power laser damage through sending a very strong laser pulse. Again, a simple counter-measure would be for Alice and Bob to
measure the intensity of the incoming signals and monitor the properties of various components from time to time to ensure that they perform properly. Counter-Measures Once an attack is known, there are often simple countermeasures. For instance, for the large pulse attack, a simple counter-measure would be to add a circulator in Alice’s laboratory. As for the faked state attack and time-shift attack, a simple counter-measure would be for Bob to use a four-state setting in his phase modulator. Other countermeasures include Bob applying a random time-shift to his received signals. However, the most dangerous attacks are the unanticipated ones. Notice that it is not enough to say that a counter-measure to an attack exists. It is necessary to actually implement a counter-measure experimentally in order to see how effective and convenient it really is. This will allow Alice and Bob to select a useful counter-measure. Moreover, notice that the implementation of a counter-measure may itself open up new loopholes. For instance, if Bob implements a four-state setting as a counter-measure to a timeshift attack, Eve may still combine a large pulse attack with the time-shift attack to break a QKD system. Importance of Quantum Hacking As noted in Sect. “Security Proofs”, there has been a lot of theoretical interest on the connection between the security of QKD and fundamental physical principles such as the violation of Bell’s inequality. An ultimate goal of such investigations, which has not been realized yet, is to construct a device-independent security proof [23]. Even if such a goal is achieved in future, would any of these theoretical security proofs applies to a quantum key distribution system in practice? Unfortunately, the answer is no. As is well-known, the experimental testing of Bell-inequalities often suffers from the detection efficiency loophole [23]. The low detection efficiency of practical detectors not only nullifies security proofs based on Bell-inequality violation, but also gives an eavesdropper a powerful handle to break the security of a practical QKD system. Therefore, the detection efficiency loophole is of both theoretical and practical interest. A practical QKD system often consists of two or more detectors. In practice, it is very hard to construct detectors of identical characteristics. As a result, two detectors can generally exhibit different detector efficiencies as functions of either one or a combination of variables in the time, frequency, polarization or spatial domains. Now, if an eavesdropper could manipulate a signal in these variables, then
2473
2474
Quantum Cryptography
she could effectively exploit the detection efficiency loophole to break the security of a QKD system. In fact, she could even violate a Bell-inequality with only a classical source. In time-shift attack, one consider an eavesdropper’s manipulation of the time variable. However, the generality of detection efficiency loophole and detector efficiency mismatch should not be lost. We should remark that, for eavesdropping attacks, the sky is the limit. The more imaginative one is, the more new attacks one comes up. Indeed, what people have done so far are just scratching the surface of the subject. Much more work needs to be done in the battle-testing of QKD systems and security proofs with testable assumptions. See Sect. “Future Directions”. Beyond Quantum Key Distribution Besides QKD, many other applications of quantum cryptography have been proposed. Consider, for instance, the millionaires’ problem. Two millionaires, Alice and Bob, would like to determine who is richer without disclosing the actual amount of money each has to each other. More generally, in a secure two-party computation, two distant parties, Alice and Bob, with private inputs, x and y respectively, would like to compute a prescribed function f (x; y) in such a way that at the end, they learn the outcome f (x; y), but nothing about the other party’s input, other than what can be logically be deduced from the value of f (x; y) and his/her input. There are many possible functions f (x; y). Instead of implementing them one by one, it is useful to construct some cryptographic primitives, which if available, can be used to implement the secure computation of any function f (x; y). In classical cryptography, to implement secure two-party computation of a general function will require making additional assumptions such as a trusted third party or computational assumptions. The question is whether we can do unconditionally secure quantum secure two-party computations. Two important cryptographic primitives are namely quantum bit commitment (QBC) and one-out-of-two quantum oblivious transfer (QOT). In particular, it was shown by Kilian [102] that in classical cryptography, oblivious transfer can be used to implement a general two-party secure computation of any function f (x; y). Moreover, in quantum cryptography, it was proven by Yao [103] that a secure QBC scheme can be used to implement QOT securely. For a long time back in the early 1990s, there was high hope that QBC and QOT could be done with unconditional security. In fact, in a paper [104] it was claimed that QBC can be made unconditionally secure. The sky fell around 1996 when Mayers [105] and subsequently, Lo and
Chau [106], proved that, contrary to widespread belief at that time, unconditionally secure QBC is, in fact, impossible. Subsequently, Lo [107] proved explicitly that unconditionally secure one-out-of-two QOT is also impossible. Mayers and Lo-Chau’s result was a big step backwards and thus a big disappointment for quantum security. After the fall of QBC and QOT, people turned their attention to quantum coin tossing (QCT). Suppose Alice and Bob are having a divorce and they would like to determine by a coin toss who is going to keep their kid. They do not trust each other. However, they live far away from each other and have to do a coin toss remotely. How can they do so without trusting each other? Classically, coin tossing will require either a trusted third party or making computational assumptions. As shown by Lo and Chau, ideal quantum coin tossing is impossible [108]. Even for the non-ideal case, Kitaev has proven that a strong version of QCT (called strong QCT) cannot be unconditionally secure. However, despite numerous papers on the subject (see, for example, [109] and references therein), whether non-ideal weak QCT is possible remains an open question. Other QKD protocols are also of interest. For instance, the sharing of quantum secrets has been proposed in [110]. It is an important primitive for building other protocols such as secure multi-party quantum computation [111]. There are also protocols for quantum digital signatures [112], quantum fingerprinting and unclonable encryption. Incidentally, quantum mechanics can also be used for the quantum sharing of classical secrets [113], conference key agreement and third-man cryptography. For QKD, so far we have only discussed a point-topoint configuration. In real-life applications, it will be interesting to study QKD in a network setting [114]. Note that the multiplexing of several QKD channels in the single fiber has been successful performed. So has the multiplexing of a classical channel together with a QKD channel. However, much work remains to be done on the design of both the key management structure and the optical layer of a QKD network. Future Directions The subject of quantum cryptography is still in a state of flux. We will conclude with a few examples of future directions. Quantum Repeaters Losses in quantum channels greatly limit the distance and key generation rate of QKD. To achieve secure QKD over long distances without trusting the intermediate nodes, it is highly desirable to have quantum repeaters. Briefly
Quantum Cryptography
stated, quantum repeaters are primitive quantum computers can be perform some form of quantum error correction, thus preserving the quantum signals used in QKD. In more detail, quantum repeaters often rely on the concept of entanglement distillation, whose goal is, given a large number M of noisy entangled states, two parties, Alice and Bob, perform local operations and classical communications to distill out a smaller number (say N) but less noisy entangled states. The experimental development of a quantum repeater will probably involve the development of quantum memories together with the interface between flying qubits and qubits in a quantum memory.
Multi-party Quantum Key Distribution and Entanglement Besides its technological interest, QKD is of fundamental interest because it is deeply related to the theory of entanglement, which is the essence of quantum mechanics. So far there have been limited studies on multi-party QKD. Notice that there are many deep unresolved problems in multi-party entanglement. It would be interesting to study more deeply multi-party QKD and understand better its connection to multi-party entanglement. Hopefully, this will shed some light on the mysterious nature of multiparty entanglement. Security Proofs with Testable Assumptions
Ground to Satellite QKD Another method to extend the distance of QKD is to perform QKD between a satellite and a ground station. If one trusts a satellite, one can even build a global QKD network via a satellite relay. Basically, a satellite can perform QKD with Alice first, when it has a line of sight with Alice. Afterwards, it moves in orbit until it has a line of sight with Bob. Then, the satellite performs a separate QKD with Bob. By broadcasting the XOR of the two keys, Alice and Bob will share the same key. Satellite to ground QKD appears to be feasible with current or near-future technology, for a discussion, see, for example, [62]. With an untrusted satellite, one can still achieve secure QKD between two ground stations by putting an entangled source at the satellite and sending one half of each entangled pair to each of Alice and Bob. Calculation of the Quantum Key Capacity Given a specific theoretical model, so far it is not known how to calculate the actual secure key generation rate in a noisy channel. All is known is how to calculate some upper bounds and lower bounds. This is a highly unsatisfactory situation because we do not really know the actual fundamental limit of the system. Our ignorance can be highlighted by a simple open question: what is the highest tolerable bit error rate of BB84 that will still allow the generation of a secure key? While lower bounds are known [25,26,27] and 25 percent is an upper bound set by a simple intercept-resend attack, we do not know the answer to this simple question. Notice that this question is of both fundamental and practical interests. Without knowing the fundamental limit of the key generation rate, we do not know what the most efficient procedure for generating a key in a practical setting is.
The surprising success of quantum hacking highlights the big gap between the theory and practice of QKD. In our opinion, it is important to work on security proofs with testable assumptions. Every assumption in a security proof should be written down and experimentally verified. This is a long-term research program. Battle-Testing QKD Systems Only through battle-testing can we gain confidence about the security of a real-life QKD system. Traditionally, breaking a cryptographic systems is as important as building one. Therefore, we need to re-double our efforts on the study of eavesdropping attacks and their counter-measures. As stated before, quantum cryptography enjoys forward security. Thanks to the quantum no-cloning theorem, an eavesdropper Eve does not have a transcript of all quantum signals sent by Alice to Bob. Therefore, once a QKD process has been performed, the information is gone and it will be too late for Eve to go back to eavesdrop. Therefore, for Eve to break a real-life QKD system today, it is imperative for Eve to invest in technologies for eavesdropping now, rather than in future. Acknowledgments We thank various funding agencies including NSERC, CRC program, QuantumWorks, CIFAR, MITACS, CIPI, PREA, CFI, and OIT for their financial support. Bibliography Primary Literature 1. Bennett CH, Brassard G (1984) In: Proceedings of IEEE International Conference on Computers, Systems, and Signal Processing. IEEE, New York, pp 175–179
2475
2476
Quantum Cryptography
2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.
14. 15.
16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30.
31. 32. 33. 34. 35. 36. 37. 38. 39.
Bennett CH (1992) Phys Rev Lett 68:3121 Ekert AK (1991) Phys Rev Lett 67:661 Wiesner S (1983) Sigact News 15:78 Vernam G (1926) J Am Inst Electr Eng 45:109 Shor PW (1997) SIAM J Sci Statist Comput 26:1484 Brassard and Crépeau (1996) ACM SIGACT News 27:13 Mayers D (2001) J ACM 48:351 Lo H-K, Chau HF (1999) Science 283:2050 Bennett CH, DiVincenzo DP, Smolin JA, Wootters WK (1996) Phys Rev A 54:3824 Deutsch D, Ekert A, Jozsa R, Macchiavello C, Popescu S, Sanpera A (1996) Phys Rev Lett 77:2818 Shor P, Preskill J (2000) Phys Rev Lett 85:441 Biham E, Boyer M, Boykin PO, Mor T, Roychowdhury V (2000) In: STOC ’00: Proceedings of the thirty-second annual ACM symposium on Theory of computing. ACM Press, New York, pp 715–724 Ben-Or M (2002) http://www.msri.org/publications/ln/msri/ 2002/qip/ben-or/1/index.html. Accessed 03 Nov 2008 Given a density matrix, , define the von Neumann entropy P of as S( ) D i i log2 i D tr log2 where i ’s are eigenvalues of the density matrix http://en.wikipedia.org/wiki/Holevo’s_theorem. Accessed 03 Nov 2008 Renner R, Koenig R (2005) In: TCC 2005. LNCS, vol 3378. Springer, Berlin, (eprint) quant-ph/0403133 Renner R (2005) (eprint) quant-ph/0512258 Renner R (2007) Nat Phys 3:645 Horodecki K, Horodecki M, Horodecki P, Oppenheim J (2005) Phys Rev Lett 94:160502 Horodecki K, Horodecki M, Horodecki P, Leung D, Oppenheim J (2008) IEEE Trans Inf Theory 54(6):2604–2620 Koashi M (2007) arXiv:0704.3661 Acin A, Brunner N, Gisin N, Massar S, Pironio S, Scarani V (2007) Phys Rev Lett 98:230501 Masanes L, Winter A (2006) (eprint) quant-ph/0606049. Accessed 03 Nov 2008 Gottesman D, Lo H-K (2003) IEEE Trans Inf Theory 49:457 Renner R, Gisin N, Kraus B (2005) Phys Rev A 72:012332 Chau HF (2002) Phys Rev A 66:060302 Brassard G, Salvail L (1994) Lecture Notes in Computer Science. Springer, vol 765, pp 410–423 Bennett CH, Brassard G, Crepeau C, Maurer UM (1995) IEEE Trans Info Theory 41:1915 Ben-Or M, Horodecki M, Leung DW, Mayers D, Oppenheim J (2005) In: Kilian J (ed) Theory of Cryptography: Second Theory of Cryptography Conference. TCC 2005. Lecture Notes in Computer Science, vol 3378. Springer, Berlin, pp 386–406 Koenig R, Renner R, Bariska A, Maurer U (2007) Phys Rev Lett 98:140502 Inamori H, Lütkenhaus N, Mayers D (2007) Eur Phys J D 41:599 Gottesman D, Lo H-K, Lütkenhaus N, Preskill J (2004) Quant Info Compu 4:325 Bennett CH, Bessette F, Brassard G, Salvail L, Smolin J (1992) J Cryptogr 5:3 Townsend PD, Rarity JG, Tapster PR (1993) Electron Lett 29:634 Muller A, Breguet J, Gisin N (1993) Europhys Lett 23:383 Jacobs BC, Franson JD (1996) Opt Lett 21:1854 Franson JD, Lives H (1994) Appl Opt 33:2949 Townsend PD (1994) Electron Lett 30:809
40. Muller A, Zbinden H, Gisin N (1995) Nature 378:449 41. Muller A, Zbinden H, Gisin N (1996) Europhys Lett 33:335 42. Muller A, Herzog T, Hutter B, Tittel W, Zbinden H, Gisin N (1997) Appl Phys Lett 70:793 43. Zbinden H, Gautier J-D, Gisin N, Hutter B, Muller A, Tittel W (1997) Electron Lett 33:586 44. Stucki D, Gisin N, Guinnard O, Robordy G, Zbinden H (2002) New J Phys 4:41 45. Lütkenhaus N (2000) Phys Rev A 61:052304 46. Gobby C, Yuan ZL, Shields AJ (2004) Electron Lett 40:1603 47. Hwang WY (2003) Phys Rev Lett 91:057901 48. Lo H-K (2004) In: Proceedings of IEEE International Symposium on Information Theory IEEE, New York, p 137 49. Lo H-K, Ma X, Chen K (2005) Phys Rev Lett 94:230504 50. Ma X, Qi B, Zhao Y, Lo H-K (2005) Phys Rev A 72:012326 51. Wang X-B (2005) Phys Rev Lett 94:230503 52. Wang X-B (2005) Phys Rev A 72:012322 53. Zhao Y, Qi B, Ma X, Lo H-K, Qian L (2006a) Phys Rev Lett 96:070502 54. Zhao Y, Qi B, Ma X, Lo H-K, Qian L (2006b) In: Proceedings of IEEE International Symposium of Information Theory, IEEE, New York, pp 2094–2098 55. Schmitt-Manderbach T et al (2007) Phys Rev Lett 98:010504 56. Rosenberg D et al (2007) Phys Rev Lett 98:010503 57. Yuan ZL, Sharpe AW, Shields AJ (2007) Appl Phys Lett 90:011118 58. Peng C-Z et al (2007) Phys Rev Lett 98:010505 59. Yin Z-Q, Han Z-F, Chen W, Xu F-X, Wu Q-L, Guo G-C (2008) Chin Phys Lett 25:3547 60. Wang Q, Karlsson A (2007) Phys Rev A 76:014309 61. Takesue H et al (2007) Nat Photonics 1:343 62. Villoresi P et al (2004) arXiv:quant-ph/0408067v1 63. Stucki D, Ribordy G, Stefanov A, Zbinden H, Rarity JG, Wall T (2001) J Mod Opt 48:1967 64. Qi B, Fung C-HF, Lo H-K, Ma X (2007a) Quant Info Compu 7:73 65. Zhao Y, Fung C-HF, Qi B, Chen C, Lo H-K (2008) Phys Rev A 78:042333 66. Namekata N, Fujii G, Inoue S (2007) Appl Phys Lett 91:011112 67. Thew RT et al (2006) New J of Phys 8:32 68. Diamanti E, Takesue H, Langrock C, Fejer MM, Yamamoto Y (2006) Opt Express 14:13073, (eprint) quant-ph/0608110 69. Rosenberg D et al (2006) Appl Phys Lett 88:021108 70. Dynes JF, Yuan ZL, Sharpe AW, Shields AJ (2007) Opt Express 15:8465 71. Mo XF, Zhu B, Han ZF, Gui YZ, Guo GC (2005) Opt Lett 30:2632 72. Gisin N, Fasel S, Kraus B, Zbinden H, Ribordy G (2006) Phys Rev A 73:022320 73. Zhao Y, Qi B, Lo H-K (2008) Phys Rev A 77:052327 74. Bennett CH, Brassard G, Breidbart S, Wiesner S (1984) IBM Technical Disclosure Bulletin 26:4363 75. Bruss D (1998) Phys Rev Lett 81:3018 76. Goldenberg L, Vaidman L (1995) Phys Rev Lett 75:1239 77. Lo H-K, Chau HF, Ardehali M (2005) J Cryptology 18:133 78. Gisin N, Ribordy G, Zbinden H, Stucki D, Brunner N, Scarani V (2004) arXiv:quant-ph/0411022v1 79. Peng C-Z et al (2005) Phys Rev Lett 94:150501 80. Ursin R et al (2007) Nat Phys 3:481 81. Mauerer W, Silberhorn C (2007) Phys Rev A 75:050305 82. Wang Q, Wang X-B, Guo G-C (2007) Phys Rev A 75:012312 83. Adachi Y, Yamamoto T, Koashi M, Imoto N (2007) Phys Rev Lett 99:180503
Quantum Cryptography
84. Ma X, Lo H-K (2008) New J Phys 10:073018 85. Wang Q et al (2008) Phys Rev Lett 100:090501 86. Mendonça F, de Brito DB, Silva JBR, Thé GAP, Ramos RV (2008) Microw Opt Tech Lett 50:236 87. Crosshans F, Assche GV, Wenger J, Brourl R, Cerf NJ, Grangier P (2003) Nature 421:238 88. Lodewyck J, Debuisschert T, Tualle-Brouri R, Grangier P (2005) Phys Rev A 72:050303 89. Legré M, Zbinden H, Gisin N (2006) Quant Info Compu 6:326 90. Qi B, Huang L-L, Qian L, Lo H-K (2007) Phys Rev A 76:052323. arXiv:0709.3666 91. Lodewyck J et al (2007) Phys Rev A 76:042305 92. Curty M, Zhang LLX, Lo H-K, Lütkenhaus N (2007) Quant Info Compu 7:665 93. Tsurumaru T (2007) Phys Rev A 75:062319 94. Honjo T, Inoue K, Takahashi H (2004) Opt Lett 29:2797 95. Takesue H et al (2005) New J Phys 7:232 96. Gisin N, Fasel S, Kraus B, Zbinden H, Ribordy G (2006) Phys Rev A 73:022320 97. Makarov V, Anisimov A, Skaar J (2006) Phys Rev A 74:022313 98. Larsson J-Å (2002) Quant Info Compu 2:434 99. Fung C-HF, Qi B, Tamaki K, Lo H-K (2007) Phys Rev A 75:032314 100. Lamas-Linares A, Kurtsiefer C (2007) Opt Express 15:9388 101. Makarov V (2007) arXiv:0707.3987v1 102. Kilian J (1988) In: Proceedings of the 20th Annual ACM Symposium on Theory of Computing. ACM, New York, pp 20–31 103. Yao AC-C (1995) In: Proceedings of the 26th Annual ACM Symposium on the Theory of Computing. ACM, New York, p 67 104. Brassard RJG, Crépeau C, Langlois D (1993) In: Proceedings of the 34th Annual IEEE Symposium on the Foundations of Computer Science. IEEE, New York, p 362 105. Mayers D (1997) Phys Rev Lett 78:3414 106. Lo H-K, Chau HF (1997) Phys Rev Lett 78:3410
107. 108. 109. 110. 111.
Lo H-K (1997) Phys Rev A 56:1154 Lo H-K, Chau HF (1997) Physica D 120:177 Mochon C (2005) Phys Rev A 72:022341 Cleve R, Gottesman D, Lo H-K (1999) Phys Rev Lett 83:648 Ben-Or M, Crépeau C, Gottesman D, Hassidim A, Smith A (2006) In: Proceedings of 47th Annual IEEE Symposium on the Foundations of Computer Science (FOCS’06). IEEE, New York, pp 249–260 112. Gottesman D, Chuang I (2001) arXiv:quant-ph/0105032v2 113. Hillery M, Bužek V, Berthiaume A (1999) Phys Rev A 59:1829 114. Alleaume R et al (2007) quant-ph/0701168. Accessed 03 Nov 2008
Books and Reviews Brassard G (1994) A Bibliography of Quantum Cryptography. http://www.cs.mcgill.ca/~crepeau/CRYPTO/Biblio-QC.html. Accessed 03 Nov 2008 Gisin N, Thew R (2007) Quantum Communication. Nat Photon 1(3):165–171. On-line available at http://arxiv.org/abs/ quant-ph/0703255. Accessed 03 Nov 2008 Gisin N, Ribordy G, Tittel W, Zbinden H (2002) Quantum Cryptography. Rev Mod Phys 74:145–195. On-line Available at http:// arxiv.org/abs/quant-ph/0101098. Accessed 03 Nov 2008 Gottesman D, Lo H-K (2000) From Quantum Cheating to Quantum Security. Physics Today, Nov 2000, p 22 Lo H-K, Lütkenhaus N (2007) Quantum Cryptography: from theory to practice Invited paper for Physics In Canada, SeptDec 2007. On-line available at http://arxiv.org/abs/quant-ph/ 0702202. Accessed 03 Nov 2008 Scarani V, Bechmann H-Pasquinucci, Cerf NJ, Dusek M, Lütkenhaus N, Peev M (2008) The security of practical quantum key distribution. arXiv:0802.4155v2 Wikipedia (2008) Quantum Cryptography, http://en.wikipedia.org/ wiki/Quantum_cryptography. Accessed 03 Nov 2008
2477
2478
Quantum Error Correction and Fault Tolerant Quantum Computing
Quantum Error Correction and Fault Tolerant Quantum Computing MARKUS GRASSL1 , MARTIN RÖTTELER2 1
2
Institute for Quantum Optics and Quantum Information, Austrian Academy of Sciences, Innsbruck, Austria NEC Laboratories America, Inc., Princeton, USA
Article Outline Glossary Definition of the Subject Introduction Basics of Classical Error Correction Basic Ideas of Quantum Error Correction A Simple Example and Shor’s Nine-Qubit Code Conditions for Quantum Error Correction Quantum Codes from Classical Codes Techniques for Fault-Tolerant Quantum Computing The Threshold Theorem Further Aspects Bibliography Glossary Ancilla qubits Refers to qubits that are used to facilitate quantum computations, but serve neither as input nor as output of the computation. The two main uses of ancilla qubits are a) for the purpose of a scratch pad during a quantum computation, i. e., states that are initialized in a known state and can be used during a computation such that they are returned to the initial state at the end, and b) for the purpose of quantum error correction where they serve as space for holding the error syndrome. Bit-flip error An error that affects a single qubit and interchanges the basis states of that qubit. CSS code A special class of quantum codes named after their inventors Calderbank, Steane, and Shor. CSS codes allow to construct quantum codes from certain classical error-correcting codes. CNOT gate The controlled-NOT gate is an example of a quantum gate. It is a two qubit gate and – together with single qubit gates – can be shown to be a universal gate for quantum computation. Error syndrome A classical bit-vector that describes a quantum error that has affected a quantum code.
Contrary to the classical case where there is a one-toone correspondence between syndromes and errors, in the quantum case a syndrome does in general not uniquely identify a quantum error. Fault-tolerant quantum computing The discipline that studies how to perform reliable quantum computations with imperfect hardware. Phase-flip error An error that affects a single qubit and gives a relative phase of 1 to the basis states of that qubit. Contrary to the bit-flip error, this error has no classical analog. Quantum channel A term describing a general physical operation that can affect the state of a quantummechanical system. Another name for quantum channels are completely-positive trace preserving maps. Quantum error-correcting code (QECC) A method to introduce redundancy to a quantum mechanical system in such a way that certain errors that are affecting the system can be detected or corrected. Quantum circuits Operations that a quantum computer can carry out. Quantum circuits are composed of quantum gates and, when acting on an n qubit system, correspond to unitary matrices of size 2n 2n . Quantum gate A basic operation that a quantum computer can carry out. Common choices for quantum gates are certain unitary matrices of size 2 2 (single qubit gates) or 4 4 (two qubit gates). Qubit Shortened form of quantum bit. A physical object that can support states in a two-dimensional complex Hilbert space. Contrary to the classical case where a bit refers to both, the physical object and the states it can hold, a qubit refers to the physical object only. Qubits are the basic units of a quantum computer’s memory. Stabilizer code A class of quantum codes that is defined as the joint eigenspace of a group of commuting operators. Stabilizer codes give rise to many examples of quantum codes, comprise the class of CSS codes as a special case, and are equivalent to additive codes over GF(4) that are self-orthogonal with respect to the Hermitian inner product. Threshold theorem An important result of the theory of fault-tolerant quantum computing, stating that there is a threshold value for this noise level such that arbitrarily long quantum computations become possible if the gates have a noise level that is under the threshold. Transversal gates Special class of operations acting on encoded quantum data. Transversal gates have the important property to exhibit benign behavior in case errors happen during the application of the gate.
Quantum Error Correction and Fault Tolerant Quantum Computing
Definition of the Subject Quantum error correction offers a solution to the problem of protecting quantum systems against noise induced by interactions with the environment or caused by imperfect control of the system. The need for error correction arises not only in communication, when quantum information is sent over some distance, but also in locally, when storing and processing quantum information. Fault-tolerant quantum computing builds on quantum error correction and denotes techniques that allow computations to be performed on a quantum system with faulty gates as well as storage errors. Without mechanisms for quantum error correction and fault-tolerance, quantum computing would be impossible even for moderate error rates. The idea of quantum error correction was first conceived in a paper [64] by Shor in 1995 in which a particular quantum code was given that encodes one quantum bit (qubit) into nine quantum bits, while being able to correct against one arbitrary error on one of these nine qubits. Before long, Bennett et al. [11] developed a theory of error correction describing a quantum code that encodes one quantum bit into five while still being able to correct against arbitrary single qubit errors. Around the same time, Calderbank, Shor [15] and Steane [66] constructed quantum codes from suitable pairs of classical codes. Named after their inventors, this important class of error correcting codes is called CSS codes. Other noteworthy work in the early days of quantum error correction were the contributions by Calderbank et al. [14], which related quantum codes to classical codes over the finite field GF(4) that are additively closed and self-orthogonal with respect to the Hermitian inner product, and by Gottesman [25,26], who developed the theory of stabilizer codes. These two constructions are equivalent. As far as fault-tolerant quantum computing is concerned, again the first method was given by Shor [65] who introduced the idea of performing a universal set of quantum gates fault-tolerantly, including syndrome measurements required for quantum error correction. Around the same time, ideas based on concatenation of quantum codes were used by Aharonov and Ben-Or [2], Kitaev [42], Gottesman [26], Knill, Laflamme and Zurek [45], and Preskill [58], to derive fault-tolerant schemes that work even if only faulty gates are available. The characteristic feature of all these constructions is that there is a certain value for the error rate such that if the actual error rate is below this value, then arbitrarily long quantum computations can be performed with any desired accuracy, whereas if the actual error rate is above this value, then quantum computing is impossible with the same scheme. This wa-
tershed type behavior is characterized in the threshold theorem which is a central result in fault-tolerant quantum computing. As the task to determine the exact value of the threshold is daunting, much of work has been done to get upper and lower bounds as well as to perform numerical simulations that give indications of the threshold value.
Introduction In the early days of quantum computing, Haroche and Raimond asked the poignant question whether the dream of quantum computing could ever be realized in a real physical system or if “the large-scale quantum machine . . . is the experimenter’s nightmare” [39]. At the time the article was written, the first quantum error-correcting code had just been proposed [64]. However, Haroche and Raimond argued that “the implementation of error-correcting codes will become exceedingly difficult” given any detection efficiency less than 100%. It was only later that it was shown that even with imperfect quantum memory and imperfect quantum operations it is possible to implement an arbitrarily long quantum computation, provided that the failure probability of each element is below a certain threshold [2,45]. Here we provide an overview of the ingredients leading to fault-tolerant quantum computation (FTQC). In the first part, we present the theory of quantum error-correcting codes (QECCs) and, in particular, two important classes of QECCs: CSS codes and stabilizer codes. Both are related to classical error-correcting codes, so we start with some basics from this area. In the second part of the article, we present a high-level view of the main ideas of FTQC and the threshold theorem. For the background of quantum computing in general, we refer the reader to the related articles in this volume as well as the book by Nielsen and Chuang [56].
Basics of Classical Error Correction Block Codes Before discussing the principles of quantum error correction, we briefly summarize some results from classical error correction. For more information see MacWilliams and Sloane [53]. Error correction is part of information theory whose foundations were laid by Claude Shannon in his landmark paper “A Mathematical Theory of Communication” [63]. In that paper, Shannon introduced the basic mathematical concepts for communication systems that are used to send messages from one point in space or time to another point:
2479
2480
Quantum Error Correction and Fault Tolerant Quantum Computing
“The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages. The system must be designed to operate for each possible selection, not just the one which will actually be chosen since this is unknown at the time of design.” Hence we can model the set of all possible messages, for example, by a set of binary strings of fixed length. Definition 1 (block code) A block code B of length n is a subset of all possible sequences of length n over an alphabet A, i. e., B An . The rate RD
log jBj log jBj D log jAn j n log jAj
of the code is a measure of the amount of information that is transmitted per symbol in the code. While the rate of the code should be as high as possible, using only a proper subset of all possible messages allows us to correct some errors. On the highest level of abstraction, one distinguishes only whether a symbol is transmitted correctly or not. This yields to the following measure for the distance between two sequences of length n. Definition 2 (Hamming distance) The Hamming distance between two sequences x D (x1 ; : : : ; x n ) and y D (y1 ; : : : ; y n ) equals the number of positions where x and y differ, that is, dH (x; y) :D jfi : 1 i n j x i ¤ y i gj : If the alphabet contains a symbol zero, the Hamming weight of a sequence is defined as the number of non-zero elements of the sequence. If we send two copies of every symbol to the receiver, we get a repetition code of length n D 2 and rate R D 1/2. The Hamming distance between any two codewords is two. In this situation, the receiver can detect an error if at most one of the symbols has been corrupted during transmission. If we repeat every symbol three times (n D 3, R D 1/3), the receiver can even deduce the correct message from the majority of the symbols, assuming that no more than one symbol is erroneous. The general situation is as follows:
Theorem 1 Let B be a block with minimum distance d, that is, any two codewords differ in at least d positions. Then one can either detect errors that affect less than d positions or correct errors that affect strictly less than d/2 positions. Proof As the minimum distance of the code is d, changing up to d 1 positions of a codeword does not yield another codeword. So in order to detect up to d 1 errors, it is sufficient to check whether the received word is a codeword or not. In order to correct errors, note that the spheres of radius b(d 1)/2c around any codeword are disjoint. In the error correction process, every word in that sphere will be mapped to the corresponding codeword. In this setting, finding a good error-correcting code, by which we mean a code with both high rate and high minimum distance, amounts to packing spheres of words of length n with radius b(d 1)/2c with respect to Hamming distance. In general, the resulting code will be a set of codewords without further structure so that to succeed we would have to store the list of all codewords. Linear Block Codes Developing more efficient ways of describing all codewords and testing whether a given word lies in the code requires that the code has some additional structure. First, we assume that the alphabet of the code is a finite field GF(q) with q elements. Second, we require that any linear combination of two codewords is again a codeword. The resulting code is a linear block code, denoted by C D [n; k; d] (for more details see [53]). Here k is the dimension of the code as a subspace of the vector space GF(q)n , and d denotes the minimum distance of the code. Instead of listing all qk codewords, it is sufficient to specify a basis with k linearly independent vectors of the linear space. Alternatively, a subspace of dimension k is uniquely described as the space of solutions of n k linearly independent homogeneous linear equations. Definition 3 (generator matrix/parity check matrix) A generator matrix of a linear block code C D [n; k; d] is a matrix G with k rows and n columns of full rank such that the row span of G equals the code C. A parity check matrix is a matrix H with n k rows and n columns of full rank such that the row-nullspace of H equals the code C. The generator matrix G provides both a compact description of a linear block code and an efficient way of encoding a message which can be represented by a vector i of length k. The corresponding codeword is obtained by the linear mapping i 7! c :D iG. The parity check matrix provides an efficient way to detect errors.
Quantum Error Correction and Fault Tolerant Quantum Computing
Theorem 2 (error syndrome) Let H be a parity check matrix of the linear block code C D [n; k; d]. Then a vector v is a codeword if and only if the error syndrome s :D vH t is zero. Furthermore, for an erroneous codeword v D c C e the error syndrome depends only on the error e, but not on the codeword c. Proof By definition, the code C equals the row-nullspace of the parity check matrix H. Furthermore, any erroneous codeword can be written as v D c C e. Then we have s D vH t D (c C e)H t D cH t C eH t D eH t :
Later we will need the concept of the dual code which also motivates the definition of the parity check matrix with n columns and n k rows. Proposition 1 (dual code) Let C D [n; k; d] be a linear block code. Then the (Euclidean) dual code C ? is the space of all vectors that are orthogonal to all codewords with reP spect to the Euclidean inner product v w :D niD1 v i w i , i. e. C ? D fv : v 2 GF(q)n j v c D 0
for all c 2 Cg :
If G is a generator matrix and H is a parity check matrix for C, then G is a parity check matrix and H is a generator matrix for C ? , i. e., the role of G and H is interchanged. The parity check matrix can also be used to compute the minimum distance of a linear block code. Theorem 3 If any d 1 columns of the parity check matrix H of a linear block code C are linearly independent, then the minimum distance of the code is at least d. Proof First note that for a linear code, the minimum distance equals the minimum Hamming weight of a nonzero codeword as d H (x; y) D d H (x y; 0). Assume that c is a codeword of Hamming weight d 1, that is, there are d 1 indices i1 ; : : : ; i d1 such that c i j ¤ 0. The syndrome is computed as 0 D cH t D c i 1 h(i 1 ) C C c i d1 h(i d1 ) , where h(i) denotes the ith column of H. Hence we have a non-trivial linear combination of d 1 columns of H that is zero, contradicting the fact that any d 1 columns of H are linearly independent. Hamming Codes The previous theorem can be used to construct codes with a prescribed minimum distance d. For a code that can correct a single error, the minimum distance d must be at least three, that is, any two columns of the parity check matrix must be linearly independent. For the simplest case in which the field has two elements GF(2) D f0; 1g, that is, all
operations are modulo two, it is sufficient that all columns of H are distinct and nonzero. This yields the following family of single error-correcting codes (see [38,53]). Proposition 2 (binary Hamming code) The rth binary Hamming code is a linear binary block code of length n D 2r 1, dimension k D 2r r 1, and minimum distance d D 3. A parity check matrix H of the Hamming code is a matrix whose 2r 1 columns are all nonzero binary vectors of length r. The columns of H can be arranged such that the ith column equals the binary expansion of i. Then for an error e of weight one, the syndrome eH t equals the binary expansion of the position of the error. For r D 3, we get the following parity check matrix 0 1 0 0 0 1 1 1 1 H D @ 0 1 1 0 0 1 1A : 1 0 1 0 1 0 1 For the error e D (0; 0; 1; 0; 0; 0; 0), the syndrome is eH t D (0; 1; 1), that is, the binary expansion of three. In general it is difficult to deduce the error of smallest Hamming weight—which is often the most likely error—from the error syndrome. More precisely, given a binary parity check matrix H, an error syndrome s, and a positive integer w it is NP complete to decide whether there is an error vector e whose weight does not exceed w such that eH t D s [12]. Nonetheless, for some classes of codes, such as BCH codes or Reed–Solomon codes, there exist efficient algorithms for the correction of all errors up to a certain weight (see [53]). Basic Ideas of Quantum Error Correction In the classical setting of error correction, information is represented, for example, by a binary string of length n. The extremal case is that we have only one bit that is either zero or one. In the context of quantum information, the simplest quantum system is modeled by a twodimensional complex vector space. A quantum bit (or qubit, for short) corresponds to a normalized vector in this space (for more details see the book by Nielsen and Chuang [56]). The qubit is given by j i D ˛j0i C ˇj1i ; where j˛j2 C jˇj2 D 1 ; ˛; ˇ 2 C : Here j0i and j1i denote two orthogonal vectors in the twodimensional vector space. While these basis states correspond to the two classical values 0 and 1 of a bit, a qubit can be in a superposition of both j0i and j1i. When the
2481
2482
Quantum Error Correction and Fault Tolerant Quantum Computing
Quantum Error Correction and Fault Tolerant Quantum Computing, Figure 1 Similarities between classical and quantum error-correcting codes
qubit is measured with respect to the basis fj0i; j1ig, the result is either “0” or “1” with probability j˛j2 and jˇj2 , respectively. As we have seen in Subsect. “Block Codes”, a simple classical one-error correcting code can be obtained by sending the information three times and taking a majority decision at the receiver’s end. However, this does not work in the context of quantum information. First, it is not possible for the sender to compute copies of an unknown quantum state (see [69]). The main idea of this so-called no-cloning theorem is as follows. Theorem 4 There is no quantum operation that maps an arbitrary quantum state j i and a fixed state j0 i to two independent copies j ij i. Proof Let j i D ˛j0i C ˇj1i. Two independent copies of j i are given by j ij i D ˛j0i C ˇj1i ˝ ˛j0i C ˇj1i D ˛2 j00i C ˛ˇj01i C ˛ˇj10i C ˇ 2 j11i : If j i is one of the basis states, we have j0ij0 i 7! j00i and j1ij0 i 7! j11i. By the linearity of quantum mechanics, starting with a superposition we get ˛j0i C ˇj1i j0 i 7! ˛j00i C ˇj11i. This equals j ij i only when ˛ D 0 or ˇ D 0. Second, even if independent copies of a quantum state j i are sent (for example, if the sender knows how to prepare the state j i), the direct quantum mechanical analogue of a majority decision is not possible for the receiver. Instead, the receiver may, for example, check whether the joint quantum state of all received copies is invariant un-
der permutation of the copies. This allows us to detect and correct some errors [9]. Although the classical repetition code does not have a direct quantum mechanical analogue, we will see that quantum error correction can be related to classical error correction codes. For this we recall that, in general, a classical code is given by a proper subset of the finite set of all possible messages of fixed length. In contrast, quantum information is represented by an arbitrary normalized vector in a complex vector space. In order to achieve the possibility of correcting quantum errors, a quantum error-correcting code must be a proper subspace of a larger vector space. Similar to the problem of packing spheres for classical codes, the subspace has to be chosen in such a way that the spaces corresponding to the quantum errors do not overlap (see Fig. 1). Before addressing this question in more detail in Sect. “Conditions for Quantum Error Correction”, we present a simple example. A Simple Example and Shor’s Nine-Qubit Code The Three-Qubit Code The shortest classical code that encodes one bit and can correct one error is the repetition code of length three which coincides with the second binary Hamming code. Hence a parity check matrix H and a generator matrix G are given by HD
0 1
1 0
1 1
and G D 1
1
1 :
We use the codewords (000) and (111) of the binary code to define the basis states of a three-qubit code, that is, the
Quantum Error Correction and Fault Tolerant Quantum Computing
Quantum Error Correction and Fault Tolerant Quantum Computing, Figure 2 Quantum circuit for the three-qubit code. Two ancilla qubits are used for the encoding, and another two for the error syndrome. The measurement yields two classical syndrome bits
decomposed as follows:
encoding operation is given by C : C 2 ! (C 2 )˝3
C 2 ˝ C 2 ˝ C 2 D (I ˝ I ˝ I) C ˚ (x ˝ I ˝ I) C
j0i 7! j000i
˚ (I ˝ x ˝ I) C ˚ (I ˝ I ˝ x ) C :
j1i 7! j111i : Hence a superposition ˛j0i C ˇj1i is encoded as C(˛j0i C ˇj1i) D ˛j000i C ˇj111i. The encoding transformation C can be implemented using two controlledNOT (CNOT) gates (see [56]). The CNOT gate is given by CNOT : jxijyi 7! jxijx ˚ yi ; where x ˚ y denotes addition modulo two (XOR). Hence the second (target) qubit is flipped if and only if the first (control) qubit is one. The states j000i and j111i span the quantum code C which is a two-dimensional subspace of C 2 ˝ C 2 ˝ C 2 . A classical error flips a bit, interchanging 0 and 1. The quantum mechanical analogue is given by the matrix
0 1 x D 1 0 interchanging the basis states j0i and j1i. Applying x to at most one of the three subsystems we obtain the following states: error no error 1st position 2nd position 3rd position
state ˛j000iCˇj111i ˛j100iCˇj011i ˛j010iCˇj101i ˛j001iCˇj110i
subspace (I ˝ I ˝ I) C D: C0 (x ˝ I ˝ I) C D: C1 (I ˝ x ˝ I) C D: C2 (I ˝ I ˝ x ) C D: C3 (1)
The four different cases yield four mutually orthogonal subspaces, that is, the Hilbert space of three qubits can be
In principle it is possible to construct a quantum mechanical observable whose eigenspaces are the four two-dimensional spaces C i in (1). The corresponding projective measurement projects onto one of these spaces and provides information about the error, but preserves the superposition within the spaces. Alternatively, we can compute information about the error using two auxiliary qubits (ancillae). Recall that for the binary Hamming code, the error syndrome s D eH t equals the binary expansion of the position of the error, provided that there is at most one error. The computation of the error syndrome can also be implemented using CNOT gates (see Fig. 2). The gates in the box labeled “syndrome computation” implement the transformation jcij00i D jc1 c2 c3 ij0ij0i 7! jc1 c2 c3 ijc2 ˚ c3 ijc1 ˚ c3 i D jcijcH t i :
(2)
Measuring the two ancilla qubits yields a two-bit syndrome s D (s2 s1 ) encoding the position of the error. This classical information can be used to correct the error. Instead of measuring the syndrome qubits, one can use controlled quantum operations to correct the errors as illustrated in Fig. 3. While this three-qubit code allows us to correct the quantum mechanical analogue of a single bit-flip error, it cannot correct an arbitrary single qubit error. For instance, measuring any of the three qubits with respect to the standard basis fj0i; j1ig for the encoded state j˚ i D ˛j000i C ˇj111i has the same effect as measuring
2483
2484
Quantum Error Correction and Fault Tolerant Quantum Computing
Quantum Error Correction and Fault Tolerant Quantum Computing, Figure 3 Quantum circuit for the three-qubit code. In contrast to Fig. 2, the error is corrected without measuring the syndrome qubits
the unencoded qubit ˛j0i C ˇj1i. In order to turn the three-qubit code into a code that can correct the effect of measuring one qubit, we first use the following identities for projection onto the basis states: j0ih0j D
1 0
1 0 D (I C z ) and 0 2
1 0 0 D (I z ) ; j1ih1j D 0 1 2
(3)
0 . The relation between x and z is where z D 10 1 p 1 given by z D Hx H, where H D 1/ 2 11 1 . Hence the Hadamard matrix H turns bit-flip errors x into phase-flip errors z . Since the three-qubit code C is able to correct a single bit-flip error, the code C phase D (H ˝ H ˝ H) C can correct a single phase-flip error. By linearity, this code can also correct the effect of measuring one subsystem. Assume that we measured the first qubit of the encoded state j˚i and the result was 0. Using (3), for the state after the measurement we compute
1 1 j0ih0j ˝ I ˝ I j˚i D j˚ i C z ˝ I ˝ I j˚i ; 2 2
that is, the state is a superposition of the states corresponding to “no error” and “phase error at the first position”. Measuring the error syndrome projects either onto the error free state j˚i or onto the state with a single phase-flip error that can be corrected. The measurement result 1, as well as measuring one of the other qubits, can be treated similarly. Shor’s Nine-Qubit Code The three-qubit code C and its Hadamard transformed version C phase of the previous section can correct a single bit-flip error or a single phase-flip error, respectively.
However, none of the codes can correct both types of errors. For an encoded state j i D ˛j000i C ˇj111i of the three-qubit code, a single phase-flip error z results in the state (z ˝ I ˝ I)j i D ˛j000i ˇj111i. In terms of the encoded or logical basis states j0iL D j000i and j1iL D j111i corresponding to the encoding of j0i and j1i, respectively, a single phase-flip has the effect of an encoded z operation. Note that the operations z ˝ I ˝ I, I ˝ z ˝ I, and I ˝ I ˝ z all have the same effect on the code. Hence, in order to correct also for phase-flip errors, we can add another layer of encoding using the code C phase with the encoded basis states 1 (j000i C j011i C j101i C j110i) 2 1 j1i 7! (j001i C j010i C j100i C j111i) : 2 j0i 7!
(4)
Each qubit in (4) is replaced by its encoded version with respect to the three-qubit code C , that is, we get the encoding 1 (j000000000i C j000111111i 2 C j111000111i C j111111000i) 1 j1i 7! (j000000111i C j000111000i 2 C j111000000i C j111111111i) :
j0i 7!
(5)
This nine-qubit code has been constructed by Shor [64]. It turns out it can correct an arbitrary single-qubit error. For this, first note that the errors x and z can be corrected independently on the two levels of encoding, respectively. Therefore, the code can correct not only these errors, but also their combination x z . Together with identity we
Quantum Error Correction and Fault Tolerant Quantum Computing
Example (depolarizing channel) The depolarizing channel on the Hilbert space H with error parameter p, 0 p 1, is given by
have the matrices 1 ID 0
0 1 1 z D 0
0 1 ; x D ; 1 0
0 0 ; and x z D 1 1
1 : 0
7! (1 p) C p I/ dim H : (6)
Any 2 2 matrix can be written as linear combination of the four matrices in (6). Similar to the arguments following (3), it can be shown that in order to correct an arbitrary single-qubit error it is indeed sufficient to correct the errors corresponding to the matrices in (6). The quantum error-correcting code given by (5) is denoted by C D [[9; 1; 3]], indicating that one qubit is encoded into nine qubits and that the minimum distance of the code is three. Before presenting further constructions for quantum codes, we will discuss necessary and sufficient conditions for quantum error correction. Conditions for Quantum Error Correction Quantum Channels Errors in quantum systems are due to interaction of the system with its environment or imperfection in the control of quantum operations. The latter can be modeled by a perfect operation that additionally depends on the state of the environment. The Hilbert space of the combined system is given by Hsys/env D Hsystem ˝ Henvironment. If the dimension of both spaces is sufficiently large, we can assume that the initial state is a pure state. Additionally, we assume that there are no initial correlations between the system and its environment, that is, the initial state is a product state jisys j"ienv . Again, for sufficiently large dimensions of the Hilbert spaces we can assume that the dynamics of the joint system are given by a unitary transformation Usys/env . The resulting state of the system is obtained by tracing out the environment: out D Trenv Usys/env (jihj ˝ j"ih"j)Usys/env : (7) The output state in (7) can equivalently be expressed as a function of the input state in D jihj in the form X E i in E i ; out D i
where the operators Ei are the so-called error operators or Kraus operators [47]. They depend on both the initial state j"i of the environment and the unitary interaction Usys/env . In the following, some important special cases of quantum channels are presented.
The input is transmitted faithfully with probability 1 p. With probability p, the state is replaced by the completely random state I/ dim H . Note that even in this case, the probability of measuring a particular pure state j i is 1/ dim H ¤ 0. While the depolarizing channel treats all input states uniformly, the next quantum channel is basis-dependent. Example (dephasing channel) The dephasing channel on the Hilbert space H with orthonormal basis B D fjb i i : i 2 I g and error parameter p, 0 p 1, is given by X 7! (1 p) C p jb i ihb i j jb i ihb i j : i2I
With probability p, the channel performs a projective measurement with respect to the basis B. is derived from the fact that this is equivalent to randomizing the phases of the basis states. The dephasing channel allows us to perfectly transmit classical information by encoding the information as basis states. Coherent superpositions of basis states, however, are changed into classical mixtures. The final example is a channel that provides the side-information that an error has occurred. Example (quantum erasure channel [35]) The quantum erasure channel on the Hilbert space H with error parameter p, 0 p 1, is given by 7! (1 p) C p jihj ; where ji is a quantum state in the Hilbert space H 0 H that is orthogonal to all states in H . The input is transmitted faithfully with probability 1 p. With probability p, the state is replaced by the state jihj. As ji is orthogonal to all states in H , the receiver can perform a measurement which detects that an error has occurred. The quantum mechanical analogue of a memoryless channel is a product channel which is defined for a quantum system with n subsystems of, say, equal dimension, that is, on H D H0˝n . The product channel is given by n uses of a channel on H0 which acts independently on each of the n subsystems. If the channel on H0 is given by the error operators E0 D fE i : i 2 I g, the error operators of the product channel on H are E D E0˝n :D fE i 1 ˝E i 2 ˝: : :˝E i n : (i1 ; i2 ; : : : ; i n ) 2 I n g:
2485
2486
Quantum Error Correction and Fault Tolerant Quantum Computing
Characterization of Quantum Codes As we have seen in Fig. 1, a quantum error-correcting code is a subspace C of the Hilbert space H . What is more, the full Hilbert space H can be decomposed into mutually orthogonal unitary images of C , corresponding to different error events (see Eq. (1)). In general, the error operators Ei describing the quantum channel need not be unitary. This leads to the question whether a subspace C of H is a quantum error-correcting code (QECC) for a given quantum channel. Theorem 5 (QECC characterization [44]) Let Q be a quantum channel on H with error operators fE i : i 2 IQ g. A subspace C H with orthonormal basis fjc i i : i 2 IC g is a quantum error-correcting code for Q if and only if the following conditions hold:
8k; ` 2 IQ 8i ¤ j 2 IC : hc i jE k E` jc j i D 0
(8a)
8k; ` 2 IQ 8i; j 2 IC : hc i jE k E` jc i i hc j jE k E` jc j i
D
X D x D
0 1
1 ; 0
0 i ; i 0
1 0 ; Z D z D 0 1
Y D y D
and
(9)
together with identity form a vector space basis of all operators in C 22 . For a quantum code using n qubits, we consider the tensor product of Pauli matrices and identity as the so-called error basis. The number of tensor factors different from identity is referred to as the number of errors or the weight of an error. Quantum Codes from Classical Codes
D: ˛ k` 2 C
(8b)
P
Denoting by PC :D i2IC jc i ihc i j the projection onto the code C , we obtain the following equivalent condition which is independent of the basis of the code:
8k; ` 2 IQ : PC E k E` PC D ˛ k` PC : In principle, the proof of Theorem 5 implies an algorithm that allows the correction of errors (see [31]). However, as error correction is NP hard in the classical case we cannot expect to have an efficient algorithm for the more general situation of the quantum case. If the conditions (8) are fulfilled, then the errors corresponding to the operators Ei can be corrected. It must be stressed that this implies that one can correct any operator that is a linear combination of the error operators Ei . For this, we show that the conditions (8) are linear in the error operators. Consider the new error operators X X A :D k E k and B :D l E l k
where ˛ 0 (A; B) 2 C is some constant depending on the operators A and B only. From this it also follows that it is sufficient to check the conditions (8) for a vector space basis of E and hence for a finite set of errors. For qubit systems, the Pauli matrices
CSS Codes The three-qubit code of Sect. “The Three-Qubit Code” is based on the classical triple repetition code. Both codes can correct a single bit-flip error. The Hadamard transformation yields the code C phase which can correct a single phase-flip error. In the following we present a similar construction of quantum codes based on linear binary block codes, but the resulting codes will allow both the correction of bit-flip errors and phase-flip errors. For the construction we need the following lemma. Lemma 1 Let C GF(2)n denote a k-dimensional linear subspace of GF(2)n and let a; b 2 GF(2)n be two arbi˝n n trary binary p vectors. Furthermore, by H2 :D H , where 1 , we denote the Hadamard transformaH D 1/ 2 11 1 tion on n qubits. Then the state 1 X (1)ac jc C bi j i :D p jCj c2C is mapped by the Hadamard transformation to
l
which are arbitrary linear combinations of the Ei . Using (8) we compute X k l hc i jE k E l jc j i hc i jA Bjc j i D k;l
D
X
k l ı i; j ˛ k;l
k;l 0
D ı i; j ˛ (A; B) ;
(1)ab X H2 n j i D p (1)bd jd C ai : ? jC j d2C ? Proof The Hadamard transformation on n qubits can be written as 1 H2 n D p 2n
X x;y2G F(2) n
(1)xy jxihyj ;
Quantum Error Correction and Fault Tolerant Quantum Computing
where x y denotes the inner product of the binary vectors x and y. Then H j iD p 2n
Dp Dp Dp
1
X
X (1) jxihyj (1)ac jcCbi xy
2n jCj x;y2G F(2)n 1 2n jCj 1
c2C
X (1)xyCac jxihyjc C bi
X
x;y2G F(2) n c2C
X (1)x(cCb)Cac jxi
X
1 X 1 X (1)cx0 jci D p jci H27 j0iL D p jCj c2C jCj c2C 1 D p (j0iL C j1iL ) 2
1 2n jCj
X
(1)bx jxi
x2G F(2) n
X
(11b)
(1)(xCa)c
c2C
X
(1)ab X Dp (1)bd jd C ai : jC ? j d2C ? In () we have used that the sum and only if x ¤ C ? .
P
xc c2C (1)
vanishes if
Lemma 1 shows that the Hadamard transformation not only changes phase-flip errors into bit-flip errors, it also maps superpositions of all codewords of the linear binary code C to superpositions of all codewords of the dual code C ? (cf. Proposition 1). Example (seven-qubit code) The 3rd binary Hamming code C D [7; 4; 3] (cf. Proposition 2) contains its dual code C ? D [7; 3; 4], that is, C ? C. Hence we can partition the codewords of C into two cosets of C ? as follows: ˙ C ? C x1 ; C D C ? C x0 [ where x0 D (0000000) and x1 D (1111111) : The Hamming weight of all codewords of C ? is even, while the weight of all vectors in the coset C ? C x1 is odd. Based on this decomposition, we define the following encoding: X 1 jc C x0 i j0i 7! j0iL D p jC ? j c2C ? X 1 jci D p jC ? j c2C ? X
1 jc C x1 i : j1i 7! j1iL D p jC ? j c2C ?
(11a)
1 X 1 (1)cx1 jci D p (j0iL j1iL ) : H27 j1iL D p 2 jCj c2C
2n jCj x2G F(2)n c2C
jCj D p (1)bx jxi 2n jCj x2C ? Ca
()
Hadamard transformation of these states yields
A superposition j i D ˛j0iL C ˇj1iL of the logical qubits is a superposition of words of the Hamming code C D [7; 4; 3]. Similar to the transformation (2) it is possible to compute an error syndrome for the bit-flip errors using a parity check matrix of the Hamming code. Measuring the error syndrome provides information about the position of a single bit-flip, allowing us to correct this error. From (11) it can be seen that the Hadamard transformation of the state j i is again a superposition of words of the Hamming code, so a single phase-flip error can be corrected as well. Similar to Shor’s nine-qubit code, for this seven-qubit code C D [[7; 1; 3]] given by (10), bit-flips and phase-flips can be corrected independently (for more details see [31]). Equation (11) additionally shows that applying the Hadamard transformation to all seven qubits corresponds to a Hadamard transformation of the logical qubits j0iL and j1iL . Similarly, applying x or z to all qubits transforms the encoded qubits like encoded versions of x and z , respectively (see also Sect. “Transversal Gates”). The generalization of this construction principle yields socalled CSS codes. This class of codes was independently derived by Calderbank and Shor [15] and Steane [66,67]. Theorem 6 (CSS code) Let C1 D [n; k1 ; d1 ] and C2 D [n; k2 ; d2 ] be linear binary codes of length n, dimension k1 and k2 , respectively, and minimum distance d1 and d2 , respectively, with C2? C1 . Furthermore, let W D fw1 ; : : : ; w K g GF(2)n be a system of representatives of the cosets of C2? in C1 . The K D 2 k 1 (nk 2 ) mutually orthogonal states j
ii
D q
1
X
jC2? j c2C 2?
jc C w i i
(12)
(10a)
(10b)
span a quantum error-correcting code C D [[n; k; d]] with k :D k1 C k2 n. The code corrects at least b(d1 1)/2c bit-flip errors and simultaneously at least b(d2 1)/2c phase-flip errors. Its minimum distance is d minfd1 ; d2 g.
2487
2488
Quantum Error Correction and Fault Tolerant Quantum Computing
Proof An arbitrary combination of bit-flip and phase-flip errors can be written as e e e e e :D x x;1 z z;1 ˝ : : : ˝ x x;n z z;n ; (13) where the binary vectors ex and ez indicate the positions with bit-flip and sign-flip errors, respectively. An arbitrary state of the CSS code is a superposition of the encoded basis states j i i in (12). Rewriting the superposition shows that the state is a superposition of codewords of the binary code C1 : j iD
K X
˛i j
ii D
iD1
K X iD1
˛ 0i
X
jc C w i i D
X
ˇc jci :
c2C 1
c2C 2?
(14) Combining (13) and (14) the erroneous state reads X ej i D ˇc (1)cez jc C ex i :
(15)
c2C 1
Using a parity check matrix H 1 for the linear binary code C1 we can implement the mapping S1 : jxijyi 7! jxijxH1t C yi : Applying S1 to the state (15) and n k1 ancilla qubits in the state j0i, we get X
D
X
D ej i ˝ jex H1t i :
(16)
As the syndrome sx D ex H1t depend only on the error ex , not on the codeword c, the state in (16) is a tensor product. Hence we can measure the syndrome of the bit-flip errors without disturbing the state ej i. A classical algorithm for the decoding of the code C1 can then be used to deduce the error vector ex from the syndrome sx . In order to correct phase-flip errors, we recall that the Hadamard transformation interchanges x and z . Applying the Hadamard transformation to the erroneous state (15) we get 1 0 K X X C B ˛ 0i jc C w i iA H2n ej i D (H2n eH2n )H2n @ iD1
D
K X iD1
˛ 00i
X c2C 2
ej i ˝ jex H1t i ˝ jez H2t i X ˇc (1)cez jc C ex i ˝ jex H1t i ˝ jez H2t i : D c2C 1
Again we can measure the syndrome sz D ez H2t and use a decoding algorithm for the linear binary code C2 to find the error vector ez . Stabilizer Codes The construction of CSS codes uses two binary codes which define a decomposition of the set of all binary strings of length n. Using the binary strings as labels of quantum states, one obtains a decomposition of the complex vector space (C 2 )˝n (cf. Fig. 1). Such a decomposition can also be defined via eigenspaces of operators. Let Pn denote the group which is generated by tensor products of n Pauli matrices (cf. (9)) and identity. Every element g 2 Pn has a unique representation of the form g
g
g
g D i c x x;1 z z;1 ˝ ˝ x x;n z z;n ;
ˇc (1)cez jc C ex i ˝ jex H1t i
c2C 1
Undoing the Hadamard transformation on the first n qubits we finally get
g
ˇc (1)cez jc C ex ij(c C ex )H1t i
c2C 1
The last equation follows using Lemma 1. Similar to (16) we can use a parity check matrix H 2 of the binary code C2 to define the mapping S2 : jxijyi 7! jxijxH2t C yi. Using n k2 additional ancilla qubits, we obtain the state H2n ej i ˝ jex H1t i ˝ jez H2t i :
c2C 2?
(1)cex (1)cw i jc C ez i :
where c 2 f0; 1; 2; 3g and gx and gz are binary vectors of length n. Two elements g and h of Pn either commute or anti-commute, that is, gh D ˙hg. Let S be an Abelian subgroup of Pn , that is, any two elements of S commute. Furthermore, we assume that S does not contain I. Then S has 2nk elements and there are n k generators. The spectrum of every generator g i is fC1; 1g. Furthermore, there are 2nk common eigenspaces, each of dimension 2 k , which can be labeled by the eigenvalues of the generators. We choose one of these eigenspaces as our quantum code C , e. g., the eigenspace with all eigenvalues C1. For the analysis of the error-correcting properties of this code, recall that any two elements in Pn either commute or anti-commute. Assume that E 2 Pn anti-commutes with some generator g i of S. Then for every state j i 2 C we have g i (Ej i) D Eg i j i D Ej i ;
(17)
that is, Ej i lies in the eigenspace of g i with eigenvalue 1 which is orthogonal to the code. Hence every error E that
Quantum Error Correction and Fault Tolerant Quantum Computing
anti-commutes with at least one of the generators g i of S can be detected. From the definition of S it follows that Ej i D j i for all E 2 S and every state j i 2 C , that is, these “errors” E have no effect on the code. However, if E … S commutes with all elements in S, then E preserves the code space, but acts non-trivially on it. When designing a code, we want the weight of these undetectable errors to be large, that is, we want the number of tensor factors of E that are different from identity to be high. This construction of so-called stabilizer codes has been found independently by Gottesman [25] and Calderbank et al. [14]. In the latter paper, a connection to additive codes over the finite field GF(4) has been established which allows us to use results from classical coding theory for the construction of good quantum codes. Theorem 7 (stabilizer codes) Let S be an Abelian subgroup of the n-qubit Pauli group Pn that does not contain I. The stabilizer code C with stabilizer S is the common eigenspace with eigenvalue + 1 of all operators in S. The dimension of C is 2 k if jSj D 2nk . The normalizer N (S) is defined as N (S) D fx : x 2 Pn j x 1 Sx D Sg
D fx : x 2 Pn j x g D gx for all g 2 Sg : The normalizer has 2nCk elements and contains the stabilizer S. If the weight of all elements of the set N (S) n S is at least d, then all errors of weight strictly less than d can either be detected or have no effect on the code. Equivalently, one can correct all errors of weight up to b(d 1)/2c. The code is denoted by C D [[n; k; d]]. In a manner similar to error correction for CSS codes, error correction for stabilizer codes is based on measuring a syndrome of the error. Recall that an error can be detected if it anti-commutes with one of the generators g i of the stabilizer S. Then the erroneous state lies in the eigenspace of g i with eigenvalue 1 (see (17)). Hence measuring the eigenvalues of the generators g i provides information about the error E. Since the Pauli matrices are both unitary and Hermitian, the generators define a quantum mechanical observable. However, if the weight of the generator g i is m, one would have to measure an m-qubit observable. Alternatively, one can use one ancilla qubit per generator g i to compute an error syndrome. Assume that we want to measure the eigenvalue of g D z ˝ x ˝ I ˝ y . For this, we can use the quantum circuit shown in Fig. 4. The eigenstates of z are j0i and j1i with eigenvalues + 1 and 1, respectively. Hence the first CNOT gate flips the state of the ancilla qubit if and only if the first qubit is in
Quantum Error Correction and Fault Tolerant Quantum Computing, Figure 4 Quantum circuit to measure the eigenvalue of g D z ˝ x ˝ I ˝ y
the eigenspace of z with eigenvalue 1. The control qubit of the next CNOT gate is conjugated with the Hadamard matrix which interchanges x and z . Hence the ancilla qubit is flipped if and only if the second qubit is in the eigenspace of x with eigenvalue 1. In order to measure the eigenvalue of y , the control of the p final 1 iCNOT gate is conjugated with the matrix D 1/ 2 i 1 that maps y to z . If measuring the ancilla qubit jsi yields the result 0 or 1, the state of the first four qubits is projected onto the eigenspace of g with eigenvalue + 1 or 1, respectively. On the one hand, the elements of the normalizer N (S) which do not lie in the stabilizer S correspond to errors that cannot be detected. On the other hand, these elements can be used to perform operations on the code, as we will see in the following example. Example (five-qubit code) The shortest quantum code that can correct an arbitrary single-qubit error is the code C D [[5; 1; 3]] which is a stabilizer code. The stabilizer S of this code is generated by g1 D z ˝ x ˝ x ˝ z ˝ I D ZX X ZI ; g2 D I ˝ z ˝ x ˝ x ˝ z D IZX X Z ; g3 D z ˝ I ˝ z ˝ x ˝ x D ZIZX X ; g4 D x ˝ z ˝ I ˝ z ˝ x D X ZIZX : The normalizer of S is generated by g1 ; : : : ; g4 together with h1 D x ˝ z ˝ x ˝ I ˝ I D X ZXII ; h2 D z ˝ z ˝ I ˝ x ˝ I D ZZIXI : Both h1 and h2 commute with all g i , and h1 and h2 anticommute with each other. Hence h1 and h2 act on the code as encoded versions of x and z . We close this section by noting that CSS codes are also stabilizer codes. The stabilizer of the seven-qubit code of Ex-
2489
2490
Quantum Error Correction and Fault Tolerant Quantum Computing
ample 4 is generated by g1 D XIXIXIX ;
g4 D ZIZIZIZ ;
g2 D IX XIIX X ;
g5 D IZZIIZZ ;
g3 D IIIX X X X ;
g6 D IIIZZZZ :
The additional generators of the normalizer are given by h1 D X X X X X X X ;
h2 D ZZZZZZZ :
As already mentioned at the end of Example 4, h1 and h2 correspond to the encoded version of x and z , respectively. Techniques for Fault-Tolerant Quantum Computing The theory of quantum codes as described in the previous sections is a prerequisite to be able to achieve a larger goal, namely, reliable quantum computations performed with imperfect hardware. Imperfections arise from the fact that every physical device is subject to noise. This includes memory elements as well as gate elements. Not only is memory affected by errors, but the operations by which this memory is modified can also be faulty, so the task of devising reliable quantum computations is very delicate. The idea behind fault-tolerant quantum computing (FTQC) is to achieve this very goal, provided that the noise level introduced by the memory and the gates is not too high. It should be noted that FTQC is quite different from the techniques used in classical fault-tolerance and dependability of systems (see [7] for an overview). In general, the supply of techniques that can be drawn from in the quantum case is less rich than in the classical case, where fault-tolerance techniques are known on basically all levels of abstraction, for hardware as well as for software. The reason for this difference is that in the quantum case there are strong restrictions on how quantum information can be maintained. One example might be the “no cloning theorem” that was already mention as Theorem 4. Not having the ability to copy quantum information rules out many classical fault-tolerance techniques such as error-detection for rollback recovery [23]. The results shown in FTQC have the flavor of some of the results shown in the early days of gate level fault-tolerance in the classical case, such as the celebrated threshold result for faulty NAND gates due to J. von Neumann [55]. An important result of the theory of FTQC is that also in the quantum case there is a threshold value for this noise level such that arbitrarily long quantum computations become possible if the gates have a noise level that is under the threshold. We will briefly discuss the current state of
the art of estimates of this threshold value from below and above in Sect. “The Threshold Theorem”. FTQC comprises methods to perform the following tasks on encoded quantum states in such a way that errors do not accumulate. This means that during the application of a single gate, errors do not spread out in an uncontrolled fashion over the whole circuit. Instead, errors stay locally confined so that they can be taken care of by error correction routines that are applied periodically to the system:
Preparation of initial states and ancilla states Algorithms for quantum error-correction Measurement of quantum states Universal set of quantum gates.
Throughout, in FTQC it is assumed that quantum data is never given in a form where it is unencoded. Instead, it is assumed that whenever a qubit consisting of two (logical) basis states j0iL , j1iL is given, that these states are encoded into some higher-dimensional Hilbert space and that the need never arises to decode this quantum information from its encoded physical representation to its logical representation.
Transversal Gates FTQC requires careful design of how quantum information is encoded. A first indication of the inherent difficulties is given by the observation that errors have the tendency to be propagated through quantum gates. The CNOT gate might serve as a simple example. A simple calculation reveals that it propagates (even when working perfectly) a single X-error on the control qubit to a double X-error on both control and target qubit. Conversely, the same CNOT gate propagates a single Z-error on the target qubit to a double Z-error on control and target qubit. A basic requirement for all quantum gates that are used to operate on encoded quantum data is that they should not propagate errors, that is, errors should stay locally confined to a small set of qubits (see Fig. 5). It was shown in [27] how to perform encoded Clifford operations transversally on any stabilizer code. Clifford operations are those quantum operations that leave the group of tensor products of Pauli matrices unchanged under conjugation. Clifford operations are not universal for quantum operations. For certain inputs, they can even be efficiently simulated classically [1,5]. However, it was also shown in [27] that, using transversal gates together with local measurements, a universal set of quantum gates can be implemented on any stabilizer code.
Quantum Error Correction and Fault Tolerant Quantum Computing
a CSS code. Finally, we describe Knill’s method which is based on teleportation and post-selection.
Quantum Error Correction and Fault Tolerant Quantum Computing, Figure 5 Fault-tolerant quantum gates. Shown are two examples, namely a local operations that operate on encoded qubits without entangling them, thereby avoiding any error propagation between them, and b transversal gates which in this case are given by a sequence of controlled gates. An error in one of these gates will have only a local effect to at most two of the encoded qubits and therefore can be corrected by one round of quantum error-correction provided the qubits are encoded using a quantum code that can correct at least one error. In both a and b the gates U1 ; U2 ; : : :, are arbitrary unitaries
Teleporting Gates While in the beginnings of FTQC universal gate sets were realized using ad hoc constructions involving measurements [65], later it was realized that teleportation [10] also can be used to implement universal sets of quantum gates. This has the following unique advantage: suppose that a particular gate U has to be applied during the computation. Then as shown in [30], under some conditions (which characterize a set of gates more general than Clifford gates, hence a universal set of gates), such a quantum gate can be “precomputed” and stored into a quantum state j U i. Once the gate U has to be applied, this state j U i is used in a generalized teleportation scheme which can be shown to have the same effect as applying U. The advantage of this method of teleporting the gate U is that j U i can be prepared and verified offline. Then only very high fidelity specimens of j U i are actually used in the computation. Fault-Tolerant Error Correction Several fault-tolerant methods are known to measure the syndrome of a quantum code. We first describe Shor’s method which relies on generalized cat states and which can be applied to any stabilizer code. Next, we describe Steane’s method which is applicable whenever the code is
Shor’s Method In Fig. 4 we have seen an example of how the eigenvalue of a generator of the stabilizer can be measured. The principle underlying this example is to compute a controlled operation for each of the non-trivial Pauli operators contained in the generator and to compute the output of this controlled gate into a fresh qubit which is then subsequently measured. Unfortunately, this method is not fault-tolerant as the CNOT has the mentioned property of propagating X-errors from control qubits to target qubits and Z-errors from target qubits to control qubits. Therefore, a different method is needed to implement a fault-tolerant measurement of stabilizer generators. As shown by Shor [65], such a method of faulttolerant syndrome measurements exists, provided that a supply of cat states, that is, quantum states of the form p 1/ 2(j0 : : : 0i C j1 : : : 1i), is available. Such a cat state is first transformed using the Hadamard transform applied to all qubits, yielding the equal superposition of all even weight binary words. Next, the same controlled operations as in Fig. 4 are applied but instead of having the same target qubit, they are controlled to individual qubits of the cat state. Afterwards, all ancilla qubits are measured in the standard basis, and the parity of the measurement results is computed classically. The remaining problem is how to ensure the supply of cat states. This is done in a separate, offline procedure which first performs an encoding of a cat state followed by suitable tests to verify that, indeed, the correct state has been produced. Steane’s Method A different method can be applied if the given quantum code is a CSS code. In this so-called Steane method [68], first an X-error correction is performed, followed by a Z-error correction. Here, these two error correction routines are performed in a very special way. To correct X-errors, p an ancilla is prepared in an encoded state jCiL D 1/ 2 (j0iL C j1iL ) that corresponds to the superposition of all logical codewords. Then transversal CNOTs from the encoded codeword (control) into the ancillas (target) are applied. Afterwards, the ancilla is measured in the standard basis and, if necessary, a correction is applied. For the Z-errors, the corresponding operations in the Hadamard basis are applied. A fault-tolerant measurement for a CSS code derived from C2? C1 is defined with respect to the orthogonal projectors onto the states j i i as in Eq. (6): X 1 jiiL D j i i D q jc C w i i : jC2? j c2C 2?
2491
2492
Quantum Error Correction and Fault Tolerant Quantum Computing
First, we describe a way of measuring jiiL that is fault-tolerant but does not implement the orthogonal projectors given by the CSS basis states (6). Measuring all qubits of j i i in the computational basis yields a random codeword of the classical code C1 , possibly with some error added. Using a classical error-correction strategy for C1 , this error can be corrected, resulting in a random element of a coset C2? C w i . Then, computing the error-syndrome with respect to the classical code C2? , it is possible to infer what i was. In order to obtain a measurement that also implements the orthogonal projection operators, we use ancilla qubits which can be prepared in the state j0iL and tested offline [68]. Next, the quantum information is transferred from the encoded qubits to the ancilla using transversal CNOTs. Finally, the ancillas are measured in the computational basis and the information about i is computed as described above. Knill’s Method Knill [43] suggested a scheme for faulttolerant quantum computation based on error detection and post-selection. Knill’s scheme involves a “sieving” step in which ancillas are prepared in a suitable state that is later used to teleport a gate into a computation. The gates involved in this sieving phase are protected using a concatenated code that is error-detecting. If an error occurs in this phase, the preparation of the ancilla state is aborted and a new attempt is started. A rigorous threshold for Knill’s scheme was given in [4] where it was shown that ı 1:04 103 is a lower bound on the threshold. It should be noted that Knill’s numerical simulations [43] indicate a much higher value of the threshold of about ı 3 102 , albeit at the price of a significant (but additive) hardware overhead. Both Knill’s and Steane’s schemes have in common that they are based on post-selection, a feature that also underlies the proposed implementation of a quantum computer based on linear optical elements [46]. The Threshold Theorem In its simplest form, FTQC is achieved by recursively encoding qubits using a fixed quantum code C . This is the idea underlying several papers in which accuracy thresholds for quantum computing are proven. We briefly describe the reasoning as to why a threshold for FTQC exists and conclude with some historical remarks and pointers to further reading. Intuition Behind the Threshold Theorem Let C be a quantum code that encodes only one qubit into several qubits, that is, it is an [[n; 1; d]] which can correct
up to t D b d1 2 c errors. Encoding one qubit with respect to this code will lead to an improved error rate of pt over the given error rate of p. Repeating this process h times results in a quantum code C h with parameters [[n h ; 1; d h ]]. Although the code C h has a very poor rate, it leads to a dramatic reduction of the error probability with a relatively small price in terms of overhead: for h levels of concatenah tion the error probability becomes p (p/p )2 , where p denotes the threshold probability. This threshold is given by p D 1/c , where c roughly denotes the number of events that can cause a logical error under the constraining assumptions made about the physical error model. It should be noted that, in FTQC, usually one parameter is given, namely the probability p of any gate failing during the computation. This probability p is actually a property of a given, fixed set of universal quantum gates. The quantity 1 p is given as the minimum of the probability of projecting the state obtained from the faulty gate onto the correct quantum state, where the minimum is taken over all gates in the universal set and all states of the Hilbert space. History of the Threshold Theorem The first method for FTQC was given by Shor [65], who introduced the idea of alternating error-correction and actual quantum gates in order to do long quantum computations that can tolerate more noise than unencoded operations. He introduced and used universal fault-tolerant gates and showed how to measure the syndrome of a stabilizer code fault-tolerantly by using cat states. Around the same time, ideas based on concatenation of quantum codes were used by Aharonov and BenOr [2], Kitaev [42], Gottesman [26], Knill, Laflamme and Zurek [45], and Preskill [58], to derive FTQC schemes that can perform arbitrarily long computations provided the given error rate lies below a certain value, namely the threshold for FTQC. In these early works, the QECCs used had to be at least two-error correcting. This requirement was later relaxed and it was shown by Reichardt [62] and Aliferis, Gottesman, and Preskill [3] that also oneerror correcting quantum codes yield a positive threshold. This allowed us to obtain much higher estimates of the threshold. In the following years, several attempts have been made to obtain good bounds on the true value of the threshold. Starting with [70] there have been attempts to compute numerical approximations to the true value of the threshold via Monte-Carlo simulations. For this, a particular quantum error-correcting code and a particular error-correction strategy are fixed. Then, in a classical sim-
Quantum Error Correction and Fault Tolerant Quantum Computing
Quantum Error Correction and Fault Tolerant Quantum Computing, Figure 6 Concatenated quantum codes. Shown is one level of concatenation for the seven qubit [[7; 1; 3]] code. Each of the seven physical qubits in the first layer is replaced by seven qubits encoded again into a [[7; 1; 3]] quantum code. If the initial error rate is given by ", then iterating this construction h times yields a quantum code with parameters [[7h ; 1; 3h ]] and a resulting error rate of roughly 2h 1 "2h c
ulation, errors are introduced randomly and it is checked to see how many of them cause uncorrectable errors. This result can be then used to obtain an estimate of the threshold. This technique was later criticized [3] as being prone to overestimating the true threshold value. A further disadvantage is that more advanced error models such as local non-Markovian noise cannot be captured by this method. On the other hand, in FTQC, there has been a tradition of giving rigorous, analytical proofs for bounds on the threshold by analyzing the recursion. Examples for this method include [2,45]. Typically, this gives a fairly low bound for the true value of the threshold because all error effects are assumed to be worst-case. A recently-studied intermediate model of proving bounds on the threshold makes educated guesses that certain effects are negligible, then calculates the threshold based on that. An advantage of this method is that it never underestimates the true value of the threshold while giving significantly higher values. Similar to the DiVincenzo criteria for building a quantum computer [21], there is a list of requirements and desiderata for universal FTQC attributable to Gottesman [29] which we repeat here. The point of this list is that any quantum computer that performs its computations in a fault-tolerant fashion must have the following features:
The error scaling must be benign, that is, the error rates must not increase as the computer gets larger. Also, there must not be any large-scale correlated errors during the computation.
Further Reading For an overview of results on FTQC including noise thresholds for models based on post-selection and a detailed treatment of many FTQC constructions, we recommend Reichardt’s Ph D thesis [61]. That reference also gives a discussion of alternative techniques to achieve FTQC such as topological quantum computing [20]. For an overview on developments on non-Markovian models, extended rectangles and gadgets used for fault-tolerant error correction, the reader is referred to [3]. Complementing the lower bound results on the threshold are results which give upper bounds, that is, results which determine noise levels above which no quantum computation is possible. Results in this direction were first obtained by Razborov [60] and have been further improved [40]. Finally, for a comparison between different methods for FTQC based on short block codes combined with concatenation, see the recent paper [19]. Further Aspects
The gate error rates must be low. The architecture must support the ability to perform operations in parallel. There must be a way of remaining in, or of returning to, the computational Hilbert space, thereby preventing leakage errors. There must be a source of fresh initialized qubits during the computation.
In this article we have considered only qubit systems. Both quantum error-correction and fault-tolerant quantum computation can be generalized to quantum systems that are composed of higher dimensional systems (see [6,28,36]). The stabilizer formalism has also been extended from block codes to quantum convolutional codes [17,24,57]. This class of codes allows us to encode and decode a stream of quantum information without par-
2493
2494
Quantum Error Correction and Fault Tolerant Quantum Computing
titioning it into blocks of fixed size. For both quantum block codes and quantum convolutional codes there are efficient ways for encoding quantum information [18,33,37]. Building on the connection between stabilizer codes and classical additive codes, various constructions of QECCs have been proposed (for an overview see [41]). Among those are cyclic codes, for which efficient encoding and sub-optimal decoding algorithms are known [32], as well as families of good QECCs based on classical codes from algebraic geometry [54]. More recently, the use of classical LDPC codes for quantum error correction has been proposed [16,52]. Since classical LDPC codes achieve very good performance, it is hoped that their quantum versions result in a high threshold for FTQC. Allowing nonadditive classical codes in the construction of QECCs can lead to QECCs whose dimension is larger than those of stabilizer codes [34,59]. Instead of using active methods to obtain information about an error and subsequently correct the error, some physical systems allow for passive error-protection. If the interaction of the quantum system with its environment possesses some symmetry, the quantum information can be stored in a so-called decoherence free subspace on which the environment has no effect [22,51,71]. A generalization of this concept is given by decoherence free subsystems (more details can be found in the review article [50]). Combining the ideas of active quantum error correction and encoding information into subsystems yields to operator quantum error correction [48,49]. Within this framework it is possible to derive simplified active error-correcting schemes where the error is only partially corrected while the residual error affects only the state of a gauge subsystem [8]. These codes yield to a higher threshold for FTQC. In a communication scenario, quantum information can be sent via teleportation. For this one needs entangled quantum states shared by the sender and receiver. It is possible to distill these entangled states from noisy entangled states [11]. More recently, the construction of QECCs which combine active error correction with pre-shared entanglement has been proposed [13]. Bibliography 1. Aaronson S, Gottesman D (2004) Improved simulation of stabilizer circuits. Phys Rev A 70(5):052328 2. Aharonov D, Ben-Or M (1997) Fault-tolerant quantum computation with constant error. In: Proceedings of 29th ACM symposium on theory of computing (STOC’97). ACM, El Paso, pp 176–188 3. Aliferis P, Gottesman D, Preskill J (2006) Quantum accuracy threshold for concatenated distance-3 codes. Quantum Inf Comput 6(2):97–165
4. Aliferis P, Gottesman D, Preskill J (2008) Accuracy threshold for postselected quantum computation. Quantum Inf Comput 8:181–244 5. Anders S, Briegel HJ (2006) Fast simulation of stabilizer circuits using a graph-state representation. Phys Rev A 73:022334 6. Ashikhmin A, Knill E (2001) Nonbinary quantum stabilizer codes. IEEE Trans Inf Theory 47(7):3065–3072 7. Avizienis A, Laprie JC, Randell B, Landwehr C (2004) Basic concepts and taxonomy of dependable and secure computing. IEEE Trans Dependable Secur Comput 1(1):11–33 8. Bacon D (2006) Operator quantum error-correcting subsystems for self-correcting quantum memories. Phys Rev A 73:012340 9. Barenco A, Berthiaume A, Deutsch D, Ekert A, Macciavello C (1997) Stabilization of quantum computations by symmetrization. SIAM J Comput 26(5):1541–1557 10. Bennett CH, Brassard G, Crepeau C, Jozsa R, Peres A, Wootters W (1993) Teleporting an unknown quantum state via dual classical and Einstein–Podolsky–Rosen channels. Phys Rev Lett 70(13):1895–1899 11. Bennett CH, DiVincenzo DP, Smolin JA, Wootters WK (1996) Mixed-state entanglement and quantum error correction. Phys Rev A 54(5):3824–3851 12. Berlekamp ER, McEliece RJ, van Tilborg HCA (1978) On the inherent intractability of certain coding problems. IEEE Trans Inf Theory 24(3):384–386 13. Brun T, Devetak I, Hsieh MH (2006) Correcting quantum errors entanglement. Science 314(5798):436–439 14. Calderbank AR, Rains EM, Shor PW, Sloane NJA (1998) Quantum error correction via codes over GF(4). IEEE Trans Inf Theory 44(4):1369–1387 15. Calderbank AR, Shor PW (1996) Good quantum error-correcting codes exist. Phys Rev A 54(2):1098–1105 16. Camara T, Ollivier H, Tillich JP (2005) Constructions and performance of classes of quantum LDPC codes. Preprint quantph/0502086 17. Chau HF (1998) Quantum convolutional codes. Phys Rev A 58(2):905–909 18. Cleve R, Gottesman D (1997) Efficient computations of encodings for quantum error correction. Phys Rev A 56(1):76–82 19. Cross AW, DiVincenzo DP, Terhal BM (2007) A comparative code study for quantum fault tolerance. Preprint arXiv:0711.1556 [quant-ph] 20. Dennis E, Kitaev A, Landahl A, Preskill J (2002) Topological quantum memory. J Math Phys 43:4452 21. DiVincenzo DP (2000) The physical implementation of quantum computation. Fortschr Phys 48(9–11):771–783 22. Duan LM, Guo CC (1997) Preserving coherence in quantum computation by pairing quantum bits. Phys Rev Lett 79(10):1953–1956 23. Elnozahy EN, Alvisi L, Wang YM, Johnson DB (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv 34(3):375–408 24. Forney GD Jr, Grassl M, Guha S (2007) Convolutional and tailbiting quantum error-correcting codes. IEEE Trans Inf Theory 53(3):865–880 25. Gottesman D (1996) A class of quantum error-correcting codes saturating the quantum hamming bound. Phys Rev A 54(3):1862–1868
Quantum Error Correction and Fault Tolerant Quantum Computing
26. Gottesman D (1997) Stabilizer codes and quantum error correction. Ph D thesis, California Institute of Technology, Pasadena 27. Gottesman D (1998) A theory of fault-tolerant quantum computation. Phys Rev A 57(1):127–137 28. Gottesman D (1999) Fault-tolerant quantum computation with higher-dimensional systems. Chaos, Solitons & Fractals 10(10):1749–1758 29. Gottesman D (2005) Requirements and desiderata for faulttolerant quantum computing: Beyond the DiVincenzo criteria. http://www.perimeterinstitute.ca/personal/dgottesman/ FTreqs.ppt 30. Gottesman D, Chuang IL (1999) Demonstrating the viability of universal quantum computation using teleportation and single-qubit operations. Nature 402:390–393 31. Grassl M (2002) Algorithmic aspects of quantum error-correcting codes. In: Brylinski RK, Chen G (eds) Mathematics of quantum computation. CRC, Boca Raton, pp 223–252 32. Grassl M, Beth T (2000) Cyclic quantum error-correcting codes and quantum shift registers. Proc Royal Soc London A 456(2003):2689–2706 33. Grassl M, Rötteler M (2006) Non-catastrophic encoders and encoder inverses for quantum convolutional codes. In: Proceedings 2006 IEEE International Symposium on Information Theory (ISIT 2006), Seattle, pp 1109–1113 34. Grassl M, Rötteler M (2008) Quantum Goethals-Preparata codes. In: Proceedings 2008 IEEE International Symposium on Information Theory (ISIT 2008), Toronto, pp 300–304 35. Grassl M, Beth T, Pellizzari T (1997) Codes for the quantum erasure channel. Phys Rev A 56(1):33–38 36. Grassl M, Beth T, Rötteler M (2004) On optimal quantum codes. Int J Quantum Inf 2(1):55–64 37. Grassl M, Rötteler M, Beth T (2003) Efficient quantum circuits for non-qubit quantum error-correcting codes. Int J Found Comput Sci 14(5):757–775 38. Hamming RW (1986) Coding and information theory. PrenticeHall, Englewood Cliffs 39. Haroche S, Raimond JM (1996) Quantum computing: Dream or nightmare? Phys Today 49(8):51–52 40. Kempe J, Regev O, Unger F, de Wolf R (2008) Upper bounds on the noise threshold for fault-tolerant quantum computing. Preprint arXiv:0802.1464 [quant-ph] 41. Ketkar A, Klappenecker A, Kumar S, Sarvepalli PK (2006) Nonbinary stabilizer codes over finite fields. IEEE Trans Inf Theory 52(11):4892–4914 42. Kitaev A (1997) Quantum computations: Algorithms and error correction. Russ Math Surv 52(6):1191–1249 43. Knill E (2005) Quantum computing with realistically noisy devices. Nature 434:39–44 44. Knill E, Laflamme R (1997) Theory of quantum error-correcting codes. Phys Rev A 55(2):900–911 45. Knill E, Laflamme R, Zurek WH (1998) Resilient quantum computation: Error models and thresholds. Proc Royal Soc London Series A, 454:365–384. Preprint quant-ph/9702058 46. Knill E, Laflamme R, Milburn G (2001) A scheme for efficient quantum computation with linear optics. Nature 409:46–52 47. Kraus K (1983) States, effects, and operations. Lecture notes in physics, vol 190. Springer, Berlin 48. Kribs DW, Laflamme R, Poulin D (2005) Unified and generalized approach to quantum error correction. Phys Rev Lett 94(18):180501
49. Kribs DW, Laflamme R, Poulin D, Lesosky M (2006) Operator quantum error correction. Quantum Inf Comput 6(3–4):382– 399 50. Lidar DA, Whaley KB (2003) Decoherence-free subspaces and subsystems. In: Benatti F, Floreanini R (eds) Irreversible quantum dynamics, Lecture Notes in Physics, vol 622. Springer, Berlin, pp 83–120 51. Lidar DA, Chuang IL, Whaley KB (1998) Decoherence-free subspaces for quantum computation. Phys Rev Lett 81(12):2594– 2597 52. MacKay DJC, Mitchison G, McFadden PL (2004) Quantum computations: Algorithms and error correction. IEEE Trans Inf Theory 50(10):2315–2330 53. MacWilliams FJ, Sloane NJA (1977) The theory of error-correcting codes. North-Holland, Amsterdam 54. Matsumoto R (2002) Improvement of Ashikhmin–Litsyn– Tsfasman bound for quantum codes. IEEE Trans Inf Theory 48(7):2122–2124 55. von Neumann J (1956) Probabilistic logic and the synthesis of reliable organisms from unreliable components. In: Shannon CE, McCarthy J (eds) Automata studies. Princeton University Press, Princeton, pp 43–98 56. Nielsen MA, Chuang IL (2000) Quantum computation and quantum information. Cambridge University Press, Cambridge 57. Ollivier H, Tillich JP (2003) Description of a quantum convolutional code. Phys Rev Lett 91(17):177902 58. Preskill J (1998) Reliable quantum computers. Proc Royal Soc London A, 454:385–410. Preprint quant-ph/9705031 59. Rains EM, Hardin RH, Shor PW, Sloane NJA (1997) Nonadditive quantum code. Phys Rev Lett 79(5):953–954 60. Razborov A (2004) An upper bound on the threshold quantum decoherence rate. Quantum Inf Comput 4(3):222–228 61. Reichardt BW (2006) Error-detection-based quantum fault tolerance against discrete Pauli noise. Ph D thesis, University of California, Berkeley 62. Reichardt BW (2006) Fault-tolerance threshold for a distancethree quantum code. In: Proceedings of the 2006 International Colloquium on Automata, Languages and Programming (ICALP’06). Lecture Notes in Computer Science, vol 4051. Springer, Berlin, pp 50–61 63. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–656, http://cm.bell-labs.com/ cm/ms/what/shannonday/paper.html 64. Shor PW (1995) Scheme for reducing decoherence in quantum computer memory. Phys Rev A 52(4):R2493–R2496 65. Shor PW (1996) Fault-tolerant quantum computation. In: Proceedings of 35th Annual Symposium on Fundamentals of Computer Science (FOCS’96). IEEE Press, Burlington, pp 56–65. Preprint quant-ph/9605011 66. Steane AM (1996) Error correcting codes in quantum theory. Phys Rev Lett 77(5):793–797 67. Steane AM (1996) Simple quantum error correcting codes. Phys Rev A 54(6):4741–4751 68. Steane AM (1997) Active stabilization, quantum computation and quantum state synthesis. Phys Rev Lett 78(11):2252–2255 69. Wootters WK, Zurek WH (1982) A single quantum cannot be cloned. Nature 299(5886):802–803 70. Zalka C (1996) Threshold estimate for fault tolerant quantum computation. Preprint quant-ph/9612028 71. Zanardi P, Rasetti M (1997) Noiseless quantum codes. Phys Rev Lett 79(17):3306–3309
2495
2496
Quantum Information Processing
Quantum Information Processing SETH LLOYD W.M. Keck Center for Extreme Quantum Information Processing (xQIT), MIT, Cambridge, USA Article Outline Glossary Definition of the Subject Introduction Quantum Mechanics Quantum Computation Noise and Errors Quantum Communication Implications and Conclusions Bibliography Glossary Algorithm A systematic procedure for solving a problem, frequently implemented as a computer program. Bit The fundamental unit of information, representing the distinction between two possible states, conventionally called 0 and 1. The word ‘bit’ is also used to refer to a physical system that registers a bit of information. Boolean algebra The mathematics of manipulating bits using simple operations such as AND, OR, NOT, and COPY. Communication channel A physical system that allows information to be transmitted from one place to another. Computer A device for processing information. A digital computer uses Boolean algebra (q. v.) to processes information in the form of bits. Cryptography The science and technique of encoding information in a secret form. The process of encoding is called encryption, and a system for encoding and decoding is called a cipher. A key is a piece of information used for encoding or decoding. Public-key cryptography operates using a public key by which information is encrypted, and a separate private key by which the encrypted message is decoded. Decoherence A peculiarly quantum form of noise that has no classical analog. Decoherence destroys quantum superpositions and is the most important and ubiquitous form of noise in quantum computers and quantum communication channels.
Error-correcting code A technique for encoding information in a form that is resistant to errors. The syndrome is the part of the code that allows the error to be detected and that specifies how it should be corrected. Entanglement A peculiarly quantum form of correlation that is responsible for many types of quantum weirdness. Entanglement arises when two or more quantum systems exist in a superposition of correlated states. Entropy Information registered by the microscopic motion of atoms and molecules. The second law of thermodynamics (q. v.) states that entropy does not decrease over time. Fault-tolerant computation Computation that uses error-correcting codes to perform algorithms faithfully in the presence of noise and errors. If the rate of errors falls below a certain threshold, then computations of any desired length can be performed in a fault-tolerant fashion. Also known as robust computation. Information When used in a broad sense, information is data, messages, meaning, knowledge, etc. Used in the more specific sense of information theory, information is a quantity that can be measured in bits. Logic gate A physical system that performs the operations of Boolean algebra (q. v.) such as AND, OR, NOT, and COPY, on bits. Moore’s law The observation, first made by Gordon Moore, that the power of computers increases by a factor of two every year and a half or so. Quantum algorithm An algorithm designed specifically to be performed by a quantum computer using quantum logic. Quantum algorithms exploit the phenomena of superposition and entanglement to solve problems more rapidly than classical computer algorithms can. Examples of quantum algorithms include Shor’s algorithm for factoring large numbers and breaking public-key cryptosystems, Grover’s algorithm for searching databases, quantum simulation, the adiabatic algorithm, etc. Quantum bit A bit registered by a quantum-mechanical system such as an atom, photon, or nuclear spin. A quantum bit, or ‘qubit’, has the property that it can exist in a quantum superposition of the states 0 and 1. Qubit A quantum bit. Quantum communication channel A communication channel that transmits quantum bits. The most common communication channel is the bosonic channel, which transmits information using light, sound, or other substances whose elementary excitations consist of bosons (photons for light, phonons for sound).
Quantum Information Processing
Quantum computer A computer that operates on quantum bits to perform quantum algorithms. Quantum computers have the feature that they can preserve quantum superpositions and entanglement. Quantum cryptography A cryptographic technique that encodes information on quantum bits. Quantum cryptography uses the fact that measuring quantum systems typically disturbs them to implement cryptosystems whose security is guaranteed by the laws of physics. Quantum key distribution (QKD) is a quantum cryptographic technique for distributing secret keys. Quantum error-correcting code An error-correcting code that corrects for the effects of noise on quantum bits. Quantum error-correcting codes can correct for the effect of decoherence (q. v.) as well as for conventional bit-flip errors. Quantum information Information that is stored on qubits rather than on classical bits. Quantum mechanics The branch of physics that describes how matter and energy behave at their most fundamental scales. Quantum mechanics is famously weird and counterintuitive. Quantum weirdness A catch-all term for the strange and counterintuitive aspects of quantum mechanics. Well-known instances of quantum weirdness include Schrödinger’s cat (q. v.), the Einstein–Podolsky–Rosen thought experiment, violations of Bell’s inequalities, and the Greenberger–Horne–Zeilinger experiment. Reversible logic Logical operations that do not discard information. Quantum computers operate using reversible logic. Schrödinger’s cat A famous example of quantum weirdness. A thought experiment proposed by Erwin Schrödinger, in which a cat is put in a quantum superposition of being alive and being dead. Not sanctioned by the Society for Prevention of Cruelty to Animals. Second law of thermodynamics The second law of thermodynamics states that entropy does not increase. An alternative formulation of the second law states that it is not possible to build an eternal motion machine. Superposition The defining feature of quantum mechanics which allows particles such as electrons to exist in two or more places at once. Quantum bits can exist in superpositions of 0 and 1 simultaneously. Teleportation A form of quantum communication that uses pre-existing entanglement and classical communication to send quantum bits from one place to another.
Definition of the Subject Quantum mechanics is the branch of physics that describes how systems behave at their most fundamental level. The theory of information processing studies how information can be transferred and transformed. Quantum information science, then, is the theory of communication and computation at the most fundamental physical level. Quantum computers store and process information at the level of individual atoms. Quantum communication systems transmit information on individual photons. Over the past half century, the wires and logic gates in computers have halved in size every year and a half, a phenomenon known as Moore’s law. If this exponential rate of miniaturization continues, then the components of computers should reach the atomic scale within a few decades. Even at current (2008) length scales of a little larger than one hundred nanometers, quantum mechanics plays a crucial role in governing the behavior of these wires and gates. As the sizes of computer components press down toward the atomic scale, the theory of quantum information processing becomes increasingly important for characterizing how computers operate. Similarly, as communication systems become more powerful and efficient, the quantum mechanics of information transmission becomes the key element in determining the limits of their power. Miniaturization and the consequences of Moore’s law are not the primary reason for studying quantum information, however. Quantum mechanics is weird: electrons, photons, and atoms behave in strange and counterintuitive ways. A single electron can exist in two places simultaneously. Photons and atoms can exhibit a bizarre form of correlation called entanglement, a phenomenon that Einstein characterized as spukhafte Fernwirkung, or ‘spooky action at a distance’. Quantum weirdness extends to information processing. Quantum bits can take on the values of 0 and 1 simultaneously. Entangled photons can be used to teleport the states of matter from one place to another. The essential goal of quantum information science is to determine how quantum weirdness can be used to enhance the capabilities of computers and communication systems. For example, even a moderately sized quantum computer, containing a few tens of thousands of bits, would be able to factor large numbers and thereby break cryptographic systems that have until now resisted the attacks of even the largest classical supercomputers [1]. Quantum computers could search databases faster than classical computers. Quantum communication systems allow information to be transmitted in a manner whose security against eavesdropping is guaranteed by the laws of physics.
2497
2498
Quantum Information Processing
Prototype quantum computers that store bits on individual atoms and quantum communication systems that transmit information using individual photons have been built and operated. These prototypes have been used to confirm the predictions of quantum information theory and to explore the behavior of information processing at the most microscopic scales. If larger, more powerful versions of quantum computers and communication systems become readily available, they will offer considerable enhancements over existing computers and communication systems. In the meanwhile, the field of quantum information processing is constructing a unified theory of how information can be registered and transformed at the fundamental limits imposed by physical law. The remainder of this article is organized as follows: Section “Introduction” A review of the history of ideas of information, computation, and the role of information in quantum mechanics is presented. Section “Quantum Mechanics” The formalism of quantum mechanics is introduced and applied to the idea of quantum information. Section “Quantum Computation” Quantum computers are defined and their properties presented. Section “Noise and Errors” The effects of noise and errors are explored. Section “Quantum Communication” The role of quantum mechanics in setting limits to the capacity of communication channels is delineated. Quantum cryptography is explained. Section “Implications and Conclusions” Implications are discussed. This review of quantum information theory is mathematically self-contained in the sense that all the necessary mathematics for understanding the quantum effects treated in detail here are contained in the introductory section on quantum mechanics. By necessity, not all topics in quantum information theory can be treated in detail within the confines of this article. We have chosen to treat a few key subjects in more detail: in the case of other topics we supply references to more complete treatments. The standard reference on quantum information theory is the text by Nielsen and Chuang [1], to which the reader may turn for in depth treatments of most of the topics covered here. One topic that is left largely uncovered is the broad field of quantum technologies and techniques for actually building quantum computers and quantum communication systems. Quantum technologies are rapidly changing,
and no brief review like the one given here could adequately cover both the theoretical and the experimental aspects of quantum information processing.
Introduction Information Quantum information processing as a distinct, widely recognized field of scientific inquiry has arisen only recently, since the early 1990s. The mathematical theory of information and information processing dates to the mid-twentieth century. Ideas of quantum mechanics, information, and the relationships between them, however, date back more than a century. Indeed, the basic formulae of information theory were discovered in the second half of the nineteenth century, by James Clerk Maxwell, Ludwig Boltzmann, and J. Willard Gibbs [2]. These statistical mechanicians were searching for the proper mathematical characterization of the physical quantity known as entropy. Prior to Maxwell, Boltzmann, and Gibbs, entropy was known as a somewhat mysterious quantity that reduced the amount of work that steam engines could perform. After their work established the proper formula for entropy, it became clear that entropy was in fact a form of information — the information required to specify the actual microscopic state of the atoms in a substance such as a gas. If a system has W possible states, then it takes log2 W bits to specify one state. Equivalently, any system with distinct states can be thought of as registering information, and a system that can exist in one out of W equally likely states can register log2 W bits of information. The formula, S D k log W, engraved on Boltzmann’s tomb, means that entropy S is proportional to the number of bits of information registered by the microscopic state of a system such as a gas. (Ironically, this formula was first written down not by Boltzmann, but by Max Planck [3], who also gave the first numerical value 1:38 1023 J/K for the constant k. Consequently, k is called Planck’s constant in early works on statistical mechanics [2]. As the fundamental constant of quantum mechanics, h D 6:63 1034 joule seconds, on which more below, is also called Planck’s constant, k was renamed Boltzmann’s constant and is now typically written kB .) Although the beginning of the information processing revolution was still half a century away, Maxwell, Boltzmann, Gibbs, and their fellow statistical mechanicians were well aware of the connection between information and entropy. These researchers established that if the probability of the ith microscopic state of some system is pi , then the entropy of the system is S D
Quantum Information Processing
P P kB ( i p i ln p i ). The quantity i p i ln p i was first introduced by Boltzmann, who called it H. Boltzmann’s famous H-theorem declares that H never increases [2]. The H-theorem is an expression of the second law of thermodynamics, which declares that S D kB H never decreases. Note that this formula for S reduces to that on Boltzmann’s tomb when all the states are equally likely, so that p i D 1/W. Since the probabilities for the microscopic state of a physical system depend on the knowledge possessed about the system, it is clear that entropy is related to information. The more certain one is about the state of a system – the more information one possesses about the system – the lower its entropy. As early as 1867, Maxwell introduced his famous ‘demon’ as a hypothetical being that could obtain information about the actual state of a system such as a gas, thereby reducing the number of states W compatible with the information obtained, and so decreasing the entropy [4]. Maxwell’s demon therefore apparently contradicts the second law of thermodynamics. The full resolution of the Maxwell’s demon paradox was not obtained until the end of the twentieth century, when the theory of the physics of information processing described in this review had been fully developed. Quantum Mechanics For the entropy, S, to be finite, a system can only possess a finite number W of possible states. In the context of classical mechanics, this feature is problematic, as even the simplest of classical systems, such as a particle moving along a line, possesses an infinite number of possible states. The continuous nature of classical mechanics frustrated attempts to use the formula for entropy to calculate many physical quantities such as the amount of energy and entropy in the radiation emitted by hot objects, the so-called ‘black body radiation’. Calculations based on classical mechanics suggested the amount of energy and entropy emitted by such objects should be infinite, as the number of possible states of a classical oscillator such as a mode of the electromagnetic field was infinite. This problem is known as ‘the ultraviolet catastrophe’. In 1901, Planck obtained a resolution to this problem by suggesting that such oscillators could only possess discrete energy levels [3]: the energy of an oscillator that vibrates with frequency can only come in multiples of h , where h is Planck’s constant defined above. Energy is quantized. In that same paper, as noted above, Planck first wrote down the formula S D k log W, where W referred to the number of discrete energy states of a collection of oscillators. In other words, the very first paper on quantum mechanics
was about information. By introducing quantum mechanics, Planck made information/entropy finite. Quantum information as a distinct field of inquiry may be young, but its origins are old: the origin of quantum information coincides with the origin of quantum mechanics. Quantum mechanics implies that nature is, at bottom, discrete. Nature is digital. After Planck’s advance, Einstein was able to explain the photo-electric effect using quantum mechanics [5]. When light hits the surface of a metal, it kicks off electrons. The energy of the electrons kicked off depends only on the frequency of the light, and not on its intensity. Following Planck, Einstein’s interpretation of this phenomenon was that the energy in the light comes in chunks, or quanta, each of which possesses energy h . These quanta, or particles of light, were subsequently termed photons. Following Planck and Einstein, Niels Bohr used quantum mechanics to derive the spectrum of the hydrogen atom [6]. In the mid nineteen-twenties, Erwin Schrödinger and Werner Heisenberg put quantum mechanics on a sound mathematical footing [7,8]. Schrödinger derived a wave equation – the Schrödinger equation – that described the behavior of particles. Heisenberg derived a formulation of quantum mechanics in terms of matrices, matrix mechanics, which was subsequently realized to be equivalent to Schrödinger’s formulation. With the precise formulation of quantum mechanics in place, the implications of the theory could now be explored in detail. It had always been clear that quantum mechanics was strange and counterintuitive: Bohr formulated the phrase ‘wave-particle duality’ to capture the strange way in which waves, like light, appeared to be made of particles, like photons. Similarly, particles, like electrons, appeared to be associated with waves, which were solutions to Schrödinger’s equation. Now that the mathematical underpinnings of quantum mechanics were in place, however, it became clear that quantum mechanics was downright weird. In 1935, Einstein, together with his collaborators Boris Podolsky and Nathan Rosen, came up with a thought experiment (now called the EPR experiment after its originators) involving two photons that are correlated in such a way that a measurement made on one photon appears instantaneously to affect the state of the other photon [9]. Schrödinger called this form of correlation ‘entanglement’. Einstein, as noted above, referred to it as ‘spooky action at a distance’. Although it became clear that entanglement could not be used to transmit information faster than the speed of light, the implications of the EPR thought experiment were so apparently bizarre that Einstein felt that it demonstrated that quantum mechanics was fundamentally incorrect. The EPR ex-
2499
2500
Quantum Information Processing
periment will be discussed in detail below. Unfortunately for Einstein, when the EPR experiment was eventually performed, it confirmed the counterintuitive predictions of quantum mechanics. Indeed, every experiment ever performed so far to test the predictions of quantum mechanics has confirmed them, suggesting that, despite its counterintuitive nature, quantum mechanics is fundamentally correct. At this point, it is worth noting a curious historical phenomenon, which persists to the present day, in which a famous scientist who received his or her Nobel prize for work in quantum mechanics, publicly expresses distrust or disbelief in quantum mechanics. Einstein is the best known example of this phenomenon, but more recent examples exist, as well. The origin of this phenomenon can be traced to the profoundly counterintuitive nature of quantum mechanics. Human infants, by the age of a few months, are aware that objects – at least, large, classical objects like toys or parents – cannot be in two places simultaneously. Yet in quantum mechanics, this intuition is violated repeatedly. Nobel laureates typically possess a powerful sense of intuition: if Einstein is not allowed to trust his intuition, then who is? Nonetheless, quantum mechanics contradicts their intuition just as it does everyone else’s. Einstein’s intuition told him that quantum mechanics was wrong, and he trusted that intuition. Meanwhile, scientists who are accustomed to their intuitions being proved wrong may accept quantum mechanics more readily. One of the accomplishments of quantum information processing is that it allows quantum weirdness such as that found in the EPR experiment to be expressed and investigated in precise mathematical terms, so we can discover exactly how and where our intuition goes wrong. In the 1950’s and 60’s, physicists such as David Bohm, John Bell, and Yakir Aharonov, among others, investigated the counterintuitive aspects of quantum mechanics and proposed further thought experiments that threw those aspects in high relief [10,11,12]. Whenever those thought experiments have been turned into actual physical experiments, as in the well-known Aspect experiment that realized Bell’s version of the EPR experiment [13], the predictions of quantum mechanics have been confirmed. Quantum mechanics is weird and we just have to live with it. As will be seen below, quantum information processing allows us not only to express the counterintuitive aspects of quantum mechanics in precise terms, it allows us to exploit those strange phenomena to compute and to communicate in ways that our classical intuitions would tell us are impossible. Quantum weirdness is not a bug, but a feature.
Computation Although rudimentary mechanical calculators had been constructed by Pascal and Leibnitz, amongst others, the first attempts to build a full-blown digital computer also lie in the nineteenth century. In 1822, Charles Babbage conceived the first of a series of mechanical computers, beginning with the fifteen ton Difference Engine, intended to calculate and print out polynomial functions, including logarithmic tables. Despite considerable government funding, Babbage never succeeded in building a working difference. He followed up with a series of designs for an Analytical Engine, which was to have been powered by a steam engine and programmed by punch cards. Had it been constructed, the analytical engine would have been the first modern digital computer. The mathematician Ada Lovelace is frequently credited with writing the first computer program, a set of instructions for the analytical engine to compute Bernoulli numbers. In 1854, George Boole’s An investigation into the laws of thought laid the conceptual basis for binary computation. Boole established that any logical relation, no matter how complicated, could be built up out of the repeated application of simple logical operations such as AND, OR, NOT, and COPY. The resulting ‘Boolean logic’ is the basis for the contemporary theory of computation. While Schrödinger and Heisenberg were working out the modern theory of quantum mechanics, the modern theory of information was coming into being. In 1928, Ralph Hartley published an article, ‘The Transmission of Information’, in the Bell System Technical Journal [14]. In this article he defined the amount of information in a sequence of n symbols to be n log S, where S is the number of symbols. As the number of such sequences is Sn , this definition clearly coincides with the Planck–Boltzmann formula for entropy, taking W D S n . At the same time as Einstein, Podolsky, and Rosen were exploring quantum weirdness, the theory of computation was coming into being. In 1936, in his paper “On Computable Numbers, with an Application to the Entscheidungsproblem”, Alan Turing extended the earlier work of Kurt Gödel on mathematical logic, and introduced the concept of a Turing machine, an idealized digital computer [15]. Claude Shannon, in his 1937 master’s thesis, “A Symbolic Analysis of Relay and Switching Circuits”, showed how digital computers could be constructed out of electronic components [16]. (Howard Gardner called this work, “possibly the most important, and also the most famous, master’s thesis of the century”.) The Second World War provided impetus for the development of electronic digital computers. Konrad Zuse’s
Quantum Information Processing
Z3, built in 1941, was the first digital computer capable of performing the same computational tasks as a Turing machine. The Z3 was followed by the British Colossus, the Harvard Mark I, and the ENIAC. By the end of the 1940s, computers had begun to be built with a stored program or ‘von Neumann’ architecture (named after the pioneer of quantum mechanics and computer science John von Neumann), in which the set of instructions – or program – for the computer were stored in the computer’s memory and executed by a central processing unit. In 1948, Shannon published his groundbreaking article, “A Mathematical Theory of Communication”, in the Bell Systems Journal [17]. In this article, perhaps the most influential work of applied mathematics of the twentieth century (following the tradition of his master’s thesis), Shannon provided the full mathematical characterization of information. He introduced his colleague, John Tukey’s word, ‘bit’, a contraction of ‘binary digit’, to describe the fundamental unit of information, a distinction between two possibilities, True or False, Yes or No, 0 or 1. He showed that the amount of information associated with a set of possible states i, each with probability pi , was P uniquely given by formula i p i log2 p i . When Shannon asked von Neumann what he should call this quantity, von Neumann is said to have replied that he should call it H, ‘because that’s what Boltzmann called it’. (Recalling the Boltzmann’s original definition of H, given above, we see that von Neumann had evidently forgotten the minus sign.) It is interesting that von Neumann, who was one of the pioneers both of quantum mechanics and of information processing, apparently did not consider the idea of processing information in a uniquely quantum-mechanical fashion. Von Neumann had many things on his mind, however – game theory, bomb building, the workings of the brain, etc. – and can be forgiven for failing to make the connection. Another reason that von Neumann may not have thought of quantum computation was that, in his research into computational devices, or ‘organs’, as he called them, he had evidently reached the impression that computation intrinsically involved dissipation, a process that is inimical to quantum information processing [18]. This impression, if von Neumann indeed had it, is false, as will now be seen. Reversible Computation The date of Shannon’s paper is usually taken to be the beginning of the study of information theory as a distinct field of inquiry. The second half of the twentieth century saw a huge explosion in the study of information, compu-
tation, and communication. The next step towards quantum information processing took place in the early 1960s. Until that point, there was an impression, fostered by von Neumann amongst others, that computation was intrinsically irreversible: according to this view, information was necessarily lost or discarded in the course of computation. For example, a logic gate such as an AND gate takes in two bits of information as input, and returns only one bit as output: the output of an AND gate is 1 if and only if both inputs are 1, otherwise the output is 0. Because the two input bits cannot be reconstructed from the output bits, an AND gate is irreversible. Since computations are typically constructed from AND, OR, and NOT gates (or related irreversible gates such as NAND, the combination of an AND gate and a NOT gate), computations were thought to be intrinsically irreversible, discarding bits as they progress. In 1960, Rolf Landauer showed that because of the intrinsic connection between information and entropy, when information is discarded in the course of a computation, entropy must be created [19]. That is, when an irreversible logic gate such as an AND gate is applied, energy must be dissipated. So far, it seems that von Neumann could be correct. In 1963, however, Yves Lecerf showed that Turing Machines could be constructed in such a way that all their operations were logically reversible [20]. The trick for making computation reversible is record-keeping: one sets up logic circuits in such a way that the values of all bits are recorded and kept. To make an AND gate reversible, for example, one adds extra circuitry to keep track of the values of the input to the AND gate. In 1973, Charles Bennett, unaware of Lecerf’s result, rederived it, and, most importantly, constructed physical models of reversible computation based on molecular systems such as DNA [21]. Ed Fredkin, Tommaso Toffoli, Norman Margolus, and Frank Merkle subsequently made significant contributions to the study of reversible computation [22]. Reversible computation is important for quantum information processing because the laws of physics themselves are reversible. It’s this underlying reversibility that is responsible for Landauer’s principle: whenever a logically irreversible process such as an AND gate takes place, the information that is discarded by the computation has to go somewhere. In the case of an conventional, transistorbased AND gate, the lost information goes into entropy: to operate such an AND gate, electrical energy must be dissipated and turned into heat. That is, once the AND gate has been performed, then even if the logical circuits of the computer no longer record the values of the inputs to the gate, the microscopic motion of atoms and electrons
2501
2502
Quantum Information Processing
in the circuit effectively ‘remember’ what the inputs were. If one wants to perform computation in a uniquely quantum-mechanical fashion, it is important to avoid such dissipation: to be effective, quantum computation should be reversible. Quantum Computation In 1980, Paul Benioff showed that quantum mechanical systems such as arrays of spins or atoms could perform reversible computation in principle [23]. Benioff mapped the operation of a reversible Turing machine onto the a quantum system and thus exhibited the first quantum-mechanical model of computation. Benioff’s quantum computer was no more computationally powerful than a conventional classical Turing machine, however: it did not exploit quantum weirdness. In 1982, Richard Feynman proposed the first non-trivial application of quantum information processing [24]. Noting that quantum weirdness made it hard for conventional, classical digital computers to simulate quantum systems, Feynman proposed a ‘universal quantum simulator’ that could efficiently simulate other quantum systems. Feynman’s device was not a quantum Turing machine, but a sort of quantum analog computer, whose dynamics could be tuned to match the dynamics of the system to be simulated. The first model of quantum computation truly to embrace and take advantage of quantum weirdness was David Deutsch’s quantum Turing machine of 1985 [25]. Deutsch pointed out that a quantum Turing machine could be designed in such a way as to use the strange and counterintuitive aspects of quantum mechanics to perform computations in ways that classical Turing machines or computers could not. In particular, just as in quantum mechanics it is acceptable (and in many circumstances, mandatory) for an electron to be in two places at once, so in a quantum computer, a quantum bit can take on the values 0 and 1 simultaneously. One possible role for a bit in a computer is as part a program, so that 0 instructs the computer to ‘do this’ and 1 instructs the computer to ‘do that’. If a quantum bit that takes on the values 0 and 1 at the same time is fed into the quantum computer as part of a program, then the quantum computer will ‘do this’ and ‘do that’ simultaneously, an effect that Deutsch termed ‘quantum parallelism’. Although it would be years before applications of quantum parallelism would be presented, Deutsch’s paper marks the beginning of the formal theory of quantum computation. For almost a decade after the work of Benioff, Feynman, and Deutsch, quantum computers remained a curiosity. Despite the development of a few simple algo-
rithms (described in greater detail below) that took advantage of quantum parallelism, no compelling application of quantum computation had been discovered. In addition, the original models of quantum computation were highly abstract: as Feynman noted [24], no one had the slightest notion of how to build a quantum computer. Absent a ‘killer ap’, and a physical implementation, the field of quantum computation languished. That languor dissipated rapidly with Peter Shor’s discovery in 1994 that quantum computers could be used to factor large numbers [26]. That is, given the product r of two large prime numbers, a quantum computer could find the factors p and q such that pq D r. While it might not appear so instantaneously, solving this problem is indeed a ‘killer ap’. Solving the factoring problem is the key to breaking ‘public-key’ cryptosystems. Public-key cryptosystems are a widely used method for secure communication. Suppose that you wish to buy something from me over the internet, for example. I openly send you a public key consisting of the number r. The public key is not a secret: anyone may know it. You use the public key to encrypt your credit card information, and send me that encrypted information. To decrypt that information, I need to employ the ‘private keys’ p and q. The security of public-key cryptography thus depends on the factoring problem being hard: to obtain the private keys p and q from the public key r, one must factor the public key. If quantum computers could be built, then public-key cryptography was no longer secure. This fact excited considerable interest among code breakers, and some consternation within organizations, such as security agencies, whose job it is to keep secrets. Compounding this interest and consternation was the fact that the year before, in 1993, Lloyd had shown how quantum computers could be built using techniques of electromagnetic resonance together with ‘off-the shelf’ components such as atoms, quantum dots, and lasers [27]. In 1994, Ignacio Cirac and Peter Zoller proposed a technique for building quantum computers using ion traps [28]. These designs for quantum computers quickly resulted in small prototype quantum computers and quantum logic gates being constructed by David Wineland [29], and Jeff Kimble [30]. In 1996, Lov Grover discovered that quantum computers could search databases significantly faster than classical computers, another potentially highly useful application [31]. By 1997, simple quantum algorithms had been performed using nuclear magnetic resonance based quantum information processing [32,33,34]. The field of quantum computation was off and running. Since 1994, the field of quantum computation has expanded dramatically. The decade between the discovery
Quantum Information Processing
of quantum computation and the development of the first applications and implementations saw only a dozen or so papers published in the field of quantum computation. As of the date of publication of this article, it is not uncommon for a dozen papers on quantum computation to be posted on the Los Alamos preprint archive (ArXiv) every day. Quantum Communication While the idea of quantum computation was not introduced until 1980, and not fully exploited until the mid1990s, quantum communication has exhibited a longer and steadier advance. By the beginning of the 1960s, J.P. Gordon [35] and Lev Levitin [36] had begun to apply quantum mechanics to the analysis of the capacity of communication channels. In 1973, Alexander Holevo derived the capacity for quantum mechanical channels to transmit classical information [37] (the Holevo–Schumacher– Westmoreland theorem [38,39]). Because of its many practical applications, the so-called ‘bosonic’ channel has received a great deal of attention over the years [40]. Bosonic channels are quantum communication channels in which the medium of information exchange consists of bosonic quantum particles, such as photons or phonons. That is, bosonic channels include communication channels that use electromagnetic radiation, from radio waves to light, or sound. Despite many attempts, it was not until 1993 that Horace Yuen and Masanao Ozawa derived the capacity of the bosonic channel, and their result holds only in the absence of noise and loss [41]. The capacity of the bosonic channel in the presence of loss alone was not derived until 2004 [42], and the capacity of this most important of channels in the presence of noise and loss is still unknown [43]. A second use of quantum channels is to transmit quantum information, rather than classical information. The requirements for transmitting quantum information are more stringent than those for transmitting classical information. To transmit a classical bit, one must end up sending a 0 or a 1. To transmit a quantum bit, by contrast, one must also faithfully transmit states in which the quantum bit registers 0 and 1 simultaneously. The quantity which governs the capacity of a channel to transmit quantum information is called the coherent information [44,45]. A particularly intriguing method of transmitting quantum information is teleportation [46]. Quantum teleportation closely resembles the teleportation process from the television series Star Trek. In Star Trek, entities to be teleported enter a special booth, where they are measured and dematerialized. Information about the composition of the en-
tities is then sent to a distant location, where the entities rematerialize. Quantum mechanics at first seems to forbid Trekkian teleportation, for the simple reason that it is not possible to make a measurement that reveals an arbitrary unknown quantum state. Worse yet, any attempt to reveal that state is likely to destroy it. Nonetheless, if one adds just one ingredient to the protocol, quantum teleportation is indeed possible. That necessary ingredient is entanglement. In quantum teleportation, an entity such as a quantum bit is to be teleported from Alice at point A to Bob at point B. For historical reasons, in communication protocols the sender of information is called Alice and the receiver is called Bob; an eavesdropper on the communication process is called Eve. Alice and Bob possess prior entanglement in the form of a pair of Einstein–Podolsky– Rosen particles. Alice performs a suitable measurement (described in detail below) on the qubit to be teleported together with her EPR particle. This measurement destroys the state of the particle to be teleported (‘dematerializing’ it), and yields two classical bits of information, which Alice sends to Bob over a conventional communication channel. Bob then performs a transformation on his EPR particle. The transformation Bob performs is a function of the information he receives from Alice: there are four possible transformations, one for each of the four possible values of the two bits he has received. After the Bob has performed his transformation of the EPR particle, the state of this particle is now guaranteed to be the same as that of the original qubit that was to be teleported. Quantum teleportation forms an integral part of quantum communication and of quantum computation. Experimental demonstrations of quantum teleportation have been performed with photons and atoms as the systems whose quantum states are to be teleported [47,48]. At the time of the writing of this article, teleportation of larger entities such as molecules, bacteria, or human beings remains out of reach of current quantum technology. Quantum Cryptography A particularly useful application of the counterintuitive features of quantum mechanics is quantum cryptography [49,50,51]. Above, it was noted that Shor’s algorithm would allow quantum computers to crack public-key cryptosystems. In the context of code breaking, then, quantum information processing is a disruptive technology. Fortunately, however, if quantum computing represents a cryptographic disease, then quantum communication represents a cryptographic cure. The feature of quantum mechanics that no measurement can determine an unknown
2503
2504
Quantum Information Processing
state, and that almost any measurement will disturb such a state, can be turned into a protocol for performing quantum cryptography, a method of secret communication whose security is guaranteed by the laws of physics. In the 1970s, Stephen Wiesner developed the concept of quantum conjugate coding, in which information can be stored on two conjugate quantum variables, such as position and momentum, or linear or helical polarization [49]. In 1984, Charles Bennett and Gilles Brassard turned Wiesner’s quantum coding concept into a protocol for quantum cryptography [50]: by sending suitable states of light over a quantum communication channel, Alice and Bob can build up a shared secret key. Since any attempt of Eve to listen in on their communication must inevitably disturb the states sent, Alice and Bob can determine whether Eve is listening in, and if so, how much information she has obtained. By suitable privacy amplification protocols, Alice and Bob can distill out secret key that they alone share and which the laws of physics guarantee is shared by no one else. In 1990 Artur Ekert, unaware of Wiesner, Bennett, and Brassard’s work, independently derived a protocol for quantum cryptography based on entanglement [51]. Commercial quantum cryptographic systems are now available for purchase by those who desire secrecy based on the laws of physics, rather than on how hard it is to factor large numbers. Such systems represent the application of quantum information processing that is closest to every day use. The Future Quantum information processing is currently a thriving scientific field, with many open questions and potential applications. Key open questions include, Just what can quantum computers do better than classical computers? They can apparently factor large numbers, search databases, and simulate quantum systems better than classical computers. That list is quite short, however. What is the full list of problems for which quantum computers offer a speed up? How can we build large scale quantum computers? Lots of small scale quantum computers, with up to a dozen bits, have been built and operated. Building large scale quantum computers will require substantial technological advances in precision construction and control of complex quantum systems. While advances in this field have been steady, we’re still far away from building a quantum computer that could break existing public-key cryptosystems.
What are the ultimate physical limits to communication channels? Despite many decades of effort, fundamental questions concerning the capacity of quantum communication channels remain unresolved. Quantum information processing is a rich stream with many tributaries in the fields of engineering, physics, and applied mathematics. Quantum information processing investigates the physical limits of computation and communication, and it devises methods for reaching closer to those limits, and someday perhaps to attain them. Quantum Mechanics In order to understand quantum information processing in any non-trivial way, some math is required. As Feynman said, “. . . it is impossible to explain honestly the beauties of the laws of nature in a way that people can feel, without their having some deep understanding of mathematics. I am sorry, but this seems to be the case” [52]. The counterintuitive character of quantum mechanics makes it even more imperative to use mathematics to understand the subject. The strange consequences of quantum mechanics arise directly out of the underlying mathematical structure of the theory. It is important to note that every bizarre and weird prediction of quantum mechanics that has been experimentally tested has turned out to be true. The mathematics of quantum mechanics is one of the most trustworthy pieces of science we possess. Luckily, this mathematics is also quite simple. To understand quantum information processing requires only a basic knowledge of linear algebra, that is, of vectors and matrices. No calculus is required. In this section a brief review of the mathematics of quantum mechanics is presented, along with some of its more straightforward consequences. The reader who is familiar with this mathematics can safely skip to the following sections on quantum information. Readers who desire further detail are invited to consult reference [1]. Qubits The states of a quantum system correspond to vectors. In a quantum bit, the quantum logic state 0 corresponds to a two-dimensional vector, 10 , and the quantum logic state 1 corresponds to the vector 01 . It is customary to write these vectors in the so-called ‘Dirac bracket’ notation: !
1 0 j0i ; j1i : (1) 1 0
Quantum Information Processing
A general state for a qubit, j i, corresponds to a vec tor ˇ˛ D ˛j0i C ˇj1i, where ˛ and ˇ are complex numbers such that j˛j2 C jˇj2 D 1. The requirement that the amplitude squared of the components of a vector sum to one is called ‘normalization’. Normalization arises because amplitudes squared in quantum mechanics are related to probabilities. In particular, suppose that one prepares a qubit in the state j i, and then performs a measurement whose purpose is to determine whether the qubit takes on the value 0 or 1 (such measurements will be discussed in greater detail below). Such a measurement will give the outcome 0 with probability j˛j2 , and will give the outcome 1 with probability jˇj2 . These probabilities must sum to one. The vectors j0i; j1i; j i are column vectors: we can also define the corresponding row vectors, h0j 1 0 ; h1j 0 1 ; h j ˛¯ ˇ¯ : (2) Note that creating the row vector h j involves both transposing the vector and taking the complex conjugate of its entries. This process is called Hermitian conjugation, and is denoted by the superscript , so that h j D j i . The two-dimensional, complex vector space for a qubit is denoted C2 . The reason for introducing Dirac bracket notation is that this vector space, like all the vector spaces of quantum mechanics, possesses a natural inner product, defined in the usual way by the product of row vectors and ˛ column vectors. Suppose j i D and ji D , so ˇ ı that hj D ¯ ı¯ The row vector hj is called a ‘bra’ vector, and the column vector j i is called a ‘ket’ vector. Multiplied together, these vectors form the inner product, or ‘bracket’,
˛ D ˛ ¯ C ˇ ı¯ : (3) hj i ¯ ı¯ ˇ Note that h j i D j˛j2 C jˇ 2 j D 1. The definition of the inner product (3) turns the vector space for qubits C2 into a ‘Hilbert space’, a complete vector space with inner product. (Completeness means that any convergent sequence of vectors in the space attains a limit that itself lies in the space. Completeness is only an issue for infinite-dimensional Hilbert spaces and will be discussed no further here.) We can now express probabilities in terms of brackets: jh0j ij2 D j˛j2 p0 is the probability that a measurement that distinguishes 0 and 1, made on the state j i, yields the output 0. Similarly, jh1j ij2 D jˇj2 p1 is the probability that the same measurement yields the output 1. Another way to write these probabilities is to define the
two ‘projectors’
1 1 0 1 D P0 D 0 0 0
0 0 0 P1 D 0 D 0 1 1
0 D j0ih0j 1 D j1ih1j :
(4)
Note that P02 D j0ih0j0ih0j D j0ih0j D P0 :
(5)
Similarly, P12 D P1 . A projection operator or projector P is defined by the condition P2 D P. Written in terms of these projectors, the probabilities p0 ; p1 can be defined as p0 D h jP0 j i ;
p1 D h jP1 j i :
(6)
Note that h0j1i D h1j0i D 0: the two states j0i and j1i are orthogonal. Since any vector j i D ˛j0i C ˇj1i can be written as a linear combination, or superposition, of j0i and j1i, fj0i; j1ig make up an orthonormal basis for the Hilbert space C2 . From the probabilistic interpretation of brackets, we see that orthogonality implies that a measurement that distinguishes between 0 and 1, made on the state j0i, will yield the output 0 with probability 1 (p0 D 1), and will never yield the output 1 (p1 D 0). In quantum mechanics, orthogonal states are reliably distinguishable. Higher Dimensions The discussion above applied to qubits. More complicated quantum systems lie in higher dimensional vector spaces. For example, a ‘qutrit’ is a quantum system with three distinguishable states j0i; j1i; j2i that live in the three-dimensional complex vector space C3 . All the mechanisms of measurement and definitions of brackets extend to higher dimensional systems as well. For example, the distinguishability of the three states of the qutrit implies hij ji D ı i j . Many of the familiar systems of quantum mechanics, such as a free particle or a harmonic oscillator, have states that live in infinite dimensional Hilbert spaces. For example, the state of a free particle corresponds to a complex valued function (x) R1 such that 1 ¯ (x) (x) dx D 1. The probability of finding the particle in the interval between x D a and x D b is Rb then a ¯ (x) (x) dx. Infinite dimensional Hilbert spaces involve subtleties that, fortunately, rarely impinge upon quantum information processing except in the use of bosonic systems as in quantum optics [40]. Matrices Quantum mechanics is an intrinsically linear theory: transformations of states are represented by matrix multiplication. (Nonlinear theories of quantum mechanics can
2505
2506
Quantum Information Processing
be constructed, but there is no experimental evidence for any intrinsic nonlinearity in quantum mechanics.) Consider the set of matrices U such that U U D Id, where Id is the identity matrix. Such a matrix is said to be ‘unitary’. (For matrices on infinite-dimensional Hilbert spaces, i. e., for linear operators, unitarity also requires UU D Id.) If we take a normalized vector j i, h j i D 1, and transform it by multiplying it by U, so that j 0 i D Uj i, then we have h
0
j i D h jU Uj i D h j i D 1 :
(7)
That is, unitary transformations U preserve the normalization of vectors. Equation (7) can also be used to show that any U that preserves the normalization of all vectors j i is unitary. Since to be given a physical interpretation in terms of probabilities, the vectors of quantum mechanics must be normalized, the set of unitary transformations represents the set of ‘legal’ transformations of vectors in Hilbert space. (Below, we’ll see that when one adds an environment with which qubits can interact, then the set of legal transformations can be extended.) Unitary transformations on a single qubit make up the set of two-by-two unitary matrices U(2). Spin and Other Observables A familiar quantum system whose state space is represented by a qubit is the spin 1/2 particle, such as an electron or proton. The spin of such a particle along a given axis can take on only two discrete values, ‘spin up’, with angular momentum „/2 about that axis, or ‘spin down’, with angular momentum „/2. Here, ¯ is Planck’s reduced constant: „ h/2 D 1:05457 1034 joule-sec. It is conventional to identify the state j "i, spin up along the zaxis, with j0i, and the state j #i, spin up along the z-axis, with j1i. In this way, the spin of an electron or proton can be taken to register a qubit. Now that we have introduced the notion of spin, we can introduce an operator or matrix that corresponds to the measurement of spin. Let P" D j "ih" j be the projector onto the state j "i, and let P# D j #ih# j be the projector onto the state j #. The matrix, or ‘operator’ corresponding to spin 1/2 along the z-axis is then „ „ 1 I z D (P" P# ) D 2 2 0
„ 0 D z ; 1 2
(8)
1 0 is called the z Pauli matrix. In what where z 0 1 way does I z correspond to spin along the z-axis? Suppose that one starts out in the state j i D ˛j "i C ˇj #i and
then measures spin along the z-axis. Just as in the case of measuring 0 or 1, with probability p" D j˛j2 one obtains the result ", and with probability p# D jˇj2 one obtains the result #. The expectation value for the angular momentum along the z-axis is then hI z i D p" („/2) C p# („/2) D h jI z j i :
(9)
That is, the expectation value of the observable quantity corresponding to spin along the z-axis is given by taking the bracket of the state j i with the operator I z corresponding to that observable. In quantum mechanics, every observable quantity corresponds to an operator. The operator corresponding to an observable with possible outcome values fag is P P A D a ajaihaj D a aPa , where jai is the state with value a and Pa D jaihaj is the projection operator corresponding to the outcome a. Note that since the outcomes of measurements are real numbers, A D A: the operators corresponding to observables are Hermitian. The states fjaig are, by definition, distinguishable and so make up an orthonormal set. From the definition of A one sees that Ajai D ajai. That is, the different possible outcomes of the measurement are eigenvalues of A, and the different possible outcome states of the measurement are eigenvectors of A. If more than one state jai i corresponds to the outP P come a, then A D a aPa , where Pa D i jai i haj is the projection operator onto the eigenspace corresponding to the ‘degenerate’ eigenvalue a. Taking, for the moment, the case of non-degenerate eigenvalues, then the expectation value of an observable A in a particular state P ji D a a jai is obtained by bracketing the state about the corresponding operator: X X j a j2 a D pa a ; (10) hAi hjAji D a
a
where p a D j a j2 is the probability that the measurement yields the outcome a. Above, we saw that the operator corresponding to spin along the z-axis was I z D („/2)z . What then are the operators corresponding to spin along the x- and y-axes? They are given by I x D („/2)x and I y D („/2) y , where x and y are the two remaining Pauli spin matrices out of the trio:
0 1 0 i 1 0 y D z D : x D 1 0 i 0 0 1 (11) By the prescription for obtaining expectation values (10), for an initial state ji the expectation values of spin along
Quantum Information Processing
the x-axis and spin along the y-axis are hI x i D hjI x ji ;
hI y i D hjI y ji :
(12)
The eigenvectors of I x ; x and I y ; y are also easily decorresponding to spin scribed. The eigenvector of I x ; x p
1/p2 , while the eigenup along the x-axis is j !i D 1/ 2 vector of I x ; x corresponding to spin down along the x p
1/ p2 . Note that these eigenvectors axis is j i D 1/ 2 are orthogonal and normalized – they make up an orthonormal set. It’s easy to verify that, x j "i D C1j "i, and x j #i D 1j #i, so the eigenvalues of x are ˙1. The eigenvalues of I x D („/2)x are ˙„/2, the two different possible values of angular momentum corresponding to spin up or spin down along the x-axis. Similarly, the eigenvector of I y ; y corresponding to spin up along the y-axis p
1/p 2 , while the eigenvector of I y ; y correis j˝i D i/ 2 p
i/ p2 . sponding to spin down along the y-axis is jˇi D 1/ 2 (Here, in deference to the right-handed coordinate system that we are implicitly adopting, ˝ corresponds to an arrow heading away from the viewer, and ˇ corresponds to an arrow heading towards the viewer.) Rotations and SU(2) The Pauli matrices x ; y ; z play a crucial role not only in characterizing the measurement of spin, but in generating rotations as well. Because of their central role in describing qubits in general, and spin in particular, several more of their properties are elaborated here. Clearly, i D i : Pauli matrices are Hermitian. Next, note that
1 0 2 2 2 : (13) x D y D z D Id D 0 1 Since i D i , and i2 D Id, it’s also the case that i i D Id: that is, the Pauli matrices are unitary. Next, defining the commutator of two matrices A and B to be [A; B] D AB BA, it is easy to verify that [x ; y ] D 2iz . Cyclic permutations of this identity also hold, e. g., [z ; x ] D 2i y . Now introduce the concept of a rotation. The operator ei( /2)x corresponds to a rotation by an angle about the x-axis. The analogous operators with x replaced by y or z are expressions for rotations about the y- or zaxes. Exponentiating matrices may look at first strange, but exponentiating Pauli matrices is significantly simpler. Using the Taylor expansion for the matrix exponential,
e A D Id C A C A2 /2! C A3 /3! C : : :, and employing the fact that j2 D Id, one obtains ei( /2) j D cos( /2)Id i sin( /2) j :
(14)
It is useful to verify that the expression for rotations (14) makes sense for the states we have defined. For example, rotation by about the x-axis should take the state j "i, spin z up, to the state j #i, spin z down. Inserting
D and j D x in (14), we find that the operator corresponding to this rotation is the matrix ix . Multiplying j "i by this matrix, we obtain
0 1 1 0 ix j "i D i D i D ij #i: (15) 1 0 0 1 The rotation does indeed take j "i to j #i, but it also introduces an overall phase of i. What does this overall phase do? The answer is Nothing! Or, at least, nothing observable. Overall phases cannot change the expectation value of any observable. Suppose that we compare expectation values for the state ji and for the state j0 i D e i ji for some observable corresponding to an operator A. We have hjAji D hjei Ae i ji D h0 jAj0 i :
(16)
Overall phases are undetectable. Keeping the undetectability of overall phases in mind, it is a useful exercise to verify that other rotations perform as expected. For example, a rotation by /2 about the x-axis takes j˝i, spin up along the y-axis, to j "i, together with an overall phase. Once rotation about the x, >y, and z axes have been defined, it is straightforward to construct rotations about any axis. Let ˆ D (x ; y ; z ), 2x C 2y C 2z D 1, be a unit vector along the ˆ direction in ordinary three-dimensional space. Define ˆ D x x C y y C z z to be the generalized Pauli matrix associated with the unit vector ˆ. It is easy to verify that ˆ behaves like a Pauli matrix, e. g., ˆ2 D Id. Rotation by about the ˆ axis then corresponds to an operator ei( /2)ˆ D cos( /2)Id i sin( /2)ˆ . Once again, it is a useful exercises to verify that such rotations behave For example, a rotation by about the p as expected. p (1/ 2; 0; 1/ 2) axis should ‘swap’ j "i and j !i, up to some phase. The set of rotations of the form ei /2ˆ forms the group SU(2), the set of complex 2 by 2 unitary matrices with determinant equal to 1. It is instructive to compare this group with the ‘conventional’ group of rotations in three dimensions, SO(3). SO(3) is the set of real 3 by 3 matrices with orthonormal rows/columns and determinant 1. In SO(3), when one rotates a vector by 2, the vector returns to its original state: a rotation by 2 corresponds to
2507
2508
Quantum Information Processing
the 3 by 3 identity matrix. In SU(2), rotating a vector by 2 corresponds to the transformation Id: in rotating by 2, the vector acquires an overall phase of 1. As will be seen below, the phase of 1, while unobservable for single qubit rotations, can be, and has been observed in two-qubit operations. To return to the original state, with no phase, one must rotate by 4. A macroscopic, classical version of this fact manifests itself when one grasps a glass of water firmly in the palm of one’s hand and rotates one’s arm and shoulder to rotate the glass without spilling it. A little experimentation with this problem shows that one must rotate glass and hand around twice to return them to their initial orientation. Why Quantum Mechanics? Why is the fundamental theory of nature, quantum mechanics, a theory of complex vector spaces? No one knows for sure. One of the most convincing explanations came from Aage Bohr, the son of Niels Bohr and a Nobel laureate in quantum mechanics in his own right [53]. Aage Bohr pointed out that the basic mathematical representation of symmetry consists of complex vector spaces. For example, while the apparent symmetry group of rotations in three dimensional space is the real group SO(3), the actual underlying symmetry group of space, as evidenced by rotations of quantum-mechanical spins, is SU(2): to return to the same state, one has to go around not once, but twice. It is a general feature of complex, continuous groups, called ‘Lie groups’ after Sophus Lie, that their fundamental representations are complex. If quantum mechanics is a manifestation of deep, underlying symmetries of nature, then it should come as no surprise that quantum mechanics is a theory of transformations on complex vector spaces.
an operator A when the underlying state possesses the uncertainty described is hAi D p0 h0jAj0i C p1 h1jAj1i D tr A ;
(17)
where D p0 j0ih0j C p1 j1ih1j is the density matrix corresponding to the uncertain state. The density matrix can be thought of as the quantum mechanical analogue of a probability distribution. Density matrices were developed to provide a quantum mechanical treatment of statistical mechanics. A famous density matrix is that for the canonical ensemble. Here, the energy state of a system is uncertain, and each energy state jE i i is weighted by a probability P p i D eE i /k B T /Z, where Z D i eE i /k B T is the partition function. Z is needed to normalize the probabilities fp i g P so that i p i D 1. The density matrix for the canonical P ensemble is then C D (1/Z) i eE i /k B T jE i ihE i j. The expectation value of any operator, e. g., the energy operator H (for ‘Hamiltonian’) is then given by hHi D tr C H. Multiple Systems and Tensor Products To describe two or more systems requires a formalism called the tensor product. The Hilbert space for two qubits is the space C 2 ˝ C 2 , where ˝ is the tensor product. C 2 ˝ C 2 is a four-dimensional space spanned by the vectors j0i ˝ j0i; j0i ˝ j1i; j1i ˝ j0i; j1i ˝ j1i. (To save space these vectors are sometimes written j0ij0i; j0ij1i; j1ij0i; j1ij1i, or even more compactly, j00i; j01i; j10i; j11i. Care must be taken, however, to make sure that this notation is unambiguous in a particular situation.) The tensor product is multilinear: in performing the tensor product, the distributive law holds. That is, if j i D ˛j0i C ˇj1i, and ji D j0i C ıj1i, then j i ˝ ji D (˛j0i C ˇj1i) ˝ ( j0i C ıj1i)
Density Matrices The review of quantum mechanics is almost done. Before moving on to quantum information processing proper, two topics need to be covered. The first topic is how to deal with uncertainty about the underlying state of a quantum system. The second topic is how to treat two or more quantum systems together. These topics turn out to possess a strong connection which is the source of most counterintuitive quantum effects. Suppose that don’t know exactly what state a quantum system is in. Say, for example, it could be in the state j0i with probability p0 or in the state j1i with probability p1 . Note that this state is not the same as a quantum superp p position, p0 j0i C p1 j1i, which is a definite state with spin oriented in the x z plane. The expectation value of
D ˛ j0i ˝ j0i C ˛ıj0i ˝ j1i
(18)
C ˇ j1i ˝ j0i C ˇıj1i ˝ j1i : A tensor is a thing with slots: the key point to keep track of in tensor analysis is which operator or vector acts on which slot. It is often useful to label the slots, e. g., j i1 ˝ ji2 is a tensor product vector in which j i occupies slot 1 and ji occupies slot 2. One can also define the tensor product of operators or matrices. For example, x1 ˝ z2 is a tensor product operator with x in slot 1 and z in slot 2. When this operator acts on a tensor product vector such as j i1 ˝ ji2 , the operator in slot 1 acts on the vector in that slot, and the operator in slot 2 acts on the vector in that slot: (x1 ˝ z2 )(j i1 ˝ ji2) D (x1 j i1 ) ˝ (z2 ji2 ) : (19)
Quantum Information Processing
The No-Cloning Theorem Now that tensor products have been introduced, one of the most famous theorems of quantum information – the no-cloning theorem – can immediately be proved [54]. Classical information has the property that it can be copied, so that 0 ! 00 and 1 ! 11. How about quantum information? Does there exist a procedure that allows one to take an arbitrary, unknown state j i to j i ˝ j i? Can you clone a quantum? As the title to this section indicates, the answer to this question is No. Suppose that you could clone a quantum. Then there would exist a unitary operator U C that would take the state j i ˝ j0i ! UC j i ˝ j0i D j i ˝ j i ;
(20)
for any initial state j i. Consider another state ji. Since U C is supposed to clone any state, we have then we would also have UC ji ˝ j0i D ji ˝ ji. If U C exists, then, the following holds for any states j i, ji: hj i D (1 hj ˝ 2 h0j)(j i1 ˝ j0i2 ) D (1 hj ˝ 2 h0j)(UC UC )(j i1 ˝ j0i2 )
D (1 hj ˝ 2 h0jUC )(UC j i1 ˝ j0i2 )
(21)
D (1 hj ˝ 2 hj)(j i1 ˝ j i2 )
answer is A1 ˝ Id2 : we have to put the identity in slot 2. The expectation value for this measurement for the state j i1 ˝ ji2 is then 1h
j ˝ 2 hjA1 ˝ Id2 j i1 ˝ ji2 D 1h
(22)
Here we have used the rule that operators in slot 1 act on vectors in slot 1. Similarly, the operators in slot 2 act on vectors in slot 2. As always, the key to performing tensor manipulations is to keep track of what is in which slot. (Note that the tensor product of two numbers is simply the product of those numbers.) In ordinary probability theory, the probabilities for two sets of events labeled by i and j is given by a joint probability distribution p(ij). The probabilities for the first set of events on their own is obtained by averaging over P the second set: p(i) D j p(ij) is the marginal distribution for the first set of events labeled by i. In quantum mechanics, the analog of a probability distribution is density matrix. Two systems 1 and 2 are described by a joint density matrix 12 , and system 1 on its own is described by a ‘reduced’ density matrix 1 . Suppose that systems 1 and 2 are in a state described by a density matrix
(1 hj i1)(2 hj i2) D hj i2 ;
jA1 j i1 ˝ 2 hjId2 ji2 D 1 h jA1 j i1 :
12 D
X
i i 0 j j0 jii1 hi 0 j ˝ j ji2 h j0 j ;
(23)
i i 0 j j0
where we have used the fact that U C is unitary so that UC UC D Id. So if cloning is possible, then hj i D hj i2 for any two vectors j i and ji. But this is impossible, as it implies that hj i equals either 0 or 1 for all j i, ji, which is certainly not true. You can’t clone a quantum. The no-cloning theorem has widespread consequences. It is responsible for the efficacy of quantum cryptography, which will be discussed in greater detail below. Suppose that Alice sends a state j i to Bob. Eve wants to discover what state this is, without Alice or Bob uncovering her eavesdropping. That is, she would like to make a copy of j i and send the original state j i to Bob. The no-cloning theorem prevents her from doing so: any attempt to copy j i will necessarily perturb the state. An ‘optimal cloner’ is a transformation that does the best possible job of cloning, given that cloning is impossible [55]. Reduced Density Matrices Suppose that one makes a measurement corresponding to an observable A1 on the state in slot 1. What operator do we take the bracket of to get the expectation value? The
where fjii1 g and fj ji2 g are orthonormal bases for systems 1 and 2 respectively. As in the previous paragraph, the expectation value of a measurement made on 12 alone is given by tr 12 (A1 ˝ Id2 ). Another way to write such expectation values is to define the reduced density matrix, 1 D tr2 12
X
i i 0 j j0 jii1 hi 0 j ˝2 h j0 j ji2
i i 0 j j0
D
X
i i 0 j j jii1 hi 0 j :
(24)
i i0 j
Equation (24) describes the partial trace tr2 over system 2. In other words, if 12 has components, f i i 0 j j0 g, then reduced density matrix 1 D tr2 12 has components P f j i i 0 j j g. The expectation value of a measurement A made on the first system alone is then simply hAi D tr 1 A. Just as in ordinary probability theory, where the marginal distribution for system 1 is obtained by averaging over the state of system 2, so in quantum mechanics the reduced density matrix that describes system 1 is obtained by tracing over the state of system 2.
2509
2510
Quantum Information Processing
Entanglement One of the central features of quantum information processing is entanglement. Entanglement is a peculiarly quantum-mechanical form of correlation between quantum systems, that has no classical analogue. Entanglement lies at the heart of the various speedups and enhancements that quantum information processing offers over classical information processing. A pure state j i12 for two systems 1 and 2 is entangled if the reduced density matrix for either system taken on its own has non-zero entropy. In particular, the reduced density matrix for system 1 is 1 D tr2 12 , where 12 D j i12 h j. The entropy of this density matrix is S( 1 ) D tr 1 log2 1 . For pure states, the entropy of 1 is equal to the entropy of 2 and is a good measure of the degree of entanglement between the two systems. S( 1 ) D S( 2 ) measures the number of ‘e-bits’ of entanglement between systems 1 and 2. A mixed state 12 for 1 and 2 is entangled if it is not separable. A density matrix is separable if it can be written P j j 12 D j p j 1 ˝ 2 . In other words, a separable state is one that can be written as a classical mixture of uncorrelated states. The correlations in a separable state are purely classical. Entanglement can take a variety of forms and manifestations. The key to understanding those forms is the notion of Local Operations and Classical Communication (LOCC) [56]. Local operations such as unitary transformations and measurement, combined with classical communication, can not, on their own, create entanglement. If one state can be transformed into another via local operations and classical communication, then the first state is ‘at least as entangled’ as the second. LOCC can then be used to categorize the different forms of entanglement. Distillable entanglement is a form of entanglement that can be transformed into pure-state entanglement [57]. Systems 1 and 2 posses d qubits worth of distillable entanglement if local operations and classical communication can transform their state into a pure state that contains d e-bits (possibly with some leftover ‘junk’ in a separate quantum register). Systems that are non-separable, but that possess no distillable entanglement are said to possess bound entanglement [58]. The entanglement of formation for a state 12 is equal to the minimum number of e-bits of pure-state entanglement that are required to create 12 using only local operations and classical control [59]. The entanglement of formation of 12 is greater than or equal to 12 ’s distillable entanglement. A variety of entanglement measures exist. Each one is useful for different purposes. Squashed entan-
glement, for example, plays an important role in quantum cryptography [60]. (Squashed entanglement is a notion of entanglement based on conditional information.) One of the most interesting open questions in quantum information theory is the definition of entanglement for marti-partite systems consisting of more than two subsystems. Here, even in the case of pure states, no unique definition of entanglement exists. Entanglement plays a key role in quantum computation and quantum communication. Before turning to those fields, however, it is worth while investigating the strange and counterintuitive features of entanglement. Quantum Weirdness Entanglement is the primary source of what for lack of a better term may be called ‘quantum weirdness’. Consider the two-qubit state 1 j i12 D p (j0i1 ˝ j1i2 j1i1 j0i2 ) : 2
(25)
This state is called the ‘singlet’ state: if the two qubits correspond to two spin 1/2 particles, as described above, so that j0i is the spin z up state and j1i is the spin z down state, then the singlet state is the state with zero angular momentum. Indeed, rewriting j i12 in terms of spin as 1 j i12 D p (j "i1 ˝ j #i2 j #i1 j "i2 ) : 2
(26)
one sees that if one makes a measurement of spin z, then if the first spin has spin z up, then the second spin has spin z down, and vice versa. If one decomposespthe state in terms of spin along p the x-axis, j !i D (1/ 2)(j "i C j #i), j i D (1/ 2) (j "i j #i), then j i12 can be rewritten 1 j i12 D p (j !i1 ˝ j 2
i2 j
i1 ˝ j !i2 ) : (27)
Similarly, rewriting in terms of spin along the y-axis, we obtain 1 j i12 D p (j˝i1 jˇi2 jˇi1 j˝i2 ) ; 2
(28)
where j˝i is the state with spin up along the y-axis and jˇi is the state with spin down along the y-axis. No matter what axis one decomposes the spin about, if the first spin has spin up along that axis then the second spin has spin down along that axis, and vice versa. The singlet state has angular momentum zero about every axis. So far, this doesn’t sound too strange. The singlet simply behaves the way a state with zero angular momentum
Quantum Information Processing
should: it is not hard to see that it is the unique two-spin state with zero angular momentum about every axis. In fact, the singlet state exhibits lots of quantum weirdness. Look at the reduced density matrix for spin 1: 1 1 D tr2 12 D tr2 j i12 h j D (j "i1 h" j C j #i1 h# j 2 D Id/2 : (29) That is, the density matrix for spin 1 is in a completely indefinite, or ‘mixed’ state: nothing is known about whether it is spin up or spin down along any axis. Similarly, spin 2 is in a completely mixed state. This is already a little strange. The two spins together are in a definite, ‘pure’ state, the singlet state. Classically, if two systems are in a definite state, then each of the systems on its own is in a definite state: the only way to have uncertainty about one of the parts is to have uncertainty about the whole. In quantum mechanics this is not the case: two systems can be in a definite, pure state taken together, while each of the systems on its own is in an indefinite, mixed state. Such systems are said to be entangled with each other. Entanglement is a peculiarly quantum form of correlation. Two spins in a singlet state are highly correlated (or, more precisely, anticorrelated): no matter what axis one measures spin along, one spin will be found to have the opposite spin of the other. In itself, that doesn’t sound so bad, but when one makes a measurement on one spin, something funny seems to happen. Both spins start out in a completely indefinite state. Now one chooses to make a measurement of spin 1 along the z-axis. Suppose that one gets the result, spin up. As a result of the measurement, spin 2 is now in a definite state, spin down along the zaxis. If one had chosen to make a measurement of spin 1 along the x-axis, then spin 2 would also be put in a definite state along the x-axis. Somehow, it seems as if one can affect the state of spin 2 by making a measurement of spin 1 on its own. This is what Einstein called ‘spooky action at a distance’. In fact, such measurements involve no real action at a distance, spooky or otherwise. If one could really act on spin 2 by making a measurement on spin 1, thereby changing spin 2’s state, then one could send information instantaneously from spin 1 to spin 2 by measuring spin 1 alone. Such instantaneous transmission of information would violate special relativity and give rise to all sorts of paradoxical capabilities, such as the ability to travel backwards in time. Luckily, it is easy to see that it is impossible to send information superluminally using entanglement: no matter what one does to spin 1, the outcomes of measurements on spin 2 are unaffected by that action. In particular, operations on spin 1 correspond to operators of the form
A1 ˝ Id2 , while operations on spin 2 correspond to operators of the form Id1 ˝ B2 . The commutator between such operators is [A1 ˝ Id2 ; Id1 ˝ B2 ] D A1 ˝ B2 A1 ˝ B2 D 0 : (30) Since they commute, it doesn’t matter if one does something to spin 1 first, and then measures spin 2, or if one measures spin 2 first and then does something to spin 1: the results of the measurement will be the same. That is, nothing one does to spin 1 on its own can effect spin 2. Nonetheless, entanglement is counterintuitive. One’s classical intuition would like to believe that before the measurement, the system to be measured is in a definite state, even if that definite state is unknown. Such a definite state would constitute a ‘hidden variable’, an unknown, classical value for the measured variable. Entanglement implies that such hidden variables can’t exist in any remotely satisfactory form. The spin version of the EPR effect described above is due to David Bohm [61]. Subsequently, John Bell proposed a set of relations, the ‘Bell inequalities’, that a hidden variable theory should obey [62]. Bell’s inequalities are expressed in terms of the probabilities for the outcomes of measurements made on the two spins along different axes. Suppose that each particle indeed has a particular value of spin along each axis before it is measured. Designate a particle that has spin up along the x-axis, spin down along the y-axis, and spin up along the z-axis by (xC; y; zC). Designate other possible orientations similarly. In a collection of particles, let N(xC; y; zC) be the number of particles with orientations (xC; y; zC). Clearly, N(xC; y) D N(xC; y; zC) C N(xC; y; z). Now, in a collection of measurements made on pairs of particles, originally in a singlet state, let #(x1 C; y2 ) be the number of measurements that give the result spin up along the x-axis for particle 1, and spin down along the yaxis for particle 2. Bell showed that for classical particles that actually possess definite values of spin along different axes before measurement, #(x1 C; y2 C) #(x1 C; z2 C) C #(y1 ; z2 ), together with inequalities that are obtained by permuting axes and signs. Quantum mechanics decisively violates these Bell inequalities: in entangled states like the singlet state, particles simply do not possess definite, but unknown, values of spin before they are measured. Bell’s inequalities have been verified experimentally on numerous occasions [13], although not all exotic forms of hidden variables have been eliminated. Those that are consistent with experiment are not very aesthetically appealing however (depending, of course, on one’s aesthetic ideals). A stronger set of inequalities than Bell’s are the CHSH inequalities
2511
2512
Quantum Information Processing
(Clauser–Horne–Shimony–Holt), which have also been tested in numerous venues, with the predictions of quantum mechanics confirmed each time [63]. One of weirdest violation of classical intuition can be found in the so-called GHZ experiment, named after Daniel Greenberger, Michael Horne, and Anton Zeilinger [64]. To demonstrate the GHZ paradox, begin with the three-qubit state p ji D (1/ 2)(j """i j ###i)
(31)
(note that in writing this state we have suppressed the tensor product ˝ signs, as mentioned above). Prepare this state four separate times, and make four distinct measurements. In the first measurement measure x on the first qubit, y on the second qubit, and y on the third qubit. Assign the value C1 to the result, spin up along the axis measured, and 1 to spin down. Multiply the outcomes together. Quantum mechanics predicts that the result of this multiplication will always be C1, as can be verified by taking the expectation value hjx1 ˝ y2 ˝ y3 ji of the operator x1 ˝ y2 ˝ y3 that corresponds to making the three individual spin measurements and multiplying their results together. In the second measurement measure y on the first qubit, x on the second qubit, and y on the third qubit. Multiply the results together. Once again, quantum mechanics predicts that the result will be C1. Similarly, in the third measurement measure y on the first qubit, y on the second qubit, and x on the third qubit. Multiply the results together to obtain the predicted result C1. Finally, in the fourth measurement measure x on all three qubits and multiply the results together. Quantum mechanics predicts that this measurement will give the result hjx1 ˝ x2 ˝ x3 ji D 1. So far, these predictions may not seem strange. A moment’s reflection, however, will reveal that the results of the four GHZ experiments are completely incompatible with any underlying assignment of values of ˙1 to the spin along the x- and y-axes before the measurement. Suppose that such pre-measurement values existed, and that these are the values revealed by the measurements. Looking at the four measurements, each consisting of three individual spin measurements, one sees that each possible spin measurement appears twice in the full sequence of twelve individual spin measurements. For example, measurement of spin 1 along the x-axis occurs in the first of the four three-fold measurements, and in the last one. Similarly, measurement of spin 3 along the 3-axis occurs in the first and second three-fold measurements. The classical consequence of each individual measurement occurring twice is
that the product of all twelve measurements should be C1. That is, if measurement of x1 in the first measurement yields the result 1, it should also yield the result 1 in the fourth measurement. The product of the outcomes for x1 then gives (1) (1) D C1; similarly, if x1 takes on the value C1 in both measurements, it also contributes (C1) (C1) D C1 to the overall product. So if each spin possesses a definite value before the measurement, classical mechanics unambiguously predicts that the product of all twelve individual measurements should be C1. Quantum mechanics, by contrast, unambiguously predicts that the product of all twelve individual measurements should be 1. The GHZ experiment has been performed in a variety of different quantum-mechanical systems, ranging from nuclear spins to photons [65,66]. The result: the predictions of classical mechanics are wrong and those of quantum mechanics are correct. Quantum weirdness triumphs. Quantum Computation Quantum mechanics has now been treated in sufficient detail to allow us to approach the most startling consequence of quantum weirdness: quantum computation. The central counterintuitive feature of quantum mechanics is quantum superposition: unlike a classical bit, which either takes on the value 0 or the value 1, a quantum bit in the superposition state ˛j0i C ˇj1i takes on the values 0 and 1 simultaneously. A quantum computer is a device that takes advantage of quantum superposition to process information in ways that classical computers can’t. A key feature of any quantum computation is the way in which the computation puts entanglement to use: just as entanglement plays a central role in the quantum paradoxes discussed above, it also lies at the heart of quantum computation. A classical digital computer is a machine that can perform arbitrarily complex logical operations. When you play a computer game, or operate a spread sheet, all that is going on is that your computer takes in the information from your joy stick or keyboard, encodes that information as a sequence of zeros and ones, and then performs sequences of simple logical operations one that information. Since the work of George Boole in the first half of the nineteenth century, it is known that any logical expression, no matter how involved, can be broken down into sequences of elementary logical operations such as NOT, AND, OR and COPY. In the context of computation, these operations are called ‘logic gates’: a logic gates takes as input one or more bits of information, and produces as output one or more bits of information. The output bits are a function of the input bits. A NOT gate, for example, takes
Quantum Information Processing
as input a single bit, X, and returns as output the flipped bit, NOT X, so that 0 ! 1 and 1 ! 0. Similarly, an AND gate takes in two bits X, Y as input, and returns the output X AND Y. X AND Y is equal to 1 when both X and Y are equal to 1; otherwise it is equal to 0. That is, an AND gate takes 00 ! 0, 01 ! 0, 10 ! 0, and 11 ! 1. An OR gate takes X, Y to 1 if either X or Y is 1, and to 0 if both X and Y are 0, so that 00 ! 0, 01 ! 1, 10 ! 1, and 11 ! 1. A COPY gate takes a single input, X, and returns as output two bits X that are copies of the input bit, so that 0 ! 00 and 1 ! 11. All elementary logical operations can be built up from NOT, AND, OR and COPY. For example, implication can be written A ! B A OR (NOT B), since A ! B is false if and only if A is true and B is false. Consequently, any logical expression, e. g., A AND (NOT B) OR C AND (NOT A)
AND NOT(C OR B) ; (32) can be evaluated using NOT, AND, OR, and COPY gates, where COPY gates are used to supply the different copies of A, B and C that occur in different places in the expression. Accordingly, {NOT, AND, OR, COPY} is said to form a ‘computationally universal’ set of logic gates. Simpler computationally universal sets of logic gates also exist, e. g. {NAND, COPY}, where X NAND Y D NOT(X AND Y). Reversible Logic A logic gate is said to be reversible if its outputs are a one-to-one function of its inputs. NOT is reversible, for example: since X D NOT(NOT X), NOT is its own inverse. AND and OR are not reversible, as the value of their two input bits cannot be inferred from their single output. COPY is reversible, as its input can be inferred from either of its outputs. Logical reversibility is important because the laws of physics, at bottom, are reversible. Above, we saw that the time evolution of a closed quantum system (i. e., one that is not interacting with any environment) is given by a unitary transformation: j i ! j 0 i D Uj i. All unitary transformations are invertible: U 1 D U , so that j i D U Uj i D U j 0 i. The input to a unitary transformation can always be obtained from its output: the time evolution of quantum mechanical systems is one-to-one. As noted in the introduction, in 1961, Rolf Landauer showed that the underlying reversibility of quantum (and also of classical) mechanics implied that logically irreversible operations such as AND necessarily required physical dissipation [19]: any physical device that per-
forms an AND operation must possess additional degrees of freedom (i. e., an environment) which retain the information about the actual values of the inputs of the AND gate after the irreversible logical operation has discarded those values. In a conventional electronic computer, those additional degrees of freedom consist of the microscopic motions of electrons, which, as Maxwell and Boltzmann told us, register large amounts of information. Logic circuits in contemporary electronic circuits consist of field effect transistors, or FETs, wired together to perform NOT, AND, OR and COPY operations. Bits are registered by voltages: a FET that is charged at higher voltage registers a 1, and an uncharged FET at Ground voltage registers a 0. Bits are erased by connecting the FET to Ground, discharging them and restoring them to the state 0. When such an erasure or resetting operation occurs, the underlying reversibility of the laws of physics insure that the microscopic motions of the electrons in the Ground still retain the information about whether the FET was charged or not, i. e., whether the bit before the erasure operation registered 1 or 0. In particular, if the bit registered 1 initially, the electrons in Ground will be slightly more energetic than if it registered 0. Landauer argued that any such operation that erased a bit required dissipation of energy kB T ln 2 to an environment at temperature T, corresponding to an increase in the environment’s entropy of kB ln 2. Landauer’s principle can be seen to be a straightforward consequence of the microscopic reversibility of the laws of physics, together with the fact that entropy is a form of information – information about the microscopic motions of atoms and molecules. Because the laws of physics are reversible, any information that resides in the logical degrees of freedom of a computer at the beginning of a computation (i. e., in the charges and voltages of FETs) must still be present at the end of the computation in some degrees of freedom, either logical or microscopic. Note that physical reversibility also implies that if information can flow from logical degrees of freedom to microscopic degrees of freedom, then it can also flow back again: the microscopic motions of electrons cause voltage fluctuations in FETs which can give rise to logical errors. Noise is necessary. Because AND, OR, NAND are not logically reversible, Landauer initially concluded that computation was necessarily dissipative: entropy had to increase. As is often true in the application of the second law of thermodynamics, however, the appearance of irreversibility does not always imply the actual fact of irreversibility. In 1963, Lecerf showed that digital computation could always be performed in a reversible fashion [20]. Unaware of Lecerf’s
2513
2514
Quantum Information Processing
work, in 1973 Bennett rederived the possibility of reversible computation [21]. Most important, because Bennett was Landauer’s colleague at IBM Watson laboratories, he realized the physical significance of embedding computation in a logically reversible context. As will be seen, logical reversibility is essential for quantum computation. A simple derivation of logically reversible computation is due to Fredkin, Toffoli, and Margolus [22]. Unaware of Bennett’s work, Fredkin constructed three-input, three-output reversible logic gates that could perform NOT, AND, OR, and COPY operations. The best-known example of such a gate is the Toffoli gate. The Toffoli gate takes in three inputs, X, Y, and Z, and returns three outputs, X 0 ; Y 0 and Z 0 . The first two inputs go through unchanged, so that X 0 D X, Y 0 D Y. The third output is equal to the third input, unless both X and Y are equal to 1, in which case the third output is the NOT of the third input. That is, when either X or Y is 0, Z 0 D Z, and when both X and Y are 1, Z 0 D NOT Z. (Another way of saying the same thing is to say that Z 0 D Z XOR (X AND Y), where XOR is the exclusive OR operation whose output is 1 when either one of its inputs is 1, but not both. That is, XOR takes 00 ! 0, 01 ! 1, 10 ! 1, 11 ! 0.) Because it performs a NOT operation on Z controlled on whether both X and Y are 1, a Toffoli gate is often called a controlled-controlled-NOT (CCNOT) gate.
Quantum Information Processing, Figure 1 A Toffoli gate
To see that CCNOT gates can be wired together to perform NOT, AND, OR, and COPY operations, note that when one sets the first two inputs X and Y both to the value 1, and allows the input Z to vary, one obtains Z 0 D NOT Z. That is, supplying additional inputs allows a CCNOT to perform a NOT operation. Similarly, setting the input Z to 0 and allowing X and Y to vary yields Z 0 D X AND Y. OR and COPY (not to mention NAND) can be obtained by similar methods. So the ability to set inputs to predetermined values, together with ability to apply CCNOT gates allows one to perform any desired digital computation. Because reversible computation is intrinsically less dissipative than conventional, irreversible computation, it has been proposed as a paradigm for constructing low power electronic logic circuits, and such low power circuits have been built and demonstrated [67]. Because additional inputs and wires are required to perform compu-
tation reversibly, however, such circuits are not yet used for commercial application. As the miniaturization of the components of electronic computers proceeds according to Moore’s law, however, dissipation becomes an increasingly hard problem to solve, and reversible logic may become commercially viable. Quantum Computation In 1980, Benioff proposed a quantum-mechanical implementation of reversible computation [23]. In Benioff’s model, bits corresponded to spins, and the time evolution of those spins was given by a unitary transformation that performed reversible logic operations. (In 1986, Feynman embedded such computation in a local, Hamiltonian dynamics, corresponding to interactions between groups of spins [68].) Benioff’s model did not take into account the possibility of putting quantum bits into superpositions as an integral part of the computation, however. In 1985, Deutsch proposed that the ordinary logic gates of reversible computation should be supplemented with intrinsically quantum-mechanical single qubit operations [25]. Suppose that one is using a quantum-mechanical system to implement reversible computations using CCNOT gates. Now add to the ability to prepare qubits in desired states, and to perform CCNOT gates, the ability ˆ to perform single-qubit rotations of the form ei /2 as described above. Deutsch showed that the resulting set of operations allowed universal quantum computation. Not only could such a computer perform any desired classical logical transformation on its quantum bits; it could perform any desired unitary transformation U whatsoever. Deutsch pointed out that a computer endowed with the ability to put quantum bits into superpositions and to perform reversible logic on those superpositions could compute in ways that classical computers could not. In particular, a classical reversible computer can evaluate any desired function of its input bits: (x1 : : : x n ; 0 : : : 0) ! (x1 : : : x n ; f (x1 : : : x n )), where xi represents the logical value, 0 or 1, of the ith bit, and f is the desired function. In order to preserve reversibility, the computer has been supplied with an ‘answer’ register, initially in the state 00 : : : 0, into which to place the answer f (x1 : : : x n ). In a quantum computer, the input bits to any transformation can be in a quantum superposition. For example, ifpeach input bit is in an equal superposition of 0 and 1, (1/ 2)(j0i C j1i), then all n qubits taken together are in the superposition 1 2n/2
(j00 : : : 0i C j00 : : : 1i C C j11 : : : 1i X 1 D n/2 jx1 : : : x n i : 2 x ;:::;x D0;1 1
n
(33)
Quantum Information Processing
If such a superposition is supplied to a quantum computer that performs the transformation x1 : : : x n ! f (x1 : : : x n ), then the net effect is to take the superposition
Deutsch–Jozsa Algorithm
X
1 2n/2
jx1 : : : x n ij00 : : : 0i
x 1 ;:::;x n D0;1
!
1 2n/2
counterpoint reveal meaning that is not there in each of the voices taken separately.
X
jx1 : : : x n ij f (x1 : : : x n )i :
(34)
x 1 ;:::;x n D0;1
That is, even though the quantum computer evaluates the function f only once, it evaluates it on every term in the superposition of inputs simultaneously, an effect which Deutsch termed ‘quantum parallelism’. At first, quantum parallelism might seem to be spectacularly powerful: with only one function ‘call’, one performs the function on 2n different inputs. The power of quantum parallelism is not so easy to tease out, however. For example, suppose one makes a measurement on the output state in (33) in the fj0i; j1ig basis. The result is a randomly selected input-output pair, (x1 : : : x n ; f (x1 : : : x n )). One could have just as easily obtained such a pair by feeding a random input string into a classical computer that evaluates f . As will now be seen, the secret to orchestrating quantum computations that are more powerful than classical computations lies in arranging quantum interference between the different states in the superposition of equation (34). The word ‘orchestration’ in the previous sentence was used for a reason. In quantum mechanics, states of physical systems correspond to waves. For example, the state of an electron is associated with a wave that is the solution of the Schrödinger equation for that electron. Similarly, in a quantum computer, a state such as jx1 : : : x n ij f (x1 : : : x n )i is associated with a wave that is the solution of the Schrödinger equation for the underlying quantum degrees of freedom (e. g., electron spins or photon polarizations) that make up the computers quantum bits. The waves of quantum mechanics, like waves of water, light, or sound, can be superposed on each other to construct composite waves. A quantum computer that performs a conventional reversible computation, in which its qubits only take on the values 0 or 1 and are never in superpositions ˛j0i C ˇj1i, can be thought of as an analogue of a piece of music like a Gregorian chant, in which a single, unaccompanied voice follows a prescribed set of notes. A quantum computer that performs many computations in quantum parallel is analogous to a symphony, in which many lines or voices are superposed to create chords, counterpoint, and harmony. The quantum computer programmer is the composer who writes and orchestrates this quantum symphony: her job is to make that
Let’s examine a simple example, due to David Deutsch and Richard Jozsa, in which the several ‘voices’ of a quantum computer can be orchestrated to solve a problem more rapidly than a classical computer [69]. Consider the set of functions f that take one bit of input and produce one bit of output. There are four such functions: f (x) D 0; f (x) D 1; f (x) D x ; f (x) D NOT x : (35) The first two of these functions are constant functions; the second two are ‘balanced’ in the sense that half of their inputs yield 0 as output, while the other half yield 1. Suppose that one is presented with a ‘black box’ that implements one of these functions. The problem is to query this black box and to determine whether the function the box contains is constant or balanced. Classically, it clearly takes exactly two queries to determine whether the function in the box is constant or balanced. Using quantum information processing, however, one query suffices. The following quantum circuit shows how this is accomplished. Deutsch–Jozsa Circuit
Quantum Information Processing, Figure 2 2-Qubit Deutsch–Jozsa circuit
Quantum circuit diagrams are similar in character to their classical counterparts: qubits enter on the left, undergo a series of transformations effected by quantum logic gates, and exit at the right, where they are measured. In the circuit above, the first gate, represented by H is called a Hadamard gate. The Hadamard gate is a singlequbit quantum logic gate that effects the transformation p j0i ! (1/ 2)(j0i C j1i) ; (36) p j1i ! (1/ 2)(j0i j1i) : In other words, the Hadamard a unitary transpperforms
p 1/p2 1/ p2 formation U H D on its single-qubit in1/ 2 1/ 2 put. Note that the Hadamard transformation is its own in2 D Id. verse: U H
2515
2516
Quantum Information Processing
The second logic gate implements the unknown, black-box function f . It takes two binary inputs, x, y, and gives two binary outputs. The gate leaves the first input unchanged, and adds f (x) to the second input (modulo 2), so that x ! x and y ! y C f (x)( mod 2). Such gates can be implemented using the controlled-NOT operation introduced above. Recall that the controlled-NOT or CNOT leaves its first input bit unchanged, and flips the second if and only if the first input is 1. In the symbol for a controlled-NOT operation, the part represents the control bit and the ˚ part represents the bit that can be flipped. The circuits required to implement the four different functions from one bit to one bit are as follows: f (x) D 0 :
f (x) D 1 :
f (x) D x :
f (x) D NOT x : (37)
The black box in the Deutsch–Jozsa algorithm contains one of these circuits. Note that the black-box circuits are ‘classical’ in the sense that they map input combinations of 0’s and 1’s to output combinations of 0’s and 1’s: the circuits of (37) make sense as classical circuits as well as quantum circuits. Any classical circuit that can determine whether f is constant or balanced requires at least two uses of the f gate. By contrast, the Deutsch–Jozsa circuit above requires only one use of the f gate. Going through the quantum logic circuit, one finds that a constant function yields the output j0i on the first output line, while a balanced function yields the output j1i (up to an overall, unobservable phase). That is, only a single function call is required to reveal whether f is constant or balanced. Several comments on the Deutsch–Jozsa algorithm are in order. The first is that, when comparing quantum algorithms to classical algorithms, it is important to compare apples to apples: that is, the gates used in the quantum algorithm to implement the black-box circuits should be the same as those used in any classical algorithms. The difference, of course, is that the quantum gates preserve quantum coherence, a concept which is meaningless in the classical context. This requirement has been respected in the Deutsch–Jozsa circuit above. The second comment is that the Deutsch–Jozsa algorithm is decidedly odd and counterintuitive. The f gates and the controlled-NOT gates from which they are constructed both have the property that the first input passes through unchanged j0i ! j0i and j1i ! j1i. Yet somehow, when the algorithm is identifying balanced functions, the first bit flips. How can this be? This is the part where quantum weirdness enters. Even though the f and controlled-NOT gates leave their first input unchanged in the logical basis fj0i; j1ig, the same property does not hold
p in other bases. For example, p let jCi D (1/ 2)(j0iCj1i) D U H j0i, and let j1i D (1/ 2)(j0i j1i) D U H j1i. Straitforward calculation shows that, when acting on the basis fjCi; jig, the CNOT still behaves like a CNOT, but with roles of its imputs reversed: now the second qubit passes through unchanged, while the first qubit gets flipped. That is, when acting on the basis fjCi; jig, the CNOT still behaves like a CNOT, but with the roles of its inputs reversed: now the second qubit passes through unchanged, while the first qubit gets flipped. It is this quantum role reversal that underlies the efficacy of the Deutsch–Jozsa algorithm. Pretty weird. It is important to note that the Deutsch–Jozsa algorithm is not just a theoretical point. The algorithm has been implemented using techniques from nuclear magnetic resonance (NMR) [33]. The results are exactly as predicted by quantum mechanics: a single function call suffices to determine whether that function is constant or balanced. The two-qubit algorithm was first described by David Deutsch. Later, with Richard Jozsa, he extended this algorithm to a multi-qubit algorithm. Now consider functions f from n qubits to a single qubit. Once again, the problem is to determine whether or not f is constant or balanced. That is, the function f in the black box is either constant: f (x) D 0 for all n-bit inputs x, or f (x) D 1 for all x, or balanced: f (x) D 0 for exactly half of its 2n possible input strings, and f (x) D 1 for the other half. (If this problem statement seems somewhat artificial, note that the algorithm works equally well for distinguishing between constant functions and ‘typical’ functions, which are approximately balanced.) On average, a classical algorithm takes a little more than two function calls to distinguish between a constant or a balanced function. However, in the worst case, it takes 2n1 C 1 calls, as more than half the inputs have to be sampled. As before, the quantum algorithm takes but a single function call, as the following circuit shows.
Quantum Information Processing, Figure 3 Caption: n-Qubit Deutsch–Jozsa circuit
Quantum Information Processing
To determine whether the f is constant or balanced, one measures the first n output bits: if they are all 0, then the function is constant; if one or more is 1, then the function is balanced. Other Algorithms: The Quantum Fourier Transform While it conclusively demonstrates that quantum computers are strictly more powerful than classical computers for certain problems, the Deutsch–Jozsa algorithm does not solve a problem of burning interest to applied computer scientists. Once it was clear that quantum computers could offer a speedup over classical algorithms, however, other algorithms began to be developed. Simon’s algorithm [70], for example, determines whether a function f from n bits to n bits is (a) one-to-one, or (b) two-to-one with a large period s, so that f (x C s) D f (x) for all x. (In Simon’s algorithm the addition is bitwise modulo 2, with no carry bits.) Simon’s algorithm has a similar ‘flavor’ to the Deutsch–Jozsa algorithm: it is intriguing but does not obviously admit wide application. A giant step towards constructing more useful algorithms was Coppersmith’s introduction [71] of the Quantum Fourier Transform (QFT). The fast Fourier transform maps a function of n bits to its discrete Fourier transform function: f (x) ! g(y) D
n 1 2X
n
e2 i x y/2 f (x) :
(38)
xD0
The fast Fourier transform takes O(n2n ) steps. The quantum Fourier transform takes a wave function over n qubits to a Fourier transformed wave function: n 1 2X
xD0
f (x)jxi ! 2n/2
n 1 2X
It is not difficult to show that the quantum Fourier transform is a unitary. To obtain a quantum logic circuit that accomplishes the QFT, it is convenient to express states in a binary representation. In the equations above, x and y are nbit numbers. Write x as x n : : : x1 , where x n ; : : : x1 are the bits of x. This is just a more concise way of saying that x D x1 20 C C x n 2n1 . Similarly, the expression 0:y1 : : : y m represents the number y1 /2 C : : : y m /2m . Using this binary notation, it is not hard to show that the quantum Fourier transform can be written: jx1 : : : x n i ! 2n/2 (j0i C e2 i0:x 1 j1i)(j0i C e2 i0:x 2 x 1 j1i) : : : (j0i C e2 i0:x n :::x 1 j1i) :
(40)
When the quantum Fourier transform is written in this form, it is straightforward to construct a circuit that implements it (see Fig. 4). Note that the QFT circuit for wave functions over n qubits takes O(n2 ) steps: it is exponentially faster than the FFT for functions over n bits, which takes O(n2n ) steps. This exponential speedup of the quantum Fourier transform is what guarantees the efficacy of many quantum algorithms. The quantum Fourier transform is a potentially powerful tool for obtaining exponential speedups for quantum computers over classical computers. The key is to find a way of posing the problem to be solved in terms of finding periodic structure in a wave function. This step is the essence of the best known quantum algorithm, Shor’s algorithm for factoring large numbers [26]. Shor’s Algorithm
e
2 i x y/2 n
x;yD0
Quantum Information Processing, Figure 4 Quantum Fourier Transform
f (x)jyi :
(39)
The factoring problem can be stated as follows: Given N D pq, where p, q are prime, find p and q. For large p
2517
2518
Quantum Information Processing
and q, this problem is apparently hard for classical computers. The fastest known algorithm (the ‘number sieve’) takes O(N 1/3 ) steps. The apparent difficulty of the factoring problem for classical computers is important for cryptography. The commonly used RSA public-key cryptosystem relies on the difficulty of factoring to guarantee security. Public-key cryptography addresses the following societally important situation. Alice wants to send Bob some secure information (e. g., a credit card number). Bob sends Alice the number N, but does not reveal the identity of p or q. Alice then uses N to construct an encrypted version of the message she wishes to send. Anyone who wishes to decrypt this message must know what p and q are. That is, encryption can be performed using the public key N, but decryption requires the private key p, q. In 1994, Peter Shor showed that quantum computers could be used to factor large numbers and so crack public-key cryptosystems that whose security rests on the difficulty of factoring [26]. The algorithm operates by solving the so-called ‘discrete logarithm’ problem. This problem is, given N and some number x, find the smallest r such that x r 1( mod N). Solving the discrete logarithm allows N to be factored by the following procedure. First, pick x < N at random. Use Euclid’s algorithm to check that the greatest common divisor of x and N is 1. (Euclid’s algorithm is to divide N by x; take the remainder r1 and divide x by r1 ; take the remainder of that division, r2 and divide r1 by that, etc. The final remainder in this procedure is the greatest common divisor, or g.c.d., of x and N.) If the g.c.d. of x and N is not 1, then it is either p or q and we are done. If the greatest common divisor of x and N is 1, suppose that we can solve the discrete logarithm problem to find the smallest r such that x r 1( mod N). As will be seen, if r is even, we will be able to find the factors of N easily. If r turns out to be odd, just pick a new x and start again: continue until you obtain an even r (since this occurs half the time, you have to repeat this step no more than twice on average). Once an even r has been found, we have (x r/2 1)(x r/2 C 1) 1( mod N). In other words, (x r/2 1)(x r/2 C 1) D bN D bpq for some b. Finding the greatest common divisor of x r/2 1, x r/2 C 1 and N now reveals p and q. The goal of the quantum algorithm, then, is to solve the discrete logarithm problem to find the smallest r such that x r 1 mod N. If r can be found, then N can be factored. In its discrete logarithm guise, factoring possesses a periodic structure that the quantum Fourier transform can reveal. First, find an x whose g.c.d. with N is 1, as above, and pick n so that N 2 < 2n < 2N 2 . The quantum algorithm uses two n-qubit registers. Begin by constructing
P n 1 a uniform superposition 2n/2 2kD0 jkij0i. Next, perform exponentiation modulo N to construct the state, 2
n/2
n 1 2X
jkijx k mod Ni :
(41)
kD0
This modular exponentiation step takes O(n3 ) operations k (note that x 2 mod N can be evaluated by first constructing x 2 mod N, then constructing (x 2 )2 mod N, etc.). The periodic structure in (41) arises because if x k a mod N, for some a, then x kCr a mod N, x kC2r a mod N, . . . , x kCmr a mod N, up to the largest m such that k C mr < N 2 . The same periodicity holds for any a. That is, the wave function (41) is periodic with with period r. So if we apply the quantum Fourier transform to this wave function, we can reveal that period and find r, thereby solving the discrete logarithm and factoring problems. To reveal the hidden period and find r apply the QFT to the first register in the state (41). The result is n
2
n 1 2X
n
e2 i jk/2 j jijx k mod Ni :
(42)
jkD0
Because of the periodic structure, positive interference takes place when j(k C `r) is close to a multiple of 2n . That is, measuring the first register now yields a number j such that jr/2n is close to an integer: only for such j does the necessary positive interference take place. In other words, the algorithm reveals a j such that j/2n D s/r for some integer s. That is, to find r, we need to find fractions s/r that approximate j/2n . Such fractions can be obtained using a continued fraction approximation. With reasonably high probability, the result of the continued fraction approximation can be shown to yield the desired answer r. (More precisely, repetition of this procedure O(2 log N) times suffices to identify r.) Once r is known, then the factors p and q of N can be recovered by the reduction of factoring to discrete logarithm given above. The details of Shor’s algorithm reveal considerable subtlety, but the basic idea is straightforward. In its reduction to discrete logarithm, factoring possesses a hidden periodic structure. This periodic structure can be revealed using a quantum Fourier transform, and the period itself in turn reveals the desired factors. More recent algorithms also put the quantum Fourier transform to use to extract hidden periodicities. Notably, the QFT can be used to find solutions to Pell’s equation (x 2 ny 2 D 1, for non-square n) [72]. Generalizations of the QFT to transforms over groups (the dihedral group and the permutation group on n objects Sn ) have been applied to other problems such as the shortest vec-
Quantum Information Processing
tor on a lattice [73] (dihedral group, with some success) and the graph isomorphism problem (Sn , without much success [74]). The Phase-Estimation Algorithm One of the most useful applications of the quantum Fourier transform is finding the eigenvectors and eigenvalues of unitary transformations. The resulting algorithm is called the ‘phase-estimation’ algorithm: its original form is due to Kitaev [75]. Suppose that we have the ability to apply a ‘black box’ unitary transformation U. U can be P written U D j e i j j jih jj, where j ji are the eigenvectors of U and e i j are the corresponding eigenvalues. The goal of the algorithm is to estimate the e i j and the j ji. (The goal of the original Kitaev algorithm was only to estimate the eigenvalues e i j . However, Abrams and Lloyd showed that the algorithm could also be used to construct and estimate the eigenvectors j ji, as well [76]. The steps of the phase estimation algorithm are as follows. (0) Begin with the initial state j0ij i, where j0i is the n-qubit state 00 : : : 0i, and j i is the state that one wishes to decompose into eigenstates: j i D P j j j ji. (1) Using Hadamards or a QFT, put the first register into a uniform superposition of all possible states: !2
n/2
n 1 2X
jkij i :
kD0
(2) In the kth component of the superposition, apply U k to j i: !2
n/2
n 1 2X
jkiU k j i
kD0
D 2n/2
n 1 2X
jkiU k
j j ji
j;kD0
D2
n/2
n 1 2X
ke
i k j
jkij ji :
j;kD0
(3) Apply inverse QFT to first register: ! 2n
n 1 2X
ke
i k j 2 i k l /2 n
e
jlij ji :
j;k;l D0
(4) Measure the registers. The second register contains the eigenvector j ji. The first register contains jli where 2 l/2n j . That is, the first register contains an n-bit approximation to j .
By repeating the phase-estimation algorithm many times, one samples the eigenvectors and eigenvalues of U. Note that to obtain n-bits of accuracy, one must possess the ability to apply U 2n times. This feature limits the applicability of the phase-estimation algorithm to a relatively small number of bits of accuracy, or to the estimation of eigenvalues of Us that can easily be applied an exponentially large number of times. We’ve already seen such an example of a process in modular exponentiation. Indeed, Kitaev originally identified the phase estimation algorithm as an alternative method for factoring. Even when only a relatively small number of applications of U can be performed, however, the phase-estimation algorithm can provide an exponential improvement over classical algorithms for problems such as estimating the ground state of some physical Hamiltonian [76,77], as will now be seen. Quantum Simulation One of the earliest uses for a quantum computer was suggested by Richard Feynman [24]. Feynman noted that simulating quantum systems on a classical computer was hard: computer simulations of systems such as lattice gauge theories take up a substantial fraction of all supercomputer time, and, even then, are often far less effective than their programmers could wish them to be. The reason why it’s hard to simulate a quantum system on a classical computer is straightforward: in the absence of any sneaky tricks, the only known way to simulate a quantum system’s time evolution is to construct a representation of the full state of the system, and to evolve that state forward using the system’s quantum equation of motion. To represent the state of a quantum system on a classical computer is typically exponentially hard, however: an n-spin system requires 2n complex numbers to represent its state. Evolving that state forward is even harder: it requires exponentiation of a 2n by 2n matrix. Even for a small quantum system, for example, one containing fifty spins, this task lies beyond the reach of existing classical supercomputers. True, supercomputers are also improving exponentially in time (Moore’s law). No matter how powerful they become, however, they will not be able to simulate more than 300 spins directly, for the simple reason that to record the 2300 numbers that characterize the state of the spins would require the use of all 2300 particles in the universe within the particle horizon. Feynman noted that if one used qubits instead of classical bits, the state of an n-spin system can be represented using just n qubits. Feynman proposed a class of systems called ‘universal quantum simulators’ that could be pro-
2519
2520
Quantum Information Processing
grammed to simulate any other quantum system. A universal quantum simulator has to possess a flexible dynamics that can be altered at will to mimic the dynamics of the system to be simulated. That is, the dynamics of the universal quantum simulator form an analog to the dynamics of the simulated system. Accordingly, one might also call quantum simulators, ‘quantum analog computers’. In 1996, Lloyd showed how Feynman’s proposal could be turned into a quantum algorithm [78]. For each degree of freedom of the system to be simulated, allocate a quantum register containing a sufficient number of qubits to approximate the state of that degree of freedom to some desired accuracy. If one wishes to simulate the system’s interaction with the environment, a number of registers should also be allocated to simulate the environment (for a d-dimensional system, up to d2 registers are required to simulate the environment). Now write the Hamiltonian of Pm the system an environment as H D `D1 H` , where each H` operates on only a few degrees of freedom. The Trotter formula implies that ei Hı t Dei H 1 t : : : ei H m t C O t 3 :
1X [H j ; H k ]t 2 2 (43) jk
Each ei H ` t can be simulated using quantum logic operations on the quantum bits in the registers corresponding to the degrees of freedom on which H` acts. To simulate the time evolution of the system over time t D nt, we simply apply ei Ht n times, yielding n ei Ht D ei Ht n n X D ˘` ei H ` t [H j ; H k ]t 2 (44) 2 jk C O t 3 : The quantum simulation takes O(mn) steps, and reproduces the original time evolution to an accuracy h2 t 2 m2 /n, where h is the average size of k[H j ; H k ]k (note that for simulating systems with local interactions, most of these terms are zero, because most of the local interactions commute with each other). A second algorithm for quantum simulation takes advantage of the quantum Fourier transform [79,80]. Suppose that one wishes to simulate the time evolution of a quantum particle whose Hamiltonian is of the form H D P2 /2m C V(X), where P D i@/@x is the momentum operator for the particle, and V(X) is the potential energy operator for the particle expressed as a function of the position operator X. Using an n-bit dis-
cretization for the state we identify the x eigenstates with jxi D jx n : : : x1 i. The momentum eigenstates are then just the quantum Fourier transform of the posiP n 1 2 i x p/2n tion eigenstates: jpi D 2n/2 2xD0 e jxi. That is, P D U Q FT XU Q FT . By the Trotter formula, the infinitesimal time evolution operator is ei Ht D ei P
2 t/2m
ei V (X)t C O ıt 2 :
(45)
To enact this time evolution operator one proceeds as above. Write the state of the particle in the x-basis: j i D P i V (X)t operx x jxi. First apply the infinitesimal e ator: X
x jxi
!
X
x
xe
i V (x)
jxi :
(46)
x 2
To apply the infinitesimal ei P ı t/2m operator, first apply an inverse quantum Fourier transform on the state, then 2 apply the unitary transformation jxi ! ei x t/2m jxi, and finally apply the regular QFT. Because X and P are related by the quantum Fourier transform, these three steps 2 effectively apply the transformation ei P t/2m . Applying 2 first ei V (X)t then ei P t/2m yields the full infinitesimal time evolution (45). The full time evolution operator ei Ht can then be built up by repeating the infinitesimal operator t/t times. As before, the accuracy of the quantum simulation can be enhanced by slicing time ever more finely. Quantum simulation represents one of the most powerful uses of quantum computers. It is probably the application of quantum computers that will first give an advantage over classical supercomputers, as only one hundred qubits or fewer are required to simulate, e. g., molecular orbitals or chemical reactions, more accurately than the most powerful classical supercomputer. Indeed, special purpose quantum simulators have already been constructed using nuclear magnetic resonance techniques [81]. These quantum analog computers involve interactions between many hundreds of nuclear spins, and so are already performing computations that could not be performed by any classical computer, even one the size of the entire universe. Quantum Search The algorithms described above afford an exponential speedup over the best classical algorithms currently known. Such exponential speedups via quantum computation are hard to find, and are currently limited to a few special problems. There exists a large class of quantum al-
Quantum Information Processing
gorithms afford a polynomial speedup over the best possible classical algorithms, however. These algorithms are based on Grover’s quantum search algorithm. Grover’s algorithm [31] allows a quantum computer to search an unstructured database. Suppose that this database contains N items, one of which is ‘marked’, and the remainder of which are unmarked. Call the marked item w, for ‘winner’. Such a database can be represented by a function f (x) on the items in the database, such that f of the marked item is 1, and f of any unmarked item is 0. That is, f (w) D 1, and f (x ¤ w) D 0. A classical search for the marked item must take N/2 database calls, on average. Bypcontrast, a quantum search for the marked item takes O( N) calls, as will now be shown. Unstructured database search is an ‘oracle’ problem. In computer science, an oracle is a ‘black box’ function: one can supply the black box with an input x, and the black box then provides an output f (x), but one has no access to the mechanism inside the box that computes f (x) from x. For the quantum case, the oracle is represented by a function on two registers, one containing x, and the other containing a single qubit. The oracle takes jxijyi ! jxijy C f (x)i, where the addition takes place modulo 2. Grover originally phrased his algorithm in terms of a ‘phase’ oracle U w , where jxiUw jxi D (1) f (x) jxi. In other words, the ‘winner’ state acquires a phase of 1: jwi ! jwi, while the other states remain unchanged: jx ¤ wi ! jxi. Such a phase oracle can be constructed from the original oracle in several ways. The first way involves two oracle calls. Begin with the state jxij0i and call the oracle once to construct the state jxij f (x)i. Now apply a z transformation to the second register. The effect of this is to take the state to (1) f (x) jxij f (x)i. Applying the oracle for a second time yields the desired phase-oracle state (1) f (x) jxij0i. A second, sneakier way to construct a phase p oracle is to initialize the second qubit in the state j1i). A single call of the original oracle on the (1/ 2)(j0i p state jxi(1/ 2)(j0i p j1i) then transforms this state into (1) f (x) jxi((1/ 2)(j0i j1i)). In this way a phase oracle can be constructed from a single application of the original oracle. Two more ingredients are needed to perform Grover’s algorithm. Let’s assume that N D 2n for some n, so that the different states j ji can be written in binary form. Let U 0 be the unitary transformation that takes j0 : : : 0i ! j0 : : : 0i, that takes j ji ! j ji for j ¤ 0. That is, U 0 acts in the same way as U w , but applies a phase of 1 to j0 : : : 0i rather than to jwi. In addition, let H be the transformation that performs Hadamard transformations on all of the qubits individually.
Grover’s algorithm is performed as follows. Prepare all qubits in the state j0i and apply the global Hadamard p P transformation H to create the state j i D (1/ N) N1 jD0 j ji. Apply, in succession, U w , then H, then U 0 , then H again. These four transformations make up the composite transformation UG D HU0 HUw . Now p apply U G again, and repeat for a total of (/4) N times (that is, the total number of times p U G is applied is equal to the integer closest to (/4) N). The system is pnow, with high probability, in the state jwi. That is,
UG N j0 : : : 0i jwi. Since each application of U G contains a single call to the phase oracle U pw , the winner state jwi has now been identified with O( N) oracle calls, as promised. The quantum algorithm succeeds because the transformation U G acts as a rotation in the two-dimensional subspace defined by the states j i and jwi. The angle of the rotation effected by each application of U G can be shown p to be given by sin D 2/ N. Note that p j i and jwi are approximately orthogonal, h jwi D 1/ N, and that after the initial Hadamard transformation the system begins in the state j i. Each successive application of U Gpmoves it an angle closer to jwi. Finally, after (/4) N iterations, the state has rotated the full /2 distance to jwi. Grover’s algorithm can be shown to be optimal [82]: no pblack-box algorithm can find jwi with fewer than O( N) iterations of the oracle. The algorithm also works for oracles where there are M winners, so that f (x) D 1 for M distinct inputs. In this case, the angle of rotation p for each iteration of U G is given by sin p D (2/N) M(N M), and the algorithm takes (/4) N/M steps to identify a winner.
The Adiabatic Algorithm Many classically hard problems take the form of optimization problems. In the well-known traveling salesman problem, for example, one aims to find the shortest route connecting a set of cities. Such optimization problems can be mapped onto a physical system, in which the function to be optimized is mapped onto the energy function of the system. The ground state of the physical system then represents a solution to the optimization problem. A common classical technique for solving such problems is simulated annealing: one simulates the process of gradually cooling the system in order to find its ground state [83]. Simulated annealing is bedeviled by the problem of local minima, states of the system that are close to the optimal states in terms of energy, but very far away in terms of the particular configuration of the degrees of freedom of the state. To
2521
2522
Quantum Information Processing
avoid getting stuck in such local minima, one must slow the cooling process to a glacial pace in order to insure that the true ground state is reached in the end. Quantum computing provides a method for getting around the problem of local minima. Rather than trying to reach the ground state of the system by cooling, one uses a purely quantum-mechanical technique for finding the state [84]. One starts the system with a Hamiltonian dynamics whose ground state is simple to prepare (e. g., ‘all spins sideways’). Then one gradually deforms the Hamiltonian from the simple dynamics to the more complex dynamics whose ground state encodes the answer to the problem in question. If the deformation is sufficiently gradual, then the adiabatic theorem of quantum mechanics guarantees that the system remains in its ground state throughout the deformation process. When the adiabatic deformation is complete, then the state of the system can be measured to reveal the answer. Adiabatic quantum computation (also called ‘quantum annealing’) represents a purely quantum way to find the answer to hard problems. How powerful is adiabatic quantum computation? The answer is, ‘nobody knows for sure’. The key question is, what is ‘sufficiently gradual’ deformation? That is, how slowly does the deformation have to be to guarantee that the transformation is adiabatic? The answer to this question lies deep in the heart of quantum matter. As one performs the transformation from simple to complex dynamics, the adiabatic quantum computer goes through a quantum phase transition. The maximum speed at which the computation can be performed is governed by the size of the minimum energy gap of this quantum phase transition. The smaller the gap, the slower the computation. The scaling of gaps during phase transitions (‘Gapology’) is one of the key disciplines in the study of quantum matter [85]. While the scaling of the gap has been calculated for many familiar quantum systems such as Ising spin glasses, calculating the gap for adiabatic quantum computers that are solving hard optimization problems seems to be just about as hard as solving the problem itself. While few quantum computer scientists believe that adiabatic quantum computation can solve the traveling salesman problem, there is good reason to believe that adiabatic quantum computation can outperform simulated annealing on a wide variety of hard optimization problems. In addition, it is known that adiabatic quantum computation is neither more nor less powerful than quantum computation itself: a quantum computer can simulate a physical system undergoing adiabatic time evolution using the quantum simulation techniques described above; in addition, it is possible to construct devices that per-
form conventional quantum computation in an adiabatic fashion [86]. Quantum Walks A final, ‘physics-based’, type of algorithm is the quantum walk [87,88,89,90]. Quantum walks are coherent versions of classical random walks. A classical random walk is a stochastic Markov process, the random walker steps between different states, labeled by j, with a probability wij for making the transition from state j to state i. Here wij P is a stochastic matrix, w i j 0 and j w i j D 1. In a quantum walk, the stochastic, classical process is replaced by a coherent, quantum process: the states j ji are quantum states, and the transition matrix U ij is unitary. By exploiting quantum coherence, quantum walks can be shown typically to give a square root speed up over classical random walks. For example, in propagation along a line, a classical random walk is purely diffusive, with the expectation value of displacement along the line going as the square root of the number of steps in the walk. By contrast, a quantum walk can be set up as a coherent, propagating wave, so that the expectation value of the displacement is proportional to the number of steps [88]. A particularly elegant example of a square root speed up in a quantum walk is the evaluation of a NAND tree [90]. A NAND tree is a binary tree containing a NAND gate at each vertex. Given inputs on the leaves of the tree, the problem is to evaluate the outcome at the root of the tree: is it zero or one? NAND trees are ubiquitous in, e. g., game theory: the question of who wins at chess, checkers, or Go, is determined by evaluating a suitable NAND tree. Classically, a NAND tree can be evaluated with a minimum of steps. A quantum walk, by contrast, can evaluate a NAND tree using only 2n/2 steps. For some specially designed problems, such as propagation along a random tree, quantum walks can give exponential speedups over classical walks [89]. The question of what problems can be evaluated more rapidly using quantum walks than classical walks remains open. The Future of Quantum Algorithms The quantum algorithms described above are potentially powerful, and, if large-scale quantum computers can be constructed, could be used to solve a number of important problems for which no efficient classical algorithms exist. Many questions concerning quantum algorithms remain open. While the majority of quantum computer scientists would agree that quantum algorithms are unlikely to provide solutions to NP-complete problems, it is not known whether or not quantum algorithms could provide solu-
Quantum Information Processing
tions to such problems as graph isomorphism or shortest vector on a lattice. Such questions are an active field of research in quantum computer science. Noise and Errors The picture of quantum computation given in the previous section is an idealized picture that does not take into account the problems that arise when quantum computers are built in practice. Quantum computers can be built using nuclear magnetic resonance, ion traps, trapped atoms in cavities, linear optics with feedback of nonlinear measurements, superconducting systems, quantum dots, electrons on the surface of liquid helium, and a variety of other standard and exotic techniques. Any system that can be controlled in a coherent fashion is a candidate for quantum computation. Whether a coherently controllable system can actually be made to computer depends primarily on whether it is possible to deal effectively with the noise intrinsic to that system. Noise induces errors in computation. Every type of quantum information processor is subject to its own particular form of noise. A detailed discussion of the various technologies for building quantum computers lies beyond the scope of this article. While the types of noise differ from quantum technology to quantum technology, however, the methods for dealing with that noise are common between technologies. This section presents a general formalism for characterizing noise and errors, and discusses the use of quantum error-correcting codes and other techniques for coping with those errors. Open-System Operations The time evolution of a closed quantum-mechanical system is given by unitary transformation: ! U U , where U is unitary, U D U 1 . For discussing quantum communications, it is necessary to look at the time evolution of open quantum systems that can exchange quantum information with their environment. The discussion of open quantum systems is straightforward: simply adjoin the system’s environment, and consider the coupled system and environment as a closed quantum system. If the joint density matrix for system and environment is
SE (0) ! SE (t) D USE SE (0)USE :
(47)
The state of the system on its own is obtained by taking the partial trace over the environment, as described above: S (t) D trE SE (t). A particularly useful case of system and environmental interaction is one in which the system and environment
are initially uncorrelated, so that SE (0) D S (0) ˝ E (0). In this case, the time evolution of the system on its own can P always be written as S (t) D k A k S (0)A k . Here the Ak P are operators that satisfy the equation k A k A k D Id: the Ak are called Kraus operators, or effects. Such a time evolution for the system on its own is called a completely positive map. A simple example of such p a completely p positive map for a qubit is A0 D Id/ 2, A1 D x / 2. fA0 ; A1 g can easily be seen to obey A0 A0 C A1 A1 D Id. This completely positive map for the qubit corresponds to a time evolution in which the qubit has a 50% chance of being flipped about the x-axis (the effect A1 ), and a 50% chance of remaining unchanged (the effect A0 ). The infinitesimal version of any completely positive map can be obtained by taking SE (0) D S (0) ˝ E (0), and by expanding (46) to second order in t. The result is the Lindblad master equation: X @ S (L k L k S 2L k S Ł k C S L k L k ) : D i[H˜ S ; S ] @t k
(48) ˜ S is the effective system Hamiltonian: it is equal to Here H the Hamiltonian HS for the system on its own, plus a perturbation induced by the interaction with the environment (the so-called ‘Lamb shift’). The Lk correspond to open system effects such as noise and errors. Quantum Error-Correcting Codes One of the primary effects of the environment on quantum information is to cause errors. Such errors can be corrected using quantum error-correcting codes. Quantum error-correcting codes are quantum analogs of classical error-correcting codes such as Hamming codes or Reed–Solomon codes [91]. Like classical error-correcting codes, quantum error-correcting codes involve first encoding quantum information in a redundant fashion; the redundant quantum information is then subjected to noise and errors; then the code is decoded, at which point the information needed to correct the errors lie in the code’s syndrome. More bad things can happen to quantum information than to classical information. The only error that can occur to a classical bit is a bit-flip. By contrast, a quantum bit can either be flipped about the x-axis (the effect x ), flipped about the y-axis (the effect y ), flipped about the zaxis (the effect z ), or some combination of these effects. Indeed, an error on a quantum bit could take the form of a rotation by an unknown angle about an unknown axis. Since specifying that angle and axis precisely could take
2523
2524
Quantum Information Processing
an infinite number of bits of information, it might at first seem impossible to detect and correct such an error. In 1996, however, Peter Shor [92] and Andrew Steane [93] independently realized that if an error correcting code could detect and correct bit-flip errors (x ) and phase-flip errors (z ), then such a code would in fact correct any single-qubit error. The reasoning is as follows. First, since y D ix z , a code that detects and corrects first x errors, then z errors will also correct y errors. Second, since any single-qubit rotation can be written as a combination of x ; y and z rotations, the code will correct arbitrary single qubit errors. The generalization of such quantum error-correcting codes to multiple qubit errors are called Calderbank–Shor–Steane (CSS) codes [94]. A powerful technique for identifying and characterizing quantum codes is Gottesman’s stabilizer formalism [95]. Concatenation is a useful method for constructing codes, both classical and quantum. Concatenation combines two codes, with the second code acting on bits that have been encoded using the first code. Quantum error-correcting codes can be combined with quantum computation to perform fault-tolerant quantum computation. Fault-tolerant quantum computation allows quantum computation to be performed accurately even in the presence of noise and errors, as long as those errors occur at a rate below some threshold [96,97,98]. For restricted error models [99], this rate can be as high as 1% 3%. For realistic error models, however, the rate is closer to 103 — 104 . Re-focusing Quantum error-correcting codes are not the only technique available for dealing with noise. If, as is frequently the case, environmentally induced noise possesses some identifiable structure in terms of correlations in space and time, or obeys some set of symmetries, then powerful techniques come into play for coping with noise. First of all, suppose that noise is correlated in time. The simplest such correlation is a static imperfection: the Hamiltonian of the system is supposed to be H, but the actual Hamiltonian is H C H, where H is some unknown perturbation. For example, an electron spin could have the Hamiltonian H D („/2)(! C !)z , where ! is an unknown frequency shift. If not attended to, such a frequency shift will introduce unknown phases in a quantum computation, which will in turn cause errors. Such an unknown perturbation can be dealt with quite effectively simply by flipping the electron back and forth. Let the electron evolve for time T; flip it about the x-axis; let it evolve for time T; finally, flip it back about the x-axis.
The total time-evolution operator for the system is then x e i(!C!)Tz x e i(!C!)Tz D Id :
(49)
That is, this simple refocusing technique cancels out the effect of the unknown frequency shift, along with the time evolution of the unperturbed Hamiltonian. Even if the environmental perturbation varies in time, refocusing can be used significantly to reduce the effects of such noise. For time-varying noise, refocusing effectively acts as a filter, suppressing the effects of noise with a correlation time longer than the refocusing timescale T. More elaborate refocusing techniques can be used to cope with the effect of couplings between qubits. Refocusing requires no additional qubits or syndromes, and so is a simpler (and typically much more effective) technique for dealing with errors than quantum error-correcting codes. For existing experimental systems, refocusing typically makes up the ‘first line of defence’ against environmental noise. Once refocusing has dealt with time-correlated noise, quantum error correction can then be used to deal with any residual noise and errors. Decoherence-free Subspaces and Noiseless Subsystems If the noise has correlations in space, then quantum information can often be encoded in such a way as to be resistant to the noise even in the absence of active error correction. A common version of such spatial correlation occurs when each qubit is subjected to the same error. For example, suppose that two qubits are subjected to noise of the form of a fluctuating Hamiltonian H(t) D („/2) (t)(z1 C z2 ). This Hamiltonian introduces a time-varying phase (t) between the states j "i i ; j #i i . The key point to note here is that this phase is the same for both qubits. A simple way to compensate for such a phase is to encode the logical state j0i as the two-qubit state j "i1 j #i2 , and the logical state j1i as the two-qubit state j #i1 j "i2 . It is simple to verify that the two-qubit encoded states are now invariant under the action of the noise: any phase acquired by the first qubit is canceled out by the equal and opposite phase acquired by the second qubit. The subspace spanned by the two-qubit states j0i; j1i is called a decoherence-free subspace: it is invariant under the action of the noise. Decoherence-free subspaces were first discovered by Zanardi [100] and later popularized by Lidar [101]. Such subspaces can be found essentially whenever the generators of the noise possess some symmetry. The general form that decoherence-free subspaces take arises from the following observation concerning the relationship between noise and symmetry.
Quantum Information Processing
Let fE k g be the effects that generate the noise, so that P the noise takes ! k E k E k , and let E be the algebra generated by the fE k g. Let G be a symmetry of this algebra, so that [g; E] D 0 for all g 2 G; E 2 E . The Hilbert space for the system then decomposes into irreducible representation of E and G in the following well-known way: H D
X
j
j
HE ˝ H G ;
(50)
j j
where HE are the irreducible representations of E , and j HG are the irreducible representations of G. The decomposition (49) immediately suggests a simple way of encoding quantum information in a way that is immune to the effects of the noise. Look at the effect of the j noise on states of the form ji j ˝ j i j where ji j 2 HE , j
and j i j 2 HG for some j. The effect Ek acts on this state j
as (E k ji j ) ˝ j i j , where Ek j is the effect corresponding j
to Ek within the representation HE . In other words, if we encode quantum information in the state j i j , then the noise has no effect on j i j . A decoherence-free subspace j
corresponds to an HG where the corresponding represenj j tation of E , HE , is one-dimensional. The case where HE , is higher dimensional is called a noiseless subsystem [102]. Decoherence-free subspaces and noiseless subsystems represent highly effective methods for dealing with the presence of noise. Like refocusing, these methods exploit symmetry to encode quantum information in a form that is immune to noise that possesses that symmetry. Where refocusing exploits temporal symmetry, decoherence-free subspaces and noiseless subsystems exploit spatial symmetry. All such symmetry-based techniques have the advantage that no error-correcting process is required. Like refocusing, therefore, decoherence-free subspaces and noiseless subsystems form the first line of defense against noise and errors. The tensor product decomposition of irreducible representations in (49) lies behind all known error-correcting codes [103]. A general quantum-error correcting code begins with a state j00 : : : 0iA j i, where j00 : : : 0iA is the initial state of the ancilla. An encoding transformation Uen is then applied; an error Ek occurs; finally a decoding transformation Ude is applied to obtain the state jek iA j i D Ude E k Uen j00 : : : 0iA j i :
(51)
Here, je k iA is the state of the ancilla that tells us that the error corresponding to the effect Ek has occurred. Equation. (50) shows that an error-correcting code is just a noiseless subsystem for the ‘dressed errors’ fUde E k Uen g.
At bottom, all quantum error-correcting codes are based on symmetry. Topological Quantum Computing A particularly interesting form of quantum error correction arises when the underlying symmetry is a topological one. Kitaev [104] has shown how quantum computation can be embedded in a topological context. Two-dimensional systems with the proper symmetries exhibit topological excitations called anyons. The name, ‘anyon’, comes from the properties of these excitations under exchange. Bosons, when exchanged, obtain a phase of 1; fermions, when exchanged, obtain a phase of 1. Anyons, by contrast, when exchanged, can obtain an arbitrary phase e i . For example, the anyons that underlie the fractional quantum Hall effect obtain a phase e2 i/3 when exchanged. Fractional quantum Hall anyons can be used for quantum computation in a way that makes two-qubit quantum logic gates intrinsically resistant to noise [105]. The most interesting topological effects in quantum computation arise when one employs non-abelian anyons [104]. Non-abelian anyons are topological excitations that possess internal degrees of freedom. When two non-abelian anyons are exchanged, those internal degrees of freedom are subjected not merely to an additional phase, but to a general unitary transformation U. Kitaev has shown how in systems with the proper symmetries, quantum computation can be effected simply by exchanging anyons. The actual computation takes place by dragging anyons around each other in the two-dimensional space. The resulting transformation can be visualized as a braid in two dimensional space plus the additional dimension of time. Topological quantum computation is intrinsically fault tolerant. The topological excitations that carry quantum information are impervious to locally occurring noise: only a global transformation that changes the topology of the full system can create a error. Because of their potential for fault tolerance, two-dimensional systems that possess the exotic symmetries required for topological quantum computation are being actively sought out. Quantum Communication Quantum mechanics provides the fundamental limits to information processing. Above, quantum limits to computation were investigated. Quantum mechanics also provides the fundamental limits to communication. This section discusses those limits. The session closes with a section on quantum cryptography, a set of techniques by
2525
2526
Quantum Information Processing
which quantum mechanics guarantees the privacy and security of cryptographic protocols. Multiple Uses of Channels Each quantum communication channel is characterized by its own open-system dynamics. Quantum communication channels can possess memory, or be memoryless, depending on their interaction with their environment. Quantum channels with memory are a difficult topic, which will be discussed briefly below. Most of the discussion that follows concerns the memoryless quantum channel. A single use of such a channel corresponds to a comP pletely positive map, ! k A k A k , and n uses of the channel corresponds to a transformation X A k n ˝ ˝ A k 1 1:::n Ak 1 ˝ ˝ Ak n 1:::n ! k 1 :::k n
X
A K 1:::n AK ;
(52)
K
where we have used the capital letter K to indicate the n uses of the channel k1 : : : k n . In general, the input state 1:::n may be entangled from use to use of the channel. Many outstanding questions in quantum communication theory remain unsolved, including, for example, the question of whether entangling inputs of the channel helps for communicating classical information. Sending Quantum Information Let’s begin with using quantum channels to send quantum information. That is, we wish to send some quantum state j i from the input of the channel to the output. To do this, we encode the state as some state of n inputs to the channel, send the encoded state down the channel, and then apply a decoding procedure at the output to the channel. It is immediately seen that such a procedure is equivalent to employing a quantum error-correcting code. The general formula for the capacity of such quantum channels is known [44,45]. Take some input or ‘signal’ state 1:::n for the channel. First, construct a purification of this state. A purification of a density matrix for the signal is a pure state j iAS for the signal together with an ancilla, such that the state S D trA j iAS h j is equal to the original density matrix . There are many different ways to purify a state: a simple, explicit way is P to write D j p j j jih jj in diagonal form, where fj jig is P p the eigenbasis for . The state j iAS D j p j j jiA j jiS , where fj jiA g is an orthonormal set of states for the ancilla, then yields a purification of . To obtain the capacity of the channel for sending quantum information, proceed as follows. Construct a pu-
P p rification for the signal 1:::n : j n i D J p J jJiAn jJiSn , where we have used an index J instead of j to indicate that these states are summed over n uses of the channel. Now send the signal state down the channel, yielding the state Xp X AS D p J p J 0 jJiAn hJ 0 j ˝ A K jJiSn hJ 0 jA K ; (53) J J0
K
where as above K D k1 : : : k n indicates k uses of the channel. AS is the state of output signal state together with the ancilla. Similarly, S D trA AS is the state of the output signal state on its own. Let I(AS) D tr AS log2 AS be the entropy of AS , measured in bits. Similarly, let I(S) D tr S log2 S be the entropy of the output state S , taken on its own. Define I(S/A) I(S) I(AS) if this quantity is positive, and I(S/A) 0 otherwise. The quantity I(S/A) is a measure of the capacity of the channel to send quantum information if the signals being sent down the channel are described by the density matrix 1:::n . It can be shown using either CSS codes [106] or random codes [45,107] that encodings exist that allow quantum information to be sent down the channel and properly decoded at the output at a rate of I(S/A)/n qubits per use. I(S/A) is a function only of the properties of the channel and the input signal state 1:::n . The bigger I(S/A) is, the less coherence the channel has managed to destroy. For example, if the channel is just a unitary transformation of the input, which destroys no quantum information, then I(AS) D 0 and I(S/A) D I(S): the state of the signal and ancilla after the signal has passed through the channel is pure, and all quantum information passes down the channel unscathed. By contrast, a completely decoP p hering channel takes an the input j p j j jiA j jiS to the P output j p j j jiA h jj ˝ j jiS h jj. In this case, I(AS) D I(S) and I(S/A) D 0: the channel has completely destroyed all quantum information sent down the channel. In order to find the absolute capacity of the channel to transmit quantum information, we must maximize the quantity I(S/A)/n over all n-state inputs 1:::n to the channel and take the limit as n ! 1. More precisely, define I C D lim n!1 min sup I(S/A)/n ;
(54)
where the supremum (sup) is taken over all n-state inputs 1:::n . I C is called the coherent information [44,45]: it is the capacity of the channel to transmit quantum information reliably. Because the coherent information is defined only in the limit that the length of the input state goes to infinity, it has been calculated exactly in only a few cases. One might hope, in analogue to Shannon’s theory of classical commu-
Quantum Information Processing
nication, that for memoryless channels one need only optimize over single inputs. That hope is mistaken, however: entangling the input states typically increases the quantum channel capacity even for memoryless channels [108].
Capacity of Quantum Channels to Transmit Classical Information One of the most important questions in quantum communications is the capacity of quantum channels to transmit classical information. All of our classical communication channels – voice, free space electromagnetic, fiber optic, etc. – are at bottom quantum mechanical, and their capacities are set using the laws of quantum mechanics. If quantum information theory can discover those limits, and devise ways of attaining them, it will have done humanity a considerable service. The general picture of classical communication using quantum channels is as follows. The conventional discussion of communication channels, both quantum and classical, designates the sender of information as Alice, and the receiver of information as Bob. Alice selects an ensemble of input states J over n uses of the channel, and send the Jth input J with probability pJ . The channel takes P the n-state input J to the output ˜J D K A K J A K . Bob then performs a generalized measurement fB` g with outcomes f`g to try to reveal which state Alice sent. A generalized measurement is simply a specific form an open-system transformation. The fB` g are effects for a completely P positive map: ` B` B` D Id. After making the generalized measurement on an output state ˜J , Bob obtains the outcome ` with probability p`jJ D trB` ˜J B` , and the sys-
tem is left in the state (1/p`jJ )B` ˜J B` . Once Alice has chosen a particular ensemble of signal states f J ; p J g, and Bob has chosen a particular generalized measurement, then the amount of information that can be sent along the channel is determined by the input probabilities pJ and the output probabilP ities p`jJ and p` D J p J p`jJ . In particular, the rate at which information can be sent through the channel and reliably decoded at the output is given by the mutual information I(in : out) D I(out) I(outjin), where P I(out) D ` p` log2 p` is the entropy of the output and P P I(outjin) D J p J ( ` p`jJ log2 p`jJ ) is the average entropy of the output conditioned on the state of the input. To maximize the amount of information that can be sent down the channel, Alice and Bob need to maximize over both input states and over Bob’s measurement at the output. The Schumacher–Holevo–Westmoreland theorem, however, considerably simplifies the problem
of maximizing the information transmission rate of the channel by obviating the need to maximize over Bob’s measurements [37,38,39]. Define the quantity 1 0 X X p J ˜J A p J S( ˜J ) ; (55) D S@ J
J
where S( ) tr log2 . is the difference between the entropy of the average output state and the average entropy of the output states. The Schumacher–Holevo– Westmoreland theorem then states that the capacity of the quantum channel for transmitting classical information is given by the limit as lim n!1 min sup /n, where the supremum is taken over all possible ensembles of input states f J ; p J g over n uses of the channel. For Bob to attain the channel capacity given by , he must in general make entangling measurements over the channel outputs, even when the channel is memoryless and when Alice does not entangle her inputs. (An entangling measurement is one the leaves the outputs in an entangled state after the measurement is made.) It would simplify the process of finding the channel capacity still further if the optimization over input states could be performed over a single use of the channel for memoryless channels, as is the case for classical communication channels, rather than having to take the limit as the number of inputs goes to infinity. If this were the case, then the channel capacity for memoryless channels would be attained for Alice sending unentangled states down the channel. Whether or not one is allowed to optimize over a single use for memoryless channels was for many years one of the primary unsolved conjectures of quantum information theory. Let’s state this conjecture precisely. Let n be the maximum of over n uses of a memoryless channel. We then have the Channel Additivity Conjecture: n D n1 Shor showed that the channel additivity conjecture is equivalent to two other additivity conjectures, the additivity of minimum output entropy and the additivity of entanglement of formation [109]. Entanglement of formation was discussed in the section on entanglement above. The minimum output entropy for n uses of a memoryless channel is simply the minimum over input states n , for n uses of the channel, of S( ˜n ), where ˜n is the output state arising from the input n . We then have the Minimum Output Entropy Additivity Conjecture: The minimum over n of S( ˜n ) is equal to n times the minimum over 1 of S( ˜1 ).
2527
2528
Quantum Information Processing
Shor’s result shows that the channel additivity conjecture and the minimum output entropy additivity conjecture are equivalent: each one implies the other. If these additivity conjectures could have been proved to be true, that would have resolved some of the primary outstanding problems in quantum channel capacity theory. Remarkably, however, Hastings recently showed that the minimum output entropy conjecture is false, by exhibiting a channel whose minimum output entropy for multiple uses is achieved for entangled inputs [110]. As a result, the question of just how much classical information can be sent down a quantum channel, and just which quantum channels are additive and which are not, remains wide open. Bosonic Channels The most commonly used quantum communication channel is the so-called bosonic channel with Gaussian noise and loss [40]. Bosonic channels are ones that use bosons such as photons or phonons to communicate. Gaussian noise and loss is the most common type of noise and loss for such channels, it includes the effect of thermal noise, noise from linear amplification, and leakage of photons or phonons out of the channel. It has been shown that the capacity for bosonic channels with loss alone is attained by sending coherent states down the channel [42]. Coherent states are the sort of states produced by lasers and are the states that are currently used in most bosonic channels. It has been conjectured that coherent states also maximize the capacity of quantum communication channels with Gaussian noise as well as loss [43]. This conjecture, if true, would establish the quantum-mechanical equivalent of Shannon’s theorem for the capacity of classical channels with Gaussian noise and loss. The resolution of this conjecture can be shown to be equivalent to the following, simpler conjecture: Gaussian Minimum Output Entropy Conjecture: Coherent states minimize the output entropy of bosonic channels with Gaussian noise and no loss. The Gaussian minimum output entropy is intuitively appealing: an equivalent statement is that the vacuum input state minimizes the output entropy for a channel with Gaussian noise. In other words, to minimize the output entropy of the channel, send nothing. Despite its intuitive appeal, the Gaussian minimum output entropy conjecture has steadfastly resisted proof for decades. This conjecture is related to the additivity conjectures above: in particular, if the additivity conjectures can be shown to be true, then the Gaussian mini-
mum output entropy conjecture is also true [110]. It is not known whether the converse implication also holds. Proving the additivity conjectures and the Gaussian minimum output entropy conjecture is one of the primary goals of quantum information theory. Entanglement Assisted Capacity Just as quantum bits possess greater mathematical structure than classical bits, so quantum channels possess greater variety than their classical counterparts. A classical channel has but a single capacity. A quantum channel has one capacity for transmitting quantum information (the coherent information), and another capacity for transmitting classical information (the Holevo quantity ). We can also ask about the capacity of a quantum channel in the presence of prior entanglement. The entanglement assisted capacity of a channel arises in the following situation. Suppose that Alice and Bob have used their quantum channel to build up a supply of entangled qubits, where Alice possesses half of the entangled pairs of qubits, and Bob possesses the other half of the pairs. Now Alice sends Bob some qubits over the channel. How much classical information can these qubits convey? At first one might think that the existence of shared prior entanglement should have no effect on the amount of information that Alice can send to Bob. After all, entanglement is a form of correlation, and the existence of prior correlation between Alice and Bob in a classical setting has no effect on the amount of information sent. In the quantum setting, however, the situation is different. Consider, for example, the case where Alice and Bob have a perfect, noiseless channel. When Alice and Bob share no prior entanglement, then a single qubit sent down the channel conveys exactly one bit of classical information. When Alice and Bob share prior entanglement, however, a single quantum bit can convey more than one bit of classical information. Suppose that Alice p and Bob share an entangled pair in the singlet state (1/ 2)(j0iA j1iB j1iA j0iB ). Alice then performs one of four actions on her qubit: either she does nothing (performs the identity Id on the qubit), or she flips the qubit around the x-axis (performs x ), or she flips the qubit around the y-axis (performs y ), she flips the qubit around the z-axis (performs z ). Now Alice sends her qubit to Bob. Bob nowppossesses one of the four p orthogonal states, (1/ p2) (j0iA j1iB j1iA j0iB ), (1/ 2)(j1i p A j1iB j0iA j0iB ), (i/ 2) (j1iA j1iB C j0iA j0iB ), (1/ 2)(j0iA j1iB C j1iA j0iB ). By measuring which of these states he possesses, Bob can determine which of the four actions Alice performed. That
Quantum Information Processing
is, when Alice and Bob share prior entanglement, Alice can send two classical bits for each quantum bit she sends. This phenomenon is known as superdense coding [111]. In general, the quantum channel connecting Alice to Bob is noisy. We can then ask, given the form of the quantum channel, how much does the existence of prior entanglement help Alice in sending classical information to Bob? The answer to this question is given by the following theorem, due to Shor et al. The entanglement assisted capacity of a quantum channel is equal to the maximum of the quantum mutual information between the input and output of the channel [112]. The quantum mutual information is defined as follows. Prepare a purification j iAS of an input state and send the signal state S down the channel, resulting the state AS as in (52) above. Defining S D trA AS , A D trS AS , as before, the quantum mutual information is defined to be I Q (A : S) D S( A ) C S( S ) S( AS ). The entanglement assisted capacity of the channel is obtained by maximizing the quantum mutual information I Q (A : S) over input states . The entanglement assisted capacity of a quantum channel is greater than or equal to the channel’s Holevo quantity, which is in turn greater than or equal to the channel’s coherent information. Unlike the coherent information, which is known not to be additive over many uses of the channel, or the Holevo quantity, which is suspected to be additive but which has not been proved to be so, the entanglement assisted capacity is known to be additive and so can readily be calculated for memoryless channels.
Teleportation As mentioned in the introduction, one of the most strange and useful effects in quantum computation is teleportation [46]. The traditional, science fiction picture of teleportation works as follows. An object such as an atom or a human being is placed in a device called a teleporter. The teleporter makes complete measurements of the physical state of the object, destroying it in the process. The detailed information about that physical state is sent to a distant location, where a second teleporter uses that information to reconstruct an exact copy of the original object. At first, quantum mechanics would seem to make teleportation impossible. Quantum measurements tend to disturb the object measured. Many identical copies of the object are required to obtain even a rough picture of the underlying quantum state of the object. In the presence of shared, prior entanglement, however, teleportation is in
fact possible in principle, and simple instances of teleportation have been demonstrated experimentally. A hint to the possibility of teleportation comes from the phenomenon of superdense coding described in the previous section. If one qubit can be used to convey two classical bits using prior entanglement, then maybe two classical bits might be used to convey one qubit. This hope turns out to be true. Suppose that Alice and Bob each possess one qubit out of an entangled pair of qubits (that is, they mutually possess one ‘e-bit’). Alice desires to teleport the state j i of another qubit. The teleportation protocol goes as follows. First, Alice makes a Bell-state measurement on the qubit to be teleported together with her half of the entangled pair. A Bell-state measurement on two qubits is one that determines whether the ptwo qubits are in one of p the four states j00 i D (1/ p2)(j01i j10i), j01 i D (1/ 2)(j00i j11i), j10 i D (1/ 2)(j00i C j11i), p or j11 i D (1/ 2)(j01i C j10i). Alice obtains two classical bits of information as a result of her measurement, depending on which j i j i the measurement revealed. She sends these two bits to Bob. Bob now performs a unitary transformation on his half of the entangled qubit pair. If he receives 00, then he does nothing. If he receives 01, then he applies x to flip his bit about the x-axis. If he receives 10, then he applies y to flip his bit about the y-axis. If he receives 11, then he applies z to flip his bit about the zaxis. The result? After Bob has performed his transformation conditioned on the two classical bits he received from Alice, his qubit is now in the state j i, up to an overall phase. Alice’s state has been teleported to Bob. It might seem at first somewhat mysterious how this sequence of operations can teleport Alice’s state to Bob. The mechanism of teleportation can be elucidated as follows. Write j i D ˛j0i i C ˇj1i i . Alice and Bob’s entangled pair is originally in the state p j00 iAB D (1/ 2)(j0iA j1iB j1iA j0iB ). The full initial state of qubit to be teleported together with the entangled pair can then be written as j ij00 iAB 1 D (˛j0i i C ˇj1i i ) p (j0iA j1iB j1iA j0iB ) 2 1 D p (j0i i j1iA j1i i j0iA ) ˝ (˛j0iB C ˇj1iB ) 2 2 1 C p (j0i i j0iA j1i i j1iA ) ˝ (˛j1iB C ˇj0iB ) 2 2 1 C p (j0i i j0iA C j1i i j1iA ) ˝ (˛j1iB ˇj0iB ) 2 2 1 C p (j0i i j1iA C j1i i j0iA ) ˝ (˛j0iB ˇj1iB ) 2 2
2529
2530
Quantum Information Processing
1 D (j00 i iA ˝ j iB C j01 i iA ˝ x j iB 2 C j10 i iA ˝ i y j iB C j11 i iA ˝ z j iB ) : (56) When the initial state is written in this form, one sees immediately how the protocol works: the measurement that Alice makes contains exactly the right information that Bob needs to reproduce the state j i by performing the appropriate transformation on his qubit. Teleportation is a highly useful protocol that lies at the center of quantum communication and fault tolerant quantum computation. There are several interesting features to note. The two bits of information that Alice obtains are completely random: 00; 01; 10; 11 all occur with equal probability. These bits contain no information about j i taken on their own: it is only when combined with Bob’s qubit that those bits suffice to recreate j i. During the teleportation process, it is difficult to say just where the state j i ‘exists’. After Alice has made her measurement, the state j i is in some sense ‘spread out’ between her two classical bits and Bob’s qubit. The proliferation of quotation marks in this paragraph is a symptom of quantum weirdness: classical ways of describing things are inadequate to capture the behavior of quantum things. The only way to see what happens to a quantum system during a process like teleportation is to apply the mathematical rules of quantum mechanics.
Quantum Cryptography A common problem in communication is security. Suppose that Alice and Bob wish to communicate with each other with the secure knowledge that no eavesdropper (Eve) is listening in. The study of secure communication is commonly called cryptography, since to attain security Alice must encrypt her messages and Bob must decrypt them. The no-cloning theorem together with the fact that if one measures a quantum system, one typically disturbs it, implies that quantum mechanics can play a unique role in constructing cryptosystems. There are a wide variety of quantum cryptographic protocols [49,50,51]. The most common of these fall under the heading of quantum key distribution (QKD). The most secure form of classical cryptographic protocols is the one-time pad. Here, Alice and Bob each possess a random string of bits. This string is called the key. If no one else possesses the key, then Alice and Bob can send messages securely as follows. Suppose that Alice’s message has been encoded in bits in some conventional way (e. g.,
mapping characters to ASCII bit strings). Alice encrypts the message by adding the bits of the key to the bits of her message one by one, modulo 2 (i. e., without carrying). Quantum information theory is a rich and fundamental field. Its origins lie with the origins of quantum mechanics itself a century ago. The field has expanded dramatically since the mid 1990s, due to the discovery of practical applications of quantum information processing such as factoring and quantum cryptography, and because of the rapid development of technologies for manipulating systems in a way that preserves quantum coherence. As an example of the rapid pace of development in the field of quantum information, while this article was in proof, a new algorithm for solving linear sets of equations was discovered [116]. Based on the quantum phase algorithm, this algorithm solves the following problem: given E find a vector xE such that a sparse matrix A and a vector b, E That is, construct xE D A1 b. E If A is an n by n AE x D b. matrix, the best classical algorithms for solving this problem run in time O(n). Remarkably, the quantum matrix inversion algorithm runs in time O(log n), an exponential improvement: a problem that could take 1012 – 1015 operations to solve on a classical computer could be solved on a quantum computer in fewer than one hundred steps. When they were developed in the mid twentieth century, the fields of classical computation and communication provided unifying methods and themes for all of engineering and science. So at the beginning of the twenty first century, quantum information is providing unifying concepts such as entanglement, and unifying techniques such as coherent information processing and quantum error correction, that have the potential to transform and bind together currently disparate fields in science and engineering. The idea of quantum cryptography was proposed, in embryonic form, by Stephen Wiesner in [49]. The first quantum cryptographic protocol was proposed by Bennett and Brassard in 1984 and is commonly called BB84 [50]. The BB84 protocol together with its variants is the one most commonly used by existing quantum cryptosystems. In BB84, Alice sends Bob a sequence of qubits. The protocol is most commonly described in terms of qubits encoded on photon polarization. Here, we will describe the qubits in terms of spin, so that we can use the notation developed in Sect. “Quantum Mechanics”. Spin 1/2 is isomorphic to photon polarization and so the quantum mechanics of the protocol remains the same. Alice chooses a sequence of qubits from the set fj "i; j #i; j i; j !ig at random, and sends that sequence to Bob. As he receives each qubit in turn, Bob picks at random either the z-axis or the x-axis and measures the
Quantum Information Processing
received qubit along that axis. Half of the time, on average, Bob measures the qubit along the same axis along which it was prepared by Alice. Alice and Bob now check to see if Eve is listening in. Eve can intercept the qubits Alice sends, make a measurement on them, and then send them on to Bob. Because she does not know the axis along which any individual qubit has been prepared, however, here measurement will inevitably disturb the qubits. Alice and Bob can then detect Eve’s intervention by the following protocol. Using an ordinary, insecure form of transmission, e. g., the telephone, Alice reveals to Bob the state of some of the qubits that she sent. On half of those qubits, on average, Bob measured them along the same axis along which they were sent. Bob then checks to see if he measured those qubits to be in the same state that Alice sent them. If he finds them all to be in the proper state, then he and Alice can be sure that Eve is not listening in. If Bob finds that some fraction of the qubits are not in their proper state, then he and Alice know that either the qubits have been corrupted by the environment in transit, or Eve is listening in. The degree of corruption is related to the amount of information that Eve can have obtained: the greater the corruption, the more information Eve may have. From monitoring the degree of corruption of the received qubits, Alice and Bob can determine just how many bits of information Eve has obtained about their transmission. Alice now reveals to Bob the axis along which she prepared the remainder of her qubits. On half of those, on average, Bob measured using the same axis. If Eve is not listening, those qubits on which Bob measured using the same axis along which Eve prepared them now constitute a string of random bits that is shared by Alice and Bob and by them only. This shared random string can then be used as a key for a one-time pad. If Eve is listening in, then from their checking stage, Alice and Bob know just how many bits out of their shared random string are also known by Eve. Alice and Bob can now perform classical privacy amplification protocols [113] to turn their somewhat insecure string of shared bits into a shorter string of shared bits that is more secure. Once privacy amplification has been performed, Alice and Bob now share a key whose secrecy is guaranteed by the laws of quantum mechanics. Eve could, of course, intercept all the bits sent, measure them, and send them on. Such a ‘denial of service’ attack prevents Alice and Bob from establishing a shared secret key. No cryptographic system, not even a quantum one, is immune to denial of service attacks: if Alice and Bob can exchange no information then they can exchange no secret information! If Eve lets enough information through,
however, then Alice and Bob can always establish a secret key. A variety of quantum key distribution schemes have been proposed [50,51]. Ekert suggested using entangled photons to distribute keys to Alice and Bob. In practical quantum key distribution schemes, the states sent are attenuated coherent states, consisting of mostly vacuum with a small amplitude of single photon states, and an even smaller amplitude of states with more than one photon. It is also possible to use continuous quantum variables such as the amplitudes of the electric and magnetic fields to distribute quantum keys [114,115]. To guarantee the full security of a quantum key distribution scheme requires a careful examination of all possible attacks given the actual physical implementation of the scheme. Implications and Conclusions Quantum information theory is a rich and fundamental field. Its origins lie with the origins of quantum mechanics itself a century ago. The field has expanded dramatically since the mid 1990s, due to the discovery of practical applications of quantum information processing such as factoring and quantum cryptography, and because of the rapid development of technologies for manipulating systems in a way that preserves quantum coherence. When they were developed in the mid twentieth century, the fields of classical computation and communication provided unifying methods and themes for all of engineering and science. So at the beginning of the twenty first century, quantum information is providing unifying concepts such as entanglement, and unifying techniques such as coherent information processing and quantum error correction, that have the potential to transform and bind together currently disparate fields in science and engineering. Indeed, quantum information theory has perhaps even a greater potential to transform the world than classical information theory. Classical information theory finds its greatest application in the man-made systems such as electronic computers. Quantum information theory applies not only to man-made systems, but to all physical systems at their most fundamental level. For example, entanglement is a characteristic of virtually all physical systems at their most microscopic levels. Quantum coherence and the relationship between symmetries and the conservation and protection of information underlie not only quantum information, but the behavior of elementary particles, atoms, and molecules. When or whether techniques of quantum information processing will become tools of mainstream technology is an open question. The technologies of precision measure-
2531
2532
Quantum Information Processing
ment are already fully quantum mechanical: for example, the atomic clocks that lie at the heart of the global positioning system (GPS) rely fundamentally on quantum coherence. Ubiquitous devices such as the laser and the transistor have their roots in quantum mechanics. Quantum coherence is relatively fragile, however: until such a time as we can construct robust, easily manufactured coherent systems, quantum information processing may have its greatest implications at the extremes of physics and technology. Quantum information processing analyzes the universe in terms of information: at bottom, the universe is composed not just of photons, electrons, neutrinos and quarks, but of quantum bits or qubits. Many aspects of the behavior of those elemental qubits are independent of the particular physical system that registers them. By understanding how information behaves at the quantum mechanical level, we understand the fundamental behavior of the universe itself. Bibliography 1. Nielsen MA, Chuang IL (2000) Quantum Computation and Quantum Information. Cambridge University Press, Cambridge 2. Ehrenfest P, Ehrenfest T (1912) The Conceptual Foundations of the Statistical Approach in Mechanics. Cornell University Press, Ithaca, NY 3. Planck M (1901) Ann Phys 4:553 4. Maxwell JC (1871) Theory of Heat. Appleton, London 5. Einstein A (1905) Ann Phys 17:132 6. Bohr N (1913) Phil Mag 26:1–25, 476–502, 857–875 7. Schrödinger E (1926) Ann Phys 79:361–376, 489–527; (1926) Ann Phys 80:437–490; (1926) Ann Phys 81:109–139 8. Heisenberg W (1925) Z Phys 33:879–893; (1925) Z Phys 34:858–888; (1925) Z Phys 35:557–615 9. Einstein A, Podolsky B, Rosen N (1935) Phys Rev 47:777 10. Bohm D (1952) Phys Rev 85:166–179, 180–193 11. Bell JS (1964) Physics 1:195 12. Aharonov Y, Bohm D (1959) Phys Rev 115:485–491; (1961) Phys Rev 123:1511–1524 13. Aspect A, Grangier P, Roger G (1982) Phys Rev Lett 49:91–94; Aspect A, Dalibard J, Roger G (1982) Phys Rev Lett 49:1804– 1807 ; Aspect A (1999) Nature 398:189 14. Hartley RVL (1928) Bell Syst Tech J 535–563 15. Turing AM (1936) Proc Lond Math Soc 42:230–265 16. Shannon CE (1937) Symbolic Analysis of Relay and Switching Circuits. Master’s Thesis, MIT 17. Shannon CE (1948) A Math theory of commun. Bell Syst Tech J 27:379–423, 623–656 18. von Neumann J (1966) In: Burks AW (ed) Theory of Self-Reproducing Automata. University of Illinois Press, Urbana 19. Landauer R (1961) IBM J Res Develop 5:183–191 20. Lecerf Y (1963) Compte Rendus 257:2597–2600 21. Bennett CH (1973) IBM J Res Develop 17:525–532; (1982) Int J Theor Phys 21:905–940 22. Fredkin E, Toffoli T (1982) Int J Theor Phys 21:219–253
23. Benioff P (1980) J Stat Phys 22:563–591; (1982) Phys Rev Lett 48:1581–1585; (1982) J Stat Phys 29:515–546; Ann NY (1986) Acad Sci 480:475–486 24. Feynman RP (1982) Int J Th Ph 21:467 25. Deutsch D (1985) Proc Roy Soc Lond A 400:97–117; (1989) Proc Roy Soc Lond A 425:73–90 26. Shor P (1994) In: Goldwasser S (ed) Proceedings of the 35th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, pp 124–134 27. Lloyd S (1993) Science 261:1569; (1994) Science 263:695 28. Cirac JI, Zoller P (1995) Phys Rev Lett 74:4091 29. Monroe C, Meekhof DM, King BE, Itano WM, Wineland DJ (1995) Phys Rev Lett 75:4714 30. Turchette QA, Hood CJ, Lange W, Mabuchi H, Kimble HJ (1995) Phys Rev Lett 75:4710 31. Grover LK (1996) In: Proceedings, 28th Annual ACM Symposium on the Theory of Computing, p 212 32. Cory D, Fahmy AF, Havel TF (1997) Proc Nat Acad Sci USA 94:1634–1639 33. Chuang IL, Vandersypen LMK, Zhou X, Leung DW, Lloyd S (1998) Nature 393:143–146 34. Chuang IL, Gershenfeld N, Kubinec M (1998) Phys Rev Lett 80:3408–3411 35. Gordon JP (1962) Proc IRE 50:1898–1908 36. Lebedev DS, Levitin LB (1963) Sov Phys Dok 8:377 37. Holevo AS (1973) Prob Per Inf 9:3; Prob Inf Trans (USSR) 9:110 38. Schumacher B, Westmoreland M (1997) Phys Rev A 55:2738 39. Holevo AS (1998) IEEE Trans Inf Th 44:69 40. Caves CM, Drummond PD (1994) Rev Mod Phys 66:481–537 41. Yuen HP, Ozawa M (1993) Phys Rev Lett 70:363–366 42. Giovannetti V, Guha S, Lloyd S, Maccone L, Shapiro JH, Yuen HP (2004) Phys Rev Lett 92:027902 43. Giovannetti V, Guha S, Lloyd S, Maccone L, Shapiro JH (2004) Phys Rev A 70:032315 44. Nielsen M, Schumacher B (1996) Phys Rev A 54:2629–2635 45. Lloyd S (1997) Phys Rev A 55:1613–1622 46. Bennett CH, Brassard G, Cripeau C, Jozsa R, Peres A, Wootters WK (1993) Phys Rev Lett 70:1895–1899 47. Pan JW, Daniell M, Gasparoni S, Weihs G, Zeilinger A (2001) Phys Rev Lett 86:4435–4438 48. Zhang TC, Goh KW, Chou CW, Lodahl P, Kimble HJ (2003) Phys Rev A 67:033802 49. Wiesner S (1983) SIGACT News 15:78–88 50. Bennett CH, Brassard G (1984) Proc IEEE Int Conf Comput 10– 12:175–179 51. Ekert A (1991) Phys Rev Lett 67:661–663 52. Feynman RP (1965) The Character of Physical Law. MIT Press, Cambridge 53. Bohr A, Ulfbeck O (1995) Rev Mod Phys 67:1–35 54. Wootters WK, Zurek WH (1982) Nature 299:802–803 55. Werner RF (1998) Phys Rev A 58:1827–1832 56. Bennett CH, Bernstein HJ, Popescu S, Schumacher B (1996) Phys Rev A 53:2046–2052 57. Horodecki M, Horodecki P, Horodecki R (1998) Phys Rev Lett 80:5239–5242 58. Horodecki M, Horodecki P, Horodecki R (2000) Phys Rev Lett 84:2014–2017 59. Wootters WK (1998) Phys Rev Lett 80:2245–2248 60. Christandl M, Winter A (2004) J Math Phys 45:829–840 61. Bohm D (1952) Am J Phys 20:522–523
Quantum Information Processing
62. Bell JS (2004) Aspect A (ed) Speakable and Unspeakable in Quantum Mechanics. Cambridge University Press, Cambridge 63. Clauser JF, Horne MA, Shimony A, Holt RA (1969) Phys Rev Lett 23:880–884 64. Greenberger DM, Horne MA, Zeilinger A (1989) In: Kafatos M (ed) Bell’s Theorem, Quantum Theory, and Conceptions of the Universe. Kluwer, Dordrecht 65. Pan JW, Daniell M, Weinfurter H, Zeilinger A (1999) Phys Rev Lett 82:1345–1349 66. Nelson RJ, Cory DG, Lloyd S (2000) Phys Rev A 61:022106 67. Frank MP, Knight TF (1998) Nanotechnology 9:162–176 68. Feynman RP (1985) Opt News 11:11; (1986) Found Phys 16:507 69. Deutsch D, Jozsa R (1992) Proc R Soc London A 439:553 70. Simon DR (1994) In: Goldwasser S (ed) Proceedings of the 35th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, pp 116–123 71. Coppersmith D (1994) IBM Research Report RC19642 72. Hallgren S (2007) J ACM 54:1 73. Kuperberg G (2003) arXiv:quant-ph/0302112 74. Hallgren S, Moore C, Rötteler M, Russell A, Sen P (2006) In: Proc 38th ann ACM symposium theory comput, pp 604–617 75. Kitaev AY, Shen A, Vyalyi MN (2002) Classical and Quantum Computation. American Mathematical Society, Providence 76. Abrams DS, Lloyd S (1997) Phys Rev Lett 79:2586–2589 77. Aspuru-Guzik A, Dutoi AD, Love PJ, Head-Gordon M (2005) Science 309:1704–1707 78. Lloyd S (1996) Science 273:1073–8 79. Wiesner S (1996) arXiv:quant-ph/9603028 80. Zalka C (1998) Proc Roy Soc Lond A 454:313–322 81. Ramanathan C, Sinha S, Baugh J, Havel TF, Cory DG (2005) Phys Rev A 71:020303(R) 82. Boyer M, Brassard G, Hoyer P, Tapp A (1998) Fortsch Phys 46:493–506 83. Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Science 220:671– 680 84. Farhi E, Goldstone J, Gutmann S (2001) Science 292:472–476 85. Sachdev S (1999) Quantum Phase Transitions. Cambridge University Press, Cambridge 86. Aharonov D, van Dam W, Kempe J, Landau Z, Lloyd S, Regev O (2004) In: Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2004), pp 42–51. quant-ph/0405098 87. Farhi E, Gutmann S (1998) Phys Rev A 58:915–928
88. Aharonov D, Ambainis A, Kempe J, Vazirani U (2001) In: Proc 33th ACM Symposium on Theory of Computing (STOC 2001), pp 50–59 89. Childs AM, Cleve R, Deotto E, Farhi E, Gutmann S, Spielman DA (2003) Proc 35th ACM Symposium on Theory of Computing (STOC 2003), pp 59–68 90. Farhi E, Goldstone J, Gutmann S (2007) A quantum algorithm for the Hamiltonian NAND tree. arXiv:quant-ph/0702144 91. Hamming R (1980) Coding and Information Theory. Prentice Hall, Upper Saddle River 92. Shor PW (1995) Physical Review A, 52:R2493–R2496 93. Steane AM (1996) Phys Rev Lett 77:793–797 94. Calderbank AR, Shor PW (1996) Phys Rev A 54:1098–1106 95. Gottesman D (1996) Phys Rev A 54:1862 96. Aharonov D, Ben-Or M (1997) In: Proc 29th ann ACM symposium theory comput, pp 176–188 97. Knill E, Laflamme R, Zurek W (1996) arXiv:quant-ph/9610011 98. Gottesman D (1998) Phys Rev A 57:127–137 99. Knill E (2005) Nature 434:39–44 100. Zanardi P, Rasetti M (1997) Phys Rev Lett 79:3306–3309 101. Lidar DA, Chuang IL, Whaley KB (1998) Phys Rev Lett 81:2594– 2597 102. Knill E, Laflamme R, Viola L (2000) Phys Rev Lett 84:2525–2528 103. Zanardi P, Lloyd S (2003) Phys Rev Lett 90:067902 104. Kitaev AY (2003) Ann Phys 30:2–30 105. Lloyd S (2004) Quant Inf Proc 1:15–18 106. Devetak I (2005) IEEE Trans Inf Th 51:44–55 107. Shor PW (2002) Lecture Notes, MSRI Workshop on Quantum Computation. Mathematical Sciences Research Institute, Berkley 108. DiVincenzo DP, Shor PW, Smolin JA (1998) Phys Rev A 57:830– 839 109. Shor PW (2004) Comm Math Phys 246:453–472 110. Hastings MB (2008) A counterexample to additivity of minimum output entropy. arXiv:0809.3972 111. Bennett CH, Wiesner SJ (1992) Phys Rev Lett 69:2881 112. Bennett CH, Shor PW, Smolin JA, Thapliyal AV (1999) Phys Rev Lett 83:3081–3084 113. Bennett CH, Brassard G, Robert JM (1988) SIAM Journal on Computing 17:210–229 114. Ralph TC (1999) Phys Rev A 61:010303 115. Braunstein S, Pati AK (2003) Quantum Information With Continuous Variables. Springer, New York 116. Harrow AW, Hassidim A, Lloyd S (2008) Quantum algorithm for solving linear sets of equations. arXiv:0811.3171
2533
2534
Quantum Information Science, Introduction to
Quantum Information Science, Introduction to JOSEPH F. TRAUB Computer Science Department, Columbia University, New York, USA Quantum information science is a hugely exciting fairly new field that has attracted many researchers from diverse fields such as physics, chemistry, computer science, and mathematics. A driving force is that Moore’s law will soon end due to a number of factors. These include tunneling, the heat problem, approaching atomic size, difficulty of design, and cost of fabrication facilities. Multicores and manycores may provide a near term fix. But if we are to continue Moore’s Law exponential trajectory long term entirely new technologies are needed. Possible future technologies include molecular, biological, photonic, or quantum computers. Quantum computers are based on the laws of quantum mechanics. We’ve seen progress on quantum algorithms and complexity. But can quantum computers be built? Two of the impediments include the small number of qubits to date and the short decoherence times. Will there be enough qubits to do new science especially with the requirements of error correction and fault tolerant computing? Will there be enough time to do new science before decoherence sets in? In contrast to quantum computation some quantum communication applications have already been realized. This section begins with a broad overview of quantum information science. The articles that follow discuss quantum algorithms, quantum algorithms and complexity for continuous problems, quantum computational complexity, quantum error correction and fault tolerant computing, quantum computing with trapped ions, quantum computing using optics, and finally cryptography. Lloyd (see Quantum Information Processing) argues that the most important reason for studying quantum information science is to construct a unified theory of how information can be registered and transformed at the fundamental limits imposed by physical law. He introduces the formalism of quantum mechanics and applies it to the idea of quantum information. He then identifies quantum superposition and entanglement as being at the heart of quantum computation. He discusses the Deutsch– Josza algorithm which solves a problem faster than could be done by any classical computer. This is an artificial example. Decidably not artificial is Shor’s algorithm for fac-
toring large integers in polynomial time. The fastest classical algorithm known is exponential in the number of bits. Shor’s algorithm is important for cryptography since the commonly used RSA public-key system relies on the difficulty of factoring to guarantee security. The importance of the quantum Fourier transform for algorithm design is discussed. Lloyd does not describe the various technologies that can be used to build quantum computers. Every one of them is subject to its own particular form of noise. He does present a general formalism for characterizing noise and errors and discusses techniques for coping with them. Mosca (see Quantum Algorithms) defines quantum algorithms as algorithms that run on any realistic model of quantum computation. The most commonly used model of quantum computation is the circuit model (more strictly, the model of uniform families of acyclic quantum circuits), and the quantum Strong Church– Turing thesis states that the quantum circuit model can efficiently simulate any realistic model of computation. Several other models of quantum computation have been developed, and indeed they can be efficiently simulated by quantum circuits. Quantum circuits closely resemble most of the currently pursued approaches for attempting to construct scalable quantum computers. Papageorgiou and Traub (see Quantum Algorithms and Complexity for Continuous Problems) point out that most continuous mathematical formulations arising in science and engineering can only be solved numerically and therefore approximately. There are two major motivations for studying quantum algorithms and complexity for continuous problems. 1. Are quantum computers more powerful than classical computers for important scientific problems? How much more powerful? 2. Many important scientific and engineering problems have continuous formulations. These problems occur in fields such as physics, chemistry, engineering and finance. The continuous formulations include path integration, partial differential equations (in particular, the Schrödinger equation) and continuous optimization. To answer the first question the classical computational complexity of the problem must be known. There have been decades of research on the classical complexity of continuous problems in the field of information-based complexity. The reason the complexity of many continuous problems is known is that adversary arguments can be used to obtain their query complexity. Regarding the second motivation, in this article they report on high-di-
Quantum Information Science, Introduction to
mensional integration, path integration, Feynman path integration, the smallest eigenvalue of a differential equation, approximation, partial differential equations, ordinary differential equations and gradient estimation. They also briefly report on the simulation of quantum systems on a quantum computer. Watrous (see Quantum Computational Complexity) provides a survey of quantum computational complexity, with a focus on three fundamental notions: polynomial-time quantum computations, the efficient verification of quantum proofs, and quantum interactive proof systems. Based on these notions he defines quantum complexity classes that contain computational problems of varying hardness. Properties of these complexity classes, and the relationships among these classes and classical complexity classes, are presented. As these notions and complexity classes are typically defined within the quantum circuit model, this article includes a section that focuses on basic properties of quantum circuits that are important in the setting of quantum complexity. A selection of other topics in quantum complexity, including quantum advice, space-bounded quantum computation, and bounded-depth quantum circuits, is also presented. Grassl and Rotteler (see Quantum Error Correction and Fault Tolerant Quantum Computing) point out that it has been shown that even with imperfect quantum memory and imperfect quantum operations it is possible to implement arbitrary long quantum computation, provided that the failure probability of each element is below a certain threshold. They provide an overview of the ingredients leading to fault tolerant quantum computation (FTQC). In the first part, they present the theory of quantum error-correcting codes (QECCs) and in particular two important classes of QECCs, namely the so-called CSS codes and stabilizer codes. Both are related to classical error-correcting codes, so they start with some basics from this area. In the second part of the article, they present a high-level view of the main ideas of FTQC and the threshold theorem. Lange (see Quantum Computing with Trapped Ions) summarizes the state-of-the-art of quantum computing with trapped ions. All the necessary components of a trapped ion quantum computer have been demonstrated, from quantum memory and fundamental quantum logic gates to simple quantum algorithms. Current
experimental efforts are directed towards scaling up the small systems investigated so far and enhancing the fidelity of operations to a level where error correction can be applied efficiently. The first task at which a quantum computer is expected to outperform a classical one is the efficient simulation of quantum systems too complex for classical treatment. Milburn and White (see Quantum Computing Using Optics) point out that optical implementations of quantum computing have largely focused on encoding quantum information using single photon states of light. For example, a single photon could be excited to one of two carefully defined orthogonal mode functions of the field with different momentum directions. However, as optical photons do not interact with each other directly, physical devices that enable one encoded bit of information to unitarily change another are hard to implement. In principle it can be done using a Kerr nonlinearity, but Kerr nonlinear phase shifts are too small to be useful. Knill et al. discovered another way in which the state of one photon could be made to act conditionally on the state of another using a measurement based scheme. They discuss this approach in some detail as it has led to experiments that have already demonstrated many of the key elements required for quantum computation with optics. Lo and Zhao (see Quantum Cryptography) point out that the goal of quantum cryptography is to perform tasks that are impossible or intractable with conventional cryptography. Quantum cryptography makes use of the subtle properties of quantum mechanics such as the quantum no-cloning theorem and the Heisenberg uncertainty principle. Unlike conventional cryptography, whose security is often based on unproven computational assumptions, quantum cryptography has an important advantage in that its security is often based on the laws of physics. Thus far, proposed applications of quantum cryptography include quantum key distribution (abbreviated QKD), quantum bit commitment and quantum coin tossing. These applications have varying degrees of success. The most successful and important application (QKD) has been proven to be unconditionally secure. Moreover, experimental QKD has now been performed over hundreds of kilometers over both standard commercial telecom optical fibers and open-air. In fact, commercial QKD systems are currently available on the market.
2535
2536
Random Graphs, a Whirlwind Tour of
Random Graphs, a Whirlwind Tour of FAN CHUNG University of California, San Diego, USA Article Outline Glossary Introduction Some Basic Graph Theory Random Graphs in a Nutshell Classical Random Graphs Random Power Law Graphs On-Line Random Graphs Remarks Bibliography Glossary Random graph a graph which is chosen from a certain probability distribution over the set of all graphs satisfying some given set of constraints. Statements about random graphs are thus statements about “typical” graphs for the given constraints. Graph diameter the maximum distance between any pair of vertices in a graph. Here the distance between two vertices is the length of any shortest path joining the vertices. Node degree distribution nk the number of vertices having degree k. Power law distribution a node degree distribution which (at least approximately) obeys n k / k ˇ with fixed positive exponent ˇ. Erd˝os–Rényi random graph G(n; p) a graph with n vertices, in which each possible edge between the n2 pairs of vertices is present with probability p. The fact that the edges are chosen independently makes it relatively easy to compute properties of G (n; p). Random graph model G(w) for expected degree sequence w:] a model which generates random graphs which, on average, will have a prescribed degree sequence w D (w1 ; w2 ; : : : ; w n ). On-line random graph a graph which grows in size over time, according to given probabilistic rules, starting from a start graph G0 . One can make statements about these graphs in the limit of long time (and hence large vertex number n, as n approaches infinity). Introduction Nowadays we are surrounded by assorted large information networks. For example, the phone network has all
users as vertices which are interconnected by phone calls from one user to another. The Web can be viewed as a network with webpages as vertices which are then linked to other webpages. There are various biological networks arising from numerous databases, such as the gene network which represents the regulatory effect among genes. Of interest are many social networks expressing various types of social interactions. Some noted examples include the Collaboration graph (denoting coauthorship among mathematicians) and the Hollywood graph (consisting of actors/actresses and their joint appearances in feature films), among others. How are these networks formed? What are basic structures of such large networks? How do they evolve? What are the underlying principles that dictate their behaviors? To answer these questions, graph theory comes into play. Random graphs have a similar flavor as these large information networks in a natural way. For example, the phone network is formed by making random phone calls while a random graph results from adding a random edge one at a time. Although the classical random graphs can not directly be used to model real networks and seem to exhibit different ‘shapes’, the methods and approaches in random graph theory provides useful tools for the modeling and analysis of these information networks. In this article, we will start with some basic graph theory in Sect. “Some Basic Graph Theory”. We then introduce the main themes of random graphs in Sect. “Random Graphs in a Nutshell”. Then we consider the classical random graph theory in Sect. “Classical Random Graphs” before we proceed to describe some general random graph models with given degree distributions, in particular, the power law graphs in Sect. “Random Power Law Graphs”. In Sect. “On-Line Random Graphs”, we will cover two types of “on-line” graph models, including the model of preferential attachment and the duplication model. Although random graphs can be used to analyze various aspects of realistic networks, we wish to point out that there is no silver bullet to answer all the difficult problems about these large complex networks. In the last section we will put things in perspective by clarifying what random graphs can and can not do. Some Basic Graph Theory All the information networks that we have mentioned can be formulated in terms of graphs. A graph G consists of a vertex set, denoted by V D V(G) (which contains all the objects that we wish to deal with) and an edge set E D E(G) which consists of of specified pairwise relations between vertices. For example, a friendship graph has the
Random Graphs, a Whirlwind Tour of
Random Graphs, a Whirlwind Tour of, Table 1 Graph models for several networks Graph Flight schedule graph Phone graph Collaboration graph Web graph Biological graph
Vertices Cities Telephone numbers Authors in Math Review Webpages Genes
n R(k; l) vertices must contain either K k as a subgraph or contain l independent vertices. Edges Flights Phone calls Coauthorship Links Regulatory effects
vertex set consisting of people of interest and the edge set denoting the pairs of people who are friends. In Table 1 we list a number of graphs associated with various networks. As an introduction to graph theory, we describe the socalled party problem: Among six people in a party, show that there are at least three people who know each other or there are three people who do not know each other. This can be said in graph-theoretical terms: Any graph on 6 vertices must contain a triangle or contain three independent vertices with no edge among them. Indeed, 6 is the smallest number for this to occur since there is a graph on 5 vertices that contain neither a triangle nor three independent vertices. Such a graph is a cycle on 5 vertices, denoted by C5 , as seen in Fig. 1. LetKn denote a complete graph on n vertices which has all n2 edges. For example, a triangle is K 3 which turns out is also C3 . The above party problem is a toy case of the so-called Ramsey theory which deals with unavoidable patterns in large graphs. In 1930, Ramsey [40] showed the following: For any two positive integers k and l, there is an associated number R(k, l) such that any graph on
Random Graphs, a Whirlwind Tour of, Figure 1 A five cycle C 5 and a complete graph K 5
For example, R(3; 3) D 6 as stated in the party problem. It is not too difficult to show that R(4; 4) D 17. However, the value of R(5; 5) is not yet determined (in spite of the huge computational power we have today). All that is known is 43 R(5; 5) 49 (see [25] and [36]). Relatively few exact Ramsey numbers R(k, l) are determined. For an extensive survey on this topic, the reader is referred to the dynamic survey in the Electronic Journal of Combinatorics at http://www.combinatorics.org/. In 1947, Erd˝os wrote an important paper [21] that helped start two areas including combinatorial probabilistic methods and Ramsey theory. He established the following lower bound for the Ramsey number R(k, k) by proving k
R(k; k) 2 2 ;
(1)
the argument is quite simple and elegant: Suppose we wish to find a graph on n vertices that does not contain K k or an independent subset of k vertices. How large can n be? For a fixed integer n, there are all ton gether 2( 2 ) possible graphs on n vertices. We say a graph is bad if it contains K k or an independent subset of k ver tices. How many bad graphs can there be? There are nk ways to choose k out of n vertices. So, there are at most n k 2 nk 2( 2 )( 2) bad graphs. Therefore there is a graph on n vertices that is not bad if ! n n (n)( k ) ) ( 22 2 : 2 2 2 k So, for n 2 k/2 , there must be a graph on n vertices that is not bad, which implies (1). We note that for the upper bound there is an inwhich is ductive proof to show that R(k; k) 2k2 k2 about 4k . In the previous five decades, there have been some improvements only by a factor of a lower order for both the upper and lower bounds [19,42]. It remains unsettled (with Erd˝os award unclaimed) to determine if lim k!1 (R(k; k))1/k exists or what value it should be. A basic notion in graph theory is “adjacency”. A vertex u is said to be adjacent to another vertex v if fu; vg is an edge. Or, we say u is a neighbor of v. Equivalently, v is a neighbor of u. The degree of a vertex u is the number of edges containing u. If we restrict ourselves to simple graphs (i. e., at most one edge between any pairs of vertices), then the degree of u is just the number of neighbors that u has. Suppose that in a graph G the vertex vi has
2537
2538
Random Graphs, a Whirlwind Tour of
degree di for 1 i n. Then (d1 ; d2 ; : : : ; d n ) forms a degree sequence for G. Sometimes, we organize the degree sequence so that d1 d2 d n . Here comes a natural question on graph realization: For what values di is the sequence (d1 ; d2 ; : : : ; d n ) a degree sequence of some graph? To answer this question, first we observe that the sum of all di ’s must be even since that is exactly twice the number of edges. This is the folklore “Handshake Theorem”. In a 1961 paper, Erd˝os and Gallai [24] answered the above question. They gave a necessary and sufficient condition by showing that a sequence (d1 ; d2 ; : : : ; d n ), where d i d iC1 , is a degree sequence of some graph if and only if the sum of di ’s is even and for each integer r n 1, r X iD1
d i r(r 1) C
n X
minfr; d i g :
iDrC1
Another way to keep track of the degrees of a graph is to consider the degree distribution as follows: Let nk denote the number of vertices having degree k. Instead of writing down the degree sequence (which consists of n numbers and n can be a very large number), we just use nk . Therefore, the number of values that we need to keep does not exceed the maximum degree. If all degrees are the same value, we say the graph is regular. In this case, only one of the nk ’s is nonzero. Many real-world graphs have degree distribution satisfying the so-called “power law”. Namely, the number nk of vertices of degree k is proportional to k ˇ for some fixed positive value ˇ. For example, the Collaboration graph, as illustrated in Fig. 2, can be approximated by a power law with exponent ˇ D 2:46. The degree distribution of the Collaboration graph is included in Fig. 3 in log-log scale. In a graph G, a path is a sequence of vertices v0 ; v1 ; : : : ; v k such that v i1 is adjacent to vi for i D 1; : : : ; k. The length of a path is the number of edges in the path. For example, the above mentioned path has length k joining v0 and vk . If v0 D v k , the path is said to be a cycle. A graph which contains no cycle is called a tree. A graph is connected if any two vertices can be joined by a path. For a graph G, a maximum subset of vertices each pair of which can be joined by paths is called a connected component. Thus, a graph is connected if there is only one connected component. In a connected graph, the distance between two vertices u and v is the length of a shortest path joining u and v. The maximum distance among all pairs of vertices is called the diameter of a graph. In 1967, the psychologist Stanley Milgram [37] conducted a series of experiments which indicated that any two strangers are connected by a chain of intermediate ac-
Random Graphs, a Whirlwind Tour of, Figure 2 The collaboration graph
Random Graphs, a Whirlwind Tour of, Figure 3 The number of vertices for each possible degree for the collaboration graph
quaintances with the average chain length about six. Since then, the so-called “small world phenomenon” has long been a subject of anecdotal observation and folklore. Recent studies have suggested that the phenomenon is pervasive in numerous networks arising in nature and technology, and in particular, in the structural evolution of the World Wide Web [5,33,44]. In addition to “six degrees of separation”, various numbers have emerged with many networks. In 1999, Barabási et al. [5] estimated that any two webpages are at most 19 clicks away from one another (in certain mod-
Random Graphs, a Whirlwind Tour of
els of the Internet). Broder et al. [11] set up crawlers on a webgraph of 200 million nodes and 1.5 billion links and reported that the average path length is about 16. A mathematician who has written joint papers is likely to have Erd˝os number at most eight [29] (i. e., with a chain of coauthors with length at most 8 connecting to Erd˝os). The majority of actors or actresses have the so-called “Kevin Bacon number” two or three. Before we make sense of these numbers, some clarification is in order: There are in fact two different interpretations of ‘short distance’ in a network. One notion is the diameter of the graph. Another notion is the average distance (which might be closer to what was meant by these experiments). We will discuss the small world phenomenon further in a later section. Random Graphs in a Nutshell What does a “random graph” mean? Before proceeding to describe random graphs, some clarification for “random” is in order. According to the Cambridge Dictionary, “random” means “happening, done or chosen by chance rather than according to a plan”. Quite contrary to this explanation, our random graphs have precise meanings and can be clearly defined. Using the terminology in probability, a random graph is a random variable defined in a probability space with a probability distribution. In layman’s terms, we first put all graphs on n vertices in a lottery box and then the graph we pick out of the box is a random graph. (In this case, all graphs are chosen with equal probability.) What do we want from our random graphs? Well, we would like to say that a random graph (in some given model) has certain properties (e. g., having small diameter). Such a statement means that with probability close to 1 (as the number n of vertices approaches infinity), the random graph we pick out of the lottery box satisfies the property that we specified. In other words, a random graph has a specified property means that almost all graphs of interest have the desired property. Note that this is quite a strong implication! Any statement about a random graph is really about almost all graphs! The beauty of random graphs lies in being able to use relatively few parameters in the model to capture the behavior of almost all graphs of interest (which can be quite numerous and complex). In the early days of the subject, Erd˝os and Rényi introduced two random graph models. The first one is a random graph F (n; m) defined on all graphs with n vertices and m edges each of which is chosen with equal probability. The second is the celebrated Erd˝os–Rényi random graph G (n; p) defined on all graphs on n vertices and each
edge is chosen independently with probability p. Consequently, a graph on n vertices and x edges is chosen with n probability p x (1 p)( 2 )x in G (n; p). The advantage of the Erd˝os–Rényi model is the independence of choices for the edges (i. e., each pair of vertices has its own dice for determining being chosen as an edge). Since the probability of two independent events is the product of probabilities of two events, we can compute with ease. For example, the probability of a random graph in G (n; p) containing a fixed triangle is 1/8 for p D 1/2. It is possible to compute such a probability for a ran n)3 n 2 dom graph in F (n; m), e. g., (m3 / (m2 ) , which is a more complicated expression. For many problems, such as the diameter problem, it can be quite nontrivial for F (n; m) because the dependency among edges is getting in the way. To model real graphs, there are some obvious difficulties. For example, the random graph G (n; p) has all degrees very close to pn if the graph is not so sparse, (i. e., p log n/n). The distribution of the degrees follows the same bell curve for every vertex. As we know, many realworld graphs satisfy the power law which is very different from the degree distribution of G (n; p). In order to model real-world networks, it is imperative to consider random graphs with general degree distribution and, in particular, the power law distribution. There are basically two types of random graph models for general degree distributions. The configuration model is a take-off from random regular graphs [6]. The way to define random regular graphs G k of degree k on n vertices is to consider all possible matchings in a complete graph K kn . Note that a matching is a maximum set of vertex-disjoint edges. Each matching is chosen with equal probability. We then get a random k-regular graph by partitioning the vertices of k into subsets. Each k-subset then is associated with a vertex in a random regular graph G k . Although such a random regular graph might contain loops (i. e., an edge having both endpoints the same vertex), the probability of such an event is of a lower order and can be controlled. It is then obvious to define random graphs with general degrees. Instead of partitioning the vertex set of the large graph into equal parts, we choose a random matchP ing of a complete graph on i d i vertices which are partitioned into subsets of sizes d1 ; d2 ; : : : ; d n . Then we form the random graph by associating each edge in the matching with an edge between associated vertices. In the configuration model, there are nontrivial dependencies among the edges. There is another random graph model for general expected degrees as a generalization of the Erd˝os–Rényi model. Let w D (w1 ; w2 ; : : : ; w n ) denote the specified degrees. The G (w) model yields random graphs with expected degrees w. The edge between vi
2539
2540
Random Graphs, a Whirlwind Tour of
and vj is independently chosen with probability w i w j /W P where W D i w i . In other words, each pair of vertices has its own dice with probability assigned so that the expected degree at vertex vi is exactly wi . The Erd˝os–Rényi model is just the case with all wi ’s equal to pn. Since the G (w) model inherits the robustness and independence of the Erd˝os–Rényi model, many strong properties can be derived. We will discuss some of these further in Sect. “Parameters for Modeling Power Law Graphs”, especially when w satisfies power laws. All the random graph models mentioned above are off-line models. Since real-world graphs are dynamically changing—both in adding and deleting vertices and edges—there are several on-line random graph models in which the probability spaces are changing at the tick of the clock. In fact, in the study of complex real-world graphs, the on-line model came to attention first. There are a large number of research papers, surveys and books on random graphs, mostly about the Erd˝os– Rényi model G (n; p). After the year 2000, the study of real-world graphs has led to interesting directions and new methods for analyzing random graphs with general degree distributions. Many on-line models have been proposed and published. Here we will only be able to cover the main ones—the preferential attachment schemes and the duplication model later. Classical Random Graphs In early 60’s, Erd˝os and Rényi wrote a series of influential papers on random graphs. Their modeling and analysis are thorough and elegant. Their approaches and methods are powerful and have had enormous impact up to this day. In this section, we will give a brief overview. First we will describe the classical results on the evolution of random graphs G (n; p) of the Erd˝os–Rényi model. Then we will discuss the diameter of G (n; p) as the edge density ranges from 0 to 1. The Evolution of the Erd˝os–Rényi Graph What does a random graph in G (n; p) look like? Erd˝os and Rényi [23] gave a full answer for the edge density p ranging from 0 to 1. At the start, there is no edge and the edge density is 0. We have isolated vertices. As p increases, the expected number p n2 of edges gets p larger. When there are about n edges, how many connected components are there and what sizes and structures are they? For 0 < p 1/n, Erd˝os and Rényi [23] showed that the random graph G is a disjoint union of trees. Furthermore, they gave a beautiful formula. For
p D cnk/(k1) , the probability that j is the number of connected components in G formed by trees on k vertices is j e / j! where D (2c) k1 k k2 /k!. For example, when p we have about n edges, the probability that the random graph G contains j trees on 3 vertices is close to 2 j / j! (if n is large enough). As we have more edges, cycles start to appear. When the graph has a linear number of edges, i. e., p D c/n; with c < 1, almost all vertices are in connected components of trees and there are only a small number of cycles. Namely, the expected number of cycles is 1/2 log 1/(1 c) c/2 (c 2 )/4. When a random graph has edges ranging from slightly below n/2 to slightly over n/2 edges, i. e., p D (1 C o(1))/n, there is an unusual phenomenon, called “double jumps”. What are double jumps and why is it so unusual? In the study of “threshold function” or, “phase transition” that happens in natural or evolving systems, it is of interest to identify the critical point, below which the behavior is dramatically different from what is above. Erd˝os and Rényi [23] found that as p is smaller than 1/n, the largest component in G has size O(log n) and all components are either trees or unicyclic (i. e., each component contains at most one cycle). If p is (1 C )/n and > 0, then the giant component emerges. However, when p D 1/n, the largest component is of size O(n2/3 ). There has been detailed analysis examining this tricky transition in details (see [7] and [31]). When p D c/n for c > 1, the random graph G has one giant component and all others are quite small, of size O(log n). Also, Erd˝os and Rényi [23] determined the number of vertices in the giant connected component to be f (c)n where f (c) D 1
1 1 X k k1 c k ce : c k!
(2)
kD1
Finally when p D c log n/n and c > 1, the random graph G is almost always connected. When c gets larger, G is not only connected but gets closer to a regular graph. Namely, all vertices have degrees close to pn. The Diameter of the Erd˝os–Rényi Graph We consider the diameter of a random graph G in G (n; p) for all ranges of p including the range for which G (n; p) is not connected. For a disconnected graph G, the diameter of G is defined to be the diameter of its largest connected component. Roughly speaking, the diameter of a random graph in G (n; p) is of order log n/(log(np)) if the expected degree
Random Graphs, a Whirlwind Tour of
np is at least 1. Note that this is best possible in the following sense. For any graph with degrees at most d, the number of vertices that can be reached within distance k is at most 1 C d C d(d 1) C d(d 1)2 C C d(d 1) k1 . This sum should be at least n if k is the diameter. Therefore we known that the diameter, denoted by diam(G) is at least (log n)/ log(d 1). To be precise, it can be shown the diameter of a random graph G in G (n; p) is (1 C o(1))(log n)/(log np) if the expected degree np goes to infinity as n approaches infinity. When np c > 1, the diameter diam(G) is within a constant factor of (log n)/(log np) where the constant depends only on c and is independent of n. When np D c < 1, the random graph is surely disconnected and diam(G) is equal to the diameter of a tree component. In fact, the diameter of a graph G in G (n; p) is quite predictable as follows. The values for the diameter of G (n; p) is almost surely concentrated on at most two values around (log n)/(log np) if (np)/(log n) D c > 8. When (np)/(log n) D c > 2, the diameter of G (n; p) is almost surely concentrated on at most three values. For the range 2 (np)/(log n) D c > 1, the diameter of G (n; p) is almost surely concentrated on at most four values. It is of particular interest to consider random graphs G (n; p) for the range of np > 1 and np c log n for some constant c since this range includes the emergence of the unique giant component. Because of a phase transition in connectivity at p D log n/n, the problem of determining the diameter of G (n; p) and its concentration seems to be difficult for certain ranges of p. If (np)/(log n) D c > c0 for any (small) constant c and c0 , then the diameter of G (n; p) is almost surely concentrated on finitely many values, namely, no more than 2b1/(c0 )c C 4 values. These facts are summarized in Table 2 with references listed. As we can see from the table, numerous questions remain.
Random Power Law Graphs Parameters for Modeling Power Law Graphs A large realistic network usually has a huge number of parameters with complicated descriptions. By “modeling a realistic network”, we mean cutting down the number of parameters to relatively few and still capture a good part of the character of the network. To choose the parameters for modeling a real network, the exponent ˇ of the power law is relatively easy to select. We can plot the log-degree versus log-frequency table and choose a good approximation of the slope. In a graph G, suppose that there are y vertices of degree x. Then G is considered to be a power law graph if x and y satisfy (or can be approximated by) the following equation: log y D ˛ ˇ log x :
(3)
In other words, we have j fvjdeg(v) D xg j y D
e˛ : xˇ
Basically, ˛ is the logarithm of the volume of the graph and ˇ can be regarded as the log-log growth rate of the graph. To take a closer look of the degree distribution of a typical realistic graph, several impediments obviously exist. (a) When we fit the power law model, there are discrepancies especially when the degree is very small or very large. There is almost always a heavy tail distribution at the upper range and there seems to be scattering at the lower range. For example, for the collaboration graph, should we or shouldn’t we include the data point for isolated vertices (an author with no coauthors)? Should we stay with the largest component or include all small components (including the isolated vertices)?
Random Graphs, a Whirlwind Tour of, Table 2 The diameter of random graphs G(n; p) Range
diam(G (n; p))
Reference
np log n np log n
!1
Concentrated on at most 2 values
[9]
Dc>8
Concentrated on at most 2 values
[12]
8
np log n np log n np log n
Dc>2
Concentrated on at most 3 values
[12]
Dc>1
Concentrated on at most 4 values
[10]
D c > c0
Concentrated on at most 2b c10 c C 4 values log n diam(G (n; p)) D (1 C o(1)) log(np) log n The ratio diam(Gn;p )/ log(np) is finite (between 1 and f (c))
[12]
2 1
log n > np ! 1 np c > 1 np < 1
diam(G (n; p)) equals the diameter of a tree component if (1
[12] [12] np)n1/3
! 1 [35]
2541
2542
Random Graphs, a Whirlwind Tour of
(b) The power law states that the number of vertices of degree k is proportional to k ˇ . We can approximate the number of vertices of degree k by the function f (k) D ck ˇ for some constant c. However, f (k) is usually not an integer. By taking either the ceiling or floor of f (k), some errors are inevitable. In fact, such errors are acute when k or f (k) is small. (c) The power law model is usually a better fit in the middle range (than at either end). Still, in many examples, there is a visible slight “hump” in the curve instead of the straight line representing the power law in the loglog table. Among the above three points, (c) is mainly due to firstorder approximations. The straight line with slope ˇ is a linear approximation of the actual plotted data. Thus the power law model is an important and necessary step for more complicated real cases. Here, we will first discuss (b) and then (a). Item (b) concerns rounding errors which can be checked by the following basic calculations about the power law graphs as described by Eq. (3). (1) The maximum degree of the graph is at most e˛/ˇ . Note that 0 log y D ˛ ˇ log x. (2) The number of vertices n can be computed as follows (under the assumption that the maximum degree is e˛/ˇ ). By summing y(x) for x from 1 to e˛/ˇ , we have 8 ˛ ˆ %(ˇ)e˛ if ˇ > 1 ; ˆ ˆ ˇ e < ˛ X ˛ e if ˇ D 1 ; nD ˛e ˇ ˛ ˆ x ˆ xD1 ˆ : eˇ if 0 < ˇ < 1 ; 1ˇ P t where %(t) D 1 nD1 1/(n ) is the Riemann Zeta function. (3) The number of edges E can be computed as follows: 8 ˛ 1 ˆ %(ˇ 1)e˛ if ˇ > 2 ; ˆ ˇ e 3 are quite different from those with exponent ˇ < 3 as evidenced by the value of w˜ (assuming m w). 8 (ˇ 2)2 ˆ ˆ 3 ; if ˇ D 3 ; if 2 < ˇ < 3 :
The above values of w˜ are quite useful in the study of average distance and diameter of random graphs. The Evolution of Random Power Law Graphs A natural question concerning the configuration model is how the random graphs evolve for power law distributions. Can we mimic the classical analysis as in the Erd˝os– Rényi random graph model? Here we consider a configuration model with degree distribution as in the (˛; ˇ)-graph. As it turns out, the evolution only depends on ˇ and not on ˛ as follows. 1. When ˇ > ˇ0 D 3:47875 : : :, the random graph almost surely has no giant component where the value
Random Graphs, a Whirlwind Tour of
ˇ0 D 3:47875 : : : is a solution to %(ˇ 2) 2%ˇ 1) D 0 :
2.
3.
4.
5. 6.
When ˇ < ˇ0 D 3:47875 : : :, there is almost surely a unique giant component. When 2 < ˇ < ˇ0 D 3:47875 : : :, the second largest component is almost surely of size (log n). For any 2 x < (log n), there is almost surely a component of size x. When ˇ D 2, almost surely the second largest component is of size (log n/(loglog n)). For any 2 x < (log n/(loglog n)), there is almost surely a component of size x. When 1 < ˇ < 2, the second largest component is almost surely of size (1). The graph is almost surely not connected. When 0 < ˇ < 1, the graph is almost surely connected. When ˇ D ˇ0 D 3:47875 : : :, the situation is complicated. It is similar to the double jump of the random graph G (n; p) with p D 1/n. For ˇ D 1, there is a nontrivial probability for either case that the graph is connected or disconnected.
A useful tool in configuration model is a result of Molloy and Reed [38,39]: For a random graph with ( i C o(1))n vertices of degree i, where i are nonnegative values which sum to 1 and n is the number of vertices, the giant component emerges P when Q D i1 i(i 2) i > 0, provided that the maximum degree is less than n1/4 and some “smoothness” conditions are satisfied. Also, there is almost surely no giP ant component when Q D i1 i(i 2) i < 0 and the maximum degree is less than n1/8 . Let us consider Q for our (˛; ˇ)-graphs with ˇ > 3. ˛ eˇ
QD
e˛ 1X x(x 2)b ˇ c n xD1 x 0 ˛ eˇ X
˛
1
eˇ X
1 B 1 1 C B C 2 @ ˇ 2 ˇ %(ˇ) xD1 x x 1 A xD1
%(ˇ 2) 2%(ˇ 1) %(ˇ) Hence, we consider the value ˇ0 D 3:47875 : : :, which we recall is a solution to %(ˇ 2) 2%(ˇ 1) D 0: If ˇ > ˇ0 , we have ˛
eˇ X
e˛ x(x 2)b ˇ c < 0 : x xD1
We remark that for ˇ > 8, Molloy and Reed’s result immediately implies that almost surely there is no giant component. When ˇ 8, additional analysis is needed to deal with the degree constraints [2]. It can be shown that the second largest component almost surely has size (log n). Furthermore, the second largest component has size at least (log n). G(w) Model for Power Law Graphs
In the Erd˝os–Rényi model G (n; p), the threshold function for the phase transition of the giant component is at p D 1/n. Namely, when the average degree pn is less than 1, all connected components are small (of size O(log n)) and there is no giant component. When the average degree is more than 1, the giant component emerges in full swing. (There is a “double jump” which takes place when the average degree is close to 1 as discussed in Sect. “Random Graphs in a Nutshell”.) For the random graph model G (w), with given expected degrees w, it is natural to ask the same question: What parameter in which range will trigger the (sudden) emergence of the giant component? In addition to w, the expected average degree, we have scores of parameters, e. g., w˜ and higher order average degrees. Which parameter w, w˜ or others is critical for the rise of the giant component? These questions were answered in [13]: Suppose that G is a random graph in G (w) with expected degree sequence w. If the expected average degree w is strictly greater than 1, then the following holds: (1) Almost surely G has a unique giant component. Furthermore, the volume of the giant component is at least (1 p2w e C o(1))Vol(G) if w 4/e D 1:4715 : : :, and is at least (1 (1 C log w)/w C o(1))Vol(G) if w < 2. (2) The second largest component almost surely has size at most (1 C o(1))(w) log n, where ( (w) D
1 1Clog wlog 4 1 w1log w
if w >
4 e
;
if 1 < w < 2 :
Moreover, with probability at least 1 nk , the second largest component has size at most (k C 1 C o(1))(w) log n, for any k 1. There is a sharp asymptotic estimate for the volume of the giant component for a random graph in G (w). In [15], it was proved that if the expected aver-
2543
2544
Random Graphs, a Whirlwind Tour of
age degree is strictly greater than 1, then almost surely the giant component in a graph G in G (w) has volp ume 0 Vol(G) C O( n log3:5 n), where 0 is the unique nonzero root of the following equation: n X iD1
w i ew i D (1 )
n X
wi :
(4)
iD1
Because of the robustness of the G (w) model, many properties can be derived for appropriate degree distributions, including power law graphs. Average Distance and the Diameter A random graph G in G (w) has average distance almost surely (1 C o(1)) ˜ if w satisfies certain conditions (called (log n)/(log w), admissible conditions in [14]). The diameter is almost ˜ In addition to studying the avsurely ((log n)/(log w)). erage distance and diameter, the structure of a random power law graph is very interesting, especially for the range 2 < ˇ < 3 where the power law exponents ˇ for numerous real networks reside. In this range, the power law graph can be roughly described as an “octopus” with a dense subgraph having small diameter O(log log n), as the core while the overall diameter is O(log n) and the average distance is O(log log n). When ˇ > 3 and the average degree w is strictly greater than 1, almost surely the av˜ and the diamerage distance is (1 C o(1))(log n)/(log w) eter is (log n). A phase transition occurs at ˇ D 3 and then the graph has diameter almost surely (log n) and average distance (log n/ log log n). Eigenvalues Eigenvalues of graphs are useful for controlling many graph properties and consequently have numerous algorithmic applications including clustering algorithms, low rank approximations, information retrieval and computer vision. In the study of the spectra of power law graphs, there are basically two competing approaches. One is to prove analogues of Wigner’s semi-circle law (such as for G (n; p)) while the other predicts that the eigenvalues follow a power law distribution [27]. Although the semi-circle law and the power law have nothing in common, both approaches are essentially correct if one considers the appropriate matrices. there are in fact several ways to associate a matrix to a graph. The usual adjacency matrix A associated with a (simple) graph has eigenvalues quite sensitive to the maximum degree (which is a local property). The combinatorial Laplacian D A with D denoting the diagonal degree matrix is a major tool for enumerating spanning trees and has numerous applications. Another matrix associated with a graph is the
(normalized) Laplacian L D I D1/2 AD1/2 which controls the expansion/isoperimetrical properties (which are global) and essentially determines the mixing rate of a random walk on the graph. The traditional random matrices and random graphs are regular or almost regular so the spectra of all the above three matrices are basically the same (with possibly a scaling factor or a linear shift). However, for graphs with uneven degrees, the above three matrices can have very different distributions. Here we state bounds for eigenvalues for random graphs in G (w) with a general degree distribution from which the results on random power law graphs then follow [18]. 1. The largest eigenvalue of the adjacency matrix of a random graph with a given expected degree sequence is ˜ the determined by m, the maximum degree, and w, weighted average of the squares of the expected degrees. In this case the largest eigenvalue of the adjacency map ˜ mg provided trix is almost surely (1 C o(1)) maxfw; some minor conditions are satisfied. In addition, if the kth largest expected degree m k is significantly larger than w˜ 2 , then the kth largest eigenvalue of the adjacency p matrix is almost surely (1 C o(1)) m k . 2. For a random power law graph with exponent ˇ > 2:5, the largest eigenvalue of a random power law graph is p almost surely (1 C o(1)) m where m is the maximum degree. Moreover, the k largest eigenvalues of a random power law graph with exponent ˇ have power law distribution with exponent 2ˇ 1 if the maximum degree is sufficiently large and k is bounded above by a function depending on ˇ; m and w, the average degree. When 2 < ˇ < 2:5, the largest eigenvalue is heavily concentrated at cm3ˇ for some constant c depending on ˇ and the average degree. 3. The eigenvalues of the Laplacian satisfy the semi-circle law under the condition that the minimum expected degree is relatively large ( the square root of the expected average degree). This condition contains the basic case when all degrees are equal (the Erd˝os–Rényi model). If we weaken the condition on the minimum expected degree, we can still have the following strong bound for the eigenvalues of the Laplacian which implies strong expansion rates for rapid mixing, 4 g(n) log2 n max j1 i j (1 C o(1)) p C wmin w i6D0 where w is the expected average degree, wmin is the minimum expected degree and g(n) is any slow growing function of n.
Random Graphs, a Whirlwind Tour of
On-Line Random Graphs Preferential Attachment Schemes The preferential attachment scheme is often attributed to Herbert Simon. In his paper [41] of 1955, he gave a model for word distribution using the preferential attachment scheme and derived Zipf’s law (i. e., the probability of a word having occurred exactly i times is proportional to 1/i). The basic setup for the preferential attachment scheme is a simple local growth rule which leads to a global consequence—a power law distribution. Since this local growth rule gives preferences to vertices with large degrees, the scheme is often described by “the rich get richer”. Of interest is to determine the exponent of the power law from the parameters of the local growth rule. There are two parameters for the preferential attachment model: A probability p, where 0 p 1. An initial graph G0 , that we have at time 0. Usually, G0 is taken to be the graph formed by one vertex having one loop. (We consider the degree of this vertex to be 1, and in general a loop adds 1 to the degree of a vertex.) Note, in this model multiple edges and loops are allowed. We also have two operations we can do on a graph: Vertex-step Add a new vertex v, and add an edge fu; vg from v by randomly and independently choosing u in proportion to the degree of u in the current graph. Edge-step Add a new edge fr; sg by independently choosing vertices r and s with probability proportional to their degrees. Note that for the edge-step, r and s could be the same vertex. Thus loops could be created. However, as the graph gets large, the probability of adding a loop can be well bounded and is quite small. The random graph model G(p; G0 ) is defined as follows: Begin with the initial graph G0 . For t > 0, at time t, the graph Gt is formed by modifying G t1 as follows: with probability p, take a vertex-step, otherwise, take an edge-step. When G0 is the graph consisting of a single loop, we will simplify the notation and write G(p) D G(p; G0 ). There were quite a number of papers analyzing the preferential attachment model G(p), usually having similar conclusions of power law degree distribution. However, many of these analyses are heuristics without specifying the ranges for the power law to hold. Heuristics of-
ten run into the danger of incorrect deductions and incomplete conclusions. It is quite essential to use rigorous proofs which help specify the appropriate conditions and ranges for the power law. The following statement was proved in [16]. For the preferential attachment model G(p), almost surely the number of vertices with degree k at time t is p M k t C O(2 k 3 t ln(t)) : where M1 D (2p)/(4 p) and M k D (2p)/(4 p) ( (k) (1 C 2/(2 p)))/( (k C 1 C 2/(2 p))) D O(k (2Cp/(2p)) ), for k 2. In other words, almost surely the graphs generated by G(p) have the power law degree distribution with the exponent ˇ D 2 C p/(2 p). Duplication Models Networks of interactions are present in all biological systems. The interactions among species in ecosystems, between cells in an organism and among molecules in a cell all lead to complex biological networks. Using current technological advances, extensive data of such interactions has been acquired. To find the underlying structure in these databases, it is of great importance to understand the basic principles of various genetic and metabolic networks. It has been observed that many biological networks have power law graphs with exponents ˇ less than 2. The ranges for the exponents of the power law for biological networks are quite different from the ranges for nonbiological networks. Various examples, such as the WWWgraphs, call graphs, and various social networks, among others, are power law graphs with the exponent ˇ between 2 and 3. Table 3 lists the exponents of a variety of biological and nonbiological networks with associated references. As we saw in Sect. “Preferential AttachRandom Graphs, a Whirlwind Tour of, Table 3 Power law exponents for biological and nonbiological networks Biological networks Yeast protein -protein net E. Coli metabolic net Yeast gene expression net Gene functional interaction Nonbiological networks Internet graph Phone call graph Collaboration graph Hollywood graph
Exponent ˇ 1.6, 1.7 1.7, 2.2 1.4–1.7 1.6
References [20,43] [3,28] [20] [30]
2.2 (indegree), 2.6 (outdegree) 2.1 –2.3 2.4 2.3
[4,27,34] [1,2] [29] [4]
2545
2546
Random Graphs, a Whirlwind Tour of
ment Schemes”, the preferential attachment model generates graphs with power law degree distribution with exponents ˇ between 2 and 3. Therefore there is a need to consider alternative models for biological networks. The duplication of the information in the genome – genes and their controlling elements – is a driving force in evolution and a determinative factor of biological networks. The process of duplication is quite different from the preferential attachment process that is regarded by many as the basic growth rule for most nonbiological networks. Here we consider a duplication model. If we only allow pure duplication, the resulting graph depends heavily on the initial graph and does not satisfy the power law. So we consider a duplication model that allows randomness within the duplication step as defined below. We will see that this duplication model generates power law graphs with exponents in the range including the interval between 1 and 2 and therefore is more suitable for modeling complex biological networks. There are two basic parameters for the duplication model: A selection probability p, where 0 p 1. An initial graph G0 , that we have at time 0. Usually, G0 is taken to be the graph formed by one vertex. However, G0 can be taken to be any finite simple connected graph. Unlike the preferential attachment model, in this model the generated random graph is always a simple graph. There is one basic operation: Duplication step: A sample vertex u is selected randomly and uniformly from the current graph. A new vertex v and edge fu; vg is added to the graph. For each neighbor w of u, with probability p, fv; wg is added as a new edge. The edge fu; vg in the duplication step is called a basic edge. The vertex u is said to be the parent of v and v is called a child of u. We note that a vertex can have several children or no child at all and that each vertex not in the initial graph G0 has a parent. All basic edges from children to parents form a forest where the vertices in G0 are roots of component trees. All terms like “leafs” and “descendants”, if not defined, refer to this forest. The duplication step can be further decomposed into two parts—vertex-duplication and edge-duplication as follows: Vertex-duplication: At time t, randomly select a sample vertex u and add a new vertex v and an edge fu; vg. Edge-duplication: At time t, for the vertex v created, its parent u and each neighbor w of u, with probability p, add an edge fv; wg to w.
For any vertex v, a descendant of v can only be connected to descendants of v’s neighbors (including v itself). An edge fx; yg is said to be a descendant of an edge fu; vg, if “x is a descendant of u and y is a descendant of v” or “x is a descendant of v and y is a descendant of u”. We remark that having the basic edges fu; vg makes the graph G always connected. This helps avoid degenerate cases such as having mostly isolated vertices. For the above duplication model, it can be shown [17] that its degree distribution obeys a power law with the exponent ˇ of the power law satisfying the following equation: 1 C p D pˇ C pˇ 1 :
(5)
We remark that the solutions for (5) that are illustrated in Fig. 4 consist of two parts. One is the line ˇ D 1. The other is a curve which is a monotonically decreasing function of p. The two curves intersect at (x; 1) where x D 0:56714329 : : : the solution of x D log x. One very interesting range for ˇ is when p is near 1/2. To get a power-law with exponent 1.5, for example, one should choose p D 0:535898 : : p : Also we see that the second curve intersects zero at p D ( 5 1)/2, an intriguing number (the “golden mean”). At p D 1/2, one solution for ˇ is 2. Although there are two solutions for each p, the stable solutions are on the curve when p < 0:56714329 : : : and ˇ D 1 for p > 0:56714329 : : :
Random Graphs, a Whirlwind Tour of, Figure 4 The value of ˇ as a function of p. The green section of the curve shows the range of ˇ values which are typically found for biological networks
Random Graphs, a Whirlwind Tour of
Remarks The small world phenomenon, that occurs ubiquitously in numerous existing networks, refers to two similar but different properties: Small distance Between any pair of nodes, there is a short path. The clustering effect Two nodes are more likely to be adjacent if they share a common neighbor. There have been various approaches to model networks that exhibit the small world phenomenon. In particular, the aspect of small distances can be well explained by using random graphs with general degree distributions which include the power law distribution. However, the other feature concerning the clustering effect seems much harder to model. To model the clustering effect, a typical approach is to add random edges to a grid graph or the like [26,32,44]. Such grid-based models are quite restrictive and far from satisfactory for modeling biological networks or collaboration graphs, for example. On the other hand, random power law graphs are good for modeling small distance, but fail miserably for modeling the clustering effect. In a way, the aspect of small distances is about neighborhood expansion while the aspect of the clustering effect is about neighborhood density. The related graph-theoretical parameters seem to be of an entirely different scale. For example, while the clustering effect is quite sensitive to average degree, the small distance effect is not. The heart of the problem can be quite simply stated: For a given network, what is its true geometry? How can we capture the geometry of the network (without invoking too many parameters)? Bibliography 1. Abello J, Buchsbaum A, Westbrook J (1998) A functional approach to external graph algorithms. In: Proc. 6th European Symposium on Algorithms. Springer, Berlin, pp 332–343 2. Aiello W, Chung F, Lu L (2000) A random graph model for massive graphs. In: Proc. of the Thirty-Second Annual ACM Symposium on Theory of Computing. ACM Press, New York, pp 171– 180 3. Albert R, Barabási A-L (2002) Statistical mechanics of complex networks. Rev Mod Phys 74:47–97 4. Barabási A-L, Albert R (1999) Emergence of scaling in random networks. Science 286:509–512 5. Barabási A-L, Albert R, Jeong H (2000) Scale-free characteristics of random networks: the topology of the world-wide web. Physica A 281:69–77 6. Bender EA, Canfield ER (1978) The asymptotic number of labeled graphs with given degree sequences. J Comb Theory Ser A 24:296–307 7. Bollobás B (1984) The evolution of random graphs. Trans Amer Math Soc 286:257–274
8. Bollobás B (1981) The diameter of random graphs. Trans Amer Math Soc 267:41–52 9. Bollobás B (1985) Random graphs. Academic Press, New York, pp xvi+447 10. Bollobás B (1984) The evolution of sparse graphs. In: Graph theory and combinatorics. Academic Press, London, New York, pp 35–57, (Cambridge 1983) 11. Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A, Wiener J (2000) Graph structure in the web. Proc. of the WWW9 Conference, Amsterdam, May 2000, Paper version appeared in Computer Networks 33, pp 309–320 12. Chung F, Lu L (2001) The diameter of sparse random graphs. Adv Appl Math 26:257–279 13. Chung F, Lu L (2002) Connected components in random graphs with given expected degree sequences. Ann Comb 6:125–145 14. Chung F, Lu L (2002) The average distances in random graphs with given expected degrees. Proc Nat Acad Sci 99:15879– 15882 15. Chung F, Lu L (2006) The volume of the giant component of a random graph with given expected degrees. SIAM J Discret Math 20:395–411 16. Chung F, Lu L (2006) Complex graphs and networks. In: CBMS Lecture Series, vol 107. AMS Publications, Providence, pp vii+264 17. Chung F, Lu L, Dewey G, Galas DJ (2003) Duplication models for biological networks. J Comput Biol 10(5):677–687 18. Chung F, Lu L, Vu V (2003) The spectra of random graphs with given expected degrees. Proc Nat Acad Sci 100(11):6313–6318 19. Conlon D A new upper bound for diagonal Ramsey numbers. Ann Math (in press) 20. Dorogovtsev SN, Mendes JFF (2001) Scaling properties of scale-free evolving networks: continuous approach. Phys Rev E 63(056125):19 ˝ P (1947) Some remarks on the theory of graphs. Bull 21. Erdos Amer Math Soc 53:292–294 ˝ P, Rényi A (1959) On random graphs, vol I. Publ Math 22. Erdos Debrecen 6:290–297 ˝ P, Rényi A (1960) On the evolution of random graphs. 23. Erdos Magyar Tud Akad Mat Kutató Int Közl 5:17–61 ˝ P, Gallai T (1961) Gráfok eloírt ˝ fokú pontokkal (Graphs 24. Erdos with points of prescribed degrees, in Hungarian). Mat Lapok 11:264–274 25. Exoo G (1989) A lower bound for R(5; 5). J Graph Theory 13:97– 98 26. Fabrikant A, Koutsoupias E, Papadimitriou CH (2002) Heuristically optimized trade-offs: a new paradigm for power laws in the internet. In: Automata, languages and programming. Lecture Notes in Computer Science, vol 2380. Springer, Berlin, pp 110–122 27. Faloutsos M, Faloutsos P, Faloutsos C (1999) On power-law relationships of the Internet topology. In: Proc. of the ACM SIGCOM Conference. ACM Press, New York, pp 251–262 28. Friedman R, Hughes A (2001) Gene duplications and the structure of eukaryotic genomes. Genome Res 11:373–381 ˝ number 29. Grossman J, Ion P, De Castro R (2007) The Erdos project. http://www.oakland.edu/enp 30. Gu Z, Cavalcanti A, Chen F-C, Bouman P, Li W-H (2002) Extent of gene duplication in the genomes of drosophila, nematode, and yeast. Mol Biol Evol 19:256–262
2547
2548
Random Graphs, a Whirlwind Tour of
31. Janson S, Knuth DE, Łuczak T, Pittel B (1993) The birth of the giant component. Rand Struct Algorithms 4:231–358 32. Kleinberg J (2000) The small-world phenomenon: an algorithmic perspective. Proc. 32nd ACM Symposium on Theory of Computing. ACM, New York, pp 163–170 33. Kleinberg J, Kumar R, Raphavan P, Rajagopalan S, Tomkins A (1999) The web as a graph: measurements, models and methods. In: Proc. of the International Conference on Combinatorics and Computing. Lecture Notes in Computer Science, vol 1627. Springer, Berlin, pp 1–17 34. Kumar R, Raghavan P, Rajagopalan S, Tomkins A (1999) Trawling the web for emerging cyber communities. In: Proc. of the 8th World Wide Web Conference. Toronto, 1999 35. Łuczak T (1998) Random trees and random graphs. Rand Struct Algorithms 13:485–500 36. McKay BD, Radziszowski SP (1994) Subgraph counting identities and Ramsey numbers. J Comb Theory (B) 61:125– 132
37. Milgram S (1967) The small world problem. Psychol Today 2:60–67 38. Molloy M, Reed B (1995) A critical point for random graphs with a given degree sequence. Rand Struct Algorithms 6:161–179 39. Molloy M, Reed B (1998) The size of the giant component of a random graph with a given degree sequence. Combin Probab Comput 7:295–305 40. Ramsey FP (1930) On a problem of formal logic. Proc London Math Soc 30:264–286 41. Simon HA (1955) On a class of skew distribution functions. Biometrika 42:425–440 42. Spencer J (1975) Ramsey’s theorem—a new lower bound. J Comb Theory (A) 18:108–115 43. Stubbs L (2002) Genome comparison techniques. In: Galas D, McCormack S (eds) Genomic technologies: present and future. Caister Academic Press 44. Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small world’ networks. Nature 393:440–442
Random Matrix Theory
Glossary Definition of the Subject Introduction Random Matrix Ensemble Classifications Construction of Gaussian Ensembles Eigenvalues of Gaussian Ensembles Density of States: Semicircle Law Spacing Distribution: Wigner Surmise Two-Level Correlation Function Random Matrix Theory and Complex Systems Summary Future Directions Bibliography
in a certain sense, towards a limiting distribution. This only depends on the symmetry properties of the matrix and is independent of the initial probability law imposed on the matrix entries. For example, in many-body systems, the random matrix models are very successful in describing the spectral fluctuation properties of complex atoms, molecules and atomic nuclei, despite the fact that the interactions between the constitutes of these systems are very different, and not at all random: The nuclei are bound by short-range nuclear forces whereas atoms and molecules are governed by the long-range Coulomb forces, but the Hamiltonian of these models only considers the symmetry properties without the detailed knowledge of the system under consideration. The reason for such universality is not so clear but may be an outcome of a law of large numbers operating in the background. In the proceeding sections after giving the history and development of the theory, we will explore the random matrix ensembles, eigenvalue statistics as well as their pertinence to various complex systems.
Glossary
Introduction
Random matrices Large matrices with randomly distributed elements obeying the given probability laws and symmetry classes. Orthogonal ensembles Real symmetric random matrix ensembles which are invariant under all orthogonal transformations. Majority of practical systems are described by these ensembles. Unitary ensembles Hermitian random matrix ensembles which are invariant under all unitary transformations. Symplectic ensembles Hermitian self-dual random matrix ensembles which are invariant under all symplectic transformations.
Although, the Random Matrix Theory was initiated in the early 1900s by statisticians in their studies of the product moment distribution [59], it made its progress in the hands of physicists. It was Eugene Wigner who introduced the random matrix idea to the theoretical physics community in the mid 1950s. During the 1950s, due to the development of nuclear weapons and nuclear power stations, physicists of this era were involved in measuring fission resonances in various nuclei. Wigner was predominantly studying [53] the statistical properties of nuclear spectra of heavy nuclei, but the large number of resonances was making his analysis very difficult. It was clear that the resonance spectra were far too complex to be analyzed by existing models like the atomic shell model. This prompted Wigner to look for statistical methods which would deal with such complexity. Later, the realization that the statistical distribution of the nuclear resonance energies shared the same properties as the eigenvalues of random matrices, led him to study a random matrix model [54], in which an infinite real symmetric matrix with identically distributed elements having the values zero in the diagonal and + 1 or 1 in the off diagonal was used, so that all sign interactions were equally probable. In this model he showed that the distribution of the characteristic values of the matrix i. e., the density of states (DOS) was a semicircle, and it was also obvious that the characteristic values were repelling each other.1
Random Matrix Theory GÜLER ERGÜN Department of Mathematical Sciences, University of Bath, Bath, UK Article Outline
Definition of the Subject Random Matrix Theory (RMT) is a method of studying the statistical behavior of large complex systems, by defining an ensemble which considers all possible laws of interactions within the system. The important question addressed by random matrix theory is: Given a random matrix ensemble what are the probability laws which govern its eigenvalues or eigenvectors. This question is pertinent to many areas in physics and mathematics, for instance statistical behavior of compound nucleus, conductivity in disordered metals, behavior of chaotic systems or zeros of the Riemann zeta function. The success of random matrices lies in the universality regime of the eigenvalue statistics. There is compelling evidence that when the size of the matrix is very large then the eigenvalue distribution tends,
1 This was
already known to Wigner and von Neuman [51].
2549
2550
Random Matrix Theory
During a number of conferences [55,56,57] in 1956– 1957, on the basis of eigenvalues of a 2 2 real symmetric matrix, Wigner gave an assumed formula for the level spacing distribution, and he advocated Wishart’s work [59], which was giving the joint probability distribution of eigenvalues, to confirm his findings. At about the same time analysis of experimental data by Gurevich and Pevsner [29] was also confirming that the nuclear levels were repelling each other and the distribution of consecutive level spacings was not totally random, but showed some regularities. The interest in random matrix hypothesis was becoming wide spread, in [44] Porter and Rosenzweig summarized the developments on the statistical properties of atomic and nuclear spectra, by including experimental, theoretical, and numerical findings, which helped to strengthen the applicability of such statistical laws to nuclear reactions. It was time to make these statistical theories more mathematically rigorous; Wigner used the method of moments to prove that the density of eigenvalues for large symmetric matrices distributed as a semi circle [58]. Motivated by this, in 1960 Mehta and Gaudin obtained the same distribution by integrating joint probability density function of eigenvalues, and simultaneously Mehta analytically showed that the Wigner surmise was a very good approximation for the large random matrices. Soon after that, Gaudin gave the exact expression for the spacing distribution [26]. Dyson considered these statistical theories of random matrices as a new kind of Statistical Mechanics [10] such that one ignores the exact knowledge of the nature of the system. That is to say, a compound nucleus can be viewed as a black box, and one would need to define an ensemble of systems in which all possible laws of interactions are equally probable. In an attempt to explore this so-called “new kind of statistical mechanics” Dyson presented a series of papers [10,11,12], and in these papers he showed that all random matrix ensembles fall into three universality classes called the orthogonal ensemble (OE), unitary ensemble (UE) and symplectic ensemble (SE). These ensembles account for the symmetry properties of the Hamiltonians as well as the mathematical rigor of all possible interactions being equally likely. For matrices with Gaussian distributed entries they are referred to as GOE, GUE and GSE, respectively. The random matrix theory (RMT) was developed all through the 1960s by Wigner, Mehta, Gaudin and Dyson. Most of the early work up to 1965 on random matrices is collected in Statistical Theories of spectra by Porter. The RMT took a new dimension in the 1980s when Bohigas, Giannoni and Schmit (BGS) [5] conjectured that
it could be applied to describing the spectra of all chaotic systems. This BGS statement not only renewed the interest in random matrices but also started a surge of research in different subject areas such as quantum chaos, quantum graphs, mesoscopic systems etc., and now there exists an overwhelming evidence that the BGS proposition may be true. There are several review articles which summarize the RMT applied research up to 1980s by Brody and up to more recent times by Bohigas, Guhr. A comprehensive book in this subject area is Random Matrices by Mehta [38], however it is aimed at the more advanced reader. As for the new comers, the pedagogical derivation of most of the basic principles of RMT may be found in Quantum Signatures of Chaos by Haake [30]. Nowadays, RMT is at the center of several areas of mathematics, particularly number theory, combinatorics, diffusion processes, probability and statistics. At the same time it is used as a tool to understand many complex systems from biology, quantum chaos, wireless communication and finance. In the following sections, first the most commonly used random matrix ensembles, namely the Gaussian ensembles, will be presented. Then some of their central statistical properties such as the density of states and spacing distribution will be discussed. Random Matrix Ensemble Classifications Because of the broad use of RMT a variety of random matrix ensembles have been studied. These ensembles can be loosely divided into three types, Gaussian, circular and hyperbolic. In the next sections, although the emphasis will be on the Gaussian ensembles, we also give very brief descriptions of the circular and the hyperbolic ensembles. Gaussian Ensembles These ensembles are categorized into three universality classes, which only take into account: (1) the time-reversal invariance (more often referred to as the time-reversal symmetry) and (2) the strong spin-orbit coupling of the Hamiltonian of an underlying quantum system. In Gaussian ensembles, the type of transformation gives the classifying name to the ensemble. For quick reference these ensembles are grouped in the Table 1 with consideration of the space-time symmetries and the values of the angular momentum. Circular Ensembles Introduced by Dyson [10] with an assumption that the system is not characterized by its Hamiltonian but by a uni-
Random Matrix Theory Random Matrix Theory, Table 1 The grouping of the quantum systems by the symmetry properties and the value of angular momentum. ˇ is an index of the classification Gaussian Ensemble Classification Symmetries Total angular mom. Time-reversal Integer Time-reversal, Rotational Half-Integer Time-reversal, No-rotational Half-Integer None Any
Hamiltonian
Canonical group Ensemble ˇ
Real symmetric Orthogonal
GOE
1
Quaternion real Symplectic Hermitian Unitary
GSE GUE
4 2
tary matrix S. The matrix is considered to be a scattering matrix (S-Matrix), whose elements give the transition probabilities between various states and its eigenvalues belong to the unit circle in the complex plane. Because of this compactness of the ensembles the universal distribution can be obtained with a uniform probability law. From the point of view of the fundamental symmetries, these ensembles can be categorized into three types as: Circular orthogonal ensembles (COE), circular unitary ensembles (CUE), circular symplectic ensembles (CSE). In the large N limit, the statistical properties of their eigenvalues are identical to those of the Gaussian ensembles.
i; j D 1; : : : ; N can be chosen independently and the rest of them are determined by the symmetry. The ultimate aim is to define a joint probability density P(H ) for the matrix elements satisfying the conditions:
Hyperbolic Ensembles
2. The matrix elements H i j are statistically independent. Thus, the joint probability density function must be the product of the densities of these elements, Y P(H ) D Pi j (H i j ) : (2)
These ensembles are appropriately described in terms of Brownian motion in some matrix spaces. A seminal study of such a system is Dyson’s Brownian motion model [9], which applies the notion of the RMT to a new type of Coulomb gas model so that the “Coulomb gas” could be interpreted as a dynamical system. The matrix defined is a transfer matrix (M-matrix), which is a matrix used to give the relationship between the fluxes at one edge of a conductor to fluxes at the other edge (i. e., the number of ingoing electrons minus the outgoing electrons should be the same on each side of a conductor). Unlike the S-matrix, the M-matrix does not belong to a compact group. Also, it is not possible to define a normalizable probability law, but instead a diffusion equation as a function of a fictitious time. The term “time” can be any property of the model such as the length of a conductor. Construction of Gaussian Ensembles In fixing the elements of random matrices we are not completely free, since, first of all the matrix elements must obey the class restrictions given in the previous section. In the GOE case the Hamiltonian matrix is considered to be real symmetric. This means that for an N N real symmetric matrix, there are N(N C 1)/2 elements i. e., H i j with
1. The probability P(H ) must be invariant under any transformation i. e. H ! W 1 H 0 W with W is either orthogonal (ˇ D 1), unitary (ˇ D 2), or symplectic (ˇ D 4) matrix. Then, e. g. for ˇ D 1, when W D O is any real orthogonal matrix then we have P(H ) dH D P(H 0 ) dH 0 ; O1 D OT ;
and OOT D 1 :
(1)
ij
The first condition is absolutely essential, P(H ) must be invariant under any transformation, because these transforms determine the eigenvalues, but the second one is only chosen to simplify the calculations. However, as will be more apparent later in the derivations, the consideration of these conditions necessitates the distribution of the random matrix elements to be Gaussian. To find P(H ) for the real symmetric Hamiltonian matrices, we will consider a 2 2 real symmetric matrix H and an orthogonal transformation matrix
cos sin OD ; (3) sin cos which is the two-dimensional rotation through . The rotation O acts as a transformation H D OT H 0 O, from which we get the relations 0 C H0 0 H11 H 0 H22 0 22 sin 2 C 11 cos 2 H12 2 2 0 0 H H22 0 D 11 cos 2 sin 2 C H12 2
H11 D H12
2551
2552
Random Matrix Theory
H22 D
0 C H0 0 H11 H 0 H22 0 22 sin 2 : 11 cos 2 C H12 2 2 (4)
The joint probability density function is simply the product of the three densities P(H ) D P11 (H11 )P12 (H12 )P22 (H22 ) ;
(5)
and the orthogonal invariance of P(H ) implies that dP(H ) d D 0. Therefore we have dP12 (H12 ) dH12 dP11 (H11 ) dH11 P(H ) C P(H ) P11 (H11 ) d P12 (H12 ) d dP22 (H22 ) dH22 P(H ) D 0 : C P22 (H22 ) d
(6)
dH12 D H11 H22 ; d dH22 D 2H12 : d
(7)
Substituting these later relations into Eq. (6), and dividing by H12 (H11 H22 ) we get d ln P11 (H11 ) 2 d ln P22 (H22 ) dH11 H11 H22 dH22 2 d ln P12 (H12 ) 1 D D 2a : H11 H22 dH12 H12
(8)
(9)
in which the constant coefficient C12 is a consequence of the integration. Then re-writing Eq. (8) as d ln P11 (H11 ) C aH11 dH11 d ln P22 (H22 ) D C aH22 D b dH22
These solutions have Gaussian distribution form, and the product of them gives 2 2 2 C H22 C 2H12 ) P(H ) D C exp (a/2)(H11
Cb(H11 C H22 ) :
(12)
If we fix the mean values of the diagonal elements to be zero i. e., hHii i D 0 then the second constant b D 0. The overall constant C is determined by the normalization C1 Z
dH11 dH22 dH12 P(H ) D 1 :
(13)
A simple calculation gives C D 1/2(a/)3/2 . For the convergence of Eq. (13) the constant a must be positive, also it is related to the widths of the distributions of the diagonal and off-diagonal elements. The variances of the diagonal matrix elements 2 2 hH22 i D hH11 i Z p 2 2 dH11 D a/2 H11 exp (a/2)H11
D 1/a ;
The two sides of this equation depend on different variables, which means that each side must be equal to a constant, hence the constant a is introduced. From Eq. (8) we first find a solution for P12 (H12 ) by simply integrating d ln P12 (H12 ) D 2aH12 dH12 2 P12 (H12 ) D C12 exp aH12 ;
(11)
1
From the Eqs. (4) we can find equivalent expressions for the three differentials of dH / d as: dH11 D 2H12 ; d
2 C bH22 : P22 (H22 ) D C22 exp (a/2)H22
(10)
where another constant b is introduced as the equation depends on a different variable on each side, and following the same sequence of steps as before, we get solutions of P11 and P22 , 2 C bH11 P11 (H11 ) D C11 exp (a/2)H11
(14)
whereas for the off-diagonal elements Z p 2 2 2 exp(aH12 )dH12 D 1/2a : (15) hH12 i D a/ H12 By taking 2 as the variance of the off diagonal matrix elements, we obtain a D 1/2 2 for a 2 2 matrix. We make a remark here that for large dimensional matrices the off-diagonal elements greatly outnumber the diagonal ones so, for the purpose of numerical simulations it does not matter much if the variances of the diagonal and the off-diagonal elements are not the same. Although we have only considered the 2 2 Hamiltonian matrix in finding P(H ), this can easily be extended to N N dimension with a suitable transformation matrix. In this way we find that a real symmetric random matrix having a probability density P(H ) D
1 N/2 1 N(N1)/4 4 2 2 2 n 1 X 2o exp 2 Hi j ; 4 i; j
defines the Gaussian orthogonal ensemble.
(16)
Random Matrix Theory
Analogously, we obtain the Gaussian unitary ensemble (GUE) and the Gaussian symplectic ensemble (GSE), with the consideration that the joint probability distribution of the matrix elements is invariant under unitary and symplectic transformations, respectively. Thus, the general form of the probability density of a random Hermitian matrix is given by:
N/2
N(N1)/2
1 1 P(H ) D 2 48 2 2 9 < 1 Xh i= ; exp 2 (H R )2i j C (H I )2i j : 4 ;
(17)
where (HR )2i j and (HI )2i j are the real and the imaginary parts of the off-diagonal matrix elements H i j , respectively. And the probability density of a random quaternion matrix is: 1 N/2 1 N(N1) P(H ) D 4 2 2 2 8 9 < 1 X = exp 2 (H0 )2i j C (H1 )2i j C (H2 )2i j C (H3 )2i j ) ; : 4 ;
i; j
(18) where (H0 ) i j , (H1 ) i j , (H2 ) i j and (H3 ) i j are the quaternionic components of H i j . Without loss of generality, the probability density P(H ) with H of an arbitrary N N dimension can be written in a form common to all three ensembles as Nˇ
2
(19) P
where Tr stands for the matrix trace: Tr H D i Hii , ˇ D 1; 2; 4 is used as an index for the ensemble classification, with 2 being the variance of the off diagonal matrix elements and the factor N ensuring the moments of the eigenvalues i of H to be finite in the limit N ! 1, while h2i i D 1. Eigenvalues of Gaussian Ensembles Using the joint probability distribution of the matrix entries in Eq. (19) we will write the joint probability distribution of the eigenvalues for all three Gaussian ensembles. Again, we consider the simplest case of a real symmetric 2 2 Hamiltonian matrix H, and the orthogonal transformation matrix O, as given in Eq. (3), such that T
H D O HD O ;
1;2 D
1 (H11 C H22 ) ˙ 2
q
2 : (21) (H11 H22 )2 C 4H12
Writing out Eq. (20) explicitly H11 D 1 cos2 C 2 sin2 H22 D 1 sin2 C 2 cos2
(22)
H12 D (1 2 ) cos sin ;
i; j
PNˇ (H ) D C Nˇ e 4 2 Tr[H ] ;
which are given by
(20)
where HD is the diagonal matrix containing the eigenvalues of H . In this case there are only two eigenvalues,
we can see that the elements of H are linear functions of the eigenvalues and a single angle parameter which specifies the set of eigenvectors. The Jacobian of this transformation is simply @(H11 ; H22 ; H12 ) J D det (23) D (1 2 ) ; @(1 ; 2 ; ) and since Tr[H 2 ] D (21 C 22 ), we can just replace the matrix elements H i j with the eigenvalues of H and integrate over to get Nˇ
2
2
P(1 ; 2 ) D Cj1 2 je 4 2 (1 C2 ) ;
(24)
which is the joint probability density of the eigenvalues for the GOE of 2 2 dimension. We follow the same procedure as above to find the joint probability density of the eigenvalues for the GUE case. Here we consider a 2 2 complex Hermitian matrix H , and a simple 2 2 unitary transformation matrix U to diagonalize it, where
cos ei sin : (25) UD e i sin cos However, we note that this unitary matrix is not the most general 2 2 unitary transformation matrix, but rather an element of the coset space V (2)/V(1)V (1). This is to ensure a one-to-one correspondence H ! (1 ; 2 ; U) after the ordering of the eigenvalues, which are given by q 1 H : (26) 1;2 D (H11 CH22 )˙ (H11 H22 )2 C 4H12 12 2 This expression is essentially the same as the one in 2 ! jH j2 . In Eq. (21) with the only difference being H12 12 complete analogy to Eq. (20) we write H D U H D U
(27)
to represent the matrix elements of H in terms of the two eigenvalues 1;2 and the two angles , H11 D 1 cos2 C 2 sin2 H22 D 1 sin2 C 2 cos2
2553
2554
Random Matrix Theory
H12 D (1 2 )[cos sin cos i cos sin sin ]; (28) from which the Jacobian is J D det
R ; HI ) @(H11 ; H22 ; H12 12
rameters , , 0 H11 D 1 cos2 C 2 sin2
D (1 2 )2 cos sin :
@(1 ; 2 ; ; )
(29) Again, Tr[H 2 ] D 21 C 22 thus Eq. (19) can be written for GUE, in terms of the eigenvalues and the two parameters namely angles as Nˇ2 (21 C22 ) 4
P(1 ; 2 ; ; ) D Cj1 2 j2 e
cos sin ; (30)
integrating this over the angles and we get
0 H22 D 1 sin2 C 2 cos2
(34)
0 H12 H12
(35)
D (1 2 ) cos sin cos D (1 2 ) cos sin sin :
The Jacobian of this transformation is 0 ; H0 ; H ) @(H11 12 22 / (1 2 )4 ; J D det @(1 ; 2 ; ; )
where the coefficient of the proportionality is not specified since it is independent of the eigenvalues, and will only contribute to the constant C after integration over all four parameters and . Thus, the joint probability distribution of the eigenvalues for the smallest GSE is Nˇ
Nˇ 2 4 2 (21 C22 )
P(1 ; 2 ) D Cj1 2 j e
:
(31)
1 0 0 H11 H22 2 2 31/2 3 X 0 2 0 2 H12 5 ; ˙ 4 H11 H22 C4
(32)
D0
each of these s are double, so that we have 4 eigenvalues since the matrix is 4 4. The total number of freely picked real entries in a 2N 2N quaternion real matrix H is given by N(2N 1), thus we have 6 real entries in this case. To diagonalize H , not the most general, but a convenient choice is a 2 2 symplectic matrix SD
e cos sin
sin e cos
;
2
2
P(1 ; 2 ) D Cj1 2 j4 e 4 2 (1 C2 ) :
Although the constant C has been modified as a result of the integration, we still call it C to avoid unnecessary complication. Now we move on to the GSE case, in which the smallest Hamiltonian matrix H is 4 4 but the procedure for finding the joint probability density of the eigenvalues is completely analogous to the preceding GOE and GUE cases. The most convenient way to write H is the quaternion notation, reducing the 2N 2N matrix to an N N by representing each H i j with a block of 2 2, superposition of unity and the quaternion matrices. Then the degenerate eigenvalues of quaternion real H are 1;2 D
(36)
(33)
where , are the four real parameters provided by S. As before using the notation H D S HD S, we write the elements of H as a function of the eigenvalues and the pa-
(37)
From the general form of the Eqs. (24,31,37), we can guess the generic joint probability distribution of the eigenvalues for all three Gaussian ensembles to be PNˇ (1 ; : : : ; N ) D C Nˇ
Y i< j
N Nˇ X 2 j i j j exp 2 k 4 ˇ
! ;
(38)
kD1
in which the constant C Nˇ is an ensemble dependent constant, and it is chosen such that the PNˇ is normalized to unity. Density of States: Semicircle Law Having at our disposal the generalized joint probability density of eigenvalues, now we can derive the famous semicircle law for the mean eigenvalue densities (see Fig. 1). The essential step is integrating Eq. (38) in the previous section i. e., Z1
Z1 Z1
N 1
PNˇ (1 ; 2 ; : : : ; N ) d2 ; : : : ; d N (39)
1 1
over all variables accept one to determine the ensembleaveraged eigenvalue or level density (), and the semicircle law arises asymptotically as N ! 1. This integration is not at all trivial, and was first carried out by Mehta and Gaudin [39], using some remarkable techniques. However, there are many ways to derive the semicircle law, but the one that is perhaps the most elegant and up-to-date is the representation of determinants by Gaussian integrals over Grassmaninan variables, we will use this technique in the following derivations of the semicircle law.
Random Matrix Theory
and when x ! 0 we recover the expression for the density of the eigenvalues we have defined previously. So that our starting point will be det(I H ) : det(I H )
Random Matrix Theory, Figure 1 Averaged density of states (DOS) for GOE (with a suitable scaling, all three ensembles have the same form of DOS distribution). The solid line is the theoretical result superimposed over numerical data values (circles)
General Formalism Consider an N N Hermitian matrix H , which has ordered eigenvalues k (k D 1; : : : ; N). The normalized spectrum of H can be expressed as sums of delta function, which essentially picks out the positions of the eigenvalues, () D
N 1 X ı( k ) : N
" 1 ; ( k )2 C "2
(41)
where I is identity matrix and is assumed to have a small imaginary part but we will not write it explicitly for the sake of less cumbersome expressions. To proceed with the supersymmetric calculations we need to represent this last relation in terms of ratios of the determinants, for which we introduce the following generating function det(I H ) ; det(I H )
1 iN
Z
i
S i S H S 2
dS dSe 2 S
; (46)
with the assumption that has a small positive imaginary part i. e. we choose D C ix/N so that the integral is convergent. For the determinant in the numerator, we use Gaussian integrals over anticommuting (Grassmannian) N components vectors ; , which gives (2)1 det(I H ) D
1 iN
Z
i
i H 2
d de 2
:
(47)
Z h iH D d2 Z D i
E i
d2 Se 2 ( CS S) e 2 (S H SC H )
H
and substituting this latter relation to the former we get
1 1 () D ; (42) Im Tr N I H
Z(; )
(2)det1 (I H ) D
Thus, we write the expression in Eq. (45) as
By representing the delta function as a Lorentzian curve with vanishing widths. "!0
We are interested in averaging the expression in Eq. (45) over all realization of H . For which we will use the “supersymmetrization” method. In this method, the denominator of the expression is represented by a general Gaussian integral over a complex N dimensional vector S D (S1 ; : : : ; S N )T , where T stands for the vector transpose, and S i are commuting variables
(40)
kD1
ı( k ) D lim
(45)
(48)
The commuting and the anti commuting variables appear symmetrically in this expression, which in fact gives the name “supersymmetry” to the method. To perform the average over the GUE matrix H we can decompose elements of H into its real and imaginary parts, H k l D H kRl C iH kI l and write E E D P i D iP R D e k 2 A kk H kk e 2 k l H k l A k l H E D P 1 E D P I 2i (A k l CA l k )H Rkl k>l e k>l 2 (A k l A l k )H k l ; e (49)
(43)
where ˙ ix/2N. Thus, differentiating the generating function with respect x, i. e.,
1 1 @ ; (44) Z(; )j x!0 D Im Tr @x N I H
:
where A k l D S k S l C k l . The explicit meaning of angle brackets is the average over the joint probability density for GUE, here we take N
P(H ) D C N e 2
Tr[H 2 ]
dH ;
(50)
2555
2556
Random Matrix Theory
where C N is a constant normalization factor and the probability distribution has zero mean and unity variance. By means of this probability distribution we get "
# A2k k e D exp ; 8N E D i R (A k l C A l k )2 ; e 2 (A k l CA l k )H k l D exp 16N D 1 E (A k l A l k )2 I e 2 (A k l A l k )H k l D exp : 16N D
E
2i A kk H Rkk
Z1
1
H
/ e 8N
P
Akl Al k
kl
;
Further manipulation p can be made by introducing the variable q F D i q/ N and shifting the contour of integration in such a way that, the integral over q F goes along the real axis, thus, we rewrite the above expression as (52)
it is easy to show that k l H k l A l k D Tr[H A] and P 2 k l A k l A l k D Tr[A ] then the expression in Eq. (52) is just the identity 1
GUE
2
/ e 8N Tr[A ] :
Tr[A2 ] D Tr[S ˝ S ˝ ]2 :
(54)
Expanding the latter we get Tr[A2 ] D (S S)2 2(S )( S) ( )2 :
(55)
The quartic term in the far right of Eq. (55) needs to be “decoupled” before we can carry on with the integration, we can do so by using the usual Hubbard–Stratonovich transformation Z1 D 1
2
dq q p e 2 2
q 2N
p
:
(56)
1
Z
1
N dq F 2 p e 2 (q F i) 2
S)2 C i S S 2
d2 Se 8N (S
1 2 i h iH D d2 S exp (S S) C S S 8N 2 Z1 Z dq 2 p eq /2 d d 2 1 n h i o exp 1/2(i q/N)I S ˝ S /4N :
Our next step is to Rintroduce polar coordinates: RS D rn with n n D 1 and d2 S D r2N1 dr dn, where dn D ˝ N corresponds to the area of a 2N dimensional unit sphere, which only produces a constant factor. Further introducing p D r2 and changing p ! 2N p and then following with the obvious manipulations we get: det(I H ) det(I H )
1 det q F I S ˝ S ; 2N (59)
where we shifted the contour for q F 2 (1; 1) to be real. Further simplification can be made by noticing that the N N matrix S ˝ S is of rank unity, i. e., it has (N 1) zero eigenvalues, and only one nonzero eigenvalue equal to (SS ). Then the determinant in the previous expression is 1 S˝S det q F I 2N
1 S S : (60)
q FN1 q F 2N
Thus, re-writing Eq. (48) with these new relations we have Z
h iH /
(53)
The far right exponent in Eq. (48) have the relations H D Tr[H ˝ ] and S H S D Tr[H S ˝ S ]. These relations imply that the right hand side exponent of the identity is:
e
Z1
D i E e 2 Tr[H A]
( )2
1
" # dq q2 1 S ˝ S q p e 2 det I i : 2 N 4N 2 (58)
P
1 8N
(51)
Substitution of these equations into (49) yields D iP E e 2 k l H k l A k l
The last integral over the Grassmannian variables is the familiar Gaussian integral in Eq. (47), which leads to Z 1 i
2
h iH D d2 Se 8N (S S) C 2 S S
D CN e H
N 2
2
Z1 1
dq F p 2 q F
N 2 q 2iq F 2 ln q F exp 2 F Z1 0
(57)
dp(q F p) N 2 p 2ip 2 ln p : p exp 2 2 p (61)
Random Matrix Theory
In this expression C N stands for the accumulated constant factors such that when D the right hand side must yield unity identically. Up to now all the expressions were valid for finite-size matrices, but for N ! 1 with appropriate scaling we expect the results to be universal. This means that in the asymptotic limit the results will be broadly insensitive to the details of the distribution of random matrices, thus it can be applicable to complex or quantum chaotic systems. More over in such a limit the integrals in Eq. (61) can be evaluated by the saddle-point method, or Laplace’s method. To use the latter method we write the terms in the exponent in Eq. (61) as L(q F ) D [q2F 2iq F 2 ln q F ]
and
2
L(p) D [p 2ip 2 ln p] :
(62)
The first derivatives of these pequations yield the sads:p dle-points: q F D (i ˙ 4 2 )/2 and ps:p D p p (i C 4 2 )/2 since Re(p) 0, for 4 the so called bulk of the spectrum. The imaginary part of here denoted as ix/N does not contribute, when N ! 1 also for consistency we should put D . Because of it is the presence of the factor (q F p) in the integrand, p s:p easy to see that only the choice q F D (i 4 2 )/2 yields the leading-order contribution. Substituting this choice into the integrand in Eq. (61) and evaluating the Gaussian fluctuations around the saddle-point values, finally yields
ˇ ˇ det(I H ) ˇ det[( C ix/N)I H ] H ˇ x>0 n x N!1 p o D exp [i C 4 2 ] ; 2
(63)
where we re-substituted D C ix/N. To recover averaged density of states h ()i from the expression above we simply need to differentiate both sides with respect to x, and let x ! 0, for which we write @ hdet( H ) exp f Tr ln[( C ix/N) H ]gi @x o n x p @ exp [i C 4 2 ] : (64) D @x 2 Performing the differentiation and letting x ! 0 we get i
ˇ ˇ 1 1 ˇ Tr N I H H ˇ x!0
N!1 ˇ ˇ i 1 p 2 D 4 ˇˇ : 2 2 x!0
Comparing this last relation with Eq. (42) we see that the averaged density of states is given by 1 p 4 2 : (66) h ()i D 2 It is easy to repeat all the calculations for x < 0 and find that for any real value of x the result in Eq. (63) can be written as ˇ ˇ det(I H ) ˇ det[( ix/N)I H ] H ˇx 2 ˇ 2 N
;
where h ()i is normalized to N, and 2 is the variance of the matrix elements. The three Wigner laws (referred to as the Wigner Surmise) for the nearest neighbor spacing distributions for 2 2 matrices are 8 s 2 /4 ˆ ˇ D 1 (GOE) < 2 se 32 2 s 2 4/ P(s) D s e ˇ D 2 (GUE) : 2 ˆ : 218 4 s 2 64/9 s e ˇ D 4 (GSE) 36 3 2 They are also found to be very good approximations in the asymptotic limit i. e., N ! 1 derivations. Future Directions The use of random matrices in sub-atomic level interactions has led to many great mathematical theorems and shed new light on the understanding of nuclear spectra. Other branches of physical and mathematical sciences equally benefited from the resultant theorems yet biological sciences haven’t had the same share despite an important link between random matrices and the field of biological complexity that was established in the 1970s by Robert May. With recent technological advances massive
2561
2562
Random Matrix Theory
amounts of DNA data sets are being produced, however, the understanding of protein–protein or gene interactions still lags behind. Studies of structured sparse random matrices in the context of DNA data sets may produce fruitful results. Another area where random matrix theory is not fully consulted is the interpretation of the spectra of the adjacency matrices of various graphs and the spectra of the Laplacian of these graphs. There is no doubt that Random Matrix Theory is an effective tool to study a variety of complex systems; one would also hope to see its usage in the theory of machine learning and climate modeling. Bibliography Primary Literature 1. Almond DP, Bowen CR (2004) Anomalous power law dispersions in AC conductivity and permittivity shown to be characteristics of microstructural electrical networks. Phy Rev Lett 92:1576011–1576014 2. Almond DP, Vainas B (1999) The dielectric properties of random R-C networks as an explanation of the universal power law dielectric response of solids. J Phys: Condensed Matter 11:9081–9093 3. Bak P, Sneppen K (1993) Punctuated equilibrium and criticality in a simple model of evolution. Phys Rev Lett 71:4083 4. Barabási A-L, Albert R (1999) Emergence of scaling in random networks. Science 286:509 5. Bohigas O, Giannoni M-J, Schmit C (1984) Characterization of chaotic quantum spectra and universality of level fluctuation laws. Phys Rev Lett 52:1 6. Clerc P, Giraud G, Laugier JM, Luck JM (1990) The electrical conductivity of binary disordered systems, percolation clusters, fractals and related models. Adv Phys 39:191–309 7. Dorogovtsev SN, Mendes JFF, Samukhin AN (2000) Structure of Growing Networks: Exact Solution of the Barabási–Albert’s Model. Phys Rev Lett 85:4633 8. Dykhne AM (1970) Anomalous resistance of a plasma in a strong magnetic field. Zh Eksp Teor Fiz 59:641–647 9. Dyson FJ (1962) A Brownian-motion model for the eigenvalues of a random matrix. J Math Phys 3:1191 10. Dyson FJ (1962) Statistical theory of the energy levels of complex systems, I. J Math Phys 3:140 11. Dyson FJ (1962) Statistical theory of the energy levels of complex systems, II. J Math Phys 3:157 12. Dyson FJ (1962) Statistical theory of the energy levels of complex systems, III. J Math Phys 3:166 13. Efetov KB (1983) Supersymmetry and theory of disordered metals. Adv Phys 32:53 14. Eigen M (1971) Selforganisation of sequence space and tensor products of representation spaces. Naturwissenschaften 58:465–523 ˝ P, Rényi A (1959) On random graphs I. Publ Math 6:290 15. Erdos 16. Ergün G (2002) Human sexual contact network as a bipartite graph. Physica A 308:483 17. Ergün G, Semicircle to triangular distribution of density of states: Supersymmetric Approach (submitted) 18. Ergün G, Separation of a large eigenvalue from the bulk of the spectrum (submitted)
19. Ergün G, Fyodorov YV (2003) Level Curvature distribution in a model of two uncoupled chaotic subsystems. Phys Rev E 68:046124 20. Ergün G, Rodgers GJ (2002) Growing Random Networks with Fitness. Physica A 303:261 21. Ergün G, Zheng D (2003) Coupled Growing Networks. Adv Comp Syst 6:4 22. Farkas IJ, Derényi I, Barabási A-L, Vicsek T (2001) Spectra of real-world graphs: Beyond the semi-circle law. Phys Rev E 64:026704 23. Farkas IJ, Derényi I, Jeong H, Neda Z, Oltvai ZN, Ravasz E, Schubert A, Barabási A-L, Vicsek T (2002) Networks in life: Scaling properties and eigenvalue spectra. Physica A 314:25 24. Fyodorov YV, Chubykalo OA, Izrailev FM, Casati G (1996) Wigner random banded matrices with sparse structure: Local spectral density of states. Phys Rev Lett 76:1603 25. Gardner MR, Ashby WR (1970) Connectance of large daynamic systems: Critical values for stability. Nature 228:784 26. Gaudin M (1961) Sur la loi limite de l’espacement des valeurs propres d’une matrice aléatoire. Nucl Phys 25:447 27. Gaudin M, Mehta ML (1960) On the density of eigenvalues of a random matrix. Nucl Phys 18:420 28. Goh K-I, Kahng B, Kim D (2001) Spectra and eigenvectors of scale-free networks. Phys Rev E 64:051903 29. Gurevich II, Pevsner MI (1957) Repulsion of nuclear levels. Nucl Phys 2:575 30. Haake F (2000) Quantum Signatures of Chaos, 2nd edn. Springer, Berlin 31. Jeong H, Mason SP, Barabási A-L, Oltvai ZN (2001) Lethality and centrality in protein networks. Nature 411:41–42 32. Jones BL, Enns RH, Rangnekar SS (1976) On the theory of selection of coupled macromolecular systems. Bull Math Biol 38:15– 28 33. Krapivsky PL, Redner S, Leyvraz F (2000) Connectivity of growing random networks. Phys Rev Lett 85:4629 34. Luck JM, Jonckheere T (1998) Dielectric resonances of binary random networks. J Phys A: Math Gen 31:3687–3717 35. Luo F, Zhongb J, Yang Y, Scheuermann RH, Zhou J (2006) Application of random matrix theory to biological networks. Phys Lett A 357:420–423 36. May RL (1972) Will a large complex system be stable? Nature 238:413 37. McCaskill JS (1984) Localisation threshold for macromolecular quasispecies from continuously distributed replication rates. J Chem Phys 80:5194–5202 38. Mehta ML (1991) Random Matrices, 2nd edn. Academic, San Diego 39. Mehta ML, Gaudin M (1960) On the density of eigenvalues of a random matrix. Nuc Phys 18:420 40. Mirlin AD (2000) Statistics of energy levels and eigenfunctions in disordered systems. Phys Rep 326:259 41. Mirlin AD, Fyodorov YV (1991) Universality of level correlation function of sparse random matrics. J Phys A 24:2273–2286 42. Newman MEJ, Strogatz SH, Watts DJ (2001) Random graphs with arbitrary degree distributions and their applications. Phys Rev E 64:026118 43. Pellegrini M, Haynor D, Johnson JM (2004) Protein interaction networks. Expert Rev Proteomics 1:239–249 44. Porter CE, Rosenzweig N (1960) Statistical properties of atomic and nuclear spectra. Suomalaisen Tiedeakatemian Toimituksia (Ann. Acad. Sci. Fennicae) AVI Phys 44:166
Random Matrix Theory
45. Porter CE, Thomas RG (1956) Fluctuations of nuclear reaction widths. Phys Rev 104:483 46. Potters M, Bouchaud J-P, Laloux L (2005) Financial applications of random matrix theory: Old laces and new pieces. physics/0512090 47. Rodgers GJ, Bray AJ (1988) Density of states of a sparse random matrix. Phys Rev B 37:3557 48. Rumschitzki DS (1987) Spectral properties of Eigen evolution matrices. J Math Biol 24:667–680 50. Stäring J, Mehlig B, Fyodorov YV, Luck JM (2003) On the random symmetric matrices with a constraint: The spectral density of random impedence networks. Phys Rev E 67:047101 51. von Neuman J, Wigner E (1929). Phys Z 30:467 52. Watts DJ, Strogatz SH (1998) Collective dynamics of “smallworld” networks. Nature 393:440 53. Wigner EP (1951) On the statistical distribution of the widths and spacing of nuclear resonance levels. Phil Soc 62:548 54. Wigner EP (1955) Characteristic vectors of bordered matrices with infinite dimensions. Ann Math 62:548 55. Wigner EP (1957) Distribution of neutron resonance levels. In: International conference on the neuron interactions with the nucleus, Columbia University, New York, 9–13 September 1957. Columbia Univ. Rept. CU-175 TID-8547 56. Wigner EP (1957) Results and theory of resonance absorption (Conference on Neutron Physics by Time-of-Flight, Gatlinburg, Tenessee, November 1–2, 1956). Oak Ridge Natl Lab Rept ORNL-2309:59 57. Wigner EP (1957) Statistical Properties of Real Symmetric Matrices with Many dimensions. In: Can. Math. Congr. Proc., 174. Univ of Toronto Press, Toronto 58. Wigner EP (1958) On the distribution of the roots of certain symmetric matrices. Ann Math 67:2 59. Wishart J (1928) The generalised product moment distribution in samples from a normal multivariate population. Biometrica 20A:32
Books and Reviews Abramowitz M, Stegun I (1972) Handbook of mathematical fucntions, 10th edn. Dover, New York Albert R, Barabási A-L (2002) Statistical mechanics of complex networks. Rev Mod Phys 74:47 Bak P (1996) How Nature Works: The science of self-organised criticality. Copernicus, New York
Barabási A-L (2002) Linked: How everything is connected to everything else and what it means. Perseus, Cambridge Bellman R (1997) Introduction to Matrix Analysis, 2nd edn. SIAM, Philadelphia Berezin FA (1987) Introduction to Superanalysis. Reidel, Dordrecht Bohigas O (1991) Random matrix theories and chaotic dynamics, Session LII, 1989. Chaos and Quantum Physics, Les Houches Bollobás B (2001) Random Graphs, 2nd edn. Cambridge Univ Press, Cambridge Brody TA, Flores J, French JB, Mello PA, Pandey A, Wong SSM (1981) Random-matrix physics: spectrum and strength fluctuations. Rev Mod Phys 53:385 Cvetkovi´c D, Domb M, Sachs H (1995) Spectra of Graphs: Theory and Applications. Johann Ambrosius Barth, Heidelberg Cvetkovi´c D, Rowlinson P, Simi´c S (1997) Eigenspaces of Graphs. Cambridge Univ Press, Cambridge Dorogovtsev SN, Mendes JFF (2002) Evolution of networks. Adv Phys 51:1079 Efetov KB (1997) Supersymmetry in Disorder and Chaos. Cambridge Univ Press, Cambridge Faloutsos M, Faloutsos P, Faloutsos C (1999) On power-law relationships of the internet topology. Proc ACM SIGCOMM Comp Comm Rev 29:251 Fyodorov YV (1995) Basic features of Efetov’s supersymmetry approach. Mesoscopic Quantum Physics, Les Houches, Session LXI, 1994 Guhr T, Müller-Groeling A, Weidenmüller HA (1998) Random matrix theories in quantum physics: common concepts. Phys Rep 299:189 Hinch EJ (1991) Perturbation Methods. Cambridge Univ Press, Cambridge Lieb EH, Mattis DC (1966) Mathematical Physics in One Dimension. Academic, New York Markowitz H (1959) Portfolio Selection: Efficient Diversification of Investments. Wiley, New York Merzbacher E (1970) Quantum Mechanics, 2nd edn. Wiley, London Porter CE (1965) Statistical Theories of Spectra: fluctuations. Academic, New York Stöckmann H-J (2000) Quantum Chaos: an introduction. Cambridge Univ Press, Cambridge Watts DJ (1999) Small Worlds. Princeton Univ Press, Princeton Wilks S (1972) Mathematical Statistics. Wiley, Japan Wilson RJ (1996) Graph Theory. Longman, Edingburg
2563
2564
Random Walks in Random Environment
Law Probability distribution. When dealing with a stochastic process fX t g, this is the collection of distribution functions P(X t1 x1 ; : : : ; X t k x k ).
Random Walks in Random Environment OFER Z EITOUNI Department of Mathematics, University of Minnesota, Minneapolis, USA Article Outline Glossary Definition of the Subject Introduction One Dimensional RWRE Multi Dimensional RWRE – non Perturbative Regime Multi Dimensional RWRE – the Perturbative Regime Diffusions in Random Environments Topics Left Out and Future Directions Bibliography Glossary Aging A stochastic process X t exhibits aging if for some function f (x; y), the limit of E f (X t ; X tCs t ) as t ! 1 exists and is not trivial for a function s t ! t!1 1. Annealed law Average of the quenched law, over all possible environments. Brownian motion A continuous process fX t g with continuous sample paths, X 0 D 0, with increments over non-overlapping intervals independent and normally distributed with zero mean and variance equal to the length of the interval. Convex A function f : X ! R, where X is a linear space, is convex if for ˛ 2 [0; 1] and x; y 2 X, f (˛x C (1 ˛)y) ˛ f (x) C (1 ˛) f (y). Ergodic law If P is a law on a collection of random variables fX zgz2Zd , then it is ergodic if for any event A that is invariant under a transformation X z 7! X zCe , with e 2 Zd , jej D 1, it holds that P(A) 2 f0; 1g. Environment Collection of transition probabilities indexed by the sites of the lattice Zd , that is a collection of vectors belonging to the 2d-simplex. i.i.d. Independent identically distributed. Large deviations principle (LDP) A sequence of probability measures Pn on a common (topological) space X satisfies the large deviations principle if for some nonnegative function I : X ! RC with closed level sets fx 2 X : I(x) ag, for any A X, info I(x) lim inf n1 log Pn (A) x2A
lim sup n1 log Pn (A) inf I(x) x2 A¯
where A¯ is the closure of A and Ao is its interior.
Law of large numbers (LLN) A sequence X n satisfies the LLN if X n /n converges to a deterministic limit. Lipschitz norm Given a function on a normed space X with norm k k, its Lipschitz norm is defined as supx¤y j f (x) f (y)j/kx yk. Martingale Let F n denote a filtration, that is an increasing sequence of -algebras. A process fX n g is a (discrete time) martingale with respect to F n if E(X nC1 jF n ) D X n . Mixing environment An environment whose law satisfies a mixing condition, that is if A; B are events that depend on a finite number of sites then P(A \ T z B) !jzj!1 P(A)P(B) where T z is the shift of B by z 2 Zd . For d D 1, the environment is strongly mixing if the convergence above is uniform in the choice of A; B of the same local dependence. P-a.s. A property holds P-a.s. if the probability of the event that the property does not hold (under the law P) vanishes. Quenched law Law of the random walk, with the environment given. Variational distance For two probability distributions
; , the variational distance is defined as supA j (A) (A)j. 2d-Simplex Collection of vectors of length 2d with nonnegative entries that sum to 1. Definition of the Subject The term random walks in random environments (abbreviated RWRE) refers to a class of Markov chains evolving in Zd , where the transition mechanism itself is random, and forms a stationary random field indexed by Zd . They first appeared in the literature in the early 1970s, for d D 1, modeling in a natural way transport properties of a tagged particle in an inhomogeneous medium, as well as biological systems. They can be naturally used to model complex transport phenomena in any dimension. The analysis of the case d D 1 during the 1970s and early 1980s, revealed that the behavior of such random walks can differ drastically from that of ordinary random walks (that correspond to a homogeneous environment). The extension to higher dimensions (d 2) was proposed and studied in [51], but it is only in the last decade, starting with the work [102], that rapid progress has been made, although several basic questions remain open.
Random Walks in Random Environment
Introduction We recall that random walks and their scaling limits, diffusion processes, provide a simple yet powerful description of random processes, and are fundamental in the description of many fields, from biology through economics, engineering, and statistical mechanics. A large body of work has accumulated concerning the properties of such processes, and very detailed information is available. We refer to [61] and [93] for background on random walks and diffusion processes. In many situations, the medium in which the process evolves is highly irregular. Without further modeling, this results with spatially inhomogeneous Markov processes, and not much can be said. Things are however different if some degree of homogeneity is assumed on the law of the environment. When the underlying state space on which the walk moves with nearest neighbor steps is the lattice Zd , d 1, and the law of the environment is assumed stationary, we call the resulting random walk a random walks in random environment (RWRE). Informally, one starts with an environment, which is a configuration of transition probabilities (one transition probability for each site in Zd ). One then starts a random walk that, when at location z 2 Zd , moves at the next time step according to the transition probability associated with z. In the one dimensional case, this model has been thoroughly analyzed, and it is known that the behavior can differ dramatically from the behavior of ordinary random walk (in particular, the standard central limit theorem can fail). In higher dimensions, it is expected that the RWRE behaves more like ordinary random walk, and much of the available analysis points in this direction; however, gaps in the understanding of the model still remain, and the law of large numbers has not yet been proved in full generality. A precise formulation of the RWRE model is as follows. Let S denote the 2d-dimensional simplex, set d ˝ D S Z , and let !(z; ) D f!(z; z C e)g e2Z d ;jejD1 denote the coordinate of ! 2 ˝ corresponding to z 2 Z d . ! is an “environment” for an inhomogeneous nearest neighbor random walk (RWRE) started at x with quenched transition probabilities P! (X nC1 D zCejX n D z) D !(x; xC e) (e 2 Zd ; jej D 1), whose law is denoted P!x . We write E!x for expectations with respect to the law P!x , and write 0 . In the RWRE model, the enP! and E! for P!0 and E! vironment is random, of law P, which is always assumed stationary and ergodic. We often assume that the environment is uniformly elliptic, that is there exists an > 0 such that P-a.s., !(x; x C e) for all x; e 2 Zd ; jej D 1. Finally, we denote by P the annealed law of the RWRE started at 0, that is the law of fX n g under the measure
P P!0 , and again we write E for expectations with respect to P and E for expectations with respect to P. Note that under the law P! , the RWRE is a (space inhomogeneous) Markov process, whereas under P , its law is space homogeneous but not Markovian, since the knowledge of the path of the RWRE up to time n modifies the a-priori law on the environment. To get an idea of some of the unusual features of the RWRE model, we begin by discussing the one dimensional case. (A further motivation is that the study of certain reinforced random walks can be reduced to that of one dimensional RWRE, a connection first described in [73]). This model is fairly well understood, and we review the results in Sect. “One Dimensional RWRE”, emphasizing the points in which there are differences from walks in a homogeneous environment. We then turn in Sect. “Multi Dimensional RWRE – non Perturbative Regime” to the multidimensional case, while Sect. “Multi Dimensional RWRE – the Perturbative Regime” is devoted to the perturbative regime. Sect. “Diffusions in Random Environments” quickly reviews the available results for the related model of (non reversible) diffusions in random environments. In Sect. “Topics Left Out and Future Directions”, we collect some information about related models. This article borrows heavily from [101,107] and [108]. One Dimensional RWRE When d D 1, we write !x D !(x; x C 1), x D (1 !x )/ !x , and u D E log 0 . Because explicit recursions and computations are possible in the one dimensional case, the understanding of the RWRE is rather complete. We summarize the main features below. Ergodic Behavior, Limit Laws, and Traps Recall that for a homogeneous environment (that is, when the stationary measure P has a marginal which charges a single value: ! i D !¯ for p all i), we have X n /n ! v!¯ :D ¯ !)n ¯ converges in distri2!¯ 1 and (X n nv!¯ )/ 4!(1 bution to a standard Gaussian. We describe next the corresponding results for the RWRE model, emphasizing the dramatic difference in behavior. As it turns out, the sign of u determines the direction of escape of the RWRE, while the limiting behavior depends on an explicit function of the law of the environment. The following theorem is essentially due to [92], see also [1,107]. Theorem 1 (Transience, recurrence, limit speed, d D 1) (a) If u < 0 then X n !n!1 1, P -a.s. If u > 0 then X n ! 1, P -a.s. Finally, if u D 0 then the RWRE oscil-
2565
2566
Random Walks in Random Environment
lates, that is, P -a.s., lim sup X n D 1 ; n!1
lim inf X n D 1 : n!1
Further, there is a deterministic v such that lim
n!1
Xn D v ; P a.s. : n
(1)
If P is a product measure, then 8 < (1 E( 0 ))/(1 C E( 0 )) ; (1 E( 01 ))/(1 C E( 01 )) ; vD : 0;
E( 0 ) < 1 ; E( 01 ) < 1 ; else : (2)
The statement (1) that X n /n converges to a deterministic limit (under both the quenched and annealed measures) is referred to as a law of large numbers (LLN). Remark 2 The surprising features of the RWRE model can be appreciated if one notes the following facts, all for a product measure P. Suppose u < 0, that is X n ! 1, P -a.s. By Jensen’s inequality, log E 0 E log 0 , but it is quite possible that E 0 > 1. Thus, it is possible to construct i.i.d. environments in which the RWRE is transient, but the speed v vanishes. Suppose v¯ D 2E!0 1 denotes the speed of a (biased) simple random walk with probability of jump to the right equal, at any site, to E!0 . It is easy to construct examples with v¯ > 0 but u > 0, which means that X n ! 1 even if the static speed v¯ points to the right. However, by Jensen’s inequality, v < 0 implies that v¯ < 0. Thus, if the static speed v¯ is positive, the RWRE may be transient to the left but if so, only with zero speed. We come back to this point in Subsubsect. “RWRE with Deterministic Components”, where we show that the last property is not necessarily true in high dimension. Another application of Jensen’s inequality reveals that jvj j¯v j, with examples of strict inequality readily available. Thus, the effect of the random environment is to force a slowdown with respect to the (averaged, deterministic) environment. Another aspect in which the RWRE differs from the standard model of random walk is in its fluctuations. Consider product measures P with u :D E(log 0 ) < 0 (i. e., RWRE is transient to C1). Set s D supfr : E( 0r ) < 1g and note that because u < 0, necessarily s 2 (0; 1]. When
s > 2, the behavior is similar to standard random walk, and one has a central limit theorem of the following form. For some deterministic strictly positive constant , the p random variable Wn :D (X n nv)/ n converges, under the annealed law, to a standard Gaussian random variable, that is Z 1 1 2 ey /2 dy ; P (Wn > x) !n!1 p 2 x see [46,55,75,107] for this statement and a similar one under the quenched law (with random centering). On the other hand, when s < 2, due to the existence of localized pockets of environments (“traps”) where the walk spends a large time, one gets stable limit laws. In particular, for s 2 (0; 1), under mild assumptions on the law of 0 , the random variable X n /ns converges, under the annealed law, to a stable random variable with parameters (s; b), where b is a deterministic constant, whose value has been identified in [39]. (A stable law with parameters (s; b) is the distribution of a random variable S with characteristic function
t i tS s ; t 2 R:) E(e ) D exp bjtj 1 i tan( s/2) jtj We refer to the regime where s 2 (0; 2], in which the fluctuations are not in the CLT scale, [55], as the sub-diffusive regime. It is interesting to note that these statements do not possess a quenched counterpart, and in fact a quenched limit law is not possible for s < 2, see [75,76]. Remark 3 When P is a strongly mixing environment, the parameter s has to be defined differently, by means of the large deviations rate function for the variable P n1 niD1 log i . The Gaussian limit laws apply in such situations when s > 2, see [20,75,107]. The stable limit laws are more delicate, and are not known for general ergodic environments with good mixing properties. For a class of Markovian environments, such results are contained in [67]. Limit Laws and Aging, Recurrent RWRE: Sinai’s Walk When E(log 0 ) D 0, the traps alluded to in the previous section stop being local, and the whole environment becomes a diffused trap. The walk spends most of its time “at the bottom of the trap”, and as time evolves it is harder and harder for the RWRE to move. This is the phenomenon of aging, captured in the following theorem: Theorem 4 There exists a random variable B n , depending on the environment only, such that for any > 0, ˇ ˇ
ˇ Xn ˇ nˇ P ˇˇ B > ! 0: ˇ n!1 (log n)2
Random Walks in Random Environment
Further, for h > 1, lim lim P
!0 n!1
jX n h X n j 1 5 2 (h1) : < D e (log n)2 h2 3 3 (3)
The first part of Theorem 4 is due to Sinai [91], with Kesten [52] providing the evaluation of the limiting law of B n , see also [47]. It is actually not hard to understand the anomalous scaling (log n)2 : indeed, the time for the particle to overcome a stretch of the environment of length c1 log n in which the drift points “backwards” is exponential in c1 log n, i. e. an appropriate c1 can be chosen such that this time is of order n. Hence, the range of the RWRE at time n cannot be larger than the distance in which there exists such a stretch. Due to the scaling properties of random walk, this distance is of order (log n)2 . The second part of Theorem 4 is implicit in [47], and also follows from the analysis of the time spent by the RWRE at “bottom of traps”. We refer to [62] for a detailed study of aging in the Sinai model by renormalization techniques, and to [24,30,107] for rigorous proofs that avoid renormalization arguments. For information concerning the time spent by the walk at the most visited site (which can be of order n in the Sinai model), see [28,49,89], and [43] for the transient case. Tail Estimates and Large deviations Another question of interest relates to the probability of seeing a-typical behavior of the RWRE. These probabilities turn out to depend on the measure discussed, that is whether one considers the quenched or annealed measures. Following Varadhan [105], we say that the sequence of random variables X n /n satisfies the large deviations principle (LDP) with convex, continuous rate function I on a compact set K, if for any measurable subset of K, lim
n!1
1 log P(X n /n 2 A) D inf I(x) : x2A n
Cramér’s theorem (Theorem 2.2.3 in [29]) states that rescaled random walk X n /n in a homogeneous environment with ! i D !¯ for all i satisfies the LDP with a strictly convex rate function I(x) that vanishes only on v!¯ D 2!¯ 1. The situation is different for the RWRE. The following theorem combines results from [25,32,48]. Theorem 5 For P-a.e. realization of the environment !, the sequence X n /n satisfies, under P!0 , a LDP on [1; 1] with a deterministic convex rate function I P (). If P is an
i.i.d. measure, then under the annealed measure P , the same sequence satisfies a LDP with convex rate function I(x) D inf e (h(QjP) C I Q (x)) ; Q2M1
(4)
where h(QjP) is the specific entropy of Q with respect to P and M1e denotes the space of stationary ergodic measures on ˝. Always, I(x) I P (x), and both I and I P may vanish for x 2 [0; v], and only for such x. In particular, neither I nor I P need be strictly convex. The rate function for the RWRE thus differs from the case of homogeneous environments in two important aspects: it may vanish on the whole segment [0; v], indicating subexponential behavior for the probability of slowdown, and further the rate function is in general not strictly convex. When I(x) vanishes for x 2 [0; v], it means that the probability of seeing an a-typical slowdown of the random walk decays at a speed less than exponentially. A precise characterization of these is available in [33,44,77,78,107]. The specific entropy h(QjP) (see [29] for definition) appearing in Theorem 5 measures the rate of decay of the probability that a block of the environment will look as if it came from Q, when it was generated by P. Thus, one may interpret Theorem 5 as follows: to create an annealed large deviation, one may first “modify” the environment (at a certain exponential cost measured by the specific entropy h) and then apply the quenched LDP in the new environment. Multi Dimensional RWRE – non Perturbative Regime We turn our attention to RWRE in the lattice Zd with d > 1. Unless stated otherwise explicitly, we only consider in the sequel measures P that are i.i.d. and uniformly elliptic. Ergodic Properties and a 0–1 Law A natural starting point for the discussion of ergodic properties of the RWRE (X n ) would have been an analogue of Theorem 1. Unfortunately, obtaining such a statement has been a major challenge since the early 1980’s, and is still open. To explain the challenge, we need to digress and introduce a certain conjectured 0–1 law. Fix ` 2 S d1 , i. e. ` is a unit vector in Rd . Define the events n o AC D lim X n ` D 1 ; ` n!1 n o A` D lim X n ` D 1 : n!1
The following proposition is due to Kalikow [51].
2567
2568
Random Walks in Random Environment
Proposition 6 Assume P is i.i.d. and elliptic, i. e. P(!(0; e) > 0) D 1 for all e with jej D 1, and that ` 2 S d1 . Then, [ A ) 2 f0; 1g. P (AC ` ` ) 2 Note that for d D 1, Theorem 1 implies that P (AC ` f0; 1g. If one ever hopes to obtain a LLN, then one should be able to prove the following. Conjecture 7 (Kalikow) Assume P is i.i.d. and uniformly elliptic, and that ` 2 S d1 . Then, P (AC ) 2 f0; 1g. ` Efforts to prove Conjecture 7 are ongoing. The following summarizes its status at the current time, and combines results from [19,111]. Theorem 8 (a) Conjecture 7 holds for d D 1; 2 and elliptic i.i.d. environments. (b) There exist ergodic environments that are elliptic (for d D 2) and even uniformly elliptic and mixing (for d 3), for which a deterministic direction ` 2 S d1 exists such that P!0 (AC ) 2 (0; 1), for P-almost every !. ` As mentioned above, part (a) of Theorem 8 for d D 1 is a direct consequence of the LLN, Theorem 1. As it turns out, the validity of Conjecture 7 is the only obstruction to a LLN, as demonstrated in the following theorem, which combines results from [4,102,106,110]. Theorem 9 Assume P is i.i.d. and uniformly elliptic. (a) Fix ` 2 S d1 . Then, lim
n!1
Xn ` ; P a.s. : D vC 1AC C v 1A ` n `
(5)
In particular, when d D 2 the LLN holds true. (b) P -almost surely, there are at most two possible limit points, denoted v1 ; v2 , for the sequence X n /n. Further, v1 ; v2 are deterministic, and if v1 ¤ v2 then there exists a constant a 0 such that v2 D av1 . (c) When d 5, if v1 ¤ v2 then at least one of v1 and v2 equals 0. Part (a) of the theorem implies that if Conjecture 7 is true, then the LLN holds for P i.i.d. and uniformly elliptic. The proof of Theorem 9, and of many of the other results in this section, uses the machinery of regeneration times, introduced in [102]. Roughly, a random time k is a regeneration time relative to a direction ` 2 S d1 if X k ` X n ` for all k n but X k ` < X n ` for all k < n (i. e., X n ` sets a record at time k, and never moves backward from that record). It turns out that the sequence
of inter-regeneration times and inter-regeneration distances is an i.i.d. sequence under the annealed measure P , if P is i.i.d. Once such an i.i.d. sequence has been identified, ergodic arguments yield the LLN, and the (annealed) CLT involves studying tail behavior of the regeneration times. In some ballistic cases, one may translate an annealed CLT to a quenched one, see [7] and [82]. We note in passing that regeneration techniques have also been useful for certain non-i.i.d. environments. We refer the interested reader to [26,27,79]. So far, there is no known criterion that allows one to decide the question of transience or recurrence for RWRE in dimension d 2, although one certainly expect transience as soon as d 3.
Ballistic Behavior and Sznitman’s Conditions Lacking an explicit expression for the speed of the RWRE for d 2, a natural goal is to identify a large family of models for which X n /n ! v ¤ 0. RWRE’s that satisfy such a relation are called ballistic. As we saw in Theorem 1, when d D 1 and X n ! 1, and the environment is i.i.d., the RWRE is ballistic if and only if E 0 < 1. P Define d0 :D [!(0; e i ) !(0; e i )]e i as the drift at the origin. If there exists a direction ` 2 S d1 such that d0 ` > 0 for P-a.e. environment, a simple martingale argument shows that X n /n ! v with v ` > 0. Following Zerner [109], we call such environments non-nestling. We will be mainly interested in nestling environments, that is environments in which the origin belongs to the closed convex hull of the support of d0 . (The source for the name lies in the fact that when the walk is nestling, it is possible to construct localized regions, called traps, to which the walk return many times, leading to the mental picture of a bird that keeps returning to a nest. We note that as we show in Remark 18 below, one should not confuse the condition Ed0 ¤ 0 with ballistic behavior, as it does not guarantee a limiting non-zero speed.) Traps tend to slow down the particle. However, unlike d D 1, all attempts to build explicit traps that slow down the particle to a sub-diffusive scale quickly fail. One thus suspects that a good control of trapping properties relates to an analysis of the RWRE. With this motivation in mind, Sznitman introduced conditions on the environment that eventually lead to a good understanding of the ballistic regime. Fix a direction ` 2 S d1 , and for b > 0, define the region U`;b;L D fx 2 Zd : x ` 2 (bL; L)g. Let T`;b;L D minfn > 0 : X n 62 U`;b;L g. Definition 10 Let 2 (0; 1) be given. Then, P satisfies condition T relative to ` if for all `0 in some neighbor-
Random Walks in Random Environment
hood of `, and all b > 0, lim sup L!1
1 log P X T`;b;L ` < 0 < 0 : L
(6)
P satisfies condition T 0 relative to ` if it satisfies condition T relative to ` for all 2 (0; 1). It satisfies condition T relative to ` if it satisfies condition T 1 relative to `. In words, condition T relative to ` holds if the exit from a slab that is contained between two hyperplanes perpendicular to `, located respectively at distance CL in the ` direction and bL in the opposite direction, occurs through the “backward” direction with probability that is exponentially small in L. Condition T 0 relaxes the exponential decay to “almost” exponential decay (there is an alternative description of condition T 0 in terms of regeneration distances, see Proposition 12 below). The power of condition T 0 is the following. Theorem 11 (Sznitman) Assume P is i.i.d. and uniformly elliptic, and that condition T 0 relative to some direction ` holds. Then, the process (X n ) is ballistic, i. e. X n /n ! v ¤ 0 for some deterministic v with v ` > 0, and there is a deterministic 2 > 0 such that, under the annealed measure p P , (X n nv)/ n converges in distribution to a standard Gaussian random variable. (The convergence in distribution in Theorem 11 actually extends to an invariance principle; The results in [7] yield a quenched CLT as soon as d 4.) The key to the usefulness of Condition T 0 is in the following result from [99], where 1 denotes the first regeneration time. Proposition 12 Let d 2, ` 2 S d1 , and 2 (0; 1]. The following are equivalent: (a) Condition T holds. ) D 1 and, with X :D sup0n1 jX n j, there ex(b) P (AC ` ists a c > 0 such that E exp(c(X ) ) < 1 : The proof is detailed in [99], see also the exposition in [101]. From part (b) of Proposition 12, tail estimates on 1 follow. We omit further details. It can be checked (by a martingale argument) that condition T (and hence T 0 ) holds for a certain direction ` when the environment is non-nestling. The first example of a nestling environment that satisfies condition T was provided by Sznitman, who showed that the class of environments satisfying the Kalikow condition from [51] also satisfy condition T (Kalikow himself had showed that his condition implies the 0–1 law, and it was shown in [102]
to imply ballistic behavior). However, the verification of condition T 0 seems a-priori not obvious. It is thus extraordinary that an effective criterion for checking it exists, see [98]. This criterion is used in [99] to construct an example of a ballistic RWRE that does not satisfy Kalikow’s condition but does satisfy T 0 , relative to some `. Sznitman’s Conjecture As noted in Proposition 12, condition T 0 is equivalent to certain exponential moments on the maximal distance from the origin the RWRE has achieved before time 1 . In [99], Sznitman actually proves that condition T relative to ` with any 2 (1/2; 1) implies condition T 0 relative to the same `. This led him to the following conjecture, see [98]: Conjecture 13 (Sznitman) Assume P is uniformly elliptic and i.i.d. Then, condition T relative to ` is implied by condition T relative to ` for any 2 (0; 1). It is also reasonable to expect (“plausible”, in the language of [98]) that in addition, ballistic behavior with speed v implies condition T relative to ` D v/jvj, for d > 1. For d D 1, and i.i.d. environment, all the conditions T with respect to the direction ` D 1 are equivalent to E log 0 < 0, see [96]. Hence, Conjecture 13 holds when d D 1 (note that this is not the case for the conclusions concerning ballistic behavior, which do not hold true for d D 1). In the ballistic situation, some information on the environment viewed from the point of view of the particle can be deduced. We refer to [12] and [79] for details. Large Deviations, Quenched and Annealed In dimension d D 1, the large deviations for the sequence X n /n were obtained by considering hitting times. While this approach can be partially extended to obtain quenched LDP’s for some RWRE’s, see [109], its scope is limited, and in particular it does not apply to all i.i.d. measures P, nor to an annealed LDP. A different approach was taken by Varadhan [106], who obtained the following. Theorem 14 Assume d 2. (a) Assume P is a uniformly elliptic, ergodic measure. Then, for P-a.e. environment !, the sequence of variables X n /n under P!0 satisfies the (quenched) LDP (on [1; 1]d ) with speed n and deterministic, convex rate function I. (b) Assume further that P is i.i.d. Then, the sequence of random variables X n /n satisfies, under P , the (annealed) LDP with speed n and convex rate function I .
2569
2570
Random Walks in Random Environment
(c) The rate functions I and I possess the same zero set. Further, this (convex) set is either a single point or a segment of a line. An alternative description of the quenched rate function, that is more instructive than the sub-additivity argument, has been developed for the related model of diffusions in random environments in [54]. Part (b) of Theorem 14 was extended to certain mixing environments in [80]. As for dimension d D 1, both I and I are in general not strictly convex. The quenched statement is an application of the ergodic subadditive theorem [64]. The annealed LDP is obtained by noting that the process of histories of the walk is a Markov chain, and applying the general large deviations theory for such chains. Remark 15 In the multi-dimensional case, a formula like (4), with its intuitive description of the way an annealed deviation is obtained, is not available, since the modification of big chunks of the environment has probability which decays exponentially in volume order, i. e. nd , instead of n. As for d D 1, it is natural to study slowdown estimates in the region where the rate functions vanish, and in particular to study the probability of slowdown. This study is closely related to the analysis of Condition T 0 , and we refer to [96,97] for details. Non-ballistic Results The analysis of RWRE for environments that do not exhibit ballistic behavior is still limited. Still, two important classes of models have been identified, for which the analysis could be carried out. We sketch those below. Balanced Environment Balanced environments satisfy the constraint !(0; e i ) D !(0; e i ) for all i, in which case the local drift vanishes everywhere. In that case, X n itself is a martingale with bounded increments, and thus X n /n ! 0, P -a.s. In fact, an invariance principle also holds. Theorem 16 Assume P is stationary and ergodic, balanced, and uniformly elliptic Then X n /n ! 0, P -a.s., and p there exists a deterministic 2 > 0 such that X n / n converges in distribution (under the annealed measure P ) to a Gaussian random variables. Further, X n is recurrent if d D 2 and transient if d 3. Theorem 16 is essentially due to [59] (the recurrence statement is due to Kesten, and can be found in [107]). It is one of the few instances where “classical” homogenization can be applied to the study of multi-dimensional RWRE.
RWRE with Deterministic Components A key to the analysis of the ballistic case is the existence of certain regeneration times. Those were used to create an i.i.d. sequence under the measure P . In the non-ballistic case, regeneration times as defined above do not exist. However, if the dimension of the space is large enough and some of the components are deterministic, an alternative to regeneration times can be found, based on cut times for simple random walk. For a simple random walk fS n g, a cut time k is such that fS n ; n kg \ fS n ; n > kg D ;. Such cut times exist for d 4 [40]. By considering the cut times induced by the components of the RWRE evolving in the deterministic directions, [14] proved the following. Theorem 17 Assume d D d1 C d2 with d1 5. Assume P is a uniformly elliptic i.i.d. measure, with !(x; x C e) D q(e) for e D ˙e i ; i D 1; : : : ; d1 and a deterministic q. Then, there exists a deterministic constant v such that X n /n ! v, P -a.s.. Further, if d1 13, then the quenched CLT holds, i. e. there exists a deterministic 2 > 0 such that, p for P-almost every !, (X n nv)/ n converges in distribution, under P!0 to a standard Gaussian variable. Remark 18 (a) The convergence in distribution in Theorem 17 extends to a full invariance principle. (b) An amusing consequence of Theorem 17 is that, for d > 5, one may construct P i.i.d. and uniformly elliptic such that E(d0 `) < 0 but the resulting RWRE is ballistic with v ` > 0. Recall that this is impossible in dimension d D 1, see Remark 2b). Also, for d > 6, one may construct for every > 0 a P i.i.d. and uniformly elliptic such that j!(x; x C e) 1/2dj < , E(d0 ) ¤ 0, but X n /n ! 0, P -a.s., or such that E(d0 ) D 0 but the walk is ballistic. We refer to [14] for the construction. Multi Dimensional RWRE – the Perturbative Regime We discuss in this section the perturbative analysis of the RWRE. By P being a small perturbation from a kernel q P we mean that q(˙e i ) 0, i [q(e i ) C q(e i )] D 1, and for some small, j!(x; x C e) q(e)j < for e 2 f˙e i g. When q(e) D 1/2d for e D ˙e i , we say that P is a small perturbation from simple random walk. We already observed, see Remark 18, that in the perturbative regime for simple random walk, the RWRE can exhibit behavior which is very different from the behavior of simple random walk. Ballistic Walks Sznitman’s criterion [99] for condition T 0 to hold together with an renormalization analysis allowed him to
Random Walks in Random Environment
give in [99] sufficient conditions for ballistic behavior when is small. Set 0 (3) D 5/2 and 0 (d) D 3 for d 4. Theorem 19 Let d 3 and < 0 (d). Then there exists an 0 D 0 (d; ) > 0 such that if P is i.i.d. and an perturbation from simple random walk, and Ed0 e1 > , then the T 0 condition relative to e1 holds. Contrasting Theorem 19 with the examples in Remark 18 shows that some condition on the strength of the averaged drift Ed0 as function of is necessary for ballistic behavior. Also, 0 (d) > 2 is used in constructing the examples in [99] that show that Kalikow’s condition is strictly included in condition T 0 . We note that the case d D 2 is still open. In another direction, if one writes !(x; x C e) D P q(e) C (x; x C e) with i.i.d., and either eq(e) ¤ 0 P P or eq(e) D 0 but eE(0; e) ¤ 0, then for small enough, Kalikow’s condition holds. Expansions in of the speed of the RWRE are provided in [83]. Balanced Walks Recall the balanced walks introduced in Subsect. “NonBallistic Results”, c.f. Theorem 16. The existence of an invariance measure viewed from the point of view of the particle, and the control achieved on this measure by approximations with periodized environments, allow one to get an expansion of the diffusivity matrix in terms of the strength of the perturbation from simple random walk. We refer the reader to [60] for details. Isotropic RWRE The existence of sub-diffusive behavior for the RWRE model in d D 1 immediately raises the question as to whether such sub-diffusive behavior is present in higher dimension. As implied by Theorem 11, this is not the case when the environment satisfies condition T 0 . Since it may be expected that condition T 0 characterizes ballistic behavior for d > 1, it is reasonable to expect (but not proved!) that for P i.i.d. and uniformly elliptic, and d > 1, no subdiffusive behavior is possible when the walk is transient in direction ` (and further, in the ballistic regime, when recentering around the limiting velocity v, one expects fluctuations in the diffusive scale). Outside the ballistic regime, rigorous results are few. Early attempts to address the question of existence of a diffusive regime appeared in [35,41], using a formal renormalization group analysis in the small perturbation regime, with the conclusion that no sub-diffusive behavior exists at d 3 in the perturbative regime, and that
at most logarithmic corrections to diffusive behavior exist at d D 2. While this conclusion certainly conforms with what one would expect, soon after it was pointed out that counter-examples can be constructed (albeit not with i.i.d., or even finite range dependent, environments), see [15,17,18]. Further, some of the examples discussed in this article, and in particular those of Subsect. “Non-Ballistic Results”, see Remark 18, do not seem to be consistent with the formal renormalization analysis. An attempt to put the analysis on a rigorous foundation was made in [22]. Among other things, they introduced the following isotropy condition: Definition 20 The law P on the environment is isotropic if, for any rotation matrix O acting on Rd that fixes Zd , the laws of (!(0; O e)) e:jejD1 and (!(0; e)) e:jejD1 coincide. In particular, if P is isotropic then Ed0 D 0. The main result of [22] is the following: Theorem 21 (Bricmont–Kupiainen) Assume d 3. There exists an 0 D 0 (d) such that if P is i.i.d. and isotropic, and an perturbation of simple random walk with < 0 , then for some deterministic 2 > 0 and for P p almost every !, the sequence X n / n converges in distribution, under P!0 , to a standard Gaussian random variable. The approach of [22] is to introduce a (diffusive) rescaling in time and space, and propagate an estimate on both the large scale behavior of the RWRE, as well as about the existence of local traps that have the potential to destroy, at the next level, the diffusivity properties. The restriction to d 3 is useful because the underlying simple random walk for d 3 is transient, and hence Green function computations can be performed. Several attempts have recently been made to provide an rescaling argument (alternative to [22]) that is more transparent. The first approach [103], which is closest to Theorem 21, has been in the context of diffusions in random environments. In the remainder of this section, we describe another approach, due to [13], that yields a result concerning the exit measure of (isotropic) RWRE from large balls. Let VL D fx 2 Zd : jxj Lg be the ball of radius L in d Z (where we recall that j j is the euclidean norm), and let @VL D fy 2 Zd : d(y; VL ) D 1g denote the boundary of V L . Let L D minfn : X n 62 VL g denote the exit time of the RWRE from V L , and for x 2 VL ; z 2 @VL , let ˘L (x; z) D P!x (X L D z) denote the exit measure of the RWRE from V L , and let L (x; z) denote the corresponding quantity for simple random walk. Finally, let s ˘L;l (x; z) D ˘ L ? l , where ? denotes convolution and is a random variable with smooth density supported
2571
2572
Random Walks in Random Environment
s on (1; 2). ˘L;l is a smoothed version of ˘L , where the smoothing is at scale l. One expects that for an isotropic environment that is a small perturbation of simple random walk, the exit measure ˘L approaches that of simple random walk, except for small nonvanishing correction that are due to localized perturbations near the boundary, and that as soon as some additional smoothing is applied, convergence occurs. Under the assumptions of Theorem 21, this is indeed the case. In what follows, for probability measures ; we write k k for the variational distance between and .
Theorem 22 Assume d 3. There exists a ı0 D ı0 (d) > 0 with the following property: for each ı < ı0 there exists an 0 D 0 (d; ı) such that if < 0 and P is an i.i.d. and isotropic law which is an perturbation of simple random walk, then lim sup k˘L (0; ) L (0; )k ı :
(7)
L!1
Like the RWRE in dimension d D 1, the model (9) leads to a reversible diffusion. A direct generalization of (9) via the expression (10) for the generator, see for example [68,69], preserves the reversibility of the process, and thus for our purpose does not serve as a true analogue of the RWRE model. Instead, we consider diffusions satisfying the equation in Rd : dX t D b(X t ; !)dt C (X t ; !)dWt ;
X0 D 0 ;
(11)
with generator LD
Pd 1 Pd 2 i; jD1 a i j (x; !) @ i j C iD1 b i (x; !) @ i ; (12) 2
where a D T is a d-by-d matrix and the coefficients a; b are assumed to satisfy the following: Assumption 23
Further, s lim sup k˘L;l (0; ) L ? l (0; )k c l ! l !1 0 : (8) L!1
Diffusions in Random Environments The model of RWRE possesses a natural analogue in the setup of diffusion processes. One Dimensional Generators For dimension d D 1, the study of analogues of the RWRE model goes back to [23] and [86]. Formally, one looks at solutions to the stochastic differential equation 1 dX t D V 0 (X t )dt C dˇ t ; 2
Multi Dimensional Diffusions: Finite Range Dependence
X0 D 0 ;
(9)
where ˇ is a standard Brownian motion and V, the potential, is itself an (independent of ˇ) Brownian motion with constant drift. Of course, (9) does not make sense as written, but one can express the solution to (9) for smooth V in a way that makes sense also when V is replaced by Brownian motion, by saying that conditioned on the environment V, X t is a diffusion with generator
d 1 V (x) d e eV (x) : (10) 2 dx dx The diffusion in (9) inherits many of the asymptotic properties of the RWRE model. Additional tools, borrowed from stochastic calculus, are often needed to obtain sharp statements. We refer to [89] for details and additional references.
(a) The functions a(; !) and b(; !) are uniformly (in !) bounded by K, with Lipschitz norm bounded by K, and a is uniformly elliptic, i. e. a(x; !) I is positive definite for some > 0 independent of x or !. (b) The random field a(x; !); b(x; !) x2Rd is stationary with respect to shifts in Rd . (c) The collection of random variables a(x; ); b(x; ) x2A and a(y; ); b(y; ) y2B are independent when d(A; B) > R. Part (a) of Assumption 23 ensures that (11) possesses a unique strong solution. Part (c) of Assumption 23 is a “finite range dependence” condition. We continue to write P! for the quenched law of the trajectories of the diffusion. Many of the results described in Subsect. “Ergodic Properties and a 0–1 Law” and Subsect. “Ballistic Behavior and Sznitman’s Conditions” have been proved also in the context of diffusions, when Assumption 23 holds. We refer to [45,84,85,88] for details. Isotropic Diffusions in the Perturbative Regime The analogue of the isotropy condition Assumption 23 (b) in the diffusion context is the following: Assumption 24 (Isotropy) For any rotation matrix O preserving the union of coordinate axes of Rd ,
a(O x; !); b(O x; !) x2Rd has same law under P as O a(x; !)O T ; Ob(x; !) x2R d :
Random Walks in Random Environment
The analogue of Theorem 21 is the following theorem, due to [103]. Its proof again uses multi-scale arguments, and is based on controlling the (scaled) Hölder norm of the operator associated with the transition probability of the diffusion.
in the strength of the bias, which is again a manifestation of the trapping phenomenon. We refer to [6] and [100] for details.
Theorem 25 Let Assumptions 23 and 24 hold. Then, there exists a constant 0 D 0 (d; K; R) such that if ja(x; !) Ij 0 and jb(x; !)j 0 , for all x 2 Rd ; ! 2 2 > 0, for a.e. !, the se˝, then for some deterministic p quence of random variables X t / t converges in distribution to a standard Gaussian random variable.
Another closely related (reversible) model is the model of Brownian motion in a field of obstacles in Rd . Here, P one defines a potential V(x; !) D i W(x x i ) where the collection fx i g is a (random) configuration of points in Rd (usually, taken according to a Poisson law) and W is a fixed nonnegative shape function. Of interest are the properties of Brownian motion (X t ) t2[0;T] , perturbed by the change of measure
(A full quenched invariance principle also holds under the assumptions of Theorem 25.) Topics Left Out and Future Directions As pointed out in the text, many open problems remain in the study of RWRE’s and motion in random media, with the most important one being the general validity of the law of large numbers, and the existence of diffusive limits for dimensions d 2. We briefly mention in the rest of this section several topics that are related to this review but that we have not covered in details. Random Conductance Model We have concentrated in this review on RWRE’s in i.i.d. environments, which give rise in the multi-dimensional case to non-reversible Markov processes. Although mentioned in several places, we did not discuss in details the reversible case, where homogenization techniques using the environment viewed from the point of view of the particle are very efficient (note that the reversible case is a very particular case of an environment which is not i.i.d. but rather dependent with finite range dependence). The prototype for such reversible models is the “random conductance model”, where each edge (x; y) of Zd is associated a (random, i.i.d.) conductance C x;y , and the transition probP ability between x and y is Cx;y /( z : jzxjD1 Cx;z ). Annealed CLT’s for the random conductance model are provided in [57,66]. See also [2] for a related model with symmetric transitions. The quenched CLT is obtained in [16] and [90]. One of the motivations to consider the random conductance model is the analysis of random walk on supercritical percolation clusters. The annealed CLT is covered by [66]. Several recent papers discuss the quenched case, first in dimension d 4 [90], and then in all dimensions, see [5,70]. In another direction, when one discusses biased walks on a percolation cluster, new phenomena occur, for example the lack of monotonicity of the speed of the walk
Brownian Motion in a Field of Random Obstacles
1 exp T D Z T (!)
Z
!
T
V(X s ; !)ds
:
0
It is common to distinguish between “soft traps”, with W bounded and typically of compact support, and “hard trap”, where W D 11C where C is a given compact set. One is interested in understanding various path properties, as T gets large, or in understanding the quenched partition function Z T (!) and its annealed counterpart EZ T . Due to reversibility, the problem is closely related to the study of the bottom ! of the spectrum of /2 C V, and the difficulty is in understanding the structure of those traps that influence ! . A good overview of the model and the techniques developed to analyze it, including the “method of enlargement of obstacles”, can be found in [95]. Time Dependent RWRE An interesting variant of the RWRE model has been proposed in [8]. In this model, the random environment is dynamic, i. e. changes with time, and so we write !(x; x C e; n) where we wrote before !(x; x C e). In the simplest version, the collection of random vectors (!(x; x C ; n))x2Zd ;n2N is i.i.d. Annealed, the RWRE is then a simple random walk in an averaged environment, but the true interest lies in obtaining quenched statements. Those were obtained in [8,10] by a perturbative approach. An alternative, simpler proof is given by [94]. Another approach to the quenched CLT, that covers other cases of random walk “with a forbidden direction”, is developed in [81], based on a general pointwise CLT for additive functionals of Markov chains due to [34]. An interpolation between the RWRE model and the i.i.d. dynamical environment model is when the collection (!(x; x C ; n))x2Zd ;n2N is i.i.d. in x but Markovian
2573
2574
Random Walks in Random Environment
in n. This case has been analyzed by perturbative methods in [9], and by regeneration techniques in [3]. In both cases, an annealed CLT holds in any dimension, but the quenched CLT was obtained only in high dimension. Recently, a dynamical approach was developed in [36], that proves the quenched CLT in all dimensions, subject to fast enough mixing of the Markov chain. It is still open to determine whether in the Markovian setup, there are uniformly elliptic, exponentially mixing examples where the quenched CLT fails.
RWRE on Trees and Other Graphs We have already mentioned the interest in considering random walks on random subgraphs of Zd , and in particular percolation clusters. Of course, one may consider instead random walk (or biased random walk) on other random graphs. A particularly important class of models treats random walks on random trees, and in particular Galton–Watson trees. We refer to [65] for an excellent overview of the properties and ergodic theory of such random walks, and to [74] for recent results concerning the CLT. See also [50] for slowdown estimates for the analog of the RWRE on the binary tree. We emphasize that these models are all reversible.
Non Nearest Neighbor RWRE Many of the techniques described in this survey have a natural generalization to non nearest neighbor walks. In particular, the results in [80,106] are already stated in terms of compactly supported transition probabilities, and the development of regeneration times can easily be extended, following the techniques in [26,27], to the non-nearest neighbor, finite range setup. However, to the best of my knowledge, no systematic study of RWRE for non-nearest neighbor RWRE’s in dimension d 2 has appeared in the literature. The situation is different in dimension d D 1, where the RWRE is not reversible anymore. It was early realized, see [53,63], that ergodic theorems involve the study of certain Lyapounov exponents associated with the product of random matrices. For some recent results, we refer to [11] and [21]. Bibliography Primary Literature 1. Alili S (1999) Asymptotic behaviour for random walks in random environments. J Appl Probab 36:334–349
2. Anshelevich VV, Khanin KM, Sinai YG (1982) Symmetric random walks in random environments. Commun Math Phys 85:449–470 3. Bandyopadhyay A, Zeitouni O (2006) Random Walk in Dynamic Markovian Random Environment. ALEA 1:205–224 4. Berger N (2006) On the limiting velocity of high-dimensional random walk in random environment. Arxiv: math.PR/ 0601656 5. Berger N, Biskup M (2007) Quenched invariance principles for simple random walk on percolation clusters. Probab Theory Relat Fields 137:83–120 6. Berger N, Gantert N, Peres Y (2003) The speed of biased random walk on percolation clusters. Probab Theory Relat Fields 126:221–242 7. Berger N, Zeitouni O (2008) A quenched invariance principle for certain ballistic random walks in i.i.d. environments. In: Siduravicius V, Vares ME (eds) In and out of equilibrium, Progress in probability, vol. pp 137–160 8. Boldrighini C, Minlos RA, Pellegrinotti A (1997) Almost-sure central limit theorem for a Markov model of random walk in dynamical random environment. Probab Theory Relat Fields 109:245–273 9. Boldrighini C, Minlos RA, Pellegrinotti A (2000) Random walk in a fluctuating random environment with Markov evolution. In: On Dobrushin’s way. From probability theory to statistical physics. Amer Math Soc Transl Ser 198(2):13–35; Amer Math Soc, Providence 10. Boldrighini C, Minlos RA, Pellegrinotti A (2004) Random walks in quenched i.i.d. space-time random environment are always a.s. diffusive. Probab Theory Relat Fields 129:133– 156 11. Bolthausen E, Goldsheid I (2000) Recurrence and transience of random walks in random environments on a strip. Commun Math Phys 214:429–447 12. Bolthausen E, Sznitman AS (2002) On the static and dynamic points of view for certain random walks in random environment. Method Appl Analysis 9:345–375 13. Bolthausen E, Zeitouni O (2007) Multiscale analysis of exit distributions for random walks in random environments. Prob Theor Fields 138:581–645 14. Bolthausen E, Sznitman AS, Zeitouni O (2003) Cut points and diffusive random walks in random environments. Ann Inst H Poincare 39:527–555 15. Bouchaud JP, Georges A, Le Doussal P (1987) Anomalous diffusion in random media: trapping, correlations and central limit theorems. J Phys 48:1855–1860 16. Boivin D, Depauw J (2003) Spectral homogenization of reversible random walks on Zd in a random environment. Stoch Proc App 104:29–56 17. Bramson M (1991) Random walk in random environment: a counterexample without potential. J Stat Phys 62:863– 875 18. Bramson M, Durrett R (1988) Random walk in random environment: a counterexample? Commun Math Phys 119:199– 211 19. Bramson M, Zeitouni O, Zerner MPW (2006) Shortest spanning trees and a counterexample for random walks in random environments. Ann Probab 34:821–856
Random Walks in Random Environment
20. Brémont J (2004) Behavior of random walks on Z in Gibbsian medium. C R Math Acad Sci Paris 338:895–898 21. Brémont J (2004) Random walks in random medium on Z and Lyapunov spectrum. Ann Inst H Poincar Probab Stat 40:309– 336 22. Bricmont J, Kupiainen A (1991) Random walks in asymmetric random environments. Com mun Math Phys 142:345– 420 23. Brox T (1986) A one-dimensional diffusion process in a Wiener medium. Ann Probab 14:1206–1218 24. Cheliotis D (2005) Diffusions in random environments and the renewal theorem. Ann Probab 33:1760–1781 25. Comets F, Gantert N, Zeitouni O (2000) Quenched, annealed and functional large deviations for one dimensional random walk in random environment. Probab Theory Relat Fields 118:65–114 26. Comets F, Zeitouni O (2004) A law of large numbers for random walks in random mixing environments. Ann Probab 32:880–914 27. Comets F, Zeitouni O (2005) Gaussian fluctuations for random walks in random mixing environments. Isr J Math 148:87– 114 28. Dembo A, Gantert N, Peres Y, Shi Z (2007) Valleys and the maximum local time for random walk in random environment. Probab Theory Relat Fields 137:443–473 29. Dembo A, Zeitouni O (1998) Large deviations techniques and applications. 2nd edn. Springer, New York 30. Dembo A, Guionnet A, Zeitouni O (2001) Aging properties of Sinai’s random walk in random environment. Arxiv:math.PR/0105215 31. Dembo A, Gantert N, Peres Y, Zeitouni O (2002) Large deviations for random walks on Galton–Watson trees: averaging and uncertainty. Probab Theory Relat Fields 122:241–288 32. Dembo A, Gantert N, Zeitouni O (2004) Large deviations for random walk in random environment with holding times. Ann Probab 32:996–1029 33. Dembo A, Peres Y, Zeitouni O (1996) Tail estimates for onedimensional random walk in random environment. Commun Math Physics 181:667–684 34. Derriennic Y, Lin M (2003) The central limit theorem for Markov chains started at a point. Probab Theory Relat Fields 125:73–76 35. Derrida B, Luck JM (1983) Diffusion on a random lattice: weak-disorder expansion in arbitrary dimension. Phys Rev B 28:7183–7190 36. Dolgopyat D, Keller G, Liverani C (2007) Random walk in markovian environment. ArXiv:math/0702100v1 [math.PR] (preprint) 37. Donsker MD, Varadhan SRS (1983) Asymptotic evaluation of certain Markov process expectations for large time, IV. Commun Pure Appl Math 36:183–212 38. Doyle PG, Snell JL (1984) Random walks and electric networks, Carus Mathematical Monographs, 22. Mathematical Association of America, Washington 39. Enriquez N, Sabot C, Zindy O (2007) Limit laws for transient random walks in random environments on Z. ArXiv:math/0703660v1 [math.PR] (preprint) 40. Erdös P, Taylor SJ (1960) Some intersection properties of random walks paths. Acta Math Acad Sci Hungar 11:231–248
41. Fisher DS (1984) Random walks in random environments. Phys Rev A 30:60–964 42. Gantert N (2002) Subexponential tail asymptotics for a random walk with randomly placed one-way nodes. Ann Inst H Poincaré – Probab Statist 38:1–16 43. Gantert N, Shi Z (2002) Many visits to a single site by a transient random walk in random environment. Stoch Process Appl 99:159–176 44. Gantert N, Zeitouni O (1998) Quenched sub-exponential tail estimates for one-dimensional random walk in random environment. Commun Math Physics 194:177–190 45. Goergen L (2006) Limit velocity and zero-one laws for diffusions in random environment. Ann Appl Probab 16:1086– 1123 46. Goldsheid I (2007) Simple transient random walks in onedimensional random environment: the central limit theorem. Probab Theory Related Fields 139:41–64 47. Golosov AO (1985) On limiting distributions for a random walk in a critical one dimensional random environment. Commun Moscow Math Soc 199:199–200 48. Greven A, den Hollander F (1994) Large deviations for a random walk in random environment. Ann Probab 22:1381– 1428 49. Hu Y, Shi Z (2000) The problem of the most visited site in random environment. Probab Theory Relat Fields 116:273– 302 50. Hu Y, Shi Z (2007) A subdiffusive behaviour of recurrent random walk in random environment on a regular tree. Probab Theory Related Fields 138:521–549 51. Kalikow SA (1981) Generalized random walks in random environment. Ann Probab 9:753–768 52. Kesten H (1986) The limit distribution of Sinai’s random walk in random environment. Physica A 138:299–309 53. Key ES (1984) Recurrence and transience criteria for random walk in a random environment. Ann Probab 12:529– 560 54. Kosygina E, Rezakhanlou F, Varadhani RS (2006) Stochastic homogenization of Hamilton-Jacobi-Bellman Equations. Comm Pure Appl Math 59:1489–1521 55. Kesten H, Kozlov MV, Spitzer F (1975) A limit law for random walk in a random environment. Comput Math 30:145– 168 56. Kozlov SM (1985) The method of averaging and walks in inhomogeneous environments. Russian Math Surv 40:73–145 57. Kunnemann R (1983) The diffusion limit of reversible jump processes in Zd with ergodic random bond conductivities. Commun Math Phys 90:27–68 58. Kuo HJ, Trudinger NS (1990) Linear elliptic difference inequalities with random coefficients. Math Comput 55:37–53 59. Lawler GF (1982) Weak convergence of a random walk in a random environment. Commun Math Phys 87:81–87 60. Lawler GF (1989) Low-density expansion for a two-state random walk in a random environment. J Math Phys 30:145– 157 61. Lawler GF (1991) Intersections of Random walks. Birkhauser, Basel 62. Le Doussal P, Monthus C, Fisher D (1999) Random walkers in one-dimensional random environment: exact renormalization group analysis. Phys Rev E 59:4795–4840
2575
2576
Random Walks in Random Environment
63. Ledrappier F (1984) Quelques propriétés des exposants charactéristiques, Lecture Notes in Mathematics, vol 1097. Springer, New York 64. Liggett TM (1985) An improved subadditive ergodic theorem. Ann Probab 13:1279–1285 65. Lyons R, Peres Y Probability on trees and networks. http:// mypage.iu.edu/~rdlyons/prbtree/prbtree.html 66. De Masi A, Ferrari PA, Goldstein S, Wick WD (1989) An invariance principle for reversible Markov processes. Applications to random motions in random environments. J Stat Phys 55:787–855 67. Mayer-Wolf E, Roitershtein A, Zeitouni O (2004) Limit theorems for one-dimensional transient random walks in Markov environments. Ann Inst H Poincaré Probab Stat 40:635– 659 68. Mathieu P (1994) Zero white noise limit through Dirichlet forms, with applications to diffusions in a random medium. Probab Theory Relat Fields 99:549–580 69. Mathieu P (1995) Limit theorems for diffusions with a random potential. Stoch Proc App 60:103–111 70. Mathieu P, Piatnitski A (2007) Quenched invariance principles for random walks on percolation clusters. Proc R Soc Lond Ser A Math Phys Eng Sci 463:2287–2307 71. Molchanov SA (1994) Lectures on random media. Lecture Notes in Mathematics, vol 1581. Springer, New York 72. Papanicolaou GC, Varadhan SRS (1982) Diffusions with random coefficients. In: Kallianpur G, Krishnaiah PR, Ghosh JK (eds) Statistics and probability: essays in honor of C. R. Rao. North-Holland, Amsterdam, pp 547–552 73. Pemantle R (1988) Phase transition in reinforced random walk and RWRE on trees. Ann Probab 16:1229–1241 74. Peres Y, Zeitouni O (2008) A Quenched CLT for biased random walks on Golton–Watson trees, Prob Theory Rel Fieldsx 140:595–629 75. Peterson J (2008) Ph.D. thesis. University of Minnesota 76. Peterson J, Zeitouni O (2007) Quenched limits for transient, zero speed one-dimensional random walk in random environment. ArXiv:0704.1778v1 [math.PR] (preprint) 77. Pisztora A, Povel T (1999) Large deviation principle for random walk in a quenched random environment in the low speed regime. Ann Probab 27:1389–1413 78. Pisztora A, Povel T, Zeitouni O (1999) Precise large deviation estimates for a one-dimensional random walk in a random environment. Probab Theory Relat Fields 113:191– 219 79. Rassoul-Agha F (2003) The point of view of the particle on the law of large numbers for random walks in a mixing random environment. Ann Probab 31:1441–1463 80. Rassoul-Agha F (2004) Large deviations for random walks in a mixing random environment and other (non-Markov) random walks. Commun Pure Appl Math 57:1178–1196 81. Rassoul-Agha F, Seppäläinen T (2005) An almost sure invariance principle for random walks in a space-time random environment. Probab Theory Relat Fields 133:299–314 82. Rassoul-Agha F, Seppäläinen T (2007) Almost sure functional central limit theorem for non-nestling random walk in random environment. ArXiv:0704.1022v1 [math.PR] (preprint) 83. Sabot C (2004) Random walks in random environment at low disorder. Ann Probab 32:2996–3023
84. Schmitz T (2006) Diffusions in random environment and ballistic behavior. Ann Inst H Poincaré Prob Statist 42:683– 714 85. Schmitz T (2006) Examples of condition (T) for diffusions in random environment. Electron J Probab 11:540–562 86. Schumacher S (1985) Diffusions with random coefficients. Contemp Math 41:351–356 87. Shen L (2002) Asymptotic properties of certain anisotropic walks in random media. Ann Appl Probab 12:477–510 88. Shen L (2003) On ballistic diffusions in random environment. Ann Inst H Poincare Probab Statist 39:839–876 89. Shi Z (2001) Sinai’s walk via stochastic calculus. In: Comets F, Pardoux E (eds) Milieux Aléatoires. Panoramas et Synthèses 12. Soc Math Fr, pp 53–74 90. Sidoravicius V, Sznitman AS (2004) Quenched invariance principles for walks on clusters of percolation or among random conductances. Probab Theory Relat Fields 129:219– 244 91. Sinai YG (1982) The limiting behavior of a one-dimensional random walk in random environment. Theor Prob Appl 27:256–268 92. Solomon F (1975) Random walks in random environments. Ann Probab 3:1–31 93. Stroock D, Varadhan SRS (1979) Multidimensional diffusion processes. Springer, New York 94. Stannat W (2004) A remark on the CLT for a random walk in a random environment. Probab Theory Relat Fields 130:377– 387 95. Sznitman AS (1998) Brownian motion, obstacles and random media. Springer, New York 96. Sznitman AS (1999) On a class of transient random walks in random environment. Ann Probab 29:724–765 97. Sznitman AS (2000) Slowdown estimates and central limit theorem for random walks in random environment. JEMS 2:93–143 98. Sznitman AS (2002) An effective criterion for ballistic behavior of random walks in random environment. Probab Theory Relat Fields 122:509–544 99. Sznitman AS (2003) On new examples of ballistic random walks in random environment. Ann Probab 31:285–322 100. Sznitman AS (2003) On the anisotropic walk on the supercritical percolation cluster. Commun Math Phys 240.123– 148 101. Sznitman AS (2004) Topics in random walks in random environment. School and Conference on Probability Theory. ICTP Lect Notes XVII Abdus Salam, Int Cent Theor Phys Trieste, pp 203–266 102. Sznitman AS, Zerner M (1999) A law of large numbers for random walks in random environment. Ann Probab 27:1851– 1869 103. Sznitman AS, Zeitouni O (2006) An invariance principle for isotropic diffusions in random environments. Invent Math 164:455–567 104. Temkin DE (1972) One dimensional random walk in a twocomponent chain. Soviet Math Dokl 13:1172–1176 105. Varadhan SRS (1966) Asymptotic probabilities and differential equations. Commun Pure Appl Math 9:261–286 106. Varadhan SRS (2003) Large deviations for random walks in a random environment. Commun Pure Appl Math 56:1222– 1245
Random Walks in Random Environment
107. Zeitouni O (2004) Random walks in random environment. XXXI Summer School in Probability, St. Flour, 2001. Lect Notes Math, vol 1837. Springer, New York, pp 193–312 108. Zeitouni O (2006) Random walks and diffusions in random environments. J Phys A: Math Gen 39:R433–R464 109. Zerner MPW (1998) Lyapounov exponents and quenched large deviations for multidimensional random walk in random environment. Ann Probab 26:1446–1476 110. Zerner MPW (2002) A non-ballistic law of large numbers for random walks in i.i.d. random environment. Electron Commun Probab 7(19):191–197 111. Zerner MPW, Merkl F (2001) A zero-one law for planar random walks in random environment. Ann Probab 29:1716–1732
Books and Reviews Bolthausen E, Sznitman AS (2002) Ten lectures on random media DMV Seminar, 32. Birkhäuser, Basel Hughes BD (1996) Random walks and random environments. Oxford University Press, Oxford Révész P (2005) Random walk in random and non-random environments, 2nd edn. World Scientific, Hackensack Revuz D, Yor M (1999) Continuous martingales and Brownian motion, 3rd edn. Springer, New York Spitzer F (1976) Principles of random walk, 2nd edn. Springer, New York Zeitouni O (2002) Random Walks in Random Environments. Proceedings of ICM 2002. Documenta Mathematica III:117–127
2577
2578
Rational, Goal-Oriented Agents
Rational, Goal-Oriented Agents ROSARIA CONTE LABSS-ISTC, CNR, Rome, Italy Article Outline Glossary Definition of the Subject Introduction Agents Concluding Remarks and Future Directions Bibliography Glossary Agent A system that uses its power to bring about a given state in a physical, social, or mental environment. Autonomy It is a property of a subset of agents, which decide to act on the grounds of their internal criteria (internal representations). Computing agent An entity that executes the instructions contained in an algorithm running on a computer. Goal-directed agents are intelligent agents that have an internal representation of the goals they achieve. Goal-oriented agents are entities designed to achieve a certain state of the world wanted by either the agent itself, which in such a case is also a goal-directed system, or the user/designer. Intelligent agents are goal-oriented agents using their knowledge to solve problems, including taking decisions and planning actions. Internal representation Knowledge stored into internal agent structures or distributed over sets of interconnected internal units. Knowledge Information about the environment, which is stored in agents’ memory (archives). Rational agents are intelligent agents that use their (limited) knowledge to maximize the difference between benefits and costs. Definition of the Subject In this article, goal-oriented agents, i. e. agents designed to achieve a given goal, will be treated as a highly general category within agent theory and application, and argued to subsume both rational agents, which maximize the distance between benefits and costs, and goal-directed agents, which have a representation of the goal they achieve. These sub-types of agents exemplify two different but non-exclusive approaches to agent theory, endogenous and ex-
ogenous. The endogenous approach is aimed at modeling agents in terms of their internal mechanisms of regulation. The exogenous approach is used to describe them from an external point of view, in terms of the effects they achieve in the world. In this article, these two approaches will be combined with another important classification, weak and strong agents. The resulting specific models and applications will be described. Finally, goal-directed and rational agents will be compared and future directions of research in the agent field will be discussed. The field of agent theory and technology is growing fast. Suffice it to consider that the community of scientists interested in multi agent systems, which counted at most a few dozen people in the early nineties, is now one of the largest in the area of computational and technological sciences (The 2008 AAMAS (Autonomous Agents and Multi Agent Systems) conference received around 700 submissions.). Furthermore, a growing number of disciplines within the social and behavioral sciences, not to speak of the science of complexity, are turning to agentbased computational modeling in the aim to promote explicit, controllable models of social and behavioral phenomena, and to realize computational laboratories for visualizing such phenomena and conducting repeatable experiments about them. Despite, or perhaps because of such a fast development, the field of agent theory and application did not receive an adequate scientific re-elaboration, nor commonly shared views and definitions. Agent is still a controversial notion, specified – as somebody observed in [81] – by a growing number of adjectives, for example, autonomous, goal-directed, rational, etc., rather than by explicit properties. Developed by different disciplines – physics, biology, AI, psychology, computer science, economy, decision theory, organization science, etc. – with their own conventions and vocabulary, the notion of agent would necessitate an explicit confrontation, which is underestimated by the different communities. The result is an unpleasant conceptual mix and consequent loss of scientific credibility of the field. This article is therefore intended to offer a contribution of clarification concerning some properties of agency, in order to compare at least a subset of models and theories currently insufficiently interacting, and help envisage future trends. Introduction In the present contribution, rational agents are seen as a subset of intelligent agents (see Fig. 1), and these in turn as a subset of goal-oriented ones. Agents are said to be
Rational, Goal-Oriented Agents
Agents
Rational, Goal-Oriented Agents, Figure 1 Set relationships among some types of goal-oriented agents
goal-oriented when they exhibit a finalistic behavior [3,9], i. e. when their behavior brings about a state of the world that achieves somebody’s, not necessarily the user’s, goals. To be noticed, goal-oriented systems need not be goaldirected either. The latter are guided by a representation of the goal they are oriented to [76]: a software agent designed to bring about a given effect is goal-oriented. It becomes goal-directed if, for any reason, the effect needs to be represented as an internal state of the agent. The distinction between goal-oriented and goal-directed action is crucial, since it allows two main components of computational agents to be distinguished, the design-goals – i. e. the goals of the user/designer that need not be represented into the agent although incorporated in the way it operates and is constructed – and the internal goals, which are represented in the agent that achieves them. Of course, goal-oriented modeling works as a filter for the designer, which enables her to recognize important properties of her domain of interest. It also tells her which internal mechanisms the system ought to possess in order to achieve its design-goal, and whether and which among these goals ought to be internalized. Within the broad category of goal-oriented agents, we identify intelligent agents, as those entities that use their own knowledge to bring about certain effects in the world. Intelligent agents include goal-directed agents, achieving their own goals, and rational agents, maximizing their own utility. Rational and goal-directed agents are here taken as paradigmatic examples of two complementary approaches in the agent field, endogenous and exogenous: the latter describes agents’ behavior as perceived by an external observer; the former models it in terms of the agent’s internal mechanisms. Either approach will be characterized separately. In the final section, rational and goal-oriented agents will be compared, their differences will be outlined and the respective advantages for both theory and application will be pointed out.
Agents are systems acting on some environment, including their minds. In this case, their action consists of taking decisions, manipulating information, etc. More generally, they act on the physical or social environment producing transitions from one state to another. They are modeled in more or less formal terms. In general, agent models are abstract, formal, often computational. The latter describe agents as entities carrying out the operations described in an algorithm, accepting inputs from the environment, and performing actions that depend upon the current inputs and produce outputs that modify the environment. By an abstract model of the agent, it is meant an idealtype entity endowed with defining properties as specified in the underlying theory. One such entity, which may never be found in nature, facilitates general explanation of real-world phenomena involving natural agents, and provides inspiring models for concrete applications to be designed. By a formal description, it is meant an abstract description that is also explicit, non-equivocal and sound. It is by no means necessary that such a description be translated into a standard logic language, although this is sometimes the case. Strictly speaking, a formal model of the agent is not necessarily transformed into a computational model either, although this is often the case. In this contribution, agent computational models will mainly be addressed, although the discussion of rational agents will involve mathematical notions, and that of BDI agents – a special type of goal-directed systems – will bring into play logic definitions of mental states. In all of these cases, however, formal technicalities will not be provided, and the models will only be reported on in informal terms. Criteria for Classifying Agent Models As shall be argued throughout the article, goal-directed and rational agents represent two complementary ways, endogenous and exogenous, to approach intelligent agents, which in turn largely depends on whether agents are instruments or objects of scientific inquiry. Taken as scientific instruments, agents are modeled and implemented on a computer in order to generate phenomena [33] that are themselves the primary objects of scientific investigation. This is the case with macro-social phenomena grown in artificial societies, where agents are specified only to the extent that is required to bring about the (macroscopic) effects of interest. This usually consists of an interesting dynamics at the aggregate level, such as emergent macro-social effects like segregation, public
2579
2580
Rational, Goal-Oriented Agents
goods provision, opinion dynamics, etc.; or stylized social facts, such as game theoretic scenarios, in order to explore social or collective problems and dilemmas and account for prosocial action, such as cooperation [6], altruism [34], fairness and justice [10], etc. As targets of scientific interest, agents – in particular autonomous agents [41,96,100] – draw the scientist’s attention on the mechanisms of regulation that activate and rule the activity of any system of action. In this sense, agents are studied and described in their internal mechanisms of regulation. Ideally, the language, the concepts, and the formal instruments applied to them, describe agents and their behaviors precisely in the same mechanisms that operate when they behave. Such a description of course corresponds to the real processes as suggested by the current scientific standards. Which consequences do these different ways of conceptualizing agents bear? The distinction discussed above leads to the emergence of two different approaches to agent modeling, endogenous and exogenous, which have not as yet been systematically analyzed. Endogenous vs. Exogenous Approach Generally speaking, endogenous variables are meant as properties originating into the system under study. Exogenous variables instead are properties originating from outside the system, which are not under its control nor are manipulated by the system itself, but can be manipulated by the observer/experimenter. In computational models, exogenous variables are parameters set to defined values by the computer scientist. It is interesting to observe that an extended parameters’ space poses a serious problem of computability to the analyzer (see [99], for a recent review of alternatives and prospects), who must arbitrarily choose which parameters to analyze extensively. Therefore, attention is currently paid to the so-called parameters’ “endogeneization” [19]. By endogenous models, it will be here meant to refer to models describing agents and their behaviors in terms of a special type of endogenous variable, i. e. their generative mechanisms. These are supposed to operate within the agents when they exhibit any given behavior, including responses to external stimuli. Exogenous models, instead, describe agent’s behavior from the outside in terms of its effects or functions (e. g., utility or fitness function). An exogenous model is indifferent to how behavior is actually ruled; it is limited to apply a formal description that accounts for the behavior under study as this is perceived by the external observer. Notice that the exogenous and endogenous approaches are not necessarily concurrent: they tackle
agency from two different perspectives. For example, on the grounds of an exogenous model, one can assume that a certain opinion and the corresponding behavior spread over a population as an effect of the physical space (see [68]). It is interesting to notice that so far, computational models have adopted exclusively an exogenous approach to the study of opinion dynamics [30,51,71,73,90], modeling the external dynamics of opinions, i. e. their dynamics in society, independent of the endogenous mechanisms allowing agents to accept [84], update [58], revise [12], transmit [72], retreat and finally drop their opinions. However, at least in principle, an endogenous model of opinion dynamics is necessary to unfold the social dynamics: to understand how physical or social distance impacts on opinion dynamics, we ought to understand how opinions gain ground, agent by agent, mind by mind. In other words, we ought to model the process by means of which agents influence one other and accept others’ opinions. One might say that these two approaches tackle different aspects of the same phenomena, or different steps of the same processes, the exogenous model being focused on the so called distal causes, the endogenous model addressing the proximal. At a more careful sight, however, this does not seem to be the case. Indeed, it should be noticed that proximal causes are sometimes exogenously modeled. This is the case with game-theoretic models that describe decision-making (proximal cause of behavior) by means of a utility function taking into account wealth distribution and risk sensitivity [42,67]. This can hardly be conceived of as an internal mechanism ruling agents’ decision-making. While taking what frequently proves to be an intelligent, adaptive decision, agents haven’t the faintest idea that the process they have gone through might well be described by means of a mathematical function. Far from identifying the mechanisms generating the agent’s choice from within, this function describes behavior from the outside. Game-theoretic models have often been charged with the accusation of proceeding from implausible assumptions. This accusation is warranted from the point of view of an endogenous model, not from an exogenous standpoint. That agents maximize their expected utility is probably a truism, at least if utility is not meant to coincide with self-interest [39]. If utility is meant to coincide with own desires or goals [5], the rational assumption is perhaps too obvious, even tautological, but not implausible. What makes it implausible is the description in which it is conveyed: agents do not calculate mathematically their expected utility, although many times they act as if they were doing so. The utility function is a valid exogenous model, of course, but is an implausible endogenous explanation.
Rational, Goal-Oriented Agents
The problem is, what is a good endogenous explanation? What are the internal mechanisms that allow agents to maximize their utility? This contribution does not, nor is intended to provide an answer to this question. However, a good step forward in the direction to “endogenize” factors of regulation, is to turn to current scientific theories that aim explicitly at elaborating an endogenous model. Analogously, we should refrain from the common fallacy that distal explanations can easily do without endogenous models. Things are not so simple. If one aims at generating the behavioral effect starting from a far removed cause, one needs to construct the whole chain between distal causes and wanted effects. Without an endogenous model, one can get nothing better than a partial and non-generative explanation. This is not equal to saying that this explanation has no right of its own. On the contrary, and somewhat in contrast with latest generativists [33], the present writer’s claim is that exogenous models are non-generative statements that may have explanatory value [22]. As Nwana [81] argued in his overview of the field of computational agents, one of the symptoms of the ontological deficiency characterizing this field is the number of adjectives often accompanying the word agent. Rather than adequate definitions of these adjectives, which are easily accessible both in Nwana’s or in Wooldridge and Jennings’ papers [101], the present contribution attempts at discussing major criteria for categorizing agents and combining them with the epistemological approaches discussed above. Weak vs. Strong Agents Wooldridge and Jennings [101] distinguished between a weak and a strong form of computing agents. Weak agents are characterized by a number of properties. Autonomy Computing agents operate without the direct intervention of humans or others, and have some kind of control over their actions and internal state. Agents can be placed on a continuous dimension of autonomy, as several intermediate cases occur between full autonomy and full slavery. Indeed, limited autonomy [23] is the paradigmatic case of humans. Sources of limited autonomy can be found both in resource scarcity and in agents’ sharing a common environment. Due to resource scarcity agents may not possess all of the means needed for achieving their goals (limited self-sufficiency). In a common environment, agents interfere with one another positively and negatively. Both factors make them liable to all sorts of mutual influences.
Rational, Goal-Oriented Agents, Table 1 Set relationships among types of goal-oriented agents
SubAgents I WeakN Agents T E L L I StrongG Agents E N T
Goal-Oriented Endogenous approach Reflex
Exogenous approach Spin glasses Cellular automata
Learning Evolutionary Neural nets Goal-directed
Bounded rational Deliberative Rational (BDI, Cognitive)
Social ability Agents interact with other agents (and possibly humans) via some kind of agent-communication language [44]. Reactivity Agents perceive and respond to their environment, including other agents. Pro-activeness Agents act in response to their environment, but also take the initiative [101]. In addition to these properties, strong agents are computer systems characterized by mentalistic notions, such as knowledge, belief, intention, and obligation [93] as well as emotions [7,8]. What happens if we combine the dimension of weak and strong agency, with the endogenous and exogenous approaches? In Tab. 1, goal-oriented agents are classified. In the first row, sub-agents appear, i. e. those lacking one or another of the minimal properties for weak agency identified above. In particular, none of the systems classified as subagents are proactive. Below, intelligent agents are shown: they include weak and strong agents, endogenously and exogenously modeled. Overview: Weak Agents Weak agents are essentially autonomous, interconnected and communicating systems that effectuate state transitions. Unlike simpler and stronger models, no exogenous examples of weak agents are found. They are all described in terms of their internal mechanisms. Verisimilarly, simpler models lend themselves to be described with mathematical formalisms, which are also applied to model some complex capacities, like rational choice (see rational agents). Instead, weak agents are not modeled to describe
2581
2582
Rational, Goal-Oriented Agents
the effects they produce in the world, but to find out algorithms generating them. The interest of these systems is intrinsically computational, even technological, and they gave a strong impulse to endogenous modeling. Endogenous Weak endogenous agents are described in their internal mechanisms, essentially rules or subsymbolic representations. However simplified, these enable the agents to behave autonomously, react to external stimuli, and show also some kind of proactive behavior. Evolutionary In their early time, agents were modeled as rigid machines, which might assume a fairly limited set of states on the grounds of a number of fixed rules. This was the case with cellular automata, which at the origin were fixed to given locations on a toroidal grid and could assume either an active or a nonactive state depending on the states of neighbors. Action was taken according to a fixed set of rules [26]. Several factors contributed to promote flexibility, the principal of which probably was the growing interest in evolutionary phenomena and the capacity of agents to modify and adapt to a changing environment. Fitness function was designed as a function measuring how good a given property or trait is in a pool of traits (for example a gene in a population): the agents performing above a given value would survive and reproduce whereas those performing below would extinguish. Let us look more closely to computational models based on a fitness function. In these models, agents’ behavior is the expression of a vector in which given traits are represented. As the metaphor is clearly the evolutionary mechanism, mutations in the vector may occur with a probability that is established offline. Mutations that contribute to the reproductive success of the agent will be transmitted to the future generations after a recombination with the partner’s traits. Genetic algorithms (GAs) are a particular class of algorithms inspired by evolutionary biology, and implementing evolutionary mechanisms and processes, such as inheritance, mutation, selection, and crossover (also called recombination). Starting from a population of randomly generated individuals, at every generation the fitness of every individual in the population is evaluated, the current population genotype is modified (recombined and possibly randomly mutated) and the new population is then used in the next iteration of the algorithm. The fitness function is problem dependent and in some problems, hard to define. A GA proceeds through iterated executions of mutation, crossover, and selection operators. Whereas selec-
tion is an important genetic operator, the importance of crossover versus mutation is more debatable (references in Fogel [36] support the importance of mutation-based search). To be noticed, for specific problems, other optimization algorithms may do better than GAs with equal speed, for example simulated annealing. (Simulated annealing (SA) is an algorithm for locating a good approximation to the global optimum of a given function in a large search space. It is often used when the search space is discrete (e. g., all tours that visit a given set of cities). In favorable cases, simulated annealing may be more effective than exhaustive enumeration of the search space [59].) As with other computational models, one main problem is the parameters’ space search (mutation probability, recombination probability, etc.) To which value should one set the mutation rate, when no reference is made to evolutionary phenomena known in nature? A very small mutation rate may lead to genetic drift. On the other hand, a high mutation rate may lead to loss of good solutions. GAs are an invaluable means for optimizing fitness. A fascinating explanation, the so-called building block hypothesis (BBH), has been proposed to explain this effect [46]. Although heavily criticized (see [102]) and poorly supported by experimental evidence, BBH provides an intuitive reason for the mechanism of adaptation, which is said to result from iterated recombination of “building blocks”, or low order elements: at any recombination (or generation), past solutions that proved fitter than their own building blocks are recombined to increase the optimality of new solutions. By this means, genetic algorithm achieves step-wisely optimal performance through the iterated recombination of building blocks. If GAs are meant to optimize the genotype of any given population, genetic programming (GP) is used to optimize ability to perform computational tasks. After some pioneer computer scientists, and especially John Holland in the early 1970s, John R. Koza ([62,63,64,65,66]) has pioneered the application of genetic programming in various contexts. Results obtained by the application of GP include innovations in hardware as well as computer programs. Learning Learning algorithms used in agent based models include classifier systems, reinforcement learning, Q-learning, and neural nets. Classifier Systems Classifier systems (CSs) are strongly related to genetic algorithms. Introduced by John Holland [52], CSs consist of sets of rules optimized thanks to a genetic algorithm that operates on reinforcement rather than on a fitness function. There are two main types of CSs. In one case, the genetic algorithm recombines dif-
Rational, Goal-Oriented Agents
ferent sets of rules; in the other, the genetic algorithm operates within one set only. Reinforcement Learning Inspired to behavioral science, reinforcement has been applied to machine learning as a reward mechanism (see [54]). Reinforcement based algorithms have been applied in economics and game theory as a boundedly rational interpretation of how agents endowed with the same learning mechanisms may converge on one equilibrium. Given a set of world-states, a set of actions A, and a set of “rewards”, at each time the agent chooses an action and receives a certain reward. Reinforcement will cause agent to maximize the quantity of reward obtained. Recently, reinforcement learning has been used in cognitive models of human problem solving (e. g., [43,47]) and error-processing [53]. Neural Nets Borrowed from neuroscience, where they are identified as groups of neurons performing a specific physiological function, neural nets currently refer to artificial neural networks [50]. These are made up of interconnecting artificial neurons, and used either to study biological circuits – a simplified view of artificial neural networks may be used to simulate properties of neural networks – or to solve artificial intelligence problems, such as speech recognition, in order to construct software agents or autonomous robots. Neural networks have been viewed as simplified models of the brain, but the relation between this model and the brain architecture is far from obvious. After all, computers entail sequential processing and are based on explicit instructions, whereas neural networks model biological systems as performing parallel processing on implicit instructions. An artificial neural network is an adaptive system. Its structure gets modified based on information spreading through the network. Neural nets are based on sub-symbolic representations. These are physical sets of interconnected units, with variable weighs on connections. The resulting networks are said to take inspiration from the organization and structure of neuronal synapses. As they stand for nothing, they hardly fit the notion of “representation” as such. Nonetheless, they effectively play a number of important functionalities of representations, especially patternrecognition. Given a certain input, a given set of internal units may be gradually “taught” to “recognize” it by contrast with other inputs, thanks to a suitable modification of the weighs of given connections. Subsymbolic representations are
Quantifiable, whereas symbolic representations vary, and can be compared on, a qualitative base. Gracefully degradable, whereas symbolic representations are none-or-all. Hence, unlike these latter, they allow for partial and reparable damages of the long-term memory. Compatible with non-decided upon and non-controlled activity.
Applications of artificial neural networks include pattern and sequence recognition [87] (radar systems, face identification, object recognition, etc.), sequence recognition (gesture, speech, text recognition), medical diagnosis [39], data mining, etc. Goal-Directed Agents Goal-directed agents are systems ruled by, and described in terms of, their goals. In a psychological-behavioral sense, a goal is a motivational factor defined on the grounds of a number of observable features, among which vigorous attainment, persistence in the face of obstacles, resumption after disruption ([69], etc.). Another notion of goal derives from the theory of systems and cybernetic circuits. Control theory introduced the notion of feedback [15] to control states or outputs of a dynamical system [40]. Its name comes from the information processed in the system: inputs (e. g. a given state of the world) have an effect on the outputs (for example, elicit a given action) that is measured with sensors, and the result of which (the worldstate brought about by the action) is used as input to the process. On the grounds of such a notion of goal, a theory of behavior as planning activity [78] has been constructed, which has played a foundational role in the development of cognitive science. Building on control theory, cognitive scientists define a goal as a wanted state of the world that, if discrepant from the currently perceived worldstate, activates the agent and rules its action. A goal is therefore both a trigger and a regulatory state: actions are activated and ruled by their goals [23]. Further analysis allowed several categories of goals to be identified. End- and sub-goals. A goal can either be the final end of activity – f.i., survival is probably the supreme finality of natural systems – or a condition for its execution, which may not be verified in real matters. In such a case, in terms of AI planning, the initial end-goal gives rise to further sub-goals, i. e. to verify the necessary conditions for action execution.
2583
2584
Rational, Goal-Oriented Agents
Innate vs. learned. Whereas end-goals are usually innate (but examples of subgoals that become ends in themselves exist), subgoals are formed and dropped after successful execution. Side-goals. Goals may guide the action they trigger, or they may rule actions triggered by others goals. The latter are side-goals. A typical example is utility-maximization, which rules a number of activities, especially economic, triggered by other goals. Achievement and maintenance goals. Goals may be already verified in real matters, i. e. coincide with currently perceived worldstates. In this case, they are usually dropped – as is the case with purchasing a new car, or finding a job. However, in many cases, these fortunate circumstances may suggest a new goal, i. e. to keep one’s job or car. In the BDI language (see below), the goal to maintain a current worldstate is a maintenance goal, whereas the goal to realize a given worldstate is an achievement goal [20]. Realizable vs. unrealizable goals. In the logic-based treatment of goals, these must be consistent and realizable. Unrealizable goals are treated as a subset of desires that are never chosen for action. The first application of goal-directed agents goes back to the AI planning systems [89,97], which soon gave impulse to a Copernican revolution in planning and plan execution, i. e. Distributed Artificial Intelligence (DAI) systems [11]. In the early nineties, the first studies and scientific events [31] on Multi Agent Systems (MAS) took their moves from the necessity to increase effective autonomy of distributed systems and their interactive capacity. Both of these requirements, in turn, necessitated developments of mental capacities, especially in the direction of social intelligence [21], what soon led to the appearance of stronger models of agency. Overview: Strong Agents
agents are sometimes identified with intelligent agents at large [88,100]. In this contribution we go back to the classic notion of rationality as utility-maximization, and define rational agents as a subset of intelligent agents that act so as to maximize their utility [2,80]. In this more specific sense, a rational agent takes actions expected to maximize its chances of success [4]. Its action is said to bring about this effect no matter how. In other words, a rational agent model is indifferent to the mechanisms that effectively operate within the agent to ensure the hypothesized effect [1]. At most, what is attributed to a rational agent is (limited) knowledge of its environment, consisting of (a memory of) past experience, current information about the environment (possibly including other agents and their expectations [70]), as well as the estimated benefits and the chances of success of own actions [88]. The classic conceptualization of rational agents derives from the abstract model of homo oeconomicus defined by John Stuart Mill [77] as an entity which aims at possessing wealth and is able to evaluate and adopt the means for achieving it. On these assumptions, rational choice took on the specific meaning of self-interested action. Thus meant, the term “rational” implies a utility function: independent of the benefits themselves, what is said to be rational is the choice that maximizes the distance between the benefits obtained and the costs sustained to obtain them. Rational action theory provides a formal model of practical behavior, mainly applied to social and economic domains. We might consider it as the still dominant paradigm in economy, but its hegemony extends to all of the social sciences and to the computational field of agent theory and technology. Essentially, it assumes individuals to choose the best action according to stable utility functions. Rather than describing reality, rational choice theory aims at supporting reasoning (prescriptive use) and modeling social behavior in highly stylized scenarios.
Below, we will review the main approaches to account for complex tasks, like deliberative choice. As was premised above, both the endogenous and the exogenous approaches concurred to this aim. Let us examine them in turn.
Requirements of Rational Action Let us examine the view of the agent implied by the rationality assumptions. Choosing an action rationally requires individual preferences, a set of options or alternatives for actions, and a set of expectations about their outcomes. Two important assumptions concern preferences:
Exogenous Here, rational agents are shown to represent the paradigmatic case of exogenous models of strong agents, whereas BDI and cognitive agents examplify strong agents endogenously modeled.
Completeness All actions are ranked in an order of preference. Transitivity If any given action a1 is preferred to action a2 , and the latter is preferred over a third option a3 , then a1 is preferred to a3 .
Rational Agents One way of conceptualizing rational action describes it as reason-based action. Hence rational
Consequently, an individual can always order the available options on the grounds of its preferences, which will al-
Rational, Goal-Oriented Agents
ways be consistent. This gives rise to a utility function, i. e. an ordinal number assigned to alternative actions – such as: u(a i ) > u(a j ) – with preferences being defined as relations between these assignments. Unrealistic assumptions are often made, such as full or perfect information and the ability and time to weigh every choice against every other choice. Relaxations of these assumptions are included in theories of bounded rationality, which will be examined later. Although empirical support to rational choice theory is far from satisfactory [48], this paradigm presents two main advantages for the social sciences. First, precisely because it proceeds from stylized facts and abstract phenomena, it proves a powerful instrument for the scientific study of social action. Secondly, and consequently, rational action theory in principle allows future actions to be predicted. Compared to stochastic modeling, the predictive power of social scientific explanation is expected to increase. However, it should be observed that the tautological and a posteriori explanation of behavior deriving from the principle of utility maximization strongly reduces the predictive potential of the theory. As preferences and goals are subjective, any choice results in the maximization of utility, whatever its outcome is and can be said to be rational (tautology), once it has been made (a posteriori). Third, the abstract and strong nature of the theory allows testable hypotheses to be formulated and data to be collected. The experimental potential of social scientific explanation is therefore bound to increase. However, rationality theory has been criticized on empirical grounds. In traditional societies, people’s social action, for example the way they exchange goods, follows rules that could not be expected on the grounds of rationality assumptions and which, thanks to the famous work by Mauss [75], seemed to point to a regime of “gift”, rather than market, economy. In addition, according to experimental scientists (cf. the wide literature cited in [82]), people are much more cooperative than expected by the rational choice theory. Finally, the rationality assumptions are commonly said to necessitate too much understanding of macroeconomics and economic forecasting for people to take rational decision. We will get back on this argument later in the next section. Theoretical critiques are no less strong. Among others, it has been observed that the role of intrinsic motivations [29], for example that of altruism, is under-estimated to the advantage of extrinsic, incentive-based, motivations, an objection that is not entirely warranted in consideration of the indetermined nature of preferences: rational agents can be subjectively inclined to altruism. In such a case,
even one’s extreme sacrifice to the benefit of others’ is perfectly “rational”. Rather, the inadequate account of preferences dynamics poses a twofold problem to the social scientist. On one hand, totally indeterminate preferences reduce the predictive potential of the rationality paradigm. On the other, if intrinsic motivations are acknowledged, little is said about how they can be acquired. As a consequence, the origin of altruistic motivations, like that of any other taste and inclination, and the effects of social influence, training, education, and the like on the utility function are overlooked. Another consequence of rational agents’ poor internal dynamics is the insufficient account of inner conflicts like that between motivations and goals (for example, passing the exam and playing around) or that between types of motivations or finally between individual goals and societal values, which may strongly affect and impair individuals’ decision-making. Substantially, these critiques converge on arguing against a purely exogenous approach, which ignores the acquisition, modification and integration of internal states. To say that agents’ choices proceed from their preferences is neither particularly interesting nor informative, but to make the model more realistic does not help much and may render it tautological. The missed point here is a theory of the internal dynamics of preferences. However, this requires an endogenous approach to rationality. Endogenous Strong models of the agent aiming at modeling internal states and mechanisms include a variant of the rational choice theory, i. e. bounded rationality, as well as theories and architectures of cognitive agents, including but not reduced to BDI agents. Bounded Rationality Bounded rationality is a variant of rational choice theory that took its move from the obvious consideration that perfect knowledge is never available, and that agents, even investors, deal with uncertainty and risk in business. Far from regretting the special attention paid to economic behavior and the homo oeconomicus model, the proposers of bounded rationality questioned the utility of full rationality as a paradigm for the study of economic behavior itself. As Fox and Tversky [37] observed, the very notion of uncertainty was first questioned by theoretical economists. In [61], Knight distinguished between measurable uncertainty or risk, and unmeasurable uncertainty, suggesting that entrepreneurs deal with unmeasurable uncertainty, rather than risk. At the same time, Keynes [59] distinguished between probability of occurrence of a given event and the weight of the evidence supporting it. Nonetheless, game theorists showed little interest in the issue of vagueness of probability [91].
2585
2586
Rational, Goal-Oriented Agents
When one considers it carefully, bounded rationality is situated at the borderline between endogenous and exogenous approaches. Prospect Theory [56], one of the main theoretical contributions of bounded rationality, was proposed as a psychologically realistic alternative to rationality theory, in particular to expected utility theory. The authors meant to account for empirical evidence showing that when people evaluate potential losses and gains under risk, e. g. in financial decisions, the probability of occurrence acts differently on the values of alternative options depending on whether agents are gaining utility or avoiding losses. Rather than constructing a model that shows how this happens in the minds of the agents, the authors expressed this evidence in an asymmetric utility function. This shows a bigger impact of losses than of gains (loss aversion) and expresses that people tend to overreact to small probability events, but underreact to medium and large probabilities. The theory, meant as an endogenous development of full rationality, is expressed into an exogenous formalism. Nonetheless, bounded rationality has done a great job in revising the assumptions of full rational theory. What is more, Herbert Simon [94] did actually argue for an endogenous model, when he claimed that in order to account for rational action, it is necessary to study the mental processes and mechanisms allowing agents to formulate and solve complex problems and process (receive, store, retrieve, transmit) information (see also [98], p. 553, quoting Simon). One important aspect of the mental endowment is represented by heuristics [57,83], i. e. powerful informal methods to solve problems in complex situations. In psychology, heuristics are simple, efficient rules, rules of thumb, which either evolve or are learned and support people in taking decisions, coming to judgments, and solving complex problems under incomplete information. When agents cannot compute the expected utility of every alternative action, they resort to heuristics, which are applied in different contexts and might have been acquired over the course of life, through repeated decision-making processes. These rules work well under most circumstances, but in certain cases lead to systematic cognitive biases, i. e. very basic statistical and memory errors that are common to all human beings. To be noticed, most models of bounded rationality did not strictly follow Simon’s ideas. Some authors preferred to model people’s decisions as a sub-optimal version of rationality [32] or to study how people cope with their inability to optimize. For others [45], alternatives to full rationality in decision making frequently lead to better decisions.
Deliberative Agents Let us now turn our attention to fully endogenous models of complex agents, such as BDI architectures and more generally cognitive agents. Both are goal-directed systems, not incompatible with utility maximization but actually implementing it. However, goal-directed agents do not necessarily imply neither explicit computation of utility maximization, nor heuristics achieving it. Cognitive systems. These are goal-directed agents endowed with mentalistic notions, such as beliefs, intentions, obligations and the capacity to manipulate mental symbols and accomplish a number of operations on them, including inference, reasoning, problem-solving, planning, etc. Symbolic representations stand for the states of the world they refer to in such a manner that these world states can be: Compared: a given representation of a worldstate ws1 can be compared with another to check whether they are equal or not. This is fundamental in goal-directed action. Manipulated, e. g. asserted or denied, evaluated and possibly selected along common criteria, modified and revised and therefore used as instruments for reasoning, action planning, etc. Embedded into one another, thereby giving rise to meta-beliefs. Among other reasons, meta-beliefs are useful for the representation of others’ beliefs (the so called “theory of mind”), which is of vital importance in social life [16,85]. Of late, theory of mind has been used to refer to a specific cognitive capacity: the ability to attribute mental states – beliefs, intents, desires, pretending, knowledge, etc. – to oneself and others and to understand that others have beliefs, desires and intentions that are different from one’s own [27,49]. BDI agents. A BDI agent is an architecture for a special type of cognitive agent, more specifically for intentiondriven agents, endowed with particular mental attitudes, viz: Beliefs, Desires and Intentions (BDI). The philosophical bases of BDI agents is Bratman’s theory of practical reasoning, but has been formally described by Anand Rao and Michael Georgeff’s [86], which combine temporal logic with a modal logic enriched with beliefs, desires and intentions. Beliefs Beliefs represent the agent’s representations about the world (including itself and other agents), which are neither necessarily true nor complete. Inference rules generating new beliefs allow to enrich the agent belief base.
Rational, Goal-Oriented Agents
Desires Desires represent the motivational state of the agent. Goals are a subset of desires, i. e. consistent and achievable ones. Intentions Intentions are executable goals, i. e. goals that the agent has chosen for action, to which the agent has to some extent committed (in implemented systems, this means the agent has begun executing a plan). Aimed to represent the future as an epistemic tree, allowing agents to reason upon alternative courses of action, BDI architectures lend themselves to model the interplay among different mentalistic notions. In particular, an extension of BDI, the BOID architecture, concerns the representation of the interaction among beliefs, obligations, intentions, and desires, and has been used to support reasoning and deliberative choice on legal norms [14]. Mental Representations and Their Dynamics An endogenous approach to the computational model of intelligent agents revolves around the main question as to how agents represent their goals and the information needed to achieve them [28,74,95]. This issue, called knowledge representation, arises at the intersection between cognitive science and Artificial Intelligence: AI scientists have borrowed methods of knowledge representation from cognitive scientists, since it was of primary importance for them to design programs that could store knowledge and re-apply it to analogous problems. Knowledge Representation techniques – such as frames, scripts, etc. – stem from theories of human information processing. The main goal is to represent knowledge in such a way as to facilitate reasoning, planning, problem solving and the like. The main questions posed by AI researchers concerning knowledge representation focuses on the format of representations, the nature of knowledge, the preferability of general purpose vs. domain-specific representation schemata, the degree of expressiveness and detail of such schemata, whether it should be declarative or procedural etc. In AI, many methods of knowledge representations have been tried since the early seventies, e. g. theorem proving and expert systems, with variable degree of success. The major knowledge representation systems go back to the eighties when for example large databases of language information were built, rendering knowledge representations more feasible. Several programming languages were developed, such as Prolog, KL-ONE and XML, facilitating information retrieval and data mining.
The necessity to store and manipulate knowledge in a formal way so that it may be used by mechanisms to accomplish a given task has drawn the attention of knowledge representation scientists. Expert systems, machine translation, and information retrieval are good examples of applications. Probably the most extensively used structure for knowledge representation from the 1960s, is the knowledge frame [35,79], with its own name and set of attributes, or slots that contain values. Frames have been used for expert systems in object-oriented programming, with inheritance of features described by an IS-A link, a type of relation that has posed a number of problems of semantic interpretation [13]. However, in an endogenous perspective, it should probably be observed that frame structures are well-suited for the representation of stereotypes and other socio-cognitive patterns. A script [92] is a frame that describes events, especially behavioral events. The usual example is going to a restaurant, with steps including waiting to be seated, receiving a menu, ordering, etc. Goals Typologies of goals are not limited to those examined in Sect. “Goal-Directed Agents”, but include other sub-categories, which are based on the interplay among goals and other mental notions, such as beliefs and obligations. Active vs. inactive goals. Maintenance goals are not always active. A goal is active when it rules the system’s behavior. When a goal is represented in the agent’s mind but does not rule its behavior is inactive. A maintenance goal may be activated when it is likely to be thwarted. An inactive goal may be a maintenance or side-goal, or an achievement goal that becomes (momentarily) inactive because incompatible with a more important one (emergency or conflict). Executed vs. waiting. An active goal may be chosen for action, or may be interrupted during its execution and kept waiting in a queue, because the conditions for its execution turn out to be false, and must be realized for the goal to be achieved. Chosen. An active goal may be chosen for action and transformed into an intention when (a) it is not realized, but it is (b) realizable and (c) more important than other active goals, if any. These clauses are conjunctive: in case the goal is realized, or outcompeted, it becomes inactive. In case it is unrealizable, it is dropped (or retrocedes to the status of a simple desire). Individual vs. social. Goals are individual when mentioning no-one else except the hosting agent. Otherwise, they are weak or strong social goals. Weak social goals
2587
2588
Rational, Goal-Oriented Agents
mention others as models to imitate or as sources of influence – for example their future actions are perceived as either obstacles or favoring events. Strong social goals mention others as targets of influence (simple request or persuasion or manipulation or coercion). Prosocial vs. aggressive. A goal is prosocial when it aims at realizing a worldstate as long as and why this is wanted by another agent. The process by means of which an agent comes to have another’s goal as its own is called goal-adoption. This may be either an end in itself (benevolence) or a sub-goal for an individual goal (e. g., exchange). An aggressive goal is a goal that aims at thwarting another’s goal, as long as and why this is wanted by the latter. Again, an aggressive goal may be an end-goal (hate), or a sub-goal, and in the latter case it may be instrumental to a selfish or a pro-social goal (prevent crime), even to the benefit of the victim (surgery, education, etc.). Shared vs. collective. Finally, a goal may be shared by a set of agents. It is collective [25] when it mentions a set of agents in which any one is necessary, but none is sufficient to achieve the goal shared by them all (an orchestra or a football team are typical examples of collective goals). Cognitive Dynamics In a rather generic sense, an autonomous agent is a self-interested agent. In a more specific sense, an autonomous agent is one, which has internal criteria to select among inputs. Inputs might generate beliefs and goals. An autonomous agent is therefore characterized by a “double filter architecture”, allowing both beliefs and goals to be selected (cf. [17]). These two filters are sequential, but at the same time they allow for an integrated processing of mental representations. Filtering beliefs. Thanks to this filter, agents have control over the beliefs they form. This filter is rather complex and implies that a number of tests be executed over a candidate belief against several distinct criteria. These are pragmatic or epistemic criteria. Epistemic criteria include Credibility, with agents controlling, among other properties, the coherence of candidate beliefs with previous beliefs, the reliability of the source of the candidate belief: agents accept information from other agents (Gricean principles) provided they have no reasons to doubt their sincerity or competence. Non-negotiability, or the Pascal law. To believe or not is a “decision”. However, it cannot be made in view of one’s pragmatic utility, but only in view of one’s epistemic utility. In social interaction, we cannot use threats (“Argumentum ad baculum”) or promise to make people believe something. The difference be-
tween persuading to do and persuading to believe is crucial. Since beliefs control goals, this represents a further protection of agents’ autonomy. Pragmatic criteria concern the reasons for believing something. Typologies of beliefs generally concern the format of their representation (declarative, procedural); the degrees of certainty; the levels of nesting (one can believe something, without believing that one believes . . . ). These typologies are known too well. Perhaps it is less obvious that beliefs may have a different “status” in the mind according to the motives for acceptance. The language provides a rich vocabulary: superstitious belief, creed, faith, credence, doctrine, postulate, axiom, principle, conception, idea, view, opinion, and many others. These beliefs vary on several, often quantitative, dimensions, such as certitude (subjective truth value), retractability (how likely it will be modified), connectivity (how much a belief is connected with other beliefs). One interesting quantitative dimension is the “force” of beliefs (cf. the role of this dimension in the Social Impact Theory, cf. [68]): beliefs vary on how strongly they are held. This is related to certitude, but also to the motives of acceptance, which may lead the agent to ignore the belief’s truth-value; for example, Self-protection and self-enhancement, agents may be led to accept one among several competing beliefs because of the belief’s positive effect on their self-esteem or self-concept. Commitment to a given (set of) belief(s) provides one important reason for accepting further, consistent beliefs, despite or independent of incompatible evidence: agents that accept beliefs out of commitment don’t check their truth value. Hypothetical and counter-factual reasoning, beliefs may be (transitorily) held as means to reason and carry out operations (demonstrations, proofs, experiments). A god example is the priest accepting the atheist’s point of view to dismantle it. Communication: A psychotherapist may “accept” the delusions of her patient in order to communicate with him and give a clinical sense to his fantasies; here the goal is not to carry out a counterfactual argumentation, but to understand the meaning of delusions. Empathy: agents may want to share the views of their close connections. Risk-taking or gambling: agents may participate in lotteries, accepting one alternative and investing (money) in it; in such a case, agents will hold an uncertain belief but behave as if it were certain. Prudence, agents may accept uncertain information (for example, rumours, gossip, even calumnies) and be-
Rational, Goal-Oriented Agents
have as if they were certain: unlike the preceding situation, a risk-minimization strategy applies. Filtering goals. There are at least two fundamental tests which are performed on goals: self-interested goal-generation (an agent is autonomous if whatever new goal it comes to have, there is at least another goal of that agent, for which, in the agent’s beliefs, the former is a means), and belief-driven goal-processing (any modification of an autonomous agent’s goals can only be allowed by a modification of its beliefs) (cf. Fig. 2). Both these filters bear interesting social and cultural consequences. First, agents’ minds are modified thanks to a process of belief-formation or belief-revision. Secondly, belief-formation and -revision are decision-based and selective processes. With Cavalli-Sforza and Feldman [18], we will speak of beliefs’ “acceptance”. A cultural process is a process interspersed with decisions taken by the agents involved. But a decision-based process is not necessarily explicit and reflected upon: mental filters do not necessarily operate consciously: agents may not be able to report on them. Thirdly, agents will never accept beliefs under threat, or in order to obtain a benefit in return (nonnegotiability). Fourth, agents may accept beliefs for different reasons, and these will affect the probability that such beliefs are held, their strength, and their transmission. The social mechanisms of influence and transmission are strongly intertwined with criteria and motives of acceptance. This leads us to the question of limited autonomy.
Limited autonomy. The model of agent outlined above appears rather abstract and unrealistic. In the real life, agents appear liable to external influence, prone to accept and transmit prejudices, victim of superstition, prey to false doctrines and creeds. Indeed, autonomy is limited both at the belief and at the goal level: agents are liable to being influenced by external (including social) inputs. Both at the level of goals and beliefs, autonomy is limited in a very elementary sense: agents are designed to take into account external inputs, if only to discard them later. If an input is received, the filter processing is activated (cf. [17]). At the goal level, agents cannot avoid accept very elementary requests: if somebody asks a passerby the time, this will not keep on ignoring the request. At most, the passer-by can pretend that she did not perceive it. But if this perception cannot be concealed, an answer whatsoever will be given, if only to say that one has no idea what the time is (minimal level of adoption). Of course, this type of influence is rather superficial and ephemeral. But it paves the way for other more relevant types of influence. Obviously, agents’ autonomy is limited because they are not always self-sufficient. They may need the help of other agents to achieve their goals (social dependence), and this causes agents to adopt others’ goals and to accept their requests. However, one adoption of others’ goals will always be a means for the achievement of one’s goals for example through social exchange or cooperation. In turn, these social actions favor the transmission of beliefs, including action plans, techniques, procedures, rules, conventions, social beliefs. Finally, agents’ autonomy is limited by norms, which are aimed to regulate agents’ behaviors. But agents may accept or reject norms, comply with or violate them, always according to their internal criteria for acceptance. At the level of beliefs, agents’ liability varies depending upon the type of beliefs. For example, social agents are strongly permeable to social evaluations, rumors, gossip, even calumnies (cf. [24]). Rumors and gossip are accepted for prudence, and, as we will see, this favors their spreading. Indeed, these are important phenomena of memetic transmission, which spread social labels, stigmas, and prejudices, but also reputation, social hierarchies and other institutions. Concluding Remarks and Future Directions
Rational, Goal-Oriented Agents, Figure 2 The “double filter” architecture
Which are the similarities and different advantages of these different paradigms in agent modeling, especially among rational and goal-directed agents? As initially stated, these should not be viewed as mutually exclusive. Hence, the question arises as to when, for which scien-
2589
2590
Rational, Goal-Oriented Agents
tific objectives, and how they could be integrated. But first, let us see what they have in common and where they differ. Rational and goal-directed agents have three major common properties: autonomy, (limited) knowledge, and strategic action. These attributes characterize the fundamental endowment of intelligent agents, allowing them to act on their own criteria, reasons, and current beliefs. To be noticed, however, these properties are differently characterized within the two approaches. Autonomy. Whereas rational agents are self-interested systems – with self-interest being closely related to the notion of adaptation and fitness in a biological sense – goal-directed agents are systems pursuing their own goals, whether these are means for own fitness or not. A suicide is a goal-directed agent, which autonomously decides to take out its life. As this example shows, goal-directed agency is an abstract and general category that can easily be applied to natural systems, which not always pursue their self-interest. Limited knowledge. Both approaches model agents’ beliefs as limited and uncertain. However, the rational approach does not account for the internal dynamics of beliefs, their acquisition, revision, and eventual dropping. Even adaptive views of agents fail to account for the internal dynamics of their motivations, and for the role these have in the management, update, and decay of their beliefs. However, the double filter architecture allows cognitive dynamics of goals and beliefs to be accounted for. Strategic action. As to this aspect, i. e. the social application of rational action, one should acknowledge a superiority of the rational agents, paradigm, as this is inherently devoted to the solution of social and collective dilemmas. However, in the rationality paradigm, strategic action is meant exclusively in the weak social sense meant above, i. e. as the ability to take into account what other agents know, can do and will do. The strong meaning instead is essentially overlooked, since rational agents are not modeled as goal-directed: they cannot be said to pursue the goal to modify others’ beliefs, goals, and actions. This is one of the main reasons why it is difficult to integrate social influence, training, education, and the other phenomena of social change in the rationality paradigm. Despite the aforesaid shortcomings, the rationality paradigm is undeniably influential not only in the social, but also in the computational field. In comparison, the impact of goal-directed agents is far from satisfactory. There are multiple reasons for such a state of affairs. First, a persistent cultural hegemony of Homo Oeconomicus should not be ignored. However indeterminate
subjective preferences are in principle declared to be, investment and economic business are tacitly assumed to be the master domains of application of rationality. Secondly, rationality paradigm has all the advantages of an exogenous, in particular a mathematical, approach: (a) quantitative data, (b) theorem-proving, and (c) the reduction of computational complexity, as the rational agent architecture is rather simplified. Thirdly, and most importantly, the rationality paradigm enjoys the powerful simplicity of abstraction, the invaluable scientific appeal of stylized facts. No other paradigm, so far, has been able to design such simple, highly abstract scenarios, characterized by the same heuristic power, as those represented by the games worked out by rational choice theory for the study of strategic action. Making a difficult but fascinating exercise, let us endeavor to envisage the future of these paradigms. Whilst the rationality paradigm can reasonably be expect to maintain its hegemony in the short term, what can we expect to happen in the middle and long run? The growing success of the neuroscientific approach (think of mirror neurons) leads us but to foresee a dominance of the neural nets approach not only within the behavioral sciences (e. g., psychology), but also – and thanks to embodied agents and robotics – within the computational sciences. How and to what extent can this model answer the questions currently answered by rational agents, on one hand, and by goal-directed agents, on the other? Largely, the answer depends on whether or not the predictable advances of the neuroscientific approach will empower agent technologies, for example, in the direction of social and collective intelligence, acquisition of norms and social values, formation and recognition of institutions etc. However, the answer might also depend on the potential cross-over between the neuroscientific approach and the complexity one. To what extent can embodied agents and neural nets be combined with sociophysical models? The future status of the rationality paradigm depends on the probability of this cross-fertilization. More generally, the future of the agent paradigms depends on the probability of integration between endogenous and exogenous paradigms. Bibliography 1. Allingham M (2002) Choice theory: a very short introduction. Oxford University Press, Oxford 2. Anand P (1993) Foundations of rational choice under risk. Oxford University Press, Oxford 3. Aristotle (2006) Metaphysics book theta. Oxford University Press, Oxford [translated with an introduction and commentary by Stephen Makin]
Rational, Goal-Oriented Agents
4. Arrow KJ (1987) Economic theory and the hypothesis of rationality. In: The new palgrave: a dictionary of economics, vol 2. Macmillan, Houndmills, pp 69–75 5. Aumann RJ (2006) War and peace. Nobel Lect PNAS 103(46):17075–17078. November 14, 2006 6. Axelrod R (1997) The complexity of cooperation. Princeton University Press, Princeton 7. Bates J (1994) The role of emotion in believable agents. Commun ACM 37(7):122–125 8. Bates J, Loyall BA, Reilly SW (1992) An architecture for action, emotion, and social behaviour. Technical Report CMU-CS-92144. School of Computer Science, Carnegie-Mellon University, Pittsburgh 9. Bigelow J, Rosenblueth A, Wiener N (1943) Behavior, purpose and teleology. Philos Sci 10:18–24 10. Binmore K (1994) Playing fair: game theory and the social contract, vol 1. MIT Press, Cambridge. Vol. 2: Just Playing: Game theory and the Social Contract 11. Bond AH, Gasser L (eds) (1988) Readings in distributed artificial intelligence. Morgan Kaufmann Publishers, San Matteo 12. Boutilier C (1995) Generalized update: belief change in dynamic settings. In: Proc of the 14th international joint conference on artificial intelligence (IJCAI’95). Morgan, Kaufmann, Montreal, pp 1550–1556 13. Brachman RJ (1983) What IS-A is and isn’t. An analysis of taxonomic links in semantic networks. IEEE Comput 16(10):30– 36 14. Broersen J, Dastani M, Hulstijn J, Huang Z, van der Torre L (2001) The BOID architecture conflicts between beliefs, obligations, intentions and desires. In: Mueller JP, Andre E, Sen S, Frasson C (eds) Proc of the 5th international conference on autonomous agents, Montreal, 28 May–01 June 2001, pp 9– 17 15. Bush V (1929) Operational circuit analysis. Wiley, New York 16. Carruthers P, Smith PK (eds) Theories of theories of mind. Cambridge University Press, Cambridge 17. Castelfranchi C (1997) Representation and integration of multiple knowledge sources: issues and questions. In: Cantoni V, Di Gesù V, Setti A, Tegolo D (eds) Human and machine perception: information fusion. Plenum, New York, pp 236–254 18. Cavalli-Sforza LL, Feldman MW (1981) Cultural transmission and evolution: a quantitative approach. Princeton University Press, Princeton 19. Chavalarias D (2007) Endogenous distribution in multiagents models: the example of endogeneization of ends and time constants, M2M 2007. In: 3rd international Model-toModel workshop, Marseille, France, 15–16 March 2007 20. Cohen PR, Levesque HJ (1990) Rational interaction as the basis for communication. In: Cohen PR, Morgan J, Pollack ME (eds) Intentions in communication. MIT Press, Cambridge, pp 221–256 21. Conte R (1999) Artificial social intelligence: a necessity for agent systems’ developments. Knowl Eng Rev 14:109–118 22. Conte R (2007) From simulation to theory (and backward). In: Squazzoni F (ed) Proc of the epistemological perspectives of simulation, 2nd edn. Springer (in press) 23. Conte R, Castelfranchi C (1995) Cognitive social action. UCL Press, London 24. Conte R, Paolucci M (2002) Reputation in artificial societies: social beliefs for social order. Kluwer, New York
25. Conte R, Turrini P (2006) Argyll-feet giants: a cognitive analysis of collective autonomy. Cogn Syst Res 7:209–219 26. Conway JH (1976) On numbers and games. Academic Press, London 27. Courtin C, Melot A-M (2005) Metacognitive development of deaf children: lessons from the appearance-reality and false belief tasks. J Deaf Stud Deaf Educ 5:266–276 28. Davis R, Shrobe H, Szolovits P (1993) What is a knowledge representation? AI Mag 14(1):17–33 29. Deci EL (1975) Intrinsic motivation. Plenum, New York 30. Deffuant G, Amblard F, Weisbuch G, Faure, T (2002) How can extremism prevail? A study based on the relative agreement interaction model. J Artif Soc Soc Simul 5(4). http://jasss.soc. surrey.ac.uk/5/4/1.html 31. Demazeau Y, Mueller JP (eds) (1990) Decentralized AI. Elsevier, North Holland 32. Elster J (1983) Sour grapes: studies in the subversion of rationality. Cambridge University Press, Cambridge 33. Epstein JM (2007) Generative social science. studies in agentbased computational modelling. Princeton University Press, Princeton 34. Fehr E, Fischbacher U (2003) The nature of human altruism. Nature 425:785–791 35. Fikes RE, Kehler T (1985) The role of frame-based representation in knowledge representation and reasoning. Commun ACM 28(9):904–920 36. Fogel DB (ed) (1998) Evolutionary computation: the fossil record. IEEE Press, New York 37. Fox CR, Tversky A (1995) Ambiguity aversion and comparative ignorance. Q J Econ 110(3):585–603 38. Frank MJ (2005) Dynamic dopamine modulation in the basal ganglia: a neurocomputational account of cognitive deficits in medicated and non-medicated parkinsonism. J Cogn Neurosci 17:51–72 39. Frank RH (1987) If homo economicus could choose his own utility function, would he want one with a conscience? Amer Econ Rev 77(4):593–604 40. Franklin GF, Powell JD, Emami-Naeini A (1994) Feedback control of dynamic systems. Addison-Wesley, Reading 41. Franklin S, Graesser A (1997) Is it an agent, or just a program? A taxonomy for autonomous agents. In: Intelligent agents, vol III. Springer, Berlin, pp 21–35 42. Friedman M, Savage L J (1948) The utility analysis of choices involving risk. J Polit Econ 56(4):279–304 43. Fu W-T, Anderson JR (2006) From recurrent choice to skill learning: a reinforcement-learning model. J Exp Psychol Gen 135(2):184–206 44. Genesereth MR, Ketchpel SP (1994) Software agents. Commun ACM 37(7):48–53 45. Gigerenzer G, Selten R (2002) Bounded rationality. MIT Press, Cambridge. reprint edition 46. Goldberg DE (1989) Genetic algorithms in search, optimization and machine learning. Kluwer, Boston 47. Gray WD, Sims CR, Fu W-T, Schoelles MJ (2006) The soft constraints hypothesis: a rational analysis approach to resource allocation for interactive behavior. Psychol Rev 113(3):461– 482 48. Green TP, Shapiro I (1994) Pathologies of rational choice theory: A critique of application in political science. Yale University Press, London
2591
2592
Rational, Goal-Oriented Agents
49. Hare B, Call J, Tomasello M (2001) Do chimpanzees know what conspecifics know and do not know? Anim Behav 61:139–151 50. Haykin S (1999) Neural networks: a comprehensive foundation. Prentice Hall, Englewood Cliffs 51. Hegselmann R, Krause U (2002) Opinion dynamics and bounded confidence: models, analysis and simulation. J Artif Soc Soc Simul 5(3). http://jasss.soc.surrey.ac.uk/5/3/2. html 52. Holland J (1989) Using classifier systems to study adaptive nonlinear networks. In: Stein DL (ed) Lectures in the sciences of complexity. Addison Wesley, New York 53. Holroyd CB, Coles MGH (2002) The neural basis of human error processing: reinforcement learning, dopamine, and the error-related negativity. Psychol Rev 109:679–709 54. Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285 55. Kahneman D (2003) Maps of bounded rationality: psychology for behavioral economics. Amer Econ Rev 93(5):1449–1475 56. Kahneman D, Tversky A (1979) Prospect theory: an analysis of decision under risk. Econometrica XLVII:263–291 57. Kahneman D, Tversky A, Slovic P (eds) (1982) Judgement under uncertainty: heuristics and biases. Cambridge University Press, Cambridge 58. Katsuno H, Mendelzon AO (1991) On the difference between updating a knowledge base and revising it. In: Proc of the 2nd international conference on the principles of knowledge representation and reasoning (KR’91). Morgan Kaufmann, San Matteo, pp 387–394 59. Keynes JM (1921) Treatise on probability. MacMillan, London 60. Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4598):671–680. http://www.cs.virginia.edu/cs432/documents/sa-1983.pdf; http://citeseer.ist.psu.edu/kirkpatrick83optimization.html 61. Knight FH (1921) Risk, uncertainty, and profit. In: Hart, Schaffner and Marx prize essays, vol 31. Houghton Mifflin, Boston, New York 62. Koza JR (1990) Genetic programming: a paradigm for genetically breeding populations of computer programs to solve problems. Stanford University Computer Science Department technical report STAN-CS-90-1314 63. Koza JR (1992) Genetic programming: on the programming of computers by means of natural selection. MIT Press, Cambridge 64. Koza JR (1994) Genetic programming II: automatic discovery of reusable programs. MIT Press, Cambridge 65. Koza JR, Bennett FH, Andre D, Keane MA (1999) Genetic programming III: darwinian invention and problem solving. Morgan Kaufmann, San Mateo 66. Koza JR, Keane MA, Streeter MJ, Mydlowec W, Yu J, Lanza G (2003) Genetic programming IV: routine human-competitive machine intelligence. Kluwer, Norwell 67. Kuznar LA, Frederick WG (2003) Environmental constraints and sigmoid utility: implications for value, risk sensitivity, and social status. Ecol Econ 46(2):293–306. September 2003 68. Latané B, Nowak A, Liu JH (1995) Distance matters: physical space and social impact. Pers Soc Psychol Bull 21:795–805 69. Lewin K (1921) Das Problem der Willensmessung und das Grundgesetz der Assoziation, vol I. Psychol Forsch 1:191–302
70. Lewis DK (1969) Convention: a philosophical study. Harvard University Press, Cambridge 71. Lorenz J (2007) Continuous opinion dynamics of multidimensional allocation problems under bounded confidence: More dimensions lead to better chances for consensus. Eur J Econ Soc Syst EJESS 19:213–227. Beliefs, Norms, and Markets 72. Lynch A (1999) Thought contagion: how belief spreads through society by. Basic Books, New York 73. Malarz K (2006) Truth seekers in opinion dynamics models. Int J Mod Phys C 17(10):1521–1524 74. Markman AB (1998) Knowledge representation. Lawrence Erlbaum Associates, Hillsdale 75. Mauss M (2006) The gift. The form and reason for exchange in archaic societies. Routledge, London 76. McFarland D (1995) Opportunity versus goals in robots, animals and people. In: Roitblat HL, Meyer J-A (eds) Comparative approaches to cognitive science. MIT Press, Cambridge, pp 415–433 77. Mill JS (1874) On the definition of political economy; and on the method of investigation proper to it. In: London and Westminster review. Essays on some unsettled questions of political economy, 2nd edn. Longmans, Green, Reader & Dyer, London. 1874, essay 5, paragraphs 38 and 48 78. Miller GA, Galanter E, Pribram KH (1960) Plans and the structures of behaviour. Holt, New York 79. Minsky M (1975) A framework for representing knowledge. In: Winston PH (ed) The psychology of computer vision. McGraw-Hill, New York 80. Nozick R (1993) The nature of rationality. Princeton University Press, Princeton 81. Nwana HS (1996) Software agents: an overview. Knowl Eng Rev 11(3):1–40. September 1996 82. Ostrom E (2000) Collective action and the evolution of social norms. J Econ Pers 14(3):137–158 83. Pearl J (1983) Heuristics: intelligent search strategies for computer problem solving. Addison-Wesley, New York. p vii 84. Perry J (1980) Belief and acceptance. Midwest Stud Philos V:533–542 85. Premack DG, Woodruff G (1978) Does the chimpanzee have a theory of mind? Behav Brain Sci 1:515–526 86. Rao AS, Georgeff MP (1995) BDI agents from theory to practice. In: Proc of the 1st international conference on multi agent systems ICMAS. AAAI Press, MIT Press, San Francisco 87. Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge 88. Russell SJ, Norvig P (2003) Artificial intelligence: a modern approach, 2nd edn. Prentice Hall, Upper Saddle River 89. Sacerdoti ED (1977) A structure for plans and behavior. Elsevier, New York 90. Salzarulo L (2006) A continuous opinion dynamics model based on the principle of meta-contrast. J Artif Soc Soc Simul 9(1). http://jasss.soc.surrey.ac.uk/9/1/13.html 91. Savage L J (1954) The foundations of statistics. Wiley, New York 92. Schank RC, Abelson RP (1977) Scripts, plans, goals, and understanding: an inquiry into human knowledge structures. Lawrence Erlbaum, Hillsdale 93. Shoham Y (1993) Agent-oriented programming. Artif Intell 60(1):51–92
Rational, Goal-Oriented Agents
94. Simon H (1957) A behavioral model of rational choice. In: Models of man, social and rational: mathematical essays on rational human behavior in a social setting. Wiley, New York 95. Sowa JF (2000) Knowledge representation: logical, philosophical, and computational foundations. Brooks/Cole, New York 96. Sun R (2002) Duality of the mind. Erlbaum, New York 97. Wilensky R (1983) Planning and understanding: a computational approach to human reasoning. Advanced book program. Addison-Wesley, Reading 98. Williamson O (1981) The economies of organization: the transaction cost approach. Amer J Sociol 87:548–577
99. Windrum P, Fagiolo G, Moneta A (2007) Empirical validation of agent based models: alternatives and prospects. JASSS J Artif Soc Soc Simul 10(2):8. http://jasss.soc.surrey.ac.uk/10/2/ 8.html 100. Wooldridge MJ (2000) Reasoning about rational agents. MIT Press, Cambridge 101. Wooldridge M, Jennings NR (1995) Intelligent agents: theory and practice. Knowl Eng Rev 10(2):115–152 102. Wright AH, Vose MD, Rowe JE (2003) Implicit parallelism. In: Cantu-Paz E et al (eds) Proc of GECCO (the genetic and evolutionary computation conference). Lecture notes in computer science. Springer, Berlin, pp 1003–1014
2593
2594
Reaction-Diffusion Computing
Reaction-Diffusion Computing ANDREW ADAMATZKY University of the West of England, Bristol, UK Article Outline Glossary Definition of the Subject Introduction Computational Geometry Logical Universality Memory Programmability Robot Navigation and Massive Manipulation Future Directions Acknowledgments Bibliography Glossary Belousov–Zhabotinsky (BZ) reaction is a chemical reaction where the organic substrate is oxidized by bromate ions in the presence of acid and a one electron transfer redox catalyst. The reaction produces oscillations in well-stirred reactors and traveling waves in thin layers. Cellular automaton is an array of locally connected finite automata, which update their discrete states in discrete time depending on the states of their neighbors; all automata of the array update their states in parallel. Collision-based computer is a uniform homogeneous medium which employs mobile compact patterns (particles, wave fragments) which travel in space and perform computation (e. g. implement logical gates) when they collide with each other. Truth values of logical variables are given by either the absence or presence of a localization or by various types of localizations. Excitable medium is spatially distributed assembly of coupled excitable systems; spatial distribution and coupling allow for propagation of excitation waves. Excitable system is a system with a single steady quiescent state that is stable to small perturbations, but responds with an excursion from its quiescent state (excitation event) if the perturbation is above a critical threshold level. After excitation the system enters a refractory period during which time it is insensitive to further excitation before returning to its steady state. Glider as related to cellular automata, is a compact (neither infinitely expanding nor collapsing) pattern of
non-quiescent states that travels along the cellular-automaton lattice. Image processing is a transformation of an input image to an output image with desirable properties, using manipulation of images to enhance or extract information. Logical gate is an elementary building block of a digital, or logical, circuit, which represents (mostly) binary logical operations, e. g. AND, OR, XOR, with two input terminals and one output terminal. In Boolean logic terminals are in one of two binary conditions (e. g. low voltage and high voltage) corresponding to TRUE and FALSE values of logical variables. Logically universal processor is a system which can realize a functionally complete set of logical operations in its development, e. g. conjunction and negation. Oregonator is a system of three (or two in a modified version) coupled differential equations aimed to simulate oscillatory phenomena in the Belousov–Zhabotinsky reaction. Shortest-path problem is the problem of finding a path between two sites (e. g. vertices of the graph, locations in space) such that the length (sum of the weights of the graph edges or travel distances) of the path is minimized. Skeleton of a planar contour is a set of centers of bi-tangent circles lying inside the contour. Subexcitable medium is a medium whose steady state lies between the excitable and the unexcitable domains. In excitable media waves initiated by perturbations of a sufficient size propagate throughout the media. In an unexcitable medium no perturbation is large enough to trigger a wave. In a subexcitable medium wave fragments with open ends are formed. Voronoi diagram of a planar set P of planar points is a partition of the plane into regions, each for any element of P, such that a region corresponding to a unique point p contains all those points of the plane that are closer to p than to any other node of P. Wave fragment is an excitation wave formed in a subexcitable medium; this is a segment with free ends, which either expand or contract, depending on their size and the medium’s excitability. Definition of the Subject A reaction-diffusion computer is a spatially extended chemical system, which processes information using interacting growing patterns, excitable and diffusive waves. In reaction-diffusion processors, both the data and the results of the computation are encoded as concentration
Reaction-Diffusion Computing
profiles of the reagents. The computation is performed via the spreading and interaction of wave fronts. A reaction-diffusion computer is a thin layer of a reagent mixture which reacts to changes of one reagent’s concentration—data configuration—in a predictable way to form a stationary pattern corresponding to the concentration of the reagent—result configuration. A computation in the chemical processor is implemented via the spreading and interaction of diffusive or phase waves [1,7]. The reaction-diffusion computers are parallel because myriads of their micro-volumes update their states simultaneously, and molecules diffuse and react in parallel. Liquid-phase chemical media are wet-analogs of massive-parallel (millions of elementary processors in a small chemical reactor) and locally connected (every micro-volume of the medium changes its state depending on states of its closest neighbors) processors. They have parallel input and outputs, e. g. optical input—control of initial excitation dynamics by illumination masks, output is parallel because the concentration profile representing the results of computation is visualized by indicators. The reaction-diffusion computers show fault-tolerance and are capable of automatic reconfiguration, namely if we remove some quantity of the computing substrate, the topology is restored almost immediately. Introduction Reaction-diffusion computers are based on three principles of physics-inspired computing. First, physical action measures amount of information: we exploit active processes in non-linear systems and interpret dynamics of the systems as computation. Second, physical information travels only a finite distance: this means that computation is local and we can assume the non-linear medium is a spatial arrangement of elementary processing units connected locally, i. e. each unit interacts with closest neighbors. Third, nature is governed by waves and spreading patterns: computation is therefore spatial. Surface tension, propagating waves, electricity and chemical reactions have been principal “engines” of nature-inspired computers for over two centuries [1]. However, only reaction-diffusion computers utilize all these phenomena at once, to solve large-scale NP-complete problems in a parallel and stable manner. Experimental studies and designs of reaction-diffusion computers could be traced back to the pioneer discovery of Kuhnert [28]. In 1986 he demonstrated that some very basic image transformations can be implemented in a lightsensitive Belousov–Zhabotinsky system [28]. The ideas by Kuhnert, Krinsky and Agladze [29] on image and pla-
nar shape transformations in two-dimensional excitable chemical medium were further developed and modified by Rambidi with colleagues, see e. g. [42,43]. At that time, mid and late 1990s, a range of chemical logical gates was experimentally built in the Showalter [52] and Yoshikawa laboratories [30]. Computation of shorter, one of classical optimization problems, has also been implemented in these laboratories using the Belousov–Zhabotinsky medium [7,10,51]. Reaction-diffusion computers give us the best examples of unconventional computers, they feature, following Jonathan Mills’ classification of conventional vs. unconventional [34]: wet-ware, non-silicon computing substrate; parallel processing; computation occurring everywhere in substrate space; computation is based on analogies; spatial increase in precision; holistic and spatial programming; visual structure; implicit error correcting. A theory of reaction-diffusion computing was established and a range of practical applications outlined in [1]. Recent discoveries were published in a collective monograph [7]. Herein we will provide an account of achievements in reaction-computing obtained in the latter 1990s and early 2000s. Our present article in no way serves as a substitute for these books but rather an introduction to the field and case study of several characteristic examples. We give a case-study introduction to the novel paradigm of wave-based computing in chemical systems. We show how selected problems and tasks of computational geometry, robotics and logics can be solved by encoding data in configuration of chemical medium’s disturbances and programming wave dynamics and interaction. We describe and analyze working reaction-diffusion algorithms for image processing, computational geometry, logical and arithmetical circuits, memory devices, path planning and robot navigation, and control of massive parallel actuators. Computational Geometry The Voronoi diagram, or plane subdivision, is our most favorable NP-complete problem, for demonstrating “mechanics” and computational complexity of reaction-diffusion computers. Let P be a nonempty finite set of planar points. A planar Voronoi diagram of the set P is a partition of the plane onto such regions, that for any element of P, a region corresponding to a unique point p contains all those points of the plane which are closer to p than to any other node of P. A unique region vor(p) D fz 2 R2 : d(p; z) < d(p; m) 8 m 2 R2 ; m ¤ zg assigned to point p is called a Voronoi cell of the point p. The boundary of the Voronoi cell of a point p is built of segments of bi-
2595
2596
Reaction-Diffusion Computing
sectors separating pairs of geographically closest points of the given planar set P. A union of all boundaries of the Voronoi cells determines the planar Voronoi diagram: VD(P) D [ p2P @vor(p). A variety of Voronoi diagrams and algorithms of their construction can be found in [27]. The basic concept of constructing Voronoi diagrams with reaction-diffusion systems is based on a very simple intuitive technique for detecting the bisector points separating two given points of the set P. If we drop reagents at the two data points the diffusive waves, or phase waves if the computing substrate is active, spread outwards from the drops with the same speed. The waves travel the same distance from the sites of origination before they meet one another. The points, where the waves meet, are the bisector points. This idea of a Voronoi diagram computation was originally implemented in cellular automata models and in experimental parallel chemical processors, see the extensive bibliography in [1,7]. Assuming that the computational space is homogeneous and locally connected, and every site (micro-volume of the chemical medium or cell of the automaton array) is coupled to its closest neighbors by the same diffusive links, we can easily draw a parallel between distance and time, and thus put our wave-based approach into action. In a cellular-automaton representation of the physical reality cell neighborhood u determines that all processes in the cellular automata model are constrained to the discrete metric L1 . So, when studying automata models we should think rather about the discrete Voronoi diagram than its Euclidean representation. Chemical laboratory prototypes of reaction-diffusion computers do approximate a continuous Voronoi diagram as we will see further. A discrete Voronoi diagram can be defined on lattices or arrays of cells where, e. g. a two-dimensional lattice Z2 . The distance d(; ) is calculated not in Euclidean but in one of the discrete metrics, e. g. L1 and L1 . A discrete bisector of nodes x and y of Z2 is determined as B(x; y) D fz 2 Z2 : d(x; z) D d(y; z)g. However, following such definition we sometimes generate bisectors that fill a quarter of the lattices or produce no bisector at all [1]. If we want the constructed diagrams to be closer to the real-world then we could re-define the discrete bisector as follows B(x; y) D fz 2 Z2 : jd(x; z) d(y; z)j 1g. The redefined bisector will comprise edges of Voronoi diagrams constructed in discrete, cellular-automaton models of reaction-diffusion and excitable media. Now we will discuss several versions of reaction-diffusion wave-based constructions of Voronoi diagrams, from a naïve model, where the number of reagents grow proportionally to the number of data points, to a minimalist implementation with just one reagent and one substrate [1].
Reaction-Diffusion Computing, Figure 1 Computation of a Voronoi diagram in a cellular-automaton model of a chemical processor with O(n) reagents. Precipitate is shown in black
Let us start with the O(n)-reagent model. In a naïve version of reaction-diffusion computation of a Voronoi diagram one needs two reagents and a precipitate to mark a bisector separating two points. Therefore n C 2 reagents, including precipitate and substrate, are required to approximate a Voronoi diagram of n points. We place n unique reagents on n points of the given data set P, waves of these reagents spread around the space and interact with each other where they meet. When at least two different reagents meet at the same or adjacent sites of the space, they react and form a precipitate—sites that contain the precipitate represent edges of the Voronoi cell, and therefore constitute the Voronoi diagram. In “chemical reaction” equations the idea looks as follows, ˛ and ˇ are different reagents and # is a precipitate: ˛ C ˇ ! #. This can be converted to a cellular-automaton interpretation as follows: 8 t t ˆ < ; if x D and (x) f ; g tC1 t x D # ; if x ¤ # and j (x) t /#j > 1 ˆ : t x ; otherwise where is a resting state (cell in this state does not contain any reagents), 2 R is a reagent from the set R of n reagents, and (x) t D fy t : y 2 u(x)g characterizes
Reaction-Diffusion Computing
what reagents are present in the local neighborhood u(x) of the cell x at time step t. The first transition of the above rule symbolizes diffusion. A resting cell takes the state if only this reagent is present in the cell’s neighborhood. If there are two different reagents in the cells neighborhood then the cell takes the precipitate state #. Diffusing reagents halt because formation of precipitate reduces the number of “vacant” resting cells. The precipitate does not diffuse. Cell in state # remain in this state indefinitely. An example of a cellularautomaton simulation of O(n)-reagent chemical processor is shown in Fig. 3. The O(n)-reagent model is demonstrative, however, computationally inefficient. Clearly we can reduce the number of reagents to four—using map coloring theorems—but pre-processing time will be unfeasibly high. The number of participating reagents can be sufficiently reduced to O(1) when the topology of the spreading waves is taken into account [1]. Now we go from one extreme to another and consider a model with just one reagent, and a substrate. The reagent ˛ diffuses from sites corresponding to points of a planar data set P. When two diffusing wave fronts meet a superthreshold concentration of reagents they do not spread further. A cellular-automaton model represents this as follows. Every cell has two possible states: resting or substrate state and reagent state ˛. If the cell is in state ˛ it remains in this state indefinitely. If the cell is in state and between one and four of its neighbors are in state ˛, then the cell takes the state ˛. Otherwise, the cell remains in the state —this reflects the “super-threshold inhibition” idea. A cell state transition rule is as follows: ( ˛ ; if x t D and 1 (x) t 4 tC1 x D x t ; otherwise where (x) t D jy 2 u(x) : y t D ˛j. Increasing the number of reagents to two (one reagent and one precipitate) would make life easy. A reagent ˇ diffuses on a substrate, from the initial points (drop of reagent) of P, and forms a precipitate in the reaction mˇ ! ˛, where 1 m 4. Every cell takes three states: (resting cell, no reagents), ˛ (e. g. colored precipitate) and ˇ (reagent). The cell updates its states by the rule: 8 t t ˆ ˆ > ˆ > ˆ > ˆ S I I A I I I > ˆ > ˆ > ˆS S I A I I > ˆ > ˆ > ˆ > ˆS I I A I > ˆ = < : MD S S I A > ˆ > ˆ > ˆS S I > ˆ > ˆ > ˆ > ˆ S S > ˆ > ˆ > ˆ > ˆ S > ˆ > ˆ ; : The cell-state transition rule reflects the nonlinearity of activator-inhibitor interactions for sub-threshold concentrations of the activator. Namely, for a small concentration of the inhibitor and for threshold concentrations, the activa-
2601
2602
Reaction-Diffusion Computing
tor is suppressed by the inhibitor, while for critical concentrations of the inhibitor both inhibitor and activator dissociate producing the substrate. In exact words, M01 D A symbolizes the diffusion of activator A, M11 D I represents the suppression of activator A by the inhibitor I, M z2 D I (z D 0; : : : ; 5) can be interpreted as self-inhibition of the activator in particular concentrations. M z3 D A (z D 0; : : : ; 4) means a sustained excitation under particular concentrations of the activator. M z0 D S (z D 1; : : : ; 7) means that the inhibitor is dissociated in absence of the activator, and that the activator does not diffuse in sub-threshold concentrations. And, finally, M z p D I, p 4 is an upper-threshold self-inhibition. Amongst non-trivial localizations, see full “catalog” in [5], found in the medium we selected eaters gliders G4 and G34 , mobile localizations with activator head and inhibitor tail, and eaters E6 , stationary localizations transforming gliders colliding with them, as components of the memory unit.
The eater E6 can play the role of a 6-bit flip-flop memory device. The substrate-sites (bit-down) between inhibitor-sites (Fig. 8) can be switched to an inhibitorstate (bit-up) by a colliding glider. An example of writing one bit of information in E6 is shown in Fig. 9. Initially E6 stores no information. We aim to write one bit in the substrate-site between the northern and north-western inhibitor-sites (Fig. 9a). We generate a glider G34 (Fig. 9b,c) traveling West. G34 collides with (or brushes past) the North edge of E6 resulting in G34 being transformed to a different type of glider, G4 (Fig. 9g,h). There is now a record of the collision—evidence that writing was successful. The structure of E6 now has one site (between the northern and north-western inhibitor-sites) changed to an inhibitor-state (Fig. 9j) —a bit was saved [5]. To read a bit from the E6 memory device with one bitup (Fig. 10a), we collide (or brush past) with glider G34 (Fig. 10b). Following the collision, the glider G34 is transformed into a different type of basic glider, G34 (Fig. 10g), and the bit is erased (Fig. 10j). Programmability
Reaction-Diffusion Computing, Figure 8 Localizations in reaction-diffusion hexagonal cellular-automaton. Cells with inhibitor I are empty circles, cells with activator A are black disks. a Stationary localization eater E 6 , b–c two forms of glider G34 , d glider G4 [5]
Reaction-Diffusion Computing, Figure 9 Write bit [5]
In the chemical laboratory the term programmabilty means controllability. How real chemical systems can be controlled? The majority of the literature, related to theoretical and experimental studies concerning the controllability of the reaction-diffusion medium, deals with application of an electric field. For example, in a thin-layer Belousov–Zhabotinsky reactor stimulated by an electric field the following phenomena are observed: the velocity of excitation waves is increased by a negative and decreased by
Reaction-Diffusion Computing
Reaction-Diffusion Computing, Figure 10 Read and erase bit [5]
a positive electric field, a wave is split into two waves that move in opposite directions if a very high electric field is applied across the evolving medium, crescent waves are formed not commonly observed in the field absent evolution of the BZ reaction, stabilization and destabilization of wave fronts, see [7]. Other control parameters may include temperature (to e. g. program transitions between periodic and chaotic oscillations), substrate’s structure (controlling formation, annihilation and propagation of waves), and illumination (inputting data and routing signals in light-sensitive chemical systems). Let us demonstrate a concept of control-based programmability in models of reaction-diffusion processors. Firstly, we show how to adjust reaction rates in chemical medium to make it perform a computation of a Voronoi diagram over a set of given points. Secondly, we show how to switch an excitable system between specialized-processor and universal-processor modes, see [7] for additional examples and details. Let a cell x of a two-dimensional lattice take four states: resting ı, excited (C), refractory () and precipitate F, and update their states in discrete time t depending on the number t(x) of excited neighbors in its eight-cell neighborhood as follows (Fig. 11a). A resting cell x becomes excited if 0 < t (x) 2 and precipitates if 2 < t (x). An excited cell “precipitates” if 1 < t (x) or otherwise becomes refractory. A refractory cell recovers to the resting state unconditionally, and the precipitate cell does not change its state. Initially we perturb the medium, excite it in several sites, thus inputting data. Waves of excitation are gener-
Reaction-Diffusion Computing, Figure 11 Cell state transition diagrams: a model of precipitating reactiondiffusion medium, b model of excitable system
ated, they grow, collide with each other and annihilate as a result of the collision. They may form a stationary inactive concentration profile of a precipitate, which represents the result of the computation. Thus, we can only be concerned with reactions of precipitation: C ! k 1 ? and ı C ! k 2 ?; where k1 and k2 are inversely proportional to 1 and 2 , respectively. Varying 1 and 2 from 1 to 8, and thus changing precipitation rates from the maximum possible to the minimum, we obtain various kinds of precipitate patterns, as shown in Fig. 12. Precipitate patterns developed for relatively high ranges of reaction rates (3 1 ; 2 4) represent discrete Voronoi diagrams (a given “planar” set, represented by sites of initial excitation, is visible in pattern 1 D 2 D 3 as white dots inside the Voronoi cells) derived from the set of initially excited sites, see Fig. 13a and b. This example demonstrates that by externally controlling precipitation rates we can force the reaction-diffusion medium to compute a Voronoi diagram.
2603
2604
Reaction-Diffusion Computing
Reaction-Diffusion Computing, Figure 12 Final configurations of reaction-diffusion medium for 1 1 2 2. Resting sites are black, precipitate is white [3]
Reaction-Diffusion Computing, Figure 13 Exemplary configurations of reaction-diffusion medium for a 1 D 3 and 2 D 3, b 1 D 4 and 2 D 3. Resting sites are black, precipitate is white [7]
Let each cell of the 2D automaton take three states: resting (), exciting (C) and refractory (), and update its state depending on number C of excited neighbors in its 8-cell neighborhood (Fig. 11a). A cell goes from excited to refractory and from refractory to resting states unconditionally, and a resting cell excites if C 2 [ 1 ; 2 ], 1 1 2 8. By changing 1 and 2 we can move the medium dynamics in a domain of “conventional” excitation waves, useful for image processing and robot navigation [7] (Fig. 14a), as well as to make it exhibit mobile localized excitations (Fig. 14b), quasi-particles, discrete analogs of dissipative solitons, employed in collisionbased computing [1]. Robot Navigation and Massive Manipulation
When dealing with excitable media excitability is the key parameter for tuning spatio-temporal dynamics. In [1] we demonstrated that by varying excitability we can force the medium to exhibit almost all possible types of excitation dynamics.
As we have seen in previous sections, reaction-diffusion chemical systems can solve complex problems and implement logical circuits. Embedded controllers for non-traditional robotics architectures would be yet another potentially huge field of application of reaction-diffusion com-
Reaction-Diffusion Computing
Reaction-Diffusion Computing, Figure 14 Snapshots of space-time excitation dynamics for excitability C 2 [1; 8] (a) and C 2 [2; 2] (b)
puters. The physico-chemical artifacts are well-known to be capable of sensible motion. Most famous are Belousov– Zhabotinsky vesicles [26], self-propulsive chemo-sensitive drops [25,37] and ciliar arrays. Their motion is directional but somewhere lacks sophisticated control mechanisms. At the present stage of reaction-diffusion computing research it seems to be difficult to provide effective solutions for experimental prototyping of combined sensing, decision-making and actuating. However, as a proof-ofconcept we can always consider hybrid “wetware + hardware” systems. For example, to fabricate a chemical controller for a robot (Fig. 15a), we can place a reactor with Belousov–Zhabotinsky solution on-board of a wheeled robot, and allow the robot to observe excitation wave dynamics in the reactor. When the medium is stimulated at one point, target waves are formed. The robot becomes aware of the direction toward source of stimulation from the topology of the wave-fronts [7]. A set of remarkable experiments were undertaken by Hiroshi Yokoi and Ben De Lacy Costello. They built an interface (Fig. 15b) between a robotic hand and Belousov– Zhabotinsky chemical reactor [62]. Excitation waves propagating in the reactor were sensed by photo-diodes, which triggered finger motion. Bending fingers touched the chemical medium with their glass nails filled with colloid silver, which triggered circular waves in the medium [7]. Starting from any initial configuration, the chemicalrobotic system does always reach a coherent activity mode, where fingers move in regular, somewhat melodic patterns, and few generators of target waves govern dynamics of excitation in the reactor [62]. The chemical processors for navigating the wheeled robot and for controlling, and actively interacting with, a robotic hand are well discussed in our recent monograph [7], therefore we will not go into detail in herein. In-
Reaction-Diffusion Computing, Figure 15 Robots controlled by BZ chemical medium: a mobile robot, b robotic hand (courtesy of Hiroshi Yokoi)
stead we will concentrate on rather novel findings on coupling of a reaction-diffusion system with a massive parallel array of virtual actuators. How a reaction-diffusion medium can manipulate objects? To find out we couple a simulated abstract parallel manipulator with an experimental Belousov–Zhabotinsky (BZ) chemical medium, so the excitation dynamics in the chemical system are reflected in changing the OFF-ON mode of elementary actuating units. In this case, we convert experimental snapshots of the spatially distributed chemical system to a force vector field and then simulate the motion of manipulated objects in the force field, thus achieving reaction-diffusion medium controlled actuation. To build an interface between the recordings of space-time snapshots of the excitation dynamics in the BZ medium and simulated physical objects we calculate force fields generated by mobile excitation patterns and then simulate the behavior of an object in this force field.
2605
2606
Reaction-Diffusion Computing
The chemical medium to perform actuation is prepared following the typical recipe1 , see [8,20], based on a ferroin catalyzed BZ reaction. A silica gel plate is cut and soaked in a ferroin solution. The gel sheet is placed in a Petri dish and BZ solution added. The dynamics of the chemical system are recorded at 30-second intervals using a digital camera. The cross-section profile of the BZ wave-front recorded on a digital snapshot shows a steep rise of red color values in the pixels at the wave-front’s head and a gradual descent in the pixels along the wave-front’s tail. Assuming that excitation waves push the object local force vectors generated at each site—pixel of the digitized image—of the medium should be oriented along local gradients of the red color values. From the digitized snapshot of the BZ medium we extract an array of red components from the snapshot’s pixels and then calculate the projection of a virtual vector force at the pixel. Force fields generated by the excitation patterns in a BZ system Fig. 16 result in tangential forces being applied to a manipulated object, thus causing translational and rotational motions of the object [8]. Non-linear medium controlled actuators can be used for sorting and manipulating both small objects, comparable in size to the elementary actuating unit, and larger objects, with lengths of tens or hundreds of actuating units. Therefore, we demonstrate here two types of experiments with BZ-based manipulation: of pixel-sized objects and of planar convex shapes. Pixel-objects, due to their small size, are subjected to random forces, caused by impurities of the physical medium and imprecision of the actuating units. In this case, no averaging of forces is allowed and the pixel-objects themselves sensitively react to a single force vector. Therefore, we adopt the following model of manipulating a pixel-object: if all force vectors at the 8-pixel neighborhood of the current site of the pixel-object are nil then the pixel-object jumps to a randomly chosen neighboring pixel of its neighborhood, otherwise the pixel-object is translated by the maximum force vector in its neighborhood. When placed on the simulated manipulating surface, pixel-objects move at random in the domains of the resting medium, however by randomly drifting each pixel-object does eventually encounter a domain of co-aligned vectors (representing the excitation wave front in the BZ medium) and is translated along the vectors. An example of several pixel-objects transported on a “frozen” snapshot of the 1 Chemical
laboratory experiments are undertaken by Dr. Ben De Lacy Costello (UWE, Bristol, UK).
chemical medium is shown in Fig. 17. Trajectories of pixelobjects (Fig. 17a) show distinctive intermittent modes of random motion separated by modes of directed “jumps” guided by traveling wave fronts. Smoothed trajectories of pixel-objects (Fig. 17b) demonstrate that despite a very strong chaotic component in manipulation, pixel-objects are transported to the sites of the medium where two or more excitation wave-fronts meet. The overall speed of pixel-object transportation depends on the frequency of wave generations by sources of target waves. As a rule, the higher the frequency the faster the objects are transported. This is because in parts of the medium spanned by low frequency target waves there are lengthy domains of resting system, where no force vectors are formed. Therefore, pixel-sized objects can wander randomly for a long time till climbing the next wave front [8]. To calculate the contribution of each force we partitioned the object into fragments, using a square grid, in which each cell of the grid corresponds to one pixel of the image. We assume that the magnitude of the force applied to each fragment above a given pixel is proportional to the area of the fragment, and is co-directional with a force vector. A momentum of inertia of the whole object with respect to the axis normal to the object and passing through the object’s center of mass is calculated from the position of the center of mass and the mass of every fragment. Since the object’s shape and size are constant it is enough to calculate the moment of inertia only at the beginning of simulation. We are also taking into account principal rotational momentum created by forces and angular acceleration of the object around its center of mass. Therefore, object motion in our case can be sufficiently described by the coordinates of its center of mass and its rotation at every moment of time [8]. Spatially extended objects follow the general pattern of motion observed for the pixel-sized objects. However, due to integration of many force vectors the motion of planar objects is smoother and less sensitive to the orientation of any particular force-vector. Outcome of manipulation depends on the size of the object, with increasing size of the object—due to larger numbers of local vector-forces acting on the object—the objects become more controllable by the excitation wavefronts (Fig. 18). Future Directions The field of reaction-diffusion computing started 20 years ago [28] as a sub-field of physics and chemistry dealing with image processing operations in uniform thin-layer
Reaction-Diffusion Computing
Reaction-Diffusion Computing, Figure 16 Force vector field (b) calculated from BZ medium’s image (a) [8]
Reaction-Diffusion Computing, Figure 17 Examples of manipulating five pixel-objects using the BZ medium: a trajectories of pixel-objects, b jump-trajectories of pixel-objects recorded every 100th time step. Initial positions of the pixel-objects are shown by circles [8]
Reaction-Diffusion Computing, Figure 18 Manipulating planar object in BZ medium. a Right-angled triangle moved by fronts of target waves. b Square object moved by fronts of fragmented waves in sub-excitable BZ medium. Trajectories of center of mass of the square are shown by the dotted line. Exact orientation of the objects is displayed every 20 steps. Initial position of the object is shown by and the final position by ˝ [8]
2607
2608
Reaction-Diffusion Computing
excitable chemical media. The basic idea was to apply input data as a two-dimensional profile of heterogeneous illumination, then allow excitation waves to spread and interact with each, and then optically record the result of the computation. The first reaction-diffusion computers were already massively parallel, with parallel optical input and outputs. Later computer engineers entered the field, and started to exploit traditional techniques—wires were implemented by channels where wave-pulses travel, and specifically shaped junctions acted as logical valves. In this manner, the most “famous” chemical computing devices were implemented, including Boolean gates, coincidence detectors, memory units and more. The idea of reaction-diffusion computation was if not ruined then forced into a cul-de-sac of non-classical computation. The breakthrough happened when paradigms and solutions from the field of dynamical, collision-based computing and conservative logic, were mapped onto realms of spatially extended chemical systems. The computers became uniform and homogeneous. Over several examples we demonstrated that reactiondiffusion chemical systems are capable of solving combinatorial problems with natural parallelism. In spatially distributed chemical processors, the data and the results of the computation are encoded as concentration profiles of the chemical species. The computation per se is performed via the spreading and interaction of wave fronts. The reaction-diffusion computers are parallel because the chemical medium’s micro-volumes update their states simultaneously, and molecules diffuse and react in parallel. During the last decades a wide range of experimental prototypes of reaction-diffusion computing devices have been fabricated and applied to solve various problems of computer science, including: image processing, pattern recognition, path planning, robot navigation, computational geometry, logical gates in spatially distributed chemical media, arithmetical and memory units. These important—but scattered across many scientific fields— results convince us that reaction-diffusion systems can do a lot. Are they capable enough to be intelligent? Yes, reaction-diffusion systems are smart—showing a state of readiness to respond, able to cope with difficult situations, capable of determining something by mathematical and logical methods—and endowed with the capacity to reason. Reaction-diffusion computers allow for massive parallel input of data. Equivalently, reaction-diffusion robots would need no dedicated sensors, each micro-volume of the medium, each site of the matrix gel, is sensitive to changes in one or another physical characteristic of the environment. Electric field, temperature and illumination are “sensed” by reaction-diffusion devices, and these
are the three principle parameters in controlling and programming reaction-diffusion robots. Hard computational problems of geometry, image processing and optimization on graphs are resource-efficiently solved in reaction-diffusion media due to intrinsic natural parallelism of the problems [1]. Herein we demonstrated the efficiency of reaction-diffusion computers with the example of constructing Voronoi diagrams. The Voronoi diagram is a subdivision of plane by data planar set. Each point of the data set is represented by a drop of a reagent. The reagent diffuses and produces a color precipitate when reacting with the substrate. When two or more diffusive fronts of the “data” chemical species meet, no precipitate is produced (due to concentration-dependent inhibition). Thus, uncolored domains of the computing medium represent bisectors of the Voronoi diagram. The precipitating chemical processor can also compute a skeleton. The skeleton of a planar shape is computed in a similar manner. A contour of the shape is applied to the computing substrate as a disturbance in reagent concentrations. The contour concentration profile induces diffusive waves. A reagent diffusing from the data-contour reacts with the substrate and the precipitate is formed. The precipitate is not produced at the sites of collision of diffusive waves. The uncolored domains correspond to the skeleton of the data shape. To compute a collision-free shortest path in a space with obstacles, we can couple two reaction-diffusion media. Obstacles are represented by local disturbances of concentration profiles in one of the media. The disturbances induce circular waves traveling in the medium and approximating a scalar distance-to-obstacle field. This field is mapped onto the second medium, which calculates a tree of “many-sources-one-destination” shortest paths by spreading wave-fronts [7]. Just few words of warning—when thinking about chemical algorithms some of you may realize that diffusive and phase waves are pretty slow in physical time. The sluggishness of computation is the only point that may attract criticism to reaction-diffusion chemical computers. There is however a cure—to speed up we are implementing the chemical medium in silicon, micro-processor LSI analogs of reaction-diffusion computers [12]. Further miniaturization of the reaction-diffusion computers can be reached when the system is implemented as a twodimensional array of single-electron nonlinear oscillators diffusively coupled to each other [38]. Yet another point of developing reaction-diffusion computers—is to design embedded controllers for soft-bodied robots, where usage of conventional silicon materials seem to be inappropriate.
Reaction-Diffusion Computing
Acknowledgments Chemical laboratory prototypes of reaction-diffusion computers, discussed in the article, were implemented by Ben De Lacy Costello. I am grateful to Andy Wuensche (hexagonal cellular automata), Hiroshi Yokoi (robotic hand controlled by Belousov–Zhabotinsky reaction), Chris Melhuish (control of robot navigation), Sergey Skachek (massive parallel manipulation), Tetsuya Asai (LSI prototypes of reaction-diffusion computers) and Genaro Martinez (binary-state cellular automata) for their cooperation. Some pictures, where indicated, where adopted from our co-authored publications. My sincere thanks go to Soichiro Tsuda and Tomohiro Shirakawa who ignited my interest in experimenting with plasmodium and were my patient advisers. Bibliography
14. 15.
16.
17. 18. 19.
20.
Primary Literature 1. Adamatzky A (2001) Computing in nonlinear media and automata collectives. Institute of Physics Publishing, London 2. Adamatzky A (ed) (2003) Collision based computing. Springer, London 3. Adamatzky A (2005) Programming reaction-diffusion computers. In: Unconventional programming paradigms. Springer, New York 4. Adamatzky A, De Lacy Costello BPJ (2002) Experimental logical gates in a reaction-diffusion medium: the XOR gate and beyond. Phys Rev E 66:046112 5. Adamatzky A, Wuensche A (2006) Computing in ‘spiral rule’ reaction-diffusion hexagonal cellular automaton. Complex Syst 16(4):277–298 6. Adamatzky A, De Lacy Costello BPJ, Melhuish C, Ratcliffe N (2003) Experimental reaction-diffusion chemical processors for robot path planning, J Intell Robt Syst 37:233–249 7. Adamatzky A, De Lacy Costello BPJ, Asai T (2005) Reaction-diffusion computers. Elsevier, Amsterdam 8. Adamatzky A, De Lacy Costello BPJ, Skachek S, Melhuish C (2005) Manipulating objects with chemical waves: open loop case of experimental Belousov–Zhabotinsky medium. Phys Lett A 350(3–4):201–209 9. Adamatzky A, Wuensche A, De Lacy Costello BPJ (2006) Gliderbased computation in reaction-diffusion hexagonal cellular automata. Chaos, Solitons Fractals 27:287–295 10. Agladze K, Magome N, Aliev R, Yamaguchi T, Yoshikawa K (1997) Finding the optimal path with the aid of chemical wave. Phys D 106:247–254 11. Asai T, De Lacy Costello BPJ, Adamatzky A (2005) Silicon implementation of a chemical reaction-diffusion processor for computation of Voronoi diagram. Int J Bifurc Chaos 15(1) 12. Asai T, Kanazawa Y, Hirose T, Amemiya Y (2005) Analog reaction-diffusion chip imitating Belousov–Zhabotinsky reaction with hardware oregonator model. Int J Unconv Comput 1:123– 147 13. Beato V, Engel H (2003) Pulse propagation in a model for the photosensitive Belousov–Zhabotinsky reaction with external
21. 22. 23.
24. 25.
26.
27. 28. 29. 30.
31.
32. 33. 34.
noise. In: Schimansky-Geier L, Abbott D, Neiman A, Van den Broeck C (eds) Noise in complex systems and stochastic dynamics Proc. SPIE, vol 5114. pp 353–362 Berlekamp ER, Conway JH, Guy RL (1982) Winning ways for your mathematical plays, vol 2. Academic Press Bode M, Liehr AW, Schenk CP, Purwins H-G (2000) Interaction of dissipative solitons: particle-like behavior of localized structures in a three-component reaction-diffusion system. Phys D 161:45–66 Brandtstädter H, Braune M, Schebesch I, Engel H (2000) Experimental study of the dynamics of spiral pairs in light-sensitive Belousov–Zhabotinskii media using an open-gel reactor. Chem Phys Lett 323:145–154 Courant R, Robbins H (1941) What is mathematics? Oxford University Press Dupont C, Agladze K, Krinsky V (1998) Excitable medium with left-right symmetry breaking. Phys A 249:47–52 Field RJ, Noyes RM (1974) Oscillations in chemical systems: IV. Limit cycle behavior in a model of a real chemical reaction. J Chem Phys 60:1877–1884 Field R, Winfree AT (1979) Traveling waves of chemical activity in the Zaikin–Zhabotinsky–Winfree reagent. J Chem Educ 56:754 Fredkin F, Toffoli T (1982) Conservative logic. Int J Theor Phys 21:219–253 Gerhardt M, Schuster H, Tyson JJ (1990) A cellular excitable media. Phys D 46:392–415 Grill S, Zykov VS, Müller SC (1996) Spiral wave dynamics under pulsatory modulation of excitability. J Phys Chem 100:19082– 19088 Hartman H, Tamayo P (1990) Reversible cellular automata and chemical turbulence. Phys D 45:293–306 Kitahata H, Yoshikawa K (2005) Chemo-mechanical energy transduction through interfacial instability. Phys D 205:283– 291 Kitahata H, Aihara R, Magome N, Yoshikawa K (2002) Cinvective and periodic motion driven by a chemical wave. J Chem Phys 116:5666 Klein R (1990) Concrete and abstract voronoi diagrams. Springer, Berlin Kuhnert L (1986) A new photochemical memory device in a light sensitive active medium. Nature 319:393 Kuhnert L, Agladze KL, Krinsky VI (1989) Image processing using light-sensitive chemical waves. Nature 337:244–247 Kusumi T, Yamaguchi T, Aliev R, Amemiya T, Ohmori T, Hashimoto H, Yoshikawa K (1997) Numerical study on time delay for chemical wave transmission via an inactive gap. Chem Phys Lett 271:355–360 Maeda S, Hashimoto S, Yoshida R (2004) Design of chemo-mechanical actuator using self-oscillating gels. In: Proc. IEEE International Conference on Robotics and Biomimetics (ROBIO 2004). p 313 Margolus N (1984) Physics-like models of computation. Phys D 0:81–95 Markus M, Hess B (1990) Isotropic cellular automata for modeling excitable media. Nature 347:56–58 Mills J (2005) The new computer science and its unifying principle: complementarity and unconventional computing. In: The Grand Challenge in Nonclassical Computation, Int. Workshop, Position Papers. York, 18–19 April 2005
2609
2610
Reaction-Diffusion Computing
35. Motoike IN, Yoshikawa K (2003) Information operations with multiple pulses on an excitable field. Chaos, Solitons Fractals 17:455–461 36. Motoike IN, Yoshikawa K, Iguchi Y, Nakata S (2001) Real-time memory on an excitable field. Phys Rev E 63:036220 37. Nagai K, Sumino Y, Kitahata H, Yoshikawa K (2005) Mode selection in the spontaneous motion of an alcogol droplets. Phys Rev E 71:065301 38. Oya T, Asai T, Fukui T, Amemiya Y (2005) Reaction-diffusion systems consisting of single-electron oscillators. Int J Unconvent Comput 1:179–196 39. Petrov V, Ouyang Q, Swinney HL (1997) Resonant pattern formation in a chemical system. Nature 388:655–657 40. Pour-El MB (1974) Abstract computability and its relation to the general purpose analog computer (some connections between logic, differential equations and analog computers). Trans Am Math Soc 199:1–28 41. Qian H, Murray JD (2001) A simple method of parameter space determination for diffusion-driven instability with three species. Appl Math Lett 14:405–411 42. Rambidi NG (1998) Neural network devices based on reactiondiffusion media: an approach to artificial retina. Supramol Sci 5:765–767 43. Rambidi NG, Shamayaev KR, Peshkov GY (2002) Image processing using light-sensitive chemical waves. Phys Lett A 298:375– 382 44. Ramos JJ (2003) Oscillatory dynamics of inviscid planar liquid sheets. Appl Math Comput 143:109–144 45. Saltenis V (1999) Simulation of wet film evolution and the Euclidean Steiner problem. Informatica 10:457–466 46. Schebesch I, Engel H (1998) Wave propagation in heterogeneous excitable media. Phys Rev E 57:3905–3910 47. Schenk CP, Or-Guil M, Bode M, Purwins H-G (1997) Interacting pulses in three-component reaction-diffusion systems on twodimensional domains. Phys Rev Lett 78:3781–3784 48. Sedi na-Nadal ˝ I, Mihaliuk E, Wang J, Pérez-Mu nuzuri ˝ V, Showalter K (2001) Wave propagation in subexcitable media with periodically modulated excitability. Phys Rev Lett 86:1646–1649 49. Sielewiesiuk J, Gorecki J (2001) Logical functions of a cross junction of excitable chemical media. J Phys Chem A 105:8189–8195 50. Sienko T, Adamatzky A, Rambidi N, Conrad M (eds) (2003) Molecular computing. MIT Press 51. Steinbock O, Toth A, Showalter K (1995) Navigating complex labyrinths: optimal paths from chemical waves. Science 267:868–871 52. Tóth A, Showalter K (1995) Logic gates in excitable media. J Chem Phys 103:2058–2066 53. Tyson JJ, Fife PC (1980) Target patterns in a realistic model of the Belousov–Zhabotinskii reaction. J Chem Phys 73:2224– 2237
54. Vergis A, Steiglitz K, Dickinson B (1986) The complexity of analog computation. Math Comput Simul 28:91–113 55. Wang J (2001) Light-induced pattern formation in the excitable Belousov–Zhabotinsky medium. Chem Phys Lett 339:357–361 56. Weaire D, Hutzler S, Cox S, Kern N, Alonso MD, Drenckhan W (2003) The fluid dynamics of foams. J Phys Condens Matter 15:S65–S73 57. Wuensche A (2005) Glider dynamics in 3-value hexagonal cellular automata: the beehive rule. Int J Unconvent Comput 1:375–398 58. Wuensche A, Adamatzky A (2006) On spiral glider-guns in hexagonal cellular automata: Activator-inhibitor paradigm. Int J Mod Phys 17:1009–1026 59. Yaguma S, Odagiri K, Takatsuka K (2004) Coupled-cellular-automata study on stochastic and pattern-formation dynamics under spatiotemporal fluctuation of temperature. Phys D 197:34–62 60. Yang X (2004) Pattern formation in enzyme inhibition and cooperativity with parallel cellular automata. Parallel Comput 30:741–751 61. Yang X (2006) Computational modeling of nonlinear calcium waves. Appl Math Model 30:200–208 62. Yokoi H, Adamatzky A, De Lacy Costello B, Melhuish C (2004) Excitable chemical medium controlled for a robotic hand: closed loop experiments. Int J Bifurc Chaos 14(9)3347–3354 63. Yoneyama M (1996) Optical modification of wave dynamics in a surface layer of the Mn-catalyzed Belousov–Zhabotinsky reaction. Chem Phys Lett 254:191–196 64. Young D (1984) A local activator–inhibitor model of vertebrate skin patterns. Math Biosci 72:51
Books and Reviews Adamatzky A (2001) Computing in nonlinear media and automata collectives. Institute of Physics Publ, London Adamatzky A (ed) (2003) Collision-Based computing. Springer, London Adamatzky A, Teuscher C (eds) (2006) From utopian to genuine unconventional computers. Luniver Press, London Adamatzky A, De Lacy Costello B, Asai T (2005) Reaction-Diffusion computers. Elsevier, Amsterdam Chua L (1998) CNN: a paradigm for complexity. World Scientific Gray P, Scott SK (2002) Chemical oscillations and instabilities: nonlinear chemical kinetics. Oxford University Press, Oxford Scott SK (1994) Oscillations, waves and chaos in chemical kinetics. Oxford University Press, Oxford Teuscher C, Adamatzky A (eds) (2005) Proceedings of the 2005 workshop on unconventional computing: from cellular automata to wetware. Luniver Press, London Toffoli T, Margolus N (1987) Cellular automata machines. MIT Press
Record Statistics and Dynamics
Record Statistics and Dynamics PAOLO SIBANI 1 , HENRIK, JELDTOFT JENSEN2 1 Institut for Fysik og Kemi, SDU, Odense, Denmark 2 Department of Mathematics, Imperial College London, London, UK Article Outline Glossary Definition of the Subject Introduction Complex Dynamics The Distribution of Records in Stationary Time Series Aging and Metastability Marginal Stability Record Statistics and Intermittency in Glassy Dynamics Future Directions Acknowledgments Bibliography Glossary Aging The slow time evolution of a class of complex systems, which are brought into a far-from-equilibrium state by the sudden change of an external parameter, e. g. the temperature. Complex dynamics The collective or emergent time dependent properties of interacting multi-component and multi agent systems. Marginal stability A metastable state is marginally stable if it can be destroyed by slight perturbations. Metastability The ability of a non-equilibrium system to remain in, or close to, the same state for a certain characteristic time. Record In a time ordered series of random numbers, a record is an entry larger than all preceding entries. Scale invariant process A scale invariant process looks the same under re-scaling of time and/or space variables. Stationary process A stationary process is time homogeneous, i. e. it is invariant under time translations. Definition of the Subject The term record statistics covers the statistical properties of records within an ordered series of numerical data obtained from observations or measurements. A record within such series is simply a value larger (or smaller) than all preceding values. The mathematical properties
of records strongly depend on the properties of the series from which they are extracted. These properties have been investigated for many different cases, the simplest cases perhaps being series of independent random numbers drawn from the same (arbitrary) distribution, and series produced by a diffusion process with independent random increments. The term record dynamics covers the rather new idea that records may, in special situations, have measurable dynamical consequences. The approach applies to the aging dynamics of glasses and other systems with multiple metastable states. The basic idea is that record sizes fluctuations of e. g. the energy are able to push the system past some sort of ‘edge of stability’, inducing irreversible configurational changes, whose statistics then closely follows the statistics of record fluctuations. Introduction Floods, droughts and earth-quakes are spectacular, potentially disastrous, and routinely monitored extremal events. Yet, even with no large scale effects involved, record performances of all kinds attract great interest, as witnessed by the famous Guinness Book of Records. Aside from their popular appeal, extremes and their properties are an important topic of mathematical statistics [16,28]. Formally, a value larger (or smaller) than all its predecessors within a time ordered series of numerical data constitutes a record. The number of records in a series encodes information on how the data are produced. E. g. measuring every second the velocity of a steadily accelerating object produces a time series where each entry is a record. Conversely, with data generated by a random process, e. g. the coordinates of a drunkard taking a step in any direction with equal probability, considerably fewer records occur. An intermediate case, which is indicative of a climatic warming trend, are temperature records more frequent than expected for fluctuations around a constant [35]. More generally, record-sized events in spatio-temporal series of data, such as seismic shocks, are indicative of correlations and causal relationships [14]. In some contexts, record sized events have important implications: e. g. in population dynamics, mutants surpassing the current standard for highest reproductive success contribute disproportionately to the genetic pool of the next generation, a bias which automatically raises the bar for future improvements. This naturally leads to the expectation that record events be of significance to biological evolution, as perhaps in particular suggested by the catch-phrase ‘survival of the fittest’. Darwinism’s impact on biology – and human culture at large – can hardly
2611
2612
Record Statistics and Dynamics
be exaggerated. In recent years computer optimization methods have appeared, e. g. Genetic Algorithms [17] and Extremal Optimization [5], which are inspired by evolution dynamics and which, implicitly or explicitly, utilize records. In a computer program, or in a social context, the function, entity or organization in charge of memorizing the existing record and detecting its eventual obliteration is clearly distinct from the process being monitored. By contrast, record keeping in biological evolution is fully integrated in the dynamics, and we will see in examples below that record dynamics can emerge as a collective property of a collection of co-evolving organisms. In cases where records act as seeds for an amplifying dynamical process, the magnitude of the seed bears no direct relation to the magnitude of the final outcome of the amplification. E. g. a snow avalanche can be triggered by sound vibrations above a certain threshold. However, its impact is more related to the amount of snow on the ground and to the slope and shape of the valley than to the intensity of the sound. Two important and related characteristic of recordinduced dynamics are: i) its temporal non-homogeneity, meaning that record fluctuations and any induced effects occur at a decreasing rate and ii) the presence of an embedded memory mechanism. Not coincidentally, the same properties are present in a large class of systems known as glassy. In the sequel, we focus on the significance of record sized fluctuations for glassy dynamics, expanding on the observation that record fluctuations induce transitions between metastable states [43], leaving a so-called intermittent signal as a fingerprint [44]. Complex Dynamics Systems of many interacting components and with a coupling to an external reservoir of e. g. energy and/or particles, are usually described using Master- or Fokker–Planck equations [57]. These equations combine external and internal deterministic forces with a randomizing influence of the reservoir, and determine the probability flow between a set of allowed configurations or states. A simple yet illustrative example is the Brownian motion of a grain of pollen immersed in a fluid. The collisions with the surrounding molecules exert random forces on the molecules. Furthermore, an external force may be applied to the pollen by means of e. g. an electric field. If the motion of the grain is constrained by the walls of a container, the probability distribution of its position reaches on a certain time scale – called the relaxation time – a stationary form which only depends on the geometry of the container. In this case, all
memory of the initial condition is lost. Master equations or other dynamical equations often lead to a stationary state. The latter can be unique, or it can be one out of many, each state draining a basin of attraction in configuration space. An interesting situation arises when the stationarity is only approximate, i. e. when, given enough time, it is possible to escape from a basin of attraction and enter a different one. Each attraction basin then becomes a metastable region of configuration space, which is characterized by a certain escape time: the time typically needed for a trajectory to exit the basin. Gases, liquids and crystalline solids are mostly found in a stationary state where relevant physical observables, e. g. the local density, fluctuate reversibly around fixed values. A stationary fluctuation signal is invariant under time translations, its average is a constant and its autocorrelation function only depends on the difference between its two time arguments. In some cases, e. g. for equilibrium systems fluctuating near a critical temperature, correlations decay algebraically and may lack an associated time-scale. Interestingly, a similar situation arises in certain slowly driven systems, epitomized by the famous ‘sand pile’ model [4,22], which self-organizes into marginally stable configurations lacking characteristic space and time scales. Friction dominated granular systems, e. g. rice piles, are experimental realizations of this behavior, which is known as Self Organized Critical, or SOC. In SOC and some other driven dissipative systems the initial state is statistically equivalent to the states subsequently visited. In contrast, complex materials such as glasses, polymers, colloids and disordered magnetic materials, undergo a slow but systematic physical aging following a rapid quench of e. g. the temperature or the particle density. Typically, these materials get trapped in metastable basins, where physical variables appear to fluctuate reversibly on observation time scales shorter than the age. During this pseudo-equilibrium regime, fluctuations have Gaussian distributions with zero averages [44]. On longer time scales, intermittent events add an asymmetric, and typically exponential, tail to the Probability Density Function (PDF) of the fluctuations. The tail is caused by the non-equilibrium drift which, through a sequence of metastable basins of subtly different nature, slowly erases the memory of the initial state and changes physical properties. e. g. the linear response to a small perturbation. The Distribution of Records in Stationary Time Series Record-sized entries within a stationary series of independent and identically distributed random numbers form
Record Statistics and Dynamics
a sub-series whose interesting statistical properties follow from simple arguments [16,42,45]. E. g., as the magnitude required to qualify as the next record, increases with each new entry, records appear at a decreasing rate. For sufficiently long time series, the number of records found between the first and the tth draw can be shown to have a Poisson distribution, with average ln t. Equivalently, if the kth record appears at times tk , the ratios ln(t k /t k1 ) are independent random numbers with an exponential distribution. Consider a sequence of independent random numbers drawn from the same distribution at times 1; 2; 3; : : : ; t. We exclude distributions supported on a finite set, because these eventually produce a record which cannot be beaten1 . The first number drawn is by definition a record. Subsequent trials lead to a record if their outcome is larger than the previous record. We seek the probability Pn (t) of finding precisely n records in t successive trials, where 1 n t. In the derivation we need the auxiliary function P(1;m1 ;:::m k1 ) (t), which is the joint probability that k records happen at times 1 < m1 ; < m k1 , with m k1 t. P1 (t) is simply the probability that the first outcome be largest among t. As each outcome has, by symmetry, the same probability of being the largest, it follows that P1 (t) D 1/t. In order to obtain two records at times 1 and m, the largest of the first m 1 random numbers must be drawn at the very first trial. This happens with probability 1/(m 1). Secondly, the mth outcome must be the largest among t. This happens with probability 1/t, independently of the position of the largest outcome in the first m 1 trials. Accordingly, P(1;m) (t) D
1 : (m 1)t
(1)
Summing the above over all possible values of m we then obtain P2 (t) D
t X mD2
1 ln(t)/t ; (m 1)t
(2)
and, in the more general case of n events, we similarly obtain 1 : (3) P(1;m1 ;:::;m n1 ) (t) D Qn1 iD1 (m i 1)t We now take q i D m i 1 and sum over all possible values of the qi ’s, leading to Pn (t) D
tnC1 X q 1 D1
1 q1
t1 X q n1 Dq n2 C1
1 1 : q n1 t
(4)
1 Apart from this constraint, the form of the distribution is immaterial.
An approximate closed form expression can now be obtained by replacing the sums by integrals, which is reasonable for t n 1. The integrals can then be evaluated, finally yielding Pn (t) D
(ln t)n1 1 : (n 1)! t
(5)
As anticipated, Eq. 5 is a Poisson distribution, with ln t replacing the usual time argument t. Clearly, the number of records between extractions tw and t > tw is the difference of two Poisson process, and hence itself a Poisson process with average ln(t) ln(tw ) D ln(t/tw ). The discrete series of random numbers can be replaced by stationary random white noise, in which case t becomes a continuous time variable, and the restriction on the asymptotic validity of Eq. 5 for t n can be lifted. Let n(t) and n2 (t) be the average and variance of the number of events in time t. As an immediate consequence of Eq. 5 we note that n(t) D n2 (t) D ln t :
(6)
Furthermore, the ‘current’, i. e. the average number of events per unit of time decays as 1 dn D : dt t
(7)
A third consequence of Eq. 5 is the following: Let t1 D 1 < t2 < < t k < be the times at which the record breaking events occur, and let 1 D ln t1 D 0 < 2 < k D ln t k < be the corresponding natural logarithms. The stochastic variables k D kC1 k D ln(t kC1 /t k ) are independent and identically distributed. Their common distribution is an exponential with unit average. By writing: k D k1 C k2 C 1 we have p that ( kC1 k)/ k approaches a standard Gaussian distribution for large k. Hence, the waiting time tk for the kth event is approximately log-normal, and the average of its logarithm grows linearly in k. By Jensen’s inequality [37] we also find ln(t k ) ln(t k ) D k :
(8)
We see that the average waiting time from the kth to the (k C 1)th event grows at least exponentially in k. Consider finally the case in which a system is made up of ˛ independent parts. The total number of records is the sum of contributions from each part, and hence remains Poisson distributed. For the most general case where records between times tw and t > tw are of interest, the Poisson process has average hni(tw ; t) D ˛ ln(t/tw ) :
(9)
2613
2614
Record Statistics and Dynamics
The cumulative distribution of the ’logarithmic waiting times’ ı k D ln(t k /t k1 ) is correspondingly P(ı k < x) D 1exp(˛x);
x > 0; k D 2; 3; : : : (10)
In glassy dynamics, noise fluctuations are drawn from a (quasi-) equilibrium distribution characterizing the current metastable region. This distribution heavily penalizes large excursions from the metastable configuration, making record events rare and uncorrelated. Importantly, spatially extended systems with short range interactions may break down into a set of weakly interacting subsystems, or domains, which are localized in space and which have independent fluctuations and independent metastable states. E. g. spin glasses are characterized by a thermal correlation length, corresponding to the linear size of the equilibrated domains. The latter grows slowly and remains at all times rather small compared to the linear size of the sample. Each independently thermalized domain evolves in parallel with other domains, from which it is shielded by a backbone of mainly inactive degrees of freedom. In accord with observations discussed below, fluctuation spectra of physical quantities have a Gaussian component with zero average, and a nonGaussian tail. The former describes the sum of independent equilibrium-like fluctuations, and the latter describes the rare non-equilibrium large events, or quakes, which carry the drift. In this situation, the parameter ˛ of Eq. 9 can be identified with the number of domains. Its value enters the width of the Gaussian, which grows as the square root of ˛, as well as the statistical weight of the tail, which grows linearly with ˛. Aging and Metastability The term glassy dynamics usually refers to the extremely slow relaxation observed in complex systems with many interacting components, where reaching a steady, time independent, state is typically far beyond experimentally accessible time scales. E. g., when melted alloys are cooled down rapidly, they do not enter a crystalline ordered state. Instead the atoms retain the amorphous arrangement characteristic of the liquid phase while the mobility of the molecules decreases by many orders of magnitude. This colossal change in the characteristic dynamical time scales hinders a structural glass at low temperature to reach thermodynamic equilibrium. Nevertheless, the properties of glasses over short observation time scales may appear to be time independent as in thermal equilibrium. The low temperature out-of-equilibrium behavior of structural glasses and of a host of other complex mate-
rials is intriguingly similar. E. g. spin glasses [29] a class of disordered magnetic alloys, polymers [53], colloids and gels [10] and type II superconductors [31,32] all remain out of equilibrium, with their macroscopic physical properties changing algebraically or logarithmically as a function of the time elapsed since the initial quench. Small temperature and field variations applied during the aging process uncover many fascinating ‘memory’ effects see e. g. [19,23,59], which indicate the presence of a hierarchy of time scales for relaxation in configuration space, a hierarchy which is also explicitly uncovered in numerical studies [49,50]. The topography of complex energy landscapes has been thoroughly investigated, see e. g. [2,30,38,51,52], and various heuristic approaches have been proposed in order to link topography and dynamics. For glassy dynamics, heuristic ideas such as hierarchical mesoscopic models [20,58], and the highly popular trap model [6] have been considered. While many aspects of aging in glassy systems are known and partly understood, important open questions remain. E. g. how are configuration space properties and real space morphology to be combined in a seamless fashion? How should one weigh thermodynamical versus dynamical properties, and, more specifically what is the rôle of equilibrium statistical properties, a rôle initially strongly advocated from different camps [7,15,33], and hotly debated ever since [24,25,34]. In retrospect, knowing that systems with similar aging behavior have completely different equilibrium properties, e. g. vibrated granular matter versus thermalizing polymers, the connection between aging and equilibrium properties appears rather tenuous. A radically different approach was initiated by Coppersmith and Littlewood who studied memory effects in Charge Density Waves [12] using a simple model with multiple attractors. The focus is there on the mechanism for attractor selection, and the suggestion is that the least stable attractors, are those typically selected following a quench, in spite of having little statistical weight. One development of the idea [45] points to fluctuations records, which are likewise statistically insignificant, as key events in aging dynamics. The focus of both experiments and theory has recently shifted from macroscopic averages to the fluctuations around these averages (Buisson et al. [8]), which are observable in meso- and nanoscaled systems. As mentioned, the fluctuation spectra have a Gaussian part, pertaining to pseudo-equilibrium fluctuations around the ‘current’ metastable value of the macroscopic average of the quantity of interest, and an asymmetric exponential tail, describing rare and irreversible jumps from one metastable
Record Statistics and Dynamics
configuration to the next. Such ‘quakes’ carry the full macroscopic evolution of the system. Their decreasing rate is indicative of an entrenchment into gradually more stable configurations.
Marginal Stability Broadly speaking, marginal stability relates to how metastable attractor basins are selected during the offequilibrium evolution of metastable complex systems. Diverse applications include driven systems, e. g. charge density waves [12,45] and the sand-pile model [4] and its conceptual ancestor [3,41,56], thermal aging of e. g. spinglasses [44] and type II superconductors [32]. The relevance to biological evolution is suggested by a number of model studies, some where a fitness function is explicitly defined and drives the dynamics [46,48] and others based on co-evolutionary interactions without fitness function [1]. Record size fluctuations are irrelevant in stationary processes, where the possibility of triggering permanent changes simply does not exist. They are likewise irrelevant in processes characterized by one or few time scales, e. g. thermal equilibration between two metastable basins: A record has a rank, but lacks an inherent magnitude which could match the scale of the process. For record fluctuations to trigger irreversible changes, many inequivalent meta-stable basins must be accessible to the dynamics. We envisage that each metastable basin is associated to a finite characteristic escape time , the typical time it takes to leave the basin. For thermal dynamics, the relevant time scale is the Arrhenius time corresponding to a free energy barrier b D T kB ln( ), where T is the temperature and kB is the Boltzmann constant. Assuming that attractor basins have a broad range of stability, with the more fickle basins out-numbering the more stable ones, a quench from a random initial condition typically leads to very fickle basins. At any stage of the subsequent evolution, all moves leading from the less to the more fickle can be reversed on the time scale at which they first happen. Irreversible moves into more stable attractors, the quakes, are important for the aging dynamics, and typically entail a tiny, or marginal, increase of stability. The marginal increase implies that a random fluctuation slightly larger than the previous largest fluctuation, i. e. a record fluctuation, can trigger the next quake. Defacto irreversibility of the quakes can be strengthened by a thermodynamic mechanism, i. e. the basin change may entail a large decrease in energy, and/or a large increase of entropy.
Record Statistics and Dynamics, Figure 1 The cartoon, taken from [43], illustrates the link between marginal stability and record dynamics: the wedges represent metastable region of a complex system, with any reference to their internal hierarchical structure omitted for simplicity. The vertical axis is the energy. Jumps from one metastable region to the next (quakes) are indicated by unidirectional arrows. The thermal stability of each metastable region is represented by the vertical distance from the bottom to the boundary of the wedge, i. e. the energy barrier for the corresponding region. Importantly, the barrier of the current attractor only changes minutely through each quake
The record dynamics scenario sketched above implies that, as the systems age, the metastable basins successively explored develop a nested hierarchical substructure, each layer composed of basins of greater and greater stability. If all metastable regions maintain a similar structure, the noise distribution from which the records are drawn does not itself change during the aging process. The temporal statistics of record fluctuations and quakes has universal properties which are disconnected from the physical changes associated to the quakes. These effects may e. g. depend on the physical quantity being measured and on the noise strength. Using the subordination principle [32,39,47] briefly discussed in the next Section, the changes can be described mathematically in a general, albeit approximate, fashion. Record Statistics and Intermittency in Glassy Dynamics The left panel of Fig. 2 (taken from [32]) illustrates important qualitative features of intermittency in complex systems. The figure depicts the time evolution of the number
2615
2616
Record Statistics and Dynamics
Record Statistics and Dynamics, Figure 2 The figure is taken from [32]. Left panel: The time variation of the total number of vortices N(t) in the system for a single realization of the pinning potential and the thermal noise in a 8 8 8 lattice for T D 0:1. The monotonous step function character of the time series, and the fact that the steps are approximately the same duration on a logarithmic time scale indicates the decelerating nature of the dynamics. Right panel: Numerical results for the creep rate versus T for the time interval between t D 1000 and t D 10000. In the low temperature region the creep rate is constant within our numerical precision for about two orders of magnitude – we observe a nonzero creep rate in the T ! 0 limit. Insets: experimental results for the creep rate versus T. The right inset shows data from Keller et al. [26,27] for melt processed YBCO crystals with the magnetic field applied along the c-axis (squares) and ab plane (circles). The left inset shows data from Civale et al. [11] for un-irradiated (squares) and 3 MeV proton-irradiated (circles) YBCO flux grown crystals with a 1 T magnetic Field applied parallel to the c-axis
of vortices in the Random Occupancy Model for magnetic flux creep in type II superconductors [31]. The number n(t) of vortices inside the system changes in a step-wise fashion. Each horizontal plateau corresponds to the time spent in a metastable configuration. The plateaus appear to have the same average duration on a logarithmic time scale, showing that the dynamics entrenches itself into gradually more stable basins. Each vertical jump corresponds to a quake, leading from one basin to the next. The size of the quake is the height of the corresponding jump. In a record-dynamics scenario the rate of quakes is independent of the statistical properties of the noise. Specifically for thermally activated dynamics, it is independent of the temperature. The rate of change of a physical observable may have a temperature dependence, which enters through the quake size distribution, i. e. the statistical distribution of the vertical jumps. The right panel of Fig. 2, also taken from [32], shows the striking temperature independence of the creep rate in the ROM model, and the insert shows its experimental counterpart. Quenching a glassy system from a high temperature invariably leads to a configuration with an energy far above the equilibrium value at the low temperature. It is then of interest to understand how the excess energy leaves the system during the aging process. The release in a spin-glass [44] occurs through sporadic, intermittent quakes, and the average rate of energy flow (nearly) decreases with the inverse of the system age. The p-spin model [54], an Ising spin model with plaquette interac-
tions, has this type of behavior as well. The left panel of Fig. 3 (taken from [39]) shows the Probability Density Function (PDF) of the amount of heat given off by the system over short intervals of length ıt, for several values of the system age tw . The Gaussian part of the PDF, which is nearly independent of tw , describe reversible energy fluctuations of zero average. The exponential tail describes the out-flow of heat. As tw increases, all other parameters being constant, the tail becomes less prominent. The right panel shows, for a number of low temperatures, that the rate of energy flow decays as the reciprocal of the age. Experimental probes of glassy dynamics often involve applying a small perturbing field at a certain age tw and measuring the linear response, R(t; tw ), for t > tw . The glassy response depends on both tw and t, while in a stationary situation the only dependence is on the difference t tw . There is a large and growing body of experimental and numerical results concerning linear response functions in aging systems and their many facets[23,36,47,55,58]. Attempts to describe the off-equilibrium response as analogous to equilibrium response lead to the idea of effective temperature [13]. Being defined in the limits tw ! 1 and t ! 1, with t/tw finite, the latter quantity is, in practice, difficult to measure [21]. At a more basic level, one may ask whether a glassy system is mainly responding to the applied field, or rather mainly responding to the initial quench. In the latter case, the field has the auxiliary rôle of biasing the course of the quakes, but no influence on their temporal statistics. Technically, the re-
Record Statistics and Dynamics
Record Statistics and Dynamics, Figure 3 The figure is taken from [39]. Left panel: The PDF of the heat exchanged between system and thermal bath over a time ıt D 5. Negative values correspond to an energy outflow. The data are based on 200 independent runs, taken at temperature T D 1 in the intervals [tw ; tw C 500] for tw D 1000 (diamonds), tw D 2000 (squares) tw D 4000 (polygons) tw D 8000 (hexagons) and tw D 16000 (circles). Right panel: The average rate of flow of the energy is plotted versus the age for the five temperatures shown. The full line 1 , with the proportionality constant c estimated as the mean of t r . Since data sets for different temperatures has the form y D ctw w E are almost overlapping, they are vertically shifted in the plot for the sake of typographical clarity
sponse is then subordinated to the quakes, which are themselves subordinated to record-sized fluctuations. Following this approach [39,47], physical observables are treated as (stochastic) functions of the number of quakes k occurring during the observation interval. Generic eigenvalue expansions in the variable k are available, and the time and age dependence are extracted by averaging them over all possible values of k according to the Poisson distribution with the average given in Eq. 9. The subordination approach is not limited to linear response functions, e. g. it can be applied to a calculation of the configurational autocorrelation function [40] or indeed any other quantity which can be argued to be (mainly) a function of k [32]. The potential relevance of records for the evolution of ecosystems has been investigated using the Tangled Nature model of evolutionary ecology [9,18]. The model deals with a population of individuals undergoing mutationprone reproduction in a type space. All individuals are equally likely to be removed or killed. Their reproduction, however, depends on the interactions among subsets of individuals: The reproduction probability of an individual of type, say, A will change through interactions with a type B individual which comes into being through a mutation of another existing type. Depending on whether the interaction is cooperative or antagonistic, the reproduction rate of A will increase, respectively decrease. Metastable ecosystems spontaneously arise as sets of types for which the reproduction and death probability of the individuals are balanced. Such metastable configurations last for periods of varying duration, and are eventually replaced by new metastable configurations. The model
does not assume a fitness function, and no optimization procedure is explicitly invoked at the level of the microscopic dynamics. Interestingly, the times spent in consecutive metastable configurations have a statistics in qualitative agreement with the predictions of record dynamics [1]. This illustrates how record dynamics can be identified by a relatively straightforward analysis of time series, and how it can emerge at the level of collective evolution even when the precise nature of the observable undergoing records is not known. Future Directions Metastable systems are ubiquitous in nature, and their interaction with an external noisy environment affects their dynamics in significant ways. When the system evolves irreversibly through metastable configurations of marginally increasing stability, fluctuation records determine the course of the dynamics. The temporal statistics of records is an in-road to an approximate analytical description of the time evolution of complex systems, ranging from physical material to biological systems. It is an open challenge to test and develop these idea on the large variety of model and observational data where irreversibility and marginal stability appear in different guises. Acknowledgments The authors are indebted to P. Anderson, J. Dall, L. Oliveira and S. Boettcher. P. Sibani did part of this work while visiting the Cherry L. Emerson Center for Scientific Computation at Emory University.
2617
2618
Record Statistics and Dynamics
Bibliography Primary Literature 1. Anderson P, Jensen HJ, Oliveira LP, Sibani P (2004) Evolution in complex systems. Complexity 10:49–56 2. Angelani L, Di Leonardo R, Parisi G, Ruocco G (2003) Topological description of the aging dynamics in simple glasses. Phys Rev Lett 87:055502 3. Bak P (1997) How Nature Works. The science of self-organized criticality. Oxford University Press, Oxford 4. Bak P, Tang C, Wiesenfeld K (1987) Self-organized criticality: An explanation of 1/f noise. Phys Rev Lett 59:381–384 5. Boettcher S, Percus A (2001) Optimization with extremal dynamics. Phys Rev Lett 86:5211–5214 6. Bouchaud JP (1992) Weak ergodicity breaking and aging in disordered systems. J Phys I France 2:1705–1713 7. Bray AJ, Moore MA (1987) Chaotic nature of the spin-glass phase. Phys Rev Lett 58:57–60 8. Buisson L, Bellon L, Ciliberto S (2003) Intermittency in aging. J Phys Condens Matter 15:S1163 9. Christensen K, di Collobiano SA, Hall M, Jensen HJ (2002) Tangled Nature: A Model of Evolutionary Ecology. J Theor Biol 216:73–84 10. Cipelletti L, Manley S, Ball RC, Weitz DA. Universal aging features in the restructuring of fractal colloidal gels. Phys Rev Lett 84:2275–2278 11. Civale L, Marwick AD, McElfresh MW, Worthington TK, Malozemoff AP, Holtzberg FH, Thompson JR, Kirk MA (1990) Defect independence of the irreversibility line in proton-irradiated yba-cu-o crystals. Phys Rev Lett 65(9):1164–1167 12. Coppersmith SN, Littlewood PB (1987) Pulse-duration memory effect and deformable charge-density waves. Phys Rev B 36:311–317 13. Cugliandolo LF, Kurchan J, Peliti L (1997) Energy flow, partial equilibration, and effective temperature in systems with slow dynamics. Phys Rev E 55:3898–3914 14. Davidsen J, Grassberger P, Paczuski M (2007) Networks of recurrent events, a theory of records, and an application to finding causal signatures in seismicity. cond-mat/0701190 15. Fisher DS, Huse DA (1986) Ordered phase of short-range Ising spin-glasses. Phys Rev Lett 56:1601–1604 16. Glick N (1978) Breaking records and breaking boards. Am Math Mon 85:2–26 17. Goldberg DE (1989) Genetic algorithms in search, optimization and machine learning. Addison-Wesley, Reading 18. Hall M, Christensen K, di Collabiano SA, Jensen HJ (2002) Time dependent extinction rate and species abundance in the Tangled Nature model of biological evolution. Phys Rev E 66:011904 19. Hammann J, Bouchaud J-P, Dupuis V, Vincent E (2001) Separation of time and length scales in spin glasses: Temperature as a microscope. Phys Rev B 65:024439 20. Hoffmann KH, Sibani P (1988) Diffusion in hierarchies. Phys Rev A 38:4261–4270 21. Hérisson D, Ocio M (2002) Fluctuation-dissipation ratio of a spin glass in the aging regime. Phys Rev Lett 88:257202 22. Jensen HJ (1998) Self-Organized Complex Systems. Cambridge University Press, Cambridge 23. Jonason K, Vincent E, Hammann J, Bouchaud JP, Nordblad P (1998) Memory and Chaos Effects in Spin Glasses. Phys Rev Lett 81:3243–3246
24. Jönsson PE, Takayama H, Katori AH, Ito A (2005) Dynamical breakdown of the ising spin-glass order under a magnetic field. Phys Rev B 71:180412(R) 25. Katzgraber HG, Young AP (2005) Probing the Almeida-Thouless line away from the mean-field model. Phys Rev B 72:184416 26. Keller C, Kupfer H, Meier-Hirmer R, Wiech U, Selvamanickam V, Salama K (1990) Irreversible behaviour of oriented grained yba2cu3ox. part 2: relaxation phenomena. Cryogenics 30:410– 416 27. Keller C, Kupfer H, Meier-Hirmer R, Wiech U, Selvamanickam V, Salama K (1990) Irreversible behaviour of oriented grained YBa2Cu3Ox. part 1: transport and shielding currents. Cryogenics 30:401–409 28. Krug J (2007) Records in a changing world. cond-mat/0702136 29. Lundgren L, Nordblad P, Sandlund L (1986) Memory behaviour of the spin glass relaxation. Europhys Lett 1:529–534 30. Mossa S, Ruocco G, Sciortino F, Tartaglia PP (2002) Quenches and crunches: does the system explore in ageing the same part of the configuration space explored in equilibrium? Phil Mag B 82:695–705 31. Nicodemi M, Jensen HJ (2001) Aging and memory phenomena in magnetic and transport properties of vortex matter: a brief review. J Phys A 34:8425 32. Oliveira LP, Jensen HJ, Nicodemi M, Sibani P (2005) Record dynamics and the observed temperature plateau in the magnetic creep rate of type ii superconductors. Phys Rev B 71: 104526 33. Parisi G (1979) Infinite number of order parameters for spinglasses. Phys Rev Lett 43:1754–1756 34. Parisi G, Marinari E, Ruiz-Lorenzo JJ (1998) On the phase structure of the 3D Edwards-Anderson spin-glass. Phys Rev B 58:14852 35. Redner S, Petersen MR (2007) Role of global warming on the statistics of record-breaking temperatures. Phys Rev E 74:061114 36. Rodriguez GF, Kenning GG, Orbach R (2003) Full Aging in Spin Glasses. Phys Rev Lett 91:037203 37. Rudin W (1966) Real and complex analysis. McGraw Hill, New York 38. Schön JC, Putz H, Jansen M (1996) Studying the energy hypersurface of continuous systems–the threshold algorithm. J Phys: Condens Matter 8:143–156 39. Sibani P (2006) Aging and intermittency in a p-spin model. Phys Rev E 74:031115 40. Sibani P (2006) Mesoscopic fluctuations and intermittency in aging dynamics. Europhys Lett 73:69–75 41. Sibani P, Andersen CM (2001) Aging and self-organized criticality in driven dissipative systems. Phys Rev E 64:021103 42. Sibani P, Brandt M, Alstrøm P (1998) Evolution and extinction dynamics in rugged fitness landscapes. Int J Modern Phys B 12:361–391 43. Sibani P, Dall J (2003) Log-Poisson statistics and pure aging in glassy systems. Europhys Lett 64:8–14 44. Sibani P, Jensen HJ (2005) Intermittency, aging and extremal fluctuations. Europhys Lett 69:563–569 45. Sibani P, Littlewood PB (1993) Slow Dynamics from Noise Adaptation. Phys Rev Lett 71:1482–1485 46. Sibani P, Pedersen A (1999) Evolution dynamics in terraced NK landscapes. Europhys Lett 48:346–352
Record Statistics and Dynamics
47. Sibani P, Rodriguez GF, Kenning GG (2006) Intermittent quakes and record dynamics in the thermoremanent magnetization of a spin-glass. Phys Rev B 74:224407 48. Sibani P, Schmidt M, Alstrøm P (1995) Fitness optimization and decay of the extinction rate through biological evolution. Phys Rev Lett 75:2055–2058 49. Sibani P, Schriver P (1994) Phase-structure and low-temperature dynamics of short range Ising spin glasses. Phys Rev B 49:6667–6671 50. Sibani P, Schön C, Salamon P, Andersson J-O (1993) Emergent hierarchical structures in complex system dynamics. Europhys Lett 22:479–485 51. Stillinger FH, Weber TA (1983) Dynamics of structural transitions in liquids. Phys Rev A 28:2408–2416 52. Stillinger FH, Weber TA (1984) Packing Structures and Transitions in Liquids and Solids. Science 225:983–989 53. Struik LCE (1978) Physical aging in amorphous polymers and other materials. Elsevier, New York 54. Swift MR, Bokil H, Travasso RDM, Bray AJ (2000) Glassy behavior in a ferromagnetic p-spin model. Phys Rev B 62:11494– 11498
55. Takayama H, Hukushima K (2002) Numerical Study on Aging Dynamics in 3D Ising Spin-Glass Model. III. Cumulative Memory and ‘Chaos’ Effects in the Temperature Shift Protocol. J Phys Soc JPN 71:3003–3010 56. Tang C, Wiesenfeld K, Bak P, Coppersmith S, Littlewood P (1987) Phase Organization. Phys Rev Lett 58:1161–1164 57. Van Kampen NG (1992) Stochastic Processes in Physics and Chemistry. North Holland, Amsterdam 58. Vincent E (1991) Slow dynamics in spin glasses and other complex systems. In: Ryan DH (ed) Recent progress in random magnets. McGill University, Montreal, pp 209–246 59. Vincent E, Hammann J, Ocio M, Bouchaud J-P, Cugliandolo LF (1996) Slow dynamics and aging in spin-glasses. SPECSACLAY-96/048
Books and Reviews Tinkham M (1998) Introduction to Superconductivity. Krieger, Melbourne Drossel B (2001) Biological evolution and statistical physics. Adv Phys 50:209–295
2619
2620
Repeated Games with Complete Information
Repeated Games with Complete Information OLIVIER GOSSNER1 , TRISTAN TOMALA 2 PSE, UMR CNRS-EHESS-ENPC-ENS 8545, Northwestern University, Paris, France 2 Economics and Finance Department, HEC Paris, Paris, France
1
Article Outline Glossary Definition of the Subject Introduction Games with Observable Actions Games with Non-observable Actions Acknowledgments Bibliography
itive outcome is the only one compatible with individual profit maximization under a static interaction, collusion is sustainable at an equilibrium when the interaction is repeated. Generally, repeated games provide a framework in which individual utility maximization by selfish agents is compatible with welfare maximization (common good), while this is known to fail for many classes of static interactions. Introduction The discussion of an example shows the importance of repeated games and introduces the questions studied. Consider the following game referred to as the Prisoner’s Dilemma: C D
C 4; 4 5; 0
D 5; 5 1; 1
The Prisoner’s Dilemma
Glossary Repeated game A model of repeated interaction between agents. Behavioral strategy A decision rule that prescribes a randomized choice of actions for every possible history. Nash equilibrium A strategy profile from which no unilateral deviation is profitable. Sequential equilibrium A strategy profile and Bayesian beliefs on past histories such that after every history, every agent is acting optimally given his beliefs. Monitoring structure A description of player’s observation of each other’s strategic choices. It specifies, for every profile of actions, the probability distribution over the profiles of individual signals received by the agents. Definition of the Subject Repeated interactions arise in several domains such as Economics, Computer Science, and Biology. The theory of repeated games models situations in which a group of agents engage in a strategic interaction over and over. The data of the strategic interaction is fixed over time and is known by all the players. This is in contrast with stochastic games, for which the data of the strategic interaction is controlled by player’s choices, and repeated games with incomplete information, where the stage game is not common knowledge among players (the reader is referred to the corresponding chapters of this Encyclopedia). Early studies of repeated games include Luce and Raiffa [48] and Aumann [4]. In the context of production games, Friedman [25] shows that, while the compet-
Player 1 chooses the row, player 2 chooses the column, and the pair of numbers in the corresponding cell are the payoffs to players 1 and 2 respectively. In a one-shot interaction, the only outcome consistent with game theory predictions is (D; D). In fact, each player is better off playing D whatever the other player does. On the other hand, if players engage in a repeated Prisoner’s Dilemma, if they value sufficiently future payoffs compared to present ones, and if past actions are observable, then (C; C) is a sustainable outcome. Indeed, if each player plays C as long as the other one has always done so in the past, and plays D otherwise, both player have an incentive to always play C, since the short term gain that can be obtained by playing D is more than offset by the future losses entailed by the opponent playing D at all future stages. Hence, a game theoretical analysis predicts significantly different outcomes from a repeated game than from static interaction. In particular, in the Prisoner’s Dilemma, the cooperative outcome (C; C) can be sustained in the repeated game, while only the non-cooperative outcome (D; D) can be sustained in one-shot interactions. In general, what are the equilibrium payoffs of a repeated game and how can they be computed from the data of the static game? Is there a significant difference between games repeated a finite number of times and infinitely repeated ones? What is the role played by the degree of impatience of players? Do the conclusions obtained for the Prisoner’s Dilemma game and for other games rely crucially on the assumption that each player perfectly observes other player’s past choices, or would imperfect ob-
Repeated Games with Complete Information
servation be sufficient? The theory of repeated games aims at answering these questions, and many more. Games with Observable Actions This section focuses on repeated games with perfect monitoring in which, after every period of the repeated game, all strategic choices of all the players are publicly revealed. Data of the Game, Strategies, Payoffs Data of the Stage Game There is a finite set I of players. A stage game is repeated over and over. Each player i’s action set in this stage game is denoted Ai , and S i D (A i ) is the set of player i’s mixed actions (for any finite set X, (X) denotes the set of probabilities over X). Every degenerate lottery in Si (which puts probability 1 to one particular action in Ai ) is associated to the corresponding element in Ai . A choice of action for every player i deterQ mines an outcome a 2 i A i . The payoff function of the stage game is g : A ! RI . Payoffs are naturally associated Q to profiles of mixed actions s 2 S D i S i using the expectation: g(s) D Es g(a). Repeated Game After every repetition of the stage game, the action profile previously chosen by the players is publicly revealed. After the t first repetitions of the game, a player’s information consists of the publicly known history at stage t, which is an element of H t D At (H0 D f;g by convention). A strategy in the repeated game specifies the choice of a mixed action at every stage, as a function of the past observed history. More specifically, a behavioral strategy for player i is of the form i : [ t H t ! S i . When all the strategy choices belong to Ai ( i : [ t H t ! A i ), i is called a pure strategy. Other Strategy Specifications A behavioral strategy allows the player to randomize his action depending on past history. If, at the start of the repeated game, the player was to randomize over the set of behavioral strategies, the result would be equivalent to a particular behavioral strategy choice. This result is a consequence of Kuhn’s theorem [5,41]. Furthermore, behavioral strategies are also equivalent to randomizations over the set of pure strategies. Induced Plays Every choice of pure strategies D ( i ) i by all the players induces a play h D (a1 ; a2 ; : : : ) 2 A1 in the repeated game, defined inductively by a1 D ( i;0 (;)) and a t D ( i;t1 (a1 ; : : : ; a t1 )). A profile of behavioral strategies defines a probability distribution P over plays.
Preferences To complete the definition of the repeated game, it remains to define player’s preferences over plays. The literature commonly distinguishes infinitely repeated games with or without discounting, and finitely repeated games. In infinitely repeated games with no discounting, the players care about their long-run stream of stage payoffs. In particular, the payoff in the repeated game associated to a play h D (a1 ; a2 ; : : : ) 2 A1 coincides with the limit of the Cesaro means of stage payoffs when this limit exists. When this limit does not exist, the most common evaluation of the stream of payoffs is defined through a Banach limit of the Cesaro means (a Banach limit is a linear form on the set of bounded sequences that lies always between the liminf and the limsup). In infinitely repeated games with discounting, a discount factor 0 < ı < 1 characterizes the player’s degree of impatience. A payoff of 1 at stage t C 1 is equivalent to a payoff of ı at stage t. Player i’s payoff in the repeated game for the play h D (a1 ; a2 ; : : : ) 2 A1 is the normalP ized sum of discounted payoffs: (1 ı) t1 ı t1 g i (a t ). In finitely repeated games, the game ends after some stage T. Payoffs induced by the play after stage T are irrelevant (and a strategy needs not specify choices after stage T). The payoff for a player is the average of the stage P payoffs during stages 1 up to T: T1 TtD1 g i (a t ). Equilibrium Notions What plays can be expected to be observed in repeated interactions of players who observe each other’s choices? Non-cooperative Game Theory focuses mainly on the idea of stable convention, i. e. of strategy profiles from which no player has incentives to deviate, knowing the strategies adopted by the other players. A strategy profile forms a Nash Equilibrium (Nash [55]) when no player can improve his payoff by choosing an alternative strategy, as long as other players follow the prescribed strategies. In some cases, the observation of past play may not be consistent with the prescribed strategies. When, for every possible history, each player’s strategy maximizes the continuation stream of payoffs, assuming that other players abide with their prescribed strategies at all future stages, the strategy profile forms a subgame perfect equilibrium (Selten [66]). Perfect equilibrium is a more robust and often considered a more satisfactory solution concept than Nash equilibrium. The construction of perfect equilibria is in general also more demanding than the construction of Nash equilibria. The main objective of the theory of repeated games is to characterize the set of payoff vectors that can be sus-
2621
2622
Repeated Games with Complete Information
tained by some Nash or perfect equilibrium of the repeated game. Necessary Conditions on Equilibrium Payoffs Some properties are common to all equilibrium payoffs. First, under the common assumption that all players evaluate the payoff associated to a play in the same way, the resulting payoff vector in the repeated game is a convex combination of stage payoffs. That is, the payoff vector in the repeated game is an element of the convex closure of g(A), called the set of feasible payoffs and denoted F. A notable exception is the work of Lehrer and Pauzner [47] who study repeated games where players have heterogeneous time preferences. The payoff vector resulting from a play does not necessarily belong to F if players have different evaluations of payoff streams. For instance, in a repetition of the Prisoner’s Dilemma, if player 1 cares only about the payoff in stage 1 and player 2 cares only about the payoff in stage 2, it is possible for both players to obtain a payoff of 4 in the repeated game. Now consider a strategy profile , and let i be a strategy of player i that plays after every history (a1 ; : : : ; a t ) a best response to the profile of mixed actions chosen by the other players in the next stage. At any stage of the repeated game, the expected payoff for player i using i is no less than v i D min max g i (si ; a i ) s i 2S i a i 2A i
(1)
where si D (s j ) j¤i (we use similar notations throughout the paper: for a family of sets (E i ) i2I , ei denotes an element of Ei D ˘ j¤i E j and a profile e 2 ˘ j E j is denoted e D (e i ; ei ) when the ith component is stressed). The payoff vi is referred to as player i’s min max payoff. A payoff vector that provides each player i with at least [resp. strictly more than] vi is called individually rational [resp. strictly individually rational], and IR [resp. IR ] denotes the set of such payoff vectors. Since for any strategy profile, there exists a strategy of player i that yields a payoff no less than vi , all equilibrium payoffs have to be individually rational. Also note that players j ¤ i collectively have a strategy profile in the repeated game that forces player i’s payoff down to vi : they play repeatedly a mixed strategy profile that achieves the minimum in the definition of vi . Such a strategy profile in the one-shot game is referred to as punishing strategy, or min max strategy against player i. For the Prisoner’s Dilemma game, F is the convex hull of (1; 1), (5; 0), (0; 5) and (4; 4). Both player’s min max levels are equal to 1. Figure 1 illustrates the set of feasible and individually rational payoff vectors (hatched area):
Repeated Games with Complete Information, Figure 1 F and IR for the Prisoner’s Dilemma
The set of feasible and individually rational payoffs can be directly computed from the stage game data. Infinitely Patient Players The following result has been part of the folklore of Game Theory at least since the mid 1960s. Its authorship is obscure (see the introduction of Aumann [7]). For this reason, it is commonly referred to as the “Folk Theorem”. By extension, characterization of sets of equilibrium payoffs in repeated games are also referred to as “Folk Theorems”. Theorem 1 The set of equilibrium payoffs of the repeated game with no discounting coincides with the set of feasible and individually rational payoffs. Aumann and Shapley [8,9] and Rubinstein [62,64] show that restricting attention to perfect equilibria does not narrow down the set of equilibrium payoffs. They prove that: Theorem 2 The set of perfect equilibrium payoffs of the repeated game with no discounting coincides with the set of feasible and individually rational payoffs. We outline a proof of Theorem 2. It is established that any equilibrium payoff is in F \ IR. We need only to prove that every element of F \ IR is a subgame perfect equilibrium payoff. Let x 2 F \ IR, and let h D a1 ; : : : ; a t ; : : : be a play inducing x. Consider the strategies that play at in stage t; if player i does not respect this prescription at stage t0 , the other players punish player i for t0 stages by repeatedly playing the min max strategy profile against player i. After the punishment phase is over, players revert to the play of h, hence playing a2t0 C1 ; : : :. Now we explain why these strategies form a subgame perfect equilibrium. Consider a strategy of player i starting after any history. The induced play by this strategy for player i and by other player’s prescribed strategies is, up to a subset of stages of null density, defined by the sequence h with interweaved periods of punishment for
Repeated Games with Complete Information
player i. Hence the induced long-run payoff for player i is a convex combination of his punishment payoff and of the payoff induced by h. The result follows since the payoff for player i induced by h is no worse than the punishment payoff.
prove the following Folk Theorem for subgame perfect equilibria with discounting: Theorem 3 If the number of players is 2 or if the set feasible payoff vectors has non-empty interior, then any payoff vector that is feasible and strictly individually rational is a subgame perfect equilibrium of the discounted repeated game, provided that players are sufficiently patient.
Impatient Players The strategies constructed in the proof of the Folk Theorem for repeated games with infinitely patient players (Theorem 1) do not necessarily constitute a subgame perfect equilibrium if players are impatient. Indeed, during a punishment phase, the punishing players may be receiving low stage payoffs, and these stage payoffs matter in the evaluation of their stream of payoffs. When constructing subgame perfect equilibria of discounted games, one must make sure that after a deviation of player i, players j ¤ i have incentives to implement player i’s punishment. Nash Reversion Friedman [25] shows that every feasible payoff that Pareto dominates a Nash equilibrium payoff of the static game is a subgame perfect equilibrium payoff of the repeated game provided that players are patient enough. In Friedman’s proof, punishments take the simple form of reversion to the repeated play of the static Nash equilibrium forever. In the Prisoner’s Dilemma, (D; D) is the only static Nash equilibrium payoff, and thus (4; 4) is a subgame perfect Nash equilibrium payoff of the repeated game if players are patient enough. Note however that in some games, the set of payoffs that Pareto dominate some equilibrium payoff may be empty. Also, Friedman’s result constitutes a partial Folk Theorem only in that it does not characterize the full set of equilibrium payoffs. The Recursive Structure Repeated games with discounting possess a structure similar to dynamic programming problems. At any stage in time, players choose actions that maximize the sum of the current payoff and the payoff at the subsequent stages. When strategies form a subgame perfect equilibrium, the payoff vector at subsequent stages must be an equilibrium payoff, and players must have incentives to follow the prescribed strategies at the current stage. This implies that subgame perfect equilibrium payoffs have a recursive structure, first studied by Abreu [1]. Subsection “A Recursive Structure” presents the recursive structure in more details for the more general model of games with public monitoring. The Folk Theorem for Discounted Games Relying on Abreu’s recursive results, Fudenberg and Maskin [20]
Forges, Mertens and Neyman [24] provide an example for which a payoff which is individually rational but not strictly individually rational is not an equilibrium payoff of the discounted game. Abreu, Dutta and Smith [2] show that the non-empty interior condition of the theorem can be replaced by a weaker condition of “non equivalent utilities”: no pair of players have the same preferences over outcomes. Wen [73] and Fudenberg, Levine and Takahashi [22] show that a Folk Theorem still holds when the condition of non equivalent utilities fails if one replaces the minmax level defining individually rational payoffs by some “effective minmax” payoffs. An alternative representation of impatience to discounted payoffs in infinitely repeated games is the overtaking criterion, introduced by Rubinstein [63]: the play (a1 ; a2 ; : : : ) is strictly preferred by player i to the play (a10 ; a20 ; : : : ) if the inferior limit of the difference of the corresponding streams of payoffs is positive, P i. e. if lim inf T TtD1 g i (a t ) g i (a 0t ) > 0. Rubinstein [63] proves a Folk Theorem with the overtaking criterion. Finitely Repeated Games Strikingly, equilibrium payoffs in finitely repeated games and in infinitely repeated games can be drastically different. This effect is best exemplified in repetitions of the Prisoner’s Dilemma. The Prisoner’s Dilemma Recall that in an infinitely repeated Prisoner’s Dilemma, cooperation at all stages is achieved at a subgame perfect equilibrium if players are patient enough. By contrast, at every Nash equilibrium of any finite repetition of the Prisoner’s Dilemma, both players play D at every stage with probability 1. Now we present a short proof of this result. Consider any Nash equilibrium of the Prisoner’s Dilemma repeated T times. Let a1 ; : : : ; a T be a sequence of action profiles played with positive probability at the Nash equilibrium. Since each player can play D at the last stage of the repetition, and D is a dominating action, a T D (D; D). We now prove by induction on that for any such , a T ; : : : ; a T D (D; D); : : : ; (D; D). Assume
2623
2624
Repeated Games with Complete Information
the induction hypothesis valid for 1. Consider a strategy for player i that follows the equilibrium strategy up to stage T 1, then plays D from stage T on. This strategy obtains the same payoff as the equilibrium strategy at stages 1; : : : ; T 1, and as least as much as the equilibrium strategy at stages T C 1; : : : ; T . Hence, this strategy cannot obtain more than the equilibrium strategy at stage T , and therefore the equilibrium strategy plays D at stage T with probability 1 as well. Sorin [68] proves the more general result: Theorem 4 Assume that in every Nash equilibrium of G, all players are receiving their individually rational levels. Then, at every Nash equilibrium of any finitely repeated version of G, all players are receiving their individually rational levels. The proof of Theorem 4 relies on a backwards induction type of argument, but it is striking that the result applies for all Nash equilibria and not only for subgame perfect Nash equilibria. This result shows that, unless some additional assumptions are made on the one-shot game, a Folk Theorem cannot obtain for finitely repeated games. Games with Unique Nash Payoff Using a proof by backwards induction, Benoît and Krishna [10] obtain the following result. Theorem 5 Assume that G admits x as unique Nash equilibrium payoff. Then every finite repetition of G admits x as unique subgame perfect equilibrium payoff. Theorems 4 and 5 rely on the assumption that the last stage of repetition, T, is common knowledge between players. Neyman [56] shows that a Folk Theorem obtains for the finitely repeated Prisoner’s Dilemma (and for other games) if there is lack of common knowledge on the last stage of repetition. Folk Theorems for Finitely Repeated Games A Folk Theorem can be obtained when there are two Nash equilibrium payoffs for each player. The following result is due to Benoît and Krishna [10] and Gossner [27]. Theorem 6 Assume that each player has two distinct Nash equilibrium payoffs in G and that the set of feasible payoffs has non-empty interior. Then, the set of subgame perfect equilibrium payoffs of the T times repetition of G converges to the set of feasible and individually rational payoffs as T goes to infinity. Hence, with at least two equilibrium payoffs per player, the sets of equilibrium payoffs of finitely repeated games and infinitely repeated games are asymptotically the same.
The condition that each player has two distinct Nash equilibrium payoffs in the stage game can be weakened, see Smith [67]. Assume for simplicity that one player has two distinct Nash payoffs. By playing one of the two Nash equilibria in the last stages of the repeated game, it is possible to provide incentives for this player to play actions that are not part of Nash equilibria of the one-shot game in previous stages. If this construction leads to perfect equilibria in which a player j ¤ i has distinct payoffs, we can now provide incentives for both players i and j. If successive iterations of this procedure yield distinct subgame perfect equilibrium payoffs for all players, a Folk Theorem applies. Games with Non-observable Actions For infinitely repeated games with perfect monitoring, a complete and simple characterization of the set of equilibrium payoffs is obtained: feasible and individually rational payoff vectors. In particular, cooperation can be sustained at equilibrium. How equilibrium payoffs of the repeated game depend on the quality of player’s monitoring of each other’s actions is the subject of a very active area of research. Repeated games with imperfect monitoring, in which players observe imperfectly other player’s action choices, were first motivated by economic applications. In Stigler [69], two firms are repeatedly engaged in price competition over market shares. Each firm observes its own sales, but not the price set by the rival. While it is in the best interest for both firms to set a collusive price, each firm has incentives to secretly undercut the rival’s price. Upon observing plunging sales, should a firm deduce that the rival firm is undercutting prices, and retaliate by setting lower prices, or should lower sales be interpreted as a result of an exogenous shock on market demand? Whether collusive behavior is sustainable or not at equilibrium is one of the motivating questions in the theory of repeated games with imperfect monitoring. It is interesting to compare repeated games with imperfect monitoring with their perfect monitoring counterparts. The structure of equilibria used to prove the Folk Theorem with perfect monitoring and no discounting is rather simple: if a player deviates from the prescribed strategies, the deviation is detected and the deviating player is identified, all other players can then punish the deviator. With imperfect monitoring, not all deviations are detectable, and when a deviation is detected, deviators are not necessarily identifiable. The notions of detection and identification allow fairly general Folk Theorems for undiscounted
Repeated Games with Complete Information
games. We present these results in Subsect. “Detection and Identification”. With discounting, repeated games with perfect monitoring possess a recursive structure that facilitates their study. Recursive methods can also be successfully applied to discounted games with public monitoring. We review the major results of this branch of the literature in Subsect. “Public Equilibria”. Almost-perfect monitoring is the natural framework to study the effect of small departures from the perfect or public monitoring assumptions. We review this literature in Subsect. “Almost Perfect Monitoring”. Little is known about general discounted games with imperfect private monitoring. We present the main known results in Subsect. “General Stochastic Signals”. With perfect monitoring, the worst equilibrium payoff for a player is given by the min max of the one-shot game, where punishing (minimizing) players choose an independent profile of mixed strategies. With imperfect monitoring, correlation past signals for the punishing players may lead to more efficient punishments. We present results on punishment levels in Subsect. “Punishment Levels”. Model In this section we define repeated games with imperfect monitoring, and describe several classes of monitoring structures of particular interest. Data of the Game Recall that the one-shot strategic interaction is described by a finite set I of players, a finite action set Ai for each player i, and a payoff function g : A ! RI . Player’s observation of each other’s actions is described by a monitoring structure given by a finite set of signals Y i for each player i and by a transition probabilQ Q ity Q : A ! (Y) (with A D i2I A i and Y D i2I Yi ). When the action profile chosen is a D (a i ) i2I , a profile of signals y D (y i ) i2I is drawn with probability Q(yja) and yi is observed by player i. Perfect Monitoring Perfect monitoring is the particular case in which each player observes the action profile chosen: for each player i, Yi D A and Q((y i ) i2I ja) D 1f8i; y i Dag . Almost Perfect Monitoring The monitoring structure is "-perfect (see Mailath and Morris [52]) when each player can identify the other player’s action with a probability of error less than ". This is the case if there exist functions f i : A i Yi ! Ai for all i such that, for all a 2 A: Q(8i; f i (a i ; y i ) D ai ja) 1 " :
Almost-perfect monitoring refers to "-perfect monitoring for small values of ". Canonical Structure The monitoring structure is canonical when each player’s observation corresponds to an action profile of the opponents, that is, when Yi D Ai . Public and Almost Public Signals Signals are public when all the players observe the same signal, i. e., Q(8i; j; y i D y j ja) D 1, for every action profile a. For instance, in Green and Porter [33], firms compete over quantities, and the public signal is the realization of the price. Firms can then make inferences on rival’s quantities based on their own quantity and market price. The case in which Q(8i; j; y i D y j ja) is close to 1 for every a is referred to as almost public monitoring (see Mailath and Morris [52]). Private signals refer to the case when these signals are not public. Deterministic Signals Signals are deterministic when the signal profile is uniquely determined by the action profile. When a is played, the signal profile y is given by y D f (a), where f is called the signaling function. Observable Payoffs Payoffs are observable when each player i can deduce his payoff from his action and his signal. This is the case if there exists a mapping ' : A i Yi ! R such that for every action profile a, Q(8i; g i (a) D '(a i ; y i )ja) D 1. The Repeated Game The game is played repeatedly and after every stage t, the profile of signals yt received by the players is drawn according to the distribution Q(y t ja t ), where at is the profile of action chosen at stage t. A player’s information consists of his past actions and signals. We let H i;t D (A i Yi ) t be the set of player i’s histories of length t. A strategy for player i now consists of a mapping i : [ t0 H i;t ! S i . The set of complete histories of the game after t stages is H t D (A Y) t , it describes chosen actions and received signals for all the players at all past stages. A strategy profile D ( i ) i2I induces a probability distribution P on the set of plays H1 D (A Y)1 . Equilibrium Notions Nash Equilibria Player’s preferences over game plays are defined according to the same criteria as for perfect monitoring. We focus on infinitely repeated games, both discounted and undiscounted. A choice of players’ preferences defines a set of Nash equilibrium payoffs in the repeated game.
2625
2626
Repeated Games with Complete Information
Sequential Equilibria The most commonly used refinement of Nash equilibrium for repeated games with imperfect monitoring is the sequential equilibrium concept (Kreps and Wilson, [42]), which we recall here. A belief assessment is a sequence D (i;t ) t1; i2I with i;t : H i;t ! (H t ), i. e., given the private history hi of player i, i;t (h i ) is the probability distribution representing the belief that player i holds on the full history. A sequential equilibrium of the repeated game is a pair (; ) where is a strategy profile and is a belief assessment such that: 1) for each player i and every history hi , i is a best reply in the continuation game, given the strategies of the other players and the belief that player i holds regarding the past; 2) the beliefs must be consistent in the sense that (; ) is the limit of a sequence ( n ; n ) where for every n, n is a completely mixed strategy (it assigns positive probability to every action after every history) and n is the unique belief derived from Bayes’ law under P n . Sequential equilibria are defined both on the discounted game and the undiscounted versions of the repeated game. For undiscounted games, the set of Nash equilibrium payoffs and sequential equilibrium payoffs coincide. The two notions also coincide for discounted games when the monitoring has full support (i. e. under every action profile, all signal profiles have positive probability). The results presented in this survey all hold for sequential equilibria, both for discounted and undiscounted games. Extensions of the Repeated Game When players receive correlated inputs or may communicate between stages of the repeated game, the relevant concepts are correlated and communication equilibria. Correlated Equilibria A correlated equilibrium (Aumann [6]) of the repeated game is an equilibrium of an extended game in which: at a preliminary stage, a mediator chooses a profile of correlated random inputs and informs each player of his own input; then the repeated game is played. A characterization of the set of correlated equilibrium payoffs for two-player games is obtained by Lehrer [45]. Correlation arises endogenously in repeated games with imperfect monitoring, as the signals received by the players can serve as correlated inputs that influence player’s continuation strategies. This phenomenon is called internal correlation, and was studied by Lehrer, [44], Gossner and Tomala, [29,30]. Communication Equilibria An (extensive form) communication equilibrium (Myerson [54], Forges [23]) of
a repeated game is an equilibrium of an extension of the repeated game in which after every stage, players send messages to a mediator, and the mediator sends back private outputs to the players. Characterizations of the set of communication equilibrium payoffs are obtained under weak conditions on the monitoring structure, see e. g. Kandori and Matsushima [38], Compte [15], and Renault and Tomala [61]. Detection and Identification Equivalent Actions A player’s deviation is detectable when it induces a different distribution of signals for other players. When two actions induce the same distribution of actions for other players, they are called equivalent (Lehrer [43,44,45,46]): Definition 1 Two actions ai and bi of player i are equivalent, and we note a i b i , if they induce the same distribution of other players’ signals: Q(yi ja i ; ai ) D Q(yi jb i ; ai ) ;
8ai :
Example 1 Consider the two-player repeated Prisoner’s Dilemma where player 2 receives no information about the actions of player 1 (e. g. Y 2 is a singleton). The two actions of player 1 are thus equivalent. The actions of player 2 are independent of the actions of player 1: player 1 has no impact on the behavior of player 2. Player 2 has no power to threat player 1 and in any equilibrium, player 1 defects at every stage. Player 2 also defects at every stage: since player 1 always defects, he also loses his threatening power. The only equilibrium payoff in this repeated game is thus (1; 1). Example 1 suggests that between two equivalent actions, a player chooses at equilibrium the one that yields the highest stage payoff. This is indeed the case when the information received by a player does not depend on his own action. Lehrer [43] studies particular monitoring structures satisfying this requirement. Recall from Lehrer [43] the definition of semi-standard monitoring structures: each action set Ai is endowed with a partition A¯ i , when player i plays ai , the corresponding partition cell a¯ i is publicly announced. In the semi-standard case, two actions are equivalent if and only if they belong to the same cell: a i b i () a¯ i D b¯ i and the information received by a player on other player’s action does not depend on his own action. If player i deviates from ai to bi , the deviation is undetected if and only if a i b i . Otherwise it is detected by all other players. A profile of mixed actions is called immune to undetectable deviations if no player can profit
Repeated Games with Complete Information
by a unilateral deviation that maintains the same distribution of other players’ signals. The following result, due to Lehrer [43], characterizes equilibrium payoffs for undiscounted games with semi-standard signals: Theorem 7 In an undiscounted repeated game with semistandard signals, the equilibrium payoffs are the individually rational payoffs that belong to the convex hull of payoffs generated by mixed action profiles that are immune to undetectable deviations. More Informative Actions When the information of player i depends on his own action, some deviations may be detected in the course of the repeated game even though they are undetectable in the stage game. Example 2 Consider the following modification of the Prisoner’s Dilemma. The action set of player 1 is A1 D fC1 ; D1 g fC2 ; D2 g and the action set of player 2 is fC2 ; D2 g. An action for player 1 is thus a pair a1 D ( a˜1 ; a˜2 ). When the action profile ( a˜1 ; a˜2 ; a2 ) is played, the payoff to player i is g i ( a˜1 ; a2 ). We can interpret the component a˜1 as a real action (it impacts payoffs) and the component a˜2 as a message sent to player 2 (it does not impact payoffs). The monitoring structure is as follows: player 2 only observes the message component a˜2 of the action of player 1 and, player 1 perfectly observes the action of player 2 if he chooses the cooperative real action ( a˜1 D C1 ), and gets no information on player 2’s action if he defects ( a˜1 D D1 ). Note that the actions (C1 ; C2 ) and (D1 ; C2 ) of player 1 are equivalent, and so are the actions (C1 ; D2 ) and (D1 ; D2 ). However, it is possible to construct an equilibrium that implements the cooperative payoff along the following lines: i) Using his message component, player 1 reports at every stage t > 1 the previous action of player 2. Player 1 is punished in case of a non matching report. ii) Player 2 randomizes between both actions, so that player 1 needs to play the cooperative action in order to report player 2’s action accurately. The weight on the defective action of player 2 goes to 0 as t goes to infinity to ensure efficiency. Player 2 has incentives to play C2 most of the time, since player 1 can statistically detect if player 2 uses the action D2 more frequently than prescribed. Player 1 also has incentives to play the real action C1 , as this is the only way to
observe player 2’s action, which needs to be reported later on. The key point in the example above is that the two real actions C1 and D1 of player 1 are equivalent but D1 is less informative than C1 for player 1. For general monitoring structures an action ai is more informative than an action bi if: whenever player i plays ai i, he can reconstitute the signal he would have observed, had he played bi . The precise definition of the more informative relation relies on Blackwell’s ordering of stochastic experiments [11]: Definition 2 The action ai of player i is more informative than the action bi if there exists a transition probability f : Yi ! (Yi ) such that for every ai and every profile of signals y, X
f (y 0i jy i )Q(y i ; yi ja i ; ai ) D Q(y 0i ; yi jb i ; ai ) :
yi
We denote a i b i if a i b i and ai is more informative than bi . Assume that prescribed strategies require player i to play bi at stage t, and let a i b i . Consider the following deviation from player i: play ai at stage t, and reconstruct a signal at stage t that could have arisen from the play of bi . In all subsequent stages, play as if no deviation took place at stage t, and as if the reconstructed signal had been observed at stage t. Not only such a deviation would be undetectable at stage t, since a i b i , but it would also be undetectable at all subsequent stages, as it would induce the same probability distribution over plays as under the prescribed strategy. This argument shows that, if an equilibrium strategy specifies that player i plays ai , there is no b i a i that yields a higher expected stage payoff than ai . Definition 3 A distribution of actions profiles p 2 (A) is immune to undetectable deviations if for each player i and pair of actions a i ; b i such that b i a i : X a i
p(a i ; ai )g i (a i ; ai )
X
p(a i ; ai )g i (b i ; ai )
a i
If p is immune to undetectable deviations, and if player i is supposed to play ai , any alternative action bi that yields a greater expected payoff can not be such that b i a i . The following proposition gives a necessary condition on equilibrium payoffs that holds both in the discounted and in the undiscounted cases. Proposition 1 Every equilibrium payoff of the repeated game is induced by a distribution that is immune to undetectable deviations.
2627
2628
Repeated Games with Complete Information
The condition of Proposition 1 is tight for some specific classes of games, all of them assuming two players and no discounting. Following Lehrer [45], signals are non-trivial if, for each player i, there exist an action ai for player i and two actions aj , bj for i’s opponent such that the signal for player i is different under (a i ; a j ) and (a i ; b j ). Lehrer [45] proves: Theorem 8 The set of correlated equilibrium payoffs of the undiscounted game with deterministic and non-trivial signals is the set of individually rational payoffs induced by distributions that are immune to undetectable deviations. Lehrer [46] assumes that payoffs are observable, and obtains the following result: Theorem 9 In a two-player repeated game with no discounting, non-trivial signals and observable payoffs, the set of equilibrium payoffs is the set of individually rational payoffs induced by distributions that are immune to undetectable deviations. Finally, Lehrer [44] shows that in some cases, one may dispense with the correlation device of Theorem 8, as all necessary correlation can be generated endogenously through the signals of the repeated game: Proposition 2 In two-player games with non-trivial signals such that either the action profile is publicly announced or a blank signal is publicly announced, the set of equilibrium payoffs coincides with the set of correlated equilibrium payoffs. Identification of Deviators A deviation is identifiable when every player can infer the identity of the deviating player from his observations. For instance, in a game with public signals, if separate deviations from players i and j induce the same distribution of public signals, these deviations from i or j are not identifiable. In order to be able to punish the deviating player, it is sometimes necessary to know his identity. Detectability and identifiability are two separate issues, as shown by the following example. Example 3 Consider the following 3-player game where player 1 chooses the row, player 2 chooses the column and player 3 chooses the matrix.
T B
L R 1; 1; 1 4; 4; 0 4; 4; 0 4; 4; 0
L R 0; 3; 0 0; 3; 0 0; 3; 0 0; 3; 0
L R 3; 0; 0 3; 0; 0 3; 0; 0 3; 0; 0
W
M
E
Consider the monitoring structure in which actions are not observable and the payoff vector is publicly announced. The payoff (1; 1; 1) is feasible and individually rational. The associated action profile (T; L; W) is immune to undetectable deviations since any individual deviation from (T; L; W) changes the payoff. However, (1; 1; 1) is not an equilibrium payoff. The reason is that, player 3, who has the power to punish either player 1 or player 2, cannot punish both players simultaneously: punishing player 1 rewards player 2 and viceversa. More precisely, whatever weights player 3 puts on the action M and E, the sum of player 1 and player 2’s payoffs is greater than 3. Any equilibrium payoff vector v D (v1 ; v2 ; v3 ) must thus satisfy v1 C v2 3. In fact, it is possible to prove that the set of equilibrium payoffs of this repeated game is the set of feasible and individually rational payoffs that satisfy this constraint. Approachability When the deviating player cannot be identified, it may be necessary to punish a group of suspects altogether. The notion of a payoff that is enforceable under group punishments is captured by the definition of approachable payoffs: Definition 4 A payoff vector v is approachable if there exists a strategy profile such that, for every player i and unilateral deviation i of player i, the average payoff of player i under ( i ; i ) is asymptotically less than or equal to vi . Blackwell’s [12] approachability theorem and its generalization by Kohlberg [40] provide simple geometric characterizations of approachable payoffs. It is straightforward that approachability is a necessary condition on equilibrium payoffs: Proposition 3 Every equilibrium payoff of the repeated game is approachable. Renault and Tomala [61] show that the conditions of Proposition 1 and 3 are tight for communication equilibria: Theorem 10 For every game with imperfect monitoring, the set of communication equilibrium payoffs of the repeated game with no discounting is the set of approachable payoffs induced by distributions which are immune to undetectable deviations. Tomala [70] shows that pure strategy equilibrium payoffs of undiscounted repeated games with public signals are also characterized through identifiability and approachability conditions (the approachability definition then uses pure strategies). Tomala [71] provides a similar character-
Repeated Games with Complete Information
ization in mixed strategies for a restricted class of public signals. Identification Through Endogenous Communication A deviation may be identified in the repeated game even though it cannot be identified in the stage game. In a network game, players are located at nodes of a graph, and each player monitors his neighbors’ actions. Each player can use his actions as messages that are broadcasted to all the neighbors in the graph. The graph is called 2-connected if no single node deletion disconnects the graph. Renault and Tomala [60] show that when the graph is 2-connected, there exists a communication protocol among the players that ensures that the identity of any deviating player becomes common knowledge among all players in finite time. In this case, identification takes place through communication over the graph. Public Equilibria In a seminal paper, Green and Porter [33] introduce a model in which firms are engaged in a production game and publicly observe market prices, which depend both on quantities produced and on non-observable exogenous market shocks. Can collusion be sustained at equilibrium even if prices convey imperfect information on quantities produced? This motivates the study of public equilibria for which sharp characterizations of equilibrium payoffs are obtained. Signals are public when all sets of signals are identical, i. e. Yi D Ypub for each i and Q(8i; j; y i D y j ja) D 1 for every a. A public history of length t is a record of t public signals, i. e. an element of Hpub;t D (Ypub ) t . A strategy i for player i is a public strategy if it depends on the public history only: if h i D (a i;1 ; y1 ; : : : ; a i;t ; y t ) and h0i D (a0i;1 ; y10 ; : : : ; a 0i;t ; y 0t ) are two histories for player i such that y1 D y10 ; : : : ; y t D y 0t , then i (h i ) D i (h0i ). Definition 5 A perfect public equilibrium is a profile of public strategies such that after every public history, each player’s continuation strategy is a best reply to the opponents’ continuation strategy profile. The repetition of a Nash equilibrium of the stage game is a perfect public equilibrium, so that perfect public equilibria exist. Every perfect public equilibrium is a sequential equilibrium: any consistent belief assigns probability one to the realized public history and thus correctly forecasts future opponents’ choices. A Recursive Structure A perfect public equilibrium (PPE henceforth) is a profile of public strategies that forms
an equilibrium of the repeated game and such that, after every public history, the continuation strategy profile is also a PPE. The set of PPEs and of induced payoffs therefore possesses a recursive structure, as shown by Abreu, Pearce and Stachetti [3]. The argument is based on a dynamic programming principle. To state the main result, we first introduce some definitions. Given a mapping f : Ypub ! RI , G(ı; f ) represents the one-shot game where each player i choose actions in Ai and where payoffs are given by: X (1 ı)g i (a) C ı Q(yja) f i (y) : y2Ypub
In G(ı; f ), the stage game is played, and players receive f (y) as an additional payoff if y is the realized public signal. The weights 1 ı and ı are the relative weights of present payoffs versus all future payoffs in the repeated game. Definition 6 A payoff vector v 2 RI is decomposable with respect to the set W RI if there exists a mapping f : Ypub ! W such that v is a Nash equilibrium payoff of G(ı; f ). Fı (W) denotes the set of payoff vectors which are decomposable with respect to W. Let E(ı) be the set of perfect public equilibrium payoffs of the repeated game discounted at the rate ı. The following result is due to Abreu et al. [3]: Theorem 11 E(ı) is the largest bounded set W such that W Fı (W). Fudenberg and Levine [19] derive an asymptotic characterization of the set of PPE payoffs when the discount factor goes to 1 as follows. Given a vector 2 RI , define the score in the direction as: k() D sup h; vi where the supremum is taken over the set of payoff vectors v that are Nash equilibrium payoffs of G(ı; f ), where f is any mapping such that, h; vi h; f (y)i ;
8y 2 Ypub :
Scores are independent of the discount factor. The following theorem is due to Fudenberg and Levine [19]: Theorem 12 Let C be the set of vectors v such that for every 2 RI , h; vi k(). If the interior of C is non-empty, E(ı) converges to C (for the Hausdorff topology) as ı goes to 1. Fudenberg, Levine and Takahashi [22] relax the nonempty interior assumption. They provide an algorithm for computing the affine hull of limı!1 E(ı) and provide a corresponding characterization of the set C with continuation payoffs belonging to this affine hull.
2629
2630
Repeated Games with Complete Information
Folk Theorems for Public Equilibria The recursive structure of Theorem 11 and the asymptotic characterization of PPE payoffs given by Theorem 12 are essential tools for finding sufficient conditions under which every feasible and individually rational payoff is an equilibrium payoff, i. e. conditions under which a Folk Theorem holds. The two conditions under which a Folk Theorem in PPEs holds are a 1) a condition of detectability of deviations and 2) a condition of identifiability of deviating players. Definition 7 A profile of mixed actions s D (s i ; si ) has individual full rank if for each player i, the probability vectors (in the vector space RYpub ) fQ(ja i ; si ) : a i 2 A i g are linearly independent. If s has individual full rank, no player can change the distribution of his actions without affecting the distribution of public signals. Individual full rank is thus a condition on detectability of deviations. Definition 8 A profile of mixed actions s has pairwise full rank if for every pair of players i ¤ j, the family of probability vectors fQ(ja i ; si ) : a i 2 A i g [ fQ(ja j ; s j ) : a j 2 A j g ˇ ˇ has rank jA i j C ˇA j ˇ 1. Under the condition of pairwise full rank, deviations from two distinct players induce distinct distributions of public signals. Pairwise full rank is therefore a condition of identifiability of deviating players. Fudenberg et al. [21] prove the following theorem: Theorem 13 Assume the set of feasible and individually rational payoff vectors F has non-empty interior. If every pure action profile has individual full rank and if there exists a mixed action profile with pairwise full rank, then every convex and compact subset of the interior of F is a subset of E(ı) for ı large enough. In particular, under the conditions of the theorem, every feasible and individually rational payoff vector is arbitrarily close to a PPE payoff for large enough discount factors. Variations of this result can be found in [21] and [19]. Extensions The Public Part of a Signal The definition of perfect public equilibria extends to the case in which each player’s signals consists of two components: a public component and
a private component. The public components of all players’ signals are the same with probability one. A public strategy is then a strategy that depends only on the public components of past signals, and all the analysis carries through. Public Communication In the public communication extension of the repeated game, players make public announcements between any two stages of the repeated game. The profile of public announcements then forms a public signal, and recursive methods can be successfully applied. The fact that public communication is a powerful instrument to overcome the difficulties arising from private signals was first observed by Matsushima [49,50]. Ben Porath and Kahneman [14], Kandori and Matsushima [38], and Compte [15] prove Folk Theorems in games with private signals and public communication. Kandori [37] shows that in games with public monitoring, public communication allows to relax the conditions for the Folk Theorem of Fudenberg et al. [21]. Private Strategies in Games with Public Monitoring PPE payoffs do not cover the full set of sequential equilibrium payoffs, even when signals are public, as some equilibria may rely on players using private strategies, i. e. strategies that depend on past chosen actions and past private signals. See [53] and [39] for examples. In a minority game, there is an odd number of players, each player chooses between actions A and B. Players choosing the least chosen (minority) action get a payoffs of 1, other players get 0. The public signal is the minority action. Renault et al. [58,59] show that, for minority games, a Folk Theorem holds in private strategies but not in public strategies. Only few results are known concerning the set of sequential equilibrium payoffs in privates strategies of games with public monitoring. A monotonicity property is obtained by Kandori [36] who shows that the set of payoffs associated to sequential equilibria in pure strategies is increasing with respect to the quality of the public signal. Almost Public Monitoring Some PPEs are robust to small perturbations of public signals. Considering strategies with finite memory, Mailath and Morris [52] identify a class of public strategies which are sequential equilibria of the repeated game with imperfect private monitoring, provided that the monitoring structure is close enough to a public one. They derive a Folk Theorem for games with almost public and almost perfect monitoring. Hörner and Olszewski [35] strengthen this result and prove a Folk Theorem for games with almost public monitoring. Under detectability and identifiability conditions, they prove that
Repeated Games with Complete Information
feasible and individually rational payoffs can be achieved by sequential equilibria with finite memory. Almost Perfect Monitoring Monitoring is almost perfect when each player can identify the action profile of his opponents with near certainty. Almost perfect monitoring is the natural framework to study the robustness of the Folk Theorem to small departures from the assumption that actions are perfectly observed. The first results were obtained for the Prisoner’s Dilemma. Sekiguchi [65] shows that the cooperative outcome can be approximated at equilibrium when players are sufficiently patient and monitoring is almost perfect. Under the same assumptions, Bhaskar and Obara [13], Piccione [57] and Ely and Valimaki [16] show that a Folk Theorem obtains. Piccione [57] and Ely and Valimaki [16] study a particular class of equilibria called belief free. Strategies form a belief free equilibrium if, whatever player i’s belief on the opponent’s private history, the action prescribed by i’s strategy is a best response to the opponent’s continuation strategy. Ely, Hörner and Olszewski [17] extend the belief free approach to general games. However, they show that, in general, belief free strategies are not enough to reconstruct a Folk Theorem, even when monitoring is almost perfect. For general games and with any number of players, Hörner and Olszewski [34] prove a Folk Theorem with almost perfect monitoring. The strategies that implement the equilibrium payoffs are defined on successive blocks of a fixed length, and are block-belief-free in the sense that, at the beginning of every block, every player is indifferent between several continuation strategies, independently on his belief as to which continuation strategies are used by the opponents. This result closes the almost perfect monitoring case by showing that equilibrium payoffs in the Folk Theorem are robust to a small amount of imperfect monitoring. General Stochastic Signals Besides the case of public (or almost public) monitoring, little is known about equilibrium payoffs of repeated games with discounting and imperfect signals. The Prisoner’s Dilemma game is particularly important for economic applications. In particular, it captures the essential features of collusion with the possibility of secret price cutting, as in Stigler [69]. When signals are imperfect, but independent conditionally on the pair of actions chosen (a condition called conditional independence), Matsushima [51] shows that
the efficient outcome of the repeated Prisoner’s Dilemma game is an equilibrium outcome if players are sufficiently patient. In the equilibrium construction, every player’s action is constant in every block. The conditional independence assumption is crucial in that it implies that, during every block, a player has no feedback as to what signals the other player has received. The conditional independence assumption is non-generic: it holds for a set of monitoring structures of empty interior. Fong, Gossner, Hörner, and Sannikov [18] prove that efficiency can be obtained at equilibrium without conditional independence. Their main assumption is that there exists a sufficiently informative signal, but this signal needs not be almost perfectly informative. Their result holds for a family of monitoring structures of non empty interior. It is the first result that establishes cooperation in the Prisoner’s Dilemma with impatient players for truly imperfect, private and correlated signals. Punishment Levels Individual rationality is a key concept for Folk Theorems and equilibrium payoff characterizations. Given a repeated game, define the individually rational (IR) level of player i as the lowest payoff down to which this player may be punished in the repeated game. Definition 9 The individual rational level of player i is: " lim min max E i ;i
ı!1 i
i
X
# (1 ı)ı
t1
g i;t
t
where the min runs over profiles of behavior strategies for player i, and the max over behavior strategies of player i. That is, the individually rational level is the limit (as the discount factor goes to one) of the min max value of the discounted game (other approaches, through undiscounted games or limits of finitely repeated games, yield equivalent definitions, see [29]). Comparison of the IR Level with the min max With perfect monitoring, the IR level of player i is player i’s min max in the one-shot game, as defined by equation (1). With imperfect monitoring, the IR level for player i is never larger than vi since player i’s opponents can force down player i to vi by repeatedly playing the min max strategy against player i. With two players, it is a consequence of von-Neumann’s min max theorem [72] that vi is the IR level for player i.
2631
2632
Repeated Games with Complete Information
For any number of players, Gossner and Hörner [28] show that vi is equal to the min max in the one-shot game whenever there exists a garbling of player i’s signal such that, conditionally on i’s garbled signal, the signals of i’s opponents are independent. Furthermore, the condition in [28] is also a necessary condition in normal form games extended by correlation devices (as in Aumann [6]). A continuity result in the IR level also applies for monitoring structure close to those that satisfy the conditional independence condition. The following example shows that, in general, the IR level can be lower than vi : Example 4 Consider the following three-player game. Player 1 chooses the row, player 2 the column and player 3 the matrix. Players 1 and 2 perfectly observe the action profile while player 3 observes player 2’s action only. As we deal with the IR level of player 3, we specify the payoff for this player only.
T B
L 0 0
R 0 1 W
L 1 0
R 0 0 E
A simple computation shows that v3 D 14 and that the min max strategies of players 1 and 2 are uniform. Consider the following strategies of players 1 and 2 in the repeated game: randomize uniformly at odd stages, play (T; L) or (B; R) depending on player 1’s previous action at even stages. Against these strategies, player 3 cannot obtain better than 14 at odd stages and 12 at even stages, resulting in an average payoff of 38 . Entropy Characterizations The exact computation of the IR level in games with imperfect monitoring requires to analyze the optimal trade-off for punishing players between the production of correlated and private signals and the use of these signals for effective punishment. Gossner and Vieille, [31] and Gossner and Tomala [29] develop tools based on information theory to analyze this tradeoff. At any stage, the amount of correlation generated (or spent) by the punishing players is measured using the entropy function. Gossner and Tomala [30] derive a characterization of the IR level for some classes of monitoring structures. Gossner, Laraki, and Tomala [32] provide methods explicit computations of the IR level. In particular, for the above example, the IR level computed and is about .401. Explicit computations of IR levels for other games are derived by Goldberg [26].
Acknowledgments The authors are grateful to Johannes Hörner for insightful comments.
Bibliography Primary Literature 1. Abreu D (1988) On the theory of infinitely repeated games with discounting. Econometrica 56:383–396 2. Abreu D, Dutta P, Smith L (1994) The folk theorem for repeated games: a NEU condition. Econometrica 62:939–948 3. Abreu D, Pearce D, Stacchetti E (1990) Toward a theory of discounted repeated games with imperfect monitoring. Econometrica 58:1041–1063 4. Aumann RJ (1960) Acceptable points in games of perfect information. Pac J Math 10:381–417 5. Aumann RJ (1964) Mixed and behavior strategies in infinite extensive games. In: Dresder M, Shapley LS, Tucker AW (eds) Advances in Game Theory. Princeton University Press, New Jersey, pp 627–650 6. Aumann RJ (1974) Subjectivity and correlation in randomized strategies. J Math Econ 1:67–95 7. Aumann RJ (1981) Survey of repeated games. In: Aumann RJ (ed) Essays in game theory and mathematical economics in honor of Oskar Morgenstern. Wissenschaftsverlag, Bibliographisches Institut, Mannheim, pp 11–42 8. Aumann RJ, Shapley LS (1976) Long-term competition – a game theoretic analysis. Re-edited in 1994. See [9] 9. Aumann RJ, Shapley LS (1994) Long-term competition – a game theoretic analysis. In: Megiddo N (ed) Essays on game theory. Springer, New York, pp 1–15 10. Benoît JP, Krishna V (1985) Finitely repeated games. Econometrica 53(4):905–922 11. Blackwell D (1951) Comparison of experiments. In: Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, Berkeley, pp 93–102 12. Blackwell D (1956) An analog of the minimax theorem for vector payoffs. Pac J Math 6:1–8 13. Bhaskar V, Obara I (2002) Belief-based equilibria in the repeated prisoners’ dilemma with private monitoring. J Econ Theory 102:40–70 14. Ben EP, Kahneman M (1996) Communication in repeated games with private monitoring. J Econ Theory 70(2):281–297 15. Compte O (1998) Communication in repeated games with imperfect private monitoring. Econometrica 66:597–626 16. Ely JC, Välimäki J (2002) A robust folk theorem for the prisoner’s dilemma. J Econ Theory 102:84–106 17. Ely JC, Hörner J, Olszewski W (2005) Belief-free equilibria in repeated games. Econometrica 73:377–415 18. Fong K, Gossner O, Hörner J, Sannikov Y (2007) Efficiency in a repeated prisoner’s dilemma with imperfect private monitoring. mimeo 19. Fudenberg D, Levine DK (1994) Efficiency and observability with long-run and short-run players. J Econ Theory 62:103–135 20. Fudenberg D, Maskin E (1986) The folk theorem in repeated games with discounting or with incomplete information. Econometrica 54:533–554
Repeated Games with Complete Information
21. Fudenberg D, Levine DK, Maskin E (1994) The folk theorem with imperfect public information. Econometrica 62(5):997– 1039 22. Fudenberg D, Levine DK, Takahashi S (2007) Perfect public equilibrium when players are patient. Games Econ Behav 61:27–49 23. Forges F (1986). An approach to communication equilibria. Econometrica 54:1375–1385 24. Forges F, Mertens J-F, Neyman A (1986) A counterexample to the folk theorem with discounting. Econ Lett 20:7–7 25. Friedman J (1971) A noncooperative equilibrium for supergames. Rev Econ Stud 38:1–12 26. Goldberg Y (2007) Secret correlation in repeated games with imperfect monitoring: The need for nonstationary strategies. Math Oper Res 32:425–435 27. Gossner O (1995) The folk theorem for finitely repeated games with mixed strategies. Int J Game Theory 24:95–107 28. Gossner O, Hörner J (2006) When is the individually rational payoff in a repeated game equal to the minmax payoff? DP 1440, CMS-EMS 29. Gossner O, Tomala T (2006) Empirical distributions of beliefs under imperfect observation. Math Oper Res 31(1):13–30 30. Gossner O, Tomala T (2007) Secret correlation in repeated games with signals. Math Oper Res 32:413–424 31. Gossner O, Vieille N (2002) How to play with a biased coin? Games Econ Behav 41:206–226 32. Gossner O, Laraki R, Tomala T (2009) Informationally optimal correlation. Math Programm B 116:147–112 33. Green EJ, Porter RH (1984) Noncooperative collusion under imperfect price information. Econometrica 52:87–100 34. Hörner J, Olszewski W (2006) The folk theorem with private almost-perfect monitoring. Econometrica 74(6):1499–1544 35. Hörner J, Olszewski W (2007) How robust is the folk theorem with imperfect public monitoring? mimeo 36. Kandori M (1992) The use of information in repeated games with imperfect monitoring. Rev Econ Stud 59:581–593 37. Kandori M (2003) Randomization, communication, and efficiency in repeated games with imperfect public monitoring. Econometrica 71:345–353 38. Kandori M, Matsushima H (1998) Private observation, communication and collusion. Rev Econ Stud 66:627–652 39. Kandori M, Obara I (2006) Efficiency in repeated games revisited: The role of private strategies. Econometrica 74:499–519 40. Kohlberg E (1975) Optimal strategies in repeated games with incomplete information. Int J Game Theory 4:7–24 41. Kuhn HW (1953) Extensive games and the problem of information. In: Kuhn HW, Tucker AW (eds) Contributions to the Theory of Games, vol II. Annals of Mathematical Studies, vol 28. Princeton University Press, New Jersey, pp 193–216 42. Kreps DM, Wilson RB (1982) Sequential equilibria. Econometrica 50:863–894 43. Lehrer E (1990) Nash equilibria of n-player repeated games with semi-standard information. Int J Game Theory 19:191– 217 44. Lehrer E (1991) Internal correlation in repeated games. Int J Game Theory 19:431–456 45. Lehrer E (1992) Correlated equilibria in two-player repeated games with nonobservable actions. Math Oper Res 17:175–199 46. Lehrer E (1992) Two players repeated games with non observable actions and observable payoffs. Math Oper Res 17:200– 224
47. Lehrer E, Pauzner A (1999) Repeated games with differential time preferences. Econometrica 67:393–412 48. Luce RD, Raiffa H (1957) Games and Decisions: Introduction and Critical Survey. Wiley, New York 49. Matsushima H (1991) On the theory of repeated games with private information: Part i: Anti-folk theorem without communication. Econ Lett 35:253–256 50. Matsushima H (1991) On the theory of repeated games with private information: Part ii: Revelation through communication. Econ Lett 35:257–261 51. Matsushima H (2004) Repeated games with private monitoring: Two players. Econometrica 72:823–852 52. Mailath G, Morris S (2002) Repeated games with almost-public monitoring. J Econ Theory 102:189–229 53. Mailath GJ, Matthews SA, Sekiguchi T (2002) Private strategies in finitely repeated games with imperfect public monitoring. B E J Theor Econ 2 54. Myerson RB (1982) Optimal coordination mechanisms in generalized principal-agent problems. J Math Econ 10:67–81 55. Nash JF (1951) Noncooperative games. Ann Math 54:289–295 56. Neyman A (1999) Cooperation in repeated games when the number of stages is not commonly known. Econometrica 67:45–64 57. Piccione M (2002) The repeated prisoner’s dilemma with imperfect private monitoring. J Econ Theory 102:70–84 58. Renault J, Scarlatti S, Scarsini M (2005) A folk theorem for minority games. Games Econ Behav 53:208–230 59. Renault J, Scarlatti S, Scarsini M (2008) Discounted and finitely repeated minority games with public signals. Math Soc Sci 56:44–74 60. Renault J, Tomala T (1998) Repeated proximity games. Int J Game Theory 27:539–559 61. Renault J, Tomala T (2004) Communication equilibrium payoffs of repeated games with imperfect monitoring. Games Econ Behav 49:313–344 62. Rubinstein A (1977) Equilibrium in supergames. Center for Research in Mathematical Economics and Game Theory. Res Memo 25 63. Rubinstein A (1979) Equilibrium in supergames with the overtaking criterion. J Econ Theory 21:1–9 64. Rubinstein A (1994) Equilibrium in supergames. In: Megiddo N (ed) Essays on game theory. Springer, New York, pp 17–28 65. Sekiguchi T (1997) Efficiency in repeated prisoner’s dilemma with private monitoring. J Econ Theory 76:345–361 66. Selten R (1965) Spieltheoretische Behandlung eines Oligopolmodells mit Nachfrageträgheit. Z Gesamte Staatswiss 12:201–324 67. Smith L (1995) Necessary and sufficient conditions for the perfect finite horizon folk theorem. Econometrica 63:425–430 68. Sorin S (1986) On repeated games with complete information. Math Oper Res 11:147–160 69. Stigler G (1964) A theory of oligopoly. J Political Econ 72:44–61 70. Tomala T (1998) Pure equilibria of repeated games with public observation. Int J Game Theory 27:93–109 71. Tomala T (1999) Nash equilibria of repeated games with observable payoff vector. Games Econ Behav 28:310–324 72. von Neumann J (1928) Zur Theorie der Gesellschaftsspiele. Math Ann 100:295–320 73. Wen Q (1994) The “folk theorem” for repeated games with complete information. Econometrica 62:949–954
2633
2634
Repeated Games with Complete Information
Books and Reviews Aumann RJ (1981) Survey of repeated games. In: Aumann RJ (ed) Essays in game theory and mathematical economics in honor of Oskar Morgenstern. Wissenschaftsverlag, Bibliographisches Institut, Mannheim, pp 11–42 Mailath GJ, Samuelson L (2006) Repeated Games and Reputations: Long-Run Relationships. Oxford University Press, Oxford
Mertens J-F (1986) Repeated games. In: Proceedings of the international congress of Mathematicians. Berkeley, California, pp 1528–1577 Mertens J-F, Sorin S, Zamir S (1994) Repeated games. CORE discussion paper. pp 9420–9422, Univeritsé Catholique de Levain, Louvain-la-neuve
Repeated Games with Incomplete Information
Repeated Games with Incomplete Information JÉRÔME RENAULT Ceremade, Université Paris Dauphine, Paris, France
P S p s 0 and s2S p s D 1g. Given s in S, the Dirac measure on s will be denoted by ı s . For p D (p s )s2S and q D (qs )s2S in RS , we will use, unless otherwise P specified, kp qk D s2S jp s qs j. Definition of the Subject
Article Outline
Introduction
Glossary Definition of the Subject Strategies, Payoffs, Value and Equilibria The Standard Model of Aumann and Maschler Vector Payoffs and Approachability Zero-Sum Games with Lack of Information on Both Sides Non Zero-Sum Games with Lack of Information on One Side Non-observable Actions Miscellaneous Future Directions Acknowledgments Bibliography
In a repeated game with incomplete information, there is a basic interaction called stage game which is repeated over and over by several participants called players. The point is that the players do not perfectly know the stage game which is repeated, but rather have different knowledge about it. As illustrative examples, one may think of the following situations: an oligopolistic competition where firms don’t know the production costs of their opponents, a financial market where traders bargain over units of an asset which terminal value is imperfectly known, a cryptographic model where some participants want to transmit some information (e. g., a credit card number) without being understood by other participants, a conflict when a particular side may be able to understand the communications inside the opponent side (or might have a particular type of weapons), . . . Natural questions arising in this context are as follows. What is the optimal behavior of a player with a perfect knowledge of the stage game? Can we determine which part of the information such a player should use? Can we price the value of possessing a particular information? How should one player behave while having only a partial information? Foundations of games with incomplete information have been studied in [28] and [56]. Repeated games with incomplete information have been introduced in the sixties by Aumann and Maschler [6], and we present here the basic and fundamental results of the domain. Let us start with a few well known elementary examples [6,91].
Glossary Repeated game with incomplete information A situation where several players repeat the same stage game, the players having different knowledge of the stage game which is repeated. Strategy of a player A rule, or program, describing the action taken by the player in any possible situation which may happen, depending on the information available to this player in that situation. Strategy profile A vector containing a strategy for each player. Lack of information on one side Particular case where all the players but one perfectly know the stage game which is repeated. Zero-sum games 2-player games where the players have opposite payoffs. Value Solution (or price) of a zero-sum game, in the sense of the fair amount that player 1 should give to player 2 to be entitled to play the game. Equilibrium Strategy profile where each player’s strategy is in best reply against the strategy of the other players. Completely revealing strategy Strategy of a player which eventually reveals to the other players everything known by this player on the selected state. Non revealing strategy Strategy of a player which reveals nothing on the selected state. The simplex of probabilities over a finite set For a finite set S, we denote by (S) the set of probabilities over S, and we identify (S) to fp D (ps )s2S 2 RS ; 8s 2
Basic Examples In each example, there are two players, and the game is zero-sum, i. e. player 2’s payoff always is the opposite of player 1’s payoff. There are two states a and b, and the possible stage games are given by two real matrices Ga and Gb with identical size. Initially a true state of nature k 2 fa; bg is selected with even probability between a and b, and k is announced to player 1 only. Then the matrix game Gk is repeated over and over: at every stage, simultaneously player 1 chooses a row i, whereas player 2 chooses a column j, the stage payoff for player 1 is then G k (i; j) but only i and j are publicly announced before proceeding to the next stage. Players are patient and want to maximize their long-run average expected payoffs.
2635
2636
Repeated Games with Incomplete Information
Example 1 Ga D
0 0
0 1
and G b D
1 0
0 0
:
This example is very simple. In order to maximize his payoff, player 1 just has to play, at any stage, the Top row if the state is a and the Bottom row if the state is b. This corresponds to playing optimally in each possible matrix game. Example 2
a
G D
1 0
0 0
b
and G D
0 0
0 1
:
A naive strategy for player 1 would be to play at stage 1: Top if the state is a, and Bottom if the state is b. Such a strategy is called completely revealing, or CR, because it allows player 2 to deduce the selected state from the observation of the actions played by player 1. This strategy of player 1 would be optimal here if a single stage was to be played, but it is a very weak strategy on the long run and does not guarantee more than zero at each stage t 2 (because player 2 can play Left or Right depending on player 1’s first action). On the opposite, player 1 may not use his information and play a non revealing, or NR, strategy, i. e. a strategy which is independent of the selected state. He can consider the average matrix 1 1 a G C Gb D 2 2
1/2 0
0 1/2
;
and play independently at each stage an optimal mixed action in this matrix, i. e. here the unique mixed action 1 1 2 Top C 2 Bottom. It will turn out that this is here the optimal behavior for player 1, and the value of the repeated game is the value of the average matrix, i. e. 1/4. Example 3 Ga D
4 4
0 0
2 2
and G b D
0 0
4 4
2 2
:
Playing a CR strategy for player 1 does not guarantee more than zero in the long-run, because player 2 will eventually be able to play Middle if the state is a, and Left if the state is b. But a NR strategy will not do better, because the average matrix
1 a 1 2 2 0 ; G C G b is 2 2 0 2 2 hence has value 0.
We will see later that an optimal strategy for player 1 in this game is to play as follows. Initially, player 1 chooses an element s in fT; Bg as follows: if k D a, then s D T with probability 3/4, and thus s D B with probability 1/4; and if k D b, then s D T with probability 1/4, and s D B with probability 3/4. Then at each stage player 1 plays row s, independently of the actions taken by player 2. The conditional probabilities satisfy: P(k D ajs D T) D 3/4, and P(k D ajs D B) D 1/4. At the end of stage 1, player 2 will have learned, from the action played by his opponent, something about the selected state: his belief on the state will move from 12 a C 12 b to 34 a C 14 b or to 14 a C 34 b. But player 2 still does not know perfectly the selected state. Such a strategy of player 1 is called partially revealing. General Definition Formally, a repeated game with incomplete information is given by the following data. There is a set of players N, and a set of states K. Each player i in N has a set of actions Ai Q and a set of signals U i , and we denote by A D i2N Ai the Q set of action profiles and by U D i2N U i the set of signal profiles. Every player i has a payoff function g i : K A ! R. There is a signaling function q : K A ! (U), and an initial probability 2 (K U). In what follows, we will always assume the sets of players, states, actions and signals to be non empty and finite. A repeated game with incomplete information can thus be denoted by D (N; K; (Ai ) i2N ;(U i ) i2N ; (g i ) i2N ; q; ). The progress of the game is the following. Initially, an element (k; (u0i ) i ) is selected according to : k is the realized state of nature and will remain fixed, and each player i learns u0i (and nothing more than u0i ). At each integer stage t 1, simultaneously every player i chooses an action a ti in Ai , and we denote by a t D (a ti ) i the action profile played at stage t. The stage payoff of a player i is then given by g i (k; a t ). A signal profile (u it ) i is selected according to q(k; a t ), and each player i learns u ti (and nothing more than u it ) before proceeding to the next stage. Remark 1. We will always assume that during the play, each player remembers the past actions he has chosen, as well as the past signals he has received. And players will be allowed to select their action independently at random. 2. The players do not necessarily know their stage payoff after each stage (as an illustration, imagine the play-
Repeated Games with Incomplete Information
ers bargaining over units of an asset which terminal value will only be known “at the end” of the game). This is without loss of generality, because it is possible to add hypotheses on q so that each player will be able to deduce his stage payoff from his realized stage signal. 3. Repeated games with complete information are a particular case, corresponding to the situation where each initial signal u0i reveals the selected state. Such games are studied in the chapter “Repeated games with complete information”. 4. Games where the state variable k evolve from stage to stage, according to the actions played, are called stochastic games. These games are not covered here, but in a specific chapter entitled “Stochastic games”. 5. The most standard case of signaling function is when each player exactly learns, at the end of each stage t, the whole action profile at . Such games are usually called games with “perfect monitoring”, “full monitoring”, “perfect observation” or with “observable actions”.
Strategies, Payoffs, Value and Equilibria Strategies A (behavior) strategy for player i is a rule, or program, describing the action taken by this player in any possible case which may happen. These actions may be chosen at random, so a strategy for player i is an element i D ( ti ) t1 , t1 where for each t, ti is a mapping from U i (U i Ai ) i to (A ) giving the lottery played by player i at stage t as a function of the past signals and actions of player i. The set of strategies for player i is denoted by ˙ i . A history of length t in is a sequence (k; u0 ; a1 ; u1 ; : : : ; a t ; u t ), and the set of such histories is the finite set K U (A U) t . An infinite history is called a play, the set of plays is denoted by ˝ D K U (A U)1 and is endowed with the product -algebra. A strategy profile D ( i ) i naturally induces, together with the initial probability , a probability distribution over the set of histories of length t. This probability uniquely extends to a probability over plays, and is denoted by P; . Payoffs Given a time horizon T, the average expected payoff of player i, up to stage T, if the strategy profile is played is denoted by: Ti ()
D EP;
! T 1 X i g (k; a t ) : T tD1
The T-stage game is the game T where simultaneously, each player i chooses a strategy i in ˙ i , then receives the payoff Ti (( j ) j2N ). Given a discount factor in (0; 1], the -discounted payoff of player i is denoted by: ! 1 X t1 i i ( ) D EP; (1 ) g (k; a t ) : tD1
The -discounted game is the game where simultaneously, each player i chooses a strategy i in ˙ i , then receives the payoff i (( j ) j2N ). Remark A strategy for player i is called pure if it always plays in a deterministic way. A mixed strategy for player i is defined as a probability distribution over the set of pure strategies (endowed with the product -algebra). Kuhn’s theorem (see [3,38] or [85] for a modern presentation) states that mixed strategies or behavior strategies are equivalent, in the following sense: for each behavior strategy i , there exists a mixed strategy i of the same player such that P; i ; i D P; i ; i for any strategy profile i of the other players, and vice-versa if we exchange the words “behavior” and “mixed”. Unless otherwise specified, the word strategy will refer here to a behavior strategy, but we will also sometimes equivalently use mixed strategies, or even mixtures of behavior strategies. Value of Zero-Sum Games By definition the game is zero-sum if there are two players, say player 1 and player 2, with opposite payoffs. The T-stage game T can then be seen as a matrix game, hence by the minmax theorem it has a value v T D sup 1 inf 2 T1 ( 1 ; 2 ) D inf 2 sup 1 T1 ( 1 ; 2 ). Similarly, one can use Sion’s theorem [74] to show that the -discounted game has a value v D sup 1 inf 2 1 ( 1 ; 2 ) D inf 2 sup 1 1 ( 1 ; 2 ). To study long term strategic aspects, it is also important to consider the following notion of uniform value. Players are asked to play well uniformly in the time horizon, i. e. simultaneously in all game T with T sufficiently large (or similarly uniformly in the discount factor, i. e. simultaneously in all game with sufficiently low). Definition 1 Player 1 can guarantee the real number v in the repeated game if: 8" > 0; 9 1 2 ˙ 1 ; 9T0 ; 8T
T0 ; 8 2 2 ˙ 2 ; T1 ( 1 ; 2 ) v ". Similarly, Player 2 can guarantee v in if 8" > 0; 9 2 2 ˙ 2 ; 9T0 ; 8T
T0 ; 8 1 2 ˙ 1 ; T1 ( 1 ; 2 ) v C ". If both player 1 and player 2 can guarantee v, then v is called the uniform value of the repeated game. A strategy 1 of player 1 satisfying 9T0 ; 8T T0 ; 8 2 2 ˙ 2 ; T1 ( 1 ; 2 ) v is then
2637
2638
Repeated Games with Incomplete Information
called an optimal strategy of player 1 (optimal strategies of player 2 are defined similarly). The uniform value, whenever it exists, is necessarily unique. Its existence is a strong property, which implies that both vT , as T goes to infinity, and v , as goes to zero, converge to the uniform value. Equilibria of General-Sum Games In the general case, the T-stage game T can be seen as the mixed extension of a finite game, and consequently possesses a Nash equilibrium. Similarly, the discounted game always has, by the Nash Glicksberg theorem, a Nash equilibrium. Concerning uniform notions, couples of optimal strategies are generalized as follows. Definition 2 A strategy profile D ( i ) i2N is a uniform Nash equilibrium of if: 1) 8" > 0; is an "-Nash equilibrium in every finitely repeated game sufficiently long, that is: 9T0 ; 8T T0 ; 8i 2 N; 8 i 2 ˙ i ; Ti ( i ; i ) Ti () C ", and 2) the sequence of payoffs ( Ti ()) i2N T converges to a limit payoff ( i ()) i2N in R N . Remark The initial probability will play a great role in the following analyzes, so we will often write Ti; () for Ti (); v T () for the value vT , etc. The Standard Model of Aumann and Maschler This famous model has been introduced in the sixties by Aumann and Maschler (see the reedition [6]). It deals with zero-sum games with lack of information on one side and observable actions, as in the basic examples previously presented. There is a finite set of states K, an initial probability p D (p k ) k2K on K, and a family of matrix games Gk with identical size I J. Initially, a state k in K is selected according to p, and announced to player 1 (called the informed player) only. Then the matrix game Gk is repeated over and over: at every stage, simultaneously player 1 chooses a row i in I, whereas player 2 chooses a column j in J, the stage payoff for player 1 is then G k (i; j) but only i and j are publicly announced before proceeding to the next stage. Denote by M the constant max k;i; j jG k (i; j)j. Basic Tools: Splitting, Martingale, Concavification, and the Recursive Formula The following aspects are simple but fundamental. The initial probability p D (p k ) k2K represents the initial belief, or a priori, of player 2 on the selected state of nature. Assume that player 1 chooses his first action (or more generally a message or signal s from a finite set S) according to
Repeated Games with Incomplete Information, Figure 1 Splitting
a probability distribution depending on the state, i. e. according to a transition probability x D (x k ) k2K 2 (S)K . For each signal s, the probability that s is chosen is denoted P (x; s) D k p k x k (s), and given s such that (x; s) > 0 the conditional probability on K, or a posteriori of player 2, is pˆ(x; s) D (p k x k (s))/((x; s)) k2K . We clearly have: X pD (x; s) pˆ (x; s) : (1) s2S
So the a priori p lies in the convex hull of the a posteriori. The following lemma expresses a reciprocal: player 1 is able to induce any family of a posteriori containing p in its convex hull. Lemma 1 (Splitting) Assume that p is written as a conP vex combination p D s2S s p s with positive coefficients. There exists a transition probability x 2 (S)K such that 8s 2 S; s D (x; s) and p s D pˆ(x; s). Proof Just put x k (s) D
s p ks pk
if p k > 0.
Equation (1) not only tells that the a posteriori contains p in their convex hull, but also that the expectation of the a posteriori is the a priori. We are here in a repeated context, and for every strategy profile one can define the process (p t ( )) t0 of the a posteriori of player 2. We have p0 D p, and p t ( ) is the random variable of player 2’s belief on the state after the first t stages. More precisely, we define for any t 0, h t D (i1 ; j1 ; : : : ; i t ; j t ) 2 (I J) t and k in K: p kt (; h t ) D P p; (kjh t ) D
p k Pı k ; (h t ) P p; (h t )
:
p t (; h t ) D (p kt (; h t )) k2K 2 (K) (arbitrarily defined if P p; (h t ) D 0) is the conditional probability on the state of nature given that is played and ht has occurred in the first t stages. It is easy to see that as soon as P p; (h t ) > 0; p t (; h t ) does not depend on player 2’s
Repeated Games with Incomplete Information
strategy 2 , nor on player 2’s last action jt . It is fundamental to notice that: Lemma 2 (Martingale of a posteriori) (p t ()) t0 is a P p; -martingale with values in (K). This is indeed a general property of Bayesian learning of a fixed unknown parameter: the expectation of what I will know tomorrow is what I know today. This martingale is controlled by the informed player, and the splitting lemma shows that this player can essentially induce any martingale issued from the a priori p. Notice that, to be able to compute the realizations of the martingale, player 2 needs to know the strategy 1 used by player 1. The splitting lemma also easily gives the following concavification result. Let f be a continuous mapping from (K) to R. The smallest concave function above f is denoted by cav f , and we have:
X cav f (p) D max s f (p s ); S finite; s2S
8ss 0; p s 2 (K);
X s2S
s D 1;
X
s p s D p :
s2S
Lemma 3 (Concavification) If for any initial probability p, the informed player can guarantee f (p) in the game (p), then for any p this player can also guarantee cav f (p) in (p). Non Revealing Games As soon as player 1 uses a strategy which depends on the selected state, the martingale of a posteriori will move and player 2 will have learned something on the state. This is the dilemma of the informed player: he can not use the information on the state without revealing information. Imagine now that player 1 decides to reveal no information on the selected state, and plays independently of it. Since payoffs are defined via expectations, it is as if the players were repeating the average matrix game G(p) D P k k k2K p G . Its value is: X x(i)y( j)G(p)(i; j) u(p) D max min x2(I) y2(J)
D min max
y2(J) x2(I)
i; j
X
x(i)y( j)G(p)(i; j) :
i; j
u is a Lispchitz function, with constant M, from (K) to R. Clearly, player 1 can guarantee u(p) in the game (p) by playing i. i. d. at each stage an optimal strategy in G(p). By the concavification lemma, we obtain: Proposition 1 Player 1 can guarantee cavu(p) in the game (p).
Repeated Games with Incomplete Information, Figure 2 u and cav u
Let us come back to the examples. In example 1, we have
(1 p) 0 u(p) D Val D p(1 p) ; 0 p where p 2 [0; 1] stands here for the probability of state a. This is a convex function of p, and cavu(p) D 0 for all p. In example 2, u(p) D p(1 p) for all p, hence u is already concave and cavu D u. Regarding example 3, the following picture show the functions u (regular line), and cavu (dashed line). Let us consider again the partially revealing strategy previously described. With probability 1/2, the a posteriori will be 3/4a C 1/4b, and player 1 will play Top which 3 1 1 . Similarly is optimal in 3/4G a C 1/4G b D 3 1 1 with probability 1/2, the a posteriori will be 1/4a C 3/4b and player 1 will play an optimal strategy in 1/4G a C 3/4G b . Consequently, this strategy guarantees 1/2u(3/4) C 1/2u(1/4) D cavu(1/2) D 1 to player 1. Player 2 can Guarantee the Limit Value In the infinitely repeated game with initial probability p, player 2 can play as follows: T being fixed, he can play an optimal strategy in the T-stage game T (p), then forget everything and play again an optimal strategy in the T-stage game T (p), etc.. . . By doing so, he guarantees v T (p) in the game (p). So he can guarantee infT v T (p) in this game, and this implies that lim supT v T (p) inf T v T (p). As a consequence, we obtain: Proposition 2 The sequence (v T (p))T converges to inf T v T (p), and this limit can be guaranteed by player 2 in the game (p). Uniform Value: cav u Theorem We will see here that limT v T (p) is nothing but cavu(p), and since this quantity can be guaranteed by both players this is the uniform value of the game (p). The idea of the proof is the following. The martingale (p t ( )) t0 is bounded, hence will converge almost surely, and we have a bound on its L1 variation (see Lemma 4 below).
2639
2640
Repeated Games with Incomplete Information
This means that after a certain stage the martingale will essentially remain constant, so approximately player 1 will play in a non revealing way, so will not be able to have a stage payoff greater than u(q), where q if a “limit a posteriori”. Since the expectation of the a posteriori is the a priori p, player 1 can not guarantee more than P maxf s2S s u(p s ); S finite; 8s 2 Ss 0; p s 2 (K); P P s2S s D 1; s2S s p s D pg, that is more than cavu(p). Let us now proceed to the formal proof. Fix a strategy 1 of player 1, and define the strategy 2 of player 2 as follows: play at each stage an optimal strategy in the matrix game G(p t ), where pt is the current a posteriori in (K). Assume that D ( 1 ; 2 ) is played in the repeated game (p). To simplify notations, we write P for P p; , p t (h t ) for p t (; h t ), etc. We use everywhere norms k:k1 . To avoid confusion between variables and random variables in the following computations, we will use tildes to denote random variables, e. g. k˜ will denote the random variable of the selected state.
1 T
T1 X
P E(kp tC1 p t k)
tD0
p
Proof Fix t 0 and ht in (I J) t s.t. P p; (h t ) > 0. For (i tC1 ; j tC1 ) in I J, one has: p ktC1 (h t ; i tC1 ; j tC1 ) D P ( k˜ D kjh t ; i tC1 ) P ( k˜ D kjh t )P (i tC1 jk; h t ) D P (i tC1 jh t ) D
k2K
p
T
pk )
:
! v ! u T1 ˇ u1 X 1 ˇ k kˇ t k k 2 pt ˇ (p tC1 p t ) E ˇp T tD0 tC1 T tD0 T1 Xˇ
and the result follows. 1 (k; h ) J) t ; tC1 t
For ht in (I is the mixed action in (I) played by player 1 at stage t C 1 if the state is k and ht has 1 (h ) for the law of previously occurred, and we write ¯ tC1 t 1 (h ) D the action of player 1 of stage t C 1 after h t : ¯ tC1 t P k 1 ¯ p (h ) (k; h ) 2 (I). (h ) can be seen as t t tC1 t k2K t tC1 the average action played by player 1 after ht , and will be 1 (k; h )) . used as a non revealing approximation for ( tC1 t k The next lemma precisely links the variation of the martingale (p t ()) t0 , i. e. the information revealed by player 1, and the dependence of player 1’s action on the selected state, i. e. the information used by player 1.
1 (k; h )(i p kt (h t ) tC1 t tC1 ) : 1 ¯ tC1 (h t )(i tC1 )
Consequently, E kp tC1 p t kjh t X X 1 ¯ tC1 (h t )(i tC1 ) jp ktC1 (h t ; i tC1 ) p kt (h t )j D D
p k (1
Proof This is a property of martingales with values in (K) and expectation p.We have for each state k and t 0 : E (p ktC1 p kt )2 D E(E((p ktC1 p kt )2 jH t )), where H t is the -algebra by the first on plays generated t action profiles. So E (p ktC1 p kt )2 D E(E((p ktC1 )2 C (p kt )2 2p ktC1 p kt jH t )) D E((p ktC1 )2 ) E((p kt )2 ). So P T1 k k 2 (p p ) D E (p kT )2 (p k )2 p k (1p k ). E t tD0 tC1 By Cauchy- Schwartz inequality, we also have for each k,
E
8t 0; 8h t 2 (I J) t ; E kp tC1 p t k jh t ˜ k D E tC1 (h t ) ¯ tC1 (h t ) jh t :
i tC1 2I
Lemma 4 8T 1;
Lemma 5
X X
k2K 1 jp kt (h t ) tC1 (k; h t )(i tC1 )
i tC1 2I k2K 1 ¯ tC1 (h t )(i tC1 )p kt (h t )j X 1 1 D p kt (h t )k tC1 (k; h t ) ¯ tC1 (h t )k k2K
1 ˜ h t ) ¯ 1 (h t )kjh t : ( k; D E k tC1 tC1 We can now control payoffs. For t 0 and ht in (I J) t : ˜ E G k ( i˜tC1 ; j˜tC1 )jh t X 1 2 p kt (h t )G k ( tC1 (k; h t ); tC1 (h t )) D k2K
X
1 2 p kt (h t )G k (¯ tC1 (h t ); tC1 (h t ))
k2K
CM
X
1 1 p kt (h t )k tC1 (k; h t ) ¯ tC1 (h t )k
k2K
u(p t (h t )) C M
X
1 1 p kt (h t )k tC1 (k; h t ) ¯ tC1 (h t )k;
k2K
where u(p t (h t )) comes from the definition of 2 . By Lemma 5, we get: ˜ E G k ( i˜tC1 ; j˜tC1 )jh t u(p t (h t )) C ME kp tC1 p t kjh t : Applying Jensen’s inequality yields: ˜ E G k ( i˜tC1 ; j˜tC1 ) cavu(p) C ME kp tC1 p t k :
Repeated Games with Incomplete Information
We now apply Lemma 4 and obtain:
1;p T ( 1 ; 2 )
! T1 1 X k˜ ˜ ˜ DE G ( i tC1 ; j tC1 ) T tD0 q M X cavu(p) C p p k (1 p k ) : T k2K
This is true for any strategy 1
of player 1. Considering the case of an optimal strategy for player 1 in the T-stage game T (p), we have shown:
It is also easy to prove that the T-stage value functions satisfy the following recursive formula: v TC1 (p) D
1 max min T C 1 x2(I)K y2(J) G(p; x; y) C T
X
! x(p)(i)v T ( pˆ (x; i))
i2I
D
1 min max T C 1 y2(J) x2(I)K G(p; x; y) C T
X
! x(p)(i)v T ( pˆ (x; i))
;
i2I
Proposition 3 For p in (K) and T 1,
v T (p) cavu(p) C
M
P
p k2K
p
p k (1 p k )
T
:
It remains to conclude about the existence of the uniform value. We have seen that player 1 can guarantee cavu(p), that player 2 can guarantee limT v T (p), and we obtain from proposition 3 that limT v T (p) cavu(p). This is enough to deduce Aumann and Maschler’s celebrated “cavu” theorem. Theorem 1 (Aumann and Maschler [6]) The game (p) has a uniform value which is cavu(p).
where x D (x k (i)) i2I;k2K , with xk the mixed action used at stage 1 by player 1 if the state is k; G(p; x; y) D P k k k k;i; j p G (x (i); y( j)) is the expected payoff of stage 1, P x(p)(i) D k2K p k x k (i) is the probability that action i is played at stage 1, and pˆ(x; i) is the conditional probability on K given i. The recursive formula simply is a generalization of the dynamic programming principle. The following property interprets easily: the advantage of the informed player can only decrease as the number of stages increases (for a proof, one can show that v TC1 v T by induction on T, using the concavity of vT ). Lemma 6 The T-stage value v T (p) is non increasing in T. Optimal Strategies
T-Stage Values and the Recursive Formula As the T-stage game is a zero-sum game with incomplete information where player 1 is informed, we can write: vT (p) D inf
1;p
sup T () ;
2 2˙ 2 1 2˙ 1
D inf
2 2˙ 2
D inf
2 2˙ 2
sup
X
1 2˙ 1 k2K
X k2K
pk
1;ık
p k T
() ; !
sup 1 2˙ 1
1;ı T k ()
:
This shows that vT is the infimum of a family of affine functions of p, hence is a concave function of p. This concavity represents the advantage of player 1 to possess the information on the selected state. Clearly, we have v T (p) u(p), hence we get the inequalities: P 8T 1; cavu(p) v T (p) cavu(p) C (M k2K p p p k (1 p k ))/ T.
In order to determine the optimal behavior of the players, it is important to be able to compute optimal strategies. The recursive formula enables to compute, by induction on T, an optimal strategy for player 1 in the T-stage game T (p). But it can not be used to compute an optimal strategy for player 2 in the finitely repeated games, because such a strategy should not depend on player 1’s strategy, and consequently on the a posteriori of player 2. Constructing such an optimal strategy for player 2 can be done via the recursive formula of a dual game, see Subsect. “Zero-Sum Games”. These strategies of player 2 may afterwards be used to construct an optimal strategy in the infinitely repeated game (see Subsect. “Player 2 can Guarantee the Limit Value”): define consecutive blocks of stages B1 ; : : : ; B T ; : : : of respective cardinalities 1; : : : ; T, and play independently at each block BT an optimal strategy for player 2 in the T-stage game T (p). This strategy guarantees lim sup v T (p) D cavu(p) for the uninformed player, hence is optimal for player 2, in (p). In the next section we will also see how to directly construct an explicit opti-
2641
2642
Repeated Games with Incomplete Information
mal strategy for player 2 in (p), taking care simultaneously of all possible states k. It is much more simpler to construct an optimal strategy for player 1 in the infinitely repeated game: since player 1 has to guarantee cavu(p), this can be done using the concavification lemma, see proposition 1. Vector Payoffs and Approachability The following model has been introduced by D. Blackwell [8] and is, strictly speaking, not part of the general definition given in Sect. “Definition of the Subject”. We still have a family of I J matrices (G k ) k2K , where K is a finite set of parameters. At each stage t, simultaneously player 1 chooses i t 2 I and player 2 chooses j t 2 J, and the stage “payoff” is the full vector G(i t ; j t ) D (G k (i t ; j t )) k2K in RK . Notice that there is no initial probability or true state of nature here, and both players have a symmetric role. We assume here that after each stage both players observe exactly the stage vector payoff (but one can check that assuming that the action profiles are observed wouldn’t change the results). A natural question is then to determine the sets C in RK such that player 1 (for example) can force the average long term payoff to belong to C? Such sets will be called approachable by player 1. In Sect. “Vector Payoffs and Approachability”, we use Euclidean distances and norms. Denote by F D f(G k (i; j)) k2K ; i 2 I; j 2 Jg the finite set of possible stage payoffs, and by M a constant such that kuk M for each u in F. A strategy for player 1, resp. player 2, is an element D ( t ) t1 , where t maps F t1 into (I), resp. (J). Strategy spaces for player 1 and 2 are respectively denoted by ˙ and T . A strategy profile (; ) naturally induces a unique probability on (I J F)1 denoted by P; . Let C be a “target” set, that will always be assumed, without loss of generality, a closed subset of RK . We denote by g t the random variable, with value in F, of the payP off of stage t, and we use g¯t D 1t tt 0 D1 g t 0 2 conv(F), and finally d t D d( g¯t ; C) for the distance from g¯t to C. Definition 3 C is approachable by player 1 if: 8" > 0; 9 2 ˙; 9T; 8 2 T ; 8t T; E; (d t ) ". C is excludable by player 1 if there exist ı > 0 such that fz 2 RK ; d(z; C) ıg is approachable by player 1. Approachability and excludability for player 2 are defined similarly. C is approachable by player 1 if for each " > 0, this player can force that for t large we have E; (d t ) ", so the average payoff will be "-close to C with high probability. A set cannot be approachable by a player as well as excludable by the other player. In the usual case where K is a singleton, we are in dimension 1
Repeated Games with Incomplete Information, Figure 3 The Blackwell property
and the Minmax theorem implies that for each t, the interval [t; C1[ is either approachable by player 1, or excludable by player 2, depending on the comparison between t and the value maxx2(I) min y2(J) G(x; y) D min y2(J) maxx2(I) G(x; y). Necessary and Sufficient Conditions for Approachability Given a mixed action x in (I), we write xG for the set of possible vector payoffs when player 1 uses x, i. e. P xG D fG(x; y); y 2 (J)g D convf i2I x i G(i; j); j 2 Jg. Similarly, we write Gy D fG(x; y); x 2 (I)g for y in (J). Definition 4 The set C is a B(lackwell)-set for player 1 if for every z … C, there exists z0 2 C and x 2 (I) such that: (i) kz0 zk D d(z; C), and (ii) the hyperplane containing z0 and orthogonal to [z; z0 ] separates z from xG. For example, any set xG, with x in (I), is a B-set for player 1. Given a B-set for player 1, we now construct a strategy adapted to C as follows. At each positive stage t C 1, player 1 considers the current average payoff g¯t . If g¯t 2 C, or if t D 0; plays arbitrarily at stage t C 1. Otherwise, plays at stage t C 1 a mixed action x satisfying the previous definition for z D g¯t . Theorem 2 If C is a B-set for player 1, a strategy adapted to C satisfies: 2M 8 2 T ; 8t 1; E; (d t ) p t and d t ! t!1 0P; a:s: As an illustration, in dimension 1 and for C D f0g, this theorem implies that a bounded sequence (x t ) t of reals,
Repeated Games with Incomplete Information
P such that the product x TC1 T1 TtD1 x T is non-positive for each T, Cesaro converges to zero. Proof Assume that player 1 plays adapted to C, whereas player 2 plays some strategy . Fix t 1, and assume that g¯t … C. Consider z0 2 C and x 2 (I) satisfying (i) and (ii) of definition 4 for z D g¯t . We have: 2 d 2tC1 D d( g¯tC1 ; C)2 k g¯tC1 z0 k 2
tC1
Dk
1 X g l z0 k tC1 l D1
hz0 z; G(x; y)i minhz0 z; ci D hz0 z; z0 i :
2 1 t Dk (g tC1 z0 ) C ( g¯t z0 )k tC1 tC1
2
2 1 t 2 D kg tC1 z0 k C dt 2 tC1 tC1 ˛ 2t ˝ g C z0 ; g¯t z0 : 2 tC1 (t C 1)
c2C
By hypothesis, the expectation, given the first t action profiles h t 2 (I J) t , of the above scalar product is non2 2 2 positive, so E d 2tC1 jh C t (t/(t C 1)) d t C 0(1/(t 1)) 0 2 2 E kg tC1 z k jh t . Since E kg tC1 z k jh t 2 2 ¯ E kg tC1 g t k jh t (2M) , we have:
E d 2tC1 jh t
t 2 2 1 2 ( ) dt C ( ) 4M 2 : tC1 tC1
Proof The implication (i) H) (iii) comes from Theorem 2. Proof of (iii) H) (ii): assume there exists y 2 (J) such that Gy \ C D ;. Since Gy is approachable by player 2, then C is excludable by player 2 and thus C is not approachable by player 1. Proof of (ii) H) (i): Assume that Gy \ C ¤ ;8y 2 (J). Consider z … C and define z0 as its projection onto C. Define the matrix game where payoffs are projected towards the direction z0 z, i. e. the P matrix game k2K (z0 k z k )G k . By assumption, one has: 8y 2 (J); 9x 2 (I) such that G(x; y) 2 C, hence such that:
(2)
Taking the expectation, we get, whether g¯t … C or not: 8t 1; E d 2tC1 (t/(t C 1))2 E(d t 2 )C(1/(t C 1))2 4M 2 . By induction, we obtain thatpfor each t 1; E(d 2t ) (4M 2 )/t, and E(d t ) (2M)/ t. P 2 Put now, as in Sorin 2002 [85], e t D d 2t C t 0 >t 4M0 2 . t Inequality (2) gives E(e tC1 jh t ) e t , so (e t ) is a nonnegative supermartingale which expectation goes to zero. By a standard probability result (see, e. g., Meyer 1966 [58]), we obtain e t ! t!1 0P; a.s., and finally d t ! t!1 0P; a.s.
So min y2(J) maxx2(I) hz0 z; G(x; y)i hz0 z; z0 i. By the minmax theorem, there exists x in (I) such that 8y 2 (J); hz0 z; G(x; y)i hz0 z; z0 i, that is hz0 z; z0 G(x; y)i 0. (iv) means that any half-space containing C is approachable by player 1. (iii) H) (iv) is thus clear. (iv) H) (i) is similar to (ii) H) (i). Up to minor formulation differences, Theorems 2 and 3 are due to Blackwell [8]. More recently, X. Spinat [86] has proved the following characterization. Theorem 4 A closed set is approachable for player 1 if and only if it contains a B-set for player 1. As a consequence, it shows that adding the condition d t ! t!1 0P; a.s. in the definition of approachability does not modify the notion. Approachability for Player 1 Versus Excludability for Player 2
(i) C is a B-set for player 1 ;
As a corollary of Theorem 3, we obtain that: A closed convex set in RK is either approachable by player 1, or excludable by player 2. One can show that when K is a singleton, then any set is either approachable by player 1, or excludable by player 2. A simple example of a set which is neither approachable for player 1 nor excludable by player 2 is given in dimension 2 by:
(0; 0) (0; 0) GD ; and (1; 0) (1; 1) [ C D f(1/2; v); 0 v 1/4g f(1; v); 1/4 v 1g
,
(ii) 8y 2 (J); Gy \ C ¤ ; ;
(see [85]).
,
(iii) C is approachable by player 1 ;
,
(iv) 8q 2 RK ;
This theorem implies that any B-set for player 1 is approachable by this player. The converse is true for convex sets. Theorem 3 Let C be a closed convex subset of RK .
max min
x2(I) y2(J)
Weak Approachability X k2K
q k G k (x; y) inf hq; ci : c2C
On can weaken the definition of approachability by giving up time uniformity.
2643
2644
Repeated Games with Incomplete Information
Definition 5 C is weakly approachable by player 1 if: 8" > 0; 9T; 8t T; 9 2 ˙; 8 2 T ; E; (d t ) ". C is weakly excludable by player 1 if there exists ı > 0 such that fz 2 RK ; d(z; C) ıg is weakly approachable by player 1. N. Vieille [87] has proved, via the consideration of certain differential games: Theorem 5 A subset of RK is either weakly approachable by player 1 or weakly excludable by player 2. Back to the Standard Model Let us come back to Aumann and Maschler’s model with a finite family of matrices (G k ) k2K , and an initial probability p on (K). By Theorem 1, the repeated game (p) has a uniform value which is cavu(p), and Blackwell approachability will allow for the construction of an explicit optimal strategy for the uninformed player. Considering a hyperplane which is tangent to cavu at p, we can find a vector l in RK such that hl; pi D cavu(p) and
8q 2 (K); hl; qi cavu(q) u(q) : RK ; z k
l k 8k
Define now the orthant C D fz 2 2 Kg. Recall that player 2 does not know the selected state, and an optimal strategy for him can not depend on player 1’ strategy, and consequently on a martingale of a posteriori. He will play in a way such that player 1’s long term payoff is, simultaneously for each k in K, not greater than lk if the state is k. Fix q D (q k ) k in RK . If there exists k with q k > 0, we clearly have inf c2C hq; ci D 1 max y2(J) P minx2(I) k2K q k G k (x; y). Assume now that q k 0 for P each k, with q ¤ 0. Write s D k (q k ). inf hq; ci D
c2C
X
qk l k
k2K
D q E D s l; s q ) su( s X q k G k (x; y) s x2(I) y2(J) k2K X D max min q k G k (x; y) s max min
y2(J) x2(I)
k2K
This is condition (iv) of Theorem 3, adapted to player 2. So C is a B-set for player 2, and a strategy adapted to C
satisfies by Theorem 2: 8 2 ˙; 8k 2 K,
E;
T 1X k ˜ ˜ G (it ; jt ) l k T tD1
E;
!
T 1 X k ˜ ˜ G ( i t ; j t ); C T tD1
d
!!
2M p ; T
(where M is here an upper bound for the Euclidean norms of the vectors (G k (i; j)) k2K , with i 2 I and j 2 J.) And this holds as well for any strategy of player 1 in the repeated game with incomplete information. So for any such strategy , 1;p T (; )
D
X
p
k
k2K
! T 1X k ˜ ˜ E; (G ( i t ; j t )) T tD1
2M 2M hp; li C p D cavu(p) C p : T T As shown by Kohlberg [35], the approachability strategy is thus an optimal strategy for player 2 in the repeated game (p). Zero-Sum Games with Lack of Information on Both Sides The following model has also been introduced by Aumann and Maschler [6]. We are still in the context of zero-sum repeated games with observable actions, but it is no longer assumed that one of the players is fully informed. The set of states is here a product K L of finite sets, and we have a family of matrices (G k;l )(k;l )2KL with size I J, as well as initial probabilities p on K, and q on L. In the game (p; q), a state of nature (k; l) is first selected according to the product probability p ˝ q, then k, resp. l, is announced to player 1, resp. player 2 only. Then the matrix game G k;l is repeated over and over: at every stage, simultaneously player 1 chooses a row i in I, whereas player 2 chooses a column j in J, the stage payoff for player 1 is G k;l (i; j) but only i and j are publicly announced before proceeding to the next stage. The average payoff for player 1 in the T-stage game ˜ ˜ 1;p;q p;q P is written: T ( 1 ; 2 ) D E 1 ; 2 T1 TtD1 G k; l ( i˜t ; j˜t ) , and the T-stage value is written v T (p; q). Similarly, the -discounted value of the game will be written v (p; q) The non revealing game now corresponds to the case where player 1 plays independently of k and player 2 plays independently of l. Its value is denoted by: u(p; q) D max min
x2(I) y2(J)
X k;l
p k q l G k;l (x; y) :
(3)
Repeated Games with Incomplete Information
Given a continuous function f : (K) (L) ! R, we denote by cavI f the concavification of f with respect to the first variable: for each (p; q) in (K) (L), cavI f (p; q) is the value at p of the smallest concave function from (K) to R which is above f (:; q). Similarly, we denote by vexII f the convexification of f with respect to the second variable. It can be shown that cavI f and vexII f are continuous, and we can compose cavI vexII f and vexII cavI f . These functions are both concave in the first variable and convex in the second variable, and they satisfy cavI vexII f (p; q) vexII cavI f (p; q).
Mertens and Zamir [52] have shown that here, cavI vexII u(p; q) D 14 < 0 D vexII cavI u(p; q). Limit Values It is easy to see that for each T and , the value functions vT and v are concave in the first variable, and convex in the second variable. They are all Lipschitz functions, with the same constant M D max i; j;k;l jG k;l (i; j)j, and here also, recursive formula can be given. In the following result, vT and v are viewed as elements of the set C of continuous mappings from (K) (L) to R.
Maxmin and Minmax of the Repeated Game Theorem 1 generalizes as follows. Theorem 6 ([6]) In the repeated game (p; q), the greatest quantity which can be guaranteed by player 1 is cavI vexII u(p; q), and the smallest quantity which can be guaranteed by player 2 is vexII cavI u(p; q). Aumann, Maschler and Stearns also showed that cavI vexII u(p; q) can be defended by player 2, uniformly in p;q time, i. e. that 8" > 0; 8 1 ; 9T0 ; 9 2 ; 8T T0 ; T ( 1 ; 2 ) cavI vexII u(p; q)v C ". Similarly, vexII cavI u(p; q) can be defended by player 1. The proof uses the martingales of a posteriori of each player, and a useful notion is that of the informational content of a strategy: for a strategy 1 of the first player, P1 p;q P it is defined as: I( 1 ) D sup 2 E 1 ; 2 tD0 k2K (p ktC1 ( 1 ) p kt ( 1 ))2 , where p t ( 1 ) is the a posteriori on K of player 2 after stage t given that player 1 uses 1 . By linearity of the expectation, the supremum can be restricted to strategies of player 2 which are both pure and independent of l. Theorem 6 implies that cavI vexII u(p; q) D sup 1 2˙ 1 1;p;q lim infT inf 2 2˙ 2 T ( 1 ; 2 ) , and cavI vexII u(p; q) is called the maxmin of the repeated game (p; q). Similarly, vexIIcavI u(p; q) D inf 2 2˙ 2 lim supT sup 1 2˙ 1 T1 ( 1 ; 2 ) is called the minmax of (p; q). As a corollary, we obtain that the repeated game (p; q) has a uniform value if and only if: cavI vexII u(p; q) D vexII cavI u(p; q). This is not always the case, and there exist counter-examples to the existence of the uniform value. Example 4 K D fa; a0 g, and L D fb; b0 g, with p and q uniform.
0 0 0 0 0 1 1 1 1 G a;b D G a;b D 1 1 1 1 0 0 0 0
0 0 0 1 1 1 1 0 0 0 0 G a ;b D G a ;b D 0 0 0 0 1 1 1 1
Theorem 7 (Mertens and Zamir [52]) (v T )T , as T goes to infinity, and (v ) , as goes to zero, both uniformly converge to the unique solution f of the following system: (
f D vexII maxfu; f g f D cavI minfu; f g
Besides, the above system can also be fruitfully studied without reference to repeated games (see [39,40,55,81]). For a proof of Theorem 7, one can also see Zamir [91] or Sorin [85]. Mertens and Zamir notably consider responses of a player, to a given strategy of his opponent, which are of the following type: play non revealing up to a particular stopping time, and then start using the information by playing optimally in the remaining subgame. Remark Let U be the set of all non revealing value functions, i. e. of functions from (K) (L) to R satisfying Eq. (3) for some family of matrices (G k;l ) k;l . One can easily show that any mapping in C is a uniform limit of elements in U.
Correlated Initial Information A more general model can be written, where it is no longer assumed that the initial information of the players are independent. The set of states is now denoted by R (instead of K L), initially a state r in R is chosen according to a known probability p D (pr )r2R , and each player receives a deterministic signal depending on r. Equivalently, each player i has a partition Ri of R and observes the element of his partition which contains the selected state. After the first stage, player 1 will play an action x D (x r )r2R which is measurable with respect to R1 , i. e. (r ! x r ) is constant on each atom of R1 . After having observed player 1’s action at the first stage, the conditional
2645
2646
Repeated Games with Incomplete Information
probability on R necessarily belongs to the set: …I (p) D
˛ r pr
; 8r˛ r 0; r2R
X
˛ r pr D 1
r
and (˛ r )r is R1 -measurable : …I (p) contains p, and is a convex compact subset of (R). A mapping f from (R) to R is now said to be I-concave if for each p in (R), the restriction of f to …I (p) is concave. And given g : (R) ! R which is bounded from above, we define the concavification cavI g as the smallest function above g which is I-concave. Similarly one can define the set …II (p) and the notions of II-convexity and II-convexification. With these generalized definitions, the results of Theorem 6 and 7 perfectly extend [52]. Non Zero-Sum Games with Lack of Information on One Side We now consider the generalization of the standard model of Sect. “The Standard Model of Aumann and Maschler” to the non-zero sum case. Hence two players infinitely repeat the same bimatrix game, with player 1 only knowing the bimatrix. Formally, we have a finite set of states K, an initial probability p on K, and families of I Jpayoff matrices (Ak ) k2K and (B k ) k2K . Initially, a state k in K is selected according to p, and announced to player 1 only. Then the bimatrix game (Ak ; B k ) is repeated over and over: at every stage, simultaneously player 1 chooses a row i in I, whereas player 2 chooses a column j in J, the stage payoff for player 1 is then Ak (i; j), the stage payoff for player 2 is B k (i; j), but only i and j are publicly announced before proceeding to the next stage. Without loss of generality, we assume that p k > 0 for each k, and that each player has at least two actions. Given a strategy pair ( 1 ; 2 ), it is here convenient to denote the expected payoffs up to stage T by: ! T 1 X k˜ ˜ ˜ p 1 2 A (it ; jt ) ˛T ( ; ) D E p; 1 ; 2 T tD1 X D p k ˛Tk ( 1 ; 2 ) : k2K p ˇT ( 1 ; 2 )
D E p; 1 ; 2 D
X
! T 1 X k˜ ˜ ˜ B (it ; jt ) T tD1
p k ˇTk ( 1 ; 2 ) :
k2K
P k Given a probability q on K, we write A(q) D kq P Ak ; B(q) D k q k B k ; u(q) D maxx2(I) min y2(J) A(q)
(x; y) and v(q) D max y2(J) minx2(I) B(q)(x; y). If D P ( (i; j))(i; j)2IJ 2 (I J), we put A(q)( ) D (i; j)2IJ P (i; j)A(q)(i; j) and similarly B(q)( ) D (i; j)2IJ (i; j) B(q)(i; j). Existence of Equilibria The question of existence of an equilibrium has remained unsolved for long. Sorin [79] proved the existence of an equilibrium for two states of nature, and the general case has been solved by Simon et al. [76]. Exactly as in the zero-sum case, a strategy pair induces a sequence of a posteriori (p t ( )) t0 which is a P p; - martingale with values in (K). We will concentrate on the cases where this martingale moves only once. Definition 6 A joint plan is a triple (S; ; ), where: S is a finite non empty set (of messages), D ( k ) k2K (signaling strategy) with for each P k; k 2 (S) and for each s; s Ddef k2K p k sk > 0, D (s )s2S (contract) with for each s; s 2 (I J). The idea is due to Aumann, Maschler and Stearns. Player 1 observes k, then chooses s 2 S according to k and announces s to player 2. Then the players play pure actions corresponding to the frequencies s (i; j), for i in I and j in J. Given a joint plan (S; ; ), we define: p k k
8s 2 S; p s D (p sk ) k2K 2 (K), with p sk D s s for each k. ps is the a posteriori on K given s. ' D (' k ) k2K 2 RK , with for each k, ' k D maxs2S Ak (s ). P P k k 8s 2 S; s D B(p s )(s ) and D k2K p s2S s P k B (s ) D s2S s s . Definition 7 A joint plan (S; ; ) is an equilibrium joint plan if: (i) 8s 2 S; s vexv(p s ), (ii) 8k 2 K; 8s 2 S s.t. p sk > 0; Ak (s ) D ' k , and (iii) 8q 2 (K); h'; qi u(q). Condition (ii) can be seen as an incentive condition for player 1 to choose s according to k . Given an equilibrium joint plan (S; ; ), one define a strategy pair ( 1 ; 2 ) adapted to it. For each message s, first fix a sequence (i st ; jst ) t1 of elements in I J such that for each (i; j), the empirical frequencies converge to the corresponding probability: T1 jft; 1 t T; (i st ; jst ) D (i; j)gj !T!1 s (i; j). We also fix an injective mapping f from S to I l , where l is large enough, corresponding to a code between the players to announce an element in S. 1 is precisely
Repeated Games with Incomplete Information
defined as follows. Player 1 observes the selected state k, then chooses s according to k , and announces s to player 2 by playing f (s) at the first l stages. Finally, 1 plays i st at each stage t > l as long as player 2 plays jst . If at some stage t > l player 2 does not play jst then player 1 punishes his opponent by playing an optimal strategy in the zero-sum game with initial probability ps and payoffs for player 1 given by (B k ) k2K . We now define 2 . Player 2 arbitrarily plays at the beginning of the game, then compute at the end of stage l the message s sent by player 1. Next he plays at each stage t > l the action jst as long as player 1 plays i st . If at some stage t > l, player 1 does not play i st , or if the first l actions of player 1 correspond to no message, then player 2 plays a punishing strategy ¯ 2 such that: 8" > 0; 9T0 ; 8T T0 ; 8 1 2 ˙ 1 ; 8k 2 K; ˛Tk ( 1 ; ¯ 2 ) ' k C ". Such a strategy ¯ 2 exists because of condition (iii): it is an approachability strategy for player 2 of the orthant fx 2 RK ; 8k 2 K x k ' k g (see Subsect. “Back to the Standard Model”). Lemma 7 ([79]) A strategy pair adapted to an equilibrium joint plan is a uniform equilibrium of the repeated game. Proof The payoffs induced by ( 1 ; 2 ) can be easP k k ily computed: 8k; ˛Tk ( 1 ; 2 ) !T!1 s2S s A p 1 k 2 (s ) D ' because of (ii), and ˇ T ( ; ) !T!1 P P k k k . Assume that player 2 k2K p s2S s B (s ) D 2 plays . The existence of ¯ 2 implies that no detectable deviation of player 1 is profitable, so if the state is k, player 1 will gain no more than maxs 0 2S Ak (s 0 ). But this is just ' k . The proof can be made uniform in 1 and we obtain: 8" > 0 9T0 8T T0 ; 8k 2 K; 8 1 2 ˙ 1 ; ˛Tk ( 1 ; 2 ) ' k C ". Finally assume that player 1 plays 1 . Condition (i) implies that if player 2 uses 2 , the payoff of this player will be at least vexv(p s ) if the message is s. Since vex v(p s )(D cav(v(p s ))) is the value, from the point of view of player 2 with payoffs (B k ) k , of the zero-sum game with initial probability ps , player 2 fears the punition by player 1, and 8" > 0; 9T0 ; 8T T0 ; 8 2 2 P p ˙ 2 ; ˇ T ( 1 ; 2 ) s2S s s C " D C ". To prove the existence of equilibria, we then look for equilibrium joint plans. The first idea is to consider, for each probability r on K, the set of payoff vectors ' compatible with r being an a posteriori. This leads to the consideration of the following correspondence (for each r; ˚ (r) is a subset of RK ): ˚ : (K) RK r 7! f(Ak ( )) k2K ;
where 2 (I J)
satisfies B(r)( ) vex v(r)g
Repeated Games with Incomplete Information, Figure 4 ´ A Borsuk–Ulam type theorem by Simon, Spie˙z and Torunczyk
It is easy to see that the graph of ˚, i. e. the set f(r; ') 2 (K) RK ; ' 2 ˚(r)g, is compact, that ˚ has non empty convex values, and satisfies: 8r 2 (K); 8q 2 (K); 9' 2 ˚(r); h'; qi u(q). Assume now that one can find a finite family (p s )s2S of probabilities on K, as well as vectors ' and, for each s; 's in RK such that: 1) p 2 convfp s ; s 2 Sg, 2) h'; qi u(q)8q 2 (K), 3) 8s 2 S; 's 2 ˚(p s ), and 4) 8s 2 S; 8k 2 K; 'sk ' k with equality if p sk > 0. It is then easy to construct an equilibrium joint plan. Thus we get interested in proving the following result. Proposition 4 Let p be in (K); u : (K) ! R be a continuous mapping, and ˚ : (K) RK be a correspondence with compact graph and non empty convex values such that: 8r 2 (K); 8q 2 (K); 9' 2 ˚(r); h'; qi
u(q). Then there exists a finite family (p s )s2S of elements of (K), as well as vectors ' and, for each s, ' s in RK such that:
p 2 convfp s ; s 2 Sg, h'; qi u(q)8q 2 (K), 8s 2 S; 's 2 ˚(p s ), 8s 2 S; 8k 2 K; 'sk ' k with equality if p sk > 0.
The proof of proposition 4 relies, as explained in [62] or [75], on a fixed point theorem of Borsuk–Ulam type proved by Simon, Spie˙z and Toru´nczyk [76] via tools from algebraic geometry. A simplified version of this fixed point theorem can be written as follows: Theorem 8 ([76]) Let C be a compact subset of an n-dimensional Euclidean space, x 2 C and Y be a finite union of affine subspaces of dimension n 1 of an Euclidean space. Let F be a correspondence from C to Y with compact graph and non empty convex values. Then there exists L @C and y 2 Y such that: 8l 2 L; y 2 F(l) and x 2 conv(L). Notice that for n D 1 (corresponding to 2 states of nature), the image by F of the connected component of C contain-
2647
2648
Repeated Games with Incomplete Information
ing x necessarily is a singleton, hence the result is clear. In the general case, one finally obtains: Theorem 9 ([76]) There exists an equilibrium joint plan. Thus there exists a uniform equilibrium in the repeated game (p). Characterization of Equilibrium Payoffs Characterizing equilibrium payoffs, as the Folk theorem does for repeated games with complete information, has been a challenging problem. We denote here by p0 the initial probability in the interior of (K). We are interested in the set of equilibrium payoffs, in the convenient following sense: Definition 8 A vector (a; b) in RK R is called an equilibrium payoff of the repeated game (p0 ) if there exists a strategy pair ( 1 ; 2 ) satisfying: (i) 8" > 0 9T0 8T T0 ; 8k 2 K; 8 1 2 ˙ 1 ; ˛Tk ( 1 ; 2 ) ˛Tk ( 1 ; 2 ) C "; and 8" > 0 9T0 8T
p p T0 ; 8 2 2 ˙ 2 ; ˇT0 ( 1 ; 2 ) ˇ T ( 1 ; 2 ) C ", and p 0 (ii) (˛Tk ( 1 ; 2 )) k;T and (ˇT ( 1 ; 2 ))T respectively converge to a and b. Since p lies in the interior of (K), the first line of (i) is p equivalent to: 8" > 0 9T0 8T T0 ; 8 1 2 ˙ 1 ; ˛T ( 1 ; p 2 ) ˛T ( 1 ; 2 ) C ". The strategy pair ( 1 ; 2 ) is thus a uniform equilibrium of the repeated game, with the additional requirement that expected average payoffs of player 1 converge in each state k. In some sense, player 1 is viewed here as jKj different types or players, and we require the existence of the limit payoff of each type. We will only consider such uniform equilibria in the sequel. Notice that the above definition implies: 8k 2 K; 8" > 0; 9T0 ; 8T T0 ; 8 1 2 ˙ 1 ; ˛Tk ( 1 ; 2 ) a k C ". So the orthant fx 2 RK ; x k a k 8k 2 Kg is approachable by player 2, and by Theorem 3 and Subsect. “Back to the Standard Model” one can obtain that: ha; qi u(q)
8q 2 (K)
(4)
Condition (4) is called the individual rationality condition for player 1, and does not depend on the initial probability in the interior of (K). Regarding player 2, we have: p 8" > 0 9T0 8T T0 ; 8 2 2 ˙ 2 ; ˇT0 ( 1 ; 2 ) ˇ C ", so by Theorem 1:
the value of the game where player 1’s plays in order to minimize player 2’s payoffs. Imagine now that 1 is a non revealing strategy for player 1, and that the players play actions with empirical frequencies corresponding to a given probability distribution D ( i; j )(i; j)2IJ 2 (I J). We will have: 8k 2 K, P P P a k D i; j i; j Ak (i; j) and ˇ D k p0k i; j i; j B k (i; j), and if the individual rationality conditions are satisfied, no detectable deviation of a player can be profitable. This leads to the definition of the following set, where M is the constant maxfjAk (i; j)j; jB k (i; j)j; (i; j) 2 I Jg, and R M D [M; M]. Definition 9 Let G be the set (a; ˇ; p) 2 RKM R M (K) satisfying:
of
triples
1. 8q 2 (K); ha; qi u(q), 2. ˇ vexv(p), P kP k 3. 9 2 (I J) s.t. ˇ D k p i; j i; j B (i; j) and P 8k 2 K; a k i; j i; j Ak (i; j) with equality if p k > 0. We need to considerate every possible initial probability because the main state variable of the model is, here also, the belief, or a posteriori, of player 2 on the state of nature. f(a; ˇ); (a; ˇ; p0 ) 2 Gg is the set of payoffs of non revealing equilibria of (p0 ). The importance of the following definition will appear with Theorem 10 below (which unfortunately has not led to a proof of existence of equilibrium payoffs). Definition 10 G is defined as the set of elements g D (a; ˇ; p) 2 RKM R M (K) such that there exist a probability space (˝; A; Q), an increasing sequence (F n )n1 of finite sub- -algebras of A, and a sequence of random variables (g n )n1 D (a n ; ˇn ; p n )n1 defined on (˝; A) with values in RKM R M (K) satisfying: (i) g1 D g a.s., (ii) (g n )n1 is a martingale adapted to (F n )n1 , (iii) 8n 1; a nC1 D a n a.s. or p nC1 D p n a.s., and (iv) (g n )n converges a.s. to a random variable g1 with values in G. Let us forget for a while the component of player 2’s payoff. A process (g n )n satisfying (ii) and (iii) may be called a bimartingale, it is a martingale such that at every stage, one of the two components remains a.s. constant. So the set G can be seen as the set of starting points of converging bimartingales with limit points in G. Theorem 10 (Hart [29]) Let (a; ˇ) be in RK R.
ˇ vex v(p0 ) :
(5)
Condition (5) is the individual rationality condition for player 2: at equilibrium, this player should have at least
(a; ˇ) is an equilibrium payoff of (p0 ) () (a; ˇ; p0 ) 2 G :
Repeated Games with Incomplete Information
Theorem 10 is too elaborate to be proved here, but let us give a few ideas about the proof. First consider the implication H), and fix an equilibrium D ( 1 ; 2 ) of (p0 ) with payoff (a; ˇ). The sequence of a posteriori (p t ( )) t0 is a P p 0 ; - martingale. Modify now slightly the time structure so that at each stage, player 1 plays first, and then player 2 plays without knowing the action chosen by player 1. At each half-stage where player 2 plays, his a posteriori remains constant. At each half-stage where player 1 plays, the “expectation of player 1’s future payoff” (which can be properly defined) remains constant. Hence, the heuristic apparition of the bimartingale. And since bounded martingale converge, for large stages everything will be fixed and the players will approximately play a non revealing equilibrium at a “limit a posteriori”, so the convergence will be towards elements of G. Consider now the converse implication (H. Let (a; ˇ) be such that (a; ˇ; p0 ) 2 G , and assume for simplification that the associated bi-martingale (a n ; ˇn ; p n ) converges in a fixed number N of stages: 8n N; (a n ; ˇn ; p n ) D (a N ; ˇ N ; p N ) 2 G. One can construct an equilibrium ( 1 ; 2 ) of (p0 ) with payoff (a; ˇ) along the following lines. For each index n, (a n ; ˇn ) will be an equilibrium payoff of the repeated game with initial probability pn . Eventually, player 1 will play independently of the state, the a posteriori of player 2 will be pN , and the players will end up playing a non revealing equilibrium of the repeated game (p N ) with payoff (a N ; ˇ N ). What should be played before? Since we are in an undiscounted setup, any finite number of stages can be used for communication without influencing payoffs. Let n < N be such that a nC1 D a n . To move from (a n ; ˇn ; p n ) to (a n ; ˇnC1 ; p nC1 ), player 1 can simply use the splitting lemma (Lemma 1) in order to signal part of the state to player 2. Let now n < N be such that p nC1 D p n , so that we want to move from (a n ; ˇn ; p n ) to (a nC1 ; ˇnC1 ; p n ). Player 1 will play independently of the state, and both players will act so as to convexify their future payoffs. This convexification is done through procedures called “jointly controlled lotteries” and introduced in the sixties by Aumann and Maschler [6], with the following simple and brilliant idea. Imagine that the players have to decide with even probability whether to play the equilibrium E1 with payoff (a1 ; ˇ 1 ) or to play the equilibrium E2 with payoff (a2 ; ˇ 2 ). The players may not be indifferent between E1 and E2, e. g. player 1 may prefer E1 whereas player 2 prefers E2. They will proceed as follows, with i and i 0 , respectively j and j0 , denoting two distinct actions of player 1, resp. player 2. Simultaneously and independently, player 1 will select i or i 0 with probability 1/2, whereas player 2 will behave similarly with j and j0 .
i i0
j
j0
:
Then the equilibrium E1 will be played if the diagonal has been reached, i. e. if (i; j) or (i 0 ; j0 ) has been played, and otherwise the equilibrium E2 will be played. This procedure is robust to unilateral deviations: none of the players can deviate and prevent E1 and E2 to be chosen with probability 1/2. In general, jointly controlled lotteries are procedures allowing to select an alternative among a finite set according to a given probability (think of binary expansions if necessary), in a way which is robust to deviations by a single player. S. Hart has precisely shown how to combine steps of signaling and jointly controlled lotteries to construct an equilibrium of 1 (p0 ) with payoff (a; ˇ). Biconvexity and Bimartingales The previous analysis has lead to the introduction and study of biconvexity phenomena. The reference here is [4]. Let X and Y be compact convex subsets of Euclidean spaces, and let (˝; F ; P ) be an atomless probability space. Definition 11 A subset B of X Y is biconvex if for every x in X and y in Y, the sections B x: D fy 0 2 Y; (x; y 0 ) 2 Bg and B:y D fx 0 2 X; (x 0 ; y) 2 Bg are convex. If B is biconvex, a mapping f : B ! R is called biconvex if for each (x; y) 2 X Y, f (:; y) and f (x; :) are convex. As in the usual convexity case, we have that if f is biconvex, then for each ˛ in R, the set f(x; y) 2 B; f (x; y) ˛g is biconvex. Definition 12 A sequence of random variables Z n D (X n ; Yn )n1 with values in X Y is called a bimartingale if: (1) there exists an increasing sequence (F n )n1 of finite sub--algebra of F such that (Z n )n is a (F n )n1 -martingale. (2) 8n 1; X n D X nC1 a.s. or Yn D YnC1 a.s. (3) Z1 is a.s. constant. Notice that (Z n )n1 being a bounded martingale, it converges almost surely to a limit Z1 . Definition 13 Let A be a measurable subset of X Y. A D fz 2 X Y; there exists a bimartingale (Z n )n1 converging to a limit Z1 such that Z1 2 A a.s. and Z1 D z a.s.g : One can show that any atomless probability space (˝; F ; P ), or any product of convex compact spaces X Y
containing A, induce the same set A . One can also
2649
2650
Repeated Games with Incomplete Information
substitute condition (2) by: 8n 1; (X n D X nC1 or Yn D YnC1 ) a.s. Notice that without condition (2), the set A would just be the convex hull of A. We always have A A conv(A), and these inclusions can be strict. For example, if X D Y D [0; 1] and A D f(0; 0); (1; 0); (0; 1)g, it is possible to show that A D f(x; y) 2 [0; 1] [0; 1]; x D 0 or y D 0g. A always is biconvex and thus contains biconv(A), which is defined as the smallest biconvex set which contains A. The inclusion biconv(A) A can also be strict, as shown by the following example: Example 5 Put X D Y D [0; 1], v1 D (1/3; 0), v2 D (0; 2/3), v3 D (2/3; 1), v4 D (1; 1/3), w1 D (1/3; 1/3), w2 D (1/3; 2/3), w3 D (2/3; 2/3) et w4 D (2/3; 1/3), and A D fv1 ; v2 ; v3 ; v4 g. A is biconvex, so A D biconv(A). Consider now the following Markov process (Z n )n1 , with Z1 D w1 . If Z n 2 A, then Z nC1 D Z n . If Z n D w i for some i, then Z nC1 D w iC1(mod 4) with probability 1/2, and Z nC1 D v i with probability 1/2. (Z n )n is a bimartingale converging a.s. to a point in A, hence w1 2 A nbiconv(A). We now present a geometric characterization of the set A , and assume here that A is closed. For each biconvex subset B of X Y containing A, we denote by nsc(B) the set of elements of B which can not be separated from A by a continuous bounded biconvex function on A. More precisely, nsc(B) D fz 2 B; 8 f : B ! R bounded biconvex, and continuous on A; f (z) supf f (z0 ); z0 2 Agg. Theorem 11 ([4]) A is the largest biconvex set B containing A such that nsc(B) D B. Let us now come back to repeated games and to the notations of Subsect. “Approachability for Player 1 Versus Excludability for Player 2”. To be precise, we need to add the component of player 2’s payoff, and consequently to slightly modify the definitions. G is closed in RKM R M (K). For B RKM R M (K), B is biconvex if for each a in RKM and for each p in (K), the sections f(ˇ; p0 ); (a; ˇ; p0 ) 2 Bg and f(a0 ; ˇ); (a 0 ; ˇ; p) 2 Bg are convex. A real function f defined on a biconvex set B is said to be biconvex if 8a, 8p, f (a; :; :) and f (:; :; p) are convex. Theorem 12 ([4]) G is the largest biconvex set B containing G such that: 8z 2 B; 8 f : B ! R bounded biconvex, and continuous on A; f (z) supf f (z0 ); z0 2 Gg.
Repeated Games with Incomplete Information, Figure 5 The “four frogs” example of Aumann and Hart: A ¤ biconv(A)
ing function q : K A ! (U) giving the distributions of the signals received by the players as a function of the state of nature and the action profile just played. The particular case where q(k; a) does not depend on k is called state independent signaling. The previous models correspond to the particular case of perfect observation, where the signals received by the players exactly reveal the action profile played. Theorem 1 has been generalized [6] to the general case of signaling function. We keep the notations of Sect. “The Standard Model of Aumann and Maschler”. Given a mixed action x 2 (I), an action j in J and a state k, we denote by Q(k; x; j) the marginal distribution on U 2 of the P law i2I x(i)q(k; i; j), i. e. Q(k; x; j) is the law of the signal received by player 2 if the state is k, player 1 uses x and player 2 plays j. The set of non revealing strategies of player 1 is then defined as: N R(p) D fx D (x k ) k2K 2 (I) K ; 0
8k 2 K; 8k 0 2 K s.t. p k p k > 0; 0
8 j 2 J; Q(k; x k ; j) D Q(k 0 ; x k ; j)g : If the initial probability is p and player 1 plays a strategy x in NR(p) (i. e. plays xk if the state is k), the a posteriori of player 2 will remain a.s. constant: player 2 can deduce no information on the selected state k. The value of the non revealing game becomes: u(p) D max
min
x2N R(p) y2(J)
D min
max
y2(J) x2N R(p)
P
Non-observable Actions We now consider the case where, as in the general definition of Sect. “Definition of the Subject”, there is a signal-
X
p k G k (x k ; y)
k2K
X
p k G k (x k ; y) ;
k2K
where G k (x k ; y) D i; j x k (i)y( j)G k (i; j), and the convention u(p) D 1 if N R(p) D ;. Theorem 1 perfectly extends here: The repeated game with initial probability p has a uniform value given by cavu(p).
Repeated Games with Incomplete Information
The explicit construction of an optimal strategy of player 2 (see Subsect. “Back to the Standard Model”) has also been generalized to the general signaling case (see [35], and part B, p. 234 [57] for random signals). Regarding zero-sum games with lack of information on both sides, the results of Sect. “Zero-Sum Games with Lack of Information on Both Sides” have been generalized to the case of state independent signaling (see [50,52] and [55]). Attention has been paid to the speed of convergence of the value function (v T )T , and bounds are identical for both models of lack of information on one side and on both sides, if we assume state independent signaling: this speed is of order 1/T 1/2 for games with perfect observation, and of order 1/T 1/3 for games with signals (these orders are optimal, both for lack of information on one side and lack of information on both sides, see [89,90]). For state dependent signaling and lack of information on one side, it was shown by Mertens [51] that the convergence occurs with worst case error (ln n/n)1/3 . A particular class of zero-sum repeated games with state dependent signaling has been studied (games with no signals, see [54,88] and [82]). In these games, the state k is first selected according to a known probability and is not announced to the players; then after each stage both players receive the same signal which is either “nothing” or “the state is k”. It is not possible to obtain here a standard recursive formula with state space (K), or even (K) (K), because even when the strategies are given, during the play none of the players can compute the a posteriori of the other player. It was shown that the maxmin and the minmax may differ, although lim T v T always exists. In non zero-sum repeated games with lack of information on one side, the existence of “joint plan” equilibria have been generalized to the case of state independent signaling [62], and more generally to the case where “player 1 can send non revealing signals to player 2” [77]. The existence of a uniform equilibrium in the general signaling case is still an open question see [78].
Miscellaneous Zero-Sum Games In games with lack of information on one side, it is important that player 1 knows not only the selected state k, but also the a priori p. [84] provides an example of a game with lack of information on “one and a half” side with no uniform value. More precisely, in this example nature first chooses p in fp1 ; p2 g according to a known probability, and announces p to player 2 only; then k is selected ac-
cording to p, and announced to player 1 only; finally the matrix game Gk is played. For games with lack of information on one side, the value function vT is a concave piecewise linear function of the initial probability p (see [61] for more generality). On the contrary, the discounted value v can be quite a complex function of p: in example 2 of Sect. “Definition of the Subject”, Mayberry [49] has proved that for 2/3 < < 1; v is, at each rational value of p, non differentiable. Convergence of the value functions (v T )T and (v ) have been widely studied. We have already discussed the speed of convergence in Sect. “Non-observable Actions”, but much more can be said. Example 6 Standard model of lack of information on one side and observable actions. K D fa; bg ; 3 Ga D 3
1 1
and
Gb D
2 2
2 2
:
One can show [53] that for each p 2 [0; 1],pviewed as the initial probability of state a, the sequence Tv T (p) conp 2 verges to '(p), where '(p) D 1/ 2ex p /2 , and xp satisp p R x p x 2 /2 fies 1/ 2 1 e dx D p. So the limit of Tv T (p) is the standard normal density function evaluated at its p-quantile. The apparition of the normal distribution is by no way an isolated phenomenon, but rather an important property of some repeated games ([12,13,14,15,18], . . . ). B. de Meyer introduced the notion of “dual game” (see the previous references and also [17,19,41,69]). Let us now illustrate this on the standard model of Sect. “The Standard Model of Aumann and Maschler”. Let z be a parameter in RK . In the dual game T (z), player 1 first secretly chooses the state k. Then at each stage t T, the players choose as usual actions it and jt which are announced before proceeding to the next stage. With time horizon T, player 1’s payoff finally is 1 PT k k tD1 G (i t ; j t ) z . This player is thus now able to fix T the state equal to k, but has to pay zk for it. It can be shown that the T-stage dual game T (z) has a value w T (z). w T is convex, and is linked to the value of the primal game by the conjugate formula: w T (z) D max (v T (p) hp; zi); p2(K)
v T (p) D inf (w T (z) C hp; zi) : z2R K
and
2651
2652
Repeated Games with Incomplete Information
And (w T )T satisfies the dual recursive formula: T wT w TC1 (z) D min max y2(J) i2I T C 1 1 0 X T C 1 1 @ y j G k (i; j) A : z k T T j2J
There are also strong relations between the optimal strategies of the players in the primal and dual games, and this gives a way to compute recursively optimal strategies of the uninformed player in the finite game (see also [32] on this topic). Repeated games with incomplete information, as well as stochastic games, can also be studied in a functional analysis setup called the operator approach. This general approach is based on the study of the recursive formula [40,70,85]. In [65], the standard model, as well as the proof of Theorem 1, is generalized to the case where the state is not fixed at the beginning of the game, but evolves according to a Markov chain uniquely observed by player 1 (see also [59] for non observable actions, [48] and [33] for the difficulty of computing the value, [67] for the generalization to a state process controlled and observed by player 1, and [71] for several kinds of stochastic games with lack of information on one side). It is known since [80] that the uniform value may not exist in general for stochastic games with lack of information on one side (where the stochastic game to be played is first randomly selected and announced to player 1 only). Blackwell’s approachability theorem has been extended to infinite dimensional spaces by Lehrer [43]. Approachability theory has strong links with the existence of no-regret strategies (first studied in [31], see also [10,26,30,44,72] and the recent book [9]), convergence of simple procedures to the set of correlated equilibria [31], and calibration [25,42]. The links between merging, reputation phenomena and repeated games with incomplete information have been studied in [83], where several existing results are unified. Finally, no-regret and approachability have also been studied when the players have bounded computational capacities (finite automata, bounded recall strategies) [45,46]. Let us mention also that de Meyer and Moussa Saley studied the modelization via Brownian motions in financial models [18]. They introduced a marked game based on a repeated game with lack of information on one side, and showed the endogenous apparition of a Brownian motion (see [16] for incomplete information on both sides).
Non Zero-Sum Games In the setup of Sect. “Non Zero-Sum Games with Lack of Information on One Side”, it is interesting to study the number of communication stages which is needed to construct the different equilibria. This number is linked with the convergence of the associated bimartingales (see [4,6,21,24]). Let us mention also that F. Forges [23] gave a similar characterization of equilibrium payoffs, for a larger notion of equilibria called communication equilibria (see also [22] for correlated equilibria). Amitai [2] studied the set of equilibrium payoffs in case of lack of information on both sides. Aumann and Hart [5] characterized the equilibrium payoffs in two player games with lack of information on one side when long, payoff-irrelevant, preplay communication is allowed (see [1] for incomplete information on both sides). The particular case where each player knows his own payoffs is particularly worthwhile studying (known own payoffs). In the two-player case with lack of information on one side, this amounts to say that player 2’s payoffs do not depend on the selected state. In this case, Shalev [73] showed that any equilibrium payoff can be obtained as the payoff of an equilibrium which is completely revealing. This result generalizes to the non zero-sum case of lack of information of both sides (see the unpublished manuscript [37]), but unfortunately uniform equilibria may fail to exist even though both players known their own payoffs. Another model deals with the symmetric case, where the players have an incomplete, but identical, knowledge of the selected state. After each stage they receive the same signal, which may depend on the state. A. Neyman and S. Sorin have proved the existence of equilibrium payoffs in the case of two players (see [60], the zero-sum case being solved in [36] and [20]). Few papers study the case of more than 2 players. The existence of uniform equilibrium has been studied for 3 players and lack of information on one side [63], and in the case of two states of nature it appears that a completely revealing equilibria, or a joint plan equilibria by one of the informed players, always exists. Concerning n-player repeated games with incomplete information and signals, several papers study how the initial information can be strategically transmitted, independently of the payoffs ([64,66], and [68] for an application to a cryptographic model). As an application, the existence of completely revealing equilibria is obtained in particular cases. Repeated games with incomplete information have been used to study perturbations of repeated games with complete information (see [27] and [11] for Folk theo-
Repeated Games with Incomplete Information
rem-like results, [7] for enforcing cooperation in games with a Pareto-dominant outcome, and [34] for a perturbation with known own payoffs). The case where the players have different discount factors has also been investigated [11,47]. Future Directions Several open problems are well formulated and deserve attention. Does a uniform equilibrium always exist in twoplayer repeated games with lack of information on one side and general signaling, or in n-player repeated games with lack of information on one side? More conceptually, one should look for classes of n-player repeated games with incomplete information which allow for the existence of equilibria, and/or for a tractable description of equilibrium payoffs (or at least of some of these payoffs). Regarding applications, there is certainly a lot of room in the vast fields of financial markets, cryptology, and sequential decision problems. Acknowledgments I thank Françoise Forges, Sergiu Hart, Dinah Rosenberg, Robert Simon and Eilon Solan for their comments on a preliminary version of this chapter. Bibliography Primary Literature 1. Amitai M (1996) Cheap-Talk with Incomplete Information on Both Sides. Ph D Thesis, The Hebrew University of Jerusalem, http://ratio.huji.ac.il/dp/dp90.pdf 2. Amitai M (1996) Repeated games with incomplete information on Both Sides. Ph D Thesis, The Hebrew University of Jerusalem, http://ratio.huji.ac.il/dp/dp105.pdf 3. Aumann RJ (1964) Mixed and behaviour strategies in infinite extensive games. In: Dresher, Shapley, Tucker (eds) Advances in game theory. Annals of Mathematics Study vol 52, Princeton University Press, Princeton, pp 627–650 4. Aumann RJ, Hart S (1986) Bi-convexity and bi-martingales. Israel J Math 54:159–180 5. Aumann RJ, Hart S (2003) Long cheap talk. Econometrica 71:1619–1660 6. Aumann RJ, Maschler M (1995) Repeated games with incomplete information, with the collaboration of Stearns RE. MIT Press, Cambridge (contains a reedition of chapters of Reports to the US Arms Control and Disarmament Agency ST-80, 116 and 143, Mathematica, 1966–1967–1968) 7. Aumann RJ, Sorin S (1989) Cooperation and bounded recall. Games Econ Behav 1:5–39 8. Blackwell D (1956) An analog of the minmax theorem for vector payoffs. Pac J Math 65:1–8 9. Cesa-Bianchi N, Lugosi G (2006) Prediction, Learning and Games. Cambridge University Press, Cambridge
10. Cesa-Bianchi N, Lugosi G, Stoltz G (2006) Regret minimization under partial monitoring. Math Oper Res 31:562–580 11. Cripps MW, Thomas JP (2003) Some asymptotic results in discounted repeated games of one-sided incomplete information. Math Oper Res 28:433–462 12. de Meyer B (1996) Repeated games and partial differential equations. Math Oper Res 21:209–236 13. de Meyer B (1996) Repeated games, duality and the central limit theorem. Math Oper Res 21:237–251 14. de Meyer B (1998) The maximal variation of a bounded martingale and the central limit theorem. Annales de l’Institut Henri Poincaré, Probabilités et statistiques 34:49–59 15. de Meyer B (1999) From repeated games to Brownian games. Annales de l’Institut Henri Poincaré, Probabilités et statistiques 35:1–48 16. de Meyer B, Marino A (2004) Repeated market games with lack of information on both sides. DP 2004.66, MSE Université Paris I 17. de Meyer B, Marino A (2005) Duality and optimal strategies in the finitely repeated zero-sum games with incomplete information on both sides. DP 2005.27, MSE Université Paris I 18. de Meyer B, Moussa H (2003) Saley On the strategic origin of Brownian motion in finance. Int J Game Theory 31:285–319 19. de Meyer B, Rosenberg D (1999) “Cavu” and the dual game. Math Oper Res 24:619–626 20. Forges F (1982) Infinitely repeated games of incomplete information: symmetric case with random signals. Int J Game Theory 11:203–213 21. Forges F (1984) A note on Nash equilibria in repeated games with incomplete information. Int J Game Theory 13:179–187 22. Forges F (1985) Correlated equilibria in a class of repeated games with incomplete information. Int J Game Theory 14:129–149 23. Forges F (1988) Communication equilibria in repeated games with incomplete information. Math Oper Res 13:191–231 24. Forges F (1990) Equilibria with communication in a job market example. Q J Econ 105:375–398 25. Foster D (1999) A Proof of Calibration via Blackwell’s Approachability Theorem. Games Eco Behav 29:73–78 26. Foster D, Vohra R (1999) Regret in the On-Line Decision Problem. Games Econ Behav 29:7–35 27. Fudenberg D, Maskin E (1986) The folk theorem in repeated games with discounting or with incomplete information. Econometrica 54:533–554 28. Harsanyi J (1967–68) Games with incomplete information played by ‘Bayesian’ players, parts I-III. Manag Sci 8:159–182, 320–334, 486–502 29. Hart S (1985) Nonzero-sum two-person repeated games with incomplete information. Math Oper Res 10:117–153 30. Hart S (2005) Adaptative Heuristics. Econometrica 73:1401– 1430 31. Hart S, Mas-Colell A (2000) A simple adaptative procedure leading to Correlated Equilibrium. Econometrica 68:1127– 1150 32. Heuer M (1992) Optimal strategies for the uninformed player. Int J Game Theory 20:33–51 33. Hörner J, Rosenberg D, Solan E, Vieille N (2006) On Markov Games with Incomplete Information on One Side. Discussion paper 1412, Center for Mathematical Studies in Economics and Management Science, Northwestern University, Evanston 34. Israeli E (1999) Sowing Doubt Optimally in Two-Person Repeated Games. Games Econ Behav 28:203–216
2653
2654
Repeated Games with Incomplete Information
35. Kohlberg E (1975) Optimal strategies in repeated games with incomplete information. Int J Game Theory 4:7–24 36. Kohlberg E, Zamir S (1974) Repeated games of Incomplete information: The Symmetric Case. Ann Stat 2:40–41 37. Koren G (1992) Two-person repeated games where players know their own payoffs. master thesis, Tel-Aviv University, TelAviv http://www.ma.huji.ac.il/hart/papers/koren.pdf 38. Kuhn HW (1953) Extensive games and the problem of information. In: Kuhn, Tucker (eds) Contributions to the Theory of Games, vol II. Annals of Mathematical Studies 28, Princeton University Press, Princeton, pp 193–216 39. Laraki R (2001) Variational inequalities, system of functional equations and incomplete information repeated games. SIAM J Control Optim 40:516–524 40. Laraki R (2001) The splitting game and applications. Int J Game Theory 30:359–376 41. Laraki R (2002) Repeated games with lack of information on one side: the dual differential approach. Math Oper Res 27:419–440 42. Lehrer E (2001) Any inspection is manipulable. Econometrica 69:1333–1347 43. Lehrer E (2003) Approachability in infinite dimensional spaces. Int J Game Theory 31:253–268 44. Lehrer E (2003) A wide range no-regret theorem. Games Econ Behav 42:101–115 45. Lehrer E, Solan E (2003) No regret with bounded computational capacity. DP 1373, Center for Mathematical Studies in Economics and Management Science, Northwestern University, Evanston 46. Lehrer E, Solan E (2006) Excludability and bounded computational capacity. Math Oper Res 31:637–648 47. Lehrer E, Yariv L (1999) Repeated games with lack of information on one side: the case of different discount factors. Math Oper Res 24:204–218 48. Marino A (2005) The value of a particular Markov chain game. Chaps 5 and 6, Ph D thesis, Université Paris I http://alexandre. marino.free.fr/theseMarino.pdf 49. Mayberry J-P (1967) Discounted repeated games with incomplete information. Report of the US Arms control and disarmament agency, ST116, Chap V. Mathematica, Princeton, pp 435– 461 50. Mertens J-F (1972) The value of two-person zero-sum repeated games: the extensive case. Int J Game Theory 1:217–227 51. Mertens J-F (1998) The speed of convergence in repeated games with incomplete information on one side. Int J Game Theory 27:343–357 52. Mertens J-F, Zamir S (1971) The value of two-person zero-sum repeated games with lack of information on both sides. Int J Game Theory 1:39–64 53. Mertens J-F, Zamir S (1976) The normal distribution and repeated games. Int J Game Theory 5:187–197 54. Mertens J-F, Zamir S (1976) On a repeated game without a recursive structure. Int J Game Theory 5:173–182 55. Mertens J-F, Zamir S (1977) A duality theorem on a pair of simultaneous functional equations. J Math Anal Appl 60:550– 558 56. Mertens J-F, Zamir S (1985) Formulation of Bayesian analysis for games with incomplete information. Int J Game Theory 14:1–29
57. Mertens J-F, Sorin S, Zamir S (1994) Repeated games. CORE discussion paper. Université Catholique de Louvain, Belgium, pp 9420–9422 58. Meyer PA (1966) Probability and Potentials. Blaisdell, New York (in French: Probabilités et potentiel, Hermann, 1966) 59. Neyman A (2008) Existence of Optimal Strategies in Markov Games with Incomplete Information. Int J Game Theory 37:581–596 60. Neyman A, Sorin S (1998) Equilibria in Repeated Games with Incomplete Information: The General Symmetric Case. Int J Game Theory 27:201–210 61. Ponssard JP, Sorin S (1980) The LP formulation of finite zerosum games with incomplete information. Int J Game Theory 9:99–105 62. Renault J (2000) 2-player repeated games with lack of information on one side and state independent signalling. Math Oper Res 4:552–572 63. Renault J (2001) 3-player repeated games with lack of information on one side. Int J Game Theory 30:221–246 64. Renault J (2001) Learning sets in state dependent signalling game forms: a characterization. Math Oper Res 26:832–850 65. Renault J (2006) The value of Markov chain games with lack of information on one side. Math Oper Res 31:490–512 66. Renault J, Tomala T (2004) Learning the state of nature in repeated games with incomplete information and signals. Games Econ Behav 47:124–156 67. Renault J (2007) The value of Repeated Games with an informed controller. Technical report, Université Paris– Dauphine, Ceremade 68. Renault J, Tomala T (2008) Probabilistic reliability and privacy of communication using multicast in general neighbor networks. J Cryptol 21:250–279 69. Rosenberg D (1998) Duality and Markovian strategies. Int J Game Theory 27:577–597 70. Rosenberg D, Sorin S (2001) An operator approach to zerosum repeated games. Israel J Math 121:221–246 71. Rosenberg D, Solan E, Vieille N (2004) Stochastic games with a single controller and incomplete information. SIAM J Control Optim 43:86–110 72. Rustichini A (1999) Minimizing Regret: The General Case. Games Econ Behav 29:224–243 73. Shalev J (1994) Nonzero-Sum Two-Person Repeated Games with Incomplete Information and Known-Own Payoffs. Games Econ Behav 7:246–259 74. Sion M (1958) On General Minimax Theorems. Pac J Math 8:171–176 75. Simon RS (2002) Separation of joint plan equilibrium payoffs from the min-max functions. Games Econ Beha 1:79–102 ´ 76. Simon RS, Spie˙z S, Torunczyk H (1995) The existence of equilibria in certain games, separation for families of convex functions and a theorem of Borsuk–Ulam type. Israel J Math 92:1–21 ´ 77. Simon RS, Spie˙z S, Torunczyk H (2002) Equilibrium existence and topology in some repeated games with incomplete information. Trans AMS 354:5005–5026 ´ H (2008) Equilibria in a class of 78. Simon RS, Spie˙z S, Torunczyk games and topological results implying their existence. RACSAM Rev R Acad Cien Serie A Mat 102:161–179 79. Sorin S (1983) Some results on the existence of Nash equilibria for non- zero sum games with incomplete information. Int J Game Theory 12:193–205
Repeated Games with Incomplete Information
80. Sorin S (1984) Big match with lack of information on one side, Part I. Int J Game Theory 13:201–255 81. Sorin S (1984) On a pair of simultaneous functional equations. J Math Anal Appl 98:296–303 82. Sorin S (1989) On recursive games without a recursive structure: existence of lim vn . Int J Game Theory 18:45–55 83. Sorin S (1997) Merging, reputation, and repeated games with incomplete information. Games Econ Behav 29:274– 308 84. Sorin S, Zamir S (1985) A 2-person game with lack of information on 1 and 1/2 sides. Math Oper Res 10:17–23 85. Sorin S (2002) A first course on zero-sum repeated games. Mathématiques et Applications, vol 37. Springer 86. Spinat X (2002) A necessary and sufficient condition for approachability. Math Oper Res 27:31–44 87. Vieille N (1992) Weak approachability. Math Oper Res 17:781– 791 88. Waternaux C (1983) Solution for a class of repeated games without recursive structure. Int J Game Theory 12:129–160 89. Zamir S (1971) On the relation between finitely and infinitely
repeated games with incomplete information. Int J Game Theory 1:179–198 90. Zamir S (1973) On repeated games with general information function. Int J Game Theory 21:215–229 91. Zamir S (1992) Repeated Games of Incomplete Information: zero-sum. In: Aumann RJ, Hart S (eds) Handbook of Game Theory, I. Elsevier Science Publishers, Amsterdam, pp 109–154
Books and Reviews Forges F (1992) Repeated Games of Incomplete Information: Nonzero sum. In: Aumann RJ, Hart S (eds) Handbook of Game Theory, vol I. Elsevier Science Publishers, Amsterdam, pp 155–177 Laraki R, Renault J, Tomala T (2006) Théorie des Jeux, Introduction à la théorie des jeux répétés. Editions de l’Ecole Polytechnique, Journées X-UPS, in French (Chapter 3 deals with repeated games with incomplete information). Palaiseau Mertens J-F (1987) Repeated games. In: Proceedings of the International Congress of Mathematicians 1986. American Mathematical Society, Berkeley, pp 1528–1577
2655
2656
Reputation Effects
Reputation Effects GEORGE J. MAILATH Department of Economics, University of Pennsylvania, Philadelphia, USA Article Outline Glossary Definition of the Subject Introduction A Canonical Model Two Long-Lived Players Future Directions Acknowledgments Bibliography
Simple action type An action who plays the same (pure or mixed) stage-game action in every period, regardless of history. Stage game A game played in one period. Stackelberg action In a stage game, the action a player would commit to, if that player had the chance to do so, i. e., the optimal commitment action. Stackelberg type A simple action type that plays the Stackelberg action. Subgame In a repeated game with perfect monitoring, the game following any history. Subgame perfect equilibrium A strategy profile that induces a Nash equilibrium on every subgame of the original game. Type The characteristic of a player that is not common knowledge.
Glossary Action type A type of player who is committed to playing a particular action, also called a commitment type or behavioral type. Complete information Characteristics of all players are common knowledge. Flow payoff Stage game payoff. Imperfect monitoring Past actions of all players are not public information. Incomplete information Characteristics of some player are not common knowledge. Long-lived player Player subject to intertemporal incentives, typically has the same horizon as length of the game. Myopic optimum An action maximizing stage game payoffs. Nash equilibrium A strategy profile from which no player has a profitable unilateral deviation (i. e., it is self-enforcing). Nash reversion In a repeated game, permanent play of a stage game Nash equilibrium. Normalized discounted value The discounted sum of an infinite sequence fa t g t0 , calculated as (1 ı) P t t0 ı a t , where ı 2 (0; 1) is the discount value. Perfect monitoring Past actions of all players are public information. Repeated game The finite or infinite repetition of a stage game. Reputation bound The lower bound on equilibrium payoffs of a player that the other player(s) believe may be a simple action type (typically the Stackelberg type). Short-lived player Player not subject to intertemporal incentives, having a one-period horizon and so is myopically optimizing.
Definition of the Subject Repeated games have many equilibria, including the repetition of stage game Nash equilibria. At the same time, particularly when monitoring is imperfect, certain plausible outcomes are not consistent with equilibrium. Reputation effects is the term used for the impact upon the set of equilibria (typically of a repeated game) of perturbing the game by introducing incomplete information of a particular kind. Specifically, the characteristics of a player are not public information, and the other players believe it is possible that the distinguished player is a type that necessarily plays some action (typically the Stackelberg action). Reputation effects fall into two classes: “Plausible” phenomena that are not equilibria of the original repeated game are equilibrium phenomena in the presence of incomplete information, and “implausible” equilibria of the original game are not equilibria of the incomplete information game. As such, reputation effects provide an important qualification to the general indeterminacy of equilibria. Introduction Repeating play of a stage game often allows for equilibrium behavior inconsistent with equilibrium of that stage game. If the stage game has multiple Nash equilibrium payoffs, a large finite number of repetitions provide sufficient intertemporal incentives for behavior inconsistent with stage-game Nash equilibria to arise in some subgame perfect equilibria. However, many classic games do not have multiple Nash equilibria. For example, mutual defection DD is the unique Nash equilibrium of the prisoners’ dilemma, illustrated in Fig. 1.
Reputation Effects
C D
C 2; 2 3; 1
D 1; 3 0; 0
Reputation Effects, Figure 1 The prisoners’ dilemma. The cooperative action is labeled C, while defect is labeled D
A standard argument shows that the finitely repeated prisoner’s dilemma has a unique subgame perfect equilibrium, and in this equilibrium, DD is played in every period: In any subgame perfect equilibrium, in the last period, DD must be played independently of history, since the stage game has a unique Nash equilibrium. Then, since play in the last period is independent of history, there are no intertemporal incentives in the penultimate period, and so DD must again be played independently of history. Proceeding recursively, DD must be played in every period independently of history. (In fact, the finitely repeated prisoners’ dilemma has a unique Nash equilibrium outcome, given by DD in every period.) This contrasts with intuition, which suggests that if the prisoners’ dilemma were repeated a sufficiently large (though finite) number of times, the two players would find a way to play cooperatively (C) at least in the initial stages. In response, Kreps, Milgrom, Roberts and Wilson [15] argued that intuition can be rescued in the finitely repeated prisoners’ dilemma by introducing incomplete information. In particular, suppose each player assigns some probability to their opponent being a behavioral type who mechanistically plays tit-for-tat (i. e., plays C in the first period or if the opponent had played C in the previous period, and plays D if the opponent had played D in the previous period) rather than being a rational player. No matter how small the probability, if the number of repetitions is large enough, the rational players will play C in early periods, and the fraction of periods in which CC is played is close to one. This is the first example of a reputation effect: a small degree of incomplete information (of the right kind) both rescues the intuitive CC for many periods as an equilibrium outcome, and eliminates the unintuitive always DD as one. In the same issue of the Journal of Economic Theory containing Kreps, Milgrom, Roberts and Wilson [15], Kreps and Wilson[14] and Milgrom and Roberts [18] explored reputation effects in the finite chain store of Selten [22], showing that intuition is again rescued, this time by introducing the possibility that the chain store is a “tough” type who always fights entry.
Reputation effects describe the impact upon the set of equilibria of the introduction of small amounts of incomplete information of a particular form into repeated games (and other dynamic games). Reputation effects fall into two classes: “Plausible” phenomena that are not equilibria of the complete information game are equilibrium phenomena in the presence of incomplete information, and “implausible” equilibria of the complete information game are not equilibria of the incomplete information game. Reputation effects are distinct from the equilibrium phenomenon in complete information repeated games that are sometimes described as capturing reputations. In this latter use, an equilibrium of the complete information repeated game is selected, involving actions along the equilibrium path that are not Nash equilibria of the stage game. As usual, incentives to choose these actions are created by attaching less favorable continuation paths to deviations. Players who choose the equilibrium actions are then interpreted as maintaining a reputation for doing so, with a punishment-triggering deviation interpreted as causing the loss of one’s reputation. For example, players who cooperate in the infinitely repeated prisoners’ dilemma are interpreted as having (or maintaining) a cooperative reputation, with any defection destroying that reputation. In this usage, the link between past behavior and expectations of future behavior is an equilibrium phenomenon, holding in some equilibria, but not in others. The notion of reputation is used to interpret an equilibrium strategy profile, but otherwise adds nothing to the formal analysis. In contrast, the approach underlying reputation effects begins with the assumption that a player is uncertain about key aspects of her opponent. For example, player 2 may not know player 1’s payoffs, or may be uncertain about what constraints player 1 faces on his ability to choose various actions. This incomplete information is a device that introduces an intrinsic connection between past behavior and expectations of future behavior. Since incomplete information about players’ characteristics can have dramatic effects on the set of equilibrium payoffs, reputations in this approach do not describe certain equilibria, but rather place constraints on the set of possible equilibria. An Example While reputation effects were first studied in a symmetric example with two long-lived players, they arise in their purest form in infinitely repeated games with one longlived player playing against a sequence of short-lived players. The chain store game of Selten [22] is a finitely repeated game in which a chain store (the long-lived player) faces a finite sequence of potential entrants in its different
2657
2658
Reputation Effects
H L
h 2; 3 3; 0
` 0; 2 1; 1
Reputation Effects, Figure 2 The product-choice game
markets. Since each entrant only cares about its own decision, it is short-lived. Consider the “product-choice” game of Fig. 2. The row player (player 1), who is long-lived, is a firm choosing between high (H) and low (L) effort, while the column player (player 2), who is short-lived, is a customer choosing between a high (h) or low (`) priced product. (Mailath and Samuelson [17] illustrate various aspects of repeated games and reputation effects using this example.) Player 2 prefers the high-priced product if the firm has exerted high effort, but prefers the low-priced product if the firm has not. The firm prefers that customers purchase the high-priced product and is willing to commit to high effort to induce that choice by the customer. In a simultaneous move game, however, the firm cannot observably choose effort before the customer chooses the product. Since high effort is costly, the firm prefers low effort, no matter the choice of the customer. The stage game has a unique Nash equilibrium, in which the firm exerts low effort and the customer purchases the low-priced product. Suppose the game is played infinitely often, with perfect monitoring (i. e., the history of play is public information). The firm is long-lived and discounts flow profits by the discount factor ı 2 (0; 1), and is patient if ı is close to 1. The role of the customer is taken by a succession of short-lived players, each of whom plays the game only once (and so myopically optimizes). It is standard to abuse language by treating the collection of shortlived players as a single myopically optimizing player. When the firm is sufficiently patient, there is an equilibrium outcome in the repeated game in which the firm always exerts high effort and customers always purchase the high-priced product. The firm is deterred from taking the immediate myopically optimal action of low effort by the prospect of future customers then purchasing the low-priced product. Purchasing the high-priced product is a best response for the customer to high effort, so that no incentive issues arise concerning the customer’s behavior. In this equilibrium, the long-lived player’s payoff is 2 (the firm’s payoffs are calculated as the normalized discounted sum, i. e., as the discounted sum of flow payoffs normalized by (1 ı), so that payoffs in the infinite
horizon game are comparable to flow payoffs). However, there are many other equilibria, including one in which low effort is exerted and low price purchased in every period, leading to a payoff of 1 for the long-lived player. Indeed, for ı 1/2, the set of pure-strategy subgame-perfect-equilibrium player 1 payoffs is given by the entire interval [1; 2]. Reputation effects effectively rule out any payoff less than 2 as an equilibrium payoff for player 1. Suppose customers are not entirely certain of the characteristics of the firm. More specifically, suppose they attach high probability to the firm’s being “normal,” that is, having the payoffs given above, but they also entertain some (possibly very small) probability that they face a firm who fortuitously has a technology or some other characteristic that ensures high effort. Refer to the latter as the “H-action” type of firm. Since such a type necessarily plays H in every period, it is a type described by behavior (not payoffs), and such a type is often called a behavioral or commitment type. This is now a game of incomplete information, with the customers uncertain of the firm’s type. Since the customers assign high probability to the firm being “normal,” the game is in some sense close to the game of complete information. None the less, reputation effects are present: For a sufficiently patient firm, in any Nash equilibrium of the repeated game, the firm’s payoff cannot be significantly less than 2. This result holds no matter how unlikely customers think the H-action type to be, though increasing patience is required from the normal firm as the action type becomes less likely. The intuition behind this result is most easily seen by considering pure strategy Nash equilibria of the incomplete information game where the customers believe the firm is either the normal or the H-action type. In that case, there is no pure strategy Nash equilibrium with a payoff less than 2ı (which is clearly close to 2 for ı close to 1). In the pure strategy Nash equilibrium, either the firm always plays H, (in which case, the customers always play h and the firm’s payoff is 2), or there is a first period (say t) in which the firm plays L, revealing to future customers that he is the normal type (since the action type plays H in every period). In such an equilibrium, customers play h before t (since both types of firm are choosing H). After observing H in period t, customers conclude the firm is the H-action type. Consequently, as long as H is always chosen thereafter, customers subsequently play h (since they continue to believe the firm is the H-action type, and so necessarily plays H). An easy lower bound on the normal firm’s equilibrium payoff is then obtained by observing that the normal firm’s payoff must be at least the payoff from mimicking the action type in every period. The pay-
Reputation Effects
off from such behavior is at least as large as
(1 ı)
t1 X D0
ƒ‚
„
ı 2 …
payoff in t from playing H when L may be myopically optimal
payoff in < t from pooling with H-action type
C
(1 ı) „
(1 ı)ı t 0 „ ƒ‚ …
C
1 X DtC1
ƒ‚
ı 2 …
payoff in > t from playing like and being treated as the H-action type
D (1 ı t )2 C ı tC1 2 D 2 2ı t (1 ı)
2 2(1 ı) D 2ı : The outcome in which the stage game Nash equilibrium L` is played in every period is thus eliminated. Since reputation effects are motivated by the hypothesis that the short-lived players are uncertain about some aspect of the long-lived player’s characteristics, it is important that the results are not sensitive to the precise nature of that uncertainty. In particular, the lower bound on payoffs should not require that the short-lived players only assign positive probability to the normal and the H-action type (as in the game just analyzed). And it does not: The customers in the example may assign positive probability to the firm being an action type that plays H on even periods, and L on odd periods, as well as to an action type that plays H in every period before some period t 0 (that can depend on history), and then always plays L. Yet, as long as the customers assign positive probability to the H-action type, for a sufficiently patient firm, in any Nash equilibrium of the repeated game, the firm’s payoff cannot be significantly less than 2. Reputation effects are more powerful in the presence of imperfect monitoring. Suppose that the firm’s choice of H or L is not observed by the customers. Instead, the customers observe a public signal y 2 fy; y¯g at the end of each period, where the signal y¯ is realized with probability p 2 (0; 1) if the firm chose H, and with the smaller probability q 2 (0; p) if the firm chose L. Interpret y¯ as a good meal: while customers do not observe effort, they do observe a noisy signal (the quality of the meal) of that effort, with high effort leading to a good meal with higher probability. In the game with complete information, the largest equilibrium payoff to the firm is now given by v¯1 2
1p ; pq
(1)
reflecting the imperfect monitoring of the firm’s actions (the firm is said to be subject to binding moral hazard, see Sect. 7.6 in [17]). Since deviations from H cannot be detected for sure, there are no equilibria with the deterministic outcome path of Hh in every period. In some periods after some histories, L` must be played in order to provide the appropriate intertemporal incentives to the firm. As under perfect monitoring, as long as customers assign positive probability to the H-action type in the incomplete information game with imperfect monitoring, for a sufficiently patient firm, in any Nash equilibrium of the repeated game, the firm’s payoff cannot be significantly less than 2 (in particular, this lower bound exceeds v¯1 ). Thus, in this case, reputation effects provide an intuitive lower bound on equilibrium payoffs that both rules out “bad” equilibrium payoffs, as well as rescues outcomes in which Hh occurs in most periods. Proving that a reputation bound holds in the imperfect monitoring case is considerably more involved than in the perfect monitoring case. In perfect-monitoring games, it is only necessary to analyze the evolution of the customers’ beliefs when always observing H, the action of the H-action type. In contrast, imperfect monitoring requires consideration of belief evolution on all histories that arise with positive probability. None the less, the intuition is the same: Consider a putative equilibrium in which the normal firm receives a payoff less than 2 ". Then the normal and action types must be making different choices over the course of the repeated game, since an equilibrium in which they behave identically would induce customers to choose h and would yield a payoff of 2. As in the perfect monitoring case, the normal firm has the option of mimicking the behavior of the H-action type. Suppose the normal firm does so. Since the customers expect the normal type of firm to behave differently from the H-action type, they will more often see signals indicative of the H-action type (rather than the normal type), and so must eventually become convinced that the firm is the H-action type. Hence, in response to this deviation, the customers will eventually play their best response to H of h. While “eventually” may take a while, that time is independent of the equilibrium (indeed of the discount factor), depending only on the imperfection in the monitoring and the prior probability assigned to the H-action type. Then, if the firm is sufficiently patient, the payoff from mimicking the H-action type is arbitrarily close to 2, contradicting the existence of an equilibrium in which the firm’s payoff fell short of 2 ". At the same time, because monitoring is imperfect, as discussed in Sect. “Temporary Reputation Effects”, the reputation effects are necessarily transient. Under gen-
2659
2660
Reputation Effects
eral conditions in imperfect-monitoring games, the incomplete information that is at the core of reputation effects is a short-run phenomenon. Player 2 must eventually come to learn player 1’s type and continuation play must converge to an equilibrium of the complete information game. Reputation effects arise for very general specifications of the incomplete information as long as the customers assign strictly positive probability to the H-action type. It is critical, however, that the customers do assign strictly positive probability to the H-action type. For example, in the product-choice game, the set of Nash equilibria of the repeated game is not significantly impacted by the possibility that the firm is either normal or the L-action type only. While reputation effects per se do not arise from the L-action type, it is still of interest to investigate the impact of such uncertainty on behavior using stronger equilibrium notions, such as Markov perfection (see Mailath and Samuelson [16]). A Canonical Model The Stage Game The stage game is a two-player simultaneous-move finite game of public monitoring. Player i has action set Ai , i D 1; 2. Pure actions for player i are denoted by a i 2 A i , and mixed actions are denoted by ˛ i 2 (A i ), where (A i ) is the set of probability distributions over Ai . Player 2’s actions are public, while player 1’s are potentially private. The public signal of player 1’s action, denoted by y is drawn from a finite set Y, with the probability that y is realized under the pure action profile a 2 A A1 A2 denoted by (y j a). Player 1’s ex post payoff from the action profile a and signal realization y is r1 (y; a), and so the ex ante (or expected) flow payP off is u1 (a) y r1 (y; a) (y j a). Player 2’s ex post payoff from the action profile a and signal realization y is r2 (y; a2 ), and so the ex ante (or expected) flow payoff is P u2 (a) y r2 (y; a2 ) (y j a). Since player 2’s ex post payoff is independent of player 1’s actions, player 1’s actions only affect player 2’s payoffs through the impact on the distribution of the signals and so on ex ante payoffs. While the ex post payoffs ri play no explicit role in the analysis, they justify the informational assumptions to be made. In particular, the model requires that histories of signals and past actions are the only information players receive, and so it is important that stage game payoffs ui are not informative about the action choice (and this is the critical feature delivered by the assumptions that ex ante payoffs are not observable and that payer 2’s ex post payoffs do not depend on a1 ).
Perfect monitoring is the special case where Y D A1 and (y j a) D 1 if y D a1 , and 0 otherwise. The results in this section hold under significantly weaker monitoring assumptions. In particular, it is not necessary that the actions of player 2 be public. If these are also imperfectly monitored, then the ex post payoff for player 1 is independent of player 2 actions. Since player 2 is short-lived, when player 2’s actions are not public, it is then natural to also assume that the period t player 2 does not know earlier player 2’s actions. The Complete Information Repeated Game The stage game is infinitely repeated. Player 1 is long-lived, with payoffs given by the normalized discounted value P t t (1 ı) 1 tD0 ı u1 , where ı 2 (0; 1) is the discount factor t and u1 is player 1’s period t flow payoff. Player 1 is patient if ı is close to 1. As in our example, the role of player 2 is taken by a succession of short-lived players, each of whom plays the game only once (and so myopically optimizes). Player 1’s set of private histories is H1 [1 tD0 (Y t A) and the set of public histories (which coincides with t the set of player 2’s histories) is H [1 tD0 (Y A2 ) . If the game has perfect monitoring, histories h D (y 0 ; a0 ; y 1 ; a1 ; : : : ; y t1 ; a t1 ) in which y ¤ a1 for some t 1 arise with zero probability, independently of behavior, and so can be ignored. A strategy 1 for player 1 specifies a probability distribution over 1’s pure action set for each possible private history, i. e., 1 : H1 ! (A1 ). A strategy 2 for player 2 specifies a probability distribution over 2’s pure action set for each possible public history, i. e., 2 : H ! (A2 ). Definition 1 The strategy profile (1 ; 2 ) is a Nash equilibrium if 1. there does not exist a strategy 1 yielding a strictly higher payoff for player 1 when player 2 plays 2 , and 2. in all periods t, after any history h t 2 H arising with positive probability under (1 ; 2 ), 2 (h t ) maximizes E[u2 (1 (h1t ); a1 ) j h t ], where the expectation is taken over the period t-private histories that player 1 may have observed.
The Incomplete Information Repeated Game In the incomplete information game, the type of player 1 is unknown to player 2. A possible type of player 1 is denoted by 2 , where is a finite or countable set (see Fudenberg and Levine [12] for the uncountable case). Player 2’s prior belief about 1’s type is given by the distribution , with support . The set of types is partitioned into a set of
Reputation Effects
payoff types 1 , and a set of action types 2 n1 . Payoff types maximize the average discounted value of payoffs, which depend on their type and which may be nonstationary, u1 : A1 A2 1 N0 ! R : Type 0 2 1 is the normal type of player 1, who happens to have a stationary payoff function, given by the stage game in the benchmark game of complete information, u1 (a; 0 ; t) D u1 (a) 8a 2 A; 8t 2 N0 : It is standard to think of the prior probability (0 ) as being relatively large, so the games of incomplete information are a seemingly small departure from the underlying game of complete information, though there is no requirement that this be the case. Action types (also called commitment or behavioral types) do not have payoffs, and simply play a specified repeated game strategy. For any repeated-game strategy from the complete information game, ˆ 1 : H1 ! (A1 ), denote by (ˆ 1 ) the action type committed to the strategy ˆ 1 . In general, a commitment type of player 1 can be committed to any strategy in the repeated game. If the strategy in question plays the same (pure or mixed) stage-game action in every period, regardless of history, that type is called a simple action type. For example, the H-action type in the product-choice game is a simple action type. The (simple action) type that plays the pure action a1 in every period is denoted by (a1 ) and similarly the simple action type committed to ˛1 2 (A1 ) is denoted by (˛1 ). As will be seen soon, allowing for mixed action types is an important generalization from simple pure types. A strategy for player 1, also denoted by 1 : H1 ! (A1 ), specifies for each type 2 a repeated game strategy such that for all (ˆ 1 ) 2 2 , the strategy ˆ 1 is specified. A strategy 2 for player 2 is as in the complete information game, i. e., 2 : H ! (A2 ). Definition 2 The strategy profile (1 ; 2 ) is a Nash equilibrium of the incomplete information game if 1. for all 2 1 , there does not exist a repeated game strategy 1 yielding a strictly higher payoff for payoff type of player 1 when player 2 plays 2 , and 2. in all periods t, after any history h t 2 H arising with positive probability under (1 ; 2 ) and , 2 (h t ) maximizes E[u2 (1 (h1t ; ); a1 ) j h t ], where the expectation is taken over both the period t-private histories that player 1 may have observed and player 1’s type.
Example 1 Consider the product-choice game (Fig. 2) under perfect monitoring. The firm is willing to commit to H to induce h from customers. This incentive to commit is best illustrated by considering a sequential version of the product-choice game: The firm first publicly commits to an effort, and then the customer chooses between h and `, knowing the firm’s choice. In this sequential game, the firm chooses H in the unique subgame perfect equilibrium. Since Stackelberg [23] was the first investigation of such leader-follower interactions, it is traditional to call H the Stackelberg action, and the H-action type of player 1 the Stackelberg type, with associated Stackelberg payoff 2. Suppose D f0 ; (H); (L)g. For ı 1/2, the grim trigger strategy profile of always playing Hh, with deviations punished by Nash reversion, is a subgame perfect equilibrium of the complete information game. Consider the following adaptation of this profile in the incomplete information game:
1 (h t ; ) D
and t
2 (h ) D
8 ˆ H; ˆ ˆ ˆ < ˆ ˆ ˆ ˆ :L; (
if D (H); or D 0 and a D Hh for all < t; otherwise;
h;
if a D Hh for all < t;
`;
otherwise :
In other words, player 2 and the normal type of player 1 follow the strategies from the Nash-reversion equilibrium in the complete information game, and the action types (H) and (L) play their actions. This is a Nash equilibrium for ı 1/2 and ((L)) < 1/2. The restriction on ((L)) ensures that player 2 finds h optimal in period 0. Should player 2 ever observe L, then Bayes’ rule causes her to place probability 1 on type (L) (if L is observed in the first period) or the normal type (if L is first played in a subsequent period), making her participation in Nash reversion optimal. The restriction on ı ensures that Nash reversion provides sufficient incentive to make H optimal for the normal player 1. After observing a10 D H in period 0, player 2 assigns zero probability to D (L). However, the posterior probability that 2 assigns to the Stackelberg type does not converge to 1. In period 0, the prior probability is ((H)). After one observation of H, the posterior increases to ( )/[( )C(0 )], after which it is constant. By stipulating that an observation of H in a history in which L has previously been observed causes player 2 to place probability one on the normal type of player 1, a specification
2661
2662
Reputation Effects
of player 2’s beliefs that is consistent with sequentiality is obtained. As seen in the introduction, for ı close to 1, 1 (h t ; 0 ) D L for all ht is not part of any Nash equilibrium. The Reputation Bound Which type would the normal type most like to be treated as? Player 1’s pure-action Stackelberg payoff is defined as v1 D sup
min u1 (a1 ; ˛2 ) ;
a 1 2A 1 ˛2 2B(a 1 )
(2)
where B(a1 ) D arg max a 2 u2 (a1 ; a2 ) is the set of player 2 myopic best replies to a1 . If the supremum is achieved by some action a1 , that action is an associated Stackelberg action, a1 2 arg max
sup
min u1 (˛1 ; ˛2 ) ;
˛1 2(A 1 ) ˛2 2B(˛1 )
This assumption is trivially satisfied in the perfect monitoring case. Reputation effects still exist when this assumption fails, but the bounds are more complicated to calculate (see [12] or Sect. 15.4.1 in [17]). Fixing an action for player 2, a2 , the mixed action ˛ 1 P implies the signal distribution a 1 (y j (a1 ; a2 ))˛1 (a1 ). Lemma 1 Suppose satisfies Assumption 1. Then, if for some a2 , (y j (a1 ; a2 ))˛1 (a1 ) D
a1
This is a pure action to which player 1 would commit, if player 1 had the chance to do so (and hence the name “Stackelberg” action, see the discussion in Example 1), given that such a commitment induces a best response from player 2. If there is more than one such action for player 1, the action can be chosen arbitrarily. However, player 1 would typically prefer to commit to a mixed action. In the product-choice game, for example, a commitment by player 1 to mixing between H and L, with slightly larger probability on H, still induces player 2 to choose h and gives player 1 a larger payoff than a commitment to H. Define the mixed-action Stackelberg payoff as
Assumption 1 For all a2 2 A2 , the collection of probability distributions f (y j (a1 ; a2 ) : a1 2 A1 g is linearly independent.
X
min u1 (a1 ; ˛2 ) :
a 1 2A 1 ˛2 2B(a 1 )
v1
When monitoring is imperfect, the public signals are statistically informative about the actions of the long-lived player under the next assumption (Lemma 1).
(3)
where B(˛1 ) D arg maxa 2 u2 (˛1 ; a2 ) is the set of player 2’s best responses to ˛ 1 . In the product-choice game, v1 D 2, while v1 D 5/2. Typically, the supremum is not achieved by any mixed action, and so there is no mixed-action Stackelberg type. However, there are mixed action types that, if player 2 is convinced she is facing such a type, will yield payoffs arbitrarily close to the mixed-action Stackelberg payoff. As with imperfect monitoring, simple mixed action types under perfect monitoring raise issues of monitoring, since a deviation by the normal type from the distribution ˛ 1 of a mixed action type (˛1 ), to some action in the support cannot be detected. However, when monitoring of the pure actions is perfect, it is possible to statistically detect deviations, and this will be enough to imply the appropriate reputation lower bound.
X
(y j (a1 ; a2 ))˛10 (a1 );
a1
8y ;
(4)
then ˛1 D ˛10 . Proof Suppose (4) holds for some a2 . Let R denote the jYj jA1 j matrix whose y-a1 element is given by (y j (a1 ; a2 )) (so that the a1 -column is the probability distribution on Y implied by the action profile a1 a2 ). Then, (4) can be written as R˛1 D R˛10 , or more simply as R(˛1 ˛10 ) D 0. By Assumption 1, R has full column rank, and so x D 0 is the only vector x 2 RjA 1 j solving Rx D 0. Consequently, if player 2 believes that the long-lived player’s behavior implies a distribution over the signals close to the distribution implied by some particular action ˛10 , then player 2 must believe that the long-lived player’s action is also close to ˛10 . Since A2 is finite, this then implies that when player 2 is best responding to some belief about the long-lived player’s behavior implying a distribution over signals sufficiently close to the distribution implied by ˛10 , then player 2 is in fact best responding to ˛10 . We are now in a position to state the main reputation bound result. Let v 1 (0 ; ; ı) be the infimum over the set of the normal player 1’s payoffs in any (pure or mixed) Nash equilibrium in the incomplete information repeated game, given the distribution over types and the discount factor ı. Proposition 1 (Fudenberg and Levine [11,12]) Suppose satisfies Assumption 1 and let ˆ denote the simple action type that always plays ˛ˆ 1 2 (A1 ). Suppose (0 ), ˆ > 0. For every > 0, there is a value K such that for ()
Reputation Effects
all ı, v 1 (0 ; ; ı) (1 )ı
K
min u1 (˛ˆ 1 ; ˛2 )
˛2 2B(˛ˆ 1 )
C (1 (1 )ı K ) min u1 (a) :
(5)
a2A
This immediately yields the pure action Stackelberg reputation bound. Fix " > 0. Taking ˛ˆ 1 in the proposition as the degenerate mixture that plays the Stackelberg action a1 with probability 1, Eq. (5) becomes v 1 (0 ; ; ı) (1 )ı K v1 C (1 (1 )ı K ) min u1 (a) a2A
v1 (1 (1 )ı K )2M ; where M max a ju1 (a)j. This last expression is at least as large as v1 " when < "/(2M) and ı is sufficiently close to 1. The mixed action Stackelberg reputation bound is also covered: Corollary 1 Suppose satisfies Assumption 1 and assigns positive probability to some sequence of simple types k f(˛1k )g1 kD1 with each ˛1 in (A1 ) satisfying v1 D lim
min u1 (˛1k ; ˛2 ) :
k!1 ˛2 2B(˛ k ) 1
For
all "0
> 0, there exists ı < 1 such that for all ı 2 (ı; 1),
v 1 (0 ; ; ı) v1 "0 : The remainder of this subsection outlines a proof of Proposition 1. Fix a strategy profile (1 ; 2 ) (which may be Nash, but at this point of the discussion, need not be). The beliefs then induce a probability distribution P on the set of outcomes, which is the set of possible infinite histories (denoted by h1 ) and realized types, (Y A)1 ˝. The probability measure P describes how the short-lived players believe the game will evolve, given their prior beliefs about the types of the long-lived player. Let Pˆ denote the probability distribution on the set of outcomes ˆ The probabilinduced by (1 ; 2 ) and the action type . ity measure Pˆ describes how the short-lived players believe the game will evolve if the long-lived player’s type is ˆ Finally, let P˜ denote the probability distribution on the . set of outcomes induced by (1 ; 2 ) conditioning on the ˆ Then, long-lived player’s type not being the action type . ˆ ˆ ˜ ˆ P C (1 ) ˆ P, where ˆ (). P The discussion after Lemma 1 implies that the optimal behavior of the short-lived player in period t is determined by that player’s beliefs over the signal realizations in that period. These beliefs can be viewed as a one-step
ahead prediction of the signal y that will be realized condiˆ t (h t ) D P(ˆ j h t ) tional on the history ht , P(y j h t ). Let denote the posterior probability after observing ht that the short-lived player assigns to the long-lived player having ˆ Note also that if the long-lived player is the actype . ˆ then the true probability of the signal y is tion type , ˆ j h t ) D y j (H; 2 (h t )) . Then, P(y ˆ j h t ) C (1 ˜ j ht ) : ˆ t (h t )P(y ˆ t (h t ))P(y P(y j h t ) D The key step in the proof of Proposition 1 is a statistical result on merging. The following lemma essentially says that the short-lived players cannot be surprised too many times. Note first that an infinite public history h1 can be thought of as a sequence of ever longer finite public histories ht . Consider the collection of infinite public histories with the property that player 2 often sees histories ht that lead to very different one-step ahead predictions about the signals under P˜ and under Pˆ and have a “low” posterior ˆ The lemma asserts that if that the long-lived player is . ˆ this collecthe long-lived player is in fact the action type , tion of infinite public histories has low probability. Seeing the signals more likely under ˆ leads the short-lived playˆ The posterior ers to increase the posterior probability on . ˆ probability fails to converge to 1 under P only if the play of the types different from ˆ leads, on average, to a signal ˆ For the purely stadistribution similar to that implied by . tistical statement and its proof, see Section 15.4.2 in [17]. Lemma 2 For all ; > 0 and 2 (0; 1], there exists ˆ 2 [ ; 1), for eva positive integer K such that for all () ery strategy 1 : H1 ! (A1 ) and 2 : H ! (A2 ), ˇ˚ ˆ t (h t )) Pˆ h1 : ˇ t 1 : (1 ˇ ˇ ˆ j h t )ˇ
max ˇ˜ P(y j h t ) P(y y
ˇ ˇ K :
(6)
Note that the bound K holds for all strategy profiles ˆ 2 [ ; 1). This al(1 ; 2 ) and all prior probabilities () lows us to bound equilibrium payoffs. Proof of Proposition 1 Fix > 0. From Lemma 1, by ˆ choosing sufficiently small in Lemma 2, with P-probability at least 1 , there are at most K periods in which the short-lived players are not best responding to ˛ˆ 1 . Since a deviation by the long-lived player to the simple strategy of always playing ˛ˆ 1 induces the same distribution ˆ the long-lived player’s expected on public histories as P, payoff from such a deviation is bounded below by the right side of (5).
2663
2664
Reputation Effects
Temporary Reputation Effects Under perfect monitoring, there are often pooling equilibria in which the normal and some action type of player 1 behave identically on the equilibrium path (as in Example 1). Deviations on the part of the normal player 1 are deterred by the prospect of the resulting punishment. Under imperfect monitoring, such pooling equilibria do not exist. The normal and action types may play identically for a long period of time, but the normal type always eventually has an incentive to cheat at least a little on the commitment strategy, contradicting player 2’s belief that player 1 will exhibit commitment behavior. Player 2 must then eventually learn player 1’s type. In addition to Assumption 1, disappearing reputation effects require full support monitoring. Assumption 2 For all a 2 A, y 2 Y, (y j a) > 0. This assumption implies that Bayes’ rule determines the beliefs of player 2 about the type of player 1 after all histories. Suppose there are only two types of player 1, the norˆ where ˆ D (˛ˆ 1 ) mal type 0 and a simple action type , for some ˛ˆ 1 2 (A1 ). The analysis is extended to many commitment types in Section 6.1 in Cripps et al. [8]. It is convenient to denote a strategy for player 1 as a pair of functions ˜ 1 and ˆ 1 (so ˆ 1 (h1t ) D ˛ˆ 1 for all h1t 2 H1 ), the former for the normal type and the latter for the action type. Recall that P 2 (˝) is the unconditional probability measure induced by the prior , and the strategy profile (ˆ 1 ; ˜ 1 ; 2 ), while Pˆ is the measure induced by conditionˆ P˜ is the measure induced ˆ Since f0 g D n fg, ing on . by conditioning on 0 . That is, Pˆ is induced by the strategy 1 ; 2 ), describing how profile ˆ D (ˆ 1 ; 2 ) and P˜ by ˜ D (˜ play evolves when player 1 is the commitment and normal type, respectively. The action of the commitment type satisfies the following assumption. Assumption 3 Player 2 has a unique stage-game best response to ˛ˆ 1 (denoted by aˆ2 ) and ˛ˆ (˛ˆ 1 ; aˆ2 ) is not a stagegame Nash equilibrium. Let ˆ 2 denote the strategy of playing the unique best response aˆ2 to ˛ˆ 1 in each period independently of history. Since ˛ˆ is not a stage-game Nash equilibrium, (ˆ 1 ; ˆ 2 ) is not a Nash equilibrium of the complete information infinite horizon game. Proposition 2 (Cripps, Mailath and Samuelson [8]) Suppose the monitoring distribution satisfies Assumptions 1 and 2, and the commitment action ˛ˆ 1 satisfies As-
sumption 3. In any Nash equilibrium of the game with incomplete information, the posterior probability assigned by ˆ t , converges to zero unplayer 2 to the commitment type, ˜ i. e., der P, ˆ t (h t ) ! 0;
˜ P-a.s.
The intuition is straightforward: Suppose there is a Nash equilibrium of the incomplete information game in which both the normal and the action type receive positive probability in the limit (on a positive probability set of histories). On this set of histories, player 2 cannot distinguish between signals generated by the two types (otherwise player 2 could ascertain which type she is facing), and hence must believe that the normal and action types are playing the same strategies on average. But then player 2 must play a best response to this strategy, and hence to the action type. Since the action type’s behavior is not a best response for the normal type (to this player 2 behavior), player 1 must eventually find it optimal to not play the action-type strategy, contradicting player 2’s beliefs. Assumption 3 requires a unique best response to ˛ˆ 1 . For example, in the product-choice game, every action for player 2 is a best response to player 1’s mixture ˛10 that assigns equal probability to H and L. This indifference can be exploited to construct an equilibrium in which (the normal) player 1 plays ˛10 after every history (Section 7.6.2 in [17]). This will still be an equilibrium in the game of incomplete information in which the commitment type plays ˛10 , with the identical play of the normal and commitment types ensuring that player 2 never learns player 1’s type. In contrast, player 2 has a unique best response to any other mixture on the part of player 1. Therefore, if the commitment type is committed to any mixed action other than ˛10 , player 2 will eventually learn player 1’s type. As in Proposition 1, a key step in the proof of Proposition 2 is a purely statistical result on updating. Either player 2’s expectation (given her history) of the strategy ˜ t j h t ], where E˜ denotes played by the normal type (E[˜ 1 ˜ is in the limit identical expectation with respect to P) to the strategy played by the action type (˛ˆ 1 ), or player 2’s posterior probability that player 1 is the action type ˆ t (h t )) converges to zero (given that player 1 is indeed ( normal). This is a merging argument and closely related to Lemma 2. If the distributions generating player 2’s signals are different for the normal and action type, then these signals provide information that player 2 will use in updating her posterior beliefs about the type she faces. This (converging, since beliefs are a martingale) belief can converge to an interior probability only if the distributions generating the signals are asymptotically uninformative, which requires that they be asymptotically identical.
Reputation Effects
Lemma 3 Suppose the monitoring distribution satisfies Assumptions 1 and 2. Then in any Nash equilibrium, ˇ ˇ ˜ ˜ 1t (a1 ) j h t ]ˇ D 0; P-a.s. ˜ ˆ t max ˇ˛ˆ 1 (a1 ) E[ (7) lim t!1
a1
Given Proposition 2, it should be expected that continuation play converges to an equilibrium of the complete information game, and this is indeed the case. See Theorem 2 [8] for the formal statement. Proposition 2 leaves open the possibility that for any period T, there may be equilibria in which uncertainty about player 1’s type survives beyond T, even though such uncertainty asymptotically disappears in any equilibrium. This possibility cannot arise. The existence of a sequence of Nash equilibria with uncertainty about player 1’s type persisting beyond period T ! 1 would imply the (contradictory) existence of a limiting Nash equilibrium in which uncertainty about player 1’s type persists. Proposition 3 (Cripps, Mailath and Samuelson [9]) Suppose the monitoring distribution satisfies Assumptions 1 and 2, and the commitment action ˛ˆ 1 satisfies Assumption 3. For all " > 0, there exists T such that for any Nash equilibrium of the game with incomplete information, ˜ ˆ t < "; 8t > T) > 1 " : P( Example 2 Recall that in the product-choice game, the unique player 2 best response to H is to play h, and Hh is not a stage-game Nash equilibrium. Proposition 1 ensures that the normal player 1’s expected value in the repeated game of incomplete information with the H-action type is arbitrarily close to 2, when player 1 is very patient. In particular, if the normal player 1 plays H in every period, then player 2 will at least eventually play her best response of h. If the normal player 1 persisted in mimicking the action type by playing H in each period, this behavior would persist indefinitely. It is the feasibility of such a strategy that lies at the heart of the reputation bounds on expected payoffs. However, this strategy is not optimal. Instead, player 1 does even better by attaching some probability to L, occasionally reaping the rewards of his reputation by earning a stage-game payoff even larger than 2. The result of such equilibrium behavior, however, is that player 2 must eventually learn player 1’s type. The continuation payoff is then bounded below 2 (recall (1)). Reputation effects arise when player 2 is uncertain about player 1’s type, and there may well be a long period of time during which player 2 is sufficiently uncertain of player 1’s type (relative to the discount factor), and in which play does not resemble an equilibrium of the complete information game. Eventually, however, such behavior must
give way to a regime in which player 2 is (correctly) convinced of player 1’s type. ˆ that the long-lived player is For any prior probability the commitment type and for any " > 0, there is a discount factor ı sufficiently large that player 1’s expected payoff is close to the commitment-type payoff. This holds no matˆ However, for any fixed ı and in any equiter how small . librium, there is a time at which the posterior probability attached to the commitment type has dropped below the ˆ becoming too small (relcorresponding critical value of , ative to ı) for reputation effects to operate. A reasonable response to the results on disappearing reputation effects is that a model of long-run reputations should incorporate some mechanism by which the uncertainty about types is continually replenished. For example, Holmström [13], Cole, Dow and English [6], Mailath and Samuelson [16], and Phelan [19] assume that the type of the long-lived player is governed by a stochastic process rather than being determined once and for all at the beginning of the game. In such a situation, reputation effects can indeed have long-run implications. Reputation as a State The posterior probability that short-lived players assign to player 1 being ˆ is sometimes interpreted as player 1’s reputation, particularly if ˆ is the Stackelberg type. When ˆ the posterior belief ˆt contains only the normal type and , is a state variable of the game, and attention is sometimes restricted to Markov strategies (i. e., strategies that only depend on histories through their impact on the posterior beliefs of the short-lived players). An informative example is Benabou and Laroque [2], who study the Markov perfect equilibria of a game in which the uninformed players respond continuously to their beliefs. They show that the informed player eventually reveals his type in any Markov perfect equilibrium. On the other hand, Markov equilibria need not exist in finitely repeated reputation games (Section 17.3 in [17]). The literature on reputation effects has typically not restricted attention to Markov strategies, since the results do not require the restriction. Two Long-Lived Players The introduction of nontrivial intertemporal incentives for the uninformed player significantly reduces reputation effects. For example, when only simple Stackelberg types are considered, the Stackelberg payoff may not bound equilibrium payoffs. The situation is further complicated by the possibility of non-simple commitment types (i. e., types that follow nonstationary strategies).
2665
2666
Reputation Effects
Consider applying the logic from Sect. “The Reputation Bound” to obtain the Stackelberg reputation bound when both players are long-lived and player 1’s characteristics are unknown, under perfect monitoring. The first step is to demonstrate that, if the normal player 1 persistently plays the Stackelberg action and there exists a type committed to that action, then player 2 must eventually attach high probability to the event that the Stackelberg action is played in the future. This argument, a simple version of Lemma 2, depends only upon the properties of Bayesian belief revision, independently of whether the person holding the beliefs is a long-lived or short-lived player. When player 2 is short-lived, the next step is to note that if she expects the Stackelberg action, then she will play a best response to this action. If player 2 is instead a longlived player, she may have an incentive to play something other than a best response to the Stackelberg type. The key step when working with two long-lived players is thus to establish conditions under which, as player 2 becomes increasingly convinced that the Stackelberg action will appear, player 2 must eventually play a best response to that action. One might begin such an argument by observing that, as long as player 2 discounts, any losses from not playing a current best response must be recouped within a finite length of time. But if player 2 is “very” convinced that the Stackelberg action will be played not only now but for sufficiently many periods to come, there will be no opportunity to accumulate subsequent gains, and hence player 2 might just as well play a stage-game best response. Once it is shown that player 2 is best responding to the Stackelberg action, the remainder of the argument proceeds as in the case of a short-lived player 2. The normal player 1 must eventually receive very nearly the Stackelberg payoff in each period of the repeated game. By making player 1 sufficiently patient (relative to player 2, so that discount factors differ), this consideration dominates player 1’s payoffs, putting a lower bound on the latter. Hence, the obvious handling of discount factors is to fix player 2’s discount factor ı 2 , and to consider the limit as player 1 becomes patient, i. e., ı 1 approaching one. This intuition misses the following possibility. Player 2 may be choosing something other than a best response to the Stackelberg action out of fear that a current best response may trigger a disastrous future punishment. This punishment would not appear if player 2 faced the Stackelberg type, but player 2 can be made confident only that she faces the Stackelberg action, not the Stackelberg type. The fact that the punishment lies off the equilibrium path makes it difficult to assuage player 2’s fear of such punishments. Short-lived players in the same situation are sim-
ilarly uncertain about the future ramifications of best responding, but being short-lived, this uncertainty does not affect their behavior. Consequently, reputation effects are typically weak with two long-lived players under perfect monitoring: Celentani, Fudenberg, Levine and Pesendorfer [3] and Cripps and Thomas [7], describe examples with only the normal and the Stackelberg types of player 1, in which the future play of the normal player 1 is used to punish player 2 for choosing a best response to the Stackelberg action when she is not supposed to, and player 1’s payoff is significantly below the Stackelberg payoff. Moreover, the robustness of reputation effects to additional types beyond the Stackelberg type, a crucial feature of settings with one long-lived player, does not hold with two long-lived players. Schmidt [21] showed that the possibility of a “punishment” type can prevent player 2 best responding to the Stackelberg action, while Evans and Thomas [10] showed that the Stackelberg bound is valid if in addition to the Stackelberg type, there is an action type who punishes player 2 for not behaving appropriately (see Sections 16.1 and 16.5 in [17]). Imperfect monitoring (of both players’ actions), on the other hand, rescues reputation effects. With a sufficiently rich set of commitment types, player 1 can be assured of at least his Stackelberg payoff. Indeed, player 1 can often be assured of an even higher payoff, in the presence of commitment types who play nonstationary strategies [3]. At the same time, these reputation effects are temporary (Theorem 2 in [9]). Finally, there is a literature on reputation effects in bargaining games (see [1,4,5,20]), where the issues described above are further complicated by the need to deal with the bargaining model itself.
Future Directions The detailed structure of equilibria of the incomplete information game is not well understood, even for the canonical game of Sect. “A Canonical Model”. A more complete description of the structure of equilibria is needed. While much of the discussion was phrased in terms of the Stackelberg type, Proposition 1 provides a reputation bound for any action type. While in some settings, it is natural that the uninformed players assign strictly positive probability to the Stackelberg type, it is not natural in other settings. A model endogenizing the nature of action types would be an important addition to the reputation literature.
Reputation Effects
Finally, while the results on reputation effects with two long-lived players are discouraging, there is still the possibility that some modification of the model will rescue reputation effects in this important setting. Acknowledgments I thank Eduardo Faingold, KyungMin Kim, Antonio Penta, and Larry Samuelson for helpful comments. Bibliography 1. Abreu D, Gul F (2000) Bargaining and Reputation. Econometrica 68(1):85–117 2. Benabou R, Laroque G (1992) Using Privileged Information to Manipulate Markets: Insiders, Gurus, and Credibility. Q J Econ 107(3):921–958 3. Celentani M, Fudenberg D, Levine DK, Pesendorfer W (1996) Maintaining a Reputation Against a Long-Lived Opponent. Econometrica 64(3):691–704 4. Chatterjee K, Samuelson L (1987) Bargaining with Two-Sided Incomplete Information: An Infinite Horizon Model with Alternating Offers. Rev Econ Stud 54(2):175–192 5. Chatterjee K, Samuelson L (1988) Bargaining with Two-Sided Incomplete Information: The Unrestricted Offers Case. Oper Res 36(4):605–638 6. Cole HL, Dow J, English WB (1995) Default, Settlement, and Signalling: Lending Resumption in a Reputational Model of Sovereign Debt. Int Econ Rev 36(2):365–385 7. Cripps MW, Thomas JP (1997) Reputation and Perfection in Repeated Common Interest Games. Games Econ Behav 18(2):141–158 8. Cripps MW, Mailath GJ, Samuelson L (2004) Imperfect Monitoring and Impermanent Reputations. Econometrica 72(2):407– 432 9. Cripps MW, Mailath GJ, Samuelson L (2007) Disappearing Private Reputations in Long-Run Relationships. J Econ Theory 134(1):287–316
10. Evans R, Thomas JP (1997) Reputation and Experimentation in Repeated Games with Two Long-Run Players. Econometrica 65(5):1153–1173 11. Fudenberg D, Levine DK (1989) Reputation and Equilibrium Selection in Games with a Patient Player. Econometrica 57(4):759–778 12. Fudenberg D, Levine DK (1992) Maintaining a Reputation When Strategies Are Imperfectly Observed. Rev Econ Stud 59(3):561–579 13. Holmström B (1982) Managerial Incentive Problems: A Dynamic Perspective. In: Essays in Economics and Management in Honour of Lars Wahlbeck, pp 209–230. Swedish School of Economics and Business Administration, Helsinki. Published in: Rev Econ Stud 66(1):169–182 14. Kreps D, Wilson R (1982) Reputation and Imperfect Information. J Econ Theory 27:253–279 15. Kreps D, Milgrom PR, Roberts DJ, Wilson R (1982) Rational Cooperation in the Finitely Repeated Prisoner’s Dilemma. J Econ Theory 27:245–252 16. Mailath GJ, Samuelson L (2001) Who Wants a Good Reputation? Rev Econ Stud 68(2):415–441 17. Mailath GJ, Samuelson L (2006) Repeated Games and Reputations: Long-Run Relationships. Oxford University Press, New York 18. Milgrom PR, Roberts DJ (1982) Limit pricing and entry under incomplete information: An equilibrium analysis. Econometrica 50:443–459 19. Phelan C (2006) Public Trust and Government Betrayal. J Econ Theory 130(1):27–43 20. Schmidt KM (1993) Commitment Through Incomplete Information in a Simple Repeated Bargaining Game. J Econ Theory 60(1):114–139 21. Schmidt KM (1993) Reputation and Equilibrium Characterization in Repeated Games of Conflicting Interests. Econometrica 61(2):325–351 22. Selten R (1978) Chain-store paradox. Theory Decis 9:127– 159 23. Stackelberg HV (1934) Marktform und Gleichgewicht. Springer, Vienna
2667
2668
Reversible Cellular Automata
Reversible Cellular Automata KENICHI MORITA Hiroshima University, Higashi-Hiroshima, Japan Article Outline Glossary Definition of the Subject Introduction Reversible Cellular Automata Simulating Irreversible Cellular Automata by Reversible Ones 1-D Universal Reversible Cellular Automata 2-D Universal Reversible Cellular Automata Future Directions Bibliography Glossary Cellular automaton A cellular automaton (CA) is a system consisting of a large (theoretically, infinite) number of finite automata, called cells, which are connected uniformly in a space. Each cell transits its state depending on the states of itself and the cells in its neighborhood. Thus the state transition of a cell is specified by a local function. Applying the local function to all the cells in the space synchronously, the transition of a configuration (i. e., a whole state of the cellular space) is induced. Such a transition function is called a global function. A CA is regarded as a kind of dynamical system that can deal with various kinds of spatio-temporal phenomena. Cellular automaton with block rules A CA with block rules was proposed by Margolus [18], and it is often called a CA with Margolus-neighborhood. The cellular space is divided into infinitely many blocks of the same size (in the two-dimensional case, e. g., 2 2). A local transition function consisting of “block rules”, which is a mapping from a block state to a block state, is applied to all the blocks in parallel. At the next time step, the block division pattern is shifted by some fixed amount (e. g., to the north-east direction by one cell), and the same local function is applied to them. This model of CA is convenient to design a reversible CA. Because, if the local transition function is injective, then the resulting CA is reversible. Partitioned cellular automaton A partitioned cellular automaton (PCA) is a framework for designing a reversible CA. It is a subclass of a usual CA where each cell is partitioned into several parts, whose number is equal to the neighborhood size. Each part of a cell has
its own state set, and can be regarded as an output port to a specified neighboring cell. Depending only on the corresponding parts (not on the entire states) of the neighboring cells, the next state of each cell is determined by a local function. We can see that if the local function is injective, then the resulting PCA is reversible. Hence, a PCA makes it feasible to construct a reversible CA. Reversible cellular automaton A reversible cellular automaton (RCA) is defined as a one whose global function is injective (i. e., one-to-one). It can be regarded as a kind of a discrete model of reversible physical space. It is in general difficult to construct an RCA with a desired property such as computation-universality. Therefore, the frameworks of a CA with Margolus neighborhood, a partitioned cellular automaton, and others are often used to design RCAs. Universal cellular automaton A CA is called computationally universal, if it can compute any recursive function by giving an appropriate initial configuration. Equivalently, it is also defined as a CA that can simulate a universal Turing machine. Universality of RCAs can be proved by simulating other systems such as arbitrary (irreversible) CAs, reversible Turing machines, reversible counter machines, and reversible logic elements and circuits, which have already been known to be universal. Definition of the Subject Reversible cellular automata (RCAs) are defined as cellular automata (CAs) with an injective global function. Every configuration of an RCA has exactly one previous configuration, and thus RCAs are “backward deterministic” CAs. The notion of reversibility originally comes from physics. It is one of the fundamental microscopic physical laws of Nature. In this sense, an RCA is thought as an abstract model of a physically reversible space as well as a computing model. It is very important to investigate how computation can be carried out efficiently and elegantly in a system having reversibility. This is because future computing devices will surely become those of a nanoscale size. In this article, we mainly discuss on the properties of RCAs from the computational aspects. In spite of the strong constraint of reversibility, RCAs have very rich ability of computing. We can see that even very simple RCAs have universal computing ability. We can also recognize, in some reversible cellular automata, computation is carried out in a very different manner from conventional computing systems, thus they may give new ways and concepts for future computing.
Reversible Cellular Automata
Introduction Problems related to injectivity and surjectivity of global functions of CAs were first studied by Moore [22] and Myhill [33] in the Garden-of-Eden problem. A Garden-ofEden configuration is such that it can exist only at time 0. Therefore, if a CA has such a configuration, then its global function is not surjective, and vice versa. They proved the following Garden-of-Eden theorem: A CA has a Gardenof-Eden configuration, if and only if it has an “erasable configuration”. After that, many researchers studied on injectivity and surjectivity of global functions more generally [1,19,20,34]. In particular, Richardson [34] showed that if a CA is injective, then it is surjective. Toffoli [36] first studied reversible (i. e., injective) CAs from the computational viewpoint. He showed that every k-dimensional irreversible CA can be simulated by a k C 1-dimensional RCA. Hence, a two-dimensional RCA has universal computing ability. Since then, extensive studies on RCAs have been done until now. After the pioneering work of Bennett [3] on reversible Turing machines, several models of reversible computing were proposed besides RCAs. They are, for example, reversible logic circuits [8,26,37], Billiard Ball Model of computing [8], and reversible counter machines [25]. Most of these models have close relation to physical reversibility. In fact, reversible computing plays an important role when considering inevitable power dissipation in computing [3,4,5,16,38]. It is also one of the basis of quantum computing (see e. g., [9]) because an evolution of a quantum system is a reversible process. In this article, we discuss how RCAs can have universal computing ability, and how simple they can be. Since reversibility is one of the fundamental microscopic properties of physical systems, it is important to investigate whether we can use such physical mechanisms directly for computation. An RCA is a useful framework to formalize and investigate these problems. Since this article is not an exhaustive survey, many interesting topics related to RCAs, such as complexity of RCA [35], relations to quantum CA (e. g., [41]), etc., are omitted. An outline of the following sections is as follows. In Sect. “Reversible Cellular Automata”, we give basic definitions on RCAs, and design methods for obtaining RCAs. In Sect. “Simulating Irreversible Cellular Automata by Reversible Ones”, it is shown how irreversible CAs are simulated by RCAs. In Sect. “1-D Universal Reversible Cellular Automata”, two kinds of computation-universal onedimensional RCAs are shown. In Sect. “2-D Universal Reversible Cellular Automata”, several universal two-dimensional RCAs with simple local functions are shown. In
Sect. “Future Directions”, we discuss future directions and open problems as well as some other problems on RCAs not given in the previous sections. Reversible Cellular Automata Formal Definitions We first give definitions on conventional cellular automata, and then their reversibility. Definition 1 A deterministic k-dimensional (k-D) m-neighbor cellular automaton (CA) is a system defined by A D (Z k ; Q; (n1 ; : : : ; n m ); f ; #) ; where Z is the set of all integers (hence Z k is the set of all k-dimensional points with integer coordinates at which cells are placed), Q is a non-empty finite set of states of each cell, (n1 ; : : : ; n m ) is an element of (Z k )m called a neighborhood (m D 1; 2; : : :), f : Q m ! Q is a local function, and # 2 Q is a quiescent state satisfying f (#; : : : ; #) D #. A k-dimensional configuration over the set Q is a mapping ˛ : Z k ! Q. Let Conf k (Q) denote the set of all k-dimensional configurations over Q, i. e., Conf k (Q) D f˛ j ˛ : Z k ! Qg. If k is understood, we write it by Conf(Q). We say that a configuration ˛ is finite iff the set fx j x 2 Z k ^ ˛(x) ¤ #g is finite. Otherwise, it is called infinite. The global function F : Conf(Q) ! Conf(Q) of A is defined as the one that satisfies the following formula. 8˛ 2 Conf(Q); x 2 Z k : F(˛)(x) D f (˛(x C n1 ) ; : : : ; ˛(x C n m )) Definition 2 Let A D (Z k ; Q; (n1 ; : : : ; n m ); f ; #) be a CA. (1) A is called an injective CA iff F is one-to-one. (2) A is called an invertible CA iff there is a CA A0 D (Z k ; Q; N 0 ; f 0 ; #) that satisfies the following condition: 8˛; ˇ 2 Conf(Q) : F(˛) D ˇ
iff
F 0 (ˇ) D ˛ ;
where F and F 0 are the global functions of A and A0 , respectively. The following theorem can be derived from the results independently proved by Hedlund [10], and Richardson [34]. Theorem 3 (Hedlund [10] and Richardson [34]) Let A be a CA. A is injective iff it is invertible.
2669
2670
Reversible Cellular Automata
By the above theorem, we see the notions of injectivity and invertibility are equivalent. Henceforth, we use the terminology “reversible CA” (RCA) for such a CA, instead of injective CA or invertible CA, because an RCA is regarded as an analog of physically reversible space. (Note that, in some other computing models such as Turing machines, counter machines, and logic circuits, injectivity is trivially equivalent to invertibility, if they are suitably defined. Therefore, for these models, we can directly define reversibility without introducing the notions of injectivity and invertibility.)
Reversible Cellular Automata, Figure 1 A cellular space with the Margolus neighborhood
How Can We Find RCAs? The class of RCAs is a special subclass of CAs. Therefore there arises a problem how we can find or construct RCAs with some desired property. It is in general hard to do so if we use the conventional framework of CAs. Because, the following result is shown by Kari [14] for the two-dimensional case (hence it also holds for higher dimensional CAs). Theorem 4 (Kari [14]) The problem whether a given twodimensional CA is reversible is undecidable. For the case of one-dimensional CA, Amoroso and Patt [2] showed it is decidable. Theorem 5 (Amoroso and Patt [2]) There is an algorithm to test whether a given one-dimensional CA is reversible or not. There are also several studies on enumerating all reversible one-dimensional CAs (e. g., [6,23]). But, it is generally difficult to find RCAs with specific properties such as computation-universality, even for the one-dimensional case. In order to make it feasible to design an RCA, several methods have been proposed until now. They are, for example, CAs with block rules [18,38], partitioned CAs [28], CAs with second order rules [18,38,39], and others (see e. g., [38]). Here, we describe the first two methods in some detail. Cellular Automata with Block Rules Margolus [18] proposed an interesting variant of a CA, by which he composed a computation-universal two-dimensional two-state reversible CA. In his model, all the cells are grouped into “blocks” of size 2 2 as shown in Fig. 1. A specific example of a transformation specified by “block rules” is shown in Fig. 2. This CA evolves as follows: At time 0 the local transformation is applied to every solid line block, then at time 1 to every dotted line block, and so on, alternately. Since this local transformation is one-to-one, the global function of
Reversible Cellular Automata, Figure 2 Block rules for the Margolus RCA [18]. (Rotation-symmetry is assumed here. Hence, rules obtained by rotating both sides of a rule are also included.)
the CA is also one-to-one. Such a neighborhood is called Margolus neighborhood. One can obtain reversible CAs, by giving one-to-one block rules. However, CAs with Margolus neighborhood are not conventional CAs, because each cell should know the relative position in a block and the parity of time besides its own state. Related to this topic, Kari [15] showed that every oneand two-dimensional RCA can be represented by a block permutations and translations. Partitioned Cellular Automata The method of using partitioned cellular automata (PCA) has some similarity to the one that uses block rules. However, resulting reversible CAs are in the framework of conventional CA (in other words, a PCA is a special case of a CA). In addition, flexibility of neighborhood is rather high. Shortcomings of PCA is that, in general, the number of states per cell becomes large. Definition 6 A deterministic k-dimensional m-neighbor partitioned cellular automaton (PCA) is a system defined by P D (Z k ; (Q1 ; : : : ; Q m ); (n1 ; : : : ; n m ); f ; (#1 ; : : : ; #m )) ;
Reversible Cellular Automata
where Z is the set of all integers, Qi (i D 1; : : : ; m) is a non-empty finite set of states of the i-th part of each cell (thus the state set of each cell is Q D Q1 Q m ), (n1 ; : : : ; n m ) 2 (Z k )m is a neighborhood, f : Q ! Q is a local function, and (#1 ; : : : ; #m ) 2 Q is a quiescent state satisfying f (#1 ; : : : ; #m ) D (#1 ; : : : ; #m ). (In general, the states #1 ; : : : ; #m may be different from each other. However, by renaming the states in each part appropriately, we can identify the states #1 ; : : : ; #m as representing the same state #. In what follows, we often assume so, and write the quiescent state by (#; : : : ; #).) The notion of a finite (or infinite) configuration is defined similarly as in CA. Let p i : Q ! Q i be the projection function such that p i (q1 ; : : : ; q m ) D q i for all (q1 ; : : : ; q m ) 2 Q. The global function F : Conf(Q) ! Conf(Q) of P is defined as the one that satisfies the following formula. 8˛ 2 Conf(Q); x 2 Z k : F(˛)(x) D f (p1 (˛(x C n1 )); : : : ; p m (˛(x C n m ))) By the above definition, a one-dimensional radius 1 (3-neighbor) PCA P1d can be defined as follows. P1d D (Z; (L; C; R); (1; 0; 1); f ; (#; #; #)) Each cell is divided into three parts, i. e., left, center, and right parts, and their state sets are L, C, and R. The next state of a cell is determined by the present states of the left part of the right-neighbor cell, the center part of this cell, and the right part of the left-neighbor cell (not depending on the whole three parts of the three cells). Figure 3 shows its cellular space, and how the local function f works. Let (l; c; r); (l 0 ; c 0 ; r0 ) 2 L C R. If f (l; c; r) D 0 (l ; c 0 ; r0 ), then this equation is called a local rule (or simply a rule) of the PCA P1d , and it is sometimes written in
Reversible Cellular Automata, Figure 3 Cellular space of a one-dimensional 3-neighbor PCA P1d , and its local function f
Reversible Cellular Automata, Figure 4 A pictorial representation of a local rule f(l; c; r) D (l 0 ; c0 ; r0 ) of a one-dimensional 3-neighbor PCA P1d
a pictorial form as shown in Fig. 4. Note that, in the pictorial representation, the arguments of the lefthand side of f (l; c; r) D (l 0 ; c 0 ; r0 ) appear in a reverse order. Similarly, a two-dimensional PCA P2d with Neumannlike neighborhood is defined as follows. P2d D (Z2 ; (C; U; R; D; L); ((0; 0); (0; 1); (1; 0); (0; 1); (1; 0)); f ; (#; #; #; #; #)) Figure 5 shows the cellular space of P2d , and a pictorial representation of a rule f (c; u; r; d; l) D (c 0 ; u0 ; r0 ; d 0 ; l 0 ). Let P D (Z k ; (Q1 ; : : : ; Q m ); (n1 ; : : : ; n m ); f ; (#1 ; : : : ; #m )) be a k-dimensional PCA, and F be its global function. It is easy to show the following proposition (a proof for the one-dimensional case given in [28] can be extended to higher dimensions). Proposition 7 The local function f is one-to-one, iff the global function F is one-to-one. It is also easy to see that the class of PCAs is a subclass of CAs. More precisely, the following proposition is derived by extending the domain of the local function of P. Proposition 8 For any k-dimensional m-neighbor PCA P, we can obtain a k-dimensional m-neighbor CA A whose global function is identical with that of P.
Reversible Cellular Automata, Figure 5 Cellular space of a two-dimensional 5-neighbor PCA P2d , and its local rule
2671
2672
Reversible Cellular Automata
By above, if we want to construct an RCA, it is sufficient to give a PCA whose local function f is one-to-one. This makes a design of an RCA feasible. Simulating Irreversible Cellular Automata by Reversible Ones Toffoli [36] first showed that for every irreversible CA there exists a reversible one that simulates the former by increasing the dimension by one. From this result, computation-universality of two-dimensional RCA is derived, since it is easy to embed a Turing machine in a (irreversible) one-dimensional CA. Theorem 9 (Toffoli [36]) For any k-dimensional (irreversible) CA A, we can construct a k C 1-dimensional RCA A0 that simulates A in real time.
Reversible Cellular Automata, Figure 6 An example of an evolution in an irreversible one-dimensional CA A
Although Toffoli’s proof is rather complex, the idea of the proof is easily implemented by using a PCA. Here we explain it informally. Consider a one-dimensional 3-neighbor irreversible CA A that evolves as in Fig. 6. Then, we can construct a two-dimensional reversible PCA P that
Reversible Cellular Automata, Figure 7 Simulating the irreversible CA A in Fig. 6 by a two-dimensional reversible PCA P
Reversible Cellular Automata
simulates A as shown in Fig. 7. The configuration of A is kept in some row of P. A state of each cell of A is stored in left, center, and right parts of a cell in P in triplicate. By this, each cell of P can compute the next state of the corresponding cell of A correctly. At the same time, the previous states of the cell and the left and right neighbor cells (which were used to compute the next state) are put downward as a “garbage” signal to keep P reversible. In other words, the additional dimension is used to record all the past history of the evolution of A. In this way, P can simulate A reversibly. As for one-dimensional CA with finite configuration, reversible simulation is possible without increasing the dimension.
symbol s in the state qi , then write s 0 and go to the state qj . [q i ; /; d; q j ] means that if T is in the state qi , then shift the head to the direction d and go to the state qj . T is called deterministic iff the following statement holds for any pair of distinct quadruples [p1 ; b1 ; c1 ; q1 ] and [p2 ; b2 ; c2 ; q2 ].
Theorem 10 (Morita [24]) For any one-dimensional (irreversible) CA A with finite configurations, we can construct a one-dimensional RCA A0 that simulates A (but not in real time).
Hereafter, we consider only deterministic Turing machines. The next theorem shows computation-universality of a reversible three-tape Turing machine.
1-D Universal Reversible Cellular Automata
Theorem 12 (Bennett [3]) For any (irreversible) one-tape Turing machine, there is a reversible three-tape Turing machine which simulates the former.
Simulating Reversible Turing Machines by 1-D RCAs It is possible to prove computation-universality of onedimensional RCAs by constructing RCAs that can simulate reversible Turing machines directly. Here, we first give definitions on reversible Turing machines, and then show how they can be simulated by RCAs. Bennett [3] showed a nice construction method of a reversible Turing machine that simulates a given irreversible Turing machine, and never leaves garbage signals on its tape at the end of computation. We now give a definition of a one-tape Turing machine and its reversible version (a multi-tape reversible Turing machine can be also defined similarly). It is convenient to use quadruple formalism [3] of a Turing machine to define a reversible one. Definition 11 A one-tape Turing machine (TM) is defined by T D (Q; S; q0 ; qa ; qr ; s0 ; ı) ; where Q is a non-empty finite set of states, S is a nonempty finite set of symbols, q0 is an initial state (q0 2 Q), qa is an accepting state (qa 2 Q), qr is a rejecting state (qr 2 Q), s0 is a special blank symbol (s0 2 S), and ı is a move relation which is a subset of (Q S S Q) [ (Q f/g f; 0; Cg Q). Each element of ı is called a quadruple, and either of the form [q i ; s; s 0 ; q j ] 2 (Q S S Q) or [q i ; /; d; q j ] 2 (Q f/g f; 0; Cg Q). The symbols “”, “0”, and “C” denote left-shift, zero-shift, and rightshift, respectively. [q i ; s; s; q j ] means that if T reads the
If p1 D p2 ; then b1 ¤ / ^ b2 ¤ / ^ b1 ¤ b2 : On the other hand, T is called reversible iff the following statement holds for any pair of distinct quadruples [p1 ; b1 ; c1 ; q1 ] and [p2 ; b2 ; c2 ; q2 ]. If q1 D q2 ; then b1 ¤ / ^ b2 ¤ / ^ c1 ¤ c2 :
It is also shown in [31] that for any irreversible one-tape TM, there is a reversible one-tape two-symbol TM which simulates the former. In fact, to prove computation-universality of a one-dimensional reversible PCA, it is convenient to simulate a reversible one-tape TM. Theorem 13 (Morita and Harao [28]) For any reversible one-tape Turing machine T, there is a one-dimensional reversible PCA P that simulates the former. We show how P simulates T (the method given below is slightly modified from the one in [28]). Let T D (Q; S; q0 ; qa ; qr ; s0 ; ı) be a reversible one-tape TM. We assume that q0 does not appear as a fourth element in any quintuple in ı (because we can always construct such a reversible TM from an irreversible one [31]). A reversible PCA P D (Z; (L; C; R); (1; 0; 1); f ; (#; s0 ; #)) that simulates T is as follows. The state sets L; C, and R are: L D R D Q [ f; #g ;
C D S [ (Q S) :
The local function f is as below. Note that, in (3)–(6), if p 2 fq0 g then x D , else x D #. Likewise, if q 2 fqa ; qr g then y D , else y D #. (1) For every s 2 S, f (#; s; #) D (#; s; #) ; f (#; s; ) D (#; s; ) ; f (; s; #) D (; s; #) :
2673
2674
Reversible Cellular Automata
(2) For every q 2 fq0 ; qa ; qr g and s 2 S, f (#; [q; s]; #) D (#; [q; s]; #) : (3) For every p; q 2 Q and s; t 2 S, if [p; s; t; q] 2 ı, f (#; [p; s]; x) D (y; [q; t]; #) : (4) For every p; q 2 Q and s; t 2 S, if [p; /; ; q] 2 ı,
Reversible Cellular Automata, Figure 8 The initial and the final computational configuration of Tparity for a given unary input 11
f (#; [p; s]; x) D (q; s; #) ; f (q; t; #) D (y; [q; t]; #) : (5) For every p; q 2 Q and s 2 S, if [p; /; 0; q] 2 ı, f (#; [p; s]; x) D (y; [q; s]; #) : (6) For every p; q 2 Q and s; t 2 S, if [p; /; C; q] 2 ı, f (#; [p; s]; x) D (#; s; q) ; f (#; t; q) D (y; [q; t]; #) : We can see that the right-hand side of each rule differs from that of any other rule, because T is deterministic and reversible. The rules in (3)–(6) are for simulating T step by step. If the initial computational configuration of T is s 0 t1 q0 t i t n s 0 then set P to the following configuration. : : : ; (#; s0 ; #); (#; t1 ; #); : : : ; (#; t i2 ; #); (#; t i1 ; ); (#; [q0 ; t i ]; #); (#; t iC1 ; #); : : : ; (#; t n ; #); (#; s0 ; #); : : : The simulation process starts when a right-moving signal meets a center state of the form [q0 ; t i ]. It is easily seen that, from this configuration, P can correctly simulates T by the rules in (3)–(6). If T becomes a halting state qa or qr , then a left-moving signal is emitted, and the final computational configuration of T is kept in all the successive configurations of P. Note that P itself cannot halt (i. e., it cannot keep exactly the same configuration after reaching it from a different configuration), because P is reversible. Example 14 Consider a reversible TM Tparity D (Q; f0; 1g; q0 ; qa ; qr ; 0; ı), where Q D fq0 ; q1 ; q2 ; q3 ; q4 ; qa ; qr g, and ı is as below. ı D f[ q0 ; 0; 1; q1 ] ; [ q1 ; /; C; q2 ] ; [ q2 ; 0; 1; qa ] ; [ q2 ; 1; 0; q3 ] ; [ q3 ; /; C; q4 ] ; [ q4 ; 0; 1; qr ] ; [ q4 ; 1; 0; q1 ] g : For a given unary number n on the tape, Tparity checks if n is even or odd. If it is even, then Tparity halts in the accepting state qa , otherwise halts in the rejecting state qr . All the symbols read by Tparity are complemented. See Fig. 8. A simulation process of Tparity by a reversible PCA Pparity , which is constructed by the method described above, is shown in Fig. 9.
Reversible Cellular Automata, Figure 9 Simulating Tparity by a one-dimensional reversible PCA Pparity . The state # is indicated by a blank
Simulating Cyclic Tag Systems by 1-D RCAs From Theorem 12 we can see the existence of a universal reversible TM, and thus from Theorem 13 there is a onedimensional universal RCA. Then the following problem arises: How simple can it be? To get a universal RCA with a small number (say several dozen) of states, we need another useful framework of a universal system. A cyclic tag system (CTAG) is proposed by Cook [7] to show universality of the elementary cellular automaton of rule 110. As we shall see, it is also useful for constructing simple universal RCAs. Definition 15 A cyclic tag system (CTAG) is defined by C D (k; fY; Ng; (p0 ; : : : ; p k1 )), where k (k D 1; 2; : : :) is the length of a cycle (i. e., period), fY; Ng is the (fixed) alphabet, and (p0 ; : : : ; p k1 ) 2 (fY; Ng ) k is a k-tuple of production rules. A pair (v; m) is called an instantaneous description (ID) of C, where v 2 fY; Ng and m 2 f0; : : : ; k 1g.
Reversible Cellular Automata
m is called a phase of the ID. A transition relation ) on the set of IDs is defined as follows. For any (v; m); (v 0 ; m0 ) 2 fY; Ng f0; : : : ; k 1g, (Yv; m) ) (v 0 ; m0 ) iff [m0 D m C 1 mod k] ^ [v 0 D v p m ]; (Nv; m) ) (v 0 ; m0 ) iff [m0 D m C 1 mod k] ^ [v 0 D v] : A sequence of IDs (v0 ; m0 ); (v1 ; m1 ); : : : is called a computation starting from v 2 fY; Ng iff (v0 ; m0 ) D (v; 0) and (v i ; m i ) ) (v iC1 ; m iC1 )(i D 0; 1; : : : ). (In what follows, we write a computation by (v0 ; m0 ) ) (v1 ; m1 ) ) .) A CTAG is a variant of a classical tag system (see e. g., [21]), where production rules are applied cyclically. If the first symbol of a host (i. e., rewritten) string is Y, then it is removed and a specified string at that phase is attached to the end of the host string. If it is N, then it is simply removed and no string is attached. Example 16 Let us consider the following CTAG. C0 D (3; fY; Ng; (Y; N N; Y N)) : If we give NYY to C0 as an initial string, then (NYY; 0) ) (YY; 1) ) (Y N N; 2) ) (N NY N; 0) ) (NY N; 1) ) (Y N; 2) is an initial segment of a computation starting from NYY. Minsky [21] proved that a 2-tag system, which is a special class of classical tag systems, is universal. The following theorem shows the universality of a CTAG. Theorem 17 (Cook [7]) For any 2-tag system, there is a CTAG that simulates the former. It was shown by Morita [27] that there are universal onedimensional RCAs that can simulate any CTAG. Theorem 18 (Morita [27]) There is a 36-state one-dimensional reversible PCA P36 that can simulate any CTAG on infinite (leftward-periodic) configurations. Theorem 19 (Morita [27]) There is a 98-state one-dimensional reversible PCA P98 that can simulate any CTAG on finite configurations. (Note: it can also handle halting of a CTAG.) The reversible PCA P36 in Theorem 18 is given below. P36 D (Z; (f#g; f#; Y; N; C; ; g; f#; y; n; C; ; g); (1; 0; 1); f ; (#; #; #)) ;
where f is defined as follows. f (#; c; #) D (#; c; #)
for c 2 f#; Y; N; C; ; g
f (#; c; r) D (#; c; r)
for c 2 f#; Y; Ng; and r 2 fy; n; C; g
f (#; ; r) D (#; ; r)
for r 2 fy; ng
f (#; ; r) D (#; ; r)
for r 2 fy; n; g
f (#; Y; ) D (#; ; C) f (#; N; ) D (#; ; ) f (#; c; r) D (#; r; c)
for c; r 2 fC; g
f (#; C; y) D (#; Y; ) f (#; C; n) D (#; N; ) f (#; #; ) D (#; C; y) It is easy to see that f is injective. Note that since the state set of the left part of a cell of P36 has only one element #, it is actually a two-neighbor PCA. Consider the CTAG C0 in Example 16. The computation in C0 starting from NYY is simulated in P36 as shown in Fig. 10. The initial string NYY is placed in the center parts of some consecutive cells of P36 . The production rules (Y; N N; Y N) in C0 are given in the reverse order in the right parts of consecutive cells. The state is used to delimit the production rules. Since the production rules are applied cyclically, infinite copies of the state sequence “ny nn y” should be given to the left half of the cellular space of P36 . Figure 10 shows an initial segment (NYY; 0) ) (YY; 1) ) (Y N N; 2) ) (N NY N; 0) ) (NY N; 1) ) (Y N; 2) of the computation in C0 . 2-D Universal Reversible Cellular Automata Simulating Reversible Logic Gates by 2-D RCAs A set of logic elements is called logically universal, if any sequential machine (i. e., finite automaton with outputs) can be composed using only elements in the set. Since a finite-state control and tape cells of a Turing machine are in fact sequential machines, we can construct any Turing machine by using these elements. In this section, we show several computation-universal two-dimensional RCAs in which universal reversible logic elements are embedded. A Fredkin gate [8] is a reversible (i. e., its logical function is one-to-one) and bit-conserving (i. e., the number of 1’s is conserved between inputs and outputs) logic gate shown in Fig. 11. It has been known that any combinational logic element (especially, AND, OR, NOT, and fanout elements) can be realized only with Fredkin gates [8]. Hence, we can construct any sequential machine with Fredkin gate and delay elements.
2675
2676
Reversible Cellular Automata
Reversible Cellular Automata, Figure 10 Simulating the CTAG C 0 by the reversible PCA P36 [27]
Reversible Cellular Automata, Figure 11 A Fredkin gate [8]
Reversible Cellular Automata, Figure 12 A switch gate [8]
It is also known that a Fredkin gate can be composed of a much simpler gate called a switch gate and its inverse gate [8]. A switch gate is again a reversible and bitconserving logic gate (Fig. 12). Furthermore, a switch gate is realized by the Billiard Ball Model (BBM) of computation [8]. The BBM is a kind of physical model of computation where a signal “1” is represented by an ideal ball, and
Reversible Cellular Automata, Figure 13 A switch gate realized in the Billiard Ball Model [8]
logical operations and routing can be performed by their elastic collisions and reflections by reflectors. Figure 13 shows a BBM realization of a switch gate.
Reversible Cellular Automata
Reversible Cellular Automata, Figure 14 Reflection of a ball by a reflector in the Margolus’ RCA [18]
Reversible Cellular Automata, Figure 15 The local function of the 24 -state rotation-symmetric reversible PCA S1 [30]
The 2-state RCA with Margolus neighborhood The BBM can be realized by the two-dimensional 2-state RCA proposed by Margolus [18], which has a block rules shown in Fig. 2. Figure 14 shows a reflection of a ball by a mirror in the Margolus’ CA. Hence, Margolus’ CA is computationally universal. A 16-State Reversible PCA Model S1 The model S1 [30] is a 4-neighbor rotation- and reflection-symmetric reversible PCA model. A cell is divided into four parts, and each part has the state set f0; 1g. Its local transition rules are shown in Fig. 15. Rotated rules are omitted since it is rotation-symmetric (hence each rule can be regarded as a “rule scheme”). The states 0 and 1 are represented by a blank and a dot, respectively. The set of these rules has some similarity with that of Margolus’ CA, and in fact, it can simulate the BBM in a similar manner. In S1 , a ball of the BBM is represented by two dots. Figure 16 gives a configuration of a switch gate.
A 16-State Reversible PCA Model S2 The second model S2 [30] is also a 4-neighbor reversible PCA having the set of transition rules shown in Fig. 17. It is rotation-symmetric but not reflection-symmetric. In S2 , the shape of a mirror and reflection by it are different from those of S1 , i. e., only a left-turn is possible. Hence a right-turn should be realized by three left-turns. The other features are similar to S1 . Figure 18 shows a configuration of a switch gate. An 8-State Triangular Reversible PCA Model T1 The model T 1 [12] is an 8-state reversible PCA on a triangular grid. It is rotation-symmetric but not reflection-symmetric. Its local function is extremely simple as shown in Fig. 19. Signal routing, crossing, and delay are very complex to realize, because a kind of “wall” is necessary to make a signal go straight ahead. Thus the size of the configuration of a Fredkin gate is very large (26 220) as in Fig. 20.
2677
2678
Reversible Cellular Automata
Reversible Cellular Automata, Figure 16 A switch gate realized in the reversible PCA S1 [30]
Reversible Cellular Automata, Figure 18 A switch gate realized in the reversible PCA S1 [30]
Simulating Reversible Counter Machines by 2-D RCAs Besides reversible logic gates like a Fredkin gate, there are also universal reversible logic elements with memory. A rotary element (RE) [26] is a typical one of such elements. An RE has four input lines fn; e; s; wg and four output lines fn0 ; e 0 ; s 0 ; w 0 g, and two states called H-state and V-state shown in Fig. 21 (hence it has a 1-bit memory). All the values of inputs and outputs are either 0 or 1. Here, the input (and the output) are restricted as follows: at most one “1” appears as an input (output) at a time. The operation of an RE is undefined for the cases that signal 1s are given to two or more input lines. We employ the following intuitive interpretation for the operation of an RE. Signals 1 and 0 are interpreted as existence and non-existence of a particle. An RE has a “rotating bar” to control the moving direction of a particle. When no particle exists, nothing happens on the RE.
Reversible Cellular Automata, Figure 19 The local function of the 23 -state rotation-symmetric reversible triangular PCA T 1 [12]
If a particle comes from a direction parallel to the rotating bar, then it goes out from the output line of the opposite side (i. e., it goes straight ahead) without affecting the direction of the bar (Fig. 22a). If a particle comes from a direction orthogonal to the bar, then it makes a right turn, and rotates the bar by 90 degrees counterclockwise (Fig. 22b). It is clear its operation is reversible.
Reversible Cellular Automata, Figure 17 The local function of the 24 -state rotation-symmetric reversible PCA S2 [30]
Reversible Cellular Automata
Reversible Cellular Automata, Figure 21 Two states of a rotary element (RE)
Reversible Cellular Automata, Figure 22 Operations of an RE: a the parallel case, and b the orthogonal case
Reversible Cellular Automata, Figure 23 A counter machine with two counters
It has been shown that any reversible two-counter machine can be implemented in a quite simple way by using REs and some additional elements [32]. Since a reversible two-counter machine is known to be universal [25], such a reversible PCA is also universal. A counter machine (CM) is a simple computation model consisting of a finite number of counters and a finite-state control [21]. In [25] a CM is defined as a kind of multi-tape Turing machine whose heads are read-only ones and whose tapes are all blank except the leftmost squares as shown in Fig. 23 (P is a blank symbol). This definition is convenient for giving the notion of reversibility on a CM. It is well known that a CM with two counters is computation-universal [21]. This result also holds even if the reversibility constraint is added. Theorem 20 (Morita, [25]) For any Turing machine T, there is a deterministic reversible CM with two counters M that simulates T. Reversible Cellular Automata, Figure 20 A Fredkin gate realized in the 23 -state reversible triangular PCA T 1 [12]
A 34 -State Reversible PCA Model P3 Any reversible CM with two counters is embeddable in the model P3 with the local function shown in Fig. 24 [32]. In P3 , five kinds
2679
2680
Reversible Cellular Automata
Reversible Cellular Automata, Figure 24 The local function of the 34 -state rotation-symmetric reversible PCA P3 [32]. The rule scheme (m) represents 33 rules not specified by (a)–(l), where w; x; y; z 2 fblank; ; g
ı
Reversible Cellular Automata
Reversible Cellular Automata, Figure 27 Pushing and pulling operations to a position marker in the reversible PCA P3
Reversible Cellular Automata, Figure 25 Basic elements realized in the reversible cellular space of P3 [32]
of signal processing elements shown in Fig. 25 can be realized. Here, a single acts as a signal. An LR-turn element, an R-turn element, and a reflector are used for signal routing. Figure 26 shows the operations of an RE in the P3
space. A position marker is used to keep a head position of a CM, and realized by a single ı, which rotates clockwise at a certain position by the rule (a) in Fig. 24. Figure 27 shows the pushing and pulling operations of a position marker. Figure 28 shows an example of a whole configuration for a reversible CM with two counters embedded in the P3 space. In this model, no conventional logic elements like AND, OR and NOT is used. Computation is simply carried out by a single signal that interacts with REs and position markers.
Reversible Cellular Automata, Figure 26 Operations of an RE in the reversible PCA P3 : a the parallel case, and b the orthogonal case
2681
2682
Reversible Cellular Automata
Reversible Cellular Automata, Figure 28 An example of a reversible counter machine, which computes the function 2x C 2, embedded in the reversible PCA P3 [32]
Future Directions In this section, we discuss future directions and open problems as well as topics not dealt with in the previous sections. How Simple Can Universal RCAs Be? We have seen that there are many kinds of simple RCAs having computation-universality. These RCAs with least number of states known so far are summed up as follows. One-dimensional case: Finite configuration: 98-state reversible PCA [27]. Infinite configuration: 36-state reversible PCA [27].
Two-dimensional case: Finite configuration: 81-state reversible PCA [32]. Infinite configuration: 2-state RCA with Margolus neighborhood [18]. 8-state reversible PCA on triangular grid [12]. 16-state reversible PCA on rectangular grid [30]. We think the number of states for universal RCA can be reduced much more for each case of the above. Although the framework of PCA is useful for designing an RCA of a standard type, the number of states become relatively
Reversible Cellular Automata
large because the state set is the direct product of the sets of the states of the parts. Hence we shall need some other technique to find a universal RCA with a small number of states. How Can We Realize RCAs in Reversible Physical Systems? This is a very difficult problem. At present there is no good solution. The Billiard Ball Model [8] is an interesting idea, but it is practically impossible to implement it perfectly. Instead of using a mechanical collision of balls, at least some quantum mechanical reversible phenomena should be used. Furthermore, if we want to implement a CA in a real physical system, the following problem arises. In a CA, both time and space are discrete, and all the cells operate synchronously. On the other hand, in a real system, time and space are continuous, and no synchronizing clock is assumed beforehand. Hence, we need some novel theoretical framework for dealing with such problems. Self-Reproduction in RCAs Von Neumann first invented a self-reproducing cellular automata by using his famous 29-state CA [40]. In his model, the size of a self-reproducing pattern is quite huge, because the pattern has both computing and self-reproducing abilities. Later, Langton [17] created a very simple self-reproducing CA by removing the condition that the pattern need not have computation-universality.
Reversible Cellular Automata, Figure 29 Self-reproduction of a pattern in a three-dimensional RCA [13]
It was shown that self-reproduction of the Langton’s type is possible in two- or three-dimensional reversible PCA [13,29]. Figure 29 shows a self-reproducing pattern in a three-dimensional reversible PCA [13]. But, it is left for the future study to design a simple and elegant RCA in which objects with computation-universality can reproduce themselves. Firing Squad Synchronization in RCAs It is also possible to solve the firing squad synchronization problem using RCAs. Imai and Morita [11] gave a 99-state reversible PCA that synchronize an array of n cells in 3n time steps. Though it seems possible to give an optimal time solution, i. e., a (2n 2)-step solution, its concrete design has not yet done. Bibliography Primary Literature 1. Amoroso S, Cooper G (1970) The Garden of Eden theorem for finite configurations. Proc Amer Math Soc 26:158–164 2. Amoroso S, Patt YN (1972) Decision procedures for surjectivity and injectivity of parallel maps for tessellation structures. J Comput Syst Sci 6:448–464 3. Bennett CH (1973) Logical reversibility of computation. IBM J Res Dev 17:525–532 4. Bennett CH (1982) The thermodynamics of computation. Int J Theor Phys 21:905–940 5. Bennett CH, Landauer R (1985) The fundamental physical limits of computation. Sci Am 253:38–46 6. Boykett T (2004) Efficient exhaustive listings of reversible one dimensional cellular automata. Theor Comput Sci 325:215–247 7. Cook M (2004) Universality in elementary cellular automata. Complex Syst 15:1–40 8. Fredkin E, Toffoli T (1982) Conservative logic. Int J Theor Phys 21:219–253 9. Gruska J (1999) Quantum Computing. McGraw-Hill, London 10. Hedlund GA (1969) Endomorphisms and automorphisms of the shift dynamical system. Math Syst Theor 3:320–375 11. Imai K, Morita K (1996) Firing squad synchronization problem in reversible cellular automata. Theor Comput Sci 165:475–482 12. Imai K, Morita K (2000) A computation-universal two-dimensional 8-state triangular reversible cellular automaton. Theor Comput Sci 231:181–191 13. Imai K, Hori T, Morita K (2002) Self-reproduction in threedimensional reversible cellular space. Artifici Life 8:155–174 14. Kari J (1994) Reversibility and surjectivity problems of cellular automata. J Comput Syst Sci 48:149–182 15. Kari J (1996) Representation of reversible cellular automata with block permutations. Math Syst Theor 29:47–61 16. Landauer R (1961) Irreversibility and heat generation in the computing process. IBM J Res Dev 5:183–191 17. Langton CG (1984) Self-reproduction in cellular automata. Physica 10D:135–144
2683
2684
Reversible Cellular Automata
18. Margolus N (1984) Physics-like model of computation. Physica 10D:81–95 19. Maruoka A, Kimura M (1976) Condition for injectivity of global maps for tessellation automata. Inf Control 32:158–162 20. Maruoka A, Kimura M (1979) Injectivity and surjectivity of parallel maps for cellular automata. J Comput Syst Sci 18:47–64 21. Minsky ML (1967) Computation: Finite and Infinite Machines. Prentice-Hall, Englewood Cliffs 22. Moore EF (1962) Machine models of self-reproduction, Proc Symposia in Applied Mathematics. Am Math Soc 14:17–33 23. Mora JCST, Vergara SVC, Martinez GJ, McIntosh HV (2005) Procedures for calculating reversible one-dimensional cellular automata. Physica D 202:134–141 24. Morita K (1995) Reversible simulation of one-dimensional irreversible cellular automata. Theor Comput Sci 148:157–163 25. Morita K (1996) Universality of a reversible two-counter machine. Theor Comput Sci 168:303–320 26. Morita K (2001) A simple reversible logic element and cellular automata for reversible computing. In: Proc 3rd Int Conf on Machines, Computations, and Universality. LNCS, vol 2055. Springer, Berlin, pp 102–113 27. Morita K (2007) Simple universal one-dimensional reversible cellular automata. J Cell Autom 2:159–165 28. Morita K, Harao M (1989) Computation universality of onedimensional reversible (injective) cellular automata. Trans IEICE Japan E-72:758–762 29. Morita K, Imai K (1996) Self-reproduction in a reversible cellular space. Theor Comput Sci 168:337–366 30. Morita K, Ueno S (1992) Computation-universal models of twodimensional 16-state reversible cellular automata. IEICE Trans Inf Syst E75-D:141–147 31. Morita K, Shirasaki A, Gono Y (1989) A 1-tape 2-symbol reversible Turing machine. Trans IEICE Japan E-72:223–228 32. Morita K, Tojima Y, Imai K, Ogiro T (2002) Universal computing in reversible and number-conserving two-dimensional cellular spaces. In: Adamatzky A (ed) Collision-based Computing. Springer, London, pp 161–199
33. Myhill J (1963) The converse of Moore’s Garden-of-Eden theorem. Proc Am Math Soc 14:658–686 34. Richardson D (1972) Tessellations with local transformations. J Comput Syst Sci 6:373–388 35. Sutner K (2004) The complexity of reversible cellular automata. Theor Comput Sci 325:317–328 36. Toffoli T (1977) Computation and construction universality of reversible cellular automata. J Comput Syst Sci 15:213–231 37. Toffoli T (1980) Reversible computing, Automata, Languages and Programming. In: de Bakker JW, van Leeuwen J (eds) LNCS, vol 85. Springer, Berlin, pp 632–644 38. Toffoli T, Margolus N (1990) Invertible cellular automata: a review. Physica D 45:229–253 39. Toffoli T, Capobianco S, Mentrasti P (2004) How to turn a second-order cellular automaton into lattice gas: a new inversion scheme. Theor Comput Sci 325:329–344 40. von Neumann J (1966) Theory of Self-reproducing Automata. Burks AW (ed) University of Illinois Press, Urbana 41. Watrous J (1995) On one-dimensional quantum cellular automata. In: Proc 36th Symp on Foundation of Computer Science. IEEE, Los Alamitos, pp 528–537
Books and Reviews Adamatzky A (ed) (2002) Collision-Based Computing. Springer, London Bennett CH (1988) Notes on the history of reversible computation. IBM J Res Dev 32:16–23 Burks A (1970) Essays on Cellular Automata. University of Illinois Press, Urbana Kari J (2005) Theory of cellular automata: A survey. Theor Comput Sci 334:3–33 Morita K (2001) Cellular automata and artificial life–computation and life in reversible cellular automata. In: Goles E, Martinez S (eds) Complex Systems. Kluwer, Dordrecht, pp 151–200 Wolfram S (2001) A New Kind of Science. Wolfram Media, Champaign
Reversible Computing
Reversible Computing KENICHI MORITA Hiroshima University, Higashi-Hiroshima, Japan
Article Outline Glossary Definition of the Subject Introduction Reversible Logic Gates and Circuits Reversible Logic Elements with Memory Reversible Turing Machines Other Models of Reversible Computing Future Directions Bibliography
Glossary Billiard ball model The Billiard Ball Model (BBM) is a physical model of computation proposed by Fredkin and Toffoli [19]. It consists of idealized balls and reflectors. Balls can collide with other balls or reflectors. It is a reversible dynamical system, since it is assumed that collisions are elastic, and there is no friction. Fredkin and Toffoli showed that a reversible logic gate called Fredkin gate, which is known to be logically universal, can be embedded in BBM. Hence, a universal computer can be realized in the space of BBM. Fredkin gate The Fredkin gate is a typical reversible logic gate with 3 inputs and 3 outputs whose operation is defined by the one-to-one mapping (c; p; q) 7! (c; cp C c¯q; c¯p C cq). It is known that any logic function (even if it is not a one-to-one function) can be realized by using only Fredkin gates by allowing constant inputs and garbage outputs (i. e., useless signals). Furthermore, it is shown by Fredkin and Toffoli [19] that garbage signals can be reversibly erased, and hence any logic function is realized by a garbage-less circuit composed only of Fredkin gates. Reversible cellular automaton A cellular automaton (CA) consists of a large number of finite automata called cells interconnected uniformly, and each cell changes its state depending on its neighboring cells. A reversible cellular automaton (RCA) is a one whose global function (i. e., a transition function from the configurations to the configurations) is one-to-one. RCAs can be thought as spatio-temporal models of reversible physical systems as well as reversible computing models. (See Reversible Cellular Automata)
Reversible logic element A reversible logic element is a primitive from which logic circuits can be composed, and whose operation is defined by a one-to-one mapping. There are two kinds of such elements: those without memory, and with memory. Reversible elements without memory are nothing but reversible logic gates. The Fredkin gate and the Toffoli gate are well-known examples of them, which are universal. Reversible elements with 1-bit memory are also useful when constructing reversible computing systems. The rotary element is a typical one of this type, which is also known to be universal. Reversible Turing machine A reversible Turing machine (RTM) is a “backward deterministic” Turing machine, and is a standard model of a reversible computing system. Bennett [4] showed that for any (irreversible) Turing machine there is an RTM that simulates the former and leaves no garbage information on the tape when it halts. Hence, Turing machines still have computation-universality even if the constraint of reversibility is added. Furthermore, there is a rather small universal reversible Turing machine. Rotary element A rotary element (RE) is a reversible logic element with 1-bit memory [36]. Conceptually, it is square-shaped, and has a “rotating bar” inside. The direction of the bar is either horizontal or vertical. It has four input ports and four output ports on the four edges. If a “particle” comes from the direction parallel to the bar, then it goes straight ahead and does not affect the direction of the bar. If it comes from the direction orthogonal to the bar, then it turns rightward and the bar rotates by 90 degrees. It is known that an RE is universal, and reversible Turing machines can be built very concisely by using only REs. Definition of the Subject A reversible computing system is defined as a “backward deterministic” system, where every computational configuration (state of the whole system) has exactly one previous configuration. Hence, a backward computation can be performed by its “inverse” computing system. Though its definition is rather simple, it is closely related to physical reversibility. The study of reversible computing originated from an investigation of heat generation (or energy dissipation) in computing systems. It deals with problems how computation can be carried out efficiently and elegantly in reversible computing models, and how such systems can be implemented in reversible physical systems. Since reversibility is one of the fundamental microscopic physical laws of Nature, and future computing systems will surely
2685
2686
Reversible Computing
become in a nano-scale size, it is very important to investigate these problems. In this article, we discuss several models of reversible computing from the viewpoint of computing theory (hence, we do not discuss problems of physical implementation in detail). We deal with several reversible computing models of different levels: from an element level to a system level. They are the Billiard Ball Model – an idealized reversible physical model, reversible logic elements and circuits, reversible Turing machines, and a few others. We see these models are related each other, i. e., reversible Turing machines are composed of reversible logic elements, and reversible logic elements are implemented in the Billiard Ball Model. We can also see even very simple reversible systems have very rich computing ability in spite of the constraint of reversibility. Such computing models will give new insights for the future computing systems. Introduction The definitions of reversible computing systems are rather simple. They are the systems whose computation process can be traced in the reverse direction in a unique way from the end of the computation to the start of it. In other words, they are “backward deterministic” computing models (of course, a more precise definition should be given for each model of reversible computing). But, if we explain reversible computing systems only in this way, it is hard to understand why they are interesting. Their importance lies in the fact that it is closely related to physical reversibility. In [24], Landauer argued the relation between logical irreversibility and physical irreversibility. He pointed out that any irreversible logical operation, such as a simple erasure of an information from the memory, or a simple merge of two paths of a program, is associated with physical irreversibility, and hence it necessarily causes heat generation. In particular, if 1 bit of information is erased, at least kT ln 2 of energy will be dissipated, where k is the Boltzmann constant, and T is the absolute temperature. This is called “Landauer’s principle”. (Note that besides simple erasure and simple merge, there are reversible erasure and reversible merge, which are not irreversible operations, and are useful for constructing garbage-less reversible Turing machine as we shall see in Sect. “Universality of RTMs and Garbage-Less Computation”). After the work of Landauer, various studies related to “thermodynamics of computation” appeared [4,5,6,7,9,22]. There is also an argument on computation from the physical viewpoint by Feynman [16], which contains an idea leading to quantum computing (see e. g., [12,15,20]).
Bennett [4] studied reversible Turing machine from the viewpoint of Landauer’s principle. He proved that for any irreversible Turing machine there is a reversible Turing machine that simulates the former and leaves no garbage information on the tape. The notion of garbage-less computation is important, because disposal of a garbage information is actually equivalent to erasure of the information, and hence leads to energy dissipation. The result of Bennett suggests that any computation can be performed efficiently in a physically reversible system. Since then, various kinds of reversible computing models have been proposed and investigated: reversible logic elements and circuits [19,36,47,48], the Billiard Ball Model [19] – an idealized physical model of computation, reversible cellular automata [46,49], reversible counter machines [34], and so on. It has been known that all these models have computation-universality, i. e., any (irreversible) Turing machine can be simulated in these frameworks. In this article, we discuss reversible computing from the theoretical point of view (though there are also studies on hardware implementation issues). We show how they can have computation-universality, and how different they are from conventional models. We can see that in some cases, computation in reversible systems is carried out in a very different fashion from that of traditional irreversible computing models. An outline of this article is as follows. In Sect. “Reversible Logic Gates and Circuits”, reversible logic gates and circuits are discussed. In particular, the Fredkin gate, its logical universality, garbage-less logic circuits, and its realization in the Billiard Ball Model are explained. In Sect. “Reversible Logic Elements with Memory”, reversible logic elements with memory are discussed. A rotary element (RE) is introduced, and its logical universality is shown. In Sect. “Reversible Turing Machines”, reversible Turing machines (RTMs) are discussed. A converting method to a garbage-less RTM, a simple realization of an RTM by REs, and a universal RTM are shown. In Sect. “Other Models of Reversible Computing”, other reversible models of computing, which are not dealt with in the previous sections, are briefly discussed. In Sect. “Future Directions”, future directions and open problems are given. Reversible Logic Gates and Circuits A reversible logic gate is a many-input many-output gate that realizes a one-to-one logical (i. e., Boolean) function. Early research on such gates is found in the study of Petri [42]. Later, Toffoli [47,48], and Fredkin and Toffoli [19] studied them from the standpoint of reversible
Reversible Computing
computing, i. e., how they can be realized in a reversible physical system, and how they are related to power dissipation in computation. In this section, we discuss issues of logical universality of reversible logic gates, garbage-less logic circuits, and a relation to the Billiard Ball Model of computation, along the line of the reference [19]. Reversible Logic Gates Definition 1 An m-input n-output logic gate with a domain D is defined as an element that realizes a function D ! f0; 1gn , where D f0; 1gm . When D D f0; 1gm , we simply call it an m-input n-output logic gate. It is called a reversible logic gate, if the function D ! f0; 1gn is oneto-one. It is easy to see that the traditional 1-input 1-output logic gate NOT is reversible, and the 2-input 1-output logic gates AND and OR are irreversible. An example of an m-input n-output reversible logic gate with a domain D ¤ f0; 1gm will appear in Sect. “Billiard Ball Model (BBM)”. Definition 2 (Toffoli [47,48]) The generalized AND/ NAND gate of order n is a reversible logic gate that realizes the function
(n) : (x1 ; : : : ; x n1 ; x n ) 7! (x1 ; : : : ; x n1 ; (x1 : : : x n1 ) ˚ x n ) : We can easily verify that (n) is one-to-one, and in fact ( (n) )1 D (n) . The gate that realizes (2) is sometimes called the controlled NOT (CNOT), which is an important gate in quantum computing (see e. g., [20]). In particular, it is shown that all unitary operations are expressed as compositions of CNOT gates and 1-bit quantum gates [3]. The gate for (3) is called the Toffoli gate (Fig. 1).
Reversible Computing, Figure 1 The generalized AND/NAND gate of order 3 called the Toffoli gate
Reversible Computing, Figure 2 The Fredkin gate and its operations
Definition 3 (Fredkin and Toffoli [19]) The Fredkin gate is a 3-input 3-output reversible logic gate that realizes the function ' : (c; p; q) 7! (c; c p C c¯ q; c¯ p C c q) : It is also easy to verify that ' is one-to-one and ' 1 D '. The operations of the Fredkin gate are depicted in Fig. 2. If c D 1 then the inputs p and q are connected to x and y in parallel, while if c D 0 then the outputs are exchanged. In addition to reversibility, the Fredkin gate has bitconserving property: the total number of 1’s in the output lines is the same as that of the input lines (note that the Toffoli gate is not so). This property is an analogue of the conservation law of mass, energy, or momentum in physics. Fredkin and Toffoli [19] proposed a design theory of reversible logic circuits composed of Fredkin gates and unit wires (which have unit-time delay) (Fig. 3). In their theory, a circuit composed of these elements must satisfy the following conditions: every output of a Fredkin gate can be connected to an input of a unit wire (not directly to another Fredkin gate), and every output of a Fredkin gate or a unit wire can be connected to at most one input of another element (i. e., fan-out is not allowed). Definition 4 A set E of logic elements is called logically universal, if any logic function can be realized by a circuit composed only of the elements in E. It is well known that, e. g., the set {AND, NOT} is logically universal. If we add a delay element (i. e., a memory) to a logically universal set, we can construct any sequential machine, and thus build a universal computer from them (as an infinite circuit). We can see that AND, NOT, and fan-out can be realized by Fredkin gates [19] as shown in Fig. 4, if we allow to supply constant inputs 0’s and 1’s, and allow garbage (useless) outputs besides the true inputs and outputs. Hence, the set {Fredkin gate} is logically universal. It is also pos-
Reversible Computing, Figure 3 A unit wire, which acts as a delay element (y(t) D x(t 1))
2687
2688
Reversible Computing
Reversible Computing, Figure 4 Implementing AND, NOT, and fan-out by Fredkin gates
sible to show logical universality of the set {Toffoli gate} (this is left as an exercise for the reader). Theorem 1 The sets {Fredkin gate} and {Toffoli gate} are logically universal. On the other hand, Toffoli [47] showed any 2-input 2-output reversible gate is composed only of CNOTs and NOTs. Since it is easy to see that {CNOT, NOT} is not logically universal, we can conclude no 2-input 2-output reversible gate is logically universal. Garbage-Less Logic Circuits Composed of Fredkin Gates As seen from Theorem 1, any logic function f : f0; 1gm ! f0; 1gn (even if it is not one-to-one) can be realized by a circuit ˚ composed of Fredkin gates and unit wires, by allowing constant inputs c and garbage outputs g (Fig. 5). Disposing the garbage g outside of the circuit is, very roughly speaking, analogous to heat generation in an actual circuit. Fredkin and Toffoli [19] showed a method to obtain a garbage-less circuit to compute the function f . This method is a logic circuit version of the garbage-less reversible Turing machine by Bennett [4], which will be discussed in Sect. “Universality of RTMs and Garbage– Less Computation”. First, the notion of an inverse circuit ˚ 1 of a given reversible logic circuit ˚ composed of Fredkin gates and
Reversible Computing, Figure 5 Embedding a logic function f in a reversible logic circuit ˚
Reversible Computing, Figure 6 a A reversible logic circuit, and b its inverse circuit
unit wires is introduced. ˚ 1 is obtained as follows: first take its mirror image, and then exchange inputs and outputs of each element. Figure 6 shows an example. We assume, ˚ has no feedback loop (thus it realizes some combinatorial logic function f ), and all the delays between inputs and outputs are the same. Then, ˚ 1 computes the inverse function f 1 , because ˚ 1 “undoes” the computation performed by ˚ (note that the inverse of the Fredkin gate is itself). If we connect ˚ and ˚ 1 in series as in Fig. 7, garbage signals are erased reversibly, and we get again constants, which can be used in the next computation of the logic function. However, the true outputs y are also converted to the inputs x. By inserting fan-out (i. e., copying) circuits between ˚ and ˚ 1 , we can obtain the outputs y without producing the garbage signals g (Fig. 8). Theorem 2 (Fredkin and Toffoli, [19]) For any logic function f : x 7! y, there is a reversible logic circuit composed of Fredkin gates and unit wires that computes the
Reversible Computing
Reversible Computing, Figure 7 Reversibly erasing garbage signals by the inverse circuit ˚ 1
Billiard Ball Model (BBM)
Reversible Computing, Figure 8 A garbage-less reversible logic circuits that computes a logic function
mapping f 0 : (c; x; 0; 1) 7! (c; x; y; y¯), and produces no garbage signal. Although the circuit in Fig. 8 does not give garbage signals g, it produces x and y¯ besides the true output y, which may be regarded as a kind of a garbage (though the amount is generally smaller than that of g). Especially, when we construct a circuit with memories (i. e., a sequential circuit), this kind of garbage is produced at every time step. In this sense, the above circuit is an “almost” garbageless one. However, if we consider only “reversible” sequential machines (defined in Sect. “Definitions on Reversible Logic Elements with Memory”), then completely garbage-less circuits can be obtained. This is discussed in Sect. “Garbage-less Construction of RSMs by REs”.
Reversible Computing, Figure 10 a The switch gate, and b the inverse switch gate
Fredkin and Toffoli [19] proposed an interesting reversible physical model of computing called the Billiard Ball Model (BBM). It consists of idealized balls and reflectors (Fig. 9), and balls can collide with other balls or reflectors. We assume collisions are elastic, and there is no friction. Hence, there is no energy dissipation inside of this model. Fredkin and Toffoli showed that any circuit composed of Fredkin gates and unit wires can be embedded in BBM. Therefore, a universal computer can be constructed in the space of BBM. To show the Fredkin gate is realizable in BBM, we need other reversible logic gates called the switch gate and its inverse. The switch gate is a 2-input 3-output gate shown in Fig. 10a, which realizes the one-to one logic function (c; x) 7! (c; cx; c¯x). The inverse switch gate is a 3-input 2-output gate whose logical function is defined only on the set f(0; 0; 0); (0; 0; 1); (1; 0; 0); (1; 1; 0)g as shown in Fig. 10b, and is the inverse function of the switch gate. Figure 11 shows how the Fredkin gate can be built from switch gates and inverse switch gates (this cir-
Reversible Computing, Figure 9 The Billiard Ball Model (BBM)
2689
2690
Reversible Computing
Reversible Computing, Figure 11 The Fredkin gate built from two switch gates and two inverse switch gates
and 4 output lines (exactly stated, it has 4 input symbols and 4 output symbols), and its operation is intuitively very simple. In this section, we discuss logical universality of an RE, and also show that any reversible sequential machine can be realized by REs as a completely garbage-less circuit. In Sect. “Constructing RTMs by REs”, we shall also see that by appropriately designing circuits composed only of REs, a clock signal can be eliminated. We should note that a clock is necessary as far as we use “gates”, because two or more incoming signals must be synchronized to perform the gate operation. In particular, we shall see any reversible sequential machines and Turing machines can be realized very concisely by such circuits.
Definitions on Reversible Logic Elements with Memory Since a reversible logic elements with memory is a kind of a reversible sequential machine (RSM), we first give definitions on a sequential machine of Mealy type (i. e., a finite automaton with an output), and its reversibility. Reversible Computing, Figure 12 A realization of the switch gate in BBM
Definition 5 A sequential machine (SM) is defined by M D (Q; ˙; ; q0 ; ı) ;
cuit is due to R. Feynman and A. Ressler [19]). Finally, the switch gate is realized in BBM as in Fig. 12. Therefore, the Fredkin gate is also implemented in BBM. Margolus showed that BBM can be realized in a reversible cellular automaton [28]. There are also studies related to BBM and cellular automata [14,35].
Reversible Logic Elements with Memory In the classical design theory of logic circuits, generally two sorts of logic elements are supposed. They are logic gates (such as AND, OR, NOT, etc.) and a delay element (i. e., a memory like a flip-flop). Its design technique is also divided into two phases: implement Boolean functions as combinatorial logic circuits consisting of gates, and then construct sequential machines from combinatorial logic circuits and delay elements. In this way, the entire process of constructing digital circuits is systematized, and various methods of circuit minimization can be applied (especially for the first phase). An approach of Sect. “Reversible Logic Gates and Circuits”, which uses reversible logic gates and delay elements, is also a method along this line. However, logic elements with memory are often more useful in reversible computing. A rotary element (RE) [36] is a typical example of such elements with 1-bit memory. It has 4 input lines
where Q is a finite non-empty set of states, ˙ and are finite non-empty sets of input and output symbols, respectively, and q0 2 Q is an initial state. ı : Q ˙ ! Q is a mapping called a move function. A variation of an SM M D (Q; ˙; ; ı), where no initial state is specified, is also called an SM for convenience. M is called a reversible sequential machine (RSM) if ı is one-to-one (hence j˙ j j j). An RSM is “reversible” in the sense that, from the present state and the output of M, the previous state and the input are determined uniquely. A reversible logic elements with memory (RLEM) is nothing but an RSM with small numbers of states and input/output symbols, from which reversible computers can be built. However, since there are infinitely many RSMs, we should restrict candidates of RLEMs somehow. Here, we consider only RLEMs with 2 states (i. e., jQj D 2) and with k input/output symbols (i. e., j˙ j D j j D k) for k D 2; 3; 4. We consider, e. g., a 2-state 4-symbol RLEM M D (Q; ˙; ; ı). Here, we fix the state set as Q D fq0 ; q1 g, and the input and output alphabets as ˙ D fa; b; c; dg and D fw; x; y; zg, respectively. The move function ı is as follows. ı : fq0 ; q1 g fa; b; c; dg ! fq0 ; q1 g fw; x; y; zg
Reversible Computing Reversible Computing, Table 1 An example of a pair of equivalent 2-state 3-symbol RLEMs Input a b c q0 q0 x q1 x q1 y q1 q0 y q1 z q0 z (a) RLEM No. 3-61 Present state
Present state
Input a’ b’ c’ q00 q00 y0 q01 z0 q01 y0 q01 q00 x0 q01 x0 q00 z0 (b) RLEM No. 3-235
Since ı is one-to-one, it is specified by a permutation from the set f(q0 ; w); (q0 ; x); (q0 ; y); (q0 ; z);
Hence, there are 8! D 40320 RLEMs for k D 4. They are numbered by 0; : : : ; 40 319 in the lexicographic order of permutations. 2-state 2-symbol and 3-symbol RLEMs are also numbered in this way [38]. To indicate k-symbol RLEM, the prefix “k-” is attached to the serial number. Example 1 Table 1 shows two RLEMs No. 3-61 and No. 3235 specified by the following permutations (in Table 1(b), each symbol has a prime (0 ) for the convenience of the later argument). No. 3 61 : ((q0 ; x); (q1 ; x); (q1 ; y); (q0 ; y); (q1 ; z); (q0 ; z)) No. 3 235 : ((q0 ; y); (q1 ; z); (q1 ; y); (q0 ; x); (q1 ; x); (q0 ; z)) There are many kinds of 2-state RLEMs even if we limit k D 2; 3; 4, but we can regard two RLEMs are “equivalent” if one can be obtained by renaming the states and the input/output symbols of the other. This is formalized as follows. Definition 6 Let M1 D (Q1 ; ˙1 ; 1 ; ı1 ) and M2 D (Q2 ; ˙2 ; 2 ; ı2 ) be two LEMs. M 1 and M 2 are called equivalent (denoted by M1 M2 ), if there exist oneto-one onto mappings f : Q1 ! Q2 ; g : ˙1 ! ˙2 , and h : 1 ! 2 that satisfy
where
f (q0 ) D q01 ;
f (q1 ) D q00
g(a) D b0 ;
g(b) D a0 ;
g(c) D c 0
h(x) D x 0 ;
h(y) D z0 ;
h(z) D y 0
The total numbers of 2-state 2-, 3-, and 4-symbol RLEMs are 4! D 24, 6! D 720, and 8! D 40 320, respectively. But, the numbers of essentially different RLEMs are relatively few. It has been shown that the numbers of equivalence classes of 2-state 2-, 3-, and 4-symbol RLEMs are 8, 24 and 82, respectively [38]. A Rotary Element (RE) and Its Logical Universality
(q1 ; w); (q1 ; x); (q1 ; y); (q1 ; z)g :
8 q 2 Q1 ; 8 s 2 ˙1 [ı1 (q; s) D
onto mappings.
(ı2 ( f (q); g(s)))] ;
: Q2 2 ! Q1 1 is defined as follows.
8 q 2 Q2 ; 8 t 2 2 [ (q; t) D ( f 1 (q); h1 (t))] : We can see the two RLEMs No. 3-61 and No. 3-235 in Example 1 are equivalent under the following one-to-one
A rotary element (RE) [36] is a 2-state 4-symbol RLEM defined as follows. Re = ({
,
}, fn; e; s; wg; fn0 ; e 0 ; s 0 ; w 0 g; ıRE )
The move function ıRE is shown in Table 2. An RE can be understood by the following intuitive interpretation. It has two states called H-state ( ), and four input lines fn; e; s; wg and four output lines fn0 ; e 0 ; s 0 ; w 0 g corresponding to the input and output alphabets (Fig. 13). All the values of input and output lines are either 0 or 1, i. e., (n; e; s; w); (n0 ; e 0 ; s 0 ; w 0 ) 2 f0; 1g4 . However, we restrict the domain of their values as f(1; 0; 0; 0); (0; 1; 0; 0); (0; 0; 1; 0); (0; 0; 0; 1)g, i. e., exactly one “1” appears among the four input (output, respectively) lines at a time, when an input is given (an output is produced). The operation of an RE is left undefined for the cases that signal 1’s are given to two or more input lines. Signals 1 and 0 are interpreted as existence and nonexistence of a particle. We can interpret that an RE has
Reversible Computing, Figure 13 Two states of a rotary element (RE) Reversible Computing, Table 2 The move function ıRE of a rotary element RE Present state
Input n
H-state:
w
V-state:
s0
e 0
s w
0
n0
w 0
e
e0
n0
s0
2691
2692
Reversible Computing
Reversible Computing, Figure 14 Operations of an RE: a the parallel case, and b the orthogonal case
a “rotating bar” to control the moving direction of a particle. When no particle exists, nothing happens on the RE. If a particle comes from a direction parallel to the rotating bar, then it goes out from the output line of the opposite side (i. e., it goes straight ahead) without affecting the direction of the bar (Fig. 14a). If a particle comes from a direction orthogonal to the bar, then it makes a right turn, and rotates the bar by 90 degrees counterclockwise (Fig. 14b).
Theorem 4 (Morita [37]) For any reversible sequential machine M, there is a garbage-less circuit C that realizes M, and is composed only of REs. C is completely garbage-less in the sense that it has no extra input/output lines other than true input/output lines (hence it is a “closed” circuit except the true input/output). We show a construction method of a circuit C by an example. First, we consider a circuit composed of k C 1 REs shown in Fig. 16. It is called an RE-column. In a resting state, each RE in the RE-column is in the vertical state except the bottom one indicated by x. If x is in the horizontal
Theorem 3 (Morita [36]) The set {RE} is logically universal. This theorem is proved by giving a circuit composed only of REs that simulates a Fredkin gate. Figure 15 shows one such circuit [38]. (A circuit given in [36] uses 16 REs to simulate a Fredkin gate.) In Fig. 15, small triangles represent delay elements, where the number written inside of them indicates its delay time. In this circuit, we assume all the elements works synchronously. Note that an RE itself can operate as a delay element. Garbage-Less Construction of RSMs by REs We can show any RSM can be realized as a completely garbage-less RE circuit. (We can also construct a circuit of similar property from Fredkin gates [33].)
Reversible Computing, Figure 15 A realization of a Fredkin gate out of 8 REs and delay elements
Reversible Computing, Figure 16 An RE-column
Reversible Computing Reversible Computing, Table 3 The move function of the RE-column
State x
Input li ri
(marked)
li0
li0
(unmarked)
ri0
ri0
Reversible Computing, Table 4 A move function ı 1 of an RSM M1 (an example) Input State a1 q1 q2 b1 q2 q2 b2 q3 q1 b2
a2 q3 b2 q1 b1 q3 b1
state, the RE-column is called in the marked state, otherwise unmarked state. It can be regarded as if it is a 2-state RSM with 2k input symbols fl1 ; : : : ; l k ; r1 ; : : : ; r k g and 2k output symbols fl10 ; : : : ; l k0 ; r10 ; : : : ; r0k g, though it has many transient states. Its move function is shown in Table 3. We now consider an example of an RSM M1 D (fq1 ; q2 ; q3 g; fa1 ; a2 g; fb1 ; b2 g; ı1 ), where its move function ı 1 is shown in Table 4. Prepare three RE-columns with k D 2, each of which corresponds to each state of M 1 , and connect them as shown in Fig. 17. If M 1 is in the state q1 , then only the first RE-column is set to the marked state. When an input signal comes from, e. g., the input line a2 , then the signal first goes out from the line l20 of the first RE-column, setting this column to the unmarked state. Then, this signal enters the third column from the line r2 . This makes the third RE-column marked,
and finally goes out from the output line b2 , which realizes ı1 (q1 ; a2 ) D (q3 ; b2 ). In Fig. 17, the circuit is designed so that the delay between the inputs and outputs is constant. However, if there is no need to do so, delay elements can be removed. In such a case, even if each RE operates asynchronously, the circuit works correctly. Reversible Turing Machines A reversible Turing machine (RTM) is a standard model of reversible computing as in the case of traditional (i. e., irreversible) computation theory. The notion of an RTM first appeared in the paper of Lecerf [25], where unsolvability of the halting problem and some related problems on RTMs are shown. Bennett [4] showed that every irreversible Turing machine (TM) can be simulated by an RTM with three tapes without leaving garbage information on the tapes when it halts. He also pointed out the significance of RTMs, since they are closely related to physical reversibility and the problem of energy dissipation in computing process. In this section, after giving definitions on RTMs, we explain the method of Bennett for converting a given irreversible TM into an equivalent 3-tape RTM. We then discuss the problems how RTMs can be built by reversible logic elements, and how a small universal RTM can be constructed. Definitions on Reversible Turing Machines (RTMs) We first introduce a quadruple formulation of RTMs according to Bennett [4]. This formulation makes it easy to define an “inverse” Turing machine for a given RTM. The
Reversible Computing, Figure 17 A garbage-less circuit made of REs and delay elements that simulates M1
2693
2694
Reversible Computing
inverse RTM, which “undoes” the computation performed by the original RTM, plays a key role for garbage-less computation. We also give a quintuple formulation of RTMs compatible with the quadruple formulation. This is because most classical TMs are defined in quadruple form, and thus it makes convenient to compare with them. Definition 7 A 1-tape Turing machine in the quadruple form is defined by T4 D (Q; S; q0 ; q f ; s0 ; ı) ; where Q is a non-empty finite set of states, S is a nonempty finite set of symbols, q0 is an initial state (q0 2 Q), qf is a final (halting) state (q f 2 Q), s0 is a special blank symbol (s0 2 S). ı is a move relation, which is a subset of (Q S S Q) [ (Q f/g f; 0; Cg Q). Each element of ı is a quadruple, and either of the form [p; s; s0 ; q] 2 (Q S S Q) or [p; /; d; q] 2 (Q f/g f; 0; Cg Q). The symbols “”, “0”, and “C” stand for “left-shift”, “zeroshift”, and “right-shift”, respectively. [p; s; s0 ; q] means if T 4 reads the symbol s in the state p, then write s0 and go to the state q. [p; /; d; q] means if T 4 is in p, then shift the head to the direction d and go to the state q. T 4 is called a deterministic Turing machine iff the following condition holds for any pair of distinct quadruples [p1 ; b1 ; c1 ; q1 ] and [p2 ; b2 ; c2 ; q2 ] in ı. If p1 D p2 ; then b1 ¤ / ^ b2 ¤ / ^ b1 ¤ b2 : Note that, in what follows, we consider only deterministic Turing machines. So, we omit the word “deterministic” henceforth. On the other hand, T 4 is called a reversible Turing machine (RTM) iff the following condition holds for any pair of distinct quadruples [p1 ; b1 ; c1 ; q1 ] and [p2 ; b2 ; c2 ; q2 ] in ı. If q1 D q2 ; then b1 ¤ / ^ b2 ¤ / ^ c1 ¤ c2 : It is easy to extend the above definition to multi-tape RTMs. For example, a 2-tape TM is defined by T D (Q; (S1 ; S2 ); q0 ; q f ; (s1;0 ; s2;0 ); ı) : A quadruple in ı is either of the form [p; [s1 ; s2 ]; [s10 ; s20 ]; q] 2 (Q (S1 S2 ) (S1 S2 ) Q) or of the form [p; /; [d1 ; d2 ]; q] 2 (Q f/g f; 0; Cg2 Q). Determinism and reversibility are defined similarly as above, namely, T is reversible iff the following condition holds for any pair of distinct quadruples [p1 ; X1 ; [b1 ; b2 ]; q1 ] and [p2 ; X2 ; [c1 ; c2 ]; q2 ] in ı. If q1 D q2 ; then X1 ¤ / ^ X2 ¤ / ^ [b1 ; b2 ] ¤ [c1 ; c2 ] :
Definition 8 A 1-tape Turing machine in the quintuple form is defined by T5 D (Q; S; q0 ; q f ; s0 ; ı) ; where Q; S; q0 ; q f ; s0 are the same as in Definition 7. ı is a move relation, which is a subset of (Q S S f; 0; Cg Q). Each element of ı is a quintuple of the form [p; s; s0 ; d; q]. It means if T 5 reads the symbol s in the state p, then write s0 , shift the head to the direction d, and go to the state q. We say T 5 is deterministic iff the following condition holds for any pair of distinct quintuples [p1 ; s1 ; s10 ; d1 ; q1 ] and [p2 ; s2 ; s20 ; d2 ; q2 ] in ı. If p1 D p2 ; then s1 ¤ s2 : We say T 5 is reversible iff the following condition holds for any pair of distinct quintuples [p1 ; s1 ; s10 ; d1 ; q1 ] and [p2 ; s2 ; s20 ; d2 ; q2 ] in ı. If q1 D q2 ; then s10 ¤ s20 ^ d1 D d2 : Proposition 1 For any RTM T 5 in the quintuple form, there is an RTM T 4 in the quadruple form that simulates each step of the former in two steps. Proof Let T5 D (Q; S; q0 ; q f ; s0 ; ı). We define T4 D (Q 0 ; S; q0 ; q f ; s0 ; ı 0 ) as follows. Let Q 0 D Q [ fq0 j q 2 Qg. The set ı 0 is given by the next procedure. First, set the initial value of ı 0 to the empty set. Next, for each q 2 Q do the following operation. Let [p1 ; s1 ; s10 ; d1 ; q]; [p1 ; s2 ; s20 ; d2 ; q]; : : : ; [p m ; s m ; s 0m ; d m ; q] be all the quintuples in ı whose fifth element is q. Note that d1 D d2 D : : : D d m holds, and s10 ; s20 ; : : : ; s 0m are pair-wise distinct, because T 5 is reversible. Then, include the m C 1 quadruples [p1 ; s1 ; s10 ; q0 ]; [p1 ; s2 ; s20 ; q0 ]; : : : ; [p m ; s m ; s 0m ; q0 ]; and [q0 ; /; d1 ; q] in ı 0 . It is easy to see that T 4 has the required property. By Proposition 1, we see the definition of an RTM in the quintuple form is compatible with that in the quadruple form. The converse of Proposition 1 is also easy to show. It is left for readers as an exercise. (Of course, it is very easy to construct an RTM in the quintuple form so that it simulates each quadruple of a given RTM by a single quintuple. In addition, it is also possible to simulate a consecutive pair of read/write and shift quadruples by one quintuple, and thus we can reduce the numbers of states and quintuples.) Universality of RTMs and Garbage-Less Computation We now discuss computation-universality of RTMs. It is actually easy to convert an irreversible TM to an RTM, because we can construct an RTM that simulates the former
Reversible Computing
and records all the information which quadruples were executed by the irreversible TM by using a special history tape. But, by this method a large amount of garbage information will be left at the end of a computation. An important point is that such a garbage can be reversibly erased, hence a garbage-less computation is possible as shown in the next theorem. Theorem 5 (Bennett [4]) For any (irreversible) 1-tape TM, we can construct a 3-tape RTM that simulates the former and leaves only an input string and an output string on the tapes when it halts (hence, it leaves no garbage information). Proof Let T D (Q; S; q0 ; q f ; 0; ı) be a given irreversible TM in the quadruple form. We assume T satisfies the following conditions (it is easy to modify a given TM so that it meets the conditions). (i) (ii) (iii) (iv) (v) (vi)
T has a rightward-infinite tape whose leftmost square always keeps the blank symbol. When T halts, it does so at the leftmost symbol in the state qf . The output is given from the second square of the tape to the right when T halts. The output string does not contain blank symbols. The initial state q0 does not appear as the fourth element of any quadruple. There is only one quadruple in ı whose fourth element is qf .
Let m be the total number of quadruples contained in ı, and we assume the numbers 1; : : : ; m are assigned uniquely to these quadruples. An RTM T R that simulates T has three tapes: a working tape (for simulating the tape of T), a history tape (for recording the movement of T at each step), and an output tape (for writing an output string). It is defined as follows: T R D (Q 0 ; (S; f0; 1; : : : ; mg; S); q0 ; p0 ; (0; 0; 0); ı 0 ) Q 0 D fq0 ; q1 ; : : : ; q f g [ fq00 ; q01 ; : : : ; q0f g [ fc1 ; c10 ; c2 ; c20 g [ fp0 ; p1 ; : : : ; p f g [ fp00 ; p01 ; : : : ; p0f g Computation of T R has three stages: (1) the forward computation stage, (2) the copying stage, and (3) the backward computation stage. The set ı 0 of quintuples is defined as the union ı1 [ ı2 [ ı3 of the sets of quintuples given below. We assume when T R starts to move, an input string is written in the working tape, and the history and the output tapes contain only blank symbols. (1) The quadruple set ı 1 for the forward computation stage: When T R simulates T step by step, it records on
the history tape which quadruple of T was used. By this operation T R can be reversible. This is realized by giving the quadruple set ı 1 as follows. (i) If the hth quadruple (h D 1; : : : ; m) of T is [q i ; s j ; s k ; q l ] (q i ; q l 2 Q; s j ; s k 2 S) ; then include the following quadruples in ı 1 . [q i ; [q0i ;
/; [s j ; 0; 0];
[0; C; 0];
q0i ]
[s k ; h; 0];
ql ]
(ii) If the hth quadruple (h D 1; : : : ; m) of T is [q i ; /; d; q l ] (q i ; q l 2 Q; d 2 f; 0; Cg) then include the following quadruples in ı 1 . [q i ; [q0i ;
/; [x; 0; 0];
[d; C; 0];
q0i ]
[x; h; 0];
q l ] (x 2 S)
(2) The quadruple set ı 2 for the copying stage: If the forward computation stage is completed, and T R becomes in the state qf , then copies the output string on the working tape to the output tape. This is realized by giving the quadruple set ı 2 as below. Here let n be the number of the quadruple that contains qf as the fourth element. [q f ;
[0; n; 0];
[c1 ;
/;
[0; n; 0];
c1 ]
[C; 0; C];
c2 ]
[c2 ;
[y; n; 0];
[y; n; y];
c1 ] (y 2 S f0g)
[c2 ;
[0; n; 0];
[0; n; 0];
c20 ]
[c10 ;
[y; n; y];
[y; n; y];
c20 ] (y 2 S f0g)
[; 0; ];
c10 ]
[0; n; 0];
pf ]
[c20 ; [c10 ;
/; [0; n; 0];
(3) The quadruple set ı 3 for the backward computation stage: After the copying process, the backward computation stage starts in order to reversibly erase garbage information left on the history tape. This is performed by an inverse TM of the forward computing. The quadruple set ı 3 is as follows. (i) If the hth quadruple (h D 1; : : : ; m) of T is [q i ; s j ; s k ; q l ] (q i ; q l 2 Q; s j ; s k 2 S) then include the following quadruples in ı 3 . [p0i ; [p l ;
/; [s k ; h; 0];
[0; ; 0];
pi ]
[s j ; 0; 0];
p0i ]
2695
2696
Reversible Computing
(ii) If the hth quadruple (h D 1; : : : ; m) of T is [q i ; /; d; q l ] (q i ; q l 2 Q; d 2 f; 0; Cg) then include the following quadruples in ı 3 . Note that if d D (0, C, respectively), then d 0 D C (0, ). [p0i ; [p l ;
/; [x; h; 0];
[d 0 ; ; 0];
pi ]
[x; 0; 0];
p0i ] (x 2 S)
It is easy to see that T R simulates T correctly. Reversibility of T R is verified by checking the quadruple set ı 0 . In particular, the set ı 3 of quadruples is reversible since T is deterministic. Example 2 We consider the following irreversible TM R by the method Terase , and convert it into an RTM Terase of Theorem 5. Terase D (fq0 ; q1 ; q2 ; q3 ; q4 ; q f g; f0; 1; 2g; q0 ; q f ; 0; ıerase ) ıerase D f1 : [q0 ; /; C; q1 ];
2 : [q1 ; 0; 0; q4 ];
3 : [q1 ; 1; 1; q2 ];
4 : [q1 ; 2; 1; q2 ];
5 : [q2 ; /; C; q1 ];
6 : [q3 ; 0; 0; q f ];
7 : [q3 ; 1; 1; q4 ];
8 : [q4 ; /; ; q3 ]g :
[q00 ; [q1 ;
/;
[C; C; 0]; q00 ] [p00 ;
[x; 0; 0]; [x; 1; 0]; /;
[0; C; 0];
q1 ] [p1 ; q01 ]
[p01 ;
/;
[; ; 0]; p0 ]
[x; 1; 0]; [x; 0; 0];
p00 ]
[0; ; 0];
p1 ]
/;
[q01 ; [0; 0; 0]; [0; 2; 0];
q4 ] [p4 ;
[0; 2; 0]; [0; 0; 0];
p01 ]
[q01 ; [q01 ;
[1; 0; 0]; [1; 3; 0];
q2 ] [p2 ;
[1; 3; 0]; [1; 0; 0];
p01 ]
[2; 0; 0]; [1; 4; 0];
q2 ] [p2 ;
[1; 4; 0]; [2; 0; 0];
p01 ]
[q2 ;
/;
[C; C; 0];
q02 ]
[p02 ;
[q02 ; [x; 0; 0]; [x; 5; 0];
q1 ] [p1 ;
[q3 ;
q03 ] [p03 ;
[q03 ; [q03 ; [q4 ;
/;
[0; C; 0];
/;
[; ; 0]; p2 ]
[x; 5; 0]; [x; 0; 0];
p02 ]
[0; ; 0];
p3 ]
/;
[0; 0; 0]; [0; 6; 0];
q f ] [p f ;
[0; 6; 0]; [0; 0; 0];
p03 ]
[1; 0; 0]; [1; 7; 0];
q4 ] [p4 ;
[1; 7; 0]; [1; 0; 0];
p03 ]
/;
[; C; 0];
q04 ]
[p04 ;
/;
[C; ; 0]; p4 ]
[q04 ; [x; 0; 0]; [x; 8; 0];
q3 ] [p3 ;
[x; 8; 0]; [x; 0; 0];
p04 ]
[q f ; [0; 6; 0]; [0; 6; 0];
c1 ] [c10 ;
[0; 6; 0]; [0; 6; 0];
pf ]
[c1 ;
/;
[C; 0; C]; c2 ] [c20 ;
[c2 ; [y; 6; 0]; [y; 6; y]; [c2 ; [0; 6;
/;
[; 0; ]; c10 ]
c1 ] [c10 ;
[y; 6; y]; [y; 6; y];
0]; [0; 6;
0]; c20 ]
It is also possible to convert an arbitrary (irreversible) TM into an equivalent 1-tape 2-symbol RTM [31]. There are studies related to computational complexity of RTMs [8,10,21]. Bennett [8] showed the following spacetime trade-off theorem for RTMs, where an interesting method of reducing space is given. Note that the RTM in Theorem 5 uses large amount of space, which is proportional to the time that the simulated TM uses. Theorem 6 (Bennett [8]) For any " > 0 and any (irreversible) TM using time T and space S, we can construct an RTM that simulates the former using time O(T 1C" ) and space O(S log T). Constructing RTMs by REs
When given an input string consisting of 1’s and 2’s, Terase rewrite all the occurrences of 2’s into 1’s, and halts (see R that simFig. 18). The set of quadruples of the RTM Terase ulate Terase is given below, and its computation process for the input 122 is shown in Fig. 19. [q0 ;
Reversible Computing, Figure 18 The initial and the final computational configuration of Terase for the given input string 122
c20 ]
RTMs are related to many other models of reversible computing. For example, for any given RTM, there is a one-dimensional reversible cellular automaton (RCA) that simulates the RTM directly [32]. Hence, a one-dimensional RCA is computation-universal. It is also possible to construct a garbage-less reversible logic circuit that simulates a given RTM. We have already seen that any reversible sequential machine can be realized as a garbage-less reversible RE circuit (Theorem 4). In [36], it is shown that an RTM can be decomposed into tape cell modules and a finite state control module, each of which is a reversible sequential machine. Therefore, any RTM can be composed of REs as a garbage-less circuit. Figure 20 shows an example of an RTM composed only of REs [36]. There is a finite state control in the left half of this figure. To the right of it, infinitely many copies of tape cell modules are connected. (Note that, here, some ad hoc techniques are used to reduce the number of REs rather than to use the systematic method employed in Theorem 4.) Note that this circuit is closed except the Begin and the End ports. This RTM computes the function f (n) D 2n if n is given as a unary number [36]. Setting the initial states of tape cell modules, and then giving a signal from Begin port, a computing process starts. If the RTM halts, a signal goes out from the End port, leaving an answer in the tape cell modules.
Reversible Computing
Reversible Computing, Figure 19 The computation process of the RTM Terase for the input string 122: the forward computation stage (from t D 0 to t D 32), the copying stage (from t D 33 to t D 48), and the backward computation stage (from t D 49 to t D 81)
Reversible Computing, Figure 20 A circuit composed only of REs that realizes an RTM [36]
2697
2698
Reversible Computing
It is also possible to realize RTMs by Fredkin gates and delay elements as a garbage-less circuit using the method of composing RSMs by Fredkin gates shown in [33]. If we do so, however, we need a larger number of Fredkin gates than that of REs. In addition, when designing the circuit, signal delays and timings should be very carefully adjusted, because signals must be synchronized at each gate. On the other hand, the circuit of Fig. 20 works correctly even if each RE operates asynchronously, since only one signal is moving in this circuit. In an RE, a single moving signal interacts with a rotating bar, where the latter is thought as a kind of stationary signal. Hence, its operation may be performed at any time, and there is no need of synchronizing signals at each RE. Of course, if two or more signals are moving in a circuit, some control mechanism for timing is required. Universal RTMs From Theorem 5, we can see the existence of a universal RTM (URTM) that can compute any recursive function. However, if we use the technique in the proof of Theorem 5, the numbers of states and symbols of a URTM will become very large. As for classical (i. e., irreversible) universal Turing machines (UTMs), the following UTMs have been proposed so far, where UTM(m; n) denotes an m-state n-symbol UTM: a UTM(7,4) by Minsky [30], a UTM(24,2), a UTM(10,3), a UTM(7,4), a UTM(5,5), a UTM(4,6), a UTM(3,10) and a UTM(2,18) by Rogozhin [44], a UTM(19,2) by Baiocchi [2], a UTM(3,9) by Kudlek and Rogozhin [23], etc. Most of these small UTMs simulates a 2-tag system [30], which is a simple string rewriting system having computation-universality. In the following, we describe a 17-state 5-symbol URTM (URTM(17,5)) proposed by Morita and Yamaguchi [39]. It simulates a cyclic tag system (CTAG) given by Cook [11], which is also a very simple string rewriting system with computation-universality. Since the notion of halting was not defined explicitly in the original definition of a CTAG, we use here a modified definition of a CTAG with the halting property, which can simulate a 2-tag system with the halting property [39]. Definition 9 A cyclic tag system (CTAG) is defined by C D (k; fY; Ng ; (halt; p1 ; : : : ; p k1 )) ; where k (k D 1; 2; : : :) is the length of a cycle (i. e., period), fY; Ng is the (fixed) alphabet, and (p1 ; : : : ; p k1 ) 2 (fY; Ng ) k1 is a (k 1)-tuple of production rules. A pair (v; m) is called an instantaneous description (ID) of C,
where v 2 fY; Ng and m 2 f0; : : : ; k 1g. m is called a phase of the ID. A halting ID is an ID of the form (Yv; 0)(v 2 fY; Ng ). The transition relation ) is defined as follows. For any (v; m); (v 0 ; m0 ) 2 fY; Ng f0; : : : ; k 1g, (Yv; m) ) (v 0 ; m0 ) iff [m ¤ 0] ^ [m0 D m C 1
mod k]
0
^ [v D v p m ] ; (Nv; m) ) (v 0 ; m0 ) iff [m0 D m C 1 mod k] ^ [v 0 D v] : A sequence of IDs ((v0 ; m0 ); : : : ; (v n ; m n )) is called a complete computation starting from v 2 fY; Ng iff (v0 ; m0 ) D (v; 0); (v i ; m i ) ) (v iC1 ; m iC1 ) (i D 0; 1; : : : ; n 1), and (v n ; m n ) is a halting ID. Example 3 Consider the CTAG C1 D (3; fY; Ng; (halt; Y N; YY)). The complete computation starting from NYY is (NYY; 0) ) (YY; 1) ) (YY N; 2) ) (Y NYY; 0). Theorem 7 (Morita and Yamaguchi, [39]) a URTM(17,5).
There is
A URTM(17,5) T17;5 in the quintuple form that simulates any CTAG is as follows [39]. T17;5 D (fq0 ; : : : ; q16 g ; fb; Y; N; ; $g; q0 ; b; ı) ; where the set ı of quintuples is shown in Table 5. (Note that, in a construction of a UTM, the final state is usually omitted from the state set.) There are 67 quintuples in total. If a CTAG halts with a halting ID, then T17;5 halts in the state q1 . If the string becomes an empty string, then it halts in the state q2 . In Table 5, it is indicated by “null”. Figure 21 shows how the CTAG C1 with the initial string NYY in Example 3 is simulated by the URTM T17;5 . On the tape of the URTM, the production rules (halt; NY; YY) of C1 are expressed by the reversal sequence over fY; N; g, i. e., YY Y N , where is used as a delimiter between rules, and “halt” is represented by the empty string. Note that in the initial tape of T17;5 (t D 0), the rightmost is replaced by b. This indicates that the phase is 0. In general, if the phase is i 1, then the ith from the right is replaced by b. This symbol b is called a “phase marker”. On the other hand, the given initial string for C1 is placed to the right of the rules, where $ is used as a delimiter. The IDs in the complete computation (NYY; 0) ) (YY; 1) ) (YY N; 2) ) (Y NYY; 0) of C1 appear in the computational configurations of T17;5 at t D 0; 6; 59 and 142, respectively. The symbol $ in the final string (t D 148) should be regarded as Y.
Reversible Computing
It is easy to see that T17;5 is reversible by checking the set of quintuples shown in Table 5 according to the definition of an RTM. Intuitively, its reversibility is guaranteed from the fact that no information is erased in the whole simulation process. Furthermore, every branch of the program caused by reading the symbol Y or N is “merged reversibly” by writing the original symbol. For example, the states q11 and q15 transit to the same state q0 by writing Y and N, respectively, using the quintuples [q11 ; $; Y; C; q0 ] and [q15 ; b; N; C; q0 ]. Other Models of Reversible Computing
Reversible Computing, Figure 21 Simulating the CTAG C 1 in Example 3 by the URTM T17;5
Reversible Computing, Table 5 The set of quintuples of the URTM T17;5 [39]
q0 q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16
b $ q2 halt q3 b C q12 Y C q5 b q6 Y q3 N C q8 b q9 N q3
q14 b C q16 N C q0
Y $ q1 Y q1 Y q2 b C q4 Y C q4
N b q13 N q1 N q2 b C q7 N C q4
Y q6 Y C q7
N q6 N C q7
q6 C q7
$ q6 $ C q7
Y q9 Y C q10 Y C q11 Y C q12 Y q13 Y q14 Y C q15 Y C q16
N q9 N C q10 N C q11 N C q12 N q13 N q14 N C q15 N C q16
q9 C q10 C q11 C q12 q13 b C q15 C q15 C q16
$ q9 $ C q11 Y C q0 $ q3 $ q13
$
C q0 b q1 q2 null b C q10 C q4 $ C q 4
$ C q15 $ q14
There are also several models of reversible computing that are not dealt with in the previous sections. Here, we discuss a few of them briefly. A reversible cellular automaton (RCA) is an important model, because it can deal with reversible spatio-temporal phenomena. In fact, it can be thought as an abstract model of a reversible space. So far, a lot of interesting results and properties on RCAs have been shown (see e. g., [35,49]). Some of them are described in Reversible Cellular Automata. A counter machine (CM) is a simple model of a computing consisting of a read-only input tape, a finite number of counters, and a finite state control. It is known that a CM with only two counters has computation-universality [30]. A reversible counter machine (RCM) is a backward deterministic version of a CM. An RCM with only two counters is known to be computation-universal [34], though it is a very simple model. This is useful to show universality of other reversible systems. A reversible finite automaton (RFA) is also a backward deterministic version of a finite automaton. Pin [43] studied this model, and characterized its language accepting ability from the formal language theory. Note that an RSM in Sect. “Definitions on Reversible Logic Elements with Memory” is an RFA augmented by an output mechanism. Future Directions How Can We Realize Reversible Computers as a Hardware? Although we have discussed reversible computing mainly from the standpoint of computation theory, it is a very important problem how reversible computers can be realized as a hardware. So far, there have been many interesting attempts from engineering side: e. g., implementing reversible logic as electrically controlled switches [29], c-MOS implementation of reversible logic gates and circuits [13], adiabatic circuits for reversible
2699
2700
Reversible Computing
computer [17,18], and synthesis of reversible logic circuits [1,45]. However, ultimately, reversible logic elements and computers should be implemented in atomic or molecular level. It is plausible that some nano-scale phenomena can be used as primitives of reversible computing, because the microscopic physical law is reversible. Of course, finding such solution is very difficult, but it is an interesting and challenging problem left for future investigations. How Simple Can Reversible Computers Be? To find very simple reversible logic elements with universality is an important problem from both theoretical and practical viewpoints. We may assume, in general, hardware implementation will become easier, if we can find much simpler logical primitives from which reversible computers can be built. As we have discussed in Sect. “Reversible Logic Gates”, the Fredkin gate and the Toffoli gate are logically universal gates with minimum number of inputs and outputs. On the other hand, as for reversible logic elements with memory (RLEMs) discussed in Sect. “Reversible Logic Elements with Memory”, there are 14 2-state 3-symbol logically universal RLEMS [38,40], which have a less number of symbols than an RE. However, it is an open problem whether there is a single 2-state 2-symbol RLEM that is logically universal (but it is known that there is a set of two 2-state 2-symbol RLEMs that is logically universal [27]). It is also an interesting problem to find much smaller universal reversible Turing machines than T17;5 in Sect. “Universal RTMs”. Novel Architectures for Reversible Computers If we try to construct more realistic computers from reversible logic primitives, we shall need new design theories suited for it. For example, a circuit composed of REs shown in Fig. 20 has a very different feature from traditional logic circuits made of gates. There will surely be many other possibilities of new design techniques that cannot be imagined from the classical theory of logic circuits. To realize efficient reversible computers, development of novel design methods and architectures are necessary. Asynchronous and Continuous Time Reversible Models There are still other problems between reversible physical systems and reversible computing systems. The first one is asynchronous systems versus synchronous ones. When building a computing system with a huge number of elements, it is preferable if it is realized as an asynchronous
system, because if a clock is eliminated, then power dissipation is reduced, and each element can operate at any time independent to other elements [26,41]. However, it is a difficult problem to define reversibility in asynchronous systems properly. The second problem is continuous time versus discrete time. Natural physical systems are continuous (at least they seem so), while most reversible computing models are defined as discrete systems. To bridge a gap between reversible physical systems and reversible computing models, and to implement the latter in the former systems, we shall need further investigations on these problems. Bibliography Primary Literature 1. Al-Rabadi AN (2004) Reversible Logic Synthesis. Springer, Berlin 2. Baiocchi C (2001) Three small universal Turing machines. 3rd Int Conference on Machines, Computations, and Universality. LNCS, vol 2055. Springer, Berlin, pp 1–10 3. Barenco A, Bennett CH, Cleve R, DiVincenzo DP, Margolus N, Shor P, Sleator T, Smolin J, Weinfurter H (1995) Elementary gates for quantum computation. Phys Rev A 52:3457–3467 4. Bennett CH (1973) Logical reversibility of computation. IBM J Res Dev 17:525–532 5. Bennett CH (1982) The thermodynamics of computation. Int J Theoret Phys 21:905–940 6. Bennett CH, Landauer R (1985) The fundamental physical limits of computation. Sci Am 253:38–46 7. Bennett CH (1987) Demons, engines, and the second law. Sci Am 257:108–117 8. Bennett CH (1989) Time/space trade-offs for reversible computation. SIAM J Comput 18:766–776 9. Bennett CH (2003) Notes on Landauer’s principle, reversible computation, and Maxwell’s Demon. Stud Hist Philos Mod Phys 34:501–510 10. Buhrman H, Tromp J, Vitanyi P (2001) Time and space bounds for reversible simulation. J Phys A 34:6821–6830 11. Cook M (2004) Universality in elementary cellular automata. Complex Syst 15:1–40 12. Deutsch D (1985) Quantum theory, the Church–Turing principle and the universal quantum computer. Proc R Soc Lond A 400:97–117 13. De Vos A, Desoete B, Adamski A, Pietrzak P, Sibinski M, Widerski T (2000) Design of reversible logic circuits by means of control gates. PATMOS 2000. LNCS vol 1918. Springer, Berlin, pp 255– 264 14. Durand-Lose J (2002) Computing inside the billiard ball model. In: Adamatzky A (ed) Collision-based Computing. Springer, Londeon, pp 135–160 15. Feynman RP (1982) Simulating physics with computers. Int J Theoret Phys 21:467–488 16. Feynman RP (1996) Hey AJG, Allen RW (eds) Feynman lectures on computation. Perseus Books, Massachusetts 17. Frank M, Vieri C, Ammer MJ, Love N, Margolus NH, Knight TE (1998) A scalable reversible computer in silicon. In: Calude CS,
Reversible Computing
18. 19. 20. 21.
22. 23.
24. 25.
26.
27.
28. 29. 30. 31.
32.
33.
34. 35.
Casti J, Dinneen MJ (eds) Unconventional Models of Computation. Springer, Singapore, pp 183–200 Frank M (1999) Reversibility for Efficient Computing. Ph D Dissertation, MIT, Cambridge Fredkin E, Toffoli T (1982) Conservative logic. Int J Theoret Phys 21:219–253 Gruska J (1999) Quantum Computing. McGraw-Hill, London Jacopini G, Mentrasti P, Sontacchi G (1990) Reversible Turing machines and polynomial time reversibly computable functions. SIAM J Disc Math 3:241–254 Keys RW, Landauer R (1970) Minimal energy dissipation in logic. IBM J Res Dev 14:152–157 Kudlek M, Rogozhin Y (2002) A universal Turing machine with 3 states and 9 symbols. In: Kuich w, Rozenberg G, Salomaa A (eds) Developments in Language Theory (DLT 2001). LNCS vol 2295. Springer, Berlin, pp 311–318 Landauer R (1961) Irreversibility and heat generation in the computing process. IBM J Res Dev 5:183–191 Lecerf Y (1963) Machines de Turing réversibles – Reursive insolubilité en n 2 N de l’équation u D n u, où est un isomorphisme de codes. Comptes Rendus Hebd Séances L’acad Sci 257:2597–2600 Lee J, Peper F, Adachi S, Mashiko S (2004) On reversible computation in asynchronous systems. In: Hida T, Saito k, Si S (eds) Quantum Information and Complexity. World Scientific, Singapore, pp 296–320 Lee J, Peper F, Adachi S, Morita K (2004) An asynchronous cellular automaton implementing 2-state 2-input 2-output reversed-twin reversible elements, Proc Eighth Int Conf on Cellular Automata for Research and Industry, Yokohama, Sep 2008, (to appear) Margolus N (1984) Physics-like model of computation. Physica 10D:81–95 Merkle RC (1993) Reversible electronic logic using switches. Nanotechnology 4:20–41 Minsky ML (1967) Computation: Finite and Infinite Machines. Prentice-Hall, Englewood Cliffs Morita K, Shirasaki A, Gono Y (1989) A 1-tape 2-symbol reversible Turing machine. Trans IEICE Japan E-72:223– 228 Morita K, Harao M (1989) Computation universality of one-dimensional reversible (injective) cellular automata. Trans IEICE Japan E-72:758–762 Morita K (1990) A simple construction method of a reversible finite automaton out of Fredkin gates, and its related problem. Trans IEICE Japan E-73:978–984 Morita K (1996) Universality of a reversible two-counter machine. Theoret Comput Sci 168:303–320 Morita K (2001) Cellular automata and artificial life – computation and life in reversible cellular automata. In: Goles E, Martinez S (eds) Complex Systems. Kluwer Academic Publishers, Dordrecht, pp 151–200
36. Morita K (2001) A simple reversible logic element and cellular automata for reversible computing. In: Proc 3rd Int Conf on Machines, Computations, and Universality. LNCS, vol 2055. Springer, Berlin, pp 102–113 37. Morita K (2003) A new universal logic element for reversible computing. In: Martin-Vide C, Mitrana V (eds) Grammars and Automata for String Processing. Taylor & Francis, London, pp 285–294 38. Morita K, Ogiro T, Tanaka K, Kato H (2005) Classification and universality of reversible logic elements with one-bit memory. In: Proc 4th Int Conf on Machines, Computations, and Universality. LNCS, vol 3354. Springer, Berlin, pp 245–256 39. Morita K, Yamaguchi Y (2007) A universal reversible Turing machine. In: Proc 5th Int Conf on Machines, Computations, and Universality. LNCS, vol 4664. Springer, Berlin, pp 90–98 40. Ogiro T, Kanno A, Tanaka K, Kato H, Morita K (2005) Nondegenerate 2-state 3-symbol reversible logic elements are all universal. Int J Unconv Comput 1:47–67 41. Peper F, Lee J, Adachi S, Mashiko S (2003) Laying out circuits on asynchronous cellular arrays: a step towards feasible nanocomputers. Nanotechnology 14:469–485 42. Petri CA (1967) Grundsätzliches zur Beschreibung diskreter Prozesse. In: Proc 3rd Colloquium über Automatentheorie. Birkhäuser, Basel, pp 121–140 43. Pin JE (1992) On reversible automata. In: Proc Latin American Symp on Theoretical Informatics. LNCS vol 583. Springer, Berlin, pp 401–416 44. Rogozhin Y (1996) Small universal Turing machines. Theor Comput Sci 168:215–240 45. Shende VV, Prasad AK, Markov IL, Hayes JP (2003) Synthesis of reversible logic circuits. IEEE Trans Computer-Aided Des Integr Circuits Syst 22:710–722 46. Toffoli T (1977) Computation and construction universality of reversible cellular automata. J Comput Syst Sci 15:213–231 47. Toffoli T (1980) Reversible computing, Automata, Languages and Programming. In: LNCS, vol 85. Springer, Berlin, pp 632– 644 48. Toffoli T (1981) Bicontinuous extensions of invertible combinatorial functions. Math Syst Theory 14:12–23 49. Toffoli T, Margolus N (1990) Invertible cellular automata: a review. Physica D 45:229–253
Books and Reviews Adamatzky A (ed) (2002) Collision-Based Computing. Springer, London Bennett CH (1988) Notes on the history of reversible computation. IBM J Res Dev 32:16–23 Milburn GJ (1998) The Feynman Processor. Perseus Books, Reading Vitanyi P (2005) Time, space, and energy in reversible computing. In: Proc 2005 ACM Int Conf on Computing Frontiers, Ischia, pp 435–444
2701
2702
Rough and Rough-Fuzzy Sets in Design of Information Systems
Rough and Rough-Fuzzy Sets in Design of Information Systems THERESA BEAUBOUEF1 , FREDERICK PETRY2 1 Southeastern Louisiana University, Hammond, USA 2 Naval Research Lab, Stennis Space Center, Mississippi, USA
Article Outline Glossary Definition of the Subject Introduction Rough Sets and Rough-Fuzzy Sets Rough Relational Database Information Theory Entropy and the Rough Relational Database Rough Fuzzy Relational Database Rough Set Modeling of Spatial Data Data Mining in Rough Databases Future Directions Acknowledgment Bibliography Glossary Rough sets Rough set theory is a technique for dealing with uncertainty and for identifying cause-effect relationships in databases. It is based on a partitioning of some domain into equivalence classes and the defining of lower and upper approximation regions based on this partitioning to denote certain and possible inclusion in the rough set. Fuzzy sets Fuzzy set theory is another technique for dealing with uncertainty. It is based on the concept of measuring the degree of inclusion in a set through the use of a membership value. Where elements can either belong or not belong to a regular set, with fuzzy sets elements can belong to the set to a certain degree with zero indicating not an element, one indicating complete membership, and values between zero and one indicating partial or uncertain membership in the set. Information theory Information theory involves the study of measuring the information content of a signal. In databases information theoretic measures can be used to measure the information content of data. Entropy is one such measure. Database A collection of data and the application programs that make use of this data for some enterprise is a database.
Information system An information system is a database enhanced with additional tools that can be used by management for planning and decision making. Data mining Data mining involves the discovery of patterns or rules in a set of data. These patterns generate some knowledge and information from the raw data that can be used for making decisions. There are many approaches to data mining, and uncertainty management techniques play a vital role in knowledge discovery. Definition of the Subject Databases and information systems are ubiquitous in this age of information and technology. Computers have revolutionized the way data can be manipulated and stored, allowing for very large databases with sophisticated capabilities. With so much money and manpower invested in the design and daily use of these systems, it is imperative that they be as correct, secure, and as adaptable to the changing needs of the enterprise as possible. Therefore it is important to understand the design and implementation of such systems and to be able to utilize all their capabilities. Scientists and business executives alike know the value of information. The challenge has been to produce relevant information for an ever changing uncertain world from data and facts stored on computers and archival devices. These data are considered to be exact, certain, factual values. The real world, however, is uncertain, inexact, and fraught with errors. It is a challenge, then, to extract useful and relevant information from ordinary databases. Uncertainty management techniques such as rough and fuzzy sets can help. Introduction Databases are recognized for their ability to store and update data in an efficient manner, providing reliability and the elimination of data redundancy. The relational database model, in particular, has well-established mechanisms built into the model for properly designing the database and maintaining data integrity and consistency. Data alone, however, are only facts. What is needed is information. Knowledge discovery attempts to derive information from the pure facts, discovering high level regularities in the data. It is defined as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [24,27]. An innovative technique in the field of uncertainty and knowledge discovery is based on rough sets. Rough set theory, introduced and further developed mathematically by [39], provides a framework for the representa-
Rough and Rough-Fuzzy Sets in Design of Information Systems
tion of uncertainty. It has been used in various applications such as the rough querying of crisp data [6], uncertainty management in databases [11], the mining of spatial data [8], and improved information retrieval [48]. These techniques may readily be extended for use with objectoriented, spatial, and other complex databases, and may be integrated with additional data mining techniques for a comprehensive knowledge discovery approach. Rough Sets and Rough-Fuzzy Sets Rough set theory, introduced by Pawlak [38], is a technique for dealing with uncertainty and for identifying cause-effect relationships in databases. An extensive theory for rough sets and their properties has been developed and they have become a well established approach for the management of uncertainty in a variety of applications. Rough sets involve the following: U is the universe, which cannot be empty, R is the indiscernibility relation, or equivalence relation, A D (U; R) an ordered pair, is called an approximation space, [x] R denotes the equivalence class of R containing x, for any element x of U, Elementary sets in A – the equivalence classes of R, Definable set in A – any finite union of elementary sets in A. Given an approximation space defined on some universe U that has an equivalence relation R imposed upon it, U is partitioned into equivalence classes called elementary sets that may be used to define other sets in A. A rough set X, where X U, can be defined in terms of the definable sets in A by the following: Lower approximation of X in A is the set RX D fx 2 U j [x]R Xg Upper approximation of X in A is the set RX D fx 2 U j [x]R \ X ¤ ;g. POS R (X) D RX denotes the R-positive region of X, or those elements which certainly belong to the rough set. The R-negative region of X, NEGR (X) D U RX, contains elements which do not belong to the rough set, and the boundary or R-borderline region of X, BN R (X) D RX RX, contains those elements which may or may not belong to the set. X is R-definable if and only if RX D RX. Otherwise, RX ¤ RX and X is rough with respect to R. A rough set in A is the group of subsets of U with the same upper and lower approximations. Fuzzy set theory [53] is another approach for managing uncertainty. It has been around for a few years longer than rough sets, and also has well developed theory, properties, and applications. Applications involving fuzzy logic
are diverse and plentiful, ranging from fuzzy control systems in industry to fuzzy logic in databases. Because there are advantages to both fuzzy set and rough set theories, several researchers have studied various ways of combining the two theories [22,30,36] Others have investigated the interrelations between the two theories [19,40,52]. A similar approach to our rough-fuzzy set is the fuzzy rough set in [31]. That approach is more in the spirit of functional analysis, however. Fuzzy sets and rough sets are not equivalent, but complementary. It has been shown in [52] that rough sets can be expressed by a fuzzy membership function ! f0; 0:5; 1g to represent the negative, boundary, and positive regions. In this model, all elements of the lower approximation, or positive region, have a membership value of one. Those elements of the boundary region are assigned a membership value of 0.5. Elements not belonging to the rough set have a membership value of zero. Rough set definitions of union and intersection can be modified so that the fuzzy model satisfies all the properties of rough sets [7]. This allows a rough set to be expressed as a fuzzy set. We integrate fuzziness into the rough set model in order to quantify levels of roughness in boundary region areas through the use of fuzzy membership values. Therefore, we do not require membership values of elements of the boundary region to equal 0.5, but allow them to range from zero to one, noninclusive. Additionally, the union and intersection operators for fuzzy rough sets are comparable to those for ordinary fuzzy sets, where MIN and MAX are used to obtain membership values of redundant elements. Let U be a universe, X a rough set in U. Definition A fuzzy rough set Y in U is a membership function Y (x) which associates a grade of membership from the interval [0,1] with every element of U where Y (RX) D 1;
Y (U RX) D 0; and 0 < Y (RX RX) < 1:
Definition The union of two fuzzy rough sets A and B is a fuzzy rough set C where C D fxjx 2 A OR x 2 Bg; where C (x) D MAX[A (x); B (x)]: Definition The intersection of two fuzzy rough sets A and B is a fuzzy rough set C where C D fxjx 2 A AND x 2 Bg; where C (x) D MIN[A (x); B (x)]:
2703
2704
Rough and Rough-Fuzzy Sets in Design of Information Systems
Rough Relational Database The rough relational database model [13] is an extension of the standard relational database model of Codd [23]. It captures all the essential features of rough sets theory including indiscernibility of elements denoted by equivalence classes and lower and upper approximation regions for defining sets which are indefinable in terms of the indiscernibility. Every attribute domain is partitioned by some equivalence relation designated by the database designer or user. Within each domain, those values that are considered indiscernible belong to an equivalence class. This information is used by the query mechanism to retrieve information based on equivalence with the class to which the value belongs rather than equality, resulting in less critical wording of queries. Recall is also improved in the rough relational database because rough relations provide possible matches to the query in addition to the certain matches which are obtained in the standard relational database. This is accomplished by using set containment in addition to equality of attributes in the calculation of lower and upper approximation regions of the query result. The rough relational database has several features in common with the ordinary relational database. Both models represent data as a collection of relations containing tuples. These relations are sets. The tuples of a relation are its elements, and like elements of sets in general, are unordered and nonduplicated. A tuple ti takes the form (d i1 ; d i2 ; : : : ; d i m ), where d i j is a domain value of a particular domain set Dj . In the ordinary relational database, d i j 2 D j . In the rough database, however, as in other non-first normal form extensions to the relational model [34,44], d i j D j , and although it is not required that d i j be a singleton, d i j ¤ ;. Let P(D i ) denote the powerset (D i ) ;. Definition A rough relation R is a subset of the set cross product P(D1 ) P(D2 ) P(D m ). A rough tuple t is any member of R, which implies that it is also a member of P(D1 ) P(D2 ) P(Dm ). If ti is some arbitrary tuple, then ti D (d i1 ; d i2 ; : : : ; d i m ) where d i j Dj . A tuple in this model differs from that of ordinary databases in that the tuple components may be sets of domain values rather than single values. The set braces are omitted from singletons for notational simplicity. Let [d x y ] denote the equivalence class to which d x y belongs. When d x y is a set of values, the equivalence class is formed by taking the union of equivalence classes of members of the set; if d x y D fc1 ; c2 ; : : : ; c n g, then [d x y ] D [c1 ] [ [c2 ] [ [ [c n ].
Definition Tuples t i D (d i1 ; d i2 ; : : : ; d i m ) and t k D (d k1 ; d k2 ; : : : ; d km ) are redundant if [d i j ] D [d k j ] for all j D 1; : : : ; m. In the rough relational database, redundant tuples are removed in the merging process since duplicates are not allowed in sets, the structure upon which the relational model is based. There are two basic types of relational operators. The first type arises from the fact that relations are considered sets of tuples. Therefore, operations which can be applied to sets also apply to relations. The most useful of these for database purposes are set difference, union, and intersection. Operators which do not come from set theory, but which are useful for retrieval of relational data are select, project, and join. In the rough relational database, relations are rough sets as opposed to ordinary sets. Therefore, new rough operators (—, [, \, , , ‰), which are comparable to the standard relational operators, must be developed for the rough relational database. Moreover, a mechanism must exist within the database to mark tuples of a rough relation as belonging to the lower or upper approximation of that rough relation. Properties of the rough relational operators can be found in [13]. Information Theory In communication theory, Shannon [45] introduced the concept of entropy which was used to characterize the information content of signals. Since then, variations of these information theoretic measures have been successfully applied to applications in many diverse fields. In particular, the representation of uncertain information by entropy measures has been applied to all areas of databases, including fuzzy database querying [17], data allocation [25], classification in rule-based systems [42], and measuring uncertainty in rough and fuzzy rough relational databases [12]. In fuzzy set theory the representation of uncertain information measures has been extensively studied [14,20,29]. So this paper relates the concepts of information theory to rough sets and compares these information theoretic measures to established rough set metrics of uncertainty. The measures are then applied to the rough relational database model [13]. Information content of both stored relational schemas and rough relations are expressed as types of rough entropy. Rough set theory [38] inherently models two types of uncertainty. The first type of uncertainty arises from the indiscernibility relation that is imposed on the universe, partitioning all values into a finite set of equivalence
Rough and Rough-Fuzzy Sets in Design of Information Systems
classes. If every equivalence class contains only one value, then there is no loss of information caused by the partitioning. In any coarser partitioning, however, there are fewer classes, and each class will contain a larger number of members. Our knowledge, or information, about a particular value decreases as the granularity of the partitioning becomes coarser. Uncertainty is also modeled through the approximation regions of rough sets where elements of the lower approximation region have total participation in the rough set and those of the upper approximation region have uncertain participation in the rough set. Equivalently, the lower approximation is the certain region and the boundary area of the upper approximation region is the possible region. Pawlak [41] discusses two numerical characterizations of imprecision of a rough set X: accuracy and roughness. Accuracy, which is simply the ratio of the number of elements in the lower approximation of X, RX, to the number of elements in the upper approximation of the rough set X, RX, measures the degree of completeness of knowledge about the given rough set X. It is defined as a ratio of the two set cardinalities as follows: ˛R (X) D
card(RX) card(RX)
;
where 0 ˛R (X) 1 :
The second measure, roughness, represents the degree of incompleteness of knowledge about the rough set. It is calculated by subtracting the accuracy from 1: R (X) D 1 ˛R (X). These measures require knowledge of the number of elements in each of the approximation regions and are good metrics for uncertainty as it arises from the boundary region, implicitly taking into account equivalence classes as they belong wholly or partially to the set. However, accuracy and roughness measures do not necessarily provide us with information on the uncertainty related to the granularity of the indiscernibility relation for those values that are totally included in the lower approximation region. For example: Let the rough set X be defined as follows: X D fA11; A12; A21; A22; B11; C1g with lower and upper approximation regions defined as RX D fA11; A12; A21; A22g and RX D fA11; A12; A21; A22; B11; B12; B13; C1; C2g: These approximation regions may result from one of several partitionings. Consider, for example, the following in-
discernibility relations: A1 D f[A11; A12; A21; A22]; [B11; B12; B13]; [C1; C2]g; A2 D f[A11; A12]; [A21; A22]; [B11; B12; B13]; [C1; C2]g; A3 D f[A11]; [A12]; [A21]; [A22]; [B11; B12; B13]; [C1; C2]g: All three of the above partitionings result in the same upper and lower approximation regions for the given set X, and hence the same accuracy measure (4/9 D :444) since only those classes belonging to the lower approximation region were re-partitioned. It is obvious, however, that there is more uncertainty in A1 than in A2 , and more uncertainty in A2 than in A3 . Therefore, a more comprehensive measure of uncertainty is needed. We derive such a measure from techniques used for measuring entropy in classical information theory. Countless variations of the classical entropy have been developed, each tailored for a particular application domain or for measuring a particular type of uncertainty. Our rough entropy is defined such that we may apply it to rough databases. We define the entropy of a rough set X as follows: Definition The rough entropy Er (X) of a rough set X is calculated by Er (X) D ( R (X))[˙ Q i log(Pi )] for i D 1; : : : n equivalence classes. The term R (X) denotes the roughness of the set X. The second term is the summation of the probabilities for each equivalence class belonging either wholly or in part to the rough set X. There is no ordering associated with individual class members. Therefore the probability of any one value of the class being named is the reciprocal of the number of elements in the class. If ci is the cardinality of, or number of elements in, equivalence class i and all members of a given equivalence class are equal, Pi D 1/c i represents the probability of one of the values in class i. Qi denotes the probability of equivalence class i within the universe. Qi is computed by taking the number of elements in class i and dividing by the total number of elements in all equivalence classes combined. The entropy of the sample rough set X, Er (X), is given below for each of the possible indiscernibility relations A1 , A2 , and A3 . Using A1 : (5/9)[(4/9) log(1/4) C (3/9) log(1/3) C (2/9) log(1/2)] D :274
2705
2706
Rough and Rough-Fuzzy Sets in Design of Information Systems
Using A2 : (5/9)[(2/9) log(1/2) C (2/9) log(1/2) C (3/9) log(1/3) C (2/9) log(1/2)] D :20 Using A3 : (5/9)[(1/9) log(1) C (1/9) log(1) C (1/9) log(1) C(1/9) log(1) C(3/9) log(1/3) C (2/9) log(1/2)] D :048 From the above calculations it is clear that although each of the partitionings results in identical roughness measures, the entropy decreases as the classes become smaller through finer partitionings.
account the partitioning of the domains on which the attributes of the schema are defined. We extend the schema entropy Es (S) to define the entropy of an actual rough relation instance ER (R) of some database D by multiplying each term in the product by the roughness of the rough set of values for the domain of that given attribute. Definition The rough relation entropy of a particular extension of a schema is ER (R) D ˙ j D j (R)[˙ DQ i log(DPi )]
Entropy and the Rough Relational Database The basic concepts of rough sets and their informationtheoretic measures carries over to the rough relational database model [13]. Recall that in the rough relational database all domains are partitioned into equivalence classes and relations are not restricted to first normal form. We therefore have a type of rough set for each attribute of a relation. This results in a rough relation, since any tuple having a value for an attribute that belongs to the boundary region of its domain is a tuple belonging to the boundary region of the rough relation. There are two things to consider when measuring uncertainty in databases: uncertainty or entropy of a rough relation that exists in a database at some given time and the entropy of a relation schema for an existing relation or query result. We must consider both since the approximation regions only come about by set values for attributes in given tuples. Without the extension of a database containing actual values, we only know about indiscernibility of attributes. We cannot consider the approximation regions. We define the entropy for a rough relation schema as follows: Definition The rough schema entropy for a rough relation schema S is Es (S) D ˙ j [˙ Q i log(Pi )] for i D 1; : : : n; j D 1; : : : ; m
for i D 1; : : : n; j D 1; : : : ; m where D j (R) represents a type of database roughness for the rough set of values of the domain for attribute j of the relation, m is the number of attributes in the database relation, and n is the number of equivalence classes for a given domain for the database. We obtain the D j (R) values by letting the non-singleton domain values represent elements of the boundary region, computing the original rough set accuracy and subtracting it from one to obtain the roughness. DQ i is the probability of a tuple in the database relation having a value from class i, and DPi is the probability of a value for class i occurring in the database relation out of all the values which are given. Information theoretic measures again prove to be a useful metric for quantifying information content. In rough sets and the rough relational database, this is especially useful since in ordinary rough sets Pawlak’s measure of roughness does not seem to capture the information content as precisely as our rough entropy measure. In rough relational databases, knowledge about entropy can either guide the database user toward less uncertain data or act as a measure of the uncertainty of a data set or relation. As rough relations become larger in terms of the number of tuples or attributes, the automatic calculation of some measure of entropy becomes a necessity. Our rough relation entropy measure fulfills this need.
where there are n equivalence classes of domain j, and m attributes in the schema R(A1 ; A2 ; : : : ; A m ).
Rough Fuzzy Relational Database
This is similar to the definition of entropy for rough sets without factoring in roughness since there are no elements in the boundary region (lower approximation = upper approximation). However, because a relation is a cross product among the domains, we must take the sum of all these entropies to obtain the entropy of the schema. The schema entropy provides a measure of the uncertainty inherent in the definition of the rough relation schema taking into
The fuzzy rough relational database, as in the ordinary relational database, represents data as a collection of relations containing tuples. Because a relation is considered a set having the tuples as its members, the tuples are unordered. In addition, there can be no duplicate tuples in a relation. A tuple t i takes the form (d i1 ; d i2 ; : : : ; d i m ; d i ), where d i j is a domain value of a particular domain set Dj and d i 2 D , where D is the interval [0,1], the
Rough and Rough-Fuzzy Sets in Design of Information Systems
domain for fuzzy membership values. In the ordinary relational database, d i j 2 D j . In the fuzzy rough relational database, except for the fuzzy membership value, however, d i j D j , and although d i j is not restricted to be a singleton, d i j ¤ ;. Let P(D i ) denote any non-null member of the powerset of Di . Definition A fuzzy rough relation R is a subset of the set cross product P(D1 ) P(D2 ) P(D m ) D . For a specific relation, R, membership is determined semantically. Given that D1 is the set of names of nuclear/chemical plants, D2 is the set of locations, and assuming that RIVERB is the only nuclear power plant that is located in VENTRESS, (RIVERB, VENTRESS, 1) (RIVERB, OSCAR, .7) (RIVERB, ADDIS, 1) (CHEMO, VENTRESS, .3) are all elements of P(D1 ) P(D2 ) D . However, only the element (RIVERB, VENTRESS, 1) of those listed above is a member of the relation R(PLANT, LOCATION, ), which associates each plant with the town or community in which it is located. A fuzzy rough tuple t is any member of R. If t i is some arbitrary tuple, then t i D (d i1 ; d i2 ; : : : ; d i m ; d i ) where d i j D j and d i 2 D . Definition An interpretation ˛ D (a1 ; a2 ; : : : ; a m ; a ) of a fuzzy rough tuple t i D (d i1 ; d i2 ; : : : ; d i m ; d i ) is any value assignment such that a j 2 d i j for all j. The interpretation space is the cross product D1 D2 D m D , but is limited for a given relation R to the set of those tuples which are valid according to the underlying semantics of R. In an ordinary relational database, because domain values are atomic, there is only one possible interpretation for each tuple t i . Moreover, the interpretation of t i is equivalent to the tuple t i . In the fuzzy rough relational database, this is not always the case. Let [d x y ] denote the equivalence class to which d x y belongs. When d x y is a set of values, the equivalence class is formed by taking the union of equivalence classes of members of the set; if d x y D fc1 ; c2 ; : : : ; c n g, then [d x y ] D [c1 ] [ [c2 ] [ [ [c n ]. Definition Tuples t i D (d i1 ; d i2 ; : : : ; d i n ; d i ) and t k D (d k1 ; d k2 ; : : : ; d kn ; d k ) are redundant if [d i j ] D [d k j ] for all j D 1; : : : ; n. If a relation contains only those tuples of a lower approximation, i. e., those tuples having a value equal to one, the
interpretation ˛ of a tuple is unique. This follows immediately from the definition of redundancy. In fuzzy rough relations, there are no redundant tuples. The merging process used in relational database operations removes duplicate tuples since duplicates are not allowed in sets, the structure upon which the relational model is based. Tuples may be redundant in all values except . As in the union of fuzzy rough sets where the maximum membership value of an element is retained, it is the convention of the fuzzy rough relational database to retain the tuple having the higher value when removing redundant tuples during merging. If we are supplied with identical data from two sources, one certain and the other uncertain, we would want to retain the data that is certain, avoiding loss of information. Recall that the rough relational database is in non-first normal form; there are some attribute values that are sets. Another definition, which will be used for upper approximation tuples, is necessary for some of the alternate definitions of operators to be presented. This definition captures redundancy between elements of attribute values that are sets: Definition Two sub-tuples X D (d x1 ; d x2 ; : : : ; d x m ) and Y D (d y1 ; d y2 ; : : : ; d ym ) are roughly-redundant, R , if for some [p] [d x j ] and [q] [d y j ], [p] D [q] for all j D 1; : : : ; m. In order for any database to be useful, a mechanism for operating on the basic elements and retrieving specified data must be provided. The concepts of redundancy and merging play a key role in the operations defined. We must first design our database using some type of semantic model. We use a variation of the entity-relationship diagram that we call a fuzzy-rough E-R diagram. This diagram is similar to the standard E-R diagram in that entity types are depicted in rectangles, relationships with diamonds, and attributes with ovals. However, in the fuzzyrough model, it is understood that membership values exist for all instances of entity types and relationships. Attributes which allow values where we want to be able to define equivalences are denoted with an asterisk (*) above the oval. These values are defined in the indiscernibility relation, which is not actually part of the database design, but inherent in the fuzzy-rough model. Our fuzzy-rough E-R model [7] is similar to the second and third levels of fuzziness defined by Zvieli and Chen [54]. However, in our model, all entity and relationship occurrences (second level) are of the fuzzy type so we do not mark an ‘f ’ beside each one. Zvieli and Chen’s third level considers attributes that may be fuzzy. They use triangles instead of ovals to represent these attributes. We do
2707
2708
Rough and Rough-Fuzzy Sets in Design of Information Systems
not introduce fuzziness at the attribute level of our model in this paper, only roughness, or indiscernibility, and denote those attribute with the ‘*’. From the Fuzzy-Rough E-R diagram, we design the structure of the fuzzy rough relational database. If we have a priori information about the types of queries that will be involved, we can make intelligent choices that will maximize computer resources. We next formally define the fuzzy rough relational database operators and discuss issues relating to the realworld problems of data representation and modeling. We may view indiscernibility as being modeled through the use of the indiscernibility relation, imprecision through the use of non-first normal form constructs, and degree of uncertainty and fuzziness through the use of tuple membership values, which are given as the value for the attribute in every fuzzy rough relation.
value in X than in Y. For example, let X contain the tuple (MODERN, 1) and Y contain the tuple (MODERN, .02). It would not be desirable to subtract out certain information with possible information, so X - Y yields (MODERN, 1). Union Because relations in databases are considered as sets, the union operator can be applied to any two unioncompatible relations to result in a third relation which has as its tuples all the tuples contained in either or both of the two original relations. The union operator can be extended to apply to fuzzy rough relations. Let X and Y be two union compatible fuzzy rough relations. Definition The fuzzy rough union of X and Y, X [ Y is a fuzzy rough relation T where T D ftjt 2 X OR t 2 Yg
Fuzzy Rough Relational Operators In [13], we defined several operators for the rough relational algebra. We now define similar operators for the fuzzy rough relational database as in [5]. Recall that for all of these operators the indiscernibility relation is used for equivalence of attribute values rather than equality of values. Difference The fuzzy rough relational difference operator is very much like the ordinary difference operator in relational databases and in sets in general. It is a binary operator that returns those elements of the first argument that are not contained in the second argument. In the fuzzy rough relational database, the difference operator is applied to two fuzzy rough relations and, as in the rough relational database, indiscernibility, rather than equality of attribute values, is used in the elimination of redundant tuples. Hence, the difference operator is somewhat more complex. Let X and Y be two union compatible fuzzy rough relations. Definition The fuzzy rough difference, X - Y, between X and Y is a fuzzy rough relation T where T D ft(d1 ; : : : ; d n ; i ) 2 Xjt(d1 ; : : : ; d n ; i ) … Yg [ ft(d1 ; : : : ; d n ; i ) 2 Xjt(d1 ; : : : ; d n ; j ) 2 Y and i > j g:
and
T (t) D MAX[ X (t); Y (t)]: The resulting relation T contains all tuples in either X or Y or both, merged together and having redundant tuples removed. If X contains a tuple that is redundant with a tuple in Y except for the value, the merging process will retain only that tuple with the higher value. Intersection The fuzzy rough intersection, another binary operator on fuzzy rough relations, can be defined similarly. Definition The fuzzy rough intersection of X and Y, X \ Y is a fuzzy rough relation T where T D ftjt 2 X AND t 2 Yg
and
T (t) D MIN[ X (t); Y (t)]: In intersection, the MIN operator is used in the merging of equivalent tuples having different values and the result contains all tuples that are members of both of the original fuzzy rough relations. Definition The fuzzy rough intersection of X and Y, X \A Y is a fuzzy rough relation T where T D ftjt 2 X; and 9s 2 Yjt R sg [ fsjs 2 Y; and 9t 2 Xjs R tg
The resulting fuzzy rough relation contains all those tuples which are in the lower approximation of X, but not redundant with a tuple in the lower approximation of Y. It also contains those tuples belonging to upper approximation regions of both X and Y, but which have a higher
and T (t) D MIN[ X (t); Y (t)]: Select The select operator for the fuzzy rough relational database model, , is a unary operator which takes a fuzzy
Rough and Rough-Fuzzy Sets in Design of Information Systems
rough relation X as its argument and returns a fuzzy rough relation containing a subset of the tuples of X, selected on the basis of values for a specified attribute. The operation ADa (X), for example, returns those tuples in X where attribute A is equivalent to the class [a]. In general, select returns a subset of the tuples that match some selection criteria. Let R be a relation schema, X a fuzzy rough relation on that schema, A an attribute in R, a D fai g and b D fbj g, where ai , bj 2 dom(A), and [x is interpreted as “the union over all x”. Definition The fuzzy rough selection, ADa (X), of tuples from X is a fuzzy rough relation Y having the same schema as X and where Y D ft 2 Xj [ i [a i ] [ j [b j ]g; and a i 2 a, b j 2 t(A), and where membership values for tuples are calculated by multiplying the original membership value by card(a)/card(b)
Definition The fuzzy rough projection of X onto B, B (X), is a fuzzy rough relation Y with schema Y(B) where Y(B) D ft(B)jt 2 Xg: Join Join is a binary operator that takes related tuples from two relations and combines them into single tuples of the resulting relation. It uses common attributes to combine the two relations into one, usually larger, relation. Let X(A1 ; A2 ; : : : ; A m ) and Y(B1 ; B2 ; : : : ; B n ) be fuzzy rough relations with m and n attributes, respectively, and AB D C, the schema of the resulting fuzzy rough relation T. Definition The fuzzy rough join, X ‰ Y, of two relations X and Y, is a relation T(C1 ; C2 ; : : : ; C mCn ) where T D ftj9t X 2 X; tY 2 Y for t X D t(A); tY D t(B)g; and where t X (A \ B) D tY (A \ B); D 1
where card(x) returns the cardinality, or number of elements, in x.
t X (A \ B) tY (A \ B) or tY (A \ B) t X (A \ B); D MIN( X ; Y )
Assume we want to retrieve those elements where CITY = “ADDIS” from the following fuzzy rough tuples: (ADDIS
1)
({ADDIS, LOTTIE, BRUSLY}
:7)
(OSCAR
1)
({ADDIS, JACKSON}
:9):
The result of the selection is the following: (ADDIS
1)
({ADDIS, LOTTIE, BRUSLY}
:23)
({ADDIS, JACKSON}
:45);
where the for the second tuple is the product of the original membership value .7 and 1/3. Project Project is a unary fuzzy rough relational operator. It returns a relation that contains a subset of the columns of the original relation. Let X be a fuzzy rough relation with schema A, and let B be a subset of A. The fuzzy rough projection of X onto B is a fuzzy rough relation Y obtained by omitting the columns of X which correspond to attributes in A - B, and removing redundant tuples. Recall the definition of redundancy accounts for indiscernibility, which is central to the rough sets theory and that higher values have priority over lower ones.
(1)
(2) is a conjunction of one or more conditions of the form A D B. Only those tuples which resulted from the “joining” of tuples that were both in lower approximations in the original relations belong to the lower approximation of the resulting fuzzy rough relation. All other “joined” tuples belong to the upper approximation only (the boundary region), and have membership values less than one. The fuzzy membership value of the resultant tuple is simply calculated as in [18] by taking the minimum of the membership values of the original tuples. Taking the minimum value also follows the logic of [37], where in joins of tuples with different levels of information uncertainty, the resultant tuple can have no greater certainty than that of its least certain component. Functional Dependencies A functional dependency can be defined as in [23] through the use of a universal database relation concept. Let R D fA1 ; A2 ; : : :; A n g be a universal relation schema describing a database having n attributes. Let X and Y be subsets of R. A functional dependency between the attributes
2709
2710
Rough and Rough-Fuzzy Sets in Design of Information Systems
of X and Y is denoted by X ! Y. This dependency specifies the constraint that for any two tuples of an instance r of R, if they agree on the X attribute(s) they must agree on their Y attributes(s): if t1 [X] D t2 [X], then it must be true that t1 [Y] D t2 [Y]. Tuples that violate the constraint cannot be inserted into the database. The rough functional dependency is based on the rough relational database model. The classical notion of functional dependency for relational databases does not naturally apply to the rough relational database, since all the “roughness” would be lost. In the rough querying of crisp data [6], however, the data is stored in the standard relational model having ordinary functional dependencies imposed upon it and rough relations result only from querying; they are not a part of the database design in which the designer imposes constraint upon relation schemas. Rough functional dependencies for the rough relational database model are defined as follows [9]: Definition A rough functional dependency, X ! Y, for a relation schema R exists if for all instances T(R),
Rough Functional Dependencies Satisfy Armstrong’s Axioms Proof (1) Reflexive If Y X U, then redundant(t(X); t 0 (X)) ) redundant(t(Y); t 0 (Y)); and roughly-redundant(t(X); t 0 (X)) ) roughly-redundant(t(Y); t 0 (Y)): Hence, X ! Y: (2) Augmentation If Z U and the rough functional dependency X ! Y holds, then redundant(t(X Z); t 0 (X Z)) ) redundant(t(Y Z); t 0 (Y Z)); and roughly-redundant(t(X Z); t 0 (X Z)) ) roughly-redundant(t(Y Z); t 0 (Y Z)):
(1) for any two tuples t, t 0 2 RT, redundant(t(X); t 0 (X)) ) redundant(t(Y); t0(Y)) (2) for any two tuples s,
s0
2 RT,
roughly-redundant(s(X); s 0 (X)) ) roughly-redundant(s(Y); s0(Y)): Y is roughly functional dependent on X, or X roughly functionally determines Y, whenever the above definition holds. This implies that constraints can be imposed on a rough relational database schema in a rough manner that will aid in integrity maintenance and the reduction of update anomalies without limiting the expressiveness of the inherent rough set concepts. It is obvious that the classical functional dependency for the standard relational database is a special case of the rough functional dependency; indiscernibility reduces to simple equality and part (2) of the definition is unused since all tuples in relations in the standard relational model belong to the lower approximation region of a similar rough model. The first part of the definition of rough functional dependency compares with that of fuzzy functional dependencies discussed in [46], where adherence to Armstrong’s axioms was proven. The results apply directly in the case of rough functional dependencies when only the lower approximation regions are considered. It is also necessary to show that axioms hold for upper approximations.
Hence, X Z ! Y Z: (3) Transitive If the rough functional dependencies X ! Y and Y ! Z hold, then redundant(t(X); t 0 (X)) ) redundant(t(Z); t 0 (Z)); and roughly-redundant(t(X); t 0 (X)) ) roughly-redundant(t(Z); t 0 (Z)): Hence, X ! Z: Hence, rough functional dependencies satisfy Armstrong’s axioms. Given a set of rough functional dependencies, the complete set of rough functional dependencies can be derived using Armstrong’s axioms. The rough functional dependency, therefore, is an important formalism for design in the rough relational database. Fuzzy and rough set techniques integrated into the underlying data model result in databases that can more accurately represent real world enterprises since they incorporate uncertainty management directly into the data model itself. This is useful as is for obtaining greater information through the querying of rough and fuzzy databases. Additional benefits may be realized when they are used in the process of data mining.
Rough and Rough-Fuzzy Sets in Design of Information Systems
Rough Set Modeling of Spatial Data Many of the problems associated with data are prevalent in all types of database systems. Spatial databases and GIS contain descriptive as well as positional data. The various forms of uncertainty occur in both types of data, so many of the issues apply to ordinary databases as well, such as integration of data from multiple sources, time-variant data, uncertain data, imprecision in measurement, inconsistent wording of descriptive data, and “binning” or grouping of data into fixed categories, also are employed in spatial contexts. First consider an example of the use of rough sets in representing spatially related data. Let U = {tower, stream, creek, river, forest, woodland, pasture, meadow} and let the equivalence relation R be defined as follows: R D f[tower]; [stream, creek, river]; [forest, woodland]; [pasture, meadow]g: Given some set X = {tower, stream, creek, river, forest, pasture}, we would like to define it in terms of its lower and upper approximations: RX D ftower, stream, creek, riverg; and RX D ftower, stream, creek, river, forest, woodland, pasture, meadowg: A rough set in A is the group of subsets of U with the same upper and lower approximations. In the example given, the rough set is fftower, stream, creek, river, forest, pastureg ftower, stream, creek, river, forest, meadowg ftower, stream, creek, river, woodland, pastureg ftower, stream, creek, river, woodland, meadowgg: Often spatial data is associated with a particular grid. The positions are set up in a regular matrix-like structure and data is affiliated with point locations on the grid. This is the case for raster data and for other types of non-vector type data such as topography or sea surface temperature data. There is a tradeoff between the resolution or the scale of the grid and the amount of system resources necessary to store and process the data. Higher resolutions provide more information, but at a cost of memory space and execution time. If we approach these data issues from a rough set point of view, it can be seen that there is indiscernibility inherent in the process of gridding or rasterizing data. A data item at a particular grid point in essence may represent
data near the point as well. This is due to the fact that often point data must be mapped to the grid using techniques such as nearest-neighbor, averaging, or statistics. The rough set indiscernibility relation may be set up so that the entire spatial area is partitioned into equivalence classes where each point on the grid belongs to an equivalence class. If the resolution of the grid changes, then, in fact, this is changing the granularity of the partitioning, resulting in fewer, but larger classes. The approximation regions of rough sets are beneficial whenever information concerning spatial data regions is accessed. Consider a region such as a forest. One can reasonably conclude that any grid point identified as FOREST that is surrounded on all sides by grid points also identified as FOREST is, in fact, a point represented by the feature FOREST. However, consider points identified as FOREST that are adjacent to points identified as MEADOW. Is it not possible that these points represent meadow area as well as forest area but were identified as FOREST in the classification process? Likewise, points identified as MEADOW but adjacent to FOREST points may represent areas that contain part of the forest. This uncertainty maps naturally to the use of the approximation regions of the rough set theory, where the lower approximation region represents certain data and the boundary region of the upper approximation represents uncertain data. It applies to spatial database querying and spatial database mining operations. If we force a finer granulation of the partitioning, a smaller boundary region results. This occurs when the resolution is increased. As the partitioning becomes finer and finer, finally a point is reached where the boundary region is non-existent. Then the upper and lower approximation regions are the same and there is no uncertainty in the spatial data as can be determined by the representation of the model. In [50] Worboys models imprecision in spatial data based on the resolution at which the data is represented, and for issues related to the integration of such data. This approach relies on the issue of indiscernibility – a core concept for rough sets – but does not carry over the entire framework and is just described as “reminiscent of the theory of rough sets” [51]. Ahlqvist and colleagues [2] used a rough set approach to define a rough classification of spatial data and to represent spatial locations. They also proposed a measure for quality of a rough classification compared to a crisp classification and evaluated their technique on actual data from vegetation map layers. They considered the combination of fuzzy and rough set approaches for reclassification as required by the integration of geographic data. Another research group in a mapping
2711
2712
Rough and Rough-Fuzzy Sets in Design of Information Systems
and GIS context [49] have developed an approach using a rough raster space for the field representation of a spatial entity and evaluated it on a classification case study for remote sensing images. In [16] Bittner and Stell consider K-labeled partitions, which can represent maps, and then develop their relationship to rough sets to approximate map objects with vague boundaries. Additionally they investigate stratified partitions, which can be used to capture levels of details or granularity such as in consideration of scale transformations in maps, and extend this approach using the concepts of stratified rough sets.
The support and confidence can be interpreted as probabilities: 1. s – Prob (X j [ X k ) and 2. c – Prob (X k jX j ) We assume the system user has provided minimum values for these in order to generate only sufficiently interesting rules. A rule whose support and confidence exceeds these minimums is called a strong rule. The result of a query is then: T D fRT; RTg
Data Mining in Rough Databases Association rules capture the idea of certain data items commonly occurring together and have been often considered in the analysis of a “marketbasket” of purchases. For example, a delicatessen retailer might analyze the previous year’s sales and observe that of all purchases 30% were of both cheese and crackers and, for any of the sales that included cheese, 75% also included crackers. Then it is possible to conclude a rule of the form: Cheese ! Crackers This rule is said to have a 75% degree of confidence and a 30% degree of support This particular form of data is largely based on the a priori algorithm developed by Agrawal [1]. Let a database of possible data items be [28] D D fd1 ; d2 ; : : :d n g and the relevant set of transactions (sales, query results, etc.) are T D fT1 ; T2 ; : : :g where Ti D. We are interested in discovering if there is a relationship between two sets of items (called itemsets) X j , X k ; X j , X k D. For such a relationship to be determined, the entire set of transactions in T must be examined and a count made of the number of transactions containing these sets, where a transaction T i contains X m if X m Ti . This count, called the support count of X m , SC T (X m ), will be appropriately modified in the case of rough sets. There are then two measures used in determining rules: the percentage of T i ’s in T that 1. Contain both X j and X k (i. e. X j [ X k ) – called the support s 2. If T i contains X j then T i also contains X k – called the confidence c.
and so we must take into account the lower approximation, RT, and upper approximation, RT, results of the rough query in developing the association rules. Recall that in order to generate frequent itemsets, we must count the number of transactions T j that support an itemset X j . In the ordinary data mining algorithm one simply counts the occurrence of a value as 1 if in the set, or 0 if not. But now since the query result T is a rough set, we must modify the support count SC T . So we define the rough support count, RSC T , for the set X j , to count differently in the upper and lower approximations: RSC T (X j ) D
X
WTi (X j ); X j Ti
i
where ( W(Xj ) D
1 if Ti 2 RT a if Ti 2 RT; 0 < a < 1:
The value, a, can be a subjective value obtained from the user depending on relative assessment of the roughness of the query result T. For the data mining example of the next section we chose a neutral default value of a D 12 . Note that W(X j ) is included in the summation only all of the values of the itemset X j are included in the transaction, i. e. it is a subset of the transaction. Finally to produce the association rules from the set of relevant data T retrieved from the spatial database, we must consider the frequent itemsets. For the purposes of generating a rule such as X j X k we can now extend the approach to rough support and confidence as follows: RS D RSC T (X j [ X k )/jTj RC D RSC T (X j [ X k )/RSC T (X j ) : In the spatial data mining area there have only been a few efforts using rough sets. In the research described in [8] approaches for attribute induction knowledge discovery [43]
Rough and Rough-Fuzzy Sets in Design of Information Systems
in rough spatial data are investigated. In [15] Bittner considers rough sets for spatio-temporal data and how to discover characteristic configurations of spatial objects focusing on the use of topological relationships for characterizations. In a survey of uncertainty-based spatial data mining Shi et al. [47] provide a brief general comparison of fuzzy and rough set approaches for spatial data mining.
Future Directions There are several other approaches to uncertainty representation that may be more suitable for certain applications. Type-2 fuzzy have been of considerable recent interest [35]. In these as opposed to ordinary fuzzy sets in which the underlying membership functions are crisp, here the membership function are themselves fuzzy. Intuitionistic sets introduced by Atanassov [3,4] are another generalization of a fuzzy set. Two characteristic functions are used for capturing both the ordinary idea of degree of membership in the intuitionistic set as well as the degree of non-membership of elements in the set and can be used in database design [10]. Related to the concepts introduced by rough sets is the idea of granularity for managing complex data by abstraction using information granules as discussed by Lin [32,33]. A granular set approach has also been introduced [30] which is a set and a number of disjoint subsets that constitute a semi-partition. Some prior database research on ordered relations [26], although not presented in the context of uncertainty of data, may provide some approaches to extend our work in this area. A main emphasis for future work is the incorporation of some of these research topics into mainstream database, GIS commercial products and semi-structured data on the semantic web.
Acknowledgment The authors would like to thank the Naval Research Laboratory’s Base Program, Program Element No. 0602435N for sponsoring this research.
Bibliography Primary Literature 1. Agrawal R, Imielinski T, Swami A (1993) Mining Association Rules between sets of items in large databases. Proc of the 1993 ACM-SIGMOD International Conference on Management of Data. ACM Press, New York, NY, pp 207–216 2. Ahlqvist O, Keukelaar J, Oukbir K (2000) Rough classification and accuracy assessment. Int J Geogr Inf Sci 14:475–496
3. Atanassov K (1986) Intuitionistic Fuzzy Sets. Fuzzy Sets Syst 20:87–96 4. Atanassov K (2000) Intuitionistic Fuzzy Sets; Theory and Applications. Physica Verlag, Heidelberg 5. Beaubouef T, Petry F (1994) Fuzzy Set Quantification of Roughness in a Rough Relational Database Model. Proc. Third IEEE International Conference on Fuzzy Systems, Orlando, Florida, pp 172–177 6. Beaubouef T, Petry F (1994) Rough Querying of Crisp Data in Relational Databases. Proc. Third Int.l Workshop on Rough Sets and Soft Computing (RSSC’94), San Jose, California, pp 368– 375 7. Beaubouef T, Petry F (2000) Fuzzy Rough Set Techniques for Uncertainty Processing in a Relational Database. Int J Intell Syst 15:389–424 8. Beaubouef T, Petry F (2002) A Rough Set Foundation for Spatial Data Mining Involving Vague Regions. Proc.FUZZ-IEEE’02, Honolulu, Hawaii, pp 767–772 9. Beaubouef T, Petry F (2004) Rough Functional Dependencies. 2004 Multiconferences: International Conf. On Information and Knowledge Engineering (IKE’04), Las Vegas, pp 175– 179 10. Beaubouef T, Petry F (2007) Intuitionistic Rough Sets for Database Applications. In: Peters JF et al (eds) Transactions on Rough Sets VI. LNCS, vol 4374. Springer, Berlin, pp 26–30 11. Beaubouef T, Petry F (2007) Rough Sets: A Versatile Theory for Approaches To Uncertainty Management in Databases. Rough Computing: Theories, Technologies and Applications, Idea Group, Inc. 12. Beaubouef T, Petry F, Arora G (1998) Information-Theoretic Measures of Uncertainty for Rough Sets and Rough Relational Databases. Inf Sci 109:185–195 13. Beaubouef T, Petry F, Buckles B (1995) Extension of the Relational Database and its Algebra with Rough Set Techniques. Comput Intell 11:233–245 14. Bhandari D, Pal NR (1993) Some New Information Measures for Fuzzy Sets. Inf Sci 67:209–228 15. Bittner T (2000) Rough sets in spatio-temporal data mining. Proceedings of International Workshop on Temporal, Spatial and Spatio–Temporal Data Mining. Springer, BerlinHeidelberg, pp 89–104 16. Bittner T, Stell J (2003) Stratified Rough Sets and Vagueness. In: Kuhn W, Worboys M, Timpf S (eds) Spatial Information Theory: Cognitive and Computational Foundations of Geographic Information Science International Conference (COSIT’03). Springer, Berlin, pp 286–303 17. Buckles B, Petry F (1983) Information-Theoretical Characterization of Fuzzy Relational Databases. IEEE Trans. Syst Man Cybern 13:74–77 18. Buckles BP, Petry F (1985) Uncertainty models in information and database systems. J Inf Sci 11:77–87 19. Chanas S, Kuchta D (1992) Further remarks on the relation between rough and fuzzy sets. Fuzzy Sets Syst 47:391–394 20. de Luca A, Termini S (1972) A Definition of a Nonprobabilistic Entropy in the Setting of Fuzzy Set Theory. Inf Control 20:301– 312 21. Dubois D, Prade H (1987) Twofold Fuzzy Sets and Rough Sets– Some Issues in Knowledge Representation. Fuzzy Sets Syst 23:3–18
2713
2714
Rough and Rough-Fuzzy Sets in Design of Information Systems
22. Dubois D, Prade H (1992) Putting Rough Sets and Fuzzy Sets Together. In: Slowinski R (ed) Intelligent Decision Support: Handbook of Applications and Advances of the Rough Sets Theory, Kluwer Academic Publishers, Boston 23. Elmasri R, Navathe S (2004) Fundamentals of Database Systems, 4th edn. Addison Wesley, Boston 24. Frawley W, Piatetsky-Shapiro G, Matheus C (1991) Knowledge Discovery in Databases: An Overview. In: Piatetsky-Shapiro G, Frawley W (eds) Knowledge Discovery in Databases. AAAI, MIT Press, Cambridge, pp 1–27 25. Fung KT, Lam CM (1980) The Database Entropy Concept and its Application to the Data Allocation Problem. INFOR 18(4):354– 363 26. Ginsburg S, Hull R (1983) Order Dependency in the Relational Model. Theor Comput Sci 26:146–195 27. Han J, Cai Y, Cercone N (1992) Knowledge Discovery in Databases: An Attribute–Oriented Approach, Proc. 18th VLDB Conf., Vancouver, Brit. Columbia, pp 547–559 28. Han J, Kamber M (2006) Data Mining: Concepts and Techniques, 2nd ed. Morgan Kaufman, San Diego, CA 29. Klir GJ, Folger TA (1988) Fuzzy Sets, Uncertainty, and Information. Prentice Hall, Englewood Cliffs NJ 30. Ligeza A (2002) Granular Sets and Granular Relation. Intelligent Information Systems, Physica Verlag, pp 331–340 31. Lin TY (1992) Topological and Fuzzy Rough Sets. In: Slowinski R (ed) Intelligent Decision Support: Handbook of Applications and Advances of the Rough Sets Theory. Kluwer Academic Publishers, Boston, pp 287–304 32. Lin TY (1997) Granular Computing: From Rough Sets and Neighborhood Systems to Information Granulation and Computing in Words, European Congress on Intelligent Techniques and Soft Computing, September 8–12, pp 1602–1606 33. Lin TY (1999) Granular computing: Fuzzy Logic and Rough Sets, In: Zadeh L, Kacprzyk J (eds) Computing with Words in Information/ Intelligent Systems. Physica-Verlag, Heidelberg, pp 183–200 34. Makinouchi A (1977) A Consideration on normal form of notnecessarily normalized relation in the relational data model. Proc. of the 3rd Int. Conf. VLDB, pp 447–453 35. Mendel J, John R (2002) Type-2 Fuzzy Sets Made Simple. IEEE Trans on Fuzzy Sets 10:117–127 36. Nanda S, Majumdar S (1992) Fuzzy rough sets. Fuzzy Sets Syst 45:157–160 37. Ola A, Ozsoyoglu G (1993) Incomplete Relational Database Models Based on Intervals. IEEE Trans Knowl Data Eng 5:293– 308 38. Pawlak Z (1982) Rough Sets. Int J Comput Inform Sci 11:341– 356 39. Pawlak Z (1984) Rough Sets. Int J Man-Machine Stud 21:127– 134 40. Pawlak Z (1985) Rough Sets and Fuzzy Sets. Fuzzy Sets Syst 17:99–102 41. Pawlak Z (1991) Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer Academic Publishers, Norwell, MA 42. Quinlan JR (1986) Induction of Decision Trees. Mach Learn 1:81–106 43. Raschia G, Mouaddib N (2002) SAINTETIQ:a fuzzy set-based approach to database summarization. Fuzzy Sets Syst 129:137– 162 44. Roth M, Korth H, Batory D (1987) SQL/NF: A query language for non-1NF databases. Infor Syst 12:99–114
45. Shannon CL (1948) The Mathematical Theory of Communication. Bell Syst Tech J 27 46. Shenoi S, Melton A, Fan L (1992) Functional Dependencies and Normal Forms in the Fuzzy Relational Database Model. Infor Sci 60:1–28 47. Shi W, Wang S, Li D, Wang X (2003) Uncertainty-based Spatial Data Mining, Proceedings of Asia GIS Association, Wuhan, China, pp 124–35 48. Srinivasan P (1991) The importance of rough approximations for information retrieval. Int J Man-Mach Stud 34:657–671 49. Wang S, Li D, Shi W, Wang X (2002) Rough Spatial Description, International Archives of Photogrammetry and Remote Sensing, XXXII, Commission II, pp 503–510 50. Worboys M (1998) Computation with imprecise geospatial data. Comput Environ Urban Syst 22:85–106 51. Worboys M (1998) Imprecision in Finite Resolution Spatial Data. Geoinformatica 2:257–280 52. Wygralak M (1989) Rough Sets and Fuzzy Sets–Some Remarks on Interrelations. Fuzzy Sets Syst 29:241–243 53. Zadeh L (1965) Fuzzy Sets. Infor Control 8:338–353 54. Zvieli A, Chen P (1986) Entity–Relationship Modeling and Fuzzy Databases. Proc. of International Conference on Data Engineering, pp 320–327
Books and Reviews Aczel J, Daroczy Z (1975) On Measures of Information and their Characterization. Academic Press, New York Angryk R, Petry F (2007) Attribute-Oriented Fuzzy Generalization in Proximity and Similarity-based Relation Database Systems. Int J Intell Syst 22:763–781 Arora G, Petry F, Beaubouef T (1997) Information Measure of Type ˇ Under Similarity Relations. Sixth IEEE International Conference on Fuzzy Systems Barcelona, pp 857–862 Arora G, Petry F, Beaubouef T (2001) A Note On New Parametric Measures of Information for Fuzzy Sets. J Comb Infor Syst Sci 26:167–174 Beaubouef T, Petry F (2001) Vague Regions and Spatial Relationships: A Rough Set Approach. Fourth International Conference on Computational Intelligence and Multimedia Applications, Yokosuka City, Japan, pp 313–318 Beaubouef T, Petry F (2001) Vagueness in Spatial Data: Rough Set and Egg-Yolk Approaches, 14th International Conference on Industrial & Engineering Applications of Artificial Intelligence, pp 367–373 Beaubouef T, Petry F (2003) Rough Set Uncertainty in an Object Oriented Data Model, Intelligent Systems for Information Processing. In: Bouchon B, Meunier L, Foulloy, Yager R (eds) From Representation to Applications. Elsevier, Amsterdam, pp 37–46 Beaubouef T, Petry F (2005) Normalization in a Rough Relational Database, International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, pp 257–265 Beaubouef T, Petry F (2005) Representation of Spatial Data in an OODB Using Rough and Fuzzy Set Modeling, Soft Comput J 9:364–373 Beaubouef T, Petry F (2007) An Attribute-Oriented Approach for Knowledge Discovery in Rough Relational Databases, Proc FLAIRS’07, pp 507–508 Beaubouef T, Ladner R, Petry F (2004) Rough Set Spatial Data Modeling for Data Mining. Int J Intell Syst 19:567–584
Rough and Rough-Fuzzy Sets in Design of Information Systems
Beaubouef T, Petry F, Arora G (1998) Information Measures for Rough and Fuzzy Sets and Application to Uncertainty in Relational Databases. In: Pal S, Skowron A (eds) Rough-Fuzzy Hybridization: A New Trend in Decision-Making. Springer, Singapore, pp 200–214 Beaubouef T, Petry F, Ladner R (2007) Spatial Data Methods and Vague Regions: A Rough Set Approach. Appl Soft Comput J 7:425–440 Buckles B, Petry F (1982) Security and Fuzzy Databases, Procs 1982 IEEE Int Conf on Cybernetics and Society, pp 622–625 Codd E (1970) A relational model of data for large shared data banks. Comm ACM 13:377–387 Ebanks B (1983) On Measures of Fuzziness and their Representations. J Math Anal Appl 94:24–37
Grzymala-Busse J (1991) Managing Uncertainty in Expert Systems. Kluwer Academic Publishers, Boston Han J, Nishio S, Kawano H, Wang W (1998) Generalization-Based Data Mining in Object-Oriented Databases Using an ObjectCube Model. Data Knowl Eng 25:55–97 Havrda J, Charvat F (1967) Quantification Methods of Classification Processes: Concepts of Structural ˛ Entropy. Kybernetica 3:149–172 Kapur J, Kesavan H (1992) Entropy Optimization Principles with Applications. Academic Press, New York Slowinski R (1992) A Generalization of the Indiscernibility Relation for Rough Sets Analysis of Quantitative Information. 1st Int. Workshop on Rough Sets: State of the Art and Perspectives, Poland, pp 41–48
2715
2716
Rough Set Data Analysis
Rough Set Data Analysis SHUSAKU TSUMOTO Department of Medical Informatics, Faculty of Medicine, Shimane University, Shimane, Japan Article Outline Introduction Focusing Mechanism Definition of Rules Algorithms for Rule Induction Experimental Results What Is Discovered? Rule Discovery as Knowledge Acquisition and Decision Support Discussion Conclusions Bibliography Introduction Rule-induction methods are classified into two categories, induction of deterministic rules and of probabilistic ones [4,5,7,10]. On one hand, deterministic rules are described as ifthen rules, which can be viewed as propositions. From the set-theoretical point of view, a set of examples supporting the conditional part of a deterministic rule, denoted by C, is a subset of a set whose examples belong to the consequence part, denoted by D. That is, the relation C D holds and deterministic rules are supported only by positive examples in a data set. On the other hand, probabilistic rules are if-then rules with probabilistic information [10]. When a classical proposition will not hold for C and D, C is not a subset of D but closely overlapped with D. That is, the relations C \ D ¤ and jC \ Dj/jCj ı will hold in this case, where the threshold ı is the degree of closeness of overlapping sets, which will be given by domain experts. (For more information, see Sect. “Definition of Rules”) Thus, probabilistic rules are supported by a large number of positive examples and a few negative examples. The common feature of both deterministic and probabilistic rules is that they deduce their consequence positively if an example satisfies their conditional parts. We call the reasoning by these rules positive reasoning. However, medical experts use not only positive reasoning but also negative reasoning for selection of candidates, which is represented as if-then rules whose consequences include negative terms.
For example, when a patient who complains of headache does not have a throbbing pain, migraine should not be suspected with a high probability. Thus, negative reasoning also plays an important role in cutting the search space of a differential diagnosis process [10]. Thus, medical reasoning includes both positive and negative reasoning, though conventional rule-induction methods do not reflect this aspect. This is one of the reasons medical experts have difficulty in interpreting induced rules, and interpreting rules for a discovery procedure does not proceed easily. Therefore, negative rules should be induced from databases in order not only to induce rules reflecting experts’ decision processes, but also to induce rules that will be easier for domain experts to interpret, both of which are important to enhance the discovery process done by the cooperation of medical experts and computers. In this chapter, first the characteristics of medical reasoning are discussed and then two kinds of rules, positive rules and negative rules, are introduced as a model of medical reasoning. Interestingly, from the set-theoretic point of view, sets of examples supporting both rules correspond to the lower and upper approximations in rough sets [5]. On the other hand, from the viewpoint of propositional logic, both positive and negative rules are defined as classical propositions or deterministic rules with two probabilistic measures, classification accuracy, and coverage. Second, two algorithms for induction of positive and negative rules are introduced, defined as search procedures using accuracy and coverage as evaluation indices. Finally, the proposed method is evaluated on several medical databases. The experimental results show that the induced rules correctly represent experts’ knowledge. In addition, several interesting patterns are discovered. Focusing Mechanism One of the characteristics in medical reasoning is a focusing mechanism, which is used to select the final diagnosis from many candidates [10,11]. For example, in differential diagnosis of headache, more than 60 diseases will be checked by present history, physical examinations, and laboratory examinations. In diagnostic procedures, a candidate is excluded if a symptom necessary to diagnose is not observed. This style of reasoning consists of the following two processes: exclusive reasoning and inclusive reasoning. Relations of this diagnostic model with another diagnostic model are discussed in [12]. The diagnostic procedure proceeds as follows (Fig. 2): First, exclusive reasoning ex-
Rough Set Data Analysis
Rough Set Data Analysis, Figure 1 Illustration of focusing mechanism
cludes a disease from candidates when a patient does not have a symptom that is necessary to diagnose that disease. Second, inclusive reasoning suspects a disease in the output of the exclusive process when a patient has symptoms specific to a disease. These two steps are modeled as two kinds of rules, negative rules (or exclusive rules) and positive rules; the former corresponds to exclusive reasoning, the latter to inclusive reasoning. In the next two sections, these two rules are represented as special kinds of probabilistic rules. Definition of Rules
i. e., a : U ! Va for a 2 A, where V a is called the domain of a, respectively. Then a decision table is defined as an information system, A D (U; A [ fdg). For example, Table 1 is an information system with U D f1; 2; 3; 4; 5; 6g and A D fage, location, nature, prodrome, nausea; M1g and d D class. For location 2 A, Vlocation is defined as foccular, lateral, wholeg. The atomic formulas over B A [ fdg and V are expressions of the form [a D v], called descriptors over B, where a 2 B and v 2 Va . The set F(B; V) of formulas over B is the least set containing all atomic formulas over B and closed with respect to disjunction, conjunction, and negation. For example, [location D occular] is a descriptor of B. For each f 2 F(B; V), f A denotes the meaning of f in A, i. e., the set of all objects in U with property f , defined inductively as follows: 1. If f is of the form [a D v], then f A D fs 2 Uja(s) D vg 2. ( f ^ g)A D f A \ g A ; ( f _ g)A D f A _ g A ; (: f )A D U fa . For example, f D [location D whole] and f A D f2; 4; 5; 6g. As an example of a conjunctive formula, g D [location D whole] ^ [nausea D no] is a descriptor of U and f A is equal to glocation, nausea D f2; 5g.
Rough Sets In the following sections, we use the following notation introduced by Grzymala–Busse and Skowron [8], based on rough set theory [5]. These notations are illustrated by a small data set shown in Table 1, which includes symptoms exhibited by six patients who complained of headache. Let U denote a nonempty finite set called the universe and A denote a nonempty, finite set of attributes,
Classification Accuracy and Coverage Definition of Accuracy and Coverage By use of the preceding framework, classification accuracy and coverage, or true positive rate are defined as follows. Definition 1 Let R and D denote a formula in F(B; V) and a set of objects that belong to a decision d. Classification accuracy and coverage (true positive rate) for R ! d
Rough Set Data Analysis, Table 1 An example of a data set No. 1 2 3 4 5 6
Age 50–59 40–49 40–49 40–49 40–49 50–59
Location occular whole lateral whole whole whole
Nature persistent persistent throbbing throbbing radiating persistent
Prodrome no no no yes no no
Nausea no no yes yes no yes
M1 yes yes no no yes yes
Class m.c.h. m.c.h. migra migra m.c.h. psycho
M1: tenderness of M1; m.c.h.: muscle contraction headache; migra: migraine; psycho: psychological pain
2717
2718
Rough Set Data Analysis
Rough Set Data Analysis, Figure 3 Venn diagram for probabilistic rules
Rough Set Data Analysis, Figure 2 Venn diagram of accuracy and coverage
is defined as: jR A \ Dj (D P(DjR)) and jR A j jR A \ Dj (D P(RjD)) ; R (D) D jDj
˛R (D) D
where jSj, ˛R (D), R (D), and P(S) denote the cardinality of a set S, a classification accuracy of R as to classification of D, and coverage (a true positive rate of R to D), and probability of S, respectively. Figure 2 depicts the Venn diagram of relations between accuracy and coverage. Accuracy views the overlapped region jR A \ Dj from the meaning of a relation R. On the other hand, coverage views the overlapped region from the meaning of a concept D. In the preceding example, when R and D are set to [nau D yes] and [class D migraine], ˛R (D) D 2/3 D 0:67 and R (D) D 2/2 D 1:0. It is notable that ˛R (D) measures the degree of the sufficiency of a proposition, R ! D, and that R (D) measures the degree of its necessity. For example, if ˛R (D) is equal to 1.0, then R ! D is true. On the other hand, if R (D) is equal to 1.0, then D ! R is true. Thus, if both measures are 1.0, then R $ D. Probabilistic Rules By use of accuracy and coverage, a probabilistic rule is defined as: ˛;
R!d
s.t.
R D ^ j [a j D v k ];
˛R (D) ı˛
and
R (D) ı :
If the thresholds for accuracy and coverage are set to high values, the meaning of the conditional part of probabilistic rules corresponds to the highly overlapped region.
Figure 3 depicts the Venn diagram of probabilistic rules with highly overlapped region. This rule is a kind of probabilistic proposition with two statistical measures, which is an extension of Ziarko’s variable precision model (VPRS) [15].1 It is also notable that both a positive rule and a negative rule are defined as special cases of this rule, as shown in the next sections. Positive Rules A positive rule is defined as a rule supported by only positive examples, the classification accuracy of which is equal to 1.0. It is notable that the set supporting this rule corresponds to a subset of the lower approximation of a target concept, which is introduced in rough sets [5]. Thus, a positive rule is represented as: R!d
s.t.
R D ^ j [a j D v k ];
˛R (D) D 1:0 :
Figure 4 shows the Venn diagram of a positive rule. As shown in this figure, the meaning of R is a subset of that of D. This diagram is exactly equivalent to the classic proposition R ! d. In the preceding example, one positive rule of m.c.h. (muscle contraction headache) is: [nausea D no] ! m.c.h.
˛ D 3/3 D 1:0 :
This positive rule is often called a deterministic rule. However, we use the term, positive (deterministic) rules, because a deterministic rule supported only by negative examples, called a negative rule, is introduced in the next section. 1 This
probabilistic rule is also a kind of rough modus ponens [6]
Rough Set Data Analysis
From the viewpoint of propositional logic, an exclusive rule should be represented as: d ! _ j [a j D v k ] ; because the condition of an exclusive rule corresponds to the necessity condition of conclusion d. Thus, it is easy to see that a negative rule is defined as the contrapositive of an exclusive rule: ^ j :[a j D v k ] ! :d ;
Rough Set Data Analysis, Figure 4 Venn diagram of positive rules
Negative Rules Before defining a negative rule, let us first introduce an exclusive rule, the contrapositive of a negative rule [10]. An exclusive rule is defined as a rule supported by all the positive examples, the coverage of which is equal to 1.0. That is, an exclusive rule represents the necessity condition of a decision. It is notable that the set supporting an exclusive rule corresponds to the upper approximation of a target concept, which is introduced in rough sets [5]. Thus, an exclusive rule is represented as: R!d
s.t.
R D _ j [a j D v k ] ;
R (D) D 1:0 :
Figure 5 shows the Venn diagram of an exclusive rule. As shown in this figure, the meaning of R is a superset of that of D. This diagram is exactly equivalent to the classic proposition d ! R. In the preceding example, the exclusive rule of m.c.h. is: [M1 D yes] _ [nau D no] ! m.c.h.
Rough Set Data Analysis, Figure 5 Venn diagram of exclusive rules
D 1:0 ;
which means that if a case does not satisfy any attribute value pairs in the condition of a negative rule, then we can exclude a decision d from candidates. For example, the negative rule of m.c.h. is: :[M1 D yes] ^ :[nausea D no] ! :m.c.h. In summary, a negative rule is defined as: ^ j :[a j D v k ] ! :d
s.t.
8[a j D v k ][a j Dv k ] (D) D 1:0 ; where D denotes a set of samples that belong to a class d. Figure 6 shows the Venn diagram of a negative rule. As shown in this figure, it is notable that this negative region is the “positive region” of “negative concept.” Negative rules should also be included in a category of deterministic rules, because their coverage, a measure of negative concepts, is equal to 1.0. It is also notable that the set supporting a negative rule corresponds to a subset of negative region, which is introduced in rough sets [5]. In summary, positive and negative rules correspond to positive and negative regions defined in rough sets. Figure 7 shows the Venn diagram of those rules.
Rough Set Data Analysis, Figure 6 Venn diagram of negative rules
2719
2720
Rough Set Data Analysis
Rough Set Data Analysis, Figure 7 Venn diagram of defined rules
Algorithms for Rule Induction The contrapositive of a negative rule, an exclusive rule, is induced as an exclusive rule by the modification of the algorithm introduced in PRIMEROSE-REX [10], as shown in Fig. 8. This algorithm works as follows. (1) First it selects a descriptor [a i D v j ] from the list of attribute-value pairs, denoted by L.
(2) Then it checks whether this descriptor overlaps with a set of positive examples, denoted by D. (3) If so, this descriptor is included in a list of candidates for positive rules and the algorithm checks whether its coverage is equal to 1.0. If the coverage is equal to 1.0, then this descriptor is added to Re r, the formula for the conditional part of the exclusive rule of D. (4) Then [a i D v j ] is deleted from the list L. This procedure, from (1) to (4), will continue unless L is empty. (5) Finally, when L is empty, this algorithm generates negative rules by taking the contrapositive of induced exclusive rules. On the other hand, positive rules are induced as inclusive rules by the algorithm introduced in PRIMEROSEREX [10], as shown in Fig. 9. For induction of positive rules, the threshold of accuracy and coverage is set to 1.0 and 0.0, respectively. This algorithm works in the following way. (1) First it substitutes L1 , which denotes a list of formulas composed of only one descriptor, with the list Ler generated by the former algorithm shown in Fig. 9.
procedure Exclusive and Negative Rules; var L : List; /* A list of elementary attribute-value pairs */ begin L :D P0 ; /* P0 : A list of elementary attribute-value pairs given in a database */ while (L ¤ fg) do begin Select one pair [a i D v j ] from L; if ([a i D v j ]A \ D ¤ ) then do /* D: positive examples of a target class d */ begin L ir :D L ir C [a i D v j ]; /* Candidates for Positive Rules */ if ([a i Dv j ] (D) D 1:0) then Rer :D Rer ^ [a i D v j ]; /* Include [a i D v j ] into the formula of Exclusive Rule */ end L :D L [a i D v j ]; end Construct Negative Rules: Take the contrapositive of Rer . end {Exclusive and Negative Rules};
Rough Set Data Analysis, Figure 8 Induction of exclusive and negative rules
Rough Set Data Analysis
procedure Positive Rules; var i : integer; M; L i : List; begin L1 :D L ir ; /* L ir : A list of candidates generated by induction of exclusive rules */ i :D 1; M :D fg; for i :D 1 to n do /* n: Total number of attributes given in a database */ begin while ( L i ¤ fg ) do begin Select one pair R D ^[a i D v j ] from L i ; L i :D L i fRg; if (˛R (D) > ı˛ ) then do S ir :D S ir C fRg; /* Include R in a list of the Positive Rules */ else M :D M C fRg; end L iC1 :D (A list of the whole combination of the conjunction formulae in M); end end {Positive Rules};
Rough Set Data Analysis, Figure 9 Induction of positive rules
(2) Then until L1 becomes empty, the following steps will continue: (a) A formula [a i D v j ] is removed from L1 . (b) Then the algorithm checks whether ˛R (D) is larger than the threshold. (For induction of positive rules, this is equal to checking whether ˛R (D) is equal to 1.0.) If so, then this formula is included a list of the conditional parts of positive rules. Otherwise, it will be included in M, which is used for making conjunctions. (3) When L1 is empty, the next list L2 is generated from the list M.
(1) Headache (RHINOS domain), whose training samples consist of 52,119 samples, 45 classes, and 147 attributes; (2) Cerebulovasular diseases (CVD), whose training samples consist of 7620 samples, 22 classes, and 285 attributes; and (3) Meningitis, whose training samples consist of 1211 samples, 4 classes, and 41 attributes (Table 2). For evaluation, we used the following two types of experiments. One experiment was to evaluate the predictive accuracy using the cross-validation method, which is often used in the machine-learning literature [9]. The other ex-
Experimental Results For experimental evaluation, a new system, called PRIMEROSE-REX2 (Probabilistic Rule Induction Method for Rules of Expert System ver. 2.0), was developed, where the algorithms discussed in Sect. “Algorithms for Rule Induction” were implemented. PRIMEROSE-REX2 was applied to the following three medical domains:
Rough Set Data Analysis, Table 2 Databases Domain Headache CVD Meningitis
Samples Classes Attributes 52,119 45 147 7620 22 285 1211 4 41
2721
2722
Rough Set Data Analysis
periment was to evaluate the induced rules by medical experts and to check whether these rules lead to a new discovery.
domain experts is very hard for the knowledge discovery process. Next, we show several examples of rules that are interesting to domain experts.
Performance of Rules Obtained For comparison of performance, the experiments are conducted by the following four procedures. First, rules are acquired manually from experts. Second, the data sets are randomly split into new training samples and new test samples. Third, PRIMEROSE-REX2, conventional ruleinduction methods, AQ15 [4] and C4.5 [7] are applied to the new training samples for rule generation. Fourth, the induced rules and rules acquired from experts are tested using new test samples. The second through fourth steps are repeated 100 times, and the average classification accuracy over 100 trials is computed. This process is a variant of repeated two-fold cross-validation, introduced in [10]. Experimental results (performance) are shown in Table 3. The first and second rows show the results obtained using PRIMROSE-REX2; the results in the first row are derived using both positive and negative rules and those in the second row are derived by only positive rules. The third row shows the results derived from medical experts. For comparison, we compare the classification accuracy of C4.5 and AQ-15, which is shown in the fourth and fifth rows. These results show that the combination of positive and negative rules outperforms positive rules, although it is a little worse than medical experts’ rules.
What Is Discovered? Interesting Rules Are Very Few One of the most important observations is that there were very few rules interesting or unexpected to medical experts compared to the number of rules extracted from the data sets. Table 4 shows the mentioned results. The second column denotes the number of positive and negative rules obtained from each data set. The third column denotes the number of positive and negative rules interesting or unexpected for domain experts. For example, the first row shows that the number of induced positive rules in headache is 24,335, but the number of interesting rules are 114, which shows that only 0.47% of rules are interesting for domain experts. This table shows that the number of interesting rules are very few: even in the case of meningitis, only 6.5% are interesting, which suggests that the interpretation part of
Positive Rules in Differential Diagnosis of Headache In the domain of differential diagnosis of headache, the following interesting positive rules were found. [Age < 20] ^ [History:paroxysmal] ! Common Migraine(Coverage:0.75) This rule is said to be interesting if it is compared with the following rule: [Age > 40] ^ [History:paroxysmal] ! Classic Migraine(Coverage:0.72) These two rules include two parts; although the values of the attribute “Age” are different, those of the attribute “History” are the same. This suggests that the attribute “Age” is important for differential diagnosis between common migraine and classic migraine.
Negative Rules in Differential Diagnosis of Headache In the domain of differential diagnosis of headache, the following interesting negative rule was found. :[Nature:Persistent]^ :[History:acute] ^ :[History:paroxysmal]) ^ :[Neck Stiffness:yes] ! :Common Migraine It is notable that it is difficult even for domain experts to interpret the interestingness of this rule if only this rule is shown. The domain experts pointed out that the rule is interesting when compared with the following rules. [Nature:Persistent] ^ [History:acute] ^ [Neck Stiffness:yes] ! Meningitis[Nature: Persistent] ^ [History:chronic] ! Brain Tumor This means that the rule includes information that is very important for differential diagnosis between common migraine and meningitis or brain tumor. Note that these two rules are very similar to the negative rule except for the negative symbols.
Rough Set Data Analysis Rough Set Data Analysis, Table 3 Experimental results (accuracy: averaged) Method PRIMEROSE-REX2 (positive+negative) PRIMEROSE-REX2 (positive) Experts PR-REX C4.5 AQ15
Rough Set Data Analysis, Table 4 Number of extracted rules and interesting rules Induced positive rules Headache 24,335 CVD 14,715 Meningitis 1,922
Interesting rules 114 (0.47%) 106 (0.72%) 40 (2%)
Induced negative rules Headache 12,113 CVD 7,231 Meningitis 77
Interesting rules 120 (0.99%) 155 (2.1%) 5 (6.5%)
Positive Rules in Meningitis In the domain of meningitis, the following positive rules, which medical experts do not expect, are obtained. [WBC < 12 000] ^ [Sex D Female] ^ [Age < 40] ^ [CSF_CELL < 1000] ! Virus(Coverage:0.91) [Age 40] ^ [WBC 8000] ^ [Sex D Male]
Headache 91.3% 68.3% 95.0% 88.3% 85.8% 86.2%
CVD 89.3% 71.3% 92.9% 84.3% 79.7% 78.9%
Meningitis 92.5% 74.5% 93.2% 82.5% 81.4% 82.5%
that [age < 40] was also an important factor not to suspect viral meningitis, which also matches the fact that most old people suffer from chronic diseases. These results were also reevaluated in medical practice. Recently, the preceding two rules were checked by an additional 21 cases who suffered from meningitis (15 cases viral, 6 cases bacterial meningitis.) Surprisingly, the rules misclassified only three cases (two viral, the other bacterial), that is, the total accuracy was equal to 18/21 D 85:7%, and the accuracies for viral and bacterial meningitis were equal to 13/15 D 86:7% and 5/6 D 83:3%. The reasons for misclassification are the following: a case of bacterial infection involved a patient who had a severe immunodeficiency, although he is very young. Two cases of viral infection involved patients who suffered from herpes zoster. It is notable that even those misclassified cases could be explained from the viewpoint of the immunodeficiency: that is, it was confirmed that immunodeficiency is a key factor for meningitis. The validation of these rules is still ongoing; it will be reported in the near future.
^ [CSF_CELL 1000] ! Bacteria(Coverage:0.64) The former rule means that if WBC (white blood cell count) is less than 12 000, the gender of a patient is female, the age is less than 40, and CSF_CELL (cell count of cerebulospinal fluid), then the type of meningitis is Viral. The latter means that the age of a patient is less than 40, WBC is larger than 8000, the gender is male, and CSF_CELL is larger than 1000, then the type of meningitis is Bacterial. The most interesting points are that these rules included information about age and gender, which often seems to be unimportant attributes for differential diagnosis of meningitis. The first discovery was that women did not often suffer from bacterial infection compared with men, because such relationships between gender and meningitis has not been discussed in medical context [1]. By a close examination of the database on meningitis, it was found that most of the patients suffered from chronic diseases, such as DM, LC, and sinusitis, which are the risk factors of bacterial meningitis. The second discovery was
Positive and Negative Rules in CVD Concerning the database on CVD, several interesting rules were derived. The most interesting results were the following positive and negative rules for thalamus hemorrhage: Sex D Female ^ [Hemiparesis D Left] ^ [LOC:positive] ! Thalamus :[Risk:Hypertension] ^ :[Sensory D no] ! :Thalamus The former rule means that if the gender of a patient is female and he or she suffered from the left hemiparesis ([HemiparesisDLeft]) and loss of consciousness ([LOC:positive]), then the focus of CVD is thalamus. The latter rule means that if he or she suffers neither from hypertension ([Risk: Hypertension]) or sensory disturbance ([SensoryDno]), then the focus of CVD is thalamus.
2723
2724
Rough Set Data Analysis
Interestingly, LOC (loss of consciousness) under the condition of [Gender D Female] ^ [Hemiparesis D Left] was found to be an important factor to diagnose thalamic damage. In this domain, any strong correlations between these attributes and others, like the database of meningitis, have not been found yet. It will be our future work to find what factor is behind these rules.
Rule Discovery as Knowledge Acquisition and Decision Support Expert System: RH Another point of discovery of rules is automated knowledge acquisition from databases. Knowledge acquisition is referred to as a bottleneck problem in development of expert systems [2], which has not fully been solved and is expected to be solved by induction of rules from databases. However, there are few papers that discuss the evaluation of discovered rules from the viewpoint of knowledge acquisition [12]. For this purpose, we have developed an expert system, called RH (rule-based system for headache) using the acquired knowledge. The reason for selecting the domain of headache is that earlier we developed an expert system RHINOS (rule-based headache information organizing system), which makes a differential diagnosis in headache [3]. In this system, it takes about six months to acquire knowledge from domain experts. RH consists of two parts. First, it requires inputs and applies exclusive and negative rules to select candidates (focusing mechanism). Then, it requires additional inputs and applies positive rules for differential diagnosis between selected candidates. Finally, RH outputs diagnostic conclusions.
Rough Set Data Analysis, Table 5 Evaluation of RH (accuracy: averaged) Method PRIMEROSE-REX2 (positive and negative) PRIMEROSE-REX2 (positive) RHINOS (positive and negative) RHINOS (positive)
Accuracy 91.4% (851/930) 78.5% (729/930) 93.5% (864/930) 82.8% (765/930)
Discussion Hierarchical Rules for Decision Support One of the problems with rule induction is that conventional rule-induction methods cannot extract rules that plausibly represent experts’ decision processes [12]. The description length of induced rules is too short, compared to the experts’ rules. (It may be observed that this length part does not contribute much to the classification performance.) For example, rule-induction methods introduced in this chapter induced the following common rule for muscle contraction headache from databases on differential diagnosis of headache: location D whole ^ [Jolt Headache D no] ^ [Tenderness of M1 D yes] ! muscle contraction headache This rule is shorter than the following rule given by medical experts: Jolt Headache D no ^ ([Tenderness of M0 D yes]
Evaluation of RH RH was evaluated in clinical practice with respect to its classification accuracy by using 930 patients who came to the outpatient clinic after the development of this system. Experimental results about classification accuracy are shown in Table 5. The first and second rows show the performance of rules obtained using PRIMROSE-REX2; the results in the first row are derived using both positive and negative rules and those in the second row are derived using only positive rules. The third and fourth rows show the results derived using both positive and negative rules and those by positive rules acquired directly from medical experts. These results show that the combination of positive and negative rules outperforms positive rules and gains almost the same performance as those by experts.
_ [Tenderness of M1 D yes] _ [Tenderness of M2 D yes]) ^ [Tenderness of B1 D no] ^ [Tenderness of B2 D no] ^ [Tenderness of B3 D no] ^ [Tenderness of C1 D no] ^ [Tenderness of C2 D no] ^ [Tenderness of C3 D no] ^ [Tenderness of C4 D no] ! muscle contraction headache These results suggest that conventional rule-induction methods do not reflect a mechanism of knowledge acquisition of medical experts.
Rough Set Data Analysis
Typically, rules acquired from medical experts are much longer than those induced from databases, the decision attributes of which are given by the same experts. This is because rule induction methods generally search for shorter rules, compared with decision tree induction. In the case of decision tree induction, the induced trees are sometimes too deep, and in order for the trees to be useful for learning, pruning and examination by experts are required. One of the main reasons rules are short and decision trees are sometimes long is that these patterns are generated by only one criteria, such as high accuracy or high information gain. The comparative study in this section suggests that experts should acquire rules by usage of several measures. Those characteristics of medical experts’ rules are fully examined not by comparing between those rules for the same class, but by comparing experts’ rules with those for another class. For example, a classification rule for muscle contraction headache is given by: Jolt Headache D no ^ ([Tenderness of M0 D yes] _ [Tenderness of M1 D yes] _ [Tenderness of M2 D yes]) ^ [Tenderness of B1 D no] ^ [Tenderness of B2 D no] ^ [Tenderness of B3 D no] ^ [Tenderness of C1 D no] ^ [Tenderness of C2 D no] ^ [Tenderness of C3 D no] ^ [Tenderness of C4 D no] ! muscle contraction headache This rule is very similar to the following classification rule for disease of cervical spine: Jolt Headache D no
The differences between these two rules are attributevalue pairs, from tenderness of B1 to C4. Thus, these two rules can be simplified into the following form: a1 ^ A2 ^ :A3 ! muscle contraction headache; a1 ^ A2 ^ A3 ! disease of cervical spine: The first two terms and the third one represent different reasoning. The first and second terms a1 and A2 are used to differentiate muscle contraction headache and disease of cervical spine from other diseases. The third term A3 is used to make a differential diagnosis between these two diseases. Thus, medical experts first select several diagnostic candidates, which are similar to each other, from many diseases and then make a final diagnosis from those candidates. This problem has been partially solved; Tsumoto introduced a new approach for inducing these rules in [13], as induction of hierarchical decision rules. In that paper, the characteristics of experts’ rules are closely examined and a new approach to extract plausible rules is introduced, which consists of the following three procedures. First, the characterization of decision attributes (given classes) is done from databases and the classes are classified into several groups with respect to the characterization. Then two kinds of subrules, characterization rules for each group and discrimination rules for each class in the group, are induced. Finally, those two parts are integrated into one rule for each decision attribute. The proposed method was evaluated on a medical database, the experimental results of which show that induced rules correctly represent experts’ decision processes. This observation also suggests that medical experts implicitly look at the relation between rules for different concepts. Future work should discover the relations between induced rules.
^ ([Tenderness of M0 D yes]
Relations Between Rules
_ [Tenderness of M1 D yes]
In [14], Tsumoto focuses on the characteristics of medical reasoning(focusing mechanism) and introduces three kinds of rules, positive rules, exclusive rules and total covering rules, as a model of medical reasoning, which is an extended formalization of rules defined in [12]. Interestingly, from the set-theoretic point of view, sets of examples supporting these rules correspond to the lower and upper approximations in rough sets. Furthermore, from the viewpoint of propositional logic, both inclusive and exclusive rules are defined as classical propositions, or deterministic rules with two probabilistic measures, classification accuracy and coverage. To-
_ [Tenderness of M2 D yes]) ^ ([Tenderness of B1 D yes] _ [Tenderness of B2 D yes] _ [Tenderness of B3 D yes] _ [Tenderness of C1 D yes] _ [Tenderness of C2 D yes] _ [Tenderness of C3 D yes] _ [Tenderness of C4 D yes]) ! disease of cervical spine
2725
2726
Rough Set Data Analysis
tal covering rules have several interesting relations with inclusive and exclusive rules, which reflects the characteristics of medical reasoning. Originally, a total covering rule is defined as a set of symptoms that can be observed in at least one case of a target disease. That is, this rule is defined as a collection of attribute-value pairs whose accuracy is larger than 0: R!d
s.t.
R D _ j [a j D v k ];
˛R (D) > 0 :
From the definition of accuracy and coverage, this formula can be transformed into: R!d
s.t.
R D _ j [a j D v k ];
R (D) > 0 :
For each attribute, the attribute-value pairs form a partition of U. Thus, for each attribute, total covering rules include a covering of all the positive examples. According to this property, the preceding formula is redefined as: R!d
s.t.
R D _ j R(a j );
R(a j ) D _ k [a j D v k ]
s.t. R (D) D 1:0 :
It is notable that this definition is an extension of exclusive rules and this total covering rule can be written as: d ! _ j _ k [a j D v k ]
s.t. [a j Dv k ] (D) D 1:0 :
Let S(R) denote a set of attribute-value pairs of rule R. For each class d, let Rpos (d), Rex (d), and Rtc (d) denote the positive exclusive rule and total covering rules, respectively. Then S(Rex (d)) S(Rtc (d)) ; because a total covering rule can be viewed as an upper approximation of exclusive rules. It is also notable that this relation will hold in the relation between the positive rule (inclusive rule) and total covering rule. That is, S(Rpos (d)) S(Rtc (d)) : Thus, the total covering rule can be viewed as an upper approximation of inclusive rules. This relation also holds when Rpos (d) is replaced with a probabilistic rule, which shows that total covering rules are the weakest form of diagnostic rules. In this way, rules that reflect the diagnostic reasoning of medical experts have sophisticated background from the viewpoint of set theory. Especially, the rough set framework provides a good tool for modeling such focusing mechanisms. Our future work will investigate the relation between these three types of rules from the viewpoint of rough set theory.
Conclusions In this chapter, the characteristics of two measures, classification accuracy and coverage, are discussed, which show that both measures are dual and that accuracy and coverage are measures of both positive and negative rules, respectively. Then an algorithm for induction of positive and negative rules is introduced. The proposed method is evaluated on medical databases. The experimental results have demonstrated that the induced rules are able to correctly represent experts’ knowledge. We also demonstrated that the method can discover several interesting patterns. Bibliography 1. Adams RD, Victor M (1993) Principles of Neurology, 5th edn. McGraw-Hill, New York 2. Buchnan BG, Shortliffe EH (eds) (1984) Rule-Based Expert Systems. Addison-Wesley 3. Matsumura Y, Matsunaga T, Hata Y, Kimura M, Matsumura H (1988) Consultation system for diagnoses of headache and facial pain. Med Inform RHINOS 11:145–157 4. Michalski RS, Mozetic I, Hong J, Lavrac N (1986) The multi-purpose incremental learning system AQ15 and its testing application to three medical domains. Proc 5th National Conference on Artificial Intelligence. AAAI Press, Palo Alto, pp 1041– 1045 5. Pawlak Z (1991) Rough Sets. Kluwer, Dordrecht 6. Pawlak Z (1998) Rough Modus Ponens. In: Proc. International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems 98, Paris 7. Quinlan JR (1993) C4.5–Programs for Machine Learning. Morgan Kaufmann, Palo Alto 8. Skowron A, Grzymala-Busse J (1994) From rough set theory to evidence theory. In: Yager R, Fedrizzi M, Kacprzyk J (eds) Advances in the Dempster-Shafer Theory of Evidence. Wiley, New York, pp 193–236 9. Shavlik JW, Dietterich TG (eds) (1990) Readings in Machine Learning. Morgan Kaufmann, Palo Alto 10. Tsumoto S, Tanaka H (1996) Automated discovery of medical expert system rules from clinical databases based on rough sets. In: Proc. 2nd International Conference on Knowledge Discovery and Data Mining 96. AAAI Press, Palo Alto, pp 63–69 11. Tsumoto S (1998) Modelling medical diagnostic rules based on rough sets. In: Polkowski L, Skowron A (eds) Rough Sets and Current Trends in Computing. Lecture Note in Artificial Intelligence, vol 1424. Springer, Berlin 12. Tsumoto S (1998) Automated extraction of medical expert system rules from clinical databases based on rough set theory. J Inf Sci 112:67–84 13. Tsumoto S (1998) Extraction of experts’ decision rules from clinical databases using rough set model. Intell Data Anal 2(3) 14. Tsumoto S (2001) Medical diagnostic rules as upper approximation of rough sets. Proc FUZZ-IEEE2001 15. Ziarko W (1993) Variable precision rough set model. J Comput Syst Sci 46:39–59
Rough Sets in Decision Making
Rough Sets in Decision Making 1,2 , SALVATORE GRECO 3 , ´ ROMAN SŁOWI NSKI BENEDETTO MATARAZZO3 1 Institute of Computing Science, Poznan University of Technology, Poznan, Poland 2 Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland 3 Faculty of Economics, University of Catania, Catania, Italy
Article Outline Glossary Definition of the Subject Introduction Classical Rough Set Approach to Classification Problems of Taxonomy Type Dominance-Based Rough Set Approach to Ordinal Classification Problems DRSA on a Pairwise Comparison Table for Multiple Criteria Choice and Ranking Problems DRSA for Decision Under Uncertainty Multiple Criteria Decision Analysis Using Association Rules Interactive Multiobjective Optimization Using DRSA (IMO-DRSA) Conclusions Future Directions Bibliography
Glossary Multiple attribute (or multiple criteria) decision support aims at giving the decision maker (DM) a recommendation concerning a set of objects A (also called alternatives, actions, acts, solutions, options, candidates, . . . ) evaluated from multiple points of view called attributes (also called features, variables, criteria, objectives, . . . ). Main categories of multiple attribute (or multiple criteria) decision problems: classification, when the decision aims at assigning each object to one of predefined classes, choice, when the decision aims at selecting the best object, ranking, when the decision aims at ordering objects from the best to the worst.
Two kinds of classification problems are distinguished: taxonomy, when the value sets of attributes and the predefined classes are not preference ordered, ordinal classification (also known as multiple criteria sorting), when the value sets of attributes and the predefined classes are preference ordered. Two kinds of choice problems are distinguished: discrete choice, when the set of objects is finite and reasonably small to be listed, multiple objective optimization, when the set of objects is infinite and defined by constraints of a mathematical program. If value sets of attributes are preference-ordered, they are called criteria or objectives, otherwise they keep the name of attributes. Criterion is a real-valued function g i defined on A, reflecting a worth of objects from a particular point of view, such that in order to compare any two objects a; b 2 A from this point of view it is sufficient to compare two values: g i (a) and g i (b). Dominance Object a is non-dominated in set A (Paretooptimal) if and only if there is no other object b in A such that b is not worse than a on all considered criteria, and strictly better on at least one criterion. Preference model is a representation of a value system of the decision maker on the set of objects with vector evaluations. Decision under uncertainty takes into account consequences of decisions that distribute over multiple states of nature with given probability. The preference order, characteristic for data describing multiple attribute decision problems, concerns also decision under uncertainty, where the objects correspond to acts, attributes are outcomes (gain or loss) to be obtained with a given probability, and the problem consists in ordinal classification, choice, or ranking of the acts. Rough set in universe U is an approximation of a set based on available information about objects of U. The rough approximation is composed of two ordinary sets called lower and upper approximation. Lower approximation is a maximal subset of objects which, according to the available information, certainly belong to the approximated set, and upper approximation is a minimal subset of objects which, according to the available information, possibly belong to the approximated set. The difference between upper and lower approximation is called boundary. Decision rule is a logical statement of the type “if . . . , then . . . ”, kursiv where the premise (condition part)
2727
2728
Rough Sets in Decision Making
specifies values assumed by one or more condition attributes and the conclusion (decision part) specifies an overall judgment. Definition of the Subject Scientific analysis of decision problems aims at giving the decision maker (DM) a recommendation concerning a set of objects (also called alternatives, solutions, acts, actions, options, candidates, . . . ) evaluated from multiple points of view considered relevant for the problem at hand and called attributes (also called features, variables, criteria, objectives, . . . ). For example, a decision can concern: 1) diagnosis of pathologies for a set of patients, where patients are objects of the decision, and symptoms and results of medical tests are the attributes, 2) assignment of enterprises to classes of risk, where enterprises are objects of the decision, and financial ratio indices and other economic indicators, such as the market structure, the technology used by the enterprise and the quality of management, are the attributes, 3) selection of a car to be bought from among a given set of cars, where cars are objects of the decision, and maximum speed, acceleration, price, fuel consumption, comfort, color and so on, are the attributes, 4) ordering of students applying for a scholarship, where students are objects of the decision, and scores in different subjects are the attributes. The following three main categories of decision problems are typically distinguished [44]: classification, when the decision aims at assigning each object to one of predefined classes, choice, when the decision aims at selecting the best objects, ranking, when the decision aims at ordering objects from the best to the worst. Looking at the above examples, one can say that 1) and 2) are classification problems, 3) is a choice problem and 4) is a ranking problem. The above categorization can be refined by distinguishing two kinds of classification problems: taxonomy, when the value sets of attributes and the predefined classes are not preference ordered, and ordinal classification (also known as multiple criteria sorting), when the value sets of attributes and the predefined classes are preference ordered [12]. In the above examples, 1) is a taxonomy problem and 2) is an ordinal classification problem. If value sets
of attributes are preference ordered, they are called criteria, otherwise they keep the name of attributes. For example, in a decision regarding the selection of a car, its price is a criterion because, obviously, a low price is better than a high price. Instead, the color of a car is not a criterion but simply an attribute, because red is not intrinsically better than green. One can imagine, however, that also the color of a car could become a criterion if, for example, a DM would consider red better than green. Introduction Scientific support of decisions makes use of a more or less explicit model of the decision problem. The model relates the decision to the characteristics of the objects expressed by the considered attributes. Building such a model requires information about conditions and parameters of the aggregation of multi-attribute characteristics of objects. The nature of this information depends on the adopted methodology: prices and interest rates for cost-benefit analysis, cost coefficients in objectives and technological coefficients in constraints for mathematical programming, a training set of decision examples for neural networks and machine learning, substitution rates for a value function of multi-attribute utility theory, pairwise comparisons of objects in terms of intensity of preference for the analytic hierarchy process, attribute weights and several thresholds for ELECTRE methods, and so on (see the state-of-the-art survey [4]). This information has to be provided by the DM, possibly assisted by an analyst. Very often this information is not easily definable. For example, this is the case of the price of many immaterial goods and of the interest rates in cost-benefit analysis, or the case of the coefficients of objectives and constraints in mathematical programming models. Even if the required information is easily definable, like a training set of decision examples for neural networks, it is often processed in a way which is not clear for the DM, such that (s)he cannot see what are the exact relations between the provided information and the final recommendation. Consequently, very often the decision model is perceived by the DM as a black box whose result has to be accepted because the analyst’s authority guarantees that the result is “right”. In this context, the aspiration of the DM to find good reasons to make decision is frustrated and rises the need for a more transparent methodology in which the relation between the original information and the final recommendation is clearly shown. Such a transparent methodology searched for has been called glass box [32]. Its typical representative is using a learning set of decision examples as the input preference information provided by the DM, and
Rough Sets in Decision Making
it is expressing the decision model in terms of a set of “if . . . , then . . . ” decision rules induced from the input information. From one side, the decision rules are explicitly related to the original information and, from the other side, they give understandable justifications for the decisions to be made. For example, in case of a medical diagnosis problem, the decision rule approach requires as input information a set of examples of previous diagnoses, from which some diagnostic rules are induced, such as “if there is symptom ˛ and the test result is ˇ, then there is pathology ”. Each one of such rules is directly related to examples of diagnoses in the input information, where there is symptom ˛, test result ˇ and pathology . Moreover, the DM can verify easily that in the input information there is no example of diagnosis where there is symptom ˛, test result ˇ but no pathology . The rules induced from the input information provided in terms of exemplary decisions represent a decision model which is transparent for the DM and enables his understanding of the reasons of his past decisions. The acceptance of the rules by the DM justifies, in turn, their use for future decisions. The induction of rules from examples is a typical approach of artificial intelligence. This explains our interest in rough set theory [38,39] which proved to be a useful tool for analysis of vague description of decision situations [41,48]. The rough set analysis aims at explaining the values of some decision attributes, playing the role of “dependent variables”, by means of the values of condition attributes, playing the role of “independent variables”. For example, in the above diagnostic context, data about the presence of a pathology are given by decision attributes, while data about symptoms and tests are given by condition attributes. An important advantage of the rough set approach is that it can deal with partly inconsistent examples, i. e. cases where the presence of different pathologies is associated with the presence of the same symptoms and test results. Moreover, it provides useful information about the role of particular attributes and their subsets, and prepares the ground for representation of knowledge hidden in the data by means of “if . . . , then . . . ” decision rules. Classical rough set approach proposed by Pawlak [38,39] cannot deal, however, with preference order in the value sets of condition and decision attributes. For this reason, the classical rough set approach can deal with only one of four decision problems listed above – classification of taxonomy type. To deal with ordinal classification, choice and ranking, it is necessary to generalize the classical rough set approach, so as to take into
account preference orders and monotonic relationships between condition and decision attributes. This generalization, called Dominance-based Rough Set Approach (DRSA), has been proposed by Greco, Matarazzo and Slowinski [12,14,15,18,21,49]. Classical Rough Set Approach to Classification Problems of Taxonomy Type Information Table and Indiscernibility Relation Information regarding classification examples is supplied in the form of an information table, whose separate rows refer to distinct objects, and whose columns refer to different attributes considered. This means that each cell of this table indicates an evaluation (quantitative or qualitative) of the object placed in the corresponding row by means of the attribute in the corresponding column. Formally, an information table is the 4-tuple S D hU; Q; V ; vi, where U is a finite set of objects, called universe, Q D fq1 ; : : : ; q n g is a finite set of attributes, V q is a value set of the atS tribute q; V D q2Q Vq , and v : U Q ! V is a total function such that v(x; q) 2 Vq for each q 2 Q; x 2 U, called information function. Therefore, each object x of U is described by a vector (string) DesQ (x) D [v(x; q1 ); : : : ; v(x; q n )], called description of x in terms of the evaluations on the attributes from Q; it represents the available information about x. Obviously, x 2 U can be described in terms of any nonempty subset P Q. To every (non-empty) subset of attributes P Q there is associated an indiscernibility relation on U, denoted by I P : I P D f(x; y) 2 U U : v(x; q) D v(y; q) ; 8q 2 Pg : If (x; y) 2 I P , it is said that the objects x and y are P-indiscernible. Clearly, the indiscernibility relation thus defined is an equivalence relation (reflexive, symmetric and transitive). The family of all the equivalence classes of the relation I P is denoted by U/I P , and the equivalence class containing object x 2 U, by I P (x). The equivalence classes of the relation I P are called P-elementary sets. Approximations Let S be an information table, X a non-empty subset of U and ; ¤ P Q. The P-lower approximation and the Pupper approximation of X in S are defined, respectively, as: P (X) D fx 2 U : I P (x) Xg ; P (X) D fx 2 U : I P (x) \ X ¤ ;g :
2729
2730
Rough Sets in Decision Making
The elements of P (X) are all and only those objects x 2 U which belong to the equivalence classes generated by the indiscernibility relation I P , contained in X; the elements of P (X) are all and only those objects x 2 U which belong to the equivalence classes generated by the indiscernibility relation I P , containing at least one object x belonging to X. In other words, P (X) is the largest union of the P-elementary sets included in X, while P (X) is the smallest union of the P-elementary sets containing X. The P-boundary of X in S, denoted by Bn P (X), is defined by Bn P (X) D P (X) P (X) : The following inclusion relation holds: P (X) X P (X) : Thus, in view of information provided by P, if an object x belongs to P (X), then it certainly belongs to set X, while if x belongs to P (X), then it possibly belongs to set X. Bn P (X) constitutes the “doubtful region” of X: nothing can be said with certainty about the membership of its elements to set X, using the subset of attributes P only. Moreover, the following complementarity relation is satisfied: P (X) D U P(U X) : If the P-boundary of set X is empty, Bn P (X) D ;, then X is an ordinary (exact) set with respect to P, that is, it may be expressed as union of a certain number of P-elementary sets; otherwise, if Bn P (X) ¤ ;, set X is an approximate (rough) set with respect to P and may be characterized by means of the approximations P (X) and P (X). The family of all sets X U having the same P-lower and P-upper approximations is called the rough set. The quality of the approximation of set X by means of the attributes from P is defined as P (X) D
jP (X)j ; jXj
such that 0 P (X) 1. The quality P (X) represents the relative frequency of the objects correctly classified using the attributes from P. The definition of approximations of a set X U can be extended to a classification, i. e. a partition Y D fY1 ; : : : ; Ym g of U. The P-lower and P-upper approximations of Y in S are defined by sets P(Y) D ˚ fP(Y1 ); : : : ; P(Ym )g and P(Y) D P(Y1 ); : : : ; P(Ym ) , respectively. The coefficient Pm jP(Yi )j P (Y) D iD1 jUj
is called quality of the approximation of classification Y by set of attributes P, or in short, quality of classification. It expresses the ratio of all P-correctly classified objects to all objects in the system. Dependence and Reduction of Attributes An issue of great practical importance is the reduction of “superfluous” attributes in an information table. Superfluous attributes can be eliminated without deteriorating the information contained in the original table. Let P Q and p 2 P. It is said that attribute p is superfluous in P with respect to classification Y if P(Y) D (P p)(Y); otherwise, p is indispensable in P. The subset of Q containing all the indispensable attributes is known as the core. Given classification Y, any minimal (with respect to inclusion) subset P Q, such that P(Y) D Q(Y), is called reduct. It specifies a minimal subset P of Q which keeps the quality of classification at the same level as the whole set of attributes, i. e. P (Y) D Q (Y). In other words, the attributes that do not belong to the reduct are superfluous with respect to the classification Y of objects from U. More than one reduct may exist in an information table and their intersection gives the core. Decision Table and Decision Rules In the information table describing examples of classification, the attributes of set Q are divided into condition attributes (set C ¤ ;) and decision attributes (set D ¤ ;); C [ D D Q and C \ D D ;. Such an information table is called a decision table. The decision attributes induce a partition of U deduced from the indiscernibility relation I D in a way that is independent of the condition attributes. D-elementary sets are called decision classes, denoted by Cl t ; t D 1; : : : ; m. The partition of U into decision classes is called classification Cl D fCl1 ; : : : ; Cl m g. There is a tendency to reduce the set C while keeping all important relationships between C and D, in order to make decisions on the basis of a smaller amount of information. When the set of condition attributes is replaced by one of its reducts, the quality of approximation of the classification induced by the decision attributes is not deteriorating. Since the aim is to underline the functional dependencies between condition and decision attributes, a decision table may also be seen as a set of decision rules. These are logical statements of the type “if . . . , then . . . ”, where the premise (condition part) specifies values assumed by one or more condition attributes (description of C-elementary sets) and the conclusion (decision part) specifies an assign-
Rough Sets in Decision Making
ment to one or more decision classes. Therefore, the syntax of a rule is the following: “if v(x; q1 ) D r q1 and v(x; q2 ) D r q2 and . . . v(x; q p ) D r q p , then x belongs to decision class Cl j1 or Cl j2 or . . . Cl jk ”, where fq1 ; q2 ; : : : ; q p g C; (r q1 ; r q2 ; : : : ; r q p ) 2 Vq1 Vq2 Vq p and Cl j1 ; Cl j2 ; : : : ; Cl jk are some decision classes of the considered classification Cl. If the consequence is univocal, i. e. k D 1, then the rule is univocal, otherwise it is approximate. An object x 2 U supports decision rule r if its description is matching both the condition part and the decision part of the rule. The decision rule r covers object x if it matches the condition part of the rule. Each decision rule is characterized by its strength, defined as the number of objects supporting the rule. In the case of approximate rules, the strength is calculated for each possible decision class separately. If a univocal rule is supported by objects from the lower approximation of the corresponding decision class only, then the rule is called certain or deterministic. If, however, a univocal rule is supported by objects from the upper approximation of the corresponding decision class only, then the rule is called possible or probabilistic. Approximate rules are supported, in turn, only by objects from the boundaries of the corresponding decision classes. Procedures for generation of decision rules from a decision table use an inductive learning principle. The objects are considered as examples of classification. In order to induce a decision rule with a univocal and certain conclusion about assignment of an object to decision class X, the examples belonging to the C-lower approximation of X are called positive and all the others negative. Analogously, in case of a possible rule, the examples belonging to the C-upper approximation of X are positive and all the others negative. Possible rules are characterized by a coefficient, called confidence, telling to what extent the rule is consistent, i. e. what is the ratio of the number of positive examples supporting the rule to the number of examples belonging to set X according to decision attributes. Finally, in case of an approximate rule, the examples belonging to the C-boundary of X are positive and all the others negative. A decision rule is called minimal if removing any attribute from the condition part gives a rule covering also negative objects. The existing induction algorithms use one of the following strategies [55]: (a) Generation of a minimal representation, i. e. minimal set of rules covering all objects from a decision table,
(b) Generation of an exhaustive representation, i. e. all rules for a given decision table, (c) Generation of a characteristic representation, i. e. a set of rules covering relatively many objects, however, not necessarily all objects from a decision table.
Explanation of the Classical Rough Set Approach by an Example Suppose that one wants to describe the classification of basic traffic signs to a novice. There are three main classes of traffic signs corresponding to: Warning (W), Interdiction (I), Order (O). Then, these classes may be distinguished by such attributes as the shape (S) and the principal color (PC) of the sign. Finally, one can consider a few examples of traffic signs, like those shown in Table 1. These are: a) b) c) d)
Sharp right turn, Speed limit of 50 km/h, No parking, Go ahead.
The rough set approach is used here to build a model of classification of traffic signs to classes W, I, O on the basis of attributes S and PC. This is a typical problem of taxonomy. One can remark that the sets of signs indiscernible by “Class” are: W D fagClass ;
I D fb; cgClass ;
O D fdgClass ;
Rough Sets in Decision Making, Table 1 Examples of traffic signs described by S and PC Traffic sign
Shape (S) Primary color (PC) Class
a)
triangle
yellow
W
b)
circle
white
I
c)
circle
blue
I
d)
circle
blue
O
2731
2732
Rough Sets in Decision Making
and the sets of signs indiscernible by S and PC are as follows: fagS;PC ;
fbgS;PC ;
fc; dgS;PC :
The above elementary sets are generated, on the one hand, by decision attribute “Class” and, on the other hand, by condition attributes S and PC. The elementary sets of signs indiscernible by “Class” are denoted by fgClass and those by S and PC are denoted by fgS;PC. Notice that W D fagClass is characterized precisely by fagS;PC. In order to characterize I D fb; cgClass and O D fdgClass , one needs fbgS;PC and fc; dgS;PC , however, only fbgS;PC is included in I D fb; cgClass while fc; dgS;PC has a non-empty intersection with both I D fb; cgClass and O D fdgClass . It follows, from this characterization, that by using condition attributes S and PC, one can characterize class W precisely, while classes I and O can only be characterized approximately: Class W includes sign a certainly and possibly no other sign than a, Class I includes sign b certainly and possibly signs b; c and d, Class O includes no sign certainly and possibly signs c and d. The terms “certainly” and “possibly” refer to the absence or presence of ambiguity between the description of signs by S and PC from the one side, and by “Class”, from the other side. In other words, using description of signs by S and PC, one can say that all signs from elementary sets fgS;PC included in elementary sets fgClass belong certainly to the corresponding class, while all signs from elementary sets fgS;PC having a non-empty intersection with elementary sets fgClass belong to the corresponding class only possibly. The two sets of certain and possible signs are, respectively, the lower and upper approximation of the corresponding class by attributes S and PC: lower_approx:S;PC (W) D fag; upper_approx:S;PC (W) D fag; lower_approx:S;PC (I) D fbg; upper_approx:S;PC (I) D fb; c; dg; lower_approx:S;PC (O) D ;; upper_approx:S;PC (O) D fc; dg: The quality of approximation of the classification by attributes S and PC is equal to the number of all the signs in the lower approximations divided by the number of all the signs in the table, i. e. 1/2.
Rough Sets in Decision Making, Table 2 Examples of traffic signs described by S, PC and SC Traffic sign Shape (S)
Primary Secondary Class color (PC) color (SC)
a)
triangle
yellow
red
W
b)
circle
white
red
I
c)
circle
blue
red
I
d)
circle
blue
white
O
One way to increase the quality of the approximation is to add a new attribute so as to decrease the ambiguity. Let us introduce the secondary color (SC) as a new condition attribute. The new situation is now shown in Table 2. As one can see, the sets of signs indiscernible by S, PC and SC, i. e. the elementary sets fgS;PC;SC, are now: fagS;PC;SC ;
fbgS;PC;SC ;
fcgS;PC;SC ;
fdgS;PC;SC :
It is worth noting that the elementary sets are finer than before and this enables the ambiguity to be eliminated. Consequently, the quality of approximation of the classification by attributes S, PC and SC is now equal to 1. A natural question occurring here is to ask if, indeed, all three attributes are necessary to characterize precisely the classes W, I and O. When attribute S or attribute PC is eliminated from the description of the signs, the elementary sets fgPC;SC or fgS;SC are defined, respectively, as follows: fagPC;SC ; fagS;SC ;
fbgPC;SC ; fb; cgS;SC ;
fcgPC;SC ;
fdgPC;SC ;
fdgS;SC :
Using any one of the above elementary sets, it is possible to characterize (approximate) classes W, I and O with the same quality (equal to 1) as it is when using the elementary sets fgS;PC;SC (i. e. those generated by the complete set of three condition attributes). Thus, the answer to the above question is that the three condition attributes are not all necessary to characterize precisely the classes W, I and O. It is, in fact, sufficient to use either PC and SC or S and SC. The subsets of condition attributes fPC; SCg and fS; SCg are called reducts of fS; PC; SCg because they have
Rough Sets in Decision Making
this property. Note that the identification of reducts enables us to reduce attributes about the signs from the table to only those which are relevant. Other useful information can be generated from the identification of reducts by taking their intersection. This is called the core. In our example, the core contains attribute SC. This tells us that it is clearly an indispensable attribute i. e. it cannot be eliminated from the description of the signs without decreasing the quality of the approximation. Note that other attributes from the reducts (i. e. S and PC) are exchangeable. If there happened to be some other attributes which were not included in any reduct, then they would be superfluous, i. e. they would not be useful at all in the characterization of the classes W, I and O. If, however, column S or PC is eliminated from Table 2, then the resulting table is not a minimal representation of knowledge about the classification of the four traffic signs. Note that, in order to characterize class W in Table 2, it is sufficient to use the condition “S = triangle”. Moreover, class I is characterized by two conditions (“S = circle” and “SC = red”) and class O is characterized by the condition “SC = white”. Thus, the minimal representation of this information system requires only four conditions (rather than the eight conditions that are presented in Table 2 with either column S or PC eliminated). This representation corresponds to the following set of decision rules which may be seen as classification model discovered in the data set contained in Table 2 (in the braces there are symbols of signs covered by the corresponding rule): rul e #1 : if S D triangle;
then Class D W
fag
then Class D I
fb; cg
then Class D O
fdg :
rul e #2 : if S D circle and SC D red; rul e #3 : if SC D white;
This is not the only representation, because an alternative set of rules is: rul e #10 : if PC D yellow;
then Class D W
fag
then Class D I
fbg
and SC D red;
then Class D I
fcg
0
then Class D O
fdg :
0
rul e #2 : if PC D white; rul e #30 : if PC D blue; rul e #4 : if SC D white;
It is interesting to come back to Table 1 and to ask what decision rules represent this information system. As the description of the four signs by S and PC is not sufficient to characterize exactly all the classes, it is not surprising that not all the rules will have a non-ambiguous decision.
Indeed, the following decision rules can be induced: rul e #100 : if S D triangle;
then Class D W
fag
rul e #200 : if PC D white;
then Class D I
fbg
then Class D I or O
fc; dg :
00
rul e #3 : if PC D blue;
Note that these rules can be induced from the lower approximations of classes W and I, and from the set called the boundary of both I and O. Indeed, for exact rule #100 , the supporting example is in lower_approx:S;PC (W) D fag; for exact rule #200 it is in lower_approx:S;PC (I) D fbg; and the supporting examples for approximate rule #300 are in the boundary of classes I and O, defined as: boundaryS;PC (I) D upper_approx:S;PC(I) lower_approx:S;PC (I) D fc; dg ; boundaryS;PC (O) D upper_approx:S;PC(O) lower_approx:S;PC (O) D fc; dg : As a result of the approximate characterization of classes W, I and O by S and PC, an approximate representation in terms of decision rules is obtained. Since the quality of the approximation is 1/2, exact rules (#100 and #200 ) cover one half of the examples and the other half is covered by the approximate rule (#300 ). Now, the quality of approximation by S and SC, or by PC and SC, was equal to 1, so all examples were covered by exact rules (#1 to #3, or #10 to #40 respectively). One can see, from this simple example, that the rough set analysis of data included in an information system provides some useful information. In particular, the following results are obtained: A characterization of decision classes in terms of chosen condition attributes through lower and upper approximation. A measure of the quality of approximation which indicates how good the chosen set of attributes is for approximation of the classification. The reducts of condition attributes including all relevant attributes. At the same time, superfluous and exchangeable attributes are also identified. The core composed of indispensable attributes. A set of decision rules which is induced from the lower and upper approximations of the decision classes. This constitutes a classification model for a given information system.
2733
2734
Rough Sets in Decision Making
Dominance-Based Rough Set Approach to Ordinal Classification Problems Dominance-Based Rough Set Approach (DRSA) Dominance-based Rough Set Approach (DRSA) has been proposed by the authors to handle background knowledge about ordinal evaluations of objects from a universe, and about monotonic relationships between these evaluations, e. g. “the larger the mass and the smaller the distance, the larger the gravity” or “the greater the debt of a firm, the greater its risk of failure”. Such a knowledge is typical for data describing various phenomena. It is also characteristic for data concerning multiple criteria decision or decision under uncertainty, where the order of value sets of condition and decision attributes corresponds to increasing or decreasing preference. In case of multiple criteria decision, the condition attributes are called criteria. Let us consider a decision table including a finite universe of objects (solutions, alternatives, actions) U evaluated on a finite set of criteria F D f f1 ; : : : ; f n g, and on a single decision attribute d. The set of the indices of criteria is denoted by I D f1; : : : ; ng. Without loss of generality, f i : U ! < for each i D 1; : : : ; n, and, for all objects x; y 2 U; f i (x) f i (y) means that “x is at least as good as y with respect to criterion f i ”, which is denoted by x i y. Therefore, it is supposed that i is a complete preorder, i. e. a strongly complete and transitive binary relation, defined on U on the basis of evaluations f i (). Furthermore, decision attribute d makes a partition of U into a finite number of decision classes, Cl D fCl1 ; : : : ; Cl m g, such that each x 2 U belongs to one and only one class Cl t ; t D 1; : : : ; m. It is assumed that the classes are preference ordered, i. e. for all r; s D 1; : : : ; m, such that r > s, the objects from Clr are preferred to the objects from Cls . More formally, if is a comprehensive weak preference relation on U, i. e. if for all x; y 2 U; xy reads “x is at least as good as y”, then it is supposed that [x2Cl r ; y2Cls ; r > s] ) xy ; where xy means xy and not yx. The above assumptions are typical for consideration of an ordinal classification (or multiple criteria sorting) problem. Indeed, the decision table characterized above includes examples of ordinal classification which constitute an input preference information to be analyzed using DRSA. The sets to be approximated are called upward union and downward union of decision classes, respectively: Cl t
D
[ st
Cls ;
Cl t
D
[ st
Cls ;
t D 1; : : : ; m :
The statement x 2 Cl t reads “x belongs to at least class Cl t ”, while x 2 Cl t reads “x belongs to at most class D U; Cl D Cl Cl t ”. Let us remark that Cl1 D Cl m m m and Cl1 D Cl1 . Furthermore, for t D 2; : : : ; m, Cl t1 D U Cl t
and
Cl t D U Cl t1 :
The key idea of DRSA is representation (approximation) of upward and downward unions of decision classes, by granules of knowledge generated by criteria. These granules are dominance cones in the criteria values space. x dominates y with respect to set of criteria P I (shortly, x P-dominates y), denoted by xD P y, if for every criterion i 2 P; f i (x) f i (y). The relation of P-dominance is reflexive and transitive, i. e. it is a partial preorder. Given a set of criteria P I and x 2 U, the granules of knowledge used for approximation in DRSA are: a set of objects dominating x, called P-dominating set, DC P (x) D fy 2 U : yD P xg, a set of objects dominated by x, called P-dominated set, D P (x) D fy 2 U : xD P yg. Let us recall that the dominance principle requires that an object x dominating object y on all considered criteria (i. e. x having evaluations at least as good as y on all considered criteria) should also dominate y on the decision (i. e. x should be assigned to at least as good decision class as y). Objects satisfying the dominance principle are called consistent, and those which are violating the dominance principle are called inconsistent. P-lower approximation of Cl t , denoted by The the P-upper approximation of Cl t , denoted P Clt , and by P Cl t , are defined as follows (t D 2; : : : ; m): ˚ P Cl t D x 2 U : DC P (x) Cl t ; ˚ P Cl t D x 2 U : D P (x) \ Cl t ¤ ; : Analogously, one can define the P-lower approximation and the P-upper approximation of Cl t as follows (t D 1; : : : ; m 1): ˚ ; P Cl t D x 2 U : D P (x) Cl t ˚ C P Cl t D x 2 U : D P (x) \ Cl t ¤ ; : The P-lower and P-upper approximations so defined satisfy the following inclusion properties, for all P I: P Cl t Cl t P Cl t ;
t D 2; : : : ; m ;
P Cl t Cl t P Cl t ;
t D 1; : : : ; m 1 :
Rough Sets in Decision Making
The P-lower and P-upper approximations of Cl t and Cl t have an important complementarity property, according to which, P Cl t D U P Cl t D U P Cl t D U P Cl t D U
P Cl t1 and P Cl t1 ; t D 2; : : : ; m; P Cl tC1 and P Cl tC1 ; t D 1; : : : ; m 1 :
The P-boundary of Cl t and Cl t , denoted by Bn P Cl t and Bn P Cl t , respectively, are defined as follows: Bn P Cl t D P Cl t P Cl t ; t D 2; : : : ; m; Bn P Cl t D P Cl t P Cl t ; t D 1; : : : ; m 1:
Due complementarity property, to the above , for t D 2; : : : ; m. Bn P Cl t D Bn P Cl t1 For every P C, the quality of approximation of the ordinal classification Cl by a set of criteria P is defined as the ratio of the number of objects P-consistent with the dominance principle and the number of all the objects in U. Since the P-consistent objects are those which do not belong to any P-boundary Bn P (Cl t ); t D 2; : : : ; m, or Bn P Cl t ; t D 1; : : : ; m 1, the quality of approximation of the ordinal classification Cl by a set of criteria P, can be written as ˇ S ˇ ˇU ˇ tD2;:::;m Bn P Cl t P (Cl) D jUj ˇ S ˇ ˇU ˇ tD1;:::;m1 Bn P Cl t D : jUj P (Cl) can be seen as a degree of consistency of the objects from U, when P is the set of criteria and Cl is the considered ordinal classification. Each minimal (with respect to inclusion) subset P C such that P (Cl) D C (Cl) is called a reduct of Cl, and is denoted by REDCl . Let us remark that for a given set U one can have more than one reduct. The intersection of all reducts is called the core, and is denoted by CORECl . Criteria in CORECl cannot be removed from consideration without deteriorating the quality of approximation. This means that, in set C, there are three categories of criteria: indispensable criteria included in the core, exchangeable criteria included in some reducts, but not in the core, redundant criteria, neither indispensable nor exchangeable, and thus not included in any reduct.
The dominance-based rough approximations of upward and downward unions of decision classes can serve to induce a generalized description of objects in terms of “if . . . , then . . . ” decision rules. For a given upward or downward union of classes, Cl t or Cl s , the decision rules inducedunder a hypothesis that objects belonging to P(Cl t ) or P Cl s are positive examples, and all the others are negative, suggest a certain assignment to “class Cl t or better”, or to “class Cl s or worse”, respectively. On the other hand, the decision rules under a hypothesis induced that objects belonging to P Cl t or P Cl s are positive examples, and all the others are negative, suggest a possible assignment to “class Cl t or better”, or to “class Cl s or worse”, respectively. Finally, the decision rules induced under a hypothesis that objects belonging to the intersection P(Cl s ) \ P(Cl t ) are positive examples, and all the others are negative, suggest an approximate assignment to some classes between Cl s and Cl t (s < t). In the case of preference ordered description of objects, set U is composed of examples of ordinal classification. Then, it is meaningful to consider the following five types of decision rules: 1) certain D -decision rules, providing lower profile descriptions for objects belonging to P(Cl t ): if f i 1 (x)
r i 1 and . . . and f i p (x) r i p , then x 2 Cl t ; fi1 ; : : : ; i p g I; t D 2; : : : ; m; r i 1 ; : : : ; r i p 2 0. If u1 (b; a) > u1 (a; a) for some b 2 A, the reverse inequality holds for all sufficiently small ". Future Directions Static games have been shown to be a useful framework for analyzing and understanding many situations that involve strategic interaction. At present, a large body of literature is available that develops various solution concepts, some of which are refinements of Nash equilibrium and some of which are coarsenings of it. Nonetheless, several areas for future research remain. One is the application of the theory to particular games to better understand the situations they model, for example auctions. In many markets trade is conducted by auctions of one kind or another, including markets for small domestic products as well as some centralized electricity markets where generators and distributors buy and sell electric power on a daily basis. Also, auctions are used to allocate large amounts of valuable spectrum among telecommunication companies. It would be interesting to calculate the equilibria of many real life auctions. Simultaneously, future research should also focus on the design of auctions whose equilibria have certain desirable properties. Another future direction would be to empirically and experimentally test the theory. The various equilibrium concepts predict certain kinds of behavior in certain games. Our confidence in the predictive and explanatory power of the theory depends on its performance in the field and in the laboratory. Moreover, the experimental and empirical results should provide valuable feedback for further development of the theory. Although some valuable experimental and empirical tests have already been performed (see [17,19,22,25] to name a few), the empirical aspect of game theory in general, and of static games in particular, remains underdeveloped.
3041
3042
Static Games
Bibliography 1. Athey S (2001) Single crossing properties and the existence of pure strategy equilibria in games of incomplete information. Econometrica 69:861–889 2. Aumann RJ (1974) Subjectivity and correlation in randomized strategies. J Math Econ 1:67–96 3. Aumann RJ (1987) Correlated equilibrium as an expression of bayesian rationality. Econometrica 55:1–18 4. Aumann RJ (1995) Backward induction and common knowledge of rationality. Games Econ Behav 8:6–19 5. Aumann RJ, Brandenburger A (1995) Epistemic conditions for Nash equilibrium. Econometrica 63:1161–1180 6. Binmore K (2007) Playing for Real. Oxford University Press, New York 7. Brandenburger A, Dekel E (1987) Rationalizability and correlated equilibria. Econometrica 55:1391–1402 8. Crawford V (1990) Equilibrium without independence. J Econ Theory 50:127–154 9. Fudenberg D, Tirole J (1991) Game Theory. MIT Press, Cambridge 10. Geanakoplos J (2003) Nash and Walras equilibrium via Brouwer. Econ Theory 21:585–603 11. Glicksberg IL (1952) A further generalization of the Kakutani fixed point theorem, with application to Nash equilibrium points. Proc Am Math Soc 3:170–174 12. Harsanyi J (1967) Games with incomplete information played by ‘Bayesian’ players. Manag Sci Part i 14:159–82; Part ii 14:320–34; Part iii 14:486–502
13. Krishna V, Morgan J (1997) An analysis of the war of attrition and the all-pay auction. J Econ Theory 72:343–362 14. Smith MJ (1974) The theory of games and the evolution of animal conflicts. J Theor Biol 47:209–221 15. Smith MJ, Price GR (1973) The logic of animal conflict. Nature 246:15–18 16. McAdams D (2003) Isotone equilibrium in games of incomplete information. Econometrica 71:1191–1214 17. McKelvey RD, Palfrey TR (1992) An experimental study of the centipede game. Econometrica 60:803–836 18. Nash JF (1950) Equilibrium points in n-person games. Proc Natl Acad Sci USA 36:48–49 19. O’Neill B (1987) Nonmetric test of the minimax theory of two-person zero-sum games. Proc Natl Acad Sci 84:2106– 2109 20. Osborne MJ (2004) An Introduction to Game Theory. Oxford University Press, New York 21. Osborne MJ, Rubinstein A (1994) A Course in Game Theory. MIT Press, Cambridge 22. Palacios-Huerta I (2003) Professionals play minimax. Rev Econ Stud 70:395–415 23. von Neumann J (1928) Zur Theorie der Gesellschaftsspiele. Math Ann 100:295–320 24. von Neumann J (1959) On the theory of games of strategy. In: Tucker AW, Luce RD (eds) Contributions to the Theory of Games, vol IV. Princeton University Press, Princeton 25. Walker M, Wooders J (2001) Minimax play at Wimbledon. Am Econ Rev 91:1521–1538
Statistical Applications of Wavelets
Statistical Applications of Wavelets SOFIA OLHEDE Department of Statistical Science, University College London, London, UK Article Outline Glossary Definition of the Subject Introduction Decomposition View of Random Processes Wavelets Signal Estimation or Denoising 2-D Extensions of Wavelets Covariance Estimation Using Wavelets Future Directions Bibliography Glossary Locally stationary process A stochastic process that when sampled in time, its sampled points have joint characteristics in a short time window that only depend on the difference between the sampled time points, and not the global time point of the samples within the time window. Multiresolution analysis The analysis of a phenomenon of interest over different temporal or spatial scales and locations. Shrinkage A method of estimation where the initial estimator is decreased by multiplication with a factor of magnitude less than or equal to one. If the object being estimated is small, then a reduced variance will arise, that sets off the increased bias in estimation. Sparsity Sparsity in a vector corresponds to the support of the vector being limited. The sparsity can either be strict, i. e. the vector is perfectly supported in a small subset of all its entries, or the magnitudes of the entries decay sufficiently rapidly, if ordered in magnitude. Stationary process A stochastic process that when sampled in time, its sampled points have joint characteristics that only depend on the difference between the sampled time points, and not the global time point of the samples. Thresholding A method of estimation where the initial estimator is set to zero if it does not exceed a given threshold. Thresholding is a special case of shrinkage. Wavelet and scaling functions The functions that the wavelet and scaling filters converge to in increasing scale, when using discrete wavelet filters. These are not
for any finite scale equivalent to the wavelet and scaling filters. Wavelet and scaling filters The high and low pass digital filters that are used to calculate the discrete wavelet transform of a given sequence or vector. These are a pair of quadrature mirror filters that satisfy the perfect reconstruction property. Definition of the Subject A wavelet is a function essentially supported over a small time-window, that is oscillatory, with a local period that ranges over a limited sets of values. A wavelet function can be considered jointly local in time and frequency. Wavelets have become an integral part of statistical analysis of structured phenomena. Their utility for inference is based on several of their important properties, namely compression, separation of multiscale features, and differencing of a given set of moments. The statistical analysis of processes using wavelets presents a more faceted view of the phenomenon of interest than traditional methods. The utility of this in many areas is unquestionable, and statistical applications of wavelets include areas as disparate as climatology, econometrics, finance, geophysics, mathematics, as well as signal and image processing. Introduction The field of wavelets in statistics can roughly be divided into two areas, namely the usage of continuous wavelets, and the usage of discrete wavelets. Continuous wavelets are defined at a continuum of locations, whilst discrete wavelet filters are sequences defined only at a countable number of locations and scales. These sequences are not strictly speaking wavelet functions, but converge in increasing scale to such functions. Unfortunately in the literature both functions and filters are referred to as ‘wavelets’. Continuous wavelets are applied to the characterization of random processes, but are not usually used for actual signal estimation, as the transform does not possess an exact discretely implementable inverse. Continuous wavelets do not necessarily satisfy the constraint of compactness in time, and this permits more freedom in their design. A third class of wavelets that often receive separate treatment are complex-valued discrete wavelet filters. These can be thought of as bridging the gap between continuous and discrete representations. Complex wavelets are often chosen to be redundant, and have interpretable magnitude and phase, see for example the review of these in [48]. The statistical representation of structured phenomena using decompositions was first introduced by [47] who
3043
3044
Statistical Applications of Wavelets
analyzed time series using Fourier decompositions. This analysis method corresponds to decomposing a sample into cosines and sines to determine the power associated with each frequency or mode. The interpretation of the observed weight attached to each frequency depends on the model posited for the observed data. If the process is assumed to be stationary, then the observed weight can be shown to converge in expectation to the true weight, namely the Fourier transform of the autocovariance of the time series. By calculating the full set of weights across frequencies, we may obtain a picture of the frequency content of the process. Even for processes that are not stationary, decomposing the set of observations into sinusoids allows us to visualize the frequency content of the signal. As long as the process is harmonizable, see [36], such decompositions are suitable. Unfortunately whilst the decomposition of a timevarying process into sines and cosines is interpretable in a global sense, and can be shown to possess some suitable properties, it is not as easily interpreted as might be the case. Many processes are generated by mechanisms that fundamentally alter over time. It is therefore suitable to use decomposition functions that unlike sines are limited in time, but still correspond to oscillatory functions. A popular choice of such functions is wavelets. A wavelet is a time-limited oscillation, or a “little wave”. The first family was constructed by Alfred Haar in 1910, see [26]. The subject lay dormant for many years, until the mid 1980s when their usage in geophysics revitalized interest in the set of decompositions, by Morlet, and one of the groundbreaking new articles corresponds to [24]. A great theoretical leap forward was the construction of discrete wavelet filters, see [10]. Most of the statistical usage of wavelets has focused around the usage of discrete wavelets. The development of statistical methods can really be grouped into three main strands of development: i) the usage of compression for non-parametric signal estimation, following work by [15,16,17], ii) the characterization of stochastic difference stationary processes, see [38, 51], and iii) the inference of locally stationary wavelet processes, see [40]. Theoretical results have also been achieved in terms of the characterization of stochastic processes, see [3,4,8,34], but these results as of yet have had limited practical implications. After initial usage of the wavelet transform for signal estimation it became obvious that despite the wavelet transform’s orthogonalizing effects, relationships between coefficients must be used, to derive better representations, see [58] The development of data driven decompositions, and usage of the ‘lifting algorithm’, [12], followed in the
second strand of developments. Further steps forward included the construction of two dimensional decompositions for image estimation, where notable contributions include the ridgelet, curvelet and bandelet transforms, see [9,33,52]. Outstanding problems remain in the statistical usage of these transforms, and investigations have mainly been limited to modeling the observed phenomena as deterministic and immersed in some form of noise. Decomposition View of Random Processes Wavelets are used to analyze observations whose generating mechanism can be characterized in terms of the location the observation is made at. This could be time, space or some other suitable indexing. We shall denote the location of any observation by t 2 Rd , where d denotes the dimension of the indexing, this could be d D 1, like a time series, or d D 2, like an image. If fx(t)g is a random process then its structure can be determined from its moments. The first order structure of x(t) is given by its mean, defined by x (t) D E fx(t)g ;
t 2 Rd ;
(1)
and the second order structure of x(t) is given by its covariance, namely x (t1 ; t2 ) D cov fx(t1 ); x(t2 )g ;
t1 ; t2 2 Rd :
(2)
E fg denotes the expectation operator, and cov f; g the covariance operator. If the process is assumed to be a Gaussian process, i. e. if any sample of n observations from fx(t)g is multivariate Gaussian, then fx (t)g in combination with f x (t1 ; t2 )g fully characterize the generation of the process fx(t)g. Often we only take n observations from the process fx(t)g and still want to fully characterize the generating mechanism, so that we can test hypotheses of interest regarding fx(t)g or just estimate some parameters of interest. To be able to do this we must make some further assumptions about x(t), and these could correspond to fully parametric or non-parametric assumptions on fx (t)g or f x (t1 ; t2 )g. Typical examples might be x (t) D a x C b x t, which is a parametric specification of the mean, or x (t) possessing a certain degree of smoothness. If the latter approach is taken then it is convenient to represent x (t) in terms of some functions where the assumed smoothness can be easily incorporated into the modeling. Furthermore the covariance structure can be assumed to simplify if we choose a suitable representation. If x(t) is represented in terms of linear combinations of known basis functions, such as wavelets, then Gaussian processes will be transformed into Gaussian processes with a different mean and
Statistical Applications of Wavelets
covariance function. For certain classes of auto-covariance functions the analysis is much simplified by considering the transformed rather than the original Gaussian process. Wavelets A wavelet function, is a complex-valued square integrable function satisfying the admissibility condition, see [27]. We denote a generic wavelet by (t). The Continuous Wavelet Transform (CWT) of a signal x(t) with respect to a wavelet (t) is defined as a filtration of x(t): Z 1 w (x) (t; s) D x(t 0 ) t;s (t 0 ) dt 0 ; 1 0
1 t t 0 : (3) t;s (t ) D s 2 s s1/2
varies across apThe choice of normalization, here plication fields. Sometimes 1/2 is replaced by 1 if more convenient for analysis, see [54]. The former is referred to as the energy normalization, and the latter as the compression normalization. The former normalization conserves the variance of the transformed process. If fx(t)g is a stochastic process some regu˚ satisfying larity conditions, see [34], then w (x) (t; s) is a doubly indexed stochastic process with mean o Z 1 n x (t 0 ) t;s (t 0 ) dt 0 ; E w (x) (t; s) D 1 0
1 t t 0 2 (t ) D s (4) t;s s
w (x ) (t; s) :
(5)
The form of this expression simplifies if x(t) takes the simpler form of a time-varying oscillation, see [13], or a singularity [28,55], or is a sufficiently regular function, see[56]. ˚ Under such circumstances very few of w (x ) (t; s) are non-zero and inferences can be made about the model from the few non-zero coefficients. In the special case of time-varying oscillations the theory of wavelet ridge analysis has been developed to characterize the local oscillations. Applications of such methods include mechanical vibratory signals, see [31,59], seismic signals [44,45], as well as ocean eddy time series, see [6,35]. The usage of continuous wavelets does not extend greatly beyond the aforementioned methods of signal inference. The characterization of various stochastic processes have been investigated using continuous wavelets, i. e. [3,8,25,32,34], but this has been put to little practical usage. Some interesting developments for the prediction of continuous signals using wavelets have been developed in [1,2].
The main focus of wavelet usage in statistics has instead evolved around the usage of the Discrete Wavelet Transform (DWT). The DWT is defined as a reversed convolution between a discretely sampled signal and the ˚wavelet filter. We denote the˚ jth level wavelet filter h j;l , and the scaling filter g j;l . For a complete review of the properties of these objects see [11,37,46]. These are constructed from the j D 1 level filters fh l g and fg l g, that satisfy the quadrature mirror relationship of g l D (1) l C1 h L1l where L is the length of the two filters. Given a sampled process x n D x(nt) the wavelet coefficients are defined by: L1 X X w (x) D h x ; h D h k g j1;l 2 j1 k ; (6) n j;l j;l j;l n
kD0
where the wavelet filters given g j1;l can now be determined iteratively. To complement the wavelet coefficients of xn , additionally the scaling coefficients are computed, of ˚
v (x) j;l D
X
g j;l x n ;
g j;l D
n
L1 X
g k g j1;l 2 j1 k ;
(7)
kD0
this completing the iterative specification of the wavelet and scaling filters. If we collect the wavelet and scaling (x) (x) (x) ; : : : ; w1;N/21 ; w2;0 ; coefficients in a vector W(x) D [w1;0 ; : : : ; v (x) J/2;0 ], then the DWT can be represented as: W(x) D W X ;
(8) where W is composed of h j;l and g J;l , see [46]. Equation (8) is not the method to compute the DWT, popularly the transform is implemented using Mallat’s pyramid algorithm, see [37], but it enables the easy determination of the stochastic properties of the wavelet transform. The CWT and the DWT are not twondisparate objects: if we o ˚
˚
(x) possess a sequence of coefficients v0;l such that
x(t) D
1 X
(x) v0;l 0;k (t)
(9)
nD1 (x) (x) then w (x) j;l can be calculated from v0;l , as can v j;l , and correspond to the CWT as well as scaling coefficients of the continuous rather than the sampled process. In practise xn (x) is equated to v0;n , even if this is formally inappropriate.
Signal Estimation or Denoising The most popular statistical application of wavelets is denoising, or function estimation. This proceeds from a model of the observed signal of: x n D (nt) C n ;
n D 0; : : : ; N 1 ;
(10)
where t is referred to as the sampling period of the sig-
3045
3046
Statistical Applications of Wavelets
nal, and the sample size is N. To fully specify the generation of the observations fn g must be modeled, and a popular model is to assume this sequence as zero-mean, uncorrelated (white) and Gaussian, with some constant variance 2 , that needs to be determined to implement estimation. To see a typical realization of such a signal, see Fig. 1. This signal, ‘Heavisine’ is famous as one of the four typical signals that were introduced by [16]. Heavisine has been immersed in noise, see Fig. 1. This signal is not trivial to smooth as it possesses varying degrees of smoothness across the realization, compare for example the discontinuities with the smooth portions of the signal. One important inference problem in this setting is to estimate f(nt)g with low expected square deviation, i. e. to chose an estimator b (nt) such that MSE(b ; ) D
N1 X
E
n
2 o (nt) b (nt) ;
(11)
nD0
is as small as possible. Common methods of estimating (nt) include using many different adaptive linear estimation methods, such as kernel smoothers, smoothing splines, truncated Fourier methods etc. Smoothing using the DWT was introduced by [15,16], and has enjoyed an incredible success. The simplest form of wavelet estimation is based on the compression of the DWT of x(t) alone, combined with the uniformity of the white process across time. n o The wavelet coefficients of () are given as w () j;l , o n () and together with the scaling coefficients, v j;l , these are sufficient to reconstruct (nt) via calculating the Inverse DFT (IDFT) or: 0 () 1 w1;0 B ::: C B C B () C Bw N C 0 1 B 1; 2 1 C (0t) B () C C w 1 @ ADW B ::: (12) B 2;0 C : B ::: C B () C ((N 1)t) Bw C B 2; N 1 C 4 B C @ ::: A v J;0 o n can easily be converted Thus a good estimator of w () j;l into a good estimator of (nt), using Eq. (12). To see the motivation behind the most commonly adopted choice of estimation for the wavelet coefficients, observe the wavelet coefficients of the Heavisine signal, that we plotted in Fig. 1, plotted in Fig. 3. Most of the wavelet coefficients are quite small. If a given signaln ()ois sufficiently reg()
ular then it can be shown that w j;l
must decrease in
magnitude. For this reason one may argue that most of this set are zero, and the transform exhibits compression or sparsity, see [56]. We are therefore trying to estimate a sequence of variables of which most are zero, and only a few are large. It is not optimal to equate the estimator to the sample DWT, but instead shrinkage estimation is suitable. Consider the generic setting of an estimator for w denoted by b w which is unbiased, but potentially has large variance. A shrinkage estimator of this quantity is given by b w for some 0 c 1. Clearly this estimator has w (s) D cb smaller variance as compared to b w (or possibly equal variance), but introduces some bias into the estimation. The mean square error is in fact given by: MSE(b w ; w) D c 2 2 C (1 c)2 w 2 :
(13)
Thus if the quantity we are trying to estimate is small compared to the variance of the empirical estimator b w then by choosing jcj smaller than one, we reduce the mean square error in estimation. Thus for jwj taking a range of values it would be better to use the shrinkage estimator. This may seem non-intuitive, but corresponds to a whole set of theory derived due to [53]. If there was an oracle that could for a given shrinkage rule inform the estimator how coefficients should be shrunk, then good estimators would be obtained. Fortunately it is not necessary to have an oracle, for large sample sizes, to expect to get an estimator with equivalent performance to using an oracle. Looking at Fig. 2 and Fig. 3 we certainly see that replacing most of the empirical wavelet coefficients by zero, should greatly improve the estimation of these coefficients. Two special shrinkage functions are commonly used, namely the soft and hard threshold functions. The soft threshold estimator modifies the empirical estimator b w by
b w (st)
8 w if b w ˆ 0. A strategy profile is a uniform "-equilibrium if there are T0 2 N and 0 2 (0; 1) such that for every T T0 the strategy profile is a T-stage "-equilibrium, and for every 2 (0; 0 ) it is a -discounted "-equilibrium. If for every " > 0 the game has a (T-stage, -discounted, limsup or uniform) "-equilibrium with corresponding payoff g" , then any accumulation point of (g" )">0 as " goes to 0 is a (T-stage, -discounted, limsup or uniform) equilibrium payoff. Zero-Sum Games A two-player stochastic game is zero-sum if u1 (s; a) C u2 (s; a) D 0 for every (s; a) 2 SA. As in matrix games, every two-player zero-sum stochastic game admits at most one equilibrium payoff at every initial state s1 , which is termed the value of the game at s1 . Each player’s strategy which is part of an "-equilibrium is termed "-optimal. The definition of "-equilibrium implies that an "-optimal strategy guarantees the value up to "; for example, in the T-stage evaluation, if 1 is an "-optimal strategy of player 1, then for every strategy of player 2 we have 1T (s1 ; 1 ; 2 ) v T (s1 ) " ; where v T (s1 ) is the T-stage value at s1 . In his seminal work, Shapley [60] presented the model of two-player zero-sum stochastic games with finite state and actions spaces, and proved the following.
Theorem 8 [60] For every two-player zero-sum stochastic game, the -discounted value at every initial state exists. Moreover, both players have -discounted 0-optimal stationary strategies. Proof Let V be the space of all functions v : S ! R. For every v 2 V define a zero-sum matrix game Gs (v) as follows: The action spaces of the two players are A1 (s) and A2 (s) respectively. The payoff function (that player 2 pays player 1) is u1 (s; a) C (1 )
X
q(s 0 j s; a)v(s 0 ) :
s 0 2S
The game Gs (v) captures the situation in which, after the first stage, the game terminates with a terminal payoff v(s 0 ), where s 0 is the state reached after stage 1. Define an operator ' : V ! V as follows: 's (v) D val(Gs (v)) ; where val(Gs (v)) is the value of the matrix game Gs (v). Since the value operator is non-expansive, it follows that the operator ' is contracting: k'(v) '(w)k1 (1 )kv wk1 , so that this operator has a unique fixed point b v . One can show that the fixed point is the value of the stochastic game, and every strategy i of player i in which he plays, after each finite history (s 1 ; a1 ; s 2 ; a2 ; : : : ; s t ), an optimal mixed action in the matrix game Gst (b v ), is a -discounted 0-optimal strategy in the stochastic game. Example 9 Consider the following two-player zero-sum game with three states, s0 , s1 and s2 ; each entry of the matrix indicates the payoff that player 2 (the column player) pays player 1 (the row player, the payoff is in the middle), and the transitions (which are deterministic, and are denoted at the top-right corner). L T B
0 s2
R 1 s1
1 s1 0 s0 State s2
L T
1 s1 State s1
T
L : 0 s0 State s0
The states s0 and s1 are absorbing: Once the play reaches one of these states it never leaves it. State s2 is non-absorbing. Stochastic games with a single non-absorbing state are called absorbing games. For every v D (v0 ; v1 ; v2 ) 2 V D R3 the game Gs2 (v) is the following
3067
3068
Stochastic Games
closely that of Martin [36] for the determinacy of Blackwell games. The study of the uniform value emanated from an example, called the “Big Match”, due to Gillette [28], that was solved by Blackwell and Ferguson [13].
matrix game:
T B
L (1 )v2
R C (1 )v1
C (1 )v1 (1 )v0 The game Gs2
T
L C (1 )v1 The game
Gs1
T
L (1 )v0 The game
Example 10 Consider the following stochastic game with two absorbing states and one non-absorbing state. :
Gs0
The unique fixed point of the operator val(G ) must satisfy b v 0 D val(Gs0 (b v)), so that b v s 0 D b v 0 D 0; b v 1 D val(Gs1 (b v)), so that b v s 1 D b v 1 D 1; b v 2 D val(Gs 1 (b v)). By Theorem 8 both players have a stationary -discounted 0-optimal strategy. Denote by x (resp. y) a mixed action for player 1 (resp. player 2) that is part of a -discounted 0-optimal strategy at the state s2 . Since we know that in the fixed point b v0 D 0 v 2 must be the unique solution of and b v 1 D 1, b v2 D y(1 )v2 C (1 y) D y ; p so that b v s 2 D b v 2 D (1 )/(1 ). The 0-optimal p v 2 D (1 )/ strategy of player 2 at state s2 is y D b (1 ), andpthe 0-optimal strategy of player 1, x D b v 2 D (1 )/(1 ), can be found by finding his 0-optimal strategy in Gs2 (b v). Bewley and Kohlberg [11] proved that when the state and action spaces are finite, the function 7! vs , that assigns to every state s and every discount factor the -discounted value at the initial state s, is a Puiseux function, P k/M that that is, it has a representation vs D 1 kDK a k is valid for every 2 (0; 0 ) for some 0 > 0, where M is a natural number, K is a non-negative integer, and (a k )1 kDK are real numbers. In particular, the function 7! vs is monotone in a neighborhood of 0, and its limit as goes to 0 exists. This result turned out to be crucial in subsequent study on games with finitely many states and actions. Shapley’s work has been extended to general state and action spaces; for a recent survey see [46]. The tools developed in [46], together with a dynamic programming argument, prove that under proper conditions on the payoff function and on the transitions the two-player zero-sum stochastic game has a T-stage value. Maitra and Sudderth [35] proved that the limsup value exists in a very general setup. Their proof follows
L T B
0
s2
R 1
s2
1 s1 0 s0 State s2
L T
s1
1 State s1
L T
0 s0 State s0
Suppose the initial state is s2 . As long as player 1 plays T the play remains at s2 ; once he plays B the play moves to either s0 or s1 , and is effectively terminated. By finding the fixed point of the operator ' one can show that the discounted value at the initial state s2 is 12 , and a -discounted stationary 0-optimal strategy for player 2 is1 [ 12 (L); 12 (R)]. Indeed, if player 1 plays T then the expected stage payoff is 12 and play remains at s2 , while if player 1 plays B then the game moves to an absorbing state, and the expected stage payoff from that stage onwards is 12 . In particular, this strategy guarantees 12 for player 2 both in the limsup evaluation and uniformly. A -discounted 1 0-optimal strategy for player 1 is [ 1C (T); 1C (B)]. What can player 1 guarantee in the limsup evaluation and uniformly? If player 1 plays the stationary strategy [x(T); (1 x)(B)] that plays at each stage the action T with probability x and the action B with probability 1 x, then player 2 has a reply that ensures that the limsup payoff is 0: If x D 1 and player 2 always plays L, the payoff is 0 at each stage; if x < 1 and player 2 always plays R, the payoff is 1 until the play moves to s0 , and then it is 0 forever. Since player 1 plays the action B with probability 1 x > 0 at each stage, the distribution of the stage in which play moves to s0 is geometric. Therefore, the limsup payoff is 0, and if is sufficiently small, the discounted payoff is close to 0. One can verify that if player 1 uses a bounded-recall strategy, that is, a strategy that uses only the last k actions that were played, player 2 has a reply that guarantees that the limsup payoff is 0, and the discounted payoff is close to 0, provided is close to 0. Thus, in the limsup payoff and uniformly finite memory cannot guarantee more than 0 in this game (see also [27]). Intuitively, player 1 would like to condition the probability of playing T on the past behavior of player 2: If in 1 That is, at each stage player 2 plays L with probability with probability 12 .
1 2
and R
Stochastic Games
the past player 2 played the action L more often than the action R, he would have liked to play T with higher probability; if in the past player 2 played the action R more often than the action L, he would have liked to play B with higher probability. Blackwell and Ferguson [13] constructed a family of good strategies f1M ; M 2 Ng for player 1. The parameter M determines the amount that the strategy guarantees: The strategy 1M guarantees a limsup M , provided the dispayoff and a discounted payoff of 2MC1 count factor is sufficiently low. In other words, player 1 cannot guarantee 12 , but he may guarantee an amount as close to 12 as he wishes by choosing M to be sufficiently large. The strategy 1M is defined as follows: At stage t, play B with probability (MCl1 r )2 , where lt is the numt t ber of stages up to stage t in which player 2 played L, and rt is the number of stages up to stage t in which player 2 played R. Since r t C l t D t 1 one has r t l t D 2r t (t 1). The quantity rt is the total payoff that player 1 received in the first t 1 stages if player 1 played T in those stages (and the game was not absorbed). Thus, this total payoff is a linear function of the difference r t l t . When presented this way, the strategy 1M depends on that total payoff. Observe that as rt increases, r t l t increases as well, and the probability to play B decreases. Mertens and Neyman [38] generalized the idea presented at the end of Example 10 to stochastic games with finite state and action spaces.2 Theorem 11 If the state and action spaces of a twoplayer zero-sum stochastic game are finite, the game has a uniform value vs0 at every initial state s 2 S. Moreover, vs0 D lim!0 vs D lim T!1 vsT . In their proof, Mertens and Neyman describe a uniform "optimal strategy. In this strategy the player keeps a parameter, t , which is a fictitious discount factor to use at stage t. This parameter changes at each stage as a function of the stage payoff; if the stage payoff at stage t is high then tC1 < t , whereas if the stage payoff at stage t is low then tC1 > t . The intuition is as follows. As mentioned before, in stochastic games there are two forces that influence the player’s behavior: He tries to get high stage payoffs, while keeping future prospects high (by playing in such a way that the next stage that is reached is favorable). When considering the -discounted payoff there is a clear comparison between the importance of the two forces: The weight of the stage payoff is and the weight of future 2 Mertens
and Neyman’s [38] result actually holds in every stochastic game that satisfies a proper condition, which is always satisfied when the state and action spaces are finite.
prospects is 1 ; the lower the discount factor, the more weight is given to the future. When considering the uniform value (or the uniform equilibrium) the weight of the stage payoff is 0. However, if the player never attempts to receive a high stage payoff, the overall payoff in the game will not be high. Therefore, the player has a fictitious discount factor; if past payoffs are low and they do not meet the expectation, player 1 increases the weight of the stage payoff by increasing the fictitious discount factor; if past payoffs are high, player 1 increases the weight of the future by lowering this fictitious discount factor. Multi-Player Games Takahashi [75] and Fink [21] extended Shapley’s [60] result to discounted equilibria in non-zero-sum games. Theorem 12 Every stochastic game with finite state and action spaces has a -discounted equilibrium in stationary strategies. Proof The proof utilizes Kakutani’s fixed point theorem [31]. Let M D maxi;s;a ju i (s; a)j be a bound on the absolute values of the payoffs. Set X D A ; i2N;s2S ((A i (s)) [M; M]). A point x D (x i;s V x i;s ) i2N;s2S 2 X is a collection of one mixed action and one payoff to each player at every state. For every v D (v i ) i2N 2 [M; M] NS and every s 2 S define a matrix game Gs (v) as follows: The action spaces of each player i is A i (s); The payoff to player i is X q(s 0 j s; a)v i (s 0 ) : u i (s; a) C (1 ) s 0 2S
We define a set-valued function ' : X ! X as follows. A is the set of For every i 2 N and every s 2 S, ' i;s all best responses of player i to the strategy vector xi;s :D (x j;s ) j¤i in the game Gs (v). That is, ˚ A ' i;s (x; v) :D argmax y i;s 2(A i (s)) r i (s; y i;s ; xi;s ) X q(s 0 j s; y i;s ; xi;s )v i;s 0 : C (1 ) s 0 2S V For every i 2 N and every s 2 S, ' i;s (x; v) is the maximal payoff for player i in the game Gs (v), when the other players play xi :
'sV (x; v) :D
max
y i;s 2(A i (s))
X s 0 2S
r(s; y i;s ; xi;s ) C (1 ) ! 0
q(s j s; y i;s ; xi;s )v i;s 0
:
3069
3070
Stochastic Games
The set-valued function ' has convex and non-empty values and its graph is closed, so that by Kakutani’s fixed point theorem it has a fixed point. It turns out that every fixed point of ' defines a -discounted equilibrium in stationary strategies. This result has been extended to general state and action spaces by various authors. These results assume a strong continuity on either the payoff function or the transition function, see, e. g., [39,44,62]. As in the case of zero-sum games, a dynamic programming argument shows that under a strong continuity assumption on the payoff function or on the transitions a T-stage equilibrium exists. Regarding the existence of the limsup equilibrium and uniform equilibrium little is known. The most significant result in this direction is Vieille [77,78], who proved that every two-player stochastic game with finite state and action spaces has a uniform "-equilibrium for every " > 0. This result has been proved for other classes of stochastic games, see, e. g., [6,24,25,61,63,65,76]. Several influential works in this area are [22,32,71,79]. Most of the papers mentioned above rely on the vanishing discount factor approach, which constructs a uniform "-equilibrium by studying a sequence of -discounted equilibria as the discount factor goes to 0. For games with general state and action spaces, a limsup equilibrium exists under an ergodicity assumption on the transitions, see e. g. Nowak [44], Remark 4 and Jaskiewicz and Nowak [30]. A game has perfect information if there are no simultaneous moves, and both players observe past play. Existence of equilibrium in this case was proven by Mertens [37] in a very general setup. Correlated Equilibrium The notion of correlated equilibrium was introduced by Aumann [8,9], see also Forges Correlated Equilibria and Communication in Games. A correlated equilibrium is an equilibrium of an extended game, in which each player receives at the outset of the game a private signal such that the vector of signals is chosen according to a known joint probability distribution. In repeated interactions, such as in stochastic games, there are two natural notions of correlated equilibria: (a) each player receives one signal at the outset of the game (normal-form correlated equilibrium); (b) each player receives a signal at each stage (extensiveform correlated equilibrium). It follows from Forges [26] that when the state and action sets are finite, the set of all correlated T-stage equilibrium payoffs (either normalform or extensive-form) is a polytope.
Nowak and Raghavan [47] proved the existence of an extensive-form correlated discounted equilibrium under weak conditions on the state and action spaces. In their construction, the strategies of the players are stationary, and so is the distribution according to which the signals are chosen after each history; both depend only on the current state, rather than on the whole past play. Roughly, their approach is to apply Kakutani’s fixed point theorem to the set-valued function that assigns to each game Gs (v) the set of all correlated equilibrium payoffs in this game, which is convex and compact. Solan and Vielle [66] proved the existence of an extensive-form correlated uniform equilibrium payoff when the state and action spaces are finite. Their approach is to let each player play his uniform optimal strategy in a zerosum game in which all other players try to minimize his payoff. Existence of a normal-form correlated equilibrium was proved for the class of absorbing games [68]. Solan [64] characterized the set of extensive-form correlated equilibrium payoffs for general state and action spaces and a general evaluation on the stage payoffs, and provided a sufficient condition that ensures that the set of normal-form correlated equilibrium payoffs coincides with the set of extensive-form correlated equilibrium payoffs. Imperfect Monitoring So far it has been assumed that at each stage the players know the past play. There are cases in which this assumption is too strong; in some cases players do not know the complete description of the current state (Examples 3 and 4), and in others players do not fully observe the actions of all other players (Examples 2, 3 and 4). For a most general description of stochastic games see Mertens, see Chapter IV in Sorin and Zamir [40] and Coulomb [17]. In the study of the discounted equilibrium, the T-stage equilibrium or the limsup equilibrium, one may consider the game as a one-shot game: The players simultaneously choose strategies, and the payoff is either the discounted payoff, the T-stage payoff or the limsup payoff. If the strategy spaces of the players are compact (e. g., if the state and action spaces are finite), and if the payoff is upper-semicontinuous in each player’s strategy, keeping the strategies of the other players fixed, then an equilibrium exists. This approach can be used successfully for the discounted equilibrium or the T-stage equilibrium under weak conditions (see, e. g., [4]), and may be used for the limsup equilibrium under a proper ergodicity condition. Whenever there exists an equilibrium in stationary strategies (e. g., a discounted equilibrium in games with
Stochastic Games
finitely many states and actions) the only information that players need in order to follow the equilibrium strategies is the current state. In particular, they need not observe past actions of the other players. As we now show, in the “Big Match” (Example 10) the limsup value and the uniform value may fail to exist when each player does not observe the past actions of the other player. Example 13 (Example 10: Continued.) Assume that no player observes the actions of the other player, and assume that the initial state is s2 . Player 2 can still guarantee 12 in the limsup evaluation by playing the stationary strategy [ 12 (L); 12 (R)]. One can show that for every strategy of player 2, player 1 has a reply such that the limsup payoff is at least 12 . In other words, inf2 sup1 1 (s2 ; 1 ; 2 ) D 12 . We now argue that sup1 inf2 1 (s2 ; 1 ; 2 ) D 0. Indeed, fix a strategy 1 for player 1, and " > 0. Let be sufficiently large such that the probability that under 1 player 1 plays B for the first time after stage t is at most ". Observe that as t increases, the probability that player 1 plays B for the first time after stage t decreases to 0, so that such a exists. Consider the following strategy 2 of player 2: Play R up to stage , and play L from stage t C 1 and on. By the definition of , either player 1 plays B before or at stage , and then the game moves to s0 , and the payoff is 0 at each stage thereafter, or player 1 plays T at each stage, and then the stage payoff after stage is 0, or, with probability less than ", player 1 plays B for the first time after stage , the play moves to s1 , and the payoff is 1 thereafter. Thus, the limsup payoff is at most ". A similar analysis shows that inf sup (s2 ; 1 ; 2 ) D 2 1
1 2
;
sup inf lim (s2 ; 1 ; 2 ) D 0 ; 1 2 !0
so that the uniform value does not exist as well. This example shows that in general the limsup value and the uniform value need not exist when the players do not observe past play. Though in general the value (and therefore also an equilibrium) need not exist, in many classes of stochastic games the value and an equilibrium do exist, even in the presence of imperfect monitoring. Rosenberg et al. [55] and Renault [54] showed that the uniform value exists in the one player setup (Markov Decision Problem), in which the player receives partial information regarding the current state. Thus, a single decision maker who faces a dynamic situation and does not fully observe the state of the environment can play in such a way that guarantees high payoff, provided the interaction is sufficiently long or the discount factor is sufficiently low. Altman et al. [5,6] and Flesch et al. [24] studied stochastic games in which each player has a “private” state,
which only he can observe, and the state of the world is composed of the vector of private states. Altman et al. [5,6] studied the situation in which players do not observe the actions of the other players, and Flesch et al. [24] studied the situation in which players do observe each others payoffs. Such games arise naturally in wireless communication (see [5]); take for example several mobiles who periodically send information to a base station. The private state of a mobile may depend, e. g., on its exact physical environment, and it determines the power attenuation between the mobile and the base station. The throughput (the amount of bits per second) that a mobile can send to the base station depends on the power attenuations of all the mobiles. Finally, the stage payoff is the stage power consumption. Rosenberg et al. [57] studied the extreme case of two player zero-sum games in which the players observe neither the current state nor the action of the other player, and proved that the uniform value does exist in two classes of games, which capture the behavior of certain communication protocols. Classes of games in which the actions are observed but the state is not observed were studied, e. g., by Sorin [69,70], Sorin and Zamir [74], Krausz and Rieder [33], Flesch et al. [23], Rosenberg et al. [56], Renault [52,53]. For additional results, see [16,73]. Algorithms There are two kinds of algorithms: Those that terminate in a finite number of steps, and those that iterate and approximate solutions. Both kinds of algorithms were devised to calculate the value and optimal strategies (or equilibria) in stochastic games. It is well known that the value of a two-player zerosum matrix game and optimal strategies for the two players can be calculated efficiently using a linear program. Equilibria in two-player non-zero-sum games can be calculated by the Lemke–Howson algorithm, which is usually efficient, however, its worst running time is exponential in the number of pure strategies of the players [59]. Unfortunately, to date there are no efficient algorithms to calculate either the value in zero-sum stochastic games, or equilibria in non-zero-sum games. Moreover, in Example 9 the discounted value may be irrational for rational discount factors, even though the data of the game (payoffs and transitions) are rational, so it is not clear whether linear programming methods can be used to calculate the value of a stochastic game. Nevertheless, linear programming methods were used to calculate the discounted and uniform value of several classes of stochastic games, see [20,50]. Other methods that were used to
3071
3072
Stochastic Games
calculate the value or equilibria in discounted stochastic games include fictitious play [80], value iterates, policy improvement, and general methods to find the maximum of a function (see [20,51]), a homotopy method [29], and algorithms to solve sentences in formal logic [15,67]. Additional and Future Directions The research on stochastic games extends to additional directions than those mentioned in earlier sections. We mention a few here. Approximation of games with infinite state and action spaces by finite games was discussed by Whitt [81], and further developed by Nowak [43]. Stochastic games in continuous time have also been studied, as well as hybrid models that include both discrete and continuous aspects, see, e. g., [2,10]. Among the many directions of future research in this area, we will mention here but a few. One challenging question is the existence of a uniform equilibrium and a limsup equilibrium in multi-player stochastic games with finite state and action spaces. Another is the development of efficient algorithms that calculate the value of two-player zero-sum games. A third direction concerns the identification of applications that can be recast in the framework of stochastic games, and that can be successfully analyzed using the theoretical tools that the literature developed. Another problem that is of interest is the characterization of approachable and excludable sets in stochastic games with vector payoffs (see [12] for the presentation of matrix games with vector payoffs, and [41] for partial results regarding this problem). Acknowledgments I thank Eitan Altman, János Flesch, Yuval Heller, Jean-Jacques Herings, Ayala Mashiach-Yakovi, Andrzej Nowak, Ronald Peeters, T.E.S. Raghavan, Jérôme Renault, Nahum Shimkin, Robert Simon, Sylvain Sorin, William Sudderth, and Frank Thuijmsman, for their comments on an earlier version of the entry. Bibliography Primary Literature 1. Altman E (2005) Applications of dynamic games in queues. Adv Dyn Games 7:309–342 2. Altman E, Gaitsgory VA (1995) A hybrid (differential-stochastic) zero-sum game with fast stochastic part. Ann Int Soc Dyn Games 3:47–59 4. Altman E, Solan E (2007) Games with constraints with networking applications. Preprint
5. Altman E, Avrachenkov K, Marquez R, Miller G (2005) Zero-sum constrained stochastic games with independent state processes. Math Methods Oper Res 62:375–386 6. Altman E, Avrachenkov K, Bonneau N, Debbah M, El-Azouzi R, Sadoc Menasche D (2008) Constrained cost-coupled stochastic games with independent state processes. Oper Res Lett 36:160–164 7. Amir R (1996) Continuous stochastic games of capital accumulation with convex transitions. Games Econ Behav 15:111–131 8. Aumann RJ (1974) Subjectivity and correlation in randomized strategies. J Math Econ 1:67–96 9. Aumann RJ (1987) Correlated equilibrium as an expression of bayesian rationality. Econometrica 55:1–18 10. Ba¸sar T, Olsder GJ (1995) Dynamic noncooperative game theory. Academic Press, New York 11. Bewley T, Kohlberg E (1976) The asymptotic theory of stochastic games. Math Oper Res 1:197–208 12. Blackwell D (1956) An analog of the minimax theorem for vector payoffs. Pac J Math 6:1–8 13. Blackwell D, Ferguson TS (1968) The big match. Ann Math Stat 39:159–163 14. Chari V, Kehoe P (1990) Sustainable plans. J Political Econ 98:783–802 15. Chatterjee K, Majumdar R, Henzinger TA (2008) Stochastic limit-average games are in EXPTIME. Int J Game Theory 37:219–234 16. Coulomb JM (2003) Absorbing games with a signalling structure. In: Neyman A, Sorin S (eds) Stochastic games and applications. NATO Science Series. Kluwer, Dordrecht, pp 335–355 17. Coulomb JM (2003) Games with a recursive structure. In: Neyman A, Sorin S (eds) Stochastic games and applications. NATO Science Series. Kluwer, Dordrecht, pp 427–442 18. Dutta P, Sundaram RK (1992) Markovian equilibrium in a class of stochastic games: Existence theorems for discounted and undiscounted models. Econ Theory 2:197–214 19. Dutta P, Sundaram RK (1993) The tragedy of the commons? Econ Theory 3:413–426 20. Filar JA, Vrieze K (1996) Competitive Markov decision processes. Springer 21. Fink AM (1964) Equilibrium in a stochastic n-person game. J Sci Hiroshima Univ 28:89–93 22. Flesch J, Thuijsman F, Vrieze K (1997) Cyclic Markov equilibria in stochastic games. Int J Game Th 26:303–314 23. Flesch J, Thuijsman F, Vrieze OJ (2003) Stochastic games with non-observable actions. Math Meth Oper Res 58:459–475 24. Flesch J, Schoenmakers G, Vrieze K (2008) Stochastic games on a product state space. Math Oper Res 33:403–420 25. Flesch J, Thuijsman F, Vrieze OJ (2007) Stochastic games with additive transitions. Europ J Oper Res 179:483–497 26. Forges F (1990) Universal mechanisms. Econometrica 58:1341– 1364 27. Fortnow L, Kimmel P (1998) Beating a finite automaton in the big match. In: Proceedings of the 7th conference on theoretical aspects of rationality and knowledge. Morgan Kaufmann, San Francisco, pp 225–234 28. Gillette D (1957) Stochastic games with zero stop probabilities, contributions to the theory of games, vol 3. Princeton University Press, Princeton 29. Herings JJP, Peeters RJAP (2004) Stationary equilibria in stochastic games: Structure, selection, and computation. J Econ Theory 118:32–60
Stochastic Games
30. Jaskiewicz A, Nowak AS (2006) Zero-sum ergodic stochastic games with Feller transition probabilities. SIAM J Control Optim 45:773–789 31. Kakutani S (1941) A generalization of Brouwer’s fixed point theorem. Duke Math J 8:457–459 32. Kohlberg E (1974) Repeated games with absorbing states. Ann Stat 2:724–738 33. Krausz A, Rieder U (1997) Markov games with incomplete information. Math Meth Oper Res 46:263–279 34. Levhari D, Mirman L (1980) The great fish war: An example using a dynamic Cournot–Nash solution. Bell J Econ 11(1):322– 334 35. Maitra A, Sudderth W (1998) Finitely additive stochastic games with Borel measurable payoffs. Int J Game Theory 27:257–267 36. Martin DA (1998) The determinacy of Blackwell games. J Symb Logic 63:1565–1581 37. Mertens JF (1987) Repeated games. In: Proceedings of the international congress of mathematicians, American Mathematical Society, Berkeley, California, pp 1528–1577 38. Mertens JF, Neyman A (1981) Stochastic games. Int J Game Th 10:53–66 39. Mertens JF, Parthasarathy T (1987) Equilibria for discounted stochastic games, CORE Discussion Paper No. 8750. (Also published in Stochastic Games and Applications, Neyman A, Sorin S (eds), NATO Science Series, Kluwer, 131–172) 40. Mertens JF, Sorin S, Zamir S (1994) Repeated games, CORE Discussion Paper 9420-9422 41. Milman E (2006) Approachable sets of vector payoffs in stochastic games. Games Econ Behav 56:135–147 42. Neyman A, Sorin S (2003) Stochastic games and applications. NATO Science Series. Kluwer 43. Nowak AS (1985) Existence of equilibrium stationary strategies in discounted noncooperative stochastic games with uncountable state space. J Optim Theory Appl 45:591–620 44. Nowak AS (2003) N-person stochastic games: Extensions of the finite state space case and correlation. In: Neyman A, Sorin S (eds) Stochastic games and applications. NATO Science Series. Kluwer, Dordrecht, pp 93–106 45. Nowak AS (2003) On a new class of nonzero-sum discounted stochastic games having stationary Nash equilibrium points. Int J Game Theory 32:121–132 46. Nowak AS (2003) Zero-sum stochastic games with Borel state spaces. In: Neyman A, Sorin S (eds) Stochastic games and applications. NATO Science Series. Kluwer, Dordrecht, pp 77–91 47. Nowak AS, Raghavan TES (1991) Existence of stationary correlated equilibria with symmetric information for discounted stochastic games. Math Oper Res 17:519–526 48. Phelan C, Stacchetti E (2001) Sequential equilibria in a Ramsey tax model. Econometrica 69:1491–1518 49. Puterman ML (1994) Markov decision processes: Discrete stochastic dynamic programming. Wiley, Hoboken 50. Raghavan TES, Syed Z (2002) Computing stationary Nash equilibria of undiscounted single-controller stochastic games. Math Oper Res 27:384–400 51. Raghavan TES, Syed Z (2003) A policy improvement type algorithm for solving zero-sum two-person stochastic games of perfect information. Math Program Ser A 95:513–532 52. Renault J (2006) The value of Markov chain games with lack of information on one side. Math Oper Res 31:490–512 53. Renault J (2007) The value of repeated games with an informed controller. Preprint
54. Renault J (2007) Uniform value in dynamic programming. Preprint 55. Rosenberg D, Solan E, Vieille N (2002) Blackwell optimality in Markov decision processes with partial observation. Ann Statists 30:1178–1193 56. Rosenberg D, Solan E, Vieille N (2004) Stochastic games with a single controller and incomplete information. SIAM J Control Optim 43:86–110 57. Rosenberg D, Solan E, Vieille N (2006) Protocol with no acknowledgement. Oper Res, forthcoming 58. Sagduyu YE, Ephremides A (2003) Power control and rate adaptation as stochastic games for random access. Proc 42nd IEEE Conf Decis Control 4:4202–4207 59. Savani R, von Stengel B (2004) Exponentially many steps for finding a Nash equilibrium in a bimatrix game. Proc 45th Ann IEEE Symp Found Comput Sci 2004:258–267 60. Shapley LS (1953) Stochastic games. Proc Nat Acad Sci USA 39:1095–1100 61. Simon RS (2003) The structure of non-zero-sum stochastic games. Adv Appl Math 38:1–26 62. Solan E (1998) Discounted stochastic games. Math Oper Res 23:1010–1021 63. Solan E (1999) Three-person absorbing games. Math Oper Res 24:669–698 64. Solan E (2001) Characterization of correlated equilibria in stochastic games. Int J Game Theory 30:259–277 65. Solan E, Vieille N (2001) Quitting games. Math Oper Res 26:265–285 66. Solan E, Vieille N (2002) Correlated equilibrium in stochastic games. Games Econ Behav 38:362–399 67. Solan E, Vieille N (2007) Calculating uniform optimal strategies and equilibria in two-player stochastic games. Preprint 68. Solan E, Vohra R (2002) Correlated equilibrium payoffs and public signalling in absorbing games. Int J Game Theory 31:91– 122 69. Sorin S (1984) Big match with lack of information on one side (part 1). Int J Game Theory 13:201–255 70. Sorin S (1985) Big match with lack of information on one side (part 2). Int J Game Theory 14:173–204 71. Sorin S (1986) Asymptotic properties of a non-zerosum stochastic games. Int J Game Theory 15:101–107 72. Sorin S (2002) A first course on zero-sum repeated games. Mathématiques et Applications, vol 37. Springer 73. Sorin S (2003) Stochastic gameswith incomplete information. In: Neyman A, Sorin S (eds) Stochastic Games and Applications. NATO Science Series. Kluwer, Berlin, pp 375–395 74. Sorin S, Zamir S (1991) Big match with lack of information on one side (part 3). In: Raghavan TES et al (eds) Stochastic games and related topics. Kluwer, pp 101–112 75. Takahashi M (1962) Stochastic games with infinitely many strategies. J Sci Hiroshima Univ Ser A-I 26:123–134 76. Thuijsman F, Raghavan TES (1997) Perfect information stochastic games and related classes. Int J Game Theory 26:403–408 77. Vieille N (2000) Equilibrium in 2-person stochastic games I: A Reduction. Israel J Math 119:55–91 78. Vieille N (2000) Equilibrium in 2-person stochastic games II: The case of recursive games. Israel J Math 119:93–126 79. Vrieze OJ, Thuijsman F (1989) On equilibria in repeated games with absorbing states. Int J Game Theory 18:293–310
3073
3074
Stochastic Games
80. Vrieze OJ, Tijs SH (1982) Fictitious play applied to sequences of games and discounted stochastic games. Int J Game Theory 12:71–85 81. Whitt W (1980) Representation and approximation of noncooperative sequential games. SIAM J Control Optim 18:33–48
Books and Reviews Ba¸sar T, Olsder GJ (1995) Dynamic noncooperative game theory. Academic Filar JA, Vrieze K (1996) Competitive Markov decision processes. Springer
Maitra AP, Sudderth WD (1996) Discrete gambling and stochastic games. Springer Mertens JF (2002) Stochastic games. In: Aumann RJ, Hart S (eds) Handbook of game theory with economic applications, vol 3. Elsevier, pp 1809–1832 Mertens JF, Sorin S, Zamir S (1994) Repeated games. CORE Discussion Paper 9420-9422 Raghavan TES, Shapley LS (1991) Stochastic games and related topics: In honor of Professor L.S. Shapley. Springer Vieille N (2002) Stochastic games: Recent results. In: Aumann RJ, Hart S (eds) Handbook of game theory with economic applications, vol 3. Elsevier, pp 1833–1850
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems HANS C. FOGEDBY1,2 1 Department of Physics and Astronomy, University of Aarhus, Aarhus, Denmark 2 Niels Bohr Institute, Copenhagen, Denmark Article Outline Glossary Definition of the Subject Introduction Scaling Conformal Invariance Loewner Evolution Stochastic Loewner Evolution Results and Discussion Future Directions Acknowledgments Bibliography Glossary Bessel process The Bessel process operates in d dimensions and describes the radial distance r from the origin of a particle performing Brownian motion. The random motion is governed by the Langevin equation dr/dt D (d 1)/2r C , where hi(t) D ı(t). For d 2 the motion is recurrent, i. e., returns to the origin; for d > 2 the motion goes off to infinity. In SLE the Bessel process describes the transition between simple curves and self-intersecting curves. Brownian motion Brownian motion, Bt , is the scaling limit of random walk. Brownian motion is plane-filling, has the fractal dimension D D 2, and is described by the Langevin equation dB t /dt D t ; h t s i D ı(t s). Bt is characterized by i) the stationarity property, B tCt 0 B t and B t 0 identical in distribution and ii) the independence property, B t 0 and B t independent for t ¤ t 0 . The correlations are given by hjB t Bs j2 i D jt sj and Bt is distributed according to the Gaussian (normal) distribution P(B; t) D (2 t)1/2 exp(B2 /2t). Chordal SLE In order to map the geometry of the growing hull to a real function by means of SLE one chooses a reference domain. Chordal SLE refers to the case where the SLE trace is grown between two boundary
points, usually the origin and the point at infinity in the upper half complex plane. Conformal invariance Conformal invariance or local scale invariance is a larger symmetry than scale invariance. Conformal invariance implies invariance under both a local rotation, translation, and dilatation. Conformal invariance is particularly powerful in 2D where a conformal transformation is implemented by an analytic function. Conformal transformation A conformal transformation is a transformation which preserves angles but allows for rotation, dilatation, and translation. In an elastic medium picture a conformal transformation corresponds to translation, rotation, and compression (dilatation), but not shear. In 2D an analytic complex function w D f (z) from the complex z plane to the complex w plane generates a conformal transformation. Continuum limit The limit where a lattice model approaches a continuum model; also called the scaling limit. The resulting continuum field theories define the universality classes of the lattice models. Correlation length The correlation length measures the size of correlations. At the critical point the correlation length diverges signaling that the system becomes scale invariant. Critical curves Critical curves are domain walls or cluster boundaries at the critical point. Critical curves are scale invariant and characterized by a fractal dimension. Exploration Domain walls at the critical point can mathematically be generated by an exploration process where the domain wall is initiated at a boundary point and constructed in steps across the domain. In the percolation case the local step is generated by ‘flipping a coin’, in the Ising case by evaluating the magnetization at the tip of the ‘growing’ domain wall. The construction by an exploration process is essential in generating a critical curve by stochastic Loewner evolution. Fortuin–Kasteleyn representation (FK) Based on a high temperature expansion one can represent the configurations in the Ising, O(n), and Potts models by means of the Fortuin–Kasteleyn representation in terms of random clusters. The FK transformation is essential in identifying random curves for the lattice models which then can be accessed by SLE in the scaling limit. Fractal dimension Irregular objects can be characterized by a fractal dimension. The fractal dimension D is derived by covering the object with N(`) intervals, disks, or spheres of linear dimension `. By letting
3075
3076
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
` ! 0 this definition resolves the fine scale fractal structure and the scaling relation N(`) / (`)D , or D D lim`!0 ln(N(`)/ ln(`), yields the fractal dimension. Examples of deterministic fractals are for example the Cantor set, the Koch curve, the Sierpinski gasket, and plane-filling Hilbert and Peano curves. Random walk and diffusion limited aggregation (DLA) are examples of random physical fractals. Hulls For a self-intersecting SLE trace the trace and enclosed regions form a so-called hull. Points in the hull cannot be connected to infinity without crossing the trace. The growing hull of a self-intersecting SLE trace eventually exhausts the half plane. Ising model The Ising model originates from the theory of magnetism and plays an important role in the theory of phase transitions and in general in statistical mechanics. The model is defined on a lattice where each lattice sites is endowed with a local spin variable assuming two distinct values. The spins interact by a short range exchange interaction. Above 1D the Ising model has a second order phase transition. Locality In the percolation case the domain wall is constructed by an exploration process and a local rule for assigning the next step. This locality property is specific to percolation which has a geometric phase transition. In the SLE context the locality property implies D 6, i. e., the percolation case. Loewner equation The Loewner equation is the first order nonlinear differential equation for the uniformizing map g t (z) which maps the complement of a growing hull or curve in the upper half plane back to the upper half plane. The geometrical properties of the hull is encoded in the real function at . The Loewner equation has the form dg t /dt D 2/(g t a t ). Loewner evolution Loewner evolution refers to the parametrized map g t (z) which uniformizes the complement of a curve or hull, say in the upper half plane. Loop erased random walk (LERW) Loop erased random walk is a variant of random walk where loops formed are removed as the walk progresses. LERW has a conformally invariant scaling limit and can be generated by SLE for D 2. Markov property The Markov property refers to a stochastic process without memory. A typical example is random walk or in the scaling limit Brownian motion. The Markov property is essential for the application of SLE to domain walls in the scaling limit. Measures For lattice systems one can define probability distributions according to the rules of statistical mechanics. In the scaling limit the distributions typically
diverge and have to be replaced by the more abstract mathematical concept of probability measures. O(n) models The O(n) model is a generalization of the Ising model. At each lattice site is associated an n-component spin variable. For n D 1 we recover the Ising model, n D 2 corresponds to the XY model, and n D 3 to the Heisenberg model. Peano curve A Peano curve is a non-crossing curve which is dense in the plane, i. e., it gets arbitrarily close to every point. The Peano curve has the fractal dimension D D 2. In a SLE context a random Peano curve winds around a UST and corresponds to D 8. Percolation Percolation on a lattice is constructed by filling lattice sites at random with a common probability p. At a critical concentration a spanning cluster extends across the system. Potts model The Potts model is a generalization of the Ising model where the local lattice variable can assume q different values and where sites only interact when they are in the same Potts state. The Potts model has a FK cluster representation. The Ising model corresponds to q D 2; q D 1 corresponds to percolation and q D 0 to the UST. Random walk Random walk is an ubiquitous phenomenon in nature. In random walk on a square lattice a particle jump from site to site with a given fixed probability. Each step is independent of the past history, there is no memory, this is the so-called Markov property. The random walk path or history is planefilling and has the fractal dimension D D 2. The mean square displacement of random walk scales with the number of steps. Restriction The restriction property in a SLE context implies that the measure on a curve in a domain D conditioned not to hit a bulge L is the same as the measure in the domain D n L. The restriction property holds in the case of a uniform measure and applies to selfavoiding random walk. Riemann’s mapping theorem Riemann’s mapping theorem states that an arbitrary simply connected domain D, i. e., without holes, can be mapped to another simply connected domain D0 by means of a suitable complex function g(z), i. e., g(D) D D0 . Often the disk or half plane are used as reference domains. The mapping theorem does not make assumptions about the domain boundaries which can be fractal. Scale invariance The scaling property refers to the case where a phenomenon is devoid of a characteristic scale or unit. Scale invariance is typically characterized by a power law behavior and critical exponents.
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
Scaling limit The limit where the lattice parameter in a lattice model approaches zero. The scaling limit is equivalent to the continuum limit. Self-avoiding random walk (SAW) A self-avoiding random walk is a walk conditioned not to cross itself. SAW has been used to model polymers. SAW has a uniform probability measure and conforms in a SLE context to the restriction condition. Stochastic Loewner evolution (SLE) Stochastic Loewner evolution is Loewner evolution driven by a real stochastic function at with a distribution given p by 1D Brownian motion, i. e., a t D B t . SLE is governed by the stochastic equation of motion dg t /dt D 2/(g t a t ). Schramm’s theorem Schramm’s theorem refers to Schramm’s derivation of SLE for LERW. Schramm showed that the scaling limit of LERW is described by SLE for D 2. Schramm also conjectured that the random Peano curve winding around an UST is described by SLE for D 8 and that percolation is described by SLE for D 6. Uniformizing maps Conformal transformations which map a domain D to a standard reference domain, e. g., the half plane or the disk, are called uniformizing maps. Uniform spanning tree (UST) A spanning tree is a collection of vertices and links forming a tree (no loops or cycles). A USF is a random tree picked among all spanning trees with equal probability. Definition of the Subject Stochastic Loewner evolution also called Schramm Loewner evolution (abbreviated, SLE) is a rigorous tool in mathematics and statistical physics for generating and studying scale invariant or fractal random curves in two dimensions (2D). The method is based on the older deterministic Loewner evolution introduced by Karl Löwner [64], who demonstrated that an arbitrary curve not crossing itself can be generated by a real function by means of a conformal transformation. A real function defined in one spatial dimension (1D) thus encodes a curve in 2D, in itself an intriguing result. In 2000 Oded Schramm [77] extended this method and demonstrated that driving the Loewner evolution by a 1D Brownian motion, the curves in the complex plane become scale invariant; the fractal dimension turns out to be determined by the strength of the Brownian motion. The one parameter family of scale invariant curves generated by SLE is conjectured and has in some cases been proven to represent the continuum or scaling limit
of a variety of interfaces and cluster boundaries in lattice models in statistical physics, ranging from self-avoiding random walks to percolation cluster boundaries, and Ising domain walls. SLE operates in the 2D continuum where it generates extended scale invariant objects. SLE delimits scaling universality classes by a single parameter , the strength of the 1D Brownian drive, yielding the fractal dimension D of the scale invariant shapes according to the relation D D 1 C /8. Moreover, SLE provides the geometrical aspects of conformal field theory (CFT). The central charge c, delimiting scaling universality classes in CFT, is thus related to by means of the expression c D (3 8)(6 )/2. Stochastic Loewner evolution derives its importance from the fact that it addresses the issue of extended random fractal shapes in 2D by direct analysis in the continuum. It thus supplements and extends earlier lattice results and also allows for the determination of new scaling exponents. From the point of view of conformal field theory based on the concept of a local field, operator expansions, and correlations, the geometrical approach afforded by SLE, directly addressing conformally invariant random shapes in the continuum, represents a novel point of view of maybe far reaching consequences; so far only explored in two dimensions. Introduction The field of SLE has mainly been driven by mathematicians presenting their results in long and difficult papers. There are, however, presently several excellent reviews of SLE both addressing the theoretical physics community [11,13,25,27,39,43] and the mathematical community [56,85], see also a complete biography up to 2003 [39]. The purpose of this article is to present a heuristic and simple discussion of some of the key aspects of SLE, for details and topics left out we refer the reader to the reviews mentioned above. However, in order to provide the necessary background and set the stage for SLE we begin with some general remarks on scaling in statistical physics. General Remarks In statistical physics we study macroscopic systems composed of many interacting components. In the limit of many degrees of freedom the macroscopic behavior roughly falls in two categories. In the most common case the macroscopic behavior is deterministic and governed by phenomenological theories like for example thermodynamics and hydrodynamics operating entirely on a macroscopic level. This behavior is basically a result of the law
3077
3078
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
of large numbers, permitting an effective coarse graining and yielding for example a macroscopic density or velocity field [29]. In the other case, the macroscopic behavior is dominated by fluctuations and shows a random behavior [29,73]. Typical cases are random walk and equilibrium systems tuned to the critical point of a second order phase transition. Other random cases are for example self-organized critical systems purported to model earth quake dynamics, flicker noise and turbulence in fluids [5,40]. The distinction between the deterministic and random cases of macroscopic behavior is illustrated by the simple example of a biased random walk described by the Langevin equation dx(t)/dt D v C (t), h(t)(0)i / ı(t). Here x(t) is a macroscopic variable sampling the statistically independent microscopic steps (t); the velocity v is an imposed drift or bias. Averaging over the steps we obtain for the deviation of x, R D [hx 2 i]1/2 D [(vt)2 C t]1/2 . For large t the deviation R hxi D vt and the mean value or deterministic part dominates the behavior, the fluctuational or random part R t 1/2 being subdominant. Fine tuning the random walk to vanishing bias v D 0 we have hxi D 0 and R t 1/2 , i. e., the random fluctuations control the phenomenon. The study of complexity encompasses a broader field than statistical physics and is concerned with the emergence of universal properties on a mesoscopic or macroscopic scale in large interacting systems. For example particle systems, networks in biology and sociology, cellular automata, etc. The class of complex systems generally falls in the category of random systems. The emergent properties are a result of many interacting agents or degrees of freedom and can in general not be directly inferred from the microscopic details. A major issue is thus the understanding of generic emerging properties of complex systems [45,70,83]. Here, however, the methods of statistical physics is an indispensable tool in the study of complexity. The evolution of statistical physics, a branch of theoretical physics, has occurred in steps and is driven both by the introduction of new concepts and the concurrent development of new mathematical methods and techniques. A well-known case is the long standing problem of second order phase transitions or critical phenomena which gave way to a deeper understanding in the sixties and seventies and spun off the renormalization group techniques for the determination of critical exponents and universality classes [20,24,65,81,86]. Scaling Ideas This brings us to the fundamental scaling ideas and techniques developed particularly in the context of critical phe-
nomena in equilibrium systems and which now pervade a good part of theoretical physics and, moreover, play an important role in the analysis of complex systems in physical sciences [20,24,29]. Scaling is synonymous with no scale in the sense that a system exhibiting scale invariance is characterized by the absence of any particular scale or unit. Scaling occurs both in the space and/or time behavior and is typically characterized by power law dependencies controlled by scaling exponents. A classical case is random walk discussed above, characterized by the Langevin equation dx/dt D (t), hi ı(t) [73]. Here the mean square displacement scales like hx 2 i(t) t 2H , where H is the Hurst scaling exponent; for random walk H D 1/2 [32,68]. Correspondingly, the R power spectrum P(!) D jx! j2 , x! D dtx(t) exp(i! t), scales like P(!) ! 12H , i. e., P(!) ! 2 for random walk; we note that the underlying reason for the universal scaling behavior of random walk is the central limit theorem [37,73]. Scaling in Equilibrium Scaling ideas and associated techniques first came to the forefront in statistical physics in the context of second order phase transitions or critical phenomena [20,24,29,65]. More specifically, consider the usual Ising model defined on a lattice with a local spin degrees of freedom, i D ˙1 at site i, subject to a short range interaction J favoring spin alignment. The Ising Hamiltonian has the form P H D J hi ji i j , where hi ji indicates nearest neighbor sites. The thermodynamic phases are characterized by the order parameter m D h i i. Above one dimension the Ising model exhibits a second order phase transition at a finite critical temperature T c . Above T c the model is in a disordered paramagnetic state with m D 0 and only microscopic domain of ordered spins. Below T c the model favors a ferromagnetic state with long range order and macroscopic domains of ordered spins, corresponding to m ¤ 0; at T D 0 the model assumes the ferromagnetic ground state configuration with totally aligned spins. Regarding the spatial organization, the size of the domains of ordered spins is characterized by the correlation length (T). As we approach the critical point at T c the order parameter vanishes m / jT Tc jˇ with critical exponent ˇ, but more significantly, the correlation length (T) diverges like (T) jT Tc j with scaling exponent . This indicates that the system is scale invariant at T c . Regarding the domains of ordered spins, the system is spatially selfsimilar on scales from the microscopic lattice distance to the system size; the system size diverging in the thermodynamic limit. The scaling behavior at the critical point T c
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
is an emergent property in the sense that the scaling exponents ˇ and do not depend on microscopic details like the type of lattice, strength of interaction, etc., but only on the dimension of the system and the symmetry of the order parameter [72]. The diverging correlation length at the critical point is the central observation which in the 60-ties and 70-ties gave rise to a detailed understanding of critical phenomena, beginning with the coarse graining block scheme proposed by Kadanoff [41] and culminating with the development and application of field theoretical renormalization group techniques by Wilson and others [20,24,29,65, 72,86]. For a diverging correlation length much larger than the lattice distance the local spin i can be replaced by a coarse-grained local field (r) and the Ising H by the Ginzburg–Landau functional F D RHamiltonian dd r[(r)2 C R 2 C U 4 ], where the ‘mass’ term R jT Tc j. Consequently, the universality class of Ising-type models is described by a scalar field theory. The renormalization group techniques basically quantify the Kadanoff block construction in momentum space and extract scaling properties in an expansion about the upper critical dimension d D 4. To leading order in 4 d one obtains ˇ D 1/2 (4 d)/6 and D 1/2 C (4 d)/12. Alternatively, keeping the correlation length fixed and letting the lattice distance approach zero, we obtain at T c the so-called scaling limit or continuum limit of the Ising model. The Ising spin i becomes a local field (r) and the weight of a configuration is determined by exp(F). Note that in order to implement the scaling limit we must be at T c . The two scenarios of a growing correlation length for fixed lattice distance implementing the Kadanoff construction and a fixed correlation length for a vanishing lattice distance yielding a continuum field theory are related by an overall scale transformation [86]. It is generally assumed that the global or nonlocal scale invariance at the critical point in the continuum limit can be extended to a local scale invariance including translation and rotation, that is an angle-preserving conformal transformation. This follows heuristically from an a local implementation of the Kadanoff coarse-graining block construction and applies to lattice models with short range interactions and discrete translational and rotational invariance [22,24]. The resulting continuum theories then fall in the category of conformal field theories (CFT). In 2D the group of conformal transformations is particularly rich since it corresponds to the class of analytical functions w D f (z), mapping the complex plane z to the complex plane w. The infinite group structure imposes sufficient constraints on the structure of conformal field
theories in 2D that the scaling form of correlations, e. g., hi(r), and in particular the critical exponents can be determined explicitly [23,24]. On finds ˇ D 1/8 and D 1 for the order parameter and correlation length exponents, respectively, in accordance with lattice theory results [14]. Here we also mention the Coulomb gas method for the determination of critical exponents [71]. It is a common feature of both renormalization group calculations based on an expansion about a critical dimension and conformal field theory in 2D that the local field (r) and its correlations are the basic building blocks and that the critical properties are encoded in their scaling form, yielding critical exponents, scaling laws, scaling functions, etc. Notwithstanding the fact that the seminal Kadanoff construction [41] was based on a geometrical picture corresponding to coarse-graining the degrees of freedom over larger and larger scales, keeping track of ordered domains on all scales, the actual geometry of critical phenomena such as the scaling properties of critical clusters was not well-understood and seemed inaccessible in the continuum limit within the context of local field theories. Whereas it is not difficult to generate critical domain walls, interfaces, and clusters for lattice models with appropriate boundary conditions by means of standard Monte Carlo simulation techniques, the continuum or scaling limit of critical shapes appeared until recently, with a few exceptions, beyond present techniques. Stochastic Loewner Evolution Here stochastic Loewner evolution (SLE) represents a new insight in 2D critical phenomena with respect to a deeper understanding of scale invariant curves, clusters, and shapes. Also, there appears to be deep connections between SLE and conformal field theory. SLE is an ingenious way of generating critical curves and shapes in the 2D continuum using conformal transformations. Let us mention a characteristic example. Consider an Ising model on a lattice in a chosen domain. Imposing spin up on a continuous part of the domain boundary and spin down on the remaining part of the boundary, it follows that a specific domain wall or interface will connect the two points on the boundary where a bond is broken. At low temperature the bond energies dominate and the free energy is lowest for a straight domain wall with few kinks. As we approach the critical point entropy or fluctuations come strongly into play and the domain wall meanders balancing energy and entropy. At the critical point the system becomes scale invariant with a diverging correlation length and likewise the domain wall becomes scale invariant, i. e., it has kinks on all scales larger that the lat-
3079
3080
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
tice distance. In the continuum limit the Ising domain wall becomes a random fractal curve with a particular fractal dimension. Here SLE provides a direct analytical method in the continuum to generate such a random curve and, moreover, provides the fractal dimension in terms of the strength of the 1D random walk driving the SLE evolution. Outline The outline of the present article is as follows. In Sect. “Scaling” on scaling we introduce some of the basic models and concepts necessary for a discussion of SLE: A) Random walk, B) Percolation, C) Ising model, D) Critical curves and exploration, and E) Distributions, Markov properties, and measures. In Sect. “Conformal Invariance” we turn to the essential ingredient in SLE, namely, conformal invariance: A) Conformal maps and B) Measures and conformal invariance. Section “Loewner Evolution” is devoted to deterministic or classical Loewner evolution: A) Growing stick, B) Loewner equation and C) Exact solutions. In Sect. “Stochastic Loewner Evolution”, constituting the core part of this article, we discuss stochastic Loewner evolution: A) Schramm’s theorem, B) SLE properties, C) Curves, hulls, and the Bessel process and D) Fractal dimension. Section “Results and Discussion” is devoted to results and discussions: A) Phase transitions, locality, restriction, and duality, B) Loop erased random walk, C) Self-avoiding random walk, D) Percolation, E) Ising model and O(n) models, F) SLE and conformal field theory, G) Application to 2D turbulence, H) Application to 2D spin glass and I) Further remarks. Finally, in Sect. “Future Directions” we discuss future directions of the field. The bibliography includes books, general reviews, and more technical papers. Scaling Stochastic Loewner evolution, has been applied to a host of lattice models proved or conjectured to possess scaling limits. However, for the present purpose we will focus on three lattice models: Random walk, Percolation, and the Ising model. Random Walk Random walk is a simple and much studied random process [4,32,73]. Consider an unbiased random walk in the plane composed of N steps, where the ith step, Ei , is random, isotropic and uncorrelated, i. e., hE i i D 0 and ˇ h˛i j i / ı˛ˇ ı i j . For the end-to-end distance we have P xE D N E i and for the size R D [hE x 2 i]1/2 N 1/2 . AsiD1 suming one step pr unit time, N / t, we obtain R t 1/2 ,
characteristic of diffusive motion. Introducing the fractal dimension D by the usual box counting procedure [32,68] N(R) R D ;
(1)
where N is the number of boxes and R the size of the object, and covering the random walk we readily infer D D 2, i. e., the self-crossing random walk is plane-filling modulo the lattice distance. Introducing the scaling exponent
according to R N we have for random walk D 1/2; note that D D 1/ . The scaling limit of unbiased random walk is Brownian motion (BM) [4] and is obtained by scaling the step size down and the number of steps N up in such a manner that the size R N 1/2 stays constant. The resulting BM path is a continuous non-differentiable random curve with fractal dimension D D 2. The BM path is plane-filling and recurrent in 2D, i. e., it returns to a given point with probability one. Focusing on one of the independent cartesian components 1D BM, Bt , is also described by the Langevin equation dB t (2) D t ; h t s i D ı(t s) ; dt where t is uncorrelated Gaussian white noise with a flat power spectrum. Integrating Eq. (2) Bt samples the step Rt t and we find, assuming B0 D 0, B t D 0 t 0 dt 0 from which we directly infer the fundamental properties of BM, namely, independence and stationarity, B TC T B T B T ;
(stationarity)
(3)
B T ; B T 0 indep. for T ¤ T 0 ; (independence) (4) where indicates identical distributions. Moreover, the correlations are given by hjB t Bs j2 i D jt sj
(5)
and Bt is distributed according to the normal (Gaussian) distribution P(B; t) D (2 t)1/2 exp(B2 /2t). Whereas 1D BM drives SLE, 2D BM is not itself generated by SLE since the path is self-crossing on all scales; by construction SLE is limited to the generation of noncrossing random curves. However, variations of BM have played an important role in the development of SLE. We mention here the scaling limit of loop erased random walk (LERW) and self-avoiding random walk (SAW), to be discussed later. Percolation The phenomenon of percolation is relevant in the context of clustering, diffusion, fractals, phase transitions and disordered systems. As a result, percolation theory provides
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
a theoretical and statistical background to many physical and natural sciences [82]. Percolation is the simplest lattice model exhibiting a geometrical phase transition. The site percolation model is constructed by occupying sites on a lattice with a given common probability p. Let an occupied site be denoted ‘plus’ and an ‘empty’ site denoted ‘minus’. For p close to zero the sites are mainly unoccupied and the lattice is ‘minus’. For p close to one the sites are predominantly occupied and the lattice is ‘plus’. At a critical concentration pc , the percolation threshold, an infinite cluster of ‘plus’ sites embedded in the ‘minus’ background extends across the lattice. In the scaling limit of vanishing lattice distance the critical cluster has a fractal boundary which can be accessed by SLE. Whereas the scaling properties of critical percolation clusters define a universality class and is independent of the lattice structure, the critical concentration pc in general depends on the lattice. For site percolation on a triangular lattice the percolation threshold is known to be p c D 1/2. In Fig. 1 we depict a realization of site percolation on a triangular lattice in the upper half plane at the percolation threshold p c D 1/2. The occupied sites are denoted ‘plus’, the empty sites ‘minus’. By imposing appropriate boundary conditions we induce a meandering domain wall across the system from A to B. In the scaling limit the domain wall becomes a fractal non-crossing critical curve.
where T is the temperature and k Boltzmann’s constant. The partition function Z has the form
Ising Model The Ising model is probably the simplest interacting many particle system in statistical physics [20,29,81]. The model has its origin in magnetism but has become of paradigmatic importance in the context of phase transitions. The model is defined on a lattice where each lattice site i is occupied by a single degree of freedom, a spin variable i , assuming two values, i D ˙1, i. e., spin up or spin down. In the ferromagnetic case considered here the spins interact via a short range exchange interaction J favoring parallel spin alignment. The model is described by the Ising Hamiltonian H D J
X
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems, Figure 1 We depict site percolation on a triangular lattice in the upped half plane at the percolation threshold. The critical concentration is pc D 1/2. The occupied sites are denoted ‘plus’, the empty sites ‘minus’. The boundary conditions enforce a meandering domain wall from A to B
i j ;
(6)
hi ji
where hi ji indicates nearest neighbor spin sites i and j. The statistical weight or probability of a specific spin configuration f i g is determined by the Boltzmann factor P(f i g) D exp[H/kT]/Z ;
(7)
ZD
X
exp(H/kT) ;
(8)
f i g
yielding the thermodynamic free energy F according to F D kT log Z. The entropy is given by S D dF/dT and the energy follows from F D E T S. The magnetization or order parameter and correlations are given by mD
X f i g
i P(f i g) and h i j i D
X
i j P(f i g) ;
f i g
(9) respectively. The Ising model possesses a phase transition at a critical temperature T c from a disordered paramagnetic phase above T c with vanishing order parameter m D 0 to a ferromagnetic ordered phase below T c with non-vanishing order parameter m ¤ 0. At the critical temperature T c the order parameter vanishes like m jT Tc jˇ with critical exponent ˇ. The correlation function h i j i monitoring
3081
3082
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
the spatial organization of ordered domains behaves like h i j i
exp[ji jj/] ; ji jj
(10)
where ji jj denotes the distance between position i and position j. The algebraic behavior is characterized by the critical exponent and the correlation length scales like jT Tc j with critical exponent . At the critical point the correlation length diverges and the Ising model becomes scale invariant with an algebraically decaying correlation function h i j i ji jj . Assigning ‘plus’ to spin up and ‘minus’ to spin down, Fig. 1 also illustrates a typical configuration of the 2D Ising model on a triangular lattice at the critical temperature, including an Ising domain wall from A to B. Critical Curves – Exploration In the percolation case at the percolation threshold a critical curve is induced by fixing the boundary conditions. Occupying sites from A to B along the right hand side of the boundary and assigning empty sites along the left hand side of the boundary from A to B a critical curve will meander across the system from A to B. Imagining painting the two sides of the curve, one side is painted with occupied sites, the other side with empty sites. Typically the curve meanders on all scales but by construction does not cross itself. For later purposes the configuration is depicted in Fig. 1. In order to make contact with SLE we observe that a critical interface in the percolation case also can be constructed by an exploration process. We initiate the curve at the boundary point A and toss a coin. In the case of ‘head’ the site or hexagon in front is chosen to be occupied and the path bends left; in the event of ‘tail’ the hexagon in front is left unoccupied and the path bends right. In this manner a critical meandering non-crossing curve is generated terminating eventually at B. The percolation growth process is depicted in Fig. 2, where we for clarity only have indicated the sites involved in the growth process. We note that since there is no interaction between the sites the path depends entirely on the local properties. In the case of the Ising model a critical interface or domain walls at the critical point is again fixed by assigning appropriate boundary conditions with spin up along the boundary from A to B and spin down from B to A. The critical curve can be constructed in two ways: Globally or by an exploration process. In the global case we generate a spin configuration by means of a Monte Carlo simulation, i. e., perform a biased importance sampling implementing the probability distribution in Eq. (7), and
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems, Figure 2 We depict the growth process in the percolation case. The percolation threshold is at pc D 1/2. The interface imposed by the boundary conditions originates at the boundary point A and progresses towards the boundary point B
identify a critical interface. Alternatively, we can generate an interface by an exploration process like in the percolation case, occupying a site i according to the weight (1/2)(1 C h i i), where h i i is evaluated in the domain with the spins along the interface fixed, see again Fig. 2. Distributions – Markov Properties – Measures In the case of random walk the number of walks of length L grows like L , where is a lattice dependent number, i. e., at each step there are lattice-dependent choices for choosing a direction of the next step. Consequently, the weight or probability of a particular walk of length L is proportional to L , P(L) L ;
(11)
and all walks of a given length have the same weight. We note here the important Markov property, characteristic of random walk, which can be formulated in the following manner. Assume that the first part 0 of the walk has taken place and thus conditions the remaining part of the walk. In a suggestive notation the conditional distribution is given by P( j 0 ) D P( 0 )/P( 0 ). The Markov property implies that the conditional probability is equal
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
to the probability of the walk in a new domain where the first part of the walk 0 has been removed, i. e., the identity P( j 0 ) D P0 ( ) ; (Markov property)
(12)
where the prime refers to the new domain. The Markov property is self-evident for random walk and follows directly from Eq. (11), i. e., P( j 0 ) D P( 0 )/P( 0 ) D 0 0 (LCL ) /L D L D P0 ( ), where L and L0 are the lengths of segments 0 and , respectively. In the scaling limit the lattice distance goes to zero whereas the length of the walk diverges. Consequently, the distribution diverges and must be replaced by an appropriate probability measure [4,13,43]. However, in order to define the probability distribution or measure in the scaling limit we shall assume that the Markov property continues to hold and interpret the P in Eq. (12) as probability measures. The Markov property is essential in carrying over the lattice probability distributions in the scaling limit. For the critical interfaces defined by an exploration processes in the case of site percolation, the Markov property follows by inspection since the propagation of the interface is entirely determined by the local process of occupying the next site with probability 1/2. This locality property is specific for percolation which has a geometrical phase transition and does in the SLE context, to be discussed later, determine the scaling universality class. In the case of an interface in the Ising model the Markov property also holds, but since the spins interact a little calculation is required [13]. Consider an interface defined by an exploration process. According to the rules of statistical mechanics the probability distribution for the interface is given by P( ) D
Z( ) : Z
(13)
Here Z is the full partition function defined in Eq. (8) with appropriate boundary conditions imposed. Z( ) is the partial partition function with the spins associated with the interface fixed, X Z( ) D exp[H( )/kT] : (14) f i g;
The Hamiltonian H( ) inferred from Eq. (6) is the energy of a spin configuration with the spins determining fixed; f i g; indicate the configurations to be summed P over. Evidently, we have the identity Z D Z( ), i. e., P P( ) D 1. Whereas in the random walk case we only considered an individual path and in the percolation case the interface only feels the nearby sites, an Ising interface is imbedded in the interacting spin systems and we have to define
the Markov property more precisely with respect to a domain D. Consider an interface across the domain D from A to B and assume that the last part is conditioned on the determination of the first part 0 , i. e., given by the distribution PD ( j 0 ). Next imagine that we cut the domain D along the interface 0 , i. e., break the interaction bonds between the spins determining 0 . The right and left face of 0 can then be considered part of the domain boundary and the Markov property states that the distribution of in the cut domain D n 0 (i. e., D minus 0 ) equals PD ( j 0 ), PD ( j 0 ) D PDn 0 ( ) : (Markov property)
(15)
In order to demonstrate Eq. (15) we use the definition in Eq. (13). The conditional probability PD ( j 0 ) D PD ( 0 )/PD ( 0 ) D (Z D ( 0 )/Z D )/(Z D ( 0 )/Z D ) D Z D ( 0 )/Z D ( 0 ). Correspondingly, the distribution in the cut domain D n 0 is PDn 0 ( ) D Z Dn 0 ( )/Z Dn 0 . However, it follows from the structure of the partition function in Eqs. (6–8) that Z Dn 0 D exp[E( 0 )/kT]Z D ( 0 ) and Z Dn 0 ( ) D exp[E( 0 )/kT]Z D ( 0 ), where E( 0 ) is the energy of the broken bonds. By insertion the interface Boltzmann factors cancel out and we obtain Eq. (15) expressing the Markov property. Conformal Invariance In the complex plane analysis implies geometry. The representation of a complex number z D x C iy directly associates complex function theory with 2D geometrical shapes. This connection is of importance in mathematical physics in for example 2D electrostatics and 2D hydrodynamics. In the context of SLE Riemann’s mapping theorem plays an essential role [1]. Conformal Maps Briefly, Riemann’s mapping theorem [2] states that any simply connected domain, i. e., topologically deformable to a disk, in the complex plane z can be uniquely mapped to a unit disk jwj < 1 in the complex w plane by mean of a complex function w D g(z). By combining complex functions we can map any simply connected domain to any other simply connected domain. For example, if g1 (z) and g2 (z) map domains D1 and D2 to the unit disk, respectively, then g21 (g1 (z)) maps D1 to D2 ; here g21 is the inverse function of g 2 , i. e., g21 (g2 (z)) D z. As an example, the transformation g(z) D i(1 C z)/(1 z) maps the the unit disk centered at the origin to the upper half plane. Likewise, the Möbius transformation
3083
3084
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
The continuous random interface approaches its scaling form and is characterized by the measure PD ( ). At the critical point we assume that the interface is scale invariant under the larger symmetry of conformal transformations. The next step is to consider a specific conformal transformation g(z) which according to Riemann’s theorem precisely maps domain D to domain D0 , i. e., D0 D g(D) and the interface to g( ). The assumption of conformal invariance then states that the probability measure P is invariant under this transformation expressing the scale invariance, i. e., Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems, Figure 3 We depict a conformal transformation from the complex z plane to the complex w plane. We note the angle-preserving property, i. e., a shear-free transformation. The map in the figure is given by w D z2
PD ( ) D Pg(D) (g( )) : (Conformal property)
(16)
Both the Markov property and the conformal property are sufficient in combination with Loewner evolution to determine the measures in the scaling limit. Loewner Evolution
g(z) D (az C b)/(cz C d) determined by four real parameters, ad bc > 0, maps the upper half plane onto itself. Conformal transformations are angle-preserving and basically correspond to a combination of a local rotation, local translation, and local dilatation. In terms of an elastic medium picture conformal transformations are equivalent to deformations without shear [54]. Expressing g(z) in terms of its real and imaginary parts, g(z) D u(x; y) C iv(x; y), the Cauchy–Riemann equations @u/@x D @v/@y and @u/@y D @v/@x hold implying that u and v are harmonic functions satisfying Laplace’s equations r 2 u D 0 and r 2 v D 0, r 2 D @2 /@x 2 C @2 /@y 2 . In Fig. 3 we have depicted a conformal angle-preserving transformation effectuated by the complex function g(z) from the complex z plane to the complex w plane. Measures – Conformal Invariance Whereas the Markov property discussed above holds for lattice curves even away from criticality, we here want to assume another property which only holds in the scaling limit at the critical point, namely conformal invariance. In the scaling limit we anticipate that the probability measure P( ) for an interface is invariant under a conformal transformation. More precisely, consider a lattice model, say the Ising or percolation model, and specify two domains D and D0 on the lattice. Next, consider an interface, cluster boundary or domain wall from the boundary points A and B across the domain D. In terms of the partition functions the probability distribution for is given by Eq. (13). We now perform the scaling or continuum limit of the lattice model keeping the domains D and D0 fixed.
The original motivation of Loewner’s work was to examine the so called Bieberbach conjecture which states that ja n j n for the coefficients in the Taylor expansion P f (z) D nD0 a n z z , where f (z) maps the unit disk to the complex plane. The conjecture was proposed in 1916 and finally proven in 1984 by de Branges [2,38,39]. For that purpose Loewner [64] considered growing parametrized conformal maps to a standard domain. In the present context Loewner’s method allow us to access growing shapes in 2D in an indirect manner by means of a ‘time dependent’ conformal transformation g t (z). Growing Stick Before we address the derivation of the Loewner equation let us consider the specific conformal transformation p g t (z) D z2 C 4t : (17) For t D 0 we have g0 (z) D z, i. e., the identity map. Likewise, for z ! 1 we obtain g t (z) z C
2t ; z
(18)
showing that far away in the complex plane we again have the identity map. The coefficient in the next leading term, C t /z, is called the capacity Ct ; here parametrized by the ‘time variable’, t D C t /2. The map (17) has a branch point at z D 2it 1/2 and it follows by inspection that g t maps the upper half plane minus a stick from the origin 0 to 2it 1/2 back to the upper half plane. From the inverse map f t (w) D g 1 t (w), p (19) f t (w) D w 2 4t ;
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance p in Complex Systems, Figure 4 We depict the growing stick corresponding to the conformal transformation w D z2 C 4t. The vertical cut in the z plane extends from the origin to the point (0; 2it1/2 ). The right and left faces of the cut are mapped to the real axis from 2t1/2 to C2t1/2 , the endpoint to the origin, in the complex w plane
we infer that the right face of the stick is mapped to the real axis from 0 to 2t 1/2 , the tip to the origin, and the left face to the interval 2t 1/2 to 0. Under the map the growing stick thus becomes part of the boundary in the w plane. The growing stick is depicted in Fig. 4. The stick shows up as an imaginary contribution along the interval 2t 1/2 to 2t 1/2 . More precisely, since f t (w) is analytic in the upper half plane, implementing the asymptotic behavior f t (w) w for w ! 1, and using Cauchy’s theorem, we obtain the dispersion relation or spectral representation Z d! A t (!) f t (w) D w ; (20) w! with spectral weight A t (!). In the case of the growing stick we find A t (!) D (4t ! 2 )1/2 for ! 2 < 4t and otherwise A t (!) D 0. Using 1/(! C i) D P 1/! iı(!) (P denotes principal value) R we also have Im f t (!) D A t (!) and Re f t (!) D ! P ( d! 0 /)A t (! 0 )/(! ! 0 ). The time dependent spectral weight A t (!) thus characterizes the growing stick. With the chosen parametrization we also have the sum rule Z d! (21) A t (!) D C t D 2t : Finally, we note that the map g t satisfies the equation of motion dg t (z) 2 D ; dt g t (z)
(22)
i. e., solving Eq. (22) with the initial condition g0 (z) D z and the boundary condition g t (z) z for z ! 1 we arrive at the map in Eq. (17). Loewner Equation The growing stick nicely illustrates the idea of accessing a growing shape indirectly by the application of Riemann’s
theorem mapping the domain adjacent to the shape to a standard reference domain, here the upper half plane. This so-called uniformizing map effectively absorbs the shape and encodes the information about the shape into the spectral weight A t (!) along the real axis. Let us consider a general shape or hull Kt in the upper half plane H. Together with the real axis the shape form part of the boundary of the domain D. In other words, the domain in question is the upper half plane H with the shape Kt subtracted, D D H n Kt . Applying Riemann’s theorem we map the simply connected domain D back to the upper half plane H by means of the conformal transformation g t (z), i. e., g t absorbs the shape Kt . Imagine that the shape grows a little bit further from Kt to K tCı t D Kt C ıKt , where Kt is now part of K tCı t ; ıKt is the shape increment. Correspondingly, the map g tCı t is designed to absorb K tCı t , i. e., H n K tCı t ! H by means of the map g tCı t . We now carry out the elimination in two ways. Either we absorb Kt by means of the map g t and subsequently ıKt by means of the map ıg t or we absorb K tCı t directly in one step by means of the map g tCı t . Consequently, combining maps we have g tCı t (z) D ıg t (g t (z)) 1 or g t (z) D ıg 1 t (g tCı t (z)). Since ıg t (w) is analytic in H we obtain the spectral representation Z d! ıA t (!) ıg 1 (w) D w ; (23) t w! with infinitesimal spectral weight ıA t , or inserting w D g tCı (z) Z g t (z) D g tCı t (z)
d! ıA t (!) : g tCı t !
(24)
The last step is to set ıA t (!) D t (!)ıt, yielding a differential equation for the evolution of the map g t eliminating the shape Kt , Z d! t (!) dg t (z) D : (25) dt g t (z) !
3085
3086
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems, Figure 5 The combination of maps involved in the derivation of the Loewner equation. First the map gt eliminates the hull K t . Subject to the growth in the time interval ıt the incremental hull ıKt is subsequently absorbed by the infinitesimal map ıgt . Correspondingly, the hull KtCıt is absorbed by the map gtCıt in one step
Specifying the weight or measure t (!) along the real !axis this equation determines, through the uniformizing map g t , how the shape Kt grows. The spectral weight encodes the 2D shape into the real function t (!). Note that since t (!) is not specified and can depend nonlinearly on the map, Eq. (25) still represent a highly nonlinear problem. Invoking the asymptotic condition g t (z) z C C t /z, where C R t is the time dependent capacity we infer dC t /dt D ( d!/) t (!) or since t (!) D dA t (!)/dt the sum rule in Eq. (21). The conformal mapping procedure is depicted in Fig. 5. In the special case where the growth takes place at a point the equation simplifies considerably. Assuming that the spectral weight is concentrated at the point ! D a t , where at is a real function of t and setting t (!) D 2ı(! a t ) we arrive at the Loewner equation 2 dg t (z) : D dt gt at
(26)
The Loewner equation describes the growth of a curve or trace t with endpoint zt , 0 < t < 1, in the upper half complex z-plane. The time-dependent conformal transformation g t maps the simply connected domain H n t , i. e., the half plane excluding the curve t back to the w half plane. At a given time instant t the tip of the curve zt is determined by g t (z t ) D a t , i. e., the point where Eq. (26) develops a singularity. The topological properties and shape of the curve are encoded in the real function at which lives on the real axis in the w-plane. As at develops in time the tip of the curve zt determined by z t D g 1 t (a t ) traces out a curve. Since the domain H n t must be simply connected for the Riemann theorem to apply the curve or trace cannot cross itself or cross the real axis. Whenever the curve touches or intersect itself or the real axis the enclosed part will be excluded from the domain. In other, words, during the time progression the curve effectively absorbs part of the upper half plane. It is a deep property of Loewner evolution that the topological properties of a 2D non-crossing curve are entirely encoded by the real function at . The encoding works both ways:
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems, Figure 6 The mechanism in the Loewner equation. The curve in the upper half complex z plane generated by the Loewner equation is mapped onto a finite but growing segment of the real axis of the complex w plane. The endpoint zT is mapped to the real number aT . As at develops in time and makes excursions along the real axis the endpoint zt of the curve grows into the upper half plane
A given 2D non-crossing curve t corresponds to a specific real function at , a given real function at yields a specific 2D non-crossing curve t . A continuous at will yield a continuous curve t . A discontinuous at in general gives rise to branching. Whether or not the curve intersects or touches itself is determined by the singularity structure of the drive at . In the case where the Hölder condition lim!0 j(a tC a )/ 1/2 j is greater that 4 we have selfintersection. Note again that since the curve is defined indirectly by the singularity structure in Eq. (26) we cannot easily identify a curve parametrization and for example determine a tangent vector, etc. The mechanism underlying the Loewner equation is shown in Fig. 6. Exact Solutions In a series of simple cases one can solve the Loewner equation analytically [42,44]. For vanishing drive a t D 0 we
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
obtain the growing vertical stick discussed above. Correspondingly, a constant drive a t D a yields a vertical stick growing up from the point a on the real axis. In the case of a linear drive a t D t the tip of the curve zt is given by z t D 2 2 t cos t C 2i t , where the phase t is determined from the equations: 2 ln r t r t cos t D 2 ln 2 C t 2 and r t D 2 t / sin t . By inspection 0 D 0 and 1 D . The curve thus approaches the asymptote 2p i for t ! 1. For small t analysis yields z t (2/3)t C 2i t, i. e., the trace approaches the p origin D 2 t and with infinite slope. The square root drives a t p a t D 2 (1 t), 0 < t < 1 with a finite-time singup larity can also be treated. In the first case, ap t D 2 t, the trace is a straight line z t D B exp(i) t forming the angle with respect to the real axis. The amplitude B and phase depend on the parameter . The angle D (/2)(1 1/2 /( C 4)1/2 ). For D 0, D /2 and we recover the perpendicular stick; for ! 1, ! 0 and the angle of intersection decreases to zero. In the second p case, a t D 2 (1 t), the behavior of the trace is more complex. For 0 < < 4 the trace forms a finite spiral in the upper half plane; for D 4 the trace has a glancing intersection with the real axis. For 4 < < 1 the trace hits the real axis in accordance with the Hölder condition discussed above. Stochastic Loewner Evolution After these preliminaries we are in position to address stochastic Loewner evolution (SLE). The essential observation made by Oded Schramm [77] within the context of loop erased random walk was that the Markov and conformal properties of the measures or probability distributions for random curves generated by Loewner evolution imply that the random drive at must be proportional to an unbiased 1D Brownian motion.
limit we thus invoke the two properties discussed above: i) the Markov property in Eq. (12) and ii) the conformal property in Eq. (16). In order to demonstrate the surprising property that the Markov and conformal properties in combination imply that at must be a 1D Brownian motion we focus on chordal SLE which applies to a random curve or trace connecting two boundary points. Since the probability distribution or measure P( ) on the random curve using property ii) is assumed to be conformally invariant and since we by Riemann’s theorem can map any simply connected domain to the upper half plane by mean of a conformal transformation, we are free to consider curves in the upper half plane from the origin O to 1 parametrized with a time coordinate 0 < t < 1. Imagine that we grow the curve from time t D 0 to time T driven by the function at , 0 < t < T. The curve or trace is generated by the Loewner Equation (26) with boundary condition a0 D 0 and the trace zt by g t (z t ) D a t or z t D g 1 t (a t ). With the chosen time parametrization we have g t (z) z C 2t/z for z ! 1 in the upper half plane. The map g t thus uniformizes the trace, i. e., the tip zt is mapped to at on the real axis in the w-plane. In order to invoke the Markov property we let the curve grow the time increment T corresponding to the curve segment . The Markov property then implies that the distribution on conditioned on the distribution on is the same as the distribution on in the cut domain H n , i. e., the domain with the curve deleted; this stage is illustrated in Fig. 7. Next, in order to implement the conformal property we shift the image by aT in such a way that the curve segment again starts at the origin O. This is achieved by using the map h T D g T a T which since g T (z T ) D a T maps the tip zT back to the origin; this construction is
Schramm’s Theorem The Loewner Equation (26) generates a non-crossing curve in the upper half plane H originating at the origin O, given a continuous function at with initial value a0 D 0. As at develops in time the tip of the curve zt determined by the condition g t (z t ) D a t traces out a curve or trace. In the case where at is a continuous random function the Loewner Equation (26) likewise becomes a stochastic equation of motion yielding a stochastic map g t (z). As a result the trace determined by g t (z t ) D a t or z t D g 1 t (a t ) is a random curve. The issue is to establish a contact between the exploration processes defining interfaces in the lattice models, the scaling limit of these curves, and the curves generated by SLE. In the scaling
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems, Figure 7 The figure depicts the construction in the derivation of SLE. The first step implements the Markov property by turning the curve into a cut. Subsequently, the conformal transformation hT maps back to the origin. Finally, the map hT maps the segment to the origin. The complete process is also implemented by hTCT . The combination of the Markov property and conformal invariance implies that at performs a Brownian motion
3087
3088
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
also depicted in Fig. 7. Moreover, the asymptotic behavior of hT for large z is h T (z) z a T 2T/z. Since the measure by assumption is unchanged under the conformal transformation hT we infer that growing the time T from the origin has the same distribution as the segment grown from time T to time T C T conditioned on the segment grown up to time T. Moreover, since the segment from O(AB) to C subject to the Markov property has become part of the boundary, as shown in Fig. 7, we also infer that the measure on is independent of the measure on . Finally, applying h T we map the segment to the origin as a common reference point as indicated in Fig. 7. Since the random curves are determined by the random maps g t and ht driven by the random function at the issue is how to transfer the properties of the measure on the curve determined by the Markov and conformal properties to the measure on the random driving function at . In order to combine the Markov properties arising from the analysis of the lattice models and the conformal invariance pertaining to critical random curves, we carry out the following steps. First we grow the curve from the origin O to the tip zT . Implying conformal invariance the curve is then uniformized back to the origin by means of hT . The next step is to grow the curve segment in time T. This segment is subsequently absorbed by means of the map h T . According to the Markov property the distribution of from O to z T is the same as the distribution of grown from T to T C T conditioned on grown from 0 to T. Since the curve is determined by the map ht the distribution is reflected in ht . In particular, the stochastic properties of the curve is transferred to the random function at generating the curve by the Loewner evolution. The last step is now to observe that absorbing the segment from O to z T by means of h T is the same transformation as first applying the inverse map h1 T followed by the map h TC T ; in both cases the end result is the absorption of the initial curve C , see Fig. 7. As regards the measure or distribution we have the equivalence h T (z) h TC T (h1 T (z)). Using the asymptotic form h t (z) z a t 2t/z we obtain a TC T a T a T ; note that indicates identical distributions or measures. In conclusion, the Markov property in combination with conformal invariance implies that a TC T a T is distributed like a T (stationarity) and that a T and a T 0 are independently distributed for non-overlapping time intervals T and T 0 (Markov property). Referring to Sect. “Scaling”, A on Brownian motion as expressed in Eqs. (3) and (4), i. e., stationarity and independence, we infer that at is proportional to a Brownian motion of arbi-
p trary strength , i. e., a t D B t . Note that the reflection symmetry x ! x holding in the present context rules out a bias or drift in the 1D Brownian motion [13,27]. This is the basic conclusion reached by Schramm in the context of loop erased random walk. Driving Loewner evolution by means of 1D Brownian motion with different diffusion coefficient or strength we generate a one-parameter family of conformally invariant or scale invariant non-crossing random curves in the plane. SLE Properties Stochastic Loewner evolution is determined by the nonlinear stochastic equation of motion dg t 2 ; D dt gt at
at D
p B t :
(27)
In the course of time at performs a 1D Brownian motion on the real axis starting at the origin a0 D 0. at is a random continuous function of t and distributed according p to B t . More precisely, at is given by the Gaussian distribution P(a; t) D (2 t)1/2 exp[a2 /2 t] ;
(28)
with correlations ˝
˛ (a t a s )2 D jt sj :
(29)
First we notice that a constant shift of the drive at , a t ! a t C b, is readily absorbed by a corresponding shift of the map, g t ! g t C b. Moreover, using the scaling property of Brownian motion, B2 t D B t , following from e. g. Eq. (5), we have a2 t D a t and we conclude from Eq. (27) that g t (z) has the same distribution as (1/)g2 t (z), i. e., g 2 t ( z) g t (z). Note that this dilation invariance is consistent since the origin z D 0 and z D 1, the endpoints of curves, are preserved. Note also that the strength of the drive is an essential parameter which cannot be scaled away. Curves – Hulls – Bessel Process For vanishing drive, D 0, the SLE yields a non random vertical line from z D 0, i. e., the growing stick discussed in Subsect. “Growing Stick”. As we increase the curve becomes random with excursions to the right and to the left in the upper half plane. Up to a critical value of the random curve is simple; i. e., non-touching or non selfintersecting. At a critical value of the Brownian drive is so strong that the curve begins to intersect itself and the real axis. These intersection take place on all scales
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
since the curve is self-similar or scale invariant. Denoting the curve by t we observe that since Riemann’s theorem uniformizing H n t to H only applies to a simply connected domain, the regions enclosed by the self-intersections do not become uniformized but are effectively removed from H. The curve t together with the enclosed parts is called the hull Kt and the mapping theorem applies to H n Kt . In order to analyze the critical value of we consider the stochastic equation for h t (z) D g t (z) a t . From Eq. (27) it follows that dh t 2 C t ; D dt ht
(30)
where t D da t /dt is white noise with correlations h t s i D ı(t s). The nonlinear complex Langevin Equation (30) maps the tip of the curve zt back to the origin. Likewise H n t is mapped onto H. A point x on the real axis is mapped to x t D h t (x) where xt satisfies dx t 2 C t : D dt xt
(31)
The Langevin Equation (31) is known as the Bessel equation and governs the radial distance R from the origin of a Brownian particle in d dimensions. P Introducing R D ( diD1 B2i )1/2 where Bi , i D 1; : : : d, is a 1D Brownian motion with distribution P(B; t) D (2 t)1/2 exp[B2 /2 t], we find P(R; t) / (2 t)d R d1 exp[R2 /2 t] satisfying the Fokker–Planck equation @P/@t D (/2)@2 P/@R2 ((d 1)/2)@(P/R)/@R, corresponding to the Langevin equation d1 dR D C t : dt 2R
(32)
For d 2 Brownian motion is recurrent, i. e., the particle returns to the origin R D 0, for d > 2 the particle goes off to infinity. Referring to Eq. (31) and setting (d 1)/2 D 2 we obtain D 4/(d 1), i. e., R ! 1 for < 4 and R ! 0 for > 4. Since the tip of the curve zt is mapped to at , i. e., h(z t ) D 0, the case x t ! 1 for < 4 corresponds to a curve never intersecting the real axis, i. e., the curve is simple. For > 4 we have x t ! 0 corresponding to the case where the tip zt intersects the real axis forming a hull. Since the curve is self-similar the intersections takes place on all scales and eventually the whole upper half plane is engulfed by the hull. The marginal value D 4 can also be inferred from a simple heuristic argument [27]. For small we can ignore the noise in Eq. (31) and the particle is repelled ac-
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems, Figure 8 The figures depict the phases of SLE. For 4 the SLE trace is a simple non-intersecting scale invariant random curve from the origin to infinity with a fractal dimension between 1 and 3/2. For 4 < 8 the SLE curve is self-intersecting on all scales and also intersects the real axis on all scales. The curve together with the enclosed regions, the hull, eventually exhausts the upper half plane. The scale invariant hull has a fractal dimension ranging between 3/2 and 2. For 8 the fractal dimension of the hull locks onto 2 and the scale invariant hull is dense and plane-filling
cording to the solution x 2t 4t. For large noise we ignore the nonlinear term and the noise can drive xt to zero; we have x 2t t. The balance is obtained for D 4. In conclusion, for 4 the curve is simple, for > 4 the curve intersects itself and the real axis infinitely many times on all scales, eventually the hull swallows the whole plane. For large the trace turns out to be plane-filling. The two cases 4 and > 4 are depicted in Fig. 8. Fractal Dimension The SLE random curves are fractal. An important issue is thus the determination of the fractal dimension D in terms of the Brownian strength or SLE parameter . It has been shown that [15,16,74] D D1C
8
for 0 8 ;
(33)
for 8 the fractal dimension locks onto 2 and the SLE curve is plane-filling. In order to illustrate a typical SLE calculation we follow Cardy [27] in a heuristic derivation of Eq. (33). In order to evaluate D and in accordance with its definition [32,68] the standard procedure is to cover the object with disks of size " and follow how the number of disks N() of size " scales with " for small ", i. e., N() / D . However, since the SLE curve is random the argument has to be rephrased. Alternatively, we consider a disc of size " located at a fixed position z and ask for the probability P(z; ) that the SLE curve crosses the disk. The number of disks covering an area A is N A D A/ 2 , i. e., P / D /N A , and we infer P(z; ) / 2D , where D is the fractal dimension. Incorporating the Markov property we subject the curve to an infinitesimal conformal transformation, hı t D gı t aı t ,
3089
3090
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
transforming the point z to w D gı t (z) aı t ; moreover, all lengths are scaled by jhı0 t (z)j [1]. Setting z D x C iy and w D x 0 C iy 0 we obtain expanding p Eq. (30) x 0 C iy 0 D 2ıt/(x C iy) ıB t or x 0 D x C p 2xıt/(x 2 C y 2 ) ıB t , y 0 D y 2yıt/(x 2 C y 2 ), 0 D (1jhı t (z)j), and jhı t j D 2(x 2 y 2 )/(x 2 C y 2 )2 . By conformal invariance the probability measure P(x; y; ) is unchanged and we infer P(x; y; ) D hP(x 0 ; y 0 ; 0 )iı B ;
(34)
where we have averaged over the Brownian motion referring to the initial part of the curve which has been eliminated by the map hı t . Expanding Eq. (34) to first order in ıt and noting that h(ıB t )2 i D ıt we arrive at a partial differential equation for P(x; y; )
2x @ @ 2y @2 2 C 2 2 2 x C y @x x C y @y 2 @x 2
2(x 2 y 2 ) @ (x 2 C y 2 )2 @
P D0:
(35)
Since P(x; y; ) / 2D we have @P/@ D (2 D)P and the determination of D is reduced to an eigenvalue problem. By inspection one finds 2 /8
P / 1/8 y (8)
(x 2 C y 2 )(8)/2 ;
(36)
and we identify the fractal dimension D D 1 C /8 for < 8; for > 8 another solution yields D D 2. Results and Discussion Stochastic Loewner evolution based on Eq. (27) generates conformally invariant non-crossing random curves in the upper half plane starting at the origin and going off to infinity. This is the case of chordal SLE, where the random curve connects two boundary points (the origin O and infinity 1). Another case is radial SLE for random curves connecting a boundary point and an interior point in a simply connected domain [13,27,43]. Radial SLE is governed by another stochastic equation and will not be discussed here. Phase Transitions – Locality – Restriction – Duality Phase Transitions SLE exhibits two phase transitions; for D 4 and D 8. For 0 < 4 the random curve is non-intersecting, i. e., a simple random continuous curve from O to 1. For 4 < 8 the curve is self-intersecting on all scales. The curve together with the excluded regions form a hull which in the course of time absorbs
the upper half plane. For just above 4 the half plane is eventually absorbed but the trace does not visit all regions, i. e., the hull is not dense. As we approach D 8 the trace becomes more dense and the hull becomes plane-filling. This is also reflected in the fractal dimension D D 1 C /8. For > 8 the hull is plane-filling, i. e., D D 2. As we increase the strength of the Brownian drive further the excursions of the trace to the right and left in the upper half plane become more pronounced and the hull becomes vertically compressed. These results have been obtained by Rohde and Schramm [74] and Lawler et al. [57]. The various phases of SLE are depicted in Fig. 8. In Fig. 9 we have depicted numerical renderings of SLE traces for various values of (with permission from V. Beffara, http://www. umpa.ens-lyon.fr/~vbeffara/simu.php). Locality – Restriction In addition to the phase transitions at D 4 and D 8, there are special values of where SLE shows a behavior characteristic of the scaling limit of specific lattice models: The locality property for D 6 and the restriction property for D 8/3. The issue here is the influence of the boundary on the SLE trace. To illustrate the locality property, consider for example the SLE trace originating at the origin and purporting to describe the scaling limit of a domain wall in the lattice model. Due to the long range correlations at the critical point it is intuitively clear that a deformation of the boundary, e. g., a bulge L on the real axis to the right of the origin, will influence the trace and push it to the left. A detailed analysis show that only for D 6 is the trace independent of a change of the boundary, i. e., the trace does not feel the boundary until it encounters a boundary point [59,63]. Returning to the lattice models the locality property for D 6 applies specifically to the percolation case where the interface generated by the exploration process is governed by a local rule and the model has a geometric phase transition. The restriction property is less obvious to visualize but basically states that the distribution of traces conditioned not to hit a bulge L on the real axis away from the origin is the same as the distribution of traces in the domain where L is part of the boundary, i. e., in the domain H n L. Analysis shows that the restriction property only applies in the case for D 8/3. Among the lattice models only the scaling limit of self-avoiding random walk (SAW), where the measure is uniform, conforms to the restriction property and thus corresponds to D 8/3 [63]. Duality For > 4 the SLE generates a hull of fractal dimension D > 3/2. The boundary, external perimeter, or frontier of the hull is again a simple conformally invariant
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems, Figure 9 We depict a numerical rendering of SLE for a variety of values. In a we show loop erased random walk (LERW) for D 2 with fractal dimension D D 5/4. In b we illustrate the case of self-avoiding random walk (SAW) for D 8/3 and fractal dimension D D 4/3; both LERW and SAW have > 4 and are simple scale invariant random curves. In c we depict site percolation for D 6 with fractal dimension D D 7/4. Since > 4 the percolation case is self-intersecting and duality implies that the boundary or frontier of the hull is described by a SLE curve for D 16/6 D 8/3, i. e., the case of SAW. In d we show the Ising case for D 3 and fractal dimension D D 11/8. In e we depict the limiting case D 8 and fractal dimension D D 2. The hull is dense and plane-filling. The frontier of the hull corresponds to the SLE case D 16/8 D 2, i. e., the case of LERW. The so-called uniform spanning tree (UST) has the same properties as LERW and the SLE case for D 8 can thus be thought of as a random plane filling Peano curve wrapping around the UST. Finally, in f we show the SLE trace and hull for D 20 and D D 2. Because of the large Brownian excursions the plane-filling hull is vertically compressed (with permission from V. Beffara: http://www.umpa.ens-lyon.fr/~vbeffara/simu.php)
3091
3092
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
¯ random curve characterized by the fractal dimension D. Using methods from 2D quantum gravity Duplantier [31] has proposed the relationship, ¯ 1) D (D 1)( D
1 4
;
(37)
between the fractal dimension of the hull and its frontier. This result has been proved by Beffara for D 6, i. e., the percolation case [16]. Inserting in Eq. (33) we obtain for the corresponding SLE parameter the duality relation ¯ D 16 :
(38)
Loop Erased Random Walk (LERW) Whereas the scaling limit of random walk, i. e., Brownian motion, does not fall in the SLE category because of selfcrossings rendering Riemann’s mapping theorem inapplicable, variations of Brownian motion are described by SLE. Loop erased random walk (LERW) where loops are removed along the way is by construction self-avoiding and was introduced as a simple model of a self-avoiding random walk. LERW was studied by Schramm in his pioneering work [77]. LERW has the Markov property and has been proved to be conformally invariant in the scaling limit and described by SLE for D 2 [57]. According to Eq. (33) LERW has the fractal dimension D D 5/4 [67]. Also, since < 4 LERW is non-intersecting. A simulation of LERW based on SLE is shown in Fig. 9a. Self-Avoiding Random Walk (SAW) Self-avoiding random walk (SAW) is a random walk conditioned not to cross itself. SAW has been used to model polymers in a dilute solution and has a uniform probability measure. Since SAW satisfies the restriction property it is conjectured in the scaling limit to fall in the SLE class with D 8/3 [46,47,48,61], yielding the fractal dimension D D 4/3. We note that Flory’s mean field theory [30] for the size R of a polymer composed of N links (monomers) scales like R N , where D 3/(2 C d) for d 4. By a box covering we infer N R D where D is the fractal dimension, i. e., D D (2 C d)/3. In d D 2 we obtain D D 4/3 in accordance with the SLE result. SLE induced SAW in the scaling limit is shown in Fig. 9b. Percolation The scaling limit of site percolation was conjectured by Schramm [77] to fall in the SLE class for D 6. Subsequently, the scaling limit of site percolation on a triangular lattice has been proved by Smirnov [78,80]. Percolation exhibits a geometrical phase transition. In the exploration
process defining a critical interface the rule for propagation is entirely local. The lack of stiffness as for example in the Ising case to be discussed below results in a strongly meandering path winding back and in the scaling limit intersecting earlier part of the path. Since > 4 the path together with the enclosed part, i. e., the hull, eliminates the whole plane in the course of time. As discussed above the locality property is specific to percolation and yields D 6. The fractal dimension of the percolation interface is according to Eq. (33) D D 7/4. We note that D is close to 2, i. e., the percolation interface nearly covers the plane densely. A series of new results and a proof of Cardy’s conjectured formula for the crossing probability have appeared; we refer to [13,27,43] for details. Using the duality relation (38) the frontier of the percolation hull is a simple SLE curve for D 8/3, corresponding to SAW. In Fig. 9c we have depicted a SLE generated percolation interface. Ising Model – O(n) Models The Ising model in Eq. (6) is a special case of the O(n) model defined by the Hamiltonian X E i E j ; (39) H D J hi ji
where E i D (1 ; : : : n ) is an n-component unit vector associated with the site i. For n D 1 we recover the Ising model, n D 2 is the XY-model [29], and n D 3 the Heisenberg model. By means of the Fortuin–Kasteleyn (FK) transformation based on a high temperature expansion the configurations of the O(n) model can be described by clusters or graphs on a dual lattice [35]. The crossing domain wall in Fig. 1 is thus a special case of a FK graph if we interpret the representation as a triangular Ising model. It has been conjectured that n is related to the SLE parameter by n D 2 cos(4/) for 8/3 4 :
(40)
In the Ising case n D 1 and we have D 3 yielding the fractal dimension D D 11/8 for the Ising domain wall [84]. Since < 4 the Ising domain wall is non-intersecting. Unlike the percolation case, the Ising interface is stiffer due to the interaction. We also note that the scaling limit of spin cluster boundaries in the Ising model recently has been proved to correspond to SLE for D 3 [79]. The interface is shown in Fig. 9d. SLE – Conformal Field Theory Whereas conformal field theory (CFT) is based on the concept of a local field (r) and its correlations and therefore
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
only accesses the underlying geometry indirectly through field correlations, SLE directly produces conformally invariant geometrical objects. A major issue is therefore the connection between CFT and SLE [6,7,26]. In CFT the central charge c plays an important role in delimiting the universality classes of the variety of lattice models yielding conformal field theories in the scaling limit. Percolation thus corresponds to the central charge c D 0, whereas the Ising model is associated with the central charge c D 1/2. It has been conjectured that the connection between the SLE parameter and the central charge c is given by cD
(6 )(3 8) ( 4)2 D16 : 2 4
(41)
We note that c < 1 and, moreover, invariant under the duality transformation ! 16/. SLE – 2D Turbulence There is an interesting application of SLE ideas in the context of 2D turbulence. The issue here is to analyze conformal invariance by comparing the statistical properties of geometrical shapes like domain walls with SLE traces with the view of determining the SLE parameter and the corresponding universality class. In 3D turbulence is governed by the incompressible Navies–Stokes equation for the velocity field. Since the viscosity is only effective at small length scales 3D turbulence is characterized by a cascade of kinetic energy (1/2)v 2 from large scales (driving scale) to small scales (dissipation scale). In the inertial regime the energy spectrum E(k) (k is the wavenumber) is characterized by the celebrated Kolmogorov 5/3 law [46], E(k) / k 5/3 , indicating an underlying scale invariance in turbulence. In 2D the cascade picture is different. Since both kinetic energy and squared vorticity (enstrophy) are conserved in the absence of dissipation and forcing, two cascades coexist [52,53]. A direct cascade to small scales for the squared vorticity ! 2 D (r v)2 with scaling exponent 3 and an inverse cascade to larger scales for the kinetic energy (1/2)v 2 with Kolmogoroff scaling exponent 5/3. The system is thus characterized by a fine scale vorticity structure together with a large scale velocity structure. Moreover, we can assume that the vorticity structure is equipartitioned, i.e, in equilibrium. In order to investigate whether the scale invariance of the small scale vorticity structure can be extended to conformal invariance Bernard et al. [17] have considered the statistics of the boundaries of vorticity clusters. By comparing the zero-vorticity isolines with SLE traces they find that cluster boundaries fall in the universality class corre-
sponding to D 6, i. e., the case of percolation. Since 2D turbulence is a driven nonequilibrium system, this observation is very intriguing in particular since the correlations between vortices are long-ranged. A similar analysis [19] of the isolines in the inverse cascade in surface quasigeostrophic turbulence corresponds to D 4, i. e., from Eq. (40) the domain walls in the equilibrium XY model for n D 2. For comments on the application of SLE in turbulence we refer to Cardy [28]. SLE – 2D Spin Glass It is a standing issue whether conformal field theory can be applied to disordered systems, in particular systems with quenched disorder. In recent work Amoruso et al. [3] and Bernard et al. [18] have considered zero temperature domain walls in the Ising spin glass [34]; see also [33]. The Ising spin glass is an equilibrium system with quenched disorder. The system is described by the Hamiltonian P H D hi ji J i j i j , where the random exchange constants J ij are picked from a Gaussian distribution with zero mean. The glass transition is at T D 0 and the system has a two-fold degenerate ground state. Inducing a scale invariant domain wall between the two ground states and comparing with an SLE trace, it is found that both the Markov and conformal properties are obeyed and that the universality class corresponds to 2:3. Further Remarks In this discussion we have left out several topics which have played an important role in the development and applications of SLE. We mention some of them below. There is an interesting connection between LERW and the so-called uniform spanning tree (UST) [13,43,57,77]. A spanning tree is a collection of vertices and edges which form a tree, i. e., without loops or cycles. A uniform spanning tree is a random spanning tree picked among all possible spanning trees with equal probability. Consider the unique path between two vertices on a UST. Since the path lives on a tree it is by construction non-crossing and it turns out that it has the same distribution as LERW. The winding random curve enclosing the UST can be visualized as a random plane-filling Peano curve. In the scaling limit the Peano curve is described by SLE for D 8 with fractal dimension D D 2. The q-state Potts model [87] constitutes a generalization of the Ising model; here the lattice variable takes q values. The model is defined by the Hamiltonian P H D J hi ji ı i j , where i D 1; : : : q; the Ising model obtains for q D 2. Applying the high temperature FK representation the configurations can be represented by loops
3093
3094
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
and domain walls. From considerations involving the fractal dimension [76] it has been conjectured that domain walls in the scaling limit of the Potts model fall in the SLE category for q D 2 C 2 cos(8/), where 4 8. For q D 2 we recover the Ising case for D 3. In the limit q ! 0, the graph representation is equivalent to the uniform spanning tree described by SLE for D 8. For a numerical study of the three-state Potts model and its relation to SLE consult [36]. Standard SLE is driven by 1D Brownian motion producing a fractal curve. Ruskin et al. [75] have considered the case of adding a stable Lévy process with shape parameter ˛ to the Brownian motion. Backing their analysis with numerics they find that the SLE trace branches and exhibit a ‘phase transitions’ related to self-intersections. 2D Brownian motion, the scaling limit of 2D random walk, is an incredibly complex fractal coil owing to the self-crossings on all scales. Although 2D Brownian motion because of self-crossing itself falls outside the SLE scheme, the outer frontier or perimeter of 2D random walk is a non-crossing and non-intersecting fractal curve which can be accessed by SLE. Verifying an earlier conjecture by Mandelbrot [68] it has been proven using SLE techniques [58] that the fractal dimension of the Brownian perimeter is D D 4/3, i.e, the same as the fractal dimension of self-avoiding random walk and the external perimeter of the percolation hull. Other characteristics of Brownian motion such as intersection exponents have also been obtained [59,60,62]; see also [66]. In recent work Zoia et al. [88] have considered the distribution of first passage times and distances along critical curves generated by SLE for different values of . Future Directions Stochastic Loewner evolution represents a major step in our understanding of fractal shapes in the 2D continuum limit. By combining the Markov property (stationarity) with conformal invariance SLE provides a minimal scheme for the generation of a one-parameter family of fractal curves. The SLE scheme also provides calculational tools which have led to a host of new results. SLE is a developing field and we can on the mathematical front anticipate progress and proofs of some yet unproved scaling limits, e. g., the scaling limit of the FK representation of the Potts model and the scaling limit of SAW. On the more physical front many issues also remain open. First there is the fundamental issue of the connection between the hugely successful but non-rigorous CFT and SLE. Here progress is already under way. In a series of papers Bauer and Bernard [6,7,8,9,10,12] have shown how
SLE results can be derived using CFT methods. Cardy [26] have considered a multiple SLE process and the connection to Dyson’s Brownian process and random matrix theory. The analysis of the CFT–SLE connection still remains to be investigated further. An obvious limitation of SLE is that it only addresses critical domain walls and not the full configuration of clusters and loops in for example the FK representation of the Potts model. In the case of critical percolation this problem has been addressed by Camia and Newman [21]. Another issue is how to provide SLE insight into spin correlations in the Potts or O(n) models. In the original formulation of SLE the Markov and conformal properties essentially requires a Brownian drive. It is clearly of interest to investigate the properties of random curves generated by other random drives. Such a program has been initiated by Ruskin et al. [75] who considered adding a Lévy drive to the Brownian drive; see also work by Kennedy [49,50]. Since the SLE trace lives in the infinite upper half plane the whole issue of finite size effects remain open. In ordinary critical phenomena the concept of a Kadanoff block construction and the diverging correlation length near the transition lead to a theory of finite size scaling and corrections to scaling which can be accessed numerically. It is an open problem how to develop a similar scheme for SLE. In statistical physics it is customary and natural to associate a free energy to a domain wall and an interaction energy associated with several domain walls. These free energy considerations are entirely absent in the SLE framework which is based on conformal transformations. A major issue is thus: Where is the free energy in all this and how do we reintroduce and make use of ordinary physical considerations and estimates [69]. Acknowledgments The present work has been supported by the Danish Natural Science Research Council. Bibliography 1. Ahlfors LV (1966) Complex analysis: An introduction to the theory of analytical functions of one complex variable. McGrawHill, New York 2. Ahlfors LV (1973) Conformal invariance: Topics in geometric function theory. McGraw-Hill, New York 3. Amoruso C, Hartmann AK, Hastings MB, Moore MA (2006) Conformal invariance and stochastic Loewner evolution processes in two-dimensional Ising spin glasses. Phys Rev Lett 97(4):267202. arXiv:cond-mat/0601711 4. Ash RB, Doléans CA (2000) Probability and measure theory. Academic, San Diego
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
5. Bak P (1999) How nature works: The science of self-organized criticality. Springer, New York 6. Bauer M, Bernard D (2002) SLE growth processes and conformal field theory. Phys Lett B 543:135–138. arXiv:math.PR/ 0206028 7. Bauer M, Bernard D (2003) Conformal field theories of stochastic Loewner evolutions. Comm Math Phys 239:493–521. arXiv: hep-th/0210015 8. Bauer M, Bernard D (2003) SLE martingales and the Viasoro algebra. Phys Lett B 557:309–316. arXiv:hep-th/0301064 9. Bauer M, Bernard D (2004) CFTs of SLEs: The radial case. Phys Lett B 583:324–330. arXiv:math-ph/0310032 10. Bauer M, Bernard D (2004) Conformal transformations and the SLE partition function martingale. Ann Henri Poincare 5:289– 326. arXiv:math-ph/0305061 11. Bauer M, Bernard D (2004) Loewner chains. arXiv:cond-mat/ 0412372 12. Bauer M, Bernard D (2004) SLE, CFT and zig-zag probabilities. Proceedings of the conference ‘Conformal Invariance and Random Spatial Processes’, Edinburgh, July 2003. arXiv:math-ph/ 0401019 13. Bauer M, Bernard D (2006) 2D growth processes: SLE and Loewner chains. Phys Rep 432:115–221 14. Baxter RJ (1982) Exactly solved models in statistical mechanics. Academic, London 15. Beffara V (2002) The dimension of SLE curves. arXiv:math.PR/ 0211322 16. Beffara V (2003) Hausdorff dimensions for SLE6 . Ann Probab 32:2606–2629. arXiv:math.PR/0204208 17. Bernard D, Boffetta G, Celani A, Falkovich G (2006) Conformal invariance in two-dimensional turbulence. Nat Phys 2:124–128 18. Bernard D, Le Doussal P, Middleton AA (2006) Are domain walls in 2D spin glasses described by stochastic Loewner evolutions. arXiv:cond-mat/0611433 19. Bernard D, Boffetta G, Celani A, Falkovich G (2007) Inverse turbulent cascades and conformally invariant curves. Phys Rev Lett 98:024501(4). arXiv:nlin.CD/0602017 20. Binney JJ, Dowrick NJ, Fisher AJ, Newman MEJ (1992) The theory of critical phenomena. Clarendon, Oxford 21. Camia F, Newman CM (2003) Continuum nonsimple loops and 2D critical percolation. arXiv:math.PR/0308122 22. Cardy J (1987) Conformal invariance. In: Domb C, Lebowitz JL (eds) Phase transitions and critical phenomena, vol 11. Academic, London 23. Cardy J (1993) Conformal field theory comes of age. Physics World, June, 6:29–33 24. Cardy J (1996) Scaling an renormalization in statistical physics. Cambridge University Press, Cambridge 25. Cardy J (2002) Conformal invariance in percolation, self-avoiding walks and related problems. Plenary talk given at TH-2002, Paris. arXiv:cond-mat/0209638 26. Cardy J (2003) Stochastic Loewner evolution and Dyson’s circular ensembles. J Phys A 36:L379–L408. arXiv:math-ph/ 0301039 27. Cardy J (2005) SLE for theoretical physicists. Ann Phys 318:81– 118. arXiv:cond-mat/0503313 28. Cardy J (2006) The power of two dimensions. Nat Phys 2:67–68 29. Chaikin PM, Lubensky TC (1995) Principles of condensed matter physics. Cambridge University Press, Cambridge
30. de Gennes PG (1985) Scaling concepts in polymer physics. Cornell University Press, Ithaca 31. Duplantier B (2000) Conformally invariant fractals and potential theory. Phys Rev Lett 84:1363–1367. arXiv:cond-mat/ 9908314 32. Feder J (1988) Fractals (physics of solids and liquids). Springer, New York 33. Fisch R (2007) Comment on conformal invariance and stochastic Loewner evolution processes in two-dimensional Ising spin glasses. arXiv:0705.0046 34. Fischer KH, Hertz JA (1991) Spin glasses. Cambridge University Press, Cambridge 35. Fortuin CM, Kasteleyn PW (1972) On the random cluster model. Physica 57:536–564 36. Gamsa A, Cardy J (2007) SLE in the three-state Potts model – a numerical study. arXiv:0705.1510 37. Gardiner CW (1997) Handbook of stochastic methods. Springer, New York 38. Gong S (1999) The Bieberbach conjecture. RI American 19. Mathematical Society. International, Providence 39. Gruzberg IA, Kadanoff LP (2004) The Loewner equation: Maps and shapes. J Stat Phys 114:1183–1198. arXiv:cond-mat/ 0309292 40. Jensen HJ (2000) Self-organized criticality: Emergent complex behavior in physical and biological systems. Cambridge University Press, Cambridge 41. Kadanoff LP (1966) Scaling laws for Ising models near T c . Physics 2:263–271 42. Kadanoff LP, Berkenbusch MK (2004) Trace for the Loewner equation with singular forcing. Nonlinearity 17:R41–R54. arXiv: cond-mat/0402142 43. Kager W, Nienhuis B (2004) A guide to stochastic Loewner evolution and its application. J Stat Phys 115:1149–1229 44. Kager W, Nienhuis B, Kadanoff LP (2004) Exact solutions for loewner evolutions. J Stat Phys 115:805–822 45. Kauffman SA (1996) At home in the universe: The search for the laws of self-organization and complexity. Oxford University Press, Oxford 46. Kennedy T (2002) Monte Carlo tests of stochastic Loewner evolution predictions for the 2D self-avoiding walk. Phys Rev Lett 88(4):130601. arXiv:math.PR/0112246 47. Kennedy T (2004) Conformal invariance and stochastic Loewner evolution predictions for the 2D self-avoiding walk – Monte Carlo tests. J Stat Phys 114:51–78. arXiv:math.PR/ 0207231 48. Kennedy T (2005) Monte Carlo comparisons of the self-avoiding walk and SLE as parameterized curves. arXiv:math.PR/ 0510604v1 49. Kennedy T (2006) The length of an SLE – Monte Carlo studies. arXiv:math.PR/0612609v1 50. Kennedy T (2007) Computing the Loewner driving process of random curves in the half plane. arXiv:math.PR/0702071v1 51. Kolmogorov AN (1941) Dissipation of energy in the locally isotropic turbulence. Dokl Akad Nauk SSSR 30:9–13. (Reprinted in Proc Royal Soc Lond A 434:9–13 (1991)) 52. Kraichnan RH (1967) Inertial ranges in two-dimensional turbulence. Phys Fluids 10:1417–1423 53. Kraichnan RH, Montgomery D (1980) Two-dimensional turbulence. Rep Prog Phys 43:567–619 54. Landau LD, Lifshitz EM (1959) Theory of elasticity. Pergamon, Oxford
3095
3096
Stochastic Loewner Evolution: Linking Universality, Criticality and Conformal Invariance in Complex Systems
55. Lawler GF (2004) ICTP Lecture Notes Series. 17:307–348 56. Lawler GF (2005) Conformally invariant processes in the plane. In: Mathematical Surveys and Monographs, vol 114. AMS, Providence 57. Lawler GF, Schramm O, Werner W (2001) Conformal invariance of planar loop-erased random walks and uniform spanning trees. Ann Prob 32:939–995. arXiv:math.PR/0112234 58. Lawler GF, Schramm O, Werner W (2001) The dimension of the planar Brownian frontier is 4/3. Math Res Lett 8:401–411. arXiv: math.PR/00010165 59. Lawler GF, Schramm O, Werner W (2001) Values of Brownian intersections exponents I: Half plane exponents. Acta Math 187:237–273. arXiv:math.PR/9911084 60. Lawler GF, Schramm O, Werner W (2001) Values of Brownian intersections exponents II: Plane exponents. Acta Math 187:275– 308. arXiv:math.PR/0003156 61. Lawler GF, Schramm O, Werner W (2002) On the scaling limit of planar self-avoiding walk. Fractal geometry and application, a jubilee of Benoit Mandelbrot, Part 2, 339–364, Proc. Sympos. Pure Math., 72, Part 2, Amer. Math. Soc., Providence, RI, 2004. arXiv:math.PR/0204277 62. Lawler GF, Schramm O, Werner W (2002) Values of Brownian intersections exponents III: Two-sided exponents. Ann Inst Henri Poincare 38:109–123. arXiv:math.PR/0005294 63. Lawler GF, Schramm O, Werner W (2003) Conformal restriction: The chordal case. J Amer Math Soc 16:917–955. arXiv:math.PS/ 0209343 64. Löwner K (Loewner C) (1923) Untersuchungen über schlichte konforme Abbildungen des Einheitskreises. I Math Ann 89:103–121 65. Ma S-K (1976) Modern theory of critical phenomena. Frontiers in Physics, vol 46. Benjamin, Reading 66. Mackenzie D (2000) Taking the measure of the wildest dance on earth. Science 290:1883–1884 67. Majumbar SN (1992) Exact fractal dimension of the looperased self-avoiding walk in two dimensions. Phys Rev Lett 68:2329–2331 68. Mandelbrot B (1987) The fractal geometry of nature. Freeman, New York 69. Moore M (2007) private communication 70. Nicolis G (1989) Exploring complexity: An introduction. Freeman, New York 71. Nienhuis B (1987) Coulomb gas formulation of two-dimensional phase transitions. In: Domb C, Lebowitz JL (eds) Phase
72. 73. 74. 75.
76.
77.
78.
79.
80. 81. 82. 83. 84.
85.
86. 87. 88.
transitions and critical phenomena, vol 11. Academic, London Pfeuty P, Toulouse G (1977) Introduction to the renormalization group and to critical phenomena. Wiley, New York Reichl LE (1998) A modern course in statistical physics. Wiley, New York Rohde S, Schramm O (2001) Basic properties of SLE. Ann Math 161:879–920. arXiv:mathPR/0106036 Rushkin I, Oikonomou P, Kadanoff LP, Gruzberg IA (2006) Stochastic Loewner evolution driven by Levy processes. J Stat Mech (2006) 01:P01001. arXiv:cond-mat/0509187 Saleur H, Duplantier B (1987) Exact determination of the percolation hull exponent in two dimensions. Phys Rev Lett 58:2325–2328 Schramm O (2000) Scaling limit of loop-erased random walks and uniform spanning trees. Israel J Math 118:221–288. arXiv: math.PR/9904022 Smirnov S (2001) Critical percolation in the plane: Conformal invariance, Cardy’s formula, scaling limits. C R Acad Sci Paris Ser I Math 333(3):239–244 Smirnov S (2006) Towards conformal invariance of 2D lattice models. Proceedings of the international congress of mathematicians (Madrid, August 22–30, 2006). Eur Math Soc 2:1421– 1451 Smirnov S, Werner W (2001) Critical exponents for two-dimensional percolation. Math Res Lett 8:729–744 Stanley HE (1987) Introduction to phase transitions and critical phenomena. Oxford University Press, Oxford Stauffer D, Aharony A (1994) Introduction to percolation theory. CRC, Boca Raton Strogatz S (2003) Sync: The emerging science of spontaneous order. Hyperion, New York Vanderzande C, Stella AL (1989) Bulk, surface and hull fractal dimension of critical Ising clusters in d D 2. J Phys A: Math Gen 22:L445–L451 Werner W (2004) Random planar curves and Schramm– Loewner evolutions. Springer Lecture Notes in Mathematics 1840:107–195. arXiv: math.PR/0303354 Wilson KG, Kogut J (1974) The renormalization group and the " expansion. Phys Rep 12:75–199 Wu FY (1982) The Potts model. Rev Mod Phys 54:235–268 Zoia A, Kantor Y, Kardar M (2007) Distribution of first passage times and distances along critical curves. arXiv:0705.1474v1 [cond-mat.stat-mech]
Stochastic Processes
Stochastic Processes ALAN J. MCKANE Theory Group, School of Physics and Astronomy, University of Manchester, Manchester, UK Article Outline Glossary Definition of the Subject Introduction Markov Chains The Master Equation The Fokker–Planck Equation Stochastic Differential Equations Path Integrals System Size Expansion Future Directions Bibliography Glossary Fokker–Planck equation A partial differential equation of the second order for the time evolution of the probability density function of a stochastic process. It resembles a diffusion equation, but has an extra term which represents the deterministic aspects of the process. Langevin equation A stochastic differential equation of the simplest kind: linear and with an additive Gaussian white noise. Introduced by Langevin [30] in 1908 to describe Brownian motion; many stochastic differential equations in physics go by this name. Markov process A stochastic process in which the current state of the system is only determined from its state in the immediate past, and not by its entire history. Markov chain A Markov process where both the states and the time are discrete and where the process is stationary. Master equation The equation describing a continuoustime Markov chain. Stochastic process A sequence of stochastic variables. This sequence is usually a time-sequence, but could also be spatial. Stochastic variable A random variable. This is a function which maps outcomes to numbers (real or integer). Definition of the Subject The most common type of stochastic process comprises of a set of random variables fx(t)g, where t represents the time which may be real or integer valued. Other types of stochastic process are possible, for instance when the stochastic variable depends on the spatial position r, as
well as, or instead of, t. Since in the study of complex systems we will predominantly be interested in applications relating to stochastic dynamics, we will suppose that it depends only on t. One of the earliest investigations of a stochastic process was carried out by Bachelier [2], who used the idea of a random walk to analyze stock market fluctuations. The problem of a random walk was more generally discussed by Pearson [42], and applied to the investigation of Brownian motion by Einstein [8,9], Smoluchowski [54] and Langevin [30]. The example of a random walk illustrates the fact that in addition to time being discrete or continuous, the stochastic variable itself can be discrete (for instance, the walker moves with fixed step size in one dimension) or continuous (for instance, the velocity of a Brownian particle). The modeling of the process may lead to an equation for the stochastic variable, such as a stochastic differential equation, or for an equation which predicts how the probability density function (pdf) for the stochastic variable changes in time. Stochastic processes are ubiquitous in the physical, biological and social sciences; they may come about through the perception of very complicated processes being essentially random (the toss of a coin, roll of a die, birth or death of individuals in populations), the inclusion of diverse and poorly characterized effects external to the system under consideration, or thermal fluctuations, among others. Introduction Deterministic dynamical processes are typically formulated as a set of rules which allow for the state of the system at time t C 1 (or t C ıt) to be found from the state of the system at time t. By contrast, for stochastic systems, we can only specify the probability of finding the system in a given state. If this only depends on the state of the system at the previous time step, but not those before this, the stochastic process is said to be Markov. Fortunately many stochastic processes are Markovian to a very good approximation, since the theory of non-Markov processes is considerably more complicated than Markov processes and much less well developed. In this article we will deal almost exclusively with Markov processes. The mathematical definition of a Markov process follows from the definition of the hierarchy of pdfs for a given process. This involves the joint pdfs P(x1 ; t1 ; x2 ; t2 ; : : : ; x n ; t n ), which are the probability that the system is in state x1 at time t1 , state x2 at time t2 ,. . . , and state xn at time tn , and also the conditional pdfs P(x1 ; t1 ; : : : ; x m ; t m jx mC1 ; t mC1 ; : : : ; x n ; t n ), which are the probability that the system is in state x1 at time t1 ; : : : ; x m at time tm , given that it was in state x mC1 at time t mC1 ; : : : ; x n at time tn . These pdfs are
3097
3098
Stochastic Processes
all non-negative and normalizable, and relations exist between them due to symmetry and reduction (integration over some of the state variables). Nevertheless, for a general non-Markov process, a whole family of these pdfs will be required to specify the process. On the other hand, for a Markov process the history of the system, apart from the immediate past, is forgotten, and so P(x1 ; t1 ; : : : ; x m ; t m jx mC1 ; t mC1 ; : : : ; x n ; t n ) D P(x1 ; t1 ; : : : ; x m ; t m jx mC1 ; t mC1 ). A direct consequence of this is that the whole hierarchy of pdfs can be determined from only two of them: P(x; t) and P(x; tjx 0 ; t 0 ). The hierarchy of defining equations then collapses to only two: Z P(x2 ; t2 ) D dx1 P(x2 ; t2 jx1 ; t1 )P(x1 ; t1 ) (1) and P(x3 ; t3 jx1 ; t1 ) D
Z dx2 P(x3 ; t3 jx2 ; t2 )P(x2 ; t2 jx1 ; t1 ) ; t1 < t2 < t3 :
(2)
The pdf P(x; tjx 0 ; t 0 ) is referred to as the transition probability and Eq. (2) as the Chapman–Kolmogorov equation. While the pdfs for a Markov process must obey Eqs. (1) and (2), the converse also holds: any two non-negative functions P(x; t) and P(x; tjx 0 ; t 0 ) which satisfy Eqs. (1) and (2), uniquely define a Markov process. We will begin our discussion in Sect. “Markov Chains” with what is probably the simplest class of Markov processes: the case when both the state space and time are discrete. These are called Markov chains and were first investigated, for a finite number of states, by Markov [32] in 1906. The extension to an infinite number of states was carried out by Kolmogorov [27] in 1936. If time is continuous, an analogous formalism may be developed, which will be discussed in Sect. “The Master Equation”. In physics the equation describing the time evolution of the pdf in this case is called the master equation and was introduced by Pauli [41] in 1928, in connection with the approach to equilibrium for quantum systems, and also by Nordsieck, Lamb and Uhlenbeck [39] in 1940, in connection with fluctuations in cosmic ray physics. The term “master equation” refers to that fact that many of the quantities of interest can be derived from this equation. The connection with previous work on Markov processes was clarified by Siegert [49] in 1949. In many instances when the master equation cannot be solved exactly, it is useful to approximate it by a rather coarser description of the system, known as the Fokker–Planck equation. This approach will be discussed in Sect. “The Fokker–Planck Equation”. This equation was used in its linear form by Rayleigh [44], Einstein [8,9], Smoluchowski [54,55], and Fokker [14], but it
was Planck [43] who derived the general nonlinear form from a master equation in 1917, and Kolmogorov who made the procedure rigorous in 1931. All the descriptions which we have mentioned so far have been based on the time evolution of the pdfs. An alternative specification is to give the time evolution of the stochastic variables themselves. This will necessarily involve random variables appearing in the equations describing this evolution, and they will therefore be stochastic differential equations. The classic example is the Langevin equation [30] used to describe Brownian motion. This equation is linear and can therefore be solved exactly. The Langevin approach, and its relation to the Fokker–Planck equation is described in Sect. “Stochastic Differential Equations”. A good summary of the understanding of stochastic processes that had been gained by the mid-1950s is given in the book edited by Wax. This covers the basics of the subject, and what is discussed in the first six sections of this article. The article by Chandrasekhar [4], first published in 1943, and reprinted in Wax, gives an extensive bibliography of stochastic problems in physics before 1943. Since then the applications of the subject have grown enormously, and the equations modeling these systems have correspondingly become more complex. We illustrate some of the procedures which have been developed to deal with these equations in the next two sections. In Sect. “Path Integrals” we discuss how the path-integral formalism may be applied to stochastic processes and in Sect. “System Size Expansion” we describe how master equations can be analyzed when the size of the system is large. We end with a look forward to the future in Sect. “Future Directions”. Markov Chains The simplest version of the Markov process is when both the states and the time are discrete, and when the stochastic process is stationary. When the states are discrete we will denote them by n or m, rather than x, which we reserve for continuous state variables. In this notation the two Eqs. (1) and (2) governing Markov processes read X P(n2 ; t2 ) D P(n2 ; t2 jn1 ; t1 )P(n1 ; t1 ) (3) n1
P(n3 ; t3 jn1 ; t1 ) D
X n2
P(n3 ; t3 jn2 ; t2 )P(n2 ; t2 jn1 ; t1 ) ; t1 < t2 < t3 :
(4)
A stationary process is one in which the conditional pdf P(n; tjn0 ; t 0 ) only depends on the time difference (t t 0 ). For such processes, when time is discrete so that t D t 0 C 1; t 0 C 2; : : :, we may write P(n; t 0 C kjn0 ; t 0 )
Stochastic Processes
as p(k) n n 0 . The most elementary form of the Chapman– Kolmogorov equation (4) may then be expressed as X (1) (1) pn n0 pn0 m : (5) p(2) nm D n0
This corresponds to the matrix multiplication of p(1) with itself, and therefore p(2) is simply (p(1) )2 . In the same way p(k) D (p(1) ) k , and from now on we drop the superscript on p(1) and denote the matrix by P . The entries of P are non-negative, with the sum of entries in each column being equal to unity, since X X pn n0 D P(n; t C 1jn0 ; t) D 1 : (6) n
n
stance, the walker could have a non-zero probability of staying put, in which case p n n D r, with p C q C r D 1. The walk could be heterogeneous, in which case p and q (and r), could depend on n0 . If there are boundaries, the boundary conditions have to be given. The most common two are absorbing boundaries defined by p nC1 n D p ;
p n1 n D q ;
(n D 2; : : : ; N 1)
p11 D 1 ; p N N D 1 ; p n n 0 D 0 ; otherwise ; (10) and reflecting boundaries defined by
Such matrices are called stochastic matrices. From Eq. (5) it is clear that P 2 is also a stochastic matrix, and by induction it follows that P k is a stochastic matrix if P is. The other defining relation for a Markov process, Eq. (3), now becomes X P(n; t C 1) D p n n 0 P(n0 ; t) : (7) n0
This relation defines a Markov chain. It has two ingredients: the probability that the system is in state n at time t, P(n; t) – which is usually what we are trying to determine, and the stochastic matrix with entries p n n 0 which gives the probabilities of transitions from the state n0 to the state n. The transition probabilities are typically given; they define the model. Note that in many texts the probability of making a transition from n0 to n is written as p n 0 n , not p n n 0 . If we write P(n; t) as a vector P(t), then we may write Eq. (7) as P(t C 1) D P P(t). Therefore, P(t) D P P(t 1) D PP P(t 2) D D P t P(0) ; (8) and so if the initial state of the system P(0) is given, then we can find the state of the system at time t (P(t)) by matrix multiplication by the tth power of the transition matrix. Examples of Markov Chains 1. A one-dimensional random walk. The most widely known example of a Markov chain is a random walk on the real axis, where the walker takes single steps between integers on the line. The simplest version is where the walker has to move during every time interval: 8 0 ˆ < p ; if n D n C 1 (9) p n n 0 D q ; if n D n0 1 ˆ : 0 ; otherwise ; where p C q D 1. There are many variants. For in-
p nC1 n D p ; p n1 n D q ; (n D 2; : : : ; N 1) p21 D p ; p N1 N D q ; p11 D q ; p N N D p ;
(11)
p n n 0 D 0 ; otherwise : With absorbing boundaries (10), if we reach the state 1 or N, we can never leave it, since we stay there with probability 1. When the boundary is reflecting, we can never move beyond it; the only options are to move back towards the other boundary, or stay put. Well known examples of absorbing boundaries include the gambler’s ruin problem, where a gambler bets a given amount against the house at each time step and can win with a probability p. Here the absorbing boundary is situated at n D 0. Eventually he will arrive at the state n D 0, where is has no money left, and so cannot reenter the game. Birth/death processes will also have absorbing states at n D 0: if n is the number of individuals at a given time, and p is the probability of a birth and q of a death, then if there are no individuals left (n D 0), none can be born. This condition will be automatically applied if the transition probabilities are proportional to the number of individuals in the population. 2. The Ehrenfest urn This Markov chain was introduced by the Ehrenfests [7] in 1907, to illustrate the approach to equilibrium in a gas. Two containers, A and B, contain molecules of the same gas, the sum of the number of molecules in A and B being fixed to be N. At time t a molecule is removed at random from the containers and put into the other. If n0 is the number of molecules in container A at a certain time, then at the next time step the transition probabilities will be: 8 n0 if n0 D n C 1 ˆ 0
) (Pst (0))1 D 1 C
Z P(x; t C t) D
n D 1; : : : ; N : (29)
The constant Pst (0) is determined by normalization: N X
analogous to that used in Eq. (17) for the discrete case:
(30)
As an example, we return to the Ehrenfest urn (12), which in the language of the master equation is defined by g n D (N n)/N and r n D n/N (any overall rate may be absorbed into the time, and this is irrelevant as far as the stationary state is concerned). Here n D 0; 1; : : : ; N and the molecules never go outside this range, so the boundaries are reflecting. Applying Eqs. (29) and (30) shows that the stationary state is the binomial distribution given by Eq. (15). The Fokker–Planck Equation The Fokker–Planck equation describes stochastic processes at a more coarse grained level than those that we have discussed so far. It only involves continuous stochastic variables; these could be for instance the fraction of individuals or genes of a certain kind in a population, whereas the master equation recognized the individuals or genes as discrete entities. To obtain the Fokker– Planck equation we first derive the Kramers–Moyal expansion [29,38]. We begin by defining the jump moments for the system: Z M` (x; t; t) D d ( x)` P(; t C tjx; t) : (31) We will assume that these are known, that is, they can be obtained by some other means. They will, however, only be required in the limit of small t. The starting point for the derivation is the Chapman– Kolmogorov equation (4), with the choice of variables
P(x; t C t) D
1 X (1)` @` fM` (x; t; t)P(x; t)g : `! @x ` `D0
(34) Since P(; tjx; t) D ı( x), it follows from Eq. (31), that lim t!0 M` (x; t; t) D 0 for ` 1. Also M0 (x; t; t) D 1. Bearing these results in mind, we will now assume that the jump moments for ` 1 take the form M` (x; t; t) D D(`) (x; t)t C o (t) :
(35)
Substituting this into Eq. (34), dividing by t and taking the limit t ! 0 gives 1 o X @P (1)` @` n (`) D (x; t)P(x; t) : D @t `! @x `
(36)
`D1
Equation (36) is the Kramers–Moyal expansion. So far nothing has been assumed other than the Markov property and the existence of Taylor series expansions. However, in many situations, examination of the jump moments reveal that in a suitable approximation they may be neglected for ` > 2. In this case, we may truncate the Kramers–Moyal expansion (36) at second order and obtain the Fokker– Planck equation: 1 @2 @ @P A(x; t)P(x; t) C B(x; t)P(x; t) ; D @t @x 2 @x 2 (37) where A D D(1) and B D D(2) are independent of t if the process is stationary.
3103
3104
Stochastic Processes
To calculate the jump moments (31), it is convenient to write them in terms of the underlying stochastic process x(t). We use the notation Z hx(t)ix(t0 )Dx 0 D dx xP(x; tjx0 ; t0 ) ; (38)
ments are given by
for the mean of the stochastic variable at time t, conditional on the value of x(t) being given to be x0 at time t0 . With this notation x(t) denotes the process and the angle brackets are averages over realizations of this process. More generally, we may define h f (x(t))i in a similar way, and in particular the jump moments are given by
and so introducing a rescaled time D 2t/N 2 , and letting N ! 1 we obtain the Fokker–Planck equation
M` (x; t; t) D h(x(t C t) x)` ix(t)Dx :
(39)
Examples of Fokker–Planck Equations 1. Simple diffusion. For the simple symmetric random walk, g n D 1 and r n D 1 when expressed in the language of the master equation (after a rescaling of the time so that the rates may taken to be equal to unity). From Eqs. (18) and (19) we find that for a one-step stationary process, h(n(t C t) n)` in(t)Dn ( g n r n t C o (t) ; D g n C r n t C o (t) ;
if ` is odd if ` is even ; (40)
and so for the symmetric random walk the odd moments all vanish, and the even moments are equal to 2t C o(t). We now make the approximation which will yield the Fokker–Planck equation: we let x D nL, where L is the step size, and let L ! 0. Since, for ` even, h(x(t C t) x)` i D (2L` )t, if we rescale the time by introducing D L2 t, then all jump moments higher than the second disappear in the limit L ! 0, and Eq. (36) becomes @P @2 P D 2 : @t @x
(41)
This is the familiar diffusion equation obtained from a continuum approximation to the discrete random walk. 2. The diffusion limit of the Moran model. For the Moran model with no mutation, we have from Eqs. (26) and (40) that the odd moments again vanish. If we describe the process by x(t) D n(t)/N, the fraction of the genes that are of type A at time t, then the even jump mo-
D
(x(t C t) x)`
E
D
x(t)Dx
1 2x (1 x) t C o (t) ; N`
@P 1 @2 x(1 x)P : D 2 @t 2 @x
(42)
The factor of 2 in the rescaling of the time is included so that the diffusive form of the Moran model agrees with that found in the Wright–Fisher model [6]. Now suppose mutations are included. The transition rates are given by Eq. (27) and lead to jump moments h(x(t C t) x)` i N ` . So the first and second jump moments are not of the same order, and the introduction of a rescaled time D t/N, and subsequently letting N ! 1, gives a Fokker– Planck equation of the form (37), but with B D 0 and A(x) D v (u C v)x. This corresponds to a deterministic process with dx/dt D v (u C v)x [22]. The fact that the system tends to a macroscopic equation when N ! 1 has to be taken into account when determining the nature of the fluctuations for large N. We will discuss this further in Sect. “System Size Expansion”. On the other hand, suppose that the mutation rates ˜ scale with N according to u D 2u/N and v D 2˜v /N, where u˜ and v˜ have a finite limit as N ! 1, and where the 2 has again been chosen to agree with the Wright– Fisher model. Now both the first and second jump moments are of order N 2 , with the higher moments falling off faster with N. Therefore once again introducing the rescaled time D 2t/N 2 and letting N ! 1, we obtain the Fokker–Planck equation 1 @2 @ @P x(1 x)P : D f˜v (u˜ C v˜) xg P C 2 @t @x 2 @x (43) So depending on the precise scaling with N, the Moran model gives different limits as N ! 1 [24]. In the first case, the mutational effects are strong enough that a macroscopic description exists, with fluctuations about the macroscopic state, as discussed in Sect. “System Size Expansion”. In the second case, the mutational effects are weaker, and there is no macroscopic equation describing the system; the large N limit is a nonlinear Fokker–Planck equation of the diffusive type.
Stochastic Processes
The Fokker–Planck equation (37) for stationary processes, where A and B are functions of x only, can be solved by separation of variables, with solutions of the form p(x)et , where is a constant. The equation for p(x) is then a second order differential equation which, when boundary conditions are given, is of the Sturm–Liouville type. To specify the boundary conditions we once again introduce the probability current, this time through the continuity equation, @P(x; t) @J(x; t) C D0; @t @x where the probability current J(x; t) is given by
(44)
1 @ B(x; t)P(x; t) : (45) 2 @x Let us suppose that the system is defined on the interval [a; b]. Then if the boundaries are reflecting, there is no net flow of probability across x D a and x D b. This implies that J(a; t) D 0 and J(b; t) D 0. If we integrate the equation of continuity (44) from x D a to x D b, and apply Rb these boundary conditions, we see that a P(x; t)dx is independent of time. Therefore, if the pdf is initially normalized, it remains normalized. This is in contrast with the case of absorbing boundary conditions, defined by P(a; t) D 0 and P(b; t) D 0. If the boundary conditions are at infinity we require that limx!˙1 P(x; t) D 0, so that if P is well-behaved it is normalizable, and also that @P/@x is well-behaved in this limit: lim x!˙1 @P/@x D 0. If A or B do not diverge as x ! ˙1 this implies that limx!˙1 J(x; t) D 0. Other types of boundary conditions are possible, and we do not attempt a complete classification here [16,46]. If A and B are independent of time, then from Eq. (44), the stationary state of the system must be given by dJ(x)/dx D 0, that is, J is a constant. For reflecting boundary conditions this constant is zero, and so from Eq. (45) the stationary pdf, Pst (x) must satisfy
by the transformation 1/2 P(x; t) D Pst (x)
(x; t) ;
(49)
2 B dA 1 A(x) C : 2 2 dx
(50)
where U(x) D
So a one-dimensional stationary stochastic process, under certain conditions (such as constant second jump moment) is equivalent to quantum mechanics in imaginary time, with B taking over the role of Planck’s constant. As an example we consider the Ornstein–Uhlenbeck process defined by
J(x; t) D A(x; t)P(x; t)
1 @ B(x)Pst (x) : 2 @x This may be integrated to give
Z x A(x 0 ) C Pst (x) D dx 0 exp 2 ; B(x) B(x 0 ) 0 D A(x)Pst (x)
(46)
(47)
where C is a constant which has to be chosen so that Pst (x) is normalized. The Fokker–Planck equation, with A independent of t and B constant, can be transformed into the Schrödingerlike problem B
@ B2 @2 C U(x) D @t 2 @x 2
;
(48)
@P @ @2 P D [axP]CD 2 ; @t @x @x
x 2 (1; 1); a > 0: (51)
In this case, the potential (50) is given by U(x) D [a 2 x 2 2aD]/2, and so the problem is equivalent to the one-dimensional simple harmonic oscillator in quantum mechanics, but with an energy shift. As in that problem [47], the eigenfunctions are Hermite polynomials. Specifically, the right-eigenfunctions are p m (x) D Pst (x)
1 [2m m!]1/2
H m (˛x) ;
(52)
where H m are the Hermite polynomials [1] and ˛ D (a/2D)1/2 . The eigenvalue corresponding to the eigenfunction (52) is m D am, m D 0; 1; : : : and the left-eigenfunctions are q m (x) D [Pst (x)]1 p m (x). From these explicit solutions we may calculate other quantities of interest, such as correlation functions. We note that the eigenvalues are all non-negative and that the stationary state corresponds to D 0. The left-eigenfunction for the stationary state is equal to 1. These latter results hold true for a wide-class of such problems. Stochastic Differential Equations So far we have described stochastic processes in terms of equations which give the time evolution of pdfs. In this section, we will describe equations for the stochastic variables themselves. The most well known instance of such an equation is the Langevin equation for the velocity of a Brownian particle, and so we begin with this particular example. Suppose that a small macroscopic particle of mass m (such as a pollen grain) is immersed in a liquid at a temperature T. In addition to any macroscopic motion that the particle may have, its velocity fluctuates due to the random collisions of the particle with the molecules of the liquid.
3105
3106
Stochastic Processes
For simplicity, we confine ourselves to one-dimensional motion – along the x-axis. Then the equation of motion of the particle may be written in the form m
d2 x dx dV D ˛ C F (t) : 2 dt dt dx
(53)
The first term on the right-hand side is due to the viscosity of the fluid and ˛ is the friction constant. The second term, where V(x) is a potential, represents the interaction of the particle with any external forces, such as gravity. The final term is the random force due to collisions with the molecules of the liquid. Clearly to complete the specification of the dynamics of the particle we need to give (i) the initial position and velocity of the particle, and (ii) the statistics of the random force F (t). To make progress with these points, we imagine a large number of realizations of the dynamics, in which the particle starts with the same initial position, x0 , and velocity, v0 , but where the initial positions and velocities of the molecules in the liquid will be different. Taking the average over a large number of such realizations will give the average position hx(t)i and velocity hv(t)i at time t, conditional on x(0) D x0 and v(0) D v0 . The statistics of the fluctuating force F (t) are assumed to be such that (a) hF (t)i D 0, since we do not expect one direction to be favored over the other. (b) hF (t)F (t 0 )i D 2Dı(t t 0 ), since we expect that after a few molecular collisions the value that F takes on will be independent of its former value. That is, the force F becomes uncorrelated over times of the order of a few collision times between molecules. This is tiny on observational time scales, and so taking the correlation function to be a delta-function is an excellent approximation. The weight of the delta-function is denoted by 2D, where at this stage D is undetermined. (c) F (t) is taken to be Gaussianly distributed on grounds of simplicity, but also because by the central limit theorem it is assumed that the net effect of the large number of molecules which collide with the pollen grain will lead to a distribution which is Gaussian. Since a Gaussian distribution is specified by its first two moments, conditions (a), (b) and (c) completely define the statistics of F (t). Finally, Eq. (53) as it stands does not define a Markov process. This is most easily seen if we write down a discrete time version of the equation. The second derivative means that x(t C ıt) not only depends on x(t), but also on x(t ıt). Therefore only first order derivatives should be included in such equations if the process is to be Markov.
This is easily achieved by promoting v(t) to be a second stochastic variable in addition to x(t). Then Eq. (53) may be equivalent written as dx Dv; dt
(54)
dv dV m D ˛v C F (t) ; dt dx
which does define a Markov process. Although, as we remarked in the Introduction, we deal almost exclusively with Markov processes in this article, the situation we have just discussed is a good illustration of one way of dealing with processes which are presented as being nonMarkovian. The method simply consists of adding a sufficient number of supplementary variables to the definition of the state variables of the process until it becomes Markovian. There is no guarantee that this will be possible or require only a small number of additional variables to be promoted in this way, but it is the most straightforward and direct way of rendering non-Markovian processes tractable. To begin the analysis of Eq. (54) we assume that there are no external forces and so the term dV/dx is equal to zero. We may then write Eq. (54) as the Langevin equation dv D v C F(t) ; dt
v(0) D v0 ;
(55)
where D ˛/m and F(t) D F (t)/m. This implies that hF(t)i D 0 and hF(t)F(t 0 )i D
2D ı(t t 0 ) : m2
(56)
Note that since F(t) is a random variable, solving the Langevin equation will give v(t) as a random variable (having a known distribution). It is therefore a stochastic differential equation. The function F is frequently called “the noise term” or simply “the noise”. It is white noise since the Fourier transform of a delta-function is a constant – all frequencies are present in equal amounts. Multiplying the Langevin equation (55) by the integrating factor e t gives d v(t)e t D F(t)e t ) v(t) dt Z t 0 dt 0 F(t 0 )e t : D v0 e t C e t
(57)
0
By taking the average of the expression for v(t) we find hv(t)i D v0 e t . More interestingly, if we square the expression for v(t) and take the average, then we find hv 2 (t)i D v02 e2 t C
D 1 e2 t ; ˛m
(58)
Stochastic Processes
which implies that 2
lim hv (t)i D
t!1
D ˛m
:
(59)
On the other hand, as t ! 1, the Brownian particle will be in thermal equilibrium: 2 lim hv 2 (t)i D veq
t!1
2 1 2 mveq
and
D 12 kT ;
where T is the temperature of the liquid and k is Boltzmann’s constant. This implies that D 1 1 (60) 2 m ˛m D 2 kT ) D D ˛kT : The molecules of the liquid are acting as a heat bath for the system – which in this case is a single Brownian particle. The equation D D ˛kT is a simple example of a fluctuation-dissipation theorem, and determines D in terms of the friction constant, ˛, and of the temperature of the liquid, T. Although we have presented a somewhat heuristic rationale for Eq. (54), it may be derived in a more controlled way. A particularly clear derivation has been given by Zwanzig [60], where the starting point is a Hamiltonian which contains three terms: for the system, the heat bath and the interaction between the system and the heat bath. Taking the heat bath to be made up of coupled harmonic oscillators and the interaction term between the system and heat bath to be linear, it is possible to integrate out the bath degrees of freedom exactly, and be left only with the equations of motion of the system degrees of freedom plus the initial conditions of the bath degrees of freedom. Assuming that the bath is initially in thermal equilibrium, so that these initial values are distributed according to a Boltzmann distribution, adds extra “noise” terms to the equations of motion which, with a few more plausible assumptions, make them of the Langevin type. Examples of Langevin–like Equations 1. Overdamped Brownian motion. Frequently the viscous damping force ˛v is much larger than the inertial term md2 x/dt 2 in Eq. (53), and so to a good approximation the left-hand side of Eq. (53) can be neglected. Scaling time by ˛, we arrive at the Langevin equation for the motion of an overdamped Brownian particle: dx (61) D V 0 (x) C F (t) ; x(0) D x0 ; dt where 0 denotes differentiation with respect to x and where, due to the rescaling of time by ˛, hF (t)i D 0 ;
0
˜ t0 ) ; hF (t)F (t )i D 2 Dı(t
˜ D D
D ˛
:
(62)
A particularly well-known case is when the Brownian particle is moving in the harmonic potential V (x) D ax 2 /2. Then dx (63) D ax C F (t) ; x(0) D x0 : dt Since Eq. (63) relating x(t) to F (t) is linear, and since the distribution of F (t) is Gaussian, then x(t) is also distributed according to a Gaussian distribution. Comparing with Eqs. (55) and (56), which also define a linear system, we find that hx(t)i D x0 eat and, from ˜ Eq. (58), hx 2 (t)i D x02 e2at C ( D/a) 1 e2at . This gives P(x; tjx0 ; 0) s D
( 2 ) a x x0 eat a exp : ˜ 1 e2at ˜ 1 e2at 2 D 2D
(64) It is straightforward to check that this conditional pdf satisfies the Fokker–Planck equation (51) for the Ornstein–Uhlenbeck process. Below we will show this more directly, by starting from the Langevin equation (61) and deriving Eq. (51) if V(x) is quadratic. Another case of interest is when V(x) is a doublewell potential, V(x) D ax 2 /2 C bx 4 /4. If the particle is initially located near the bottom of one of the potential wells, it will take on average a time of the order of e V /D to hop over the barrier and into the well on the other side. Here V is the height of the barrier that it has to hop over [29]. 2. Environmental noise in population biology. One of the simplest models of two species with population sizes N 1 and N 2 which are competing for a common resource, is the two coupled deterministic ordi˙ i D r i N i , i D 1; 2. The nary differential equations N growth rates, ri , depend on the population sizes in such a way that as the population sizes increase, the ri decrease to reflect the increased competition for resources. This could be modeled, for instance, by taking r i D a i b i i N i b i j N j with i; j D 1; 2 and i ¤ j. In reality, external factors such as climate, terrain, the presence of other species, and indeed any factor which has an uncertain influence on these two species, will also affect the growth rate. This can be modeled by adding an external random term to the ri which represents this environmental stochasticity [33]. Then the equations become dN1 D a1 N1 b11 N12 b12 N1 N2 C N1 %1 (t) dt (65) dN2 2 D a2 N2 b22 N2 b21 N2 N1 C N2 %2 (t) : dt
3107
3108
Stochastic Processes
Since the noise terms, % i (t) are designed to reflect the large number of coupled variables omitted from the description of the model, it is natural, by virtue of the central limit theorem, to assume that they are Gaussianly distributed. It also seems reasonable to assume that any temporal correlation between these external influences is on scales very much shorter than those of interest to us here, and that the noises have zero mean. We therefore assume that h% i (t)% j (t 0 )i D 2D i ı i j ı(t t 0 ) ; (66)
h% i (t)i D 0 ;
where the Di describe the strength of the stochastic effects. The deterministic equations (that is, Eq. (65) without the noise terms) have a fixed point at the origin, one on each of the N 1 and N 2 axes, and may have another at non-zero N 1 and N 2 . For some values of the parameters this latter fixed point may be a saddle, with those on the axes being stable and the origin unstable. In this situation the eventual fate of the species depends significantly on the noise: if the combination of the nonlinear dynamics and the noise drives the system to the vicinity of the fixed point on the N 1 axis, then species 2 will become extinct, and vice-versa. Langevin equations with Gaussian white noise are equivalent to Fokker–Planck equations. This can be most easily seen by calculating the jump moments (39) from the Langevin equation. For instance, if we begin from the Langevin equation for an overdamped Brownian particle (61), Z
tC t
x(t) x(t C t) x(t) D
dt 0 x˙ (t 0 )
tC t
D
dt 0 V 0 (x(t 0 )) C (t) ;
(67)
t
F (t 0 ). From Eq. (62) it is straightwhere (t) D t forward to calculate the moments of (t): h(t)i D 0, 2
Z
dt 0
tC t
h (t)i D
dt t
0
2 @P @ 0 ˜@ P : D V (x)P C D @t @x @x 2
(70)
From Eq. (47), the stationary pdf is Pst (x) D C expf ˜ D C expfV (x)/kTg, as expected. V (x)/ Dg Although in this article we have largely restricted our attention to stochastic processes involving one variable, the construction of a Fokker–Planck equation from the Langevin equation goes through in a similar way for an n-dimensional process x D (x1 ; : : : ; x n ). In this case the jump moments are ˝
˛ x i 1 (t)x i 2 (t) : : : x i ` (t) x(t)Dx D D i 1 :::i ` (x; t)t C o(t) ;
(71)
where x i ˛ D x i ˛ (t C t) x i ˛ . The Kramers–Moyal expansion is then 1 X ˚ (1)` @` @P D i 1 :::i ` (x; t)P : (72) D @t `! @x i 1 : : : @x i ` `D1
The Langevin equation for Brownian motion (54), without going to the overdamped limit, serves as a simple illustration of this generalization. Here x(t) D vt and v(t) D v t m1 V 0 (x) t C m1 (t). This results in the Fokker–Planck equation
t
Z R tC t
nonzero (t)2 . This is a weaker, but more specific, statement than saying it is o (t). Using Eqs. (35) and (36), the Fokker–Planck equation which is equivalent to the Langevin equation (61) is found to be
Z
tC t
00
0
00
˜ dt hF (t )F (t )i D 2 Dt
t
(68) and, since (t) is Gaussian, hn (t)i is zero if n is odd, and at least of order (t)2 for n 4. This implies that M1 (x; t) D V 0 (x)t C O (t)2 ; ˜ M2 (x; t) D 2 Dt C O (t)2 ;
kT @2 P @P @ @ ˚ v C m1 V 0 (x) P C : D [vP]C @t @x @v m @v 2 (73) This is Kramer’s equation. It has a stationary pdf Pst (x; v) D C expfE/kTg, where E D mv 2 /2 C V (x). We end this section by finding the Fokker–Planck equation which is equivalent to the general set of Langevin equations of the form x˙ i D A i (x; t)C
(69)
with all moments of order (t)2 or higher for ` > 2. The notation O (t)2 means that the magnitude of this quantity is less than a constant times (t)2 , for sufficiently small
m X
g i˛ (x; t) %˛ (t);
i D 1; : : : ; n ; (74)
˛D1
where %˛ (t), ˛ D 1; : : : ; m, is a Gaussian white noise with zero mean and with h%˛ (t)%ˇ (t 0 )i D ı˛ˇ ı(t t 0 ) :
(75)
Stochastic Processes
Proceeding as in Eq. (67), but noting the dependence of the function g i˛ on the stochastic variable, yields M i (x; t; t) 2 D 4 A i (x; t) C (0)
3 @ g j˛ (x; t) g i˛ (x; t)5 t @x j ˛D1
m n X X jD1
C O (t)2 ; M i j (x; t; t) D
m X
g i˛ (x; t)g j˛ (x; t) t C O (t)2 ;
˛D1
(76) with all jump moments higher than the second being of order (t)2 or higher. The quantity (0) is the value of the Heaviside theta function, (x), at x D 0 and is indeterminate. This indicates that the Langevin description does not correspond to a unique Fokker–Planck equation. This situation occurs whenever the white noise in a Langevin equation is multiplied by a function which depends on the state variable, as in Eq. (74). For systems such as this acted upon by multiplicative noise the Langevin description has to be supplemented by a rule which says whether the state variable in the multiplying function (g i˛ in Eq. (74)) is that before or after the noise pulse acts [52]. If it is taken to be the value immediately after the noise pulse acts then (0) D 0 (Itô rule), whereas if it is taken to be the average of the values before and after, then (0) D 1/2 (Stratonovich rule). The Fokker–Planck equation is now found from Eq. (72) to be
moved on to the many interesting systems which could be modeled as nonlinear stochastic processes. These systems are much more difficult to analyze. For example, a nonlinear Langevin equation cannot be solved directly, and so the averaging procedure cannot be carried out in the same explicit way as described in Sect. “Stochastic Differential Equations”. There is however one method which is applicable to many nonlinear stochastic differential equations of interest: the solution of these equations can be formally written down as a path-integral, and from this correlation functions and other quantities of physical interest can be obtained. This also has the advantage that all the formalism and approximation schemes developed to study functional integrals over the years can be called into play. Path-integrals are intimately related to Brownian motion and the earliest work on the subject by Wiener [56, 57], emphasized this. If the problem of interest is formulated as a set of Langevin equations, the derivation of the path-integral representation is particularly straightforward, if rather heuristic. For clarity we begin with the simplest case: an overdamped system with a single degree of freedom, x, acted upon by white noise. The Langevin equation is given by Eq. (61) and the noise is defined by Eq. (62). Since the noise is assumed to be Gaussian, Eq. (62) is a complete specification. An equivalent way of giving it is through the pdf [12]:
Z 1 P[F ] DF / exp dt F 2 (t) DF ; ˜ 4D
(79)
where DF is the functional measure. The idea is now to regard the Langevin equation (61) as defining a mapping F 7! x. The pdf for the x variable is then given by
n X @ @P D A i (x; t)P(x; t) @t @x i iD1
m n 1 X X @2 g i˛ (x; t)g j˛ (x; t)P(x; t) ; C 2 @x i @x j ˛D1 i; jD1
(77) in the Itô case and
0 (x) J[x] P[x] D P[F ]jF DxCV ˙
Z 1 dt [x˙ C V 0 (x)]2 J[x] ; / exp ˜ 4D
(80)
where
n X @P @ D A i (x; t)P(x; t) @t @x i iD1 m n 1 X X @ @ ˚ g j˛ (x; t)P(x; t) ; C g i˛ (x; t) 2 @x i @x j ˛D1 i; jD1
(78) in the Stratonovich case. Path Integrals While most early work on stochastic processes was concerned with linear systems, naturally attention soon
J[x] D det
ıF ıx
;
(81)
is the Jacobian of the transformation. An explicit expression for the Jacobian may be obtained either by direct calculation of a discretized form of the Langevin equation [21] or through use of the identity relating the determinant of a matrix to the exponential of the trace of the logarithm of that matrix [59]. One finds that J[x] / R expf (0) dt V 00 (x)g. The quantity (0) is once again the indeterminate value of the Heaviside theta function (x) at x D 0. Its appearance is a reflection of the fact that,
3109
3110
Stochastic Processes
due to the Brownian-like nature of the paths in the functional integral, the nature of the discretization appears explicitly through this factor [48]. If we consistently use the mid-point rule throughout, then we may take (0) D 1/2, which gives
Z Z 1 1 dt [x˙ C V 0 (x)]2 C P[x] / exp dt V 00 (x) ˜ 2 4D ˜ D exp S[x]/ D : (82) All quantities of interest can now be found from expression (82). For example, the conditional probability distribution, P(x; tjx0 ; t0 ) is given by Z hı(xx(t))ix(t0 )Dx 0 D
D xı(xx(t)) P[x]: (83) x(t 0 )Dx 0
The expression (82) has much in common with Feynman’s formulation of quantum mechanics as a path-integral [11]. In fact another way to obtain the result is to exploit the transformation (49) to write the Fokker–Planck equation (70) as a Schrödinger equation in imaginary time ˜ 00 (x), D it, with a potential U(x) D (1/2)[V 0 (x)]2 DV following Eq. (50). The action in the quantum-mechanical path-integral is i „
1 ˜ 2D
1 2 x˙ U(x) ! 2 1 0 2 1 ˜ 00 (x) ; V (x) C DV d x˙ 2 2 2
Z dt Z
i
if the matrix Dij is non-singular. The generalization to the situation where the noise is multiplicative is more complicated, and is analogous to the path-integral formulation of quantum mechanics in curved space [20]. System Size Expansion In Example 2 of Sect. “The Fokker–Planck Equation” we explicitly showed how the master equation may have different limits when the size of the system, N, becomes large. In one case both the first and second jump moments were of the same order (and much larger than the higher jump moments) and so a nonlinear Fokker–Planck equation of the diffusion type was obtained in the limit N ! 1. In another case, the first jump moment scaled in a different way to the second moment, and so the N ! 1 limit gave a deterministic macroscopic equation of the form x˙pD f (x), with finite N effects presumably consisting of 1/ N fluctuations about the macroscopic state, x(t). It is the second scenario that we will explore in this section. It can be formalized by writing n D x(t) C p ; N N
(84)
Rt Rx which is Eq. (82) since t0 dt x˙ V 0 (x) D x 0 dx V 0 (x) D V(x) V (x0 ) does not depend on the path, only on the end-points. The functional S[x] is analogous to the action in classical mechanics, and is frequently referred to as such. It is also sometimes referred to as the generalized Onsager–Machlup functional, in recognition of the original work carried out by Onsager and Machlup [40], in the case of a linear Langevin equation, in 1953. The above discussion can be generalized in many ways. For example, if the Langevin equation for an n-dimensional process takes the form x˙ i D A i (x)C % i (t);
functional is [21] 2 Z ˚ 1X ˙ j A j (x) S[x] D dt 4 fx˙ i A i (x)g D1 ij x 4 i; j # 1 X @A i ; (86) C 2 @x i
h% i (t)% j (t 0 )i D 2D i j ı(t t 0 ); (85)
where % i (t) is a Gaussian noise with zero mean and Dij is independent of x, then the general Onsager–Machlup
(87)
and substituting this into the masterp equation, then equating terms of the same order in 1/ N. The leading order equation obtained in this way will be the macroscopic equation, and the function f (x) will emerge from the analysis. The next-to-leading order equation turns out to be a linear Fokker–Planck equation in the variable . Higher order terms may also be included. This formalism was first developed by van Kampen [51] and is usually referred to as van Kampen’s system-size expansion. We will describe it in the specific case of a one-step process for a single stochastic variable in order to bring out the essential features of the method. When using this formalism it is useful to rewrite the master equation (23) using step operators which act on an arbitrary function of n according to E f (n) D f (n C 1) and E 1 f (n) D f (n 1). This gives dP(n; t) D (E 1) T(n 1jn)P(n; t) dt C E 1 1 T(n C 1jn)P(n; t) :
(88)
Stochastic Processes
We begin by using Eq. (87) to write the pdf which appears in the master Equation (88) as P(Nx(t) C
p
N; t) D ˘ (; t) dx @˘ @˘ N 1/2 : ) P˙ D @t dt @
(89)
This gives an expression for the left-hand side of the mas˙ To get an expression for the right-hand ter equation, P. side, (a) thep step operators are expanded in powers of 1/ N [53]:
1 1 @ 1 1 @2 E ˙1 D 1 ˙ p C O C ; (90) N 3/2 N @ 2! N @ 2 (b) T(n ˙ 1jn) is expressed in terms of and N, (c) P(n; t) is replaced by ˘ (; t). Steps (a), (b) and (c) gives the right-hand side of the masp ter equation as a power-series in 1/ N. Equatingpthe lefthand and right-hand sides order by order in 1/ Np(this may require a rescaling of the time, t, by a power of N), gives to leading order (the @˘ /@ cancels) an equation of the form dx/dt D f (x). This may be solved subject to the condition x(0) D x0 D n0 /N, if we take the initial condition on the master equation to be P(n; 0) D ın;n 0 . We denote the solution of thispmacroscopic equation by xM (t). To next order in 1/ N, the Fokker–Planck equation @˘ @ 1 @2 ˘ D f 0 (x) [˘ ] C g(x) 2 ; @t @ 2 @
@˘ dx @˘ @˘ 1 f (x) fC (x) C Dp dt @ @t @ N @2 ˘ 1 1 f (x) C fC (x) C 2N @ 2 @ 1 0 0 C (x) f (x) fC [ P] C : : : ; (92) N @
N 1/2
where the functions f (x) and fC (x) are given by: fC (x) D 2bx(1 x) :
dx D x (r ax) ; d
(93)
(94)
where r D 2b d and a D 2b C c. Equation (94) is the logistic equation, which is the usual phenomenological way to model intraspecies competition. Since the Fokker–Planck equation (91) describes a linear process, its solution is a Gaussian. This means that the probability distribution ˘ (; t) is completely specified by the first two moments h(t)i and h 2 (t)i. Multiplying Eq. (91) by and 2 and integrating over all one finds d h(t)i D f 0 (xM (t))h(t)i ; dt d 2 h (t)i D 2 f 0 (xM (t))h 2 (t)i C g(xM (t)) : dt
(95)
We have chosen the initial condition to be x0 D n0 /N, which implies that (0) D 0. The first equation in (95) then implies that h(t)i D 0 for all t. Multiplying the second equation by f 2 (xM (t)) one finds that h 2 (t)i D f 2 (xM (t))
Z
t 0
(91)
describing a linear stochastic process is found. The functions f 0 (x) and g(x) are to be evaluated when x D xM (t), and so are simply functions of time. If the macroscopic system tends to a fixed point: xM (t) ! x , as t ! 1, then f 0 (x) and g(x) may be replaced by constants in order to study the fluctuations about this stationary state. To illustrate the method we use Example 3, Sect. “The Master Equation”. Equating both sides of the master equation in this case one finds
f (x) D cx 2 C dx ;
For the left- and right-hand sides to balance in Eq. (92), a rescaled time D t/N needs to be introduced. Then to leading order one finds dx/d D f (x), where f (x) D fC (x) f (x). At next to leading order Eq. (91) is found with g(x) D fC (x) C f (x). The explicit form of the macroscopic equation is
dt 0
g(xM (t 0 )) ; f 2 (xM (t 0 ))
(96)
and so the determination of h 2 (t)i is reduced to quadrature. The method can be applied to systems with more than one stochastic variable and those which are not one-step processes. Details are given in van Kampen’s book [53]. Future Directions This article has focused largely on classical topics in the theory of stochastic processes, since these form the foundations on which the subject is built. Much of the current work, and one would expect future work, will be numerical in character. Some of this will begin from a basic Markovian description in the form of reactions among chemical species – even if the system is not chemical in nature (Example 3 of Sect. “The Master Equation” is an example). A straightforward algorithm developed by Gillespie [17,18], and since then extended and improved [3,19], provides an efficient way of simulating such systems. It thus provides a valuable method of investigating systems which may be formulated as complicated multivariable
3111
3112
Stochastic Processes
master equations, which complements the methods we have discussed here. However, many current studies do not begin from a system which can be described in this way, and there is every indication that this will be more true in the future. For instance, in agent based models the stochastic element may be due to mutations in characteristics, traits or behavior, which may be difficult or impossible to formulate mathematically. Such agent based models are certainly individually based, but each individual may have different attributes and generally behave in such a complex way that only numerical simulations can be used to explore the behavior of the system as a whole. Although these complex systems may be used to model more realistic situations, the well-known problems associated with the large number of parameters typically required to describe such systems, will mean that simplified versions will need to be analyzed in order to understand them at a deeper level. These simpler models are likely to include those where the agents of a particular species are essentially identical. In this article we have discussed how the classical equations of the theory of stochastic processes, such as the Fokker–Planck equation, can be obtained from such models. They will therefore form a bridge between the agent-based approaches which are expected to become more prevalent in the future, and the analytic approaches which lie at the heart of the theory of stochastic processes. Bibliography Primary Literature 1. Abramowitz M, Stegun I (Eds) (1965) Handbook of mathematical functions. Dover, New York 2. Bachelier L (1900) Théorie de la spéculation. Annales Scientifiques de L’Ecole Normale Supérieure III(17):21–86 3. Cao Y, Li H, Petzold L (2004) Efficient formulation of the stochastic simulation algorithm for chemically reacting systems. J Chem Phys 121:4059–4067 4. Chandrasekhar S (1943) Stochastic problems in physics and astronomy. Rev Mod Phys 15:1–89. Reprinted in Wax (1954) 5. Cox DR, Miller HD (1968) The theory of stochastic processes, Chap 3. Chapman and Hall, London 6. Crow JF, Kimura M (1970) An introduction to population genetics theory. Harper and Row, New York 7. Ehrenfest P, Ehrenfest T (1907) Über zwei bekannte Einwände gegen das Boltzmannsche H-Theorem. Phys Z 8:311–314 8. Einstein A (1905) Über die von der molekularkinetischen Theorie der Wärme geforderte Bewegung von in ruhenden Flüssigkeiten suspendierten Teilchen. Ann Physik 17:549–560. For a translation see: A. Einstein, “Investigations on the Theory of the Brownian Movement” Fürth R (ed), Cowper AD (tr) (Dover, New York, 1956). Chapter I 9. Einstein A (1906) Zur Theorie der Brownschen Bewegung. Ann Physik 19:371–381. For a translation see: A. Einstein, “Investigations on the Theory of the Brownian Movement” Fürth R (ed), Cowper AD (tr) (Dover, New York, 1956). Chapter II
10. Feller W (1968) An introduction to probability theory and its applications, Chap XV, 3rd edn. Wiley, New York 11. Feynman RP (1948) Space-time approach to non-relativistic quantum mechanics. Rev Mod Phys 20:367–387 12. Feynman RP, Hibbs AR (1965) Quantum mechanics and path integrals, Chap 12. McGraw-Hill, New York 13. Fisher RA (1930) The genetical theory of natural selection. Clarendon Press, Oxford 14. Fokker AD (1914) Die mittlere Energie rotierende elektrischer Dipole im Strahlungsfeld. Ann Physik 43:810–820 15. Gantmacher FR (1959) The theory of matrices, Chap 13, Sect 6, vol 2. Chelsea Publishing Co., New York 16. Gardiner CW (2004) Handbook of stochastic methods, 3rd edn. Springer, Berlin 17. Gillespie DT (1976) A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. J Comput Phys 22:403–434 18. Gillespie DT (1977) Exact stochastic simulation of coupled chemical reactions. J Phys Chem 81:2340–2361 19. Gillespie DT (2001) Approximate accelerated stochastic simulation of chemically reacting systems. J Chem Phys 115:1716– 1733 20. Graham R (1977) Path integral formulation of general diffusion processes. Z Physik B26:281–290 21. Graham R (1975) Macroscopic theory of fluctuations and instabilities. In: Riste T (ed) Fluctuations, Instabilities, and Phase Transitions. Plenum, New York, pp 215–293 22. Haken H (1983) Synergetics. Springer, Berlin. Sect 6.3 23. Kac M (1947) Random walk and the theory of Brownian motion. Amer Math Mon 54:369–391 24. Karlin S, McGregor J (1964) On some stochastic models in genetics. In: Gurland J (ed) Stochastic problems in medicine and biology. University of Wisconsin Press, Madison, pp 245–279 25. Kendall DG (1948) On some modes of population growth leading to R. A. Fisher’s logarithmic series distribution. Biometrika 35:6–15 26. Kolmogorov AN (1931) Über die analytischen Methoden in der Wahrscheinlichkeitsrechung. Math Ann 104:415–458 27. Kolmogorov AN (1936) Anfangsgründe der Theorie der Markoffschen Ketten mit unendlich vielen möglichen Zuständen. Mat Sbornik (N.S.) 1:607–610 28. Krafft O, Schaefer M (1993) Mean passage times for tridiagonal transition matrices and a two-parameter Ehrenfest urn model. J Appl Prob 30:964–970 29. Kramers HA (1940) Brownian motion in a field of force and the diffusion model of chemical reactions. Physica 7:284–304 30. Langevin P (1908) Sur la théorie du mouvement brownien. C R Acad Sci Paris 146:530–533. For a translation see: D. S. Lemons and A. Gythiel, Am. J. Phys. 65: 1079–1081 (1997) 31. Malécot G (1944) Sur un problème de probabilités en chaine que pose la génétique. C R Acad Sci Paris 219:379–381 32. Markov AA (1906) Rasprostranenie zakona bol’shih chisel na velichiny, zavisyaschie drug ot druga. Izv Fiz-Matem Obsch Kazan Univ (Series 2) 15:135–156. See also: “Extension of the limit theorems of probability theory to a sum of variables connected in a chain”, in Appendix B of R. Howard “Dynamic Probabilistic Systems, vol 1: Markov Chains” (John Wiley and Sons, 1971) 33. May RM (1973) Model ecosystems, Chap 5. Princeton University Press, Princeton
Stochastic Processes
34. McKane AJ, Newman TJ (2004) Stochastic models in population biology and their deterministic analogs. Phys Rev E 70:041902 35. McKane AJ, Newman TJ (2005) Predator-prey cycles from resonant amplification of demographic stochasticity. Phys Rev Lett 94:218102 36. Moran PAP (1958) Random processes in genetics. Proc Cambridge Philos Soc 54:60–72 37. Moran PAP (1962) The statistical processes of evolutionary theory, Chap 4. Clarendon Press, Oxford 38. Moyal JE (1949) Stochastic processes and statistical physics. J Roy Stat Soc (London) B 11:150–210 39. Nordsieck A, Lamb WE Jr, Uhlenbeck GE (1940) On the theory of cosmic-ray showers. I. The Furry model and the fluctuation problem. Physica 7:344–360 40. Onsager L, Machlup S (1953) Fluctuations and irreversible processes. Phys Rev 91:1505–1512 41. Pauli W (1928) Probleme der modernen Physik. In: Debye P (ed) Festschrift zum 60. Geburtstag A. Sommerfeld. Hirzel, Leipzig, p 30 42. Pearson K (1905) The problem of the random walk. Nature 72:294,342 43. Planck M (1917) Über einen Satz der statistischen Dynamik und seine Erweiterung in der Quantentheorie. Abh Preuss Akad Wiss Berl 24:324–341 44. Rayleigh L (1891) Dynamical problems in illustration of the theory of gases. Phil Mag 32:424–445 45. Reichl LE (1998) A modern course in statistical physics, Chap 5, 2nd edn. Wiley, New York 46. Risken H (1989) The Fokker–Planck equation, 2nd edn. Springer, Berlin 47. Schiff LI (1968) Quantum mechanics, Chap 4, 3rd edn. McGrawHill, Tokyo
48. Schulman LS (1981) Techniques and applications of path– integration, Chap 5. Wiley, New York 49. Siegert AJF (1949) On the approach to statistical equilibrium. Phys Rev 76:1708–1714 50. Sneddon IN (1957) Elements of partial differential equations, Chap 2. McGraw-Hill, New York 51. van Kampen NG (1961) A power series expansion of the master equation. Can J Phys 39:551–567 52. van Kampen NG (1981) Itô versus Stratonovich. J Stat Phys 24:175–187 53. van Kampen NG (1992) Stochastic processes in physics and chemistry, 2nd edn. North-Holland, Amsterdam 54. von Smoluchowski M (1906) Zur kinetishen Theorie der Brownschen Molekularbewegung und der Suspensionen. Ann Physik 21:756–780 55. von Smoluchowski M (1916) Drei Vortage über Diffusion, Brownsche Bewegung, und Koagulation von Kolloidteilchen. Phys Z 17:571–599 56. Wiener N (1921) The average of an analytic functional. Proc Natl Acad Sci USA 7:253–260 57. Wiener N (1921) The average of an analytic functional and the Brownian movement. Proc Natl Acad Sci USA 7:294–298 58. Wright S (1931) Evolution in Mendelian populations. Genetics 16:97–159 59. Zinn-Justin J (2002) Quantum field theory and critical phenomena, 4th edn. Clarendon Press, Oxford. Sect 4.8.2 60. Zwanzig R (1973) Nonlinear generalized Langevin equations. J Stat Phys 9:215–220
Books and Reviews Wax N (1954) Selected papers on noise and stochastic processes. Dover, New York
3113
3114
Structurally Dynamic Cellular Automata
Structurally Dynamic Cellular Automata ANDREW ILACHINSKI Center for Naval Analyses, Alexandria, USA Article Outline Glossary Definition of the Subject Introduction The Basic Model Emerging Patterns and Behaviors SDCA as Models of Computation Generalized SDCA Models Related Graph Dynamical Systems SDCA as Models of Fundamental Physics Future Directions and Speculations Bibliography Glossary Adjacency matrix The adjacency matrix of a graph with N sites is an N N matrix [a ij ] with entries a ij D 1 if i and j are linked, and a ij D 0 otherwise. The adjacency matrix is symmetric (a ij D a ji ) if the links in the graph are undirected. Coupler link rules Coupler rules are local rules that act on pairs of next-nearest sites of a graph at time t to decide whether they should be linked at t C 1. The decision rules fall into one of three basic classes – totalistic (T), outer-totalistic (OT) or restricted-totalistic (RT) – but can be as varied as those for conventional cellular automata. Decoupler link rules Decoupler rules are local rules that act on pairs of linked sites of a graph at time t to decide whether they should be unlinked at t C 1. As for coupler rules, the decision rules fall into one of three basic classes – totalistic (T), outer-totalistic (OT) or restricted-totalistic (RT) – but can be as varied as those for conventional cellular automata. Degree The degree of a node (or site, i) of a graph is equal to the number of distinct nodes to which i is linked, and where the links are assumed to possess no directional information. In general graphs, the in-degree (= number of incoming links towards i) is distinguished from the out-degree (= number of outgoing links originating at i). Effective dimension A quantity used to approximate the dimensionality of a graph. It is defined as the ratio between the average number of next-nearest neighbors to
the average degree, both averaged over all nodes of the graph. The effective dimension equals the Euclidean dimension d, in cases where the graph is the familiar d-dimensional hypercubic lattice. Graph A graph is a finite, nonempty set of nodes (referred to as “sites” throughout this article), together with (a possibly empty) set of edges (or links). The links may be either directed (in which case the edge from a site i, say, is directed away from i toward another site j, and is considered distinct from another directed edge originating at j and pointed toward i) or undirected (in which case if a link exists between sites i and j it carries no directional information). Graph grammar Graph grammars (sometimes also referred to as graph rewriting systems) apply formal language theory to networks. Each language specifies the space of “valid structures”, and the production (or “rewrite”) rules by which given graphs may be transformed into other valid graphs. Graph metric function The graph metric function defines the distance between any two nodes, i and j. It is equal to the length of the shortest path between i and j. If no path exists (such as when i and j are on two disconnected components of the same graph), the distance is assumed to be equal to 1. Graph-rewriting automata Graph-rewriting automata are generalized CA-like systems in which both (the number of) nodes and links are allowed to change. Next-nearest neighbor Two sites i and j are next-nearest neighbors in a graph if (1) they are not directly linked (so that a ij D 0; see adjacency matrix), and (2) there exists at least one other site k such that k … fi; jg, and i and j are both lined to k. Random dynamics approximation The long-term behavior of structurally dynamic cellular automata may be approximated in certain cases (in which the structure and value configurations are both sufficiently random and uncorrelated) by a random dynamics approximation: values of sites are replaced by the probability p of a site having value (and is assumed to be equal for all sites), and links between sites are replaced by the probability p` of being linked (and also assumed to be the same for all pairs of sites). The approximation often yields qualitatively correct predictions about how the real system evolves under a specific set of rules; for example, to predict whether one expects unbounded growth or that the lattice will eventually settle onto a low periodic state or simply decay. Restricted totalistic rules Restricted totalistic rules are a generalized class of link rules (operating on pairs of sites, i and j), analogous to “outer totalistic” rules (that
Structurally Dynamic Cellular Automata
operate on site values) used in conventional CA. The local neighborhood around i and j is first partitioned into three sets: (1) the two sites, i and j; (2) sites connected to either i or j, but not both; and (3) sites connected to both i and j. The restricted totalistic rule is then completely defined by associating a specific action with each possible 3-tuple of site-value sums (where the individual components represent a unique sum in each of the three neighborhoods). Structurally dynamic cellular automata Structurally dynamic cellular automata are generalizations of conventional cellular automata models in which the underlying lattice structure is dynamically coupled to the local site-value configurations. SDCA model hierarchy The SDCA model hierarchy is a set of eight related structurally dynamic cellular automata models, defined explicitly for studying their formal computational capabilities. The hierarchy is ordered (from lowest to highest level) according to their relative computational strength. For example, the SDCA model at the top of the hierarchy is capable of simulating a conventional CA with a speedup factor of two. Definition of the Subject Structurally dynamic cellular automata (abbreviated, SDCA) are a generalized class of CA in which the topological structure of the (usually quiescent) underlying lattice is dynamically coupled to the local site value configuration. The coupling is defined to treat geometry and value configurations on an approximately equal footing: the lattice structure is altered locally as a function of individual site neighborhood value-states and geometries, while the underlying local topology supports site-value evolution precisely as in conventional nearest-neighbor CA models defined on random lattices. SDCA provide a dynamical framework for a CA-like analysis of the generation, transmission and interaction of topological disturbances in a lattice. Moreover, they provide a natural testbed for studying self organized geometry; by which we mean true structural evolution, and not merely space-time patterns of value configurations that may be interpreted geometrically (but are really just “bits” of information overlayed on top of an otherwise static background lattice). Introduction SDCA were formally introduced in 1986 as part of a physics doctoral dissertation by Ilachinski [31], and developed further by Ilachinski and Halpern [29,30],
Halpern [21,23], Halpern and Caltagirone [22], Majercik [39], and Alonso-Sanz and Martín [6,7,8]; in their original incarnation [28], and at least two subsequent papers [22,61], SDCA were called topological automata. Pedagogical discussions appear in Adamatzky [1] and Ilachinski [32]. Extensions of the basic SDCA model (all discussed in this article) include the addition of probabilistic rules, memory and reversibility. Applications include the simulation of crystal growth [36], the study of pattern formation of random cellular structures [66], modeling synaptic plasticity in neural network models [19], phase transitions in chemical systems [62], chemical self-assembly [26], and gene-regulatory networks [22]. Majercik [39] has studied SDCA as generalized models of computation, and describes a CAuniversal SDCA that can simulate any conventional CA of the same dimension. More recently, O’Sullivan [56] and Saidani [63,64] have used graph-based CA models similar to SDCA to study urban dynamics and emergent behaviors of self-reconfigurable robots, respectively. Tomita et al. [67,68,69, 70,71] have introduced graph-rewriting automata in which both links and (the number of) nodes are allowed to change; and show that these systems are capable of both self-replication and Turing universality (among with many other emergent behaviors). Since SDCA provide the basic formalism for describing locally induced topological changes within arbitrary graphs, they are a potentially powerful general tool for studying complex adaptive networks, such as communication and social networks [6]. The concept behind SDCA has also been used as a foundation for philosophical musings about computationally emergent artificiality [49]. More ambitious applications of SDCA encroach on fundamental physics. Because SDCA are inherently selfmodifying systems – in which physical events are not just dynamically coupled to, but are an integral part of the spatio-temporal arena on which their transformations are defined – they are a potentially powerful methodological and ontological tool for exploring discrete pre-geometric theories of space-time [42]. Just as “value structure” solitons are ubiquitous in conventional CA models [32,74], “link structure” solitons might emerge in SDCA; physical particles would, in such a scheme, be viewed as geometrodynamic disturbances propagating within a dynamic lattice. Three SDCA-like theories of pregeometry have recently been proposed in which space-time is a self-organized emergent construct: Hillman [27], Nowotny and Requardt [55] and Wolfram [75]. Finally, we briefly comment on ostensible overlaps between SDCA and four other related fields of study: (1) Lin-
3115
3116
Structurally Dynamic Cellular Automata
denmeyer (or L-) systems, (2) graph grammars, (3) random graphs (abbreviated, RG), and (4) dynamic network analysis (abbreviated, DNA). L-systems [57] are generalized CA systems in which the number of sites can grow with time, and consist of recursive rules for rewriting strings of symbols. If interpreted graphically, abstract symbol strings can be used to model growth processes of plants and evolving morphology of physical organisms. Graph grammars [20,35] apply formal language theory to networks, and consist of production rules that define the set of “valid structures” in a given graph language. The study of RG [15] was introduced by Erdos and Renyi in the late 1950s [16], and is a mathematical framework for exploring the general topological structures of computational systems and the behavior of certain random dynamical systems. Like SDCA, RG describes evolving graphs, but the dynamics are global and random. DNA [41,50] is an emerging field that fuses traditional social network theory with statistical analysis and modeling; part of its charter is to explore general properties of network generation and evolution. While, conceptually speaking, there is a prima facie relationship between SDCA and all four fields of study, the elucidation of a more precise nature of the relationship between SDCA and these other systems awaits a future study. (The relationship appears particularly strong between SDCA and a generalized L-system called the graph development system (abbreviated, GDS), introduced by Doi [14], but not developed further since its original conception. Using incidence matrices to represent arbitrary topologies, GDS is essentially a grammar by which submatrices of the whole matrix are rewritten to describe topological changes. SDCA also formally falls under the broader rubrics of DNA and RG; however, there is no explicit reference to SDCA in the current literature of either field.) The Basic Model Conventional CA are defined on fixed, and typically regular, lattices (one-dimensional lines, two-dimensional Euclidean or hexagonal grids, etc.), the sites of which are populated with discrete-valued dynamic elements ( i 2 f0; 1; : : : ; kg, where i labels a particular site on the lattice) that evolve according to local transition functions, f : i ! i0 . We emphasize that the dynamics of conventional CA are confined to the temporal evolution of the i s. SDCA generalize conventional CA in two ways: (1) they relax the assumption that the underlying lattice is uniform, allowing the local site $ site connectivity pattern
to vary throughout the lattice; and (2) they allow both the set f i g and the lattice to evolve according to local transition rules. The most obvious – also the most dramatic – conceptual change this entails over the dynamics of conventional CA, is that the meaning of “local” itself changes as a function of how the SDCA system evolves: previously far separated sites may become neighbors; and sites that are local at time t may become far separated at some later time, t 0 . To properly define SDCA, we first generalize regular lattices to mathematical graphs G () possessing arbitrary topology. Assuming G has N lattice sites, and that G is (for now) an undirected graph (meaning that none of G’s links carry directional information), G is completely defined by the N-by-N adjacency matrix, ` ij : ( 1 if i and j are linked; ` ij D (1) 0 otherwise: Using the graph metric function, D ij D Minimum #links, l rs j fr; sg 2 Pij ; Paths, P ij
(2)
we can write a general r-neighborhood CA valuetransition rule ‘f ’ (which will from now on refer generically to as a -rule) in the form h i (3) itC1 D f f jt gj j 2 SrG (i) ; where SrG (i) D f jjD ij rg is the radius-r graph sphere about the site i. In words, the value of itC1 is some function, f , of the values jt in radius r graph sphere around the site i. With this distance measure, G becomes a discrete metric space. If G is a one-dimensional line, and r D 1, then SrG (i) D fi 1; i; i C 1g; i. e., it is equal to the conventional three site local neighborhood of elementary CA. We now formally extend a conventional CA’s dynamic arena – limited to the values it 2 f0; 1; : : : ; k 1g, i D 1; : : : ; N – to one that includes the components of the underlying lattice’s adjacency matrix: ( tC1 D F [f t g; f` t g] ; (4) ` tC1 D F` [f t g; f` t g] where F and F` are some functions (to be defined explicitly below) that explicitly couple the changing value states and geometries. The complete system at time t is specified by the state-vector jGi t D j1t ; : : : ; Nt ; f` tij gi :
(5)
The time-evolution of jGi proceeds according to the following transition rules: (i) -rules of the general
Structurally Dynamic Cellular Automata
form given above and familiar from CA simulations and (ii) `-rules, which are divided into site couplers, linking previously unconnected vertices and site decouplers, which disconnect linked points. Because the topology can be altered only by either a deletion of existing links or an addition of links between pairs of vertices ‘i’ and ‘j’ with D ij D 2, the dynamics is strictly local. To be more precise, we first restrict the general -rule F 1 to (maximally symmetric) totalistic (T) and outer-totalistic (OT) type. Since the underlying lattice is a fully dynamic object, jGi will, in general, tend towards having a complex local geometry with an unspecified local directionality. The most general rules which can therefore be applied are those which are completely invariant under all rotation and reflection symmetry transformations on local neighborhoods. T (OT) -rules are then specified by listing particular sums f˛g (outer-sums f˛0 g; f˛1 g corresponding to center site values ‘0’ and ‘1’ respectively) for which the value of the center site becomes ‘1’. Formally, 0 itC1 D f˛g @
X
1 ` tij jt ; it A ;
(6)
j
Structurally Dynamic Cellular Automata, Figure 1 Neighborhood partitioning. In the same way as outer sites can be considered separately for -transitions, we may, for topology transitions, distinguish between those sites belonging to both i and j (2 Aij ) and those belonging to one of the two sites but not both (2 Bij ). In this way we obtain the analogous totalistic (T), outer-totalistic (OT), and an additional type called restricted totalistic (RT)
operators, whose action on the state is represented by: ˇ E ˇ E ij ˇ ˇ t ij ; : : : ; ` tC1 ; : : : ; ` tN N decouplers: bfˇ g ˇ` tij D ˇ`11 ij D ˇ E ˇ E ˇ t ij ˇ ij t ; : : : ; ` tC1 D ! ; : : : ; ` couplers: b ! f"g ˇ` tij D ˇ`11 NN ; ij (9)
where f˛g (x; a) (P ˛ ı(x C a; ˛) D P P a ˛1 ı(x; ˛1 ) C (1 a) ˛0 ı(x; ˛0 )
!T ! OT ; (7)
P and ı(x; y) is the Kronecker delta. Note that j ` tij jt sums the values of all sites ‘j’ linked to ‘i’ at time ‘t’. The action on the state jGi is represented by i b ji t f˛g ˇ X E ˇ ` tij jt ; it ; : : : ; Nt ; D ˇ1t ; : : : ; itC1 D f˛g
(8) where we distinguish the operator b i acting on the global value state from the actual local transition function which transforms each site value. Link Rules Local geometry altering rules are constructed by direct analogy: for any two selected sites i and j we restrict attention to site values of vertices contained within a 1-sphere of either site; that is, to all k 2 S1 (i; j) D S1 (i) [ S1 ( j). Link
either link or unlink two sites ‘i’ and ‘j’ depending on whether the actual sum of values in S1 (i; j) matches any of those given in the fˇg or f"g lists, which completely define decouplers and couplers, respectively. In order to construct classes of rules analogous to the two types of -rules defined above, we partition the local neighborhood into 3 disjoint sets (see Fig. 1): S1 (i; j) DVij [ A ij [ B ij , where 8 ˆ j (or i < j). The parameter 2 (1; 1) governs the coupling asymmetry in the network: the limit ! 1 (or
! 1) gives a unidirectional coupling, where the old (or young) nodes drive the young (or old) ones. When D 0, the coupling pattern degenerates to the MZK pattern at ˇ D 1. For a generic , the spectrum of the coupling matrix G is in the complex plane and the complex eigenvalues appear in pairs of complex conjugates (1 D 0; ` D `r C j`i , ` D 2; : : : ; N). It can be proved that (i) 0 < 2r rN 2, and (ii) j`i j 1; 8`. The best propensity for synchronization is then ensured when both the ratio rN /`r and M max` fj`i jg are simultaneously made as small as possible. In scale-free network models, the age of a node can be denoted by the time when it is being added to the network. The class of scale-free networks under study is gen-
Synchronization Phenomena on Networks
Synchronization Phenomena on Networks, Figure 8 Eigenratio R as a function of ˇ for four kinds of complex networks specified in [78]. For each model, the synchronizability peaks at ˇ D 1:0 (after [78])
erated from the Barabasí–Albert model [82,83]. For comparison, a highly homogeneous random network with an arbitrary initial age ordering is considered, with the average degree being equal to that of the scale-free network. Figure 9 shows the variation of the synchronizability of the two networks versus the parameter . For the random network, symmetric coupling makes the ratio rN /2r smallest, while for the scale-free model, the propensity for synchronization is better (or worse) when ! 1 (or ! 1). As for M, there are only very small differences between the scale-free and the random-network models. Thus, it is concluded that in scale-free networks, the network synchronizability is enhanced when the dominant coupling direction is from older to younger nodes [84]. Taking the edge-weights into account, Chavez et al. [85,86] investigated the propensity for synchronization of some weighted complex networks, where the weight in an arbitrary edge, ` i j , is defined as its traffic load [87], which quantifies the traffic of shortest paths which make use of that edge. In this coupling pattern, the off-diagonal entries of the zero-row-sum coupling matrix G are `˛i j Gi j D P
j2 i
`˛i j ;
(11)
Synchronization Phenomena on Networks, Figure 9 a rN and r2 , b rN /r2 , and c M, vs. , for a scale-free network (m D 5 and B D 0, solid curves) and random network (dashed curves) (after [81])
where ˛ is a tunable parameter, and ` i j is the load of the edge connecting nodes i and j. Although G is asymmetric for all ˛, just like the MZK pattern, it can be proved that all its eigenvalues are nonnegative reals with only one zero eigenvalue if the network
3177
3178
Synchronization Phenomena on Networks
is connected. The case of ˛ D 0 corresponds to the best synchronizability condition for the MZK pattern. From Eq. (11), it can be seen that in the limit of ˛ D C1 (or ˛ D 1) only the edges with the largest (or smallest) loads ` i j are selected as the incoming edges for each node i. Therefore, this generates a network with at least N directed edges, which can be either connected or disconnected. In the connected (or disconnected) case, the ratio N /2 will be equal to 2 (or C1), thus yielding a very strong (or weak) condition for synchronization. Figure 10a shows the logarithm of N /2 in the parameter space (˛; B) for the above-discussed model [82, 83]. Parameter B is used to regulate the heterogeneity of the degree distribution. It can be observed that the surface of N /2 has a prominent minimum when ˛ ' 1 for all values of B above a given threshold Bc > 0, which means that the weighting procedure based on edge loads always enhances the network propensity for synchronization. The quantity D log( N /2 ) [log( N /2 )]˛D0 shown in Fig. 10b may be used to quantify the synchronizability enhancement. The coupling patterns proposed by both Hwang et al. [81] and Chavez et al. [85,86] can enhance the propensity for network synchronization. The former works well only for age-ordered networks, while the latter requires the knowledge of the load on each edge of the whole network. Therefore, a general coupling pattern using only local information would be very desirable. Based on the idea that different nodes should play different roles in a network, Zhao et al. [73] proposed a coupling pattern which requires only the degrees of neighboring nodes. The coupling matrix G of this pattern is given by 8 ˛ ˇ ˆ 0 is the coupling strength from node j to node i if they are connected. Here, suppose that the strength between node i and all its ki neighbors increases uniformly among the ki connections, in order to suppress its difference i from the mean activity of its neighbors; namely, V˙i D i /(1 C i ) ; (14) P where i D jH(x i ) (1/k i ) j a i j H(x j )j, and > 0 is the adaptation parameter. It is clear that, in this adaptive coupling scheme, the input weight (Wi j D Vi ) and the output weight (Wji D Vj ) of node i are generally asymmetrical. Next, synchronization of a network of coupled Rössler oscillators and a chaotic foodweb model on Barabasí– Albert scale-free networks are considered, and two cases of unbounded and bounded stability zones are investigated, respectively. When the stability zone is unbounded, the transition to synchronization is shown in Fig. 12a. Starting from random initial conditions on the chaotic attractors, the local synchronization difference 1, and the input weights of each node, both increase uniformly on the whole network, i. e., Wi j D Vi (t) t (Fig. 12a, inset). After a short period of time, the weights V i of different nodes develop at different rates and then converge to different values V˜i . The input weight is smaller on average for nodes with larger degrees ki (Fig. 12b). Here, the synchronization error is measured by averaging all local errors over the nodes: E(t) D hjx i hx i iji. The dependence of the input weight of a node on its degree follows a power law, G i j (t) D a i j Vi (t) ;
V(k) k ;
(15)
with exponent D 0:48 ˙ 0:01 for both oscillator models. Importantly, this scaling is also robust to the varia-
Synchronization Phenomena on Networks, Figure 12 a Transition to synchronization in an adaptive network of Rössler oscillators, indicated by the (averaged) synchronization error E(t) D hjx i hx i iji. Inset: the input strength Vi (t) vs. time over three nodes. b The weighted coupling matrix G˜ crystallized after the adaptation (for the foodweb model) (after [89])
tion of network parameters, such as the minimal degree M (Fig. 13b), which should not be confused with the order parameter M elsewhere, the system size N (Fig. 13c), and the orders of magnitudes of the adaptation parameter (Fig. 13d). When the stability zone is bounded, synchronization can always be achieved by the adaption mechanism
3179
3180
Synchronization Phenomena on Networks
Synchronization Phenomena on Networks, Figure 13 Average input weight V(k) of nodes with degree k as a function of k for a network of Rössler oscillators (empty circles) and the foodweb model (filled circles) (a), and its dependence on various parameters, M (b), N (c), and (d), where the M should not be confused with the order parameter M elsewhere (after [89])
of Eq. (14) if c for a threshold c somewhat depending on N and the oscillator dynamics. The two resulting weighted networks display the same power-law behavior as in Eq. (15), but with different exponents:
D 0:54 ˙ 0:01 (Rössler oscillator) or D 0:36 ˙ 0:01 (foodweb). The eigenratio R for the weighted networks, after the adaptation, and for the unweighted networks (symmetric coupling) is calculated as a function of N (Fig. 14a), and as a function of the ratio Smax /Smin (Fig. 14b), where Smax and Smin are the maximum and minimum intensities of the variable coupling strengths of the model. Clearly, this adaptive coupling scheme is more effective than symmetric coupling for network synchronization. Huang [90] investigated another adaptive coupling pattern, in which a node is coupled with its neighbors non-uniformly through different coupling strengths, and showed that they have better synchronizability than other networks with symmetric coupling patterns. In all the coupling patterns discussed above, whether the coupling pattern is static or dynamic, only the coupling strength is tunable while the connectivity matrix always remains unchanged. However, as is intuitively clear, network synchronizability can also be significantly improved by evolving the graph topology giving rise to a time-varying connectivity matrix. This has been recently confirmed by Boccaletti et al. [91]. It has been shown [91] that to make a network synchronizable, either the coupling matrix G(t) D G remains
unchanged, or if starting from an initial wiring condition G(0) D G0 , the coupling matrix G(t) commutes at any time with G0 , i. e., G0 G(t) D G(t)G0 ; 8t. At any time, a zero-row-sum symmetric commuting matrix G(t) can be constructed, as G(t) D V(t)V T ;
(16)
where V D fv i ; : : : ; v N g is an orthogonal matrix with columns being the eigenvectors of G0 , and (t) D diag[0; 2 (t); : : : ; N (t)] with i (t) > 0; 8i > 1. This set of matrices is referred to as the dissipative commuting set of G(0). A condition to ensure the network synchronization will be stable is Z 1 T max ( i (t 0 ))dt 0 < 0 8i ¤ 1 ; (17) S i D lim T!1 T 0 where max ( i ) is the maximal transversal (conditional) Lyapunov exponent along the direction of the ith eigenvector, and Si is its time average. Hence, it does not require max ( i (t)) < 0 at all times. One can even construct a commutative evolution such that at each time there exists one eigenvalue i for which max ( i (t)) > 0, and yet obtain a stable synchronization manifold. Thus, interestingly, synchronization in a dynamical network can be achieved even in the case where each individual commutative graph does not give rise to synchronized behavior.
Synchronization Phenomena on Networks
Synchronization Phenomena on Networks, Figure 14 Eigenratio R as a function of N (a), and Smax /Smin (b). The networks are synchronizable if R < R in a, Rössler oscillators (squares), R D 40 (dashed curve), foodweb model (triangles), R D 29 (dashed-dotted curve) (after [89])
Modifications of Network Structures It is well known that the synchronizability of a dynamical network is determined simultaneously by the network coupling pattern, the dynamical characteristics of the oscillators on its nodes, and the network structure. In the above, several cases with variable coupling patterns have been discussed. For some real-world networks, however, the coupling pattern cannot be modified at will. Thus, if the dynamics of the oscillators are given and fixed, and if the coupling patterns cannot be changed, then the only way to enhance the network synchronizability is to make a change to the network structure. There are some effective techniques to enhance the network synchronizability by modifying the network structure, as further discussed below in the rest of this section. Reducing Maximal Betweenness In scale-free networks, the average distance is often very short while the node-degree and node-betweenness distributions are both quite broad. The bottleneck for the network synchronizability seems to be the maximal node betweenness [57]. In order to reduce the node betweenness of the hubs, Zhao et al. [92] suggested a method of structural perturbations. Specifically, for a hub x0 , m 1 auxiliary nodes, labeled as x1 ; : : : ; x m1 , are added around it. These m nodes are fully connected together. Then, all the edges incident from x0 are re-distributed to all the nodes xi (including x0 itself), i D 0; 1; : : : ; m 1. After this process, the betweenness of x0 is divided into m almost equal parts associating with these m nodes. This process is called m-division. A sketch map of a 3-division process on node x0 is shown in Fig. 15. Due to the huge sizes of many real-life networks, it is usually impossible to obtain the node betweenness from a complex network. Fortunately, studies have shown that there exists a strongly positive correlation
Synchronization Phenomena on Networks, Figure 15 Sketch map for the 3-division process on x 0 . The solid circle on the left is the node x 0 with degree 6. After the 3-division process, this x 0 is divided into 3 nodes, x 0 , x 1 and x 2 , which are fully connected. The six edges incident from x 0 are then re-distributed to all the three nodes (after [92])
between the node-degree and the node-betweenness in Barabasí–Albert networks and some other heterogeneous networks [87,93]. That is, a node with larger degree has higher node-betweenness statistically. Therefore, for practical reasons, it can be assumed that nodes with higher betweenness are those with larger degrees in Barabasí–Albert networks. To further explore how the structural perturbations affect the network synchronizability, the eigenratios before and after the m-division process were compared in [92] for a Barabasí–Albert scale-free network with the coupling matrix being Laplacian. For use in the rest of the article, we define a characteristic value R D r0 /r, in which r and r0 are the eigenratios before and after the division, respectively. Figure 16 shows the correlation between R and the probability of the divided nodes. It is clear that even the m-division of a tiny fraction of nodes can sharply enhance network synchronizability. Shortening the Average Distance Zhou et al. [94] investigated the synchronizability of a network model named crossed double cycles (CDCs). They not
3181
3182
Synchronization Phenomena on Networks
Synchronization Phenomena on Networks, Figure 16 Behavior of value R vs. the fraction of divided nodes . As the number of divided nodes increases, R is reduced, leading to better synchronization (after [92])
only clarified the relationship between average distance and network synchronizability, but also provided a possible way to make a network more synchronizable. In the language of graph theory [95,96,97], a cycle CN denotes a network consisting of N nodes (vertices) x1 ; : : : ; x N . These N nodes are arranged in a ring, and the nearest two nodes are connected to each other. Thus, CN has N edges connecting the nodes x1 x2 ; x2 x3 ; : : : ; x N1 x N ; x N x1 . The set of all such CDCs, denoted by G(N; m), can be constructed by adding two edges, called crossed edges, to each node in CN . The two nodes connecting by a crossed edge have distance m within CN . For example, the network G(N; 3) can be constructed from CN by connecting x1 x4 ; x2 x5 ; : : : ; x N1 x2 ; x N x3 together. A sketch map of G(20; 4) is shown in Fig. 17 for illustration. Figure 18 shows how the average distance L affects the network synchronizability (measured by the characteristic value R). It is clear that the network synchronizability is very sensitive to the average distance: as L increases, R sharply spans more than three magnitudes. And the network synchronizability is remarkably enhanced by reducing L. When the crossed length m is not too small or too large (compared to N), networks with the same average distance have approximately the same synchronizability, regardless of the network sizes. More interestingly, the numerical results show that the characteristic value R approximately obeys a power-law form, as R L1:5 (inset of Fig. 18). Decoupling Nodes by Removing Heavily-Loaded Edges In the synchronization process, not only hubs may be the bottlenecks but some edges with large loads may also limit
Synchronization Phenomena on Networks, Figure 17 Sketch map of G(20; 4) (after [94])
Synchronization Phenomena on Networks, Figure 18 Characteristic value R vs. average distance L of CDCs. The black squares, red circles, blue triangles and green pentagons represent the cases of N D 1000, 2000, 3000 and 4000, respectively. The inset shows the same data in log-log plot, indicating that the characteristic value R approximately obeys a power-law form R L1:5 . The solid line has slope 1.5, for comparison (after [94])
the network synchronizability. Yin et al. [99] found that a scale-free network can become more synchronizable after some of its heavily-loaded edges have been removed. To reduce the computational cost, they used local information to approximately rank the edges, according to the values of k i k j , where i and j denote two adjacent nodes connected by an edge. Subsequently, at each time step, an edge with the highest rank is removed, i. e., the two nodes are decoupled at both sides of their connecting heavily-
Synchronization Phenomena on Networks
Synchronization Phenomena on Networks, Figure 20 A network of 6 nodes (after [101])
Synchronization Phenomena on Networks, Figure 19 Changes of the synchronizability as a function of the proportion of cut edges Ncut /N for different values of the average distance (after [99])
loaded edge. After this operation, the characteristic value is decreased, as shown by Fig. 19. Designing the Output Function Very recently, the relationship between graph theory and network synchronizability received some special attention [100]. For example, Duan et al. [101,102,103] found that for networks with disconnected complementary graphs, adding edges will often increase their synchronizability. The complementary graph of a given graph G is defined to be the graph consisting of all the nodes of G and all the edges that are not in G. In addition, they found [101,102] that when the couplings between nodes are symmetric, an unbounded synchronized region is always easier to analyze than a bounded synchronized region (see Sect. “Basic Concepts of Network Synchronization” to recall their definitions). Therefore, to effectively enhance network synchronizability, they presented a design method for the output function (i. e., H in network (1), or the inner linking matrix in the linear coupling case), such that the resultant network has an unbounded synchronized region, for the case where the synchronous state is an equilibrium of the network. If the synchronous state is an equilibrium, then both DF(s(t)) and DH(s(t)) in network (1), as discussed in Sect. “Enhancing Network Synchronizability” (part B), reduce to constant matrices, denoted by F and H, respectively. The synchronized region is the stability region of the matric pencil F C ˛H with respect to parameter ˛. It can be proved that there exists a matrix H of rank 1 (meaning that only one component in each state vector is used for coupling), such that the stability region is un-
bounded. The method for obtaining the desired output function is outlined below: first, take a column vector b such that (F; b) is stabilizable [104]; then, find a matrix P D PT such that FP C PF T 2bbT < 0; consequently, taking k D bT P1 leads to the stability of F ˛bk for all ˛ in the unbounded region; finally, H D bk is the matrix to be found. For illustration, synchronization of a simple 6-node network (shown in Fig. 20) is studied, where each node is located with a third-order smooth Chua’s circuit [105]. At first, arbitrarily take the output function 0
0:8348 H D @ 0:1002 0:3254
9:6619 0:0694 8:5837
1 2:6591 0:1005 A : 0:9042
(18)
But the network does not synchronize. The states of node 1 are shown in Fig. 21a. Then, let b D (0; 0; 1)T and k D (0:0708; 0:15590; 0:4296), and then set H D bk, so synchronization is achieved as guaranteed by the theory. Figure 21b shows that the states of node 1 quickly reach the equilibrium. Future Research Outlook Complex network synchronization is a rapidly growing subject attracting increasing attention from various fields of physics, engineering, mathematics, and biology alike. Despite the current great advances and progress, there are still many important open questions. In the studies of static coupling, Nishikawa and Motter [48] once pointed out that optimal global synchronizability, with eigenratio being equal to 1, can be obtained from a directed network structure without loops. Even if adding one loop of length 2 (in a directed network, two opposite edges between node i and node j can be considered as a loop of length 2), the eigenratio will be doubled [73, 85]. Another scenario is shown by extending the conclu-
3183
3184
Synchronization Phenomena on Networks
Synchronization Phenomena on Networks, Figure 21 States of node 1 (after [101])
sion in [48] to the case of non-identical oscillators [106]. Some further works in this direction will be helpful for indepth understanding about the role of loops in network synchronization. In the studies of dynamic coupling, the cost of coupling has not been taken into account. However, cost is usually very significant in some self-driven systems (for example, in wireless sensor networks [107,108] and in distributed autonomous robotic systems [109]). For each node to report its current state to the neighbors (or to detect the states of all its neighbors) requires a certain amount of power, while the total power assigned to each node is often limited, even if such communications are possible. Yet, as found in collective behaviors of biological swarms, a few effective leaders can well organize the whole population [110]. And a recent study has pointed out that partial coupling is more than enough to keep the coherence of self-propelled particles [111]. Therefore, it is very natural to expect to synchronize a complex network with a very low cost, which is an important issue for further investigation. Very recently, there are some attempts at detecting the network structures with the help of the synchronization phenomenon on complex networks [112,113,114]: to discover the hierarchical community structure by the dynamic time scales of the network synchronization process [112,113], or to infer the complete connectivity of a network from its stable response dynamics [114], etc. These seem quite useful for optimal network design, analysis, and utilization in general, therefore should be pursued with special efforts. Similar to the aforementioned open questions, many theoretically attractive and practically important problems about various aspects of synchronization on complex networks can be posted and described. As the network research further evolves in different fields, many new dy-
namical phenomena and analytic issues will also emerge. Importance notwithstanding, the subject of “Synchronization Phenomenon of Networks” will continue to prove itself an theoretically interesting and technically challenging subject for scientific research in the years to come. Acknowledgments The authors thank the American Physics Society and the Elsevier Publisher for their permission to use some of their published simulation figures, which have all been referenced (in the figure captions). The authors also thank Dr. Jian-Guo Liu for his helpful comments and suggestions. G.R.C. was supported by the NSFC/RGC joint research scheme under grant N-CityU107/07. B.H.W. was supported by the NNSFC under Grants 10472116 and A0524701, and the Specialized Program under the Presidential Funds of CAS. T.Z. was supported by NNSFC under Grants 10635040 and 70471033. Bibliography 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
Albert R, Barabási AL (2002) Rev Mod Phys 74:47 Dorogovtsev SN, Mendes JFF (2002) Adv Phys 51:1079 Newman MEJ (2003) SIAM Rev 45:167 Boccaletti S, Latora V, Moreno Y, Chaves M, Hwang DU (2006) Phys Rep 424:175 da F Costa L, Rodrigues FA, Travieso G, Boas PRV (2007) Adv Phys 56:167 Pastor-Satorras R, Vespignani A (2001) Phys Rev Lett 86:3200 Zhu CP, Xiong SJ, Tian YJ, Li N, Jiang KS (2004) Phys Rev Lett 92:218702 Zhou T, Yan G, Wang BH (2005) Phys Rev E 71:046141 Zhou T, Fu ZQ, Wang BH (2006) Prog Nat Sci 16:452 Zhou T, Liu JG, Bai WJ, Chen GR, Wang BH (2006) Phys Rev E 74:056109 Motter AE, Lai YC (2002) Phys Rev E 66:065102 Goh KI, Lee DS, Kahng B, Kim D (2003) Phys Rev Lett 91:148701
Synchronization Phenomena on Networks
13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49.
50. 51. 52. 53. 54. 55. 56. 57. 58.
Motter AE (2004) Phys Rev Lett 93:098701 Zhou T, Wang BH (2005) Chin Phys Lett 22:1072 Galstyan A, Cohen P (2007) Phys Rev E 75:036109 Guimerá R, Díaz-Guilera A, Vega-Redondo F, Cabrales A, Arenas A (2002) Phys Rev Lett 89:248701 Tadi´c B, Thurner S, Rodgers GJ (2004) Phys Rev E 69:036102 Zhao L, Lai YC, Park K, Ye N (2005) Phys Rev E 71:026125 Yan G, Zhou T, Hu B, Fu ZQ, Wang BH (2006) Phys Rev E 73:046108 Wang BH, Zhou T (2007) J Korean Phys Soc 50:134 Watts DJ, Strogatz SH (1998) Nature 393:440 Barabási AL, Albert R (1999) Science 286:509 Strogatz SH, Stewart I (1993) Sci Am 269:102 Gray CM (1994) J Comput Neurosci 1:11 Néda Z, Ravasz E, Vicsek T, Barabási AL (2000) Phys Rev E 61:6987 Glass L (2001) Nature 410:277 Chen G, Dong X (1998) From Chaos Order. World Scientific, Singapore Erdös P, Rényi A (1960) Pub Math Ins Hung Acad Sci 5:17 Wang XF (2002) Int J Bifurc Chaos 12:885 Heidelberger R (2007) Nature 450:625 Heagy JF, Carroll TL, Pecora LM (1994) Phys Rev E 50:1874 ´ Stefanski A, Perlikowski P, Kapitaniak T (2007) Phys Rev E 75:016210 Wu CW, Chua LO (1995) IEEE Trans Circuits Syst I 42:430 Jost J, Joy MP (2001) Phys Rev E 65:016201 Gade PM (1996) Phys Rev E 54:64 Manrubia SC, Mikhailov AS (1999) Phys Rev E 60:1579 Gade PM, Hu CK (1999) Phys Rev E 60:4966 Gade PM, Hu CK (2000) Phys Rev E 62:6409 Lago-Fernández LF, Huerta R, Corbacho F, Sigüenza JA (2000) Phys Rev Lett 84:2758 Wang XF, Chen G (2002) Int J Bifurc Chaos Appl Sci Eng 12:187 Wang XF, Chen G (2002) IEEE Trans Circuits Syst I 49:54 Barahona M, Pecora LM (2002) Phys Rev Lett 89:054101 Jiang PQ, Wang BH, Bu SL, Xia QH, Luo XS (2004) Int J Mod Phys B 18:2674 Lind PG, Gallas JAC, Herrmann HJ (2004) Phys Rev E 70:056207 Pecora LM, Carrol TL (1998) Phys Rev Lett 80:2109 Hu G, Yang J, Liu W (1998) Phys Rev E 58:4440 Pecora LM, Barahona M (2005) Chaos Complex Lett 1:61 Nishikawa T, Motter AE (2006) Phys Rev E 73:065106 Kuramoto Y (1975) In: Araki H (ed) Internaltional Symposium on Mathematical Problems in Theoretical Physics. Lecture Notes in Physics, vol 30. Springer, New York Kuramoto Y (1984) Chemical Oscillations, Wave and Turbulence. Springer, Berlin Kuramoto Y, Nishikawa I (1987) J Stat Phys 49:569 Pikovsky A (2001) Synchronization. Cambridge University Press, Cambridge Acebrón JA, Bonilla LL, Vicente CJP, Ritort F, Spigler R (2005) Rev Mod Phys 77:137 Nishikawa T, Motter AE, Lai YC, Hoppensteadt FC (2003) Phys Rev Lett 91:014101 Newman MEJ, Strogatz SH, Watts DJ (2001) Phys Rev E 64:026118 Dorogovtsev SN, Mendes JFF (2000) Phys Rev E 62:1842 Hong H, Kim BJ, Choi MY, Park H (2004) Phys Rev E 69:067105 Freeman L (1979) Soc Netw 1:215
59. Newman MEJ (2001) Phys Rev E 64:016132 60. Zhou T, Liu JG, Wang BH (2006) Chin Phys Lett 23:2327 61. Fan ZP (2006) Complex Networks: From Topology to Dynamics. Ph D Thesis, City University of Hong Kong 62. Zhao M, Zhou T, Wang BH, Yan G, Yang HJ, Bai WJ (2006) Physica A 371:773 63. Maslv S, Sneppen K (2002) Science 296:910 64. Kim BJ (2004) Phys Rev E 69:045101(R) 65. McGraw PN, Menzinger M (2005) Phys Rev E 72:015101 66. Gómez-Gardeñes J, Moreno Y, Arenas A (2007) Phys Rev Lett 98:034101 67. Wu X, Wang BH, Zhou T, Wang WX, Zhao M, Yang HJ (2006) Chin Phys Lett 23:1046 68. Holme P, Kim BJ (2002) Phys Rev E 65: 026107 69. Newman MEJ (2002) Phys Rev Lett 89:208701 70. di Bernardo M, Garofalo F, Sorrentino F (2007) Int J Bifurc Chaos 17:3499 71. Sorrentino F, di Bernardo M, Cuéllar GH, Boccaletti S (2006) Physica D 224:123 72. Chavez M, Hwang DU, Martinerie J, Boccaletti S (2006) Phys Rev E 74:066107 73. Zhao M, Zhou T, Wang BH, Ou Q, Ren J (2006) Eru Phys J B 53:375 74. Huang L, Park K, Lai YC, Yang L, Yang K (2006) Phys Rev Lett 97:164101 75. Radicchi F, Castellano C, Cecconi F, Loreto V, Parisi D (2004) Proc Natl Acad Sci USA 101:2658 76. Zhou T, Zhao M, Chen G, Yan G, Wang BH (2007) Phys Lett A 368:431 77. Donetti L, Hurtado PI, Muñoz MA (2005) Phys Rev Lett 95:188701 78. Motter AE, Zhou C, Kurths J (2005) Phys Rev E 71:016116 79. Motter AE, Zhou C, Kurths J (2005) Europhys Lett 69:334 80. Motter AE, Zhou C, Kurths J (2005) AIP Conf Proc 776:201 81. Hwang DU, Chavez M, Amann A, Boccaletti S (2005) Phys Rev Lett 94:138701 82. Dorogovtsev SN, Mendes JFF, Samukhin AN (2000) Phys Rev Lett 85:4633 83. Krapivsky PL, Redner S (2001) Phys Rev E 63:066123 84. Zou Y, Zhu J, Chen G (2006) Phys Rev E 74:046107 85. Chavez M, Hwang DU, Amann A, Hentschel HGE, Boccaletti S (2005) Phys Rev Lett 94:218701 86. Chavez M, Hwang DU, Amann A, Boccaletti S (2006) Chaos 16:015106 87. Goh KI, Kahng B, Kim D (2001) Phys Rev Lett 87:278701 88. Wang X, Lai YC, Lai CH (2007) Phys Rev E 75:056205 89. Zhou C, Kurths J (2006) Phys Rev Lett 96:164102 90. Huang D (2006) Phys Rev E 74:046208 91. Boccaletti S, Hwang DU, Chavez M, Amann A, Kurths J, Pecora LM (2006) Phys Rev E 74:016102 92. Zhao M, Zhou T, Wang BH, Wang WX (2005) Phys Rev E 72:057102 93. Barthélemy M (2004) Eur Phys J B 38:163 94. Zhou T, Zhao M, Wang BH (2006) Phys Rev E 73:037101 95. Bondy JA, Murty USR (1976) Graph Theory with Applications. MacMillan, London 96. Bollobás B (1998) Modern Graph Theory. Springer, New York 97. Xu JM (2003) Theory and Application of Graphs. Kluwer, Dordrecht 98. Newman MEJ, Watts DJ (1999) Phys Rev E 60:7332
3185
3186
Synchronization Phenomena on Networks
99. Yin CY, Wang WX, Chen G, Wang BH (2006) Phys Rev E 74:047102 100. Comellas F, Gago S (2007) J Phys A Math Theor 40:4483 101. Duan Z, Chen G, Huang L (2007) Phys Rev E 76:056103 102. Duan Z, Chen G, Huang L (2007) Phys Lett A 372:3741 103. Duan Z, Liu C, Chen G (2008) Physica D 237:1006 104. Friedland B (1986) Control System Design. McGraw-Hill, New York 105. Tsuneda A (2005) Int J Bifurc Chaos 15:1 106. Um J, Han SG, Kim BJ, Lee SI (2008) (unpublished) 107. Akyildiz IF, Su W, Sankarasubramaniam Y, Cayirci E (2002) Comput Netw 38:393
108. Oqren P, Fiorelli E, Leonard NE (2004) IEEE Trans Automat Contr 49:1292 109. Arai T, Pagello E, Parker LE (2002) IEEE Trans Robot Automat 18:655 110. Couzin LD, Krause J, Franks NR, Levin SA (2005) Nature 433:513 111. Zhang HT, Chen M, Zhou T (2007) arXiv:0707.3402 112. Arenas A, Díaz-Guilera A, Pérez-Vicente CJ (2006) Phys Rev Lett 96:114102 113. Boccaletti S, Ivanchenko M, Latora V, Pluchino A, Rapisarda A (2007) Phys Rev E 75:045102(R) 114. Timme M (2007) Phys Rev Lett 98:224101
Thermodynamics of Computation
Thermodynamics of Computation H. JOHN CAULFIELD, LEI QIAN Fisk University, Nashville, USA Article Outline Glossary Definition of the Subject Introduction Thermodynamics Computer Equivalents of the First and Second Laws The Thermodynamics of Digital Computers Analog and Digital Computers Natural Computing Quantum Computing Optical Computing Thermodynamically Inspired Computing Cellular Array Processors Conclusions Future Directions Bibliography Glossary Thermodynamics Literally accounting for the impact of temperature and energy flow. It is a branch of physics that describes how temperature, energy, and related properties affect behavior of an object or event. First law of thermodynamics Also known as the law of energy conservation. It states that the energy remains constant in an isolated system. Second law of thermodynamics The second law of thermodynamics asserts that the entropy of an isolated system never decreases with time. Entropy A measure of the disorder or unavailability of energy within a closed system. Computing Any activity with input/output patterns mapped onto real problems to be solved. Turing machine An abstract computing model that was invented by Alan Turing in 1936, long before the first electronic computer was invented, to serve as an idealized model for computing. A Turning machine has a tape that is unbounded in both directions, a readwrite head and a finite set of instructions. At each step, the head may modify the symbol on the tape right under the head, change the state of the head, and then move on the tape one unit to the left or right. Although extremely simple, Turing machines can solve
any problem that can be solved by any computers that could possibly be constructed (see Church–Turing thesis). The Church–Turing thesis A combined hypothesis about the nature of computable functions. It states that any function that is naturally regarded as computable is computable by a Turing Machine. In early 20th century, various computing models such as Turing Machine, -calculus and recursive functions are invented. It was proved that Turing Machine, -calculus and recursive functions are equally powerful. Alonzo Church and Alan Turing independently raised the thesis that any naturally computable function must be a recursive function or, equivalently, be computed by a Turing Machine, or be a -definable function. In other words, it is not possible to build a computing device that is more powerful than those machines with simplest computing mechanisms. Note that Church– Turing thesis is not a provable or refutable conjecture because “naturally computable function” is not a rigorous definition. There is no way to prove or refute it. But it has been accepted by nearly all mathematicians today. Analog An analog computer is often negatively defined as a computer that is not digital. More properly, it is a computer that uses quantities are that can be made proportional to the amount of signal detected. That analogy between a real number and a physical property gives the name “analog.” Many problems arise in analog computing, because it is so difficult to obtain a large number of distinguishable levels and because noise builds up in cascaded computations. But, being unencoded, it can always be run faster than its digital counterpart. Digital The name derives from digits (the fingers). Digital computing works with discrete items like fingers. Most digital computing is binary – using 0 and 1. Because signals are restored to a 1 or a 0 after each operation, noise accumulation is not much of a problem. And, of course, digital computers are much more flexible than analog computers that tend to be rather specialized. Quantum computer A quantum computer is a computing device that makes direct use of quantum mechanical phenomena such as superposition and entanglement. Theoretically, quantum computers can achieve much faster speed than traditional computers. It can solve some NP hard or even exponentially hard problem within linear time. Due to many technical difficulties, no practical quantum computer using entanglement has been built up to now.
3187
3188
Thermodynamics of Computation
Definition of the Subject Thermodynamics is a very general field that helps to define the physics limits that impose on all stuff. When people study an abstract computing model such as a Turning Machine, thermodynamics does not impact their behavior because nothing physical is been done. However, real computers must be made of some physical material, material that has a temperature. The physical stuff must have at least two readily distinguishable states that can be interpreted to two different values. In the process of computing, the stuff should be able to change the state under control. The change of the state must obey the laws of thermodynamics. For example, there is a minimum amount of energy associated with a bit of information. Information is limited by the entropy change k log 2 and the energy is kT log 2, where k is the Boltzmann’s constant and T is the absolute temperature of the device and its surroundings. Mysteriously, information and entropy seem linked even though one is physical and the other is conceptual. Those limits are vital to know, as it allows designers to stop seeking improvements as they reach or near those limits. At the same time, thermodynamics indicates where computers do not come near what is possible. Such aspects of computers represent potentially important fields to explore. Introduction Thermodynamics – the science of the changes in energy over time – is one of the most fundamental sciences. Arguably, the first and second laws of thermodynamics are the scientific principles most scientists would label as the most fundamental of all the laws of science. No one ever expects them to be violated. We will explore the following aspects of the relationship between thermodynamics and computing The energetics and corresponding temporal properties of classical digital computers. Possible modifications such as conservative computing, analog computing, and quantum computing. Thermodynamically-inspired computing. Unsurprisingly, we will find that those topics are all related. Before we can do any of those things, however, we must review thermodynamics itself and do so in terms suited for this discussion Thermodynamics Thermodynamics originated as a set of general rules characterizing the effects of system parameters such as energy
and temperature over time. Somewhat later, thermodynamics came to be partitioned into two broad classes – the classical thermodynamics (now called equilibriums thermodynamics as it predicts the final system parameters that result from some perturbing effect) and nonequilibrium thermodynamics (what happens if energy input to a system continues over time so equilibrium is never obtained). The fact that thermodynamics applies to quantities averaged over the system and deals only with probabilistic results, means that no physical law is broken if entropy within the system decreases due to either of two reasons: the occurrence of an improbable event that is allowable but fleeting or the occurrence of a state that is not reachable by equilibrium thermodynamics but is reachable with continuing input of energy. The latter is an example of nonequilibrium thermodynamics. This also connects nonequilibrium thermodynamics with topics such as nonlinear dynamics, chaos, emergence, and so forth. Two concepts from nonlinear dynamics are especially interesting: attractor states and emergence. Attractor states are system states such that if the state of the system is near a given attractor (within its “attractor basin”), successive states tend to move toward that attractor. Some of the attractor states in systems far from equilibrium can be highly organized. The properties of highly organized nonequilibrium thermodynamic states (states can be fixed conditions, periodic events, or even more complex situations called strange attractors that are somewhat predictable and necessarily chaotic in the technical sense of that word. There is generally a control parameter such as power input which determines the system performance. For some values systems often develop excessive sensitivity to the values of the control parameter leading to long term unpredictability of the deterministic result – a condition called chaos. Systems that are too unorganized like an ideal gas, are not very interesting form a behavior viewpoint. Systems that are fully organized like a crystal also have limited interest. But between those extremes lie regions of high complexity and great interest. One of the favorite and well justified sayings of complexity theorists is that life arises on the edge of chaos. You are a system that shows spectacular organization if kept away from equilibrium with continuing input of energy. The properties that arise far from equilibrium are said to be emergent and are generally unpredictable until discovered experimentally. A general feature of such organized complexity in systems maintained well away form equilibrium is self organization. That is the order observed does not have to be created by a scientist. Instead it arises spontaneously under the correct circumstances.
Thermodynamics of Computation
Turing machines can be accounted for by conventional thermodynamics, but some computer such as neural networks and cellular array processors can become complex enough to exhibit interesting and useful emergent properties that must be discussed in terms of nonlinear thermodynamics. Such self organized complex processor behavior has also been called “synergistic.” The most complex computer we know is the human brain. It not only arises on the edge of chaos, but also uses and consumes chaos. Over-regular temporal behavior of the brain is called epilepsy and is not a desirable state. Two of the first and most influential people to explore nonequilibrium thermodynamics were Warren Weaver and Ilya Priogogine. Weaver [1] discussed three ways people can understand the world: simple systems, statistical methods (conventional thermodynamics), and nonequilibrium organized complexity [thermodynamics]. Organized complexity is compatible with classical thermodynamics from an overall perspective but can be achieved locally in space and time. Prigogine’s Nobel-Prize-winning work on nonequilibrium thermodynamics is explained beautifully in his popular books [2,3,4]. Let us now review very briefly what thermodynamics deals with and why it is important to computers. Temperature, T, is a measure of the driving force for energy. There can be no systematic energy flow from a lower temperature region to a higher temperature region. Entropy, S, is a measure of the disorder. The second law of thermodynamics, perhaps science’s most sacred law, says that entropy never decreases. Indeed it nearly always increases. Note that entropy is essentially an abstract quantity. Indeed, it is the negative of information as defined by Shannon. Therefore information is sometimes called negentropy – negative entropy. There is a metaphysical puzzle here that has been more ignored than solved. How does a mathematical concept like information come to govern physics as does entropy? Boltzmann’s constant k symbolizes the relationship between information and physics. It appears in purely physical equations such as the ideal gas law. It also appears in equations about information. Leo Szilard suggested that writing a memory does not produce entropy; it is possible to store information into a memory without increasing entropy. However, entropy is produced in every case that the memory is erased. It turns out that the (minimum) entropy created by erasing one bit is given by S per erased bit D k ln 2 ; and the number ln 2 0:69 is the natural logarithm of 2.
What’s going on here? How are information and the physical world related? Information relates to the answers to yes or no questions. Entropy is physical entity not necessarily related to questions and answers. Yet the mathematics of both is the same. Information is the negative of entropy. What are we to make of this metaphysically? One view (originating in the mind and writing of John Archibold Wheeler) is that information is primary. The slogan is “It from Bit.” Wheeler’s metaphysics is extremely appealing, because it requires no external scaffolding. But it strikes many as not quite convincing. There is an opposite approach that begins with stuff and energy. The physical world is made of relationships between parts. The universe is self organized and self defined. Stuff is differentiation (yes and no) within the whole. Another view is that physics (stuff, energy, motion) is primary. Information in a brain or a computer is present only if some stuff is used and some energy is expended in generating, transmitting, or storing it. In that case, information relates to the probability that a given record would appear by chance if all records were equally probable. The less likely the chance appearance, the more information is present. In this way, information ties back to Bayes’s Theorem. Neither caricature of a position is complete enough to be fair, but perhaps this footnote will stimulate some readers to dig deeper. This is not a metaphysics chapter, so we take no side in this dispute. But, some of us are physicists and physicists are drawn to physics by the hope of finding answers not just equations. We cannot help asking such questions, so we think noting this unresolved metaphysical problem is appropriate. Note also, that no metaphysical problem can ever be fully resolved, as the criteria for truth of metaphysics lie in the domain of metametaphysics (Ockham’s razor, etc.) and so on ad infinitum. There is a fundamental relationship between the change E in the physical quantity energy and the change S in the mathematical quantity entropy, namely E D TS : What is remarkable about thermodynamics is that it tells us nothing about how things actually work. It tells us instead about what cannot be done. It is the science of fundamental limits. It is, in that sense, a negative field. Physicists sometimes have to work very hard to find the detailed ways nature enforces these limits in any particular problem. When energy ceases to have any net flow, the system is said to be at equilibrium. Until the latter part of the 20th century, thermodynamics meant near-equilibrium behavior. Far from equilibrium, very interesting things can occur. Order can arise from disorder at least locally. The sec-
3189
3190
Thermodynamics of Computation
ond law is satisfied by creating even more disorder somewhere in the universe. Life exists far from equilibrium and seems to increase order. Indeed it does – locally. But it must take energy from its environment and cause a global increase in energy. The laws of thermodynamics define and thus forbid magic – something from nothing (the first law) or effects without causes (the second law). We will interpret all magic forbidding laws as “thermodynamic” even if the apply to nonphysical things such as information
Since information must be conserved, one bit of information must somehow have been converted into some noninformative state. The amount of energy required is related to the information content of a single bit multiplied by the operating temperature T.
Computer Equivalents of the First and Second Laws
was an IBM scientist named Landauer. Here k is the Boltzmann constant. The average energy of a molecule at temperature T is about kT, so this makes some intuitive sense as well. This is really a very small amount of energy – molecules of that energy are striking you right now and you can’t even feel them. Two things make this energy remark important. First, most current gates require millions of kTs per operation. This is no accident. It is required as we will see later when we discuss analog computing. Second, the number of operations per second is now in the billions. So trillions of kTs are dissipated in current computers each second. They get very hot and consume a great amount of power. Landauer and fellow IBM physicist Bennett then took a totally unexpected and brilliant direction of research. What would happen, they asked, if we could use logic gates that do not destroy information? Then there would be no thermodynamic minimum energy requirement for such a gate’s operation. Perhaps the energy cost of computing could be reduced or even eliminated. A trivial example from arithmetic will illustrate these concepts for readers not familiar with them. Here is a conventional (dissipative) arithmetic expression:
We propose two laws of information inspired by the two laws of thermodynamics just described. First law: Information is not increased in a computer. A computer cannot create information. The output result is strictly determined by the data and the instructions. If we were smart enough and had unlimited resources (time and memory), we would know that 1 C 2 C 3 C 4 C 5 D 15. We only need a computer, because computer can compute faster and more reliably. We may not have known the answer now, but we could if we had enough time, memory and patience. The result is not new but derived information. Second law: The energy required to destroy one bit of information is not smaller than the temperature T multiplied by some minimum amount of information S. This second law applies for conservative as well as dissipative logic – terms to be described later. This second law, like the second law of thermodynamics, deals with irreversible acts. The Thermodynamics of Digital Computers Just as all digital computers are equivalent to a Turing machine, all computations in digital computers can be thought of as being performed by Boolean logic gates. If we understand the thermodynamics of a Boolean logic gate, we understand the thermodynamics of any digital computer. All readers of this chapter will recognize that Boolean logic gates such as AND, OR, NOR, and so forth convert two one-bit inputs into a single one-bit output. Immediately, we notice A Boolean logic gate has less information out than we put in. Such a logic gate is irreversible. From the two laws of computing, we conclude.
The person who first realized and determined the amount of energy needed to destroy one bit of information is E D kT log 2
2C3D5: It would be more informative to write 2C3)5: If we know the data (2,3) and the instructions (+), the result (5) is determined. But this is irreversible. If we know the result (5) and the instruction to be inverted (+), we cannot recover the data (2, 3). The data could have been (3, 2), (1, 4), etc. Here is an information conserving arithmetic expression: 2; 3 ® 3; 2 :
Thermodynamics of Computation
The operator ® means reverse the order of the data. Such an operation is clearly conservative of information. And, clearly, it is reversible: 3; 2 ® 2; 3 : An ® gate would have no inevitable energy price. It destroys no information (our first law is satisfied). Here is a not such trivial example. Consider the operation: a; b © a C b; a b This operation is also conservative of information because with a C b and a b, one can easily recover the original a and b. Since that time, a number of conservative, reversible logic gates have been proposed and implemented. We can say some things in general about such gates. They will have equal numbers of inputs and outputs (usually three). The changes effected by the gates must either rearrange the data (like the ® operator in our arithmetic example) or mix them (like the © in the second example). All electronic gates so far proposed involve rearrangements. In optical logic, reversible mixing is common. Stringing conservative logic gates together can allow us to produce sought after results. Inevitably, however, we will also have computed things we do not care about – called garbage. Garbage is inevitable in reversible computing, because the components must retain the information normally destroyed by a conventional Boolean logic gate. Reversibility eliminates one problem (the inevitable dissipation of information) but creates another (the inevitable appearance of garbage bits). Nevertheless, we must detect the desired answer, and that must cost at least kT log 2. The zero energy possibility applies only to the logic operations, not the entering of data into the system or the extraction of data from the system. Failure to understand this has caused more than a few to proclaim that zero-energy logic operations are impossible. We enter data at the input ports and wait for an answer to reach the output ports. As the system is reversible, not all occurrences lead in the “right direction.” Temperature-dependent random walk will eventually produce the answer. But with a little energy bias in the forward direction, we will get the answer faster at the cost of expending energy – the more energy, the faster the results are likely
to arrive. One might expect that something akin to the energy-time uncertainty principle might apply here, so very low energy gates might require a very long time to operate. Indeed that has usually been the case. An exception will be noted in the case of certain optical elements. Analog and Digital Computers Because digital computers and computation have been so successful, they have influenced how we think about both computers as machines and computation as a process – so much so, it is difficult today to reconstruct what analog computing was all about. . . It is a history in which digital machines can do things ‘better’ and ‘faster’ than other machines. . . However, what is at stake here are not matters of speed or precision. Rather, it is an argument about what can be rendered and understood through a machine that does computation [5]. All computers involve the operation of some physical system and the interpretation of physically measurable quantities in numerical terms. The most obvious interpretation is N D cQ : The number N is the measured quantity Q multiplied by some conversion coefficient. In classical physics Q is a continuous variable. Computers using such numbers are called analog computers. Analog computers have several profound disadvantages relative to digital computers. Sensitivity. By this we mean the ability to distinguish small difference in large numbers. Analog computers tend to have trouble being able to distinguish, say, 47 from 48. Digital computers only have to distinguish between 0 and 1. It encodes bigger numbers in terms of 0s and 1s. Inability to compute complex or even real numbers in most cases, because most useful quantities Q are usually nonnegative. This is usually “solved” by encryption of real or complex numbers as multiple nonnegative numbers. Cumulative errors. When the output from one analog computation is used as an input to another, the errors are carried forward and accumulated. This effect is made even worse by the fact that some problems amplify the input errors significantly. In linear algebra problems, do instance, that error amplification factor is called the condition number of the matrix. That
3191
3192
Thermodynamics of Computation
number is never less than 1 and often exceedingly large (ill conditioned matrices) or even infinite (singular matrices). Inflexibility. This chapter was written using a word processor in a digital computer. Making an analog word processor would require almost inconceivably complex. Some distinct advantages come from being continuous in space or time [6]. In addition, it is widely believed that anything an analog computer can compute can also be computed by a digital computer. In fact, that belief is a corollary of what is called the strong Church–Turing thesis. It asserts that anything that can be computed in any way (including analog computations) can also be computed by a Turing machine. The more widely accepted form of the Church–Turing thesis simply asserts that anything that is sensibly computable by any computer can be computed by a Turing machine. There are some who believe they have discovered exceptions to the Church–Turing thesis and many more who doubt that. This is not the place to examine that controversy. It is mentioned here for two reasons. First, it suggests that analog computers cannot do anything a digital computer cannot do. There is no advantage in principle. There may be practical advantages to analog computing when it suffices as we have already suggested, though. But there is much more to computing than simply getting the right answer. Speed, cost, and power consumption are examples. Sometimes, analog computing can win on those criteria. Second, the Church–Turing thesis is thermodynamic in the generalized sense that it sets presumably immutable constraints but does not provide information on how those constraints are enforced. So why would anyone bother with analog computers? There are many reasons. Here are some obvious ones. A binary gate is an analog gate followed by a nonlinear operator (a quantizer) that sets low signals to 0 and high signals to 1 or vice versa. So the difference between analog and binary computing lies in two things – the number of quantization levels (2 for binary and many more for analog) and how often measurements and corrections are made (after each operation for binary computers and after the whole computation for analog). So, we can argue that all computers are analog at their hearts. But encryption in terms of many binary operations, they are inevitably slower and more power hungry than digital computers. We could digitize to more than two values. It takes far fewer ternary values than binary val-
ues to represent a large number, but the more quantization levels, the more probable it is that we will make an error. It is generally assumed that two is the optimum number overall. But there are more subtle effects that sometimes favor analog over digital. Consider the representation of physically measured time signal in a digital computer. The sampling theorem tells us how frequently we must sample the signal IF we know its bandwidth. But suppose we do not know. We can do either of two dangerous things: apply a bandpass filter and risk losing information or guess the bandwidth and risk producing aliasing. So discretization of an analog signal has a significant probability of giving misleading results. Iterative digital solution of linear algebra problems converge only for condition numbers limited by the digitization itself, while analog solutions have no such limit because the condition number itself is undefined. A more general treatment can be found in [7]. For chaotic systems, discretization can lead to wildly different results with different levels of discretization [8,9,10]. So what can we say about the thermodynamics of analog computers? The operation occurs in three steps: 1. Preparation. The apparatus is set up so that the input parameters are the ones we choose and the physics will carry out the desired transformations. 2. Time Evolution. The physical system evolves through time. 3. Readout. The desired physical parameters are measured. Preparation is necessary in digital computing as well and has not been treated in any thermodynamic discussion. Nor will we treat it here. The last two steps must somehow be governed by the two laws of computing propounded earlier. That is, the system cannot produce what we interpret as information beyond that we inserted in the preparation. The underlying physics may be exceedingly complex, but the informatics must be as described by the first law. The readout requires some minimum energy. In a digital computer with L levels the Landauer analysis says we must dissipate kT log L in energy. Here is at least a qualitative explanation of the fact that we need far more than kT to operate a binary logic gate. We need to measure to far more than just two levels in order to make the binary decisions with sufficiently low error rate. Analog computing is an emerging field, in part because of interest in what is called “natural computing.”
Thermodynamics of Computation
Natural Computing In serial non Von Neumann machines the interconnections between processors have negligible impact on system space and energy consumption. But concurrency (e. g. parallelism) requires more interconnections and more attention to their space and energy consumption. And those problems become worse as the number of concurrent processors increases. For large enough numbers of processors, interconnections are no longer negligible and can become limiting. Two approaches have been proposed to ameliorate such problems: optical interconnection and quantum computing. Optical interconnection may be useful, because optical paths can cross in space without effect. That is many optical paths can share the same space. Quantum computing is of interest because N entangled states can probe 2 N paths. Both avoid massive interconnection with wires. Richard Feynman anticipated both approaches. One of Feynman’s goals in suggesting quantum computing was to solve the wiring problem. Another motivation also guided him. He knew that the accuracy with which the results of a quantum experiment could be predicted depended on the number of path integrals evaluated, so (in a sense) each natural event in quantum mechanics is equivalent to infinitely many calculations. The only way to calculate quantum events with perfect accuracy would be to use a quantum computer. Let us describe a natural computation. We begin with a physical system P that operates on an input I N to produce an output ON . It does this naturally without human assistance. Now suppose we wish to use that physical process P to calculate some result. If there is a computer operation C that operates on an input I C to produce an output OC , and if there are mappings M1 : I N $ I C M2 : O N $ OC Then we can declare that the P and C are congruent and thus compute the same results if the diagram in Fig. 1 commutes. That is if nature in doing what it will do automatically allows us to establish the starting state I N (state preparation) in a way that can be viewed using M1 as representing the computer input I C and if the output ON of the natural process and if M2 carries that to the very output that a conventional computer produces with input, then the two processes are congruent and we have a natural computer.
Thermodynamics of Computation, Figure 1 Nature converts input IN into output ON by natural law. The digital computer performing some fixed algorithm converts input IC into output OC . If, for all allowable IN , we have a rule that concerts it to an equivalent IC and if the output of the natural computer IN can be converted to the output IC , then we have in the physical operator a “natural computer”
Certainly, the most widely studied form of natural computing at this moment is the quantum computer. Quantum Computing The behaviors of matter in very small scale obey laws of quantum mechanics, which are so different from what we usually experience as to be essentially beyond our understanding. We can write the equations and perform the experiments but the results inevitably seem weird. Despite this, it is possible to harness some of these effects for computing. Quantum computers usually utilize superposition (a principal of quantum mechanism which claims that while the state of any object is unknown, it is actually in all possible states simultaneously, as long as we don’t measure it) and quantum entanglement (a quantum mechanical phenomenon in which the quantum states of two or more objects have to be described with reference to each other, even though the individual objects may be spatially separated) to perform computation [11,12]. In traditional digital computers, information is carried by numerous binary digits, which are called bits. In quantum computing, information is carried by particles which carry quantum states. Such quantum logic bit is called qubit. A qubit can be expressed by jii D a j0i C b j1i where j0i and j1i are two orthonormal states. The coefficients a and b are the normalized probability amplitudes (in general, complex-valued numbers) of states j0i and j1i that satisfy the condition jaj2 C jbj2 D 1. These two orthonormal states can be polarization in orthogo-
3193
3194
Thermodynamics of Computation
nal directions, two oppositely directed spins of atoms, etc. It seems that a qubit carries partially information of j0i and partially information of j1i. However, when a qubit is measured (which is necessary if we want to read the output), the qubit “collapses” to either state 0 or state 1 with the probabilities jaj2 or jbj2 . This process is irreversible. Whence a qubit collapses to a state, there is no way to recover the original state. On the other hand, it is also not possible to clone a quantum state without destroying the original one. A very weird but extremely useful phenomenon in quantum world is entanglement. It is the coherence among two or more objects even though these objects are spatially separated. Consider a 2-qubit state: (j0ij0i C j0ij1i C j1ij0i p C j1ij1i)/2.pIt can be decomposed to ((j0i C j1i)/ 2)((j0i C j1i)/ 2). That means we have 50% probability to read j0i and 50% probability to read j1i when we measure the first qubit. We will get same probability for the second qubit. The measured value of the first qubit is independent of that p of the second qubit. However, consider the state: 1/ 2(j0ij1i j1ij0i). It is easy to prove that one cannot decompose the state into a form (a j0i C b j1i)(c j0i C d j1i). Actually, if we measure the first qubit, we have 50% probability to read j1i and 50% chance to read j0i. However, if we read j0i when the first qubit is measured, we always read j1i for the second qubit; if we read j1i for the first qubit, we always read j0i for the second qubit. The measurement to one of them will affect the state of the other qubit immediately no matter how far they are from each other. Consider a quantum system with n qubits. If there is no entanglement, it can be decomposed into the product of n 1-qubit systems, or 2n independent numbers. However, if entanglement is allowed, then there are 2n independent coefficients and therefore can hold 2n independent values. So it can operate tremendous number of values in parallel. It is possible to solve exponentially hard or NP complete problems in polynomial or even linear time. In 1994, Peter Shor of AT&T devised an algorithm for factoring an integer N in O((log N)3 ) time and O(log N) space with a quantum computer. The security of RSA, the most widely used public-key cryptography method is based on the assumption that factoring a large integer is not computationally feasible. A quantum computer can break it easily. Even though quantum computers are much more powerful than traditional digital computers in terms of speed, it still obeys Church–Turing thesis. In other words, no quantum computer can solve a problem that a traditional computer cannot. Quantum computers can still be simulated by traditional computers. The difference is that a traditional computer may have to spend millions of years
to solve a problem that a quantum computer can solve in seconds. Therefore, the first law of computing – information cannot increase in a computer – is still true because a quantum computer has the same power as a traditional computer in terms of computability. Just as Boolean logic gates play a central role in traditional computers, quantum logic gates are the heart of quantum computers. Unlike Boolean logic gates, all quantum logic gates are reversible. More precisely, a quantum logic gate can be represented by a unitary matrix M (an n n matrix M is unitary if M C M D I n where M C is the conjugate transpose of M). So if M is the matrix that represents the quantum logic gate, M C represents the reverse operation of the gate. While it is possible to define quantum logic gates for any number of qubits, the most common quantum logic gates operate on spaces of one or two qubits. For example, Hadamard gate and phase shifter gates are singlequbit gate and CNOT (controlled-NOT) gate is a twoqubit gate. With Hadamard gates, phase shifter gates and CNOT gates, one can do any operation that can be done by a quantum computer. We call sets of quantum gates like this by universal quantum gates. Because quantum logic gates are all reversible, quantum computers do not destroy any information until outputs are measured. Therefore, quantum computers may consume less energy than traditional computers in theory. While quantum computers have tremendous power and may consume much less energy, they are extremely difficult to be built. A quantum computer must maintain coherence among its qubits long enough to perform an algorithm. Preventing qubits from interacting with the environment (which will cause decoherence) is extremely hard. We can get around the difficulty by giving up the entanglements. However, the quantum computer is no longer as powerful if we do so. Without entanglements, the state space of n-qubits only has 2n dimensions rather than 2n dimensions. Exponentially hard or NP complete problems cannot be computed by these computers within polynomial time (assume P ¤ NP) [13]. However, these computers use reversible logic gates and therefore have advantage in energy consuming. In conclusion, even though quantum computers use a quite different model, two laws of information still apply to them. Optical Computing Optical computing has been studied for many decades [14]. This work has led to conclusions that seem obvious in retrospect; namely that
Thermodynamics of Computation
Optical computing can never compete with electronic computing for general purpose computing. Optical computing has legitimate roles in niches defined by the characteristic that the information being processed is already in the optical domain, e. g. in optical communication or optical storage. Two types of complexity arise. One is the complexity of the optical effects themselves, e. g. optics often has three or more activities going on at each location, so if the processes are nonlinear (e. g. photorefractives or cross talk among active elements) then complexity factors are likely to arise [15]. This is nonequilibrium thermodynamics at work. Not to be confused with this is the savings in temporal coherence accomplished by use of space, time, and fanin complexity to consume computational complexity [16,17]. Optical computing is limited by thermodynamics in a unique sense. Let us suppose that we wish to measure an output signal that is either 1 or 0 in amplitude. Either for speed or power consumption, we seek to measure as few photons as possible to decide which of those two states (0 or 1) is present. Because the photons do not (without encouragement) come evenly spaced in time of arrival, we must detect many photons to be sure that no detected photon means a logical zero rather than a statistical extreme when a logical 1 is intended. With normal statistics, it takes about 88 detected photons to distinguish between what was intended as a logical 1 and what was intended as a zero. It is possible to make a light source with substantially uniform emission rates. This is called squeezed light. Naively, squeezing can overcome that problem to an amazing extent, but the squeezed state is very fragile and many of the central activities in an optical computer destroy the order squeezing put in before it can be detected [18]. Thermodynamically Inspired Computing The approach of systems to equilibrium may become a model for computing. The most prominent of these are called simulated annealing, the Hopfield net, and the Boltzmann machine. Basically, they are ways to encode a quantity to be minimized as a sort of pseudo energy and then simulating some energy minimization activity. And, of course hybridization of these methods make the clean division among them impossible. Annealing of materials is an ancient art. Domains in a metal sword, for instance, may be randomly disposed to one another, making them brittle. The goal of annealing is to get them aligned, so they behave more like a single crystal and thus hardened. This is done by a method
sometimes called “heat and beat.” The material is heated to a high temperature where domains changes are fairly easy and then beat with a hammer to encourage such movement. Then the material is cooled to a little lower temperature where one hopes the gains so far are not rerandomized and beat again. The art is in the cooling schedule and in the beating. In thermodynamic terms, we are trying to get the sword in its minimum energy state where no beating against the opponent will cause it to relax into a lower energy state. Unlike more familiar optimization methods such as steepest descent or Newton–Raphson that find only a local extremum (one near the starting state), simulated annealing aims at finding the global extremum independently of starting state. It does this by pure imitation of annealing. Some monotonic figure of merit for the problem is chosen. It might be the RMS difference between the output vector of a matrix-vector multiplication and a target vector. Minimizing it is analogous to minimizing the sword’s energy. Driving it to or near zero by adjusting the input vector is a way of solving the ubiquitous Ax D b problem for x. In early stages, big random steps (chosen from a thermodynamic expression governing probabilities as a function of energy and “temperature”) are used. The perturbed system is then “cooled” in the sense of allowing the starting fictitious T value to decrease slight, and the process is repeated. This is very computer intensive but very effective. More information can be found in many places including [19,20,21]. A Hopfield net [22] is very similar to simulated annealing in that it uses guided stochastic methods to minimize a user-defined energy (Figure of merit). It is based on adjusting connections of a rather simple binary connected neural network by picking a starting state and adjusting one interconnection at a time. A Boltzmann machine is a more thermodynamically inspired version of a Hopfield network [23]. One more thermodynamically-inspired means for computation needs to be mentioned, namely stochastic resonance [24]. It is a creative use of noise. Noise is usually thought of as the enemy of signal. But if in some nonlinear operation the signal is so small that it is not useful, adding a certain amount of noise to the input may help. Indeed, there is an optimal amount of noise for each case. Too little or too much noise are both bad, but there is a noise value in resonance with the problem which give optimal results. This is not the place for detailed discussions of these thermodynamically-inspired methods. Rather, our intent is to show that thermodynamics has impacted not only physical computers but also the way those computers are used to achieve certain goals.
3195
3196
Thermodynamics of Computation
Cellular Array Processors Cellular array processors (some prefer “cellular automata”) are discrete binary valued elements in fixed locations or arrays [25]. Usually started with random inputs, they can be observed to produce quite interesting spacetime histories as each cell is updated according to a fixed rule using its current value and those of its immediate neighbors. Under some circumstances, those space-time patterns simply terminate or generate chaos. Often, however, the system self organizes to produce highly organized patterns. One “application” is so-called “Artificial Life” which began with the immensely popular set of evolution rules for a two-dimensional array called “the Game of Life.” For the few who have not played with it, we urge you to do so. An online version is available free at http://www. bitstorm.org/gameoflife/. Many brilliant computer scientists have become enthralled with it. For instance, Stephen Wolfram analyzed it in terms of thermodynamics [26]. Conclusions Thermodynamics Governs all computing. Has inspired conservative, reversible computing. Is not easy to apply to non Turing computers such as analog and quantum. Has directly inspired some optimization methods. In the end, computers are at least partially physical in the classical sense, so they are ultimately limited by thermodynamics. The presence of many virtual operations in some forms of quantum computing can reduce the price of physicality dramatically but cannot eliminate it [27]. Future Directions Classical equilibrium thermodynamics places limits on what can be accomplished under different conditions. More importantly, it tells us when we are approaching a limit sate. Things often get exceedingly difficult in such cases, so effort might be better spent where there is greater room for improvement. Optical computing may allow the reaching of the energy required for reversible computing soon. As there is no fundamental limit but negative energy is not defined, we can say that zero-energy computing is the limit. Doing the computation by interferometry of beams passing through passive optics requires no energy. At last, we will have an example of how the choice of medium can be critical. That kind of zero-energy computation is not possible for electronics. Early versions of zero-energy gates have already been described and made [28,29]. That work
might not have been undertaken except for the “permission” for zero-energy systems granted by thermodynamics.
Bibliography Primary Literature 1. Weaver W (1948) Science and complexity. Am Sci 36:536–541 2. Nicolis G, Prigogine I (1977) Self-organization in nonequilibrium systems. Wiley, New York 3. Prigogine I, Stengers I (1984) Order out of chaos. Bantam Books, New York 4. Prigogine I, Stengers I, Toffler A (1986) Order out of chaos: Man’s new dialogue with nature. Flamingo, London 5. Nyce JM (1996) Guest editor’s introduction. IEEE Ann Hist Comput 18:3–4 6. Orponen P (1997) A survey of continous-time computation theory. In: Du DZ, Ko KI (eds) Advances in Algorithms, Languages, and Complexity. Kluwer, Dordrecht, pp 209–224 7. Pour-El MB, Richards JI (1989) Computability in Analysis and Physics. Springer, Berlin 8. Grantham W, Amitm A (1990) Discretization chaos – Feedback control and transition to chaos. In: Control and dynamic systems, vol 34. Advances in control mechanics. Pt. 1 (A91–50601 21–63). Academic Press, San Diego, pp 205–277 9. Herbst BM, Ablowitz MJ (1989) Numerically induced chaos in the nonlinear Schrödinger equation. Phys Rev Lett 62:2065– 2068 10. Ogorzalek MJ (1997) Chaos and complexity in nonlinear electronic circuits. World Sci Ser Nonlinear Sci Ser A 22 11. Nielsen MA, Chuang IL (2000) Quantum Computation and Quantum Information. Cambridge University Press, New York 12. Bouwmeester D, Ekert A, Zeilinger A (2001) The Physics of Quantum Information. Springer, Berlin 13. Caulfield HJ, Qian L (2006) The other kind of quantum computing. Int J Unconv Comput 2(3)281–290 14. HJ Caulfield, Vikram CS, Zavalin A (2006) Optical logic redux. Opt Int J Light Electron Opt 117:199–209 15. Caulfield HJ, Kukhtarev N, Kukhtareva T, Schamschula MP, Banarjee P (1999) One, two, and three-beam optical chaos and self organization effects in photorefractive materials. Mat Res Innov 3:194–199 16. Caulfield HJ, Brasher JD, Hester CF (1991) Complexity issues in optical computing. Opt Comput Process 1:109–113 17. Caulfield HJ (1992) Space – time complexity in optical computing. Multidimens Syst Signal Process 2:373–378 18. Shamir J, Caulfield HJ, Crowe DG (1991) Role of photon statistics in energy-efficient optical computers. Appl Opt 30:3697– 3701 19. Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4598):671–680 20. Cerny V (1985) A thermodynamical approach to the travelling salesman problem: an efficient simulation algorithm. J Optim Theory Appl 45:41–51 21. Das A, Chakrabarti BK (eds) (2005) Quantum Annealing and Related Optimization Methods. Lecture Notes in Physics, vol 679. Springer, Heidelberg
Thermodynamics of Computation
22. Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci USA 79:2554–2558 23. Hinton GE, Sejnowski TJ (1986) Learning and Relearning in Boltzmann Machines. In: Rumelhart DE, McClelland JL (eds) and the PDP Research Group. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. vol 1. Foundations. Cambridge MIT Press, Cambridge, pp 282– 317 24. Benzi R, Parisi G, Sutera A, Vulpiani A (1983) A theory os stochastic resonance in climatic change. SIAM J Appl Math 43:565–578 25. Ilachinski A (2001) Cellular Automata: A Discrete Universe. World Scientific, Singapore 26. Wolfram S (1983) Statistical mechanics of cellular automata. Rev Mod Phys 55:601–644 27. Kupiec SA, Caulfield HJ (1991) Massively parallel optical PLA. Int J Opt Comput 2:49–62 28. Caulfield HJ, Soref RA, Qian L, Zavalin A, Hardy J (2007) Generalized optical logic elements – GOLEs. Opt Commun 271:365– 376 29. Caulfield HJ, Soref RA, Vikram CS (2007) Universal reconfig-
urable optical logic with silicon-on-insulator resonant structures. Photonics Nanostruct 5:14–20
Books and Reviews Bennett CH (1982) The thermodynamics of computation a review. Int J Theor Phys 21:905–940 Bernard W, Callen HB (1959) Irreversible thermodynamics of nonlinear processes and noise in driven systems. Rev Mod Phys 31:1017–1044 Bub J (2002) Maxwell’s Demon and the thermodynamics of computation. arXiv:quant-ph/0203017 Casti JL (1992) Reality Rules. Wiley, New York Deutsch D (1985) Quantum Theory, the Church-Turing Principle and the Universal Quantum Computer. Proc Roy Soc Lond Ser A Math Phys Sci 400:97–117 Karplus WJ, Soroka WW (1958) Analog Methods: Computation and Simulation. McGraw Hill, New York Leff HS, Rex AF (eds) (1990) Maxwell’s Demon: Entropy, Information, Computing. Princeton University Press, Princeton Zurek WH (1989) Algorithmic randomness and physical entropy. Phys Rev Abstr 40:4731–4751
3197
3198
Tiling Problem and Undecidability in Cellular Automata
Tiling Problem and Undecidability in Cellular Automata JARKKO KARI Department of Mathematics, University of Turku, Turku, Finland Article Outline Glossary Definition of the Subject Introduction The Tiling Problem and Its Variants Undecidability in Cellular Automata Future Directions Acknowledgments Bibliography Glossary Cellular automata (CA) A d-dimensional cellular automaton consists of an infinite d-dimensional grid of cells, indexed by Zd . Each cell stores an element of a finite state set S. Configuration c : Zd ! S specifies the states of all cells. The set of all configurations is d S Z . The neighborhood vector N D (E n1 ; nE2 ; : : : ; nE m ) is a sequence of m distinct elements of Zd specifying the relative locations of the neighbors of the cells: A cell located at xE 2 Zd has m neighbors, in positions xE C nE1 ; xE C nE2 ; : : : ; xE C nE m . Finally, the local update rule f : S m ! S specifies the new state of a cell, based on the old states of its neighbors. In one step, configuration c is transformed into configuration e where, for all xE 2 Zd , e(E x ) D f c(E x C nE1 ); c(E x C nE 2 ); : : : ; c(E x C nE m ) : The mapping c 7! e is the global transition function, d d or the CA-function, G : S Z ! S Z specified by the CA A D (d; S; N; f ). Tiles A Wang tile is a unit square tile with colored edges. Tiles have an orientation, i. e. they may not be rotated or reflected. The colors give a local matching rule that specifies which tiles may be placed next to each other: Two adjacent tiles must have identical colors on the abutting edges. A Wang tile set consists of a finite number of Wang tiles. A more general definition: a d-dimensional tile set is a quadruple T D (d; T; N; R) where T is a finite set whose elements are called tiles, N D (E n1 ; nE2 ; : : : ; nE m ) is a neighborhood vector of m distinct elements of Zd and R T m is a relation of allowed patterns. The
neighborhood vector has the same interpretation as in the definition of cellular automata: it gives the relative locations of the neighbors of cells. Tiling A covering of the plane using tiles. A valid Wang tiling by a Wang tile set T is an assignment t : Z2 ! T of tiles to cells such that the local matching rule is satisfied between all adjacent tiles. We say that T admits tiling t. More general definition: A d-dimensional tiling using tile set T D (d; T; N; R) is a mapping t : Zd ! T. Tiling t is valid at cell xE 2 Zd if t(E x C nE 1 ; xE C nE 2 ; : : : ; xE C nE m ) 2 R : Tiling t is called valid if it is valid at every cell xE 2 Zd . Periodic tiling A tiling that is invariant under some nonzero translation. A two-dimensional tiling is called totally periodic if it is invariant under two linearly independent translations. A totally periodic tiling is automatically periodic in horizontal and vertical directions, which means that it consists of a rectangular pattern that is repeated horizontally and vertically to fill the plane. A two-dimensional tile set that admits a valid periodic tiling automatically admits also totally periodic tiling. More generally, a d-dimensional tiling t : Zd ! T is periodic with period pE 2 Zd , pE ¤ 0E, if t(E x ) D t(E x C pE) for all xE 2 Zd . It is totally periodic if it is periodic with d linearly independent periods pE1 ; pE2 ; : : : ; pEd . Note that when d > 2 it is possible that a tile set admits a periodic tiling but does not admit any totally periodic tiling. Aperiodic tile set A two-dimensional tile set that admits a valid tiling of the plane, but does not admit any valid periodic tilings. Smallest known aperiodic set of Wang tiles contain 13 tiles [3,13] Turing machine (TM) Turing machines are computation devices commonly used to formally define the concept of an algorithm. They also provide us with the most basic undecidable decision problems. A Turing machine consists of a finite state control unit that moves along an infinite tape. The tape has symbols written in cells that are indexed by Z. Depending on the state of the control unit and the symbol currently scanned on the tape the machine may overwrite the tape symbol, change the internal state and move along the tape one cell to the left or right. We formally define a Turing machine as a 6-tuple M D (Q; ; ı; q0 ; q h ; b) where Q and are finite sets (the state alphabet and the tape alphabet, respectively), q0 ; q h 2 Q are the initial and the halting states, respectively, b 2 is the blank symbol and ı : Q ! Q f1; 1g
Tiling Problem and Undecidability in Cellular Automata
is the transition function that specifies the moves of the machine. A configuration (or instantaneous description) of the machine is a triplet (q; i; t) where q 2 Q is the current state, i 2 Z is the position of the machine on the tape and t : Z ! describes the content of the tape. In one time step configuration (q; i; t) becomes (q0 ; i C d; t 0 ) if ı(q; t(i)) D (q0 ; ; d) and t 0 (i) D and t 0 ( j) D t( j) for all j ¤ i. We denote this move by (q; i; t) ` (q0 ; i C d; t 0 ) : The reflexive, transitive closure of ` is denoted by ` , that is, (q; i; t) ` (q0 ; i 0 ; t 0 ) if and only if (q0 ; i 0 ; t 0 ) can be reached from (q; i; t) by executing zero or more Turing machine moves. Decision problem A decision problem is an algorithmic question with a yes/no -answer. The problem has an input (called the instance of the problem) and a well defined answer “yes” or “no” associated to each instance. The halting problem TURING MACHINE HALTING ON BLANK TAPE is the decision problem whose input is a Turing machine M D (Q; ; ı; q0 ; q h ; b) and the answer is positive if and only if the Turing machine eventually enters its halting state qh when started in the initial state q0 on a totally blank tape, i. e. initially every tape location has the blank symbol b. (Un)decidability Some decision problems can not be solved by any algorithm. Such problems are called undecidable. In contrast, decidable decision problems are solved by some algorithm. An example of an undecidable problem is the TURING MACHINE HALTING ON BLANK TAPE . Semi-algorithm An algorithm-like procedure for a decision problem that correctly returns a positive answer on positive input instances, but on negative instances runs for ever without ever returning an answer. Semi-decidability A decision problem is called semi-decidable if there is a semi-algorithm for it. For example, the decision problem T URING MACHINE HALTING ON BLANK TAPE is semi-decidable since one can simulate any given Turing machine step-by-step until (if ever) it halts. Recursive and recursively enumerable (re) A formal language L is called recursive if it is decidable whether a given word belongs to L. The language is called recursively enumerable (re for short) if this membership problem is semi-decidable.
The tiling problem The decision problem that gets as input a tile set T , and asks whether there exists a valid tiling by T . The tiling problem was proved undecidable for Wang tiles by R. Berger [2]. Its complement (i. e., “Does there not exist a valid tiling ?”) is semi-decidable. The tiling problem with a seed tile The decision problem that gets as input a tile set T and one tile s, and asks whether T admits a tiling that contains tile s at least once. The problem is undecidable for Wang tiles, but its complement is semi-decidable [24]. The periodic tiling problem The decision problem to determine if a given set of Wang tiles admits a periodic tiling. The problem is undecidable, but it is semi-decidable [7]. The finite tiling problem A decision problem where we are given a set T of Wang tiles and a specific blank tile B 2 T whose all four edges are colored with the same color. A tiling t : Z ! T is called finite if fE x 2 Z2 jt(E x ) ¤ Bg is a finite set. If t(E x ) D B for all xE 2 Z2 then t is called trivial. The finite tiling problem asks whether there exist non-trivial valid finite tilings. The problem is undecidable but semi-decidable. Surjective cellular automata A cellular automaton (CA) is called surjective if every configuration has a preimage, that is, if its global transition function d d G : S Z ! S Z is surjective. Injective (reversible) CA A CA is injective if every configuration has at most one pre-image, that is, the global transition function is one-to-one. It is well known that a CA is injective if and only if it is bijective (every configuration has a unique pre-image), which in turn is equivalent to reversibility (=there exists an inverse CA that traces the CA back in time.) Limit set The limit set of a CA is its maximal attractor. In other words, it is the compact and translation invariant set 1 \
G(S Z ) d
iD0
where G is the global transition function and S is the state set. Nilpotent cellular automata Nilpotent CA have trivial dynamics. A CA is called nilpotent if its limit set contains only one configuration. The unique element of the limit set is the quiescent configuration. This is equivalent to every initial configuration eventually becoming the quiescent configuration.
3199
3200
Tiling Problem and Undecidability in Cellular Automata
Equicontinuous CA Cellular automaton G is called equicontinuous if for every finite A Zd there exists a finite B Zd such that any two initial configurations that agree inside B will agree inside A for all subsequent steps. In other words, 8E x 2 B : c(E x ) D e(E x) H) 8t 2 N and 8E x 2 A: G t (c)(E x ) D G t (e)(E x ): This means that equicontinuous CA can be reliably simulated in finite windows. It is known that a CA is equicontinuous if and only if it is ultimately periodic [17]: 9n; p 2 N : G n D G nCp . In this sense the dynamics of equicontinuous CA is trivial. Sensitive CA Cellular automaton G is called sensitive to initial conditions if there exists a finite set B Zd of cells such that for every configuration c and every finite set A Z of cells there exists a configuration e and time t 0 such that e(E x ) D d(E x ) for all xE 2 A but G t (e)(E x ) ¤ G t (c)(E x ) for some xE 2 B. This means that arbitrarily distant modifications to any configuration c may propagate to a fixed observation window B. Topological entropy The topological entropy h(G) of a CA G measures the complexity of its dynamics. For any finite A Zd and positive integer n we define the equivalence relation A;n among initial configud rations as follows: For all c; e 2 S Z c A;n e x ) D G t (e)(E x) : , 8E x 2 A; 0 t < n : G t (c)(E In other words, two configurations are equivalent if we can not observe any difference in their orbits in region A within the first n time instances. Let us denote by NG (A; n) the number of equivalence classes of
A;n . Then the topological entropy is h(G) D sup lim
A n!1
log N G (A; n) n
where the supremum is over all finite A Zd . The entropy always exists. In the one-dimensional case the entropy is always a finite, non-negative number. If d 2 then the entropy can also be infinite. Definition of the Subject We consider the following algorithmic questions concerning cellular automata. All problems are decision problems, that is, the answer for each input instance is either yes or no. All problems considered are undecidable, i. e. no algorithm can solve them. We only consider problems whose
undecidability is proved using a reduction from the tiling problem or its variant. INJECTIVITY Input: Cellular Automaton A Question: Is A injective (i. e. reversible)? There is an algorithm that solves INJECTIVITY for one-dimensional CA [1]. But the problem is undecidable among two-dimensional CA [9,11]. The problem is semi-decidable in any dimension. SURJECTIVITY Input: Cellular Automaton A Question: Is A surjective? Also SURJECTIVITY is decidable among one-dimensional CA. The two-dimensional question is, however, undecidable [11]. The complement of the problem (i. e. non-surjectivity) is semi-decidable in any dimension. NILPOTENCY Input: Cellular Automaton A Question: Is A nilpotent? Nilpotency is undecidable even among one-dimensional CA [10]. It is undecidable even if A has a spreading state, i. e. a state q such that any cell whose neighborhood contains q becomes q. However, N ILPOTENCY is semi-decidable in any dimension. Based on problem N ILPOTENCY we can prove a Rice’s theorem for Cellular Automata limit sets: any non-trivial decision problem concerning limit sets is undecidable [12]. T OPOLOGICAL ENTROPY Input: Cellular Automaton A. Question: Is the topological entropy of A less than constant c > 0? Problem TOPOLOGICAL ENTROPY is undecidable for every constant c > 0, even in the one-dimensional case. This can be proved proved using a direct reduction from N ILPOTENCY [8]. Also, direct reductions from N ILPOTENCY prove the undecidability of the following two problems [5,14]: EQUICONTINUITY Input: Cellular Automaton A. Question: Is A equicontinuous? SENSITIVITY TO INITIAL CONDITIONS Input: Cellular Automaton A. Question: Is A sensitive to initial conditions?
Tiling Problem and Undecidability in Cellular Automata
Introduction Several decision problems concerning cellular automata are known to be undecidable, that is, no algorithm exists that solves them. Some undecidability results easily follow from the universal computation capabilities of cellular automata, while others require more elaborate proofs. Reductions from the tiling problem and its variants turn out to be useful in proving various questions concerning CA undecidable. We consider the problems of determining if a given two-dimensional CA is surjective or injective, whether a one-dimensional CA is nilpotent or equicontinuous, and whether the topological entropy is less than some constant. The fact that the tiling problem is closely related to cellular automata is not surprising considering their apparent similarity: both involve assignments of symbols over a finite alphabet onto integer lattice points. The difference is that tilings are static while cellular automata change the assignments dynamically according to the local rule. Undecidability of the tiling problem on the two-dimensional plane naturally leads to undecidability results concerning single step properties of two- and higher dimensional cellular automata. But also asymptotic properties of one-dimensional cellular automata can be related to tiling problems by viewing the space-time diagram of the CA as a tiling. This naturally leads to the definition of deterministic tile sets: Wang tiles where tiles are uniquely determined by some of their neighbors. We start by discussing the tiling problem and its variants. We do not proof the undecidability of all the variants. Rather, literature references for the proofs are provided. We then define a particular tile set that has an interesting plane-filling property. This tile set is a useful tool in the reduction to prove that it is undecidable to tell whether a given two-dimensional CA is injective (reversible). We then provide reductions that show several questions concerning cellular automata undecidable. The Tiling Problem and Its Variants In this section we discuss the tiling problem and several of its variants. Definition of Tiles For our purposes it is convenient to define tiles in a way that most closely resembles cellular automata. In the d-dimensional cellular space the cells are indexed by Zd . A neighborhood vector N D (E n1 ; nE2 ; : : : ; nE m )
consists of m distinct elements nE i 2 Zd . Each nE i specifies the relative location of a neighbor of each cell. More precisely, the ith neighbor of the cell in position xE 2 Zd is located at xE C nE i . A tile set is a finite set T whose elements are called tiles. A local matching rule tells which patterns of tiles are allowed in valid tilings. The matching rule is given as an m-ary relation R T m where m is the size of the neighborhood. Tilings are assignments t : Zd ! T of tiles into cells. Tiling t is valid at xE 2 Zd if t(E x C nE 1 ; xE C nE 2 ; : : : ; xE C nE m ) 2 R : Tiling t is called valid if it is valid at every position xE 2 Zd . A convenient – and historically earlier – way of defining tiles is in terms of edge labelings. A Wang tile is a twodimensional unit square with colored edges. The local matching rule is determined by these colors: A tiling is valid at position xE 2 Z2 iff each of the four edges of the tile in position xE have the same color as the abutting edge in the adjacent tile. Clearly this is a two-dimensional tile set with the neighborhood vector [(1; 0); (1; 0); (0; 1); (0; 1)] and a particular way of defining the local relation R. Computations and Tilings The basic observation in establishing undecidability results concerning tilings is the fact that valid tilings can be forced to contain a complete simulation of a computation by a given Turing machine. To any given Turing machine M D (Q; ; ı; q0 ; q h ; b) we associate the Wang tiles shown in Fig. 1, and we call these tiles the machine tiles of M. Note that in the illustrations, instead of colors, we use labeled arrows on the sides of the tiles. Two adjacent tiles match if and only if an arrow head meets an arrow tail with the same label. Such arrow representation can be converted into the usual coloring representation of Wang tiles by assigning to each arrow direction and label a unique color. The machine tiles of M contain the following tiles: (i) For every tape letter a 2 a tape tile of Fig. 1a, (ii) For every tape letter a 2 and every state q 2 Q an action tile of Fig. 1b or c. Tile (b) is used if ı(q; a) D (q0 ; a0 ; 1) and tile (c) is used if ı(q; a) D (q0 ; a0 ; C1):
3201
3202
Tiling Problem and Undecidability in Cellular Automata
Tiling Problem and Undecidability in Cellular Automata, Figure 1 Machine tiles associated to a Turing machine
(iii) For every tape letter a 2 and non-halting state q 2 Q n fq h g two merging tiles shown in Fig. 1d. The idea of the tiles is that a configuration of the Turing machine M is represented as a row of tiles in such a way that the cell currently scanned by M is represented by an action tile, its neighbor where the machine moves into has a merging tile and all other tiles on the row are tape tiles. If this row is part of a valid tiling then it is clear that the rows above must be similar representations of subsequent configurations in the Turing machine computation, until the machine halts. The machine tiles above are the basic tiles associated to Turing machine M. Additional tiles will be added depending on the actual variant of the tiling problem. The Tiling Problem The tiling problem is the decision problem of determining if at least one valid tiling is admitted by the given set of tiles.
tiles that initiate Turing machine simulation in arbitrarily large regions. Note that semi-decidability of the complement problem is apparent: a semi-algorithm simply tries to tile larger and larger regions until (if ever) a region is found that can not be properly tiled. Note also that a semi-algorithm exist for those tile sets that admit a valid, totally periodic tiling: All totally periodic tilings can be effectively enumerated and it is a simple matter to test each for validity of the tiling constraint. Combining the two semi-algorithms above yields a semi-algorithm that correctly identifies tile sets that (i) do not admit any valid tiling, or (ii) admit a valid periodic tiling. Only aperiodic tile sets fail to satisfy either (i) or (ii), so we see that the existence of aperiodic tile sets is implied by Theorem 1. In the following sections we consider some variants of the tiling problem whose undecidability is easier to establish. Variants of the Tiling Problem
T ILING PROBLEM Input: Tile set T . Question: Does T admit a valid tiling?
T ILING PROBLEM WITH A SEED TILE Input: Tile set T and one tile s. Question: Does T admit a valid tiling such that tile s is used at least once?
The tiling problem is easily seen decidable if the input is restricted to one-dimensional tile sets. It is classical result by R. Berger that the tiling problem of two-dimensional tiles is undecidable, even if the input consists of Wang tiles [2,21]:
The seeded version was shown undecidable by H. Wang [24]. We present the proof here because the proof is quite simple and shows the general idea of how Turing machine halting problem can be reduced to problems concerning tiles:
Theorem 1 TILING PROBLEM is undecidable for Wang tile sets T . The complement problem (non-existence of valid tilings) is semi-decidable.
Theorem 2 TILING PROBLEM WITH A SEED TILE is undecidable for Wang tile sets. The complement problem is semidecidable.
We do not prove this result here. The undecidability proofs in [2,21] are based on an explicit construction of an aperiodic tile set such that additional tiles implementing Turing machine simulations can be embedded in valid tilings. The aperiodic set is needed to force the presence of
Proof The semi-decidability of the complement problem follows from the following semi-algorithm: For r D 1; 2; 3; : : : try all tilings of the radius r square around the origin to see if there is a valid tiling of the square such that the origin contains the seed tile s. If for some r such
Tiling Problem and Undecidability in Cellular Automata
Tiling Problem and Undecidability in Cellular Automata, Figure 2 a the blank tile, and b three initialization tiles
a tiling is not found then halt and report that there is no tiling containing the seed tile. Consider then undecidability. We reduce the decision problem TURING MACHINE HALTING ON BLANK TAPE, a problem that is well known to be undecidable. For any given Turing machine M we can effectively construct a tile set and a seed tile in such a way that they form a positive instance of T ILING PROBLEM WITH A SEED TILE if and only if M is a negative instance of TURING MACHINE HALTING ON BLANK TAPE . For the given Turing machine M we construct the machine tiles of Fig. 1 as well as the four tiles shown in Fig. 2. These are the blank tile and three initialization tiles. They initialize all tape symbols to be equal to blank b, and the Turing machine to be in the initial state q0 . The middle initialization tile is chosen as the seed tile s. Let us prove that a valid tiling containing a copy of the seed tile exists if and only if the Turing machine M does not halt when started on the blank tape: “(H”: Suppose that the Turing machine M does not halt on the blank tape. Then a valid tiling exists where one horizontal row is formed with the initialization tiles, all tiles below this row are blank, and the rows above the initialization row contain consecutive configurations of the Turing machine. “H)”: Suppose that a valid tiling containing the middle initialization tile exists. The seed tile forces its row to be formed by the initialization tiles, representing the initial configuration of the Turing machine on the blank tape. The machine tiles force the following horizontal rows above the seed row to contain the consecutive configurations of the Turing machine. There is no merge tile containing a halting state so the Turing machine does not halt – otherwise a valid tiling could not be formed. Conclusion: Suppose we had an algorithm that solves T ILING PROBLEM WITH A SEED TILE. Then we also have an algorithm (which simply constructs the tile set as above and determines if a tiling with seed tile exists) that solves TURING MACHINE HALTING ON BLANK TAPE . This contradicts the fact that this problem is known to be undecidable.
In the following tiling problem variant we are given a Wang tile set T and specify one tile B 2 T as the blank tile. The blank tile has all four sides colored by the same color. A finite tiling is a tiling where only a finite number of tiles are non-blank. A finite tiling where all tiles are blank is called trivial. FINITE TILING PROBLEM Instance: A finite set T of Wang tiles and a blank tile B2T Problem: Does there exist a valid finite tiling that is not trivial? Theorem 3 The FINITE TILING PROBLEM is undecidable. It is semi-decidable while its complement is not semi-decidable. Proof For semi-decidability notice that we can try all valid tilings of larger and larger squares until we find a tiling of a square where all tiles on the boundary are blank, while some interior tile is different from the blank tile. If such a tiling is found then the semi-algorithm halts, indicating that a valid, finite, non-trivial tiling exists. To prove the undecidability we reduce the problem TURING MACHINE HALTING ON BLANK TAPE. For any given Turing machine M we construct the machine tiles of Fig. 1 as well as the blank tile, boundary tiles and the halting tiles shown in Fig. 3. The halting tiles of Fig. 3b are constructed for all tape letters a 2 and the halting state qh . The purpose of the halting tiles is to erase the Turing machine from the configuration once it halts. The lower border tiles of Fig. 3c initialize the configuration to consist of the blank tape symbol b and the initial state q0 . The top border tiles are made for every tape symbol a 2 . They allow the absorption of the configuration as long as the Turing machine has been erased. The border tiles on the sides are labeled with symbols L and R to identify the left and the right border of the computation area. Let us prove that the tile set admits a valid, finite, nontrivial tiling if and only if the Turing machine halts on the empty tape.
3203
3204
Tiling Problem and Undecidability in Cellular Automata
Tiling Problem and Undecidability in Cellular Automata, Figure 3 a the blank tile B, b halting tiles, and c border tiles
“(H”: Suppose that the Turing machine halts on the blank tape. Then a tiling exists where the boundary tiles isolate a finite portion of the plane (a “board”) for the simulation of the Turing machine, the bottom tiles of the board initialize the Turing machine on the blank tape, and inside the board the Turing machine is simulated until it halts. After halting only tape tiles are used until they are absorbed by the topmost row of the board. If the board is made sufficiently large the entire computation fits inside the board, so the tiling is valid. All tiles outside the board are blank so the tiling is finite. “H)”: Suppose then that a finite, non-trivial tiling exists. The only non-blank tiles with a blank bottom edge are the lower border tiles of Fig. 3c, so the tiling must contain a lower border tile. Horizontal neighbors of lower border tiles are lower border tiles, so we see that the only way to have a finite tiling is to have a contiguous lower border that ends at both sides in a corner tile where the border turns upwards. The vertical borders must again – due to the finiteness of the tiling – end at corners where the top border starts. All in all we see that the boundary tiles are forced to form a rectangular board. The lower boundary of the board initializes the Turing machine configuration on the blank tape, and the rows above it are forced by the machine tiles to simulate consecutive configurations of the Turing machine. Because the Turing machine state symbol is not allowed to touch the side or the upper boundary of the board, the Turing machine must be erased by a halting tile, i. e. the Turing machine must halt.
The third variation of the tiling problem we consider is the PERIODIC TILING PROBLEM where we ask whether a given set of tiles admits a valid periodic tiling. PERIODIC TILING PROBLEM Input: Tile set T . Question: Does T admit a valid periodic tiling? Theorem 4 The PERIODIC TILING PROBLEM is undecidable for Wang tile sets. It is semi-decidable while its complement is not semi-decidable. For a proof, see [7]. Deterministic Tiles The tiling problem of one-dimensional tiles is decidable. However, tiles can provide undecidability results for onedimensional CA when we use the trick that we view spacetime diagrams as two-dimensional tilings. But not every tiling can be a space-time diagram of a CA: the tiling must be locally deterministic in the direction that corresponds to time. This leads to the consideration of determinism in Wang tiles. Consider a set T of Wang tiles, i. e. squares with colored edges. We say that T is NW-deterministic if for all a; b 2 T, a ¤ b, either the upper (=northern) edges of a and b or the left (=western) edges of a and b have different colors. See Fig. 4a for an illustration. Consider now a valid tiling of the plane by NW-deterministic tiles. Each tile is uniquely determined by its left and upper neighbor. Then tiles on each diagonal in the
Tiling Problem and Undecidability in Cellular Automata
for all 1 i < k. In other words, a path is a sequence of cells such that the next cell is always the follower of the previous cell. One-way infinite and two-way infinite paths are defined analogously. In the following we only discuss the two-dimensional case (d D 2) and the follower of each tile is one of the four adjacent positions: F(a) 2 f(˙1; 0); (0; ˙1)g
Tiling Problem and Undecidability in Cellular Automata, Figure 4 NW-deterministic sets of Wang tiles: a there is at most one matching tile z for any x and y, b diagonals of NW-deterministic tilings interpreted as configurations of one-dimensional CA
NE-SW direction locally determine the tiles on the next diagonal below it. If we interpret these diagonals as configurations of a CA then there is a local rule such that valid tilings are space-time diagrams of the CA, see Fig. 4b. We define analogously NE-, SW- and SE-deterministic tile sets. Finally, we call a tile set 4-way deterministic if it is deterministic in all four directions simultaneously. The tiling problem is undecidable among NW-deterministic tile sets [10], even among 4-way deterministic tile sets [18]: Theorem 5 The decision problem T ILING PROBLEM is undecidable among 4-way deterministic sets of Wang tiles. As discussed at the end of Sect. “The Tiling Problem”, the theorem also means that 4-way deterministic aperiodic tile sets exist. In fact, the proof of Theorem 5 in [18] uses one such aperiodic set that was reported in [16]. Plane Filling Directed Tiles A d-dimensional directed tile is a tile that is associated a follower vector fE 2 Zd . Let T D (d; T; N; R) be a tile set, and let F : T ! Zd be a function that assigns tiles their follower vectors. We call D D (d; T; N; R; F) a set of directed d tiles. Let t 2 T Z be an assignment of tiles to cells. For evd ery pE 2 Z we call pE C F(t( pE)) the follower of pE in t. In other words, the follower of pE is the cell whose position relative to pE is given by the follower vector of the tile in cell pE. Sequence pE1 ; pE2 ; : : : ; pE k where all pE i 2 Zd is a (finite) path in t if pE iC1 D pE i C F(t( pE i ))
for all a 2 T :
In this case the follower is indicated in drawings as a horizontal or vertical arrow over the tile. A set of two-dimensional directed tiles is said to have the plane-filling property if it satisfies the following two conditions: (a) There exists t 2 T Z and a one-way infinite path pE1 ; pE2 ; pE3 ; : : : such that the tiling in t is valid at pE i for all i D 1; 2; 3; : : :. (b) For all t and pE1 ; pE2 ; pE3 ; : : : as in (a), there are arbitrarily large n n squares of cells such that all cells of the squares are on the path. 2
Intuitively the plane-filling property means that the simple device that moves over tiling t, repeatedly verifies the correctness of the tiling it its present location and moves on to the follower, necessarily eventually either finds a tiling error or covers arbitrarily large squares. Note that the planefilling property does not assume that the tiling t is correct everywhere: as long as it is correct along a path the path must snake through larger and larger squares. Note that conditions (a) and (b) imply that the tile set is aperiodic. There exist tile sets that satisfy the plane filling-filling property, as proved in [11]: Theorem 6 There exists a set of directed Wang tiles that has the plane-filling property. The proof of Theorem 6 in [11] constructs a set of Wang tiles such that the path that does not find any tiling errors is forced to follow the well known Hilbert-curve shown in Fig. 5. Undecidability in Cellular Automata Let us begin with one-step properties of two-dimensional CA. Theorem 7 INJECTIVITY is undecidable among two-dimensional CA. It is semi-decidable in any dimension. Proof The semi-decidability follows from the fact that injective CA have an inverse CA. One can effectively enumerate all CA and check them one-by-one until (if ever) the inverse CA is found.
3205
3206
Tiling Problem and Undecidability in Cellular Automata
identical in c0 and c1 . There is a cell pE1 such that c0 and c1 have different bits at cell pE1 . Since these bits become identical in the next configuration, the D-tiling must be correct at pE1 and c0 and c1 must have different bits in the follower position pE2 . We repeat the reasoning and obtain an infinite sequence of positions pE1 ; pE2 ; pE3 ; : : : such that each pE iC1 is the follower of pE i , and the D tiling is correct at each pE i . It follows from the plane filling property of D that path pE1 ; pE2 ; pE3 ; : : : covers arbitrarily large squares. Also the tiling according to the T-components must be valid at each cell of the path. Hence tile set T admits valid tilings of arbitrarily large squares, and therefore it admits a valid tiling of the entire plane. Tiling Problem and Undecidability in Cellular Automata, Figure 5 Fractions of the plane-filling Hilbert curve through 4 4 and 16 16 squares
Let us next prove INJECTIVITY undecidable by reducing the TILING PROBLEM into it. In the reduction a set D of directed tiles that has the plane filling property is used. The existence of such D was stated in Theorem 6. Let T be a given set of Wang tiles that is an instance of the T ILING PROBLEM. One can effectively construct a twodimensional CA whose state set is S D T D f0; 1g and the local rule updates the bit component of a cell as follows: If either the T- or the D-components contain a tiling error at the cell then the state of the cell is not changed, but if the tilings according to both T- and D-components are valid at the cell then the bit of the follower cell (according to the direction in the D-component) is added to the present bit value modulo 2. The tile components are not changed. Let us prove that this CA G is not injective if and only if T admits a valid tiling. “(H”: Suppose a valid tiling exists. Construct two configurations c0 and c1 where the T- and D-components form 2 2 the same valid tilings t 2 T Z and d 2 DZ , respectively. In c0 all bits are 0 while in c1 they are all 1. Since the tilings are everywhere valid, every cell performs modulo 2 addition of two bits, which means that every bit becomes 0. Hence G(c0 ) D G(c1 ) D c0 , and G is not injective. “H)”: Suppose then that G is not injective. There are two different configurations c0 and c1 such that G(c0 ) D G(c1 ). Tile components are not modified by the CA so they are
Analogously we can prove the undecidability of SURJECTIVITY . It is convenient to use the well known Garden-ofEden theorem of Moore and Myhill to convert the surjectivity property into injectivity on finite configurations: Theorem 8 (Garden-of-Eden theorem) A cellular automaton is non-surjective if and only if there are two distinct configurations that differ in a finite number of cells and that have the same successor [19,20]. Theorem 9 SURJECTIVITY is undecidable among two-dimensional CA. It’s complement is semi-decidable in any dimension. Proof A semi-algorithm for non-surjectivity enumerates all finite patterns one-by-one until a pattern is found that can not appear in G(c) for any configuration c. To prove undecidability we reduce the F INITE TILING PROBLEM, using the set D of 23 directed tiles shown in Fig. 6. These directed tiles are used in an analogous way as in the proof of Theorem 7. The topmost tile in Fig. 6 is called blank. All other tiles have a unique incoming and outgoing arrow. In valid tilings arrows and labels must match. The non-blank tiles are considered directed: the follower of a tile is the neighbor directed to by the outgoing arrow on the tile. Since each non-blank tile has exactly one incoming arrow, it is clear that if the tiling is valid at a tile then the tile is the follower of exactly one of its four neighbors. The tile at the center in Fig. 6 where the dark and light thick horizontal lines meet is called the cross. It has a special role in the forthcoming proof. A rectangular loop is a valid tiling of a rectangle using tiles in D where the follower path forms a loop that visits every tile of the rectangle, and the outside border of the rectangle is colored blank. See Fig. 7 for an example of a rectangular loop through a rectangle of size 12 7. (The edge labels are not shown for the sake of clarity of the figure.) It is easy to see that a rectangular loop of size 2n m exist for all n 2
Tiling Problem and Undecidability in Cellular Automata
Tiling Problem and Undecidability in Cellular Automata, Figure 6 Tiles used in the proof of the undecidability of SURJECTIVITY
Tiling Problem and Undecidability in Cellular Automata, Figure 7 A rectangular loop of size 12 7
and m 3. Any tile in an even column in the interior of the rectangle can be made to contain the unique cross of the rectangular loop. It is easy to see that the tile set D has the following property: 2 Finite plane-filling property: Let t 2 DZ be a tiling, and pE1 ; pE2 ; pE3 ; : : : a path in t such that the tiling t is valid at pE i for all i D 1; 2; 3; : : :. If the path covers only a finite
number of different cells then the cells on the path form a rectangular loop. Let b and c be the blank and the cross of set D. For any given tile set T with blank tile B we construct the following two-dimensional cellular automaton. The state set S contains triplets (d; t; x) 2 D T f0; 1g
3207
3208
Tiling Problem and Undecidability in Cellular Automata
under the following constraints: If d D c then t ¤ B, and if d D b or d is any tile containing label SW, SE, NW, NE, A, B or C, then t D B. In other words, the cross must be associated with a nonblank tile in T while the blank of D as well as all the tiles on the boundary of a rectangular loop are forced to be associated with the blank tile of T. The triplet (b; B; 0) where both tile components are blank and the bit is 0 is the quiescent state of the CA. The local rule is as follows: Let (d; t; x) be the current state of a cell. If d D b then the state is not changed. If d ¤ b then the cell verifies the validity of the tilings according to both D and T at the cell. If either tile component has a tiling error then the state is not changed. If both tilings are valid then the cell modifies its bit component by adding the bit of its follower modulo 2. Let us prove that this CA is not surjective if and only if T admits a valid, finite, non-trivial tiling. 2 “(H”: Suppose a valid, finite, non-trivial tiling t 2 T Z exists. Consider a configuration of the CA whose T-components form the valid tiling t and the D-components form a rectangular loop whose interior covers all nonblank elements of t. Tiles outside the rectangle are all blank and have bit 0. The cross can be positioned so that it is in the same cell as some non-blank tile in t. In such a configuration both T and D tilings are everywhere valid. The CA updates the bits of all tiles in the rectangular loop by performing modulo 2 addition with their followers, while the bits outside the rectangle remain 0. We get two different configurations that have the same image: In c0 all bits in the rectangle are equal to 0 while in c1 they are all equal to 1. The local rule updates the bits so that G(c0 ) D G(c1 ) D c0 . Configurations c0 and c1 only differ in a finite number of cells, so it follows from the Gardenof-Eden theorem that G is not surjective. “H)”: Suppose then that the CA is not surjective. According to the Garden-of-Eden theorem there are two finitely different configurations c0 and c1 such that G(c0 ) D G(c1 ). Since only bit components of states are changed, the tilings in c0 and c1 according to D- and T-components of the states are identical. There is a cell pE1 such that c0 and c1 have different bits at cell pE1 . Since these bits become identical in the next configuration, the D-tiling must be correct at pE1 and c0 and c1 must have different bits in the follower position pE2 . We repeat the reasoning and obtain an infinite sequence of positions pE1 ; pE2 ; pE3 ; : : : such that each pE iC1 is the follower of pE i , and the D tiling is correct at each pE i . Moreover, c0 and c1 have different bits in each position
pE i . Because configurations c0 and c1 only differ in a finite number of cells, we see that the path can only contain a finite number of distinct cells. It follows then from the finite plane-filling property of D that the path must form a valid rectangular loop. Also the tiling according to the T-components must be valid at each cell of the path. Because of the constraints on the allowed triplets, the T-components on the boundary of the rectangle are the blank B, while the cross in the interior contains a non-blank element of T. Hence there is a valid tiling of a rectangle according to T that contains a nonblank tile and has a blank boundary. This means that a finite, valid and non-trivial tiling is possible. Undecidable Properties of One-Dimensional CA Using deterministic Wang tiles and interpreting spacetime diagrams as tilings one obtains undecidability results for long-term behavior of one-dimensional CA. Theorem 10 N ILPOTENCY is undecidable among one-dimensional CA. It is undecidable even among one-dimensional CA that have a spreading state q, i. e. a state that spreads to all neighbors. N ILPOTENCY is semi-decidable in any dimension. Proof For semi-decidability notice that, for n D 1; 2; 3 : : : , we can effectively construct a cellular automaton whose global function is Gn and check whether the local rule of the CA maps everything into the same state. If that happens for some n then we halt and report that the CA is nilpotent. To prove undecidability we reduce the T ILING PROBLEM of NW-deterministic Wang tiles. Let T be a given NW-deterministic tile set. One can effectively construct a one-dimensional CA whose state set is S D T [ fqg and the local rule turns a cell into the quiescent state q except in the case that the cell and its right neighbor are in states x; y 2 T, respectively, and tile z 2 T exists so that tiles x y, z match as in Fig. 4a. In this case z is the new state of the cell. Note that state q is a spreading state. Let us prove that the CA is not nilpotent if and only if T admits a valid tiling. “(H”: Suppose a valid tiling exists. If c 2 T Z is a diagonal of this tiling then the configurations G n (c) in its orbit are subsequent diagonals of the same tiling, for all n D 1; 2; : : :. This means that c never becomes quiescent, and the CA is not nilpotent. “H)”: Suppose no valid tiling exists. Then there is number n such that no valid tiling of an n n square exists. This means that for every initial configuration c 2 S Z the configuration G 2n (c) is quiescent: If it is not quiescent
Tiling Problem and Undecidability in Cellular Automata
then a valid tiling of an n n square can be read from the space time diagram of configurations c; G(c); : : : ; G 2n (c). We conclude that the CA is nilpotent. Undecidability of N ILPOTENCY has some interesting corollaries. First, it implies that the topological entropy of a one-dimensional CA cannot be calculated, not even approximated [8]. Theorem 11 TOPOLOGICAL ENTROPY is undecidable. Proof Let us reduce NILPOTENCY. Let c > 0 be any constant, and let n > 2c be an integer. For any given onedimensional CA G with state set S and a spreading state q 2 S construct a new CA whose state set is S f1; 2; : : : ; ng, and the local rule applies G in the first components of the states and shifts the second components one cell to the left. In addition, state (q, i) is turned into state (q; 1). If G is nilpotent then also the new CA is nilpotent and its topological entropy is 0. If G is not nilpotent then there is a configuration c 2 S Z such that no cell ever turns into the spreading state q. But then the second components form a left shift over the alphabet f1; 2; : : : ; ng so the topological entropy is at least log2 n > c. It also follows that is undecidable to determine if a given one-dimensional CA is ultimately periodic [5]: Theorem 12 EQUICONTINUITY is undecidable among one-dimensional CA. Proof Among one-dimensional CA with a spreading state EQUICONTINUITY is equivalnet to N ILPOTENCY. Theorem 13 SENSITIVITY TO INITIAL CONDITIONS is undecidable among one-dimensional CA. Proof Originally the result was proved in [5] using an elaborate reduction of the Turing machine halting problem. However, undecidability of NILPOTENCY provides the result directly, as pointed out in [14]. Namely, a onedimensional cellular automaton whose neighborhood vector contains only strictly positive numbers is either nilpotent or sensitive. Adding a constant to all elements of the neighborhood vector does not affect the nilpotency status of a CA. So for any given one-dimensional CA we proceed as follows: add a positive constant to the elements of the neighborhood vector so that they all become positive. The new CA is sensitive if and only if the original CA was not nilpotent. The result then follows from Theorem 10. As a final application of undecidability of N ILPOTENCY consider other questions concerning the limit set (=maximal attractor) of one-dimensional CA. One can show
that N ILPOTENCY can be reduced to any non-trivial question [12]. More precisely, let PROB be a decision problem that takes arbitrary one-dimensional CA as input. Suppose that PROB always has the same answer for any two CA that have the same limit set. Then we say that PROB is a decision problem concerning the limit sets of CA. We call PROB non-trivial if there exist both positive and negative instances. Theorem 14 Let PROB be any non-trivial decision problem concerning the limit sets of CA. Then PROB is undecidable [12]. Other Undecidability Results In the previous sections we only considered decision problems that have been proved undecidable using reductions from the tiling problem or its variant. There are many other decision problems that have been proved undecidable using other techniques. Below are a few, with literature references. We call a CA G periodic if there is number n such Gn is the identity function. This is equivalent to saying that every configuration is periodic, that is, every configuration returns back to itself. Clearly a periodic CA is necessarily injective. In fact, periodic CA are exactly those CA that are injective and equicontinuous. PERIODICITY Input: Cellular Automaton A Question: Is A periodic? Theorem 15 PERIODICITY is undecidable among one-dimensional CA [15]. A CA is called sensitive to initial conditions if there exists a finite set B Zd of cells such that for every configuration c and every finite set A Z of cells there exists a configuration e and time t 0 such that e(E x ) D d(E x ) for all xE 2 A but G t (e)(E x ) ¤ G t (c)(E x ) for some xE 2 B. SENSITIVITY TO INITIAL CONDITIONS Input: Cellular Automaton A Question: Is A sensitive to initial conditions? Theorem 16 SENSITIVITY TO INITIAL CONDITIONS is undecidable among one-dimensional CA [5]. The following problems deal with dynamics on finite configuration. We hence suppose that the given CA has a quiescent state, i. e. a state q such that f (q; q; : : : ; q) D q where f is the local update rule of the CA. A configuration d c 2 S Z is called finite (w.r.t. q) if all but a finite number of cells are in state q. Questions similar to N ILPOTENCY
3209
3210
Tiling Problem and Undecidability in Cellular Automata
and EQUICONTINUITY can be asked in the space of finite configurations: NILPOTENCY ON FINITE CONFIGURATIONS Input: Cellular Automaton A with a quiescent state Question: Does every finite configuration evolve into the quiescent configuration? EVENTUAL PERIODICITY ON FINITE CONFIGURATIONS Input: Cellular Automaton A with a quiescent state Question: Does every finite configuration evolve into a temporally periodic configuration? Theorem 17 NILPOTENCY ON FINITE CONFIGURA TIONS and E VENTUAL PERIODICITY ON FINITE CONFIGURATIONS are undecidable for one-dimensional CA [4,23]. Future Directions Several interesting and challenging open questions remain. In particular, the decidability statuses of the following decision problems concerning basic dynamical properties are unknown. We call a CA G periodic if there is number n such Gn is the identity function. This is equivalent to saying that every configuration is periodic, that is, every configuration returns back to itself. Clearly a periodic CA is necessarily injective. In fact, periodic CA are exactly those CA that are injective and equicontinuous. PERIODICITY Input: Cellular Automaton A Question: Is A periodic? The question is undecidable for two-dimensional CA (the construction in the proof of Theorem 7 shows it) but the decidability status is unknown for one-dimensional CA. We call a one-dimensional CA G positively expansive if there exists a finite set A Z of cells such that for any two distinct configurations c and e there exists t 0 such that G t (c) and G t (e) differ in some cell in A. We call an injective CA G expansive for some finite A holds the following: for any two distinct configurations c and e there exists t 2 Z such that G t (c) and G t (e) differ in some cell in A. It is known that a two- and higher dimensional CA can not be expansive or positively expansive [6,22], so the following decision problems are only asked for one-dimensional CA: POSITIVE EXPANSIVITY Input: One-dimensional cellular Automaton A Question: Is A positively expansive?
EXPANSIVITY Input: One-dimensional cellular Automaton A Question: Is A expansive? The decidability status of both POSITIVE EXPANSIVITY and EXPANSIVITY is unknown. Acknowledgments Research supported by the Academy of Finland grant 211967. Bibliography Primary Literature 1. Amoroso S, Patt Y (1972) Decision Procedures for Surjectivity and Injectivity of Parallel Maps for Tessellation Structures. J Comput Syst Sci 6:448–464 2. Berger R (1966) The Undecidability of the Domino Problem. Mem Am Math Soc 66:1–72 3. Culik K II (1996) An aperiodic set of 13 Wang tiles. Discret Math 160: 245–251 4. Culik K II, Yu S (1988) Undecidability of CA Classification Schemes. Complex Syst 2:177–190 5. Durand B, Formenti E, Varouchas G (2003) On undecidability of equicontinuity classification for cellular automata. In: Proceedings of Discrete Models for Complex Systems, Lyon, France, 16–19 June 2003, pp. 117–128 6. Finelli M, Manzini G, Margara L (1998) Lyapunov Exponents versus Expansivity and Sensitivity in Cellular Automata. J Complex 14:210–233 7. Gurevich YS, Koryakov IO (1972) Remarks on Berger’s paper on the domino problem. J Sib Math J 13:319–321 8. Hurd LP, Kari J, Culik K (1992) The topological Entropy of Cellular Automata is Uncomputable. Ercodic Theor Dyn Syst 12:255–265 9. Kari J (1990) Reversibility of 2D cellular automata is undecidable. Physica D 45:379–385 10. Kari J (1992) The nilpotency problem of one-dimensional cellular automata. SIAM J Comput 21:571–586 11. Kari J (1994) Reversibility and surjectivity problems of cellular automata. J Comput Syst Sci 48:149–182 12. Kari J (1994) Rice’s theorem for the Limit Sets of Cellular Automata. Theoret Comput Sci 127:229–254 13. Kari J (1996) A small aperiodic set of Wang tiles. Discret Math 160:259–264 14. Kari J (2008) Undecidable properties on the dynamics of reversible one-dimensional cellular automata. In: Proceedings of Journées Automates Cellulaires, Uzés, France, 21–25 April 2008 15. Kari J, Ollinger N Periodicity and immortality in reversible computing. (in press) 16. Kari J, Papasoglu P (1999) Deterministic aperiodic tile sets. J Geom Funct Anal 9:353–369 17. Kurka P (1997) Languages, equicontinuity and attractors in cellular automata. Ergod Theory Dyn Syst 17:417–433 18. Lukkarila V (2007) The 4-way deterministic tiling problem is undecidable. (in press)
Tiling Problem and Undecidability in Cellular Automata
19. Moore EF (1962) Machine models of self-reproduction. In: Proceedings of the Symposia in Applied Mathematics 14:17–33 20. Myhill J (1963) The Converse to Moore’s Garden-of-Eden Theorem. In: Proceedings of the American Mathematical Society 14:685–686 21. Robinson RM (1971) Undecidability and Nonperiodicity for Tilings of the plane. Invent Math 12:177–209 22. Shereshevsky MA (1993) Expansiveness, entropy and polynomial growth for groups acting on subshifts by automorphisms. Indag Math 4:203–210 23. Sutner K (1989) A note on the Culik-Yu classes. Complex Syst 3:107–115 24. Wang H (1961) Proving theorems by pattern recognition – II. Bell Syst Techn J 40:1–42
Books and Reviews Codd EF (1968) Cellular Automata. Academic Press, New York Garzon M (1995) Models of massive parallelism: analysis of cellular automata and neural networks. Springer, New York Hedlund G (1969) Endomorphisms and automorphisms of shift dynamical systems. Math Syst Theory 3:320–375 Kari J (2005) Theory of cellular automata: a survey. Theoret Comput Sci 334:3–33 Toffoli T, Margolus N (1987) Cellular Automata Machines. MIT Press, Cambridge Wolfram S (ed) (1986) Theory and Applications of Cellular Automata. World Scientific Press, Singapore Wolfram S (2002) A New Kind of Science. Wolfram Media, Canada
3211
3212
Topological Dynamics of Cellular Automata
Topological Dynamics of Cellular Automata 1,2 ˚ PETR KURKA 1 Département d’Informatique, Université de Nice Sophia Antipolis, Nice, France 2 Center for Theoretical Study, Academy of Sciences and Charles University, Prague, Czechia
Article Outline Glossary Definition of the Subject Introduction Topological Dynamics Symbolic Dynamics Equicontinuity Surjectivity Permutive and Closing Cellular Automata Expansive Cellular Automata Attractors Subshifts and Entropy Examples Future Directions Acknowledgments Bibliography Glossary Almost equicontinuous CA has an equicontinuous configuration. Attractor omega-limit of a clopen invariant set. Blocking word interrupts information flow. Closing CA distinct asymptotic configurations have distinct images. Column subshift columns in space-time diagrams. Cross section one-sided inverse map. Directional dynamics dynamics along a direction in the space-time diagram. Equicontinuous configuration nearby configurations remain close. Equicontinuous CA all configurations are equicontinuous. Expansive CA distinct configurations get away. Finite time attractor is attained in finite time from its neighborhood. Jointly periodic configuration is periodic both for the shift and the CA. Lyapunov exponents asymptotic speed of information propagation. Maximal attractor omega-limit of the full space.
Nilpotent CA maximal attractor is a singleton. Open CA image of an open set is open. Permutive CA local rule permutes an extremal coordinate. Quasi-attractor a countable intersection of attractors. Signal subshift weakly periodic configurations of a given period. Spreading set clopen invariant set which propagates in both directions. Subshift attractor limit set of a spreading set. Definition of the Subject A topological dynamical system is a continuous selfmap F : X ! X of a topological space X. Topological dynamics studies iterations F n : X ! X, or trajectories (F n (x))n0 . Basic questions are how trajectories depend on initial conditions, whether they are dense in the state space X, whether they have limits, or what are their accumulation points. While cellular automata were introduced in the late 1940s by Neumann [36] as regular infinite networks of finite automata, topological dynamics of cellular automata began in 1969 with Hedlund [19] who viewed one-dimensional cellular automata in the context of symbolic dynamics as endomorphisms of the shift dynamical systems. In fact, the term “cellular automaton” never appears in his paper. Hedlund’s main results are the characterizations of surjective and open cellular automata. In the early 1980s Wolfram [41] produced space-time representations of one-dimensional cellular automata and classified them informally into four classes using dynamical concepts like periodicity, stability and chaos. Wolfram’s classification stimulated mathematical research involving all the concepts of topological and measure-theoretical dynamics, and several formal classifications were introduced using dynamical concepts. There are two well-understood classes of cellular automata with remarkably different stability properties. Equicontinuous cellular automata settle into a fixed or periodic configuration depending on the initial condition and cannot be perturbed by fluctuations. This is a paradigm of stability. Positively expansive cellular automata, on the other hand, are conjugated (isomorphic) to one-sided full shifts. They have dense orbits, dense periodic configurations, positive topological entropy, and sensitive dependence on the initial conditions. This is a paradigm of chaotic behavior. Between these two extreme classes there are many distinct types of dynamical behavior which are understood much less. Only some specific classes or particular examples have been elucidated and a general theory is still lacking.
Topological Dynamics of Cellular Automata
Introduction Dynamical properties of CA are usually studied in the context of symbolic dynamics. Other possibilities are measurable dynamics (see Pivato Ergodic Theory of Cellular Automata) or non-compact dynamics in Besicovitch or Weyl spaces (see Formenti and K˚urka Dynamics of Cellular Automata in Non-compact Spaces). In symbolic dynamics, the state space is the Cantor space of symbolic sequences. The Cantor space has distinguished topological properties which simplify some concepts of dynamics. This is the case of attractor and topological entropy. Cellular automata can be defined in the context of symbolic dynamics as continuous mappings which commute with the shift map. Equicontinuous and almost equicontinuous CA can be characterized using the concept of blocking words. While equicontinuous CA are eventually periodic, closely related almost equicontinuous automata, are periodic on a large (residual) subset of the state space. Outside of this subset, however, their behavior can be arbitrarily complex. A property which strongly constrains the dynamics of cellular automata is surjectivity. Surjective automata preserve the uniform Bernoulli measure, they are boundedto-one, and their unique subshift attractor is the full space. An important subclass of surjective CA are (left- or right-) closing automata. They have dense sets of periodic configurations. Cellular automata which are both left- and rightclosing are open: They map open sets to open sets, are n-to-one and have cross-sections. A fairly well-understood class is that of positively expansive automata. A positively expansive cellular automaton is conjugated (isomorphic) to a one-sided full shift. Closely related are bijective expansive automata which are conjugated to two-sided subshifts, usually of finite type. Another important concept elucidating the dynamics of CA is that of an attractor. With respect to attractors, CA classify into two basic classes. In one class there are CA which have disjoint attractors. They have then countably infinite numbers of attractors and uncountable numbers of quasi-attractors i. e., countable intersections of attractors. In the other class there are CA that have either a minimal attractor or a minimal quasi-attractor which is then contained in any attractor. An important class of attractors are subshift attractors—subsets which are both attractors and subshifts. They have always non-empty intersection, so they form a lattice with maximal element. Factors of CA that are subshifts are useful because factor maps preserve many dynamical properties while they simplify the dynamics. In a special case of column subshifts, they are formed by sequences of words occurring
in a column of a space-time diagram. Factor subshifts are instrumental in evaluating the entropy of CA and in characterizing CA with the shadowing property. A finer classification of CA is provided by quantitative characteristics. Topological entropy measures the quantity of information available to an observer who can see a finite part of a configuration. Lyapunov exponents measure the speed of information propagation. The minimum preimage number provides a finer classification of surjective cellular automata. Sets of left- and right-expansivity directions provide finer classification for left- and right-closing cellular automata. Topological Dynamics We review basic concepts of topological dynamics as exposed in K˚urka [24]. A Cantor space is any metric space which is compact (any sequence has a convergent subsequence), totally disconnected (distinct points are separated by disjoint clopen, i. e., closed and open sets), and perfect (no point is isolated). Any two Cantor spaces are homeomorphic. A symbolic space is any compact, totally disconnected metric space, i. e., any closed subspace of a Cantor space. A symbolic dynamical system (SDS) is a pair (X; F) where X is a symbolic space and F : X ! X is a continuous map. The nth iteration of F is denoted by F n . If F is bijective (invertible), the negative iterations are defined by F n D (F 1 )n . A set Y X is invariant, if F(Y) Y and strongly invariant if F(Y) D Y. A homomorphism ' : (X; F) ! (Y; G) of SDS is a continuous map ' : X ! Y such that ' ı F D G ı '. A conjugacy is a bijective homomorphism. The systems (X; F) and (Y; G) are conjugated, if there exists a conjugacy between them. If ' is surjective, we say that (Y; G) is a factor of (X; F). If ' is injective, (X; F) is a subsystem of (Y; G). In this case '(X) Y is a closed invariant set. Conversely, if Y X is a closed invariant set, then (Y; F) is a subsystem of (X; F). We denote by d : X X ! [0; 1) the metric and by Bı (x) D fy 2 X : d(y; x) < ıg the ball with center x and radius ı. A finite sequence (x i 2 X)0in is a ı-chain from x0 to xn , if d(F(x i ); x iC1 ) < ı for all i < n. A point x 2 X "-shadows a sequence (x i )0in , if d(F i (x); x i ) < " for all 0 i n. A SDS (X; F) has the shadowing property, if for every " > 0 there exists ı > 0, such that every ı-chain is "-shadowed by some point. Definition 1 Let (X; F) be a SDS. The orbit relation O F , the recurrence relation R F , the non-wandering relation N F , and the chain relation C F are defined by (x; y) 2 O F () 9n > 0; y D F n (x)
3213
3214
Topological Dynamics of Cellular Automata
(x; y) 2 R F () 8" > 0; 9n > 0; d(y; F n (x)) < "
Symbolic Dynamics
(x; y) 2 N F () 8"; ı > 0; 9n > 0; 9z 2 Bı (x);
An alphabet is any finite set with at least two elements. The cardinality of a finite set A is denoted by jAj. We frequently use alphabet 2 D f0; 1g and more generally n D f0; : : : ; n 1g. A word over A is any finite sequence u D u0 : : : u n1 of elements of A. The length of u is denoted by juj :D n and the word of zero length is denoted by . The set of all words of length n is denoted by An . The set of all non-zero words and the set of all words are [ [ An ; A D An : AC D
n
d(F (z); y) < " (x; y) 2 C F () 8" > 0; 9" chain from x to y : We have O F R F N F C F . The diagonal of a relation S X X is jSj :D fx 2 X : (x; x) 2 Sg. We denote by S(x) :D fy 2 X : (x; y) 2 Sg the S-image of a point x 2 X. The orbit of a point x 2 X is O F (x) :D fF n (x) : n > 0g. It is an invariant set, so its closure (O F (x); F) is a subsystem of (X; F). A point x 2 X is periodic with period n > 0, if F n (x) D x, i. e., if x 2 jO F j. A point x 2 X is eventually periodic, if F m (x) is periodic for some preperiod m 0. The points in jR F j are called recurrent, the points in jN F j are called non-wandering and the points in jC F j are called chain-recurrent. The sets jN F j and jC F j are closed and invariant, so (jN F j; F) and (jC F j; F) are subsystems of (X; F). The set of transitive points is T F :D fx 2 X : O(x) D Xg. A system (X; F) is minimal, if R F D X X. This happens if each point has a dense orbit, i. e., if T F D X. A system is transitive, if N F D X X, i. e., if for any non-empty open sets U; V X there exists n > 0 such that F n (U)\V ¤ ;. A system is transitive if it has a transitive point, i. e., if T F ¤ ;. In this case the set of transitive points T F is residual, i. e., it contains a countable intersection of dense open sets. An infinite system is chaotic, if it is transitive and has a dense set of periodic points. A system (X; F) is weakly mixing, if (X X; F F) is transitive. It is strongly transitive, if (X; F n ) is transitive for any n > 0. A system (X; F) is mixing, if for every non-empty open sets U; V X, F n (U) \ V ¤ ; for all sufficiently large n. A system is chain-transitive, if C F D X X, and chain-mixing, if for any x; y 2 X and any " > 0 there exist chains from x to y of arbitrary, large enough length. If a system (X; F) has the shadowing property, then N F D C F . It follows that a chain-transitive system with the shadowing property is transitive, and a chain-mixing system with the shadowing property is mixing. A clopen partition of a symbolic space X is a finite system of disjoint clopen sets whose union is X. The join of clopen partitions U and V is U _ V D fU \ V : U 2 U; V 2 V g. The inverse image of a clopen partition U by F is F 1 (U) D fF 1 (U) : U 2 Ug. The entropy H(X; F; U) of a partition and the entropy h(X; F) of a system are defined by H(X; F; U) ln jU _ F 1 (U) _ _ F (n1) (U)j ; n!1 n h(X; F) D supfH(X; F; U) : U is a clopen partition of Xg : D lim
n>0
n0
We denote by Z the set of integers, by N the set of nonnegative integers, by N C the set of positive integers, by Q the set of rational numbers, and by R the set of real numbers. The set of one-sided configurations (infinite words) is AN and the set of two-sided configurations (biinfinite words) is AZ . If u is a finite or infinite word and I D [i; j] is an interval of integers on which u is defined, put u[i; j] D u i : : : u j . Similarly for open or half-open intervals u[i; j) D u i : : : u j1 . We say that v is a subword of u and write v v u, if v D u I for some interval I Z. If u 2 An , then u1 2 AZ is the infinite repetition of u defined by (u1 ) knCi D u i . Similarly x D u1 :v 1 is the configuration satisfying x iCkjuj D u i for k < 0, 0 i < juj and x iCkjvj D v i for k 0, 0 i < jvj. On AN and AZ we have metrics d(x; y) D 2n where n D minfi 0 : x i ¤ y i g; x; y 2 AN d(x; y) D 2n where n D minfi 0 : x i ¤ y i or xi ¤ yi g; x; y 2 AZ : Both AN and AZ are Cantor spaces. In AN and AZ the cylinder sets of a word u 2 An are [u] :D fx 2 AN : x[0;n) D ug, and [u] k :D fx 2 AZ : x[k;kCn) D ug, where k 2 Z. Cylinder sets are clopen and every clopen set is a finite union of cylinders. The shift maps : AN ! AN and : AZ ! AZ defined by (x) i D x iC1 are continuous. While the two-sided shift is bijective, the one-sided shift is not: every configuration has jAj preimages. A one-sided subshift is any non-empty closed set ˙ AN which is shift-invariant, i. e., (˙ ) ˙ . A two-sided subshift is any non-empty closed set ˙ AZ which is strongly shiftinvariant, i. e., (˙ ) D ˙ . Thus, a subshift ˙ represents a SDS (˙; ). Systems (AZ ; ) and (AN ; ) are called full shifts. Given a set D A of forbidden words, the set S D :D fx 2 AN : 8u 2 D; u 6v xg is a one-sided subshift, provided it is non-empty. Any one-sided subshift has this
Topological Dynamics of Cellular Automata
form. Similarly, S D :D fx 2 AZ : 8u 2 D; u 6v xg is a two-sided subshift, and any two-sided subshift has this form. A (one- or two-sided) subshift is of finite type (SFT), if the set D of forbidden words is finite. The language of a subshift ˙ is the set of all subwords of configurations of ˙ ,
Topological Dynamics of Cellular Automata, Figure 1 A blocking word
Ln (˙ ) D fu 2 An : 9x 2 ˙; u v xg; L(˙ ) D
[
Ln (˙ ) D fu 2 A : 9x 2 ˙; u v xg :
n0
The entropy of a subshift ˙ is h(˙; ) D lim n!1 ln jLn (˙ )j/n. A word w 2 L(˙ ) is intrinsically synchronizing, if for any u; v 2 A such that uw; wv 2 L(˙ ) we have uwv 2 L(˙ ). A subshift is of finite type if all sufficiently long words are intrinsically synchronizing (see Lind and Marcus [29]). A subshift ˙ is sofic, if L(˙ ) is a regular language, i. e., if ˙ D ˙G is the set of labels of paths of a labelled graph G D (V ; E; s; t; l), where V is a finite set of vertices, E is a finite set of edges, s; t : E ! V are the source and target map, and l : E ! A is a labelling function. The labelling function extends to a function ` : E Z ! AZ defined by `(x) i D l(x i ). A graph G D (V ; E; s; t; l) determines a SFT ˙jGj D fu 2 E Z ; 8i 2 Z; t(u i ) D s(u iC1 )g and ˙G D f`(u) : u 2 ˙jGj g so that ` : (˙jGj ; ) ! (˙G ; ) is a factor map. If ˙ D ˙G , we say that G is a presentation of ˙ . A graph G is right-resolving, if different outgoing edges of a vertex are labelled differently, i. e., if l(e) ¤ l(e 0 ) whenever e ¤ e 0 and s(e) D s(e 0 ). A word w is synchronizing in G , if all paths with label w have the same target, i. e., if t(u) D t(u0 ) whenever `(u) D `(u0 ) D w. If w is synchronizing in G , then w is intrinsically synchronizing in ˙G . Any transitive sofic subshift ˙ has a unique minimal right-resolving presentation G which has the smallest number of vertices. Any word can be extended to a word which is synchronizing in G (see Lind and Marcus [29]). A deterministic finite automaton (DFA) over an alphabet A is a system A D (Q; ı; q0 ; q1 ), where Q is a finite set of states, ı : Q A ! Q is a transition function and q0 ; q1 are the initial and rejecting states. The transition function extends to ı : Q A ! Q by ı(q; ) D q, ı(q; ua) D ı(ı(q; u); a). The language accepted by A is L(A) :D fu 2 A : ı(q0 ; u) ¤ q1 g (see e. g., Hopcroft and Ullmann [20]). The DFA of a labelled graph G D (V; E; s; t; l) is A(G ) D (P (V); ı; V ; ;), where P (V) is the set of all subsets of V and ı(q; a) D fv 2 V : 9u 2 a q; u ! vg. Then L(A(G )) D L(˙G ). We can reduce the size of A(G ) by taking only those states that are accessible from the initial state V.
A periodic structure n D (n i ) i0 is a sequence of integers greater than 1. For a given periodic structure n, Q let Xn :D i0 f0; : : : ; n i 1g be the product space with metric d(x; y) D 2n where n D minfi 0 : x i ¤ y i g. Then X n is a Cantor space. The adding machine (odometer) of n is a SDS (X n ; F) given by the formula ( (x i C 1) mod n i if 8 j < i; x j D n j 1 F(x) i D if 9 j < i; x j < n j 1 : xi Each adding machine is minimal and has zero topological entropy. Definition 2 A map F : AZ ! AZ is a cellular automaton (CA) if there exist integers m a (memory and anticipation) and a local rule f : AamC1 ! A such that for any x 2 AZ and any i 2 Z, F(x) i D f (x[iCm;iCa]). Call r D maxfjmj; jajg 0 the radius of F and d D a m 0 its diameter. By a theorem of Hedlund [19], a map F : AZ ! AZ is a cellular automaton if it is continuous and commutes with the shift, i. e., ıF D F ı . This means that F : (AZ ; ) ! (AZ ; ) is a homomorphism of the full shift and (AZ ; F) is a SDS. We can assume that the local rule acts on a symmetric neighborhood of 0, so F(x) i D f (x[ir;iCr]), where f : A2rC1 ! A. There is a trade-off between the radius and the size of the alphabet. Any CA is conjugated to a CA with radius 1. Any -periodic configuration of a CA (AZ ; F) is F-eventually periodic. Hence the set of F-eventually periodic configurations is dense. Thus, a cellular automaton is never minimal, because it has always an F-periodic configuration. A configuration x 2 AZ is weakly periodic, if there exists p 2 Z and q 2 N C such that F q p (x) D x. A configuration x 2 AZ is jointly periodic, if it is both F-periodic and -periodic. A CA (AZ ; F) is nilpotent, if F n (AZ ) is a singleton for some n > 0. Equicontinuity A point x 2 X of a SDS (X; F) is equicontinuous, if 8" > 0; 9ı > 0; 8y 2 Bı (x); 8n 0; d(F n (y); F n (x)) < " :
3215
3216
Topological Dynamics of Cellular Automata
The set of equicontinuous points is denoted by E F . A system is equicontinuous, if E F D X. In this case it is uniformly equicontinuous, i. e., 8" > 0; 9ı > 0; 8x; y 2 X; (d(x; y) < ı ) 8n 0; d(F n (x); F n (y)) < ") : A system (X; F) is almost equicontinuous, if E F ¤ ;. A system is sensitive, if 9" > 0; 8x 2 X; 8ı > 0; 9y 2 Bı (x); 9n 0; d(F n (y); F n (x)) " : Clearly, a sensitive system has no equicontinuous points. The converse is not true in general but holds for transitive systems (Akin et al. [2]). Definition 3 A word u 2 AC with juj s 0 is s-blocking for a CA (AZ ; F), if there exists an offset k 2 [0; jujs] such that 8x; y 2 [u]0 ; 8n 0; F n (x)[k;kCs) D F n (y)[k;kCs) : Theorem 4 (K˚urka [27]) Let (AZ ; F) be a CA with radius r 0. The following conditions are equivalent. (1) (AZ ; F) is not sensitive. (2) (AZ ; F) has an r-blocking word. (3) E F is residual, i. e., a countable intersection of dense open sets. (4) E F ¤ ;. For a non-empty set B A define Tn (B) :D fx 2 AZ : (9 j > i > n; x[i; j) 2 B) T (B) :D
\
and (9 j < i < n; x[ j;i) 2 B)g; Tn (B) :
n0
Each Tn (B) is open and dense, so the set T (B) of B-recurrent configurations is residual. If B is the set of r-blocking words, then E F D T (B). Theorem 5 (K˚urka [27]) Let (AZ ; F) be a CA with radius r 0. The following conditions are equivalent. (1) (AZ ; F) is equicontinuous, i. e., E F D AZ . (2) There exists k > 0 such that any u 2 Ak is r-blocking. (3) There exists a preperiod q 0 and a period p > 0, such that F qCp D F q .
In particular every CA with radius r D 0 is equicontinuous. A configuration is equicontinuous for F if it is equicontinuous for F n , i. e., E F D E F n . This fact enables us to consider equicontinuity along rational directions p ˛ D q. Definition 6 The sets of equicontinuous directions and almost equicontinuous directions of a CA (AZ ; F) are defined by
p E(F) D : p 2 Z; q 2 N C ; E F q p D AZ ; q
p : p 2 Z; q 2 N C ; E F q p ¤ ; : A(F) D q Clearly, E(F) A(F), and both sets E(F) and A(F) are convex (Sablik [37]): If ˛0 < ˛1 < ˛2 and ˛0 ; ˛2 2 A(F), then ˛1 2 A(F). Sablik [37] considers also equicontinuity along irrational directions. Proposition 7 Let (AZ ; F) be an equicontinuous CA such that there exists 0 ¤ ˛ 2 A(F). Then (AZ ; F) is nilpotent. Proof We can assume ˛ < 0. There exist 0 k < m and w 2 Am , such that for all x; y 2 AZ and for all i 2 Z we have x[i;iCm) D y[i;iCm) H) 8n 0; F n (x) iCk D F n (y) iCk ; w D x[i;iCm) D y[i;iCm) H) 8n 0; F n bn˛c (x) iCk D F n bn˛c (y) iCk : Take n such that l :D bn˛c C m 0. There exists a 2 A such that F n bn˛c (z) k D a for every z 2 [w]0 . Let x 2 AZ be arbitrary. For a given i 2 Z, take a configuration y 2 [x[i;iCm) ] i \ [w] il Cm . Then z :D ibn˛c (y) D il Cm (y) 2 [w]0 and F n (x) iCk D F n (y) iCk D F n bn˛c (z) k D a. Thus, F n (x) D a 1 for every x 2 AZ , and F nCt (x) D F n (F t (x)) D a1 for every t 0, so (AZ ; F) is nilpotent (see also Sablik [38]). Theorem 8 Let (AZ ; F) be a CA with memory m and anticipation a, i. e., F(x) i D f (x[iCm;iCa]). Then exactly one of the following conditions is satisfied. (1) E(F) D A(F) D Q. This happens if (AZ ; F) is nilpotent. (2) E(F) D ; and there exist real numbers ˛0 < ˛1 such that (˛0 ; ˛1 ) A(F) [˛0 ; ˛1 ] [a; m] : (3) There exists a ˛ m such that A(F) D E(F) D f˛g.
Topological Dynamics of Cellular Automata
Topological Dynamics of Cellular Automata, Figure 2 Perturbation speeds
Topological Dynamics of Cellular Automata, Figure 3 A diamond (left) and a magic word (right)
(4) There exists a ˛ m such that A(F) D f˛g and E(F) D ;. (5) A(F) D E(F) D ;. This follows from Theorems II.2 and II.5 in Sablik [37] and from Proposition 7. The zero CA of Example 1 belongs to class (1). The product CA of Example 4 belongs to class (2). The identity CA of Example 2 belongs to class (3). The Coven CA of Example 18 belongs to class (4). The sum CA of Example 11 belongs to class (5). Sensitivity can be expressed quantitatively by Lyapunov exponents which measure the speed of information propagation. Let (AZ ; F) be a CA. The left and right perturbation sets of x 2 AZ are Ws (x) D fy 2 AZ : 8i s; y i D x i g ; WsC (x) D fy 2 AZ : 8i s; y i D x i g : The left and right perturbation speeds of x 2
AZ
are
I n (x) D minfs 0 : 8i n; F i (Ws (x)) W0 (F i (x))g ; IC n (x) C D minfs 0 : 8i n; F i (Ws (x)) W0C (F i (x))g :
Thus, I n (x) is the minimum distance of a perturbation of the left part of x which cannot reach the C zero site by time n. Both I n (x) and I n (x) are nondecreasing. If 0 < s < t, and if x[s;t] is an r-blocking word (where r is the radius), then limn!1 I n (x) t. Similarly, if s < t < 0 and if x[s;t] is an r-blocking word, then lim n!1 I C n (x) jsj. In particular, if x 2 E F , then both Z I (x) and IC n n (x) have finite limit. If (A ; F) is sensitive, C then limn!1 (I n (x) C I n (x)) D 1 for every x 2 AZ .
Definition 9 The left and right Lyapunov exponents of a CA (AZ ; F) and x 2 AZ are F (x) D lim inf n!1
I n (x) ; n
C F (x) D lim inf n!1
IC n (x) : n
If F has memory m and anticipation a, then F (x) (x) maxfm; 0g for all x 2 AZ . maxfa; 0g and C F C If x 2 E F , then F (x) D F (x) D 0. If F is rightpermutive (see Sect. “Permutive and Closing Cellular AuZ tomata”) with a > 0, then F (x) D a for every x 2 A . C If F is left-permutive with m < 0, then F (x) D m for every x 2 AZ . Theorem 10 (Bressaud and Tisseur [9]) For a positively expansive CA (see Sect. “Expansive Cellular Automata”) there exists a constant c > 0, such that for all x 2 AZ , (x) c and C (x) c. Conjecture 11 (Bressaud and Tisseur [9]) Any sensitive CA has a configuration x such that (x) > 0 or C (x) > 0. Surjectivity Let (AZ ; F) be a CA with diameter d 0 and a local rule f : AdC1 ! A. We extend the local rule to a function f : A ! A by f (u) i D f (u[i;iCd] ) for i < juj d, so j f (u)j D maxfjuj d; 0g. A diamond for f (Fig. 3 left) consists of words u; v 2 Ad and distinct w; z 2 AC of the same length, such that f (uwv) D f (uzw). Theorem 12 (Hedlund [19], Moothathu [33]) Let (AZ ; F) be a CA with local rule f : AdC1 ! A. The following conditions are equivalent. (1) F : AZ ! AZ is surjective. (2) For each x 2 AZ , F 1 (x) is finite or countable. (3) For each x 2 AZ , F 1 (x) is a finite set.
3217
3218
Topological Dynamics of Cellular Automata
For each x 2 AZ , jF 1 (x)j jAjd . f : A ! A is surjective. For each u 2 AC , j f 1 (u)j D jAjd . For each u 2 AC with juj d log2 jAj (2d C jAj2d ), j f 1 (u)j D jAjd . (8) There exists no diamond for f .
(4) (5) (6) (7)
It follows that any injective CA is surjective and hence bijective. Although (6) asserts equality, the inequality in (4) may be strict. Another equivalent condition states that the uniform Bernoulli measure is invariant for F. In this form, Theorem 12 has been generalized to CA on mixing SFT (see Theorem 2B.1 in Pivato Ergodic Theory of Cellular Automata).
Recall that T (w) is the (residual) set of configurations which contain an infinite number of occurrences of w both in x(1;0) and in x(0;1) . Configurations x; y 2 AZ are d-separated, if x[i;iCd) ¤ y[i;iCd) for all i 2 Z. Theorem 18 (Hedlund [19], Kitchens [23]) Let (AZ ; F) be a surjective CA with diameter d and minimum preimage number p(F). (1) If w 2 AC is a magic word, then any z 2 T (w) has exactly p(F) preimages. These preimages are pairwise d-separated. (2) Any configuration z 2 AZ has at least p(F) pairwise d-separated preimages. (3) If every y 2 AZ has exactly p(F) preimages, then all long enough words are magic.
Theorem 13 (Blanchard and Tisseur [5]) (1) Any configuration of a surjective CA is non-wandering, i. e., jN F j D AZ . (2) Any surjective almost equicontinuous CA has a dense set of F-periodic configurations. (3) If (AZ ; F) is an equicontinuous and surjective CA, then there exists p > 0 such that F p D Id. In particular, F is bijective. Theorem 14 (Moothathu [32]) Let (AZ ; F) be a surjective CA. (1) jR F j is dense in AZ . (2) F is semi-open, i. e., F(U) has non-empty interior for any open U ¤ ;. (3) If (AZ ; F) is transitive, then it is weakly mixing, and hence totally transitive and sensitive.
Theorem 19 Let (AZ ; F) be a CA and ˙ AZ a sofic subshift. Then both F(˙ ) and F 1 (˙ ) are sofic subshifts. In particular, the first image subshift F(AZ ) is sofic. See e. g., Formenti and K˚urka [16] for a proof. The first image graph of a local rule f : AdC1 ! A is G ( f ) D (Ad ; AdC1 ; s; t; f ), where s(u) D u[0;d) and t(u) D u[1;d] . Then F(AZ ) D ˙G( f ) . It is algorithmically decidable whether a given CA is surjective. One decision procedure is based on the Moothathu result in Theorem 12(7). Another procedure is based on the construction of the DFA A(G ( f )) (see Sect. “Symbolic Dynamics”). A CA with local rule f : AdC1 ! A is surjective if the rejecting state ; cannot be reached from the initial state Ad in A(G ( f )). See Morita Reversible Cellular Automata for further information on bijective CA. Permutive and Closing Cellular Automata
Conjecture 15 Every surjective CA has a dense set of F-periodic configurations. Proposition 16 (Acerbi et al. [1]) If every mixing CA has a dense set of F-periodic configurations, then every surjective CA has a dense set of jointly periodic configurations. Definition 17 f : AdC1 ! A.
Let (AZ ; F) be a CA with local rule
(1) The minimum preimage number (Fig. 3 right) p(F) is defined by p(F; w) D min jfu 2 Ad : 9v 2 f 1 (w); v[t;tCd) D ugj; tjwj
p(F) D minfp(F; w) : w 2 AC g : (2) A word w 2
AC
is magic, if p(F; w) D p(F).
Definition 20 Let (AZ ; F) be a CA, and let f : AdC1 ! A be the local rule for F with smallest diameter. (1) F is left-permutive if 8u 2 Ad ; 8b 2 A; 9!a 2 A; f (au) D b. (2) F is right-permutive if 8u 2 Ad ; 8b 2 A; 9!a 2 A; f (ua) D b. (3) F is permutive if it is either left-permutive or rightpermutive. (4) F is bipermutive if it is both left- and right-permutive. Permutive CA can be seen in Examples 8, 10, 11, 18. Definition 21 Let (AZ ; F) be a CA. (1) Configurations x; y 2 AZ are left-asymptotic, if 9n; x(1;n) D y(1;n) .
Topological Dynamics of Cellular Automata
Topological Dynamics of Cellular Automata, Figure 4 Closingness
(2) Configurations x; y 2 AZ are right-asymptotic, if 9n; x(n;1) D y(n;1) . (3) (AZ ; F) is right-closing if F(x) ¤ F(y) for any leftasymptotic x ¤ y 2 AZ . (4) (AZ ; F) is left-closing if F(x) ¤ F(y) for any rightasymptotic x ¤ y 2 AZ . (5) A CA is closing if it is either left- or right-closing.
A SDS (X; F) is open, if F(U) is open for any open U X. A cross section of a SDS (X; F) is any continuous map G : X ! X such that F ı G D Id. If F has a cross section, it is surjective. In particular, any bijective SDS has a cross section.
Proposition 22
(1) (2) (3) (4)
(1) Any right-permutive CA is right-closing. (2) Any right-closing CA is surjective. (3) A CA (AZ ; F) is right-closing if there exists m > 0 such that for any x; y 2 AZ , x[m;0) D y[m;0) and F(x)[m;m] D F(y)[m;m] H) x0 D y0 (see Fig. 4 left). See e. g., K˚urka [24] for a proof. The proposition holds with obvious modification for left-permutive and leftclosing CA. The multiplication CA from Example 14 is both left- and right-closing but neither left-permutive nor right-permutive. The CA from Example 15 is surjective but not closing. Proposition 23 Let (AZ ; F) be a right-closing CA. For all sufficiently large m > 0, if u 2 Am , v 2 A2m , and if F([u]m ) \ [v]m ¤ ;, then (Fig. 4 right) 8b 2 A; 9!a 2 A; F([ua]m ) \ [vb]m ¤ ; : See e. g., K˚urka [24] for a proof. Theorem 24 (Boyle and Kitchens [6]) Any closing CA (AZ ; F) has a dense set of jointly periodic configurations. Theorem 25 (Coven, Pivato and Yassawi [14]) Let F be a left-permutive CA with memory 0. (1) If O(x) is infinite and x[0;1) is fixed, i. e., if F(x)[0;1) D x[0;1) , then (O(x); F) is conjugated to an adding machine. (2) If F is not bijective, then the set of configurations such that (O(x); F) is conjugated to an adding machine is dense.
Theorem 26 (Hedlund [19]) Let (AZ ; F) be a CA. The following conditions are equivalent. (AZ ; F) is open. (AZ ; F) is both left- and right-closing. For any x 2 AZ , jF 1 (x)j D p(F) There exist cross sections G1 ; : : : ; Gp(F) : AZ ! AZ , such that for any x 2 AZ , F 1 (x) D fG1 (x); : : : ; Gp(F) (x)g and G i (x) ¤ G j (x) for i ¤ j.
In general, the cross sections Gi are not CA as they need not commute with the shift. Only when p(F) D 1, i. e., when F is bijective, the inverse map F 1 is a CA. Any CA which is open and almost equicontinuous is bijective (K˚urka [24]). Expansive Cellular Automata Definition 27 Let (AZ ; F) be a CA. (1) F is left-expansive, if there exists " > 0 such that if x(1;0] ¤ y(1;0] , then d(F n (x); F n (y)) " for some n 0. (2) F is right-expansive, if there exists " > 0 such that if x[0;1) ¤ y[0;1) , then d(F n (x); F n (y)) " for some n 0. (3) F is positively expansive, if it is both left- and rightexpansive, i. e., if there exists " > 0 such that for all x ¤ y 2 AZ , d(F n (x); F n (y)) " for some n > 0. Any left-expansive or right-expansive CA is sensitive and (by Theorem 12) surjective, because it cannot contain a diamond. A bijective CA is expansive, if 9" > 0; 8x ¤ y 2 AZ ; 9n 2 Z; d(F n (x); F n (y)) " : Proposition 28 Let (AZ ; F) be a CA with memory m and anticipation a.
3219
3220
Topological Dynamics of Cellular Automata
(1) If m < 0 and if F is left-permutive, then F is left-expansive. (2) If a > 0 and if F is right-permutive, then F is right-expansive. (3) If m < 0 < a and if F is bipermutive, then F is positively expansive. See e. g., K˚urka [24] for a proof. Theorem 29 (Nasu [34,35]) (1) Any positively expansive CA is conjugated to a onesided full shift. (2) A bijective expansive CA with memory 0 is conjugated to a two-sided SFT.
Proof (1) The proof is the same as the following proof of (2). (2() If F is not right-closing and " D 2n , then there exist distinct left-asymptotic configurations such that x(1;n] D y(1;n] and F(x) D F(y). It follows that d(F i (x); F i (y)) < " for all i 0, so F is not right-expansive. The same argument works for any F q p , so XC (F) D ;. (2)) Let F be right-closing, and let m > 0 be the constant from Proposition 23. Assume that F n (x)[mC(mC1)n;mC(mC1)n] D F n (y)[mC(mC1)n;mC(mC1)n] for all n 0. By Proposition 23,
Conjecture 30 Every bijective expansive CA is conjugated to a two-sided SFT. Definition 31 Let (AZ ; F) be a CA. The left- and rightexpansivity direction sets are defined by X (F)
p D : p 2 Z; q 2 N C ; F q p is left-expansive ; q XC (F)
p D : p 2 Z; q 2 N C ; F q p is right-expansive ; q X(F) D X (F) \ XC (F) : All these sets are convex and open. Moreover, X (F) \ A(F) D XC (F) \ A(F) D ; (Sablik [37]). Theorem 32 (Sablik [37]) Let (AZ ; F) be a CA with memory m and anticipation a. (1) If F is left-permutive, then X (F) D (1; m). (2) If F is right-permutive, then XC (F) D (a; 1). (3) If X (F) ¤ ; then there exists ˛ 2 R such that X (F) D (1; ˛) (1; m). (4) If XC (F) ¤ ; then there exists ˛ 2 R such that XC (F) D (˛; 1) (a; 1). (5) If X(F) ¤ ; then there exists ˛0 ; ˛1 2 R such that X(F) D (˛0 ; ˛1 ) (a; m). Theorem 33 Let (AZ ; F) be a cellular automaton. (1) F is left-closing if X (F) ¤ ;. (2) F is right-closing if XC (F) ¤ ;. (3) If A(F) is an interval, then F is not surjective and X (F) D XC (F) D ;.
F n1 (x)mC(mC1)(n1)C1 D F n1 (y)mC(mC1)(n1)C1 : By induction we get x[m;mCn] D y[m;mCn] . This holds for every n > 0, so x[0;1) D y[0;1) . Thus, F mC1 is rightexpansive, and therefore XC (F) ¤ ;. (3) If there are blocking words for two different directions, then the CA has a diamond and therefore is not surjective by Theorem 12. Corollary 34 Let (AZ ; F) be an equicontinuous CA. There are three possibilities. (1) If F is surjective, then A(F) D E(F) D f0g, X (F) D (1; 0), XC (F) D (0; 1). (2) If F is neither surjective nor nilpotent, then A(F) D E(F) D f0g, X (F) D XC (F) D ;. (3) If F is nilpotent, then A(F) D E(F) D R, X (F) D XC (F) D ;. The proof follows from Proposition 7 and Theorem 13 (see also Sablik [38]). The identity CA is in class (1). The product CA of Example 3 is in class (2). The zero CA of Example 1 is in class (3). Attractors Let (X; F) be a SDS. The limit set of a clopen invariT ant set V X is ˝F (V ) :D n0 F n (V ). A set Y X is an attractor, if there exists a non-empty clopen invariant set V such that Y D ˝F (V ). We say that Y is a finite time attractor, if Y D ˝F (V) D F n (V ) for some n > 0 (and a clopen invariant set V). There exists always the largest attractor ˝F :D ˝F (X). Finite time maximal attractors are also called stable limit sets in the
Topological Dynamics of Cellular Automata
literature. The number of attractors is at most countable. The union of two attractors is an attractor. If the intersection of two attractors is non-empty, it contains an attractor. The basin of an attractor Y X is the set B(Y) D fx 2 X; lim n!1 d(F n (x); Y) D 0g. An attractor Y X is a minimal attractor, if no proper subset of Y is an attractor. An attractor is a minimal attractor if it is chain-transitive. A periodic point x 2 X is attracting if its orbit O(x) is an attractor. Any attracting periodic point is equicontinuous. A quasi-attractor is a non-empty set which is an intersection of a countable number of attractors. Theorem 35 (Hurley [22]) (1) If a CA has two disjoint attractors, then any attractor contains two disjoint attractors and an uncountably infinite number of quasi-attractors. (2) If a CA has a minimal attractor, then it is a subshift, it is contained in any other attractor, and its basin of attraction is a dense open set. (3) If x 2 AZ is an attracting F-periodic configuration, then (x) D x and F(x) D x. Corollary 36 For any CA, exactly one of the following statements holds. (1) There exist two disjoint attractors and a continuum of quasi-attractors. (2) There exists a unique quasi-attractor. It is a subshift and it is contained in any attractor. (3) There exists a unique minimal attractor contained in any other attractor. Both equicontinuity and surjectivity yield strong constraints on attractors. Theorem 37 (K˚urka [24]) (1) A surjective CA has either a unique attractor or a pair of disjoint attractors. (2) An equicontinuous CA has either two disjoint attractors or a unique attractor which is an attracting fixed configuration. (3) If a CA has an attracting fixed configuration which is a unique attractor, then it is equicontinuous. We consider now subshift attractors of CA, i. e., those attractors which are subshifts. Let (AZ ; F) be a CA. A clopen F-invariant set U AZ is spreading, if there exists k > 0 such that F k (U) 1 (U) \ (U). If U is a clopen invariant set, then ˝F (U) is a subshift iff U is
spreading (K˚urka [25]). Recall that a language is recursively enumerable, if it is a domain (or a range) of a recursive function (see e. g., Hopcroft and Ullmann [20]). Theorem 38 (Formenti and K˚urka [17]) Let ˙ AZ be a subshift attractor of a CA (AZ ; F). (1) A n L(˙ ) is a recursively enumerable language. (2) ˙ contains a jointly periodic configuration. (3) (˙; ) is chain-mixing. Theorem 39 (Formenti and K˚urka [17]) (1) The only subshift attractor of a surjective CA is the full space. (2) A subshift of finite type is an attractor of a CA if it is mixing. (3) Given a CA (AZ ; F), the intersection of all subshift attractors of all F q p , where q 2 N C and p 2 Z, is a non-empty F-invariant subshift called the small quasi-attractor Q F :(Q F ; ) is chain-mixing and F : Q F ! Q F is surjective. The system of all subshift attractors of a given CA forms a lattice with join ˙0 [ ˙1 and meet ˙0 ^˙1 :D ˝F (˙0 \ ˙1 ). There exist CA with infinite number of subshift attractors (K˚urka [26]). Proposition 40 (Di Lena [28]) The basin of a subshift attractor is a dense open set. By a theorem of Hurd [21], if ˝ F is SFT, then it is stable, i. e., ˝F D F n (AZ ) for some n > 0. We generalize this theorem to subshift attractors. Theorem 41 Let U be a spreading set for a CA (AZ ; F). (1) There exists a spreading set W U such that T i e (W) :D ˝F (W) D ˝F (U) and ˝ i2Z (W) is a mixing subshift of finite type. e (W)) for (2) If ˝F (W) is a SFT, then ˝F (W) D F n (˝ some n 0. Proof (1) See Formenti and K˚urka [17]. (2) Let D be a finite set of forbidden words for ˝F (W). For each u 2 D there exists nu > 0 such that e )). Take n :D maxfnu : u 2 Dg. u 62 L(F n u (˝ By Theorem 5, every equicontinuous CA has a finite time maximal attractor. Definition 42 Let ˙ AZ be a mixing sofic subshift, let G D (V; E; s; t; l) be its minimal right-resolving presentation with factor map ` : (˙jGj ; ) ! (˙; ).
3221
3222
Topological Dynamics of Cellular Automata
(1) A homogenous configuration a1 2 ˙ is receptive, if there exist intrinsically synchronizing words u; v 2 L(˙ ) and n 2 N such that ua m v 2 L(˙ ) for all m > n. (2) ˙ is almost of finite type (AFT), if ` : ˙jGj ! ˙ is one-to-one on a dense open set of ˙jGj . (3) ˙ is near-Markov, if fx 2 ˙ : j`1 (x)j > 1g is a finite set of -periodic configurations. (3) is equivalent to the condition that ` is left-closing, i. e., that `(u) ¤ `(v) for distinct right-asymptotic paths u; v 2 E Z . Each near-Markov subshift is AFT. Theorem 43 (Maass [30]) Let ˙ AZ be a mixing sofic subshift with a receptive configuration a 1 2 ˙. (1) If ˙ is either SFT or AFT, then there exists a CA (AZ ; F) such that ˙ D ˝F D F(AZ ). (2) A near-Markov subshift cannot be an infinite time maximal attractor of a CA. On the other hand, a near-Markov subshift can be a finite time maximal attractor (see Example 21). The language L(˝F ) can have arbitrary complexity (see Culik et al. [15]). A CA with non-sofic mixing maximal attractor has been constructed in Formenti and K˚urka [17]. Definition 44 Let f : AdC1 ! A be a local rule of a cellular automaton. We say that a subshift ˙ AZ has decreasing preimages, if there exists m > 0 such that for each u 2 A n L(˙ ), each v 2 f m (u) contains as a subword a word w 2 A n L(˙ ) such that jwj < juj. Proposition 45 (Formenti and K˚urka [16]) If (AZ ; F) is a CA and ˙ AZ has decreasing preimages, then ˝F ˙ . For more information about attractor-like objects in CA see Sect. “Invariance of Maxentropy Measures” of Pivato Ergodic Theory of Cellular Automata. Subshifts and Entropy Definition 46 Let F : AZ ! AZ be a cellular automaton. AZ
(1) Denote by S(p;q) (F) :D fx 2 : D xg the set of all weakly periodic configurations of F with period (p; q) 2 Z N C . (2) A signal subshift is any non-empty S(p;q) (F). p (3) The speed subshift of F with speed ˛ D q 2 Q is [ S˛ (F) D S(n p;nq) (F). n>0
F q p (x)
Note that both S(p;q) (F) and S˛ (F) are closed and -invariant. However, S(p;q) (F) can be empty, so it need not be a subshift. Theorem 47 Let (AZ ; F) be a cellular automaton with memory m and anticipation a, so F(x) i D f (x[iCm;iCa]). (1) If S(p;q) (F) is non-empty, then it is a subshift of finite type. (2) If S(p;q) (F) is infinite, then a p/q m. (3) If p0 /q0 < p1 /q1 , then S(p 0 ;q 0 ) (F) \ S(p 1 ;q 1 ) (F) fx 2 AZ : p (x) D xg, where p D q(p1 /q1 p0 /q0 ) and q D lcm(q0 ; q1 ) (the least common multiple). (4) S(p;q) (F) S p (F) ˝F and S p (F) ¤ ;. q
q
(5) If X(F) ¤ ; or if (AZ ; F) is nilpotent, then F has no infinite signal subshifts. Proof (1), (2) and (3) have been proved in Formenti and K˚urka [16]. (4) Since F is bijective on each signal subshift, we get S(p;q) (F) ˝F , and therefore S p/q (F) ˝F . Since every F q p has a periodic point, we get S p/q (F) ¤ ;. (5) It has been proved in K˚urka [25], that a positively expansive CA has no signal subshifts. This property is preserved when we compose F with a power of the shift map. If (AZ ; F) is nilpotent, then each S(p;q) (F) contains at most one element. The Identity CA has a unique infinite signal subshift
S(0;1) (Id) D AZ . The CA of Example 17 has an infinite
number of infinite signal subshifts of the same speed. A CA with infinitely many infinite signal subshifts with infinitely many speeds has been constructed in K˚urka [25]. In some cases, the maximal attractor can be constructed from signal subshifts (see Theorem 49). Definition 48 Given an integer c 0, the c-join c ˙0 _ ˙1 of subshifts ˙0 ; ˙1 AZ consists of all configurations x 2 AZ such that either x 2 ˙0 [ ˙1 , or there exist integers b; a such that b a c, x(1;b) 2 L(˙0 ), and x[a;1) 2 L(˙1 ). The operation of join is associative, and the c-join of sofic subshifts is sofic. Theorem 49 (Formenti, K˚urka [16]) Let (AZ ; F) be a CA and let S(p 1 ;q 1 ) (F), : : :, S(p n ;q n ) (F) be signal subshifts with decreasing speeds, i. e., p i /q i > p j /q j for i < j. Set q :D lcmfq1 ; : : : q n g (the least common multiple). There exists c c c 0 such that for ˙ :D S(p 1 ;q 1 ) (F) _ _ S(p n ;q n ) (F) we have ˙ F q (˙ ) and therefore ˙ ˝F . If moreover F nq (˙ ) has decreasing preimages for some n 0, then F nq (˙ ) D ˝F .
Topological Dynamics of Cellular Automata
Definition 50 Let (AZ ; F) be a CA. (1) The k-column homomorphism ' k : (AZ ; F) ! ((Ak )N ; ) is defined by ' k (x) i D F i (x)[0;k) . (2) The kth column subshift is ˙ k (F) D ' k (AZ ) (Ak )N . (3) If : (AZ ; F) ! (˙; ) is a factor map, where ˙ is a one-sided subshift, we say that ˙ is a factor subshift of (AZ ; F). (AZ ; F)
Thus, each (˙ k (F); ) is a factor of and each factor subshift is a factor of some ˙ k (F). Any positively expansive CA (AZ ; F) with radius r > 0 is conjugated to (˙2rC1 (F); ). This is an SFT which by Theorem 29 is conjugated to a full shift. Proposition 51 (Shereshevski and Afraimovich [39]) Let (AZ ; F) be a CA with negative memory and positive anticipation m < 0 < a. Then (AZ ; F) is bipermutive if it is positively expansive and ˙ amC1 (F) D (AamC1 )N is the full shift. Theorem 52 (Blanchard and Maass [4], Di Lena [28]) Let (AZ ; F) be a CA with radius r and memory m. (1) If m 0 and ˙r (F) is sofic, then any factor subshift of (AZ ; F) is sofic. (2) If ˙2rC1 (F) is sofic, then any factor subshift of (AZ ; F) is sofic. If (x i ) i0 is a 2m -chain in a CA (AZ ; F), then for all i, F(x i )[m;m] D (x iC1 )[m;m] , so u i D (x i )[m;m] satisfy F([u i ]m ) \ [u iC1 ]m ¤ ;. Conversely, if a sequence (u i 2 A2mC1 ) i0 satisfies this property and x i 2 [u i ]m , then (x i ) i0 is a 2m -chain. Theorem 53 (K˚urka [27]) Let (AZ ; F) be a CA. (1) If ˙ k (F) is an SFT for any k > 0, then (AZ ; F) has the shadowing property. (2) If (AZ ; F) has the shadowing property, then any factor subshift is sofic. Any factor subshift of the Coven CA from Example 18 is sofic, but the CA does not have the shadowing property (Blanchard and Maass [4]). A CA with shadowing property whose factor subshift is not SFT has been constructed in K˚urka [24]. Proposition 54 Let (AZ ; F) be a CA. (1)
h(AZ ; F)
D lim k!1 h(˙ k (F); ).
(2) If F has radius r, then h(AZ ; F) 2 h(˙r (F); ) 2r h(˙1 (F); ) 2r ln jAj: (3) If 0 m a, then h(AZ ; F) D h(˙ a (F); ). See e. g., K˚urka [24] for a proof. Conjecture 55 If (AZ ; F) is a CA with radius r, then h(AZ ; F) D h(˙2rC1 (F); ). Conjecture 56 (Moothathu [32]) Any transitive CA has positive topological entropy. Definition 57 The directional entropy of a CA (AZ ; F) along a rational direction ˛ D p/q is h˛ (AZ ; F) :D h(AZ ; F q p )/q. The definition is based on the equality h(X; F n ) D n h(X; F) which holds for every SDS. Directional entropies along irrational directions have been introduced in Milnor [31]. Proposition 58 (Courbage and Kami´nski [11], Sablik [37]) Let (AZ ; F) be a CA with memory m and anticipation a. If ˛ 2 E(F), then h˛ (F) D 0. If ˛ 2 X (F) [ XC (F), then h˛ (F) > 0. h˛ (F) (max(a C ˛; 0) min(m C ˛; 0)) ln jAj. If F is bipermutive, then h˛ (F) D (maxfa C ˛; 0g minfm C ˛; 0g) ln jAj. (5) If F is left-permutive, and ˛ < a, then h˛ (F) D jm C ˛j ln jAj. (6) If F is right-permutive, and ˛ > m, then h˛ (F) D (a C ˛) ln jAj.
(1) (2) (3) (4)
The directional entropy is not necessarily continuous (see Smillie [40]). Theorem 59 (Boyle and Lind [8]) The function ˛ 7! h˛ (AZ ; F) is convex and continuous on X (F) [ XC (F).
Examples Cellular automata with binary alphabet 2 D f0; 1g and radius r D 1 are called elementary (Wolfram [42]). Their local rules are coded by numbers between 0 and 255 by f (000) C 2 f (001) C 4 f (010) C C 32 f (101) C 64 f (110) C 128 f (111) :
3223
3224
Topological Dynamics of Cellular Automata
Topological Dynamics of Cellular Automata, Figure 5 ECA12
Topological Dynamics of Cellular Automata, Figure 6 Signal subshifts of ECA128
Example 1 (The zero rule ECA0) F(x) D 01 . The zero CA is an equicontinuous nilpotent CA. Its equicontinuity directions are E(F) D A(F) D (1; 1). Example 2 (The identity rule ECA204) Id(x) D x. The identity is an equicontinuous surjective CA which is not transitive. Every clopen set is an attractor and every configuration is a quasi-attractor. The equicontinuity and expansivity directions are A(Id) D E(Id) D f0g, X (Id) D (1; 0), XC (Id) D (0; 1). The directional entropy is h˛ (Id) D j˛j. Example 3 (An equicontinuous rule ECA12) F(x) i D (1 x i1 )x i . 000 : 0;
001 : 0;
010 : 1;
011 : 1;
101 : 0;
100 : 0;
110 : 0;
111 : 0:
The ECA12 is equicontinuous: the preperiod and period are m D p D 1. The automaton has finite time maximal attractor ˝F D F(AZ ) D ˙f11g D S(0;1) (F) which is called the golden mean subshift. Example 4 (A product rule ECA128) F(x) i D x i1 x i x iC1 . The ECA128 is almost equicontinuous and 0 is a 1-blocking word. The first column subshift is ˙1 (F) D Sf01g . Each
column subshift ˙ k (F) is an SFT with zero entropy, so F has the shadowing property and zero entropy. The nth image F n (2Z ) D fx 2 2Z : 8m 2 [1; 2n]; 10m 1 6v xg is a SFT. The first image graph can be seen in Fig. 10 left. The maximal attractor ˝F D Sf10n 1:n>0g is a sofic subshift and has decreasing preimages. The only other attractor is the minimal attractor f01 g D ˝F ([0]0 ), which is also the minimal quasi-attractor. The equicontinuity directions are E(F) D ; and A(F) D [1; 1]. For Lyapunov expoC 1 1 1 nents we have F (0 ) D F (0 ) D 0 and F (1 ) D C 1 F (1 ) D 1. The only infinite signal subshifts are nontransitive subshifts S(1;1) (F) D Sf10g and S(1;1) (F) D Sf01g . The maximal attractor can be constructed using the 2 join construction ˝F D F(S(1;1) (F)_S(1;1) (F)) (Fig. 6). Example 5 (A product rule ECA136) F(x) i D x i x iC1 . The ECA136 is almost equicontinuous since 0 is a 1-blocking word. As in Example 4, we have ˝F D Sf10k 1:k>0g . For any m 2 Z, [0]m is a clopen invariant set, which is spreading to the left but not to the right. Thus, Ym D ˝F ([0]m ) D fx 2 ˝F : 8i m; x i D 0g is an attractor but not a subshift. We have YmC1 Ym and T 1 m0 Ym D f0 g is the unique minimal quasi-attractor.
Topological Dynamics of Cellular Automata
Topological Dynamics of Cellular Automata, Figure 7 ECA136
Topological Dynamics of Cellular Automata, Figure 8 A unique attractor F(x)i D xiC1 xiC2
Topological Dynamics of Cellular Automata, Figure 9 The majority rule ECA232
Topological Dynamics of Cellular Automata, Figure 10 First image subshift of ECA128 (left) and ECA106 (right)
Since F 2 1 (x) i D x i1 x i x iC1 is the ECA128 which has a minimal subshift attractor f01 g, F has the small quasiattractor Q F D f01 g. The almost equicontinuity directions are A(F) D [1; 0]. Example 6 (A unique attractor) (2Z ; F) where F(x) i D x iC1 x iC2 . The system is sensitive and has a unique attractor ˝F D Sf10k 1:k>0g which is not F-transitive. If x 2 [10]0 \
˝F (2Z ), then x[0;1) D 101 , so for any n > 0, F n (x) 62 [11]0 . However, (AZ ; F) is chain-transitive, so it does not have the shadowing property. The small quasi-attractor is Q F D f01 g. The topological entropy is zero. The factor subshift ˙1 (F) D fx 2 2N : 8n 0; (x[n;nC1] D 10 H) x[n;2nC1] D 10nC1 )g is not sofic (Gilman [18]). Example 7 (The majority rule ECA232) F(x) i D b(x i1 C x i C x iC1 )/2c. 000 : 0;
001 : 0;
010 : 0;
011 : 1;
101 : 1;
100 : 0;
110 : 1;
111 : 1:
The majority rule has 2-blocking words 00 and 11, so it is almost equicontinuous. More generally, let E D fu 2 2 : juj 2; u0 D u1 ; ujuj2 D ujuj1 ; 010 6v u; 101 6v ug.
Then for any u 2 E and for any i 2 Z, [u] i is a clopen invariant set, so its limit set ˝F ([u] i ) is an attractor. These attractors are not subshifts. There exists a subshift attractor given by the spreading set U :D 2Z n ([010]0 [ [101]0 ). We have ˝F (U) D S(0;1) (F) D Sf010;101g . There are two more infinite signal subshifts S(1;1) (F) D Sf001;110g and S(1;1) (F) D Sf011;100g . The maximal at3 tractor is ˝F D S(1;0) (F) [ (S(1;1) (F)_S(1;1) (F)) D Sf010k 1;10k 10;01 k 01;101 k 0;:k>1g . All column subshifts are SFT, for example ˙1 (F) D Sf001;110g and the entropy is zero. The equicontinuity directions are E(F) D ;, A(F) D f0g. Example 8 (A right-permutive rule ECA106) F(x) i D (x i1 x i C x iC1 ) mod 2. 000 : 0;
001 : 1;
010 : 0;
011 : 1;
101 : 1;
100 : 0;
110 : 1;
111 : 0:
The ECA106 is transitive (see K˚urka [24]). The first image graph is in Fig. 10 right. The minimum preimage number is pF D 1 and the word u D 0101 is magic. Its preimages are f 1 (0101) D f010101; 100101; 000101; 111001g and for every v 2 f 1 (u) we have v[4;5] D 01. This can be seen in Fig. 11 bottom left, where all paths in the first image graph with label 0101 are displayed. Accordingly, (01)1
3225
3226
Topological Dynamics of Cellular Automata
Topological Dynamics of Cellular Automata, Figure 11 Preimages in ECA106
Topological Dynamics of Cellular Automata, Figure 12 ECA102
has a unique preimage F 1 ((01)1 ) D f(10)1 g. On the other hand 01 has two preimages F 1 (01 ) D f01 ; 11 g (Fig. 11 bottom right) and 11 has three preimages F 1 (11 ) D f(011)1 ; (110)1 ; (101)1 g. We have X (F) D ; and XC (F) D (1; 1) and there are no equicontinuity directions. For every x we have F (x) D 1. On the other hand the right Lyapunov exponents are not C 1 1 constant. For example, C F (0 ) D 0 while F ((01) ) D 1. The only infinite signal subshift is the golden mean subshift S(1;1) (F) D Sf11g .
ity directions are X (F) D (1; 0), XC (F) D (1; 1), X(F) D (1; 0). If x 2 Y :D W0C (01 :101 ), i. e., if x0 D 1 and x i D 0 for all i > 0, then (O(x); F) D (Y; F) is conjugated to the adding machine with periodic structure n D (2; 2; 2; : : : ). If 2n < i 2n1 , then (F m (x) i )m0 is periodic with period 2n . There are no signal subshifts. The minimum preimage number is pF D 2 and the two cross sections G0 ; G1 are uniquely determined by the conditions G0 (x)0 D 0, G1 (x)0 D 1. The entropy is h(AZ ; F) D ln 2.
Example 9 (The shift rule ECA170) (x) i D x iC1.
Example 11 (The sum rule ECA90) F(x) i D (x i1 C x iC1 ) mod 2.
The shift rule is bijective, expansive and transitive. It has a dense set of periodic configurations, so it is chaotic. Its only signal subshift is S(1;1) () D 2Z . The equicontinuity and expansivity directions are E() D A() D f1g, X () D (1; 1), XC () D (1; 1). For any x 2 2Z C we have (x) D 0, (x) D 1. Example 10 (A bipermutive rule ECA102) F(x) i D (x i C x iC1 ) mod 2. 000 : 0;
001 : 1;
010 : 1;
011 : 0;
101 : 1;
100 : 0;
110 : 1;
111 : 0:
The ECA102 is bipermutive with memory 0, so it is open but not positively expansive. The expansiv-
000 : 0;
001 : 1;
010 : 0;
011 : 1;
101 : 0;
100 : 1;
110 : 1;
111 : 0:
The sum rule is bipermutive with negative memory and positive anticipation. Thus it is open, positively expansive and mixing. It is conjugated to the full shift on four symbols ˙2 (F) D f00; 01; 10; 11gN . It has four cross-sections G0 ; G1 ; G2 ; G3 which are uniquely determined by the conditions G0 (x)[0;1] D 00, G1 (x)[0;1] D 01, G2 (x)[0;1] D 10, and G3 (x)[0;1] D 11. For every x 2 2Z C we have F (x) D F (x) D 1. The system has no almost equicontinuous directions and X (F) D (1; 1),
Topological Dynamics of Cellular Automata
Topological Dynamics of Cellular Automata, Figure 13 Directional entropy of ECA90
Topological Dynamics of Cellular Automata, Figure 14 The traffic rule ECA184
XC (F) D (1; 1). The directional entropy is continuous and piecewise linear (Fig. 13).
contains -transitive infinite signal subshifts S(1;2) (F) and S(0;3) (F) as well as their join. It follows Q F D ˝F (U) D
F(x) i D 1 if
F 2 (S(1;2) (F) _ S(0;3) (F)). The only other infinite signal subshifts are S(4;4) (F) and S(1;1) (F) and
3
Example 12 (The traffic rule ECA184) x[i1;i] D 10 or x[i;iC1] D 11. 000 : 0
001 : 0
010 : 0
011 : 1 101 : 1
3
110 : 0
111 : 0 :
S(0;1) (F) D Sf10g ;
Example 14 (A multiplication rule) (4Z ; F), where
S(1;1) (F) D Sf00g [ f01 g 2
and a unique F-transitive attractor ˝F D S(1;1) (F) _ S(1;1) (F) D Sf1(10)n 0:n>0g which is sofic. The system has neither almost equicontinuous nor expansive directions. The directional entropy is continuous, but neither piecewise linear nor convex (Smillie [40]). Example 13 (ECA62) F(x) i D x i1 (x i C 1) C x i x iC1 . 000 : 0;
001 : 1;
010 : 1;
011 : 1;
101 : 1;
jx k iC1 F(x) i D 2x i C mod 4 jx 2 k jx k iC1 i 4 : D 2x i C 2 2 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 0 0 1 1 2 2 3 3 0 0 1 1 2 2 3 3
100 : 1;
110 : 0; AZ
3
In the space-time diagram in Fig. 15, the words 00, 111 and 010, which do not occur in the intersection S(1;2) (F) \ S(0;3) (F) D f(110)1 ; (101)1 ; (011)1 g are displayed in grey (K˚urka [25]).
The ECA184 has three infinite signal subshifts S(1;1) (F) D Sf11g [ f11 g;
3
˝F D F 2 (S(4;4) (F) _ S(1;2) (F) _ S(0;3) (F) _ S(1;1) (F)) :
100 : 1
([06 ]2
111 : 0: [17 ]1
n [ [ There exists a spreading set U D S [v] ) and ˝ (U) is a subshift attractor which 1 7 0 F v2 f (1 )
We have jx k iC1 F 2 (x) i D 4x i C 2 C (x iC1 ) mod 4 mod 2 2 D x iC1 D (x) i :
3227
3228
Topological Dynamics of Cellular Automata
Topological Dynamics of Cellular Automata, Figure 15 ECA 62 and its signal subshifts
Topological Dynamics of Cellular Automata, Figure 16 The multiplication CA
Thus, the CA is a “square root” of the shift map. It is bijective and expansive, and its entropy is ln 2. The system expresses multiplication by two in base four. If x 2 AZ is P i is fileft-asymptotic with 01 , then '(x) D iD1 iD1 x i 4 nite and '(F(x)) D 2'(x). Example 15 (A surjective rule) (4Z ; F), m D 0, a D 1, and the local rule is 00 11 22 33 01 02 12 21 03 10 13 30 20 23 31 32 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3
The system is surjective but not closing. The first image automaton is in Fig. 17 top. We see that F(4Z ) is the full shift, so F is surjective. The configuration 01 :11 has leftasymptotic preimages 01 :(21)1 and 01 :(12)1 , so F is not right-closing. This configuration has also right-asymptotic preimages 01 :(12)1 and 21 :(12)1 , so F is not leftclosing (Fig. 17 bottom). Therefore, X (F) D XC (F) D ;.
Topological Dynamics of Cellular Automata, Figure 17 Asymptotic configurations
Example 16 (2Z 2Z ; Id ); i: e:; F(x; y) i D (x i ; y iC1 ). The system is bijective and sensitive but not transitive. A(F) D E(F) D ;, X (F) D (1; 1), XC (F) D (0; 1). There are infinite signal subshifts S(0;n) D 2Z jOn j;
S(n;n) D jOn j 2Z
where jOn j D fx 2 2Z : n (x) D xg, so the speed subshifts are S0 (F) D S1 (F) D AZ .
Topological Dynamics of Cellular Automata
Topological Dynamics of Cellular Automata, Figure 18 A bijective almost equicontinuous CA
Topological Dynamics of Cellular Automata, Figure 19 The Coven CA
Example 17 (A bijective CA) (AZ ; F), where A D f000; 001; 010; 011; 100g, and F(x; y; z) i D (x i ; (1 C x i )y iC1 C x i1 z i ; (1 C x i )z i1 C x iC1 y i ) mod 2 : The dynamics is conveniently described as movement of three types of particles, 1 D 001, 2 D 010 and 4 D 100. Letter 0 D 000 corresponds to an empty cell and 3 D 011 corresponds to a cell occupied by both 1 D 001 and 2 D 010. The particle 4 D 100 is a wall which neither moves nor changes. Particle 1 goes to the left and when it hits a wall 4, it changes to 2. Particle 2 goes to the right and when it hits a wall, it changes to 1. Clearly 4 is a 1-blocking word, so the system is almost equicontinuous. It is bijective and its inverse is F 1 (x; y; z) i D (x i ; (1 C x i )y i1 C x iC1 z i ; (1 C x i )z iC1 C x i1 y i ) mod 2 : The first column subshift is ˙1 (F) D f0; 1; 2; 3gN [ f41 g. We have infinite signal subshifts S(1;1) (F) D f0; 1gZ , S(1;1) (F) D f0; 2gZ , S(0;1) (F) D f0; 4gZ . For q > 0 we get S(0;q) (F) D fx 2 AZ : 8u 2 4 ; (4u4 v x H)
(9m; 2mjuj D q) or u 2 f0g )g so the speed subshift is S0 (F) D AZ . The equicontinuity directions are A(F) D f0g, E(F) D ;. The expansivity directions are X (F) D (1; 1), XC (F) D (1; 1). Example 18 (The Coven CA, Coven and Hedlund [13], Coven [12]) (2Z ; F) where F(x) i D x i C x iC1 (x iC2 C 1) mod 2.
The CA is left-permutive with zero memory. It is not rightclosing, since it does not have a constant number of preimages: F 1 (01 ) D f01 g, F 1 (11 ) D f(01)1 ; (10)1 g. It is almost equicontinuous with 2-blocking word 000 with offset 0. It is not transitive but it is chain-transitive and its unique attractor is the full space (Blanchard and Maass [4]). While ˙1 (F) D 2Z , the two-column factor subshift ˙2 (F) D f10; 11gN [ f11; 01gN [ f01; 00gN is sofic but not SFT and the entropy is h(AZ ; F) D ln 2. For any a; b 2 2 we have f (1a1b) D 1c where c D a C b C 1 (here f is the local rule and the addition is modulo 2). Define a CA (2Z ; G) by G(x) i D (x i C x iC1 C 1) mod 2 and a map ' : 2Z ! 2Z by '(x)2i D 1, '(x)2iC1 D x i . Then ' : (2Z ; G) ! (2Z ; F) is an injective morphism and (2Z ; G) is a transitive subsystem of (2Z ; F). If x i D 0 for all i > 0 and x2i D 1 for all i 0, then (O(x); F) is conjugated to the adding machine with pe1 riodic structure n D (2; 2; 2; : : : ), and I C n ((10) ) D 2. We have E(F) D ;, A(F) D f0g, X (F) D (1; 0) and XC (F) D ;. We have S0 (F) D 2Z and there exists an increasing sequence of non-transitive infinite signal subshifts S(0;2n ) (F): S(0;1) (F) D Sf10g S(0;2) (F)
D Sf1010;1110g S(0;4) (F) : Example 19 (Gliders and Walls, Milnor [31]) The alphabet is A D f0; 1; 2; 3g, F(x) i D f (x[i1;iC1]), where the local rule is x3x : 3;
12x : 3;
1x2 : 3;
x12 : 3;
x1x : 0;
1xx : 1;
xx2 : 2;
x2x0:
3229
3230
Topological Dynamics of Cellular Automata
Topological Dynamics of Cellular Automata, Figure 20 Directional entropy of Gliders and walls
Topological Dynamics of Cellular Automata, Figure 21 ECA132
Topological Dynamics of Cellular Automata, Figure 22 The first image graph and the even subshift
Directional entropy has discontinuity at ˛ D 0 (see Fig. 20). Example 20 (Golden mean subshift attractor: ECA132) 000 : 0;
001 : 0;
010 : 1; 011 : 0; 100 : 0; 101 : 0; 110 : 0; 111 : 1:
While the golden mean subshift Sf11g is the finite time maximal attractor of ECA12 (see Example 3), it is also an infinite time subshift attractor of ECA132. The clopen set U :D [00]0 [ [01]0 [ [10]0 is spreading and ˝F D Sf11g D S(0;1) . There exist infinite signal subshifts S(1;1) (F) D Sf10g , S(1;1) (F) D Sf01g and the maximal attractor is
2
2
their join ˝F D S(1;1) (F) _ S(0;1) (F) _ S(1;1) (F) D 2
Sf11g [ (S(1;1) (F) _ S(1;1) (F)).
Example 21 (A finite time sofic maximal attractor) (2Z ; F), where m D 1, a D 2 and the local rule is 0000 : 0; 0100 : 1;
0001 : 0; 0101 : 0;
0010 : 0; 0110 : 1;
0011 : 1; 0111 : 1;
1000 : 1; 1100 : 1;
1001 : 1; 1101 : 0;
1010 : 0; 1110 : 0;
1011 : 1; 1111 : 0:
The system has finite time maximal attractor ˝F D F(2Z ) D Sf012nC1 0: n0g (Fig. 22 left). This is the even subshift whose minimal right-resolving presentation is in
Topological Dynamics of Cellular Automata
Topological Dynamics of Cellular Automata, Figure 23 Logarithmic perturbation speeds
Topological Dynamics of Cellular Automata, Figure 24 Sensitive system with logarithmic perturbation speeds
Fig. 22 right. We have E D fa; b; cg, l(a) D l(b) D 1, l(c) D 0. A word is synchronizing in G (and intrinsically synchronizing) if it contains 0. The factor map ` is right-resolving and also left-resolving. Thus ` is left-closing and the even subshift is AFT. We have `1 (11 ) D f(ab)1 ; (ba)1 g, and j`1 (x)j D 1 for each x ¤ 11 . Thus, the even subshift is near-Markov, and it cannot be an infinite time maximal attractor. Example 22 (Logarithmic perturbation speeds) (4Z ; F) where m D 0, a D 1, and the local rule is 00 : 0; 12 : 2;
01 : 0; 13 : 2;
02 : 1;
03 : 1;
10 : 1;
11 : 1;
20 : 0; 32 : 3;
21 : 0; 22 : 1; 33 : 3 :
23 : 1;
30 : 3;
31 : 3;
The letter 3 is a 1-blocking word. Assume x i D 3 for P i i > 0 and x i ¤ 3 for i 0. If '(x) D 1 iD1 xi 2 is finite, then '(F(x)) D '(x) C 1. Thus, (O(x); F) is conjugated to the adding machine with periodic structure n D (2; 2; 2; : : : ), although the system is not left-permutive. If x D 01 :31 , then for any i < 0, (F n (x) i )n0 has
preperiod i and period 2i . For the zero configuration we have n < 2s C s H) F n (Ws (01 )) W0 (01 ) 1 2s1 C s 1 n < 2s C s H) I n (0 ) D s 1 and therefore log2 n 1 < I n (0 ) < log2 n C 1. More generally, for any x 2 f0; 1; 2gZ we have lim n!1 I n (x)/ log2 n D 1.
Example 23 (A sensitive CA with logarithmic perturbation speeds) (4Z ; F) where m D 0, a D 2 and the local rule is 33x : 0; 03x : 1;
032 : 0;
132 : 1;
232 : 0;
02x : 1;
12x : 2; 13x : 2; 23x : 1 :
20x : 0;
21x : 0;
22x : 1;
A similar system is constructed in Bressaud and Tisseur [9]. If i < j are consecutive sites with x i D x j D 3, then F n (x)(i; j) acts as a counter machine whose binary
3231
3232
Topological Dynamics of Cellular Automata
value is increased by one unless x jC1 D 2: ji1
B i j (x) D
X
x iCk 2
ji1k
kD1
B i j (F(x)) D B i j (x) C 1 2 (x jC1 ) 2 ji1 2 (x iC1 ) Here 2 (2) D 1 and 2 (x) D 0 for x ¤ 2. If x 2 f0; 1; 2g, then limn!1 I n (x)/ log2 (n) D 1. For periodic configurations which contain 3, Lyapunov exponents are positive. We have ((30n )1 ) 2n . Future Directions There are two long-standing problems in topological dynamics of cellular automata. The first is concerned with expansivity. A positively expansive CA is conjugated to a one-sided full shift (Theorem 29). Thus, the dynamics of this class is completely understood. An analogous assertion claims that bijective expansive CA are conjugated to two-sided full shifts or at least to two-sided subshifts of finite type (Conjecture 30). Some partial results have been obtained in Nasu [35]. Another open problem is concerned with chaotic systems. A dynamical system is called chaotic, if it is topologically transitive, sensitive, and has a dense set of periodic points. Banks et al. [3] proved that topological transitivity and density of periodic points imply sensitivity, provided the state space is infinite. In the case of cellular automata, transitivity alone implies sensitivity (Codenotti and Margara [10] or a stronger result in Moothathu [32]). By Conjecture 15, every transitive CA (or even every surjective CA) has a dense set of periodic points. Partial results have been obtained by Boyle and Kitchens [6], Blanchard and Tisseur [5], and Acerbi et al. [1]. See Boyle and Lee [7] for further discussion. Interesting open problems are concerned with topological entropy. For CA with non-negative memory, the entropy of a CA can be obtained as the entropy of the column subshift whose width is the radius (Proposition 54). For the case of negative memory and positive anticipation, an analogous assertion would say that the entropy of the CA is the entropy of the column subshift of width 2r C 1 (Conjecture 55). Some partial results have been obtained in Di Lena [28]. Another aspect of information flow in CA is provided by Lyapunov exponents which have many connections with both topological and measure-theoretical entropies (see Bressaud and Tisseur [5] or Pivato Ergodic Theory of Cellular Automata). Conjecture 11 states that each sensitive CA has a configuration with a positive Lyapunov exponent.
Acknowledgments I thank Marcus Pivato and Mathieu Sablik for careful reading of the paper and many valuable suggestions. The research was partially supported by the Research Program CTS MSM 0021620845. Bibliography Primary Literature 1. Acerbi L, Dennunzio A, Formenti E (2007) Shifting and lifting of a cellular automaton. In: Computational logic in the real world. Lecture Notes in Computer Sciences, vol 4497. Springer, Berlin, pp 1–10 2. Akin E, Auslander J, Berg K (1996) When is a transitive map chaotic? In: Bergelson, March, Rosenblatt (eds) Conference in Ergodic Theory and Probability. de Gruyter, Berlin, pp 25–40 3. Banks J, Brook J, Cains G, Davis G, Stacey P (1992) On Devaney’s definition of chaos. Am Math Monthly 99:332–334 4. Blanchard F, Maass A (1996) Dynamical behaviour of Coven’s aperiodic cellular automata. Theor Comput Sci 291:302 5. Blanchard F, Tisseur P (2000) Some properties of cellular automata with equicontinuous points. Ann Inst Henri Poincaré 36:569–582 6. Boyle M, Kitchens B (1999) Periodic points for onto cellular automata. Indag Math 10(4):483–493 7. Boyle M, Lee B (2007) Jointly periodic points in cellular automata: computer explorations and conjectures. (manuscript) 8. Boyle M, Lind D (1997) Expansive subdynamics. Trans Am Math Soc 349(1):55–102 9. Bressaud X, Tisseur P (2007) On a zero speed cellular automaton. Nonlinearity 20:1–19 10. Codenotti B, Margara L (1996) Transitive cellular automata are sensitive. Am Math Mon 103:58–62 ´ 11. Courbage M, Kaminnski B (2006) On the directional entropy of Z2 -actions generated by cellular automata. J Stat Phys 124(6):1499–1509 12. Coven EM (1980) Topological entropy of block maps. Proc Am Math Soc 78:590–594 13. Coven EM, Hedlund GA (1979) Periods of some non-linear shift registers. J Comb Theor A 27:186–197 14. Coven EM, Pivato M, Yassawi R (2007) Prevalence of odometers in cellular automata. Proc Am Math Soc 815:821 15. Culik K, Hurd L, Yu S (1990) Computation theoretic aspects of cellular automata. Phys D 45:357–378 16. Formenti E, Kurka ˚ P (2007) A search algorithm for the maximal attractor of a cellular automaton. In: Thomas W, Weil P (eds) STACS Lecture Notes in Computer Science, vol 4393. Springer, Berlin, pp 356–366 17. Formenti E, Kurka ˚ P (2007) Subshift attractors of cellular automata. Nonlinearity 20:105–117 18. Gilman RH (1987) Classes of cellular automata. Ergod Theor Dynam Syst 7:105–118 19. Hedlund GA (1969) Endomorphisms and automorphisms of the shift dynamical system. Math Syst Theory 3:320–375 20. Hopcroft JE, Ullmann JD (1979) Introduction to Automata Theory, Languages and Computation. Addison-Wesley, London 21. Hurd LP (1990) Recursive cellular automata invariant sets. Complex Syst 4:119–129
Topological Dynamics of Cellular Automata
22. Hurley M (1990) Attractors in cellular automata. Ergod Theor Dynam Syst 10:131–140 23. Kitchens BP (1998) Symbolic Dynamics. Springer, Berlin 24. Kurka ˚ P (2003) Topological and symbolic dynamics, Cours spécialisés, vol 11. Société Mathématique de France, Paris 25. Kurka ˚ P (2005) On the measure attractor of a cellular automaton. Discret Contin Dyn Syst 2005:524–535; supplement 26. Kurka ˚ P (2007) Cellular automata with infinite number of subshift attractors. Complex Syst 17:219–230 27. Kurka ˚ P (1997) Languages, equicontinuity and attractors in cellular automata. Ergod Theor Dynam Syst 17:417–433 28. Lena PD (2007) Decidable and computational properties of cellular automata, Ph D thesis. Universita de Bologna e Padova, Bologna 29. Lind D, Marcus B (1995) An Introduction to Symbolic Dynamics and Coding. Cambridge University Press, Cambridge 30. Maass A (1995) On the sofic limit set of cellular automata. Ergod Theor Dyn Syst 15:663–684 31. Milnor J (1988) On the entropy geometry of cellular automata. Complex Syst 2(3):357–385 32. Moothathu TKS (2005) Homogenity of surjective cellular automata. Discret Contin Dyn Syst 13(1):195–202 33. Moothathu TKS (2006) Studies in topological dynamics with emphasis on cellular automata, Ph D thesis. University of Hyderabad, Hyderabad 34. Nasu M (1995) Textile Systems for Endomorphisms and Automorphisms of the Shift. Mem Am Math Soc 546 35. Nasu M (2006) Textile systems and one-sided resolving automorphisms and endomorphisms of the shift. American Mathematical Society, Providence 36. von Neumann J (1951) The general and logical theory of automata. In: Jefferss LA (ed) Cerebral Mechanics of Behaviour. Wiley, New York 37. Sablik M (2006) Étude de l’action conjointe d’un automate cellulaire et du décalage: une approche topologique et ergodique, Ph D thesis. Université de la Mediterranée, Marseille 38. Sablik M (2007) Directional dynamics of cellular automata: a sensitivity to initial conditions approach. Theor Comput Sci 400(1–3):1–18 39. Shereshevski MA, Afraimovich VS (1992) Bipermutive cellular automata are topologically conjugated to the one-sided Bernoulli shift. Random Comput Dynam 1(1):91–98 40. Smillie J (1988) Properties of the directional entropy function for cellular automata. In: Dynamical systems, vol 1342 of Lecture Notes in Mathematics. Springer, Berlin, pp 689– 705
41. Wolfram S (1984) Computation theory of cellular automata. Comm Math Phys 96:15–57 42. Wolfram S (1986) Theory and Applications of Cellular Automata. World Scientific, Singapore
Books and Reviews Burks AW (1970) Essays on Cellular automata. University of Illinois Press, Chicago Delorme M, Mazoyer J (1998) Cellular automata: A parallel Model. Kluwer, Amsterdam Demongeot J, Goles E, Tchuente M (1985) Dynamical systems and cellular automata. Academic Press, New York Farmer JD, Toffoli T, Wolfram S (1984) Cellular automata. North-Holland, Amsterdam Garzon M (1995) Models of Massive Parallelism: Analysis of Cellular automata and Neural Networks. Springer, Berlin Goles E, Martinez S (1994) Cellular automata, Dynamical Systems and Neural Networks. Kluwer, Amsterdam Goles E, Martinez S (1996) Dynamics of Complex Interacting Systems. Kluwer, Amsterdam Gutowitz H (1991) Cellular Automata: Theory and Experiment. MIT Press/Bradford Books, Cambridge Mass. ISBN 0-262-57086-6 Kitchens BP (1998) Symbolic Dynamics. Springer, Berlin Kurka ˚ P (2003) Topological and symbolic dynamics. Cours spécialisés, vol 11. Société Mathématique de France, Paris Lind D, Marcus B (1995) An Introduction to Symbolic Dynamics and Coding. Cambridge University Press, Cambridge Macucci M (2006) Quantum Cellular automata: Theory, Experimentation and Prospects. World Scientific, London Manneville P, Boccara N, Vichniac G, Bidaux R (1989) Cellular automata and the modeling of complex physical systems. Springer, Berlin Moore C (2003) New Constructions in Cellular automata. Oxford University Press, Oxford Toffoli T, Margolus N (1987) Cellular Automata Machines: A New Environment for Modeling. Mass MIT Press, Cambridge von Neumann J (1951) The general and logical theory of automata. In: Jefferss LA (ed) Cerebral Mechanics of Behaviour. Wiley, New York Wolfram S (1986) Theory and applications of cellular automata. World Scientific, Singapore Wolfram S (2002) A new kind of science. Media Inc, Champaign Wuensche A, Lesser M (1992) The Global Dynamics of Cellular Automata. In: Santa Fe Institute Studies in the Sciences of Complexity, vol 1. Addison-Wesley, London
3233
3234
Two-Sided Matching Models
Two-Sided Matching Models MARILDA SOTOMAYOR1,2 , ÖMER ÖZAK2 Department of Economics, University of São Paulo/SP, São Paulo, Brazil 2 Department of Economics, Brown University, Providence, USA
1
Article Outline Glossary Definition of the Subject Introduction Discrete Two-Sided Matching Models Continuous Two-Sided Matching Model with Additively Separable Utility Functions Hybrid One-to-One Matching Model Incentives Future Directions Bibliography Glossary Two-sided matching model is a game theoretical model whose elements are (i) two disjoint and finite sets of agents: F with m elements, and W with n elements, referred to as the sides of the matching model; (ii) the structure of agents’ preferences and (iii) the agents’ quotas. The rules of the game determine the feasible outcomes. The main activity of the agents from one set is to form partnerships with the agents on the other set. Players derive their payoffs from the set of partnerships they form. The agents belonging to F and W are called F-agents and W-agents, respectively. Quota of an agent in a two-sided matching model is the maximum number of partnerships an agent is allowed to form. When every participant can form one partnership at most the matching model is called one-to-one. If only the players of one of the sides can form more than one partnership the matching model is said to be many-to-one. Otherwise the matching model is manyto-many. Allowable set of partners for f 2 F with quota r(f ) is a family of elements of F [ W with k distinct W-agents, 0 6 k 6 r( f ), and r( f ) k repetitions of f . Discrete two-sided matching model In the discrete twosided matching models agents have preferences over allowable sets of partners. The allowable sets of partners for f of the type fw; f ; : : : ; f g are identified with the individual agent w 2 W and the allowable set of partners f f : : : f g is identified with f . Under this iden-
tification, agent w is acceptable to agent f if and only if f likes w as well as him/her/it self. Similar definitions and identifications apply to an agent w 2 W. These preferences are transitive and complete, so they can be represented by ordered lists of preferences. The model can then be described by (F; W; P; r; s), where P is the profile of preferences and r and s are the arrays of quotas for the F-agents and W-agents, respectively. Continuous two-sided matching model In this model the structure of preferences is given by utility functions which are continuous in some money variable which varies continuously in the set of real numbers. A particular case is obtained when agents place a monetary value on each possible partner or on each possible set of partners. Hybrid two-sided matching model is a unification of the discrete and the continuous models. It is obtained by allowing the agents of both markets to trade with each other in the same market. Matching in a two-sided matching model with sides F and W is a function that maps every agent into an allowable set of partners for him/her/it, such that f is in (w) if and only if w is in ( f ), for every ( f ; w) 2 F W. If we relax this condition the function is called a pre-matching. A matching describes the set of partnerships of the type ( f ; w), ( f ; f ) or (w; w), with f 2 F and w 2 W, formed by the agents. We say that a player that does not enter any partnership is unmatched. Agents compare two matchings by comparing the two allowable sets of partners they obtain. Feasible assignment for a two-sided matching model with sides F and W is an m n matrix x D (x f w ) whose entries are zeros or ones such that P P f x f w 6 s(w) for all w 2 W and w x f w 6 r( f ) for all f 2 F. We say that x f w D 1 if f and w form a partnership and x f w D 0 otherwise. A feasible assignment x corresponds to a matching which matches f to w P if and only if x f w D 1. Thus, if f x f w D 0 then w is unassigned at x or, equivalently, unmatched at , and P if w x f w D 0, then f is likewise unassigned at x or, equivalently, unmatched at . Responsive preference in a discrete two-sided matching model with sides F and W. Agent f 2 F has a responsive preference relation over allowable sets of partners if whenever (i) A and B are two allowable sets of partners for player f ; (ii) j and k are two elements of W [ f f g and (iii) A D B [ fwg n fw 0 g with w … B and w 0 2 B, then f prefers A to B if and only if f prefers w to w 0 . Similarly we define the responsive preference for w 2 W.
Two-Sided Matching Models
r(f )-Separable preference in a discrete two-sided matching model with sides F and W. Agent f 2 F with a quota of r(f ) has a r(f )-separable preference relation over allowable sets of partners if whenever A D B [ fwg n f f g with w … B and f 2 B, then f prefers A to B if and only if f prefers w to f . Similarly we define s(w)-separable preference for w 2 W. Maximin preferences in a discrete two-sided matching model with sides F and W. Agent f 2 F with a quota of r(f ) has a maximin preference relation over allowable sets of partners if whenever two allowable sets C and C 0 contained in W, such that f prefers C 0 to C and no w in C is unacceptable to f , then a) all of C 0 are acceptable to f and b) if jCj D r( f ), then the least preferred worker in C 0 C is preferred by f to the least preferred worker in C C 0 . Similarly we define maximin preference for w 2 W. Choice set of f 2 F from A W(Ch f (A)) in a discrete two-sided matching model with sides F and W. Let B D fA0 jA0 is an allowable set of partners for f and A0 \ W is contained in Ag. Then, A0 2 Ch f (A) if and only if A0 2 B and f likes A0 at least as well as A00 , for all A00 2 B. Similarly we define Chw (A) for w 2 W and A F. Substitutable preferences in a discrete two-sided matching model with sides F and W. Agent f 2 F has a substitutable preference relation over allowable sets of partners if whenever A W and B W are such that A \ B D then (i) for all S 0 2 Ch f (A [ B) there is some S 2 Ch f (A) such that S 0 \ A S and (ii) for all S 2 Ch f (A) there is some S 0 2 Ch f (A [ B) such that S 0 \ A S. If an agent’s preference is responsive then it is substitutable. When preferences are strict, conditions (i) and (ii) are equivalent to requiring that if Ch f (A [ B) D S 0 then S 0 \ A Ch f (A). This concept is similarly defined for w 2 W. Strongly substitutable preferences in a discrete twosided matching model with sides F and W. Agent f 2 F has a strongly substitutable preference relation over allowable sets of partners if for every pair of allowable sets of partners A and B such that A > f B, if w 2 Ch f (W \ A [ fwg), then w 2 Ch f (W \ B [ fwg). This is a stronger condition than substitutability and responsiveness. Additively separable preferences in a discrete two-sided matching model with sides F and W. Agent f 2 F has additively separable preferences if he/she/it assigns a nonnegative number afw to each w 2 W and assigns P the value v(A) D w2Aa f w to each allowable set A of
partners for f . Agent f compares two allowable sets by comparing the values of these sets. This concept is similarly defined for w 2 W. If the agents have additively separable preferences we can think that if a partnership ( f ; w) 2 F W is formed then the partners participate in some joint activity that generates a payoff afw for player f and bfw for player w. These numbers are fixed, i. e., they are not negotiable. If the preferences of the agents are additively separable then they are responsive. The converse is not true (see Kraft, Pratt and Seidenberg [47]). T-map in a discrete two-sided matching model with sides F and W is defined as follows. For every pre-matching , let T(( f )) D Ch f (U( f ; )) for all f 2 F, where U( f ; ) D fw 2 Wj f 2 Chw ((w) [ f f g)g. Similarly, T((w)) D Chw (U(w; )) for all f 2 F, where U(w; ) D f f 2 Fjw 2 Ch f (( f ) [ fwg)g. Lattice property A set L endowed with a partial order relation > has the lattice property if supfx; yg x _ y and inffx; yg x ^ y are in L, for all x; x 0 2 L. The lattice is complete if all its subsets have a supremum and an infimum. (See Birkhoff [11]). Pareto-optimal matching A feasible matching is Pareto-optimal if there is no feasible matching which is weakly preferred to by all players and it is strictly preferred by at least one of them. Outcome For the discrete two-sided matching models the outcome is a matching or at least corresponds to a matching; for the continuous two-sided matching models the outcome specifies a payoff for each agent and a matching. Stable outcome It is the natural solution concept for a two-sided matching model. It is also referred as setwise-stable outcome. See the definition below. F-optimal stable matching (respectively, payoff) for a discrete (respectively, continuous) two-sided matching model is the stable matching (respectively, payoff) which is weakly preferred by every agent in F. Similarly we define the W-optimal stable matching (respectively, payoff). Achievable mate for agent y in a discrete two-sided matching model is any y’s of partner under some stable matching. Matching mechanism For the discrete two-sided matching models, a matching mechanism is a function h whose range is the set of all possible inputs X D (F; W; P; r; s), and whose output h(X) is a matching for X. Stable matching mechanism It is a matching mechanism h such that h(X) is always stable for the market X. If
3235
3236
Two-Sided Matching Models
h(X) always produces the F-optimal stable matching for X then it is called an F-optimal stable matching mechanism, and so on. Revelation mechanism Given the discrete two-sided matching model (F; W; P; r; s), a revelation mechanism is the restriction of a matching mechanism h to the set of discrete two-sided matching markets (F; W; Q; r; s) where the sets of agents and quotas are fixed. Revelation game It is the strategic game induced by a revelation mechanism for the discrete two-sided matching (F; W; P; r; s): the set of players is given by the union of F and W; a strategy of player j is any possible list of preferences Q(j) that player j can state; the outcome function is given by the mechanism h and the preferences of the players over the set of outcomes are determined by P. Sincere strategy for a player j in a revelation game is the true list of preferences P(j). Manipulable mechanism A mechanism h is manipulable or it is not strategy-proof if in some revelation game induced by h, stating the true preferences is not a dominant strategy for at least one player. A mechanism h is collectively manipulable if in some revelation game induced by h, there is a coalition whose members can be better off by misrepresenting its preferences. Rematching proof equilibrium is a Nash equilibrium profile from which no pair of players ( f ; w) 2 F W can profitably deviate given that the other players do not change their strategies. Truncation Let P(a) be the a’s preference list over individuals for a discrete two-sided matching model. For a 2 F [ W, the list of preferences Q(a) over individuals is a truncation of P(a) at some agent b if Q(a) keeps the same ordering as P(a) but ranks as unacceptable all agents which are ranked below b. Definition of the Subject This article describes the basic elements of the cooperative and non-cooperative approaches for two-sided matching models and analyzes the fundamental differences and similarities between some of these models. Basic Definitions Feasible outcome is an outcome that is specified by the rules of the game. In the discrete case, a feasible outcome is a feasible matching or at least corresponds to a feasible matching. The usual definition is the following. The matching is feasible if it
matches every agent to an allowable set of partners and ( f ) 2 Ch f (( f ) \ W) and (w) 2 Chw ((w) \ F) for every ( f ; w) 2 F W. Then, if preferences are responsive, every matched pair is mutually acceptable. An implication of this definition is that a feasible outcome is always individually rational. In the continuous case, the rules of the game may specify, for example, whether the agents negotiate their payoffs individually within each partnership or if they negotiate in blocks. In the former case a feasible outcome specifies an array of individual payoffs for each player, indexed according to the partnerships formed under the corresponding matching. In the latter case the feasible outcome only specifies a total payoff for each agent. Corewise-stability is a solution concept that assigns to each two-sided matching model the set of feasible outcomes which are not dominated by any feasible outcome via a coalition. The feasible outcome x dominates the feasible outcome y via the coalition S ¤ , if (i) every player in S prefers x to y and (ii) if j 2 S then all of j’s partners under x belong to S. Coalition S is said to block the outcome y. It is a natural solution concept for the one-to-one matching models and for the continuous many-to-one matching models. Strong-corewise-stability is a solution concept that assigns to each two-sided matching model the set of feasible outcomes which are not weakly dominated by any feasible outcome via a coalition. The feasible outcome x weakly dominates the feasible outcome y via the coalition S ¤ , if (i) all players in S weakly prefer x to y and at least one of them strictly prefers x to y; (ii) if j 2 S then all of j’s partners under x belong to S. Coalition S is said to weakly block the outcome y. Strong-corewise-stability is a natural solution concept for the discrete many-to-one matching models. Pairwise-stability is a solution concept that assigns to each two-sided matching model the set of feasible outcomes which are not quasi-dominated by any feasible outcome via a pair of agents ( f ; w) 2 F W. For the discrete case an outcome can be identified with a matching. We say that the feasible matching x quasi-dominates the feasible matching y via the coalition S ¤ , if (i) every player in S prefers the matching x to the matching y and (ii) if j 2 S and k 2 x( j) then k 2 S [ y( j). For the continuous case in which agents negotiate their payoffs individually with each partner, the feasible out-
Two-Sided Matching Models
come x quasi-dominates the feasible outcome y via the coalition S ¤ , if (i) every player in S gets a higher total payoff under x than under y and (ii) if j 2 S and k 2 x( j) then k 2 S or k 2 y( j), in which case, k keeps the same individual payment obtained with j under y. For the continuous case in which agents cannot negotiate their individual payments, the definition of quasi-dominance is equivalent to that of dominance. In any case coalition S is said to destabilize the outcome y. Thus, for the discrete models, for example, a matching is pairwise-stable if it is feasible and it is not destabilized by any pair ( f ; w) 2 F W. The pair of agents ( f ; w) destabilizes the matching if these agents prefer each other to some of their current mates. Pairwise-stability is a natural solution concept for the marriage and the college admission models as well as for continuous matching models in which the agents negotiate their individual payoffs. Setwise-stability is the solution concept that assigns to each two-sided matching model the set of feasible outcomes which are not quasi-dominated by any feasible outcome via a coalition. It is a generalization of the group stability concept defined by Roth [65] for the college admissions model. An attempt to extend the concept of group stability to a discrete many-to-many matching market with substitutable and strict preferences was not successful in Roth [62]. The concept introduced in that paper is equivalent to pairwise stability. It turns out that pairwise-stable matchings may be blocked by coalitions of size greater than two in this model. In fact, an example of a pairwise-stable matching that is corewise-unstable is presented in Blair [12] and another one in Sotomayor [80]. Thus, the pairwise-stability concept cannot be regarded as the natural cooperative solution concept for this model. Setwise-stability regarded as a new cooperative equilibrium concept, different from the core concept, was obtained for the first time in Sotomayor [75] in a many-to-many matching model with additively separable utilities. In this model the set of setwise-stable outcomes equals the set of pairwise-stable outcomes and may be smaller than the core. Setwise-stability concept was introduced in Sotomayor [80] as a refinement of the core concept and as the natural cooperative equilibrium concept for a twosided matching model. In the discrete many-to-many matching models it is stronger than pairwise-stability plus corewise-stability (see Example 3 below). Since the essential coalitions in the marriage and in the college admission models are pairs of players made up
of one agent of each side of the market, a setwisestable matching is a pairwise-stable matching in these models. If the preferences are strict, the pairwise-stable matchings are the corewise-stable matchings in the marriage model and are the strong corewise-stable matchings in the college admissions model. A Brief Historical Account The theory of the two-sided matching markets was born in 1962, year of the publication of the seminal paper by Gale and Shapley, who formulated the stable matching problem for the marriage and the college admissions markets. The problem of “college admissions” as described in that paper involves a set of colleges and a set of students. Each student lists in order of strict preference those colleges he/she wishes to attend and each college lists in order of strict preference those students it is willing to admit. Furthermore, each college has a quota, representing the maximum number of students it can admit. The problem is then to devise some allocation procedure of students to colleges in a way that takes account of their respective preferences. More specifically, given any set of colleges and students, together with their preferences and quotas, can one find a stable matching? The answer to this question is affirmative. For the existence proof Gale and Shapley constructed a simple deferred-acceptance algorithm, which, starting from the given data, leads to a stable matching in a finite number of steps. The matching obtained in this way is the unique stable matching (there may be many) which is preferred by all students to any other such matching. For this reason, it is called optimal stable matching for the students. In the 1962s paper they remark that: “. . . even though we no longer have the symmetry of the marriage problem, we can still invert our admissions procedure to obtain the unique “ccollege optimal” assignment. The inverted method bears some resemblance to a fraternity “crush week” it starts with each college making bids to those applicants it considers most desirable, up to its quota limit, and then the bid-for students reject all but the most attractive offer, and so on.” In the paper mentioned above the authors express some reservation on the possibilities of application of their algorithm. It turns out that since 1951, a mathematically equivalent algorithm was being used by the National Resident Matching Program (NRMP), located in Evanston, Illinois. The NRMP has the task each year of assigning graduates of all medical schools in the United States to hospitals where they are required to serve a year’s resi-
3237
3238
Two-Sided Matching Models
dency. The algorithm used by the NRMP was mathematically the equivalent to the one described in Gale and Shapley [30] to produce the optimal stable matching for the colleges. Thus, in this algorithm, each hospital applies to its quota of students. The confirmation of this fact was obtained by David Gale in 1975. In a letter from December 8, 1975, in response to a letter from Gale to the NRMP, Elliott Peranson, consultant to NRMP, responsible for the technical operation of the matching program, says the following: “However, I might point out that the NRMP algorithm in fact uses the inverse procedure and produces the unique “ccollege optimal” assignment rather than this “student optimal” assignment. This procedure more closely parallels the actual admissions process where a matching algorithm is not used. In this case students apply to all hospitals they would consider (not just their first choice), each hospital then selects the most desirable students, up to its quota limit, from amongst all applicants, then the “bid-for” students reject all but the most desirable offer, and so on”. Hence, the proof that the NRMP was yielding a stable matching, which was the optimal stable matching for the colleges, is that such an outcome is always obtained by the Gale and Shapley’s algorithm with the colleges proposing. The discovery that the two algorithms were mathematically equivalent was first spread orally and later reported in Gale and Sotomayor [32] with an equivalent description of the NRMP algorithm. (Roth [63] also presents the NRMP algorithm). This was the first application of matching theory of which we have knowledge. The algorithms proposed by Gale and Shapley are described below. Their description is quoted from Gale and Sotomayor [32]. Gale–Shapley-Algorithm with the Colleges Proposing to the Applicants Each hospital H tentatively admits its quota qH consisting of the top qH applicants on its list. Applicants who are tentatively admitted to more than one hospital tentatively accept the one they prefer. Their names are then removed from the lists of all other hospitals which have tentatively admitted them. This gives the first tentative matching. Hospitals which now fall short of their quota again admit tentatively until either their quota is again filled or they have exhausted their list. Admitted applicants again reject all but their favorite hospital, giving the second tentative matching, etc. The algorithm terminates when, after some tentative matching, no hospitals can admit any more applicants either because their quota is full or they have exhausted their list. The tentative matching then becomes permanent.
Gale–Shapley-Algorithm with the Applicants Proposing to the Colleges Each applicant petitions for admission to his/her favorite hospital. In general, some hospitals will have more petitioners than allowed by their quota. Such oversubscribed hospitals now reject the lowest petitioners on their preference list so as to come within their quota. This is the first tentative matching. Next, rejected applicants petition for admission to their second favorite hospital and again oversubscribed hospitals reject the overflow, etc. The algorithm terminates when every applicant is tentatively admitted or has been rejected by every hospital on his list. The fact that the matching produced by the NRMP algorithm is stable stands for one of the most important applications of game theory to Economics. During about fifty years the allocation procedures used to assign interns to hospitals in the United States produced unstable matchings. Unsuccessful procedures were often proposed by the Association of American Medical Colleges. This sort of events culminated with a centralized mechanism that employed the NRMP algorithm. Such a centralized mechanism lasted for almost fifty years, suggesting that interns and hospitals had reached an equilibrium. And the paper written by Gale and Shapely corroborated that the game-theoretical predictions were, once more, correct. Before the publication of Gale and Sotomayor [32], a few, but important, contributions were made to the theory of two-sided matching markets. The famous paper of 1972 by Shapley and Shubik establishes the assignment game via the introduction of money, as a continuous variable, into the marriage model. The book Marriages Estables by Knuth was published in 1976. In this volume, the proof, attributed to Conway, that the set of stable matchings for the marriage model is a lattice is presented. The assignment game was generalized by Kelso and Crawford [42] to a model where the utilities satisfy some gross substitute condition. Another generalization, which considers continuous utility functions, non-necessarily linear, was presented in Demange and Gale [18]. Nevertheless, among the contributions of this period one of them caused considerable impact. This was the non-manipulability theorem by Dubins and Freedman [21]. In a stable revelation mechanism, for every profile of preferences that can be selected by the agents, some algorithm that yields a stable matching is used. These authors prove that the revelation mechanism that produces the optimal stable matching for a given side of the marriage market is not collectively manipulable by the agents of that side. Also, this non-manipulability result holds for
Two-Sided Matching Models
the college admission market when the mechanism yields the student-optimal stable matching. An analogue to this result was proved in Demange and Gale [18] for the continuous model through a key lemma that became known in the literature as the Blocking Lemma. The main challenge that motivated Gale and Sotomayor [32] was to prove the discrete version of the blocking lemma, which allowed to prove the non-manipulability theorem in just three lines. Two simple and short proofs (one with the use of the algorithm and the other one without the use of the algorithm) of Dubins and Freedman’s theorem were presented as an alternative to the original proof by the authors which was about twenty pages long. An example in Dubins and Freedman [21], where some woman can be better off by falsifying her preference list when the man-optimal stable matching is to be employed, motivated Gale and Sotomayor [31]. This paper proves that such a mechanism can almost always be manipulated by the women and then treats the strategic possibilities for these agents in the corresponding strategic game. Another paper of this period was Roth [60], which proves, via an example, that any rule for selecting a stable matching is manipulable (either by some man or some woman). The existence theorem of manipulability by the women, the Impossibility Theorem and the non-manipulability theorem attracted the authors’ interest toward a fruitful line of investigation concerning the incentives facing the agents when an allocation mechanism is employed. Algorithms have been developed for this purpose for several matching models. The games induced by such mechanisms are played non-cooperatively by the agents, and in general, their self-enforcing agreements lead to a stable outcome. In these cases a non-cooperative implementation of the set of stable outcomes is provided. Analyzing the strategic behavior of the agents in such games has been an important subject of research of several authors in an attempt to get precise answers to the strategic questions raised. In this direction we can cite Roth [60, 61], Gale and Sotomayor [31], Perez-Castrillo and Sotomayor [56], Sotomayor [87,88,94], Kamecke [38], Kara and Sönmez [40,41], Alcalde [5], Alcalde, Perez–Castrillo and Romero–Medina [6], among others. Over all these years the popularity of the matching theory has spread among mathematicians and economists, mainly due to the publication in 1990 of the first edition of the book Two-sided matchings. A study in game theoretical modeling and analysis, by Roth and Sotomayor, which attempts a comprehensive survey of the main results on the two-sided matching theory until that date. (An extensive bibliography can also be found in http://kuznets.fas. harvard.edu/~aroth/bib.html#matchbib).
The stable matching problem has been generalized to several two-sided matching models, which have been widely modeled and analyzed under both cooperative and non-cooperative game theoretic approaches. Through these models a variety of markets has become better understood, which has considerably contributed to their organization. The deferred acceptance algorithm of Gale and Shapley and adaptations of it have been applied in the reorganization of admission processes of many two-sided matching markets. And the design of these mechanisms has also raised new theoretical questions. (In this connection see e. g. Balinski and Sönmez [8], Ergin and Sönmez [28], and Pathak and Sönmez [55], Abdulkadiroglu and Sönmez [1] and Bardella and Sotomayor [9]). Introduction The two-sided matching theory is the study of game theoretical models in which the set of players is partitioned into the disjoint union of two finite sets, and the main activity of the agents from one set is to form partnerships with the agents on the other set. In addition, a structure of preferences is available for the players, as well as an array of quotas, one quota for each participant, representing the maximum number of partnerships that he/she/it is allowed to form. The rules of the game determine the feasible outcomes. These models are called two-sided matching models. They are suitable for modeling a great variety of labor markets of firms and workers, markets of buyers and sellers, markets of students and schools, etc. They are grouped into three categories: the discrete, the continuous and the hybrid matching models. The distinctions between them are based on the kind of preference structure they are endowed with. Within each category the models vary as to the possibilities of the agents to form multiple partnerships, the rules of the game and some variation of the structure of the preferences. Given the sets of agents, the structure of preferences and quotas and the rules of the game, the question that emerges is: A. Which partnerships should be formed? B. If a partnership is formed, what payoff should be awarded to each agent? The prediction should be outcomes that cannot be upset by any coalition. The idea is that, in games where players form partnerships, it should be allowed for coalitions to be formed in which its members keep some of their current partners if they wish, replace some others with new partners belonging to the coalition. Furthermore, by do-
3239
3240
Two-Sided Matching Models
ing this they get a preferable outcome. This intuitive idea is captured by a refinement of the core concept introduced in Sotomayor [80], called setwise-stability (stability, for short). For the marriage model and the college admission model with responsive preferences this concept coincides with the one introduced by Gale and Shapley. This review is an attempt to understand some of the differences and similarities between some matching models. We do that by analyzing both the cooperative and the non-cooperative structure of these models. Some future directions will then be presented. The rest of the article is organized as follows. Section “Discrete Two-Sided Matching Models” is devoted to the cooperative approach of the discrete matching models with responsive preferences. It introduces the basic cooperative one-to-one, many-to-one and many-to-many matching models and discusses the main properties that characterize the set of stable outcomes for these models. This kind of analysis is also provided in Sect. “Continuous Two-Sided Matching Model with Additively Separable Utility Functions” for the continuous matching models where the utilities are additively separable. Section “Hybrid One-to-One Matching Model” discusses some fundamental similarities and differences between a one-to-one hybrid model and its corresponding non-hybrid matching models. Strategic questions are treated in Sect. “Incentives”. Section “Future Directions” presents some future directions and open problems. Discrete Two-Sided Matching Models There are two finite and disjoint sets of agents, F and W, which we may think of as being sets of men and women, colleges and students, firms and workers, etc. To fix ideas let us describe the model in terms of firms and workers. Then, F is a set of m firms and W is a set of n workers. Salaries cannot be negotiated and are part of the job description. Each worker w has a quota s(w) representing the maximum number of jobs in different firms she can take. Each firm f has a quota r(f ) representing the maximum number of workers it can hire. Set r and s the array of quotas of the firms and the workers, respectively. Since the agents form partnerships, they always have preferences over potential individual partners; that is, over allowable sets of partners with only one agent belonging to the opposite side of the market. These preferences are assumed to be strict. They are transitive and complete and so, they can be represented by ordered lists of preferences. Thus, the individual preference relation of firm f can be represented by an ordered list of preferences P(f ) on the set W [ f f g; the individual preference relation of
worker w can be represented by an ordered list of preferences P(w), on the set F [ fwg. The array of these preferences will be denoted by P. Then, an agent w is acceptable to an agent f if w > f f . Similarly, an agent f is acceptable to an agent w if f >w w. If an agent may form more than one partnership then his/her/its preferences are not restricted to the individual potential partners. That is, agents have preferences over any allowable set of partners. Given two allowable sets of partners, A and B, for agent y 2 F [ W, we write A > y B to mean y prefers A to B, and A > y B to mean y likes A at least as well as B. In order to fix ideas we will assume that these preferences are responsive to the agents’ individual preferences and are not necessarily strict. The rules of the market are that any firm and worker pair may sign an employment contract with each other if they both agree; any firm may choose to keep one or more of its positions unfilled, and any worker may choose not to fill his or her quota of jobs if he or she wishes to do so. When the quota of every agent is one, we have the marriage model. In this case an allowable set of partners for any agent is a singleton. Therefore, every agent only has preferences over individuals. If only the agents of one of the sides are allowed to have a quota greater than one, then we have the so called college admission model with responsive preferences. In both models it is a matter of verification that the sets of setwise-stable matchings, pairwise- stable matchings and strong corewise-stable matchings coincide. For the many-to-many case this is not always true. The strong corewise-stability concept is not a natural solution concept for this model as Example 1 shows. Example 1 (Sotomayor [80]) (The corewise-stability concept is not a natural solution concept for the many-to-many case) Consider two firms f 1 and f 2 and two workers w1 and w2 . Each firm may employ and wants to employ both workers; worker w1 may take, at most, one job and prefers f 1 to f 2 ; worker w2 may work and wants to work for both firms. If the agents can communicate with each other, the outcome that we expect to observe in this market is obvious: f 1 hires both workers and f 2 hires only worker w2 . Of course this outcome is in the strong core. Since f 1 has a quota of two and w1 prefers f 1 to f 2 we cannot expect to observe the strong corewise-stable outcome where f 1 hires only w2 and f 2 hires both workers. That is, both outcomes are in the strong core, but only the first one is expected to occur. Our explanation for this is that only the first outcome is setwise-stable.
Two-Sided Matching Models
On the other hand, the pairwise-stability concept is not a refinement of the core for the discrete many-to-many matching models. See Example 2.
Two-Sided Matching Models, Table 1 Payoff Matrix of Example 2. Each row represents a firm and each column a worker. The values in the cell represent the payoffs to the corresponding firm (first value) and worker (second value)
Example 2 (Sotomayor [80]) (A pairwise-stable matching which is not in the core) Consider the following labor market with a set of firms F D f f1 ; f2 ; f3 ; f4 g and a set of workers W D fw1 ; w2 ; w3 ; w4 g, where each firm can hire two workers, and each worker can work for two firms. If firm f i hires worker wj then f i gets the profit aij and wj gets the salary bij . The pairs of numbers (a i j ; b i j ) are given in Table 1. Consider the matching where f 1 and f 2 are matched to fw3 ; w4 g and f 3 and f 4 are matched to fw1 ; w2 g. (The payoffs of each matched pair are presented in boldface). This matching is pairwise stable. In fact, f 3 and f 4 do not belong to any pair which causes instability, because they are matched to their two best choices: w1 and w2 ; ( f1 ; w1 ) and ( f1 ; w2 ) do not cause instabilities since f 1 is the worst choice for w1 and w2 is the worst choice for f 1 ; ( f2 ; w1 ) and ( f2 ; w2 ) do not cause instabilities since w1 is the worst choice for f 2 and f 2 is the worst choice for w2 . Nevertheless, f 1 and f 2 prefer fw1 ; w2 g to fw3 ; w4 g and w1 and w2 prefer f f1 ; f2 g to f f3 ; f4 g. Hence this matching is not in the core, since it is blocked by the coalition f f1 ; f2 ; w1 ; w2 g. Example 3 shows that setwise stability is a strictly stronger requirement than pairwise-stability plus strong corewisestability. It presents a situation in which the set of setwisestable matchings is a proper subset of the intersection of the strong core with the set of pairwise-stable matchings. Example 3 (Sotomayor [80]) (A strong corewise-stable matching which is pairwise-stable and is not setwise-stable) Consider the following labor market with a set of firms F D f f1 ; f2 ; f3 ; f4 ; f5 ; f6 g and a set of workers W D fw1 ; w2 ; w3 ; w4 ; w5 ; w6 ; w7 g, where r1 D 3, r2 D r5 D 2, r3 D r4 D r6 D 1, s1 D s2 D s4 D 2 and s3 D s5 D s6 D s7 D 1. If firm f i hires worker wj then f i gets profit aij and the worker gets a salary bij . The pairs of numbers (a i j ; b i j ) are given in Table 2. Let be the matching given by ( f1 ) D fw2 ; w3 ; w7 g ; ( f2 ) D fw5 ; w6 g ; ( f3 ) D ( f4 ) D fw1 g ; ( f5 ) D fw2 ; w4 g ; and ( f6 ) D fw4 g : The associated payoffs are shown in bold in Table 2. This matching is in the strong core. In fact, if there is a matching 0 which weakly dominates via some coalition A, then, under 0 , no player in A is worse off and at least one player in A is better off. Furthermore, matching
10,1 1,10 10,4 10,2
1,10 10,1 4,4 4,2
4,10 4,4 2,2 2,1
2,10 2,4 1,2 1,1
Two-Sided Matching Models, Table 2 Payoff Matrix of Examples 3 and 4. Each row represents a firm and each column a worker. The values in the cell represent the payoffs to the corresponding firm (first value) and worker (second value) 13,1 1,10 10,4 10,2 0,0 0,0
14,10 0,0 0,0 0,0 9,9 0,0
4,10 0,0 0,0 0,0 0,0 0,0
1,10 10,1 0,0 0,0 10,4 10,2
0,0 4,10 0,0 0,0 0,0 0,0
0,0 2,10 0,0 0,0 0,0 0,0
3,10 0,0 0,0 0,0 0,0 0,0
0 must match all members of A among themselves. By inspection we can see that the only players that can be better off are f1 ; f2 ; w1 and w4 , for all remaining players are matched to their best choices. However, if A contains one player of the set f f1 ; f2 ; w1 ; w4 g, then A must contain all four players. In fact, If f1 2 A, then f 1 must form a new partnership with w1 , so w1 2 A. If w1 2 A, then w1 must form a new partnership with f 2 , so f2 2 A. If f2 2 A, then f 2 must form a new partnership with w4 , so w4 2 A. Finally, if w4 2 A then w4 must form a new partnership with f 1 , so f1 2 A. Thus, if 0 weakly dominates via A, then f1 ; f2 ; w1 and w4 are in A and f 1 and f 2 form new partnerships with w1 and w4 . Nevertheless, f 1 must keep his partnership with w2 , his best choice. Then w2 must be in A, so she cannot be worse off and so f 5 must also be in A. But f 5 requires the partnership with w4 , who has quota of 2 and has already filled her quota with f 1 and f 2 . Hence f 5 is worse off at 0 than at and then 0 cannot weakly dominate via A. The matching is clearly pairwise stable. Nevertheless, the coalition f f1 ; f2 ; w1 ; w4 g causes an instability in . In fact, if f 1 is matched to fw1 ; w2 ; w4 g and f 2 is matched to fw1 ; w4 g then f 1 gets 28 and the rest of the players in the coalition get 11 instead of 6. Hence, the matching is not setwise-stable. The question is then to know if, given any two sets of agents with their respective preferences and quotas, one can always find a setwise-stable matching. The answer
3241
3242
Two-Sided Matching Models
is affirmative for the one-to-one matching model and for the many-to-one matching models with substitutable preferences. The existence of setwise-stable matchings for the marriage model was first proved by Gale and Shapley [30]. Sotomayor [76] also provides a simple proof of the existence of stable matchings for the marriage model that connects stability with a broader notion of stability with respect to unmatched agents. Gale and Shapley construct a deferred acceptance algorithm described below and prove that it yields a stable matching in a finite number of steps. The deferred acceptance algorithm with the men making the proposals. Each man begins by proposing to his favorite woman (the first woman on his preference list). Each woman rejects the proposal of any man unacceptable to her, and in case she gets several proposals, she keeps only her most preferred one. If a man is not rejected at this step he is kept engaged. At any step, any man who was rejected at the previous step proposes to his next choice (his most preferred woman among those who have not rejected him), as long as there remains an acceptable woman to whom he has not yet proposed (if at one step a man has been rejected by all of his acceptable women, he issues no further proposals). The algorithm terminates after any step in which no man is rejected, because then every man is either engaged to some woman or has been rejected by every woman on his list of acceptable women. Women who did not receive any acceptable proposals, and men rejected by all women acceptable to them remain single. To see that the matching yielded by this algorithm is stable first observe that no agent has an unacceptable partner. In addition, if there is some man f and woman w not matched to each other and such that f prefers w to his current partner, then woman w is acceptable to man f and so he must have proposed to her at some step of the algorithm. But then he must have been rejected by w in favor of someone she prefers to f . Therefore, w is matched to a man whom she prefers to f , by the transitiveness of the preferences and so f and w do not destabilize the matching. For the college admission model with responsive preferences Gale and Shapley defined a related marriage model in which each college is replicated a number of times equal to its quota, so that in the related model, every agent has a quota of one. If f1 ; : : : ; f r( f ) are the r(f ) copies of college f then each of these f i ’s has preferences over individuals that are identical with those of f . Each student’s preference list is changed by replacing f , wherever it appears on his/her list, by the string f1 ; : : : ; f r( f ) in that order of preference. Therefore, the stable matchings of the related marriage market are in natural one-to-one correspondence with the
stable matchings of the college admission market. By using the existence theorem for the marriage model we obtain the corresponding result for the college admission model. The existence proof for the many-to-one case with strict and substitutable preferences was first given by Kelso and Crawford [42] through a variant of the deferred-acceptance algorithm. If the market does not have two-sides or the manyto-one matching model does not have substitutable preferences then the set of setwise-stable matchings may be empty. Gale and Shapley [30] present an example of a one-sided matching model that does not have any stable matchings. This model was called by these authors the “roommate problem”. In this example there are four agents fa; b; c; dg, such that a’s first choice is b, b’s first choice is c, and c’s first choice is a, and d is the last choice of all the other agents. Of course if some agent is unmatched then there will be two unmatched agents and they will destabilize the matching. If every one is matched, the agent who is matched to d will form a blocking pair with the agent who lists him at the head of his list. Therefore, there is no stable matching in this example. A small amount of literature has grown around the issues of finding conditions in which the set of stable matchings is non-empty for the roommate problem, and the performance of algorithms that can produce them when they exist. (See Abeledo and Isaak [3], Irving [37], Tan [95], Chung [13] and Sotomayor [89].) If preferences are not substitutable Example 2.7 of Roth and Sotomayor [69] shows that setwise-stable matchings may not exist in the many-to-one case. Even in the simplest case of preferences representable by additively separable utility functions, setwise-stable matchings may not exist for the many-to-many case. See the example below. Example 4 (Sotomayor [80]) (Nonexistence of stable matchings) Consider again the matching model of Example 2. We are going to show that the set of setwise-stable matchings is empty. First, observe that f 3 prefers fw1 ; w2 g to any other set of players and f 3 is the second choice for w1 and w2 ; w3 prefers f f1 ; f2 g to any other set of players and w3 is the second choice for f 1 and f 2 . Then, in any stable matching , f 3 must be matched to w1 and w2 , while w3 must be matched to f 1 and f 2 . Separate the cases by considering the possibilities for the second partner of w1 , under a supposed stable matching : Case 1: (w1 is matched to f f2 ; f3 g). Then f 2 is not matched to w4 and we have that f f2 ; w4 g causes an instability in the matching, since f 2 prefers w4 to w1 and f 2 is the second choice for q4 .
Two-Sided Matching Models
Case 2: (w1 is matched to f f3 ; f4 g). Then the following possibilities occur: (i) (w2 is matched to f f3 ; f4 g.) Then f f1 ; f2 ; w1 ; w2 g causes an instability in the matching. This matching is pairwise-stable, but it is not in the core. (ii) (w2 is matched to f f3 ; f1 g.) Then f f1 ; w4 g causes an instability in the matching, since f 1 is the first choice for w4 and f 1 prefers w4 to w2 . (iii) (w2 is matched to f f3 ; f2 g or f f3 g.) Then f f4 ; w2 g causes an instability in both cases, since w2 is the second choice for f 4 and w2 prefers f 4 to f 2 and prefers f 4 to have an unfilled position. Case 3: (w1 is matched to f f1 ; f3 g or f f3 g.)} Then f f4 ; w1 g causes an instability in both cases, since w1 is the first choice for f 4 and w1 prefers f 4 to f 1 and prefers f 4 to have an unfilled position. Hence, there are no stable matchings in this example. Pairwise-stable matchings always exist when the preferences are substitutable. When preferences are strict, Roth [62] presents an algorithm that finds a pairwise-stable matching for a many-to-many matching model with substitutable preferences. Sotomayor [80] provides a simple and non-constructive proof of the existence of pairwise-stable matchings for the general discrete many-tomany matching model with substitutable and not-necessarily strict preferences. Martínez et al. [51] construct an algorithm, which allows finding the whole set of pairwisestable matchings, when they exist, for the many-to-one matching model. Authors have looked for sufficient conditions on the preferences of the agents for the existence of setwise-stable matchings in the many-to-many cases. Sotomayor [88] proves that if the preferences of the firms satisfy the maximin property, then the set of pairwise-stable matchings coincides with the set of setwise-stable matchings. An example in that paper shows that the set of setwise stable matchings may be empty if this condition is not satisfied. It is assumed there that the preferences are responsive and it is conjectured that the result above extends to the case of substitutable preferences. Echenique and Oviedo [24] also address this problem with a different condition. They show that if agents on one side of the market have strongly substitutable preferences, while the other side has substitutable preferences, then the set of setwise stable matchings coincides with the set of pairwise-stable matchings.
Konishi and Ünver [46] give conditions on the preferences of the agents in a many-to-many matching market under which a pairwise-stable matching cannot be quasidominated by a pairwise-unstable matching via a collation. Eeckhout [26], under the assumption of strict preferences and that every man (woman) is acceptable to every woman (man), presents a sufficient condition for uniqueness of the stable matchings in the marriage market. The condition on preferences is simple: for every f i 2 F D f f1 ; f2 ; : : : ; f m g; w i >f i w k for all k > i, and for every w j 2 W D fw1 ; w2 ; : : : ; w n g; f j >w j f k for all k > i. One line of investigation that has been developed in the theory of two-sided matchings concerns the mathematical structure of the set of stable matchings, because it captures fundamental differences and similarities between the several kinds of models. For the marriage model and the college admission model with responsive preferences, assuming that the preferences over individuals are strict, the set of setwise-stable (stable for short) matchings have the following characteristic properties: A1. Let and 0 be stable matchings. Then >F 0 if and only if 0 >W .} That is, there exists an opposition of interests between the two sides of the market along the whole set of stable matchings. A2. Every agent is matched to the same number of mates under every stable matching. Consequently, if an agent is unmatched under some stable matching then he/she/it is unmatched under any other stable matching. When preferences are strict, there are two natural partial orders on the set of all stable matchings. The partial order >F is defined as follows: >F 0 if ( f ) > f 0 ( f ) for all f 2 F. The partial order >W is analogously defined. The fact that these partial orders are well defined follows from A1. Then, A3. The set of stable matchings has the algebraic structure of a complete lattice under the partial orders >F and >W . The lattice property means the following: if and 0 are two stable matchings, then some workers (respectively firms) will get a preferable set of mates under than under 0 and others will be better off under 0 than under . The lattice property implies that there is then a stable matching which gives each agent the most preferable of the two sets of partners and also one which gives each of them the least preferred set of partners. That is, if and 0 are stable matchings the lattice property implies that the functions ; ,; and are stable matchings, where
3243
3244
Two-Sided Matching Models
D _F 0 is defined by ( f ) D maxf( f ); 0 ( f )g and (w) D minf(w); 0 (w)g; the function D _W 0 is analogously defined; the function , D ^F 0 is defined by ,( f ) D minf( f ); 0 ( f )g and ,(w) D maxf(w); 0 (w)g. Analogously we define D ^W 0 (notice that _F 0 is the same as ^W 0 and _W 0 is the same as ^F 0 ). The fact that the lattice is complete implies the existence and uniqueness of a maximal element and a minimal element in the set of stable payoffs, with respect to the partial order that is being considered. Thus, there exists one and only one stable matching F and one and only one stable matching W such that F >F and W >W for all stable matchings . Property A1 then implies that >W F and >F W . That is, A4. There is an F-optimal stable matching F with the property that for any stable matching ; F >F and >W F ; there is a W-optimal stable matching W with symmetrical properties. Property A1 was first proved by Knuth [45] for the marriage model. The result for the college admission model with responsive preferences was proved in Roth and Sotomayor [69] by making use of the following proposition of Roth and Sotomayor [68]: Suppose colleges and students have strict individual preferences, and let 1 and 2 be stable matchings for the college admission model such that 1 ( f ) ¤ 2 ( f ). Let 1 and 2 be the stable matchings corresponding to 1 and 2 in the related marriage model. If 1 ( f i ) > f 2 ( f i ) for some position f i of f then 1 ( f j ) > f 2 ( f j ) for all positions f j of f . Property A2 was proved by Gale and Sotomayor [32] for both models. For the college admission model with responsive preferences Roth [66] added that if a college does not fill its quota at some stable matching then it has the same set of mates at every stable matching. The restriction of property A2 to the marriage model where every pair of partners is mutually acceptable was proved by McVitie and Wilson [52]. For the many-to-one case with substitutable preferences, Martínez et al. [50] presents an example in which there are agents who are unmatched under some stable matching and are matched under another one. By introducing quotas in the model with substitutable preferences of Roth and Sotomayor [69], these authors prove that if the preferences of the colleges are strict, substitutable and r(f )-separable for every college f , then property A2 holds. Furthermore Roth’s result mentioned above also applies. The lattice property of the set of stable matchings for the marriage model is attributed by Knuth [45] to Conway. The existence of the optimal stable matchings for each side
of the marriage market and the college admission market with responsive preferences was first proved in Gale and Shapley [30] by using the deferred acceptance procedure. The idea of their elegant proof is to show that a proposer is never rejected by an achievable mate, so he/she/it ends up with his/hers/its best achievable mate. The lattice property for the college admission model with responsive preferences was obtained in Roth and Sotomayor [69]. To show that the functions ; ,; and above are well defined these authors used the following theorem from Roth and Sotomayor [68]: If colleges and students have strict preferences over individuals, then colleges have strict preferences over those groups of students that they may be assigned at stable matchings. That is, if 1 and 2 are stable matchings, then a college f is indifferent between 1 ( f ) and 2 ( f ) only if 1 ( f ) D 2 ( f ). This result is an immediate consequence of the proposition mentioned above, due to the responsiveness of the preferences. Therefore, if 1 and 2 are two stable matchings then f prefers 1 ( f ) to 2 ( f ) if and only if the r(f ) most preferred students by f in the set formed by the union of 1 ( f ) and 2 ( f ) are those ones in 1 ( f ). For the many-to-many matching market with strict and substitutable preferences, Blair [12] proved that the set of pairwise-stable matchings (not necessarily setwisestable) has the lattice structure under some partial order relation that is not defined by the preferences of the agents. The definition of the partial order >F uses that if 1 and 2 are pairwise-stable matchings then 1 > f 2 if and only if Ch f (1 ( f ) [ 2 ( f )) D 1 ( f ), for every agent f 2 F. Similarly the partial order >W is defined. As remarked above the partial order defined by Blair coincides with the one defined by the preferences of the players in the college admission model with responsive preferences. Adachi [4] introduces a map, which is called a T-map, defined over the set of pre-matchings, in order to show that the set of stable matchings is a non-empty lattice in the marriage market under strict preferences. Adachi defines the T-map as follows: given a pre-matching ; T(( f )) is f ’s most preferred worker in fw 2 Wj f >w (w)g [ f f g for all f 2 F, and similarly, T((w)) is w’s most preferred firm in f f 2 Fjw > f ( f )g [ fwg for all w 2 W. Clearly, any fixed point of the T-map is a matching, and Adachi showed it has to be stable. Using the partial order >F defined by the agents’ preferences, he showed that the set of pre-matchings endowed with this partial order is a complete lattice and the T-map is an isotone function (order preserving). Thus, Tarski’s fixed-point theorem implies that the set of fixed points of the T-map, which is the set of stable matchings, is a non-empty complete lattice.
Two-Sided Matching Models
Theorem 1 (Tarski’s Theorem [96]) Let E be a complete lattice with respect to some partial order >, and let f be an isotone function from E to E. Then the set of fixed points of f is non-empty and is itself a complete lattice with respect to the partial order >. In this same vein, Echenique and Oviedo [23,24] extend Adachi’s [4] approach and the T-map in order to analyze the many-to-one and many-to-many models, respectively. Again, any fixed point of the T-map is a matching. Echenique and Oviedo [23] show that for the many-to-one model the set of fixed points of the T-map is equal to the set of stable matchings. By making successive iterations of the T-map, starting from some specific pre-matching, until a fixed point is reached, this map can be used to find all the stable matchings, as long as they exist. This procedure is called the T algorithm. These authors show that as long as the strong core is non-empty, the T algorithm always converges, and if the strong core is empty, it cycles. They present an example of a situation in which the preferences are not substitutable and the T algorithm finds strong core allocations, but the algorithm with firms proposing according to their preference lists over allowable sets of workers does not do so. Finally, they give a bound on the computational complexity of the T algorithm and show how it can be used to calculate both the supremum and infimum under Blair’s partial order, which under non-substitutable preferences might not be easily computed. Furthermore, under substitutability, the set of pre-matchings endowed with the partial order defined by Blair is again a complete lattice and the T-map is isotone, so Tarski’s theorem implies that the strong core is a non-empty lattice under the partial order introduced in Blair [12]. For the many-to-many model, the set of fixed-points of the T-map studied by Echenique and Oviedo [24] is shown to be, under substitutability, equal to the set of pairwise-stable matchings, and a superset of the set of setwisestable matchings. Furthermore, the set of pre-matchings endowed with Blair’s partial order is again a complete lattice. Then Tarski’s theorem applies and the set of pairwisestable matchings is a non-empty complete lattice. If both sides of the market satisfy the strong substitutability property then the set of setwise stable matchings is a complete lattice, both for Blair’s partial order and for the partial order defined by the agents’ preferences. Martínez et al. [51] propose an algorithm which allows them to calculate the whole set of pairwise-stable matchings under substitutability. Echenique and Yenmez [25] study the college admission problem when students have preferences over colleagues. Using the T-map they construct an algorithm,
which finds all the core allocations, as long as the core is non-empty. In a similar setup, Pycia [57] finds necessary and sufficient conditions for the existence of stable matchings. Dutta and Massó [22] studied a many-to-one version of the model of Echenique and Yenmez [25]. They showed that under certain conditions on preferences the core is non-empty. Hatfield and Milgrom [36] present a general many-toone model in which, the central concept is that of a contract, which allows a different formalization of a matching. In their model there are a finite set of firms, a finite set of workers and a finite set of wages offered by firms. Each contract c specifies the pair ( f ; w) involved and the wage the worker w gets from firm f , so that the set of contracts is C D F W W P (where WP is the set of wages offered by the firms). Clearly, if each firm offers a unique wage level to all workers, their model is a college admission model. Agents have preferences over the contracts in which they could be involved. In this model, a feasible allocation is a set of contracts C 0 C in which each worker w appears at most in one contract c 2 C 0 and each firm f appears in at most r(f ) contracts c1 ; : : : ; cr( f ) 2 C 0 and such that for each agent a 2 F [ W we have that Ch a (C a ) D C a , where Ca is the set of contracts in C 0 in which agent a appears. According to Hatfield and Milgrom a feasible allocation C 0 is stable if there does not exist an alternative feasible allocation which is strictly preferred by some firm f and weakly preferred by all of the workers it hires. Making use of Tarski’s fixed point theorem, they prove that if preferences are strict and satisfy substitutability over the set of contracts, then the set of stable allocations is a non-empty lattice. They introduce the condition of the law of aggregate demand on preferences, which requires that for all allowable sets X; Y if X Y then jCh f (Y)j > jCh f (X)j. By assuming that firms’ preferences satisfy substitutability and the law of aggregate demand, they prove some characteristic results on the structure of the set of stable allocations and analyze the incentives facing the agents when a mechanism which produces the W-optimal stable allocation is adopted. They show that under this mechanism, it is a dominant strategy for the workers to state their true preferences. Ostrovsky [54] generalizes the model presented in Hatfield et al. [36] to a K-sided discrete many-to-many matching model. A set of contracts, which allows the production and consumption of some goods, is called a network. This author considers supply networks where goods are sold and transformed through many stages, starting from the suppliers of initial inputs, going through different intermediaries until they reach the final consumer. He generalizes the concept of pairwise stability to this setting
3245
3246
Two-Sided Matching Models
and calls it chain stability, which requires that there does not exist a chain of contracts such that all members of this chain are better off. A chain of contracts specifies a sequence of agents, each of whom is the seller in one contract and the buyer in the next one. Under certain conditions on the preferences of the agents he proves the existence of chain stable networks and, by using fixed point methods, he shows that the set of chain stable networks is a non-empty lattice. Furthermore, he proves that there exists a consumer optimal network and an initial supplier optimal network, similar to the F-optimal and W-optimal matchings in other models. Finally, for the case in which each agent can be the seller (buyer) in at most one contract he shows that the set of chain stable networks is equal to the core. Crawford [15] proposes to allow offers in the NRMP mechanism to include salaries and demonstrates how the resulting market can generate stable outcomes, which might Pareto dominate the ones in the current form of the NRMP. This model can be seen as an application of the Hatfield and Milgrom [36] paper. Another line of investigation that has grown in the last decade concerns the special case of the college admission model in which colleges have fixed preferences, known nowadays as the school choice model. The seminal paper is Sotomayor [77] which was motivated by the admission market of economists to graduate centers of economics in Brazil. The students take some tests and each institution places weights on each of these tests in order to rank the students according to the weighted average of the tests. See Ergin and Sönmez [28], Pathak and Sönmez [55] and Balinski and Sönmez [8]. Continuous Two-Sided Matching Model with Additively Separable Utility Functions The two-sided matching model with additively separable utility functions involves two finite and disjoint sets of players which will be denoted by B, with m elements, and Q, with n elements. Each b 2 B has a quota r(b) and each q 2 Q has a quota s(q) representing the maximum number of partnerships they can form. The main characteristic of this model is that agents are able to negotiate their individual payoffs: If a partnership (b; q) is formed, the partners undertake an activity together that produces a payoff vbq which is divided between them into the payoffs ubq for b and wbq for q respectively, as a result of a negotiation process. Therefore, an outcome for this game is a matching, along with individual payoffs ubq ’s and wbq ’s. Dummy players, denoted by 0, are included for technical convenience in both sides of the market. We have that
vb0 D v0q D 0 for all b 2 B and q 2 Q. As for the quotas, a dummy player may form as many partnerships as needed to fill the quotas of the non-dummy players. Then, an allowable set of partners for agent b, with only r(b) k elements of Q, has k copies of the dummy Q-agent introduced in the model. A matching is feasible if each player is matched to an allowable set of partners. A feasible outcome, denoted by (u; w; ), is a feasible matching and a pair of payoffs (u; w), where the individual payoffs of each b 2 B and q 2 Q are given by the arrays of numbers ubq , with q 2 (b), and wbq , with b 2 (q), respectively, such that ubq C wbq D vbq ; ubq > 0 and wbq > 0. Consequently, ub0 D u0q D wb0 D w0q D 0 in case these payoffs are defined. We say that the matching is compatible with the payoff (u; w). P The value of is q2Q;b2(q) vbq . The matching is optimal if it attains the maximum value among all feasible matchings. This model generates the following game in coalitional function form with side payments. The set of players is N D B [ Q, and the characteristic function v satisfies the following: (a) (b) (c) (d)
v() D 0, v(S) D 0 if S B or S Q, v(S) 6 v(T) if S T, v(b; q) D vbq for all (b; q) 2 B Q.
For every b 2 B and for all sets S Q with jSj > r(b), (e) v(b [ S) D maxfv(b [ S 0 ); S 0 S and jS 0 j D r(b)g. (c) and (d) imply that for all sets S W with jSj > r(b), v(b [ S) D v(b [ S 0 ), for some S 0 S with jS 0 j D r(b) Analogously we define v(q [ S) for every q 2 Q and all sets S B with jSj > s(q). The condition that the game has additively separable utilities means that for every coalition S D R [ T; R B and T Q, P (f) v(R [ T) D maxf (b;q)2Rx T x bq vbq ; for every feasible assignment xg. Consequently, for all T Q with jTj 6 r(b) and for all R B with jRj 6 P P s(q); v(b[T) D q2T vbq and v(q[R) D b2R vbq . When the quota of any agent is one, the model is the well known assignment game introduced in Shapley and Shubik [72]. In this case the set of individual payoffs of an agent is a singleton, so these payoffs need not be indexed according to the partnerships formed under the matching. Furthermore, the concepts of setwise-stability, pairwise-
Two-Sided Matching Models
stability and corewise-stability are equivalent. The outcome (u; w; ) is stable if it is feasible and ub C w q > vbq for all pairs (b; q). The existence of stable payoffs for the assignment game was proved in Shapley and Shubik [72] with the use of linear programming. A different, but also simple proof, is obtained in Sotomayor [81] using only combinatorial arguments. Crawford and Knoer [16] consider a discrete version (as well as the continuous version) of the assignment game and develop a version of the deferred acceptance algorithm of Gale and Shapley to prove the nonemptiness of the set of pairwise-stable payoffs. The main properties that characterize the set of stable outcomes of the assignment game of Shapley and Shubik are the following: B1. Let (u; w) be some stable payoff. Then is an optimal matching if and only if it is compatible with (u; w). This result means that the set of stable payoffs is the same under every optimal matching. Then we can concentrate on the payoffs of the agents rather than on the underlying matching. B2. Let (u; w) and (u0 ; w 0 ) be stable payoffs. Then u > u0 if and only if w 0 > w. That is, there exists an opposition of interests between the two sides of the market along the whole set of stable payoffs. B3. If an agent is unmatched under some stable outcome then he/she gets a zero payoff under any other stable outcome. This means, for example, that if a worker is unemployed under some stable outcome then this worker will get a zero salary under any other stable outcome. B4. The set of stable payoffs forms a convex and compact lattice under the partial orders >B and >Q . The partial order >B on the set of stable payoffs is defined as follows: (u; w) >B (u0 ; w 0 ) if ub > u0b for all b 2 B. Property B2 implies that w q 6 w 0q for all q 2 Q, so this partial order is well defined. The partial order >Q is symmetrically defined. Then, (u; w) _B (u0 ; w 0 ) D (maxfu; u0 g; minfw; w 0 g) and (u; w) ^B (u0 ; w 0 ) D (minfu; u0 g; maxfw; w 0 g). The lattice property implies that there exist a maximal element and a minimal element in the set of stable payoffs. The fact that the lattice is complete implies the uniqueness of these extreme points. Thus, there exists one and only one stable payoff (u; w ) and one and only one
stable payoff (u ; w) such that (u; w ) >B (u; w) and (u ; w) >Q (u; w) for all stable payoffs (u; w). That is, B5. There is a B-optimal stable payoff (u; w ) with the property that for any stable payoff (u; w); u > u and w 6 w; there is a Q-optimal stable payoff (u ; w) with symmetrical properties. B6. The set of stable payoffs equals the core and the set of competitive equilibrium payoffs. Excluding property B3 which follows from property 1 of Demange and Gale [18], all the other properties were first proved in Shapley and Shubik [72]. The general quota case is a version of the model studied in Crawford and Knoer [16] and was first presented in Sotomayor [75] in the context of a labor market of firms and workers. Under this approach, the number r(b) is the maximum number of workers firm b can hire; the number s(q) is the maximum number of jobs worker q can take and the number vbq is the productivity of worker q in firm b. The natural cooperative solution concept is that of setwise-stability which is shown to be equivalent to the concept of pairwise-stability. Then, the feasible outcome (u; w; ) is setwise-stable if ub (min) C w q (min) > vbq for all pairs (b; q) with q … (b), where ub (min) is the smallest individual payoff of firm b; w q (min) is the smallest individual payoff of worker q. The existence of setwise-stable payoffs for this model was proved in Sotomayor [75,79] through the use of linear programming. Another interpretation of this model considers a buyer-seller market: B is a set of buyers and Q is a set of sellers. Buyers are interested in sets of objects owned by different sellers and each seller owns a set of identical objects. The number r(b) is the number of objects buyer b is allowed to acquire; the number s(q) is the number of identical objects seller q owns and the number vbq is the amount of money buyer b considers to pay for an object of seller q. We say that vbq is the value of object q (object owned by seller q) to buyer b. An artificial null-object, 0, owned by the dummy seller, whose value is zero to all buyers and whose price is always zero is introduced for technical convenience. Under this approach a buyer will be assigned to an allowable set of objects at a feasible allocation, meaning that she is matched to the set of sellers who own the objects in the given set. P , with Given a price vector p 2 RC q2Q s(q), the preferences of buyers over objects are completely described by the numbers vbq ’s: For any two allowable sets of objects S and S 0 , buyer b prefers S to S 0 at prices p if
3247
3248
Two-Sided Matching Models
her total payoff when she buys S is greater than her total payoff when she buys S 0 . She is indifferent between these two sets if she gets the same total payoff with both sets. Usually, given the prices of the objects, buyers demand their favorite allowable sets of objects at those prices. The set of such allowable sets is called the demand set of buyer b at prices p. An equilibrium is reached if every buyer is assigned to an allowable set of objects of her demand set, every seller with a positive price sells all of his items and the number of objects in the market is enough to meet the demand of all buyers. The solution concept that captures this intuitive idea of equilibrium is that of competitive equilibrium payoff defined in Sotomayor [91] as an extension of the concept of competitive equilibrium price for the Assignment game given in Demange, Gale and Sotomayor [19]. Formally, (u; p; ) is a competitive equilibrium outcome if (i) it is feasible, (ii) is a feasible allocation such that, if (b) D S then S is in the demand set of b at prices p for all b 2 B and (iii) p q D 0 if object q is left unsold. If (u; p; ) is a competitive equilibrium outcome we say that (u; p) is a competitive equilibrium payoff, (p; ) is a competitive equilibrium and p is a competitive equilibrium price or an equilibrium price for short. One characteristic of the additively separable utility function is that if a buyer demands a set A of objects at prices p and some of these objects have their prices raised, then the buyer will continue to want to buy the objects in A whose prices were not changed. That is, the function v(fbg [ A) over all allowable sets A of partners for b satisfies the gross substitute condition. Kelso and Crawford [42] formulated a discrete and a continuous many-toone matching model where the functions v(fbg [ A) satisfy the gross substitute condition and are not necessarily additively separable. In this model, the core, the set of setwise-stable payoffs and the set of competitive equilibrium payoffs coincide and are non-empty. These authors prove, through an example, that without this condition the core may be empty. A consequence of the competitive equilibrium concept for the many-to-many case with additively separable utility functions is that sellers do not discriminate buyers under a competitive equilibrium payoff, as they might do under a stable outcome. The competitive equilibrium payoffs for this model are characterized as the setwise-stable payoffs where every seller has identical individual payoffs. It is interesting to point out that if the identical objects are owned by different sellers, they need not be sold at the same price unless the two sellers have the same number of objects and the selling price is the minimum competitive equilibrium price. (Sotomayor [91]).
A stable (respectively, competitive equilibrium) payoff is called a B-optimal stable (respectively, competitive equilibrium) payoff if every agent in B weakly prefers it to any other stable (respectively, competitive equilibrium) payoff. That is, the B-optimal stable (respectively, competitive equilibrium) payoff gives to each agent in B the maximum total payoff among all stable (respectively, competitive equilibrium) payoffs. Similarly we define a Q-optimal stable (respectively, competitive equilibrium) payoff. The existence and uniqueness of the B-optimal and of the Q-optimal stable payoffs are proved in Sotomayor [79] by showing that the set of stable payoffs is a lattice under two conveniently defined partial orders. This result runs into the difficulties of defining a partial order relation in the set of stable payoffs, due to the fact that, on the one hand the arrays of individual payoffs are unordered sets of numbers indexed according to the current matching and on the other hand, the agents’ preferences do not define a partial order relation, since they violate the anti-symmetric property. To solve this problem Sotomayor [79] defines a partnership (b; q) to be nonessential if it occurs in some but not all optimal matchings and essential if it occurs in all optimal matchings. Then, two matchings differ only by their nonessential partnerships. According to Theorem 1 of that paper, (i) in every stable outcome a player gets the same payoff in any nonessential partnership; furthermore this payoff is less than or equal to any other payoff the player gets under the same outcome; (ii) given a stable outcome (u; w; ) and a different optimal matching 0 , we can reindex the ubq ’s and wbq ’s according to 0 and still get a stable outcome. Therefore, the array of individual payoffs of a player can be represented by a vector in a Euclidean space whose dimension is the quota of the given player. The first coordinates are the payoffs that the player gets from his essential partners (if any), following some given ordering. The remaining coordinates (if any) are equal to a number which represents the payoff the player gets from all his nonessential partners. This representation is clearly independent of the matching, so any optimal matching is compatible with a stable payoff. Hence, by ordering the players in B (respectively, Q), we can immerse the stable payoffs of these players in a Euclidean space, whose dimension is the sum of the quotas of all players in B (respectively, Q). Then, the natural partial order relation of this Euclidean space induces the partial order relation >B (respectively >Q ) in the set of stable payoffs. We say that (u; w) >B (u0 ; w 0 ) if the vector of individual payoffs of any buyer, under (u; w), is greater than or equal to her vec-
Two-Sided Matching Models
tor of individual payoffs under (u0 ; w 0 ). Similarly we define (u; w) >Q (u0 ; w 0 ). The main results of Sotomayor [79] are that, under the vectorial representation of the stable payoffs, properties B1, B2, B3, B4 and B5 hold for the general many-to-many case. An implication of property B2 is the conflict of interests that exists between the two sides of the market with respect to two comparable stable payoffs. That is, if payoffs (u; w) and (u0 ; w 0 ) are stable and comparable, then for all (b; q) 2 B Q we have that b’s total payoff under the first outcome is greater than b’s total payoff under the second outcome if and only if q’s total payoff under the second outcome is greater than q’s total payoff under the first outcome. From property B3, if a seller has some unsold object under a stable outcome, then one of his individual payoffs will be zero under any other such outcome. Even though the preferences of the players do not define the partial orders >B and >Q , the property stated in B4 is of interest because the two extreme points of the lattice have an important meaning for the model. The extreme points of the lattice are precisely the B-optimal and the Q-optimal stable payoffs. Also every buyer weakly prefers any stable payoff to the Q-optimal stable payoff and any seller weakly prefers any stable payoff to the B-optimal stable payoff. Indeed, the set of competitive equilibrium payoffs is a sub lattice of the set of stable payoffs. This connection is given by the following theorem of Sotomayor [91] that states that the set of competitive equilibrium payoffs is contained in the set of stable payoffs and is a non-empty and complete lattice under the partial order >B (respectively >Q ) whose supremum (respectively infimum) is B-optimal and whose infimum (respectively supremum) is Q-optimal. The idea of the proof is that the set of competitive equilibrium payoffs can be obtained by “shrinking” the set of stable payoffs through the application of a convenient isotone (order preserving) map f whose fixed points are exactly the competitive equilibrium payoffs. The desired result is concluded via the algebraic fixed point theorem due to Alfred Tarski [96]. It is also proved in Sotomayor [91] that the B-optimal stable payoff is a fixed point of f, so the B-optimal stable payoff is the B-optimal competitive equilibrium payoff. As for property B6, Sotomayor [84] shows that the core coincides with the set of pairwise-stable payoffs in the many-to-one case where sellers have a quota of one. Since a seller only owns one object then he cannot discriminate the buyers, so the core coincides with the set of competitive equilibrium payoffs in this model. Thus, the set of competitive equilibrium payoffs is a lattice in this model.
The same result is reached in Gül and Stacchetti [33] for the many-to-one case in which the utilities satisfy the gross substitute condition. However, in the general quota case under additively separable utilities, the core may be bigger than the set of stable payoffs, which in its turn contains and may contain properly the set of competitive equilibrium payoffs, as it is illustrated in the example below from Sotomayor [91]. This example also shows that the core may not be a lattice and, the polarization of interests, observed in the sets of stable payoffs and of competitive equilibrium payoffs, does not always carry over to the core payoffs: The best core payoff for the buyers is not necessarily the worst core payoff for the sellers. Example 5 (Sotomayor [91]) Consider the following situation. The B-players will be called firms and the Q-players will be called workers. There are two firms, b and b0 , and two workers q and q0 . Each firm may employ and wants to employ both workers; worker q may take, at most, one job and worker q0 may work and wants to work for both firms. The first row of matrix v is (3; 2) and the second one is (3; 3). There are two optimal matchings: and 0 , where (b) D fq; q0 g; (b0 ) D fq0 ; 0g and 0 (b0 ) D fq; q0 g; 0 (b) D fq0 ; 0g. The core is described by the set of individual payoffs (u; w) whose total payoffs (U; W), satisfy the following system of inequalities: 0 6 Ub 6 2, 0 6 Ub 0 6 3; Wq C Wq 0 > 3, Wq 0 Wq 6 2, 1 6 Wq 6 3. It is not hard to see that the outcome (u; w) is stable if and only if seller q always gets payoff w q D 3, seller q0 gets individual payoffs wbq 0 2 [0; 2] and wb 0 q 0 2 [0; 3]; the individual payoffs of buyers b and b0 are given by (ubq D 0; ubq 0 D 2 wbq 0 ) and (ub 0 q 0 D 3 wb 0 q 0 ; ub0 D 0), respectively. To see that corewise-stability is not adequate to define the cooperative equilibrium for this market, let (u; w; ) be such that ubq D 1, ubq 0 D 1, ub 0 q 0 D 1, ub 0 0 D 0; wbq D 2, wbq 0 D 1, wb 0 q 0 D 2. That is, firm b hires workers q and q0 , obtains from each one of them a profit of one and pays two to q and one to q0 ; firm b0 hires worker q0 at a salary of two and obtains a profit of one. Observe that b0 has quota of two, so it has one unfilled position. It happens that b0 can pay more than two to q. Thus, if agents can communicate with each other and behave cooperatively, this outcome will not occur, because worker q will not accept to receive only two from firm b, since she knows that she can get more than two by working with firm b0 . Hence, this outcome cannot be a cooperative equilibrium. Observe that 2 D ub 0 0 C wbq < vb 0 q D 3, so this outcome is not stable. On the other hand, it is in the core. In fact,
3249
3250
Two-Sided Matching Models
Two-Sided Matching Models, Figure 1 Core, stable and competitive equilibrium payoffs in Example 5
if there is a blocking coalition then it must contain fb0 ; qg. These agents cannot increase their total payoffs by themselves; b0 needs to hire both workers. However fb0 ; q; q0 g does not block the outcome, because q0 is worse off by taking only one job. Nevertheless, the coalition of all agents does not block the outcome, since b loses worker q, so it will be worse off. Now, consider the outcome (u0 ; w 0 ; ), where u0bq D 0, 0 0 D 3, w 0 0 ubq 0 D 1, u0b 0q 0 D 1, u0b 0 0 D 0; wbq bq 0 D 1, w b 0q 0 D 0 2. Firm b cannot offer more than three to worker q, so the structure of the outcome cannot be ruptured. Then, although both outcomes (u; w; ) and (u0 ; w 0 ; ) are corewise-stable, only the second one can be expected to occur, so only this outcome is a cooperative equilibrium. Our explanation for this is that only (u0 ; w 0 ; ) is stable. The connection between the core, the set of stable payoffs and the set of competitive payoffs, exhibited in this example, can be better understood via Fig. 1. Now, if the reader prefers, sets B and Q are better interpreted as being the set of buyers and the set of sellers, respectively. In Fig. 1, C(W) is the set of seller’s total payoffs, which can be derived from any core payoff. The segment OP0 is the set of sellers total payoffs which can be derived from any stable payoff. That is, (Wq ; Wq 0 ) 2 OP0 if and only if there is a stable outcome (u; w; ) such that Wq D wbq and Wq 0 D wbq 0 C wb 0q 0 . The segment OP is the set of seller’s total payoffs, which can be derived from any competitive equilibrium price. That is, (Wq ; Wq 0 ) 2 OP if and only if there is a competitive equilibrium price p such that Wq D p q and Wq 0 D p q 0 C p q 0 . We can see that C(W) is bigger than OP0 which, in its turn, is bigger than OP. The point (2; 3) 2 C(W) OP0 . It corresponds to the outcome (u; w; ) described above that is in the core but is not stable. It is clear in Fig. 1 that C(W) is not a lattice, so the core is not a lattice under neither >B nor >Q . In fact, the outcome which corresponds to the point (3; 0) gives the indi-
vidual payoff of 3 to q and two individual payoffs of zero to q0 . On the other hand, the core outcome that corresponds to (2; 3) gives the individual payoff of two to q . Then, the infimum (respectively supremum) of these two core payoffs under >Q (respectively >B ) gives payoff two to seller q and two individual payoffs of zero to seller q0 . This payoff corresponds to the vector of total payoffs (2; 0), which is not in C(W). It is evident that the stable payoff corresponding to point P0 D (3; 5) is the Q-optimal stable payoff. Seller q receives three and seller q0 receives two from b and three from b0 . Buyers get zero from the sellers. By applying the function f we obtain the Q-optimal competitive equilibrium payoff, corresponding to point P D (3; 4), where q receives three and q0 receives two from both buyers. Point O D (3; 0) corresponds to the outcome (u00 ; w 00 ; ), where u00bq D 0, u00bq 0 D 2, u00b 0 q 0 D 3, u00b 0 0 D 0; 00 D 3, w 00 D 0, w 00 00 00 wbq bq 0 b 0 q 0 D 0. Payoff (u ; w ) is the B-optimal stable payoff, the B-optimal competitive equilibrium payoff and the B-optimal core payoff. It is the worst stable payoff and the worst competitive equilibrium payoff for the sellers. However, it is not the worst core payoff for the sellers since it is the best core payoff for q. Indeed, as it can be observed, there is no minimum core payoff for the sellers. A fruitful line of investigation has been the design of mechanisms to produce a competitive equilibrium price. Demange, Gale and Sotomayor [19] propose a generalization of the English auction for the assignment game, which yields the minimum competitive equilibrium price in a finite number of steps. In the same spirit, Sotomayor [83] presents a descending bid auction mechanism which leads the maximum competitive equilibrium price as a generalization of the Dutch auction. Gül and Stacchetti [34] obtain the minimum competitive equilibrium price through a generalization of the auction of Demange et al. [19] to the many-to-one case in which the utility functions satisfy the gross substitute condition. Sotomayor [90] obtains the same result by considering a dynamic mechanism for the many-to-many case with additively separable utility functions. This mechanism leads to the B-optimal stable payoff. Using the symmetry of the model, the Q-optimal stable payoff is obtained by reverting the roles between buyers and sellers in the mechanism. A number of works related to the assignment game can be found in the literature. Demange and Gale [18] generalize the assignment game by allowing agents’ preferences to be represented by any continuous utility function in the money variable. For that model, Roth and Sotomayor [67] generalize a previous result of Rochford [58],
Two-Sided Matching Models
using Tarski’s fixed point theorem, and show that the set of fixed points of a “rebargaining” function is a subset of the core, which maintains the lattice structure of the core. Another approach of the many-to-many assignment game with additively separable utilities is treated in Sotomayor [75]. There, agents are not allowed to negotiate their individual payments and act in blocks. An outcome only specifies their total payoffs. In this model, the core coincides with the set of stable payoffs and is not a lattice. Also, pairwise-stability is not equivalent to corewisestability. Sotomayor [84] considers an extension of the assignment game to a many-to-many matching model in the context of firms and workers in which the quotas of the agents are not the number of partnerships they are allowed to form. Instead, they are given by the units of labor time they can supply or employ. Bikhchandani and Mamer [10] analyze the existence of market clearing prices in an exchange economy in which agents have interdependent values over several indivisible objects. Although an agent can be both a buyer and a seller, such an exchange economy can be transformed into a many-to-one matching market where each seller owns only one object and buyers want to buy a bundle of objects, and can be viewed as an extension of the assignment game. See also Demange [17], Leonard [48], Perez–Castrillo and Sotomayor [56], Sotomayor [86], Thompson [97] and Kaneko [39].
This model is motivated by the fact that, in practice, a wide range of real-world matching markets are neither completely discrete nor completely continuous. In the United States, for example, new law school graduates may enter the market for associate positions in private law firms, which negotiate salaries, or they may seek employment as law clerks to federal circuit court judges, which are civil service positions with predetermined fixed salaries. In the market for academic positions and professors, for example, the American universities compete with each other in terms of salaries, while the French public universities offer a preset and fixed salary. In Brazil, new professors may enter the market for permanent positions (with preset and fixed salaries) in federal universities, or they may seek employment in private universities, which do not offer such positions, but compensate the entrants with better and negotiable salaries. Eriksson and Karlander [29] present an algorithm to find a stable outcome under the assumption that the numbers a pq ; b pq and cpq are integer numbers of some unit. A non-constructive proof of the existence result without imposing any restriction is provided in Sotomayor [81]. In this paper it is proved that, under the assumption that the core, C, is equal to the strong core, C, the main properties that characterize the stable payoffs of the marriage and of the assignment models carry over to the hybrid model. That is, for the hybrid model,
Hybrid One-to-One Matching Model
C1. Let (u; w) and (u0 ; w 0 ) be stable payoffs for the hybrid model. If C D C then u > u0 if and only if w 0 > w. C2. If C D C and an agent is unmatched under some stable outcome then he/she gets a zero payoff under any other stable outcome. C3. If C D C then the set of stable payoffs forms a complete lattice under the partial orders >P and >Q . C4. If C D C then there is a P-optimal stable payoff (u; w ) with the property that for any stable payoff (u; w); u > u and w 6 w; there is a Q-optimal stable payoff (u ; w) with symmetrical properties.
The hybrid one-to-one matching model is the name given in the literature to a unified model due to Eriksson and Karlander [29] and inspired in the unification proposed in Roth and Sotomayor [70]. Agents from the marriage market and the assignment market are put together so that they can trade with each other in the same market. We can interpret the hybrid model as being a labor market of firms and workers: P is the set of firms and Q is the set of workers. There are two classes of agents in each set: rigid agents and flexible agents. For each pair (p; q) 2 P Q there is a number cpq representing the productivity of the pair. If a firm p hires worker q and both agents are flexible then the number cpq is allocated into salary vq for the worker and profit up for the firm as a result of a negotiation process. If one of the agents is rigid, then the payoffs of the agents are pre-set and fixed and are part of the job description. In this case the profit of p and the salary of q will be apq and bpq , respectively. The definitions of feasible outcome, corewise-stable outcomes and setwise-stable outcomes are straightforward extensions from the respective concepts for the non-hybrid models. Therefore, the concepts of setwise-stability and corewise-stability are equivalent.
Sotomayor [92] studies the hybrid model without imposing the assumption that the core is equal to the strong core. Instead she assumes that the preferences of the rigid agents, as well as the preferences of the flexible agents over rigid agents, are strict. It is shown there that the core of the hybrid model has a non-standard algebraic structure given by the disjoint union of complete lattices endowed with the properties above. The extreme points of the lattices of the core partition are called quasi-optimal stable payoffs for firms and quasi-optimal stable payoffs for workers. When the workers are always flexible, then the marriage market is obtained when the flexible firms leave the
3251
3252
Two-Sided Matching Models
hybrid market and the assignment game is obtained when the rigid firms leave the hybrid market. Each subset of the core partition of the hybrid model is obtained as follows. For any matching which is compatible with some stable payoff, decompose the market participants into two disjoint subsets. One subset contains all rigid firms and their mates at and the other one contains all flexible firms, their mates at and the unmatched workers. Now, fix such a partition of the agents. The desired subset C() of the core partition is formed with the core payoffs (u; w; 0 ) such that all agents in the first set are matched among themselves under 0 and all agents in the second set are matched among themselves under 0 . Clearly, as the rigid firms exit the hybrid market, the core partition for the corresponding assignment market is reduced to only one set, since any stable matching is compatible with any core payoff by property B1. An analogous result holds as the flexible firms leave the hybrid market, due to the fact that the matched agents in the marriage market are the same at every stable matching, which is implied by property A2. Therefore, as all flexible firms leave the hybrid market, or as all rigid firms leave the hybrid market, the restriction of the algebraic structure to the core of the resulting non-hybrid market is that of a complete lattice. Then, the extreme points of the resulting lattice are exactly the firm-optimal and the worker-optimal stable payoffs. This algebraic structure was used in Sotomayor [92] to investigate the comparative effects on the quasi-optimal stable payoffs for firms and on the quasi-optimal stable payoffs for workers caused by the entrance of rigid firms into the assignment market or by the entrance of flexible firms into the marriage model. The results of that paper can be summarized as follows: Whether agents are allocated according to a quasi-optimal stable payoff for firms or according to a quasi-optimal stable payoff for workers, it will always be the case that if flexible firms enter the rigid market, no rigid firm will be made better off and no worker will be made worse off; if rigid firms enter the flexible market, no flexible firm will be made better off and no worker will be made worse off. Comparative static results of adding agents from the same side to the marriage market or to the assignment market have been obtained in the literature under the assumption that the agents are allocated according to the optimal-stable outcome for firms or according to the optimal-stable outcome for workers. However, in the approach considered in Sotomayor [92], 1. The firms that are added are different from the firms, which are already in the market. For example, in the marriage market, where utility is non-transferable, the
comparative static adds firms with flexible wages who can transfer utility. 2. The points which are compared belong to cores with quite distinct algebraic structures. 3. There may exist several quasi-optimal stable outcomes for firms and several quasi-optimal stable outcomes for workers in the hybrid model. Despite the multiplicity of these outcomes, all of them reveal the same kind of comparative static effects. Therefore, the result above has no parallel in the nonhybrid models. It is argued in Sotomayor [92] that if the resulting core partition is not reduced to only one set when, say, the flexible firms leave the hybrid model, the comparative statics may be meaningless. This happens, for example, if we define a set of the core partition as the set of all stable payoffs compatible with some given matching. Then each lattice of the core partition for the marriage market has only one stable matching, which is both the supremum and the infimum of the lattice. Of course the distinctions between, say, the best stable payoff for workers of some lattice of the core partition of the hybrid market and an arbitrary core point of the marriage model, cannot be attributed to the entrance of the flexible firms into the marriage market. Results of comparative statics were originally obtained by Gale and Sotomayor [32] for the marriage model and the college admission model: If agents from the same side of the market enter the market then no agent from this side is better off and no agent of the opposite side is worse off, if any of the two optimal stable matchings prevails. A similar result was proved by Demange and Gale [18] for a continuous one-to-one matching model that includes the assignment game. For the assignment game, Shapley [71] showed that the optimal stable payoff for an agent weakly decreases when another agent is added to the same side and weakly increases when another agent is added to the other side. Still with regard to the assignment game, Mo [53] showed that if the incoming worker is allocated to some firm in some stable outcome for the new market, there is a set of agents such that every firm is better off and every worker is worse off in the new market than in the previous one. A symmetric result holds when the incoming agent is a firm. An analogous result is demonstrated by Roth and Sotomayor [69] for the marriage market. For the many-to-one matching markets with substitutable preferences, Kelso and Crawford [42] showed that, within the context of (flexible) firms and workers, the addition of one or more firms to the market weakly improves the workers’ payoffs, and the addition of one or more work-
Two-Sided Matching Models
ers weakly improves the firms’ payoffs, under the firm-optimal stable allocation. Similar conclusions were obtained by Crawford [14] for a many-to-many matching model with strict and substitutable preferences, by comparing pairwise-stable outcomes instead of setwise-stable outcomes. Incentives The strategic questions that emerge when a stable revelation mechanism is adopted concerns its non-manipulability and, for the games induced by the mechanism, the existence of strategic equilibria and the implementability of the set of stable matchings via such equilibria. For the marriage model with strict preferences the equilibrium analysis of a game induced by a stable matching mechanism leads to the following results: 1. (Impossibility Theorem) (Roth and Sotomayor, [69]) When any stable mechanism is applied to a marriage market in which preferences are strict and there is more than one stable matching, then at least one agent can profitably misrepresent his or her preferences, assuming the others tell the truth. (This agent can misrepresent in such a way as to be matched to his or her most preferred achievable mate under the true preferences at every stable matching under the mis-stated preferences). 2. (Limits on successful manipulation.) (Demange, Gale, and Sotomayor [20]). Let P be the true preferences (not necessarily strict) of the agents, and let P 0 differ from P in that some coalition C of men and women misstate their preferences. Then there is no matching , stable for P0 , which is preferred to every stable matching under the true preferences P by all members of C. A corollary of this result is due to Dubins and Freedman [21] which states that the man-optimal stable matching mechanism is non-manipulable, individually and collectively, by the men. 3. (Gale and Sotomayor [31]) When all preferences are strict, let be any stable matching for (F; W; P). Suppose each woman w in (F) chooses the strategy of listing only (w) on her stated preference list of acceptable men (and each man states his true preferences). This is a Nash equilibrium in the game induced by the man-optimal stable matching mechanism (and is the matching that results). 4. (Roth, [61]) Suppose each man chooses his dominant strategy and states his true preferences, and the women choose any set of strategies (preference lists) P 0 (w) that form a Nash equilibrium for the revelation game induced by the man-optimal stable mechanism. Then the corre-
sponding man-optimal stable matching for (F; W; P0 ) is one of the stable matchings for (F; W; P). 5. (Gale and Sotomayor, [31]) Suppose each man chooses his dominant strategy and states his true preferences, and the women truncate their true preferences at the mate they get under the woman-optimal stable mechanism. This profile of preferences is a strong equilibrium for the women in the game induced by the man-optimal stable mechanism (and the woman-optimal stable matching under the true preferences is the matching that results). Results (3) and (4) imply that the man-optimal stable mechanism implements the core correspondence via Nash equilibria. For the college admission model with responsive and strict preferences, the theorem of Dubins and Freedman implies that the student-optimal stable mechanism is nonmanipulable individually and collectively by the students. Roth [65] shows through an example that the college-optimal stable mechanism is manipulable by the colleges due to the fact that the colleges may have a quota greater than one. Sotomayor [78,82,94] analyzes the strategic behavior of the students in a school choice model where participants have strict preferences over individuals. This paper proves that the college-optimal stable mechanism implements the set of stable matchings via the Nash equilibrium concept. When some other stable mechanism is used, an example shows that the strategic behavior of the students may lead to unstable matchings under the true preferences. A sufficient condition for the stability of the Nash equilibrium outcome is then proved to be that the set of stable matchings for the Nash equilibrium profile is a singleton. A random stable matching mechanism is proposed and the Nash equilibrium concept ex-ante is shown to be equivalent to the Nash equilibrium concept ex-post of the game induced by such a mechanism. This refinement of the Nash equilibrium concept is called Nash equilibrium in the strong sense. Under this equilibrium concept any stable matching mechanism (and in particular the random stable matching mechanism) implements the set of stable matchings. Also, if the students only play truncations of the true preferences, any stable matching mechanism implements the student-optimal stable matching via strong equilibrium in the strong sense and Nash equilibrium in the strong sense. Ergin and Sönmez [28] and Pathak and Sönmez [55] analyze the Boston mechanism, which is used to assign students to schools in many cities in the US and show that students’ parents do not have incentives to report preferences truthfully.
3253
3254
Two-Sided Matching Models
Ma [49] analyzes the strategic behavior of both students and colleges in the college admission model with responsive preferences. This author proves that the set of stable matchings is implemented by any stable mechanism via rematching proof equilibrium and strong equilibrium in truncation strategies at the match point. The implementability of the set of stable matchings through stable and non-necessarily stable mechanisms has also been investigated by several authors. Alcalde [5] presents a mechanism for the marriage market closely related to the algorithm of Gale and Shapley, which implements the core correspondence in undominated equilibria. Kara and Sönmez [40] analyze the problem of implementation in the college admission market. They show that the set of stable matchings is implementable in Nash equilibrium. Nevertheless, no subset of the core is Nash implementable. Romero–Medina [59] studies the mechanism employed by the Spanish universities to distribute the students to colleges, which can produce unstable matchings for the stated preferences. However, when students play in equilibrium, only stable allocations are reached. Sotomayor [85] investigates a mechanism for the marriage model which is not designed for producing stable matchings. Here also the equilibrium outcomes are stable matchings under the true preferences. For the discrete many-to-one matching model with responsive preferences, Alcalde and Romero–Medina [7] analyze the following mechanism: firms announce a set of workers they want to hire. Then each worker selects the firm she wants to work for. This paper proves that such a mechanism implements the set of stable matchings in subgame perfect Nash equilibrium. For the many-to-many case, Sotomayor [88] shows that this result does not carry over. This paper proves that subgame perfect Nash equilibria always exist, while strong equilibria may not exist. The subgame perfect Nash equilibrium outcomes are precisely the pairwise-stable matchings, which may be out of the core when the preferences of the agents in one of the sides are not maximin. Under this condition, the equilibrium outcomes are the setwise-stable matchings and every subgame perfect Nash equilibrium is a strong equilibrium. By assuming non-strict preferences, Abdulkadiro˘glu, Pathak, Roth and Sönmez [2] show that no mechanism (stable or not, and Pareto optimal or not), which is better for the students than the student proposing deferred acceptance algorithm with tie breaking, can be strategy proof. Sönmez [73] analyzes a model which he calls the generalized indivisible allocation problem that includes the roommate and the marriage markets. He looks for conditions which explain the differences on strategy-proofness
results that have been generated in the literature. He shows how some of the results in the literature can be seen as corollaries of his results. Ehlers and Massó [27] study Bayesian Nash equilibria of stable mechanisms (such as the NRMP) in matching markets under incomplete information. They show that truth-telling is an equilibrium of the Bayesian revelation game induced by a common belief and a stable mechanism if and only if all the profiles in the support of the common belief have singleton cores. For the continuous matching models the idea is to use competitive equilibrium as an allocation mechanism to produce outcomes with the desirable properties of fairness and efficiency. It involves having agents specify their supply and demand functions. The competitive equilibria are then calculated and allocations are made accordingly. Demange [17] and Leonard [48] considered the assignment game of Shapley and Shubik and, independently, proved that the allocation mechanism that yields the minimum competitive equilibrium prices is individually non-manipulable by the buyers. Demange and Gale [18] consider a one-to-one matching model in which the utilities are continuous functions in the money variable and not necessarily additively separable. These authors prove a sort of non-manipulability theorem which states that if the mechanism which produces the buyer-optimal stable payoff is adopted then no coalition of buyers by falsifying demands can achieve, only through the mechanism, higher payoffs to all of its members. We added only through the mechanism because this model allows monetary transfers within any coalition. As in the marriage model, this result is an immediate consequence of a more general theorem due to Sotomayor [74] which states the following: Let (u0 ; w 0 ; ) be any stable outcome for the market M 0 where B0 [ Q 0 is the set of agents who misrepresent their utility functions. Let (u; w) be the true payoff under (u0 ; w 0 ; ). Then, there exists a stable payoff (u; w) for the original market such that ub > ub for at least one b in B0 or w q > w 0q for at least one q in Q 0 . Demange and Gale [18] also addresses the strategic behavior by the sellers when the mechanism produces the buyer-optimal stable payoff. These authors show that by specifying their supply functions appropriately the sellers can force, by strong Nash equilibrium strategies, the payoff to be given by the maximum rather than the minimum equilibrium price. Under the assumption that the sellers only manipulate their reservation prices then, if a profile of strategies does not give the maximum equilibrium price allocation, then either some seller is using a dominated strategy or the strategy profile is not a Nash equilibrium. For this model Sotomayor [74] proves that the outcome
Two-Sided Matching Models
produced by a Nash equilibrium strategy is stable for the original market. Sotomayor [87] considers, for the assignment game of Shapley and Shubik, the strategic games induced by a class of market clearing price mechanisms. In these procedures, buyers and sellers, in different stages reveal their demand and supply functions and a competitive equilibrium is produced by the mechanism. For each vector of reservation prices selected by the sellers, the buyers play the subgame that starts and can force the buyer-optimal stable payoff through Nash equilibrium strategies. However sellers can reverse this outcome by forcing the subgame perfect equilibrium allocation to be the seller-optimal stable payoff for the original market. Kamecke [38] and Perez–Castrillo and Sotomayor [56] consider the assignment game of Shapley and Shubik. The former paper presents two mechanisms for this market. In the first one, agents act simultaneously. In the second game, the strategies are chosen sequentially. These mechanisms implement the social choice correspondences that yield the core and the optimal stable payoff for the sellers, respectively. The second paper analyzes a sequential mechanism, which implements the social choice correspondence that yields the optimal stable payoff for the sellers.
2.
3.
4.
5.
Future Directions In this section we present some directions for future investigations and some open problems which have intrigued matching theorists. 1. The discrete two-sided matching models with non-necessarily strict preferences have been explored very little in the literature. In the discrete models under strict preferences and in the continuous models, due to the fact that there is no weak blocking pairs, the set of Pareto-stable outcomes coincides with the set of setwise-stable outcomes. However, under weak preferences, setwise-stable matchings may not be Pareto-optimal. Sotomayor [93] proposes that in this case the Pareto-stability concept, which requires that the matching is stable and Pareto optimal, should be considered the natural solution concept. The justification for the Pareto-stability concept relies in the argument that in a decentralized setting, where agents freely get together in groups, recontracts between pairs of agents already allocated according to a stable matching leading to a (weak) Pareto improvement of the original matching should be allowed. Thus, weak blockings can upset a matching once they come from the grand coalition.
6.
We think that the study of the discrete two-sided matching models with non-necessarily strict preferences and the search for algorithms to produce the Pareto-stable matchings is a new and interesting line of investigation. Consider the hybrid model where no worker is rigid. If rigid firms enter the flexible market or flexible firms enter the rigid market then no firm gains and no worker loses if a quasi-optimal stable outcome for one of the sides always prevails. Suppose now that some rigid firm becomes flexible or some flexible firm becomes rigid. What kind of comparative static effect is caused by this change in the market? One line of investigation not yet explored in the literature concerns the incentives faced by the agents in the hybrid model when some stable allocation mechanism is used. Consider the discrete many-to-many matching market with substitutable preferences where a matching is feasible if Ch y ((y)) D (y) for every agent y. Is the core always non-empty for this model? Consider the assignment game of Shapley and Shubik in the context of buyers and sellers. Consider a sealed bid auction in which the buyers select a monetary value for each of the items. The auctioneer then chooses a competitive equilibrium price vector for the profile of selected values, according to some pre-set probability distribution. It is of theoretical interest the investigation of the buyers’ strategic behavior. We know that the core of the many-to-many assignment model of Sotomayor [75] in which the agents negotiate in blocks is not a lattice. However, a problem that is still open is to know if the optimal stable payoffs for each side of the market always exist.
Bibliography 1. Abdulkadiroglu A, Sönmez T (2003) School Choice: A Mechanism Design Approach. Am Econ Rev 93(3):729–747 2. Abdulkadiroglu A, Pathak P, Roth A, Sönmez T (2006) Changing the Boston School Choice Mechanism: Strategy-proofness as Equal Access, working paper. Boston College and Harvard University, Boston 3. Abeledo H, Isaak G (1991) A characterization of graphs which assure the existence of stable matchings. Math Soc Sci 22(1):93–96 4. Adachi H (2000) On a characterization of stable matchings. Econ Lett 68(1):43–49 5. Alcalde J (1996) Implementation of Stable Solutions to Marriage Problems. J Econ Theory 69(1):240–254 6. Alcalde J, Pérez-Castrillo D, Romero-Medina A (1998) Hiring Procedures to Implement Stable Allocations. J Econ Theory 82(2):469–480
3255
3256
Two-Sided Matching Models
7. Alcalde J, Romero-Medina A (2000) Simple Mechanisms to Implement the Core of College Admissions Problems. Games Econ Behav 31(2):294–302 8. Balinski M, Sönmez T (1999) A Tale of Two Mechanisms: Student Placement. J Econ Theory 84(1):73–94 9. Bardella F, Sotomayor M (2006) Redesign and analysis of an admission market to the graduate centers of economics in Brazil: a natural experiment in market organization, working paper. Universidade de São Paulo, São Paulo 10. Bikhchandani S, Mamer J (1997) Competitive Equilibrium in an Exchange Economy with Indivisibilities. J Econ Theory 74(2):385–413 11. Birkhoff G (1973) Lattice theory, vol v. 25. Colloquium publications. American Mathematical Society, Providence 12. Blair C (1988) The Lattice Structure of the Set of Stable Matchings with Multiple Partners. Math Oper Res 13(4):619–628 13. Chung K (2000) On the Existence of Stable Roommate Matchings. Games Econ Behav 33(2):206–230 14. Crawford V (1991) Comparative statics in matching markets. J Econ Theory 54(2):389–400 15. Crawford V (2008) The Flexible-Salary Match: A Proposal to Increase the Salary Flexibility of the National Resident Matching Program. J Econ Behav Organ 66(2):149–160 16. Crawford V, Knoer E (1981) Job Matching with Heterogeneous Firms and Workers. Econometrica 49(2):437–450 17. Demange G (1982) Strategyproofness in the assignment market game, working paper. Ecole Polytechnique, Laboratoire D’Econometrie, Paris 18. Demange G, Gale D (1985) The Strategy Structure of Two-Sided Matching Markets. Econometrica 53(4):873–888 19. Demange G, Gale D, Sotomayor M (1986) Multi-Item Auctions. J Political Econ 94(4):863–872 20. Demange G, Gale D, Sotomayor M (1987) A further note on the stable matching problem. Discret Appl Math 16(3):217–222 21. Dubins L, Freedman D (1981) Machiavelli and the Gale-Shapley Algorithm. Am Math Mon 88(7):485–494 22. Dutta B, Massó J (1997) Stability of Matchings When Individuals Have Preferences over Colleagues. J Econ Theory 75(2):464– 475 23. Echenique F, Oviedo J (2004) Core many-to-one matchings by fixed-point methods. J Econ Theory 115(2):358–376 24. Echenique F, Oviedo J (2006) A theory of stability in many-tomany matching markets. Theor Econ 1(2):233–273 25. Echenique F, Yenmez M (2007) A solution to matching with preferences over colleagues. Games Econ Behav 59(1):46–71 26. Eeckhout J (2000) On the uniqueness of stable marriage matchings. Econ Lett 69(1):1–8 27. Ehlers L, Massó J (2004) Incomplete Information and Small Cores in Matching Markets, working paper. CREA, Barcelona 28. Ergin H, Sönmez T (2006) Games of school choice under the Boston mechanism. J Public Econ 90(1–2):215–237 29. Eriksson K, Karlander J (2000) Stable matching in a common generalization of the marriage and assignment models. Discret Math 217(1):135–156 30. Gale D, Shapley L (1962) College Admissions and the Stability of Marriage. Am Math Mon 69(1):9–15 31. Gale D, Sotomayor M (1985) Ms. Machiavelli and the Stable Matching Problem. Am Math Mon 92(4):261–268 32. Gale D, Sotomayor M (1983,1985) Some remarks on the stable matching problem. Discret Appl Math 11:223–232
33. Gül F, Stacchetti E (1999) Walrasian Equilibrium with Gross Substitutes. J Econ Theory 87(1):95–124 34. Gül F, Stacchetti E (2000) The English Auction with Differentiated Commodities. J Econ Theory 92(1):66–95 35. Gusfield D (1988) The Structure of the Stable Roommate Problem: Efficient Representation and Enumeration of All Stable Assignments. SIAM J Comput 17:742–769 36. Hatfield J, Milgrom P (2005) Matching with Contracts. Am Econ Rev 95(4):913–935 37. Irving R (1985) An efficient algorithm for the stable roommates problem. J Algorithms 6:577–595 38. Kamecke U (1989) Non-cooperative matching games. Int J Game Theory 18(4):423–431 39. Kaneko M (1982) The central assignment game and the assignment markets. J Math Econ 10(2–3):205–232 40. Kara T, Sönmez T (1996) Nash Implementation of Matching Rules. J Econ Theory 68(2):425–439 41. Kara T, Sönmez T (1997) Implementation of college admission rules. J Econ Theory 9(2):197–218 42. Kelso Jr A, Crawford V (1982) Job Matching, Coalition Formation, and Gross Substitutes. Econometrica 50(6):1483–1504 43. Kesten O (2004) Student placement to public schools in the US: Two new solutions, working paper. University of Rochester, Rochester 44. Kesten O (2006) On two competing mechanisms for prioritybased allocation problems. J Econ Theory 127(1):155–171 45. Knuth D (1976) Marriage Stables. Les Presses de l’Université de Montréal, Montréal 46. Konishi H, Ünver MU (2006) Credible group stability in manyto-many matching problems. J Econ Theory 127(1):57–80 47. Kraft C, Pratt J, Seidenberg A (1959) Intuitive Probability on Finite Sets. Ann Math Stat 30(2):408–419 48. Leonard H (1983) Elicitation of Honest Preferences for the Assignment of Individuals to Positions. J Political Econ 91(3):461– 479 49. Ma J (2002) Stable matchings and the small core in Nash equilibrium in the college admissions problem. Rev Econ Des 7(2):117–134 50. Martínez R, Massó J, Neme A, Oviedo J (2001) On the lattice structure of the set of stable matchings for a many-to-one model. Optimization 50(5):439–457 51. Martínez R, Massó J, Neme A, Oviedo J (2004) An algorithm to compute the full set of many-to-many stable matchings. Math Soc Sci 47(2):187–210 52. McVitie D, Wilson L (1970) Stable marriage assignment for unequal sets. BIT Numer Math 10(3):295–309 53. Mo J (1988) Entry and structures of interest groups in assignment games. J Econ Theory 46(1):66–96 54. Ostrovsky M (2008) Stability in Supply Chain Networks. Am Econ Rev 98(3):897–923 55. Pathak P, Sönmez T (2006) Leveling the Playing Field: Sincere and Strategic Players in the Boston Mechanism, working paper. Boston College and Havard University, Boston 56. Pérez-Castrillo D, Sotomayor M (2002) A Simple Selling and Buying Procedure. J Econ Theory 103(2):461–474 57. Pycia M (2007) Many-to-One Matching with Complementarities and Peer Effects, working paper. Penn State Working Paper, Pennsylvania 58. Rochford S (1984) Symmetrically Pairwise-Bargained Allocations in an Assignment Market. J Econ Theory 34(2):262–281
Two-Sided Matching Models
59. Romero-Medina A (1998) Implementation of stable solutions in a restricted matching market. Rev Econ Des 3(2):137–147 60. Roth A (1982) The Economics of Matching: Stability and Incentives. Math Oper Res 7(4):617–628 61. Roth A (1984) Misrepresentation and Stability in the Marriage Problem. J Econ Theory 34(2):383–387 62. Roth A (1984) Stability and Polarization of Interests in Job Matching. Econometrica 52(1):47–58 63. Roth A (1984) The Evolution of the Labor Market for Medical Interns and Residents: A Case Study in Game Theory. J Political Econ 92(6):991–1016 64. Roth A (1985) Conflict and Coincidence of Interest in Job Matching: Some New Results and Open Questions. Math Oper Res 10(3):379–389 65. Roth A (1985) The College Admissions Problem is not Equivalent to the Marriage Problem. J Econ Theory 36(2):277–288 66. Roth A (1986) On the Allocation of Residents to Rural Hospitals: A General Property of Two-Sided Matching Markets. Econometrica 54(2):425–427 67. Roth A, Sotomayor M (1988) Interior points in the core of twosided matching markets. J Econ Theory 45(1):85–101 68. Roth A, Sotomayor M (1989) The College Admissions Problem Revisited. Econometrica 57(3):559–570 69. Roth A, Sotomayor M (1990) Two-Sided Matching: A Study in Game-Theoretic Modeling and Analysis, vol 18. In: Econometric Society Monographs. Cambridge University Press, New York 70. Roth A, Sotomayor M (1996) Stable outcomes in discrete and continuous models of two-sided matching: A unified treatment. Braz Rev Econom 16:1–4 71. Shapley L (1962) Complements and substitutes in the Optimal Assignment Problem. Navals Res Logist Q 9:45–48 72. Shapley L, Shubik M (1972) The Assignment Game I: The core. Int J Game Theory 1(1):111–130 73. Sönmez T (1999) Strategy-Proofness and Essentially Single-Valued Cores. Econometrica 67(3):677–689 74. Sotomayor M (1986) On incentives in a two-sided matching market, working paper. Depart of Mathematics, PUC/RJ, Rio de Janeiro 75. Sotomayor M (1992) The multiple partners game in Equilibrium and dynamics. In: Majumdar M (ed) Essays in honour of David Gale. MacMillan Press Ltd, New York, pp 322–336 76. Sotomayor M (1996) A Non-constructive Elementary Proof of the Existence of Stable Marriages. Games Econ Behav 13(1):135–137 77. Sotomayor M (1996) Admission mechanisms of students to colleges. A game-theoretic modeling and analysis. Braz Rev Econom 16(1):25–63 78. Sotomayor M (1998) The strategy structure of the college admissions stable mechanisms. In: Annals of Jornadas Latino Americanas de Teoria Econômica, San Luis, Argentina, 2000; First World Congress of the Game Theory Society, Bilbao, 2000; 4th Spanish Meeting, Valencia, 2000; International Symposium of Mathematical Programming, Atlanta, 2000; World Congress of the Econometric Society, Seattle, 2000, http://www.econ. fea.usp.br/marilda/artigos/roommates_1.doc 79. Sotomayor M (1999) The lattice structure of the set of stable outcomes of the multiple partners assignment game. Int J Game Theory 28(4):567–583
80. Sotomayor M (1999) Three remarks on the many-to-many stable matching problem. Math Soc Sci 38(1):55–70 81. Sotomayor M (2000) Existence of stable outcomes and the lattice property for a unified matching market. Math Soc Sci 39(2):119–132 82. Sotomayor M (2000) Reaching the core through college admissions stable mechanisms. In: Annals of abstracts of the following congresses: International Conference on Game Theory, 2001, Stony Brook; Annals of Brazilian Meeting of Econometrics, Salvador, Brazil, 2001; Latin American Meeting of the Econometric Society, Buenos Aires, Argentina, 2001, http://www.econ.fea.usp.br/marilda/artigos/reaching_ %20core_random_stable_allocation_mechanisms.pdf 83. Sotomayor M (2002) A simultaneous descending bid auction for multiple items and unitary demand. Revista Brasileira Economia 56:497–510 84. Sotomayor M (2003) A labor market with heterogeneous firms and workers. Int J Game Theory 31(2):269–283 85. Sotomayor M (2003) Reaching the core of the marriage market through a non-revelation matching mechanism. Int J Game Theory 32(2):241–251 86. Sotomayor M (2003) Some further remark on the core structure of the assignment game. Math Soc Sci 46:261–265 87. Sotomayor M (2004) Buying and selling strategies in the assignment game, working paper. Universidade São Paulo, São Paulo 88. Sotomayor M (2004) Implementation in the many-to-many matching market. Games Econ Behav 46(1):199–212 89. Sotomayor M (2005) The roommate problem revisited, working paper. Universidade São Paulo, São Paulo. http:// www.econ.fea.usp.br/marilda/artigos/THE_ROOMMATE_ PROBLEM_REVISITED_2007.pdf 90. Sotomayor M (2006) Adjusting prices in the many-to-many assignment game to yield the smallest competitive equilibrium price vector, working paper. Universidade de São Paulo, São Paulo. http://www.econ.fea.usp.br/marilda/artigos/ A_dynam_stable-mechan_proof_lemma_1.pdf 91. Sotomayor M (2007) Connecting the cooperative and competitive structures of the multiple partners assignment game. J Econ Theory 134(1):155–174 92. Sotomayor M (2007) Core structure and comparative statics in a hybrid matching market. Games Econ Behav 60(2):357–380 93. Sotomayor M (2008) The Pareto-Stability concept is a natural solution concept for the Discrete Matching Markets with indifferences, working paper. Universidade São Paulo, São Paulo. http://www.econ.fea.usp.br/marilda/artigos/ROLE_ PLAYED_SIMPLE_OUTCOMES_STABLE_COALITI_3.pdf 94. Sotomayor M (2007) The stability of the equilibrium outcomes in the admission games induced by stable matching rules. Inter Jour Game Theory 36(3–4):621–640 95. Tan J (1991) A Necessary and Sufficient Condition for the Existence of a Complete Stable Matching. J Algorithms 12(1):154– 178 96. Tarski A (1955) A lattice-theoretical fixpoint theorem and its applications. Pac J Math 5(2):285–309 97. Thompson G (1980) Computing the core of a market game. In: Fiacco AA, Kortane K (eds) Extremal Methods and Systems Analysis, vol 174. Springer, New York, pp 312–324
3257
3258
Unconventional Computing, Introduction to
Unconventional Computing, Introduction to ANDREW ADAMATZKY University of the West of England, Bristol, UK Unconventional Computing deals with computing and information processing derived from or implemented in physical, chemical and biological systems. As Toffoli said once “a computing scheme that today viewed as unconventional may well be so because its time hasn’t come yet – or is already gone” [1]. Mechanical and analogue computers are characteristics representatives of such computing schemes. When mechanical computation concerned the unconventional computers can be traced back to many centuries old harmonic analyzers and synthesizers, Napier’s bones, and Pascal’s wheeled calculator, and they are revived recently in self-assembling devices and molecular machines (see Mechanical Computing: The Computational Complexity of Physical Devices). In 1876 Lord Kelvin envisaged that a computation by directly exploiting law of Nature [2] and invented differential analyzer. Ideas of analog computation emerged, flourished, almost deceased by 1980s and then were resurrected in 1990s in ever growing field of computing based on continuous representations by means of continuous processes (see Analog Computation). This also branched into theory of computing with light and related continuous state machines (see Optical Computing). Chemical and molecular media and substrates are now widely explored as prototypes of future non-standard computing devices. They can be classified into the following groups. First, models based interpretations of chemical reactions as computing processes, including autocatalytic polymer chemistries, chemistries inspired by Turing machines, lattice molecular systems (see Artificial Chemistry). Second, computers which utilize replication of macro-molecules. Here DNA and other polymer molecules are basic elements of computing devices, they solve Hamiltonian path problem, simulate implementation of chess problem, and binary arithmetic (see DNA Computing) and also molecular finite state machines, therapeutic and diagnostic automata (see Molecular Automata). Chemical computers of the third type compute by interactions between propagating diffusive and phase wave fronts in chemical media. When waves, e. g. in excitable chemical medium, are confined to geometrical restrictions of channels and junctions one implement diodes, frequency transformers, logical gates and chemical sensors (see Computing in Geo-
metrical Constrained Excitable Chemical Systems). Waves propagating in unconstrained, or free-space, chemical systems are also proved to be capable for implementation of logical circuits, and robot control and computational geometry (see Reaction-Diffusion Computing). While going into domains of physics we must highlight two more types of computing devices: soliton-based computers and quantum-computers. The soliton-based devices compute by colliding self-preserved solitary waves, which change their state or trajectories in result of the collision and thus are capable of implementing basic logical gates (see Computing with Solitons). The quantum computers are based on the laws of quantum mechanics, such as superposition and entanglement, interpreted in terms of logical operations (see Quantum Computing). Consequently not only the laws of physics can be exploited in computation but reversely computing and computers themselves can be analyzed in terms of physical laws, e. g. computer equivalents of the first and second laws of thermodynamics, thermodynamics of digital and analog computers (see Thermodynamics of Computation). Fundamental microscopic physical properties of natural systems are reflected in the models of reversible computing (see Reversible Computing) which also forms the underlying paradigm of thermodynamically efficient computation. The next range of unconventional computers in our – very conditional, and somewhat illusory – classification of non-standard computing devices would be those relying on biological substrates (in addition to DNA computing). They are bacterial and cellular computers (see Bacterial Computing, Cellular Computing) employing biochemistry, genetic circuitry and inter- and intra-cellular communication for transformation of information, immune-computers, which mimic behavior of immune system (see Immunecomputing), and computing based on comparmentalization of a computing medium with membranes (see Membrane Computing). There is also a family of computers classified not by implementation substrate but some other characteristics. For instance, nano-computers are based on technology employing devices and wires with feature sizes in the order of a few nano-meters (see Nanocomputers). Amorphous computers are built of a collection of computational particles, with no a priori knowledge of their positions or orientations, dispersed irregularly on a surface or throughout a volume (see Amorphous Computing). Evolving unconventional computers can learn, and programmed, to reconfigure their physical structure to meet certain computational demands, so far we have examples of evolving computing abilities of liquid crystals,
Unconventional Computing, Introduction to
conducting and electro-activated polymers, voltage controlled colloids (see Evolution in Materio). Biological, chemical and other ‘wet’ substrates are indeed exciting prototypes for future computing devices. However, there is still a great potential for further enhancement and miniaturization of hardware systems, e. g. hardware analogs of reaction-diffusion systems, digital CMOS quasi-chemical chips, devices based on minoritycarrier transport in semiconductors, networks of single-
electron oscillators (see Unconventional Computing, Novel Hardware for). Bibliography 1. Toffoli T (1998) Programmable matter methods. Future Gener Comput Syst 16:187–201 2. Thomson W (1876) On an instrument for calculating integral of the product of two given functions (Lord Kelvin). Proc R Soc 24:266–275
3259
3260
Unconventional Computing, Novel Hardware for
Unconventional Computing, Novel Hardware for TETSUYA ASAI Hokkaido University, Sapporo, Japan
Article Outline Glossary Definition of the Subject Introduction Constructing Electrical Analog of Reaction-Diffusion Systems Digital CMOS Reaction-Diffusion Chips Analog CMOS Reaction-Diffusion Chip Reaction-Diffusion Computing Devices Based on Minority-Carrier Transport in Semiconductors Single-Electron Reaction-Diffusion System Collision-Based RD Computers Future Directions Bibliography Glossary Analog circuit An electronic circuit that operates with currents and voltages that vary continuously with time and have no abrupt transitions between levels. Since most physical quantities, e. g., velocity and temperature, vary continuously, as does audio, an analog circuit provides the best means of representing them. Current mirror A circuit that copies single input current to single (or multiple) output nodes. Two types of current mirrors exist; nMOS for current sinks and pMOS for current sources. Combining both types of current mirrors, one can invert a direction of currents; e. g., sink to source or source to sink. Digital circuit An electronic circuit that can take on only a finite number of states. Binary (two-state) digital circuits are the most common. The two possible states of a binary circuit are represented by the binary digits, or bits, 0 and 1. The simplest forms of digital circuits are built from logic gates, the building blocks of the digital computer. Diode A device that allows current flow only in one direction. Chemical diode allows for propagation of chemical waves only in one direction. Flip-flop circuit A synchronous bistable device where the output changes state only when the clock input is triggered. That is, changes in the output occur in synchronization with the clock.
Floating-gate transistor A device consisting of a control gate, floating gate and the thin oxide layer; when floating gate is given an electrical charge, the charge is trapped in the insulating think oxide layer. The transistors are used as non-volatile storage devices because they store electrical charge for a long time without powering. LSI, large-scale integration circuit An electronic circuit built on a semiconductor substrate, usually one of single-crystal silicon. It contains from 100 to 1000 transistors. Some LSI circuits are analog devices; an operational amplifier is an example. Other LSI circuits, such as the microprocessors used in computers, are digital devices. Minority-carrier transport A physical phenomenon in forwardly-biased semiconductor p-n junctions. Minority carriers are generated in both area of p- and n-type semiconductors. For p-type semiconductors, the minority carriers are electrons, while they are holes in n-type semiconductors. Once minority carriers are generated, they diffuse among the semiconductor and finally disappears by the recombination of electrons and holes. nMOS FET Abbreviation of n-type metal-oxide-semiconductor field effect transistor, where semiconductor is negatively charged so the transistors are controlled by movement of electrons; these transistors have three modes of operation: cut-off, triode, and saturation (active). pMOS FET A device which works by analogy to nMOS FET but the transistors are moved on and off by movement of electron vacancies. Single-electron circuit An electrical circuit that is functionally constructed by controlling movements of single electrons. Single-electron circuit consists of tunneling junctions and electrons are controlled by using physical phenomena called the Coulomb blockade. Definition of the Subject Natural systems give us examples of amorphous, unstructured devices, capable of fault-tolerant information processing, particularly with regard to the massive parallel spatial problems that digital processors have difficulty with. For example, reaction-diffusion (RD) chemical systems have the unique ability to efficiently solve combinatorial problems with natural parallelism [5]. In liquid-phase parallel RD processors (RD chemical computers), both the data and the results of the computation are encoded as concentration profiles of the reagents. The computation is performed via the spreading and interaction of the wave
Unconventional Computing, Novel Hardware for
fronts. In experimental chemical processors, data are represented by local disturbances in the concentrations, and computation is accomplished via the interaction of waves caused by the local disturbances. The RD chemical computers operate in parallel since the chemical medium’s micro-volumes update their states simultaneously, and the molecules diffuse and react in parallel. We see a similar parallelism in cellular automata (CA). Various RD systems can be modeled in terms of CA, including the Belousov–Zhabotinsky (BZ) reaction [5,30], the Turing system [62], a precipitating BZ system for computation of Voronoi diagram [1,2], and so on. A two-dimensional CA is particularly well suited for the coming generation of massively parallel machines, in which a very large number of separate processors act in parallel. If an elemental processor in the CA is constructed from a smart processor and photosensor, various CA algorithms can easily be used to develop intelligent image sensors. Implementing RD systems in hardware has several advantages. Hardware RD systems are very useful in simulating RD phenomena, even if the phenomena never occur in nature. This implies that a hardware system is a possible candidate for developing an artificial RD system that is superior to a natural system. For instance, hardware RD systems can operate at much faster speeds than actual RD systems. The velocity of chemical waves in a BZ reaction is O(102 ) m/s [56], while that of a hardware RD system will be over a million times faster than that of the BZ reaction, independent of system size. This property is useful for developers of RD applications because every RD application benefits from high speed operations. These properties encouraged us to develop novel hardware for unconventional (RD-based) computing. Introduction This chapter presents an overview of the semiconductor implementation of reaction-diffusion (RD) computers in large-scale integrated (LSI) circuits for unconventional computing. There, we see how to model RD processes in LSI circuits and discuss several designs of RD digital chips, based on cellular-automaton models of RD and excitable systems. Feasibility of a RD digital chip is demonstrated in the construction of a Voronoi diagram and decomposition of images. The chapter concludes with analogue RD chips, where closer to physical reality nonlinear characteristics of chemical systems are employed. We see designs of RD chips based on Oregonator, Turing and so on. Moreover, functionality of analogue RD chips in feature extraction and fingerprint reconstruction tasks are exemplified.
A RD chip consists of i) reaction circuits that emulate elementary interactions between neurons (or chemical substances) and ii) diffusion devices that imitate synapses (or chemical diffusion of the substances). RD chips were mostly designed by digital, analog, or mixed-signal complementally metal-oxide-semiconductor (CMOS) circuits of cellular neural networks (CNNs) or cellular automata (CA). Electrical cell circuits were designed to implement several CA and CNN models of RD systems [8,17,20,38], as well as fundamental RD equations [18,21,24,33,50]. Each cell is arranged on a 2D square or hexagonal grid and is connected with adjacent cells through coupling devices that transmit a cell’s state to its neighboring cells, as in conventional CAs. For instance, an analog-digital hybrid RD chip [17] was designed for emulating a conventional CA model for BZ reactions [30]. A precipitating BZ system for computation of Voronoi diagram [1,2] was also implemented on an analog-digital hybrid RD chip [20]. A full-digital RD processor [38] was also designed on the basis of a multiple-valued CA model, called excitable lattices [5]. Furthermore, a RD CA processor for complex image processing has been proposed [19]. It performs quadrilateral-object extraction based on serial and parallel CA algorithms. An analog cell circuit was also designed to be equivalent to spatial-discrete Turing RD systems [24]. A full-analog RD chip that emulates BZ reactions has also been designed and fabricated [21]. Furthermore, blueprints of non-CMOS RD chips have been designed; i. e., a RD device based on minority-carrier transport in semiconductor devices [18]. In the following sections, we see how to construct an artificial RD system on solid-state media, and to develop some applications using the solid-state RD system that could cope with conventional digital computers. Constructing Electrical Analog of Reaction-Diffusion Systems The behavior of RD systems, or the spatiotemporal patterns of chemical concentration, can be expressed by the reaction-diffusion equation, a partial differential equation with chemical concentrations as variables: @u (1) D f(u) C Du ; [u D (u1 ; u2 ; u3 ; : : :)] ; @t where t is time, u is the vector of chemical concentrations, ui is the concentration of the ith substance, and D is the diagonal matrix of diffusion coefficients. Nonlinear function f(u) is the reaction term that represents the reaction kinetics of the system. Spatial derivative Du is the diffusion term that represents the change of u due to the diffusion of the substance. A greater number of variables results
3261
3262
Unconventional Computing, Novel Hardware for
Unconventional Computing, Novel Hardware for, Figure 1 Simplified model of RD systems, consisting of many chemical oscillators. Each oscillator has variables corresponding to chemical concentrations u1 ; u2 ; u3 ; : : : in Eq. (1) and interacts with its neighbors through diffusion of substances
in more complex dynamics and a more complicated dissipative structure. A simple reaction-diffusion system with few variables, however, will still exhibit dynamics similar to biological activity. An RD system can be considered an aggregate of coupled chemical oscillators, or a chemical cellular automaton, as described in Fig. 1. Each oscillator represents the local reaction of chemical substances and generates nonlinear dynamics du/dt D f(u) that corresponds to reaction kinetics in Eq. (1). The oscillator interacts with its neighbors through nonlocal diffusion of substances; this corresponds to the diffusion term in Eq. (1) and produces dynamics du/dt D Du. Because of diffusion, all oscillators correlate with one another to generate synchronization and entrainment. Consequently, the system as a whole produces orderly dissipative structures on a macroscopic level. The size of each oscillator, or the size of the local space in which chemical concentrations are roughly uniform, depends on the diffusion coefficients and reaction velocities in the system. It is several micrometers in diam-
eter in many liquid RD systems; therefore, even a tiny RD system in a test tube contains millions of oscillators. An electrical analog of RD systems can be created by using electrical oscillation circuits instead of chemical oscillators and coupling these circuits with one another in a way that imitates diffusion. Variables are the electrical potential of nodes in the oscillation circuits in this electrical RD system. The system will produce electrical dissipative structures, i. e., orderly spatiotemporal patterns of node potentials, under appropriate conditions. The key to building an electrical RD system is to integrate a large number of oscillation circuits on a chip with coupling subcircuits. A large arrangement of oscillators (e. g., 1000 × 1000 or more) is needed to generate complex, varied dissipative structures as observed in chemical RD systems. Oscillators constructed by micro- or nanoscale circuits are thus useful to achieve such large scale integration. Such circuits can generate nonlinear oscillation through a simple circuit structure, so it can effectively be used in producing small oscillators for electrical RD systems. Digital CMOS Reaction-Diffusion Chips A Reaction-Diffusion Circuit Based on Cellular-Automaton Processing Emulating the Belousov–Zhabotinsky Reaction The Belousov–Zhabotinsky (BZ) reaction provides us important clues to control 2D phase-lagged stable synchronous patterns in excitable medium. Because of the difficulty in computing RD systems in large systems using conventional digital processors a cellular-automaton (CA) circuit that emulates the BZ reaction was proposed [17]. In the circuit, a two-dimensional array of parallel processing cells, shown in Fig. 2, is responsible for the emulation,
diffusion device a
reaction circuit
Unconventional Computing, Novel Hardware for, Figure 2 Basic construction of RD chip
b
reaction diffusion chip
Unconventional Computing, Novel Hardware for
Unconventional Computing, Novel Hardware for, Figure 3 Excitatory modelock operations of RD circuit [17]
and its operation rate is independent of the system size. The operations of the CA circuit were demonstrated by using a simulation program with integrated circuit emphasis (SPICE). In the circuit’s initial state, cells adjacent to inactive cells were in a refractory period (step 0 in Fig. 3). The inactive cells adjacent to the white bar in Fig. 3 were suppressed by adjacent cells in the refractory period (cells in the white bar). The inactive cells then entered an active,
inactive, or refractory period, depending on the degree of the refractory condition. When the inactive cells were in an active or inactive period, the tip of the bar rotated inward (step 4 to 8 in Fig. 3), resulting in the generation of the modelock (spiral patterns) typically observed in the BZ reaction (step 40 in Fig. 3). A hexagonal distortion of the propagating waves was generated by interactions between adjacent cells. These results indicated that the RD
Unconventional Computing, Novel Hardware for, Figure 4 Snapshots of recorded movie obtained from fabricated RD chip [38]
3263
3264
Unconventional Computing, Novel Hardware for
chip could be easily integrated into existing digital systems and can be used to clarify RD systems, aiming at developing further novel applications. Reaction-Diffusion Chip Implementing Excitable Lattices with Multiple-Valued Cellular Automata A RD chip was fabricated based on a multiple-valued CA model of excitable lattices [38]. The experiments confirmed the expected operations, i. e., excitable wave propagation and annihilation. One could obtain a binary stream from the common output wire by selecting each cell sequentially. Using a conventional displaying technique, the binary stream was reconstructed on a 2-D display. Figure 4 shows snapshots taken from the recorded movie. Each dot represents an excitatory cell where EXC is logical “1”. In the experiment, the supply voltage were set at 5 V, and the system clock was set at low frequency (2.5 Hz) so that “very-slow” spatiotemporal activities could be observed visually (the low frequency was used only for the visualization, and was not the upper limit of the circuit operation). Pin-spot lights were applied to several cells at top-left and bottom right corners of the chip. The circuit exhibited the expected results; i. e., two excitable waves of excited cells triggered by the corner cells propagated toward the center and disappeared when they collided. This result suggests that if we use a more microscopic process and a large number of cells were implemented, we would observe the same complex
Unconventional Computing, Novel Hardware for, Figure 5 Skeleton operation with “T” shape [20]
(BZ-like) patterns, as observed in the original excitable lattices [5]. This chip can operate much faster than real chemical RD systems, even when the system clock frequency is O(1) Hz, and is much easier to use in various experimental environments. Therefore, the chip should encourage RD application developers who use such properties of excitable waves to develop unconventional computing schemes, e. g., chemical image processing, pattern recognition, path planing, and robot navigation. Silicon Implementation of a Chemical Reaction-Diffusion Processor for Computation of Voronoi Diagram RD chemical systems are known to realize sensible computation when both data and results of the computation are encoded in concentration profiles of chemical species; the computation is implemented via spreading and interaction of either diffusive or phase waves, while a silicon RD chip is an electronic analog of the chemical RD systems. A prototype RD chip implementing a chemical RD processor for a well-known NP-complete problem of computational geometry – computation of a Voronoi diagram was fabricated. Here we see experimental results for fabricated RD chips and compare the accuracy of information processing in silicon analogs of RD processors and their experimental ‘wetware’ prototypes [20]. Figures 5 and 6 show examples of skeleton operation of a T and ‘+’
Unconventional Computing, Novel Hardware for
Unconventional Computing, Novel Hardware for, Figure 6 Skeleton operation with “+” shape [20]
shaped images. As initial images, a glass mask was prepared where ‘T’ and ‘+’ areas were exactly masked. Therefore, cells under the ‘T’ and ‘+’ areas are initially resting and the rest are initially excited. At its equilibrium, skeletons of ‘T’ and ‘+’ were successfully obtained. A Quadrilateral-Object Composer for Binary Images with Reaction-Diffusion Cellular Automata A CA LSI architecture that extracted quadrilateral objects from binary images was proposed in [19] with a serial combination of parallel CA algorithms, based on RD chemical systems model. Each cell in the CA was implemented by a simple digital circuit called an elemental processor. The CA LSI can be constructed by a large number of elemental processors and their controllers operating in serial and parallel. Figure 7a demonstrates object extraction for a natural image. The image was quantized, and given to the CA LSI. Figure 7b show the results. The maximum boxes were correctly detected in order, as predicted. The input bitmap image that consisted of 181 238 pixels was decomposed of 1020 quadrilateral objects. The bitmap image occupied 43,078 bits (5385 bytes) on memory, while the objects used 4080 bytes (8-bit address of 4 corners × 1020 boxes). The important thing is not discussing the compression rate between bitmap images and extracted objects, but that bitmap images were represented by small number of vector objects, which facilitates picture drawing in terms of the drawing speed if we have variable box window. One of the most important application targets for the proposed chip is a computer-aided design (CAD) system
for LSIs. Conventional LSI CAD tools use polygons to represent device structures. However, recent LSIs include not only polygon patterns but also graphical patterns, consisting of large number of dots, usually imported from image files such as JPEGs, to implement complex analog structures. In the mask manufacturing process, exposing a large number of dot patterns is quite a time-consuming task. Recently, electron beam (EB) lithography systems that can expose wide areas through a quadrilateral window have been produced on a commercial basis. The proposed LSI can produce efficient stream files from binary image files that can easily be handled by the new EB systems, by developing simple software that converts the box format, produced by the proposed LSI, to a conventional stream format. Analog CMOS Reaction-Diffusion Chip Analog Reaction-Diffusion Chip with Hardware Oregonator Model Silicon devices that imitate the autocatalytic and dissipative phenomena of RD systems were developed [21]. Numerical simulations and experimental results revealed that an RD device could successfully produce concentric and spiral waves in the same way as natural RD systems. These results encouraged us to develop new applications based on natural RD phenomena using hardware RD devices. Figure 8 presents an example where some spiral (modelock) patterns of cell clusters were observed. Snapshots were taken with at intervals of 100 ms. The active-cell clusters and cores of the spirals are superimposed with the
3265
3266
Unconventional Computing, Novel Hardware for
Unconventional Computing, Novel Hardware for, Figure 7 Simulation results for 181 238 image; a original image, b mixed image of quantized image given to CA LSI and detected quadrilateral objects [19]
Unconventional Computing, Novel Hardware for, Figure 8 Spiral patterns on RD chip (weak connections between cells) [21]
Unconventional Computing, Novel Hardware for
Unconventional Computing, Novel Hardware for, Figure 9 Snapshots of pattern formation from initial fingerprint image [55]
figure by white curves and white circles, respectively. Although observing “beautiful” spirals as in chemical RD systems is difficult because of the small number of cells, the appearance and disappearance of small sections of spiral waves were successfully observed. RD devices and circuits are useful not only for hardware RD systems but also for constructing modern neurochips. The excitatory and oscillatory behaviors of an RD device and circuit are very similar to actual neurons that produce sequences of identically shaped pulses in time, called spikes. Recently, Fukai demonstrated that an inhibitory network of spiking neurons achieves robust and efficient neural competition on the basis of a novel timing mechanism of neural activity [28]. A network with such a timing mechanism may provide an appropriate platform to develop analog LSI circuits and could overcome problems with analog devices, namely their lack of precision and reproducibility. Striped and Spotted Pattern Generation on RD Cellular Automata A novel RD model that is suitable for LSI implementation and its basic LSI architecture were proposed in [55]. The model employs linear diffusion fields of activators and inhibitors and a discrete transition rule after diffu-
sion. Image-processing LSI circuits based on pattern formation in RD systems were developed. Continuous diffusion fields and an analog state variable were introduced to improve the Young’s local activator-inhibitor model [62]. A model pattern diagram was produced on a 2D parameter space through extensive numerical simulations. The spatial frequency and form (striped or spotted) could be controlled with only two parameters. Theoretical analysis of the one dimensional model proved that i) spatial distribution given by a periodic square function is stable at the equilibrium and ii) the spatial frequency is inversely proportional to the square root of a diffusion coefficient of the inhibitors. A basic circuit for the proposed model was designed, i. e., an RD LSI based on the analog computing method where the concentration of chemicals was represented by a two-dimensional voltage distribution and the cell voltage was diffused step by step. By mimicking two diffusion fields with the proposed model in one diffusion circuit on the LSI, one can reduce the area of the unit cell circuit. Figure 9a has snapshots of pattern formation in the circuit. One can estimate that the system produces striped patterns from the theory [55]. Therefore, a fingerprint pattern was used as an initial input. Noisy local patterns were repaired by their surrounding striped patterns, as time increased. The circuit required 50 cycles (8000 clocks) to
3267
3268
Unconventional Computing, Novel Hardware for
reach equilibrium. Figure 9b shows the results for spot pattern generation. The same initial input as in Fig. 9a was given to the circuit. As expected from the theory, spotted patterns were obtained. The pattern formation process was the same as in Fig. 9a where noisy local spots were restored by surrounding global spotted patterns. Therefore, this circuit would be suitable for restoring regularly-arranged spotted patterns such as polka-dot patterns. The system took 100 cycles (16,000 clocks) until it reached equilibrium to restore the spotted patterns. Reaction-Diffusion Computing Devices Based on Minority-Carrier Transport in Semiconductors A massive parallel computing device was designed [18] based on principles of information processing in RD chemical media [4,5] (Figs. 10 and 11). This device imitates auto-catalytic and dissipative phenomena of the chemical RD systems, however comparing to real chemical medium the semi-conductor analog of RD computers, functions much faster. Operational characteristics of the RD silicon devices and feasibility of the approach on several computational tasks are shown in [18]. The results indicate that the proposed RD device will be a useful tool for developing novel hardware architectures based on RD principles of information processing.
Unconventional Computing, Novel Hardware for, Figure 10 Construction of two-dimensional RD device with vertical p-n-p-n device [18]
Practical value of RD chemical systems are significantly reduced by low speed of traveling waves which makes real-time computation senseless. One of the costefficient options to overcome the speed-limitations of RD computers while preserving unique features of wave-based computing is to implement RD chemical computers in silicon. The velocity of traveling wavefronts in typical reaction diffusion systems, e. g., BZ reaction, is 102 m/s [56], while that of a hardware RD system will be over a million times faster than that of the BZ reaction, independent of system size [7]. The increase in speed will be indispensable for developers of RD computers. Moreover, if a RD system is implemented in integrated circuits, then we would be able to artificially design various types RD spatio-temporal dynamics and thus develop parallel computing processors for novel applications. Basing on experimental evidences of RD-like behavior, namely traveling current density filaments [41], in p-n-p-n devices a novel type of semiconductor RD computing device, where minority carriers diffuse as chemical species and reaction elements are represented by p-n-p-n diodes, was proposed. Single-Electron Reaction-Diffusion System This section introduces a single-electron device that is analogous to the RD system. This electrical RD device consists of a two-dimensional array of single-electron nonlinear oscillators that are combined with one another through diffusive coupling. The device produces animated spatiotemporal patterns of node voltages, e. g., a rotating spiral pattern similar to that of a colony of cellular slime molds and a dividing-and-multiplying pattern that reminds us of cell division. A method of fabricating actual devices by using self-organized crystal growth technology is also described. The following is an excerpt from [43]. For details, see the reference. Figure 12 shows an electrical oscillator constructed by a single-electron circuit. It consists of tunneling junction Cj and high resistance R connected in series at node 1 and biased by positive voltage Vdd . This circuit is an elementary component of single-electron circuits known as
Unconventional Computing, Novel Hardware for, Figure 11 Simulation results of RD device producing multiplicating patterns [18]
Unconventional Computing, Novel Hardware for
tatory (monostable) if Vdd < e/(2Cj ) and produces singlepulse operation excited by an external trigger (Fig. 13b). A modified Monte Carlo method is used for the simulation. Kuwamura and his colleagues [35] have given details of this method. Also see Appendix in [59]. For constructing electrical RD systems, oscillatory oscillators and excitatory ones, or both, can be used. The oscillator exhibits discontinuous, probabilistic kinetics resulting from electron tunneling. The kinetics is given in the form of Unconventional Computing, Novel Hardware for, Figure 12 Single-electron oscillator (a SET cell) consisting of tunneling junction Cj , high resistance R connected at node 1, and positive bias voltage Vdd
the single-electron transistor (SET) cell (see [31] for detailed explanation). A SET cell only has a single variable, voltage V 1 of node 1, but it can be oscillatory or excitatory in operation – which is indispensable in creating RD systems – because the node voltage can produce a discontinuous change because of electron tunneling. In continuous-variable systems such as chemical reaction systems, two or more variables are needed for oscillatory and excitatory operations. The SET cell operates as a nonlinear oscillator at the low temperatures at which the Coulomb-blockade effect occurs. It is oscillatory (astable) if Vdd > e/(2Cj ) (e is elementary charge) and produces nonlinear oscillation in voltage at node 1 (Fig. 13a). The node voltage gradually increases as junction capacitance Cj is charged through resistance R, then drops discontinuously because of electron tunneling through the junction, again gradually increasing to repeat the same cycles. In contrast, the oscillator is exci-
dV1 e e Vdd V1 ı V1 V ; D dt RCj Cj 2Cj where ı() represents a discontinuous change in node voltage caused by electron tunneling. Probabilistic operation arises from the stochastic nature of tunneling; i. e., a time lag (a waiting time) exists between when junction voltage exceeds tunneling threshold e/(2Cj ) and when tunneling actually occurs. This effect is represented by delay term V in the equation. Because the value of V has probabilistic fluctuations in every tunneling event and cannot be expressed in analytical form, so Monte Carlo simulation is necessary for studying the behavior of the oscillator. In fabricating actual oscillators, a high resistance of hundreds of mega-ohms or more is not easy to implement on an LSI chip. A better way is to use a multiple tunneling junction, i. e., a series of many tunneling junctions, instead of high resistance (Fig. 14a). This structure also enables oscillatory and excitatory operations to be obtained because sequential electron tunneling through a multiple tunneling junction has a similar effect to current flowing at high resistance (Fig. 14b). In the following sections, however, the high-resistance SET cell (Fig. 12) is used to construct
Unconventional Computing, Novel Hardware for, Figure 13 Operation of the oscillator. Waveforms of node voltage are shown for a self-induced oscillation and b monostable oscillation, simulated with following set of parameters: tunneling junction capacitance Cj D 20 aF, tunneling junction conductance D 1 µS, high resistance R D 400 M, and zero temperature. Bias voltage Vdd D 4:2 mV for self-induced oscillation, and Vdd D 3:8 mV for monostable oscillation
3269
3270
Unconventional Computing, Novel Hardware for
Unconventional Computing, Novel Hardware for, Figure 14 Single-electron oscillator with a multiple tunneling junction: a circuit configuration, and b simulated self-induced oscillation. Parameters are: capacitance of single tunneling junction Cj D 10 aF, conductance of the single tunneling junction D 1 µS, 30 tunnelling junctions in the multiple tunneling junction, capacitance and conductance of a tunneling junction in the multiple tunneling junction are 300 aF and 50 nS, bias voltage Vdd D 8:6 mV, and zero temperature
Unconventional Computing, Novel Hardware for, Figure 15 Diffusive connection of oscillators. a One-dimensional chain of oscillators (A1, A2, . . . ) with intermediary cells (B1, B2, . . . ) and coupling capacitors C. For study of the transmission of tunneling, a triggering pulse generator is connected to the left end. b Transmission of tunneling through the chain of excitatory oscillators. The waveform of node voltage is plotted for each oscillator. A triggering pulse was applied to the leftmost oscillator, and tunneling started at the oscillator to transmit along the chain with delay. Jumps in curves A1–A4 result from electron tunneling in oscillators A1–A4. Simulated with a set of parameters: Cj D 10 aF, C D 2 aF, R D 77 M, tunneling junction conductance D 5 µS, Vdd D 5 mV, Vss D 5 mV, and zero temperature
electrical RD systems because less computing time is required in simulating RD operation. One can expect that the knowledge obtained from high-resistance RD systems will be able to be applied to RD systems consisting of multiple-junction oscillators. To construct RD systems, oscillators have to be connected with one another so that they will interact through “diffusive” coupling to generate synchronization and entrainment. To do this, the oscillators are connected by means of intermediary oscillation cells and coupling capacitors. Figure 15a illustrates the method of connection with a one-dimensional chain of oscillators. The oscillators (SET cells denoted by A1, A2, . . . , with their nodes repre-
sented by closed circles) are connected with their neighboring oscillators through intermediary oscillation cells (SET cells denoted by B1, B2, . . . , with their nodes represented by open circles) and coupling capacitors C. One can use an excitatory SET cell biased with a negative voltage Vss as the intermediary oscillation cell. An excitatory SET cell biased with a negative voltage Vss is used here as the intermediary oscillation cell. When electron tunneling occurs in an oscillator in this structure, the node voltage of the oscillator changes from positive to negative, and this induces, through coupling capacitor C, electron tunneling in an adjacent intermediary cell. The induced tunneling changes the node voltage of
Unconventional Computing, Novel Hardware for
the intermediary cell from negative to positive, and this induces electron tunneling in an adjacent oscillator. In this way, electron tunneling is transmitted from one oscillator to another along the oscillator chain. There is a time lag between two tunneling events in two neighboring oscillators as if these oscillators interacted through diffusion. This phenomenon is not diffusion itself and cannot be expressed in the form Du in Eq. (1) but can be used as a substitute for diffusion. The transmission of tunneling with delay is illustrated in Fig. 15b with simulated results for a chain of excitatory oscillators with intermediary cells. Electron tunneling was induced in the leftmost oscillator by a triggering pulse, and it was transmitted to the right along the chain with delay. In other words, an excitation wave of tunneling traveled to the right along the chain. Its delay in traveling from one oscillator to a neighbor has probabilistic fluctuations because of the stochastic nature of tunneling, but this is not a problem for applications to RD systems. An electrical RD system can be constructed by connecting oscillators into a network by means of intermediary cells and coupling capacitors (Fig. 16). Each oscillator is connected to its neighboring 4 oscillators by means of 4 intermediary cells and coupling capacitors. This is a twodimensional RD system. A three-dimensional RD system can also be constructed in a similar way by arranging oscillators into a cubic structure and connecting each oscillator with its 6 neighboring oscillators by means of 6 intermediary cells and coupling capacitors.
In the single-electron RD system, the node voltage of each oscillator changes temporally as the oscillators operate through mutual interactions. Consequently, a twodimensional spatiotemporal pattern of the node voltages is produced on the RD system. Since this voltage pattern corresponds to the dissipative structure in chemical RD systems, it can be called an “electrical dissipative structure”. A variety of electrical dissipative structures are produced from different sets of system parameters. To understand the behavior of an electrical RD system entirely, a phase diagram for the system must be drawn, i. e., a diagram that depicts – in the multidimensional space of system parameters – what kind of dissipative structure will appear for each set of parameter values. However, a phase diagram for the RD system cannot be drawn without a long numerical computer simulation because its reaction-diffusion kinetics cannot be expressed in analytical form. Instead, few examples of electrical dissipative structures simulated with a few sample sets of parameter values are demonstrated. Although a single-electron RD system differs greatly from chemical RD systems in terms of reaction-diffusion kinetics, it can produce dissipative structures similar to those of chemical RD systems. Here, three examples are exhibited, i. e., an expanding circular pattern, a rotating spiral pattern, and a dividing-and-multiplying pattern. The following will have the results simulated for a RD system consisting of 201 201 excitatory oscillators and 200 200 intermediary cells. Expanding Circular Pattern
Unconventional Computing, Novel Hardware for, Figure 16 Two-dimensional RD system consisting of the network of singleelectron oscillators. Each oscillator (closed-circle node) is connected with 4 neighboring oscillators by means of 4 intermediary cells (open-circle nodes) and coupling capacitors
A single-electron RD system consisting of excitatory oscillators is in a stable uniform state as it stands. Once a triggering signal is applied to an oscillator in the system, an excitation wave of tunneling starts at the oscillator and propagates in all directions to form an expanding circular pattern. This can be seen in Fig. 17; the node voltage of each oscillator is represented by a gray scale: the light shading means high voltage, and the dark means low voltage. The front F of the wave is the region where tunneling just occurred and, therefore, the node voltage of the oscillators is at the lowest negative value. The front line is uneven or irregular because the velocity of the traveling wave fluctuated in each direction throughout the process because of the stochastic waiting time of tunneling. After the excitation wave passed through, the node voltage of each oscillator gradually increased to return to its initial value, the positive bias voltage. This is indicated in the figure by the light shading on the rear R of the wave.
3271
3272
Unconventional Computing, Novel Hardware for
age of oscillators to zero (Fig. 18a). After that, the RD system was left to operate freely, and a rotating spiral pattern of node voltages automatically generated as can be seen in Figs. 18b–f. Dividing-and-Multiplying Pattern
This pattern appears when an expanding circular wave is chipped by external disturbance, thereby making an endpoint to appear in the wave front. With this endpoint acting as a center, the wave begins to curl itself to form a rotating spiral pattern (Fig. 18). The principle of curling is similar to that in chemical RD systems. In this example, a triggering signal was applied to the middle oscillator on the left of the RD system. When an excitation wave started and expanded a little, the lower half of the wave was chipped by resetting the node volt-
This pattern appears when the coupling between oscillators is weak (i. e., small coupling capacitance or low bias voltage). When this happens, electron tunneling in an oscillator cannot be transmitted to all four adjacent intermediary cells; e. g., tunneling can be transmitted to the right and left cells but not to the upper and lower cells. As a result, an expanding node-voltage pattern splits into pieces, and each piece again expands to split again. This produces dividing-and-multiplying patterns (Fig. 19). The principle of division is different from that in chemical RD systems, but the behavior of created patterns is somewhat similar. In a way, we may consider that there are electrical microbes consisting of negative voltages living on the RD system, eating positive charges on nodes as food, and propagating so that they can will spread all over the system. The unit element in this RD system is a single-electron oscillator coupled with four neighbors. The multipletunneling-junction oscillator (Fig. 14) is preferable for this element because it can be made without high resistance, which is difficult to implement on an LSI chip. Arranging such oscillators into a two-dimensional array produces an RD system, so the next task is to fabricate many identical oscillators on a substrate. Figure 20a shows the three-dimensional and cross-sectional schematics for the structure
Unconventional Computing, Novel Hardware for, Figure 18 Rotating spiral pattern. Snapshots for six time steps. Simulated with the same parameters as used for Fig. 17
Unconventional Computing, Novel Hardware for, Figure 19 Dividing-and-multiplying pattern. Snapshots for six time steps. Simulated with the same parameters as used for Fig. 17 except that R D 150:5 M, Vdd D 15:8 mV, and Vss D 15:8 mV
Unconventional Computing, Novel Hardware for, Figure 17 Expanding circular pattern in the single-electron RD system. Snapshots for three time steps. Simulated with parameters: tunneling junction capacitance Cj D 1 aF, tunneling junction conductance D 1 µS, high resistance R D 137:5 M, coupling capacitance C D 1 aF, bias voltage Vdd D 16:5 mV, Vss D 16:5 mV, and zero temperature
If a triggering signal is applied repeatedly to one oscillator, a concentric circular wave – called a target pattern in chemical RD systems – will be generated. Rotating Spiral Pattern
Unconventional Computing, Novel Hardware for
dence of the crystal-growth rate on crystal orientation (for detailed explanation, see [36] and [37]). With this technology, a nanodot with four coupling arms can be formed automatically in a self-organizing manner. This technology can also be used to automatically create the structure for multiple tunneling junctions on nanodots simply by repeating the growth of an n-type GaAs layer and an insulating AlGaAs layer. Using such a process, the formation of GaAs nanodots with their arms and tunneling junctions beneath them was succeeded, in the form of a two-dimensional array on a substrate (Figs. 20b and c), though the technology is not yet perfect and a complete device has not been fabricated yet. An improved process technology to form GaAs nanodots with arms and multiple tunneling junctions is now under development, where the arms and multiple tunneling junctions are arranged regularly with a smaller pitch of 100 nm or less (corresponding to 1010 oscillators/cm2 ). With the improved process technology, we will be able to integrate coupled single-electron oscillators on a chip and proceed from there to develop reaction-diffusion LSIs. Collision-Based RD Computers
Unconventional Computing, Novel Hardware for, Figure 20 Device structure for the single-electron RD system. a Three-dimensional and cross-sectional schematics; b SEM photograph of a two-dimensional array of GaAs nanodots with coupling arms and tunneling junctions; c schematic diagram of the nanodots with coupling arms
of the device. Each oscillator consists of a conductive nanodot (minute dot) with four coupling arms, and there is a tunneling junction between the nanodot and the conductive substrate beneath it. Many series-connected junctions run between the nanodot and a positive-bias or a negativebias electrode. Capacitive coupling between neighboring oscillators can be achieved by laying their coupling arms close to each other. The key in this construction is to prepare a large arrangement of nanodots with coupling arms and tunneling junctions. A process technology that could be used to fabricate the RD-system structure was previously proposed and demonstrated [42]. This technology uses selforganized crystal growth achieved by selective-area metalorganic vapor-phase epitaxy (SA-MOVPE), and it can be used to fabricate GaAs nanodots with arms and tunneling junctions on a GaAs substrate by making use of the depen-
Present digital LSI systems consist of a number of combinational and sequential logic circuits as well as related peripheral circuits. A well-known basic logic circuit is a twoinput NAND circuit that consists of four metal-oxide semiconductor field-effect transistors (MOS FETs) where three transistors are on the current path between the power supply and the ground. Many complex logic circuits can be constructed by not only populations of a large number of NAND circuits but also special logic circuits with a small number of transistors (there are more than three transistors on the current path) compared with NAND-based circuits. A straight-forward way to construct low-power digital LSIs is to decrease the power-supply voltage because the power consumption of digital circuits is proportional to the square of the supply voltage. In complex logic circuits, where many transistors are on the current paths, the supply voltage cannot be decreased due to stacking effects of transistors’ threshold voltages, even though the threshold voltage is decreasing as LSI fabrication technology advances year by year. On the other hand, if two-input basic gates that have the minimum number of transistors (three or less) on the current path are used to decrease the supply voltage, a large number of the gates will be required for constructing complex logic circuits. The Reed–Muller expansion [40,48], which expands logical functions into combinations of AND and XOR logic,
3273
3274
Unconventional Computing, Novel Hardware for
enables us to design ‘specific’ arithmetic functions with a small number of gates, but it is not suitable for arbitrary arithmetic computation. Pass-transistor logic (PTL) circuits use a small number of transistors for basic logic functions but additional level-restoring circuits are required for every unit [52]. Moreover, the acceptance of PTL circuits into mainstream digital design critically depends on the availability of tools for logic, physical synthesis, and optimization. Current-mode logic circuits also use a small number of transistors for basic logic, but their power consumption is very high due to the continuous current flow in turn-on states [15]. Subthreshold logic circuits where all the transistors operate under their threshold voltage are expected to exhibit ultra-low power consumption, but the operation speed is extremely slow [53]. Binary decision diagram logic circuits are suitable for next-generation semiconductor devices such as single-electron transistors [16,51], but not for present digital LSIs because of the use of PTL circuits. To address the problems above concerning lowpower and high-speed operation in digital LSIs, a method of designing logic circuits with collision-based fusion gates, which is inspired by collision-based RD computing (RDC) [6,7], is described. In the following sections, we see a new interpretation of collision-based RDC, especially concerning directions and speeds of propagating information quanta. We also see basic logical functions constructed by collision-based fusion gates, and discuss the number of transistors in classical and fusion-gate logic circuits [60]. Collision-Based Reaction-Diffusion Computing for Digital LSIs Adamatzky proposed how to realize arithmetical scheme using wave fragments traveling in a RD medium where excitable chemical waves disappear when they collide each other [6,7]. His cellular-automaton model mimicked localized excitable waves (wave fragments) traveling along columns and rows of the lattice and along diagonals. The wave fragments represented values of logical variables where logical operations were implemented when wave fragments collided and were annihilated or reflected as a result of the collision. One can achieve basic logical gates in the cellular-automaton model, and build an arithmetic circuit using the gates [6]. The cellular-automaton model for basic logic gates has been implemented on digital LSIs [7]. Each cell consisted of several tens of transistors and was regularly arranged on a 2D chip surface. To implement a one-bit adder, for example, by collision-based cellular automata, at least sev-
eral tens of cells are required to allocate sufficient space for the collision of wave fragments [6]. This implies several hundreds of transistors are required for constructing just a one-bit adder. Direct implementation of the cellular automaton model is therefore a waste of chip space, as long as the single cell space is decreased to the same degree of chemical compounds in spatially-continuous RD processors. What happens if wave fragments travel in ‘limited directions instantaneously’? Our possible answers to this question are depicted in Fig. 21. Figure 21a shows 2-in 2-out (C22) and 2-in 1-out (C21) units representing two perpendicular ‘limited’ directions of wave fragments, i. e., North-South and West-East fragments. The number of MOS transistors in each unit is written inside the black circle in the figure. The input fragments are represented by values A and B where A (or B) = ‘1’ represents the existence of a wave fragment traveling North-South (or West-East), and A (or B) = ‘0’ represents the absence of wave fragments. When A = B = ‘1’ wave fragments collide at the center position (black circle) and then disappear. Thus, East and South outputs are ‘0’ because of the disappearance. If A = B = ‘0’, the outputs will be ‘0’ as well because of the absence of the fragments. When A = ‘1’ and B = ‘0’, a wave fragment can travel to the South because it does not collide with a fragment traveling West-East. The East and South outputs are thus ‘0’ and ‘1’, respectively, whereas they are ‘1’ and ‘0’, respectively, when A = ‘0’ and B = ‘1’. Consequently, logical functions of this simple ‘operator’ are represented by AB and AB, as shown in Fig. 21a left. We call this operator a collision-based ‘fusion gate’, where two inputs correspond to perpendicular wave fragments, and one (or two) output represents the results of the collision (transparent or disappear) along the perpendicular axes. Figures 21b to d represent basic logic circuits constructed by combining several fusion gates. The simplest example is shown in Fig. 21b where the NOT function is implemented by a C21 gate (2 transistors). The North input is always ‘1’, whereas the West is the input (A) of the NOT function. The output appears on South node (A). Figure 21c represents a combinational circuit of three fusion gates (two C21 gates and one C22 gate) that produces AND, NOR, and OR functions. Exclusive logic functions are produced by three (for XNOR) or four (for XOR) fusion gates as shown in Fig. 21d. The number of transistors for each function is depicted in the figure (inside the white boxes). A collision-based fusion gate receives two logical inputs (A and B) and produces one (C21) or two (C22) logical outputs; i. e., AB for C21, AB and AB for C22. Unit circuits for C22 and C21 gates receive logical (voltage) inputs (A and B) and produce these logic functions. The
Unconventional Computing, Novel Hardware for
Unconventional Computing, Novel Hardware for, Figure 21 Construction of collision-based fusion gates [60]. a Definition of 2-in 2-out (C22) and 2-in 1-out (C21) gates and their corresponding circuits; b NOT ; c AND, NOR, and OR; d XOR and XNOR functions
minimum circuit structure is based on PTL circuits where a single-transistor AND logic is fully utilized. For instance, in Fig. 21a right, a pMOS pass transistor is responsible for the AB function, and an additional nMOS transistor is used for discharging operations. When the pMOS transistor receives voltages A and B at its gate and drain, respectively, the source voltage approaches AB at equilibrium. If a pMOS transistor is turned off, an nMOS transistor connected between the pMOS transistor and the ground discharges the output node, which significantly increases the upper bound of the operation frequency. When A = B = ‘0’, the output voltage is not completely zero because of the threshold voltage of the MOS transistors, however, this small voltage shift is restored to logical ‘0’ at the next input stage. Therefore additional level-restoring circuits are unnecessary for this circuit. Figure 22 shows constructions of multiple-input logic functions with fusion gates. In classical circuits, two-input AND and OR gates consist of six transistors. To decrease the power supply voltage for low-power operation, a small number of transistors (three or less) should be on each unit’s current path. Since each unit circuit has six transistors, n-input AND and OR gates consist of 6(n 1) transistors (n 2). On the other hand, in fusion gate logic [(a) and (b)], a n-input AND gate consisted of 4(n 1) transistors, whereas 2(n C 1) transistors were used in an n-in-
put OR gate. Therefore, in case of AND logic, the number of transistors in fusion gate circuits is smaller than that of classical circuits. The difference will be significantly expanded as n increases. Figure 22c shows fusion gate implementation of majority logic circuits with multiple inputs. Again, in classical circuits, the number of transistors on each unit’s current path is fixed to three. For n-bit inputs (n must be an odd number larger than 3), the number of transistors in the classical circuit was 30 C 36(n 3)2 , while in the fusion gate circuit, it was 2(n2 n C 1), which indicates that the collision-based circuit has a great advantage in the number of transistors. Half- and full adders constructed by fusion gate logic are illustrated in Figs. 22d and e. The number of transistors in a classical half adder was 22, while it was 10 in a fusion gate half adder (Fig. 22d). For n-bit full adders (n 1), the number of transistors in a classical circuit was 50n 28, while it was 26(n 1) C 10 in a fusion gate circuit (Fig. 22e). Again, the fusion gate circuit has a significantly smaller number of transistors, and the difference will be increased as n increases. Figure 23 summarizes the comparison of the number of transistors between classical and fusion gate logic. The number of transistors in fusion gate logic was always smaller than that of transistors in classical logic circuits, especially in majority logic gates.
3275
3276
Unconventional Computing, Novel Hardware for
Unconventional Computing, Novel Hardware for, Figure 22 Fusion gate architectures of multiple-input functions [60]; a AND, b OR, c majority logic gates. Half and full adders are shown in d and e, respectively
Unconventional Computing, Novel Hardware for, Figure 23 Total number of transistors in classical and collision-based (CB) multiple-input logic gates [60]
Future Directions Before starting any computation one should input datainformation in the RD medium. A parallel input is an essential feature of an edge-cutting parallel computing architecture. Serial inputs, so common for vast majority of massively-parallel processors, dramatically decrease performance of the computing devices, particularly those operating in transducer mode, where information is constantly fed into the processor (e. g. in tasks of image pro-
cessing). Experimental RD chemical computers, at least in certain case, may well have analogs of parallel inputs. It has been demonstrated widely that applying light of varying intensity we can control excitation dynamic in BZ-medium [22,27,32,44], wave velocity [49], patter formation [58]. Of particular interest are experimental evidences of light-induced back propagating waves, wavefront splitting and phase shifting [61]; we can also manipulate medium’s excitability by varying intensity of the medium’s illumination [23]. In fact, optical input of data-
Unconventional Computing, Novel Hardware for
information has been already used at the beginning of RD research [34]. This was proved to particularly important in experiments in image processing in BZ-medium-based computing devices [34,45,46,47]. We are not aware of any rigorous experimental results demonstrating a possibility of optical inputs of RD semi-conductor devices there is however a simulation-related evidence of a possibility of optical parallel inputs. Paper [39] discusses particulars of a photo-response, photon-induced generation of electronhole pairs in p-n-p-n devices, depending on primary color components in the stimulating input because elementary processors can also act as color sensors. There exists a possibility of parallel optical outputs on semiconductor devices. Technologies for integrating optoelectronic devices and electronic circuitry are fully developed, but limited hybrid integration is available commercially at present. An important problem of such integration is that pursuing it involves simultaneous development of sophisticated technologies for optoelectronic devices (III-V semiconductors) and silicon integrated circuits. Indeed, recent development in optoelectronic integrated circuits (OEICs) enables us to implement lightemitting devices (LEDs) on silicon substrate by controlling defects at III-V/silicon interface. For example, Furukawa et al. demonstrated that lattice-matched and defect-free GaPN epilayers can be grown on silicon with a thin GaP buffer layer [29]. The task is enormous and as a practical matter only small scale OEICs have been demonstrated. While the integration levels of III-V OEICs have remained low, the degree of integration in commercial GaAs integrated circuits has reached LSI levels in recent years. These advances offer a route to achieving much higher levels of optoelectronic integration through epitaxial growth of III-V heterostructures on GaAs-based LSI electronics. Most experimental prototypes of chemical excitable computing devices suffer from difficulties with representation of results of wave interaction. This is because being excited a medium’s micro-volume becomes refractory and then recovers back to the resting state. Excitation wavefronts usually annihilate in the result of collision. Therefore, experimental excitable RD computers require some external devices, like digital camera, to record their spatio-temporal dynamics. Thus, e. g. to compute a collisionfree path around obstacles in thin-layer BZ-medium one must record snapshots of BZ-medium’s activity and then analyze this series of snapshots to extract results of the computation [11,14]. This is the cost we pay for reusable (because the excitable medium eventually returns to resting state) RD processors. Another option of preserving results of computation may be to employ non-excitable RD media, where a precipitate is formed (or do not formed)
in the result of diffusion wave interaction with a substrate (or competition of several wave fronts for the substrate) [9,10,13,25]. Precipitation is an analog of infinite memory. The feature is priceless however makes experimental prototypes simply disposable, in contrast to excitable media the precipitate-forming media can be used just once. RD semiconductor computers, because they are essentially man-made devices, may allow us to combine reusability and rich space-time dynamics of excitable RD media with low post-processing costs of precipitate-forming RD media. This can be done by embedding a lattice of oscillatory semiconductor elements into a lattice of excitatory semiconductor elements. The excitatory elements (EEs) will form a substrate to support traveling excitation waves while oscillatory elements (OEs) will play a role of rewritable memory. For example, to represent sites of wave-front collision we must adjust an activation threshold of OEs in such manner that front of a single wave will not trigger OEs, however when two or more wave-fronts collide the OEs at the sites of collision are triggered and continue oscillate even when the wave-fronts annihilate. What computational tasks can be realized in RD semiconductor devices? Most primitive operations of image processing (see overview in [5]), e. g. detection of contour and enhancement, are straightly mapped onto the silicon architecture. Silicon implementation of RD (precipitate formation based) algorithms for computational geometry – Voronoi diagram [13,25,26] and skeletonization [12] – requires embedding of oscillatory elements in the lattice of excitatory p-n-p-n devices; the oscillating elements will represent bisectors of a Voronoi diagram and segments of a skeleton. Universal computation can be realized in RD semiconductor devices using two approaches. Firstly, by employing collision-based mode of computation in excitable media [3], where functionally complete sets of Boolean gates are implemented by colliding waves. In collisionbased mode, however we must force the medium to be in a sub-excitable regime which may pose some problems from fabrication point of view. Secondly, we can ‘physically’ embed logical circuits, namely their diagram-based representations, into the excitable medium [54,57]. In this we employ particulars of photo-response of p-n-p-n elements and project the logical circuit onto the medium as pattern of heterogeneous illumination. Bibliography 1. Adamatzky A (1994) Reaction–diffusion algorithm for constructing discrete generalized Voronoi diagram. Neural Netw World 6:635–643
3277
3278
Unconventional Computing, Novel Hardware for
2. Adamatzky A (1996) Voronoi–like partition of lattice in cellular automata. Mathl Comput Modeling 23:51–66 3. Adamatzky A (1998) Universal dynamical computation in multi-dimensional excitable lattices. Int J Theor Phys 37:3069– 3108 4. Adamatzky A (2000) Reaction-diffusion and excitable processors: A sense of the unconventional. Parallel Distrib Comput Theor Pract 3:113–132 5. Adamatzky A (2001) Computing in nonlinear media and automata collectives. Institute of Physics Publishing, Bristol 6. Adamatzky A (ed) (2002) Collision-based computing. Springer, London 7. Adamatzky A, De Lacy Costello B, Asai T (2005) Reaction-diffusion computers. Elsevier, Amsterdam 8. Adamatzky A, Arena P, Basile A, Carmona-Galán R, De Lacy Costello B, Fortuna L, Frasca M, Rodríguez-Vázquez A (2004) Reaction-diffusion navigation robot control: From chemical to VLSI analogic processors. IEEE Trans Circuit Syst I 51:926–938 9. Adamatzky A, De Lacy Costello BPJ (2002) Experimental logical gates in a reaction-diffusion medium: The XOR gate and beyond. Phys Rev E 66:046112 10. Adamatzky A, De Lacy Costello BPJ (2003) On some limitations of reaction-diffusion computers in relation to Voronoi diagram and its inversion. Phys Lett A 309:397–406 11. Adamatzky A, De Lacy Costello BPJ (2002) Collision-free path planning in the Belousov-Zhabotinsky medium assisted by a cellular automaton. Naturwissenschaften 89:474–478 12. Adamatzky A, De Lacy Costello B, Ratcliffe NM (2002) Experimental reaction-diffusion pre-processor for shape recognition. Phys Lett A 297:344–352 13. Adamatzky A, Tolmachiev D (1997) Chemical processor for computation of skeleton of planar shape. Adv Mater Optics Electron 7:135–139 14. Agladze K, Magome N, Aliev R, Yamaguchi T, Yoshikawa K (1997) Finding the optimal path with the aid of chemical wave. Physica D 106:247–254 15. Alioto M, Palumbo G (2005) Model and design of bipolar and MOS current-mode logic: CML, ECL and SCL digital circuits. Springer, Berlin 16. Asahi N, Akazawa M, Amemiya Y (1998) Single-electron logic systems based on the binary decision diagram. IEICE Trans Electron E81-C:49–56 17. Asai T, Nishimiya Y, Amemiya Y (2002) A CMOS reaction-diffusion circuit based on cellular-automaton processing emulating the Belousov–Zhabotinsky reaction. IEICE Trans Fundam E85A:2093–2096 18. Asai T, Adamatzky A, Amemiya Y (2004) Towards reaction-diffusion computing devices based on minority-carrier transport in semiconductors. Chaos Solit Fractals 20:863–876 19. Asai T, Ikebe M, Hirose T, Amemiya Y (2005) A quadrilateralobject composer for binary images with reaction-diffusion cellular automata. Int J Parallel Emergent Distrib Syst 20:57–68 20. Asai T, De Lacy Costello B, Adamatzky A (2005) Silicon implementation of a chemical reaction-diffusion processor for computation of Voronoi diagram. Int J Bifurc Chaos 15:3307–3320 21. Asai T, Kanazawa Y, Hirose T, Amemiya Y (2005) Analog reaction-diffusion chip imitating the Belousov–Zhabotinsky reaction with hardware Oregonator model. Int J Unconv Comput 1:123–147 22. Beato V, Engel H (2003) Pulse propagation in a model for the photosensitive Belousov–Zhabotinsky reaction with external
23.
24.
25.
26.
27.
28.
29.
30. 31.
32.
33.
34. 35.
36.
37.
38.
39.
40.
noise. In: Schimansky-Geier L et al (eds) Noise in complex systems and stochastic dynamics. Proc SPIE 5114:353–362 Brandtstädter H, Braune M, Schebesch I, Engel H (2000) Experimental study of the dynamics of spiral pairs in light-sensitive Belousov–Zhabotinskii media using an open-gel reactor. Chem Phys Lett 323:145–154 Daikoku T, Asai T, Amemiya Y (2002) An analog CMOS circuit implementing Turing’s reaction-diffusion model. Proc Int Symp Nonlinear Theor Appl 809–812 De Lacy Costello B, Adamatzky A (2003) On multitasking in parallel chemical processors: experimental findings. Int J Bifurc Chaos 13:521–533 De Lacy Costello B, Adamatzky A, Ratcliffe N, Zanin AL, Liehr AW, Purwins HG (2004) The formation of Voronoi diagrams in chemical and physical systems: Experimental findings and theoretical models. Int J Bifurc Chaos 14:2187–2210 Flessels JM, Belmonte A, and Gáspár V (1998) Dispersion relation for waves in the Belousov–Zhabotinsky reaction. J Chem Soc Faraday Trans 94:851–855 Fukai T (1996) Competition in the temporal domain among neural activities phase-locked to subthreshold oscillations. Biol Cybern 75:453–461 Furukawa Y, Yonezu H, Ojima K, Samonji K, Fujimoto Y, Momose K, Aiki K (2001) Control of N content of GaPN growun by molecular beam epitaxy and growth of GaPN lattice-matched to Si(100) substrate. Jpn J Appl Phys 41:528–532 Gerhardt M, Schuster H, Tyson JJ (1990) A cellular automaton model of excitable media. Physica D 46:392–415 Gravert H, Devoret MH (1992) Single charge tunneling – Coulomb blockade phenomena in nanostructures. Plenum, New York Grill S, Zykov VS, Müller SC (1996) Spiral wave dynamics under pulsatory modulation of excitability. J Phys Chem 100:19082– 19088 Karahaliloglu K, Balkir S (2005) Bio-inspired compact cell circuit for reaction-diffusion systems. IEEE Trans Circuits Syst II Express Briefs 52:558–562 Kuhnert L, Agladze KL, Krinsky VI (1989) Image processing using light-sensitive chemical waves. Nature 337:244–247 Kuwamura N, Taniguchi K, Hamakawa C (1994) Simulation of single-electron logic circuits. IEICE Trans Electron J77-C-II:221– 228 Kumakura K, Nakakoshi K, Motohisa J, Fukui T, Hasegawa H (1995) Novel formation method of quantum dot structures by self-limited selective area metalorganic vapor phase epitaxy. Jpn J Appl Phys 34:4387–4389 Kumakura K, Motohisa J, Fukui T (1997) Formation and characterization of coupled quantum dots (CQDs) by selective area metalorganic vapor phase epitaxy. J Crystal Growth 170:700– 704 Matsubara Y, Asai T, Hirose T, Amemiya Y (2004) Reaction-diffusion chip implementing excitable lattices with multiple-valued cellular automata. IEICE Electron Express 1:248–252 Mohajerzadeh S, Nathan A, Selvakumar CR (1994) Numerical simulation of a p-n-p-n color sensor for simultaneous color detection. Sens Actuators A 44:119–124 Muller DE (1954) Application of boolean algebra to switching circuit design and to error detection. IRE Trans Electr Comp EC3:6–12
Unconventional Computing, Novel Hardware for
41. Niedernostheide FJ, Kreimer M, Kukuk B, Schulze HJ, Purwins HG (1994) Travelling current density filaments in multilayered silicon devices. Phys Lett A 191:285–290 42. Oya T, Asai T, Fukui T, Amemiya Y (2002) A majority-logic nanodevice using a balanced pair of single-electron boxes. J Nanosci and Nanotech 2:333–342 43. Oya T, Asai T, Fukui T, Amemiya Y (2005) Reaction-diffusion systems consisting of single-electron circuits. Int J Unconv Comput 1:123–147 44. Petrov V, Ouyang Q, Swinney HL (1997) Resonant formation in a chemical system. Nature 388:655–657 45. Rambidi NG (1998) Neural network devices based on reactionduffusion media: An approach to artificial retina. Supramol Sci 5:765–767 46. Rambidi NG, Yakovenchuk D (2001) Chemical reaction-diffusion implementation of finding the shortest paths in a labyrinth. Phys Rev E63:0266071–0266076 47. Rambidi NG, Shamayaev KE, Peshkov GY (2002) Image processing using light-sensitive chemical waves. Phys Lett A 298:375– 382 48. Reed IS (1954) A class of multiple-error-correcting codes and their decoding scheme. IRE Trans Inform Th PGIT-4:38–49 49. Schebesch I, Engel H (1998) Wave propagation in heterogeneous excitable media. Phys Rev E 57:3905–3910 50. Serrano-Gotarredona T, Linares-Barranco B (2003) Log-domain implementation of complex dynamics reaction-diffusion neural networks. IEEE Trans Neural Networks 14:1337–1355 51. Shelar RS, Sapatnekar SS (2001) BDD decomposition for the synthesis of high performance PTL circuits. Workshop Notes IEEE IWLS 2001 298–303
52. Song M, Asada K (1998) Design of low power digital VLSI circuits based on a novel pass-transistor logic. IEICE Trans Electron E81-C:1740–1749 53. Soeleman H, Roy K, Paul BC (2001) Robust subthreshold logic for ultra-low power operation. IEEE Trans VLSI Syst 9:90–99 54. Steinbock O, Toth A, Showalter K (1995) Navigating complex labyrinths: optimal paths from chemical waves. Science 267:868–871 55. Suzuki Y, Takayama T, Motoike IN, Asai T (2007) Striped and spotted pattern generation on reaction-diffusion cellular automata: Theory and LSI implementation. Int J Unconv Comput 3:1–13 56. Tóth Á, Gáspár V, Showalter K (1994) Propagation of chemical waves through capillary tubes. J Phys Chem 98:522–531 57. Tóth Á, Showalter K (1995) Logic gates in excitable media. J Chem Phys 103:2058–2066 58. Wang J (2001) Light-induced pattern formation in the excitable Belousov–Zhabotinsky medium. Chem Phys Lett 339:357–361 59. Yamada T, Akazawa M, Asai T, Amemiya Y (2001) Boltzmann machine neural network devices using single-electron tunneling. Nanotechnology 12:60–67 60. Yamada K, Asai T, Hirose T, Amemiya Y (2008) On digital LSI circuits exploiting collision-based fusion gates. Int J Unconv Comput 4:45–59 61. Yoneyama M (1996) Optical modification of wave dynamics in a surface layer of the Mn-catalyzed Belousov–Zhabotinsky reaction. Chem Phys Lett 254:191–196 62. Young DA (1984) A local activator-inhibitor model of vertebrate skin patterns. Math Biosci 72:51–58
3279
3280
Voting
Voting ALVARO SANDRONI , JONATHAN POGACH, MICHELA TINCANI , ANTONIO PENTA , DENIZ SELMAN University of Pennsylvania, Philadelphia, USA Article Outline Glossary Definition of the Subject Introduction The Collective Choice Problem Voting Rules Welfare Economics Arrow’s Impossibility Theorem Political Ignorance and the Condorcet Jury Theorem Gibbard–Satterthwaite Theorem Political Competition and Strategic Voting The Common Value Setting with Strategic Agents Future Directions Bibliography Glossary Arrow’s impossibility theorem Arrow’s Impossibility Theorem states that there does not exist a complete social ranking over alternatives that meets minimum impositions of egalitarianism and efficiency, No Dictatorship and Pareto Optimality, respectively. Consequently, there is no voting mechanism that can simultaneously satisfy basic notions of egalitarianism and efficiency. Cost of voting Any sacrifice in utility that voting entails, such as cognitive costs, time cost of going to the polls, etc. Collective or social choice problem A collective or social choice problem is a setting in which a group of individuals must jointly decide on a single alternative from a set. The outcome of such a problem potentially affects the welfare of all individuals. Common values A common values problem is one in which agents share preferences over the alternatives, but may differ on their information regarding the attributes of alternatives. Condorcet jury theorem In a common values setting, the result that under majority voting the correct candidate will (almost always) be elected so long as the following assumptions are satisfied: (1) each voter’s belief is correct with a probability higher than half, (2) each voter votes according to her belief and (3) there is a large number of voters. In political science this result is sometimes referred to as wisdom of crowds.
Condorcet paradox See voting cycle. Downsian model of political competition A model in which candidate strategically position themselves on a unidimensional policy space in order to win an election. Efficiency Efficiency is a broad criterion that may be used to evaluate the social value of alternative outcomes by demanding that society make use of all valuable resources. Examples of efficiency measures are utilitarianism and Pareto Optimality. Egalitarianism Egalitarianism is a broad criterion used to evaluate the social value of alternative outcomes by demanding that welfare or resources are evenly distributed across the population. Gibbard–Satterthwaite theorem The Gibbard–Satterthwaite Theorem states that, under some assumptions, every non-dictatorial social choice function is manipulable. Majority voting Majority voting is a voting rule that stipulates that each agent vote for a single alternative and an alternative that receives more than half of all votes is the collective choice. Manipulability of a social choice function A social choice function is said to be manipulable if there is some agent who, given the social choice function, prefers to misreport his true preferences. In a voting context, this translates to voting for an alternative different from that which is most preferred. Median voter theorem The median voter theorem states that if individuals’ preferences are single peaked, then there exists an alternative that beats all others in a pairwise majority vote. Single peaked requires that each individual has a bliss point (most preferred alternative) and alternatives are less preferred the further they are from the bliss point. The selected alternative is then the median bliss point and the voter who has this bliss point is the median voter. No Dictatorship The No Dictatorship criterion of Arrow’s desiderata demands that there is no single individual whose preferences always determine those of society. Any egalitarian arrangement must satisfy the no dictatorship criterion, though arrangements that satisfy no dictatorship need not be egalitarian. Pairwise majority voting rule A pairwise majority voting rule compares each pair of alternatives in a majority vote. Depending on agents preferences and votes, this rule may lead to a voting cycle. Paradox of voting the puzzle of why there is high voter turnout in large elections, when the probability of any single vote to be determinant for the outcome is extremely small.
Voting
Pareto optimality or pareto efficiency An outcome is Pareto Optimal or Pareto Efficient if it is not possible to increase the welfare of one individual without lessening the welfare of another. Political ignorance A state in which voters are not well informed about the issues and/or candidates they must vote on. Social choice function A social choice function is a mapping of all individuals’ preferences into an alternative, the social choice. Strategic abstention The tactical decision of an uninformed citizen to abstain in order to allow informed citizens with the same preferences determine the outcome. Strategic voting Voting by agents who aim to maximize their utility and might do so by misreporting their true preferences over electoral outcomes. Utilitarianism Utilitarianism is a conception of efficiency that evaluates outcomes on aggregate utility. Utility of voting Benefit to citizens from voting, usually divided into two components: a non instrumental component which includes utility derived from the mere act of voting and not related to the actual outcome of the election, and an instrumental component given by the utility of the outcome a voter would induce if determining the outcome, weighted by the probability that his vote determines the outcome. Voting cycle or Condorcet’s paradox A voting cycle or Condorcet’s Paradox results when every feasible alternative is beaten by another in a pairwise majority vote. As such, any collective choice is less preferred to some other alternative by more than half of the population. Voting rule A voting rule is a mapping of votes into a collective choice. Examples of different voting rules include majority voting and plurality voting. The collective choice may vary under alternative voting rules. Definition of the Subject Voting is a fundamental mechanism that individuals use to reach an agreement on which one of many alternatives to implement. The individuals might all be affected by the outcome of such a process and might have conflicting preferences and/or information over alternatives. In a voting mechanism, preferences and information are aggregated as individuals submit votes and a voting rule maps the compilation of votes into the alternative that is to be selected. The use of voting as a means of making a group decision dates back at least to ancient Greece, though French Revolutionary contemporaries Condorcet’s and Borda’s works are among the pioneers of voting theory. Mean-
while, welfare economists such as Bentham suggested formal definitions of socially desirable outcomes. As voting theory and welfare economics evolved, Arrow’s result in the middle twentieth century showed that no mechanism, voting or otherwise, can produce outcomes consistent with some of welfare economists’ definitions of socially desirable states. Moreover, results in the early 1970s suggested that voters have incentives to misrepresent their true preferences in elections. Contemporary voting theory has developed new models of strategic behavior to address questions on how political agents behave and which outcomes voting might produce. Introduction Voting is one of the most commonly used ways of making collective decisions. Two issues arise when a group of people must find an agreement on a choice that will potentially affect the welfare of all. First, individuals might have conflicting preferences over the set of alternatives they are choosing from. An interesting question is what is the best way to aggregate individual’s preferences in order to reach a common decision when preferences conflict. Second, even if agents share the same preferences, they might possess different information about the alternatives. In this situation, an interesting question is what is the best way to aggregate people’s conflicting information so as to make the right choice. The optimal ways to aggregate preferences and information are among the most important questions that the literature on voting has tried to answer. Since voting concerns societies, as opposed to single individuals, this literature relies on works in the field of welfare economics. This branch of economics is primarily concerned with the analysis and the definition of the welfare of a society. There is no agreement among social scientists on a single definition of welfare of a society. However, two concepts are prominent: efficiency, which concerns the minimal waste scarce resources, and egalitarianism, which concerns the equal distribution of those resources. Using these concepts to define the welfare of a society, social scientists have attempted to answer questions on preference and information aggregation in making a collective choice. Early works on different voting techniques date back to at least the late eighteenth century, with the works of French mathematicians such as Borda [6] and Condorcet [9]. Their works were among the first to analyze voting procedures with formal mathematical tools, which have since been prominently used in the literature on voting. Continuing in this tradition, a fundamental result in voting theory is Arrow’s Impossibility Theorem. This is
3281
3282
Voting
a formal treatment of the problem of aggregation of preferences that reached a striking result: there is no way, under some assumptions, to aggregate individuals’ preferences that is minimally egalitarian and minimally efficient. Unlike the problem of aggregation of conflicting preferences, studies on the aggregation of conflicting information obtain positive results. In particular, the Condorcet Jury Theorem finds a mathematical justification for the phenomenon that is known in Political Science as the wisdom of the crowds, that is, the observation that democracies seem to be better at making decisions than single individuals. However Condorcet’s result relies on the assumption that people vote sincerely and subsequent works on voters’ behavior suggest that this is not always the case. A fundamental theoretical result, known as Gibbard– Satterthwaite theorem, shows formally that, under some assumptions, in every election at least one individual has an incentive to vote non sincerely, i. e. to vote strategically. Gibbard and Satterthwaite’s result is a starting point for a branch of the voting literature that deals with strategic voting. Numerous works analyze strategic voting though the use of mathematical tools such as Game Theory. The latter has been used not only to explain voters’ behavior, but also to describe competing candidates’ behaviors in an election. The research in this game theoretic literature focuses on the outcome of political competition and on the development of a theory of turnout. The former is analyzed through the use of a model of political competition. A main result is that in a two party election, both parties choose the same political platform in order to maximize the probability of winning an election. The latter is motivated by the paradox of voting, which refers to the empirical observation that citizens vote even when the probability that their vote determines the outcome of the election is negligible, such as in large elections. At present, there is no widely accepted theory to explain this phenomenon. As with the paradox of voting, the voting literature has still many unanswered interesting questions. As will be mentioned in the section on future directions, the voter’s behavior is still only partially understood and much needs to be explained. On a broader level, the study of the historical evolution of democracies needs further developments. Much has been written on the subject of voting, and some interesting results have been obtained, but much still needs to be explained. The Collective Choice Problem The basic framework for understanding voting is a collective or social choice problem: a group of agents must reach
an agreement on which alternative to select, and this decision potentially affects the welfare of everyone. Such problems could range from “what should be taught in public schools?” to “who should be president?” Consider two possible alternatives, A and B, and a group of individuals who must jointly choose one of the two. It might be the case that some individuals in society prefer A and some prefer B. These conflicting preferences pose a challenge in determining the appropriate social choice. Alternatively, consider a scenario where A and B are different characteristics that two candidates running for a public office may possess. Suppose that all individuals agree that A is more desirable than B, i. e. this a common values setting. However, individuals differ in their information; some think that candidate 1 possesses trait A, while others think candidate 2 does. In this case, the challenge is finding the best way to balance the conflicting information in arriving to a collective choice. There are several ways to resolve the problems of conflicting preferences and information. For example, people can bargain to reach an agreement, or they can fight. A collective choice problem might also be solved through a dictatorship, or even through the toss of a coin. A fundamental mechanism used to resolve these conflicts is an election in which people submit votes on the feasible alternatives and a voting rule maps the collection of votes into a social choice. As can be seen in the next section, there exists a multitude of voting rules. Before discussing voting rules, one should note the following distinction between elections: those in which the alternatives are policy and those in which the alternatives are candidates. The former is known as direct democracy, a common example of which is a referendum. In a referendum, citizens vote on a particular proposal, such as the adoption of a new law. The outcome of such an election is then to implement or not implement the proposal. This is in contrast to the case, known as representative democracy, where the election is over candidates. The collective choice in such a system is an agent or group of agents with the responsibility of choosing policy. The following voting rules apply to both situations.
Voting Rules Majority voting is one of the most commonly used voting rules. The rule prescribes that each citizen vote for a single alternative and an alternative becomes the social choice if it receives more than half of all votes. Clearly, when there are more than two alternatives, majority voting does not necessarily produce a social choice.
Voting
To ensure a comparison between alternatives, one can resort to pairwise majority voting rule, in which alternatives are voted over pair by pair with a majority vote. That is, a majority vote is held between every pair of feasible alternatives and for each majority vote, the winner is deemed socially preferable to the loser. However, this voting rule might generate an intransitive social preference in which society chooses x to y, y to z, but z to x. Consider a three-individual committee comprised of voters 1, 2 and 3 who must choose one of three alternatives, x, y and z. Individual preferences are such that voter 1 prefers x to y to z, voter 2 prefers z to x to y, and voter 3 preferences y to z to x. By pairwise majority voting x beats y, which in turn beats z, which in turn beats x. This intransitivity over alternatives is known as a voting cycle or Condorcet’s Paradox. In the presence of a voting cycle, pairwise majority voting might not produce an overall winner, as each alternative might be beaten by another. So, an agenda setter could end a cycle by specifying the order in which the alternatives are to be compared in a pairwise vote. However, pairwise majority voting then places all the decision power in the hands of the agenda setter. Returning to the previous example, suppose the agenda prescribes that voting be carried out between alternatives x and y first and the winner is then to be compared with z. In the first round of voting x beats y and z then beats x, so z becomes collective choice. However, the agenda setter could instead choose an initial comparison between y and z. Since y beats z in the first round and x beats y in the second, x would then be the collective choice. Hence, the agenda setter decides the outcome of the election by choosing the order of the pairwise voting. A plurality voting rule is an alternative way to make a collective choice: agents each vote for one alternative and the alternative with the most votes is chosen. Other voting rules require agents to submit scores or rankings of all available alternatives, rather than just voting for a single one. In a Borda Count, agents rank all alternatives assigning the larger numbers to those that are more preferred. The voting rule sums the scores for each alternative across individuals and the alternative with the highest sum is the social choice. A supramajority voting rule stipulates that the ‘status quo’ alternative is chosen unless another alternative receives at least some specified percentage of the vote larger than fifty percent. In the limit, there might be a unanimity rule that mandates one hundred percent of the electorate vote for an alternative for it to be chosen against the status quo. Examples of supramajority rules include the passing of constitutional amendments in the United States, where the current constitution is the status quo. Unanimity rules
are commonly found in the judicial system in which all jurors must agree on the guilt of the defendant to override the status quo, the presumption of innocence. Voting rules might also grant veto power to one or more agents. For example, each of the five permanent members of the fifteen member United Nations Security Council has the power to veto resolutions on particular matters. Any collective choice must therefore have the approval of all five permanent members. The different voting rules are not simply different methods of arriving at the same social choice. Rather, the result of an election depends critically on the voting rule that is used. In fact, an alternative that is the social choice according to one voting rule might be the least preferred under another. For instance, consider alternatives x; y, and z and seven voters, three who prefer x to y to z, two who prefer y to z to x, and two who prefer z to y to x. By pairwise majority voting, x loses to both y and z. In contrast, x beats both y and z in a plurality vote. This suggests that in order to determine which voting rule to use, one must carefully analyze the various resulting outcomes. To this end, it is useful to identify criteria that allow one to discriminate among the different outcomes produced under various voting rules. There are two main methods of doing this: efficiency and egalitarianism. Welfare Economics The analysis of efficiency and egalitarianism of a social state is among the objectives of a discipline called welfare economics. The first criterion for efficiency used by welfare economists dates back at least to Jeremy Bentham [2] and is known as utilitarianism. According to utilitarianism, the social interest is judged in terms of the total utility of a community. For example, if by moving from arrangement A to arrangement B Mr 1 benefits more than Ms 2 suffers, then the movement from A to B is judged as a social welfare improvement. Notice that in order to implement this criterion, the satisfaction intensities of different individuals must be comparable. In the 1930s this criterion was criticized by Lionel Robbins [35] and other welfare economists who claimed that the comparison of utilities across individuals has no scientific basis. In the 1940s a new criterion was developed which required no comparison of individual utilities: the Pareto criterion. A social outcome is said to be Pareto Optimal (Pareto Efficient) if there is no other outcome that would benefit at least one individual without hurting anyone else. Consider a scenario where ten dollars must be split among two individuals who value money and there are two alternatives: either person 1 receives five dollars, person 2 four and the re-
3283
3284
Voting
maining dollar is thrown away, or both receive five dollars. Clearly, the first alternative is not Pareto Optimal because person 2 can be made better off and person 1 would remain as well off if 2 is given the dollar that is being thrown away. Notice that the first alternative is also non utilitarian; in fact, the sum total of utilities can not be maximized when valuable resources are thrown away. However, a drawback of this efficiency criterion is that there exist multiple non comparable Pareto Optimal outcomes: any division of the ten dollars among the two individuals is Pareto Optimal as long as no money is thrown away, since to make one person better off one would have to take resources away from the other person. Hence, the Pareto Optimality criterion does not allow one to distinguish among multiple outcomes. Finally, notice that Pareto Optimal outcomes can be extremely non egalitarian: person 1 receiving ten dollars and person 2 zero is an unequal but Pareto Efficient division. An alternative criterion often used to discriminate among social outcomes is egalitarianism, which focuses on the distribution of welfare across members of a society. One of the abstract principles behind egalitarianism is the veil of ignorance [18,19,33]. Consider a situation in which two persons must share a cake. Pareto Optimality does not help in selecting a division: as in the ten dollars example, any division of the cake is Pareto Optimal. Suppose that one of the two persons sharing the cake is asked to cut it in two without knowing a priori which piece she will receive. Her ignorance about which piece she will receive makes her cut the cake in two equal shares, an egalitarian division. Notice that there are a number of ways to define egalitarian outcomes. For example, Rawls’s [33] maximin rule suggests that the social objective should be to maximize the welfare of the worst-off individual. Finally, notice that an egalitarian outcome might be extremely inefficient. Returning to the ten dollar example, an arrangement where person 1 is given nine dollars and person 2 one dollar is not as egalitarian as one where both are given two dollars, though the latter does not make use of more than half of the available resources. Arrow’s Impossibility Theorem Ideally, one would like to use the normative criteria of social efficiency and egalitarianism to discriminate between different voting rules and select the best one. However a general result known as Arrow’s Impossibility Theorem [1] shows mathematically that this is impossible. In his seminal work, Arrow shows that there is no voting mechanism that generates a social consensus on the ordering of the different alternatives while satisfying a number of
axioms, among which are the weakest forms of efficiency and egalitarianism: Pareto Optimality and No Dictatorship. The latter, which states that no individual always determines preferences of society, is a weak form of egalitarianism. While a non-dictatorial society can be quite unequal, any egalitarian society must be non-dictatorial. A number of possibility results have been obtained by the relaxation of some of Arrow’s axioms. For example, pairwise majority voting with a particular restriction on individual tastes, which violates what Arrow called Unrestricted Domain, satisfies Pareto Optimality and No Dictatorship, while generating an ordering of the social alternatives. Black [4] noticed that pairwise majority voting produces an outcome that is not subject to Condorcet’s paradox when individual preferences are single-peaked: every individual must have a most preferred alternative (bliss point) and between any two alternatives he prefers the one that is closer to his bliss point. An important result in voting theory, called the median voter theorem, shows that when individuals’ preferences satisfy this condition, the bliss point of the median voter beats any other alternative by pairwise majority voting. The median voter is found by ordering voters according to their bliss points. The importance of this theorem derives from its ability to describe how democracies work in practice. It is commonly observed that candidates try to appeal to voters who are politically moderate, or “in the middle”: this is consistent with the theory, which suggests that these are the preferences that will eventually prevail in a democratic system. Political Ignorance and the Condorcet Jury Theorem As mentioned earlier, voting is not only a way to aggregate conflicting preferences but it is also a way to aggregate individual, possibly conflicting, information when preferences are partially or totally aligned. This is the case in common value settings. When people would agree on the best choice if given the same information on alternatives, but differ in the information they actually receive, a natural question is which is the voting mechanism that aggregates information in a way that maximizes the probability of the right decision being made. A result called Condorcet Jury Theorem [9] shows that among all the possible voting rules, simple majority rule guarantees that the right decision is made under three crucial assumptions: that each voter has a correct belief with a probability higher than 50%, the voter votes according to his belief, and that the electorate is very large. Before going into the details of the theorem, it must be mentioned that this result has a practical importance. A number of works document that voters
Voting
are ignorant over both policy and candidate alternatives over which they are to vote. [7] claim that “many people know the existence of few if any of the major issues of policy”, while [12] discusses “mass political ignorance” and “mass political apathy” as playing key roles throughout the history of American politics. More recently, the 2004 American National Election Study found that Americans performed extremely poorly when asked simple questions about the political system and the leaders in charge of it. This evidence is in favor of what is called political ignorance. In a setting where voters are politically ignorant, the Condorcet Jury Theorem provides a valuable insight as to how much political information matters in determining the outcome. Consider a committee who has to elect an administrator out of two candidates, one “good” and one “bad.” Assume that all the members of the committee share the same preferences: they all prefer the good administrator to be selected. Individuals differ, however, in the information they have about which candidate is the good one. Suppose that each voter has a belief about which is the good candidate and votes according to his belief. If each voter has a correct belief with probability higher than 50%, then by the Law of Large Numbers as the number of voters becomes very large the probability that more than half of the the electorate votes for the right candidate goes to one. Hence, under simple majority the probability that the right choice is made goes to one. Condorcet’s conclusion is that in a common value setting a democratic decision is superior to an individual decision, because each voter makes the wrong decision with a non-negligible positive probability, whereas the population as a whole makes the right decision almost always. As far as political ignorance is concerned, this result shows that it is not necessary for an electorate to be well informed for it to make the right decision. So long as each voter has a correct belief with a probability higher than a half, the electoral outcome will almost always be identical to one in which the electorate was perfectly informed. Therefore, Condorcet’s result implies that the ignorance of individual voters is overcome by the aggregation of information in an election. Gibbard–Satterthwaite Theorem Thus far, citizens have been treated as if they disregard any tactical considerations when faced with a voting decision. However, a citizen might find it worthwhile to misrepresent his true preferences in order to achieve a social outcome more preferred than the one that would result if he voted naively. Consider an election in which a status quo will be replaced if a simple majority agrees on one of three
candidates. Suppose the status quo is a conservative government, and the three alternative candidates to be voted for are a conservative, a moderate and a liberal. Imagine that there are only three voters, and two votes have already been cast: one is for the moderate candidate, one is for the conservative one. Suppose that the last individual who is called to vote is politically liberal. He knows that if he votes for his truly most preferred candidate, the liberal one, there would be a tie and the status quo conservative government would not be replaced. However, by voting for his second most preferred alternative, the moderate candidate, he would break a tie and the status quo government would be replaced with a moderate one. A liberal voter prefers this outcome to the one where conservatives win. Therefore he has an incentive to misrepresent his true preferences and vote tactically for his second-best alternative. A powerful result in voting theory called Gibbard– Satthertwaite theorem [17,36] shows formally that in most electoral settings at least one citizen has an incentive to vote tactically. Define a social choice function as a mapping of all individuals’ preferences into a collective choice. The theorem states that there is no social choice function that is Non Dictatorial and Non Manipulable, i. e. such that no agent has an incentive to vote tactically. Consequently, for every voting rule there is at least one agent with an incentive to misrepresent her preferences. It should be mentioned that the Gibbard–Satterthwaite theorem places also other technical restrictions. Furthermore, in a setting such as a majority vote, a misrepresentation of one’s true preferences coincides with voting for an alternative which the voter does not rank top. In the original formulation of their theorem, however, Gibbard and Satterthwaite dealt with mechanisms where agents are required to submit a ranking over all alternatives, and a misrepresentation of tastes in their framework does not coincide necessarily with a misrepresentation of only the most preferred alternative (see [26]). This result suggests that for a deep understanding of voting one should not focus only on the assumption that citizens vote sincerely. Tactical voting is not only an abstract possibility, it is also an actual behavior that must be considered in any voting analysis. The following sections explore how the literature on voting has dealt with strategic voting. Political Competition and Strategic Voting The early works of Downs [10] and Tullock [39] initiated the analysis of political issues within a strategic framework, where voters and/or candidates are assumed
3285
3286
Voting
to be rational decision makers. Political competition describes a situation in which candidates strategically position themselves in order to win an election, whereas strategic voting refers to individuals’ decision to vote so to maximize utility by sometimes misreporting true preferences. This game theoretical framework developed due to positive arguments such as Gibbard–Satterthwaite theorem, which suggests that voters have an incentive to behave strategically, and is also due to spread of game theory as a dominant tool in economic analysis. Political Competition The classic model analyzing the candidates’ choice of positioning on the political spectrum is that of Downs [10], who adapted the classical Hotelling model [21] to the analysis of the choice of political platforms by candidates. In the Downsian model of political competition, there is a unidimensional policy space, representing the political spectrum. There are two candidates who position themselves on this policy space. Each voter has a most preferred point on this space and prefers points closer to this point than those further away, i. e. each voter has single-peaked preferences. Downs argues that strategic candidates concerned only with winning position themselves at the point that is most preferred by the median voter. If candidate A were positioned anywhere else, say to the left of the median voter, candidate B could get the majority of votes by positioning himself between candidate A and the median voter; all the agents to the right of B, constituting more than half of the voters, would prefer B to A. Given this scenario, candidate A (for the same reason) would then have an incentive to position himself between B and the median voter, and so on. Hence, the result is that both candidates position themselves at the policy platform most preferred by the median voter. An interesting implication of this result is that under a democracy with two parties, both parties act identically, and therefore there are only as many positions (just one) taken by political parties as there would be in a dictatorship. However, it is crucial that two parties exist so that the competition between them can allow the chosen policy point to represent the preferences of the voters. This is in contrast to a dictatorship in which the ruling party can implement its own preferred policy without voter approval. Notice that the example above assumes a two-party system (for models that allow for more than two parties see [3,29]). [11] posits that in a representative democracy with a plurality voting rule, only two parties compete in the elections. This theory, known as Duverger’s Law, states that a proportional representation system, in which parties
gain seats proportional to the number of votes received, fosters elections with numerous parties. In contrast, a plurality system marginalizes smaller parties and results in only two parties entering into political competition. The Decision to Vote: the Paradox of Voting Another issue raised by Downs, one which focuses on voters’ behavior rather than candidates’, is known as the paradox of voting. It refers to the fact that in a large election, the probability that any single vote determines the outcome is vanishingly small. If every person only votes for the purpose of influencing the outcome of the election, even a small cost of voting would be sufficient to dissuade anyone from voting. Yet, it is commonly observed that turnout is very high, even in large elections. From the large empirical literature on turnout in elections, some facts seem to be acquired knowledge: (1) turnout is higher in more important elections (e. g. Presidential election in the US have a significantly higher turnout than Gubernatorial elections), (2) turnout is generally higher in close elections (i. e. with smaller margins of victory), and (3) turnout rates are different among groups with different demographic characteristics. For instance, from the thorough work by [41], it emerges that education has a substantial effect on the probability that one will vote. Income has less of an effect once it has been controlled for the impact of other variables. After education, the second most important variable is age, which appears to have a strong positive relationship with turnout. Other socio-economic variables are also important; in particular, racial minorities appear to be less likely to vote. Finally, turnout seems to be significantly influenced by factors such as the weather conditions on the day of the election and voters’ distance from the polls (see [8]). Such comparative statics suggest that it is appropriate to model voters’ behavior as a rational choice problem within a standard utility maximization framework. The modern theory of voting applies the classic utilitarian framework to the voting problem, positing that agents decide whether or not to vote by comparing the cost of voting with the utility of voting. The traditional starting point for the modern theory of voting is [34], who formalize the insights of [10] and [39] in a simple utilitarian model of voting. The cost of voting comprises any sacrifice in utility that voting entails. The utility of voting is usually divided into two components: a non-instrumental component and an instrumental component. The non-instrumental component includes utility derived from the mere act of voting and not related to the actual outcome of the election. It may include, for instance, the sense of civic duty. There is
Voting
considerable evidence that voters are motivated by a sense of civic duty (see, for example, [5]). The instrumental component is the utility of the outcome a voter induces if her vote determines the outcome, weighted by the probability that her vote actually determines the outcome. The instrumental component of the utility of voting has attracted most of the attention in the literature. It is typically analyzed through the rational theory of voting, which is motivated by an empirical observation: there exists a strong positive correlation between turnout rate and closeness of the election. This fact suggests that, ceteris paribus, voters are more likely to vote if their vote is more likely to make a difference. The main theoretical problem is to endogenize the probability that each voter is pivotal, i. e. that his vote is determinant for the outcome of the election. Ledyard [23,24] are among the early works in the literature on game theoretical models of the pivotal-voter. In these models, voters infer the probability of being pivotal from the equilibrium strategies of other voters. Subsequently, they decide whether or not to vote, trading off the cost of voting with the expected (instrumental) utility of voting. Although Ledyard did not focus on the magnitude of turnout in a strategic model, this question is addressed by Palfrey and Rosenthal ([30,31]) who model elections with uncertainty about the total number of voters. Voters strategically choose whether or not to vote for their favorite alternative amongst two candidates. However, Palfrey and Rosenthal’s theories do not explain high turnout in large elections, when the cost of voting is not very (and unrealistically) low. Ultimately, the game-theoretic approach to costly voting could not escape the paradox of voting. Since the probability of being pivotal is very small in large elections, the individual incentives to vote cannot justify high turnouts unless the cost of voting is sufficiently small. Conversely, regardless of how small the cost of voting is, the theory posits that there should be low turnout as the election becomes arbitrarily large, which is in contrast to empirical evidence. The puzzle that remains open is how to reconcile the evidence of high turnout in large elections with the responsiveness of turnout levels to the closeness of the election. Mobilization and Group-Based Notion of Welfare Two strands of the literature try to overcome the paradox of voting by focusing on groups of like-minded people rather than on individual agents. These are models of mobilization and models incorporating a group based notion of welfare.
In models of mobilization, the population of voters is assumed to be divided into groups, each of which has a leader who has the same preferences as all agents in the group and coordinates their behavior. The turnout decision within each group is determined by how the leaders allocate costly resources to voters. It is as if leaders buy the votes of the agents in their group, compensating for the agents’ costs of voting. Since leaders influence a large number of voters, their decisions have a non-negligible impact on the probability of affecting the electoral outcome and consequently, on the individual instrumental benefit from voting. [37,38] test group based models and provide some empirical support for the mobilization thesis. (see also [27,28,40]). The problem for models of mobilization is that it is not clear how leaders affect the individual behavior of voters. Models of group based welfare consider groups of likeminded individuals whose actions are intended not to maximize their individual utilities, but rather that of the group (see [22,25]). In this case, there is no leader who prescribes behavior as in mobilization models, but instead there is an implicit understanding among agents in the group on appropriate behavior. This idea is developed by [16], who appeal to [20] Group Rule-Utilitarian Theory to endogenize the non-instrumental component of the utility from voting in a way that preserves the positive relation between closeness of the election and incentive to vote typical of the classic pivotal-voter models. In this model, agents derive utility from “doing their part”: in the spirit of [20], this is understood to mean following the rule that, when followed by all the agents in a given group, would maximize some measure of the group’s utility. The outcome is a set of rules for each group, which are mutually optimal (from the point of view of the group) given that individuals follow the rules within their group. Abstention still occurs because for some agents (those with higher costs of voting) the rule prescribes not to vote, since their contribution to increase the group’s utility from the election’s outcome does not compensate the increase in the group’s total cost of voting. [8] provide some empirical support to the group rule. The Common Value Setting with Strategic Agents Feddersen and Pesendorfer [13,14] consider the Condorcet Jury Theorem in a strategic setting and reach a different conclusion than Condorcet. They model a voting problem in an almost common value setting as a game, that is, as a situation where agents interact strategically. The following simple example provides the basic insights of their model: suppose that there are three voters, 1, 2
3287
3288
Voting
and 3, and two candidates, A and B. Suppose that voter 1 is an A-partisan, meaning he prefers candidate A in all states of the world. Agents 2 and 3 instead prefer candidate A in state sA and candidate B in state sB . Let p be the probability of state sA and 1 p the probability of sB . Now, suppose that agent 2 is informed, i. e. he knows the state of the world before voting, while agent 3 is not. Finally, suppose that there is no cost of voting and that the election is decided by simple majority rule. In this situation, agent 1 votes for A, agent 3 for B, and agent 2 for A if he observes sA and for B if he observes sB . To understand why the uninformed agent 3 votes for B, notice that by doing so the outcome is that A is always selected in state sA , while B is always selected in state sB . This is clearly the best outcome for agent 3, and in all states this is also the best outcome for the majority of the population. In fact, if the true state is sA , A is selected, which is preferred to B by all voters; if the true state is sB , B is selected, which is preferred to B by two voters out of three. The uninformed agent in the example votes for B no matter what the prior probability p is, even if p is close to one, that is, even if he is almost sure that A is the right candidate. By voting for B individual 3 counterbalances the A-partisan’s vote, thereby allowing the informed voter (2) to induce the “right” outcome with probability one. In Condorcet’s argument, if voters vote according to their belief about the state of the world, information is aggregated in a way that induces the right social choice to be made. In this setting, where preferences are only partially aligned, information is aggregated so as to always generate the decision that is preferred by the majority of the population only if uninformed voters vote strategically. In this setting, if voters vote sincerely, as is assumed by Condorcet, the result that the aggregation of information taking place during an election delivers the “right” social choice does not hold. In Feddersen and Pesendorfer’s framework it is strategic voting that induces the “correct” social choice. This is the major difference with Condorcet’s result, which was driven by the assumption of sincere voting. Recognizing the strategic incentives that voters may have in an election, Feddersen and Pesendorfer [15] apply a similar analysis to the unanimity rule in juries. They find that in the context of a common value setting, unanimity voting might result in convicting the innocent more often than other rules because strategic jurors consider the probability of being pivotal (like voter 3 in the example) and make their decisions conditional on being pivotal. Now suppose there is another voter, 4, who is uninformed and shares the same preferences as 2 and 3. In this setup, even with a zero cost of voting agent 4 would abstain, so that the informed voter is pivotal with prob-
ability one. Voter 4’s behavior is known as strategic abstention: the act of abstaining by uninformed voters not because voting is costly, but because by doing so they allow the informed voters to be pivotal. By abstaining, uninformed voters in effect delegate the decision to the informed voters. In the example, voter 4’s strategic abstention allows for information equivalence to arise: making voter 4 informed would not change the outcome of the election, as long as 4 strategically abstains so that the informed voter determines the outcome. Notice that this result is in the same spirit as Condorcet’s Jury Theorem’s result: there, having each voter’s belief accurate with a probability slightly higher than 50% or substantially higher than 50% does not make a difference. As long as voters vote according to their belief, the outcome is the same no matter what the underlying belief accuracy is (as long as it is greater than a half). Although for different reasons, in both Condorcet’s setting and Feddersen and Pesendorfer’s model (under some circumstances) the aggregation of information that takes place during an election ensures that the outcome of the election does not vary if the electorate is made more informed. Future Directions There are a number of open questions in the literature on voting. As mentioned in earlier sections, at the time of writing there is no universally accepted theory of turnout. In particular, no theory delivers all of the relevant comparative statics observed empirically, that is, the variability of turnout across elections of difference size, importance, and closeness. Broadly speaking, one may say that the behavior of voters is still only partially understood. Whether voters vote based on strategic considerations or vote without regard to how other citizens might be voting is an unsettled issue. Furthermore, individuals’ choices in hypothetical and real situations might differ. The act of voting in large elections is almost a hypothetical choice, in that the likelihood that a vote determines the outcome is negligible. An open question then is whether voters choose candidates as they would in a real situation or as they would in a hypothetical situation. As far as empirical work is concerned, little seems to be known of the empirical impact of political ignorance on the outcome of elections. Also, an issue that is both empirical and theoretical and has not been satisfactorily addressed is what types of election rules are best suited for different decisions. Finally, although much has been written on the origins and evolution of democracies, there is no general consensus as to what are the possible reasons of democracies historical evolution. As is clear from this section, many interesting questions in
Voting
politics have not yet been satisfactorily answered, leaving space for future research. Bibliography Primary Literature 1. Arrow KJ (1950) A difficulty in the concept of social welfare. J Political Econ 58(4):328–46 2. Bentham J (1789) An introduction to the principles of morals and legislation. Payne, London 3. Besley T, Coate S (1997) An economic model of representative democracy. Quart J Econ 112(1):85–114 4. Black D (1948) The decisions of a committee using a special majority. Econometrica 16(3):245–61 5. Blais A (2000) To vote or not to vote: The merits and limits of rational choice theory. University of Pittsburgh Press, Pittsburgh 6. Borda J (1781) Mathematical derivation of an election system. Isis 44(1–2):42–51 7. Campbell A, Converse PE, Miller WE, Stokes DE (1960) The American voter. Wiley, New York 8. Coate S, Conlin M (2004) A group-rule utilitarian approach to voter turnout: Theory and evidence. Am Econ Rev 95(5):1476– 1504 9. Condorcet M (1785) Essai sur l’application de l’analyse a la probabilite’ des decisions rendues a la pluralite des voix. L’Imprimerie Royale, Paris. English translation by Baker (1976) Chelsea, New York 10. Downs A (1957) An economic theory of democracy. Harper, New York 11. Duverger M (1972) Factors in a two-party and multiparty system. In: Party Politics and Pressure Groups. Crowell Company, New York, pp 23–32 12. Dye TR, Zeigler TH (1970) The irony of democracy, an uncommon introduction to american politics. Duxbury, Belmont 13. Feddersen T, Pesendorfer W (1996) The swing voter’s curse. Am Econ Rev 86(3):408–424 14. Feddersen T, Pesendorfer W (1997) Voting behavior and information aggregation in elections with private information. Econometrica 86(3):408–424 15. Feddersen T, Pesendorfer W (1998) Convicting the innocent: The inferiority of unanimous jury verdicts. Am Political Sci Rev 92:23–35 16. Feddersen T, Sandroni A (2006) Theory A of participation in elections. Am Econ Rev 96(4):1271–1282 17. Gibbard A (1973) Manipulation of voting schemes. Econometrica 41:587–601 18. Harsanyi JC (1953) Cardinal utility in welfare economics and the theory of risk-taking. J Political Econ 61:434–435 19. Harsanyi JC (1977) Rational behavior and bargaining equilibrium in games and social situations. Cambridge University Press, New York 20. Harsanyi JC (1980) Rule utilitarianism, rights, obligations and the theory of rational behavior. Theory Decis 12(1):115–33 21. Hotelling H (1929) Stability and competition. Econ J 39(1):41– 57 22. Kinder DR, Kiewiet DR (1979) Economic discontent and political behavior: The role of personal grievances and collective economic judgements in congressional voting. Am J Political Sci 23(3):495–527
23. Ledyard J (1981) The paradox of voting and candidate competition: A general equilibrium analysis. In: Hoorwich G, Quick JP (eds) Essays in Contemporary Fields of Economics. Purdue University Press, Lafayette, pp 54–80 24. Ledyard J (1984) The pure theory of two candidate elections. Public Choice 44(1):7–41 25. Markus G (1988) The impact of personal and national economic conditions on the presidential vote: A pooled crosssectional analysis. Am J Political Sci 32(1):137–54 26. Mas-Colell A, Whinston M, Green J (1995) Microeconomic theory. Oxford University Press, Oxford 27. Morton R (1987) Group majority a model of voting. Social Choice and Welfare 4(2):117–31 28. Morton R (1991) Groups in rational turnout models. Am J Political Sci 35:758–76 29. Osborne MJ, Slivinski A (1996) Model a of political competition with citizen-candidates. Quart J Econ 111(1):65–96 30. Palfrey T, Rosenthal H (1983) A strategic calculus of voting. Public Choice 41(1):7–53 31. Palfrey T, Rosenthal H (1985) Voter participation and strategic uncertainty. Am Political Sci Rev 79(1):62–78 32. Persson T, Tabellini G (2001) Political economics. Press MIT, Boston 33. Rawls J (1971) A theory of justice. Harvard University Press, Cambridge 34. Riker W, Ordeshook P (1968) A theory of the calculus of voting. Am Political Sci Rev 62:25–42 35. Robbins L (1938) Interpersonal comparisons of utility: A comment. Econ J 48(192):635–41 36. Satterthwaite M (1975) Strategy-proofness and Arrow’s conditions: Existence and correspondence theorems for voting procedures and social welfare functions. J Econ Theory 10:187– 217 37. Schram A (1991) Voter behavior in economic perspective. Springer, Heidelberg 38. Shachar R, Nalebuff B (1999) Follow the leader: Theory and evidence on political participation. Am Econ Rev 89(3):525–47 39. Tullock G (1967) Towards a mathematics of politics. The University of Michigan Press, Ann Arbor 40. Uhlaner C (1989) Rational turnout: The neglected role of groups. Am J Political Sci 33(2):390–422 41. Wolfinger, Rosenstone (1980) Who votes? Yale University Press, New Haven
Books and Reviews Arrow KJ (1951) Social choice and individual values. Wiley, New York Arrow KJ, Sen AK, Suzumura K (eds) (2002) Handbook of social choice and welfare, vol 1. Elsevier, Amsterdam Austen-Smith D, Banks JS (1999) Positive political theory I: collective preference. The University of Michigan Press, Ann Arbor Cox GW (1997) Making votes count: strategic coordination in the world’s electoral systems. Cambridge University Press, Cambridge Dummett M (1984) Voting procedures. Clarendon Press, Oxford Duverger M (1959) Political parties: their organization and activity in the modern state. Methuen and Co. Ltd, London Dworkin R (1981) What is equality? Part 2: Equality of resources. Phil and Public Affairs 10:283–345
3289
3290
Voting
Feddersen T (2004) Rational choice theory and the paradox of not voting. J Econ Perspect 18(1):99–112 Katz RS (1997) Democracy and elections. Oxford University Press, Oxford McLean I (1987) Public choice: An introduction. Basil Blackwell Inc., New York Milnor AJ (1969) Elections and political stability. Little, Brown, Boston Myerson R (2000) Large poisson games. J Econ Theory 94(1):7–45 Mueller DC: Public choice II (1989) A revised edition of Public Choice. Cambridge University Press, Cambridge Niemi RG, Weisberg HF (eds) (1972) Probability models of collective decision making. Charles Merrill Publishing E company, Columbus
Ordeshook PC (1986) Game theory and political theory: An introduction. Cambridge University Press, Cambridge Ordeshook PC (ed) (1989) Models of strategic choice in politics. The University of Michigan Press, Ann Arbor Rawls J (1958) Justice as fairness. Phil Rev 67(2):164–194 Riker WH (ed) (1993) Agenda formation. The University of Michigan Press, Ann Arbor Sen AK (1973) On economic inequality. Oxford University Press, Oxford Sen AK (2002) Rationality and freedom. The Belknap Press of Harvard University Press, Cambridge Tidean N (2006) Collective decisions and voting: The potential for public choice. Ashgate, Burlington Tullock G (1998) On voting. Edward Elgar Publishing, Northampton
Voting Procedures, Complexity of
Voting Procedures, Complexity of OLIVIER HUDRY École Nationale Supérieure des Télécommunications, Paris, France Article Outline Glossary Definition of the Subject Introduction Common Voting Procedures Complexity Results Further Directions Acknowledgments Bibliography Glossary Condorcet winner A candidate is a Condorcet winner if he or she defeats any other candidate in a one-to-one matchup. Such a candidate may not exist; at most, there is only one. Though it could seem reasonable to adopt a Condorcet winner (if any) as the winner of an election, many common voting procedures bypass the Condorcet winner in favor of a winner chosen by other criteria. Majority relation, strict majority relation In a pairwise comparison method, each candidate is compared to all others, one at a time. If a candidate x is preferred to a candidate y by at least m/2 voters (a majority), where m denotes the number of voters, x is said to be preferred to y according to the majority relation. The strict majority relation is defined in a similar way, but with (m C 1)/2 instead of m/2. If there is no tie, the strict majority relation is a tournament, i. e., a complete asymmetric binary relation, called the majority tournament. Preference, preference aggregation A voter’s preference is some relational structure defined over the set of candidates. Such a structure depends on the chosen voting procedure, and usually ranges between a binary relation on one extreme and a linear order on the other. Given a collection, called a profile, of individual preferences defined on a set of candidates, the aggregation problem consists in computing a collective preference summarizing the profile as well as possible (for a given criterion). Profile A profile ˘ D (R1 ; R2 ; : : : ; R m ) is an ordered collection (or a multiset) of m relations R i (1 6 i 6 m) for a given integer m. As the relations Ri can be the
same, another representation of a profile ˘ consists in specifying only the q relations Ri which are different, for an appropriate integer q, and the number mi of occurrences of each relation R i (1 6 i 6 q) : ˘ D (R1 ; m1 ; R2 ; m2 ; : : : ; R q ; m q ). Social choice function, social choice correspondence A social choice function maps a collection of individual preferences specified on a set of candidates onto a unique candidate, while a social choice correspondence maps it onto a nonempty set of candidates. This provides a way to formalize what constitutes the most preferred choice for a group of agents. Voting procedure, voting theory A voting procedure is a rule defining how to elect a winner (single-winner election) or several winners (multiple-winner election) or to rank the candidates from the individual preferences of the voters. Voting theory studies the (axiomatic, algorithmic, combinatorial, and so on) properties of the voting procedures designed in order to reach collective decisions. Definition of the Subject One main concern of voting theory is to determine a procedure (also called, according to the context or the authors, rule, method, social choice function, social choice correspondence, system, scheme, count, rank aggregation, principle, solution and so on), for choosing a winner from among a set of candidates, based on the preferences of the voters. Each voter’s preference may be expressed as the choice of a single individual candidate or, more ambitiously, a ranked list including all or some of the candidates. Such a situation occurs, obviously, in the field of social choice and welfare (for a broader presentation of the field of social choice and welfare, see for instance [4,7,8,9,14,17,67,75, 97,101,136,143,154]) and especially of elections (for more about voting theory, see [27,63,107,109,110,125,141,156, 167,168,169]), but also in many other fields: games, sports, artificial intelligence, spam detection, Web search engines, Internet applications, statistics, and so on. For a long time, much attention has been paid to the axiomatic properties fulfilled by different procedures that have been proposed. These properties are important in choosing a procedure, since there is no “ideal” procedure (see next section). More recently, in the late 1980s and the early 1990s, the question has arisen regarding the relative difficulty of computing winners according to a given procedure. (The first to study the question may have been J. Orlin in 1981 [142]; for a recent introduction to computational social choice, see [38].) From a practical point of view, it is crucial to be able to announce the
3291
3292
Voting Procedures, Complexity of
winner in a “reasonable” time. This raises the question of the complexity of voting procedures, which should be taken into account to the same extent as their axiomatic characteristics. Below, we will detail the complexity results of several procedures: plurality rule (one-round procedure), plurality rule with run-off (two-round procedure), preferential voting procedure (STV), Borda’s procedure, Nanson’s procedure, Baldwin’s procedure, Condorcet’s procedure, Condorcet–Kemeny problem, Slater problem, prudent orders (G. Köhler, K.J. Arrow and H. Raynaud), maximin procedure (P.B. Simpson), minimax procedure (K.J. Arrow and H. Raynaud), ranked pairs procedure (T.N. Tideman), Copeland’s procedure, the top cycle solution (J.H. Smith), the uncovered set solution (P.C. Fishburn, N. Miller), the minimal covering set solution (B. Dutta), Banks’s solution, the tournament equilibrium set solution (T. Schwartz), Dodgson’s procedure, Young’s procedure, approval voting procedure, majority-choice approval procedure (F. Simmons), and Bucklin’s procedure. Sect. “Introduction” is devoted to a historic overview and to basic definitions and notation. The common voting procedures are depicted in Sect. “Common Voting Procedures”. Sect. “Complexity Results” specifies the complexity of these procedures. Other considerations linked to complexity in the field of voting theory can be found in Sect. “Further Directions”. Introduction The Search for a “Good” Voting Procedure, from Borda to Arrow It is customarily agreed that the search for a “good” voting procedure goes back at least to the end of the eighteenth century, to the works of the chevalier Jean-Charles de Borda (1733–1799) [24], and of Marie Jean Antoine Nicolas Caritat, marquis de Condorcet (1743–1794) [31] (for references upon the historical context, see [7,17,23,80, 81,119,120,121,122,123,124] and references below). In the 1770s in the 1780s, J.-C. de Borda [24], a member of the French Academy of Sciences, showed that the plurality rule used at that time by the academy was not satisfactory. Indeed, with such a voting procedure, the winner could be contested by a majority of voters who would all agree to choose another candidate over the elected winner. (We may notice that the plurality rule with run-off used in many countries has the same defect.) Borda then suggested another procedure (see below). But, as pointed out by Condorcet in 1784 [31], the procedure advocated by Borda had the same defect as the one depicted by Borda himself for the plurality rule: a majority of dissatisfied voters
could agree to constitute a majority coalition against the candidate elected by Borda’s procedure in favor of another candidate. Condorcet then designed a method based on pairwise comparisons. By nature, this method cannot elect a winner who will give rise to a majority coalition against him or her. But, unfortunately, Condorcet’s method does not always succeed in finding a winner. If such a winner does exist, i. e., if there exists a candidate who defeats any other candidate in such a pairwise comparison, then this candidate is said to be a Condorcet winner; if he or she exists, a Condorcet winner is unique. Actually, the Academy of Sciences decided to adopt Borda’s method (until 1803). By the way, notice that, according to I. McLean, H. Lorrey and J. Colomer [124], “both Ramon Llull (ca 1232–1316) and Nicolaus of Cusa (also known as Cusanus, 1401–1464) made contributions which have been believed to be centuries more recent. Llull promotes the method of pairwise comparison, and proposes the Copeland rule to select a winner. Cusanus proposes the Borda rule, which should properly be renamed the Cusanus rule”. Despite these historical discoveries, we shall keep the usual names. After these seminal works, in spite of some works by the Swiss Simon Lhuilier (1750–1840) [112] (see also [128]), the Spanish Joseph Isidoro Morales [139], the French Pierre Claude François Daunou (1761–1840) [54] and Pierre Simon, marquis de Laplace (1749–1827) [105], who, through a slightly different approach, rediscovered Borda’s procedure, it seems that the history of the theory of social choice slowed down and almost disappeared until the 1870s or even the 1950s (see [23]). In the 1870s, the English reverend Charles Lutwidge Dodgson (1832–1898), also (or maybe better) known as Lewis Carroll, proposed a voting system in which the winner is the candidate who becomes a Condorcet winner with the fewest appropriate changes in voters’ preferences [58,59,60]. Some years later, another English mathematician, Edward J. Nanson (1850– 1936) [140] on the one hand, and the Australian Joseph M. Baldwin (1878–1945) [11] in 1926 on the other hand, slightly modified Borda’s method by iteratively eliminating some candidates until only one remains. None of these methods is utterly satisfactory when we consider the properties which are usually considered desirable. In 1951 and 1963, Kenneth J. Arrow [7] shows that there does not exist a “good” voting method, with respect to some “reasonable” axiomatic properties; this is known as the famous “impossibility theorem”. More precisely, assuming that the preferences of the voters are complete preorders (see below; notice that the set of complete preorders includes the one of linear orders, quite often considered to model the preferences of the voters) and that the result of the voting procedure should also be a complete preorder,
Voting Procedures, Complexity of
K.J. Arrow considered the following properties: Unrestricted domain or universality: the voting procedure must be able to provide a result whatever the preferences of the voters are. Independence of irrelevant candidates: the collective preference between candidates x and y must depend only on the individual preferences between x and y; in other words, the collective preference between x and y must remain the same as long as the individual preferences between x and y do not change. Unanimity (or Pareto property): if a candidate x is preferred to another candidate y by all the voters, then x must be preferred to y in the collective preference, too. K.J. Arrow showed that, if there are at least three candidates (things are much more comfortable with only two candidates!) and at least two voters, the only procedure which satisfies all these conditions at once is the dictatorship, in which one voter (the dictator) imposes his or her preference. Though this impossibility theorem ruins the hope of designing a voting procedure fulfilling the usual desirable properties, several procedures have been suggested since this date; we shall describe some of them below. Among ways to escape Arrow’s impossibility theorem, we find the following: The definition of other axiomatic systems that would lead to voting procedures which would not be dictatorship. The restriction of individual preferences to more constrained domains. Adapting the result, when this one is not satisfactory with respect to the required axiomatic properties, into a result fulfilling these properties and fitting the genuine result as well as possible, for some criterion which must be defined. The main questions associated with the first possibility are “given some axiomatic properties, what are the voting procedures satisfying these properties?” or, conversely, “given a voting procedure, what is the proper axiomatic system characterizing this procedure?”; we will not consider this direction here. The second possibility will be illustrated below, by the restriction of individual preferences to single-peaked preferences. The third direction was followed by J.G. Kemeny in 1959 [102], when he studied the aggregation of orders into a median order (see below). Notice that this approach is also attributed to Condorcet; in the sequel, we will refer the search for a median linear order as the Condorcet–Kemeny problem (as other people rediscovered this problem or some of its variants [130], the problem has also been known under other names, see [36]).
Another related problem, dealing with the majority tournament (see below) is one, stated explicitly by P. Slater in 1961 [160], of fitting a tournament into a linear order at minimum distance. This is one of the so-called tournament solutions (see [106] and [138]; for references on tournaments, see also [118,134,149,150]), of which the aim is to determine a winner from a tournament. Besides the Slater solution, we shall describe other tournament solutions, such as the top cycle, the uncovered set, the minimal covering set, Banks’s solution, the tournament equilibrium set. The other common tournament solutions are polynomial or their complexities are not completely known (see [93] for more details). Definitions, Notation and Partially Ordered Sets Used to Model Preferences Let us assume that we are dealing with m voters who must choose between n candidates denoted x1 , x2 ; : : :, xn or x, y, z : : :; X denotes this set of candidates; in the following, we suppose that n is large enough.˚A binary relation R defined on X is a subset of X X D x; y : x 2 X and y 2 X . ¯ We use the notation xRy instead of (x; y) 2 R and x Ry instead of (x; y) … R. It is customary to represent the preferences R i (1 6 i 6 m) of m voters as an ordered collection (or a multiset) ˘ D (R1 ; R2 ; : : : ; R m ), called the profile of the m binary relations. Another representation of the preferences of the voters exists. Since these preferences may be the same for two different voters, we may consider only the different relations R1 ; R2 ; : : : ; R q arising from the opinions of the voters, where q denotes this number of different opinions. In this way, we combine all the voters sharing the same opinion. Last, if m i (1 6 i 6 q) denotes the number of voters sharing R i (1 6 i 6 q) as their prefPq erence (notice the equality iD1 m i D m), we associate this number mi of occurrences with each type Ri of relations to describe ˘ . Then ˘ can be described as the set of such pairs: ˘ D f(R1 ; m1 ); (R2 ; m2 ); : : : ; (R q ; m q )g. Such a representation does represent the data more compactly when m is large with respect to n. Usually, the complexity results are the same for the two representations, because their proofs stand even if m is bounded by a polynomial in n, and in this case there is no qualitative difference between the two representations. With respect to the results stated in Sect. “Complexity Results”, there would be a difference only for the LNP -completeness results of Theorems 5, 14 and 15 (which then should be replaced only by NP-hardness results). So, in the sequel, we shall consider only the first representation. Moreover, we shall assume that the preference of each voter i (1 6 i 6 m) is given by
3293
3294
Voting Procedures, Complexity of
a binary relation Ri defined on X, and that Ri is described by its characteristic vector (see below), which requires n2 bits. So the size of the data set is about mn2 . We will consider two kinds of elections: we want to elect one candidate, or we want to rank all of them into a partially ordered set (or poset). One of the most common posets is the structure of linear order. Other posets can be defined from the following basic properties: Reflexivity 8x 2 X; xRx; ¯ Irreflexivity 8x 2 X; x Rx; ¯ Antisymmetry 8(x; y) 2 X 2 , (xRy and x ¤ y) ) y Rx; ¯ Asymmetry 8(x; y) 2 X 2 ; xRy ) y Rx; Transitivity 8(x; y; z) 2 X 3 , (xRy and yRz) ) xRz; Completeness 8(x; y) 2 X 2 , with x ¤ y, xRy or (inclusively) yRx. By combining the above properties, we may define different types of binary relations (see for instance [17,32,74, 77]). As a binary relation R defined on X is the same as the directed graph G D (X; R) : (x; y) is an arc (i. e., a directed edge; similarly, a directed cycle will be called a circuit in the sequel) of G if and only if we have xRy, we illustrate these types with graph theoretic examples (for basic references on graph theory, see [12] or [22]). It is often possible to define reflexive or irreflexive versions of the following ordered structures. But, as reflexivity and irreflexivity do not matter for complexity results (see [91]), we give below only one version among these two possibilities. A partial order is an antisymmetric (if reflexive) or asymmetric (if irreflexive) and transitive binary relation (see Fig. 1); O will denote the set of the partial orders defined on X. A linear order is a complete partial order (see Fig. 2); L will denote the set of the linear orders defined on X. If L denotes a linear order defined on X, we will represent L as x(1) > x(2) > > x(n) for some appropriate permutation , with the agreement that the notation x(i) > x(iC1) (for 1 6 i < n) means that x(i) is preferred to x(iC1) according to L, the relationship between the other elements of X being involved by transitivity. The element x(1) will be called the winner of L.
Voting Procedures, Complexity of, Figure 1 A partial order
Voting Procedures, Complexity of, Figure 2 A linear order. The partial order of Fig. 1 is not a linear order, for instance because the vertices a and d are not compared
Voting Procedures, Complexity of, Figure 3 A tournament
A tournament is a complete and asymmetric binary relation (see Fig. 3); T will denote the set of the tournaments defined on X; notice that a transitive tournament is a linear order and conversely. As a tournament may contain circuits, they are usually not considered as an appropriate structure to represent the collective preference that we seek. Tournaments will be used below to summarize individual preferences in the so-called majority tournament. A preorder is a reflexive and transitive binary relation (see Fig. 4); P will denote the set of the preorders defined on X. A complete preorder is a reflexive, transitive and complete binary relation (see Fig. 5); C will denote the set of the complete preorders defined on X. An acyclic relation (we should rather say without circuit, but acyclic is the usual term) is a relation R of
Voting Procedures, Complexity of, Figure 4 A preorder
Voting Procedures, Complexity of
eral inclusions between these sets, especially the following ones: 8Z 2 fA; C ; L; O; P ; R; T g; L Z R; in other words, a linear order can be considered as a special case of any one of the other types, and any type is a special case of binary relation. To conclude this section, we summarize the notation for these sets in Table 1. Common Voting Procedures Voting Procedures, Complexity of, Figure 5 A complete preorder
Voting Procedures, Complexity of, Figure 6 An acyclic relation. Its transitive closure is the partial order of Fig. 1
which the associated graph G is without any circuit (see Fig. 6); A will denote the set of the acyclic relations defined on X. As stated above, it is possible to obtain other structures by adding or removing reflexivity or irreflexivity from the above definitions. In fact, the distinction between reflexive and irreflexive relations is not relevant from the complexity point of view: the results remain the same. Thus, in the following, we do not take reflexivity or irreflexivity into account. Other types of posets exist, obtained by considering the asymmetric parts of the previous relations (for instance, a weak order is sometimes defined as the asymmetric part of a complete preorder) or by adding extra structural properties (for instance to define interval orders or semiorders from partial orders). We will not consider these posets here; the interested reader is referred to [32] for their definitions and to [91] for some complexity results about them. On the contrary, we will pay attention to generic binary relations without any particular property; the set of the binary relations will be noted R. We may notice sev-
In this section, we describe the main voting procedures, i. e., plurality rule (one-round procedure), plurality rule with run-off (two-round procedure), preferential voting procedure (STV), Borda’s procedure, Nanson’s procedure, Baldwin’s procedure, Condorcet’s procedure, Condorcet–Kemeny problem, Slater problem, prudent orders (G. Köhler, K.J. Arrow and H. Raynaud), maximin procedure (P.B. Simpson), minimax procedure (K.J. Arrow and H. Raynaud), ranked pairs procedure (T.N. Tideman), Copeland’s procedure, the top cycle solution (J.H. Smith), the uncovered set solution (P.C. Fishburn, N. Miller), the minimal covering set solution (B. Dutta), Banks’s solution, the tournament equilibrium set solution (T. Schwartz), Dodgson’s procedure, Young’s procedure, approval voting procedure, majority-choice approval procedure (F. Simmons), and Bucklin’s procedure (see also [27]. For other tournament solutions, see [106] and [138]). Plurality Rule, Plurality Rule with Run-Off, Preferential Voting Procedure One of the easiest voting procedures by which to elect one candidate as a winner is the plurality rule (also called oneround procedure or relative majority, or sometimes firstpast-the-post, or winner-takes-all, or also majoritarian voting. . . ; see [95]). In this procedure, each voter gives one point to his or her favorite candidate (so it is not necessary to know the preferences of the voters on the whole set of candidates). The candidate who gains the maximum number of points is the winner. This procedure belongs to the family of scoring procedures. In such a procedure, a score-vector (s1 ; s2 ; : : : ; s n ) is fixed independently of the voters, with s1 > s2 > > s n . For each voter, a candidate x receives si points if x is ranked at the ith position by the considered voter. The score of x
Voting Procedures, Complexity of, Table 1 Meaning of the notation A; C ; L; O; P ; R; T A: acyclic relation
L: linear order
P : preorder
C : complete preorder O: partial order R: binary relation
T : tournament
3295
3296
Voting Procedures, Complexity of
is the total number of points that x received. The winner is the candidate with the maximum score. If the aim is to rank the candidates, we may also sort them according to decreasing scores and then consider the linear extensions of the complete preorder provided by this sorting. (Another possibility would be to apply the procedure n 1 times, after having removed the winner of the current iteration.) For the plurality rule, the score vector is (1, 0, 0, . . . , 0). There are two rounds in the plurality rule with run-off , also called two-round (or two-ballot) procedure. The first round is like the plurality rule described above. At the end of this first step, if a candidate has gained at least (m C 1)/2 points (the strict majority), he or she is the winner. Otherwise, the two candidates with the maximum numbers of points remain for a second round and the others are removed from the election. Then the plurality rule is applied again, but only to the remaining two candidates. This method is designed to elect only one winner. If we want to obtain k winners, it is sufficient to apply it k times. Notice that the repetition of a given procedure always makes it possible to elect several winners. A generalization of these procedures consists in performing a given number of rounds. For each round, each voter gives one point for his or her favorite candidate. If, at the end of a round, there is a candidate who has gained at least (m C 1)/2 points, he or she is the winner. Otherwise, the candidates with the lowest numbers of points are eliminated from the competition; the number of candidates eliminated at each round depends on the number of rounds, but is such that only two candidates will remain for the last round. The winner of this last round is the winner of the election. Special cases are the plurality rule with run-off, described above, and the one with at most n 1 rounds. For this latter case, exactly one candidate is removed at each round. This variant, in which the candidate who is the least often ranked at the first position is removed, is also known as preferential voting (or preference voting) or as single transferable vote (STV), or as instantrunoff voting (IRV). Other variants of these voting procedures may be defined by the successive eliminations of the losers. For instance, the candidate who is most often ranked last is removed, and we iterate this process while there remain at least two candidates. Borda’s Procedure and Some Variants (Nanson’s and Baldwin’s Procedures) As related above, Borda considered the plurality rule unsatisfactory because the winner can be contested by a ma-
jority of voters who would all agree to choose another candidate instead of the elected winner (the same holds for the plurality rule with run-off, see Example 1 below). Borda suggested another procedure in which the voters rank the n candidates according to their preferences. For each voter, the candidate who is ranked first is given n 1 points, a candidate ranked second is given n 2 points, and so on: more generally, a candidate ranked at the ith position is given n i points. Then, all these points are summed up for each candidate: this sum is the Borda score sB of the candidate. The candidate with a maximum Borda score is the Borda winner. So, Borda’s procedure is also a scoring procedure, of which the score-vector is (n 1; n 2; : : : ; 1; 0). Using Borda’s procedure, one can easily obtain a ranking of all the candidates. A first possibility consists, as for the plurality rule or the plurality rule with run-off, in iterating Borda’s procedure n 1 times, after having removed the Borda winner of the current iteration. But we may also apply Borda’s procedure only once, and then rank the candidates according to the decreasing values of their Borda scores. This gives a complete preorder. Any linear extension of this complete preorder can be considered as the collective ranking according to Borda’s procedure. Notice that these two possibilities do not necessarily provide the same rankings (see Example 1). Several variants of Borda’s procedure have been studied. For instance, we may apply other score-vectors: instead of n i points given to the candidate ranked at the ith position, we may choose to assign other values. For instance, given an integer k, we may credit k points to the candidate ranked at the first position, k 1 points to the second, and so on, through the candidate ranked at the kth position, who gains 1 point, after which the following candidates gain nothing (the shape of the score-vector is (k; k 1; : : : ; 1; 0; : : : ; 0)). For k D 1, this system is the plurality rule. For k > n 1, this system gives the same results as Borda’s procedure. Other systems are based on Borda’s procedure, but with the elimination of some candidates. Nanson’s procedure [140] modifies Borda’s procedure by eliminating the candidates whose Borda scores are below the average Borda score and by repeating the computations of the Borda scores with respect to the remaining candidates after these eliminations, until there remains only one candidate. Another variant of Borda’s procedure is one suggested by Baldwin in 1926 [11]; as in Nanson’s procedure, candidates are iteratively removed from the election. But, in Baldwin’s procedure, only one candidate is removed at each iteration, the one whose Borda’s score is the lowest.
Voting Procedures, Complexity of
Condorcet’s Procedure Condorcet designed a method based on pairwise comparisons. More precisely, for each candidate x and each candidate y with x ¤ y, we compute the number mxy , that we will call the pairwise comparison coefficient below, of voters who prefer x to y. Then x is considered as better than y if a majority of voters prefers x to y, i. e., if we have m x y > m yx . This defines the (strict) majority relation T: xTy , m x y > m yx . In some cases, there exists a Condorcet winner, i. e., a candidate x defeating any other candidate: 8y ¤ x; m x y > m yx . If there exists a Condorcet winner, then he or she is unique. It may even happen that T is a linear order and allows us to rank all the candidates. Such is the case for the following example (due to B. Monjardet [133]), which illustrates the previous voting procedures. Example 1 Assume that m D 13 voters must rank n D 4 candidates x, y, z, and t. Suppose the preferences of the voters are given by the following linear orders:
the preferences of two voters are: x > y > z > t; the preference of one voter is: y > z > x > t; the preference of one voter is: y > z > t > x; the preferences of four voters are: z > y > x > t; the preferences of five voters are: t > x > y > z.
According to the plurality rule, t is the winner with 5 points (2 points for x and for y, 4 points for z). According to the plurality rule with run-off, z is the winner (the four voters who voted for x or y prefer z to t). The Borda scores sB of the candidates are:
sB (x) D 2 3 C 5 2 C 5 1 C 1 0 D 21; sB (y) D 2 3 C 6 2 C 5 1 C 0 0 D 23; sB (z) D 4 3 C 2 2 C 2 1 C 5 0 D 18; sB (t) D 5 3 C 0 2 C 1 1 C 7 0 D 16.
So, the winner according to Borda’s procedure is y, and the ranking of the four candidates is y > x > z > t (notice that if, while there are at least two candidates, we apply the variant consisting in removing the winner and reapplying Borda’s procedure, then we obtain the orders y > x > z > t and y > z > x > t as the possible rankings; the distance to the ranking provided by one application of Borda’s procedure may be much more important). In Nanson’s procedure, as the Borda scores of z and t are below the average (which is equal to 19.5), z and t are removed and only x and y remain. Then the Borda scores of x and y for the second step become s B (x) D 7 and s B (y) D 6: hence x is the winner according to Nanson’s procedure. In Baldwin’s procedure, t is first removed. The
Borda scores of x, y and z become s B (x) D 14, s B (y) D 15, s B (z) D 10. Thus z is removed and the remaining computations are as in Nanson’s procedure: here also x is the winner (but it may happen that Nanson’s procedure and Baldwin’s procedure do not provide the same winners). Let us now compute the pairwise comparison coefficients mxy necessary to apply Condorcet’s procedure:
m x y D 2 C 5 D 7; m yx D 1 C 1 C 4 D 6; m xz D 2 C 5 D 7; mzx D 1 C 1 C 4 D 6; m x t D 2 C 1 C 4 D 7; m tx D 1 C 5 D 6; m yz D 2 C 1 C 1 C 5 D 9; mz y D 4; m y t D 2 C 1 C 1 C 4 D 8; m t y D 5; mz t D 2 C 1 C 1 C 4 D 8; m tz D 5.
The bold values are the ones greater than the majority (m C 1)/2. These values show that the majority relation, here, is a linear order, namely x > y > z > t. By the way, this example shows the importance of the voting procedure, as already noticed by Borda (see [113]): four procedures, four different winners; so all four of our candidates may claim to be the winners of the election. . . It is often the case that different procedures provide different winners. Median Orders, Condorcet–Kemeny Problem, Slater Problem As Condorcet himself discovered, his procedure may lead to a majority relation which is not transitive. The simplest example in this respect is one with n D 3 candidates x, y and z, and with m D 3 voters whose preferences are respectively x > y > z, y > z > x, z > x > y. It is easy to verify that the majority relation T is defined by xTy, yTz, zTx, hence a lack of transitivity. If there is no tie (which is necessarily the case if m is odd, since the individual preferences of the voters are assumed to be linear orders), then T is a tournament, called the majority tournament. We may or may not weight T if we want to take the intensity of the preferences into account. If T is not weighted, the search for a winner or for a ranking of the candidates leads to the definition of tournament solutions (see [106] for a comprehensive study of these, and [93] for a survey of their complexities, some of which are given below). In the Condorcet–Kemeny problem, T is weighted and the aim is to compute a linear order or, more generally, a poset fitting T “as well as possible”. To specify what “as well as possible” means, we use the symmetric difference distance ı defined, for two binary re-
3297
3298
Voting Procedures, Complexity of
lations R and S defined on X, by: ˇ˚ ı(R; S) D ˇ (x; y) 2 X 2 : xRy and x S¯ y ˇ ¯ and xSy ˇ : or x Ry This quantity ı(R; S) measures the number of disagreements between R and S. Though it is possible to consider other distances, ı is widely used and is appropriate for many applications. J.-P. Barthélemy [15] shows that ı satisfies a number of naturally desirable properties. J.-P. Barthélemy and B. Monjardet [17] recall that ı(R; S) is the Hamming distance between the characteristic vectors (see below) of R and S and point out the links between ı and the L1 -metric or the square of the Euclidean distance between these vectors (see also [129] and [130]). So, for a profile ˘ D (R1 ; R2 ; : : : ; R m ) of m relations, we can define the remoteness (˘; R) between a relation R and the profile ˘ by: (˘; R) D
m X
ı (R; R i ) :
iD1
The remoteness (˘; R) measures the total number of disagreements between ˘ and R. Our aggregation problem can now be seen as a combinatorial optimization problem: given a profile ˘ , determine a binary relation R minimizing over one of the sets A; C ; L; O; P ; R; T . Such a relation R will be called a median relation of ˘ [17]. According to the number m of relations of the profile, and according to the properties assumed for the relations belonging to ˘ or required from the median relation, we get many combinatorial problems. We note them as follows: Problems Pm (Y ; Z). For Y belonging to fA; C ; L; O; P ; R; T g and Z also belonging to fA; C ; L; O; P ; R; T g, for a positive integer m, Pm (Y ; Z) denotes the following problem: given a finite set X of n elements, given a profile ˘ of m binary relations all belonging to Y , find a relation R belonging to Z minimizing over Z: (˘; R ) D Minf(˘; R) for R 2 Zg. With this notation, the initial problem, possibly considered by Condorcet and then surely by J.G. Kemeny [102], is Pm (L; L), consisting in aggregating m linear orders into a median linear order. A Condorcet– Kemeny winner, or here simply a Kemeny winner, is the winner of any median linear order (for references about the Condorcet–Kemeny problem, see for example [36,98,131,132,151] and references therein). We may easily state the problems Pm (Y ; Z) as 0-1 linear programming problems (see [17,36,89,91,173] for instance) for any profile ˘ D (R1 ; R2 ; : : : ; R m ) of m bi-
nary relations R i (1 6 i 6 m) all belonging to Y . To this end, consider the characteristic vectors r i D (r xi y )(x;y)2X 2 of the relations R i (1 6 i 6 m) defined by r xi y D 1 if xR i y and r xi y D 0 otherwise, and similarly the characteristic vector r D (r x y )(x;y)2X 2 of any binary relation R. Then, after some computations, we obtain (˘; R) D P P P i is C (x;y)2X 2 ˛x y r x y , where C D m 2 r Pm iiD1 (x;y)2X x y a constant and with ˛x y D 2 jD1 r x y m. Notice that, P i for the problem Pm (L; L), the sum m iD1 r x y is equal for x ¤ y to the pairwise comparison coefficient mxy introP i duced above. So, because of the equality m D m iD1 r x y C Pm i iD1 r yx , we get ˛ x y D m x y m yx : ˛ x y measures the number of voters who prefer x to y minus the number of voters who prefer y to x. More generally, ˛x y is equal to twice the gap between the number of voters who prefer x to y and the majority. It is a non-positive or non-negative integer with the same parity as m. To obtain a 0-1 linear programming statement of Pm (Y ; Z), it is then sufficient to express the constraints defining the set Z, which is easy for the posets described above. For instance, the transitivity of R can be expressed by the following inequalities: 8(x; y; z) 2 X 3 ; 0 6 r x y C r yz r xz 6 1 : As stated above, it is also common to represent a preference R defined on X by a graph. The properties of the graph are the properties of R: it can be antisymmetric, complete, transitive, and so on. Similarly, the profile ˘ D (R1 ; R2 ; : : : ; R m ) can also be represented by a directed, weighted, complete, symmetric graph G˘ D (X; U X ): its set of vertices is X, and G˘ contains all the possible arcs except the loops, i. e., U X D X X f(x; x) for x 2 Xg. (The loops would be associated with reflexivity; this property, as well as irreflexivity, has no impact on the complexity status of the studied problems, hence this simplification.) The weights of the arcs (x; y) give the intensity of the preference for x over y. The computations above lead us to assign ˛ xy as the weight of any arc (x; y) of G˘ . With this choice, minimizing (˘; R) is the same, from the graph theoretic point of view, as drawing from G˘ a subset of arcs with a maximum total weight and satisfying the structural properties required from R. Notice that characterizations of the graphs that we can associate with profiles ˘ have been provided by different authors (see [36,55,56,68,91,117,166]); the construction of the profiles can be done in polynomial time, which allows us to study the complexities of the problems Pm (Y ; Z) through their graph theoretic representations G˘ . Example 2 illustrates these considerations. Example 2 Assume that m D 9 voters must rank n D 4 candidates x, y, z, and t. Assume, also, that the preferences
Voting Procedures, Complexity of
Voting Procedures, Complexity of, Figure 7 The majority tournament (left) and the graph G˘ (right) associated with the data of Example 2 and weighted by the quantities ˛xy D mxy myx
of the voters are given by the following linear orders:
the preferences of three voters are: x > y > z > t; the preferences of two voters are: y > t > z > x; the preference of one voter is: t > z > x > y; the preference of one voter is: x > z > y > t; the preference of one voter is: t > y > x > z; the preference of one voter is: z > t > y > x.
The quantities mxy involved in Condorcet’s procedure are the following, where the bold values, again, denote those greater than the majority (m C 1)/2:
m x y D 5; m yx D 4; m xz D 5; mzx D 4; m x t D 4; m tx D 5; m yz D 6; mz y D 3; m y t D 6; m t y D 3; mz t D 5; m tz D 4.
Here, the majority relation is not a linear order, but the tournament of Fig. 7. Figure 7 also displays the graph G˘ summarizing the data. We may observe that the majority tournament is given by the arcs of G˘ with a positive weight. For these data, it is not too difficult to verify that there is only one median linear order, which is x > y > z > t, this order keeps all the positive weights except that of the arc (t; x), of which the weight is the minimum, while the arcs with positive weights do not define a linear order. Hence, x is the only Kemeny winner. Attention has also been paid to P1 (T ; L), i. e., the approximation of a tournament (which can be the majority tournament of the election) by a linear order at minimum distance. This problem is also known as Slater problem, since P. Slater explicitly stated it in 1961 ([160]; see also [36] for a survey of this problem). A linear order L
at minimum distance (with respect to the symmetric difference distance) from the tournament T constituting the considered instance of P1 (T ; L) is called a Slater order of T : ı(T; L ) D minL ı(T; L). This minimum distance is usually called the Slater index i (T) of T : ı(T; L ) D i(T). A Slater winner of T is the winner of any Slater order of T. Prudent Orders, Maximin Procedure, Minimax Procedure, Ranked Pairs Procedure Other procedures are based on the quantities mxy defined above. Such is the case for prudent orders proposed by G. Köhler [103] and by K.J. Arrow and H. Raynaud [8]. Given the pairwise comparison coefficients mxy , let us define the cut relation R>t for any integer t with m 6 t 6 m by the following: for x 2 X and y 2 X with x ¤ y; xR>t y , m x y m yx > t : Then define a threshold tmin by: tmin D minft with m 6 t 6 m and such that R>t contains no circuit g : A prudent order is any linear order which contains R>tmin . A winner according to the procedure of the prudent orders is the winner of any prudent order of the profile being considered. For Example 2, we have tmin D 1; with respect to the graph G˘ of Fig. 7, R>tmin contains only the arcs (y; z) and (y; t). There are eight prudent orders, including x > y > z > t or y > t > z > x, but none of them has z or t as its winner: x and y are the winners according to the prudent order procedure. The maximin procedure, proposed by P.B. Simpson in 1969 [159], is also based on the quantities mxy . In this
3299
3300
Voting Procedures, Complexity of
procedure, for each candidate x, we consider the worst performance W(x) in the pairwise comparisons: W(x) D min y¤x m x y . The winner is any candidate x with a maximum worst performance: W(x ) D maxx2X W(x). We may also use these quantities to sort the candidates in decreasing order with respect to the quantities W to obtain a collective preference over the whole set of candidates (as usual, another possibility is to remove the winner and to iterate the process). For instance, applied to Example 2, we obtain W(x) D 4, W(y) D 4, W(z) D 3, W(t) D 3: x and y are the Simpson winners. The minimax procedure, in which the previous roles of “min” and “max” are switched and in which the candidates are sorted in increasing order with respect to the obtained maxima, has been proposed by K.J. Arrow and H. Raynaud [8]. For Example 2, the winners would be x, z and t. In 1987, T.N. Tideman introduced the ranked pairs procedure [170]. Here, we first sort the quantities mxy decreasingly. Then we scan them in this order to build a linear order L for the collective preference. If mxy is the current quantity that we scan, then we definitely set xLy if such a decision is compatible with the previous ones. We do so until a linear order is completely defined. For instance, for Example 2, we first fix yLz and yLt since myz and myt are maxima and z and t cannot be winners. Then we must choose between adding xLy or xLz or tLx or zLt, since mxy , mxz , mtx and mzt are equal. We cannot add all of them simultaneously, because of incompatibilities with yLz or yLt, already fixed. If we choose to add xLy, xLz and zLt, we obtain the linear order x > y > z > t, and x is a winner. If we choose to add tLx and zLt, we obtain the linear order y > z > t > x, and y is a winner. Here again, x and y are the winners. Tournament Solutions The procedures described in this section apply to tournaments, which can function as the majority tournament of an election. They deal with unweighted tournaments, but some of them can be extended to weighted tournaments, as in the case of the solution designed by P. Slater (see above). Number of Wins: Copeland’s Procedure The procedure designed by A.H. Copeland in 1951 [48] is also based on the mxy ’s. For any two different candidates x and y, we set: C(x; y) D 1 if m x y > m yx , C(x; y) D 0:5 if m x y D m yx , C(x; y) D 0 if m x y < m yx . The Copeland P score C(x) of x is C(x) D y¤x C(x; y). The Copeland winner is any candidate with a maximum Copeland score
(here also, we may rank the candidates according to decreasing Copeland scores). The application of Copeland’s procedure to Example 2 gives: C(x) D 2, C(y) D 2, C(z) D 1, C(t) D 1. Here, x and y are the Copeland winners. The Copeland score has a meaning from a graph theoretic point of view. If we assume that there is no tie, the majority relation is a tournament (the majority tournament). Then the Copeland score of a candidate x is also the out-degree of the vertex associated with x. Copeland’s procedure is one of the common tournament solutions (see [106]). It can be extended to weighted tournaments P by sorting the vertices x according to the sum y¤x m x y of the pairwise comparison coefficients mxy ; this variant leads to the same ranking as Borda’s procedure. Top Cycle: Smith’s Solution Another tournament solution is the so-called top cycle (also called the Smith set) [161]. Any directed graph G can be decomposed into its strongly connected components. We may then define another graph H derived from G. The vertices of H are associated with the strongly connected components of G. Let x (respectively y) be a vertex of G associated with the strongly connected components Cx (respectively Cy ) of G. There will exist an arc (x; y) from x to y if there is at least one arc in G from Cx to Cy (notice that, in this case, all the arcs between Cx and Cy are from Cx towards C y ). Then H does not contain any circuits. Moreover, if G is a tournament, H is a linear order and admits a winner (but H contains only one vertex if G is strongly connected). The top cycle of G is the strongly connected component of G associated with the winner of H. The top cycle solution, when applied to a tournament T, consists of considering all vertices of the top cycle of T as the winners of T. For instance, for the tournament of Fig. 7, which is strongly connected, the top cycle contains the four vertices x, y, z, and t, which are the four winners according to this solution. Uncovered Set: Fishburn’s and Miller’s Solution A refinement of the top cycle is provided by the set of uncovered vertices. Given a tournament T and two vertices x and y of T, we say that x covers y if (x; y) is an arc of T and if, for any arc (y; z), (x; z) is also an arc of T. (In other words, considering T as a majority tournament, x beats y and any vertex beaten by y is also beaten by x.) A vertex is said to be uncovered if no other vertex covers it. The uncovered set of T is noted UC(T). Adopting the elements of UC(T) as the winners of T has been independently suggested by P. Fishburn in 1977 [76] and by N. Miller in 1980 [126].
Voting Procedures, Complexity of
For the tournament of Fig. 7, it is easy to see that z is covered by y while x, y and t are uncovered: here, x, y, and t are the three winners according to this solution. The definition of a covered vertex can easily be extended to weighted tournaments, though the possibility of weights equal to 0 makes the situation more difficult than expected (see [35] for details). Minimal Covering Set: Dutta’s Solution The uncovered set can also be refined, for example by the mimimal covering set proposed by B. Dutta [64]. Consider a tournament T defined on X. We say that a subset Y of X is a covering set of T if we have the following property: 8x 2 X Y; x … UC(Y [ fxg). For instance, UC(T) is a covering set of T. Let (T) denote the set of the covering sets of T. Then there exists a minimal element of (T) with respect to inclusion; this minimal element is called the minimal covering set MC(T) of T. For the tournament of Fig. 7, the minimal covering set contains x, y, and t, which are the three winners according to this solution, here as for the uncovered set. (But there exist tournaments for which the general inclusion MC(T) UC(T) is strict.) Maximal Transitive Subtournaments: Banks’s Solution Among other tournament solutions, one designed by J. Banks in 1985 [13] is of interest. When the tournament being considered, T, is transitive (i. e., T is a linear order), there exists a unique winner who is selected as the winner of T by the usual tournament solutions. If that is not the case, we may consider the transitive subtournaments of T which are maximal with respect to inclusion, and then select the winner of each of them as the winners of T. This defines the Banks’s solution [13]: a Banks winner of T is the winner of any maximal (with respect to inclusion) transitive subtournament of T. If we consider the majority tournament of Example 2 (see Fig. 7), three Banks winners exist: x (because of the maximal transitive subtournament x > y > z), y (because y > z > t) and t (because t > x). Tournament Equilibrium Set: T. Schwartz’s Solution The solution that we deal with in this subsection was designed by T. Schwartz [158] and is called the tournament equilibrium set. To define it, we need extra definitions. Let G be a directed graph. The top set TS(G) of G is defined as the union of the strongly connected components of G with no in-coming arcs (if G is a tournament, then TS(G) is equal to TC(G)). Let Sol be a tournament solution and T be a tournament. When Sol is applied to T, we define the contestation graph G(Sol, T) associated with Sol and T as follows:
The vertex set of G(Sol, T) is X; there is an arc (x; y) from x to y in G(Sol, T) if and only if (x, y) is an arc of T and x is a winner according to Sol when Sol is applied to the subtournament of T induced by the predecessors of y in T. In other words, the arcs (x; y) of G(Sol, T) describe the following situation: if y is considered as a possible winner, then x will contest the election of y because x beats y in T and because x is a winner among the candidates who beat y. We may notice that G(Sol, T) is a subgraph of T. By considering the top set TS[G(Sol, T)] of G(Sol, T), we get a new tournament solution. T. Schwartz proved in [158] that there exists a unique tournament solution, that he called the tournament equilibrium set TEQ, which is a fixed point with respect to this process: 8T; TEQ(T) D TS[G(TEQ; T)]. For the majority tournament of Example 2 (see Fig. 7), the tournament equilibrium set contains x, y and t. Dodgson’s Procedure In the procedure proposed in 1876 by C.L. Dodgson (or Lewis Carroll) [60], each voter ranks all the candidates into a linear order. If a Condorcet winner exists, he or she is also the Dodgson winner. Otherwise, Dodgson’s procedure consists of choosing as winners all the candidates who are “closest” to being Condorcet winners: for each candidate x 2 X, let D(x) be the Dodgson score of x, defined as the minimum number of swaps between consecutive candidates in the preferences of the voters such that x becomes a Condorcet winner; a Dodgson winner is any candidate x minimizing D : D(x ) D Minx2X D(x). Specifically, consider a profile ˘ D (L1 ; L2 ; : : : ; L m ) of m linear orders. Let i be an index between 1 and m and let y and z be two consecutive candidates in L i : yL i z and there is no candidate t with yL i t and tL i z simultaneously. Then define a linear order L0i obtained from Li by swapping y and z in L i : zL0i y and, for any other ordered pair of candidates (t; v) ¤ (y; z), we have tL0i v if and only if we had tL i v. By substituting L0i to Li , we obtain a new profile. By repeating such swaps as many times as necessary, we may generate all the possible profiles with m linear orders. Among them, some admit x as a Condorcet winner. The Dodgson score D(x) of a candidate x is the minimum number of swaps that must be applied from ˘ in order to obtain a profile with x as a Condorcet winner. Let us illustrate these definitions on Example 2. First, let us compute the Dodgson scores of the four candidates. If we want x to become a Condorcet winner, it is necessary that x defeats t, which is not currently the case. Since x is never just after t in the preferences of the voters, one swap is not enough to reach this aim. On the other hand, two
3301
3302
Voting Procedures, Complexity of
swaps are enough, for instance by swapping first y and x in the preference of the last voter, and then by swapping x and t in this new preference. Thus: D(x) D 2. One swap is sufficient (and necessary) to make y a Condorcet winner, for instance by swapping y and x in the preference of one of the first three voters: D(y) D 1. The computation of D(z) is a little more difficult. To become a Condorcet winner, z must defeat y. Since there are only 3 voters who prefer z to y against 6 who prefer y to z, it is necessary to perform at least 2 swaps so that a new majority of voters will prefer z to y. But, if such swaps exist, they do not help to alleviate the difficulty in helping z to defeat x. For z to defeat x necessitates at least one extra swap. Hence the inequality D(z) > 3. On the other hand, swapping x and z in the preference of the voter whose preference is x > z > y > t and swapping y and z in the preferences of two of the first three voters makes z a Condorcet winner: D(z) D 3. The computation of D(t) is similar to that of D(z), and we also find D(t) D 3. These values show that there is only one Dodgson winner (who is not a Kemeny winner): y. Young’s Procedure Like Dodgson’s procedure, H.P. Young’s procedure [177] is based on altered profiles. But, instead of performing the fewest possible number of swaps of consecutive candidates in the voters’ preferences, we now find the minimum number of voters that must be removed in order for a Condorcet winner to emerge. More precisely, for each candidate x, we define the Young score Y(x) of x as the minimum number of voters whose simultaneous removals allow x to become a Condorcet winner. Any candidate with a minimum Young score is a Young winner. For Example 2, we may verify that the Young scores of the four candidates are: Y(x) D 2, Y(y) D 2, Y(z) D 4, Y(t) D 4. For instance, for z to defeat x and y requires at least four removals for y. But these removals must keep z defeating t. It is easy to see that removing two of the first three voters, one of the two voters sharing the same preferences and the voter for whom z is in the last position, suffices to make z a Condorcet winner of the new election. Approval Voting Procedure, Majority-Choice Approval Procedure and Variants The approval voting procedure was popularized by Steven J. Brams and Peter C. Fishburn in 1978 [25] and 1983 [26]. But, according to [40,52,111], it was already in use in the 13th century in Venice and in papal elections, and then in elections in England during the 19th century, among other places. Its name seems to comes from R.J. Weber in 1976
(see [175]; several other persons seem to have found this procedure independently during the late 1960s and early 1970s). The approval voting procedure consists of giving to the voters the possibility of voting for several candidates simultaneously. In other words, instead of appointing their preferred candidate, the voters are invited to answer the question: “who are the candidates for whom you would like to vote?”. Each voter then gives one point to the candidates with whom he or she agrees. The candidate with the greatest number of points is the winner. Notice that, if each voter actually chooses only one candidate, this voting procedure is the same as the plurality rule. In this respect, the approval voting procedure can be seen as a generalization of the plurality rule. More formally, this procedure assumes that the preferences of the voters are complete preorders with only two classes (one class for the approved candidates and one for the disapproved ones). These preorders must then be aggregated into a collective complete preorder also with two classes, one of them with only one element (the winner), the other class with all the other elements (the losers). Extension for electing several candidates simultaneously is immediate. Variants (sometimes known as range voting, ratings summation, average voting, cardinal ratings, 0–99 voting, the score system or the point system) can be based on the same idea. For instance, each voter has a maximum number of points, and he or she can share them out among the candidates as he or she pleases, with or without a constraint on the maximum number of points per candidate. We can also add several rounds, and thus define new procedures. The majority-choice approval procedure (MCA) designed by F. Simmons in 2002 can also be seen as a variant of approval voting. In this system, each voter has three possibilities for rating each candidate: “favored”, “accepted” or “disapproved”. If a candidate is ranked “favored” by an absolute majority of the voters, then any candidate marked “favored” by a maximum number of voters is a winner. Otherwise, the winner is any candidate with the largest number of “favored” or “accepted” marks. It is sometimes required this number to be at least (m C 1)/2, otherwise no one will be elected. Ties can be broken using the number of “favored” marks. We may, of course, obtain new variants by increasing the number of levels. Bucklin’s Procedure The procedure proposed by James W. Bucklin in the early 20th century is also called the Grand Junction system (Grand Junction is a city in Colorado, where Bucklin’s procedure was applied from 1909 to 1922). In this proce-
Voting Procedures, Complexity of
dure, we first count, for each candidate x, the number of times that x is ranked first, as in the plurality rule. If a candidate has at least the absolute majority (m C 1)/2, he or she is the winner. Otherwise, second choices are added to the first choices. Once again, if a candidate has the absolute majority (m C 1)/2, he or she becomes a winner. Otherwise, we consider third choices and so on, until at least one candidate obtains the absolute majority: then he or she becomes a winner. Since after the first round there are more votes than voters, Bucklin’s procedure is sometimes considered as a variant of approval voting procedure, but a main difference is that Bucklin’s procedure may require several rounds. It can also be considered as a variant of scoring procedures, since it is the same as iteratively applying a scoring procedure with successively (1, 0, 0, . . . , 0), (1, 1, 0, 0, . . . , 0), (1, 1, 1, 0, . . . , 0), . . . as the scorevectors, until a candidate obtains the majority. Once again, consider Example 2. In the first round, x gets 4 points; y, 2 points; z, 1 point; t, 2 points. As there is no candidate with at least 5 points, we consider the second choices. Then we obtain: x keeps 4 points; y obtains 6 points; z, 3 points; t, 5 points. As y and t obtain at least 5 points, the process stops here and y and t are the Bucklin winners. A variant of Bucklin’s procedure would be to consider, at the last iteration, the candidates with a maximum number of points as the winners (only y for Example 2). Complexity Results After a reminder of the complexity classes that are useful for our purposes, this section examines the complexity of the voting procedures described in the previous section (see also [72]). Among those tournament solutions which can be applied to the majority tournament, we restrict ourselves to the solutions depicted above; the reader interested in the complexity of the other common tournament solutions will find some answers in [93].
the oracle can solve an appropriate NP-complete problem in unit time, and the number of consultations of the orap cle is upper-bounded by log(n). The class PNP or 2 (or S O(1) k sometimes simply 2 ) or also PNP[n ] D k>0 PNP[n ] is defined similarly, but with a polynomial of n instead of log(n). Notice the inclusions: NP [ co-NP LNP PNP . For problems which are not decision problems (i. e., optimization problems in which we look for the optimal value of a given function f , or search problems in which we look for an optimal solution of f , or enumeration problems in which we look for all the optimal solutions of f ; these problems are sometimes called function problems), these classes are extended with an “F” in front of their names (see [96]); p for instance we thus obtain the classes FLNP (D F2 : : : ) p or FPNP (D F2 : : : ). The usual definition of “completeness” is extended to these classes in a natural way. Though p p the notation 2 or 2 are sometimes more common in complexity theory (especially when dealing with the polynomial hierarchy), we shall keep the pseudonyms PNP and LNP (as well as FPNP and FLNP ), as perhaps being more informative. Similarly, there exist refinements of P. In particular, the class L denotes the subset of P containing the (decision) problems which can be solved by an algorithm using only logarithmic space (in a deterministic Turing machine), the input itself not being counted as part of memory. The classes AC0 and TC0 are more technical. The first one, AC0 , consists of all the problems solvable by uniform constant-depth Boolean circuits with unbounded fan-in and a polynomial number of gates. The second one, TC0 , consists of all the problems solvable by polynomial-size, bounded-depth, unbounded fan-in Boolean circuits augmented by so-called “threshold” gates, i. e., unbounded fan-in gates that output “1” if and only if more than half their inputs are non-zero (see [96] and the references therein for details). A problem of TC0 is said to be TC0 -complete if it is complete under AC0 Turing reductions. Notice the inclusions AC0 TC0 L P.
Main Complexity Classes Some complexity classesz are well-known: P, NP, co-NP, the sets of polynomial problems, of NP-complete problems, of NP-hard problems. . . Let us recall some other notations (for references upon the theory of complexity, see for instance [10,78,82,96]; see also [1] for a list of about 500 complexity classes). Like D.S. Johnson in [96], we will distinguish between decision problems and other types of problems. Let n denote the size of the data. The class LNP , p , contains those deor PNP[log] , PNP[log n ] , 2 , or also PNP k cision problems which can be solved by a deterministic Turing machine with an oracle in the following manner:
Complexity Results for the Usual Voting Procedures We can now examine complexity results of the voting procedures described above. When dealing with linear orders, we assume that we can have direct access to the ordered list of the candidates, especially to their winners (see above). Remember that n denotes the number of candidates and m the number of voters. Theorem 1 The following procedures are polynomial. The plurality rule (one-round procedure) is in O(n C m).
3303
3304
Voting Procedures, Complexity of
The plurality rule with run-off (two-round procedure) is in O(n C m). The preferential voting procedure (STV) is in O(nm C n2 ). Borda’s procedure is in O(nm). If the preferences of the voters are known through a profile of complete preorders with two classes, the approval voting procedure (and its variant suggested by F. Simmons) is in O(nm). Bucklin’s procedure is in O(nm). More generally, the previous polynomial results can usually be extended to procedures based on score-vectors, often with a complexity of O(nm). The previous results are rather easy to obtain. For instance, for the plurality rule, an efficient way to determine the winners consists of computing an n-vector V which provides, for each candidate x, the number of voters who prefer x. As the preferred candidate of each voter is assumed to be accessible directly, computing V requires O(n C m) operations (or fewer if we assume that voters sharing similar preferences are gathered, see Subsect. “Definitions, Notation and Partially Ordered Sets Used to Model Preferences”). More precisely, we first initialize V to 0 in O(n). Then, for each voter, we consider his or her preferred candidate x and we increment V(x) by 1 in O(1) for each voter, thus in O(m) for all the voters. By scanning V in O(n), we determine the maximum value contained in V and, once again by scanning V in O(n), the winners: they are the candidates x whose values V(x) are maximum. The whole process requires O(n C m) operations. When it succeeds in finding an order (or at least a Condorcet winner), Condorcet’s procedure is also polynomial. This is also the case for Simpson’s procedure, since computing the pairwise comparison coefficients mxy can obviously be done in polynomial time. The same happens for Tideman’s ranked pairs procedure, since the detection of a circuit in a graph can be done in a time proportional to the number of arcs of this graph, i. e., here in O(n2 ) (see [51]), and for the prudent order procedure or the minimax procedure. Theorem 2 When there exists a Condorcet winner, Condorcet’s procedure is polynomial, in O(n2 m). The prudent orders procedure, the minimax procedure, Simpson’s procedure (maximin procedure), and Tideman’s procedure (ranked pairs procedure) are also polynomial. When there is no Condorcet winner, the situation is more difficult to manage. In this case, we may pay attention to the problems called Pm (Y ; Z) above. The problems
Pm (Y ; T ) and Pm (Y ; R), i. e., the aggregation of m preferences into a tournament or into a binary relation in which no special property is required, are polynomial for any m and any set Y as specified by Theorem 3. Theorem 3 Let m be any integer with m > 1 and let Y be any subset of R. Consider a profile ˘ 2 Y m . Then Pm (Y ; R) and Pm (Y ; T ) are polynomial. For the Condorcet–Kemeny problem, and more generally for the problems Pm (Y ; Z) with Y belonging to fA; C ; L; O; P ; R; T g and Z to fA; C ; L; O; P g and where m is a positive integer, most are NP-hard when m is large enough (see [5,16,19,34,41,65,85,89,91,92,94, 173,174]; see also [91] for the NP-hardness of similar problems Pm (Y ; Z) when extended to other posets, including interval orders, interval relations, semiorders, weak orders, quasi-orders). Theorem 4 For m large enough, we have: For any set Y containing L (this is the case in particular for Y belonging to fA; C ; L; O; P ; R; T g or to unions or intersections of such sets), Pm (Y ; Z) is NP-hard and the decision problem associated with Pm (Y ; Z) is NPcomplete for Z 2 fA; C ; Lg. Pm (R; Z) is also NP-hard and the decision problem associated with Pm (R; Z) is also NP-complete for Z 2 fO; P g. The complexity of Pm (Y ; Z) is unknown for Y 2 fA; C ; L; O; P ; T g and for Z 2 fO; P g, but Pm (Y ; O) and Pm (Y ; P ) have the same complexity. The minimum value of m for which Pm (Y ; Z) is NP-hard in Theorem 4 is usually unknown and, moreover, the parity of m plays a role. Table 2 gives the ranges of lower bounds of m from which Pm (Y ; Z) is known to be NPhard; for lower values of m, when not trivial, the complexity of Pm (Y ; Z) is usually unknown. (This is the case, for instance, for P1 (R; C ), P1 (R; O) or P1 (R; P ); notice that P2 (L; L) is polynomial.) A question mark (?) means that the complexity of the problem is still unknown. So, the Condorcet–Kemeny problem Pm (L; L) is NPhard for m odd and large enough or for m even with m > 4 (and, as noted above, is polynomial for m D 2). More specific results deal with this problem Pm (L; L) (see [19,85,92]). To state them, let us define, for each element x of X and for any given profile ˘ , the Kemeny score K(x) of x (with respect to ˘ ) as the minimum remoteness (˘ , L x ) between ˘ and the linear orders Lx with x as their winner. The Kemeny index of ˘; K(˘ ) is the minimum remoteness between ˘ and any linear order; thus it
Voting Procedures, Complexity of
Voting Procedures, Complexity of, Table 2 Lower bounds of m from which Pm (Y; Z) is known to be NP-hard Median relation (Z)
L Y L Y
m odd Acyclic relation (A) (n2 ) Complete preorder (C ) (n4 ) Linear order (L) (n2 ) Partial order (O) ? Preorder (P ) ?
m even 4 (n2 ) 4 ? ?
is the minimum that we look for in Pm (L; L); it is also the minimum of the Kemeny scores over X. Theorem 5 Let ˘ be a profile of m linear orders with m large enough. 1. The following problems are NP-complete: Given ˘ , a candidate x 2 X and an integer k, is K(x) lower than or equal to k? Given an integer k, is K(˘ ) lower than or equal to k? 2. The following problems are NP-hard: Given ˘ and two candidates x 2 X and y 2 X with x ¤ y, is K(x) lower than or equal to K(y) (in other words, is x “better” than y ?)? Moreover, if we assume that the profiles ˘ are given by the ordered list of the m preferences of the voters (and not by a set of different linear orders with their multiplicities, see Subsect. “Definitions, Notation and Partially Ordered Sets Used to Model Preferences”), this problem is LNP -complete. Given ˘ and a candidate x 2 X, is x a Kemeny winner? Moreover, if we assume that the profiles ˘ are given by the ordered list of the m preferences of the voters, this problem is LNP -complete. Given ˘ and a candidate x 2 X, is x the unique Kemeny winner? Moreover, if we assume that the profiles ˘ are given by the ordered list of the m preferences of the voters, this problem is LNP -complete. Given ˘ , determine a Kemeny winner of ˘ . Moreover, this problem belongs to FPNP . Given ˘ , determine all the Kemeny winners of ˘ . Moreover, this problem belongs to FPNP . Given ˘ , determine a Kemeny order of ˘ . Moreover, this problem belongs to FPNP . 3. The following problem belongs to the class co-NP: Given ˘ and a linear order L defined on X, is L a Kemeny order of ˘ ? Attention has also been paid to P1 (T ; L), i. e., the approximation of a tournament (which can be the majority tour-
YDT
YDT
YDR YDR
m odd 1 (n3 ) 1 ? ?
m even 2 (n) 2 ? ?
m odd 1 3 1 3 3
m even 2 2 2 2 2
nament of the election) by a linear order at minimum distance (Slater problem [160]). The complexity of the Slater problem derives from a recent result dealing with the problem called the feedback arc set problem, which is known to be NP-complete from work by R. Karp [100] for general graphs. This problem consists of removing a minimum number of arcs in a directed graph in order to obtain a graph without any circuits. Recent results [5,34,41] show that this problem remains NP-complete even when restricted to tournaments. For a tournament T, removing a minimum number of arcs to obtain a graph without any circuits is the same as reversing a minimum number of arcs to make T transitive, i. e., a linear order (see [92]). From this, we may prove the following theorem (see [92] for details): Theorem 6 For any tournament T, we have the following results: The computation of the Slater index i(T) of T is NPhard; this problem belongs to the class FLNP ; the associated decision problem is NP-complete. The computation of a Slater winner of T is NP-hard; this problem belongs to the class FPNP . Checking that a given vertex is a Slater winner of T is NP-hard; this problem belongs to the class LNP . The computation of a Slater order of T is NP-hard; this problem belongs to the class FPNP . The computation of all the Slater winners of T is NPhard; this problem belongs to the class FPNP . The computation of all the Slater orders of T is NP-hard. Checking that a given order is a Slater order is a problem which belongs to the class co-NP. Some of the previous results may be generalized to other definitions of remoteness, for instance, if the sum of the symmetric difference distances to the individual preferences is replaced by their maximum, or by the sum of their squares, or by the sum of any of their positive powers, as specified below (see [94]):
3305
3306
Voting Procedures, Complexity of
Theorem 7 Let ˚ denote any remoteness defined for any profile ˘ such that, when m is equal to 1 with ˘ D (R1 ), then the minimization of ˚(˘; R) over the relations R belonging to A yields to the same optimal solutions as the minimization of ı(R1 ; R) over the same set. Then, for any fixed m with m > 1, the aggregation of m binary relations or m tournaments obtained by minimizing ˚ (the problems similar to Pm (R; Z) and Pm (T ; Z) for Z 2 fA; Lg but with respect to ˚) are NP-hard, and the associated decision problems are NP-complete. An important case exists for which Pm (L; L) becomes polynomial for any m: it is the one for which the voters’ preferences are single-peaked linear orders. To define them, assume that we can order the candidates on a line, from left to right, and assume that this linear order ˝ does not depend on the voters (from a practical point of view, ˝ is not always easy to define, even for political elections). For any voter ˛, let x˛ denote the candidate preferred by ˛. The preference of ˛ is said to be ˝-singled-peaked if, for any candidates y and z with x ¤ y ¤ z ¤ x and located on the same side of ˝ from x˛ , y is preferred to z by ˛ if and only if y is closer to x˛ than z with respect to ˝. Let U˝ denote the set of ˝-single-peaked linear orders. D. Black [23] showed that, for any order ˝, the aggregation of ˝-single-peaked linear orders is an ˝-singlepeaked linear order (see also [42] for another study dealing with single-peaked preferences). Hence the polynomiality of the aggregation of ˝-single-peaked linear orders into an ˝-single-peaked linear order: Theorem 8 For any linear order ˝ defined on X and any positive integer m, for any set Z containing U˝ as a subset (this is the case for the sets A; C ; L; O; P ; R; T ), Pm (U˝ ; Z) is polynomial. More precisely, Pm (U˝ ; Z) can be solved in O(n2 m). The NP-hardness of Slater’s problems shows that tournament solutions can be difficult to compute. This obviously depends on the solutions considered. For instance, Copeland’s procedure is polynomial, as specified by the next theorem. Theorem 9 O(n2 m).
Copeland’s procedure is polynomial, in
If we assume that the tournament related to Copeland’s procedure is already computed, and if we do not take the memory-space necessary to code this tournament into account, it is easy to see that the memory-space necessary to compute the maximum of the Copeland scores and then to decide whether a given vertex is a Copeland winner can be bounded by a constant. So, deciding whether a given vertex is a Copeland winner is a problem belonging to class L.
In fact, a stronger result is shown by F. Brandt, F. Fischer and P. Harrenstein in [29]: Theorem 10 Checking that a given vertex is a Copeland winner is a TC0 -complete problem. Similar results can be stated for Smith’s tournament solution (top cycle): Theorem 11 Let T be a tournament. The computation of the Smith winners of T (the elements of the top cycle of T) can be done in O(n2 ) (or even in linear time with respect to the cardinality of the top cycle if the sorted scores are known). Checking that a given vertex is a Smith winner is a TC0 complete problem [29]. For the other tournament solutions depicted above, we have the results stated in Theorem 12 (see [93] for details and for results upon other tournament solutions). This theorem shows that checking whether a given vertex of a given tournament is a Banks winner is NP-complete [176] (and [30] for an alternative proof), while computing a Banks winner is polynomial [90]. Of course, when such a Banks winner is computed, we do not choose the winner that we compute among the set of Banks winners (and so there is no contradiction between the two results if P and NP are different). The polynomiality of the minimal covering set solution (through n resolutions of a linear programming problem, which can be done in polynomial time using L. Khachiyan’s algorithm [99]), and the NP-hardness of the tournament equilibrium set solution, are new results respectively due to F. Brandt and F. Fischer [28] and to F. Brandt, F. Fischer, P. Harrenstein and M. Mair [30]. Theorem 12 Let T be a tournament. Computing the uncovered elements of T can be done within the same complexity as the multiplication of two (n × n)-matrices, and so can be done in O(n2:38 ) operations. Computing the elements of the minimal covering set of T can be done in polynomial time with respect to n. The following problem is NP-complete: given a tournament T and a vertex x of T, is x a Banks winner of T? Computing a Banks winner is polynomial, and more precisely, can be done in O(n2 ) operations. Computing all the Banks winners of T is an NP-hard problem. More precisely, it is a problem belonging to the class FPNP . The following decision problem is NP-hard (but is not known to be inside NP): given a vertex x of T, does x belong to TEQ(T)?
Voting Procedures, Complexity of
J.J. Bartholdi III, C.A. Tovey and A. Trick stated in [19] that Dodgson’s procedure is NP-hard. More precisely, they proved the NP-completeness of the problem of Theorem 13 and the NP-hardness of the first two problems of Theorem 14. E. Hemaspaandra, L. Hemaspaandra and J. Rothe [84] sharpened their results (Theorem 14), assuming that ˘ is given by the ordered list of the m preferences of the voters (and not by a set of different linear orders with their multiplicities). Theorem 13 The following problem is NP-complete: given a profile ˘ of linear orders defined on X, a candidate x 2 X and an integer k, is the Dodgson score D(x) of x less than or equal to k? Theorem 14 The following problems are NP-hard, and more precisely LNP -complete if the considered profiles ˘ are given by the ordered lists of the m preferences of the voters. Given a profile ˘ of linear orders defined on X and a candidate x 2 X, is x a Dodgson winner? Given a profile ˘ of linear orders defined on X and a candidate x 2 X, is x the unique Dodgson winner? Given a profile ˘ of linear orders defined on X and two candidates x 2 X and y 2 X with x ¤ y, is D(x) less than or equal to D(y) (in other words, is x “better” than y)? Notice that a variant of Dodgson’s procedure, called homogeneous Dodgson’s procedure (see [76] and [177]) has been shown to be polynomial by J. Rothe, H. Spakowski and J. Vogel [153]. Consider that each voter is replicated p times, each copy having the same preference. A procedure is said to be homogeneous if such a replication does not change the winners. Dodgson’s procedure and Young’s procedure are not homogeneous [76]. Then, instead of the Dodgson score defined above, we may consider, for each candidate x, the limit, when p tends to infinity, of the ratio between, on the one hand, the Dodgson score of x after the replication of each voter p times, and, on the other hand, p. J. Rothe, H. Spakowski and J. Vogel provide in [153] a linear program for computing such a limit for any candidate. Hence the polynomiality of the homogeneous version of Dodgson’s procedure. They give, in the same paper [153], the complexity of Young’s procedure. Theorem 15 The following problems are NP-hard, and more precisely LNP -complete. Given a profile ˘ of linear orders defined on X and a candidate x 2 X, is x a Young winner? Given a profile ˘ of linear orders defined on X and two candidates x 2 X and y 2 X with x ¤ y, is Y(x)
less than or equal to Y(y) (in other words, is x “better” than y)? Notice that all these problems become obviously polynomial if we assume that the number of candidates is upperbounded by a constant. Further Directions Complexity is a prominent feature of voting procedures. NP-hardness may involve too much important CPU time to solve a given instance if the number of candidates is not small. In this respect, polynomial procedures can be preferable to NP-hard ones. In practice, the number of candidates is not always too large, and the instance considered can be tractable. This also depends on the efficiency of the applied algorithms (see [17,35,36,37,41,88,89, 98,108,151,152,173] for references on algorithms designed to solve some of the previous problems). Moreover, complexity is stated here in the worst case. A study about the average complexity (for appropriate probability distributions) remains to be done. Other directions can be investigated. For instance, the parameterized complexity (see [62] for a global presentation of this field) of the feedback arc set problem applied to tournaments has also been studied. V. Raman and S. Saurabh [148] (see also [61], where the case of bipartite tournaments is also considered) show that the feedback arc set problem for weighted or unweighted tournaments is fixed-parameter tractable (FPT) by providing appropriate algorithms. (Remember that a problem is said to be FPT if there exist an arbitrary function f and a polynomial function Q such that, for any instance (I; k) where I is an instance of the non-parameterized version of the problem and k is the considered complexity parameter, (I; k) can be solved within a CPU time upperbounded by f (k)Q (jIj), where jIj denotes the size of I. In particular, if k is upper-bounded by a constant, then the problem becomes polynomial.) V. Raman and S. Saurabh give several algorithms to solve the parameterized version of the feedback arc set problem applied to tournaments. The best complexity of their algorithms for this problem is O(2:415 k n! ), where ! denotes the exponent of the running time of the best matrix multiplication algorithm (for instance, in the method designed by D. Coppersmith and S. Winograd [49], ! is about 2.376). Parameterized complexity appears also in [39,115,116] for Kemeny’s, Dodgson’s and Young’s procedures. Another direction deals with algorithms with approximation guarantees (see [10] or [172] for a presentation of this field). For instance, while the feedback arc set problem is APX-hard in general [69] (remember that, from a prac-
3307
3308
Voting Procedures, Complexity of
tical point of view, this implies that we do not know polynomial-time approximation scheme – PTAS – to solve this problem), N. Ailon, M. Charikar and A. Newman [3] designed randomized 3-approximation and 2.5-approximation algorithms for the feedback arc set problem when restricted to unweighted tournaments. The 3-approximation algorithm has been derandomized by A. van Zuylen [171], still for unweighted tournaments and with an approximation ratio equal to 3. These methods may be adapted to weighted tournaments with an approximation ratio equal to 5 (see [3] and [50]). N. Alon [5] (see also [2]) shows that, for any fixed " > 0, it is NP-hard to approximate the minimum size of a feedback arc set for a tournament on n vertices up to an additive error of n2" (but approximating it up to an additive error of "n2 can be done in polynomial time). Ranking the vertices of a tournament according to their out-degrees (Copeland’s procedure) for unweighted tournaments, or according to the sum of the weights of the arcs leaving them minus the sum of the weights of the arcs entering them (Borda’s procedure) for weighted tournaments, also provides a method with approximation ratio equal to 5, irrespective of how ties are broken (see [50] and [170]). Probabilistic algorithms (see [6] for a global presentation of these methods) have also been applied to the feedback arc set problem (or rather to the search for a maximum subdigraph without circuits, which is the same for tournaments), for unweighted [53,57,144,145,163,164, 165] or weighted [53] tournaments. If NP-hardness is usually considered as a drawback (because the CPU time necessary to solve the instances becomes quickly prohibitive, because the algorithms may be difficult to explain to the voters, and so on), it can also be an asset with respect to manipulation, bribery or other attempts to control the election (by adding or deleting candidates or voters, etc.). This is an emerging but already flourishing topic, which becomes a subject on its own (see [18,20,21,33,43,44,45,46,47,66,71,72,73,79,83,86, 87,108,114,127,135,137,146,147,155,157,162,169]). A voting procedure is said to be manipulable when a voter, who knows how the other voters vote, has the opportunity to benefit by strategic or tactical voting, i. e., when the voter supports a candidate other than his or her sincerely preferred candidate in order to prevent an undesirable outcome (the main difference between manipulation and bribery is that, for manipulability, the manipulators are known as a part of the instance; for bribery, the number of corrupted voters is bounded but these manipulators are not given). It is known from the theorem by A. Gibbard [79] and M. Satterthwaite [157] that, for at least three candidates, any voting procedure without a dictator
is manipulable. Because of this result, manipulation cannot be precluded in any reasonable voting procedure on at least three candidates. As suggested in [18] and in [20] (see also [72]), the situation is less bad if we adopt a procedure for which manipulation is NP-hard. Some procedures are easy to manipulate: this is the case, for instance, for the plurality rule, Borda’s procedure, maximin procedure or Copeland’s procedure [20] (see also [72] for variants). Some others are NP-hard: this is the case for some variants of some procedures based on score-vectors [44, 47,83], or for other procedures [71,87]. But we must remember that NP-hardness refers to the complexity in the worst case; V. Conitzer and T. Sandholm show in [46] that, under certain assumptions, no voting procedure is hard to manipulate, on average. Among further related issues, we can still mention complexity issues in voting on combinatorial domains (see for instance [104]), or the complexity of determining if the output of a vote is determined even when some voters have not yet been elicited (see for instance [43]). Computational complexity has become an important criterion in evaluating voting procedures, even if real elections are not necessary the most difficult instances (for example because the number of candidates can be limited). Voting methods are increasingly used in multiagent systems. When autonomous software agents are voting over all sorts of issues, then large numbers of candidates may be more likely, and also agents are possibly more likely to manipulate. Thus, complexity is now a property which must be taken into account along with axiomatic properties. In spite of Arrow’s impossibility theorem, or maybe because of it, it is still necessary to design new voting procedures, to study them and to compare them to existing procedures. In the mathematical tool box available to do so, there is now a place for computational complexity. Acknowledgments I would like to thank Ulle Endriss, Jérôme Lang and Bernard Monjardet for their help. Their comments were very useful to improve the text. Bibliography Primary Literature 1. Aaronson S, Kuperberg G (2008) Complexity Zoo. http:// qwiki.caltech.edu/wiki/Complexity_Zoo 2. Ailon N, Alon N (2007) Hardness of fully dense problems. Inf Comput 205:117–1129 3. Ailon N, Charikar M, Newman A (2005) Aggregating inconsistent information: ranking and clustering. Proceedings of the 37th annual ACM symposium on Theory of computing (STOC), pp 684–693
Voting Procedures, Complexity of
4. Aizerman MA, Aleskerov FT (1995) Theory of choice. North Holland, Elsevier, Amsterdam 5. Alon N (2006) Ranking tournaments. SIAM J Discret Math 20(1):137–142 6. Alon N, Spencer J (2000) The probabilistic method, 2nd edn. Wiley, New York 7. Arrow KJ (1963) Social choice and individual values, rev edn. Wiley, New York 8. Arrow KJ, Raynaud H (1986) Social choice and multicriterion decision-making. MIT Press, Cambridge 9. Arrow KJ, Sen AK, Suzumura K (eds) (2002) Handbook of social choice and welfare, vol 1. North-Holland, Amsterdam 10. Ausiello G, Crescenzi P, Gambosi G, Kann V, MarchettiSpaccamela A, Protasi M (2003) Complexity and Approximation, 2nd edn. Springer, Berlin 11. Baldwin JM (1926) The technique of the Nanson preferential majority system of election. Proc Royal Soc Victoria 39:42–52 12. Bang-Jensen J, Gutin G (2001) Digraphs: theory, algorithms, and applications. Springer, Berlin 13. Banks J (1985) Sophisticated voting outcomes and agenda control. Soc Choice Welf 2:295–306 14. Barnett WA, Moulin H, Salles M, Schofield NJ (eds) (1995) Social choice, welfare and ethics. Cambridge University Press, New York 15. Barthélemy J-P (1979) Caractérisations axiomatiques de la distance de la différence symétrique entre des relations binaires. Math Sci Hum 67:85–113 16. Barthélemy J-P, Guénoche A, Hudry O (1989) Median linear orders: heuristics and a branch and bound algorithm. Eur J Oper Res 41:313–325 17. Barthélemy J-P, Monjardet B (1981) The median procedure in cluster analysis and social choice theory. Math Soc Sci 1: 235–267 18. Bartholdi III JJ, Orlin J (1991) Single transferable vote resists strategic voting. Soc Choice Welf 8(4):341–354 19. Bartholdi III JJ, Tovey CA, Trick MA (1989) Voting schemes for which it can be difficult to tell who won the election. Soc Choice Welf 6:157–165 20. Bartholdi III JJ, Tovey CA, Trick MA (1989) The Computational Difficulty of Manipulating an Election. Soc Choice Welf 6: 227–241 21. Bartholdi III JJ, Tovey CA, Trick MA (1992) How hard is it to control an election? Math Comput Model 16(8/9):27–40 22. Berge C (1985) Graphs. North-Holland, Amsterdam 23. Black D (1958) The theory of committees and elections. Cambridge University Press, Cambridge 24. Borda J-C (1784) Mémoire sur les élections au scrutin. Histoire de l’Académie Royale des Sciences pour 1781, Paris, pp 657–665. English translation: de Grazia A (1953) Mathematical Derivation of an Election System. Isis 44:42–51 25. Brams SJ, Fishburn PC (1978) Approval Voting. Am Political Sci Rev 72(3):831–857 26. Brams SJ, Fishburn PC (1983) Approval Voting. Birkhauser, Boston 27. Brams SJ, Fishburn PC (2002) Voting Procedures. In: Arrow K, Sen A, Suzumura K (eds) Handbook of Social Choice and Welfare, vol 1. Elsevier, Amsterdam, pp 175–236 28. Brandt F, Fischer F (2008) Computing the minimal covering set. Math Soc Sci 58(2):254–268 29. Brandt F, Fischer F, Harrenstein P (2006) The computational complexity of choice sets. In: Endriss U, Lang J (eds) Proceed-
30.
31.
32. 33. 34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
ings of the conference Computational Social Choice 2006. University of Amsterdam, The Netherlands, pp 63–76 Brandt F, Fischer F, Harrenstein P, Mair M (2008) A computational analysis of the tournament equilibrium set. In: Fox D, Gomes CP (eds) Proc. of AAAI, pp. 38–43 Caritat MJAN, marquis de Condorcet (1785) Essai sur l’application de l’analyse à la probabilité des décisions rendues à la pluralité des voix. Imprimerie Royale, Paris Caspard N, Monjardet B, Leclerc B (2007) Ensembles ordonnés finis: concepts, résultats et usages. Springer, Berlin Chamberlin JR (1985) An investigation into the effective manipulability of four voting systems. Behav Sci 30:195–203 Charbit P, Thomassé S, Yeo A (2007) The minimum feedback arc set problem is NP-hard for tournaments. Comb Probab Comput 16(1):1–4 Charon I, Hudry O (2006) A branch and bound algorithm to solve the linear ordering problem for weighted tournaments. Discret Appl Math 154:2097–2116 Charon I, Hudry O (2007) A survey on the linear ordering problem for weighted or unweighted tournaments. 4OR 5(1):5–60 Charon I, Guénoche A, Hudry O, Woirgard F (1997) New results on the computation of median orders. Discret Math 165–166:139–154 Chevaleyre Y, Endriss U, Lang J, Maudet N (2007) A short introduction to computational social choice. Proceedings of the 33rd Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM-2007). Lecture Notes in Computer Science, vol 4362. Springer, Berlin, pp 51–69 Christian R, Fellows M, Rosamond F, Slinko A (2006) On complexity of lobbying in multiple referenda. Proceedings of the First International Workshop on Computational Social Choice (COMSOC 2006). University of Amsterdam, pp 87–96 Colomer JM, McLean I (1998) Electing Popes: Approval Balloting and Qualified-Majority Rule. J Interdisciplinary History 29(1):1–22 Conitzer V (2006) Computing Slater Rankings Using Similarities Among Candidates. In: Proceedings of the 21st National Conference on Artificial Intelligence, AAAI-06, Boston, pp. 613–619 Conitzer V (2007) Eliciting single-peaked preferences using comparison queries. Proceedings of the 6th International Joint Conference on Autonomous Agents and Multi Agent Systems (AAMAS-07). Honolulu, USA, pp 408–415 Conitzer V, Sandholm T (2002) Vote elicitation: complexity and strategy-proofness. Proceedings of the National Conference on Artificial Intelligence (AAAI), pp 392–397 Conitzer V, Sandholm T (2002) Complexity of manipulating elections with few candidates. Proceedings of the 18th National Conference on Artificial Intelligence (AAAI), pp 314– 319 Conitzer V, Sandholm T (2003) Universal voting protocal tweaks to make manipulation hard. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI-03), Acapulco, Mexico, pp 781–788 Conitzer V, Sandholm T (2006) Nonexistence of voting rules that are usually hard to manipulate. Proceedings of the 21st National Conference on Artificial Intelligence (AAAI-06), Boston, pp 627–634 Conitzer V, Lang J, Sandholm T (2003) How many candidates
3309
3310
Voting Procedures, Complexity of
48.
49.
50.
51. 52.
53.
54. 55.
56.
57.
58.
59.
60.
61.
62. 63. 64. 65.
are needed to make elections hard to manipulate? Theoretical Aspects of Rationality and Knowledge (TARK), pp 201–214 Copeland AH (1951) A “reasonable” social welfare function. Seminar on applications of mathematics to the social sciences. University of Michigan Coppersmith T, Winograd S (1987) Matrix multiplication via arithmetic progression. In: Proc 19th Ann ACM Symp Theor Comput, pp 1–6 Coppersmith D, Fleischer L, Rudra A (2006) Ordering by weighted number of wins gives a good ranking for weighted tournaments. Proceedings of the 17th annual ACM-SIAM symposium on discrete algorithms (SODA’06), pp 776–782 Cormen T, Leiserson C, Rivest R (1990) Introduction to algorithms, 2nd edn 2001. MIT Press, Cambridge Cox GW (1987) The Cabinet and the Development of Political Parties in Victorian England. Cambridge University Press, New York Czygrinow A, Poljak S, Rödl V (1999) Constructive quasiRamsey numbers and tournament ranking. SIAM J Discret Math 12(1):48–63 Daunou PCF (1803) Mémoire sur les élections au scrutin. Baudoin, Paris, an XI Debord B (1987) Caractérisation des matrices de préférences nettes et méthodes d’agrégation associées. Math Sci Hum 97:5–17 Debord B (1987) Axiomatisation de procédures d’agrégation de préférences. Ph D thesis, Université scientifique technologique et médicale de Grenoble de la Vega WF (1983) On the maximal cardinality of a consistent set of arcs in a random tournament. J Comb Theor B 35:328–332 Dodgson CL (1873) A Discussion of the Various Methods of Procedure in Conducting Elections. Imprint by Gardner EB, Hall EP, Stacy JH. Printers to the University, Oxford. Reprinted In: Black D (1958) The Theory of Committees and Elections. Cambridge University Press, Cambridge, pp 214–222 Dodgson CL (1874) Suggestions as to the Best Method of Taking Votes, Where More than Two Issues are to be Voted on. Imprint by Hall EP, Stacy JH. Printers to the University, Oxford. Reprinted In: Black D (1958) The Theory of Committees and Elections. Cambridge University Press, Cambridge, pp 222–224 Dodgson CL (1876) A method of taking votes on more than two issues, Clarendon Press, Oxford. Reprint In: Black D (1958) The theory of committees and elections, Cambridge University Press, Cambridge, pp 224–234; and In: McLean I, Urken A (1995) Classics of social choice. University of Michigan Press, Ann Arbor Dom M, Guo J, Hüffner F, Niedermeier R, Truß A (2006) FixedParameter Tractability Results for Feedback Set Problems in Tournaments. Lecture Notes in Computer Science, vol 3998. Springer, Berlin, pp 320–331 Downey RG, Fellows MR (1999) Parameterized complexity. Springer, Berlin Dummett M (1984) Voting Procedures. Clarendon Press, Oxford Dutta B (1988) Covering sets and a new Condorcet choice correspondence. J Economic Theory 44:63–80 Dwork C, Kumar R, Naor M, Sivakumar D (2001) Rank aggregation methods for the Web. Proceedings of the 10th interna-
66.
67. 68.
69.
70.
71.
72.
73.
74. 75. 76. 77. 78.
79. 80.
81. 82.
83. 84.
85.
tional conference on World Wide Web (WWW10), Hong Kong, pp 613–622 Elkin E, Lipmaa H (2006) Hybrid voting protocols and hardness of manipulation. In: Endriss U, Lang J (eds) Proceedings of the First International Workshop on Computational Social Choice (COMSOC 2006). University of Amsterdam, pp 178–191 Elster J, Hylland A (eds) (1986) Foundations of Social Choice Theory. Cambridge University Press, New York Erdös P, Moser L (1964) On the representation of directed graphs as unions of orderings. Magyar Tud Akad Mat Kutato Int Közl 9:125–132 Even G, Naor JS, Sudan M, Schieber B (1998) Approximating minimum feedback sets and multicuts in directed graphs. Algorithmica 20(2):151–174 Fagin R, Kumar R, Mahdian M, Sivakumar D, Vee E (2005) Rank Aggregation: An Algorithmic Perspective. unpublished manuscript Faliszewski P, Hemaspaandra E, Hemaspaandra L (2006) The complexity of bribery in elections. In: Endriss U, Lang J (eds) Proceedings of the First International Workshop on Computational Social Choice (COMSOC 2006). University of Amsterdam, pp 178–191 Faliszewski P, Hemaspaandra E, Hemaspaandra L, Rothe J (2006) A richer understanding of the complexity of election systems. Technical report TR-2006-903, University of Rochester. To appear in: Ravi S, Shukla S (eds) (2008) Fundamental Problems in Computing: Essays in honor of Professor Daniel J. Rosenkrantz. Springer, Berlin Faliszewski P, Hemaspaandra E, Hemaspaandra L, Rothe J (2007) Llull and Copeland Voting Broadly Resist Bribery and Control. Technical report TR-2006-903. University of Rochester, NY Fishburn PC (1973) Interval representations for interval orders and semiorders. J Math Psychol 10:91–105 Fishburn PC (1973) The theory of social choice. Princeton University Press, Princeton Fishburn PC (1977) Condorcet social choice functions. SIAM J Appl Math 33:469–489 Fishburn PC (1985) Interval orders and interval graphs, a study of partially ordered sets. Wiley, New York Garey MR, Johnson DS (1979) Computers and intractability, a guide to the theory of NP-completeness. Freeman, New York Gibbard A (1973) Manipulation of voting schemes. Econometrica 41:587–602 Guilbaud GT (1952) Les théories de l’intérêt général et le problème logique de l’agrégation. Économie Appl 5(4):501– 584; Éléments de la théorie des jeux, 1968. Dunod, Paris Hägele G, Pukelsheim F (2001) Llull’s writings on electoral systems. Studia Lulliana 3:3–38 Hemaspaandra L (2000) Complexity classes. In: Rosen KH (ed) Handbook of discrete and combinatorial mathematics. CRC Press, Boca Raton, pp 1085–1090 Hemaspaandra E, Hemaspaandra L (2007) Dichotomy for voting systems. J Comput Syst Sci 73(1):73–83 Hemaspaandra E, Hemaspaandra L, Rothe J (1997) Exact analysis of Dodgson elections: Lewis Carroll’s 1876 voting system is complete for parallel access to NP. J ACM 44(6):806–825 Hemaspaandra E, Spakowski H, Vogel J (2005) The complexity of Kemeny elections. Theor Comput Sci 349:382–391
Voting Procedures, Complexity of
86. Hemaspaandra E, Hemaspaandra L, Rothe J (2006) Hybrid elections broaden complexity-theoretic resistance to control. Proceedings of the First International Workshop on Computational Social Choice (COMSOC 2006), University of Amsterdam, pp 234–247; (2007) Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007). AAAI Press, pp 1308–1314 87. Hemaspaandra E, Hemaspaandra L, Rothe J (2007) Anyone but him: the complexity of precluding an alternative. Artif Intell 171(5–6):255–285 88. Homan C, Hemaspaandra L (2006) Guarantees for the success frequency of an algorithm for finding Dodgson-election winners. Proceedings of the 31st International Symposium on Mathematical Foundations of Computer Science. Lecture Notes in Computer Science, vol 4162. Springer, Berlin, pp 528–539 89. Hudry O (1989) Recherche d’ordres médians: complexité, algorithmique et problèmes combinatoires. Ph D thesis, ENST, Paris 90. Hudry O (2004) A note on Banks winners. In: Woeginger GJ (ed) tournaments are difficult to recognize. Soc Choice Welf 23:1–2 91. Hudry O (2008) NP-hardness results on the aggregation of linear orders into median orders. Ann Oper Res 163(1):63–88 92. Hudry O (2008) NP-hardness of Slater and Kemeny problems. (submitted) 93. Hudry O (2008) A survey on the complexity of tournament solutions. Math Soc Sci (to appear) 94. Hudry O (2008) Complexity of the aggregation of binary relations. (submitted) 95. Inada K (1969) The Simple Majority Decision Rule. Econometrica 37:490–506 96. Johnson DS (1990) A catalog of complexity classes. In: van Leeuwen J (ed) Handbook of Theoretical Computer Science Vol. A: Algorithms and Complexity. Elsevier, Amsterdam, pp 67–161 97. Johnson PE (1998) Social Choice Theory and Research, CA (Quantitative Applications in the Social Sciences Series, vol 123). Sage Publications, Thousand Oaks 98. Jünger M (1985) Polyhedral combinatorics and the acyclic subdigraph problem. Heldermann, Berlin 99. Khachiyan L (1979) A polynomial algorithm in linear programming. Sov Math Dokl 20:191–194 100. Karp RM (1972) Reducibility among combinatorial problems. In: Miller RE, Tatcher JW (eds) Complexity of computer computations. Plenum Press, New York, pp 85–103 101. Kelly JS (1987) Social Choice Theory: an Introduction. Springer, Berlin 102. Kemeny JG (1959) Mathematics without Numbers. Daedalus 88:571–591 103. Köhler G (1978) Choix multicritère et analyse algébrique de données ordinales. Ph D thesis, université scientifique et médicale de Grenoble 104. Lang J (2004) Logical preference representation and combinatorial vote. Ann Math Artif Intell 42:37–71 105. Laplace (marquis de) PS (1795) Journal de l’École Polytechnique, tome II vol. 7–8; Théorie analytique des probabilités. Essai philosophique sur les probabilités. Œuvres de Laplace, tome VII, Paris, 1847 106. Laslier J-F (1997) Tournament Solutions and Majority Voting. Springer, Berlin
107. Laslier J-F (2004) Le vote et la règle majoritaire. Analyse mathématique de la politique éditions du CNRS 108. LeGrand R, Markakis E, Mehta A (2006) Approval voting: local search heuristics and approximation algorithms for the minimax solution. Proceedings of the First International Workshop on Computational Social Choice (COMSOC 2006), University of Amsterdam, pp 234–247 109. Levenglick A (1975) Fair and reasonable election systems. Behav Sci 20:34–46 110. Levin J, Nalebuff B (1995) An introduction to vote-counting schemes. J Economic Perspectives 9(1):3–26 111. Lines M (1986) Approval Voting and Strategy Analysis: A Venetian. Ex Theor Decis 20:155–172 112. Lhuilier S (1794) Examen du mode d’élection proposé à la Convention nationale de France en février 1793 et adopté à Genève, Genève. Reprint In: (1976) Math Sci Hum 54:7–24 113. Mascart J (1919) La vie et les travaux du chevalier JeanCharles de Borda (1733–1799): épisodes de la vie scientifique au XVIIIe siècle. Annales de l’université de Lyon vol. II (33). New edition, Presses de l’université de Paris-Sorbonne, 2000 114. Maus S, Peters H, Storcken T (2006) Anonymous voting and minimal manipulability. Proceedings of the First International Workshop on Computational Social Choice (COMSOC 2006), University of Amsterdam, pp 317–330 115. McCabe-Dansted J (2006) Feasibility and approximability of Dodgson’s rule. Master’s thesis, University of Auckland 116. McCabe-Dansted J, Pritchard G, Slinko A (2006) Approximability of Dodgson’s rule. Proceedings of the First International Workshop on Computational Social Choice (COMSOC 2006), University of Amsterdam, pp 234–247 117. Mc Garvey D (1953) A theorem on the construction of voting paradoxes. Econometrica 21:608–610 118. McKey B (2006) http://cs.anu.edu.au/pp~bdm/data/ digraphs.html 119. McLean I (1995) The first golden age of social choice, 1784– 1803. In: Barnett WA, Moulin H, Salles M, Schofield NJ (eds) Social choice welfare, and ethics: proceedings of the eighth international symposium in economic theory and econometrics. Cambridge University Press, Cambridge, pp 13–33 120. McLean I, Hewitt F (1994) Condorcet: Foundations of Social choice and Political Theory. Edward Elgar, Hants 121. McLean I, Urken A (1995) Classics of social choice. University of Michigan Press, Ann Arbor 122. McLean I, Urken A (1997) La réception des œuvres de Condorcet sur le choix social (1794–1803): Lhuilier, Morales et Daunou, in Condorcet, Homme des Lumières et de la Révolution, Chouillet A-M, Pierre Crépel (eds) ENS éditions, Fontenay-aux-roses, pp 147–160 123. McLean I, McMillan A, Monroe BL (1995) Duncan Black and Lewis Carroll. J Theor Politics 7:107–124 124. McLean I, Lorrey H, Colomer JM (2007) Social Choice in Medieval Europe. workshop Histoire des Mathématiques Sociales, Paris 125. Merrill III S, Grofman B (1999) A Unified Theory of Voting. Cambridge University Press, Cambridge 126. Miller N (1980) A new solution set for tournaments and majority voting: Further graph-theoretical approaches to the theory of voting. Am J Political Sci 24(1):68–96 127. Mitlöhner J, Eckert D, Klamler C (2006) Simulating the effects of misperception on the manipulability of voting rules. Proceedings of the First International Workshop on Computa-
3311
3312
Voting Procedures, Complexity of
128. 129.
130.
131. 132.
133. 134. 135. 136. 137.
138. 139.
140. 141. 142. 143. 144.
145.
146.
147.
148.
149. 150.
tional Social Choice (COMSOC 2006), University of Amsterdam, pp 234–247 Monjardet B (1976) Lhuilier contre Condorcet au pays des paradoxes. Math Sci Hum 54:33–43 Monjardet B (1979) Relations à éloignement minimum de relations binaires, note bibliographique. Math Sci Hum 67: 115–122 Monjardet B (1990) Sur diverses formes de la “règle de Condorcet” d’agrégation des préférences. Math Inf Sci Hum 111:61–71 Monjardet B (2008) Acyclic domains of linear orders: a survey. (to appear) Monjardet B (2008) Mathématique Sociale and Mathematics. A case study: Condorcet’s effect and medians. Electron J Hist Probab Stat 4(1):1–26 Monjardet B Private communication Moon JW (1968) Topics on tournaments. Holt, Rinehart and Winston, New York Moulin H (1980) On strategy-proofness and single peakedness. Public Choice 35:437–455 Moulin H (1983) The Strategy of Social Choice. North Holland, Amsterdam Moulin H (1985) Fairness and strategy in voting. In: Young HP (ed) Fair Allocation, American Mathematical Society. Proc Symp Appl Math 33:109–142 Moulin H (1986) Choosing from a tournament. Soc Choice Welf 3:272–291 Morales JI (1797) Memoria matemática sobre el cálculo de la opinión en las elecciones. Imprenta Real, Madrid. Translated in: McLean I, Urken A (1995) Classics of social choice. University of Michigan Press, Ann arbor Nanson EJ (1882) Methods of Election. Trans Proc Royal Soc Victoria 18:197–240 Nurmi H (1987) Comparing Voting Systems. D. Reidel Publishing Company, Dordrecht Orlin J (1981) unpublished Pattanaik PK, Salles M (eds) (1983) Social Choice and Welfare. North-Holland, Amsterdam Poljak S, Turzík D (1986) A polynomial time heuristic for certain subgraph optimization problems with guaranteed lower bound. Discret Math 58:99–104 Poljak S, Rödl V, Spencer J (1988) Tournament ranking with expected profit in polynomial time. SIAM J Discret Math 1(3):372–376 Procaccia A, Rosenschein J (2006) Junta distribution and the average-case complexity of manipulating elections. Proceedings of the 5th International Joint Autonomous Agents and Multiagent Systems, ACM Press, pp 497–504 Procaccia A, Rosenschein J, Zohar A (2006) Multi-winner elections: complexity of manipulation, control, and winnerdetermination. Proceedings of the 8th Trading Agent Design and Analysis and Agent Mediated Electronic Commerce Joint International Workshop (TADA/AMEC 2006), pp 15–28 Raman V, Saurabh S (2006) Parameterized algorithms for feedback set problems and their duals in tournaments. Theor Comput Sci 351:446–458 Reid KB (2004) Tournaments. In: Gross JL, Yellen J (eds) Handbook of Graph Theory. CRC Press, Boca Raton, pp 156–184 Reid KB, Beineke LW (1978) Tournaments. In: Beineke LW, Wilson RJ (eds) Selected topics in graph theory. Academic Press, London, pp 169–204
151. Reinelt G (1985) The linear ordering problem: algorithms and applications. Research and Exposition in Mathematics 8. Heldermann, Berlin 152. Rothe J, Spakowski H (2006) On determining Dodgson winners by frequently self-knowingly correct algorithms and in average-case polynomial time. Proceedings of the First International Workshop on Computational Social Choice (COMSOC 2006), University of Amsterdam, pp 234–247 153. Rothe J, Spakowski H, Vogel J (2003) Exact complexity of the winner problem for Young elections. Theor Comput Syst 36(4):375–386 154. Rowley CK (ed) (1993) Social Choice Theory, vol 1: The Aggregation of Preferences. Edward Elgar Publishing Company, London 155. Saari D (1990) Susceptibility to manipulation. Public Choice 64:21–41 156. Saari D (2001) Decisions and Elections, Explaining the Unexpected. Cambridge University Press, Cambridge 157. Satterthwaite M (1975) Strategy-proofness and Arrow’s conditions: existence and correspondence theorems for voting procedures and social welfare functions. J Econ Theor 10:187–217 158. Schwartz T (1990) Cyclic tournaments and cooperative majority voting: A solution. Soc Choice Welf 7:19–29 159. Simpson PB (1969) On Defining Areas of Voter Choice. Q J Econ 83(3):478–490 160. Slater P (1961) Inconsistencies in a schedule of paired comparisons. Biometrika 48:303–312 161. Smith JH (1973) Aggregation of preferences with variable electorate. Econometrica 41(6):1027–1041 162. Smith D (1999) Manipulability measures of common social choice functions. Soc Choice Welf 16:639–661 163. Spencer J (1971) Optimal ranking of tournaments. Networks 1:135–138 164. Spencer J (1978) Nonconstructive methods in discrete mathematics. In: Studies in Combinatorics, Rota GC (ed) Mathematical Association of America, Washington, pp 142–178 165. Spencer J (1987) Ten lectures on the probabilistic method. CBMS-NSF Regional Conference Series in Applied Mathematics N 52, SIAM, Philadelphia 166. Stearns R (1959) The voting problem. Am Math Monthly 66:761–763 167. Straffin PD Jr. (1980) Topics in the Theory of Voting. Birkhäuser, Boston 168. Taylor AD (1995) Mathematics and Politics Strategy, Voting, Power, and Proof. Springer, Berlin 169. Taylor AD (2005) Social Choice and the Mathematics of Manipulation. Cambridge University Press, Cambridge 170. Tideman TN (1987) Independence of clones as criterion for voting rules. Soc Choice Welf 4:185–206 171. van Zuylen A (2005) Deterministic approximation algorithms for ranking and clusterings. Cornell ORIE Tech Report No. 1431 172. Vazirani VV (2003) Approximation Algorithms. Springer, Berlin 173. Wakabayashi Y (1986) Aggregation of binary relations: algorithmic and polyhedral investigations. Ph D thesis, Augsburg 174. Wakabayashi Y (1998) The Complexity of Computing Medians of Relations. Resenhas 3(3):323–349 175. Weber RJ (1995) Approval voting. J Econ Perspectives 9(1):39–49
Voting Procedures, Complexity of
176. Woeginger GJ (2003) Banks winner in tournaments are difficult to recognize. Soc Choice Welf 20:523–528 177. Young HP (1977) Extending Condorcet’s rule. J Economic Theor 16(2):335–353
Books and Reviews Aleskerov FT (1999) Arrovian Aggregation Models, Mathematical and statistical methods. Theory and decision library, vol 39. Kluwer, Boston Aleskerov FT, Monjardet B (2002) Utility Maximisation, Choice and preference. Springer, Berlin Baker KM (1975) Condorcet from Natural Philosophy to Social Mathematics. The University of Chicago Press, Chicago. Reissued 1982 Balinski M, Young HP (1982) Fair Representation. Yale University Press, New Haven Barthélemy J-P, Monjardet B (1988) The median procedure in data analysis: new results and open problems. In: Bock HH (ed) Classification and related methods of data analysis. North Holland, Amsterdam Batteau P, Jacquet-Lagrèze É, Monjardet B (eds) (1981) Analyse et agrégation des préférences dans les sciences économiques et de gestion. Economica, Paris Black D (1996) Formal contributions to the theory of public choice. In: Brady GL, Tullock G (eds) The unpublished works of Duncan Black. Kluwer, Boston Bouyssou D, Marchant T, Pirlot M, Tsoukias A, Vincke P (2006) Evaluation and decision models with multiple criteria. Springer, Berlin Campbell DE (1992) Equity, Efficiency, and Social Choice. Clarendon Press, Oxford Coughlin P (1992) Probabilistic Voting Theory. Cambridge University Press, Cambridge Danilov V, Sotskov A (2002) Social Choice Mechanisms. Springer, Berlin Dubois D, Pirlot M, Bouyssou D, Prade H (eds) (2006) Concepts et méthodes pour l’aide à la décision. Hermès, Paris Endriss U, Lang J (eds) (2006) Proceedings of the First International Workshop on Computational Social Choice, COMSOC 2006, University of Amsterdam Enelow J, Hinich M (eds) (1990) Advances in the Spatial Theory of Voting. Cambridge University Press, Cambridge Farquharson R (1969) Theory of Voting. Yale University Press, New Haven Feldman AM (1980) Welfare Economics and Social Choice Theory. Martinus Nijhoff, Boston Felsenthal DS, Machover M (1998) The Measurement of Voting Power: Theory and Practise, Problems and Paradoxes. Edward Elgar, Cheltenham Gaertner W (2001) Domains Conditions in Social Choice Theory. Cambridge University Press, Cambridge Greenberg J (1990) The Theory of Social Situations. Cambridge University Press, Cambridge Grofman B (1981) When is the Condorcet Winner the Condorcet Winner? University of California, Irvine Grofman B, Owen G (eds) (1986) Information Pooling and Group Decision Making. JAI Press, Greenwich Heal G (ed) (1997) Topological Social Choice. Springer, Berlin Hillinger C (2004) Voting and the cardinal aggregation cardinal of
judgments. Discussion papers in economics 353, University of Munich Holler MJ (ed) (1978) Power Voting and Voting Power. Physica, Wurtsburg-Wien Holler MJ, Owen G (eds) (2001) Indices and Coalition Formation. Kluwer, Boston Kemeny J, Snell L (1960) Mathematical models in the social sciences. Ginn, Boston Laslier J-F (2006) Spatial Approval Voting. Political Analysis 14(2):160–185 Laslier J-F, Van Der Straeten K (2008) A live experiment on approval voting. Exp Econ 11:97–105 Lieberman B (ed) (1971) Social Choice. Gordon and Breach, New York Mirkin BG (1979) Group Choice. Winston, Washington Moulin H (2003) Fair Division and Collective Welfare. Institute of Technology Press, Boston Nurmi H (1999) Voting Paradoxes and how to Deal with Them. Springer, Berlin Nurmi H (2002) Voting Procedures under Uncertainty. Springer, Berlin Pattanaik PK (1971) Voting and Collective Choice. Harvard University Press, Cambridge Pattanaik PK (1978) Strategy and Group Choice. North Holland, Amsterdam Peleg B (1984) Game Theoretic Analysis of Voting in Committes. Cambridge University Press, Cambridge Rothschild E (2001) Economic Sentiments: Adam Smith, Condorcet, and the Enlightenment. Harvard University Press, Cambridge Saari DG (1994) Geometry of Voting. Springer, Berlin Saari DG (1995) Basic Geometry of Voting. Springer, Berlin Saari DG (2000) Chaotic Elections! American Mathematical Society, Providence Schofield N (1984) Social Choice and Democracy. Springer, Berlin Schofield N (ed) (1996) Collective Decision Making: Social Choice and Political Economy. Kluwer, Boston Schwartz T (1986) The Logic of Collective Choice. Columbia University Press, New York Sen AK (1979) Collective Choice and Social Welfare. North Holland, Amsterdam Sen AK (1982) Choice, Welfare and Measurement. Basil Blackwell, Oxford Suzumura K (1984) Rational Choice, Collective Decisions and Social Welfare. Cambridge University Press, Cambridge Tanguiane AS (1991) Aggregation and Representation of Preferences, Introduction to Mathematical Theory of Democracy. Springer, Berlin Tideman N (2006) Collective Decisions and Voting: The Potential for Public Choice. Ashgate, Burlington Woodall DR (1997) Monotonicity of single-seat preferential election rules. Discret Appl Math 77:81–98 Young HP (1974) An Axiomatization of Borda’s Rule. J Econ Theor 9:43–52 Young HP (1986) Optimal ranking and choice from pairwise comparisons. In: Grofman B, Owen G (eds) Information pooling and group decision making. JAI Press, Greenwich, pp 113–122 Young HP (1988) Condorcet Theory of Voting. Am Political Sci Rev 82:1231–1244 Young HP (1995) Optimal voting rules. J Econ Perspect 9(1):51–64 Young HP, Levenglick A (1978) A Consistent Extension of Condorcet’s Election Principle. SIAM J Appl Math 35:285–300
3313
3314
Wavelets, Introduction to
Wavelets, Introduction to EDWARD ABOUFADEL Department of Mathematics, Grand Valley State University, Allendale, USA Wavelets emerged in the late 1980s as a valuable new tool in science and engineering, and as a topic for fruitful mathematical research. Wavelet transforms are now regularly applied in areas such as image processing, statistical analysis, and seismic research. Wavelets are at the heart of the WSQ standard used by the United States’ Federal Bureau of Investigation to compress fingerprint images [1]. They have been implemented in the Red professional digital video camera [2] and they are now being used in an attempt to analyze paintings by famous artists [3]. In the past twenty years, related “-lets” tools have also emerged, such as ridgelets and shearlets (see Curvelets and Ridgelets). The 1992 book by Daubechies [4] is a classic in the field. Wavelet analysis shares with Fourier analysis the key idea of having a basis of functions with which to analyze signals and other functions. While Fourier analysis uses sine and cosine functions (or complex exponentials) – which are smooth and periodic – wavelet analysis utilizes functions that tend to be rough and localized (see Wavelets and the Lifting Scheme for examples of wavelet functions). To create the basis in Fourier analysis, the frequencies of the standard functions are varied. However, to create a system of wavelet functions, a scaling function is modified by translations and dilations, leading to various resolutions and hence the term multiresolution that is frequently used in the field. There are both discrete and continuous Fourier transforms, and we have the same for wavelets (see Comparison of Discrete and Continuous Wavelet Transforms). There are also analogues of the Fast Fourier Transform. Ironically, much of the derivation of results in wavelets uses the Fourier transform as a theoretical tool. To be more specific, the creation of a wavelet basis starts with a scaling function and a mother wavelet . (The scaling function is sometimes called the “father wavelet”.) These functions tend to either have compact support (meaning they are equal to 0 outside of a bounded interval), or decay exponentially outside of a relatively small part of their domain. Translations and dilations of the mother wavelet are created in the form (2n t k), where t and k are integers. The resolution varies with n, with larger positive values of n corresponding to finer resolutions. By varying k, the support of the function can be translated, allowing localized analysis of other functions or signals.
Wavelet transformations are described and applied in multiple ways. As functions, wavelets can be studied with either the real line or the complex plane as a domain. High pass and low pass wavelet filters can be applied to tease out details in signals or images (see Popular Wavelet Families and Filters and Their Use). To think of wavelet transformations as a combination of “prediction” and “update” steps is the main idea behind the lifting scheme (see Wavelets and the Lifting Scheme). In addition, working with Fourier transforms of wavelet functions lead to various polyphase results (see Multiwavelets, for example). As suggested above, the use of wavelets in various applications has contributed to their popularity. In statistics, wavelets are used to denoise signals and estimate properties of zero-mean stochastic processes (see Statistical Applications of Wavelets). Wavelet techniques have been combined with methods from partial differential equations to clean up images, or fill in damaged parts of an image (which is called inpainting) (see Wavelets and PDE Techniques in Image Processing, A Quick Tour of). Because images from astronomy tend to be isotropic (that is, having similar shapes or values in all directions, such as a sphere), wavelet methods have also been useful in that discipline (see Numerical Issues When Using Wavelets). Wavelet analysis is preferred to Fourier analysis in many applications. One reason is that the relative roughness of the wavelet functions is a better match than trigonometric functions are for real data. This leads to a rather sparse representation of this data in the wavelet domain, which is helpful for compression and noise reduction. The Fourier transform breaks down a function into different frequencies; consequently this representation is based on averages of the function over its whole domain. Wavelets give you information about a function at different scales and different locations simultaneously. The localized nature of wavelet basis functions leads to better identification and elimination of local behavior in data such as spikes, edges, and other discontinuities. In addition, the nature of wavelets allows analysis at multiple resolutions, from coarse to fine. A fertile area of research in wavelets is to consider extensions where the domain and/or range of the wavelet functions are multi-dimensional. For instance, since images are two-dimensional objects, a natural area to investigate are situations where the domain is R2 (see Bivariate (Two-dimensional) Wavelets). Basic wavelet families are orthogonal, but they don’t need to be (see Numerical Issues When Using Wavelets). In order to work with radial images such as retinal scans [5], or anisotropic images (which are long and narrow, or tall and thin), other wavelet tools have been developed (see Curvelets and Ridgelets).
Wavelets, Introduction to
Multiwavelets are functions on the real line whose range is two or more copies of the complex plane (see Multiwavelets). Finally, there is an intimate connection between wavelets and splines (see Multivariate Splines and Their Applications). For the reader unfamiliar with wavelets, sections I, II, and III of the Wavelets and the Lifting Scheme article, along with the glossary of the Comparison of Discrete and Continuous Wavelet Transforms article, are good places to begin. A familiarity with the Fourier transform is critical to read many of the articles in this section.
Bibliography 1. Bradley J, Brislawn C, Hopper T (1993) The FBI wavelet/scalar quantization standard for gray-scale fingerprint image compression. In: Proc Conf Visual Info Process II Proc SPIE, vol 1961. pp 293–304 2. RED Digital Cinema Camera Company www.red.com 3. Lyu S, Rockmore D, Farid H (2004) A digital technique for art authentication. Proc Nat Acad Sci 101:17006–17010 4. Daubechies I (1992) Ten lectures on wavelets. SIAM 5. Feng P et al (2007) Enhancing retinal image by the Contourlet transform. Pattern Recognit Lett 28:516–522
3315
3316
Wavelets and the Lifting Scheme
Wavelets and the Lifting Scheme ANDERS LA COUR–HARBO1 , ARNE JENSEN2 1 Department of Electronics Systems, Aalborg University, Aalborg East, Denmark 2 Department of Mathematical Sciences, Aalborg University, Aalborg East, Denmark Article Outline Glossary Definition of the Subject Introduction Lifting Prediction and Update Interpretation Lifting and Filter Banks Making Lifting Steps from Filters Bibliography Glossary Lifting step An equation describing the transformation of an odd (even) sample of a signal by means of a linear combination of even (odd) samples, respectively. Changing an odd sample is sometimes called a prediction step, while changing an even sample is called an update step. Discrete wavelet transform (DWT) Transformation of a discrete (sampled) signal into another discrete signal by means of a wavelet basis. The transform can be accomplished in a number of ways, typically either by a two channel filter bank or by lifting steps. Filter bank A series of bandpass filters, which separates the input signal into a number of components, each with a distinct range of frequencies of the original signal. z-Transform A mapping of a vector x D fx[k]g to a function in the complex plane by X(z) D
X
x[k]zk :
k
For a vector with a finite number of non-zero entries X(z) is a Laurent polynomial. Filter A linear map, which maps a signal x with finite energy to another signal with finite energy. In the time domain it is given by convolution with a vector h, which in the frequency domain is equal to H(z)X(z). The vector h is called the impulse response (IR) of the filter (or sometimes the filter taps), and H(e j! ) the transfer function (or sometimes the frequency response). If h is a finite sequence, then h is called a FIR
(finite impulse response) filter. An infinite h is then called an IIR filter. We only consider FIR filters in this article. We restrict ourselves to filters with real coefficients. Filter taps The entries in the vector that by convolution with a signal gives a filtering of the signal. The filter taps are also called the impulse response of the filter. Laurent polynomial A polynomial in the variables z and z1 . Some examples: 3z2 C 2z, z3 2z1 , and 1 C z C z3 . The z-transform of a FIR filter h is a Laurent polynomial of the form H(z) D
ke X
h[k]zk ;
kb ke :
kDk b
This is in contrast to ordinary polynomials, where we only have non-negative powers of z. Assuming that h[ke ] ¤ 0 and h[kb ] ¤ 0, the degree of a Laurent polynomial is defined as jhj D ke kb . Monomial A Laurent polynomial of degree 0, for example 3z7 , z8 , or 12. Definition of the Subject The objective of this article is to give a concise introduction to the discrete wavelet transform (DWT) based on a technique called lifting. The lifting technique allows one to give an elementary, but rigorous, definition of the DWT, with modest requirements on the reader. A basic knowledge of linear algebra and signal processing will suffice. The lifting based definition is equivalent to the usual filter bank based definition of the DWT. The article does not discuss applications in any detail. The reader is referred to other articles in this collection. The DWT was introduced in the second half of the eighties, through the work of Y. Meyer, I. Daubechies, S. Mallat, and many other researchers. The lifting technique was introduced by W. Sweldens in 1996, and the equivalence with the filter bank approach was established by I. Daubechies and W. Sweldens in an article published in 1998 [2], but available in preprint form from 1996. This article is based on the book Ripples in Mathematics – The Discrete Wavelet Transform [3] with kind permission of Springer Science and Business Media. For more information on the background and history of wavelets, we again refer to other articles in this collection. Introduction We begin the exposition of the lifting technique with a simple, yet very descriptive example. It will shortly become apparent how this is related to lifting and to the dis-
Wavelets and the Lifting Scheme
Wavelets and the Lifting Scheme, Table 1 Mean and difference computation. Differences are in boldface type 56 40 8 24 48 48 40 16 48 16 48 28 8 8 0 12 32 38 16 10 8 8 0 12 35 3 16 10 8 8 0 12
crete wavelet transform. We take a digital signal consisting of eight samples (this idea is due to Mulcahy [5], and we use his example, with a minor modification) 56;
40;
8;
24;
48;
48;
40;
16 :
We choose to believe that these numbers are not random, but have some structure that we want to extract. Perhaps there is some correlation between a number and its immediate successor. To reveal such a correlation we will take the numbers in pairs and compute the mean, and the difference between the first member of the pair and the computed mean. The result of this computation is shown as the second row in Table 1. This row contains the four means followed by the four differences, the latter being typeset in boldface. We then leave the four differences unchanged and apply the mean and difference computations to the first four entries of the second row. We repeat this procedure once more on the third row. The fourth row then contains a first entry, which is the mean of the original eight numbers, and the seven calculated differences. The boldface entries in the table are called the details of the signal. It is important to observe that no information has been lost in this transformation of the first row into the fourth row. We can see this by reversing the calculation. Beginning with the last row, we compute the first two entries in the third row as 32 D 35 C (3) and 38 D 35 (3). In the same way the first four entries in the second row are calculated as 48 D 32 C (16), 16 D 32 (16), 48 D 38 C (10), and finally 28 D 38 (10). Repeating this procedure we get the first row in the table. So, what is the purpose of these computations? The four signals contain the same information, just in different ways, so how do we gain anything from this change of representation of the signal? If our assumption (pairwise equality of samples) is correct, the four means in the second row will equal the original numbers, and the four differences will be zero. If further, the original numbers are equal in groups of four, the two means in the third row will equal the original quadruples, and the differences will be zero. Finally, if all eight original samples are equal, the
mean in row four will be equal to all eight samples, and the seven differences will be zero. Apparently, our assumption is not correct, as the differences are not zero. However, the numbers in the fourth row are generally smaller than the original numbers. This indicates that while the numbers are not equal, they are ‘similar’, leaving us with small differences. We can say that we have achieved some kind of loss-free compression by reducing the dynamic range of the signal. By loss-free we mean that we can transform back to the original signal without any information being lost in the process. As a simple measure of compression we can count the number of digits used to represent the signal. The first row contains 15 digits. The last row contains 12 digits and two negative signs. So in this example the compression is not very large. But it is easy to give other examples, where the compression can be substantial. We see in this example that the pair 48, 48 do fit our assumption of equality, and therefore the difference is zero. Suppose that after transformation we find that many ‘difference entries’ are zero. Then we can store the transformed signal more efficiently by only storing the nonzero entries (and their locations). Processing the Transformed Signal Let us now suppose that we are willing to accept a certain loss of quality in the signal, if we can get a higher rate of compression. One technique for lossy compression is called thresholding. We choose a threshold and decide to set equal to zero all entries with an absolute value less than this threshold value. Let us in our example choose 4 as the threshold. This means that we in Table 1 replace the entry 3 by 0 and then perform the reconstruction. The result is in Table 2, left hand part. The original and transformed signals are both shown on the left in Fig. 1. We have chosen to join the sample points by straight line segments to get a good visualizaWavelets and the Lifting Scheme, Table 2 Reconstruction with threshold 4 (top) and threshold 9 (bottom) 59 43 11 27 45 45 37 13 51 19 45 25 8 8 0 12 35 35 16 10 8 8 0 12 35 0 16 10 8 8 0 12 51 51 19 19 45 45 37 13 51 19 45 25 0 0 0 12 35 35 16 10 0 0 0 12 35 0 16 10 0 0 0 12
3317
3318
Wavelets and the Lifting Scheme
Wavelets and the Lifting Scheme, Figure 1 Original signal (solid) and modified signal (dashed) with threshold 4 (left) and 9 (right)
tion of the signals. Clearly the two graphs differ very little. Now let us perform a more drastic compression. This time we choose the threshold equal to 9. The computations are given on the right in Table 2, and the graphs are plotted on the right in Fig. 1. Notice that the peaks in the original signal have been flattened. We also note that the signal now is represented by only four non-zero entries. There are several variations of the transformation. We could have stored differences instead of ‘half-differences’, or we could have used the difference between the second element of the pair and the computed average. The first choice will lead to boldface entries in the tables that can be obtained from the computed ones by multiplication by a factor 2. The second variant is obtained by multiplication by 1. The ‘mean-difference’ transformation can be performed on any signal of even length, since all we need is the possibility of taking pairs. If we want to be able to keep transforming the signal until there is one signal mean left, we need a signal of length 2 N , and it will lead to a table with N C 1 rows, where the first row is the original signal. If the given signal has a length different from a power of 2, then we will have to do some additional operations on the signal to compensate for that. One possibility is to add samples with value zero to one or both ends of the signal until a length of 2 N is achieved. This is called zero padding.
and it provides a number of different representations of the same signal. Actually, all discrete wavelet transforms can be realized by a similar procedure, where the only modification from the Haar transform is the way, in which we compute the means part and differences part of the transformed signal. The procedure is known as ‘lifting’ in the literature, and was introduced by Wim Sweldens in 1996 in a series of papers [8,9]. To fully appreciate the concept of lifting we need to elaborate on the example in the previous section. We have a discrete signal of real (or complex) numbers x[0]; x[1]; x[2]; x[3]; : : : ; x[N 1] : Our convention is to start indexing at zero. In implementations one may need to adapt this to the programming language used. Here we assume the signal to have finite length N, but this is not a restriction. The theory also applies to infinite signals, provided they have finite energy. In the mathematical (and signal processing) literature finite energy means 1 X ˇ ˇ ˇx[n]ˇ2 < 1 ; nD1
and the set of such signals is usually denoted by `2 (Z). Sometimes we use the mathematical term sequence instead of signal, and we also use the term vector, in particular in connection with use of results from linear algebra.
Lifting The transformation performed by successively calculating means and differences of a signal is an example of the discrete wavelet transform. Specifically, the transform introduced in the example is called the Haar transform, and occasionally also the first of the Daubechies transforms. It can be undone by simply reversing the steps performed,
Introducing Prediction and Update Let us now return to the example in the Introduction. We took a pair of numbers a; b and computed the mean, and the difference between the first entry and the mean sD
aCb ; 2
(1)
Wavelets and the Lifting Scheme
d D as:
(2)
Alternatively, almost the same result can be achieved in the following way d D ab; sD
aCb d D Cb; 2 2
(3) (4)
except we now have the difference rather than half the difference. This mathematical formulation of the Haar transform, albeit quite simple, brings us to the definition of lifting. The two operations, mean and difference, can be viewed as special cases of more general operations. Remember that we assumed that there is some correlation between two successive samples, and we therefore computed the difference. If two samples are almost equal, the difference is, of course, small. Thus one can use the first sample to predict the second sample, the prediction being that they are equal. If is is a good prediction, the difference between them is small. Thus we can call (3) a prediction step. We can use other and more sophisticated prediction steps than one based on just the previous sample. We will give a few examples of this later. We also calculated the mean of the two samples. This step can be viewed in two ways. Either as an operation, which preserves some properties of the original signal (later we shall see how the mean value or the energy of a signal is preserved during transformation), or as an extraction of essential features of the signal. The latter viewpoint is based on the fact that the pair-wise mean values contain the overall structure of the signal, but with only half the number of samples. We use the term update step for the this operation, and (4) is the update step in the Haar transform. Just as the prediction operation, the update operation can be more sophisticated than just calculating the mean. The prediction and update operations are shown in Fig. 2. The notation and the setup is a little different from the Introduction; we start with a finite sequence s j of length 2 j and end with two sequences s j1 and d j1 , each of length 2 j1 . Let us explain the steps in Fig. 2. split The entries are sorted into the even and the odd entries. It is important to note that we do this only to explain the algorithm. In (effective) implementations the entries are not moved or separated. prediction If the signal contains some structure, then we can expect correlation between a sample and its nearest neighbors. In our first example the prediction is that the signal is constant. More elaborately, given the value
Wavelets and the Lifting Scheme, Figure 2 The three steps in a lifting building block. Note that the minus sign means ‘the signal from the left minus the signal from the top’
at the sample number 2n, we predict that the value at sample 2n C 1 is the same. We then compute the difference and in the figure let it replace the value at 2n C 1, since this value is not needed anymore. In our notation this is d j1 [n] D s j [2n C 1] s j [2n] : In general, the idea is to have a prediction procedure P and then compute the difference signal as d j1 D odd j1 P(even j1 ) :
(5)
Thus in the d signal each entry is one odd sample minus some prediction based on a number of even samples. update Given an even entry we predicted that the next odd entry has the same value, and stored the difference. We then update our even entry to reflect our knowledge of the signal. In the example above we replaced the even entry by the average. In our notation s j1 [n] D s j [2n] C d j1 [n]/2 : In general we decide on an updating procedure, and then compute s j1 D even j1 C U(d j1 ) :
(6)
The algorithm described here is called one step lifting. It requires the choice of a prediction procedure P, and an update procedure U. Multiple Transforms The discrete wavelet transform is obtained by combining a number of lifting steps. As in the example in Table 1 we
3319
3320
Wavelets and the Lifting Scheme
Wavelets and the Lifting Scheme, Figure 3 Two step discrete wavelet transform
keep the computed differences d j1 and use the average sequence s j1 as input for one more lifting step. This two step procedure is illustrated in Fig. 3. Obviously, we can apply more lifting steps if we want to transform the signal further. In general, we start with a signal s j of length 2 j and we can repeat the transformation j times until we have a single number s0 [0] and j difference sequences d j1 to d0 . If we let j D 3 and use the Haar transform s0 [0] will be be the mean value of the eight entries in the original sequence, and d0 , d1 , and d2 will be the numbers in boldface in Table 1. In lifting step notation this table becomes as Table 3. We have previously motivated the prediction operation with the reduction in dynamic range of the signal obtained in using differences rather than the original values, potentially leading to good compression of a signal. The update procedure has not yet been clearly motivated. The update performed in the first example was s j1 [n] D s j [2n] C 12 d j1 [n] D 12 (s j [2n] C s j [2n C 1]) : It turns out that this update operation preserves the mean value. The consequence is that all the s sequences have the same mean value. It is easy to verify in the case of the ex-
Wavelets and the Lifting Scheme, Table 3 This is Table 1 in the lifting step notation s3 [0] s2 [0] s1 [0] s0 [0]
s3 [1] s2 [1] s1 [1] d0 [0]
s3 [2] s2 [2] d1 [0] d1 [0]
s3 [3] s2 [3] d1 [1] d1 [1]
s3 [4] d2 [0] d2 [0] d2 [0]
s3 [5] d2 [1] d2 [1] d2 [1]
s3 [6] d2 [2] d2 [2] d2 [2]
s3 [7] d2 [3] d2 [3] d2 [3]
ample in Table 1, since 56 C 40 C 8 C 24 C 48 C 48 C 40 C 16 8 48 C 16 C 48 C 28 32 C 38 D D D 35 : 4 2 It is not difficult to see that this holds for any sequence s of length 2j . In particular, s0 [0] equals the mean value of the original samples s j [0]; : : : ; s j [2 j 1]. Inverse Transform A lifting step is easy to undo. In fact, looking at Fig. 2 it is clear that we can go backwards from s j1 and d j1 to s j simply by first ‘undoing’ the update step and then ‘undoing’ the prediction step. This is accomplished by using the same update step as in the direct lifting step, but subtracting instead of adding, followed by the same prediction step, but using addition instead of subtraction. Finally, we need to merge the odd and even samples. The direct lifting step and the corresponding inverse lifting step are shown in Fig. 4. This merging step, where the sequences even j1 and odd j1 are merged to form the sequence s j , is shown to explain the algorithm. It is not performed in implementations, since the entries are not reordered in the first place. Mathematically, the inverse lifting step is accomplished by using the formulas in opposite order, and isolating the odd and even samples. In our example the inverse lifting step becomes b D s d/2 ;
(7)
aDdCb;
(8)
Wavelets and the Lifting Scheme
Wavelets and the Lifting Scheme, Figure 4 Direct and inverse lifting step
where (7) is actually (4) with b isolated, and (8) is (3) with a isolated. In the lifting step notation this is first (6) rewritten as s j [2n] D s j1 [n] d j1 [n]/2 to undo the update step, and then (5) rewritten as s j [2n C 1] D d j1 [n] C s j [2n] to undo the prediction step. For the general lifting step d j1 D odd j1 P(even j1 ) s j1 D even j1 C U(d j1 ) the inversion is accomplished by the steps even j1 D s j1 U(d j1 ) odd j1 D d j1 C P(even j1 ) : That is all there is to the inverse transform. Prediction and Update Assuming that a signal is constant was useful in the first example to demonstrate the concept of lifting. However, this assumption is hardly the best for most signals. Fortunately, there are many other possible prediction and update procedures to accommodate various signal assumptions. We will present two examples of other lifting steps. The first example is the next logical step from the Haar transform when we assume the signal to be stepwise linear rather than just constant. The second example is the Daubechies 4 transform, which is not based on a specific assumption about the signal, but is taken from the classical wavelet theory and adapted to the lifting step method.
Wavelets and the Lifting Scheme, Figure 5 The linear prediction uses two even samples to predict one odd sample
signal, but we stick to the commonly used term ‘linear.’ Thus a linear signal is in this context a signal of the form s j [n] D ˛n C ˇ, that is, all the samples of the signal lie on a straight line. For a given odd entry s j [2n C 1] we now need the two nearest even neighbors to predict the value (remember that for the constant signal we needed only one neighbor). The predicted value is 12 (s j [2n] C s j [2n C 2]), the mean of the two even samples. This value is the circle on the middle of the line in Fig. 5. The correction is then the difference between what we predict the middle sample to be and what it actually is d j1 [n] D s j [2n C 1] 12 (s j [2n] C s j [2n C 2]) : We decide to base the update procedure on the two most recently computed differences. We take it of the form s j1 [n] D s j [2n] C A(d j1 [n 1] C d j1 [n]) ; where A is a constant to be determined. In the first example we had the property X
Linear Prediction Instead of basing the prediction on the assumption that the signal is constant, we will now base it on the assumption that the signal is linear. We really do mean an affine
n
s j1 [n] D
1 2
X
s j [n] ;
(9)
n
that is, the mean was preserved. (Recall that s j has length 2j and s j1 has length 2 j1 . This explains the factor 12 .) We
3321
3322
Wavelets and the Lifting Scheme
Wavelets and the Lifting Scheme, Figure 6 Multiple prediction and update steps
would like to have the same property here. Let us first rewrite the expression for s j1 [n] above, s j1 [n] D s j [2n] C Ad j1 [n 1] C Ad j1 [n] D s j [2n] C A s j [2n 1] 12 s j [2n 2] 12 s j [2n] C A s j [2n C 1] 12 s j [2n] 12 s j [2n C 2] : Using this expression, and gathering even and odd terms, we get X X X s j1 [n] D (1 2A) s j [2n] C 2A s j [2n C 1] : n
n
To satisfy (9) we must choose A D have the following two steps
n 1 4.
Summarizing, we
d j1 [n] D s j [2n C 1] 12 (s j [2n] C s j [2n C 2]) ; (10) s j1 [n] D s j [2n] C 14 (d j1 [n 1] C d j1 [n]) :
(11)
The transform in this example also has the property. X X ns j1 [n] D 14 ns j [n] : (12) n
n
Note that there is a misprint in this formula in [3]. We say that the transform preserves the first moment of the sequence. The mean is also called the zeroth moment of the sequence. Finally, we want to note that this transform is known in the literature as CDF(2,2). The origin of this name is explained later. In the above presentation we have simplified the notation by not specifying where the finite sequences start and end, thereby for the moment avoiding keeping track of the ranges of the variables. In other words, we have considered our finite sequences as infinite, adding zeroes before and after the given entries. In implementations one has to keep track of these things, but doing so now would obscure the simplicity of the lifting procedure. For more details on transformation of finite signals, see Chap. 10 in [3].
Daubechies 4 Wavelet We now turn to the second example, the Daubechies 4 wavelet. This transform is a bit different from what we have seen so far. So to adapt this transform to the lifting step method we need to first take a closer look at the construction of the lifting step. Looking at Fig. 4 once more we see that we can add another prediction step after the update step without changing the lifting concept; we still have even and odd samples, we still have s j1 and d j1 as output, and we can still invert the lifting step by reversing the order of the prediction and update steps. In fact, we can add as many prediction and update steps as we want, and still have the same lifting concept. As an illustration of this Fig. 6 shows a direct transform consisting of three pairs of prediction and update operations. If we have two prediction and only one update, but insist on having them in pairs (this is occasionally useful in the theory), we can always add an operation of either type, which does nothing. It turns out that this generalization is crucial in applications. There are many important transforms, where the steps do not occur in pairs. The Daubechies 4 transform is an example, where there is a U operation followed by a P operation and another U operation. Furthermore, in the last two steps, in (16) and (17), we add a new type of operation which is called normalization, or sometimes rescaling. The resulting algorithm is applied to a signal s j as follows s (1) j1 [n] D s j [2n] C
p 3s j [2n C 1] p (1) 3s j1 [n] p 14 ( 3 2)s (1) j1 [n 1]
d (1) j1 [n] D s j [2n C 1]
(13)
1 4
(1) (1) s (2) j1 [n] D s j1 [n] d j1 [n C 1] p 31 s j1 [n] D p s (2) j1 [n] 2
(14) (15) (16)
Wavelets and the Lifting Scheme
d j1 [n] D
p 3 C 1 (1) p d j1 [n] : 2
(17)
Since there is more than one U operation, we have used superscripts on the s and d signals in order to tell them apart. Note that in the normalization steps we have p p 31 3C1 p p D1: 2 2 Normalization does not change the information on the signal, but is merely a means to have certain transform properties satisfied. The Daubechies 4 equations are derived from the classical wavelet theory by a procedure called Euclidean factorization, see Sect. “Lifting and Filter Banks”. While the Daubechies 4 transform does have a well-described origin in mathematics (see for instance Ten Lectures on Wavelets [1] by Daubechies), it becomes somewhat unclear in the lifting step version, what exactly this transform can provide. In particular, starting with an update step does not conform with our previous exposition of the lifting approach. However, this transform has one important property not possessed by the linear prediction transform CDF(2,2), or indeed any predicting lifting steps based on polynomial interpolation; in a frequency interpretation of the wavelet transforms the Daubechies 4 behaves somewhat nicer than CDF(2,2). This is because the former is an orthogonal transform, while the latter is a biorthogonal transform. We refer to other wavelet articles in this collection for more information on this subject. While the Daubechies 4 is lacking a proper interpretation in terms of prediction and update, and thus may appear to be less appealing to include in the lifting concept, it’s ability to be written as lifting step is in fact a consequence of the pleasant fact that all wavelet transforms (not matter their origin and design criteria) can be written as lifting steps. The two examples then show that some lifting steps comes from a sample-by-sample design method, while others may come from completely different design methods (the Daubechies transform was designed within the field of filter theory to have certain nice frequency related properties). This article touches upon the filter subject, see Sect. “Lifting and Filter Banks”. For a more indepth coverage of the subject, we refer the reader to other articles in this collection. To find the inverse transform we have to use the prescription given above. We do the steps in reverse order and with the signs reversed. The normalization is undone by multiplication by the inverse constants. The result is p 31 (1) d j1 [n] D p d j1 [n] (18) 2
s (2) j1 [n] D
p 3C1 s j1 [n] p 2
(2) (1) s (1) j1 [n] D s j1 [n] C d j1 [n C 1]
1 p (1) 3s j1 [n] 4 1 p C ( 3 2)s (1) j1 [n 1] 4
(19) (20)
s j [2n C 1] D d (1) j1 [n] C
s j [2n] D s (1) j1 [n]
p 3s j [2n C 1] :
(21) (22)
Incidentally, this transform illustrates one of the problems that has to be faced in computer implementations. (1) For example, to compute d (1) j1 [0] we need to know s j1 [0] (1) (1) and s j1 [1]. But to compute s j1 [1] one needs the values s j [2] and s j [1], which are not defined. The easiest solution to this problem is to use zero padding to get a sample at this index value (zero padding means that all undefined samples are defined to be 0). There exists other more sophisticated methods, but that topic is beyond the scope of this article. Finally, let us repeat our first example in the above notation. We also add a normalization step. In this form the transform is known as the Haar transform in the literature (we used this term previously, but in fact the scaling needs to be including for the transform to be the true Haar transform). The direct transform is d (1) j1 [n] D s j [2n C 1] s j [2n]
(23)
1 (1) s (1) j1 [n] D s j [2n] C 2 d j1 [n]
(24)
s j1 [n] D
p (1) 2s j1 [n]
1 d j1 [n] D p d (1) j1 [n] 2
(25) (26)
and the inverse transform is given by d (1) j1 [n] D
p
2d j1 [n]
(27)
1 s (1) j1 [n] D p s j1 [n] 2
(28)
1 (1) s j [2n] D s (1) j1 [n] 2 d j1 [n]
(29)
s j [2n C 1] D s j [2n] C d (1) j1 [n] :
(30)
We note that this transform can be applied to a signal of length 2j without using zero padding. It turns out to be essentially the only transform with this property.
3323
3324
Wavelets and the Lifting Scheme
Wavelets and the Lifting Scheme, Figure 7 DWT and inverse DWT over four scales
Discrete Wavelet Transform We have now established a method for transforming one signal into another signal that has a natural division into two parts of equal length. This transform is the discrete wavelet transform (DWT). In many cases one is not interested in the actual implementation of the transform, though, and the DWT is in those cases often shown as a box with one input and two outputs. The direct transform is sometimes called ‘analysis’, and we will represent such a direct transform by the symbol Ta (‘a’ for analysis).
we combine k of the building blocks as shown in Fig. 3 in the case of 2 scales, and with the upper-most diagram in Fig. 7 in the case of 4 scales. In the latter figure we use the building block representation of the individual steps. It is possible to apply the DWT to the difference output of the transform as well, instead of just the average (or means) output. This is known as the wavelet packet transform, and is commonly found in the literature, and is often used in practical implementations. This concept introduces new possibilities and methods. We refer to other articles presenting this subject. Interpretation
In our notation the input would be s j and the outputs would be s j1 and d j1 . The inverse transform is then naturally shown as another box with two inputs and one output, and is sometimes called ‘synthesis’. Therefore, the inverse transform is represented by the symbol Ts . The contents of the Ta box could be the direct Haar transform as given by (23)–(26), and the contents of the Ts box could be the inverse Haar transform as given by (27)– (30). Obviously, we must make sure to use the inverse transform corresponding to the applied direct transform. Otherwise, the results will be meaningless. We can now combine these building blocks to get a multi-scale discrete wavelet transforms. We perform the transform over a certain number of scales k, meaning that
The main topic for this collection of articles is ‘Wavelets’. This term refers to a function, often denoted by in the wavelet literature, that takes many shapes and determines the properties of the wavelet transform. So what is the relation between a series of lifting steps and this wavelet function, and is it useful to know this relation, if one just wants to use the wavelet transform? After all, the lifting step method works quite well without any apparent need for knowledge of the wavelet function. It turns out that while the transform can be implemented easily with just a few equations, the understanding of what is actually happening with a signal, when it is transformed, relies heavily on a proper interpretation of the transform. Here we will present an interpretation based on linear algebra, which will provide some useful in-
Wavelets and the Lifting Scheme
sights. However, this presentation is incomplete; we state some results from the general theory, and illustrate them with explicit computations, but we do not discuss the general theory in detail, since this is beyond the scope of this article. In the following discussion we will continue to use the Haar transform example from the Introduction, partly because it allows for simple and yet convincing computations, partly because we believe that the reader has become familiar with this example by now. We will end with an interpretation of the Daubechies 4 transform.
started with any other number than 1, this number would then be the one repeated eight times in the top row after the three scale synthesis. This computation indicates that the entry at location 0 (recall that our indexing starts with 0) in the fourth row has an equal contribution from all eight original samples, and thus is the mean value. Let us repeat this procedure with the remaining seven entries in the fourth row. Here are the first two, having a 1 at entry number 1 and 2. 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
Using Linear Algebra We saw above that the first of the eight numbers in the transformed signal in the first example in the Introduction, that is, the number 35, had a special meaning; it was the mean of the entire signal. This property (though not the number itself) was actually independent of the signal; for any signal that first number s0 [0] will always be the mean of the signal. In fact, it was a property derived from the transform, and it provided an interpretation of the number 35 in the transformed signal. Similar interpretations can be made of the other numbers in the transformed signal. First, we examine the original example in Table 1 once more. 56 40 8 24 48 48 40 16 48 16 48 28 8 8 0 12 32 38 16 10 8 8 0 12 35 3 16 10 8 8 0 12
This table was constructed from top to bottom by multiple Haar transforms, and the ‘signal mean’ property was derived using equations, see for instance (9). However, we can reach the same conclusion by examining signal samples, and by starting at the bottom. To isolate the property of interest here, we start with a bottom row consisting of zeros at the entries we are not interested in interpreting, and a 1 at the entry we would like to interpret, that is, the first of the eight numbers. Then we do a three scale synthesis of the signal and get 1 1 1 1
1 1 1 0
1 1 0 0
1 1 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
Note that the normalization steps (25) and (26) has been omitted to make the table easier to read (this does not interfere with the interpretation, even though this fact may not be clear at this point). It is obvious that if we had
1 1 1 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0
Examining the first row in the left-most table reveals that the value at entry number 1, i. e. the value 1 in the fourth row, can be interpreted as the mean of the first four sample minus the mean of the last four sample. In filter terms this is roughly equal to a band pass filtering just above DC, i. e. DC does not contribute to entry number 1, and neither would any frequency higher than a single cycle. The result of all eight computations are now represented using notation from linear algebra; they are inserted as columns in a matrix 2
W(3) s
1 61 6 61 6 61 D6 61 6 61 6 41 1
1 1 1 1 1 1 1 1
1 1 1 1 0 0 0 0
0 0 0 0 1 1 1 1
1 1 0 0 0 0 0 0
0 0 1 1 0 0 0 0
0 0 0 0 1 1 0 0
3 0 07 7 07 7 07 7; 07 7 07 7 15 1 (31)
which is denoted by W with a subscript s for synthesis for the following reason. A signal of length 8 is a vector in the vector space R8 . The process described above of reconstructing the signal, whose transform is one of the canonical basis vectors in R8 , is the same as finding the columns in the matrix (with respect to the canonical basis) of the three scale synthesis transform. This is just the well known definition from linear algebra of the matrix of a linear transform. This in turn means that applying this matrix to any length 8 signal performs the inverse Haar
3325
3326
Wavelets and the Lifting Scheme
transform. For example, multiplying it with the fourth row of Table 1 (regarded as a column vector) produces the first row of that same table. The matrix of the direct three scale transform is obtained in a similar fashion; by computing the transforms of the eight canonical basis vectors in R8 . In other words, we start with the signal [1; 0; 0; 0; 0; 0; 0; 0] and carry out the three scale transform as shown here. 1 0 0 0 0 0 0 0 1 2 1 4 1 8
0 0 0 0 1 8
1 4 1 4
0 0
1 2 1 2 1 2
0 0 0 0 0 0
From Transform to Wavelets
0 0 0
Computing the other seven transforms and inserting the resulting signals as column vectors gives 2
1 6 18 68 6 61 64 6
60 W(3) a D6 61 62 6 60 6 40 0
1 8 1 8 1 4
1 8 1 8 1 4
1 8 1 8 1 4
1 8 18
1 8 18
1 8 18
0
0
0
0
0
0
1 4
1 4
14
12 0 0 0
0
0 12 0 0
0 0
0 0
1 2
12 0
0 0 0
1 2
0 0
0
in understanding the theory. However, if one is trying to carry out numerical computations, then it is a bad idea to use the matrix formulation. The direct k scale wavelet transform using the lifting steps requires a number of operations (additions, multiplications, etc.) on the computer, which is proportional to the length L of the signal. If we perform the transform using its matrix, then in general a number of operations proportional to L2 is needed. For longer signals this makes a huge difference for the computation time.
1 2
3
1 8 7 18 7 7
07 7 7 14 7 7: 7 07 7 07 7 05 12 (32)
(3) Multiplying the matrices we find W(3) s Wa D I and (3) (3) Wa Ws D I, where I denotes the 8 8 identity matrix. This is the linear algebra formulation of perfect reconstruction, or of the invertibility of the three scale transform. Another interesting property to note here is that the structures of the two matrices are quite similar. The transpose of the one matrix bears a striking resembles the other matrix. In fact, if the above computations of the direct and inverse transforms of the canonical basis vectors had used the normalization steps, we would get an orthogonal matrix and its transpose, which by elementary linear algebra equals its inverse. Any wavelet transform with this property is called an orthogonal transform (i. e. all columns are orthogonal to each other and all have norm 1). All transforms in the Daubechies family have this property, and so do other well-known families like Coiflets and Symlets. The CDF families are not orthogonal (but rather biorthogonal), and consequently the structure of the CDF matrices is slightly more complicated. It is clear that analogous constructions can be carried out for signal of length 2j and transforms to all scales k D 1; : : : ; j. The linear algebra point of view is useful
Let us now consider the column vectors in (31) as a sampling of continuous, piecewise constant signals on the interval [0; 1]. These continuous signals are shown in Fig. 8. In this figure the signals have been group according to their origin. Recall that the first column came from a three scale synthesis of [1; 0; 0; 0; 0; 0; 0; 0]. This is equivalent to letting s0 D 1 and d0 D 0, d1 D 0, and d2 D 0 and using a three scale inverse DWT as shown in Fig. 7. The four right-most graphs in Fig. 8 can be obtained by inverse transforms where d2 equals [1; 0; 0; 0], [0; 1; 0; 0], [0; 0; 1; 0], and [0; 0; 0; 1], respectively, and s0 D 0, d0 D 0, and d1 D 0 in all four cases. The s group consists of only one signal, which is constant. The d groups all consists of signals with the same shape (positive signal value followed by negative signal value). The three d differ by a factor 2 scaling in the x-axis direction, and the signals the d1 and d2 groups differ by translation along the x-axis. This pattern can be contained in a single equation. First, define the function 8 ˆ < 1; (t) D 1; ˆ : 0;
t 2 [0; 12 [ ; t 2 [ 12 ; 1] ;
(33)
otherwise :
This is in fact the d0 signal. Now all the other d signal can be obtained by j;k (t)
D
(2 j t k) ;
where j is equal to 0, 1, or 2 for the groups d0 , d1 , and d2 , and k ranges from 0 to 2 j 1. The function is the wavelet. Sometimes it is called the mother wavelet to signify that a wavelet transform consists of numerous similar functions originating from one function. If we include the normalization step the equation reaches its final form j;k (t)
D 2 j/2 (2 j t k) :
Wavelets and the Lifting Scheme
Wavelets and the Lifting Scheme, Figure 8 Graphs showing the 8 column vectors in (31) on the interval [0; 1]. The graphs are sorted according to the multiscale decomposition shown in Fig. 7
This framework does not include the s0 . However, another equation combines s0 and d0 . We noticed previously that while the vector [1; 1; 1; 1; 1; 1; 1; 1], the first column in (31), represents the mean of the original signal, the vector [1; 1; 1; 1; 1; 1; 1; 1] is mean of the first half of the original signal minus the mean of the other half of the signal. This link can be expressed with continuous functions. First, define ( 1; t 2 [0; 1] ; (t) D 0; otherwise ; that is, the function corresponding to the continuous s0 signal. This function is called the scaling function in wavelet literature. Now we can link s0 and d0 with (t) D (2t) (2t 1) :
(34)
This equation is called the two-scale equation, and is found in virtually any literature presenting the wavelet theory. Its general form is (t) D
X
g n (2t n) ;
(35)
n
and it is valid for all orthogonal wavelet transforms. For the Haar transform we have g0 D 1 and g1 D 1 and g n D 0 for all indices other than 0 and 1. The g n for the Daubechies 4 transform are p 1 3 g0 D p ; 4 2 p 3C 3 g2 D p ; 4 2
p 3 3 g1 D p ; 4 2 p 1C 3 g3 D p : 4 2
3327
3328
Wavelets and the Lifting Scheme
There is also a two-scale equation that produces the scaling function. (t) D
X
h n (2t n) :
(36)
n
For orthogonal transforms h is obtained from g by h n D (1)n g1n , that is, reverse sequence order with alternating change of sign and index shift by 1.
Wavelet and Scaling Function for Daubechies 4 The appearance of the Haar wavelet function through transform synthesis in the previous section may appear to be a special property of the simple Haar transform. However, the principle applies to all wavelets, and the method for generating the wavelet function can be used to obtain any wavelet, at least as a graph. Actually, in many cases no closed form of the wavelet exist. This includes the Daubechies family. Thus, to see the shape of the Daubechies 4 wavelet we start again with an ‘all but one entry is zero’-signal. We saw that in fact when obtaining the wavelet graph it does not matter which entry is non-zero, as long as it is not the first entry. The only difference in the output is the dilation and translation of the resulting function. As an example we therefore choose [0; 0; 0; 0; 0; 1; 0; 0]. We perform a inverse three scale transform using the inverse Daubechies 4 Equations (18)–(22) and plot the result, see Fig. 8. Note that the problem with finite signals needs to be addressed in this inverse transform. For our purpose zero padding is sufficient. This figure contains very little information. But let us now repeat the procedure for vectors of length 8, 32, 128, and 512, applied to a vector with a single 1 as its sixth entry. This requires inverse transforms over 3, 5, 7, and 9 scales, respectively. We fit each transform to the same interval, as if we have a finer and finer resolution. The result is shown in Fig. 9. This figure shows that the graphs rapidly approach a limiting graph, as we increase the length of the vector. This is a result that can be established rigorously, but it is not easy to do so, and it is beyond the scope of this article. One can interpret the limiting function in Fig. 9 as a function whose values, sampled at appropriate points, represent the entries in the inverse transform of a vector of length 2N , with a single 1 as its sixth entry. For N just moderately large, say N D 12, this is a very good approximation to the actual value. See Fig. 10 for the result for N D 12, i. e. a vector of length 4096.
Wavelets and the Lifting Scheme, Figure 9 Inverse Daubechies 4 of [0; 0; 0; 0; 0; 1; 0; 0] over three scales
Wavelets and the Lifting Scheme, Figure 10 Inverse Daubechies 4 of sixth basis vector, length 8, 32, 128 and 512
Wavelet and Scaling Function for CDF(2,2) Finally, let us repeat the procedure for the CDF(2,2) transform in (10) and (11), and at the same time illustrate how the wavelet is translated depending on the placement of the 1 in the otherwise 0 vector. An example is given in Fig. 11. The difference in the graphs in Fig. 11 and Fig. 10 is striking. It reflects the result that the Daubechies 4 wavelet has very little regularity (it is not differentiable), whereas the CDF(2,2) wavelet is a piecewise linear function. Recall that the Haar and Daubechies 4 transform are orthogonal transforms, which means that in linear algebra terms the transpose of the synthesis matrix equals the anal-
Wavelets and the Lifting Scheme
Wavelets and the Lifting Scheme, Figure 11 Inverse Daubechies 4 of sixth and first basis vectors, both length 4096. The result is the Daubechies 4 wavelet on the left and the scaling function on the right
ysis matrix. A consequence of this is that the same wavelet and scaling function are used for analysis and for synthesis. This is not the case for CDF(2,2), which is a biorthogonal transform. Therefore, it has one synthesis pair of scaling function and wavelets, and an analysis pair. If we use the forward transform instead of the inverse we can obtain the analysis functions. They are shown in Fig. 12. These functions are quite complicated, and it is interesting to see that while the analysis wavelet and scaling function are very simple functions (we have not shown the scaling function of CDF(2,2) though) the inverse of that same transform have some rather complex wavelet and scaling functions.
Wavelets and the Lifting Scheme, Figure 12 Inverse CDF(2,2) of three basis vectors of length 64, entry 40, or 50, or 60, equal to 1 and the remaining entries equal to zero. The result is the same function (the wavelet) with different translations
As an end remark notice that all CDF(2,2) functions are symmetrical. This is a special property that can only be achieved by biorthogonal wavelets. Orthogonal wavelets can come close to symmetry, but they can never become completely symmetrical. An almost symmetrical family has been constructed, called the Symlets, see for example [4]. The General Case The above computations lead us to the conclusion that there are just two functions underlying the direct transform, and another two functions underlying the inverse transform, in the sense that if we take sufficiently long vectors, say 2N , and perform a k scale transform, with k large, then we get values that are sampled values of one of the underlying functions. More precisely, inverse transforms of unit vectors with a one in places from 1 to 2 Nk yield translated copies of the scaling function. Inverse transforms of unit vectors with a one in places from 2 Nk C 1 to 2 NkC1 yield translated copies of the wavelet. Finally, inverse transforms of unit vectors with a one at places from 2 NkC1 C 1 to 2N yield scaled and translated copies of the wavelet. These results are strictly correct only in a limiting sense, and they are not easy to establish. There is one further complication which we have omitted to state clearly. If one performs the procedure above with a 1 close to the start or end of the vector, then there will in general be some strange effects, depending on how the transform has been implemented. We refer to these as boundary effects. They depend on how one makes up for missing samples in computations near the start or end of a finite vector, the so-called boundary corrections. We have already men-
3329
3330
Wavelets and the Lifting Scheme
tioned zero padding as one of the correction methods. This is what we have used indirectly in plotting for example Fig. 10, where we have taken a vector with a one at place 24, and zeroes everywhere else. Readers interested in a rigorous treatment of the interpretation of the transforms given here, and with the required mathematical background, are referred to the literature, for example the books by I. Daubechies [1], S. Mallat [4], and M. Vetterli-J. Kovaˇcevi´c [10,11]. Note that these books base their treatment of the wavelet transforms on the concepts of multiresolution analysis and filter theory rather than lifting.
In the z-transform up sampling of X(z) is accomplished by X(z2 ), and down sampling is accomplished by 1 2
X(z1/2 ) C X(z1/2 ) :
(37)
Finally, we note that in this section all signals have finite length, and thus the z-transform has a finite number of non-zero terms. The z-transform is therefore a Laurent polynomial, which is a polynomial in the variables z and z1 . Lifting in the z-Transform Representation
Lifting and Filter Banks So far, the lifting technique has been carried out in the time domain; the correlation of samples was based on odd and even entries, interpretation was based on the signal in the time domain, and every reference to the signal was made by s and d, i. e.time representations of the signals. However, there is much to be learned about wavelets and about lifting when turning to the frequency domain. The lifting method is a special case of a standard frequency-based signal processing method called filter banks. A filter bank is a series of bandpass filters which separates the input signal into a number of components, each with a distinct range of frequencies from the original signal. Often, in a filter bank, the filters are designed such that no information is lost, that is, it is possible to reconstruct the original signal from the components. The outputs of a filter bank are called subband signals, and it has as many subbands as there are filters in the filter bank. The following introduction to lifting in the frequency domain requires some knowledge of the z-transform as well as basic Fourier theory. To ease the reading, we briefly state some important relations for the z-transform. The transform maps a signal x D fx[n]g to a function defined in the complex plane X X(z) D x[n]zn : n2Z
This is equivalent to the Fourier transform, when z D e j! , and we use capital X to denote the z-transform of x. In the following exposition we will needed up and down sampling by two of a signal. In the time domain this is accomplished by inserted zeros between all samples and by removing every other sample, respectively. Explicitly, if x D fx[n]g, then the down sampled sequence x2# is given by x2# [n] D x[2n]. The up sampled sequence x2" is given by x2" [n] D x[n/2], if n is even, and x2" [n] D 0, if n is odd.
The lifting method consists of a three basic operations; it is composed of splitting the signal in odd and even samples, predicting an odd sample, and updating an even sample. This is shown in Fig. 3. The splitting is given in the z-transform representation by X(z) D X0 (z2 ) C z1 X1 (z2 ) ;
(38)
where X0 (z) D
X
x[2n]zn D
1 2
X(z1/2 ) C X(z1/2 ) ; (39)
n
X 1 (z) D
X n
x[2n C 1]zn D 12 z1/2 X(z1/2 ) X(z1/2 ) ;
(40)
are the z-transforms of the even and odd samples. Note that the second equality in both formulas comes from (37). We represent this decomposition by the left side of the diagram in Fig. 13. In the diagram the time shift is inserted prior to the down sampling, rather than after as in (40), since this gives multiplication with z instead of z1/2 . The decomposition in (38) is called a polyphase decomposition (with two components). The inverse operation is obtained by reading Eq. (38) from right to left. The equation tells us that we can obtain X(z) from X 0 (z) and X1 (z) by first up sampling the two components by 2, then shifting X 1 (z2 ) one time unit right (by multiplication by z1 ), and finally adding the two components. We represent this reconstruction by the right hand side of the diagram in Fig. 13. Notice that as with the lifting method this inversion simply ‘undoes’ the changes made to the signal. The transform pair represented in Fig. 13 is sometimes in the literature called the ‘lazy’ wavelet transform and its inverse.
Wavelets and the Lifting Scheme
Wavelets and the Lifting Scheme, Figure 13 Wavelet and Scaling function for CDF(2,2)
Prediction and Update Steps Let us now see how we can implement the prediction step from Sect. “Lifting” in the z-transform representation. The prediction technique was to form a linear combination of the even entries and then subtract the result from the odd entry under consideration. The linear combination was formed independently of the index of the odd sample under consideration, and based only on the relative location of the even entries. For example, in the CDF(2,2) transform the first step in (10) can be implemented as X1 (z) T(z)X0 (z), where T(z) D 12 (1 C z). This is because T(z)X 0 (z) is the convolution t x0 in the time domain, which is exactly a linear combination of the even entries with weights tn . Explicitly, X1 (z) T(z)X 0 (z) D X1 (z) 12 (1 C z) D X1 (z) D
X n
D
1 2
X
x[2n]zn C
n
x[2n C 1]
X
X
X
1 2
X
x[2n]zn
n
x[2n]znC1
n 1 2 (x[2n]
n
x[2n C 1] 12 (x[2n] C x[2n C 2]) zn ;
which is the z-transform representation of the right hand side of (10). The transition is described by matrix multiplication as in
X0 (z) 1 D T(z) X 1 (z) T(z)X0 (z)
1 0
S(z) : 1
The normalization step, as for example given in (26) and (25), can be implemented by multiplication by a matrix of the form
K 0
0 ; K 1
where K > 0 is a constant. Entire Transform Since every prediction step is subtraction of a linear combination of odd samples, any prediction step can be implemented by means of multiplication by a matrix of the form
C x[2n C 2])zn
n
is implemented in the z-transform representation as multiplication by the matrix
0 1
X 0 (z) : X1 (z)
Here we use 2 2 matrices whose entries are Laurent polynomials. An entirely analogous computation shows that if we define S(z) D 14 (1 C z1 ), then the update step in (11)
P(z) D
1 T(z)
0 : 1
(41)
The same result applies to any update step, which can be implemented by multiplication by a matrix of the form U(z) D
1 0
S(z) : 1
(42)
Here T(z) and S(z) are both Laurent polynomials. The general one scale DWT is a combination of multiple prediction and updates steps and one normalization step. This is shown in Fig. 6 (although the normalization is not included in the figure). In the z-transform represen-
3331
3332
Wavelets and the Lifting Scheme
Wavelets and the Lifting Scheme, Figure 14 Splitting in even and odd components by means of up sampling followed by reconstruction of the signal from the odd and even components
tation this becomes K 0 1 H(z) D 0 K 1 0
1 0 ::: TN (z) 1 S1 (z) 1 0 : (43) 1 T1 (z) 1
S N (z) 1 1 0
The order of the factors is determined by the order in which the steps are applied. First a prediction step, then an update step, repeated N times, and then finally the normalization step. An important property of the DWT implemented via lifting steps was the invertibility of the transform, as illustrated for example in Fig. 4. It is easy to verify that we have
P(z)1 D
1 T(z)
0 1
and U(z)1 D
1 0
S(z) : 1 (44)
We note that the inverse matrix is again a matrix with Laurent polynomials as entries. Thus, the inversion of the transform is accomplished simply by inverting each 2 2 matrix and reversing their order. 1 0 1 S1 (z) ::: T1 (z) 1 0 1 0 1 S N (z) K 1 0 : 1 0 1 0 K
H1 (z) D G(z) D
1 TN (z)
(45)
Multiplying out the matrices in the product defining H(z) in (43), we get a 2 2 matrix with entries, which are Laurent polynomials. We use the notation H(z) D
H00 (z) H10 (z)
H01 (z) ; H11 (z)
(46)
for such a general matrix. Written in matrix notation the DWT is given as
H00 (z) Y0 (z) D Y1 (z) H10 (z)
H01 (z) H11 (z)
X 0 (z) ; X1 (z)
(47)
Wavelets and the Lifting Scheme, Figure 15 One scale DWT in the z-representation as given in (47). This is the polyphase representation and it comes directly from the lifting step method
and we can then represent the implementation of the complete one scale DWT in the z-transform representation by the diagram in Fig. 15. This representation of a two channel filter bank, without any reference to lifting, is in the signal analysis literature called the polyphase representation, see for example [7,10,11]. Two Channel Filter Banks with Perfect Reconstruction We have now seen how the time-based lifting method has an equivalent representation in the frequency domain. Lifting allows for perfect reconstruction of the signal after transformation, and we will now use this property (which is preserved in the frequency representation) to better understand the link between lifting and filter banks. We assume the reader is familiar with the concept of filtering and filter banks. Details on filtering can be found in any book on signal analysis, for example in [6,7]. The filter bank approach to wavelets is the one used in most introductions to the subject. A two channel filter bank starts with two analysis filters, here denoted by the filter taps h0 and h1 , and two synthesis filters, denoted by g0 and g1 . Usually the filters with index 0 are chosen to be low pass filters, and the filters with index 1 to be high pass filters. The analysis and synthesis parts of the filter bank are shown in Fig. 15. The analysis part transforms the input X(z) to the output pair Y0 (z) and Y1 (z). The synthesis part then transforms this pair to the output e X(z). The filtering scheme is said to have the perfect reconstruction property, if X(z) D e X(z) for any X(z). The main point that we will now establish is that the two Figs. 15 and 16 do indeed show the same thing, and that it is possible to get the one from the other. Conditions for Perfect Reconstruction To do this we first analyze which conditions are needed on the four filters in Fig. 15 in order to obtain the perfect reconstruction property. Filtering by h0 transforms X(z) to H0 (z)X(z),
Wavelets and the Lifting Scheme
Wavelets and the Lifting Scheme, Figure 16 Two channel analysis and synthesis. This is the standard filter bank representation
and we then use (38) to down sample by two. Thus we have Y0 (z) D 12 H0 (z1/2 )X(z1/2 )C H0 (z1/2 )X(z1/2 ) ; (48) Y1 (z) D 12 H1 (z1/2 )X(z1/2 )C H1 (z1/2 )X(z1/2 ) : (49) Up sampling by two followed by filtering by the G-filters, and addition of the results leads to a reconstructed signal e X(z) D G0 (z)Y0 (z2 ) C G1 (z)Y1 (z2 ) :
(50)
Perfect reconstruction means that e X(z) D X(z). We combine the above expressions and then regroup terms to get e X(z) D 12 G0 (z)H0 (z) C G1 (z)H1 (z) X(z) C 12 G0 (z)H0 (z) C G1 (z)H1 (z) X(z) : To fulfill the condition e X(z) D X(z) we need G0 (z)H0 (z) C G1 (z)H1 (z) D 2 ;
(51)
G0 (z)H0 (z) C G1 (z)H1 (z) D 0 :
(52)
These conditions mean that the four filters cannot be chosen independently, if we want to have perfect reconstruction. To determine what conditions these equations impose on H and G it is convenient to write them as a matrix equation, G0 (z) 2 H0 (z) H1 (z) : (53) D 0 H0 (z) H1 (z) G1 (z) We want to solve this equation, that is, determine G0 (z) and G1 (z) as functions of H0 (z) and H1 (z). Since the matrix is invertible (it performs a DWT that can be inverted), we can use Cramer’s rule to get ˇ ˇ ˇ2 H1 (z) ˇ ˇ ˇ ˇ0 H1 (z)ˇ G0 (z) D D 2d(z)1 H1 (z) : d(z) The determinant d(z) of an invertible 2 2 matrix with Laurent polynomials as entries is a monomial (see [3]), which in this case has the property that d(z) D d(z).
Therefore, the determinant can be assumed to be of the form d(z) D 12 C 1 z2k1 for some integer k and some real constant C. Thus, ˇ ˇ ˇ2 H1 (z) ˇ ˇ ˇ ˇ0 H1 (z)ˇ G0 (z) D 1 1 2k1 D Cz2kC1 H1 (z) ; (54) C z ˇ2 ˇ ˇ H0 (z) 2ˇ ˇ ˇ ˇH0 (z) 0ˇ (55) G1 (z) D 1 1 2k1 D Cz2kC1 H0 (z) : 2C z These equations show that we can choose either the H-filter pair or the G-filter pair. We will assume that we have filters H 0 and H 1 , subject to the condition that d(z) D H0 (z)H1 (z) H0 (z)H1 (z) D 12 C 1 z2k1
(56)
for some integer k and nonzero constant C. Then G0 and G1 are determined by the Eqs. (54) and (55), which implies that they are unique up to a scaling factor and an odd shift in time. Note that this argument is valid for any two channel filter bank, subject to the condition (56). Lifting and Two Channel Filter Banks with Perfect Reconstruction Are Equivalent Many presentations of the wavelet transform start with a two channel filter bank with the perfect reconstruction property. The analysis part is then used to define the direct one scale DWT, and the synthesis part is used for reconstruction. It is an important result that the filtering approach, and the one based on lifting, actually are identical. Thus any set of lifting steps leads to a two channel filter bank with perfect reconstruction, and, quite remarkably, vice versa. This means that they are just two different ways of describing the same transformation from X(z) to Y0 (z) and Y1 (z). This equivalence will be the subject of the remainder of the article. The first step is to show that the analysis step in Fig. 15 coming from the two channel filter bank is equivalent to the analysis step summarized in Fig. 14 and in (47), i. e. coming from the lifting method. Thus, we want to find the equations relating the coefficients in the H matrix (46) and the filters H 0 , H 1 , G0 , and G1 . The analysis step by both methods should yield the same result. To avoid the square root terms we compare the results after up sampling by two. We start with the equality
1 Y0 (z2 ) 2 2 (X(z) C X(z)) ; D H(z ) 1 Y1 (z2 ) 2 z (X(z) X(z))
3333
3334
Wavelets and the Lifting Scheme
where the left hand side is from the filter bank approach (48) and (49), and the right hand side from the lifting approach with polyphase matrix (46) and up sampled X0 (z) and X1 (z). The first equation can then be written as 1 2
H0 (z)X(z) C H0 (z)X(z) D H00 (z2 ) 12 X(z)CX(z) CH01 (z2 ) 12 z X(z)X(z) :
This leads to the relation
Theorem 1(Daubechies and Sweldens 1998) Assume that H(z) is a 2 2 matrix of Laurent polynomials, normalized to det H(z) D 1. Then there exists a constant K ¤ 0 and Laurent polynomials S1 (z); : : : ; S N (z), T1 (z); : : : ; TN (z), such that K 0 1 0 1 S N (z) H(z) D ::: 0 K 1 0 1 TN (z) 1 1 0 1 S1 (z) : (61) T1 (z) 1 0 1
H0 (z) D H00 (z2 ) C zH01 (z2 ) : The relation for H 1 is found analogously, and then the relations for G0 and G1 can be found using the perfect reconstruction conditions (54) and (55) in the two cases. The results are summarized here. H0 (z) D H00 (z2 ) C zH01 (z2 ) ;
(57)
H1 (z) D H10 (z2 ) C zH11 (z2 ) ;
(58)
G0 (z) D G00 (z2 ) C z1 G01 (z2 ) ;
(59)
G1 (z) D G10 (z2 ) C z1 G11 (z2 ) :
(60)
Note the difference in the decomposition of the H-filters and the G-filters. Thus in the polyphase representation we use H00 (z) H01 (z) ; H(z) D H10 (z) H11 (z) G00 (z) G10 (z) G(z) D H(z)1 D : G01 (z) G11 (z) Note the placement of entries in G(z), which differs from the usual notation for matrices. The requirement of perfect reconstruction in the polyphase formulation was the requirement that G(z) should be the inverse of H(z). From Filters to Lifting Steps It is easy to start with a filter bank and derive H(z) using (57) and (58). However, going the other way is somewhat more tricky; given H(z) it is necessary to factorize it into a number of 2 2 matrices of a particular structure to obtain the lifting steps. The remarkable result is that this is always possible. This result was obtained by I. Daubechies and W. Sweldens in 1998 in the paper [2]. The factorization result for 2 2 matrices with Laurent entries was previously known in the mathematical literature. The importance of the paper by I. Daubechies and W. Sweldens lies in its impact on signal processing.
The normalization det H(z) D 1 in the theorem can always be satisfied. As seen above, in the general case det H(z) D Cz2kC1 . One can always get the determinant equal to one by scaling and an odd shift in time. Therefore the result is stated with det H(z) D 1 without loss of generality. The proof of this theorem is constructive. It gives an algorithm for finding the Laurent polynomials S1 (z); : : : ; S N (z), T1 (z); : : : ; TN (z) in the factorization. It is important to note that the factorization is not unique. Once we have a factorization, we can translate it into lifting steps. The advantage of the lifting approach, compared to the filter approach, is that it is very easy to find perfect reconstruction filters H 0 , H 1 , G0 , and G1 . It is just a matter of multiplying the lifting steps as in (43), and then assemble the filters according to the Eqs. (57)–(60). There can also be algorithmic advantages in implementing a transform using lifting steps, since it may result in fewer arithmetic operations than a filter implementation. This approach should be contrasted with the traditional signal analysis approach, where one tries to find (approximate numerical) solutions to the Eqs. (51) and (52), using for example spectral factorization. The weakness in constructing a transform based solely on the lifting technique is that it is based entirely on considerations in the time domain. Sometimes it is desirable to design filters with certain properties in the frequency domain, and once filters have been constructed in the frequency domain, we can use the constructive proof of the theorem to derive a lifting implementation. We should mention that the numerical stability of transforms designed using lifting can be difficult to analyze. Examples of Lifting Steps in the Frequency Domain We will now see how the Daubechies 4 transform looks in the frequency domain, that is, what frequency response we would get when passing a signal through the Daubechies 4
Wavelets and the Lifting Scheme
transform. First, we need to find H(z) by multiplying the lifting steps originating in the Daubechies 4 Eqs. (13) through (17). Each equation has an equivalent lifting step matrix (41) or (42). The first of the Daubechies 4 equations s (1) j1 [n] D s j [2n] C
p
3s j [2n C 1]
is an update p step, and it simply multiplies the odd sample with 3 and add the result to the even sample. In the z-transform this is achieved by multiplying with
p 3 ; 1
1 0
which makes this matrix the first transform step. The second equation 1 d (1) j1 [n] D s j [2nC1] 4
p p (1) 3s j1 [n] 14 ( 32)s (1) j1 [n1]
is a prediction step that uses the present value (index n) and the first previous value (index n 1) of the signal. This is achieved in the z-transform by " p 43
1p
0 32 1 4 z
1
# :
The third equation converts to z-transform similarly. The two scaling Equation (16) and (17) are joined in one scaling matrix 2p 4
31 p 2
3
0
2p H(z) D 4
0
p
D
H00 (z) H10 (z)
4
3
1p
1 0
z 1
0
#
1 0
p 3 1
32 1 1 4 z 2 p p p p 3 1p 3 3p 3 z C 1Cp 3 z C 3Cp 3 4 2 4 2 4 2 4 2 7 6 5 4 p p p p 3Cp 3 1p 3 z1 1Cp 3 C 3p 3 z1 4 2 4 2 4 2 4 2
D
3
5 p 3C1 p 2
0 "
H1 (z) D H10 (z2 ) C zH11 (z2 ) p p p p 3 3 3C 3 1C 3 1 3 D p z2 C p z1 p C p z : 4 2 4 2 4 2 4 2 To determine the frequency response of these transfer functions, we simply replace z with e j! , which gives the Fourier transform of the filter taps. We then plot jH0 (e j! )j and jH1 (e j! )j as function of ! to show the amplitude characteristics of the transfer functions H0 (z) and H1 (z) on the unit circle in the complex plane. This plot is shown in Fig. 16. Making Lifting Steps from Filters There are basically three ways of representing the building block in a DWT: (1) The transform can be represented by a pair of filters (usually low pass and high pass filters) satisfying the perfect reconstruction conditions, or it can be given as lifting steps, which are either given (2) in the time domain as a set of equations, or (3) in the frequency domain as a factored matrix of Laurent polynomials. As an example the Daubechies 4 transform can be given in these three forms, as shown below. Equation form: p 3S[2n C 1] ;
s (1) [n] D S[2n] C
(62)
1 p (1) 3s [n] 4 1 p ( 3 2)s (1) [n 1] ; 4
Now, the polyphase matrix H(z) can be written 31 p 2
H0 (z) D H00 (z2 ) C zH01 (z2 ) p p p p 1C 3 3C 3 3 3 2 1 3 3 D p C p zC p z C p z ; 4 2 4 2 4 2 4 2
d (1) [n] D S[2n C 1]
5: p 3C1 p 2
0
Thus, from (57) and (58) it follows that the low and high pass filter taps (in z-transform) are given by
H01 (z) H11 (z)
s (2) [n] D s (1) [n] d (1) [n C 1] ; p 31 s˜[n] D p s (2) [n] ; 2 p 3 C1 ˜ d[n] D p d (1) [n] : 2 Matrix form: 2p H(z) D 4
31 p 2
0
" p
4
3
0
3
5 p 3C1 p 2 1p
1 0
32 1 4 z
z 1 # 0 1 1 0
(63) (64) (65) (66)
p 3 : (67) 1
3335
3336
Wavelets and the Lifting Scheme
Wavelets and the Lifting Scheme, Figure 17 Frequency response of the Daubechies 4 lifting steps
Filter form: p p p 1 p h D p 1C 3; 3C 3; 3 3; 1 3 ; (68) 4 2 p p p 1 p g D p 1 3; 3 C 3; 3 C 3; 1 3 : 4 2 (69) The equation form was presented in (13) through (17). Converting this to the matrix form has not been shown directly. However, using (41), (42), and (43) it is not difficult to derive the matrices from the equations. It is not difficult either to obtain the filter taps h and g from the matrix form. Simply multiply the matrix to get the polyphase matrix (46), and the odd entries in h will be the coefficient in H00 (z), and the even entries the coefficients of H01 (z). Likewise, the odd and even entries of g is the coefficients of H10 (z) and H11 (z). Now, the hard part is to go in the other direction. Suppose we have the filter taps and want to find the lifting steps (in either time or frequency domain). That it is theoretical possible to obtain the lifting steps was made evident with Theorem 1. Now we want to show how to actually obtain the lifting steps. The Euclidean Algorithm The Euclidean algorithm is usually first presented as an algorithm for finding the greatest common divisor of two integers. But it can be applied to many other analogous problems. One application is to finding the greatest com-
mon divisor of two Laurent polynomials. This turns out to be the key step in factorizing a polyphase matrix. Take two Laurent polynomials a(z) and b(z) ¤ 0 with ja(z)j jb(z)j. Then there always exist a Laurent polynomial q(z), the quotient, with jq(z)j D ja(z)j jb(z)j, and a Laurent polynomial r(z), the remainder, with jr(z)j < jb(z)j, such that a(z) D b(z)q(z) C r(z) :
(70)
We use the notation q(z) D a(z)/b(z) as ‘division disregarding the remainder’. Now, let a0 (z) D a(z) and b0 (z) D b(z), and iterate the following steps starting from nD0 a nC1 (z) D b n (z) ; b nC1 (z) D a n (z) q nC1 (z)b n (z) which in matrix form is 0 1 a n (z) a nC1 (z) D ; 1 q nC1 (z) b n (z) b nC1 (z) where q nC1 (z) D a n (z)/b n (z). Let N denote the smallest integer with N jb(z)j C 1, for which b N (z) D 0. Then aN (z) is a greatest common divisor for a(z) and b(z). In short, for two Laurent polynomials a(z) and b(z) with the given constraints we have
1 Y a N (z) 0 D 0 1 nDN
1 q n (z)
a(z) : b(z)
We note that there is no unique greatest common divisor for a(z) and b(z), since if d(z) divides both a(z) and b(z),
Wavelets and the Lifting Scheme
and is of maximal degree, then ˛z m d(z) is also a divisor of the same degree. Note the order of the terms in this product, as given by the limits in the product. Inverting the matrices and moving them to the other side, we get (products in reverse order, as shown by the limits)
Y N a(z) q n (z) D 1 b(z) nD1
1 0
a N (z) : 0
(71)
for
Now, take a(z) D H00 (z) and b(z) D H01 (z) from the polyphase matrix (46). We then get
Y N H00 (z) q n (z) D H01 (z) 1 nD1
1 0
K : 0
(72)
The greatest common divisor aN (z) of H00 (z) and H01 (z) is a constant. This is because of the determinant assumption H00 (z)H11 (z) H01 (z)H10 (z) D 1 ; since aN (z) divides the left hand side, it also divides the right hand side. Thus, a N (z) D Kz` . By shifting the indices in the z-transform of H0 (z) with ` steps, the greatest common divisor becomes a constant. Observing that 0 1 1 q(z) q(z) 1 D 1 0 0 1 1 0 0 1 1 0 D ; 1 0 q(z) 1 and replacing K with 0
which is clearly resembles the factorization in Theorem 1, which we are trying to achieve. All we need to do now is to 0 (z) and H 0 (z) are connected with H (z) find out how H10 10 11 and H11 (z). By using the fact that both polyphase matrices (46) and (73) must have determinant 1, it is possible to show that H00 (z) H01 (z) 1 0 H00 (z) H01 (z) D 0 (z) H 0 (z) t(z) 1 H10 H10 (z) H11 (z) 11
K 0
0 ; K 1
and letting M D dN/2e, we can rewrite (72) to 0 (z) H00 (z) H10 0 (z) H01 (z) H11 M Y 0 1 0 K 1 q2n1 (z) D ; 0 1 q2n (z) 1 0 K 1 nD1
0 (z) and H 0 (z). If N is where these equations define H10 11 odd, let q2M (z) D 0. Transposing both sides yields H00 (z) H01 (z) 0 (z) H 0 (z) H10 11 Y 1 K 0 1 0 1 q2n (z) D : (73) 0 K 1 0 1 q2n1 (z) 1 nDM
0 0 (z)H11 (z) H11 (z)H10 (z) : t(z) D H10
(74)
Consequently, we have the relation K 0 H00 (z) H01 (z) 1 0 D 0 K 1 K 2 t(z) 1 H10 (z) H11 (z) 1 Y 1 q2n (z) 1 0 ; (75) 0 1 q2n1 (z) 1 nDM
which by a suitable re-indexing of the q polynomials (and at the same time making K 2 t(z) one of them), determines the Sn (z) and T n (z) in Theorem 1. Practical Implementation of the Euclidean Algorithm We now have a step-by-step method for obtaining the lifting steps from any filter. It goes as follows. 1. Find the z-transform H0 (z) of the low pass filter taps h. 2. Determine H00 (z) and H01 (z) by means of (57). 3. Assign a0 (z) D H00 (z) and b0 (z) D H01 (z), and let n D 0. 4. Determine the quotient q nC1 (z) D a n (z)/b n (z). There may be multiple solutions to this step. Choose one. 5. Determine the remainder as b nC1 D a n (z) q nC1 (z) b n (z). 6. Assign a nC1 (z) D b n (z). 7. Increment n. If b n ¤ 0, go to step 3. 0 (z) and H 0 (z) using 8. Let K D a n (z). Determine H10 11 (73). 9. Find the z-transform H1 (z) of g such that it is indexshifted an odd number compared to H0 (z). 10. Determine H10 (z) and H11 (z) by means of (58). 11. Determine t(z) from (74). 12. Let Tn (z) D q2n1 (z) for n D 1; : : : ; M. 13. Let S n (z) D q2n (z) for n D 1; : : : ; M. 14. Let TMC1 (z) D K 2 t(z). 15. If K has non-zero power of z, you may discard this. Alternatively, choosing the right z-transform in step 2 will give a ‘zero power’ K.
3337
3338
Wavelets and the Lifting Scheme
This procedure will provide all the ingredients for building the lifting step matrices. Note that this will always give a prediction step as the first matrix, and as such this procedure seems unable to generate lifting steps for, say, Daubechies 4 which starts with an update step. However, in the polyphase representation prediction and update steps are merely linear combinations of even and odd samples applied to odd and even samples, and as such predictions and updates can be interchanged simply by an odd shift of the z-transform of the signal.
The remainder is then
Factoring Daubechies 4 into Lifting Steps
thus completing step 5. In step 6 we assign
We now give an example of creating lifting steps using the algorithm presented in the previous section. The example is factorization of Daubechies 4. This means that we show how to get the matrix form (67) from the filter tap form (68) and (69). We will follow the step-bystep method laid out in the previous section. The Daubechies 4 filter taps are given by h p p p p i (76) h D 1Cp 3 3Cp 3 3p 3 1p 3 : 4 2
4 2
4 2
4 2
In the z-transform in step 1 we are free to choose a shift. For instance, we could choose 2
3
H0 (z) D h[0] C h[1]z C h[2]z C h[3]z :
(77)
We know that eventually this choice will provide a monomial K that may or may not have zero power. As we will see later, this choice does in fact lead to a real number K. H0 (z) must be separated into even and odd indices. This happens in step 2. Combining this with assignment to a0 (z) and b0 (z) in step 3 we get a0 (z) D H00 (z) D h[0] C h[2]z p p 1C 3 3 3 D p C p z; 4 2 4 2 b0 (z) D H01 (z) D h[1] C h[3]z p p 1 3 3C 3 p C p z: D 4 2 4 2
q1 (z) D
D
z coefficient of a0 (z) D z coefficient of b0 (z) (3 (1
p p
D
p p p 3) 2 3 D 3: p D 2 3)(1 C 3) 3)(1 C
In step 7 we increment n D 0 to n D 1, and since b1 ¤ 0 we repeat from step 4. This time the quotient has degree 1, since b1 (z) is one degree less than a1 (z). More specifically, q2 (z) must be on the form c C dz. Further, since the degree of the remainder decreases, and jb1 (z)j D 0, the remainder must be zero this time. So p 1C 3 p (c C dz) D a1 (z) 2 p p 1 3 3C 3 D p C p z: 4 2 4 2
b1 (z)q2 (z) D
Thus,
(78) dD
(79)
p 3 p 1 3
3
p p 3C 3 1 3 p C p z: a1 (z) D b0 (z) D 4 2 4 2
cD
Note that the concept of even and odd applies to the chosen z-transform of the filters taps (77), not the filter tap vector itself (76). Then step 4 is to find q1 (z). Since a0 (z) and b0 (z) have the same degree, the quotient is a monomial. Matching the z coefficients yields p 3p 3 4 2 p 1p 3 4 2
b1 (z) D a0 (z) b0 (z)q1 (z) p 1 C p3 3 3 p C p z D 4 2 4 2 p 3 C p3 p 1 3 p C p z ( 3) 4 2 4 2 p p p 1C 3C3 3C3 1C 3 D p D p ; 4 2 2
p 3Cp 3 4 2 p 1C p 3 2 p 1p 3 4 2 p 1C p 3 2
D
p p 3 3 p D 4 4(1 C 3) 3C
1
p
3 D p D 4(1 C 3)
p 32 : 4
Therefore, p
p 32 z; 4 p 1C 3 : a2 (z) D b1 (z) D p 2
3 q2 (z) D C 4
b2 (z)
D0;
Now, this time in step 7 we have b2 (z) D 0. Thus, we continue to step 8, assign KD
p 1C 3 ; p 2
Wavelets and the Lifting Scheme
0 and H 0 . and use (73) to find H10 11
which completes step 11. We can now determine the extra matrix we need, as shown in (75), 1 C p 3 2 p 2 p ( 3 2)z1 D z1 : K t(z) D 2
H00 (z) H01 (z) 0 (z) H 0 (z) H10 11 K 0 1 0 1 q2 (z) D 0 K 1 0 1 q1 (z) 1 2p 3" p p # 3C1 p 32 0 1 1 z C 43 2 4 5 p p 4 D 31 3 0 1 p 0 2 2 p p p p 3 1Cp 3 C 3p 3 z 3Cp 3 C 1p 3 z 4 2 4 2 4 2 4 2 5: p p D4 3 3 31 p
p
2
0 1
2
(80) 0 and H 0 , we need H and H in Now that we have H10 10 11 11 order to determine t(z). First, we need the z-transform of the high pass filter taps. The high pass filter g is the time reversed of g with alternating signs. From (76) we therefore get h p p i p p g D H1 D 1p 3 3p 3 3Cp 3 1Cp 3 : (81) 4 2
4 2
4 2
4 2
The z-transform of g must have the index shifted by 1 (or some other odd number) compared to H0 (z). For instance, H1 (z) D g[0]z2 C g[1]z1 C g[2] C g[3]z ;
(82)
where g[3] D h[0] now is coefficient for an odd power, whereas h[0] is coefficient for an even power in H0 (z). Splitting this in even and odd coefficients (where even and odd again relates to the z-transform), we get from (58) that p p 1 3 1 3 C 3 1 H10 (z) D g[0]z C g[2] D p z p ; 4 2 4 2 p p 3 3 3 1 C 1 1 H11 (z) D g[1]z C g[3] D p z C p ; 4 2 4 2 thus completing step 11. We now insert these H 10 and H 11 0 and H 0 from (80) into (74). together with H10 11 0 (z)H11 (z) H10
0 H11 (z)H10 (z)
p p p 3 3 3 3 1 1 C 3 p z C p D p 2 4 2 4 2 p p p 31 1 3 1 3 C 3 p p z p 2 4 2 4 2 p 1 p p 1 p D ( 3 3)(3 3)z C ( 3 3)(1 C 3) 8 p p p p C ( 3 1)(1 3)z1 C ( 3 1)(3 C 3)
t(z) D
p p z1 (9 C 6 3 3) C (1 C 2 3 3) 8 p D ( 3 2)z1 ; D
It is now possible to determine all the matrix lifting steps that makes up the polyphase matrix, which is step 12, 13, and 14. We have M D 1, so we get one S matrix and two T matrices. p T1 (z) D q1 (z) D 3 p p 3 32 C z S1 (z) D q2 (z) D 4 4 2 1 T2 (z) D K t(z) D z : Since K is a real number, we do not need to shift the powers of z (as mentioned in step 15), and we can now write the complete factorization of the polyphase matrix H(z) as shown in (61) from Theorem 1. 3 2p 3C1 p 0 1 0 5 1 p H(z) D4 2 31 z 1 p 0 2 " # p p 3 32 1 0 1 p 4 C 4 z : 3 1 0 But, this is fact not (67). The reason is that the presented algorithm for finding the lifting steps always will start with a prediction step, whereas the factorization in (67) starts with an update step. So while H(z) in (67) indeed is equal to H(z) above, the factorizations are not the same. This is a consequence of the fact that the distinction of the lifting matrices into update and prediction is useful for interpretation of the transform results, but not necessary for the implementation of the transform. Actually, this is not the only ‘non-uniqueness feature’ of the wavelet transform. If instead of (77) and (82) we had chosen H0 (z) D h[0]z3 C h[1]z2 C h[2]z1 C h[3] ; H1 (z) D g[0]z1 C g[1]z2 C g[2]z3 C g[3]z4 ; the even coefficients from before are now odd, and vice versa. That means that the relation between h and g changes signs, so for instance, g[3] D h[0] now, rather than g[3] D h[0]. The resulting factorization becomes 3 2 p 0 p31 2 5 12 0 p H(z) D 4 3C1 z 1 0 p 2 " # p p 2 0 1 43 z1 3C2 z p1 4 : 3z 1 0 1
3339
3340
Wavelets and the Lifting Scheme
There are more examples of factorizations in Jensen and la Cour–Harbo [3] and in the original paper by Sweldens and Daubechies [2]. Bibliography 1. Daubechies I (1992) Ten Lectures on Wavelets. CBSM-NSF Regional Conference Series in Applied Mathematics, vol 60. SIAM, Philadelphia 2. Daubechies I, Sweldens W (1998) Factoring wavelet transforms into lifting steps. J Fourier Anal Appl 4(3):245–267 3. Jensen A, la Cour–Harbo A (2001) Ripples in mathematics – the discrete wavelet transform. Springer, Heidelberg 4. Mallat S, Wavelet A (1998) Tour of signal processing. Academic Press Inc, San Diego
5. Mulcahy C (1996) Plotting and scheming with wavelets. Math Mag 69(5):323–343 6. Oppenheim AV, Schafer R (1975) Digital signal processing. Prentice Hall, Upper Saddle River 7. Oppenheim AV, Schafer R, Buck JR (1999) Distrete–time signal processing. Prentice Hall, Upper Saddle River 8. Sweldens W (1996) The lifting scheme: A custom-design construction of biorthogonal wavelets. Appl Comput Harmon Anal 3(2):186–200 9. Sweldens W (1997) The lifting scheme: A construction of second generation wavelets. SIAM J Math Anal 29(2):511–546 10. Vetterli M, Kovaˇcevi´c J (1995) Wavelets and subband coding. Prentice–Hall, Englewood Cliffs 11. Vetterli M, Kovaˇcevi´c J (2007) Wavelets and subband coding, 2nd edn. http://waveletsandsubbandcoding.org, accessed on Sept. 5th, 2008
Wavelets and PDE Techniques in Image Processing, a Quick Tour of
Wavelets and PDE Techniques in Image Processing, a Quick Tour of HAO-MIN Z HOU1 , TONY F. CHAN2 , JIANHONG SHEN3 1 2 3
School of Mathematics, Georgia Institute of Technology, Atlanta, USA Department of Mathematics, University of California, Los Angeles, USA Barclays Capital, New York, USA
Article Outline Glossary Definition of the Subject Introduction Wavelets in Image Processing PDE Techniques Wavelet Based Variational PDE Methods Future Directions Acknowledgments Bibliography Glossary Wavelets Wavelets are selected functions that generate orthonormal bases of the square integrable function space L2 (or more generally, frames of Lp spaces) by using dilations and translations. The basis functions have certain locality, such as compact support or fast decay property. And they are usually organized according to different scales or resolutions, which are called MultiResolution Analysis (MRA). Fast wavelet transforms are filtering procedures that compute the projection of any given function onto a wavelet basis. Digital images Digital images usually refer to n dimensional data arrays recorded by optical or other imaging devices, such as digital cameras, Radar, Computed Tomography (CT), and Magnetic Resonance Imaging (MRI). They can also be generated by computer graphics software. Most digital images in the literature are 2- or 3-dimensional. Image restoration Image restoration rebuilds high-quality images from given images that are corrupted or polluted during acquisition or transmission processes. The most commonly seen restoration tasks are denoising and deblurring. Denoising is to remove random perturbations to individual pixel values. Deblurring is to remove the unwanted correlation between nearby pixels and to recover the original clear images.
Image compression Compression converts images from n dimensional data arrays into “0” and “1” bit streams so that they can be stored or transmitted more efficiently. There are two types of compression, lossy and lossless, depending on whether information is permanently lost or recoverable, respectively. Many of the commonly used compression algorithms, such as the ones used by international image compression standards JPEG and JPEG2000, are transform-based compression, which consists of three basic steps: transform pixel values into frequency coefficients, quantization of the frequency coefficients, and coding to convert them into bit streams. Image segmentation Segmentation partitions images into subregions (segments), on which images share similar features. Each region often corresponds to the image of an individual object in the 3-dimensional world. Image inpainting Inpainting is an artistic word referring to filling in missing image information on damaged regions, e. g., scratches and damages in precious old photos, old Hollywood films, and ancient paintings. The objective of digital image inpainting is to fill in the missing information automatically and meaningfully. Definition of the Subject Explosive information has dominated nearly all aspects of modern society, science and technology. Visualization is one of the most direct and preferable ways to observe information carried by data, which are often massive in size and uncertain in data quality. To better reveal the information, especially when it is hidden, implicit, or corrupted, data must first be properly processed. In achieving this, image processing, which includes many different tasks such as compression, restoration, inpainting, segmentation, pattern recognition and registration, has played a critical role. Historically, it has been viewed as a branch of signal processing, and many classical methods are adopted from traditional Fourier-based signal processing algorithms. In the past couple of decades, numerous new competing methods have emerged. Among them, wavelets, variational and PDE techniques, and stochastic methods have demonstrated outstanding performance due to their special properties. For instance, wavelets have become the dominant tool in image processing because of their multiresolution structure, energy concentration ability and fast transform algorithms. The popularity of variational PDE techniques is driven by their extraordinary properties in understanding and manipulating geometri-
3341
3342
Wavelets and PDE Techniques in Image Processing, a Quick Tour of
cal features. These new techniques have contributed to new observations, understandings, and discoveries in science and technology. Furthermore, many of these methods have been successfully applied to applications in the fields of medical, physical sciences, engineering, and even everyday life. Introduction Digital image processing analyzes or extracts certain information from digital images, which are often viewed as 2or multi-dimensional data sets in mathematics. Each element in the data sets is called a pixel. Typical image processing tasks include segmentation, restoration, pattern recognition, analysis, compression, registration and motion detection [41,46]. Image processing has a wide range of applications including communication, computer vision, acoustics, satellite imaging, medical and industrial diagnosis and many more. Image processing tasks often require large-scale computations, mainly due to the large amount of data to be processed. A typical gray scale still image with moderate resolution, such as 1024 1024, has over a million pixels. The size of a color image is three times as large given the same resolution. A video sequence usually consists of over 24 color frames per second with each frame being a still image. A multi-spectral image contains a collection of several (usually more than 3) monochrome images of the same scene, each of which is taken with a different wavelength by a different sensor. In addition, many applications, such as airport screening and unmanned vehicle navigation, require real time response. All of these demand efficient and reliable algorithms. Traditional image processing methods are mainly based on Fourier/wavelets or statistical approaches. The best example is the current international image compression standards, JPEG and JPEG2000, which are based on discrete cosine transform (DCT) and wavelet transforms. For this reason, more images are stored using their wavelet coefficients. The tremendous success of wavelets in image processing is due to their positive properties, including multiresolution data structures, fast transform algorithms and superb energy concentration ability, which allows one to approximate functions (images) using only a relative small number of coefficients. Thousands of researchers have devoted their efforts to the development of wavelet theory, analysis, and algorithms in different applications. Groundbreaking contributions include Meyer’s wavelet theory [53], Daubechies’ compact support orthogonal wavelets [33], Mallat’s multiresolution analysis [49,50], Shapiro’s progressive zero
tree image coding algorithm [63], and many other works cited in books such as [28,34,44,51], and [64]. Roughly speaking, wavelet transforms can express any square integrable functions by superpositions of wavelet basis functions, which are generated by dilations and translations from a few (if not a single) wavelet functions. The summation coefficients are called wavelet coefficients, which are standard L2 inner products between wavelets and the given functions. Wavelet transforms are realized by filtering procedures. Usually, wavelet coefficients are classified into two types: low or high frequencies. Low frequency coefficients correspond to certain kinds of weighted local averages of the data values. High frequency coefficients are related to certain order derivatives. Therefore, high frequency coefficients are small for smooth functions and large for functions containing discontinuities. In applications, it is inevitable that some of the wavelet coefficients, especially the high frequencies, are unavailable due to intentional or involuntary reasons. For instance, in wavelet-based image compression, insignificant (smaller in magnitude) high frequency coefficients are discarded on purpose to save more storage space. In lossy channel communication, coefficients are lost or damaged during the transmission due to unwanted disturbances. Obviously, with incomplete wavelet coefficients, one cannot synthesize the exact original functions. Many problematic issues could arise as a result. One that has drawn the most attention is the assertion that oscillations are generated near discontinuities. This is the famous Gibbs’ phenomenon in mathematics and edge artifacts in image processing. Several directions have been taken to improve the performance of wavelet based image processing methods by reducing the Gibbs’ oscillations, and by better preserving geometrical information in images. One strategy involves the use of nonlinear thresholding procedures to allocate more storage resource to significant coefficients. Wellknown examples include translation invariant denoising methods [31], wavelet hard thresholding, and wavelet shrinkage (also called soft thresholding) [37]. Another strategy is to build new geometry friendly wavelet-like multiresolution representations, such as ridgelets [8], curvelets [9], beamlets [36], bandelets [57] and many more recent developments. With geometry directly incorporated into the construction of multiresolution representations, it is expected that the decompositions have better performance near discontinuities. The third direction is to modify the existing wavelet transforms so that fewer large high frequency coefficients are generated near discontinuities. Thus, less information
Wavelets and PDE Techniques in Image Processing, a Quick Tour of
is truncated in the thresholding process. Many methods have been proposed, such as Harten’s remarkable general multiresolution framework [42] and its recent developments [2], the adaptive lifting scheme [30], and the adaptive Essential Non-Oscillatory (ENO) wavelet transforms [25,26]. Many recent contributions are collected in [65]. In a different direction, PDE techniques for image processing, pioneered by Mumford–Shah’s segmentation functional [55], Rudin–Osher–Fatemi’s Total Variation (TV) restoration [59], and Perona–Malik’s anisotropic diffusion [58], have emerged more recently. Due to their outstanding properties in handling geometrical information, different variational PDE models and methods have been proposed and studied for a variety of image processing goals, such as affine scale space [62], fundamental equations for image processing [1], total variation image analysis [16], active contour for segmentation [12,23], blind deconvolution [24], image interpolation and inpainting [4, 5,18,20,52], restoration [17,19], and compression [27,38]. The field is significantly enriched, and many books have been published in recent years (see [3,13,21,54,56,61,66] and references therein). Given the developments in both wavelets and PDE techniques in image processing, it is natural to think of combining their advantages to gain more benefits in the applications, especially when geometrical features are important. Well designed wavelet PDE methods can retain the good properties of wavelets, such as multiresolution and fast algorithms. Meanwhile, they are able to use PDE concepts, such as gradients, curvatures to capture, control and manipulate the geometrical information to achieve image processing goals in more systematic manners. There are quite a few examples that have demonstrated the combined advantages in different applications [10,15,27,29, 39,48]. In this paper, it is not our intention to give a complete survey on either wavelets or PDE techniques in image processing. Instead, we will focus on a recent trend that combines them together. To be self-contained, we start with a brief introduction to wavelets, and followed by PDE techniques in image processing. We hope to use selected topics based on our experience to help readers, especially beginners, to know some basic models and a few commonly used methodologies on the subject. The rest of the paper is arranged as follows. Section “Wavelets in Image Processing” is a brief introduction to wavelets and their applications in image processing. Section “PDE Techniques” presents some well known PDE models in image processing. In Sect. “Wavelet Based Variational PDE Methods” we give some new developments of combining
wavelets and PDE techniques. A concise list of future directions is stated in the end. Wavelets in Image Processing Historically, Fourier decompositions, which express any given square integrable function by superpositions of sinusoidal functions, have been the major tool for image processing due to their efficient representations and fast Fourier transforms (FFT). This is particularly true for 1-D signals, such as audio sequences. However, all Fourier basis functions have global supports, which implies that any local change in the given function has to result in a global change in the representations. For this reason, Fourier bases are not efficient to represent local information, such as discontinuities. The well-known Gibbs’ phenomena is an exhibition of this limitation. Unfortunately, most salient features, such as edges and corners in images, are local and discontinuous. Thus, all Fourier-based methods for image processing suffer from the ringing artifacts. Facing this shortcoming, it is highly desirable to have efficient representations which can better handle local information, especially discontinuities. Or more precisely, the basis functions should have local support or fast decay properties so that any local perturbation will only cause changes in a small neighborhood but not to far away places. To a certain extent, wavelets are designed to fill up this expectation and have gained unsurpassed success in many applications of image processing. After several decades of intensive studies, wavelets have been developed into a very rich mathematical theory. There are many different types of wavelets such as Meyer’s wavelets, spline wavelets, and bi-orthogonal wavelets. Here, we present a very brief introduction based on Daubechies’ compact supported wavelets and their connections to compression and denoising. Wavelets Wavelets can be viewed as orthonormal bases of the square integrable function space L2 (R). It starts with carefully selected scaling function (x) and corresponding wavelet (x) defined on finite support [0; l], where l is a positive integer. We refer to [34] for the detailed selection procedure for (x) and (x). Many commonly used software such as MATLAB have built-in routines for the scaling and wavelet functions already. The functions (x) and (x) satisfy the dilation equations (also called two-scale relations or refinement equations in some literature): (x) D
l p X 2 cs (2x s) ; sD0
(1)
3343
3344
Wavelets and PDE Techniques in Image Processing, a Quick Tour of
Wavelets and PDE Techniques in Image Processing, a Quick Tour of, Figure 1 a The scaling function for Daubechies-6 wavelet. b The corresponding wavelet
and
defined by l p X (x) D 2 hs (2x s) ;
Z (2)
sD0
where the cs ’s and hs ’s are constants, called low- and highpass filters, respectively. To give examples, the famous Haar wavelet selects
1 x 2 [0; 1) (x) D 0 otherwise ; and 8 <
1 1 (x) D : 0
x 2 [0; 12 ) x 2 [ 12 ; 1) otherwise ;
which are step functions. We also plot the scaling and wavelet functions of Daubechies-6 in Fig. 1. Using dilation and translation, one can form families of functions from (x) and (x), as follows, j 2
j;k (x) D 2 (2 j x k) ;
(3)
and j;k (x)
j
D 2 2 (2 j x k) ;
(4)
where (j, k) are integers. Then the collection of j;k (x) form an orthonormal basis of L2 (R). This means that for any given function f (x) 2 L2 (R), one has X < f (x); j;k (x) > j;k (x) ; (5) f (x) D j;k
where < ; > denotes the standard L2 (R) inner product
f (x)g(x)dx :
< f (x); g(x) >D R
There are many desirable properties for the scaling functions and wavelets. Among them, locality and oscillations are the most cited common features in all wavelets. Literally speaking, they make wavelets behave like localized small waves, which also explains the origination of the name. The locality, which often refers to compact support or fast decay properties, enables wavelets to decompose or approximate functions locally. This satisfies the desire of many applications, particularly in image processing. A good mathematical way to describe the oscillatory nature of wavelets is to use their vanishing moment property, which means Z (x)x j dx D 0 ; j D 0; 1; : : : ; p 1 ; (6) where p is a positive integer. In this case, the wavelet (x) is said to have p vanishing moments. The more vanishing moments, the more oscillations in wavelets in general. Locality and oscillation together have been the main driving engines for the success of wavelets in many applications. Multi-Resolution Analysis The success of wavelets also relies on their connection to Multi-Resolution Analysis, introduced by Mallat [49,50]. Consider the subspace of L2 (R) defined by the scaling function j;k (x), Vj D spanf j;k (x); k 2 Zg ;
Wavelets and PDE Techniques in Image Processing, a Quick Tour of
for every fixed j. The dilation Eq. (1) implies that the subspaces form an ordered chain, Vj1 Vj VjC1 VjC2 : : : ; j 2 Z ; which also satisfies lim Vj D L2 (R); lim Vj D 0 :
j!1
j!1
Here, larger indexes j correspond to finer resolutions or scales. Similarly, one can define the subspaces generated by wavelets j;k (x), Wj D spanf
j;k (x); k
2 Zg :
The dilation Eq. (2) implies the following connection between W j and V j , Vj D Vj1 ˚ Wj1 ; j 2 Z :
Apparently, fast wavelet transforms involve only the coefficients and can be started if one knows the low frequency coefficients f˛I;k g on a certain fine resolution I. Then, it is natural to ask how to obtain f˛I;k g. Theoretically, f˛I;k g must be computed by < f (x); I;k (x) > according to the definition. However, they are often replaced by the point-wise values f (xk ) in practice, even though such an action is called a wavelet crime in [64]. The replacement makes sense when the function f (x) is smooth and the resolution I is fine enough, because the low frequency coefficients ˛I;k , which are the weighted local averages of f (x), are very close approximations to the pointwise values. The above described wavelet transforms are for 1-D functions. Wavelet transforms for 2-D images are realized by simple tensor product in practice. More precisely, 2-D transforms are obtained by performing column-wise 1-D transforms followed by row-wise 1-D transforms.
(7) Wavelet Thresholding and Image Processing
Therefore L2 (R) can be decomposed into, L2 (R) D VJ ˚
X
Wj D
j>J
1 X
Wj ;
jD1
where J is an arbitrary reference resolution level. Consequently, f (x) 2 L2 (R) can be decomposed into a multiresolution representation as, X X f (x) D ˛ J;k J;k (x) C ˇ j;k j;k (x) ; (8) k
j>J;k
where ˛ j;k D< f (x); j;k (x) > is called a low frequency (or scaling) coefficient, and ˇ j;k D< f (x); j;k (x) > is a high frequency (or wavelet) coefficient. Without causing confusion, we call them wavelet coefficients for simplicity in this paper. The decomposition (7) and the dilation Eqs. (1), (2) lead to the following filtering and down-sampling procedures to compute the coarser scale wavelet coefficients from the finer scale coefficients, ˛ j;k D
l X
cs ˛ jC1;2kCs ;
(9)
hs ˛ jC1;2kCs :
(10)
sD0
and ˇ j;k D
l X sD0
These are the famous fast wavelet transforms.
The wavelet representations (8) provide a mechanism to approximate functions in a multi-resolution fashion. For instance, the jth scale (resolution) approximation is simply defined as: X f j (x) D ˛ j;k j;k (x) k
D
X k
˛ J;k J;k (x) C
X
ˇ i;k
i;k (x) :
(11)
J ˇ¯ j;k D 0 jˇ j;k j ; where the sign() is the signum function.
The selection of the threshold has also been investigated by many groups. Among many proposed strategies, Donoho–Johnstone’s SQTWOLOG [37] and Stein’s unbiased risk estimate have been widely used. Thresholding procedures have accomplished remarkable success in image processing, especially in compression. It is easy to understand that wavelet thresholdings are useful in this application because one does not have to store the coefficients that are zero. However, it is more subtle in practical compression schemes. The problem is that not only does one need to remember the non-zero wavelet coefficients, but also their locations. The location information may occupy more storage space than the coefficients if they are recorded in a naive way. Shapiro’s zero tree scheme [63] introduces a tree structure for wavelet coefficients based on their multiresolution property. A branch of the tree can be represented by a single bit ‘0’ if all coefficients in the branch are zero. This is used in conjunction with thresholdings to achieve very efficient compression. Many well known state-of-the-art compression methods, such as Set Partitioning in Hierarchical Trees (SPIHT) [60] and Group Test Wavelet (GTW) [45] compression algorithms, are based on the zero tree idea. Simple thresholdings also provide fast and effective methods for noise removal. They have found many successful applications in communications, military and medical images. In Fig. 2, we display the denoising effects of wavelet hard (b) and soft (c) thresholdings of a test image with additive white noise (a). From a mathematical point of view, the success of thresholdings can be explained by their connections to optimizations. It has been shown that many thresholding results are optimal in a certain sense. In layman’s terms, those thresholding results are best under certain criteria.
Wavelets and PDE Techniques in Image Processing, a Quick Tour of
For example, let us assume that the hard thresholding reconstruction X X ˛¯ J;k J;k (x) C f¯(x) D ˇ¯ j;k j;k (x) k
j;k
has M nonzero wavelet coefficients. Then f¯(x) is the minimizer of the following optimization problem, min k f gk2 ; subject to g g has at most M nonzero wavelet coefficients : This leads to the conclusion that the hard thresholding gives the best M-term approximation in L2 (R) among all possible combinations. In a more general setting as discussed in [15] and [67], it is proved that the soft thresholding gives the minimizer of the following optimization problem minfk f gk2 C 2kgkB1 (L 1 ) g ; g
1
where B11 (L1 ) is a Besov space. And the linear thresholding gives an approximate minimizer of the following optimization problem, minfk f gk2 C 2kgkW m (L 2 ) g ; g
where W m (L2 ) is a Sobolev space. We refer readers to [15] for a detailed discussion. PDE Techniques Compared to wavelets, modern PDE techniques in image processing have appeared more recently, even though some traditional image processing methods can be interpreted from PDE perspective. For instance, the classical Gaussian filter for image denoising is accomplished by convolving the noisy image u0 with the Gaussian kernel (also called heat kernel in literature) G(x; t) D 21 t 2 exp( x2t ), Z u D G u0 D
u0 (y)G(x y; t)dy :
(13)
This denoised image u is actually the solution u(x, t) of the following diffusion PDE, u t (x; t) D Du(x; t) ;
u(x; 0) D u0 (x) ;
This is due to their extraordinary ability to handle geometrical features, which are lacking in traditional statistical or Fourier/wavelet based approaches. Two different strategies are commonly used to design PDE techniques for different image processing goals. 1. Construct PDE-based evolution processes and incorporate geometry in the equations. 2. Pose image processing tasks in variational framework and derive corresponding Euler–Lagrange equations to compute the minimizers. In both strategies, image processing goals are achieved by solving PDE’s. Next, we use a few well-known examples to demonstrate these two strategies. Anisotropic Diffusion for Denoising Image denoising removes unwanted disturbances in images. Very often, those disturbances, such as white noise and pepper-and-salt noise, are highly localized and oscillatory. This makes it harder to separate noise from edges, which are also local and discontinuous. As an anti-oscillation procedure, diffusion is a natural selection for denoising. As mentioned earlier, the classical Gaussian filter for denoising is equivalent to the linear isotropic diffusion (14). However, it has been observed in both experimental and theoretical studies that isotropic diffusion unavoidably smears sharp edges, corners and other geometrical features embedded in u0 while filtering out noise. This is because it treats all orientations identically and never recognizes the presence of spatially coherent discontinuities – edges. In addition, the larger the diffusive coefficient D, the quicker the smoothing out. To remedy this drawback, Perona–Malik [58] proposed using anisotropic diffusion instead, u t D r (D(x; u; ru)ru) :
(15)
The diffusivity coefficient D is data dependent and must sense the existence of edges, so that the PDE stops diffusion across the discontinuities. For this purpose, it is desirable to have D satisfying the following requirements, ( DD
large;
when jruj is small on intra-regions;
small; when jruj is large near edges :
(16)
(14)
where is the Laplace operator, and D D 1/2 is diffusive coefficient. Modern PDE techniques have drawn great attention and reached remarkable success in the past two decades.
Therefore, the evolution only smooths out the oscillations away from edges but not across them . In [58], D is selected as D D g(jruj2 ) ;
3347
3348
Wavelets and PDE Techniques in Image Processing, a Quick Tour of
where g is a smooth positive concave function satisfying g(C1) D 0. For example, g can be taken as g(jruj2 ) D e
jruj2 2 2
;
or
g(jruj2 ) D
1 ; 1 C bjruj2
where and b > 0 are constants. In practice, the anisotropic diffusion (15) must face a challenge on how to compute the coefficient D robustly. This may be troublesome especially in the beginning of the diffusion process when u0 contains highly oscillatory noise, because jruj is large almost everywhere, so D is small everywhere. Thus, the diffusion is not effective in removing noise. To overcome this difficulty, the use of a mollified image in g has been proposed in [14], which takes the form as u t D r (g(jr(G u)j2 )ru) ;
u(x; 0) D u0 (x) ;
where G is again the Gaussian kernel. Along the lines of anisotropic diffusion, much more research has been done including the well known general axiomatic scale-space theory in [1]. We refer readers to [21, 66] for more discussion. Total Variation Image Denoising A different viewpoint for denoising is to reduce the uncorrelated local oscillations in images. Mathematically speaking, total variation (TV) is a quantity that measures oscillations in functions. It is intuitive that oscillatory noise greatly increases the TV norm. Naturally, one can think of denoising as reducing the total variations of images. In fact, this observation leads to the famous TV model proposed by Rudin–Osher–Fatemi [59], Z (17) min jrujdx subject to ku u0 k2 ; u
where is related to the noise level. The objective functional is to reduce oscillations in the reconstruction, and the constraint term is a fitting requirement. This optimization problem can be read as to find the least oscillatory image within a small ball of radius centered at the noisy image u0 . The model is often re-formulated as a non-constraint minimization problem as Z (18) min jrujdx C ku u0 k2 ; u 2 where 0 is a Lagrange multiplier, which is the factor that balances the competition between oscillations and fidelity. The smaller the , the fewer details in the denoised
images. In extreme situations, the solution for (18) is a flat constant when is zero, or is the noisy image u0 when is infinite. The most outstanding advantage of TV denoising model (18) is that it allows sharp edges to be preserved in the reconstruction. This implies that TV model has the ability to reduce small oscillations (noise) without penalizing the edges. This feature has been well understood in the context of computational fluid dynamics (CFD), especially in shock capturing, where TV semi norm is intensively used. In fact, the authors of [59] are also experts in CFD, and it is no doubt that (17) is inspired by numerical shock capturing. Another attraction of TV denoising is its geometrical properties. For functions with finite TV semi norms, this can be seen clearly through an equivalent coarea formula, Z Z C1 Z jrujdx D ds d : 1
fuD g
R
Here the term fuD g is the length of the level set fu D g. The TV semi norm is obtained by integrating along all level contours of fu D g for all values of . This suggests that TV semi norm controls both the size of the jumps and the geometry of the level sets. The geometric connection of TV minimization is more visible if we analyze the optimization (18) by calculus of variation. The standard theory shows that the minimizer must satisfy the following Euler–Lagrange equation,
ru r (19) C (u u0 ) D 0 : jruj The first term, the functional derivative of TV semi norm, is precisely the curvature of the image, which makes the method more geometric friendly. For noisy pixels, the jumps are isolated and their curvature is large. They will be wiped out much quicker than the edges that are coherent jumps with relatively smaller curvature. The best known, but not necessarily the most efficient, algorithm to solve (18) is the gradient descent method, which introduces an artificial time to form an evolution PDE,
ru (20) (u u0 ) : ut D r jruj Compared to (15), the gradient descent of TV minimization (20) is also an anisotropic diffusion with a degenerate diffusive coefficient D D 1/jruj. And it satisfies the anisotropic diffusion requirement (16). In particular, if an edge is sharp, D will be zero and no diffusion is performed across the edge. In practice, to prevent numerical blow-
Wavelets and PDE Techniques in Image Processing, a Quick Tour of
up causedp by jruj D 0 in the denominator, it is often replaced by jruj2 C , where is a small positive number. Actually, this replacement can be derived from variational framework too. An interesting and natural question is why Rwould one want to use TV semi-norm in (18) instead of jruj2 dx, which is the famous Sobolev H 1 semi-norm in PDE’s. In fact, a very similar calculation can show that the H 1 minimization leads to exactly the isotropic diffusion (14), which loses the geometrical properties. Variational Models for Image Segmentation The purpose of image segmentation is to divide an image into regions within which the image has similar features, such as intensity values, texture pattern, or belonging to same objects. Segmentation is a crucial building block for many high level image processing and vision tasks such as object detection, recognition, and tracking. Obviously, one image may produce different partitions because of different segmentation criteria. This non-uniqueness nature, which is also true for many other image processing tasks, makes the segmentation problem very challenging. There is extensive literature on the subject and many methods have been proposed using different strategies. For example, the celebrated intensity-edge mixture model is statistic-based [40], while the widely-used active contour (also called snake) model [47] uses variational framework. We take the well-known Mumford–Shah segmentation model [55] and Chan–Vese region-based active contour model [23] as examples to demonstrate how mathematical formulations and computational strategies can contribute to segmentation. The original Mumford–Shah segmentation model is stated in a variational format, Z min 1
ds C
2 2
Z
jruj2 dx ˝n
C
3 2
Z
(u u0 )2 dx ;
(21)
where 1 ; 2 and 3 are three constants, u is the partitions with different segments. ˝ is the region where the image is defined and is the interior boundary separating different segments. The first term is the length of the interior boundary curves. The second term is isotropic diffusion within each homogeneous region. Similar to the TV minimization model (18), the third term is the fitting term. From formulation (21), the segmentation is achieved by balancing the competitive three terms. Different ratios among 1 ; 2 , and 3 give different partitions.
The Mumford–Shah model has many desirable properties and is very general. Many other known models can be viewed as special cases of it. However, it also faces serious computational challenges because the partition boundary is unknown. And thus the first term involving line integral along has no easy way to compute. To ease the challenges, many other models are proposed for better computation properties. Among them, Chan–Vese’s active contour without edge model [23] has gained remarkable success due to its simplicity and robustness. Assume that C is a closed curve partitioning the segments. The model is designed to move C so that the following energy is minimized, min 1 Length( ) C 2 Area(inside( )) Z Z C 3 (u0 c1 )2 dx C 4 (u0 c2 )2 dx ; inside( )
outside( )
(22) where 1 ; 2 ; 3 and 4 are positive fixed parameters. This model uses piecewise constant approximations inside and outside the partition curves. If one picks 2 D 0, (22) becomes the minimal partition model which is a special case for (21). The Chan–Vese model (22) can also be formulated in a level set framework, and it leads to a fast and robust computation method, which sparkles a large amount of followup researches in using level set-based active contour methods for segmentations in different applications. PDE Method for Image Inpainting Image inpainting, or its mathematical synonym image interpolation, fills in missing or damaged image regions based on known surrounding information. It is a very fundamental problem having numerous prior work in existence. It also shares common ground with many other image processing tasks, such as image replacement, error concealment, edge completion and image editing. Here we only use 1) a third order nonlinear inpainting PDE by Bertalmio et al. [6], 2) a variational inpainting model by Chan–Shen [20], as two examples to illustrate how modern mathematics is used for this traditional labor-intensive task, because image inpainting used to be done by hand. Similar to segmentation, image inpainting is an inverse problem having possible multiple solutions. It is obvious that when information is missing, different people may have different ways to patch information to the regions, and all of them may look reasonable. However, it is commonly agreed that the inpainted regions must have consistent geometrical features and texture patterns with
3349
3350
Wavelets and PDE Techniques in Image Processing, a Quick Tour of
their surroundings. For this reason, many of the inpainting methods are based on geometrical interpolations or extrapolations. One example is the remarkable third order inpainting PDE introduced by Bertalmio et al. [6]. In fact, the term image inpainting was first used by them, and the work has stimulated a wave of interest in inpainting related problems. The PDE is given as u t D r(u) r ? u ;
(23)
where r ? is the orthogonal gradient direction (isophote direction as called in the original paper). The idea behind (23) is a brilliant intuition of information transport along broken level lines (isophotes). The PDE is solved only inside the inpainting regions with proper boundary conditions based on the gray values and isophote directions. It is discovered later that (23) actually connects to the famous Navier–Stokes equations in CFD [5]. The Chan–Shen’s inpainting model tackles the problem from a different angle. It starts with a variational principle inspired by TV restoration model [59], Z jrujdx C
min u
˝
2
Z ˝nD
(u u0 )2 dx ;
(24)
where D is the inpainting region. Similar to (17), a straightforward interpretation of this model is that the minimizer u is the least oscillatory image that is close enough to the given image u0 outside the inpainting region. For the regions inside of D, the restored u has no restriction except matching its surroundings in a least oscillatory fashion. This model can lead to a nonlinear data dependent PDE similar to (19) and can be solved numerically. The results are impressive and much follow-up work has been performed to analyze the model and extend it to include more sophisticated measurements such as Euler’s Elastica into consideration for better curve treatments [18]. Wavelet Based Variational PDE Methods As discussed in the previous sections, both wavelets and PDE techniques have been used extensively in image processing and achieved tremendous success in numerous applications. Their success is based on different properties of both approaches. Wavelets have multi-resolution data structure, energy concentration (sparsity) and fast algorithms. PDE techniques are geometrically friendly and often tied to variational principles. A closer look at both approaches can easily reveal that those properties do not overlap, and one cannot be used to replace the other. In
a certain sense, they are complementary to each other, and it seems natural to combine the advantages of both to gain benefits. In fact, many research efforts have been put forward in this direction. There are two different strategies that have been explored to merge PDE techniques with wavelets, (a) Use computational PDE skills to modify the standard wavelet transforms to form new transforms having better geometric properties. (b) Design new wavelet-based variational models for different image processing tasks. In this section, we will select a few examples to demonstrate both strategies. ENO-Wavelet Transforms As mentioned earlier in Sect. “Wavelets in Image Processing”, it is a well-known fact that Fourier-based algorithms suffer Gibbs’ oscillations. Wavelets can remarkably reduce the severity of oscillations due to their locality. But they still exist unless one retains all discontinuity related coefficients, which is not practical in many applications. To improve the image quality, one needs to reduce the oscillations by lowering the threshold in thresholding procedures. As a consequence, more coefficients, especially edge-related ones, must be retained, which is why a majority of the storage is allocated to edge-related coefficients in JPEG2000. To further improve the performance and reduce the ringing artifacts, it is desirable to design wavelet-like transforms so that fewer significant high-frequency coefficients are generated. One way to achieve this goal is to incorporate geometrical features in the design of wavelet-like basis (or redundant frame) functions so that discontinuous functions can be more effectively represented. Many efforts have been proposed, such as curvelets [9], beamlets [36], and bandelets [57] to name a few. A different approach is to reduce the wavelet filter length. This is based on the fact that the larger the supports of wavelets, the longer the filters, and the worse the Gibbs’ oscillations. Then, one can adaptively use shorter wavelet filters near the vicinities of edges. The adaptive lifting scheme proposed in [30] employs this idea. ENO-wavelet transforms approach the problem from a different angle. It is inspired by the original Harten’s multi-resolution framework [42], which has a profound impact on many new methods in the field. ENO-wavelet transforms borrow a key idea, the one-sided interpolation strategy, from ENO schemes for shock capturing [43]. Different from the fore-mentioned methods, which adapt the
Wavelets and PDE Techniques in Image Processing, a Quick Tour of
Wavelets and PDE Techniques in Image Processing, a Quick Tour of, Figure 3 a The comparison between 4-level ENO-Daubechies-6 (solid line) and standard Daubechies-6 (dash-dotted line) approximations. b A zoom-in of the picture on the left near a discontinuity. Standard Daubechies-6 generates oscillations near discontinuities, but the ENO-Daubechies-6 does not
filters or basis functions to better fit the data, ENO-wavelet transforms change the data near edge areas and feed them into the same standard wavelet filters. The data is changed in a special way so that the filters do not see the discontinuities. Let us imagine that we filter around a jump discontinuity. A high-frequency coefficient is large if the high pass filter is convolved with data across the jump. However, one can extend the data from both left and right sides in smooth ways and feed the extended data to the filters. Then the high-frequency coefficients are small as the high pass filter only sees the smooth data on both sides. Of course, one may immediately question the fact that near the jump region, we have actually two different pieces of data overlapping in the area. In fact, this is a serious issue, in that it causes a double storage problem, which means we have doubled the number of wavelet coefficients in the jump region. And this is directly against the goals of many image processing tasks, especially image compression. Fortunately, the problem can be avoided by a strategy called coarse-level extrapolation, which extends the data in such ways that some of the jump-related wavelet coefficients are predictable and do not need to be memorized. And the storage can be reduced to the same as that of standard wavelet transforms. We refer to [25] for detailed ENO-wavelet transform algorithms. Here we just point out the main ideas and some important results. ENO-wavelet transforms can be used as functional replacements of standard wavelet transforms. Indeed, ENOwavelet transforms perform standard wavelet transforms if no discontinuity is detected. ENO-wavelet transforms retain the essential properties and advantages of standard wavelet transforms, such as energy concentration, mul-
tiresolution framework and fast transform algorithms, all without any edge artifacts. They also achieve uniform approximation accuracy up to the discontinuities. If fˆj (x) is the jth resolution approximation to f (x) by using ENOwavelet transforms, then k fˆj (x) f (x)k C2 j p k f (p) (x)k˝n ;
(25)
where is the set of discontinuous points. It is worth noting that the error (12) for standard wavelet transforms depends on the pth derivative of f (x) on the entire region ˝, which is unbounded if the discontinuous set is not empty. In contrast, the error for ENO-wavelet transforms (25) depends on f (p) (x) only on the domain ˝ excluding . This ensures that ENO-wavelet transforms perform uniformly accurate, regardless of the presence of discontinuities, which is probably the best result one may expect. In Fig. 3, we show a comparison between ENOwavelets and standard wavelets. Wavelet Based Minimal Energy Methods for Denoising As discussed in Sect. “PDE Techniques”, anisotropic diffusion and total variation minimization for image denoising have a tremendous ability to extract image features, especially edges, for better image quality preservation. However, it is also commonly recognized that such PDE techniques often post higher computational demands, because numerical solutions for nonlinear PDE’s need to be computed iteratively. To achieve reasonable solutions, many iterations must be performed. This has been a major criticism of PDE techniques, especially when one compares them with wavelets, which have ultra fast filtering algorithms.
3351
3352
Wavelets and PDE Techniques in Image Processing, a Quick Tour of
To retain the capability in feature extraction while eliminating the need of iterations for anisotropic diffusion, there have been efforts to formulate geometric friendly energy minimizations in wavelet spaces so that the minimizers can be obtained directly from wavelet coefficients without iterations. In fact, as discussed in Subsect. “Wavelet Thresholding and Image Processing”, classical wavelet thresholdings, including linear, hard and soft thresholdings, have corresponding energy optimizations in certain functional spaces. But those minimization problems are not built to handle geometrical features. In [29], Chui– Wang suggested a new geometrical energy minimization in wavelet space given as 1 min E ( ; ˇ) D E i ( ; ˇ) C kˇ ˛k2 ; (26) 2 where E i ( ; ˇ) is a selected internal energy which can be expressed by the wavelet coefficients ˇ. The second term is the standard L2 fitting requirement. In their original paper, a blended internal energy is chosen as X E i ( ; ˇ) D ( (m j;k (p)) C (ˇ dj;k )) ; (27) j;k
where m j;k (p) D (jˇ hj;k j p C jˇ vj;k j p )1/p , and (s) D jsj. In (27), the notations ˇ d , ˇ h and ˇ v are 2-D tensor product wavelet coefficients along diagonal, horizontal and vertical directions respectively. It is clear that the energy functional is resolution, orientation and spatial dependent. In this way, the energy functional can “see” the corners and edges in wavelet spaces because those geometrical structures create correlated wavelet coefficients along diagonal, horizontal and vertical directions respectively. When p D 2, the minimizers of (26) and (27) can be attained explicitly from the wavelet coefficients as ! h;v 0 h;v ; (28) (ˇ ) j;k D (ˇ ) j;k 1 0 m j;k C
and
0;d jˇ j (ˇ )dj;k D sign(ˇ)0;d j;k j;k
C
;
Diffusion Wavelets Diffusion wavelets has been proposed by Coifman and collaborators in [32]. It is a different way to generalize classical wavelets using PDE and geometry concepts. The goal is to construct a multiresolution analysis framework on general geometric structures, such as manifolds, graphs or even discrete point sets, so that image processing tasks can be performed for functions defined on these structures. As discussed in Subsect. “Wavelets”, standard wavelet multi-resolutions are based on dilation and translations. However, this is often impossible for general data structures, especially when little geometric information is known. To overcome this difficulty, diffusion wavelets use dyadic powers of a diffusion operator T (with kTk < 1), such as the heat operator defined on the general data structure to create scales. The following two properties are crucial for constructing diffusion wavelets. One is that the spectral of high powers of T decay faster as the power gets higher. Consequently, one can use a few leading eigenfunctions of T j (j large) to approximate the range spaces of T j accurately. The other property is that applying higher powers of T to local functions, such as Dirac delta functions defined on a point in a discrete data set produces smoother functions with wider supports, because T is a diffusion process. After a non-trivial process involving orthonormalization, which we refer to [32] for details, one can construct a multi-resoj lution analysis based on T 2 ( j 2 ZC ). Especially, once the j multi-resolution analysis is formed, T 2 can be expressed in a highly compressed format. There are many potential applications, such as in data mining and learning theory. Here, we pick the following simple example to illustrate their usage. Let us consider computing the inverse of Laplacian (I T)1 applied to an arbitrary vector f defined on a general data structure. The operator (I T)1 is a deblurring process commonly seen in image restorations. It is well known that
(29)
where ()C denotes the nonnegative value function. Along this line, there has been a recent trend in the computational harmonic analysis community to design data dependent nonlinear filters based on PDE techniques, for example, the adaptive digital TV filter presented in [19]. More recently, Chui and collaborators have proposed a new anisotropic filtering strategy based on ideas of finding approximate solutions of anisotropic diffusion equations discussed in Subsect. “Anisotropic Diffusion for Denoising”. Their method realizes image denoising by one sweep of nonlinear filtering.
(I T)1 f D
C1 X
Tk f :
kD1
Define K
SK D
2 X
Tk ;
kD1
then an approximation to (I T)1 can be achieved by K
S KC1 f D (S K C T 2 S K ) f D
K Y
k
(I C T 2 ) f :
kD0
Wavelets and PDE Techniques in Image Processing, a Quick Tour of
k
Since the powers T 2 have been compressed in the multiresolution analysis and can be efficiently applied to f , (I T)1 f is computed efficiently.
and u(ˇ; x) has the wavelet transform: u(ˇ; x) D
X
ˇ j;k
j;k (x)
:
j;k
TV Wavelet Inpainting Wavelet inpainting, or more generally wavelet interpolation, refers to the problem of filling in missing or damaged wavelet coefficients due to lossy image transmission or communication. Obviously, the task is closely related to classical inpainting problems, as discussed in Subsect. “PDE Method for Image Inpainting”, but also differs remarkably in that the inpainting regions are in the wavelet domain. Working in the wavelet domain, instead of the pixel domain, changes the nature of the inpainting problem, since damages to wavelet coefficients can create correlated damage patterns in the pixel domain. For instance, there usually exists no corresponding regular geometric inpainting regions, which are however necessary for many PDEbased inpainting models in pixel domains. Such lack of spatial geometric regularity of inpainting regions also prohibits many other existent inpainting techniques applied to pixel domains. On the other hand, direct interpolation in the wavelet domain is also problematic, because wavelet coefficients are constructed to be uncorrelated in the L2 sense and neighboring coefficients provide minimum information to the missing ones. In addition, degradation in wavelet inpainting problems is often spatially inhomogeneous, which demands different treatments in different regions. A closer examination may find that all these new challenges are actually caused by a simple fact: Damage happens in the wavelet domain while human perception prefers to see images with certain regularity in pixel domain. Therefore, it seems natural to create wavelet inpainting methods by filling in the coefficients in wavelet domain while controlling the regularity in the pixel domain. TV wavelet inpainting models presented in [22] exactly follow this strategy. Two different models have been proposed based on the noise level in images. The first one is for noiseless images, in which the retained coefficients are considered to be correct and will not be alerted. Model I: Z min
ˇ j;k :( j;k)2I
F(u; z) D
R2
jrx u(ˇ; x)jdx D TV(u(ˇ; x)) ;
(30)
where I is the inpainting index region in wavelet domain,
For noisy images, since every coefficient is also corrupted by noise. Then one has to modify (denoise) them too. Model II: Z min F(u; z) D ˇ j;k
R2
jrx u(ˇ; x)jdx C
X
j;k (ˇ j;k ˛ j;k )2 ;
(31)
( j;k)
and the parameter ( j;k) is zero if ( j; k) 2 I; otherwise, it equals a positive constant . Clearly, these two models are inspired by TV denoising model for their exceptional ability of handling geometries. Both models recover the wavelet coefficients so that the restored images are least oscillatory while matching the known information. The key difference is that the arguments are restricted to the inpainting regions I only for Model I, so the dimension of unknowns is the number of coefficients in I. While in Model II, the parameter is taken to be zero in the inpainting regions I in the wavelet domain, in contrast to the standard denoising and compression models, where is usually taken to be a constant everywhere. This difference essentially puts no constraint on the missing wavelet coefficients so that they can change freely. In Fig. 4, we show an example of wavelet inpainting by the two models. Compressive Sampling Compressive sampling [7], also goes by the name compressed sensing [35], is an emerging theory addressing the sampling problem in image and signal processing. In information theory, the classical Shannon–Nyquist sampling theorem states that “Exact reconstruction of a continuous-time baseband signal from its samples is possible if the signal is bandlimited and the sampling frequency is greater than twice the signal bandwidth”. More precisely, bandlimited signals refer to functions whose Fourier frequencies are in a bounded interval. And the sampling theorem says that a bandlimited signal can be fully reconstructed from its evenly spaced samples, provided that the sampling rate must exceed twice the maximum frequency in the bandlimited signal. This rate is often called Nyquist rate. Compressive sampling considers a different scenario. In the simplest case, let us assume that a signal f (t) is
3353
3354
Wavelets and PDE Techniques in Image Processing, a Quick Tour of
Wavelets and PDE Techniques in Image Processing, a Quick Tour of, Figure 4 a Original synthetic image. b 50 % of the wavelet coefficients are randomly lost, including some low-frequency coefficients, which results in large, damaged regions in the pixel domain. Notice that there are no well-defined inpainting regions in the pixel domain. c Restored image by Model I, d Restored image by Model II. They not only fill in missing regions properly, but also restore the sharp edges and geometrical shapes
sparse in the frequency space or any other convenient spaces. For instance, the signal consists of only a few Fourier terms, i. e. f (t) D
n X
ˇ j ei k j t ;
jD1
where n is an integer much smaller than the signal resolution N, fk j g’s are frequencies, and i the imaginary unit. But the actual values of frequency kj are not known. Obviously, f (t) is a bandlimited signal. Without loss of generality, we assume kn is the highest frequency. Then Shannon–Nyquist sampling theory requires at least 2kn evenly spaced observations to exactly reconstruct f (t). However, since the value kn is not known, the actual number of samples (resolution N) needed may be much higher than 2kn .
Compressive sampling asks a different question: Can one exactly recover a sparse signal f (t) using a small number of samples f f (t j )g1m observed at randomly selected time tj ( j D 1; : : : ; m)? Here m may be much smaller than N. Given the knowledge of sparsity, ideally one can convert this problem into the following minimization problem, min kˇk l 0 ;
subject to (F ˇ)(t j ) D f (t j ) ;
(32)
where ˇ is a vector containing the Fourier coefficients of the reconstructed signal, F is the Fourier matrix, and k k l 0 is the l0 norm of a discrete sequence, which is defined as the number of nonzero elements. Then F gives the inverse Fourier transform, and (F ˇ) is the reconstructed signal. l0 minimization (32) finds the sparsest reconstruc-
Wavelets and PDE Techniques in Image Processing, a Quick Tour of
tion (F ˇ) among all possible functions that agree with the observations f (tj ). However, l0 minimization (32) is essentially a large non-convex integer optimization problem, which is computationally prohibitive. Then compressive sampling suggests that it is still possible to exactly recover f (t) from the samples f f (t j )g1m . The exact reconstruction is realized by the following l1 optimization, min kˇk l 1 ; subject to (F ˇ)(t j ) D f (t j ) :
(33)
In other words, the compressive sampling achieves exact recovery by finding the signal having the smallest l1 norm in frequency space among all signals matching the sample values f (tj ) at tj . There are many reasons to select l1 norm in the optimization, including the remarkable mathematical insights given in [11]. We do not intend to present their results here. Instead, we list the following two reasons that are more intuitive and may explain the essence of l1 optimization in sparse recovery. 1. l1 norm exhibits interesting sparsity in many applications. In other words, l1 minimization in frequency space intends to drive more frequencies to zero. 2. l1 norm has the least index p among all lp norms that are convex. The convex property ensures that (33) may be solved efficiently by the standard convex optimization algorithms. It is believed that compressive sampling may have many implications. One of the most attractive potentials is that it suggests the possibility of new data acquisition protocols that translate analog information into digital form with fewer sensors than what was considered necessary. There are many interesting studies on how to advance the theory and applications, and even design new hardware to realize the implications. We refer to [7] for more information on the subject. It is worth noting that even though TV wavelet inpainting and compressive sampling are developed independently, there is an interesting connection between them. For instance, the derivative of a piecewise constant image, ru, can be viewed as sparse in the pixel space. If one makes measurements in the wavelet space, then Model I (30) is the l1 minimization of the derivative in the pixel space with constraints in the wavelet space, which fits well in the framework of compressive sampling. In this sense, they are complementary to each other, and can be viewed as dual formulations.
Future Directions Driven by rapidly developing imaging sciences and technologies, the last couple of decades have witnessed the tremendous success of wavelets and PDE techniques in mathematical image processing. Many researchers have been working in the field, and exciting new developments are constantly reported. However, compared to the even faster growing demands, there still is a long way to go to meet the ever-increasing expectations. The following is just a very short list of directions that are being or shall be pursued in the near future. (1) Developing more sophisticated models, methods to better preserve features for images, or general data sets in higher dimensions, such as video or hyper-spectral images. Merging traditional wavelets and PDE techniques seems to be a promising is direction. For example, developing wavelets and PDE models for segmentation is interesting. To our knowledge, it has not been attempted yet. (2) New applications in high-level vision, such as pattern recognition, auto navigation and tracking, which demands better understanding of the problems and more accurate extraction of the connections among data sets. (3) Robust and efficient implementation strategies to compute the solutions of mathematical image processing models, especially those involving solutions of nonlinear PDE’s. The list is based on the authors’ experience and reflects our own perspective. Certainly it does not cover all aspects of this large field. Interested readers are encouraged to read up-to-date literature to follow the latest advancements on the subject. Acknowledgments Research supported in part by grants ONR-N00014-061-0345, NSF CCF-0430077, CCF-0528583, DMS-0610079, DMS-0410062 and CAREER Award DMS-0645266, NIH U54 RR021813, and STTR Program from TechFinity Inc. Bibliography 1. Alvarez L, Guichard F, Lions PL, Morel JM (1993) Axioms and Fundamental Equations of Image Processing. Arch Ration Mech Anal 16:200–257 2. Arandiga F, Donat R (2000) Nonlinear Multiscale Decompositions: The approach of A Harten. Numer Algorithm 23:175–216 3. Aubert G, Kornprobst P (2001) Mathematical Problems in Image Processing: Partial Differential Equations and the Calculas of Variations, vol 147. Springer
3355
3356
Wavelets and PDE Techniques in Image Processing, a Quick Tour of
4. Ballester C, Bertalmio M, Caselles V, Sapiro G, Verdera J (2001) Filling-in by Joint Interpolation of Vector Fields and Grey Levels. IEEE Trans Image Process 10(8):1200–1211 5. Bertalmio M, Bertozzi AL, Sapiro G (2001) Navier–Stokes, fluid dynamics, and image and video inpainting. IEEE Conference on Computer Vision and Pattern Recognition Dec Kauai, Hawaii 6. Bertalmio M, Sapiro G, Caselles V, Ballester C (1999) Image Inpainting, Tech Report. ECE-University of Minnesota 7. Candés E (2006) Compressive Sampling. Proceedings of the International Congress of Mathematicians, Madrid 8. Candés E, Donoho D (1999) Ridgelets: a Key to Higher-dimensional Intermittency? Phil Trans R Soc Lond A 357(1760):2495– 2509 9. Candès E, Donoho D (1999) Curvelets: A Surprisingly Effective Nonadaptive Representation of Objects with Edges. Tech Report. Dept of Stat, Stanford University 10. Candès E, Romberg J, Tao T (2006) Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information. IEEE Trans Inform Theory 52:489–509 11. Candès E, Tao T (2006) Near Optimal Signal Recovery From Random Projections and Universal Encoding Strategies. IEEE Trans Inform Theory 52:5406–5425 12. Caselles V, Kimmel R, Sapiro G (1997) On Geodesic Active Contours. Int J Comput Vis 22(1):61–79 13. Caselles V, Morel JM, Sapiro G, Tannenbaum A (1998) Special Issue on Partial Differential Equations and Geometry-Driven Diffusion in Image Processing and Analysis. IEEE Tran Image Proc 7(3) 14. Catte F, Lions PL, Morel JM, Coll T (1992) Image Selective Smoothing and Edge Detection by Nonlinear Diffusion. SIAM J Numer Anal 29:182–193 15. Chambolle A, DeVore R, Lee N, Lucier B (1998) Nonlinear Wavelet Image Processing: Variational Problems, Compression, and Noise Removal Through Wavelet Shrinkage. IEEE Tran Image Proc 7(3):319–333 16. Chambolle A, Lions PL (1997) Image Recovery via Total Variational Minimization and Related Problems. Numer Math 76:167–188 17. Chan TF, Golub GH, Mulet P (1996) A Nonlinear PrimalDual Method for Total Variation-Based Image Restoration In: ICAOS’96, 12th International Conference on Analysis and Optimization of Systems: Images, Wavelets and PDEs, Paris, 26–28 June, 1996. Lecture Notes in Control and Information Sciences, vol 219. Springer, pp 241–252 18. Chan TF, Kang SH, Shen J (2002) Euler’s Elastica and Curvature Based inpainting. SIAM J Appl Math 63(2):564–592 19. Chan TF, Osher S, Shen J (2001) The Digital TV Filter and Nonlinear Denoising. Trans IEEE Image Process 10(2):231–241 20. Chan TF, Shen J (2002) Mathematical Models for Local NonTexture Inpainting. SIAM J Appl Math 62(3) 1019–1043 21. Chan TF, Shen J (2005) Image Processing and Analysis - Variational, PDE, Wavelet and Stochastic Methods. SIAM Publisher, Philadelphia 22. Chan TF, Shen J, Zhou HM (2006) Total Variation Wavelet Inpainting. J Math Imaging Vis 25(1):107–125 23. Chan TF, Vese L (2001) Active Contour Without Edges. IEEE Trans Image Process 10(2):266–277 24. Chan TF, Wong CK (1998) Total Variation Blind Deconvolution. IEEE Trans Image Process 7:370–375
25. Chan TF, Zhou HM (2002) ENO-wavelet Transforms for Piecewise Smooth Functions. SIAM J Numer Anal 40(4):1369–1404 26. Tony Chan F, Zhou HM (2003) ENO-wavelet Transforms and Some Applications In: Stoeckler J, Welland GV (eds) Beyond Wavelets. Academic Press, New York 27. Chan TF, Zhou HM (2000) Optimal Constructions of Wavelet Coefficients Using Total Variation Regularization in Image Compression, CAM Report, No 00–27. Dept of Math, UCLA 28. Chui CK (1997) Wavelet: A Mathematical Tool for Signal Analysis. SIAM, Philadelphia 29. Chui CK, Wang J (2007) Wavelet-based Minimal-Energy Approach to Image Restoration. Appl Comp Harmon Anal 23(1):114–130 30. Claypoole P, Davis G, Sweldens W, Baraniuk R (1999) Nonlinear Wavelet Transforms for Image Coding. Correspond. Author: Baraniuk, Dept of Elec and Comp Sci; Submit to IEEE Transactions on Image Processing 12(12):1449 –1459 31. Coifman R, Donoho D (1995) Translation invariant de-noising In: Antoniadis A, Oppenheim G (eds) Wavelets and Statistics. Springer, New York, pp 125–150 32. Coifman R, Maggioni M (2006) Diffusion Wavelets. Appl Comp Harm Anal 21(11):53–94 33. Daubechies I (1988) Orthonormal Bases of Compactly Supported Wavelets. Comm Pure Appl Math 41:909–996 34. Daubechies I (1992) Ten Lectures on Wavelets. SIAM, Philadelphia 35. Donoho D (2004) Compressed Sensing. Tech Report, Stanford University 36. Donoho D, Huo X (2002) Beamlets and multiscale image analysis. In: Barth TJ, Chan T, Haimes R (eds) Multiscale and Multiresolution Methods. Lecture Notes in Computational Science and Engineering, vol 20. Springer, pp 149–196 37. Donoho D, Johnstone I (1995) Adapting to Unknown Smoothness via Wavelet Shrinkage. J Amer Stat Assoc 90:1200–1224 38. Dugatkin D, Zhou HM, Chan TF, Effros M (2002) Lagrangian Optimization of A Group Testing for ENO Wavelets Algorithm. Proceedings to the 2002 Conference on Information Sciences and Systems, Princeton University, New Jersey, 20–22 March, 2002 39. Durand S, Froment J (2001) Artifact Free Signal Denoising with Wavelets In: Proceedings of ICASSP’01, vol 6. pp 3685–3688 40. Geman S, Geman D (1984) Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Trans Patt Anal Mach Intell 6:721–741 41. Gonzalez RC, Woods RE (1993) Digital Image Processing. Addison Wesley, Reading 42. Harten A (1994) Multiresolution Representation of Data, II General Framework. Dept of Math, UCLA, CAM Report 94-10 43. Harten A, Engquist B, Osher S, Chakravarthy S (1987) Uniformly High Order Essentially Non-Oscillatory Schemes, III. J Comput Phys 71:231–303 44. Hernandez E, Weiss G (1996) A First Course on Wavelets. CRC Press, Boca Raton 45. Hong ES, Ladner RE (2000) Group testing for image compression. IEEE Proceedings of the Data Compression Conference, Snowbird, UT, March 2000, pp 2–12 46. Jain AK (1989) Fundamentals of Digital Image Processing. Prentice Hall, Englewood Cliffs 47. Kass M, Witkin A, Terzopoulos D (1987) Snakes: Active Contour Models. Int J Comput Vis 1:321–331
Wavelets and PDE Techniques in Image Processing, a Quick Tour of
48. Malgouyres F, Guichard F (2001) Edge Direction Preserving Image Zooming: a Mathematical and Numerical Analysis. SIAM J Num Anal 39(1):1–37 49. Mallat S (1989) Multiresolution Approximation and Wavelet Orthonormal Bases of L2 (R). Trans Amer Math Soc 315:69–87 50. Mallat S (1989) A Theory of Multiresolution Signal Decomposition: The Wavelet Representation. IEEE Trans PAMI 11:674–693 51. Mallat S (1998) A Wavelet Tour of Signal Processing. Academic Press, San Diego 52. Masnou M, Morel J (1998) Level-lines Based Disocclusion. IEEE ICIP 3:259–263 53. Meyer Y (1992) Wavelets and operators, Advanced mathematics. Cambridge University Press, Cambridge 54. Morel JM, Solimini S (1994) Variational Methods in Image Segmentation. Birkhauser, Boston 55. Mumford D, Shah J (1989) Optimal Approximation by Piecewise Smooth Functions and Associated Variational Problems. Comm, Pure Appl Math 42:577–685 56. Osher S, Fedkiw R (2002) Level Set Methods and Dynamic Implicit Surfaces. Springer, New York 57. Le Pennec E, Mallat S (2000) Image Compression with Geometrical Wavelets In: IEEE Conference on Image Processing (ICIP), Vancouver, September 2000
58. Perona P, Malik J (1990) Scale-space and edge detection using anisotropic diffusion. IEEE T Pattern Anal 12(7):629–639 59. Rudin L, Osher S, Fatemi E (1992) Nonlinear Total Variation Based Noise Removal Algorithms. Phys D 60:259–268 60. Said A, Pearlman W (1996) A new, fast, and efficient image codec based on set partitioning in hierarchical trees. IEEE Trans Circuits Syst Video Technol 6(3):243–250 61. Sapiro G (2001) Geometric Partial Differential Equations and Image Processing. Cambridge University Press,Cambridge 62. Sapiro G, Tannenbaum A (1993) Affine Invariant Scale-Space. Internet J Comput Vis 11:25–44 63. Shapiro J (1993) Embedded Image Coding Using Zerotrees of Wavelet Coefficients. IEEE Trans Signal Process 41(12):3445– 3462 64. Strang G, Nguyen T (1996) Wavelets and Filter Banks. Wellesley-Cambridge Press, Wellesley 65. Welland GV (ed) (2003) Beyond Wavelets. Academic Press, New York 66. Weickert J (1998) Anisotropic Diffusion in Image Processing. ECMI Series, Teubner, Stuttgart 67. Xu J, Osher S (2006) Iterative Regularization and Nonlinear Inverse Scale Space Applied to Wavelet Based Denoising. UCLA CAM Report 06–11, March 2006
3357
3358
World Wide Web, Graph Structure
World Wide Web, Graph Structure LADA A. ADAMIC School of Information and the Center for the Study of Complex Systems, University of Michigan, Ann Arbor, USA Article Outline Glossary Definition of the Subject Introduction Graph Structure Algorithms Subsets of the Web Graph Future Directions Bibliography Glossary Graph A set of nodes (vertices) connected by links (edges, arcs). In the Web graph, the nodes are webpages, and the edges are the hyperlinks between them. Indegree The number incoming edges to a node; in the case of the Web, it is the number of webpages pointing to a page. Outdegree The number of outgoing edges; in the case of the Web, it is the number of webpages a webpage points to. Strongly connected component A set of webpages such that any page can be reached from any other page by following hyperlinks. Weakly connected component A set of webpages such that any page can be reached from any other page by treating the hyperlinks as undirected. URL A unique resource locator that corresponds to an online information source. Hypertext transfer protocol (HTTP) The communications protocol that allows web clients to communicate with web servers. Web server A program that responds to requests to for web pages. Webpage An information resource, identified by a URL, that is usually but not necessarily in the HTML (Hypertext Markup Language) format. A webpage may be static, meaning that it is stored on the server as a document, or dynamic, meaning that it is generated dynamically at the point that it is requested by the browser, using scripts and/or back-end databases. Domain A name that identifies one or more IP addresses, e. g. umich.edu.
Top level domain The suffix of the domain name, sometimes corresponding to the purpose of the website or the country of origin for the website, e. g. “edu”, “com”, “gov”, “uk”, “cn”, etc. Website A collection of webpages that is hosted on one or more web servers and that share a common root URL, e. g. “www.springer.com”. Randomized network A network that preserves the degrees of each node relative to the original, but the edges themselves are rewired. Definition of the Subject In the period of a few short years, the World Wide Web has become the primary information source. Billions of webpages, tied together by hyperlinks, constitute the world’s largest publicly accessible store of data, and are accessed by hundreds of millions of people worldwide. Understanding how the webpages are connected to form the Web graph is important for building comprehensive and accurate information retrieval systems, as well as characterizing real-world phenomena that are reflected in its content and link structure. Introduction The Web is becoming the single most indispensable source of information. A massive graph of billions of pages, the Web is continuously growing and evolving. People use the Web to do everything from reading the news, socializing, making purchases, paying their bills, watching videos, playing games, and finding romance. Given how pervasive the Web is in our daily lives, it is hard to believe that it is a relatively recent invention. The World Wide Web was invented by Tim Berners Lee in 1990 at CERN in Geneva, Switzerland. Berners’ idea was to combine hypertext – the linking of documents via terms – with the internet, allowing users to navigate between documents that are distributed and accessible across the globe. Starting with the first site created at CERN in 1991, the Web grew to 130 sites by 1993 [1]. By 1997 there were around a million sites [1], and by 2006, they numbered 100 million, about half of them active [2]. Such massive growth is only possible through the distributed contribution by many individuals. Thus online information is not contributed by a few vetted experts, nor is access limited to a few individuals. Rather, almost anyone can contribute to online content, especially with the proliferation of easy-to-use web authoring tools. One might expect then that there is a great deal of randomness in how pages are linked to one another, and how users navigate those links. Contrary to this, there are a number of
World Wide Web, Graph Structure
strong regularities both in the structure of the Web and the pattern of access by its users. These regularities can be discovered when one studies the graph structure of the Web. Graph Structure The Web is a graph. The webpages are the nodes or vertices, and the hyperlinks are the directed links or edges between them. Characterizing the Web graph has lead to improved crawling, sampling, and ranking algorithms. It has also lead to novel insights about the underlying human dynamics that are reflected in the link patterns of the Web. Empirical Measurements Given the enormity of the Web graph, only a handful of projects, WebBase [3] and WebGraph [4,5] among them, have sought to gather comprehensive crawls gathering hundreds of millions of the web’s hundreds of billions of pages and have made them available to the research community. Early characterizations of the Web graph were made from Web crawls made by commercial search engines [6,7], and major search engine companies continue to host a variety of research on the Web graph [8,9,10,11]. Degree Distributions One of the most striking features of the Web graph that was discovered early on is the power law distribution of incoming hyperlinks. While billions of webpages, comprising a substantial portion of the web, receive few or no hyperlinks, a few very popular pages have attracted millions. More precisely, the distribution of incoming links is a power-law (also known as a Zipf ’s law or Pareto distribution) [12,13,14] p(k) k ˛
(1)
The indegree distribution has shown remarkable consistency across many different studies, having an exponent ˛ of 2.1 [7,15]. The outdegree distribution has a steeper power law exponent measured at ˛ D 2:5 [15] and ˛ D 2:7 [7]. The steeper power law exponent corresponds to a smaller variance in the number of hyperlinks stemming from a web page as opposed to pointing to it; while a very useful or informative webpage such as a major search engine or news site can attract millions of hyperlinks, it is rather impractical, due to limitations in content length, for a single webpage to contain a very large number of hyperlinks. Power law degree distributions have been shown to not be preserved for some subsamples of the Web graph, however. Pennock et al. [16] found that for sets of company, university, newspaper, and scientist homepages, distributions of inlinks deviated for power-laws, especially in the number of pages of low degree. Intuitively, one might ex-
pect a university homepage, or a newspaper homepage to attract at least a few links, producing a peak in the distribution away from the minimum. This is in contrast to power law distributions, which are scale-free, appearing the same on all scales, e. g. 10 to 100 links or 100 to 1000 links, and having no “typical” degree. Much research has gone into deriving generative models that can produce the observed degree distributions. We will return to this topic in Subsect. “Generative Models”. Degree Correlations Degree correlation, also termed assortativity, characterizes to what extent high degree nodes link to other high degree nodes. In the case of the Web graph, a question may be whether pages that receive many links, link to other pages with high indegree. In an undirected version of the nd.edu webpage network, the webpages were found to be mildly disassortative with a correlation of 0.065 in the combined undirected degree of vertices [17]. This webpage network, crawled in 1999, consists of 325,729 documents and 1,469,680 hyperlinks. We can resolve this correlation further by considering pairwise the in and outdegrees of both the page that is giving the link, and the one that is being linked to. Using the nd.edu domain data set, we find that there is a slight negative correlation between both the indegree and outdegree of the webpage containing the hyperlink and the indegree of the page it is pointing to. This indicates that there is a slight tendency of link-rich pages to link to less link-rich pages and less popular pages, and link-poor pages to link to link-rich and popular pages. Interestingly, in this domain, there was a strong positive correlation between the indegree of the linking page and the outdegree of the linked page. This may be an indication that the most cited pages themselves link to link-rich pages and so serve a funneling function. The above is summarized in Table 1. The correlation between the pages’ own indegree and outdegree was positive ( D 0:244), indicating that link rich pages themselves tended to receive more links. Connected Components and Bow-Tie Structure The bow-tie structure describes the connectivity of the Web graph as a whole [7]. The middle knot of the bow-tie is the largest strongly connected component (LSCC). The com-
World Wide Web, Graph Structure, Table 1 Assortativity of webpages in the nd.edu domain to indegree outdegree indegree 0.023 0.256 from outdegree 0.062 0.014
3359
3360
World Wide Web, Graph Structure
Shortest Paths and Diameter In modeling user behavior and web crawler performance, it is useful to know not just whether B is reachable via a directed path of hyperlinks from A, but how many hyperlinks must be traversed. Albert et al. [19] found for the above mentioned nd.edu domain crawl, that the average shortest path over all pairs of vertices was 11.2. This closely matched the value of 11.6 predicted by their preferential attachment model, which will be described in Subsect. “Preferential Attachment and the BA Model”. The model produces an average shortest path hdi of hdi D 0:35 C 2:06 log(N) World Wide Web, Graph Structure, Figure 1 A schematic of the bow-tie structure of the web
ponent is called strongly connected because within it, any page can be reached from any other by following directed links. The OUT component consists of those pages that can be reached from the LSCC, but do not have paths leading into the LSCC. The IN component consist of those pages that have paths leading into the LSCC, but are not reachable from the LSCC. The remainder of the graph is composed of tubes, tendrils, and islands. Tubes connect the IN component to the OUT component, bypassing the LSCC. Tendrils are those pages that can either be reached from the IN component, or lead to the OUT component, but are not in tubes or the LSCC itself. Islands are connected components of vertices that are not connected via any links to the other components. Each of these four regions (LSCC, IN, OUT, and the rest) of the Web’s bow-tie accounts for roughly a fourth of the entire graph. This implies that if two pages A and B are selected at random, there is only roughly a 1 in 4 chance that there is a directed chain of hyperlinks leading from A to B. The reasoning is the following: most often, if there is a path from A to B, A is either in the IN or LSCC components, while B is in the LSCC or OUT components. This is true in approximately 1/2 1/2 D 1/4 of the cases. By comparison, about 90% of the webpages are in the largest connected component if the links are treated as undirected. The bow-tie structure was also observed to exist for subsets of webpages corresponding to a given topic, so that the web appears self-similar on several scales [18]. This structure has important implications for a crawler attempting to traverse the web in its entirety. By starting at a single webpage, the crawler will likely not be able to reach significant portions of the Web graph by following directed hyperlinks. Instead, it may adopt a strategy of discovering web pages to start from by scanning IP addresses to see if they are hosting a web server.
(2)
The model likewise provides a close prediction of 17.5 for a graph of the size of the sample of 203 million nodes described by Broder et al. [7]. The measured shortest path between the 24% of the pairs for this sample that were reachable was 16. As discussed in Subsect. “Connected Components and Bow–Tie Structure”, the other three quarters of the pairs of web pages have no directed path connecting them, and hence their distance is infinite. On the other hand, treating the graph as undirected produces an average shortest path of just 6 hops for the 90% of the webpages that are connected. Another quantity used to describe the width of the Web is the diameter, the maximal shortest path between any two connected vertices. Broder et al. [7] measured a diameter of at least 28 in the central core and over 500 for the graph as a whole, indicating that while most reachable pairs can be connected in just a small number of hops, others are far removed. Reciprocity, Clustering and Motifs Reciprocity is simply a measure of how often when one webpage links to another, that page links back. The measure should take into account the expected number of reciprocated links based simply on the link density a¯ D m/N/(N 1), where N is the number of nodes and m is the number of edges. An unbiased form of the reciprocity measure is given by D
mbd /m a¯ 1 a¯
(3)
where mbd is the number of bidirectional links [20]. For the crawl of pages in the nd.edu domain, D 0:52 [19]. Clustering or transitivity quantifies how often webpages that are linked to a common page are linked to each other as well. The measure, usually applied to an undirected version of the network, is given by the following equation. CD
6 number of triangles in the network number of connected triples of vertices
(4)
World Wide Web, Graph Structure
A 1998 Alexa crawl of 260,000 websites yielded a clustering coefficient of 0.014 at the site level (0.11 using the Watts Strogatz clustering coefficient [35]), significantly higher than a randomized version of the network [6]. The clustering coefficient captures the degree of transitivity in the graph, but does not reveal the configuration of those triads. Triad motifs [21] shed further insight into the local structure of the Web graph. There are 13 triad motifs, and together their abundance compared to randomized versions of the web graph constitute a triad significance profile [22]. The significance profile SP is a normalized vector SP i D P
Zi i
Z 2i
1/2
(5)
nweb hN irand i i irand
An important aspect of the Web graph is how it is navigated. Huberman et al. [28] discovered the Law of Web Surfing, that describes the distribution in the number of links a user will access on a particular website. This distribution was found to hold for web surfing patterns of AOL users across a million web sites, users accessing the Xerox website, and students, faculty, and staff at Georgia Tech’s College of Computing. The underlying process was argued to be the following. The value V L of the Lth page in a click stream where L links were followed is assumed to be related to the previous page, but also to deviate from it in a random way. VL D VL1 C L
of z scores measuring the statistical significance of the of observed motifs relative to the randomnumber nweb i ized counterparts of the network Zi D
Web Surfing
(6)
where hN irand i and irand are the mean and standard deviation of the frequency of occurrence of motif i in randomized versions of the network. The motif profile of Web graphs is like that of social networks, with closed triads and reciprocated links more frequent than motifs without closure and with unreciprocated links. The most underexpressed profile is that of a webpage that has reciprocated links with two other webpages, but those two webpages do not link to one another. The most highly expressed motif is that of the fully reciprocated closed triad. The probability of such a triad occurring in a random graph is very small, but it occurs rather frequently in the Web graph where clusters of related pages link to one another reciprocally. Time Evolution The web is growing exponentially, with pages being continuously added, modified, and deleted. Addition and deletion clearly affect which nodes are present in the graph, while modifications in content [23,24] possibly result in link addition and deletion. Cho and Garcia– Molina [25] found that 30% of a sample of 720,000 pages disappeared in an interval of a month. Douglis et al. [26] studied 950,000 web page requests from a server, and found that for pages that were accessed more than once, 60% had changed one or more of their hyperlinks. Making the task of discovering new pages easier is the fact that 85– 95% of new pages appear on existing sites [27].
(7)
where L is a normally distributed random variable. The individual will continue to surf until the expected discounted value of information to be found on future pages is less than the cost (in time) of continuing. This maps to an option pricing model in finance, and the distribution of the number of links a user follows before the expected utility of continuing falls below a threshold is given by r (L )2 exp (8) P(L) D 2 L3 22 L with a mean number of clicks E(L) D and variance Var(L) D 3 , being a scale parameter. This heavy tailed distribution accurately describes users’ surfing behavior on the Web graph: many follow just a link or two, while a few explore dozens or even hundreds of links in a single session. Crawling The web is only partly navigated through following hyperlinks. Users jump from one part of the Web graph to another by querying search engines and selecting one or more pages that match their query. In order to deliver web pages matching particular user queries, search engines must first crawl the Web graph. Web crawlers have two tasks: revisiting previously crawled pages in order to refresh them and discovering new pages by following new links. The crawler needs to prioritize both the order in which it will crawl the new content and how often it will recrawl pages it has previously crawled, in part to discover links to new content [27]. Search Engine Coverage It is not possible for search engines to crawl the entire Web. There are infinite websites that will return valid and unique content for an infinite
3361
3362
World Wide Web, Graph Structure
World Wide Web, Graph Structure, Figure 2 A basic schematic of a crawler traversing webpages. As new pages are discovered, they are parsed, and their links added to the queue of pages to be visited if they had not been seen before
number of URLs. Still, it is possible to explore the overlap in search engine coverage of the content that should be covered by the search engines. In 1998 Lawrence and Giles set out to estimate the size of the “indexable” web by issuing queries to multiple search engines [29]. By measuring the overlap in the search results and assuming that each search engine samples the web independently, they estimated the total size of the Web at that point to be at least 320 million pages, with no single search engine achieving more than a third of the coverage. Generative Models Generative models help describe the underlying processes that are shaping the Web, and therefore can be used to make predictions about the future evolution of its properties. Preferential Attachment and the BA Model The World Wide Web inspired the Barabasi–Albert preferential attachment model of network growth [15,30], that has since been applied to networks in many other domains, including biological, social, and technological. As discussed in Subsect. “Degree Distributions” Albert et al. observed that the distribution of incoming hyperlinks for webpages was highly skewed. They devised a very simple model of preferential attachment, which had previously been explored in other networked and non-networked contexts [31,32,33], wherein a new page is added at each timestep t, and attaches to m existing webpages, each with probability ˘ (k i ) proportional to that page’s current degree ki . ki ˘ (k i ) D P j
kj
(9)
An intuitive argument for why this is so is that the more popular a web resource is, the more likely an author of
a new webpage is to learn about it in the course of browsing and searching the Web, and the more likely she is to cite it. The links are treated as undirected, and the model yields a scale-free indegree distribution with a power law exponent ˛ D 3, independent of m. While reproducing some characteristics of the Web graph, such as a scale free distribution and short average path lengths, the model does not capture others. The power law exponent of 3 is much higher than the consistently observed ˛ D 2:1 for the Web. Its clustering coefficient depends on the size of the graph as N 0:75 , and is much smaller than that of the Web graph. It also predicts that the degree of each node depends on its time of introduction i as follows ˇ t 1 ; ˇD : (10) k i (t) D m ti 2 However, there is only very weak correlation between the age of a webpage or website on the web and its indegree [34]. Extensions of the Barabasi–Albert model were proposed to overcome its limitations. Rewiring of existing edges with the endpoint being chosen preferentially produces power law disributions with cut-offs as well as exponents in the range 2 ˛ < 1. Other models have introduced aging factors, where webpages lose attractiveness over time [36], which produced cutoffs in the power law distributions. In the fitness model, webpages have a different inherent attractiveness i [37], with preferential attachment combining this attractiveness with the degree ki so that ˘ i i k i . Interestingly, it was found that nonlinear preferrential attachment [38] where ˘ (k i ) k i
(11)
does not always produce power law degree distributions when ¤ 1 [39].
World Wide Web, Graph Structure
Copying Models The basic preferential attachment models do not explicitly explain why a new webpage would link to existing webpages in proportion to their indegree. Such models also do not capture features such as clustering, since the only factor influencing the likelihood of being linked to is the indegree of the page. In copying models, both clustering and preferential attachment are produced [40,41,42,43] as a result of links being copied, without content being explicitly taken into account. New pages are added over time, and each page selects an existing page, termed a prototype or ambassador page, and copies several of the links on that page. The process leads to preferential attachment, since the probability that any given page is linked to is proportional to the probability that it is the neighbor of the ambassador node, which is in turn proportional to its indegree. In the case where the new page links to the ambassador page in addition to copying some of its links, the process also produces clustering; each copied link results in a closed triad: the new page c links to the ambassador page a; a links to b, and c copies that link and also links to b. In the case where the new page does not link to the ambassador page, the process produces bi-cliques, sets of pages that link to a second set of pages. Kumar et al. [42] constructed a stochastic copying model where new pages are created linearly or exponentially in time. Each page has a fixed outdegree. With probability pr , the ith link will point to a webpage at random, otherwise it will copy the ith link of the ambassador node. In the case of linear addition of pages in time, the probability p(k; t) that a node has degree k at time t is given by a power law:
1. As in the copying model, a new node chooses an ambassador node uniformly at random from the set of all nodes 2. Two geometrically distributed random numbers, x and y are generated with means p/(1 p) and q/(1 q) respectively. x out-links and y in-links are copied from the current node (if x or y are greater than the total number, all the links are copied). 3. Step 2 is applied recursively to each of the nodes the new node now has links to. Cycles are disallowed. Although the forest fire model is so far analytically intractable, it has several properties that match the web. Heavy-tailed indegree distributions occur because in copying links from other nodes, new nodes are attaching preferentially to existing nodes in proportion to their indegree. Heavy-tailed outdegree distributions are produced by the cascading process of recursive link copying. Many cascades will terminate already with the first node, but some will continue for many steps. Community structure and clustering are produced, as in the basic stochastic copying model, because the link copying leads to triadic closure, as a new webpage links to a page and some of its neighbors, and those neighbors’ neighbors, etc.
(12)
Content-Based Models The above generative models have not explicitly modeled the textual content of the webpages. The intuition behind content-based or similaritydriven models is that webpages will not only link to popular pages, but also to topically related webpages (or both). Menczer [44] used a cosine similarity measure P k w k p1 w k p2 s(p1 ; p2 ) D r (13) P 2 P 2 k w k p1 k w k p2
where the power-law exponent ˛ of the degree distribution is bounded from below by 2 in the case of pure copying, and grows steeper as the proportion pr of random links is increased. The model also produces a large number of bipartite cliques, commensurate with empirically observed proportions. Related to the copying model is the forest fire model, where the copying continues recursively to all nodes that are encountered by following links [43]. The process is analogous to a forest fire which spreads from tree to tree. Two probabilities are assigned. With forward burning probability p, the outgoing links from page are copied. With probability q incoming links are copied. This last step models something akin to awebpage author searching for other pages that link to the current page and deciding to link to them as well.
where wkp is some weight function, typically, a TFIDF (term frequency, inverse document frequency) weight for the term k in page p. In 110,000 webpages sampled from the Open Directory, Menczer found that lexical similarity closely correlates with webpages being clustered (either linking directly or overlapping in their neighbors). He proposed a content based generative network model to take into account that once two pages exceed a threshold overlap (s > s ) in their content, their probability of being linked levels off. Below that threshold, the probability decreases as a power-law, so that while neighboring pages tend to be lexically similar, there is a long tail of lexically distant pages that are neighbors in the Web graph. As in the Barabasi–Albert model, at each time step t a new page is added and links to m existing pages. But rather than linking to any page in proportion to their degree, it
p(k; t) tk (2p r )/(1p r )
3363
3364
World Wide Web, Graph Structure
will only do so below a threshold lexical similarity. Above that threshold the probability that a new page appearing at time t will link to an existing page i with degree k(i) will be proportional to the lexical similarity: ( Pr(p; t) D
k(i) mt
c1 [s(p i ; p t
)]1
1
˛
if s(p i ; p t ) > s otherwise : (14)
The model produces a degree distribution that deviates from a pure powerlaw, but closely matches the sample webpages from the Open Directory.
finds herself on. With probability (1 d) the surfer jumps to a random node in the Web graph. Let A be the adjacency matrix for a Web graph with n pages. Then the matrix B corresponding to the Markov chain is given by a ij 1d b ij D d P C : a n ij i
(16)
Rather than solving for the eigenvector exactly, one can obtain a close approximation by applying a power iteration technique, iteratively multiplying an initial vector x by B x D Bx :
(17)
Algorithms Ranking Most search engines utilize the structure of the Web graph to rank their search results. Some structural node properties, such as indegree, are purely local and utilize only the nearest neighbors of the webpage. Others, such as PageRank, utilize the entire graph, and yet converge within a short number of iterations. Indegree The simplest link based ranking of webpages is based on their indegree, s(k) k
(15)
where each inlink serves as a popularity vote. This measure is susceptible to spamming. One can create webpages and websites with relative ease whose primary purpose is to boost the ranking of a particular target page. PageRank The PageRank algorithm was developed by Larry Page, one of the co-founders of the Google search engine [45]. PageRank is less susceptible to spamming because it takes into account not just how many pages link to a particular target page, but also how many pages link to those pages etc. It does so by simulating a random walk on the Web graph. This PageRank score represents the proportion of time a random web surfer would spend at each webpage, traversing the graph by following random links from a succession of web pages. Roughly speaking, PageRank corresponds to the probability distribution given by the principal eigenvector of the normalized webgraph adjacency matrix. To prevent the random walk from being trapped in a region of the graph (remember that only a fraction of the pages is in the LSCC), a damping factor d (also termed a teleportation probability) is introduced. With probability d the surfer follows a random link from the page she
The convergence of the iterations depends on the initial vector and the damping factor, but is tractable even for very large web crawls. The damping factor is typically set to d D 0:85. For most applications, PageRank values are precomputed for all the webpages at once, and are combined with a textual match score to a user’s query and other factors in determining the final ranking of webpages on any given search engine result page. Although PageRank is designed to take into account more than direct links, studies have found it to be highly correlated with indegree when applied to the Web graph [46]. Personalization of PageRank can be achieved by biasing the random jump toward pages a particular user has visited already, rather than all webpages uniformly at random [47]. Since pages that link to one another are likely to be topically similar, the personalized PageRank algorithm will boost the scores of pages in the areas of the web the user has previously explored, and hence bias the search results toward pages of topical interest to the user. Application to Unbiased Sampling During a crawl, the probability of encountering a link to a page is roughly proportional to the PageRank of that page, since the PageRank represents the random walk probability of finding the page. Therefore crawling will produce a biased sample where popular pages are more likely to be included. In order to create an unbiased sample, one can simply select pages in inverse proportion to their estimated PageRank [48]. HITS The HITS (Hypertext Inducted Topic Selection) algorithm was developed by Jon Kleinberg, and starts with a smaller subset of query results, rather than with the entire web graph [49]. Instead of than assigning a webpage a single score, as PageRank does, HITS assigns two scores, a hub
World Wide Web, Graph Structure
score y and authority score x. A good authority is a page with authoritative information, and a good hub points to good authorities. Hub may constitute good starting points for a user to explore an information space, while authorities are likely to contain definitive information. In the first step of the HITS algorithm, an initial root set of webpages (e. g. 200 pages) is retrieved based on a textual match. Some of these pages, containing text relevant, though not necessarily authoritative, on the subject of the query, can be expected to link to prominent authorities on the subject, and some will be linked to by hubs. This root set is therefore expanded to a base set that includes (up to a cutoff of e. g. 1000–3000) pages linked to by the root set and linking to the root set. At this point, links between pages located on the same website are removed, as they are likely to serve a navigational purpose and don’t necessarily confer authority. Much like the PageRank scores, hub scores x and authority scores y can be computed iteratively, after being initialized to uniform constants. For a page j, the value xj is updated to be the sum of yi over all the pages i that link to j. X
xj D
yi
(18)
i such that i! j
Similarly, the hub score yi of a page i is updated according to the authority scores xj of the pages it points to. X
yi D
xj
(19)
j such that i! j
The above can be also be computed using a power iteration technique, with the authority score x corresponding to the principal eigenvector of AT A, while y corresponds to the principal eigenvector of A AT , A being the adjacency matrix of the base set: x
AT y
AT A x
y
Ax
AAT y :
Unlike PageRank, where the scores for all pages are precomputed and subsequently used for all queries, HITS is computed at query-time. Search Engine Bias Since most search engines utilize a link-based ranking algorithm to order their search results, there is the possibility that they further bias web traffic toward already popular sites on a given topic. As shown in Fig. 3, 90% of the search results users click on are on the first page, with 43% being
World Wide Web, Graph Structure, Figure 3 Frequency of clicks on different search result positions in 20 million queries issued by 600,000 users of the AOL search engine, from March to May 2006 [50]
the very top search result. In part the higher rate of clickthroughs for highly ranked results is due to users’ tendency to scan content in order, but in part it is also due to search engines’ ability to rank relevant content successfully. This may inhibit newer, but high quality pages, or pages presenting a different perspective, from receiving attention from search engine users. While search engines drive traffic to the most heavily linked sites, the search engine users spread themselves out much more widely by searching for a wide variety of queries. In a 2005 study, Fortunato et al. [51] found that a majority of queries to Google return less than 30,000 hits (representing less than one millionth of the entire set of the web crawled by Google at that time). Hence, while some web pages have tens of thousands and even millions of inlinks, they do not grab all of the search engine users’ attention. Finding Communities in the Web Graph Webpages on similar topics naturally form densely connected regions of the Web graph, termed communities. Various algorithms have been developed to discover web communities based on the link structure alone, without considering the textual content or starting from a given seed set. Kleinberg et al. [52] trawled for emerging cybercommunities by identifying bipartite cores: sets of i pages that all cite j authoritative pages. Of thousands of such cores identified, the vast majority corresponded to groups of related pages. This approach has been extended to finding larger, but incomplete bipartite cores [53]. Communities can also be identified by performing the HITS expansion step from a root set that is comprised of a core [54]. Flake et al. [55] used maximum-flow minimum-cut algorithms to discover communities, the definition of a community being such that any page within the com-
3365
3366
World Wide Web, Graph Structure
munity has more links within the community than outside of it. The Web graph is augmented by creating an artificial source s with infinite capacity edges routed to all seed vertices S for which one would like to find a community. All preexisting edges are made bidirectional and rescaled to a heuristically chosen constant k. A virtual sink is connected to all vertices except the seed vertices and virtual source. A maximum flow procedure is used to produce a residual-flow graph (the graph of edges that have excess capacity). All vertices that are accessible from s through the residual graph satisfy the definition of community. Many other network based clustering algorithms exist, some of which have been applied to subsets of the Web. Some, based on the concept of modularity, divide the network into communities so as to maximize the withincommunity edges compared to what one would expect for a random assignment of pages to communities. The modularity can be expressed as: QD
ki k j 1 X A ij ı(c i ; c j ) 4m 2m
(20)
ij
where A is the adjacency matrix, m is the total number of edges in the network, ki is the degree of vertex i and ci is the community i is assigned to. The modularity can be maximized through a greedy hierarchical algorithm that scales up to millions of nodes [56], a simulated annealing approach [57], or by directly computing the eigenvectors of the modularity matrix B B ij D A ij
ki k j : 2m
(21)
Beyond identifying topically related sets of pages, community finding algorithms may be applied to identify spam hosts and pages, since pages within so called link farms tend to link to one another much more often than to legitimate sites [9,58]. Subsets of the Web Graph The Web contains various interesting subgraphs, reflecting different topics, and social and collaborative activity. Here we discuss the graph properties of just a few of those subsets. Query Connection Graphs Subraphs corresponding to search engine results for given queries can be informative both regarding the quality of the search results and the likelihood that a user will reformulate the query, making it more specific or more
World Wide Web, Graph Structure, Figure 4 The process of constructing a query connection graph. A search engine is used to retrieve web pages relevant to the query. The pages are projected onto the Web graph. The projection subgraph contains only the search results and the hyperlinks between them. The connection subgraph also includes the nodes (webpages) that are minimally needed to reconnect the pages
general [59]. The process of constructing query connection graphs is shown in Fig. 4. A projection graph is constructed, consisting of the search results and the hyperlinks between them. A connection graph is constructed by adding the minimum number of other webpages necessary to reconnect the projection graph. One can also construct the two graphs on the domain level, where a domain encompasses all of e. g. “springer.com” or “umich.edu”, and two domains share an edge if a page in one domain links to a page in the other. As one might expect, domain query projection graphs are much more densely linked than projection graphs on individual webpages. A complete domain graph of the web from February 2006 contained 39 million domain names and 720 million directed edges, with an average shortest path of 4 and the largest weakly connected component containing 99% of the nodes. Whether at the page or domain level, when compared with human ratings of the quality of a set of pages for a given query, the sets that had more tightly connected projection graphs (and correspondingly smaller connection graphs) were good predictors of quality of the pages. The success of a user’s search is only in part due to the accuracy of the search engine ranking algorithm. Another component is the skill of the user in formulating a query. By looking at the query subgraphs of issued queries, one can predict whether a user will reformulate a query. Queries with result sets that are more tightly knit and contain a few central high degree nodes are less likely to be reformulated. On the other hand, if a query projection graph lacks central, high degree nodes, the search en-
World Wide Web, Graph Structure
gine user is more likely to reformulate the query. However, if the largest connected component constitutes e. g. 10 of the top 20 search results, the query may be too specialized, and the user is more likely to reformulate the query to be more general. Weblogs A rapidly growing part of the Web are weblogs (blogs), webpages containing time-stamped posts. Blogs are interconnected through direct links from one post to another, which may occur when bloggers discuss the same topic. Blogs also frequently include a sidebar of links to other blogs and websites. Millions are used simply as personal journals, but many specialize in particular topics, such as politics or technology. Some blogs today attract as much attention as mainstream media sites [60]. Kumar et al. [61] studied the profiles of 1.3 million blogs on livejournal.com, one of the most popular blogging sites. Livejournal gives its bloggers the opportunity to specify other bloggers who are their friends, basic demographic information, and to join groups based on their interests. The network is highly clustered, with a clustering coefficient of C D 0:2 (meaning that 20% of the time friends of the same blogger are friends themselves). This high clustering is consistent with the observation that 70% of the connections correspond to one of three factors: shared interest, same 5 year age group, and geographic location. The geographic location information of livejournal users allowed for a very large scale study of the small world phenomenon [62]. It had been known for decades, ever since the 1960s, when Stanley Milgram conducted his famous small world experiment [63], that people are able to form short chains of acquaintances to reach a target individual using only information about their immediate network neighbors. Subsequently, Kleinberg [64] proved that for certain spatial network topologies, individuals would be able to use a simple greedy algorithm, choosing an acquaintance closest to the target as the next person in the chain, to connect to targets in a short number of steps. This holds true for LiveJournal, where the probability of two LiveJournal users being connected is inversely proportional to the number of people who lived between them, and this spatial distribution of connections allows for successful formation of small world chains [62]. Blogs are continuously being updated and this temporal evolution is one of their most interesting aspects. Citations patterns between blogs change rapidly, even on the scale of a few months [76]. Kumar et al. [65] found that as blogging activity grew in popularity from 2002 to 2003, the addition of links became bursty – indicating that activ-
ity was centered around communities and events relevant to them. The degree distribution appeared to be powerlaw and asymptoting to a power-law exponent of 2.1. This inequality in link distribution was noted for the subset of political blogs [66,67] in the United States. The subgraph of political blogs also showed a strong tendency of blogs to link predominantly to blogs of similar political leanings [68,69]. The time resolved nature of blog entries enable researchers to track individual memes, topics that are contagiously spreading as information cascades from blog to blog. Several models [70,71] were developed to identify topics and their spread through the blog network. In a general cascade model [71] of the directed and weighted transmission graph of blogs, the weight of each edge corresponds to the probability that a blog both reads and copies an item from another blog, and this weight is learned through a machine learning algorithm from observed mentions by the two blogs of previous topics. By tacking 7,000 topics over a set of blogs, Kempe et al. [71] found that the amount of information flowing through blogs corresponded to the ranking of those blogs by blog search engines. Similarly, ordering the edges by the transmission probability on each edge closely corresponded to blogs that were directly linked through their blogroll links (links to other blogs that occur on the sidebar of the page). Adar et al. [72] used the presence of direct links between blogs, and overlap in text and links previously cited, to predict the path of information flow among a set of blogs mentioning the same URL. Leksovec et al. [73] developed an algorithm for near optimal selection of a set of blogs such that information cascades are detected quickly. Because of the skewed nature of participation and attention in the blogosphere, only a small fraction of blogs need to be monitored in order to capture a large portion of the activity. Although blog content is highly distributed, and to a certain extent collaborative, with blogs filtering and generating news at rates often outpacing the mainstream media, the content is highly redundant, often inaccurate, and poorly interlinked. In contrast, Wikipedia is an example of collaboratively generated content that is accurate, nonredundant, and carefully structured. Wikipedia Wikipedia is a web-based encyclopedia that is collaboratively written by volunteers all around the world. Since its creation in 2001, Wikipedia has grown rapidly into one of the most prominent reference sources, its pages frequently appearing among the top search engine results.
3367
3368
World Wide Web, Graph Structure
World Wide Web, Graph Structure, Figure 5 a Web motif profile b Wikipedia motif profile. The motif profiles of several web samples and social networks compared to profiles found in the different language Wikipedias. The x axis depicts the possible triads, and the y axis represents normalized z-scores comparing the frequency of each motif to randomized versions of the networks
As of August 2007, there were 75,000 active contributors, with nearly 2 million articles in English alone. A study of the 30 largest language Wikipedias in January 2005 yielded degree distribution exponents (in D 2:15 ˙ 0:13 and out D 2:57 ˙ 0:27) reflective of the web overall. Unlike the web, however, between 85 and 97 percent of the pages in the various languages belong to the strongly connected component [74] – indicating that Wikipedia content is more consistently bound than the WWW at large. It is also demonstrating densification, with the number of links growing superlinearly with the number of Wikipedia pages as L N ˛ , with ˛ D 1:14 ˙ 0:05. Similar growth patterns, with more links than nodes being added over time, have been observed in the physical internet, patent and scientific citation networks[75] as well as blog citation networks [76]. The Wikipedia networks display a high degree of reciprocity, where one article linking to another frequently corresponds to another linking back. Using the unbiased mutual reciprocity measure specified in Eq. (3), the degree of reciprocity for the different language Wikipedias is D 0:32 ˙ 0:05, lower than for the crawl of pages in the nd.edu domain ( D 0:52) [19,20]. The Wikipedia also exhibits a high degree of clustering, with related articles linking together in triads. Specifically, the motif profile closely follows that of the web overall and falls in the same triad significance superfamily. Beyond its structural characteristics, Wikipedia has been an ideal medium to validate the preferential attachment model. It was used to verify that the probability of acquiring new links is in fact proportional to the number of links already present on the page [77]. Such time resolved data had previously been difficult to obtain, but Wikipedia provides a complete history of edits for download. Two
interesting observations were made. For pages with moderate numbers of inlinks and outlinks, the proportion is slightly sublinear, but close to unity ( D 0:9 in Eq. (11)). Second, the probability reaches a peak, where pages that already contain many links are again not as likely to add more, and those that are already heavily referenced do not tend to attract more links. This could represent both mature pages and mature topics, where the links have already been established. Future Directions Web 2.0 The Wikipedia and the blogosphere are just two of the many exciting developments in Web 2.0. Web 2.0 refers to collaborative and interactive applications that are fast becoming integrated into the Web. Rather than simply being consumers of information that they retrieve on the web, web users now can easily contribute content through tools such as blogs and wikis. Unlike simple web pages, that are generated by a single author or through a script from a database, Web 2.0 allows users to interact with one another by modifying pages. For example, bloggers can interact with one another by leaving comments on each others posts. Wikipedians interact by collaboratively editing the same Wikipedia entires, and participating in discussions on talk pages dedicated to those entries. Many content providers for text, images, and video allow users to provide reviews in the form of ratings and comments on the content, or to respond by adding content of their own. In addition to reflecting a change in the way that web content is generated, Web 2.0 sites allow the users to consume web content in a collaborative way. Through RSS feeds, users receive updates of web content rather than
World Wide Web, Graph Structure
having to remember to visit individual sites. They can tag pages with keywords for easy subsequent retrieval, not just for themselves, but for others interested in the same topic. Through collaborative filtering, users can learn about products, news articles, and media that their friends enjoyed. The best recommenders need not be friends however. Collaborative tools such as RSS readers, bookmarking websites, and content providers can automatically identify users with like interests and make recommendations based on items that those users enjoyed. With Web 2.0 technologies, one can view the Web not only as a proliferation of information and content, but also as giant, distributed machinery that allows the most relevant, personalized content to bubble to the top. At the same time that the Web may have been producing a problem through over-production and over-participation, it has solved it through the emergence of a collection of exceedingly simple yet effective filtering tools. At this stage, it is unclear whether social networking sites, which allow users to explicitly specify who their friends are, will become platforms that other content providers will be integrated into, or whether many different websites will be able to integrate an open, portable social network the users can bring with them. In either case, the Web graph may soon appear rather different, with webpages becoming mere containers of segmented, timestamped, personalized content. One of the latest trends in combining content is that of mashups. On can use a mashup to overlay, for example, rental property listings or restaurant reviews on a neighborhood map. These mashups will be especially useful on location and useraware devices. As our lives become easier to track, from the sites we browse, to the people we know, to our physical movements tracked by cellphones, we may find our lives increasingly caught in the Web. Most people will readily trade such information for improved access to personalized and geo-specific services and information. It does, however, also raise interesting privacy concerns, since the Web is a highly distributed storage and dissemination medium. Mobile Web Another forcing function for Web innovation is the fact that many individuals access the Web from small mobile devices. For many individuals in developing countries, this may be their primary way of accessing the Web. This has lead to the development of web services tailored to the mobile Web. Access patterns from mobile devices have been shown to follow the same regularities as general web brows-
ing [78]. This holds despite the fact that Web access from such devices presents challenges due to limited download speed and small display size [79]. It also presents opportunities, as users are able to contribute content wherever they may find themselves. Moblogging (mobile blogging) involves updating one’s blog instantaneously with images and videos as they are captured. Currently several news sites encourage the contribution of such materials. The mobile Web also allows users to access Web content, such as maps, traffic and tourist information when they are on the go. Again this may change the structure of the Web graph, as its content is delivered in small, personalized chunks to a variety of devices. Even so, one would expect its main features to remain: billions of hyperlinked pieces of information, showing strong regularities that are a reflection of the real world that is now inextricably tied to the Web.
Bibliography Primary Literature 1. Gray M (1997) Web growth summary. www.mit.edu/~mkgray/ net/web-growth-summary.html. Accessed 9 March 2008 2. Walton M (2006) Web reaches new milestone: 100 million sites. www.cnn.com/2006/TECH/internet/11/01/ 100millionwebsites/index.html. Accessed 9 March 2008 3. Cho J, Garcia-Molina H, Haveliwala T, Lam W, Paepcke A, Raghavan S, Wesley G (2006) Stanford webbase components and applications. ACM Trans Inter Tech 6(2):153–186 4. Boldi P, Vigna S (2004) The webgraph framework I: compression techniques. In: Proceedings of the 13th International Conference on World Wide Web, New York, NY, pp 595–602 5. Boldi P, Codenotti B, Santini M, Vigna S (2004) UbiCrawler: a scalable fully distributed Web crawler. Softw Pract Experience 34(8):711–726 6. Adamic LA (1999) The Small World Web. Proc ECDL 99:443–452 7. Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A, Wiener J (2000) Graph structure in the Web. Comput Netw 33(1–6):309–320 8. Joshi A, Kumar R, Reed B, Tomkins A (2007) Anchor-based proximity measures. In: Proceedings of the 16th International World Wide Web Conference, Banff, Canada 9. Fetterly D, Manasse M, Najork M (2004) Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, Paris, France, pp 1–6 10. Zhang B, Li H, Liu Y, Ji L, Xi W, Fan W, Chen Z, Ma WY (2005) Improving web search results using affinity graph. In: SIGIR ’05: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, pp 504–511 11. Broder AZ, Lempel R, Maghoul F, Pedersen J (2006) Efficient PageRank approximation via graph aggregation. Inf Retr 9(2):123–138
3369
3370
World Wide Web, Graph Structure
12. Adamic LA, Huberman BA (2002) Zipfs law and the internet. Glottometrics 3:143–150 13. Mitzenmacher M (2003) A brief history of generative models for power law and lognormal distributions. Internet Math 1(2):226–251 14. Newman MEJ (2005) Power laws, Pareto distributions and Zipf’s law. Contemp Phys 46(5):323–351 15. Barabási AL, Albert R (1999) Emergence of Scaling in Random Networks. Science 286(5439):509 16. Pennock DM, Flake GW, Lawrence S, Glover EJ, Giles CL (2002) Winners don’t take all: Characterizing the competition for links on the web. Proc Natl Acad Sci 99(8):5207 17. Newman MEJ (2002) Assortative Mixing in Networks. Phys Rev Lett 89(20):208701 18. Dill S, Kumar R, McCurley KS, Rajogopalan S, Sivakumar D, Tomkins A (2002) Self-Similarity In the Web. ACM Trans Internet Technol 2(3):205–223 19. Albert R, Jeong H, Barabási AL (1999) Internet: Diameter of the world-wide web. Nature 401:130–131. doi:10.1038/43601 20. Garlaschelli D, Loffredo MI (2004) Patterns of link reciprocity in directed networks. Phys Rev Lett 93(26):268701 21. Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002) Network Motifs: Simple Building Blocks of Complex Networks. Science 298(5594):824–827 22. Milo R, Itzkovitz S, Kashtan N, Levitt R, Shen-Orr S, Ayzenshtat I, Sheffer M, Alon U (2004) Superfamilies of designed and evolved networks. Science 303(5663):1538–1542 23. Brewington BE, Cybenko G (2000) How dynamic is the Web? Computer Netw 33(1–6):257–276 24. Fetterly D, Manasse M, Najork M, Wiener JL (2004) A large-scale study of the evolution of Web pages. Softw Pract Experience 34(2):213–237 25. Cho J, Garcia-Molina H (2003) Estimating frequency of change. ACM Trans Internet Technol 3(3):256–290 26. Douglis F, Feldmann A, Krishnamurthy B, Mogul J (1997) Rate of change and other metrics: a live study of the world wide web. USENIX Symposium on Internet Technologies and Systems, vol 119 27. Dasgupta A, Ghosh A, Kumar R, Olston C, Pandey S, Tomkins A (2007) The discoverability of the web. In: Proceedings of the 16th International Conference on World Wide Web, Banff, Canada, pp 421–430 28. Huberman BA, Pirolli PLT, Pitkow JE, Lukose RM (1998) Strong Regularities in World Wide Web Surfing. Science 280(5360):95 29. Lawrence S, Giles CL (1998) Searching the World Wide Web. Science 280(5360):98 30. Albert R, Barabási AL (2002) Statistical mechanics of complex networks. Rev Mod Phys 74(1):47–97 31. Simon HA (1955) On a Class of Skew Distribution Functions. Biometrika 42(3/4):425–440 32. Yule GU (1925) A Mathematical Theory of Evolution, Based on the Conclusions of Dr. JC Willis, FRS. Philos Trans Roy Soc Lond Ser B, Containing Papers of a Biological Character 213:21–87 33. Price DS (1976) A general theory of bibliometric and other cumulative advantage processes. J Am Soc Inf Sci 27(5–6):292– 306 34. Adamic LA, Huberman BA (2000) Power-law distribution of the world wide web. Science 287(5461):2115a 35. Watts DJ, Strogatz SH (1998) Collective dynamics of ‘smallworld’ networks. Nature 393(6684):440–442
36. Albert R, Barabási A-L (2000) Topology of Evolving Networks: Local Events and Universality. Phys Rev Lett 85(24):5234–5237 37. Dorogovtsev SN, Mendes JFF (2000) Scaling behaviour of developing and decaying networks. Europhys Lett 52(1):33–39 38. Bianconi G, Barabasi AL (2001) Competition and multiscaling in evolving networks. Europhys Lett 54(4):436–442 39. Jeong H, Neda Z, Barabasi AL (2003) Measuring preferential attachment in evolving networks. Europhys Lett 61(4):567–572 40. Vázquez A (2003) Growing network with local rules: Preferential attachment, clustering hierarchy, and degree correlations. Phys Rev E 67(5):56104 41. Jackson MO, Rogers BW (2007) Meeting Strangers and Friends of Friends: How Random Are Social Networks? Am Econ Rev 97(3):890–915 42. Kumar R, Raghavan P, Rajagopalan S, Sivakumar D, Tomkins A, Upfal E (2000) Stochastic models for the web graph. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, Redondo Beach, p 57 43. Leskovec J, Kleinberg J, Faloutsos C (2007) Graph evolution: Densification and shrinking diameters. ACM Trans Knowl Discov Data (TKDD) 1(1) 44. Menczer F (2002) Growing and navigating the small world Web by local content. Proc Natl Acad Sci 99(22):14014–14019 45. Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. Comput Netw ISDN Syst 30(1–7):107–117 46. Litvak N, Volkovich Y, Donato D (2007) Determining factors behind the pagerank log-log plot. Lecture notes in computer science 4863:108 47. Haveliwala Taher H (2002) Topic-sensitive pagerank. In: WWW ’02: Proceedings of the 11th International Conference on World Wide Web, New York, pp 517–526 48. Henzinger MR, Heydon A, Mitzenmacher M, Najork M (2000) On near-uniform URL sampling. Comput Netw 33(1–6):295– 308 49. Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604–632 50. Pass G, Chowdhury A, Torgeson C (2006) A picture of search. Infoscale ’06, Hong Kong. In: Proceedings of the 1st international conference on Scalable information systems. ACM, New York, pp 1 51. Fortunato S, Flammini A, Menczer F, Vespignani A (2006) Topical interests and the mitigation of search engine bias. Proc Natl Acad Sci 103(34):12684 52. Kleinberg JM, Kumar R, Raghavan P, Rajagopalan S, Tomkins AS (1999) The Web as a Graph: Measurements, Models, and Methods. Computing and Combinatorics: 5th Annual International Conference, Cocoon’99, Tokyo, Japan 53. Dourisboure Y, Geraci F, Pellegrini M (2007) Extraction and classification of dense communities in the web. In: Proceedings of the 16th International Conference on World Wide Web, Banff, Canada, pp 461–470 54. Kumar R, Raghavan P, Rajagopalan S, Tomkins A (1999) Extracting large-scale knowledge bases from the web. In: Proceedings of the 25th Very Large Data Bases Conference, Edinburgh, UK, pp 639–650 55. Flake GW, Lawrence S, Giles CL, Coetzee FM (2002) Selforganization and identification of Web communities. Computer 35(3):66–70 56. Clauset A, Newman MEJ, Moore C (2004) Finding community structure in very large networks. Phys Rev E 70(6):66111 57. Danon L, Díaz-Guilera A, Duch J, Arenas A (2005) Compar-
World Wide Web, Graph Structure
58.
59.
60. 61. 62.
63. 64. 65. 66.
67.
68.
69. 70.
71.
72.
ing community structure identification. J Stat Mech Theor Exp 9:P09008 Castillo C, Donato D, Gionis A, Murdock V, Silvestri F (2007) Know your neighbors: Web spam detection using the web topology. In: Proceedings of SIGIR, pp 423–430. ACM Press, Amsterdam Leskovec J, Dumais S, Horvitz E (2007) Web projections: learning from contextual subgraphs of the web. In: Proceedings of the 16th International Conference on World Wide Web, Banff, Canada, pp 471–480 Anderson C (2006) The long tail. Hyperion, New York Kumar R, Novak J, Raghavan P, Tomkins A (2004) Structure and evolution of blogspace. Commun ACM 47(12):35–39 Liben-Nowell D, Novak J, Kumar R, Raghavan P, Tomkins A (2005) Geographic routing in social networks. Proc Natl Acad Sci 102(33):11623–11628 Travers J, Milgram S (1969) An experimental study of the small world problem. Sociometry 32:425–443 Kleinberg J (2000) Navigation in a small world. Nature 406(6798):845 Kumar R, Novak J, Raghavan P, Tomkins A (2005) On the bursty evolution of blogspace. World Wide Web 8(2):159–178 Hindman M, Tsioutsiouliklis K, Johnson JA (2003) Googlearchy: How a Few Heavily-Linked Sites Dominate Politics on the Web. Annual Meeting of the Midwest Political Science Association Drezner DW, Farrell H (2004) The power and politics of blogs. Download at http://www.danieldrezner.com/research/ blogpaperfinal.pdf Adamic LA, Glance N (2005) The political blogosphere and the 2004 US election: divided they blog. In: Proceedings of the 3rd International Workshop on Link Discovery, Aug 21–25, Chicago, IL, pp 36–43 Newman MEJ (2006) Modularity and community structure in networks. Proc Natl Acad Sci 103(23):8577–8582 Gruhl D, Guha R, Liben-Nowell D, Tomkins A (2004) Information diffusion through blogspace. In: WWW ’04: Proceedings of the 13th International Conference on World Wide Web. ACM Press, New York, pp 491–501 Kempe D, Kleinberg J, Tardos É (2003) Maximizing the spread of influence through a social network. In: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, pp 137–146 Adar E, Zhang L, Adamic LA, Lukose RM (2004) Implicit structure and the dynamics of blogspace. Workshop on the Weblogging Ecosystem, New York, NY
73. Leskovec J, Krause A, Guestrin C, Faloutsos C, VanBriesen J, Glance N (2007) Cost-effective outbreak detection in networks. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, pp 420–429 74. Zlatic V, Bozicevic M, Stefancic H, Domazet M (2006) Wikipedias: Collaborative web-based encyclopedias as complex networks. Phys Rev E 74(1):016115 75. Leskovec J, Kleinberg J, Faloutsos C (2005) Graphs over time: densification laws, shrinking diameters and possible explanations. In: Conference on Knowledge Discovery in Data, Chicago, IL, USA, pp 177–187 76. Shi X, Tseng B, Adamic LA (2007) Looking at the Blogosphere Topology through Different Lenses. Proceedings of ICWSM’07. Boulder, CO, USA 77. Capocci A, Servedio VDP, Colaiori F, Buriol LS, Donato D, Leonardi S, Caldarelli G (2006) Preferential attachment in the growth of social networks: The internet encyclopedia wikipedia. Phys Rev E 74(3):036116 78. Halvey M, Keane MT, Smyth B (2006) Mobile web surfing is the same as web surfing. Commun ACM 49(3):76–81 79. Buyukkokten O, Garcia-Molina H, Paepcke A, Winograd T (2000) Power browser: efficient web browsing for pdas. In: CHI ’00: Proceedings of the SIGCHI conference on Human Factors in Computing Systems. ACM, New York, pp 430– 437
Books and Reviews Aiello W, Broder A, Janssen J, Milios E (eds) (2006) Algorithms and Models for the WebGraph: Proceedings of the 4th International Workshop, WAW 2006, Banff, Canada, 30 Nov– 1 Dec Bonato A (2005) A Survey of Models of the Web graph. In: LópezOrtiz A, Hamel A (eds) Combinatorial and Algorithmic Aspects of Networking. Lecture Notes in Computer Science, vol 3405. Springer, Berlin, pp 159 Bonato A, Chung FRK (eds) (2007) Algorithms and Models for the WebGraph: Proceedings of the 5th International Workshop, WAW 2007, San Diego, CA, USA Caldarelli G (2007) Technological Networks: Internet and WWW in Scale-Free Networks: Complex Webs in Nature and Technology. Oxford University Press, Oxford Huberman BA (2001) The Laws of the Web: Patterns in the Ecology of Information. MIT, Cambridge
3371
3372
Zero-Sum Two Person Games
Zero-Sum Two Person Games T.E.S. RAGHAVAN Department of Mathematics, Statistics and Computer Science, University of Illinois, Chicago, USA Article Outline Introduction Games with Perfect Information Mixed Strategy and Minimax Theorem Behavior Strategies in Games with Perfect Recall Efficient Computation of Behavior Strategies General Minimax Theorems Applications of Infinite Games Epilogue Acknowledgment Bibliography Introduction Conflicts are an inevitable part of human existence. This is a consequence of the competitive stances of greed and the scarcity of resources, which are rarely balanced without open conflict. Epic poems of the Greek, Roman, and Indian civilizations which document wars between nationstates or clans reinforce the historical legitimacy of this statement. It can be deduced that domination is the recurring theme in human conflicts. In a primitive sense this is historically observed in the domination of men over women across cultures while on a more refined level it can be observed in the imperialistic ambitions of nationstate actors. In modern times, a new source of conflict has emerged on an international scale in the form of economic competition between multinational corporations. While conflicts will continue to be a perennial part of human existence, the real question at hand is how to formalize mathematically such conflicts in order to have a grip on potential solutions. We can use mock conflicts in the form of parlor games to understand and evaluate solutions for real conflicts. Conflicts are unresolvable when the participants have no say in the course of action. For example one can lose interest in a parlor game whose entire course of action is dictated by chance. Examples of such games are Chutes and Ladders, Trade, Trouble etc. Quite a few parlor games combine tactical decisions with chance moves. The game Le Her and the game of Parcheesi are typical examples. An outstanding example in this category is the game of backgammon, a remarkably deep game. In chess, the player who moves first is usually determined by
a coin toss, but the rest of the game is determined entirely by the decisions of the two players. In such games, players make strategic decisions and attempt to gain an advantage over their opponents. A game played by two rational players is called zerosum if one player’s gain is the other player’s loss. Chess, Checkers, Gin Rummy, Two-finger Morra, and Tic-TacToe are all examples of zero-sum two-person games. Business competition between two major airlines, two major publishers, or two major automobile manufacturers can be modeled as a zero-sum two-person games (even if the outcome is not precisely zero-sum). Zero-sum games can be used to construct Nash equilibria in many dynamic nonzero-sum games [64]. Games with Perfect Information Emptying a Box Example 1 A box contains 15 pebbles. Players I and II remove between one and four pebbles from the box in alternating turns. Player I goes first, and the game ends when all pebbles have been removed. The player who empties the box on his turn is the winner, and he receives $1 from his opponent. The players can decide in advance how many pebbles to remove in each of their turn. Suppose a player finds x pebbles in the box when it is his turn. He can decide to remove 1, 2, 3 or at most 4 pebbles. Thus a strategy for a player is any function f whose domain is X D f1; 2 : : : ; 15g and range is R f1; 2; 3; 4g such that f (x) min(x; 4). Given strategies f ; g for players I and II respectively, the game evolves by executing the strategies decided in advance. For example if, say ( 2 if x is even f (x) D 1 if x is odd ; ( 3 if x 3 g(x) D x otherwise : The alternate depletions lead to the following scenario move by removes leaving
I 1 14
II 3 11
I 1 10
II 3 7
I 1 6
II 3 3
I 1 2
II 2 0 :
In this case the winner is Player II. Actually in his first move Player II made a bad move by removing 3 out of 14. Player I could have exploited this. But he did not! Though he made a good second move, he reverted back to his naive strategy and made a bad third move. The question is: Can
Zero-Sum Two Person Games
Player II ensure victory for himself by intelligently choosing a suitable strategy? Indeed Player II can win the game with any strategy satisfying the conditions of g where
g (x) D
8 ˆ 1 ˆ ˆ ˆ 0, by uniform continuity we can find a ı > n1 > 0 such that if (i; j) and (i 0 ; j0 ) are adjacent vertices of a Hex board Bn , then ˇ
0 0 ˇ ˇ ˇ ˇ f1 i ; j f1 i ; j ˇ ; ˇ n n n n ˇ ˇ (1)
0 0 ˇ ˇ ˇ ˇ f2 i ; j f2 i ; j ˇ : ˇ n n n n ˇ Consider the the 4 sets:
i j i ; > ; H C D (i; j) 2 B n : f1 n n n
i j i ; < ; H D (i; j) 2 B n : f1 n n n
(2) (3)
3375
3376
Zero-Sum Two Person Games
i j V C D (i; j) 2 B n : f2 ; n n
i j ; V D (i; j) 2 B n : f2 n n
j > ; n j < : n
(4) (5)
Intuitively the points in H C under f are moved further to the right (with increased x coordinate) by more than . Points in V under f are moved further down (with decreased y coordinate) by more than . We claim that these sets cannot cover all the vertices of the Hex board. If it were so, then we will have a winner, say Player I with a winning path, linking the East and West boundary frames. Since points of the East boundary have the highest x coordinate, they cannot be moved further to the right. Thus vertices in H C are disjoint with the East boundary and similarly vertices in H are disjoint with the West boundary. The path must therefore contain vertices from both H C and H . However for any (i; j) 2 H C ; (i 0 ; j0 ) 2 H we have
i j i ; >; f1 n n n 0 0
i j i0 f1 ; C >: n n n Summing the above two inequalities and using (1) we get i0 i > 2 : n n Thus the points (i; j) and (i 0 ; j0 ) cannot be adjacent and this contradicts that they are part of a connected path. We have a contradiction. Remark 7 The algorithm attempts to build a winning path and advances by entering mate triangles. Since the algorithm will not be able to cover the Hex board, partial bridge building should fail at some point, giving a vertex that is outside the union of sets H C ; H ; V C ; V . Hence we reach an approximate fixed point while building the bridge. An Application of the Algorithm Consider the continuous map of the unit square into itself given by: f1 (x; y) D x C max(2 C 2x C 6y 6x y; 0) 1 C max(2 C 2x C 6y 6x y; 0) C max(2x 6x y; 0) f2 (x; y) D y C max(2 6x 2y C 6x y; 0) : 1 C max(2 6x 2y C 6x y; 0) C max(2y 6x y; 0)
Zero-Sum Two Person Games, Table 1 Table giving the Hex building path (x, y) (.0, .0) (.1, 0) (.1, .1) (0, .1) (.1, .2) (0, .2) (.1, .3) (0, .3) (.1, .4) (.2, .4) (.2, .3) (.3, .4) (.3, .3) (.4, .4) (.4, .3)
jf1 xj 0 .0167 .01228 .0 .007 0 .002 0 .238 .194 .007 .153 .017 .116 .0296
jf2 yj .6667 .5833 .4667 .53333 .35 .4 .233 .26667 .116 .088 .177 .033 .067 0 0
L 1 2 2 1 2 1 2 1 1 1 2 1 2 1 *
With D :05, we can start with a grid of ı D :1 (hopefully adequate) and find an approximate fixed point. In fact for a spacing of .1 units we have the following iterations. The iterations according to Hex rule passed through the following points with j f1 (x; y) xj and j f2 (x; y) yj given by Table 1. Thus the approximate fixed point is x D :4 ; y D :3 . Extensive Games and Normal Form Reduction Any game as it evolves can be represented by a rooted tree where the root vertex corresponds to the initial move. Each vertex of the tree represents a particular move of a particular player. The alternatives available in any given move are identified with the edges emanating from the vertex that represents the move. If a vertex is assigned to chance, then the game associates a probability distribution with the the descending edges. The terminal vertices are called plays and they are labeled with the payoff to Player I. In zero-sum games Player II’s payoff is simply the negative of the payoff to Player I. The vertices for a player are further partitioned into information sets. Information sets must satisfy the following requirements: The number of edges descending from any two moves within an information set are same. No information set intersects the unique unicursal path from the root to any end vertex of the tree in more than one move. Any information set which contains a chance move is a singleton.
Zero-Sum Two Person Games
We will use the following example to illustrate the extensive form representation: Example 8 Player I has 3 dice in his pocket. Die 1 is a fake die with all sides numbered one. Die 2 is a fake die with all sides numbered two. Die 3 is a genuine unbiased die. He chooses one of the 3 dice secretly, tosses the die once, and announces the outcome to Player II. Knowing the outcome but not knowing the chosen die, Player II tries to guess the die that was tossed. He pays $1 to Player I if his guess is wrong. If he guesses correctly, he pays nothing to Player I. The game is represented by the above tree with the root vertex assigned to Player I. The 3 alternatives at this move are to choose the die with all sides 1 or to choose the die with all sides 2 or to choose the unbiased die. The end vertices of these edges descending from the root vertex are moves for chance. The certain outcomes are 1 and 2 if the die is fake. The outcome is one of the numbers 1, . . . , 6 if the die chosen is genuine. The other ends of these edges are moves for Player II. These moves are partitioned into information sets V1 (corresponding to outcome 1), V2 (corresponding to outcome 2), and singleton information sets V3 , V4 , V5 , V6 corresponding to outcomes 3, 4, 5 and 6 respectively. Player II must guess the die based on the information given. If he is told that the game has reached a move in information set V1 , it simply means that the outcome of the toss is 1. He has two alternatives for each move of this information set. One corresponds to guessing the die is fake and the other corresponds to guessing it is genuine. The same applies to the information set V2 . If the outcome is in V3 ; : : : ; V6 , the clear choice is to guess the die as genuine. Thus a pure strategy (master plan) for Player II is to choose a 2-tuple with coordinates taking the values F or G. Here there are 4 pure strategies for Player II. They are: (F1 ; F2 ); (F1 ; G); (G; F2 ); (G; G). For example, the first coordinate of the strategy indicates what to guess when the outcome is 1 and the second coordinate indicates what to guess for the outcome 2. For all other outcomes II guesses the die is genuine (unbiased). The payoff to Player I when Player I uses a pure strategy i and Player II uses a pure strategy j is simply the expected income to Player I when the two players choose i and j simultaneously. This can as well be represented by a matrix A D (a i j ) whose rows and columns are pure strategies and the corresponding entries are the expected payoffs. The payoff matrix A given by
F1
0
B A D F2 B @ G
(F1 ; F2 ) (F1 ; G)
(G; F2 ) (G; G)
0
0
1
1
0
1
0
1
1 3
1 6
1 6
0
1 C C A
is called the normal form reduction of the original extensive game. Saddle Point The normal form of a zero sum two person game has a saddle point when there is a row r and column c such that the entry arc is the smallest in row r and the largest in column c. By choosing the pure strategy corresponding to row r Player I guarantees a payoff a rc D min j ar j . By choosing column c, Player II guarantees a loss no more than maxi a i c D arc . Thus row r and column c are good pure strategies for the two players. In a payoff matrix A D (a i j ) row r is said to strictly dominate row t if ar j > a t j for all j. Player I, the maximizer, will avoid row t when it is dominated. If rows r and t are not identical and if ar j a t j , then we say that row r weakly dominates row t. Example 9 Player I chooses either 1 or 2. Knowing player I’s choice Player II chooses either 3 or 4. If the total T is odd, Player I wins $T from Player II. Otherwise Player I pays $T to Player II. The pure strategies for Player I are simply 1 = choose 1, 2 = choose 2. For Player II there are four pure strategies given by: 1 : choose 3 no matter what I chooses. 2 : choose 4 no matter what I chooses. 3 : choose 3 if I chooses 1 and choose 4 if I chooses 2. 4 : choose 4 if I chooses 3 1 and choose 3 if I chooses 2. This results in a normal form with payoff matrix A for Player I given by:
1 AD 2
(3; 3) (4; 4) (3; 4) (4; 3) 4 5
5 6
4 6
5 5
:
Here we can delete column 4 which dominates column 3. We don’t have row domination yet. We can delete column 2 as it weakly dominates column 3. Still we have no row domination after these deletions. We can delete column 1 as it weakly dominates column 3. Now we have strict row domination of row 2 by row 1 and we are left with the row 1, column 3 entry = 4. This is a saddle point for this game. In fact we have the following: Theorem 10 The normal form of any zero sum two person game with perfect information admits a saddle point. A saddle point can be arrived at by a sequence of row or column deletions. A row that is weakly dominated by another row can be deleted. A column that weakly dominates another column can be deleted. In each iteration we can always find a weakly or strictly dominated row or a weakly or strictly dominating column to be deleted from the current submatrix.
3377
3378
Zero-Sum Two Person Games
Zero-Sum Two Person Games, Figure 2 Game tree for a single throw with fake or genuine dice
Mixed Strategy and Minimax Theorem
Historical Remarks
Zero sum two person games do not always have saddle points in pure strategies. For example, in the game of guessing the die (Example 8) the normal form has no saddle point. Therefore it makes sense for players to choose the pure strategies via a random mechanism. Any probability distribution on the set of all pure strategies for a player is called a mixed strategy. In Example 8 a mixed strategy for Player I is a 3-tuple x D (x1 ; x2 ; x3 ) and a mixed strategy for Player II is a 4-tuple y D (y1 ; y2 ; y3 ; y4 ). Here xi is the probability that player I chooses pure strategy i and yj is the probability that player II chooses pure strategy j. Since the players play independently and make their choices simultaneously, the expected payoff P P to Player I from Player II is K(x; y) D i j ai j xi y j where aij are elements in the payoff matrix A. We call K(x; y) the mixed payoff where players choose mixed strategies x and y instead of pure strategies i and j. Suppose x D ( 18 ; 18 ; 34 ) and y D ( 34 ; 0; 0; 14 ). Here x guarantees Player I an expected payoff of 14 against any pure strategy j of Player II. By the affine linearity of K(x ; y) in y it follows that Player I has a guaranteed expectation D 14 against any mixed strategy choice of II. A similar argument shows that Player II can choose the mixed strategy ( 34 ; 0; 0; 14 ) which limits his maximum expected loss to 14 against any mixed strategy choice of Player I. Thus
The existence of a saddle point in mixed strategies for Example 8 is no accident. All finite games have a saddle point in mixed strategies. This important theorem, called the minimax theorem, is the very starting point of game theory. While Borel (see under Ville [65]) considered the notions of pure and mixed strategies for zero sum two person games that have symmetric roles for the players, he was able to prove the theorem only for some special cases. It was von Neumann [66] who first proved the minimax theorem using some intricate fixed point arguments. While several proofs are available for the same theorem [29,34, 44,48,49,65,69], the proofs by Ville and Weyl are notable from an algorithmic point of view. The proofs by Nash and Kakutani allow immediate extension to Nash equilibrium strategies in many person non zero sum games. For zero-sum two person games, optimal strategies and Nash equilibrium strategies coincide. The following is the seminal minimax theorem for matrix games.
max K(x; y ) D min K(x ; y) D x
y
1 : 4
By replacing the rows and columns with mixed strategy payoffs we have a saddle point in mixed strategies.
Theorem 11 (von Neumann) Let A D (a i j ) be any m n real matrix. Then there exists a pair of probability vectors x D (x1 ; x2 ; : : : ; x m ) and y D (y1 ; y2 ; : : : ; y n ) such that for a unique constant v m X iD1 n X
ai j xi v
j D 1; 2; : : : ; n ;
ai j y j v
i D 1; 2; : : : ; m :
jD1
The probability vectors x, y are called optimal mixed strategies for the players and the constant v is called the value of the game.
Zero-Sum Two Person Games
H. Weyl [69] gave a complete algebraic proof and proved that the value and some pair of optimal strategies for the two players have all of their coordinates lie in the same ordered subfield as the smallest ordered field containing the payoff entries. Unfortunately his proof was non-constructive. It turns out that the minimax theorem can be proved via linear programming in a constructive way which leads to an efficient computational algorithm a la the simplex method [18]. The key idea is to convert the problem to dual linear programming problems. Solving for Value and Optimal Strategies via Linear Programming Without loss of generality we can assume that the payoff matrix A D (a i j )mn > 0, that is a i j > 0 for all (i; j) . Thus we are looking for some v such that: v D min v1
(6)
such that n X
a i j y j v1 ;
(7)
y1 ; : : : ; y n 0 ;
(8)
jD1
n X
n X
ai j j C si D 1 ;
i D 1; 2; : : : ; m ;
(14)
jD1
yj 0 ;
j D 1; 2; : : : ; n ;
(15)
si 0 ;
i D 1; 2; : : : ; m :
(16)
Of the various algorithms to solve a linear programming problem, the simplex algorithm is among the most efficient. It was first investigated by Fourier (1830). But no other work was done for more than a century. The need for its industrial application motivated active research and lead to the pioneering contributions of Kantarowich [1939] (see a translation in Management Science [35]) and Dantzig [18]. It was Dantzig who brought out the earlier investigations of Fourier to the forefront of modern applied mathematics. Simplex Algorithm Consider our linear programming problem above. Any solution D (y1 ; : : : ; y n ); s D (s1 ; : : : ; s m ) to the above system of equations is called a feasible solution. We could also rewrite the system as 1 C 1 C 2 C 2 C C n C n C s1 e 1 C s2 e 2 C s m e m D 1 1 ; 2 ; : : : ; n ; s1 ; s2 ; : : : ; s m 0 :
yj D 1 :
(9)
jD1
Since the payoff matrix is positive, any v1 satisfying the constraints above will be positive, so the problem can be reformulated as n
max
X 1 D max j v1
(10)
jD1
such that n X
subject to
ai j j 1
for all j ;
(11)
jD1
j 0
for all j :
(12)
With A > 0, the j ’s are bounded. The maximum of the P linear function j j is attained at some extreme point of the convex set of constraints (11) and (12). By introducing nonnegative slack variables s1 ; s2 ; : : : ; s m we can replace the inequalities (11) by equalities (13). The problem reduces to max
n X jD1
j
(13)
Here C j ; j D 1 : : : ; n are the columns of the matrix A and ei are the columns of the m m identity matrix. The vector 1 is the vector with all coordinates unity. With any extreme point (; s) D (1 ; 2 ; : : : ; n ; s1 ; : : : ; s m ) of the convex polyhedron of feasible solutions one can associate with it a set of m linearly independent columns, which form a basis for the column span of the matrix (A; I). Here the coefficients j and si are equal to zero for coordinates other than for the specific m linearly independent columns. By slightly perturbing the entries we can assume that any extreme point of feasible solutions has exactly m positive coordinates. Two extreme feasible solutions are called adjacent if the associated bases differ in exactly one column. The key idea behind the simplex algorithm is that an extreme point P D ( ; s ) is an optimal solution if and only if there is no adjacent extreme point Q for which the objective function has a higher value. Thus when the algorithm is initiated at an extreme point which is not optimal, there must be an adjacent extreme point that strictly improves the objective function. In each iteration, a column from outside the basis replaces a column in the current basis corresponding to an adjacent extreme point. Since there are m C n columns in all for the matrix (A; I), and
3379
3380
Zero-Sum Two Person Games
in each iteration we have strict improvement by our nondegeneracy assumption on the extreme points, the procedure must terminate in a finite number of steps resulting in an optimal solution. Example 12 Players I and II simultaneously show either 1 or 2 fingers. If T is the total number of fingers shown then Player I receives from Player II $T when T, is odd and loses $T to Player II when T is even. The payoff matrix is given by 2 3 AD : 3 4 Add 5 to each entry to get a new payoff matrix with all entries strictly positive. The new game is strategically same as A. 3 8 : 8 1 The linear programming problem is given by max 1:y1 C 1:y2 C 0:s1 C 0:s2 such that
3 8
8 1
1 0
0 1
3 y1 6 y2 7 7D 1 : 6 4 s1 5 1 s2 2
We can start with the trivial solution (0; 0; 1; 1)T . This corresponds to the basis e 1 ; e 2 with s1 D s2 D 1. The value of the objective function is 0. If we make y2 > 0, then the value of the objective function can be increased. Thus we look for a solution to s1 D 0 ;
s2 > 0 ;
y2 > 0
satisfying the constraints 8y2 C 0s2 D 1
Fictitious Play Though optimal strategies are not easily found, even naive players can learn to steer their average payoff towards the value of the game from past plays by certain iterative procedures. This learning procedure is known as fictitious play. The two players make their next choice under the assumption that the opponent will continue to choose pure strategies at the same frequencies as what he/she did in the past. If x (n) ; y (n) are the empirical mixed strategies used by the two players in the first n rounds, then in round n C 1 Player I pretends that Player II will continue to use y (n) in the future and selects any row i such that X X a i j y (n) a i j y (n) j D max j : j
The new empirical mixed strategy is given by 1 n Ii C x (n) : nC1 nC1
x (nC1) D
(Here I i is the degenerate choice of pure strategy i .) This intuitive learning procedure was proposed by Brown [14] and the following convergence theorem was proved by Robinson [56].
n
or to s1 > 0 ;
i
j
Theorem 13 X X lim min a i j x (n) a i j y (n) i D lim max j D v:
y 2 C s2 D 1
s2 D 0 ;
7 5 point. We find that y1 D 61 ; y2 D 61 is such a solution. The procedure terminates because no adjacent solution with y1 > 0; s1 > 0 or y2 > 0; s1 > 0 or y1 > 0; s2 > 0 or y2 > 0; s2 > 0 if any has higher objective function value. The algorithm terminates with the optimal value of 1 12 61 v 1 D 61 . Thus the value of the modified game is 12 , and 1 the value of the original game is 61 12 5 D 12 . A good strategy for Player II is obtained by normalizing the optimal so7 5 lution of the linear program, it is 1 D 12 ; 2 D 12 . Similarly, from the dual linear program we can see that the 7 5 strategy 1 D 12 ; 2 D 12 is optimal for Player I.
y2 > 0
satisfying the constraints 8y2 C s1 D 1 y2 C 0s1 D 1 Notice that y2 D 18 ; s2 D 78 is a solution to the first system and that the second system has no nonnegative solution. The value of the objective function at this extreme solution is 18 . Now we look for an adjacent extreme
j
i
n
i
j
We will apply the fictitious play algorithm to the following example and get a bound on the value. Example 14 Player I picks secretly a card of his choice from a deck of three cards numbered 1, 2, and 3. Player II proceeds to guess player I’s choice. After each guess player I announces player II’s guess as “High”, “Low” or “Correct” as the case may be. The game continues till player II guesses player I’s choice correctly. Player II pays to Player I $N where N is the number of guesses he made.
Zero-Sum Two Person Games Zero-Sum Two Person Games, Table 2 Row choices R1 R3 R2 R3 R3 R2 R2 R2 R2 R1
Total so far 1 4 6 9 12 15 17 19 21 22
1 3 6 8 10 12 15 18 21 22
2 4 5 7 9 11 12 13 14 16
2 3 6 7 8 9 12 15 18 20
3 4 6 7 8 9 11 13 15 18
Column choices C1 C2 C3 C3 C4 C4 C5 C3 C3 C3
Total so far 1 2 4 6 8 10 13 15 17 19
2 5 6 7 10 13 15 16 17 18
3 5 7 9 10 11 12 14 16 18
The payoff matrix is given by
1
0
(1; 2) (1; 3) (2) 1
B AD 2 B @ 2 3 3
(3; 1) (3; 2)
1
2
2
3
1
3
2
2
1
3
1
C 2 C A: 1
Here the row labels are possible cards chosen by Player I, and the column labels are pure strategies for Player II. For example, the pure strategy (1, 3) for Player II, means that 1 is the first guess and if 1 is incorrect then 3 is the second guess. The elements of the matrix are payoffs to Player I. We can use cumulative total, instead average for the players to make their next choice of row or column based on the totals. We choose the row or column with the least index in case more than one row or one column meets the criterion. The total for the first 10 rounds using fictitious play is given in Table 2. The bold entries give the approximate lower and upper bounds for the total payoff in 10 rounds giving 1:6 v 1:9. Remark 15 Fictitious play is known to have a very poor rate of convergence to the value. While it works for all zero sum two person games, it fails to extend to Nash equilibrium payoffs in bimatrix games even when the game has a unique Nash equilibrium. It extends only to some very special classes like 2 2 bimatrix games and to the so called potential games. (See Miyasawa [1961], Shapley [1964], Monderer and Shapley [46], Krishna and Sjoestrom [41], and Berger [7]). Search Games Search games are often motivated by military applications. An object is hidden in space. While the space where the
object is hidden is known, the exact location is unknown. The search strategy consists of either targeting a single point of the space and paying a penalty when the search fails or continue the search till the searcher gets closer to the hidden location. If the search consists of many attempts, then a pure strategy is simply a function of the search history so far. Example 14 is a typical search game. The following are examples of some simple search games that have unexpected turns with respect to the value and optimal strategies. Example 16 A pet shop cobra of known length t < 1 escapes out of the shop and has settled in a nearby tree somewhere along a particular linear branch of unit length. Due to the camouflage, the exact location [x; x C t] where it has settled on the branch is unknown. The shop keeper chooses a point y of his choice and aims a bullet at the point y. In spite of his 100% accuracy the cobra will escape permanently if his targeted point y is outside the settled location of the cobra. We treat this as a game between the cobra (Player I) and the shop keeper (Player II). Let the probability of survival be the payoff to the cobra. Thus ( 1 if y < x ; or y > x C t K(x; y) D 0 otherwise : The pure strategy spaces are 0 x 1 t for the snake and 0 y 1 for the shop keeper. It can be shown that the game has no saddle point and has optimal mixed strategies. The value function v(t) is a discontinuous function of t. In case 1t is an integer n then, a good strategy for the snake is to hide along [0; t], or [t; 2t] or [(n 1)t; 1] chosen with equal chance. In this case the optimal strategy for the shop keeper is to choose a random point in [0; 1]. The value is 1 n1 . In case 1t is a fraction, let n D [ 1t ] then the optimal strategy for the snake is to hide along [0; t]; or [t; 2t]; : : : or [(n 1)t; nt]. An optimal strategy for the shop keeper is to shoot at one of the 1 2 n points nC1 ; nC1 ; : : : ; nC1 chosen at random. Example 17 While mowing the lawn a lady suddenly realizes that she has lost her diamond engagement ring some where in her lawn. She has maximum speed s and will be able to locate the diamond ring from its glitter if she is sufficiently close to, say within a distance from the ring. What is an optimal search strategy that minimizes her search time. If we treat Nature as a player against her, she is playing a zero sum two person game where Nature would find pleasure in her delayed success in finding the ring.
3381
3382
Zero-Sum Two Person Games
Search Games on Trees The following is an elegant search game on a tree. For many other search games the readers can refer to the monographs by Gal [27] and Alpern and Gal [1]. Also see [55]. Example 18 A bird has to look for a suitable location to build its nest for hatching eggs and protecting them against predator snakes. Having identified a large tree with a single predator snake in the neighborhood, the bird has to further decide where to build its nest on the chosen tree. The chance for the survival of the eggs is directly proportional to the distance the snake travels to locate the nest. While birds and snakes work out their strategies based on instinct and evolutionary behavior, we can surely approximate the problem by the following zero sum two person search game. Let T D (X; E ) be a finite tree with vertex set X and edge set E . Let O 2 X be the root vertex. A hider hides an object at a vertex x of the tree. A searcher starts at the root and travels along the edges of the tree such that the path traced covers all the terminal vertices. The search ends as soon as the searcher crosses the hidden location and the payoff to the hider is the distance traveled so far. By a simple domination argument we can as well assume that the optimal hiding locations are simply the terminal vertices. Theorem 19 The search game has value and optimal strategies. The value coincides with the sum of all edge lengths. Any optimal strategy for the hider will necessarily restrict to hide at one of the terminal vertices. Let the least distance traveled to exhaust all end vertices one by one correspond to a permutation of the end vertices in the order w1 ; w2 ; : : : ; w k . Let 1 be its reverse permutation. Then an optimal strategy for the searcher is to choose one of these two permutations by the toss of a coin. The hider has a unique optimal mixed strategy that chooses each end vertex with positive probability. Suppose the tree is a path with root O and a single terminal vertex x. Since the search begins at O, the longest trip is possible only when hider hides at x and the theorem holds trivially. In case the tree has just two terminal vertices besides the root vertex, the possible hiding locations are say, O; x1 ; x2 with edge lengths a1 ; a2 . The possible searches are via paths: O ! x1 ! O ! x2 abbreviated Ox1 Ox2 or O ! x2 ! O ! x1 , abbreviated Ox2 Ox1 . The payoff matrix can be written as
x1 x2
Ox1 Ox2
Ox2 Ox1
a1 2a1 C a2
2a2 C a1 a2
:
Zero-Sum Two Person Games, Figure 3 Bird trying to hide at a leaf and snake chasing to reach the appropriate leaf via optimal Chinese postman route starting at root O and ending at O
The value of this game is a1 C a2 D sum of the edge lengths. We can use an induction on the number of subtrees to establish the value as the sum of edge lengths. We will use an example to just provide the intuition behind the formal proof. Given any permutation of the end vertices (leaves), of the above tree let P be the shortest path from the root vertex that travels along the leaves in that order and returns to the root. Let the reverse path be P1 . Observe that it will cover all edges twice. Thus if the two paths P and P1 are chosen with equal chance by the snake, the average distance traveled by the snake when it locates the bird’s nest at an end vertex will be independent of the particular end vertex. For example along the closed path O ! t ! d ! t !a!t!x!b!x!c!x!t!O!u!s !e!s!y! f !y!g!y!s!u!h!u ! j ! u ! O the distance traveled by the snake to reach leaf e is (3 C 7 C 7 C C 8 C 3) D 74. If the snake travels along the reverse path to reach e the distance traveled is (9 C 5 C 5 C C 5 C 4 C 3) D 66. For example if it is to reach the vertex d then via path P it is (3 C 7). Via P 1 it is to make travel to e and travel from e to d by the reverse path. This is (66 C 3 C C 6 C 6 C 7) D 130. Thus in both cases the average distance traveled is 70. The average distance is the same for every other leaf when P and P 1 are used. The optimal Chinese postman route can allow all permutations subject to permuting any leaf of any subtree only among themselves. Thus the subtree rooted at t has leaves a, b, c, d and the subtree rooted at u has leaves e, f , g, h, j. For example while permuting a, b, c, d only among
Zero-Sum Two Person Games
themselves we have the further restriction that between b; c we cannot allow insertion of a or d. For example a, b, c, d and a, d, c, b are acceptable permutations, but not a, b, d, c. It can never be the optimal permuting choice. The same way it applies to the tree rooted at u. For example h, j, e, g, f is part of the optimal Chinese postman route, but h, g, j, e, f is not. We can think of the snake and bird playing the game as follows: The bird chooses to hide in a leaf of the subgame Gt rooted at t or at a leaf of the subgame Gu rooted at u. These leaves exhaust all leaves of the original game. The snake can restrict to only the optimal route of each subgame. This can be thought of as a 2 2 game where the strategies for the two players (bird) and snake are: Bird: Strategy 1: Hide optimally in a leaf of Gt , Strategy 2: Hide optimally in a leaf of Gu . Snake: Strategy 1: Search first the leaves of Gt along the optimal Chinese postman route of Gt and then search along the leaves of Gu . Strategy 2: Search first the leaves of Gu along the optimal Chinese postman route and then search the leaves of Gt along the optimal postman route. The expected outcome can be written as the following 2 2 game. (Here v(G t ), v(Gu ) are the values of the subgames rooted at t; u respectively.) Gt Gu
G t Gu
Gu G t
2[9 C v(G u )] C [3 C v(G t )] [3 C v(G t )] [9 C v(G u )] 2[3 C v(G t ] C [9 C v(G u )]
Observe that the 2 2 game has no saddle point and hence has value 3 C 9 C v(G t ) C v(Gu ). By induction we can assume v(G t ) D 24; v(Gu ) D 34. Thus the value of this game is 70. This is also the sum of the edge lengths of the game tree. An optimal strategy for the bird can be recursively determined as follows. Umbrella Folding Algorithm Ladies, when storing umbrellas inside their handbag shrink the central stem of the umbrella and then the stems around all in one stroke. We can mimic a somewhat similar procedure also for our above game tree. We simultaneously shrink the edges [xc] and [xb] to x. In the next round fa; x; dg edges [a; t]; [x; t]; [d; t] can be simultaneously shrunk to t and so on till the entire tree is shrunk to the root vertex O. We do know that the optimal strategy for the bird when the tree is simply the subtree with root x and 4 2 with leaves b; c is given by p(b) D (4C2) ; p(c) D (4C2) . Now for the subtree with vertex t and leaves fa; b; c; dg,
we can treat this as collapsing the previous subtree to x and treat stem length of the new subtree with vertices ft; a; x; dg as though the three stems [ta]; [tx]; [td] have lengths 6; 5 C (4 C 2); 7. We can check that for this subtree game the leaves a; x; d are chosen with probabilities 6 9 7 p(a) D (6C9C7) ; p(x) D (6C9C7) ; p(d) D (6C9C7) . Thus the optimal mixed strategy for the bird for choosing leaf b for our original tree game is to pass through vertices t; x; b and is given by the product p(t)p(x)p(b). We can inductively calculate these probabilities. Completely Mixed Games and Perron’s Theorem on Positive Matrices A mixed strategy x for player I is called completely mixed if it is strictly positive (x > 0). A matrix game A is completely mixed if and only all optimal mixed strategies for Player I and Player II are completely mixed. The following elegant theorem was proved by Kaplanski [36]. Theorem 20 A matrix game A with value v is completely mixed if and only if 1. 2. 3. 4.
The matrix is square. The optimal strategies are unique for the two players. If v ¤ 0, then the matrix is nonsingular. If v D 0, then the matrix has rank n 1 where n is the order of the matrix.
The theory of completely mixed games is a useful tool in linear algebra and numerical analysis [4]. The following is a sample application of this theorem. Theorem 21 (Perron 1909) Let A be any n n matrix with positive entries. Then A has a positive eigenvalue with a positive eigenvector which is also a simple root of the characteristic equation. Proof Let I be the identity matrix. For any > 0, the maximizing player prefers to play the game A rather than the game A I. The payoff gets worse when the diagonal entries are reached. The value function v() of A I is a non-increasing continuous function. Since v(0) > 0 and v() < 0 for large we have for some 0 > 0 the value of A 0 I is 0. Let y be optimal for player II, then (A 0 I)y 0 implies 0 < Ay 0 y. That is 0. Since the optimal y is completely mixed, for any optimal x of player I, we have (A 0 I)x D 0. Thus x > 0 and the game is completely mixed. By (2) and (4) if (A 0 I)u D 0 then u is a scalar multiple of y and so the eigenvector y is geometrically simple. If B D A 0 I, then B is singular and of rank n 1. If (B i j ) is the cofactor matrix of the singular matrix B then
3383
3384
Zero-Sum Two Person Games
P
i D 1; : : : ; n. Thus row k of the cofacj bi j Bk j D 0 ; tor matrix is a scalar multiple of y. Similarly each column of B is a scalar multiple of x. Thus all cofactors are of the same sign and are different from 0. That is ˇ X d ˇ Bi i ¤ 0 : det(A I)ˇ D 0 d i
Thus 0 is also algebraically simple. See [4] for the most general extensions of this theorem to the theorems of Perrron and Frobenius and to the theory of M-matrices and power positive and polynomially matrices).
Behavior Strategies in Games with Perfect Recall Consider any extensive game where the unique unicursal path from an end vertex w to the root x0 intersects two moves x and y of say, Player I. We say x y if the the unique path from y to x0 is via move x. Let U 3 x and V 3 y be the respective information sets. If the game has reached a move y 2 V ; Player I will know that it is his turn and the game has progressed to some move in V. The game is said to have perfect recall if each player can remember all his past moves and the choices made in those moves. For example if the game has progressed to a move of Player I in the information set V he will remember the specific alternative chosen in any earlier move. A move x is possible for Player I with his pure strategy 1 , if for some suitable pure strategy 2 of Player II, the move x can be reached with positive probability using 1 ; 2 . An information set U is relevant for a pure strategy 1 , for Player I, if some move x 2 U is possible with 1 . Let ˘1 ; ˘2 be pure strategy spaces for players I and II. Let 1 D fq1 ; 1 2˘1 g be any mixed strategy for Player I. The information set U for Player I is relevant for the mixed strategy 1 if for some q1 > 0; U is relevant for 1 . We say that the information set U for Player I is not relevant for the mixed strategy 1 if for all q1 > 0; U is not relevant for 1 . Let S D f1 : U is relevant for 1 and 1 (U) D g ; S D f1 : U is relevant for 1 g ; T D f1 : U is not relevant for 1 and 1 (U) D g : The behavior strategy induced by a mixed strategy pair (1 ; 2 ) at an information set U for Player I is simply the conditional probability of choosing alternative in the information set U, given that the game has progressed to
a move in U, namely 8P < P1 2S q 1 ˇ1 (U; ) D P 1 2S q 1 : 1 2T q1
if U is relevant for 1 ; if U is not relevant for 1 :
The following theorem of Kuhn [42] is a consequence of the assumption of perfect recall. Theorem 22 Let 1 ; 2 be mixed strategies for players I and II respectively in a zero sum two person finite game of perfect recall. Let ˇ1 ; ˇ2 be the induced behavior strategies for the two players. Then the probability of reaching any end vertex w using 1 ; 2 coincides with the probability of reaching w using the induced behavior strategy ˇ1 ; ˇ2 . Thus in zero-sum two person games with perfect recall, players can play optimally by restricting their strategy choices just to behavior strategies. The following analogy may help us understand the advantages of behavior strategies over mixed strategies. A book has 10 pages with 3 lines per page. Someone wants to glance through the book reading just 1 line from each page. A master plan (pure strategy) for scanning the book consists of choosing one line number for each page. Since each page has 3 lines, the number of possible plans is 310 . Thus the set of mixed strategies is a set of dimension 310 1. There is another randomized approach for scanning the book. When page i is about to be scanned choose line 1 with probability x i1 , line 2 with probability x i2 and line 3 with probability x i3 . Since for each i we have x i1 C x i2 C x i3 D 1 the dimension of such a strategy space is just 20. Behavior strategies are easier to work with. Further Kuhn’s theorem guarantees that we can restrict to behavior strategies in games with perfect recall. In general if there are k alternatives at each information set for a player and if there are n information sets for the player, the dimension of the mixed strategy space is k n 1. On the other hand the dimension of the behavior strategy space is simply n(k 1). Thus while the dimension of mixed strategy space grows exponentially the dimension of behavior strategy space grows linearly. The following example will illustrate the advantages of using behavior strategies. Example 23 Player I has 7 dice. All but one are fake. Fake die F i has the same number i on all faces i D 1; : : : ; 6. Die G is the ordinary unbiased die. Player I selects one of them secretly and announces the outcome of a single toss of the die to player II. It is Player II’s turn to guess which die was selected for the toss. He gets no reward for correct guess but pays $1 to Player I for any wrong guess. Player I has 7 pure strategies while Player II has 26 pure strategies. As an example the pure strategy (F 1 , G, G, F 4 ,
Zero-Sum Two Person Games
G, F 6 ) for Player II is one which guesses the die as fake when the outcome revealed is 1 or 4 or 6, and guesses the die as genuine when the outcome is 2 or 3 or 5. The normal form game is a payoff matrix of size 7 64. For example if G is chosen by Player I, and (F 1 , G, G, F 4 , G, F 6 ) is chosen by Player II, the expected payoff to Player I is 1 1 6 [1 C 0 C 0 C 1 C 0 C 1] D 2 . If F 2 is chosen by Player I, the expected payoff is 1 against the above pure strategy of Player II. Now Player II can use the following behavior strategy. If the outcome is i, then with probability qi he can guess that the die is genuine and with probability (1 q i ) he can guess that it is from the fake die F i . The expected behavioral payoff to Player I when he chooses the genuine die with probability p0 and chooses the fake die F i with probability p i ; i D 1; : : : ; i D 6 is given by
K(p; q) D p0
6
6
iD1
iD1
X 1X (1 q i ) C pi qi : 6
Collecting the coefficients of qi ’s, we get
K(p; q) D
6 X
qi
iD1
1 p i p0 C p0 : 6
Sequence Form
By choosing p i 16 p0 D 0, we get p1 D p2 D : : : ; p6 D 1 1 1 1 1 1 1 1 6 p0 . Thus p D ( 2 ; 12 ; 12 ; 12 ; 12 ; 12 ; 12 ). For this mixed strategy for Player I, the payoff to Player I is independent of Player II’s actions. Similarly, we can rewrite K(p; q) as a function of pi ’s for i D 1 ; : : : ; 6 where
K(p; q) D
X k¤0
" pk
# 6 1X (1 qr ) qk 6 rD1 1 C 6
ploiting the inherent symmetries in the problem. We were also lucky in our search when we were looking for optimals among equalizers. From an algorithmic point of view, this is not possible with any arbitrary extensive game with perfect recall. While normal form is appropriate for finding optimal mixed strategies, its complexity grows exponential with the size of the vertex set. The payoff matrix in normal form is in general not a sparse matrix (a sparse matrix is one which has very few nonzero entries) a key issue for data storage and computational accuracies. By sticking to the normal form of a game with perfect recall we cannot take full advantage of Kuhn’s theorem in its drastically narrowed down search for optimals among behavior strategies. A more appropriate form for these games is the sequence form [63] and realization probabilities to be described below. The behavioral strategies that induce the realization probabilities grow only linearly in the size of the terminal vertex set. Another major advantage is that the sequence form induces a sparse matrix. It has at most as many non-zero entries as the number of terminal vertices or plays.
6 X
! (1 q k )
:
kD1
This expression can be made independent of pi ’s by choosing q i D 12 ; i D 1; : : : ; 6. Since the behavioral payoff for these behavioral strategies is 12 , the value of the game is 12 , which means that Player I cannot do any better than 12 while Player II is able to limit his losses to 12 . Efficient Computation of Behavior Strategies Introduction In our above example with one genuine and six fake dice, we used Kuhn’s theorem to narrow our search among optimal behavior strategies. Our success depended on ex-
When the game moves to an information set U 1 of say, player I, the perfect recall condition implies that wherever the true move lies in U 1 , the player knows the actual alternative chosen in any of the past moves. Let u 1 denote the sequence of alternatives chosen by Player I in his past moves. If no past moves of player I occurs we take u 1 D ;. Suppose in U 1 player I selects an action “c” with behavioral probability ˇ1 (c) and if the outcome is c the new sequence is u 1 [ c. Thus any sequence s1 for player I is the string of choices in his moves along the partial path from the initial vertex to any other vertex of the tree. Let S0 ; S1 ; S2 be the set of all sequences for Nature (via chance moves), Player I and Player II respectively. Given behavior strategies ˇ0 ; ˇ1 ; ˇ2 Let r i (s i ) D
Y
ˇ i (c) ;
i D 0; 1; 2 :
c2s i
The functions: r i : S i :! R : i D 0; 1; 2 satisfy the following conditions r i (;) D 1 r i (u i ) D
(17) X
r i (u i ; c) ;
i D 0; 1; 2
(18)
c2A(U i )
r i (s i ) 0
for all s i ;
i D 0; 1; 2 :
(19)
3385
3386
Zero-Sum Two Person Games
Conversely given any such realization functions r1 ; r2 we can define behavior strategies, ˇ 1 say for player I, by ˇ1 (U1 ; c) D
r1 (u 1 [ c) r1 (u 1 ) for c 2 A(U1 ) ; and r1 (u 1 ) > 0 :
When r1 (u 1 ) D 0 we define ˇ1 (U1 ; c) arbitrarily so that P c2A(U 1 ) ˇ1 (U1 ; c) D 1. If the terminal payoff to player I at terminal vertex ! is h(!), by defining h(a) D 0 for all nodes a that are not terminal vertices, we can easily check that the behavioral payoff H(ˇ1 ; ˇ2 ) D
X s2S
h(s)
2 Y
r i (s i ) :
iD0
When we work with realization functions r i ; i D 1; 2 we can associate with these functions the sequence form of payoff matrix whose rows correspond to sequence s1 2 S1 for Player I and columns correspond to sequence s2 2 S2 for Player II and with payoff matrix X h(s0 ; s1 ; s2 ) : K(s1 ; s2 ) D s 0 2S 0
Unlike the mixed strategies we have more constraints on the sequences r1 ; r2 for each player given by the linear constraints above. It may be convenient to denote the sequence functions r1 ; r2 by vectors x; y respectively. The vector x has jS1 j coordinates and vector y has jS2 j coordinates. The constraints on x and y are linear given by Ex D e; F y D f where the first row is the unit vector (1; 0; : : : ; 0) of appropriate size in both E and F. If U1 is the collection of information sets for player I then the number of rows in E is 1 C jU1 j. Similarly the number of rows in F is 1 C jU2 j. Except for the first row, each row has the starting entry as 1 and some 1’s and 0’s. Consider the following extensive game with perfect recall. The set of sequences for player I is given by S1 D f;; l; r; L; Rg. The set of sequences for player II is given by S2 D f;; c; dg. The sequence form payoff matrix is given by 2 3 6 0 0 0 7 6 7 6 1 1 7 K(s1 ; s2 ) D A D 6 0 7: 4 0 2 4 5 1 0 0 The constraint matrices E and F are given by 2 3 1 1 4 5 ED 1 1 1 ; FD 1 1 1 1
1
Since no end vertex corresponds s1 D ;, for Player I, the first row of A is identically 0 and so it is represented by entries. We are looking for a pair of vectors x ; y such that y is the best reply vector y which minimizes (x Ay) among all vectors satisfying F y D f ; y 0. Similarly the best reply vector is x where it maximizes (x; Ay ) subject to E T x D e; x 0. The duals to these two linear programming problems are max ( f q) such that F T q AT x ; q unrestricted:
min (e p) such that Ep Ay ; p unrestricted:
Since E T x D e; and x 0; F y D f ; and y 0 these two problems can as well be viewed as the dual linear programs Primal: max( f such that
FT 0
AT E T
q)
q x
D
Dual: min(e p) such that D y F 0
p A E
0 e
f 0
;
;
x 0; q unrestricted.
y 0; p unrestricted.
We can essentially prove that:
1
Zero-Sum Two Person Games, Figure 4
:
Theorem 24 The optimal behavior strategies of a zero sum two person game with perfect recall can be reduced to solv-
Zero-Sum Two Person Games
ing for optimal solutions of dual linear programs induced by its sequence form. The linear program has a size which in its sparse representation is linear in the size of the game tree.
P Now Player II has t D j j x j a mixed strategy which chooses xj with probability j . Since t is a boundary point of T D con S, by the Caratheodary theorem for convex hulls, the convex combination above involves at most m points of S. Hence the theorem. See [50].
General Minimax Theorems The minimax theorem of von Neumann can be generalized if we notice the possible limitations for extensions. Example 25 Players I and II choose secretly positive integers i; j respectively. The payoff matrix is given by ( 1 if i > j ai j D 1 if i < j the value of the game does not exist. The boundedness of the payoff is essential and the following extension holds. S-games Given a closed bounded set S Rm , let Players I and II play the following game. Player II secretly selects a point s D (s1 ; : : : ; s m ) 2 S. Knowing the set S but not knowing the point chosen by Player II, Player I selects a coordinate i 2 f1; 2; : : : ; mg. Player I receives from Player II an amount si . Theorem 26 Given that S is a compact subset of Rm , every S-game has a value and the two players have optimal mixed strategies which use at most m pure strategies. If the set S is also convex, Player II has an optimal pure strategy. Proof Let T be the convex hull of S. Here T is also compact. Let v D min max t i D max t i : t2T
i
The compact convex set T and the open convex set G D fs : max i s i < vg are disjoint. By the weak separation theorem for convex sets there exists a ¤ 0 and constant c such that ( ; s) c
for all t 2 T ;
( ; t) c :
Many geometric theorems can be derived by using the minimax theorem for S-games. Here we give as an example the following theorem of Berge [6] that follows from the theorem on S-games. Theorem 27 Let S i ; i D 1 : : : ; m be compact convex sets in Rn . Let them satisfy the following two conditions. S 1. S D m iD1 S i is convex. T 2. i¤ j S i ¤ ; ; j D 1; 2 : : : ; m. T Then m iD1 S i ¤ ;. Proof Suppose Player I secretly chooses one of the sets Si and Player II secretly chooses a point x 2 S. Let the payoff to I be d(S i ; x) where d is the distance of the point x from the set Si . By our S-game arguments we have for some probability vector , and mixed strategy D (1 ; : : : ; m ) X i d(S i ; x) v for all x 2 S (20) i
X
and
Using the property that v D (v; v; : : : ; v) 2 G¯ and t 2 T ¯ we have ( ; t ) D c. For any u 0; t u 2 G. ¯ \ G, Thus ( ; t u) c. That is (; u) 0. We can assume is a probability vector, in which case ( ; v) D v c D ( ; t ) max t i D v : i
j d(S i ; x j ) v
for all i D 1 : : : ; m :
(21)
j
Since d(S i ; x) are convex functions, the second inequality (1) implies 1 0 X (22) d @S i ; j x jA v : j
i
for all s 2 G ;
Geometric Consequences
The game admits a pure optimal Player II. We are also given \ S i ¤ ; ; j D 1; 2 : : : ; m :
xı
D
P
j
j
xj
for
i¤ j
For any optimal mixed strategy D (1 ; 2 ; m ) of T Player I, if any i D 0 and we choose an x 2 i¤1 S i , then from (20) we have 0 v and thus v D 0. When the value v is zero, the third inequality (22) shows that x ı 2 T i S i . If > 0, the second inequality (21) will become an equality for all i and we have d(S i ; x ı ) D v. But for x ı 2 S, T we have v D 0 and x ı 2 m iD1 S i . (See Raghavan [53] for other applications.)
3387
3388
Zero-Sum Two Person Games
Ky Fan–Sion Minimax Theorems General minimax theorems are concerned with the following problem: Given two arbitrary sets X, Y and a real function K : X Y ! R, under what conditions on K, X, Y can one assert sup inf K(x; y) D inf sup K(x; y) :
x2X y2Y
y2Y x2X
A standard technique for proving general minimax theorems is to reduce the problem to the minimax theorem for matrix games. Such a reduction is often possible with some form of compactness of the space X or Y and a suitable continuity and convexity or quasi-convexity of the kernel K. Definition 28 A function f : X ! R is upper-semi-continuous on X if and only if for any real c; fx : f (x) < cg is open in X. A function f : X ! R is lower semi-continuous in X if and only if for any real c; fx : f (x) > cg is open in X. Definition 29 Let X be a convex subset of a topological vector space. A function f : X ! R is quasi-convex if and only if for each real c, the set fx : f (x) < cg is convex. A function g is quasi-concave if and only if g is quasiconvex. Clearly any convex function (concave function) is quasi-convex (conceive). The following minimax theorem is a special case of more general minimax theorems due to Ky Fan [21] and Sion [61]. Theorem 30 Let X, Y be compact convex subsets of linear topological spaces. Let K : X Y ! R be upper semi continuous (u.s.c) in x (for each fixed y) and lower semi continuous (l.s.c) in y (for each x). Let K(x; y) be quasi-concave in x and quasi-convex in y. Then max min K(x; y) D min max K(x; y) : x2X y2Y
y2Y x2X
Proof The compactness of spaces and the u:s:c; l:s:c conditions guarantee the existence of maxx2X min y2Y K(x; y) and min y2Y maxx2X K(x; y). We always have max min K(x; y) min max K(x; y) : x2X y2Y
y2Y x2X
If possible let maxx min y K(x;y) < c < min y maxx K(x;y). Let A x D fy : K(x; y) < cg and B y D fx : K(x; y) > cg. Therefore we have finite subsets X 1 X, Y1 Y such that for each y 2 Y and hence for each y 2 Con Y1 , there is an x 2 X1 with K(x; y) > c and for each x 2 X and hence for each x 2 Con X 1 , there is a y 2 Y1 , with K(x; y) < c.
Without loss of generality let the finite sets X 1 ; Y1 be with minimum cardinality m and n satisfying the above conditions. The minimum cardinality conditions have the following implications. The sets S i D fy : K(x i ; y) cg T Con Y1 are non-empty and convex. T T Further i¤ j S i ¤ ; for all j D 1; : : : ; n, but niD1 S i D ;. Now by Berge’s theorem (Subsect. “Geometric Consequences”), the union of the sets Si cannot be convex. Therefore there exists y0 2 Con Y1 , with K(x; y0 ) > c for all x 2 X 1 . Since K(:; y0 ) is quasi-concave we have K(x; y0 ) > c for all x 2 Con X1 . Similarly there exists an x0 2 Con X1 such that K(x0 ; y) < c for all y 2 Y1 and hence for all y 2 Con Y1 (by quasi-convexity of K(x0 ; y)). Hence c < K(x0 ; y0 ) < c, and we have a contradiction. When the sets X; Y are mixed strategies (probability measures) for suitable Borel spaces one can find value for some games with optimal strategies for one player but not for both. See [50], Alpern and Gal [1988]. Applications of Infinite Games S-games and Discriminant Analysis Motivated by Fisher’s enquiries [25] into the problem of classifying a randomly observed human skull into one of several known populations, discriminant analysis has exploded into a major statistical tool with applications to diverse problems in business, social sciences and biological sciences [33,54]. Example 31 A population ˘ has a probability density which is either f 1 or f 2 where f i is multivariate normal with mean vector i ; i D 1; 2 and variance covariance matrix, ˙ , the same for both f 1 and f 2 . Given an observation X from population ˘ the problem is to classify the observation into the proper population with density f 1 or f 2 . The costs of misclassifications are c(1/2) > 0 and c(2/1) > 0 where c(i/ j) is the cost of misclassifying an observation from ˘ j to ˘ i . The aim of the statistician is to find a suitable decision procedure that minimizes the worst risk possible. This can be treated as an S-game where the pure strategy for Player I (nature) is the secret choice of the population and a pure strategy for the statistician (Player II) is to partition the sample space into two disjoint sets (T1 ; T2 ) such that observations falling in T 1 are classified as from ˘1 and observations falling in T 2 are classified as from ˘2 . The payoffs to Player I (Nature) when the observation is chosen from ˘1 ; ˘2 is given by the risks (expected costs): Z r(1; (T1 ; T2 )) D c(2/1)
f1 (x)dx ; T2
Zero-Sum Two Person Games
Z r(2; (T1 ; T2 )) D c(1/2)
f2 (x)dx : T1
The following theorem, based on an extension of Lyapunov’s theorem for non-atomic vector measures [43], is due to Blackwell [10], Dvoretsky-Wald and Wolfowitz [20]. Theorem 32 Let
f1 (x)dx ; s2 D c(1/2) f2 (x)dx ; (T1 ; T2 ) 2 T T2
Z T1
where T is the collection of all measurable partitions of the sample space. Then S is a closed bounded convex set. We know from the theorem on S games (Theorem 22) that Player II (the statistician has a minimax strategy which is pure. If v is the value of the game and if (1 ; 2 ) is an optimal strategy for Player I then we have: Z Z f1 (x)dx C 2 :c(1/2) f2 (x)dx v 1 :c(2/1) T2
T1
for all measurable partitions T . For any general partition T , the above expected payoff to I simplifies to: Z 1 c(2/1) C [2 :c(1/2) f2 (x) 1 c(2/1)] f1 (x)dx : T1
It is minimized whenever the integrand is 0 on T 1 . Thus the optimal pure strategy (T1 ; T2 ) satisfies: ˚ T1 D x : 2 c(1/2) f2 (x) 1 c(2/1) f1 (x) 0 This is equivalent to T1
Z c(2/1)
p1 (k ˛ ) 2 ˛
1
y2 1 p e 2 dy 2 Z 1 D c(1/2)
p1 (kC ˛ ) 2 ˛
Z
S D (s1 ; s2 ) : s1 D c(2/1)
The random variable U is univariate normal with mean ˛2 , and variance ˛ if x 2 ˘1 . The random variable U has mean ˛ 2 and variance ˛ if x 2 ˘2 . The minimax strategy for the statistician will be such that
D x : U(x) D (1 2 )T
1 X
1 x (1 2 )T 2 1 X (1 C 2 ) k
y2 1 p e 2 dy : 2
The value of k can determined by trial and error. General Minimax Theorem and Statistical Estimation Example 33 A coin falls heads with unknown probability . Not knowing the true value of a statistician wants to estimate based ona single toss of the coin with squared error loss. Of course, the outcome is either heads or tails. If heads he can estimate ˆ as x and if the outcome is tails, he can estimate ˆ as y. To keep the problem zero-sum, the statistician pays a penalty ( x)2 when he proposes x as the estimate and pays a penalty ( y)2 when he proposes y as the estimate. Thus the expected loss or risk to the statistician is given by
( x)2 C (1 )( y)2 : It was Abraham Wald [68] who pioneered this game theoretic approach. The problem of estimation is one of discovering the true state of nature based on partial knowledge about nature revealed by experiments. While nature reveals in part, the aim of nature is to conceal its secrets. The statistician has to make his decisions by using observed information about nature. We can think of this as an ordinary zero sum two person game where the pure strategy for nature is to choose any 2 [0; 1] and a pure strategy for Player II (statistician) is any point in the unit square I D f(x; y) : 0 x 1; 0 y 1g with the payoff given above. Expanding the payoff function we get K( ; (x; y)) D 2 (2x C 1 C 2y) C (x 2 2y y 2 ) C y 2 :
and
T2 D U(x) D (1 2 )T
1 X
x 1
X 1 (1 C 2 ) < k (1 2 )T 2 for some suitable k. Let ˛ D (1 2 )T
P1
(1 2 ).
The statistician may try to choose his strategy in such a way that no matter what is chosen by nature, it has no effect on his penalty given the choice he makes for x; y. We call such a strategy an equalizer. We have an equalizer strategy if we can make the above payoff independent of . In fact we have used this trick earlier while dealing with behavior strategies. For example by choosing x and y such that the coefficient of 2 and are zero we
3389
3390
Zero-Sum Two Person Games
can make the payoff expression independent of . We find that x D 34 ; y D 14 is the solution. While this may guaran1 , one may wonder tee the statistician a constant risk of 16 whether this is the best? We will show that it is best by finding a mixed strategy for mother nature that guarantees an 1 expected payoff of 16 no matter what (x; y) is proposed by the statistician. Let F( ) be the cumulative distribution function that is optimal for mother nature. Integrating the above payoff with respect to F we get Z
1
K(F; (x; y) D (2x C 1 C 2y) 0 2
2
2 dF( )
Z
1
C (x 2y y )
dF( ) C y 2 :
0
Borel’s Poker Model Two risk neutral players I and II initiate a game by adding an ante of $1 each to a pot. Players I and II are dealt a hand, a random value u of U and a random value v of V respectively. Here U, V are independent random variables uniformly distributed on [0; 1]. The game begins with Player I. After seeing his hand he either folds losing the pot to Player II, or raises by adding $1 to the pot. When Player I raises, Player II after seeing his hand can either fold, losing the pot to Player I, or call by adding $1 to the pot. If Player II calls, the game ends and the player with the better hand wins the entire pot. Suppose players I and II restrict themselves to using only strategies g; h respectively where (
Let Z m1 D
Z
1
dF( ) and 0
1
m2 D
Player I:
L((m1 ; m2 ); (x; y)) D m2 (2x C 1 C 2y) C m1 (x 2 2y y 2 ) C y 2 : For fixed values of m2 ; m1 the minimum value of m2 (2x C 1 C 2y) C m1 (x 2 2y y 2 ) C y 2 must satisfy the first order conditions m2 m1 D y : m1 1
If m1 D 1/2 and m2 D 3/8 then x D 3/4 and y D 1/4 is optimal. In fact, a simple probability distribution can be found by choosing the point 1/2 ˛ with probability 1/2 and 1/2 C ˛ with probability 1/2. Such a distribution will have mean m1 D 1/2 and second moment m2 D (1/2 (1/2˛)2 C 1/2 (1/2˛)2 ) D 1/4 C ˛ 2 . When m2 D 3/8, we get ˛ 2 D 1/8. Thus the optimal strategy for nature also called the least favorable distribution, is to toss an ordinary unbiased coin and if it p is heads, select p a coin which falls heads with probability ( 2 1)/(2 2) and if the ordinary coin p toss is tails, p select a coin which falls heads with probability ( 2 C 1)/(2 2). For further applications of zero-sum two person games to statistical decision theory see Ferguson [23]. In general it is difficult to characterize the nature of optimal mixed strategies even for a C 1 payoff function. See [37]. Another rich source for infinite games do appear in many simplified parlor games. We give a simplified poker model due to Borel (See the reference under Ville [65]).
( Player II:
h(v) D
fold if u x raise if u > x ;
2 dF( ) :
0
In terms of m1 ; m2 ; x; y the expected payoff can be rewritten as
m2 D x ; and m1
g(u) D
fold if v y raise if v > y :
The computational details of the expected payoff K(x; y) based on the above partitions of the u; v space is given below in Table 3. The payoff K(x; y) is simply the sum of each area times the local payoff given by K(x; y) D 2y 2 3x y C x y when x < y :
Zero-Sum Two Person Games, Figure 5 Poker partition when x < y
Zero-Sum Two Person Games Zero-Sum Two Person Games, Table 3 Outcome ux u > x; v y
Action by players: Region Area I drops out A[D[F x II drops out E[G y(1 x)
u < v; u > x; v > y both raise u > v; u > x; v > y both raise
B C
(1y)(12xCy) 2 1 2 2 (1 y)
Payoff 1 1 2 2
Zero-Sum Two Person Games, Table 4 Outcome ux u > x; v y
Action by players: Region Area I drops out H[I[J[K x II drops out N y(1 x)
u < v; u > x; v > y both raise u > v; u > x; v > y both raise
The payoff K(x; y) is simply the sum of each area times the local payoff given by ( 2x 2 C x y C x y for x > y K(x; y) D 2y 2 3x y C x y for x < y : Also K(x; y) is continuous on the diagonal and concave in x and convex in y. By the general minimax theorem of Ky Fan or Sion there exist x , y pure optimal strategies for K(x; y). We will find the value of the game by explicitly computing min y K(x; y) for each x and then taking the maximum over x. Let 0 < x¯ < 1. (Intuitively we see that in our search for saddle point, x D 0 or x D 1 are quite unlikely.) min0y1 K(x¯ ; y) D minfinf y 1/9 and a good pure strategy for II is to call when y > 1/3. For other poker models see von Neumann and Morgenstern [67], Blackwell and Bellman [5], Binmore [9] and Ferguson and Ferguson [22] who discuss other discrete and continuous poker models. War Duels and Discontinuous Payoffs on the Unit Square While general minimax theorems make some stringent continuity assumptions, rarely they are satisfied in modeling many war duels as games on the unit square. The payoffs are often discontinuous along the diagonal. The following is an example of this kind.
Zero-Sum Two Person Games, Figure 6 Poker partition when x > y
Example 34 Players I and II start at a distance D 1 from a balloon, and walk toward the balloon at the same speed. Each player carries a noisy gun which has been loaded with just one bullet. Player I’s accuracy at x is p(x) and
3391
3392
Zero-Sum Two Person Games
Player II’s accuracy at x is q(x) where x is the distance traveled from the starting point. Because the guns can be heard, each player knows whether or not the other player has fired. The player who fires and hits the balloon first wins the game. Some natural assumptions are The functions p; q are continuous and strictly increasing p(0) D q(0) D 0 and p(1) D q(1) D 1. If a player fires and misses then the opponent can wait until his accuracy D 1 and then fire. Player I decides to shoot after traveling distance x provided the opponent has not yet used his bullet. Player II decides to shoot after traveling distance y provided the opponent has not yet used his bullet. The following payoff reflects the outcome of this strategy choice. K(x; y) D 8 ˆ y :
We claim that the players have optimal pure strategies and there is a saddle point. Consider min y K(x; y) D minf2p(x) 1; p(x) q(x); inf y